current position:Home>[introduction to Python tutorial] use Python 3 to teach you how to extract any HTML main content

[introduction to Python tutorial] use Python 3 to teach you how to extract any HTML main content

2022-01-29 14:07:31 Mengy7762 Mengya

0x1 Tool preparation

If a worker wants to do a good job, he must sharpen his tools first , The foundation of crawling corpus is based on python.

We are based on python3 Development , The following modules are mainly used :requests、lxml、json.

Briefly introduce the functions of each module

01|requests

requests It's a Python Third party Library , Handle URL Resources are particularly convenient . Its official documents say big slogans :HTTP for Humans( For human use HTTP born ). comparison python Self contained urllib Experience with , The author thinks requests Better experience than urllib An order of magnitude higher .

Let's simply compare :

urllib:

1import urllib
2 2import urllib 
3 
4URL_GET = "https://api.douban.com/v2/event/list" 
5# Build request parameters  
6params = urllib.urlencode({'loc':'108288','day_type':'weekend','type':'exhibition'}) 
7 
8# Send a request  
9response = urllib2.urlopen('?'.join([URL_GET,'%s'])%params)
10#Response Headers11print(response.info())
12#Response Code
13print(response.getcode())
14#Response Body
15print(response.read()) 
 Copy code 

requests:

1import requests 
2 
3URL_GET = "https://api.douban.com/v2/event/list" 
4# Build request parameters  
5params = {'loc':'108288','day_type':'weekend','type':'exhibition'} 
6 
7# Send a request  
8response = requests.get(URL_GET,params=params) 
9#Response Headers
10print(response.headers)
11#Response Code
12print(response.status_code)
13#Response Body
14print(response.text)
 Copy code 

We can find out , There are still some differences between the two libraries :

  1. Construction of parameters :urllib You need to do urlencode Encoding processing , More trouble ;requests No additional coding processing is required , It's very simple .

  2. Request to send :urllib Need an extra pair of url Parameters , Become a satisfactory form ;requests It's much more concise , direct get Corresponding links and parameters . Last , If you don't have a lot of time , And want to improve quickly , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

  3. How to connect : Take a look at the header information of the returned data “connection”, Use urllib library ,"connection":"close", Explain that every request ends and turns off socket passageway , While using requests The library uses urllib3, Multiple requests to reuse one socket,"connection":"keep-alive", Description multiple requests to use a connection , Consume less resources

  4. Encoding mode :requests The encoding method of the library Accept-Encoding More complete , There is no example here

As a defense , Use requests More concise 、 Understandability , It is very convenient for us to develop .

02|lxml

BeautifulSoup Is a library , and XPath It's a technology ,python Most commonly used XPath Kuo is lxml.

When we get it requests After returning to the page , How can we get the data we want ? At this time, sacrifice lxml This powerful HTML/XML Parsing tool .python There is no shortage of parsing library , So why should we choose among many libraries lxml Well ? We choose another famous HTML Parsing library BeautifulSoup To compare .

Let's simply compare :

BeautifulSoup:

1from bs4 import BeautifulSoup # Import library 
2#  hypothesis html It needs to be parsed html
3
4# take html Pass in BeautifulSoup  Construction method of , Get the object of a document 
5soup = BeautifulSoup(html,'html.parser',from_encoding='utf-8')
6# Find all h4 label  
7links = soup.find_all("h4") 
 Copy code 

lxml:

1from lxml import etree
2#  hypothesis html It needs to be parsed html
3
4# take html Pass in etree  Construction method of , Get the object of a document 
5root = etree.HTML(html)
6# Find all h4 label  
7links = root.xpath("//h4") 
 Copy code 

We can find out , There are still some differences between the two libraries :

  1. analysis html: BeautifulSoup The parsing method and JQ It's similar to ,API Very human , Support css Selectors ;lxml Grammar has a certain learning cost

  2. performance :BeautifulSoup Is based on DOM Of , Will load the entire document , Parse the whole DOM Trees , So the time and memory overhead will be much larger ; and lxml Only local traversal , in addition lxml Yes, it is c Written , and BeautifulSoup Yes, it is python Written , Obvious performance lxml>>BeautifulSoup.

As a defense , Use BeautifulSoup More concise 、 Easy to use ,lxml Although there is a certain learning cost , But in general, it is also very concise and easy to understand , Most importantly, it is based on C To write , It's a lot faster , For the author's obsessive-compulsive disorder , Naturally choose lxml La . Last , If you don't have a lot of time , And want to improve quickly , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

03|json

python Bring their own json library , For basic json To deal with , Your own library is completely enough . But if you want to be more lazy , You can use a third party json library , Common are demjson、simplejson.

These two libraries , Whether it's import Module speed , Or coding 、 Decoding speed , All are simplejson better , Plus compatibility simplejson Better . So if you want to use the square Library , have access to simplejson.

0x2 Determine the corpus source

When the weapons are ready , Next, you need to determine the climbing direction .

Take E-sports corpus as an example , Now we're going to climb the relevant corpus of E-sports . The familiar E-sports platform is Penguin E-sports 、 Penguin E-sports and Penguin E-sports ( Squint ), So we use the live game on penguin E-sports as the data source to crawl .

We log in to the official website of penguin E-sports , Enter the game list page , You can find many games on the page , The benefits of writing these game names manually are obviously not high , So we started the first step of our reptile : Game list crawl . Last , If you don't have a lot of time , And want to improve quickly , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

1import requests 
2from lxml import etree 
3 
4#  Update game list  
5def _updateGameList(): 
6 #  send out HTTP On request HEAD Information , Used to disguise as a browser  
7 heads = {
8 'Connection': 'Keep-Alive', 
9 'Accept': 'text/html, application/xhtml+xml, */*',
10 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
11 'Accept-Encoding': 'gzip, deflate',
12 'User-Agent': 'Mozilla/6.1 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'
13 }
14 #  Game list page to crawl 15 url = 'https://egame.qq.com/gamelist'
16
17 #  Uncompressed html, The maximum link time is 10 miao 
18 res = requests.get(url, headers=heads, verify=False, timeout=10)
19 #  To prevent mistakes , code utf-8
20 res.encoding = 'utf-8'
21 #  take html Build for Xpath Pattern 
22 root = etree.HTML(res.content)
23 #  Use Xpath grammar , Get game name 
24 gameList = root.xpath("//ul[@class='livelist-mod']//li//p//text()")
25 #  Output the game name you climbed to 
26 print(gameList) 
 Copy code 

When we get these dozens of game names , The next step is to crawl the corpus of these dozens of games , And that's the problem , Which website do we want to climb the Raiders of these dozens of games from ,taptap? Play more ?17173? After analyzing these websites , It is found that these websites only have the article corpus of some popular games , Some unpopular or low heat games , for example “ Soul chips ”、“ Miracle : Awaken ”、“ Final Destination ” etc. , It is difficult to find a large number of article corpus on these websites , As shown in the figure :

We can find out ,“ Miracle : Awaken ”、“ Soul chips ” The corpus of articles is very few , The quantity does not meet our requirements . So is there a more general resource station , It has an incomparably rich corpus of articles , It can meet our needs . Last , If you don't have a lot of time , And want to improve quickly , The most important thing is not afraid of hardship , I suggest you can adjust the price @762459510 , That's really good , A lot of people are making rapid progress , I need you not to be afraid of hardship ! You can take a look at it ~

In fact, calm down and think , We use this resource station every day , That's Baidu . We search Baidu news for related games , Get a list of search results , The web content of the links in these lists is almost strongly related to search results , In this way, the problem of insufficient data sources can be easily solved . But now a new problem arises , And it is a difficult problem to solve —— How to grab the article content of any web page ?

Because different websites have different page structures , We can't predict which website we will climb to , And we can't write a set of crawlers for every website , That amount of work is unimaginable ! But we can't simply and rudely climb down all the text in the page , Training with such a corpus is undoubtedly a nightmare !

After fighting with various websites 、 After searching information and thinking , Finally found a more general scheme , Let's talk about the author's ideas .

0x3 Crawl the article corpus of any website

01| Extraction method

1) be based on Dom Tree text extraction

2) Find text blocks based on Web page segmentation

3) Text extraction based on tag window

4) Based on data mining or machine learning

5) Text extraction based on line block distribution function

02| Extraction principle

You are a little confused when you see these kinds of , How exactly are they extracted ? Let me take my time .

1) be based on Dom Text extraction of tree :

This method is mainly standardized through comparison HTML establish Dom Trees , Then the floor cabinet is traversed Dom, Compare and identify various non textual information , Including advertising 、 Links and unimportant node information , After extracting non text information , The rest is naturally the text information .

But there are two problems with this approach

Rely especially on HTML Good structure , If we climb to one that doesn't press W3c When writing a standard web page , This method is not very applicable .

The establishment and traversal time complexity of the tree 、 The space complexity is high , The traversal method of the tree is also due to HTML There will be different labels .

  1. Find text blocks based on Web page segmentation :

One way is to use HTML The dividing line in the label and some visual information ( Such as the color of the text 、 font size 、 Text messages, etc ).

There is a problem with this method :

Different websites HTML Different style , There is no way to unify , There is no guarantee of generality .

  1. Text extraction based on tag window :

Let's start with a concept —— Marker window , We combine the two tags and the text contained inside them to form a tag window ( such as

I am a h1

Medium “ I am a h1” Is to mark the contents of the window ), Take out the text of the marking window .

This method takes the title of the article first 、HTML All marker windows in the , In the process of word segmentation . Then calculate the word distance between the title sequence and the tag window text sequence L, If L Less than a threshold , The text in this mark window is considered to be the body .

Although this method looks good , But there are also problems :

You need to segment all the text in the page , The efficiency is not high .

The threshold of word distance is difficult to determine , Different articles have different thresholds .

4) Based on data mining or machine learning

Use big data for training , Let the machine extract the main text .

This method must be excellent , But it needs to have html And text data , And then training . We won't discuss it here .

5) Text extraction based on line block distribution function

For any web page , Its body and label are always mixed together . The core of this method has highlights : The density of the text area ; The length of the row block ; The text area of a web page must be one of the most densely distributed areas of text information , This area may be the largest ( Comment message length 、 The text is short ), Therefore, at the same time, the block length is used to judge .

Realize the idea :

We will first HTML Debarking , Leave only all text , At the same time, leave all blank position information after the label is taken out , We call it Ctext;

For each Ctext Take the surroundings k That's ok (k

Yes Cblock Remove all blanks , The total length of the text is called Clen;

With Ctext Is the abscissa axis , In rows Clen For the vertical axis , Set up a coordinate system .

Take this page as an example : www.gov.cn/ldhd/2009-1… The body area of the page is 145 Travel to 182 That's ok .

It can be seen from the above figure , The correct text area is all a continuous area with the most value on the distribution function graph , This area often contains a sudden rise point and a sudden fall point . therefore , The problem of web page text extraction is transformed into finding two boundary points on the line block distribution function , The area contained in these two boundary points contains the maximum length of the row block of the current web page and is continuous .

After a lot of experiments , It is proved that this method has high accuracy for text extraction of Chinese web pages , The advantage of this algorithm is , Row block functions do not depend on and HTML Code , And HTML Label independent , Implement a simple , High accuracy .

The main logic code is as follows :

1#  hypothesis content For what you've got html 
2 
3
# Ctext Take the surroundings k That's ok (k max_text_len and (not boolstart)):
38 
# Cblock below 3 Not for 0, Think it's the text 
39 if (Ctext_len[i + 1] != 0 or Ctext_len[i + 2] != 0 or Ctext_len[i + 3] != 0):
40 boolstart = True41 start = i
42 continue
43 if (boolstart):
44 
# Cblock below 3 There is 0, End 
45 if (Ctext_len[i] == 0 or Ctext_len[i + 1] == 0):
46 end = i
47 boolend = True
48 tmp = []
49
50 #  Judge whether there is text below 
51 if(boolend):
52 for ii in range(start, end + 1):
53 if(len(lines[ii])
 Copy code 

0x4 Conclusion

At this point, we can get any content of the article corpus , But this is just the beginning , After obtaining these corpora, we need to clean them again 、 participle 、 Part of speech tagging, etc , In order to get the real corpus that can be used .

copyright notice
author[Mengy7762 Mengya],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201291407233733.html

Random recommended