current position:Home>How Python parses web pages using BS4

How Python parses web pages using BS4

2022-01-30 04:31:22 Clever crane

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities .

BS4 The full name is BeatifulSoup, It provides some simple 、python Function to handle navigation 、 Search for 、 Modify analysis tree and other functions . We can use it to easily complete the html Analytical work .

This article briefly introduces some bs4 Functions commonly used in , Can handle most situations .

1. Positioning label

First , Before crawling, you need to locate the label where the data is located , This use F12 Developer tools This button in , Click the button , And then click on the web page , You can quickly navigate to the corresponding tag in the page , I won't go into details , Feel for yourself , It's simple , Very easy to use. .

 Developer tools

Now let's officially introduce , How to use the code to get the label found in the front .

Here are BeautifulSoup Two functions in ,find() and find_all() function .

First you look at the label you're going to find , What's the label , Is there a class perhaps id Such a property ( If not, look for the parent tag , Try to find something like this ), because class and id If these two properties are used as filter conditions , Very few interference items are found , With luck , It's basically a hit .

For example, the arrow in the figure above refers to ,id by ozoom Of div When labeling , We can get it this way

# html  It is the content of the web page obtained from the previous request 
bsobj = bs4.BeautifulSoup(html,'html.parser')

#  obtain  id  by  ozoom  Of  div  label 
#  according to  id  Find the label 
div = bsobj.find('div', attrs = {'id' : 'ozoom'})

#  Keep getting  div  Under the  class  by  list_t  Of  div  label 
#  according to  class  Find the label 
title = div.find('div', attrs = {'class': 'list_t'})
 Copy code 

notes : If the label has id In terms of attributes, try to use id Search for , Because the whole page id Is the only one. . use class Look for it , It's better to be in the source code of the web page of the browser now Ctrl + F Search for , identical class How many labels are there ( If there are more , Try to find his parent tag first , Narrow down and look for ).

And then we'll talk about find_all function , Apply to Find many tags of one type at a time The situation of , For example, in the following figure .

Every one in the list li In the label , It's all one piece of data , We need to get them all , If you use the front one find Function words , You can only get one at a time li label . So we need to use find_all function , Obtain all qualified labels at one time , Store as array return .

First , because li The label doesn't have id either class , And there are a lot of irrelevant interference in the page li label , So we need to look up its parent label first , Narrow the search , find id by titleList Of div After tag , Observe , Inside li Labels are needed , direct find_all Function all at once .

# html  It's the target page content 
html = fetchUrl(pageUrl)
bsobj = bs4.BeautifulSoup(html,'html.parser')

pDiv = bsobj.find('div', attrs = {'id': 'titleList'})
titleList = pDiv.find_all('li')
 Copy code 

Basically , hold find and find_all Function combination , I can deal with almost all of them html Web page , It's really , A fresh eat all day .

2. Extract the data

After finding the tag , How can I get the data in the tag ?

The location of the data in the tag , There are generally two situations .

<!-- The first one is , In the label content -->
    <p> This is data. This is data </p>

<!-- The second kind , In tag properties -->
    <a href="/xxx.xxx_xx_xx.html"></a>
 Copy code 

If it's the first case , It's simple , direct pTip.text that will do (pTip It's what we've got before p label ).

In the second case , It depends on which attribute it is in , For example, we need to get the above a In the tag href Links in properties , adopt link = aTip["href"] that will do .(aTip It's what we've got before a label ).

copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.

Random recommended