current position:Home>Introduction to beautiful soup of Python crawler weapon, detailed explanation, actual combat summary!!!

Introduction to beautiful soup of Python crawler weapon, detailed explanation, actual combat summary!!!

2022-01-29 21:32:05 Skin shrimp

Little knowledge , Great challenge ! This article is participating in “   A programmer must have a little knowledge   ” Creative activities

This article also participates in  「 Digging force Star Program 」  , Win a creative gift bag , Challenge creation incentive fund

Code Pipi shrimp A simple and interesting boy with sand sculpture , Like most of my friends, I like listening to music 、 game , Of course, in addition to this, there is an interest in writing ,emm..., It's a long time , Let's work hard together

If you think it's good , The ball is a concern


1、 brief introduction

Beautiful Soup Is one can from HTML or XML Extracting data from a file Python library . It enables custom document navigation through your favorite converter , lookup , How to modify the document .Beautiful Soup It will save you hours or even days of work .


2、 Parsing library

** Flexible and convenient web page parsing library , Handle efficiently , Support for multiple parsers .

It can be used to extract web information easily without writing regular expressions .**

Parser Usage method advantage Inferiority
Python Standard library BeautifulSoup(markup, “html.parser”) Python Built in standard library 、 Moderate execution speed 、 Document fault tolerance Python 2.7.3 or 3.2.2 The previous version has poor fault tolerance in Chinese
lxml HTML Parser BeautifulSoup(markup, “lxml”) Fast 、 Document fault tolerance Need to install C Language library
lxml XML Parser BeautifulSoup(markup, “xml”) Fast 、 Unique support XML The parser Need to install C Language library
html5lib BeautifulSoup(markup, “html5lib”) The best fault tolerance 、 Parse the document as a browser 、 Generate HTML5 The format of the document Slow speed 、 Don't rely on external extensions

3、 Explain

3.1、Tag( tag chooser )

== Select element ==

import requests

from bs4 import BeautifulSoup

html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
# Use BeautifulSoup Analyze the web page code 
# What I use here is Python Standard library ——html.parser
soup = BeautifulSoup(html, "html.parser")

#  obtain html In code titile label 
print(soup.title)
 Copy code 

 Insert picture description here

Be careful : Only the first one is matched by default , If there are multiple identical tags in the article , And want to get the tag after , According to the class Value or some other method to locate , Then I'll come together .

== Get the name ==

print(soup.title.name)
 Copy code 

 Insert picture description here == get attribute ==

 Insert picture description here  Insert picture description here == Get content ==

 Insert picture description here

 Insert picture description here

== Nested selection ==  Insert picture description here  Insert picture description here

== Child node ==

tag Of .contents Property can be used to tag The child nodes of are output as a list adopt tag Of .children generator , It can be done to tag Loop through the child nodes of

import requests

from bs4 import BeautifulSoup

html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"> <b>The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
soup = BeautifulSoup(html, "html.parser")

print(soup.p.contents)
print("="*30)
for i in soup.p.children:
    print(i)
 Copy code 

 Insert picture description here == Parent node ==

adopt .parent Property to get the parent node of an element

 Insert picture description here

 Insert picture description here

By element .parents Attribute can recursively get all the parent nodes of the element

 Insert picture description here

 Insert picture description here

== Brother node ==  Insert picture description here  Insert picture description here


3.2、 Standard selector (find、find_all)

3.2.1、find_all()

find_all( name , attrs , recursive , string , **kwargs )
find_all() Method search current tag All of the tag Child node , And judge whether the filter conditions are met

 Insert picture description here  Insert picture description here

==keyword Parameters ==

If a parameter with a specified name is not a search built-in parameter name , This parameter will be used as the specified name when searching tag To search for , If it contains a name id Parameters of ,Beautiful Soup Will search every tag Of ”id” attribute .

 Insert picture description here

 Insert picture description here

== Custom parameter lookup :attrs==

 Insert picture description here  Insert picture description here  Insert picture description here  Insert picture description here


3.2.2、find()

find( name , attrs , recursive , text , **kwargs )
find Return single element ,find_all Return all elements

 Insert picture description here

 Insert picture description here


3.3、Select Selectors

==select==

Match all

import requests

from bs4 import BeautifulSoup

html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"> <b>The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
soup = BeautifulSoup(html, "html.parser")

print(soup.select("p b"))
print(soup.select("p a"))
print(soup.select("head title"))
 Copy code 

 Insert picture description here

==select_one==

select_one Select only the first element that satisfies the condition

 Insert picture description here

 Insert picture description here


4、 actual combat

This actual battle takes Baidu home page as an example

 Insert picture description here

 Insert picture description here

import requests

from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}

url = "https://www.baidu.com"
response = requests.get(url=url,headers=headers)

soup = BeautifulSoup(response.text,"html.parser")

# Access to all class by mnav c-font-normal c-color-t The label of , Traversal 
divs = soup.find_all(class_="mnav c-font-normal c-color-t")
for div in divs:
    print(div)
    print("="*40)
 Copy code 

Visible success

 Insert picture description here

Next, get the corresponding of each module URL And text values

for div in divs:
    print(div['href'])
    print(div.text)
 Copy code 

 Insert picture description here

 Insert picture description here

import requests

from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}

url = "https://www.baidu.com"
response = requests.get(url=url,headers=headers)

soup = BeautifulSoup(response.text,"html.parser")

# The first method 
# adopt contents, Get child node information 
a_data = soup.find(class_="hot-title").contents
print(a_data[0].text)

# The second method 
# Through the first find Use class Value positioning , In the use of find Find what's under it div The label is what we need 
a_data2 = soup.find(class_="hot-title").find("div")
print(a_data2.text)
 Copy code 

 Insert picture description here



Last

I am a Code Pipi shrimp , A lover of sharing knowledge Shrimp lovers , In the future, we will continue to update blog posts that are beneficial to you , We look forward to your attention !!!

It's not easy to create , If this blog post is helpful to you , I hope you guys can connect three times with one button !, Thank you for your support , See you next time ~~~


 One key, three links .png

copyright notice
author[Skin shrimp],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201292132037504.html

Random recommended