current position:Home>Introduction to beautiful soup of Python crawler weapon, detailed explanation, actual combat summary!!!
Introduction to beautiful soup of Python crawler weapon, detailed explanation, actual combat summary!!!
2022-01-29 21:32:05 【Skin shrimp】
Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities
This article also participates in 「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund
Code Pipi shrimp A simple and interesting boy with sand sculpture , Like most of my friends, I like listening to music 、 game , Of course, in addition to this, there is an interest in writing ,emm..., It's a long time , Let's work hard together
If you think it's good , The ball is a concern
1、 brief introduction
Beautiful Soup Is one can from HTML or XML Extracting data from a file Python library . It enables custom document navigation through your favorite converter , lookup , How to modify the document .Beautiful Soup It will save you hours or even days of work .
2、 Parsing library
** Flexible and convenient web page parsing library , Handle efficiently , Support for multiple parsers .
It can be used to extract web information easily without writing regular expressions .**
Parser | Usage method | advantage | Inferiority |
---|---|---|---|
Python Standard library | BeautifulSoup(markup, “html.parser”) | Python Built in standard library 、 Moderate execution speed 、 Document fault tolerance | Python 2.7.3 or 3.2.2 The previous version has poor fault tolerance in Chinese |
lxml HTML Parser | BeautifulSoup(markup, “lxml”) | Fast 、 Document fault tolerance | Need to install C Language library |
lxml XML Parser | BeautifulSoup(markup, “xml”) | Fast 、 Unique support XML The parser | Need to install C Language library |
html5lib | BeautifulSoup(markup, “html5lib”) | The best fault tolerance 、 Parse the document as a browser 、 Generate HTML5 The format of the document | Slow speed 、 Don't rely on external extensions |
3、 Explain
3.1、Tag( tag chooser )
== Select element ==
import requests
from bs4 import BeautifulSoup
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
# Use BeautifulSoup Analyze the web page code
# What I use here is Python Standard library ——html.parser
soup = BeautifulSoup(html, "html.parser")
# obtain html In code titile label
print(soup.title)
Copy code
Be careful : Only the first one is matched by default , If there are multiple identical tags in the article , And want to get the tag after , According to the class Value or some other method to locate , Then I'll come together .
== Get the name ==
print(soup.title.name)
Copy code
== get attribute ==
== Get content ==
== Nested selection ==
== Child node ==
tag Of .contents Property can be used to tag The child nodes of are output as a list adopt tag Of .children generator , It can be done to tag Loop through the child nodes of
import requests
from bs4 import BeautifulSoup
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"> <b>The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
soup = BeautifulSoup(html, "html.parser")
print(soup.p.contents)
print("="*30)
for i in soup.p.children:
print(i)
Copy code
== Parent node ==
adopt .parent Property to get the parent node of an element
By element .parents Attribute can recursively get all the parent nodes of the element
== Brother node ==
3.2、 Standard selector (find、find_all)
3.2.1、find_all()
find_all( name , attrs , recursive , string , **kwargs )
find_all() Method search current tag All of the tag Child node , And judge whether the filter conditions are met
==keyword Parameters ==
If a parameter with a specified name is not a search built-in parameter name , This parameter will be used as the specified name when searching tag To search for , If it contains a name id Parameters of ,Beautiful Soup Will search every tag Of ”id” attribute .
== Custom parameter lookup :attrs==
3.2.2、find()
find( name , attrs , recursive , text , **kwargs )
find Return single element ,find_all Return all elements
3.3、Select Selectors
==select==
Match all
import requests
from bs4 import BeautifulSoup
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"> <b>The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
soup = BeautifulSoup(html, "html.parser")
print(soup.select("p b"))
print(soup.select("p a"))
print(soup.select("head title"))
Copy code
==select_one==
select_one Select only the first element that satisfies the condition
4、 actual combat
This actual battle takes Baidu home page as an example
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}
url = "https://www.baidu.com"
response = requests.get(url=url,headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
# Access to all class by mnav c-font-normal c-color-t The label of , Traversal
divs = soup.find_all(class_="mnav c-font-normal c-color-t")
for div in divs:
print(div)
print("="*40)
Copy code
Visible success
Next, get the corresponding of each module URL And text values
for div in divs:
print(div['href'])
print(div.text)
Copy code
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}
url = "https://www.baidu.com"
response = requests.get(url=url,headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
# The first method
# adopt contents, Get child node information
a_data = soup.find(class_="hot-title").contents
print(a_data[0].text)
# The second method
# Through the first find Use class Value positioning , In the use of find Find what's under it div The label is what we need
a_data2 = soup.find(class_="hot-title").find("div")
print(a_data2.text)
Copy code
Last
I am a Code Pipi shrimp , A lover of sharing knowledge Shrimp lovers , In the future, we will continue to update blog posts that are beneficial to you , We look forward to your attention !!!
It's not easy to create , If this blog post is helpful to you , I hope you guys can connect three times with one button !, Thank you for your support , See you next time ~~~
copyright notice
author[Skin shrimp],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201292132037504.html
The sidebar is recommended
- Compile D + +, and use d to call C from python
- Install tensorflow and python 3.6 in Windows 7
- Python collects and monitors system data -- psutil
- Python collects and monitors system data -- psutil
- Finally, this Python import guide has been sorted out. Look!
- Quickly build Django blog based on function calculation
- Getting started with Python - object oriented - special methods
- Teach you how to use Python to transform an alien invasion game
- You can easily get started with Excel. Python data analysis package pandas (VI): sorting
- Implementation of top-level design pattern in Python
guess what you like
-
Using linear systems in python with scipy.linalg
-
Python tiktok 5000+ V, and found that everyone love this video.
-
Using linear systems in python with scipy.linalg
-
How to get started quickly? How to learn Python
-
Modifying Python environment with Mac OS security
-
You can easily get started with Excel. Python data analysis package pandas (XI): segment matching
-
Advanced practical case: Javascript confusion of Python anti crawling
-
Better use atom to support jupyter based Python development
-
Better use atom to support jupyter based Python development
-
Fast power modulus Python implementation of large numbers
Random recommended
- Python architects recommend the book "Python programmer's Guide" which must be read by self-study Python architects. You are welcome to take it away
- Decoding the verification code of Taobao slider with Python + selenium, the road of information security
- Python game development, pyGame module, python implementation of skiing games
- This paper clarifies the chaotic switching operation and elegant derivation of Python
- You can easily get started with Excel. Python data analysis package pandas (3): making score bar
- Test Development: self study Dubbo + Python experience summary and sharing
- Python + selenium automated test: page object mode
- You can easily get started with Excel. Python data analysis package pandas (IV): any grouping score bar
- Opencv skills | saving pictures in common formats as transparent background pictures (with Python source code) - teach you to easily make logo
- You can easily get started with Excel. Python data analysis package pandas (V): duplicate value processing
- Python ThreadPoolExecutor restrictions_ work_ Queue size
- Python generates and deploys verification codes with one click (Django)
- With "Python" advanced, you can catch all the advanced syntax! Advanced function + file operation, do not look at regret Series ~
- At the beginning of "Python", you must see the series. 10000 words are only for you. It is recommended to like the collection ~
- [Python kaggle] pandas basic exercises in machine learning series (6)
- Using linear systems in python with scipy.linalg
- The founder of pandas teaches you how to use Python for data analysis (mind mapping)
- Using Python to realize national second-hand housing data capture + map display
- Python image processing, automatic generation of GIF dynamic pictures
- Pandas advanced tutorial: time processing
- How to make Python run faster? Six tips!
- Django: use of elastic search search system
- Fundamentals of Python I
- Python code reading (chapter 35): fully (deeply) expand nested lists
- Python 3.10 official release
- Solution of no Python 3.9 installation was detected when uninstalling Python
- This pandas exercise must be successfully won
- [Python homework] coupling network information dissemination
- Python application software development tool - tkinterdesigner v1.0 5.1 release!
- [Python development tool Tkinter designer]: Lecture 2: introduction to Tkinter designer's example project
- [algorithm learning] sword finger offer 64 Find 1 + 2 +... + n (Java / C / C + + / Python / go / trust)
- leetcode 58. Length of Last Word(python)
- Problems encountered in writing the HTML content of articles into the database during the development of Django blog
- leetcode 1261. Find Elements in a Contaminated Binary Tree(python)
- [algorithm learning] 1486 Array XOR operation (Java / C / C + + / Python / go / trust)
- Understand Python's built-in function and add a print function yourself
- Python implements JS encryption algorithm in thousands of music websites
- leetcode 35. Search Insert Position(python)
- leetcode 1829. Maximum XOR for Each Query(python)
- [introduction to Python visualization]: 12 small examples of complete data visualization, taking you to play with visualization ~