current position:Home>Python crawler from entry to mastery (III) implementation of simple crawler
Python crawler from entry to mastery (III) implementation of simple crawler
2022-01-31 17:37:53 【zhulin1028】
「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge 」
One 、 Probably the simplest reptile in history Demo
The simplest crawler Demo :
The first crawler , Two lines of code to write a crawler :
import urllib #Python3
print(urllib.request.urlopen(urllib.request.Request("GitHub - richardpenman/wswp_places")).read() )
Copy code
These two lines of code are in Python3.6 Under normal operation , obtain example.webscraping.com****
The content of this page ;
remarks : If it is Python3 , Then use the following two lines of code :
import requests #Python3
print(requests.get('http://example.webscraping.com').text)
Copy code
without requests library , You need to use the command pip install requests Install it. ;
explain : At present, most of the code in this handout is Python3.6 Code bit blueprint , Appendix to handout A Will be Python2 and Python3 The comparison table of the most important libraries in the reptile is included , According to this table, it can be easily realized Python2 And Python3 In the transplantation of crawler code .
Two 、 Take a look back. HTTP ,HTTPS agreement
1、 About URL:
URL (Uniform / Universal Resource Locator Abbreviation ): Uniform resource locator , It is used to describe completely Internet A way to identify the addresses of web pages and other resources .
The basic format :scheme://host[:port#]/path/ …/[?query-string][#anchor]
scheme : agreement ( for example :http, https, ftp)
host : Server's IP Address or domain name
port# : The port of the server ( If you go to the protocol default port , Default port 80 )
path : Path to access resources
query-string : Parameters , Send to http Server data
anchor : anchor ( Jump to the specified anchor location of the web page )
for example :
ftp://192.168.1.118:8081/index
URL It's the entrance of reptiles , Very important .****
2、HTTP agreement ,HTTPS agreement
HTTP agreement (HyperText Transfer Protocol , Hypertext transfer protocol ): It's a kind of release and reception HTML Page method .HTTP Protocol is an application layer protocol , There is no connection ( Only one request is processed per connection ), No state ( Every connection , The transmission is independent )
HTTPS (Hypertext Transfer Protocol over Secure Socket Layer ) The agreement is simply HTTP Security version , stay HTTP Lower join SSL layer .HTTPS = HTTP+SSL (Secure Sockets Layer Secure socket layer ) It is mainly used for Web The secure transport protocol of , Encrypt the network connection at the transport layer , The guarantee is Internet Security of data transmission on
HTTP The port number of is 80 ;HTTPS The port number of is 443 ;
3、HTTP Request Two common methods of request :
Get : To get information from the server , The process of transmitting data to the server is not secure enough , Data size is limited ;
Post : Passing data to the server , The process of transmitting data is safe , There is no theoretical limit to size ;

4、** About User-Agent**
User Agent In Chinese, The user agent , abbreviation UA , It's a special string header , Enables the server to identify the operating system And version 、CPU type 、 browser And version 、 Browser rendering engine 、 Browser language 、 Browser plug-in etc. .
Let's take a look at what our simplest crawler tells the server when it runs User-Agent What is it? ? Through this example , We found that Python The crawler has a default version number User-Agent , It's easy to recognize that this is a Python Written crawler program . So if you use the default User-Agent , Those anti crawler programs can recognize us at a glance Python Reptiles , This is right Python Reptiles are bad .
that , How do we modify this User-Agent , To disguise our crawler ?
# Http The information of the request header in the protocol
headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
req = request.Request("http://www.sina.com.cn",
headers=headers)
# return http.client.HTTPResponse
response = request.urlopen(req)
Copy code
5、HTTP Response The status code of the response :
200 For success ,300 It's a jump ;
400 ,500 It means there are mistakes :
explain : The information returned by the server to the crawler can be used to judge whether our crawler is running normally ;
When an abnormal error occurs : Generally speaking, if it is 500 Then the crawler will go to sleep , It indicates that the server has been down ; If it is 400 Error of , You need to consider the modification of the crawler's capture strategy , Maybe the website has been updated , Or reptiles are banned . If in a distributed crawler system , It's easier to find and adjust the crawler's strategy .
6、HTTP The response body is the content of the protocol part that we crawlers need to care about :
adopt Python The interaction is the environment , We can intuitively and conveniently see the information of request response , This also shows Python The role of the Swiss Army knife .
>>> import requests #Python3
>>> html = requests.get('example.webscraping.com')
>>> print(html.status_code)
200
>>> print(html.elapsed)
0:00:00.818880
>>> print(html.encoding)
utf-8
>>> print(html.headers)
{'Server': 'nginx', 'Date': 'Thu, 01 Feb 2018 09:23:30 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'web2py', 'Set-Cookie': 'session_id_places=True; httponly; Path=/, session_data_places="6853de2931bf0e3a629e019a5c352fca:1Ekg3FlJ7obeqV0rcDDmjBm3y4P4ykVgQojt-qrS33TLNlpfFzO2OuXnY4nyl5sDvdq7p78_wiPyNNUPSdT2ApePNAQdS4pr-gvGc0VvnXo3TazWF8EPT7DXoXIgHLJbcXoHpfleGTwrWJaHq1WuUk4yjHzYtpOhAbnrdBF9_Hw0OFm6-aDK_J25J_asQ0f7"; Path=/', 'Expires': 'Thu, 01 Feb 2018 09:23:30 GMT', 'Pragma': 'no-cache', 'Cache-Control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Content-Encoding': 'gzip'}
>>> print(html.content)
# A little , There's too much content
3、 ... and 、 About the strategy of crawler capture
Generally, when crawling crawler data , We won't just grab one entrance URL The data stops . When there is more than one URL When the link needs to be crawled , What shall we do ?
1、 Depth-first algorithm
Depth first means that the search engine first grabs a link on the website page , After entering the page of this link , Grab the content on the page , Then continue to crawl along the link on the current page , Until you follow all the links on this page , There are no links on the deepest page , The crawler goes back and grabs another link on the first website page ; As shown in the figure below .
2、 Breadth / Width first algorithm
Breadth first is another process , It first traverses the of this level , And keep going down .
As shown in the figure below :
practice : Construct a complete binary tree , Realize its depth first and breadth first traversal algorithm .
At most, the degree of nodes on the lowest layer of a binary tree can be small 2 , And the nodes on the lowest layer are concentrated on the leftmost positions of the layer , And on the last floor , A binary tree with missing nodes on the right , Then this binary tree becomes a complete binary tree .
\
The complete binary tree is as follows :
The result of depth first traversal :[1, 3, 5, 7, 9, 4, 12, 11, 2, 6, 14, 13, 8, 10]
The result of breadth first traversal :[1, 3, 2, 5, 4, 6, 8, 7, 9, 12, 11, 14, 13, 10]
3、 How to combine capture strategies in practice
1. Generally speaking , Important web pages are very close to the portal site ;
2. Width first is conducive to the parallel cooperation of multiple crawlers ;
3. We can consider the combination of depth and breadth to realize the capture strategy : Give priority to breadth first ,
Limit the depth to the maximum depth ;
summary : The flow of a general crawler is as follows :
****
copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311737519367.html
The sidebar is recommended
- Python - convert Matplotlib image to numpy Array or PIL Image
- Python and Java crawl personal blog information and export it to excel
- Using class decorators in Python
- Untested Python code is not far from crashing
- Python efficient derivation (8)
- Python requests Library
- leetcode 2047. Number of Valid Words in a Sentence(python)
- leetcode 2027. Minimum Moves to Convert String(python)
- How IOS developers learn Python Programming 5 - data types 2
- leetcode 1971. Find if Path Exists in Graph(python)
guess what you like
-
leetcode 1984. Minimum Difference Between Highest and Lowest of K Scores(python)
-
Python interface automation test framework (basic) -- basic syntax
-
Detailed explanation of Python derivation
-
Python reptile lesson 2-9 Chinese monster database. It is found that there is a classification of color (he) desire (Xie) monsters during operation
-
A brief note on the method of creating Python virtual environment in Intranet Environment
-
[worth collecting] for Python beginners, sort out the common errors of beginners + Python Mini applet! (code attached)
-
[Python souvenir book] two people in one room have three meals and four seasons: 'how many years is it only XX years away from a hundred years of good marriage' ~?? Just come in and have a look.
-
The unknown side of Python functions
-
Python based interface automation test project, complete actual project, with source code sharing
-
A python artifact handles automatic chart color matching
Random recommended
- Python crawls the map of Gaode and the weather conditions of each city
- leetcode 1275. Find Winner on a Tic Tac Toe Game(python)
- leetcode 2016. Maximum Difference Between Increasing Elements(python)
- Run through Python date and time processing (Part 2)
- Application of urllib package in Python
- Django API Version (II)
- Python utility module playsound
- Database addition, deletion, modification and query of Python Sqlalchemy basic operation
- Tiobe November programming language ranking: Python surpasses C language to become the first! PHP is about to fall out of the top ten?
- Learn how to use opencv and python to realize face recognition!
- Using OpenCV and python to identify credit card numbers
- Principle of Python Apriori algorithm (11)
- Python AI steals your voice in 5 seconds
- A glance at Python's file processing (Part 1)
- Python cloud cat
- Python crawler actual combat, pyecharts module, python data analysis tells you which goods are popular on free fish~
- Using pandas to implement SQL group_ concat
- How IOS developers learn Python Programming 8 - set type 3
- windows10+apache2. 4 + Django deployment
- Django parser
- leetcode 1560. Most Visited Sector in a Circular Track(python)
- leetcode 1995. Count Special Quadruplets(python)
- How to program based on interfaces using Python
- leetcode 1286. Iterator for Combination(python)
- leetcode 1418. Display Table of Food Orders in a Restaurant (python)
- Python Matplotlib drawing histogram
- Python development foundation summary (VII) database + FTP + character coding + source code security
- Python modular package management and import mechanism
- Django serialization (II)
- Python dataloader error "dataloader worker (PID XXX) is killed by signal" solution
- apache2. 4 + Django + windows 10 Automated Deployment
- leetcode 1222. Queens That Can Attack the King(python)
- leetcode 1387. Sort Integers by The Power Value (python)
- Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9
- Python object oriented programming 01: introduction classes and objects
- Baidu Post: high definition Python
- Python Matplotlib drawing contour map
- Python crawler actual combat, requests module, python realizes IMDB movie top data visualization
- Python classic: explain programming and development from simple to deep and step by step
- Python implements URL availability monitoring and instant push