current position:Home>Python crawler from entry to mastery (III) implementation of simple crawler

Python crawler from entry to mastery (III) implementation of simple crawler

2022-01-31 17:37:53 zhulin1028

「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge

One 、 Probably the simplest reptile in history Demo

The simplest crawler Demo

The first crawler , Two lines of code to write a crawler :

import urllib  #Python3

print(urllib.request.urlopen(urllib.request.Request("GitHub - richardpenman/wswp_places")).read() )

 Copy code 

These two lines of code are in Python3.6 Under normal operation , obtain****

The content of this page ;

remarks : If it is Python3 , Then use the following two lines of code :

import requests #Python3

 Copy code 

without requests library , You need to use the command pip install requests Install it. ;

explain : At present, most of the code in this handout is Python3.6 Code bit blueprint , Appendix to handout A Will be Python2 and Python3 The comparison table of the most important libraries in the reptile is included , According to this table, it can be easily realized Python2 And Python3 In the transplantation of crawler code .

Two 、 Take a look back. HTTP ,HTTPS agreement

1、 About URL:

URL (Uniform / Universal Resource Locator Abbreviation ): Uniform resource locator , It is used to describe completely Internet A way to identify the addresses of web pages and other resources .

The basic format :scheme://host[:port#]/path/ …/[?query-string][#anchor]

scheme : agreement ( for example :http, https, ftp)

host : Server's IP Address or domain name

port# : The port of the server ( If you go to the protocol default port , Default port 80

path : Path to access resources

query-string : Parameters , Send to http Server data

anchor : anchor ( Jump to the specified anchor location of the web page )

for example :

URL It's the entrance of reptiles , Very important .****

2、HTTP agreement ,HTTPS agreement

HTTP agreement (HyperText Transfer Protocol , Hypertext transfer protocol ): It's a kind of release and reception HTML Page method .HTTP Protocol is an application layer protocol , There is no connection ( Only one request is processed per connection ), No state ( Every connection , The transmission is independent )

HTTPS (Hypertext Transfer Protocol over Secure Socket Layer ) The agreement is simply HTTP Security version , stay HTTP Lower join SSL layer .HTTPS = HTTP+SSL (Secure Sockets Layer Secure socket layer ) It is mainly used for Web The secure transport protocol of , Encrypt the network connection at the transport layer , The guarantee is Internet Security of data transmission on

HTTP The port number of is 80 ;HTTPS The port number of is 443 ;

3、HTTP Request Two common methods of request :

Get : To get information from the server , The process of transmitting data to the server is not secure enough , Data size is limited ;

Post : Passing data to the server , The process of transmitting data is safe , There is no theoretical limit to size ;

​4、** About User-Agent**

User Agent In Chinese, The user agent , abbreviation UA , It's a special string header , Enables the server to identify the operating system And version 、CPU type 、 browser And version 、 Browser rendering engine 、 Browser language 、 Browser plug-in etc. .

Let's take a look at what our simplest crawler tells the server when it runs User-Agent What is it? ? Through this example , We found that Python The crawler has a default version number User-Agent , It's easy to recognize that this is a Python Written crawler program . So if you use the default User-Agent , Those anti crawler programs can recognize us at a glance Python Reptiles , This is right Python Reptiles are bad .

that , How do we modify this User-Agent , To disguise our crawler ?

# Http The information of the request header in the protocol 

headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}

req = request.Request("",


#  return http.client.HTTPResponse

response = request.urlopen(req)
 Copy code 

5、HTTP Response The status code of the response :

200 For success ,300 It's a jump ;

400 ,500 It means there are mistakes :

explain : The information returned by the server to the crawler can be used to judge whether our crawler is running normally ;

When an abnormal error occurs : Generally speaking, if it is 500 Then the crawler will go to sleep , It indicates that the server has been down ; If it is 400 Error of , You need to consider the modification of the crawler's capture strategy , Maybe the website has been updated , Or reptiles are banned . If in a distributed crawler system , It's easier to find and adjust the crawler's strategy .

6、HTTP The response body is the content of the protocol part that we crawlers need to care about :

adopt Python The interaction is the environment , We can intuitively and conveniently see the information of request response , This also shows Python The role of the Swiss Army knife .

>>> import requests  #Python3

>>> html = requests.get('')

>>> print(html.status_code)


>>> print(html.elapsed)


>>> print(html.encoding)


>>> print(html.headers)

{'Server': 'nginx', 'Date': 'Thu, 01 Feb 2018 09:23:30 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'web2py', 'Set-Cookie': 'session_id_places=True; httponly; Path=/, session_data_places="6853de2931bf0e3a629e019a5c352fca:1Ekg3FlJ7obeqV0rcDDmjBm3y4P4ykVgQojt-qrS33TLNlpfFzO2OuXnY4nyl5sDvdq7p78_wiPyNNUPSdT2ApePNAQdS4pr-gvGc0VvnXo3TazWF8EPT7DXoXIgHLJbcXoHpfleGTwrWJaHq1WuUk4yjHzYtpOhAbnrdBF9_Hw0OFm6-aDK_J25J_asQ0f7"; Path=/', 'Expires': 'Thu, 01 Feb 2018 09:23:30 GMT', 'Pragma': 'no-cache', 'Cache-Control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Content-Encoding': 'gzip'}

>>> print(html.content)

# A little , There's too much content

3、 ... and 、 About the strategy of crawler capture

Generally, when crawling crawler data , We won't just grab one entrance URL The data stops . When there is more than one URL When the link needs to be crawled , What shall we do ?

1、 Depth-first algorithm

Depth first means that the search engine first grabs a link on the website page , After entering the page of this link , Grab the content on the page , Then continue to crawl along the link on the current page , Until you follow all the links on this page , There are no links on the deepest page , The crawler goes back and grabs another link on the first website page ; As shown in the figure below .

2、 Breadth / Width first algorithm

Breadth first is another process , It first traverses the of this level , And keep going down .

As shown in the figure below :

practice : Construct a complete binary tree , Realize its depth first and breadth first traversal algorithm .

At most, the degree of nodes on the lowest layer of a binary tree can be small 2 , And the nodes on the lowest layer are concentrated on the leftmost positions of the layer , And on the last floor , A binary tree with missing nodes on the right , Then this binary tree becomes a complete binary tree .


The complete binary tree is as follows :

The result of depth first traversal :[1, 3, 5, 7, 9, 4, 12, 11, 2, 6, 14, 13, 8, 10]

The result of breadth first traversal :[1, 3, 2, 5, 4, 6, 8, 7, 9, 12, 11, 14, 13, 10]

3、 How to combine capture strategies in practice

1. Generally speaking , Important web pages are very close to the portal site ;

2. Width first is conducive to the parallel cooperation of multiple crawlers ;

3. We can consider the combination of depth and breadth to realize the capture strategy : Give priority to breadth first ,

Limit the depth to the maximum depth ;

summary : The flow of a general crawler is as follows :


copyright notice
author[zhulin1028],Please bring the original link to reprint, thank you.

Random recommended