current position:Home>20000 words take you into the python crawler requests library, the most complete in history!!

20000 words take you into the python crawler requests library, the most complete in history!!

2022-02-01 19:09:50 User 7634986061731

One 、 requests Library profile Requests It is a simple and elegant design for human beings HTTP library . requests The library is a native HTTP library , Than urllib3 Libraries are easier to use . requests The library sends native HTTP 1.1 request , There is no need to manually set the URL Add query string , You don't need to be right POST Form code the data . be relative to urllib3 library , requests The library is fully automated Keep-alive and HTTP The function of connection pool . requests The library contains the following features . * 1Keep-Alive & Connection pool * International domain names and URL * With lasting Cookie Conversation * Browser style SSL authentication * Automatic content decoding * basic / Abstract authentication * elegant key/value Cookie * Automatically decompress * Unicode Response body * HTTP(S) Agent support * Upload files in blocks * Stream download * Connection timeout * Block request * Support .netrc 1.1 Requests Installation pip install requests 1.2 Requests Basic use Code 1-1: Send a get Request and view the returned results import requests url = 'www.tipdm.com/tipdm/index…' # Generate get request rqg = requests.get(url)

View the result type

print(' View the result type :', type(rqg))

Check the status code

print(' Status code :',rqg.status_code)

View encoding

print(' code :',rqg.encoding)

View response headers

print(' Response head : ',rqg.headers)

Print and view web content

print(' View web content :',rqg.text) View the result type : <class ’requests.models.Response’> Status code : 200 code : ISO-8859-1 Response head : {’Date’: ’Mon, 18 Nov 2019 04:45:49 GMT’, ’Server’: ’Apache-Coyote/1.1’, ’ Accept-Ranges’: ’bytes’, ’ETag’: ’W/"15693-1562553126764"’, ’Last-Modified’: ’ Mon, 08 Jul 2019 02:32:06 GMT’, ’Content-Type’: ’text/html’, ’Content-Length’: ’ 15693’, ’Keep-Alive’: ’timeout=5, max=100’, ’Connection’: ’Keep-Alive’} 1.3 Request Basic request method You can go through requests The library sends all http request : requests.get("httpbin.org/get") #GET request requests.post("httpbin.org/post") #POST request requests.put("httpbin.org/put") #PUT request requests.delete("httpbin.org/delete") #DELETE request requests.head("httpbin.org/get") #HEAD request requests.options("httpbin.org/get") #OPTIONS request You can go through requests The library sends all http request :

requests.get("httpbin.org/get") #GET request requests.post("httpbin.org/post") #POST request requests.put("httpbin.org/put") #PUT request requests.delete("httpbin.org/delete") #DELETE request requests.head("httpbin.org/get") #HEAD request requests.options("httpbin.org/get") #OPTIONS request Two 、 Use Request send out GET request HTTP One of the most common requests in GET request , Let's first learn more about the use of requests structure GET Requested method .  GET Parameter description : get(url, params=None, **kwargs): * URL: URL to be requested * params :( Optional ) Dictionaries , List the tuples or bytes sent for the requested query string * **kwargs: Variable length keyword parameters First , Build a simple GET request , The requested link is httpbin.org/get , The website will judge if the client initiates GET If you ask , It returns the corresponding request information , The following is the use of requests Construct a GET request import requests r = requests.get('httpbin.org/get') print(r.text) { "args": {}, "headers": { "Accept": "/", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.24.0", "X-Amzn-Trace-Id": "Root=1-5fb5b166-571d31047bda880d1ec6c311" }, "origin": "36.44.144.134", "url": "httpbin.org/get" } You can find , We successfully launched GET request , The returned result contains the request header 、URL 、IP Etc . that , about GET request , If you want to add additional information , How do you usually add ?

2.1 Send tape headers Request First, we try to ask for the home page information of Zhihu import requests response = requests.get(’www.zhihu.com/explore’) print(f" The response status code of the current request is :{response.status_code}") print(response.text) The response status code of the current request is : 400

400 Bad Request

400 Bad Request


openresty It is found that the status code of the response is 400 , That means our request failed , Because I have found that we are a reptile , Therefore, you need to disguise the browser , Add corresponding UA Information .

import requests headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} response = requests.get(’www.zhihu.com/explore’, headers=headers) print(f" The response status code of the current request is :{response.status_code}")

print(response.text)


The response status code of the current request is : 200

....... Here we join in headers Information , Which includes User-Agent Field information , That is, browser identification information . Obviously, we succeeded in camouflage ! This method of camouflage the browser is one of the simplest anti crawling measures .  GET Parameter description : Method of carrying request hair to send request requests.get(url, headers=headers)

  • headers Parameter receives the request header in dictionary form
  • Request header field name as key , The value corresponding to the field is used as value

practice Request Baidu's home page www.baidu.com , Required to carry headers, And print the header information of the request ! Explain

import requests url = 'www.baidu.com' headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

Put... In the request header User-Agent, Impersonate a browser to send a request

response = requests.get(url, headers=headers) print(response.content)

Print request header information

print(response.request.headers) 2.2 Send a request with parameters When we use Baidu search, we often find url There will be one in the address ‘?‘ , Then after the question mark is the request reference Count , Also called query string ! Usually we don't just visit the basic web pages , Especially when crawling dynamic web pages, we need to pass different parameters to get Different content ;GET There are two ways to pass parameters , You can add parameters directly to the link or use params Add parameter . 2.2.1 stay url Portability parameter Directly on the... With parameters url Initiate request import requests headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} url = ’www.baidu.com/s?wd=python… response = requests.get(url, headers=headers) 2.2.2 adopt params Carry parameter Dictionary

  1. Build request parameter Dictionary
  2. Bring the parameter dictionary when sending a request to the interface , The parameter dictionary is set to params

import requests headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

This is the target url

url = ’www.baidu.com/s?wd=python…

Finally, whether there is a question mark or not, the results are the same

url = ’www.baidu.com/s?’

The request parameter is a dictionary namely wd=python

kw = {’wd’: ’python’}

Initiate a request with the request parameters , Get a response

response = requests.get(url, headers=headers, params=kw) print(response.content) Through the running results, we can judge , The requested link is automatically constructed as :

httpbin.org/get?key2=va… . in addition , The return type of a web page is actually str type , But it's special , yes JSON Format . therefore , If you want to parse the returned result directly , Get a dictionary format , Can be called directly json() Method . Examples are as follows :

import requests r = requests.get("httpbin.org/get") print( type(r.text)) print(r.json()) print( type(r. json()))

< class ’str’ > { ’args’ : {}, ’headers’ : { ’Accept’ : ’/’ , ’Accept-Encoding’ : ’gzip, deflate’ , ’Host’ : ’httpbin.org’ , ’User-Agent’ : ’python-requests/2.24.0’ , ’X-Amzn-Trace-Id’ : ’ Root=1-5fb5b3f9-13f7c2192936ec541bf97841’ }, ’origin’ : ’36.44.144.134’ , ’url’ : ’ httpbin.org/get’ } < class ’dict’ > You can find , call json() Method , You can return the result as JSON Format string into Dictionary . But it should be noted that , If the returned result is not JSON Format , There will be parsing errors , Throw out json.decoder.JSONDecodeError abnormal . Supplementary content , Receiving dictionary strings will be automatically encoded and sent to url , as follows

import requests headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36’} wd = ’ Zhang San ’ pn = 1 response = requests.get(’www.baidu.com/s’, params={’wd’: wd, ’pn’: pn}, headers=headers) print(response.url)

Output is : www.baidu.com/s?wd=%E9%9B…

C%E5%AD%A6&pn=1

so url Has been automatically encoded

The above code is equivalent to the following code , params Transcoding is essentially using urlencode

import requests from urllib.parse import urlencode headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) wd = ’ Zhang San ’ encode_res = urlencode({’k’: wd}, encoding=’utf-8’) keyword = encode_res.split(’=’)[1] print(keyword)

And then it's spliced into url

url = ’www.baidu.com/s?wd=%s&pn=… % keyword response = requests.get(url, headers=headers) print(response.url)

Output is : www.baidu.com/s?wd=%E9%9B…

%90%8C%E5%AD%A6&pn=1 2.3 Use GET Request to grab web pages The request link above returns JSON String of form , So if you ask for a normal web page , Then we can get the corresponding content !

import requests import re headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} response = requests.get(’www.zhihu.com/explore’, headers=headers) result = re.findall("(ExploreSpecialCard-contentTitle|ExploreRoundtableCard questionTitle).?>(.?)", response.text) print([i[1] for i in result])

[ ’ What delicious food is there on Huimin street in Xi'an ? ’ , ’ What are the treasure shops worth visiting in Xi'an ? ’ , ’ Which business districts in Xi'an carry your youth ?’ , ’ What good driving habits can you share ? ’ , ’ What are the driving skills that only experienced drivers know ?’ , ’ The attention of the car , Everyone should master these driving knowledge , Can save lives at the critical moment ’ , ’ Welcome to the landing ! Know the cosmic member recruitment notice ’ , ’ Planet landing problem : I'll give you ten dollars to cross into the future , How can we get mixed up ?’ , ’ Planet landing problem : Know about... In the universe 「 Super energy 」 What kind of ? How would you use it ?’ , ’ Norwegian salmon , Origin is crucial ’ , ’ What are the most attractive places in Norway ? ’ , ’ Living in Norway is a kind of What experience ?’ , ’ How to treat BOE AMOLED Mass production of flexible screen ? What's the future ? ’ , ’ Whether the flexible screen can bring revolutionary influence to the mobile phone industry ?’ , ’ What is ultra thin flexible battery ? Will it have a significant impact on the endurance of smart phones ?’ , ’ How can we learn art well , Get high marks in the art exam ? ’ , ’ Is Tsinghua Academy of fine arts despised ?’ , ’ Are art students really bad ? ’ , ’ How should people live this life ? ’ , ’ What should one pursue in his life ? ’ , ’ Will humans go crazy when they know the ultimate truth of the world ?’ , ’ Is anxiety due to lack of ability ? ’ , ’ What kind of experience is social phobia ?’ , ’ “ If you're busy, you won't have time to be depressed ” Is this sentence reasonable ? ’ ] Here we join in headers Information , Which includes User-Agent Field information , That is, browser identification information . If you don't add this , It's almost forbidden to grab . Grab binary data in the above example , What we are grabbing is a page of Zhihu , In fact, it returns a HTML file . If you want to grab the picture 、 Audio 、 Video and other documents , What should I do ? picture 、 Audio 、 Video files are essentially binary files , Because there are specific saving formats and corresponding parsing methods , We can see all kinds of multimedia . therefore , Want to grab them , We need to get their binary code . Let's say GitHub Take a look at the site icon of :

import requests response = requests.get("github.com/favicon.ico") with open(’github.ico’, ’wb’) as f: f.write(response.content) Response Two properties of an object , One is text, The other is content. Where the former represents string type text , The latter means bytes Type data , similarly , Audio and video files can also be obtained in this way . 2.4 stay Headers Parameters carry cookie Websites often take advantage of... In the request header Cookie Field to maintain the user access state , So we can do that headers Add... To the parameter Cookie , Simulate the request of ordinary users . 2.4.1 Cookies Acquisition In order to get the login page through the crawler , Or solve through cookie Back climbing , Need to use request To deal with it cookie Related requests import requests url = ’www.baidu.com’ req = requests.get(url) print(req.cookies)

Responsive cookies

for key, value in req.cookies.items(): print(f"{key} = {value}") <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> BDORZ = 27315 Here we call first cookies Property to get Cookies , It can be found that it is RequestCookieJar type . And then use items() Method to convert it to a list of tuples , Iterate through the output of each Cookie Name and value of , Realization Cookie Traversal analysis of . 2.4.2 carry Cookies Sign in close cookie 、 session The benefits of : Can request the page after login close cookie 、 session The disadvantages of : A set of cookie and session It's often too fast to respond to a user's request , Too many requests , Easily recognized by the server as a crawler Unwanted cookie Try not to use cookie But in order to get the page after login , We have to send with cookies Request We can use it directly Cookie To maintain login status , Let's take Zhihu as an example to illustrate . First log in to know , take Headers Medium Cookie Copy the content . * Copy from browser User-Agent and Cookie * The request header fields and values in the browser are the same as headers The parameters must be consistent * headers Request parameter dictionary Cookie The value corresponding to the key is a string

import requests import re

Construct request header Dictionary

headers = {

Copied from the browser User-Agent

"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’,

Copied from the browser Cookie

"cookie": ’xxx This is a copy of cookie character string ’}

The request header parameter is carried in the dictionary cookie character string

response = requests.get(’www.zhihu.com/creator’, headers=headers) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,response.text) print(response.status_code) print(data) When we don't carry Cookies When making a request :

import requests import re headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} response = requests.get(’www.zhihu.com/creator’, headers=headers) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,response.text) print(response.status_code) print(data)

200 [] Empty in the printed output result , Both comparisons , Then make successful use of headers Parameters to carry cookie , Get the page that can only be accessed after logging in ! 2.4.3 cookies Use of parameters In the last section, we were headers Parameters carry cookie , You can also use special cookies Parameters * 1. cookies The form of the parameter : Dictionaries cookies = "cookie Of name":"cookie Of value" * The dictionary corresponds to... In the request header Cookie character string , With a semicolon 、 Space splits each pair of dictionary key value pairs * To the left of the equal sign is a cookie Of name , Corresponding cookies Dictionary key * The right side of the equal sign corresponds to cookies Dictionary value * 2.cookies How to use parameters response = requests.get(url, cookies) * 3. take cookie String conversion to cookies The dictionary required for the parameter : cookies_dict = { cookie . split ( ’=’ ) [ 0 ]: cookie . split ( ’=’ ) [- 1 ] for cookie in cookies_str . split ( ’; ’ ) } * 4. Be careful : cookie There is usually an expiration time , Once expired, you need to retrieve response = requests.get(url, cookies) import requests import re url = ’www.zhihu.com/creator’ cookies_str = ’ Copy of the cookies’ headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} cookies_dict = {cookie.split(’=’, 1)[0]:cookie.split(’=’, 1)[-1] for cookie in cookies_str.split(’; ’)}

The request header parameter is carried in the dictionary cookie character string

resp = requests.get(url, headers=headers, cookies=cookies_dict) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,resp.text) print(resp.status_code) print(data)

200 [ ’python How to put this id Different but class Write an integration in the same way ? ’ , ’ My parents can't afford to give me money to buy a computer , What am I gonna do? ?’ , ’ Describe your current life in one sentence ? ’ ] 2.4.4 structure RequestsCookieJar Object to carry out cookies Set up Here we can also construct RequestsCookieJar Object to carry out cookies Set up , The sample code is as follows : import requests import re url = ’www.zhihu.com/creator’ cookies_str = ’ Copy of the cookies’ headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} jar = requests.cookies.RequestsCookieJar() for cookie in cookies_str.split(’;’): key,value = cookie.split(’=’,1) jar. set(key,value)

The request header parameter is carried in the dictionary cookie character string

resp = requests.get(url, headers=headers, cookies=jar) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,resp.text) print(resp.status_code) print(data)

200 [ ’python How to put this id Different but class Write an integration in the same way ? ’ , ’ My parents can't afford to give me money to buy a computer , What am I gonna do? ?’ , ’ Describe your current life in one sentence ? ’ ] Here we first create a new RequestCookieJar object , Then I'll copy the cookies utilize split() Method: dissect , Then use set() Method to set each Cookie Of key and value , And then by calling requests Of get() Fang And pass it on to cookies Parameters can be . Of course , Because of the limitations of knowing itself , headers Parameters cannot be less , Just don't need to It has to be in the original headers Set in the parameter cookie Field . After testing , It is found that you can also log in normally . 2.4.5 cookieJar Object to cookies Dictionary method Use requests Acquired resposne object , have cookies attribute . The attribute value is a cookieJar type , It includes the local settings of the other server cookie . How do we translate it into cookies What about the dictionary ? * 1. Transformation method cookies_dict = requests.utils.dict_from_cookiejar(response.cookies) * 2. among response.cookies Back to you cookieJar Object of type * 3. requests.utils.dict_from_cookiejar The function returns cookies Dictionaries

import requests import re url = 'www.zhihu.com/creator' cookies_str = ' Copy of the cookies' headers = {"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} cookie_dict = {cookie.split('=', 1)[0]:cookie.split('=', 1)[-1] for cookie in cookies_str.split('; ')}

The request header parameter is carried in the dictionary cookie character string

resp = requests.get(url, headers=headers, cookies=cookies_dict) data = re.findall('CreatorHomeAnalyticsDataItem-title.?>(.?)',resp.text) print(resp.status_code) print(data)

You can turn a dictionary into a requests.cookies.RequestsCookieJar object

cookiejar = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True) type(cookiejar) # requests.cookies.RequestsCookieJar type(resp.cookies) # requests.cookies.RequestsCookieJar # structure RequestsCookieJar Object to carry out cookies Set up one of jar The type of requests.cookies. RequestsCookieJar #cookiejar Turn Dictionary requests.utils.dict_from_cookiejar(cookiejar) 2.5 Timeout Set up In the process of surfing the Internet , We often encounter network fluctuations , This is the time , A request has been waiting for a long time. It may be But there was no result . In reptiles , A request has been fruitless for a long time , It will make the efficiency of the whole project very low , At this time, we need to Make the request mandatory , Let him have to return the result within a specific time , Otherwise, it will be wrong . * 1. Timeout parameters timeout How to use response = requests.get(url, timeout=3) * 2. timeout=3 Express : After sending the request , 3 Response returned in seconds , Otherwise throw an exception

url = 'www.tipdm.com/tipdm/index…' # Set the timeout to 2 print(' The timeout is 2:',requests.get(url,timeout=2))

If the timeout is too short, an error will be reported

requests.get(url,timeout = 0.1) # Remark time is 0.001 The timeout is 2: <Response [200]> 3、 ... and 、 Use Request send out POST request reflection : Where do we use POST request ?

  1. Log in to register ( stay web In the opinion of the Engineer POST Than GET More secure , url The user's account, password and other information will not be exposed in the address )
  2. When large text content needs to be transmitted ( POST Request has no requirement for data length )

So the same , Our crawlers also need to go back to these two places to simulate browser sending post The request is actually sent POST Ask for something to do with GET In a very similar way , Just the transfer of parameters, we need to define in data Then you can :  POST Parameter description : post(url, data=None, json=None, **kwargs): * URL: URL to be requested * data : ( Optional ) Dictionaries , Tuple list , Byte or file like object , In the Request Send in the body of * json: ( Optional )JSON data , Send to Request In the body of the class . * **kwargs: Variable length keyword parameters import requests payload = {’key1’: ’value1’, ’key2’: ’value2’} req = requests.post("httpbin.org/post", data=payload) print(req.text) 3.1 POST send out JSON data Many times the data you want to send is not encoded as a form , I found that I was crawling a lot java This problem appears in the website . If you pass on a string Instead of a dict , Then the data will be released directly . We can use json.dumps() Yes, it will dict Turn it into str Format ; In addition to being able to do it yourself dict Encoding , You can still use it json ginseng The number is passed directly , And then it's automatically encoded . import json import requests url = ’httpbin.org/post’ payload = {’some’: ’data’} req1 = requests.post(url, data=json.dumps(payload)) req2 = requests.post(url, json=payload) print(req1.text) print(req2.text) You can find , We managed to get the return result , among form Part of it is the submitted data , This proves POST request Successfully sent . note requests The module sends a request with data 、 json 、 params Three ways to carry parameters params stay get Use... In request , data 、 json stay post Use... In request . data The parameters that can be received are : Dictionaries , character string , byte , File object , * Use json Parameters , Whether it's str type , still dict type , If you don't specify headers in content-type Of type , The default is : application/json . * Use data Parameters , The message is dict type , If you don't specify headers in content-type The type of , Default application/x www-form-urlencoded , It's equivalent to ordinary form Form submission form , Will convert the data in the form into key value pairs , At this time, the data can be obtained from request.POST Get in there , and request.body The content of is a=1&b=2 This key value pair form of . * Use data Parameters , The message is str type , If you don't specify headers in content-type The type of , Default applica tion/json . use data When parameters submit data , request.body The content of is a=1&b=2 In this form , use json When parameters submit data , request.body The content of is ’"a": 1, "b": 2’ In this form 3.2 POST Upload files If we want to use crawlers to upload files , have access to fifile Parameters :

url = 'httpbin.org/post' files = {'file': open('test.xlsx', 'rb')} req = requests.post(url, files=files) req.text If you are familiar with WEB Development partners should know , If you send a very large file as multipart/form data request , You may want to stream the request . By default requests I won't support it , You can use requests-toolbelt Three party Library . 3.3 Use POST Request to grab web pages Mainly to find the web page to be parsed

import requests

Prepare translated data

kw = input(" Please input the words to be translated :") ps = {"kw": kw}

Prepare a fake request

headers = {

User-Agent: title case , Indicates the identity of the request ; Generally, the identity information of the browser is used directly , forge

Crawler request

Let the browser think that the request was initiated by the browser [ Hide crawler information ]

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 Edg/85.0.564.41" }

send out POST request , Attach the form data to be translated -- Pass in the form of a dictionary

response = requests.post("fanyi.baidu.com/sug", data=ps)

Print the returned data

print(response.content)

print(response.content.decode("unicode_escape")) Four 、Requests Advanced (1) * Session Conversation maintenance This part mainly introduces about Session Conversation maintenance , And agents IP Use stay requests in , If directly used get() or post() And other methods can do the simulation of web page request , But this is actually equivalent to a different conversation , In other words, you open different pages with two browsers . Imagine such a scene , The first request uses post() Method to log into a website , The second time I want to get my personal information after successful login , You used it again get() Method to request personal information page . actually , This is equivalent to opening two browsers , These are two completely unrelated conversations , Can I get personal information successfully ? Of course not . A little friend may have said , I set the same... On two requests cookies Not to go ? Sure , But it's obviously It's very cumbersome , We have a simpler solution . In fact, the main way to solve this problem is to maintain the same session , This is equivalent to opening a new browser option Card instead of opening a new browser . But I don't want to set it every time cookies , So what should we do ? And then there's a new one Sharp weapon Session object . Take advantage of it , We can easily maintain a session , And don't worry cookies The problem of , it Will help us deal with it automatically . requests Module Session Class can automatically process data generated in the process of sending request and obtaining response cookie , And then achieve The purpose of state maintenance . Next, let's learn it 4.1 requests.session The role and application scenarios of * requests.session The role of Automatic processing cookie , That is, the next request will bring the previous cookie * requests.session Application scenarios of Automatic processing of multiple consecutive requests cookie 4.2 requests.session Usage method session Instance after requesting a website , The other server is set locally cookie Are saved in session in , Next Use it again session When requesting the other server , I'll take the last one with me cookie session = requests . session () # real example turn session Yes like response = session . get ( url , headers , ...) response = session . post ( url , data , ...) session Object to send get or post Requested parameters , And requests The parameters of the request sent by the module are exactly the same 4.3 Use Session maintain github login information * Yes github Capture the whole process of logging in and accessing the page that can only be accessed after logging in * Determine the of the login request url Address 、 Request method and required request parameters

  • Some request parameters are in other fields url In the corresponding response content , have access to re Module acquisition

* Identify the pages that can only be accessed after logging in url Address and request method * utilize requests.session Completion code

import requests import re

Construct request header Dictionary

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',}

Instantiation session object

session = requests.session()

Access the login page to obtain the parameters required for the login request

response = session.get('github.com/login', headers=headers) authenticity_token = re.search('name="authenticity_token" value="(.*?)" />', response.text).group(1) # Use regular to get the parameters required for the login request

Construct the login request parameter dictionary

data = { 'commit': 'Sign in', # Fixed value 'utf8': ' ', # Fixed value 'authenticity_token': authenticity_token, # This parameter is in the response content of the landing page 'login': input(' Input github account number :'), 'password': input(' Input github account number :')}

Send login request ( There is no need to pay attention to the response to this request )

session.post('github.com/session', headers=headers, data=data)

Print pages that need to be logged in to access

response = session.get('github.com/settings/pr…', headers=headers) print(response.text) You can use the text comparison tool to proofread ! 5、 ... and 、 Requests Advanced (2) * Use of agents For some websites , Request several times during the test , Can get content normally . But once you start crawling on a large scale , about Large and frequent requests , The website may pop up a verification code , Or jump to the login authentication page , What's more, it may directly block the client's IP , Cause inaccessibility for a certain period of time . that , To prevent this from happening , We need to set up an agent to solve this problem , And that's where it comes in proxies Parameters . It can be set in this way : proxy Proxy parameters are defined by specifying the proxy ip , Let agent ip The corresponding forward proxy server forwards the request we sent , that Well, let's first learn about agents ip And proxy server 5.1 The process of using agents

  1. agent ip It's a ip , Point to a proxy server
  2. The proxy server can help us forward requests to the target server

5.2 Forward and reverse proxies It was mentioned earlier that proxy The proxy specified by the parameter ip It points to the forward proxy server , Then the corresponding reverse server ; Now let's look at the difference between a forward proxy server and a reverse proxy server * From the perspective of the party sending the request , To distinguish between forward and reverse proxies * For browser or client ( The party who sent the request ) Forward the requested , It's called forward proxy

  • The browser knows the real name of the server that finally processes the request ip Address , for example VPN

* Not for browsers or clients ( The party who sent the request ) Forward the request 、 Instead, it forwards the request to the server that ultimately processes the request , It's called reverse proxy

  • The browser doesn't know the real address of the server , for example nginx

5.3 agent ip( proxy server ) The classification of * According to the agent ip The degree of anonymity , agent IP It can be divided into the following three categories : * Transparent proxy (Transparent Proxy) : Although transparent agents can directly “ hide ” Yours IP Address , But we can still find out who you are . The request header received by the target server is as follows : REMOTE_ADDR = Proxy IP HTTP_VIA = Proxy IP HTTP_X_FORWARDED_FOR = Your IP * Anonymous proxy (Anonymous Proxy) : Use anonymous proxy , People can only know that you used a proxy , I don't know who you are . The request header received by the target server is as follows : REMOTE_ADDR = proxy IP HTTP_VIA = proxy IP HTTP_X_FORWARDED_FOR = proxy IP * High hiding agent (Elite proxy or High Anonymity Proxy) : High hidden agents make it impossible for others to find out that you are using agents , So it's the best choice .** There is no doubt that the use of high concealment agents works best ** . The request header received by the target server is as follows : REMOTE_ADDR = Proxy IP HTTP_VIA = not determined HTTP_X_FORWARDED_FOR = not determined * Depending on the protocol used by the website , You need to use the proxy service of the corresponding protocol . The protocols used from proxy service requests can be divided into : * http agent : The goal is url by http agreement * https agent : The goal is url by https agreement * socks Tunnel agent ( for example socks5 agent ) etc. : * 1. socks Agents simply pass packets , Don't care what kind of Application Protocol ( FTP 、 HTTP and HTTPS etc. ). * 2. socks Agency ratio http 、 https Agent takes less time . * 3. socks The agent can forward http and https Request 5.4 proxies Use of proxy parameters In order to make the server think that the same client is not requesting ; In order to prevent frequent requests to a domain name from being blocked ip , So we need to use agents ip ; So we're going to learn requests How modules use proxies ip The basic usage of response = requests . get ( url , proxies = proxies ) proxies In the form of : Dictionaries proxies = { " http ": " http :// 12.34.56.79: 9527 ", " https ": " https :// 12.34.56.79: 9527 ", } Be careful : If proxies The dictionary contains multiple key value pairs , The request will be sent in accordance with url Address protocol to choose to use the corresponding proxy ip import requests proxies = { "http": "http://124.236.111.11:80", "https": "https:183.220.145.3:8080"} req = requests.get(’www.baidu.com’,proxies =proxies) req.status_code 6、 ... and 、 Requests Advanced (3) * SSL Certificate validation Besides , requests It also provides the function of certificate verification . When sending HTTP On request , It will check. SSL certificate , I We can use verify Parameter controls whether to check this certificate . In fact, if you don't add verify Parameter words , The default is True , Will send to Dynamic verification . Now let's use requests Let's test it :

import requests url = 'cas.xijing.edu.cn/xjtyrz/logi…' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers)

SSLError: HTTPSConnectionPool(host= ’cas.xijing.edu.cn’ , port=443): Max retries exceeded with url: /xjtyrz/login (Caused by SSLError(SSLCertVerificationError(1, ’[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)’ ))) Here is an error SSL Error , Indicates a certificate validation error . therefore , If you ask for a HTTPS Site , But when the certificate validation error page , It's a mistake , So how to avoid this mistake ? It's simple , hold verify Parameter set to False that will do . The relevant code is as follows : import requests url = 'www.jci.edu.cn/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers,verify=False) req.status_code

200 Can't find anything to do SSL Verify your web page , Aerobic oh ! However, we found a warning, which suggested that we assign it a certificate . We can mask this warning by setting ignore warning : import requests from requests.packages import urllib3 urllib3.disable_warnings() url = 'www.jci.edu.cn/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers,verify=False) req.status_code

200 Or ignore the warning by capturing the warning to the log :

import logging import requests logging.captureWarnings(True) url = 'www.jci.edu.cn/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers,verify=False) req.status_code

200 Of course , We can also specify a local certificate as the client certificate , This can be a single file ( Contains keys and certificates ) Or a tuple containing two file paths : import requests response = requests.get('www.12306.cn’,cert=(’./path/server…' )) print(response.status_code)

200 Of course , The above code is a demonstration example , We need to have crt and ke y file , And specify their paths . Be careful , Local private certificate key Must be decrypted , Encryption status key Is not supported . There are few such websites now ! 7、 ... and 、Requests Other contents of the library 7.1 View the response content After sending the request , The natural response is the response . In the example above , We use text and content Got the content of the response . Besides , There are many properties and methods that can be used to get other information , Such as status code 、 Response head 、Cookies etc. . Examples are as follows :

import requests url = 'www.baidu.com' req = requests.get(url) print(req.status_code)

Response status code

print(req.text)

The text content of the response

print(req.content)

The binary content of the response

print(req.cookies)

Responsive cookies

print(req.encoding)

Code of response

print(req.headers)

The header information of the response

print(req.url)

The URL of the response

print(req.history)

History of response

7.2 Check the status code and code Use rqg.status_code The status code returned by the server can be viewed in the form of , While using rqg.encoding In the form of Server returned HTTP Header information for web page coding . It should be noted that , When Requests When I guess wrong , Need to be To manually specify encoding code , Avoid garbled code in the returned web page content 7.3 send out get request , And manually specify the encoding Code 1-2: send out get request , And manually specify the encoding url = 'www.tipdm.com/tipdm/index…' rqg = requests.get(url) print(' Status code ',rqg.status_code) print(' code ',rqg.encoding) rqg.encoding = 'utf-8' # Specify the encoding manually print(' Modified code ',rqg.encoding)

print(rqg.text)

Status code 200 code ISO-8859-1 Modified code utf-8 note The method of manually specifying is not flexible , Unable to adapt to different page codes in the crawling process , While using chardet Library The method is simple and flexible . chardet Library is a very good string / File code detection module 7.4 chardet Library usage chartdet Library detect Method can detect the encoding of a given string , The syntax is as follows . chartdet.detect(byte_str) detect Method common parameters and their descriptions byte_str : receive string . Represents the string that needs to be detected and encoded . No default 7.5 Use detect Method detects the encoding and specifies Code 1-3: Use detect Method detects the encoding and specifies the encoding

import chardet url = 'www.tipdm.com/tipdm/index…' rqg = requests.get(url) print(rqg.encoding) print(chardet.detect(rqg.content)) rqg.encoding = chardet.detect(rqg.content)['encoding']

Access dictionary elements

print(rqg.encoding) ISO-8859-1 { ’encoding’ : ’utf-8’ , ’confidence’ : 0.99, ’language’ : ’’ } utf-8 7.6 requests Library comprehensive test To the website ’www.tipdm.com/tipdm/index… Send a complete GET Request , The request contains a link 、 Request header 、 Response head 、 Timeout and status code , And the code is set correctly . Code 1-6: Generate complete HTTP request

Import related libraries

import requests import chardet

Set up url

url = 'www.tipdm.com/tipdm/index…'

Set request header

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}

Generate GET request , And set the delay to 2

rqg = requests.get(url,headers=headers,timeout = 2)

Check the status code

print(" Status code ",rqg.status_code)

Detection code ( View encoding )

print(' code ',rqg.encoding)

Use chardet Library detect Method to modify the code

rqg.encoding = chardet.detect(rqg.content)['encoding']

Detect the corrected code

print(' Revised coding : ',rqg.encoding) # View response headers print(' Response head : ',rqg.headers)

View web content

#print(rqg.text) Status code 200 code ISO-8859-1 Revised coding : utf-8 Response head : { ’Date’ : ’Mon, 18 Nov 2019 06:28:56 GMT’ , ’Server’ : ’Apache-Coyote/1.1’ , ’ Accept-Ranges’ : ’bytes’ , ’ETag’ : ’W/"15693-1562553126764"’ , ’Last-Modified’ : ’ Mon, 08 Jul 2019 02:32:06 GMT’ , ’Content-Type’ : ’text/html’ , ’Content-Length’ : ’ 15693’ , ’Keep-Alive’ : ’timeout=5, max=100’ , ’Connection’ : ’Keep-Alive’ }

copyright notice
author[User 7634986061731],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011909442208.html

Random recommended