current position:Home>20000 words take you into the python crawler requests library, the most complete in history!!
20000 words take you into the python crawler requests library, the most complete in history!!
2022-02-01 19:09:50 【User 7634986061731】
One 、 requests Library profile Requests It is a simple and elegant design for human beings HTTP library . requests The library is a native HTTP library , Than urllib3 Libraries are easier to use . requests The library sends native HTTP 1.1 request , There is no need to manually set the URL Add query string , You don't need to be right POST Form code the data . be relative to urllib3 library , requests The library is fully automated Keep-alive and HTTP The function of connection pool . requests The library contains the following features . * 1Keep-Alive & Connection pool * International domain names and URL * With lasting Cookie Conversation * Browser style SSL authentication * Automatic content decoding * basic / Abstract authentication * elegant key/value Cookie * Automatically decompress * Unicode Response body * HTTP(S) Agent support * Upload files in blocks * Stream download * Connection timeout * Block request * Support .netrc 1.1 Requests Installation pip install requests 1.2 Requests Basic use Code 1-1: Send a get Request and view the returned results import requests url = 'www.tipdm.com/tipdm/index…' # Generate get request rqg = requests.get(url)
View the result type
print(' View the result type :', type(rqg))
Check the status code
print(' Status code :',rqg.status_code)
View encoding
print(' code :',rqg.encoding)
View response headers
print(' Response head : ',rqg.headers)
Print and view web content
print(' View web content :',rqg.text) View the result type : <class ’requests.models.Response’> Status code : 200 code : ISO-8859-1 Response head : {’Date’: ’Mon, 18 Nov 2019 04:45:49 GMT’, ’Server’: ’Apache-Coyote/1.1’, ’ Accept-Ranges’: ’bytes’, ’ETag’: ’W/"15693-1562553126764"’, ’Last-Modified’: ’ Mon, 08 Jul 2019 02:32:06 GMT’, ’Content-Type’: ’text/html’, ’Content-Length’: ’ 15693’, ’Keep-Alive’: ’timeout=5, max=100’, ’Connection’: ’Keep-Alive’} 1.3 Request Basic request method You can go through requests The library sends all http request : requests.get("httpbin.org/get") #GET request requests.post("httpbin.org/post") #POST request requests.put("httpbin.org/put") #PUT request requests.delete("httpbin.org/delete") #DELETE request requests.head("httpbin.org/get") #HEAD request requests.options("httpbin.org/get") #OPTIONS request You can go through requests The library sends all http request :
requests.get("httpbin.org/get") #GET request requests.post("httpbin.org/post") #POST request requests.put("httpbin.org/put") #PUT request requests.delete("httpbin.org/delete") #DELETE request requests.head("httpbin.org/get") #HEAD request requests.options("httpbin.org/get") #OPTIONS request Two 、 Use Request send out GET request HTTP One of the most common requests in GET request , Let's first learn more about the use of requests structure GET Requested method . GET Parameter description : get(url, params=None, **kwargs): * URL: URL to be requested * params :( Optional ) Dictionaries , List the tuples or bytes sent for the requested query string * **kwargs: Variable length keyword parameters First , Build a simple GET request , The requested link is httpbin.org/get , The website will judge if the client initiates GET If you ask , It returns the corresponding request information , The following is the use of requests Construct a GET request import requests r = requests.get('httpbin.org/get') print(r.text) { "args": {}, "headers": { "Accept": "/", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.24.0", "X-Amzn-Trace-Id": "Root=1-5fb5b166-571d31047bda880d1ec6c311" }, "origin": "36.44.144.134", "url": "httpbin.org/get" } You can find , We successfully launched GET request , The returned result contains the request header 、URL 、IP Etc . that , about GET request , If you want to add additional information , How do you usually add ?
2.1 Send tape headers Request First, we try to ask for the home page information of Zhihu import requests response = requests.get(’www.zhihu.com/explore’) print(f" The response status code of the current request is :{response.status_code}") print(response.text) The response status code of the current request is : 400
400 Bad Request400 Bad Request
openresty It is found that the status code of the response is 400 , That means our request failed , Because I have found that we are a reptile , Therefore, you need to disguise the browser , Add corresponding UA Information .
import requests headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} response = requests.get(’www.zhihu.com/explore’, headers=headers) print(f" The response status code of the current request is :{response.status_code}")
print(response.text)
The response status code of the current request is : 200
....... Here we join in headers Information , Which includes User-Agent Field information , That is, browser identification information . Obviously, we succeeded in camouflage ! This method of camouflage the browser is one of the simplest anti crawling measures . GET Parameter description : Method of carrying request hair to send request requests.get(url, headers=headers)
- headers Parameter receives the request header in dictionary form
- Request header field name as key , The value corresponding to the field is used as value
practice Request Baidu's home page www.baidu.com , Required to carry headers, And print the header information of the request ! Explain
import requests url = 'www.baidu.com' headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
Put... In the request header User-Agent, Impersonate a browser to send a request
response = requests.get(url, headers=headers) print(response.content)
Print request header information
print(response.request.headers) 2.2 Send a request with parameters When we use Baidu search, we often find url There will be one in the address ‘?‘ , Then after the question mark is the request reference Count , Also called query string ! Usually we don't just visit the basic web pages , Especially when crawling dynamic web pages, we need to pass different parameters to get Different content ;GET There are two ways to pass parameters , You can add parameters directly to the link or use params Add parameter . 2.2.1 stay url Portability parameter Directly on the... With parameters url Initiate request import requests headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} url = ’www.baidu.com/s?wd=python… response = requests.get(url, headers=headers) 2.2.2 adopt params Carry parameter Dictionary
- Build request parameter Dictionary
- Bring the parameter dictionary when sending a request to the interface , The parameter dictionary is set to params
import requests headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
This is the target url
url = ’www.baidu.com/s?wd=python…
Finally, whether there is a question mark or not, the results are the same
url = ’www.baidu.com/s?’
The request parameter is a dictionary namely wd=python
kw = {’wd’: ’python’}
Initiate a request with the request parameters , Get a response
response = requests.get(url, headers=headers, params=kw) print(response.content) Through the running results, we can judge , The requested link is automatically constructed as :
httpbin.org/get?key2=va… . in addition , The return type of a web page is actually str type , But it's special , yes JSON Format . therefore , If you want to parse the returned result directly , Get a dictionary format , Can be called directly json() Method . Examples are as follows :
import requests r = requests.get("httpbin.org/get") print( type(r.text)) print(r.json()) print( type(r. json()))
< class ’str’ > { ’args’ : {}, ’headers’ : { ’Accept’ : ’/’ , ’Accept-Encoding’ : ’gzip, deflate’ , ’Host’ : ’httpbin.org’ , ’User-Agent’ : ’python-requests/2.24.0’ , ’X-Amzn-Trace-Id’ : ’ Root=1-5fb5b3f9-13f7c2192936ec541bf97841’ }, ’origin’ : ’36.44.144.134’ , ’url’ : ’ httpbin.org/get’ } < class ’dict’ > You can find , call json() Method , You can return the result as JSON Format string into Dictionary . But it should be noted that , If the returned result is not JSON Format , There will be parsing errors , Throw out json.decoder.JSONDecodeError abnormal . Supplementary content , Receiving dictionary strings will be automatically encoded and sent to url , as follows
import requests headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36’} wd = ’ Zhang San ’ pn = 1 response = requests.get(’www.baidu.com/s’, params={’wd’: wd, ’pn’: pn}, headers=headers) print(response.url)
Output is : www.baidu.com/s?wd=%E9%9B…
C%E5%AD%A6&pn=1
so url Has been automatically encoded
The above code is equivalent to the following code , params Transcoding is essentially using urlencode
import requests from urllib.parse import urlencode headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) wd = ’ Zhang San ’ encode_res = urlencode({’k’: wd}, encoding=’utf-8’) keyword = encode_res.split(’=’)[1] print(keyword)
And then it's spliced into url
url = ’www.baidu.com/s?wd=%s&pn=… % keyword response = requests.get(url, headers=headers) print(response.url)
Output is : www.baidu.com/s?wd=%E9%9B…
%90%8C%E5%AD%A6&pn=1 2.3 Use GET Request to grab web pages The request link above returns JSON String of form , So if you ask for a normal web page , Then we can get the corresponding content !
import requests import re headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} response = requests.get(’www.zhihu.com/explore’, headers=headers) result = re.findall("(ExploreSpecialCard-contentTitle|ExploreRoundtableCard questionTitle).?>(.?)", response.text) print([i[1] for i in result])
[ ’ What delicious food is there on Huimin street in Xi'an ? ’ , ’ What are the treasure shops worth visiting in Xi'an ? ’ , ’ Which business districts in Xi'an carry your youth ?’ , ’ What good driving habits can you share ? ’ , ’ What are the driving skills that only experienced drivers know ?’ , ’ The attention of the car , Everyone should master these driving knowledge , Can save lives at the critical moment ’ , ’ Welcome to the landing ! Know the cosmic member recruitment notice ’ , ’ Planet landing problem : I'll give you ten dollars to cross into the future , How can we get mixed up ?’ , ’ Planet landing problem : Know about... In the universe 「 Super energy 」 What kind of ? How would you use it ?’ , ’ Norwegian salmon , Origin is crucial ’ , ’ What are the most attractive places in Norway ? ’ , ’ Living in Norway is a kind of What experience ?’ , ’ How to treat BOE AMOLED Mass production of flexible screen ? What's the future ? ’ , ’ Whether the flexible screen can bring revolutionary influence to the mobile phone industry ?’ , ’ What is ultra thin flexible battery ? Will it have a significant impact on the endurance of smart phones ?’ , ’ How can we learn art well , Get high marks in the art exam ? ’ , ’ Is Tsinghua Academy of fine arts despised ?’ , ’ Are art students really bad ? ’ , ’ How should people live this life ? ’ , ’ What should one pursue in his life ? ’ , ’ Will humans go crazy when they know the ultimate truth of the world ?’ , ’ Is anxiety due to lack of ability ? ’ , ’ What kind of experience is social phobia ?’ , ’ “ If you're busy, you won't have time to be depressed ” Is this sentence reasonable ? ’ ] Here we join in headers Information , Which includes User-Agent Field information , That is, browser identification information . If you don't add this , It's almost forbidden to grab . Grab binary data in the above example , What we are grabbing is a page of Zhihu , In fact, it returns a HTML file . If you want to grab the picture 、 Audio 、 Video and other documents , What should I do ? picture 、 Audio 、 Video files are essentially binary files , Because there are specific saving formats and corresponding parsing methods , We can see all kinds of multimedia . therefore , Want to grab them , We need to get their binary code . Let's say GitHub Take a look at the site icon of :
import requests response = requests.get("github.com/favicon.ico") with open(’github.ico’, ’wb’) as f: f.write(response.content) Response Two properties of an object , One is text, The other is content. Where the former represents string type text , The latter means bytes Type data , similarly , Audio and video files can also be obtained in this way . 2.4 stay Headers Parameters carry cookie Websites often take advantage of... In the request header Cookie Field to maintain the user access state , So we can do that headers Add... To the parameter Cookie , Simulate the request of ordinary users . 2.4.1 Cookies Acquisition In order to get the login page through the crawler , Or solve through cookie Back climbing , Need to use request To deal with it cookie Related requests import requests url = ’www.baidu.com’ req = requests.get(url) print(req.cookies)
Responsive cookies
for key, value in req.cookies.items(): print(f"{key} = {value}") <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> BDORZ = 27315 Here we call first cookies Property to get Cookies , It can be found that it is RequestCookieJar type . And then use items() Method to convert it to a list of tuples , Iterate through the output of each Cookie Name and value of , Realization Cookie Traversal analysis of . 2.4.2 carry Cookies Sign in close cookie 、 session The benefits of : Can request the page after login close cookie 、 session The disadvantages of : A set of cookie and session It's often too fast to respond to a user's request , Too many requests , Easily recognized by the server as a crawler Unwanted cookie Try not to use cookie But in order to get the page after login , We have to send with cookies Request We can use it directly Cookie To maintain login status , Let's take Zhihu as an example to illustrate . First log in to know , take Headers Medium Cookie Copy the content . * Copy from browser User-Agent and Cookie * The request header fields and values in the browser are the same as headers The parameters must be consistent * headers Request parameter dictionary Cookie The value corresponding to the key is a string
import requests import re
Construct request header Dictionary
headers = {
Copied from the browser User-Agent
"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’,
Copied from the browser Cookie
"cookie": ’xxx This is a copy of cookie character string ’}
The request header parameter is carried in the dictionary cookie character string
response = requests.get(’www.zhihu.com/creator’, headers=headers) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,response.text) print(response.status_code) print(data) When we don't carry Cookies When making a request :
import requests import re headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} response = requests.get(’www.zhihu.com/creator’, headers=headers) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,response.text) print(response.status_code) print(data)
200 [] Empty in the printed output result , Both comparisons , Then make successful use of headers Parameters to carry cookie , Get the page that can only be accessed after logging in ! 2.4.3 cookies Use of parameters In the last section, we were headers Parameters carry cookie , You can also use special cookies Parameters * 1. cookies The form of the parameter : Dictionaries cookies = "cookie Of name":"cookie Of value" * The dictionary corresponds to... In the request header Cookie character string , With a semicolon 、 Space splits each pair of dictionary key value pairs * To the left of the equal sign is a cookie Of name , Corresponding cookies Dictionary key * The right side of the equal sign corresponds to cookies Dictionary value * 2.cookies How to use parameters response = requests.get(url, cookies) * 3. take cookie String conversion to cookies The dictionary required for the parameter : cookies_dict = { cookie . split ( ’=’ ) [ 0 ]: cookie . split ( ’=’ ) [- 1 ] for cookie in cookies_str . split ( ’; ’ ) } * 4. Be careful : cookie There is usually an expiration time , Once expired, you need to retrieve response = requests.get(url, cookies) import requests import re url = ’www.zhihu.com/creator’ cookies_str = ’ Copy of the cookies’ headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} cookies_dict = {cookie.split(’=’, 1)[0]:cookie.split(’=’, 1)[-1] for cookie in cookies_str.split(’; ’)}
The request header parameter is carried in the dictionary cookie character string
resp = requests.get(url, headers=headers, cookies=cookies_dict) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,resp.text) print(resp.status_code) print(data)
200 [ ’python How to put this id Different but class Write an integration in the same way ? ’ , ’ My parents can't afford to give me money to buy a computer , What am I gonna do? ?’ , ’ Describe your current life in one sentence ? ’ ] 2.4.4 structure RequestsCookieJar Object to carry out cookies Set up Here we can also construct RequestsCookieJar Object to carry out cookies Set up , The sample code is as follows : import requests import re url = ’www.zhihu.com/creator’ cookies_str = ’ Copy of the cookies’ headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} jar = requests.cookies.RequestsCookieJar() for cookie in cookies_str.split(’;’): key,value = cookie.split(’=’,1) jar. set(key,value)
The request header parameter is carried in the dictionary cookie character string
resp = requests.get(url, headers=headers, cookies=jar) data = re.findall(’CreatorHomeAnalyticsDataItem-title.?>(.?)’,resp.text) print(resp.status_code) print(data)
200 [ ’python How to put this id Different but class Write an integration in the same way ? ’ , ’ My parents can't afford to give me money to buy a computer , What am I gonna do? ?’ , ’ Describe your current life in one sentence ? ’ ] Here we first create a new RequestCookieJar object , Then I'll copy the cookies utilize split() Method: dissect , Then use set() Method to set each Cookie Of key and value , And then by calling requests Of get() Fang And pass it on to cookies Parameters can be . Of course , Because of the limitations of knowing itself , headers Parameters cannot be less , Just don't need to It has to be in the original headers Set in the parameter cookie Field . After testing , It is found that you can also log in normally . 2.4.5 cookieJar Object to cookies Dictionary method Use requests Acquired resposne object , have cookies attribute . The attribute value is a cookieJar type , It includes the local settings of the other server cookie . How do we translate it into cookies What about the dictionary ? * 1. Transformation method cookies_dict = requests.utils.dict_from_cookiejar(response.cookies) * 2. among response.cookies Back to you cookieJar Object of type * 3. requests.utils.dict_from_cookiejar The function returns cookies Dictionaries
import requests import re url = 'www.zhihu.com/creator' cookies_str = ' Copy of the cookies' headers = {"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} cookie_dict = {cookie.split('=', 1)[0]:cookie.split('=', 1)[-1] for cookie in cookies_str.split('; ')}
The request header parameter is carried in the dictionary cookie character string
resp = requests.get(url, headers=headers, cookies=cookies_dict) data = re.findall('CreatorHomeAnalyticsDataItem-title.?>(.?)',resp.text) print(resp.status_code) print(data)
You can turn a dictionary into a requests.cookies.RequestsCookieJar object
cookiejar = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True) type(cookiejar) # requests.cookies.RequestsCookieJar type(resp.cookies) # requests.cookies.RequestsCookieJar # structure RequestsCookieJar Object to carry out cookies Set up one of jar The type of requests.cookies. RequestsCookieJar #cookiejar Turn Dictionary requests.utils.dict_from_cookiejar(cookiejar) 2.5 Timeout Set up In the process of surfing the Internet , We often encounter network fluctuations , This is the time , A request has been waiting for a long time. It may be But there was no result . In reptiles , A request has been fruitless for a long time , It will make the efficiency of the whole project very low , At this time, we need to Make the request mandatory , Let him have to return the result within a specific time , Otherwise, it will be wrong . * 1. Timeout parameters timeout How to use response = requests.get(url, timeout=3) * 2. timeout=3 Express : After sending the request , 3 Response returned in seconds , Otherwise throw an exception
url = 'www.tipdm.com/tipdm/index…' # Set the timeout to 2 print(' The timeout is 2:',requests.get(url,timeout=2))
If the timeout is too short, an error will be reported
requests.get(url,timeout = 0.1) # Remark time is 0.001 The timeout is 2: <Response [200]> 3、 ... and 、 Use Request send out POST request reflection : Where do we use POST request ?
- Log in to register ( stay web In the opinion of the Engineer POST Than GET More secure , url The user's account, password and other information will not be exposed in the address )
- When large text content needs to be transmitted ( POST Request has no requirement for data length )
So the same , Our crawlers also need to go back to these two places to simulate browser sending post The request is actually sent POST Ask for something to do with GET In a very similar way , Just the transfer of parameters, we need to define in data Then you can : POST Parameter description : post(url, data=None, json=None, **kwargs): * URL: URL to be requested * data : ( Optional ) Dictionaries , Tuple list , Byte or file like object , In the Request Send in the body of * json: ( Optional )JSON data , Send to Request In the body of the class . * **kwargs: Variable length keyword parameters import requests payload = {’key1’: ’value1’, ’key2’: ’value2’} req = requests.post("httpbin.org/post", data=payload) print(req.text) 3.1 POST send out JSON data Many times the data you want to send is not encoded as a form , I found that I was crawling a lot java This problem appears in the website . If you pass on a string Instead of a dict , Then the data will be released directly . We can use json.dumps() Yes, it will dict Turn it into str Format ; In addition to being able to do it yourself dict Encoding , You can still use it json ginseng The number is passed directly , And then it's automatically encoded . import json import requests url = ’httpbin.org/post’ payload = {’some’: ’data’} req1 = requests.post(url, data=json.dumps(payload)) req2 = requests.post(url, json=payload) print(req1.text) print(req2.text) You can find , We managed to get the return result , among form Part of it is the submitted data , This proves POST request Successfully sent . note requests The module sends a request with data 、 json 、 params Three ways to carry parameters params stay get Use... In request , data 、 json stay post Use... In request . data The parameters that can be received are : Dictionaries , character string , byte , File object , * Use json Parameters , Whether it's str type , still dict type , If you don't specify headers in content-type Of type , The default is : application/json . * Use data Parameters , The message is dict type , If you don't specify headers in content-type The type of , Default application/x www-form-urlencoded , It's equivalent to ordinary form Form submission form , Will convert the data in the form into key value pairs , At this time, the data can be obtained from request.POST Get in there , and request.body The content of is a=1&b=2 This key value pair form of . * Use data Parameters , The message is str type , If you don't specify headers in content-type The type of , Default applica tion/json . use data When parameters submit data , request.body The content of is a=1&b=2 In this form , use json When parameters submit data , request.body The content of is ’"a": 1, "b": 2’ In this form 3.2 POST Upload files If we want to use crawlers to upload files , have access to fifile Parameters :
url = 'httpbin.org/post' files = {'file': open('test.xlsx', 'rb')} req = requests.post(url, files=files) req.text If you are familiar with WEB Development partners should know , If you send a very large file as multipart/form data request , You may want to stream the request . By default requests I won't support it , You can use requests-toolbelt Three party Library . 3.3 Use POST Request to grab web pages Mainly to find the web page to be parsed
import requests
Prepare translated data
kw = input(" Please input the words to be translated :") ps = {"kw": kw}
Prepare a fake request
headers = {
User-Agent: title case , Indicates the identity of the request ; Generally, the identity information of the browser is used directly , forge
Crawler request
Let the browser think that the request was initiated by the browser [ Hide crawler information ]
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 Edg/85.0.564.41" }
send out POST request , Attach the form data to be translated -- Pass in the form of a dictionary
response = requests.post("fanyi.baidu.com/sug", data=ps)
Print the returned data
print(response.content)
print(response.content.decode("unicode_escape")) Four 、Requests Advanced (1) * Session Conversation maintenance This part mainly introduces about Session Conversation maintenance , And agents IP Use stay requests in , If directly used get() or post() And other methods can do the simulation of web page request , But this is actually equivalent to a different conversation , In other words, you open different pages with two browsers . Imagine such a scene , The first request uses post() Method to log into a website , The second time I want to get my personal information after successful login , You used it again get() Method to request personal information page . actually , This is equivalent to opening two browsers , These are two completely unrelated conversations , Can I get personal information successfully ? Of course not . A little friend may have said , I set the same... On two requests cookies Not to go ? Sure , But it's obviously It's very cumbersome , We have a simpler solution . In fact, the main way to solve this problem is to maintain the same session , This is equivalent to opening a new browser option Card instead of opening a new browser . But I don't want to set it every time cookies , So what should we do ? And then there's a new one Sharp weapon Session object . Take advantage of it , We can easily maintain a session , And don't worry cookies The problem of , it Will help us deal with it automatically . requests Module Session Class can automatically process data generated in the process of sending request and obtaining response cookie , And then achieve The purpose of state maintenance . Next, let's learn it 4.1 requests.session The role and application scenarios of * requests.session The role of Automatic processing cookie , That is, the next request will bring the previous cookie * requests.session Application scenarios of Automatic processing of multiple consecutive requests cookie 4.2 requests.session Usage method session Instance after requesting a website , The other server is set locally cookie Are saved in session in , Next Use it again session When requesting the other server , I'll take the last one with me cookie session = requests . session () # real example turn session Yes like response = session . get ( url , headers , ...) response = session . post ( url , data , ...) session Object to send get or post Requested parameters , And requests The parameters of the request sent by the module are exactly the same 4.3 Use Session maintain github login information * Yes github Capture the whole process of logging in and accessing the page that can only be accessed after logging in * Determine the of the login request url Address 、 Request method and required request parameters
- Some request parameters are in other fields url In the corresponding response content , have access to re Module acquisition
* Identify the pages that can only be accessed after logging in url Address and request method * utilize requests.session Completion code
import requests import re
Construct request header Dictionary
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',}
Instantiation session object
session = requests.session()
Access the login page to obtain the parameters required for the login request
response = session.get('github.com/login', headers=headers) authenticity_token = re.search('name="authenticity_token" value="(.*?)" />', response.text).group(1) # Use regular to get the parameters required for the login request
Construct the login request parameter dictionary
data = { 'commit': 'Sign in', # Fixed value 'utf8': ' ', # Fixed value 'authenticity_token': authenticity_token, # This parameter is in the response content of the landing page 'login': input(' Input github account number :'), 'password': input(' Input github account number :')}
Send login request ( There is no need to pay attention to the response to this request )
session.post('github.com/session', headers=headers, data=data)
Print pages that need to be logged in to access
response = session.get('github.com/settings/pr…', headers=headers) print(response.text) You can use the text comparison tool to proofread ! 5、 ... and 、 Requests Advanced (2) * Use of agents For some websites , Request several times during the test , Can get content normally . But once you start crawling on a large scale , about Large and frequent requests , The website may pop up a verification code , Or jump to the login authentication page , What's more, it may directly block the client's IP , Cause inaccessibility for a certain period of time . that , To prevent this from happening , We need to set up an agent to solve this problem , And that's where it comes in proxies Parameters . It can be set in this way : proxy Proxy parameters are defined by specifying the proxy ip , Let agent ip The corresponding forward proxy server forwards the request we sent , that Well, let's first learn about agents ip And proxy server 5.1 The process of using agents
- agent ip It's a ip , Point to a proxy server
- The proxy server can help us forward requests to the target server
5.2 Forward and reverse proxies It was mentioned earlier that proxy The proxy specified by the parameter ip It points to the forward proxy server , Then the corresponding reverse server ; Now let's look at the difference between a forward proxy server and a reverse proxy server * From the perspective of the party sending the request , To distinguish between forward and reverse proxies * For browser or client ( The party who sent the request ) Forward the requested , It's called forward proxy
- The browser knows the real name of the server that finally processes the request ip Address , for example VPN
* Not for browsers or clients ( The party who sent the request ) Forward the request 、 Instead, it forwards the request to the server that ultimately processes the request , It's called reverse proxy
- The browser doesn't know the real address of the server , for example nginx
5.3 agent ip( proxy server ) The classification of * According to the agent ip The degree of anonymity , agent IP It can be divided into the following three categories : * Transparent proxy (Transparent Proxy) : Although transparent agents can directly “ hide ” Yours IP Address , But we can still find out who you are . The request header received by the target server is as follows : REMOTE_ADDR = Proxy IP HTTP_VIA = Proxy IP HTTP_X_FORWARDED_FOR = Your IP * Anonymous proxy (Anonymous Proxy) : Use anonymous proxy , People can only know that you used a proxy , I don't know who you are . The request header received by the target server is as follows : REMOTE_ADDR = proxy IP HTTP_VIA = proxy IP HTTP_X_FORWARDED_FOR = proxy IP * High hiding agent (Elite proxy or High Anonymity Proxy) : High hidden agents make it impossible for others to find out that you are using agents , So it's the best choice .** There is no doubt that the use of high concealment agents works best ** . The request header received by the target server is as follows : REMOTE_ADDR = Proxy IP HTTP_VIA = not determined HTTP_X_FORWARDED_FOR = not determined * Depending on the protocol used by the website , You need to use the proxy service of the corresponding protocol . The protocols used from proxy service requests can be divided into : * http agent : The goal is url by http agreement * https agent : The goal is url by https agreement * socks Tunnel agent ( for example socks5 agent ) etc. : * 1. socks Agents simply pass packets , Don't care what kind of Application Protocol ( FTP 、 HTTP and HTTPS etc. ). * 2. socks Agency ratio http 、 https Agent takes less time . * 3. socks The agent can forward http and https Request 5.4 proxies Use of proxy parameters In order to make the server think that the same client is not requesting ; In order to prevent frequent requests to a domain name from being blocked ip , So we need to use agents ip ; So we're going to learn requests How modules use proxies ip The basic usage of response = requests . get ( url , proxies = proxies ) proxies In the form of : Dictionaries proxies = { " http ": " http :// 12.34.56.79: 9527 ", " https ": " https :// 12.34.56.79: 9527 ", } Be careful : If proxies The dictionary contains multiple key value pairs , The request will be sent in accordance with url Address protocol to choose to use the corresponding proxy ip import requests proxies = { "http": "http://124.236.111.11:80", "https": "https:183.220.145.3:8080"} req = requests.get(’www.baidu.com’,proxies =proxies) req.status_code 6、 ... and 、 Requests Advanced (3) * SSL Certificate validation Besides , requests It also provides the function of certificate verification . When sending HTTP On request , It will check. SSL certificate , I We can use verify Parameter controls whether to check this certificate . In fact, if you don't add verify Parameter words , The default is True , Will send to Dynamic verification . Now let's use requests Let's test it :
import requests url = 'cas.xijing.edu.cn/xjtyrz/logi…' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers)
SSLError: HTTPSConnectionPool(host= ’cas.xijing.edu.cn’ , port=443): Max retries exceeded with url: /xjtyrz/login (Caused by SSLError(SSLCertVerificationError(1, ’[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)’ ))) Here is an error SSL Error , Indicates a certificate validation error . therefore , If you ask for a HTTPS Site , But when the certificate validation error page , It's a mistake , So how to avoid this mistake ? It's simple , hold verify Parameter set to False that will do . The relevant code is as follows : import requests url = 'www.jci.edu.cn/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers,verify=False) req.status_code
200 Can't find anything to do SSL Verify your web page , Aerobic oh ! However, we found a warning, which suggested that we assign it a certificate . We can mask this warning by setting ignore warning : import requests from requests.packages import urllib3 urllib3.disable_warnings() url = 'www.jci.edu.cn/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers,verify=False) req.status_code
200 Or ignore the warning by capturing the warning to the log :
import logging import requests logging.captureWarnings(True) url = 'www.jci.edu.cn/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'} req = requests.get(url,headers=headers,verify=False) req.status_code
200 Of course , We can also specify a local certificate as the client certificate , This can be a single file ( Contains keys and certificates ) Or a tuple containing two file paths : import requests response = requests.get('www.12306.cn’,cert=(’./path/server…' )) print(response.status_code)
200 Of course , The above code is a demonstration example , We need to have crt and ke y file , And specify their paths . Be careful , Local private certificate key Must be decrypted , Encryption status key Is not supported . There are few such websites now ! 7、 ... and 、Requests Other contents of the library 7.1 View the response content After sending the request , The natural response is the response . In the example above , We use text and content Got the content of the response . Besides , There are many properties and methods that can be used to get other information , Such as status code 、 Response head 、Cookies etc. . Examples are as follows :
import requests url = 'www.baidu.com' req = requests.get(url) print(req.status_code)
Response status code
print(req.text)
The text content of the response
print(req.content)
The binary content of the response
print(req.cookies)
Responsive cookies
print(req.encoding)
Code of response
print(req.headers)
The header information of the response
print(req.url)
The URL of the response
print(req.history)
History of response
7.2 Check the status code and code Use rqg.status_code The status code returned by the server can be viewed in the form of , While using rqg.encoding In the form of Server returned HTTP Header information for web page coding . It should be noted that , When Requests When I guess wrong , Need to be To manually specify encoding code , Avoid garbled code in the returned web page content 7.3 send out get request , And manually specify the encoding Code 1-2: send out get request , And manually specify the encoding url = 'www.tipdm.com/tipdm/index…' rqg = requests.get(url) print(' Status code ',rqg.status_code) print(' code ',rqg.encoding) rqg.encoding = 'utf-8' # Specify the encoding manually print(' Modified code ',rqg.encoding)
print(rqg.text)
Status code 200 code ISO-8859-1 Modified code utf-8 note The method of manually specifying is not flexible , Unable to adapt to different page codes in the crawling process , While using chardet Library The method is simple and flexible . chardet Library is a very good string / File code detection module 7.4 chardet Library usage chartdet Library detect Method can detect the encoding of a given string , The syntax is as follows . chartdet.detect(byte_str) detect Method common parameters and their descriptions byte_str : receive string . Represents the string that needs to be detected and encoded . No default 7.5 Use detect Method detects the encoding and specifies Code 1-3: Use detect Method detects the encoding and specifies the encoding
import chardet url = 'www.tipdm.com/tipdm/index…' rqg = requests.get(url) print(rqg.encoding) print(chardet.detect(rqg.content)) rqg.encoding = chardet.detect(rqg.content)['encoding']
Access dictionary elements
print(rqg.encoding) ISO-8859-1 { ’encoding’ : ’utf-8’ , ’confidence’ : 0.99, ’language’ : ’’ } utf-8 7.6 requests Library comprehensive test To the website ’www.tipdm.com/tipdm/index… Send a complete GET Request , The request contains a link 、 Request header 、 Response head 、 Timeout and status code , And the code is set correctly . Code 1-6: Generate complete HTTP request
Import related libraries
import requests import chardet
Set up url
url = 'www.tipdm.com/tipdm/index…'
Set request header
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
Generate GET request , And set the delay to 2
rqg = requests.get(url,headers=headers,timeout = 2)
Check the status code
print(" Status code ",rqg.status_code)
Detection code ( View encoding )
print(' code ',rqg.encoding)
Use chardet Library detect Method to modify the code
rqg.encoding = chardet.detect(rqg.content)['encoding']
Detect the corrected code
print(' Revised coding : ',rqg.encoding) # View response headers print(' Response head : ',rqg.headers)
View web content
#print(rqg.text) Status code 200 code ISO-8859-1 Revised coding : utf-8 Response head : { ’Date’ : ’Mon, 18 Nov 2019 06:28:56 GMT’ , ’Server’ : ’Apache-Coyote/1.1’ , ’ Accept-Ranges’ : ’bytes’ , ’ETag’ : ’W/"15693-1562553126764"’ , ’Last-Modified’ : ’ Mon, 08 Jul 2019 02:32:06 GMT’ , ’Content-Type’ : ’text/html’ , ’Content-Length’ : ’ 15693’ , ’Keep-Alive’ : ’timeout=5, max=100’ , ’Connection’ : ’Keep-Alive’ }
copyright notice
author[User 7634986061731],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011909442208.html
The sidebar is recommended
- Python data analysis - linear regression selection fund
- How to make a python SDK and upload and download private servers
- Python from 0 to 1 (day 20) - basic concepts of Python dictionary
- Django -- closure decorator regular expression
- Implementation of home page and back end of Vue + Django tourism network project
- Easy to use scaffold in Python
- [Python actual combat sharing] I wrote a GIF generation tool, which is really TM simple (Douluo continent, did you see it?)
- [Python] function decorators and common decorators
- Explain the python streamlit framework in detail, which is used to build a beautiful data visualization web app, and practice making a garbage classification app
- Construction of the first Django project
guess what you like
-
Python crawler actual combat, pyecharts module, python realizes the visualization of river review data
-
Python series -- web crawler
-
Plotly + pandas + sklearn: shoot the first shot of kaggle
-
How to learn Python systematically?
-
Analysis on several implementations of Python crawler data De duplication
-
leetcode 1616. Split Two Strings to Make Palindrome (python)
-
Python Matplotlib drawing violin diagram
-
Python crawls a large number of beautiful pictures with 10 lines of code
-
[tool] integrated use of firebase push function in Python project
-
How to use Python to statistically analyze access logs?
Random recommended
- How IOS developers learn Python Programming 22 - Supplement 1
- Python can meet any API you need
- Python 3 process control statement
- The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection
- Datetime of pandas time series preamble
- How to send payslips in Python
- [Python] closure and scope
- Application of Python Matplotlib color
- leetcode 1627. Graph Connectivity With Threshold (python)
- Python thread 08 uses queues to transform the transfer scenario
- Python: simple single player strange game (text)
- Daily python, chapter 27, Django template
- TCP / UDP communication based on Python socket
- Use of pandas timestamp index
- leetcode 148. Sort List(python)
- Confucius old book network data collection, take one anti three learning crawler, python crawler 120 cases, the 21st case
- [HTB] cap (datagram analysis, setuid capability: Python)
- How IOS developers learn Python Programming 23 - Supplement 2
- How to automatically identify n + 1 queries in Django applications (2)?
- Data analysis starts from scratch. Pandas reads HTML pages + data processing and analysis
- 1313. Unzip the coding list (Java / C / C + + / Python / go / trust)
- Python Office - Python edit word
- Collect it quickly so that you can use the 30 Python tips for taking off
- Strange Python strip
- Python crawler actual combat, pyecharts module, python realizes China Metro data visualization
- DOM breakpoint of Python crawler reverse
- Django admin custom field stores links in the database after uploading files to the cloud
- Who has powder? Just climb who! If he has too much powder, climb him! Python multi-threaded collection of 260000 + fan data
- Python Matplotlib drawing streamline diagram
- The game comprehensively "invades" life: Python releases the "cool run +" plan!
- Python crawler notes: use proxy to prevent local IP from being blocked
- Python batch PPT to picture, PDF to picture, word to picture script
- Advanced face detection: use Dlib, opencv and python to detect face markers
- "Python 3 web crawler development practice (Second Edition)" is finally here!!!!
- Python and Bloom filters
- Python - singleton pattern of software design pattern
- Lazy listening network, audio novel category data collection, multi-threaded fast mining cases, 23 of 120 Python crawlers
- Troubleshooting ideas and summary of Django connecting redis cluster
- Python interface automation test framework (tools) -- interface test tool requests
- Implementation of Morse cipher translator using Python program