current position:Home>Python crawler series - network requests

Python crawler series - network requests

2022-02-01 02:25:33 Internet Lao Xin

This is my participation 11 The fourth of the yuegengwen challenge 16 God , Activity details link to view :2021 One last more challenge

First look at it. urllib

urllib Introduction to

urllib yes Python The standard library is used for network request , No installation required , Just quote directly . Mainly used for crawler development ,API Data acquisition and testing .

urllib Four modules of Library :

  • urllib.request: For opening and reading url
  • urllib.error : Include the exceptions proposed ,urllib.request
  • urllib.parse: For parsing url
  • urllib.robotparser: For parsing robots.txt

Case study

#  author : Internet veteran Xin 
#  Development time :2021/4/5/0005 8:23
import urllib.parse
kw={'wd':" Internet veteran Xin "}
result=urllib.parse.urlencode(kw)
print(result)
# decode 
res=urllib.parse.unquote(result)
print(res)
 Copy code 

 Insert picture description here The Internet will be old in the browser , Change to non Chinese form

I searched the Internet in my browser , And then copy what you're browsing :  Insert picture description here

www.baidu.com/s?ie=utf-8&…

Take a close look at , Is the bold part what we output in the code wd Result

Send a request

  • urllib.request library

Simulate a browser to launch a http request , And get the response result of the request

  • urllib.request.urlopen The grammar of :

urlopen(url,data=None,[timeout]*,cafile=None,capath=None,cadefault=False,context=None

Parameter description : url: str Type of address , That is to visit URL, for example https://www/baidu.com data: The default value is None urlopen The function returns a http.client.HTTPResponse object

Code case

get request

#  author : Internet veteran Xin 
#  Development time :2021/4/5/0005 8:23
import urllib.request
url="http://www.geekyunwei.com/"
resp=urllib.request.urlopen(url)
html=resp.read().decode('utf-8')  # take bytes Turn into utf-8 type 
print(html)
 Copy code 

Why change it to utf-8 instead of gbk, Here to see the web page to check what is in the web page source code :  Insert picture description here

Send a request -Request request

We're going to climb for the watercress

#  author : Internet veteran Xin 
#  Development time :2021/4/5/0005 8:23
import urllib.request

url="https://movie.douban.com/"

resp=urllib.request.urlopen(url)
print(resp)
 Copy code 

Douban has an anti crawler strategy , Will report directly to 418 error  Insert picture description here For this, we need to disguise the request head : We found user-Agent:

User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400
 Copy code 
#  author : Internet veteran Xin 
#  Development time :2021/4/5/0005 8:23
import urllib.request

url="https://movie.douban.com/"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'}

# Build the request object 
req=urllib.request.Request(url,headers=headers)
# Use urlopen Open request 
resp=urllib.request.urlopen(req)
# Read data from response results 
html=resp.read().decode('utf-8')
print(html)

 Copy code 

So let's use that Python Successfully disguised as a browser to get the data

IP agent

opener Use , Build your own opener Send a request

#  author : Internet veteran Xin 
#  Development time :2021/4/5/0005 8:23
import urllib.request
url="https://www.baidu.com/"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'}
# Build the request object 

req=urllib.request.Request(url,headers=headers)

opener=urllib.request.build_opener()
resp=opener.open(req)
print(resp.read().decode())
 Copy code 

If you keep sending requests , He might ban you from IP, So we change it every once in a while IP agent .

IP Agent classification :

  • Transparent proxy : The target site knows you're using a proxy and knows your source IP Address , This kind of agency is certainly not in line with our original intention
  • Anonymous proxy : The website knows you're using a proxy , But I don't know your source ip
  • High hiding agent : This is the safest way , Directory sites don't know you're using a proxy

ip Way of agency : Free of charge : www.xicidaili.com/nn/ Rechargeable : Elephant agent , Come on, agent , Sesame agent

#  author : Internet veteran Xin 
#  Development time :2021/4/5/0005 8:23
from urllib.request import build_opener
from urllib.request import ProxyHandler
proxy=ProxyHandler({'https':'222.184.90.241:4278'})

opener=build_opener(proxy)

url='https://www.baidu.com/'
resp=opener.open(url)
print(resp.read().decode('utf-8'))
 Copy code 

Baidu can actually do anti climbing , Even the gaoni agent can't do 100% bypass .

Use cookie

Why use cookie? Use cookie Mainly to solve http The statelessness of .

Use steps :

  • Instantiation MozillaCookiejar( preservation cookie)
  • establish handler object (cookie The processor of )
  • establish opener object
  • Open the web page ( Send request to get response )
  • preservation cookie file

Case study : Get Baidu Post cookie Store it


import urllib.request
from http import cookiejar
filename='cookie.txt'
def get_cookie():
    cookie=cookiejar.MozillaCookieJar(filename)
    # establish handler object 

    handler=urllib.request.HTTPCookieProcessor(cookie)
    opener=urllib.request.build_opener((handler))
    # Request URL 
    url='https://tieba.baidu.com/f?kw=python3&fr=index'

    resp=opener.open(url)
    #  preservation cookie
    cookie.save()
# Reading data 
def use_cookie():
    # Instantiation MozillaCookieJar
    cookie=cookiejar.MozillaCookieJar()
    # load cookie file 
    cookie.load(filename)
    print(cookie)
if __name__=='__main--':
    use_cookie()
    #get_cookie()

 Copy code 

exception handling

We crawl a website that we can't access to catch exceptions

#  author : Internet veteran Xin 
#  Development time :2021/4/6/0006 7:38

import urllib.request
import urllib.error
url='https://www.google.com'
try:
    resp=urllib.request.urlopen(url)
except urllib.error.URLError as e:
    print(e.reason)
 Copy code 

You can see that an exception was caught  Insert picture description here We're done with web requests , We'll learn a few common libraries later , Then you can crawl the data .

copyright notice
author[Internet Lao Xin],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010225321720.html

Random recommended