current position:Home>Python crawler notes: use proxy to prevent local IP from being blocked

Python crawler notes: use proxy to prevent local IP from being blocked

2022-02-01 16:41:55 Clever crane

This is my participation 11 The fourth of the yuegengwen challenge 13 God , Check out the activity details :2021 One last more challenge

 

Using a proxy is a common practice against anti crawler mechanisms . Many websites will detect a certain period of time, a certain foreign IP Address, the number of visits to the server, etc . If the number or mode of access does not comply with the security policy , Will ban the foreign IP Access to the server . therefore , Crawler designers can use some proxy servers , Make yourself real IP The address is hidden , Free from prohibition .

urllib Use in ProxyHandler To set up the use of proxy server

There are usually two types of agents on the network : Free agent 、 Charge agent . Free agent can be through Baidu /Google Search for , Or find it at : West thorn free agent IP、 Fast agent, free agent 、Proxy360 agent 、 So the agent IP ...

  • Free open agents are usually used by many people , And the agent has a short life 、 Slow speed 、 Anonymity is not high 、HTTP/HTTPS Support instability and other shortcomings ( As the saying goes , Free is not good )
  • Professional crawler engineers or crawler companies will use high-quality private agents , These agents usually need to find a special agent supplier to buy , And then through the user name / Password authorization ( As the saying goes , Reluctant to let children catch wolves )

You can organize a list of agents , Under a certain time strategy , Random use , Free from being server Blocking access .

#  Using agents 
# demo 1 : Use  ProxyHandler  By designating a free agent to visit the target website 
import urllib.request

request = urllib.request.Request('http://www.baidu.com')

#  In this case, we're looking for a free agent , It's not sure when it will be invalid 
proxy_support = urllib.request.ProxyHandler({'http':'210.1.58.212:8080'})

opener = urllib.request.build_opener(proxy_support)
response = opener.open(request)
print(response.read().decode('utf-8'))
 Copy code 
#  Using agents 
# demo 2 : Using an authenticated proxy 

import urllib.request

#  The user name, password and proxy here are all filled in blindly , When you use it, change it for something you can use 
#  Process is such a process , This is the way . Understanding is everything .
username = 'leo'
password = 'leo'
proxydict = {'http':'106.185.26.199:25'}

proxydict['http'] = username + ':' + password + '@' + proxydict['http']
httpWithProxyHandler = urllib.request.ProxyHandler(proxydict)

opener = urllib.request.build_opener(httpWithProxyHandler)
request = urllib.request.Request('http://www.baidu.com')

resp = opener.open(request)
print(resp.read().decode('utf-8'))
 Copy code 
#  Using agents 
# demo 3 : Use  urllib  Recommended practices to improve the above process 
username = 'leo'
password = 'leo'
proxyserver = {'106.185.26.199:25'}

# 1.  Build a password management object , It is used to save the user name and password to be processed 
passwordMgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# 2.  Add user information , The first parameter  realm  Is information about the domain associated with the remote server , The default is None, It can be done by  response  Head view 
#  The next three parameters are : proxy server , user name , password 
passwordMgr.add_password(None, proxyserver, username, password)

# 3.  Build a proxy base user name / Password verified  Handler  object , The parameter is the password management object 
proxyauth_handler = urllib.request.ProxyBasicAuthHandler(passwordMgr)

# 4.  adopt  build_opener()  Definition  opener  object 
opener = urllib.request.build_opener(proxyauth_handler)

# 5.  Construct request  request
request = urllib.request.Request('http://www.baidu.com')

# 6.  Use customization  opener  Send a request 
response = opener.open(request)

# 7.  Print the response 
print(response.read().decode('utf-8'))
 Copy code 
#  Using agents 
# demo 4 : Use from  http://www.goubanjia.com/,www.kuaidaili.com/dps  Get the list of agents 
#  You can use fast agent to test agent feasibility online 

import random

# 10 month 18 Japan   Free agent from the fast agent website .
proxylist = [
    {'http':'210.1.58.212:8080'},
    {'http':'106.185.26.199:25'},
    {'http':'124.206.192.210:38621'},
    {'http':'222.249.224.61:48114'},
    {'http':'115.218.217.184:9000'},
    {'http':'183.129.244.17:10010'},
    {'http':'120.26.199.103:8118'},
]

def randomTryProxy(retry):
    ''' Function : choose a proxy from the proxy list RANDOMLY! retry : number of retry '''
    #  Strategy  1  Choose... At random 
    try:
        proxy = random.choice(proxylist)
        
        print('Try %s : %s' %(retry, proxy))
        
        httpProxyHandler = urllib.request.ProxyHandler(proxy)
        opener = urllib.request.build_opener(httpProxyHandler)
        request = urllib.request.Request('http://www.baidu.com')
        response = opener.open(request,timeout = 5)
        
        print('Worked !')
        
    except:
        print('Connect error:Please retry')
        if retry > 0:
            randomTryProxy(retry-1)
        
def inorderTryProxy(proxy):
    ''' Function : choose a proxy from the proxy list RANDOMLY! retry : index of proxy '''
    #  Strategy  2 : Try to choose in turn 
    try:

        print('Try %s ' %(proxy))
        
        httpProxyHandler = urllib.request.ProxyHandler(proxy)
        opener = urllib.request.build_opener(httpProxyHandler)
        request = urllib.request.Request('http://www.baidu.com')
        response = opener.open(request,timeout = 5)
        
        print('Worked !')
        
    except:
        print('Connect error:Please retry')
        
        
if __name__ == '__main__':
    #  Random filtering is suitable for most of the situations that can be used in the proxy list 
    randomTryProxy(5)
    print('--'*20)
    #  Try to fit most of the unavailable agents in the list in turn 
    for p in proxylist:
        inorderTryProxy(p)
        
 Copy code 
 Running results :

Try 5 : {'http': '115.218.217.184:9000'}
Connect error:Please retry
Try 4 : {'http': '115.218.217.184:9000'}
Connect error:Please retry
Try 3 : {'http': '222.249.224.61:48114'}
Connect error:Please retry
Try 2 : {'http': '210.1.58.212:8080'}
Worked !
----------------------------------------
Try {'http': '210.1.58.212:8080'} 
Worked !
Try {'http': '106.185.26.199:25'} 
Connect error:Please retry
Try {'http': '124.206.192.210:38621'} 
Connect error:Please retry
Try {'http': '222.249.224.61:48114'} 
Connect error:Please retry
Try {'http': '115.218.217.184:9000'} 
Connect error:Please retry
Try {'http': '183.129.244.17:10010'} 
Connect error:Please retry
Try {'http': '120.26.199.103:8118'} 
Connect error:Please retry
 Copy code 

copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011641544353.html

Random recommended