current position:Home>Python crawler notes: use proxy to prevent local IP from being blocked
Python crawler notes: use proxy to prevent local IP from being blocked
2022-02-01 16:41:55 【Clever crane】
This is my participation 11 The fourth of the yuegengwen challenge 13 God , Check out the activity details :2021 One last more challenge
Using a proxy is a common practice against anti crawler mechanisms . Many websites will detect a certain period of time, a certain foreign IP Address, the number of visits to the server, etc . If the number or mode of access does not comply with the security policy , Will ban the foreign IP Access to the server . therefore , Crawler designers can use some proxy servers , Make yourself real IP The address is hidden , Free from prohibition .
urllib Use in ProxyHandler To set up the use of proxy server
There are usually two types of agents on the network : Free agent 、 Charge agent . Free agent can be through Baidu /Google Search for , Or find it at : West thorn free agent IP、 Fast agent, free agent 、Proxy360 agent 、 So the agent IP ...
- Free open agents are usually used by many people , And the agent has a short life 、 Slow speed 、 Anonymity is not high 、HTTP/HTTPS Support instability and other shortcomings ( As the saying goes , Free is not good )
- Professional crawler engineers or crawler companies will use high-quality private agents , These agents usually need to find a special agent supplier to buy , And then through the user name / Password authorization ( As the saying goes , Reluctant to let children catch wolves )
You can organize a list of agents , Under a certain time strategy , Random use , Free from being server Blocking access .
# Using agents
# demo 1 : Use ProxyHandler By designating a free agent to visit the target website
import urllib.request
request = urllib.request.Request('http://www.baidu.com')
# In this case, we're looking for a free agent , It's not sure when it will be invalid
proxy_support = urllib.request.ProxyHandler({'http':'210.1.58.212:8080'})
opener = urllib.request.build_opener(proxy_support)
response = opener.open(request)
print(response.read().decode('utf-8'))
Copy code
# Using agents
# demo 2 : Using an authenticated proxy
import urllib.request
# The user name, password and proxy here are all filled in blindly , When you use it, change it for something you can use
# Process is such a process , This is the way . Understanding is everything .
username = 'leo'
password = 'leo'
proxydict = {'http':'106.185.26.199:25'}
proxydict['http'] = username + ':' + password + '@' + proxydict['http']
httpWithProxyHandler = urllib.request.ProxyHandler(proxydict)
opener = urllib.request.build_opener(httpWithProxyHandler)
request = urllib.request.Request('http://www.baidu.com')
resp = opener.open(request)
print(resp.read().decode('utf-8'))
Copy code
# Using agents
# demo 3 : Use urllib Recommended practices to improve the above process
username = 'leo'
password = 'leo'
proxyserver = {'106.185.26.199:25'}
# 1. Build a password management object , It is used to save the user name and password to be processed
passwordMgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# 2. Add user information , The first parameter realm Is information about the domain associated with the remote server , The default is None, It can be done by response Head view
# The next three parameters are : proxy server , user name , password
passwordMgr.add_password(None, proxyserver, username, password)
# 3. Build a proxy base user name / Password verified Handler object , The parameter is the password management object
proxyauth_handler = urllib.request.ProxyBasicAuthHandler(passwordMgr)
# 4. adopt build_opener() Definition opener object
opener = urllib.request.build_opener(proxyauth_handler)
# 5. Construct request request
request = urllib.request.Request('http://www.baidu.com')
# 6. Use customization opener Send a request
response = opener.open(request)
# 7. Print the response
print(response.read().decode('utf-8'))
Copy code
# Using agents
# demo 4 : Use from http://www.goubanjia.com/,www.kuaidaili.com/dps Get the list of agents
# You can use fast agent to test agent feasibility online
import random
# 10 month 18 Japan Free agent from the fast agent website .
proxylist = [
{'http':'210.1.58.212:8080'},
{'http':'106.185.26.199:25'},
{'http':'124.206.192.210:38621'},
{'http':'222.249.224.61:48114'},
{'http':'115.218.217.184:9000'},
{'http':'183.129.244.17:10010'},
{'http':'120.26.199.103:8118'},
]
def randomTryProxy(retry):
''' Function : choose a proxy from the proxy list RANDOMLY! retry : number of retry '''
# Strategy 1 Choose... At random
try:
proxy = random.choice(proxylist)
print('Try %s : %s' %(retry, proxy))
httpProxyHandler = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(httpProxyHandler)
request = urllib.request.Request('http://www.baidu.com')
response = opener.open(request,timeout = 5)
print('Worked !')
except:
print('Connect error:Please retry')
if retry > 0:
randomTryProxy(retry-1)
def inorderTryProxy(proxy):
''' Function : choose a proxy from the proxy list RANDOMLY! retry : index of proxy '''
# Strategy 2 : Try to choose in turn
try:
print('Try %s ' %(proxy))
httpProxyHandler = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(httpProxyHandler)
request = urllib.request.Request('http://www.baidu.com')
response = opener.open(request,timeout = 5)
print('Worked !')
except:
print('Connect error:Please retry')
if __name__ == '__main__':
# Random filtering is suitable for most of the situations that can be used in the proxy list
randomTryProxy(5)
print('--'*20)
# Try to fit most of the unavailable agents in the list in turn
for p in proxylist:
inorderTryProxy(p)
Copy code
Running results :
Try 5 : {'http': '115.218.217.184:9000'}
Connect error:Please retry
Try 4 : {'http': '115.218.217.184:9000'}
Connect error:Please retry
Try 3 : {'http': '222.249.224.61:48114'}
Connect error:Please retry
Try 2 : {'http': '210.1.58.212:8080'}
Worked !
----------------------------------------
Try {'http': '210.1.58.212:8080'}
Worked !
Try {'http': '106.185.26.199:25'}
Connect error:Please retry
Try {'http': '124.206.192.210:38621'}
Connect error:Please retry
Try {'http': '222.249.224.61:48114'}
Connect error:Please retry
Try {'http': '115.218.217.184:9000'}
Connect error:Please retry
Try {'http': '183.129.244.17:10010'}
Connect error:Please retry
Try {'http': '120.26.199.103:8118'}
Connect error:Please retry
Copy code
copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011641544353.html
The sidebar is recommended
- Python learning notes - the fifth bullet * class & object oriented
- Python learning notes - the fourth bullet IO operation
- Python crawler actual combat: crawl all the pictures in the answer
- Quick reference manual of common regular expressions, necessary for Python text processing
- [Python] the characteristics of dictionaries and collections and the hash table behind them
- Python crawler - fund information storage
- Python crawler actual combat, pyteseract module, python realizes the visualization of boos direct employment & hook post data
- Pit filling summary: Python memory leak troubleshooting tips
- Python code reading (Chapter 61): delaying function calls
- Through the for loop, compare the differences between Python and Ruby Programming ideas
guess what you like
-
leetcode 1606. Find Servers That Handled Most Number of Requests(python)
-
leetcode 1611. Minimum One Bit Operations to Make Integers Zero(python)
-
06python learning notes - reading external text data
-
[Python] functions, higher-order functions, anonymous functions and function attributes
-
Python Networkx practice social network visualization
-
Data analysis starts from scratch, and pandas reads and writes CSV data
-
Python review (format string)
-
[pandas learning notes 01] powerful tool set for analyzing structured data
-
leetcode 147. Insertion Sort List(python)
-
apache2. 4 + windows deployment Django (multi site)
Random recommended
- Python data analysis - linear regression selection fund
- How to make a python SDK and upload and download private servers
- Python from 0 to 1 (day 20) - basic concepts of Python dictionary
- Django -- closure decorator regular expression
- Implementation of home page and back end of Vue + Django tourism network project
- Easy to use scaffold in Python
- [Python actual combat sharing] I wrote a GIF generation tool, which is really TM simple (Douluo continent, did you see it?)
- [Python] function decorators and common decorators
- Explain the python streamlit framework in detail, which is used to build a beautiful data visualization web app, and practice making a garbage classification app
- Construction of the first Django project
- Python crawler actual combat, pyecharts module, python realizes the visualization of river review data
- Python series -- web crawler
- Plotly + pandas + sklearn: shoot the first shot of kaggle
- How to learn Python systematically?
- Analysis on several implementations of Python crawler data De duplication
- leetcode 1616. Split Two Strings to Make Palindrome (python)
- Python Matplotlib drawing violin diagram
- Python crawls a large number of beautiful pictures with 10 lines of code
- [tool] integrated use of firebase push function in Python project
- How to use Python to statistically analyze access logs?
- How IOS developers learn Python Programming 22 - Supplement 1
- Python can meet any API you need
- Python 3 process control statement
- The 20th of 120 Python crawlers, 1637. All the way business opportunity network joined in data collection
- Datetime of pandas time series preamble
- How to send payslips in Python
- [Python] closure and scope
- Application of Python Matplotlib color
- leetcode 1627. Graph Connectivity With Threshold (python)
- Python thread 08 uses queues to transform the transfer scenario
- Python: simple single player strange game (text)
- Daily python, chapter 27, Django template
- TCP / UDP communication based on Python socket
- Use of pandas timestamp index
- leetcode 148. Sort List(python)
- Confucius old book network data collection, take one anti three learning crawler, python crawler 120 cases, the 21st case
- [HTB] cap (datagram analysis, setuid capability: Python)
- How IOS developers learn Python Programming 23 - Supplement 2
- How to automatically identify n + 1 queries in Django applications (2)?
- Data analysis starts from scratch. Pandas reads HTML pages + data processing and analysis