current position:Home>The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers
The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers
2022-02-01 05:07:54 【Dream eraser】
This is my participation 11 The fourth of the yuegengwen challenge 22 God , Check out the activity details :2021 One last more challenge 」
Many reptile bosses will build their own ,IP Agent pool , Do you want to know IP How is the proxy pool created ? If you happen to have this need , Welcome to this article .
This case is a reptile 120 An example in the example column , Gu use requests
+ lxml
To implement .
from 89IP The net began
agent IP One of the target websites is :www.89ip.cn/index_1.htm…, First write a random return User-Agent
Function of , You can also set the return value of the function to the request header , namely headers
Parameters .
def get_headers():
uas = [
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",
"Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
"Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
"Sosospider+(+http://help.soso.com/webspider.htm)",
"Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
]
ua = random.choice(uas)
headers = {
"user-agent": ua,
"referer": "https://www.baidu.com"
}
return headers
Copy code
In the above code uas
Variable , Using the of major search engines UA, Subsequent cases will continue to expand the list field , Strive to be a separate module .
Select a value randomly from the list , Use random.choice
, Please import... In advance random
modular .
To write requests Request function
Extract the common request function , It is convenient for subsequent expansion to collect data for multiple agent sites .
def get_html(url):
headers = get_headers()
try:
res = requests.get(url, headers=headers, timeout=5)
return res.text
except Exception as e:
print(" Request URL exception ", e)
return None
Copy code
The above code first calls get_headers
function , Get request header , After through requests
Initiate basic request .
To write 89IP Net analysis code
The following steps are divided into two steps , First write for 89IP Net extraction code , Then extract the common function .
The extracted code is as follows
def ip89():
url = "https://www.89ip.cn/index_1.html"
text = get_html(url)
ip_xpath = '//tbody/tr/td[1]/text()'
port_xpath = '//tbody/tr/td[2]/text()'
# To be returned IP And port list
ret = []
html = etree.HTML(text)
ips = html.xpath(ip_xpath)
ports = html.xpath(port_xpath)
# test , After the official operation, delete this part of the code
print(ips,ports)
ip_port = zip(ips, ports)
for ip, port in ip_port:
item_dict = {
"ip": ip.strip(),
"port": port.strip()
}
ret.append(item_dict)
return ret
Copy code
The above code first obtains the web page response , After through lxml
Do serialization , namely etree.HTML(text)
, And then through xpath
Syntax for data extraction , Finally, it is spliced into a list containing dictionary items , Go back .
The parsing part can be extracted , So the above code can be divided into two parts .
# agent IP Website source code acquisition part
def ip89():
url = "https://www.89ip.cn/index_1.html"
text = get_html(url)
ip_xpath = '//tbody/tr/td[1]/text()'
port_xpath = '//tbody/tr/td[2]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
# HTML The analysis part
def format_html(text, ip_xpath, port_xpath):
# To be returned IP And port list
ret = []
html = etree.HTML(text)
ips = html.xpath(ip_xpath)
ports = html.xpath(port_xpath)
# test , After the official operation, delete this part of the code
print(ips,ports)
ip_port = zip(ips, ports)
for ip, port in ip_port:
item_dict = {
"ip": ip.strip(), # Prevent emergence \n \t Such as space characters
"port": port.strip()
}
ret.append(item_dict)
return ret
Copy code
Test code , The results are as follows .
Extend other agents IP Address
stay 89IP After the agent network code is written , You can expand other sites to achieve , The expansion of each site is as follows :
def ip66():
url = "http://www.66ip.cn/1.html"
text = get_html(url)
ip_xpath = '//table/tr[position()>1]/td[1]/text()'
port_xpath = '//table/tr[position()>1]/td[2]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
def ip3366():
url = "https://proxy.ip3366.net/free/?action=china&page=1"
text = get_html(url)
ip_xpath = '//td[@data-title="IP"]/text()'
port_xpath = '//td[@data-title="PORT"]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
def ip_huan():
url = "https://ip.ihuan.me/?page=b97827cc"
text = get_html(url)
ip_xpath = '//tbody/tr/td[1]/a/text()'
port_xpath = '//tbody/tr/td[2]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
def ip_kuai():
url = "https://www.kuaidaili.com/free/inha/2/"
text = get_html(url)
ip_xpath = '//td[@data-title="IP"]/text()'
port_xpath = '//td[@data-title="PORT"]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
def ip_jiangxi():
url = "https://ip.jiangxianli.com/?page=1"
text = get_html(url)
ip_xpath = '//tbody/tr[position()!=7]/td[1]/text()'
port_xpath = '//tbody/tr[position()!=7]/td[2]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
def ip_kaixin():
url = "http://www.kxdaili.com/dailiip/1/1.html"
text = get_html(url)
ip_xpath = '//tbody/tr/td[1]/text()'
port_xpath = '//tbody/tr/td[2]/text()'
ret = format_html(text, ip_xpath, port_xpath)
print(ret)
Copy code
You can see , After public method extraction , The code is very similar between sites , The above contents are extracted from only one page of data , Expand to other pages , In the following, we implement , before this , Need to deal with a special site first :www.nimadaili.com/putong/1/.
The proxy site is different from the above site , namely IP With the port in one td
In the cell , As shown in the figure below : You need to provide a special parsing function for this website , As shown below , In the code through string segmentation IP And port number extraction .
def ip_nima():
url = "http://www.nimadaili.com/putong/1/"
text = get_html(url)
ip_xpath = '//tbody/tr/td[1]/text()'
ret = format_html_ext(text, ip_xpath)
print(ret)
# Expand HTML analytic function
def format_html_ext(text, ip_xpath):
# To be returned IP And port list
ret = []
html = etree.HTML(text)
ips = html.xpath(ip_xpath)
print(ips)
for ip in ips:
item_dict = {
"ip": ip.split(":")[0],
"port": ip.split(":")[1]
}
ret.append(item_dict)
return ret
Copy code
Acquired IP To verify
Acquired IP Do usability verification , And will be available IP Store in file .
There are two detection methods , The codes are as follows :
import telnetlib
# Proxy detection function
def check_ip_port(ip_port):
for item in ip_port:
ip = item["ip"]
port = item["port"]
try:
tn = telnetlib.Telnet(ip, port=port,timeout=2)
except:
print('[-] ip:{}:{}'.format(ip,port))
else:
print('[+] ip:{}:{}'.format(ip,port))
with open('ipporxy.txt','a') as f:
f.write(ip+':'+port+'\n')
print(" Phased detection is completed ")
def check_proxy(ip_port):
for item in ip_port:
ip = item["ip"]
port = item["port"]
url = 'https://api.ipify.org/?format=json'
proxies= {
"http":"http://{}:{}".format(ip,port),
"https":"https://{}:{}".format(ip,port),
}
try:
res = requests.get(url, proxies=proxies, timeout=3).json()
if 'ip' in res:
print(res['ip'])
except Exception as e:
print(e)
Copy code
The first is through telnetlib
Modular Telnet
Method realization , The second is achieved by requesting a fixed address .
expand IP Retrieval quantity
All of the above IP Detection is implemented for one page of data , Next, change to multi page data . Still take 89IP give an example .
Add a new... In the function parameter pagesize
Variable , Then use the loop to realize .
def ip89(pagesize):
url_format = "https://www.89ip.cn/index_{}.html"
for page in range(1,pagesize+1):
url = url_format.format(page)
text = get_html(url)
ip_xpath = '//tbody/tr/td[1]/text()'
port_xpath = '//tbody/tr/td[2]/text()'
ret = format_html(text, ip_xpath, port_xpath)
# Detect whether the agent is available
check_ip_port(ret)
# check_proxy(ret)
Copy code
At this point, the code runs and gets the following results :
The above code , When IP When available , Have been to IP Stored .
with open('ipporxy.txt','a') as f:
f.write(ip+':'+port+'\n')
Copy code
Comment on time
Code download address :codechina.csdn.net/hihell/pyth…, Could you give me Star.
== To have come , No comment , Do you like it ?==
copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010507526631.html
The sidebar is recommended
- Django paging (II)
- Concurrent. For Python concurrent programming Futures or multiprocessing?
- Programmers over the age of 25 can't know a few Chinese herbal medicines. Python crawler lessons 9-9
- Python crawler from introduction to pit full series of tutorials (detailed tutorial + various practical combat)
- The second bullet of class in Python
- Python object oriented programming 03: class inheritance and its derived terms
- How IOS developers learn Python Programming 13 - function 4
- Python crawler from introduction to mastery (VI) form and crawler login
- Python crawler from entry to mastery (V) challenges of dynamic web pages
- Deeply understand pandas to read excel, TXT, CSV files and other commands
guess what you like
-
Daily python, Chapter 18, class
-
"I just want to collect some plain photos in Python for machine learning," he said. "I believe you a ghost!"
-
Django view
-
Python implements filtering emoticons in text
-
When winter comes, python chooses a coat with temperament for mom! Otherwise, there's really no way to start!
-
Python crawler - get fund change information
-
Highlight actor using Python VTK
-
Python crawler actual combat: crawling southern weekend news articles
-
leetcode 406. Queue Reconstruction by Height(python)
-
leetcode 1043. Partition Array for Maximum Sum (python)
Random recommended
- Python * * packaging and unpacking details
- Python realizes weather query function
- Python from 0 to 1 (day 12) - Python data application 2 (STR function)
- Python from 0 to 1 (day 13) - Python data application 3
- Numpy common operations of Python data analysis series Chapter 8
- How to implement mockserver [Python version]
- Van * Python! Write an article and publish the script on multiple platforms
- Python data analysis - file reading
- Python data De duplication and missing value processing
- Python office automation - play with browser
- Python series tutorial 127 -- Reference vs copy
- Control flow in Python: break and continue
- Teach you how to extract tables in PDF with Python
- leetcode 889. Construct Binary Tree from Preorder and Postorder Traversal(python)
- leetcode 1338. Reduce Array Size to The Half(python)
- Object oriented and exception handling in Python
- How to configure load balancing for Django service
- How to embed Python in go
- Python Matplotlib drawing graphics
- Python object-oriented programming 05: concluding summary of classes and objects
- Python from 0 to 1 (day 14) - Python conditional judgment 1
- Several very interesting modules in Python
- How IOS developers learn Python Programming 15 - object oriented programming 1
- Daily python, Chapter 20, exception handling
- Understand the basis of Python collaboration in a few minutes
- [centos7] how to install and use Python under Linux
- leetcode 1130. Minimum Cost Tree From Leaf Values(python)
- leetcode 1433. Check If a String Can Break Another String(python)
- Python Matplotlib drawing 3D graphics
- Talk about deep and shallow copying in Python
- Python crawler series - network requests
- Python thread 01 understanding thread
- Analysis of earthquake distribution in the past 10 years with Python~
- You need to master these before learning Python crawlers
- After the old friend (R & D post) was laid off, I wanted to join the snack bar. I collected some data in Python. It's more or less a intention
- Python uses redis
- Python crawler - ETF fund acquisition
- Detailed tutorial on Python operation Tencent object storage (COS)
- [Python] comparison of list, tuple, array and bidirectional queue methods
- Go Python 3 usage and pit Prevention Guide