current position:Home>The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers

The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers

2022-02-01 05:07:54 Dream eraser

This is my participation 11 The fourth of the yuegengwen challenge 22 God , Check out the activity details :2021 One last more challenge

Many reptile bosses will build their own ,IP Agent pool , Do you want to know IP How is the proxy pool created ? If you happen to have this need , Welcome to this article .

This case is a reptile 120 An example in the example column , Gu use requests + lxml To implement .

from 89IP The net began

agent IP One of the target websites is :www.89ip.cn/index_1.htm…, First write a random return User-Agent Function of , You can also set the return value of the function to the request header , namely headers Parameters .

def get_headers():
    uas = [
        "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
        "Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",
        "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",
        "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",
        "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
        "Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
        "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
        "Sosospider+(+http://help.soso.com/webspider.htm)",
        "Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
    ]
    ua = random.choice(uas)
    headers = {
        "user-agent": ua,
        "referer": "https://www.baidu.com"
    }
    return headers
 Copy code 

In the above code uas Variable , Using the of major search engines UA, Subsequent cases will continue to expand the list field , Strive to be a separate module .

Select a value randomly from the list , Use random.choice , Please import... In advance random modular .

To write requests Request function

Extract the common request function , It is convenient for subsequent expansion to collect data for multiple agent sites .

def get_html(url):
    headers = get_headers()
    try:
        res = requests.get(url, headers=headers, timeout=5)
        return res.text
    except Exception as e:
        print(" Request URL exception ", e)
        return None

 Copy code 

The above code first calls get_headers function , Get request header , After through requests Initiate basic request .

To write 89IP Net analysis code

The following steps are divided into two steps , First write for 89IP Net extraction code , Then extract the common function .

The extracted code is as follows

def ip89():
    url = "https://www.89ip.cn/index_1.html"
    text = get_html(url)
    ip_xpath = '//tbody/tr/td[1]/text()'
    port_xpath = '//tbody/tr/td[2]/text()'
    #  To be returned IP And port list 
    ret = []
    html = etree.HTML(text)
    ips = html.xpath(ip_xpath)
    ports = html.xpath(port_xpath)
    #  test , After the official operation, delete this part of the code 
    print(ips,ports)
    ip_port = zip(ips, ports)
    for ip, port in ip_port:

        item_dict = {
            "ip": ip.strip(),
            "port": port.strip()
        }
        ret.append(item_dict)

    return ret
 Copy code 

The above code first obtains the web page response , After through lxml Do serialization , namely etree.HTML(text), And then through xpath Syntax for data extraction , Finally, it is spliced into a list containing dictionary items , Go back .

The parsing part can be extracted , So the above code can be divided into two parts .

#  agent IP Website source code acquisition part 
def ip89():
    url = "https://www.89ip.cn/index_1.html"
    text = get_html(url)
    ip_xpath = '//tbody/tr/td[1]/text()'
    port_xpath = '//tbody/tr/td[2]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)

# HTML The analysis part 
def format_html(text, ip_xpath, port_xpath):
    #  To be returned IP And port list 
    ret = []
    html = etree.HTML(text)
    ips = html.xpath(ip_xpath)
    ports = html.xpath(port_xpath)
    #  test , After the official operation, delete this part of the code 
    print(ips,ports)
    ip_port = zip(ips, ports)
    for ip, port in ip_port:

        item_dict = {
            "ip": ip.strip(), #  Prevent emergence  \n \t  Such as space characters 
            "port": port.strip()
        }
        ret.append(item_dict)

    return ret
 Copy code 

Test code , The results are as follows .  Insert picture description here

Extend other agents IP Address

stay 89IP After the agent network code is written , You can expand other sites to achieve , The expansion of each site is as follows :

def ip66():
    url = "http://www.66ip.cn/1.html"
    text = get_html(url)
    ip_xpath = '//table/tr[position()>1]/td[1]/text()'
    port_xpath = '//table/tr[position()>1]/td[2]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)

def ip3366():
    url = "https://proxy.ip3366.net/free/?action=china&page=1"
    text = get_html(url)
    ip_xpath = '//td[@data-title="IP"]/text()'
    port_xpath = '//td[@data-title="PORT"]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)

def ip_huan():
    url = "https://ip.ihuan.me/?page=b97827cc"
    text = get_html(url)
    ip_xpath = '//tbody/tr/td[1]/a/text()'
    port_xpath = '//tbody/tr/td[2]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)

def ip_kuai():
    url = "https://www.kuaidaili.com/free/inha/2/"
    text = get_html(url)
    ip_xpath = '//td[@data-title="IP"]/text()'
    port_xpath = '//td[@data-title="PORT"]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)

def ip_jiangxi():
    url = "https://ip.jiangxianli.com/?page=1"
    text = get_html(url)
    ip_xpath = '//tbody/tr[position()!=7]/td[1]/text()'
    port_xpath = '//tbody/tr[position()!=7]/td[2]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)

def ip_kaixin():
    url = "http://www.kxdaili.com/dailiip/1/1.html"
    text = get_html(url)
    ip_xpath = '//tbody/tr/td[1]/text()'
    port_xpath = '//tbody/tr/td[2]/text()'
    ret = format_html(text, ip_xpath, port_xpath)
    print(ret)
 Copy code 

You can see , After public method extraction , The code is very similar between sites , The above contents are extracted from only one page of data , Expand to other pages , In the following, we implement , before this , Need to deal with a special site first :www.nimadaili.com/putong/1/.

The proxy site is different from the above site , namely IP With the port in one td In the cell , As shown in the figure below : Insert picture description here You need to provide a special parsing function for this website , As shown below , In the code through string segmentation IP And port number extraction .

def ip_nima():
    url = "http://www.nimadaili.com/putong/1/"
    text = get_html(url)
    ip_xpath = '//tbody/tr/td[1]/text()'
    ret = format_html_ext(text, ip_xpath)
    print(ret)

#  Expand HTML analytic function 
def format_html_ext(text, ip_xpath):
    #  To be returned IP And port list 
    ret = []
    html = etree.HTML(text)
    ips = html.xpath(ip_xpath)
    print(ips)
    for ip in ips:

        item_dict = {
            "ip": ip.split(":")[0],
            "port": ip.split(":")[1]
        }
        ret.append(item_dict)

    return ret
 Copy code 

Acquired IP To verify

Acquired IP Do usability verification , And will be available IP Store in file .

There are two detection methods , The codes are as follows :

import telnetlib

#  Proxy detection function 
def check_ip_port(ip_port):
    for item in ip_port:
        ip = item["ip"]
        port = item["port"]

        try:
            tn = telnetlib.Telnet(ip, port=port,timeout=2)
        except:
            print('[-] ip:{}:{}'.format(ip,port))
        else:
            print('[+] ip:{}:{}'.format(ip,port))
            with open('ipporxy.txt','a') as f:
                f.write(ip+':'+port+'\n')
    print(" Phased detection is completed ")


def check_proxy(ip_port):
    for item in ip_port:
        ip = item["ip"]
        port = item["port"]
        url = 'https://api.ipify.org/?format=json'
        proxies= {
        "http":"http://{}:{}".format(ip,port),
        "https":"https://{}:{}".format(ip,port),
        }
        try:
            res = requests.get(url, proxies=proxies, timeout=3).json()
            if 'ip' in res:
                print(res['ip'])

        except Exception as e:
            print(e)
 Copy code 

The first is through telnetlib Modular Telnet Method realization , The second is achieved by requesting a fixed address .

expand IP Retrieval quantity

All of the above IP Detection is implemented for one page of data , Next, change to multi page data . Still take 89IP give an example .

Add a new... In the function parameter pagesize Variable , Then use the loop to realize .

def ip89(pagesize):

    url_format = "https://www.89ip.cn/index_{}.html"
    for page in range(1,pagesize+1):
        url = url_format.format(page)
        text = get_html(url)
        ip_xpath = '//tbody/tr/td[1]/text()'
        port_xpath = '//tbody/tr/td[2]/text()'
        ret = format_html(text, ip_xpath, port_xpath)
        #  Detect whether the agent is available 
        check_ip_port(ret)
        # check_proxy(ret)
 Copy code 

At this point, the code runs and gets the following results :

 Insert picture description here The above code , When IP When available , Have been to IP Stored .

with open('ipporxy.txt','a') as f:
    f.write(ip+':'+port+'\n')
 Copy code 

Comment on time

Code download address :codechina.csdn.net/hihell/pyth…, Could you give me Star.

== To have come , No comment , Do you like it ?==

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010507526631.html

Random recommended