current position:Home>Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9

Learn these 10000 passages and become a humorous person in the IT workplace. Python crawler lessons 8-9

2022-01-31 18:56:33 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 17 God , Check out the activity details :2021 One last more challenge

Modern professionals , It should be done Sentient beings 、 Interesting 、 Useful 、 Some products . Okay , Get rid of “ Yes ” You are the word . How to become a professional humor master , We need some material, that is, the story , Only by reading more can we say more , And I can say high-level jokes .

Analysis before crawling

The target website for this crawl is :www.wllxy.net/gxqmlist.as…

 Learn this  10000  A joke , Become  IT  Workplace humor .Python  Reptile lesson  8-9

The overall difficulty of climbing is not big , The analysis work can be basically omitted , After all, for you who have learned here ,requests Have mastered 7~8 It's hot .

This article focuses on introducing you to requests Agent related content in .

Reptile basics time

What is an agent

Agent is to obtain network information on behalf of network users . Vernacular is to put the user's own IP And other network related information find ways to hide , Let the target site not get .

Types of agents

High anonymity agent The high anonymity proxy will forward the data packet unchanged , From the server of the target website , It's like a real ordinary user visiting , And it uses IP It's also a proxy server IP Address , It can perfectly hide the user's original IP, So high anonymity proxy is the first choice for crawler proxy .

Ordinary anonymous agent Ordinary anonymous agents will make some changes on the packet , Join in HTTP Head fixing parameters . Due to the existence of fixed parameters, the target server can track the real users IP, Websites with high anti climbing degree can easily judge whether users are crawlers .

Transparent proxy There is no need to elaborate on this , Instead of Bai Dai , The target server is easy to detect .

On the type of agency , Sometimes according to HTTP and HTTPS distinguish , Now, most websites have been upgraded to HTTPS It's agreed , but HTTP Not abandoned , Generally, you can also crawl . Here's the thing to notice HTTPS Need to shake hands many times , Relatively slow , After using the agent, it will become slower , So I can climb later HTTP Try to crawl the website of the agreement HTTP agreement , Including the use of agents .

requests Using agents

requests Support multiple proxy methods , The setting method is also very simple , By providing... For any request method proxies Parameter to configure a single request , For example, the following code ( About the agency, this part will give you an introduction , Because it is found in the actual operation of this case that the target data acquisition can be easily completed without an agent )

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)
 Copy code 

Note that the proxy is a dictionary parameter , Can contain HTTP perhaps HTTPS Any one of the .

Also pay attention to requests It's supporting SOCKS Acting , The difficulty of knowledge points , Don't explain .

Code time

Agent related knowledge has been introduced , Let's move on to the actual coding process .

import requests
import re
import threading

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

flag_page = 0

#  Regular expression parsing , Finally, you need to merge the three tuples , Use  zip  function 
def anay(html):
	#  Regular expressions are matched three times . We can find ways to improve efficiency , Leave it to you .
    pattern = re.compile(
        '<td class="diggtdright">[.\s]*<a href=".*?" target="_blank">\s*(.*?)</a>')
    titles = pattern.findall(html)
    times = re.findall(' Release time :(\d+[-]\d+[-]\d+)', html)
    diggt = re.findall(' Get the ticket :(\d+) Person time ', html)
    return zip(titles, times, diggt)

def save(data):
    with open("newdata.csv", "a+", encoding="utf-8-sig") as f:
        f.write(f"{data[0]},{data[1]},{data[2]}\n")

def get_page():
    global flag_page
    while flag_page < 979:
        flag_page += 1
        url = f"http://www.wllxy.net/gxqmlist.aspx?p={flag_page}"
        print(f" Crawling up {url}")
        r = requests.get(url=url, headers=headers)

        ok_data = anay(r.text)
        for data in ok_data:
            print(data)
            #  Save to local method to complete by itself 
            # save(data)

if __name__ == "__main__":
    for i in range(1, 6):
        t = threading.Thread(target=get_page)
        t.start()
 Copy code 

Be careful ,zip Function to take iteratable objects as parameters , Package the corresponding elements in the object into tuples , Then return a list of these tuples .zip It returns an object . To show the list , It needs to be manually list() transformation .

If the number of elements in each iterator is inconsistent , Returns a list of the same length as the shortest object .

The rest of the content involves the data saving part , In the above code save function , You can write your own .

At the end of the sentence

This series of reptile lessons mainly introduces requests library , After learning , You can be right requests Library has a relatively perfect cognition .

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311856294080.html

Random recommended