current position:Home>A website full of temptations for Python crawler writers, "lovely picture network", look at the name of this website

A website full of temptations for Python crawler writers, "lovely picture network", look at the name of this website

2022-01-30 07:22:47 Dream eraser

This article has participated in  「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund .

Lovely picture network double thread crawling

This blog is about Python The crawler is speeding up , Implementation of dual threaded crawler . And in the process of crawling , And unexpected receipts .

Crawling target analysis

Climb to the target

  • Lovely picture net www.keaitupian.net/
  • Image classification is very rich , There's everything you want to grab , for example Lovely girl , Sexy beauty , But in order to learn technology better , I decided to focus only on classified cartoon pictures , The rest is for you guys .

 Yes Python Web site full of temptation for crawler Writers ,《 Lovely picture net 》, Look at the name of this website

The use of Python modular

  • This use requests,re,threading.
  • Add thread parallel module threading.

Key learning content

  1. Reptile basic routine ;
  2. Uncertain page number data crawling ;
  3. Fixed number of threads crawler .

List page and detail page analysis

Because the total number of pages in the list cannot be directly obtained , Gu uses the large number test , When the input https://www.keaitupian.net/dongman/list-110.html when , The page appears to be nonexistent , As shown in the figure below .

 Yes Python Web site full of temptation for crawler Writers ,《 Lovely picture net 》, Look at the name of this website

After the actual test , To get the classification, the following table pages exist 77 page .

Click on any picture details page , View the specific content of the picture page , It is found that the details page also has page flipping , And the flip page can jump between the list pages , For example, page to 9/9 after , You can go to the next set of pictures . Therefore, you can capture data directly from the details page .

 Yes Python Web site full of temptation for crawler Writers ,《 Lovely picture net 》, Look at the name of this website Get list number 77 The last group of photos on the page , Look at the last set of photo flipping data codes , Found that the last page to the right code is empty , I can't turn the page .

 Yes Python Web site full of temptation for crawler Writers ,《 Lovely picture net 》, Look at the name of this website The last page data view :https://www.keaitupian.net/article/280-8.html#.

Target website analysis completed , Sort out the whole logic , Out of demand .

Sort out the logic of demand

  1. Randomly select a detail page address , As a crawler start page ;
  2. One thread saves the picture ;
  3. A thread saves the address of the next page .

Code time

Grab the target request address

Based on the above requirements , First of all, loop acquisition is realized URL The thread of , This thread is mainly used for repeatedly crawling URL Address , Save to a global list .

Need to be used threading.Thread Create thread and start thread , At the same time, to ensure the data transmission between threads , Need to use thread mutex .

The declaration of the lock :

mutex = threading.Lock()
 Copy code 

Use of locks :

global urls
#  locked 
mutex.acquire()

urls.append(next_url)
mutex.release()
 Copy code 

To write URL Get the address code as follows :

import requests
import re
import threading
import time

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}

#  overall situation  urls
urls = []

mutex = threading.Lock()


#  Cycle to get URL
def get_image(start_url):
    global urls
    urls.append(start_url)
    next_url = start_url
    while next_url != "#":
        res = requests.get(url=next_url, headers=headers)

        if res is not None:
            html = res.text
            pattern = re.compile('<a class="next_main_img" href="(.*?)">')
            match = pattern.search(html)
            if match:
                next_url = match.group(1)
                if next_url.find('www.keaitupian') < 0:
                    next_url = f"https://www.keaitupian.net{next_url}"
                print(next_url)
                #  locked 
                mutex.acquire()

                urls.append(next_url)
                #  Release the lock 
                mutex.release()


if __name__ == '__main__':
    #  Get image thread 
    gets = threading.Thread(target=get_image, args=(
        "https://www.keaitupian.net/article/202389.html",))
    gets.start()
 Copy code 

Run code , Get the target address to be captured , The console outputs the following :

 Yes Python Web site full of temptation for crawler Writers ,《 Lovely picture net 》, Look at the name of this website Extract the destination address picture

Here's the final step , Through the above code to capture the link address , Extract the picture address , And save the pictures .

Save the picture also for a thread , The thread corresponds to save_image Function as follows :

#  Save picture thread 
def save_image():
    global urls
    print(urls)

    while True:
     	#  locked 
        mutex.acquire()
        if len(urls) > 0:
        	#  Get the first item in the list 
            img_url = urls[0]
            #  Delete the first item in the list 
            del urls[0]
            #  Release the lock 
            mutex.release()
            res = requests.get(url=img_url, headers=headers)

            if res is not None:
                html = res.text

                pattern = re.compile(
                    '<img class="img-responsive center-block" src="(.*?)"/>')

                img_match = pattern.search(html)

                if img_match:
                    img_data_url = img_match.group(1)
                    print(" Grab the picture :", img_data_url)
                    try:
                        res = requests.get(img_data_url)
                        with open(f"images/{time.time()}.png", "wb+") as f:
                            f.write(res.content)
                    except Exception as e:
                        print(e)
        else:
            print(" Waiting for the , Wait for a long time , It can be turned off directly ")
 Copy code 

Synchronization adds threads based on this function to the main function , And start the :

if __name__ == '__main__':
    #  Get image thread 
    gets = threading.Thread(target=get_image, args=(
        "https://www.keaitupian.net/article/202389.html",))
    gets.start()

    save = threading.Thread(target=save_image)
    save.start()
 Copy code 

 Yes Python Web site full of temptation for crawler Writers ,《 Lovely picture net 》, Look at the name of this website

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300722466591.html

Random recommended