current position:Home>Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough

Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough

2022-01-30 05:31:10 Dream eraser

This article has participated in  「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund .

20 Line code , Turn into a small and fleshy person in the technology circle

The goal of this blog is

Climb to the target

Using frames

  • requests,re
  • Some readers say , Why don't you use other advanced frameworks ? answer : Because it's a reptile 120 A series of columns , from the shallower to the deeper , At present, it is only up to the third stage 5 piece .

Key learning content

  • get request ;
  • Dual process crawling , A process grab 1-25 page , A process grab 26-55 page ;
  • Picture number naming

List page and detail page analysis

  • Paging data identification is obvious , The total number of pages can be read directly ;
  • The link to the details page is directly available .

 In the technology circle 【 Meaty little man 】, You can do it with an article

 In the technology circle 【 Meaty little man 】, You can do it with an article

There are many pictures in the details page , The details are as follows , During crawling , The pictures need to be numbered and named . for example Bai Feng has a lot of meat 1.png, Bai Feng has a lot of meat 2.png.

 In the technology circle 【 Meaty little man 】, You can do it with an article

Code time

First step , Get the details page address in the list page .

The solution of this step is consistent with the previous case , Get web source , Parse through regular expressions . Before parsing, it can be intercepted by string , Get the... Of the target area HTML Code .

import requests
import re

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
}


def get_list():
    """  Get a link to the full details page  """
    all_list = []
    res = requests.get(
        "https://www.zhimengo.com/duoroutu?page=1", headers=headers)
    html = res.text
    start = '<ul class="list2">'
    end = '<ul class="pagination">'

    html = html[res.text.find(start):res.text.find(end)]
    pattern = re.compile(
        '<h3><a href="(.*?)" target="_blank" title="(.*?)">.*?</a></h3>')
    all_list = pattern.findall(html)

    return all_list
 Copy code 

To write run function , Call this function .

def run():
    url_list = get_list()
    print(url_list)

if __name__ == "__main__":
    run()
 Copy code 

After running the code , Get the following list of detail pages , The title is used for subsequent image storage .

 In the technology circle 【 Meaty little man 】, You can do it with an article

The second step , Get all the pictures from the details page , In the process of coding ,Python Requested returned HTML The page data is inconsistent with what the developer tool looks at , With Python On request , The differences are as follows . The biggest impact of this difference is the writing of regular expressions .

 In the technology circle 【 Meaty little man 】, You can do it with an article

The code to capture the picture address of the details page is as follows , The experience here is to use a fixed address for regular expression testing .

def get_detail(title, url):
    res = requests.get(url=url, headers=headers)
    html = res.text
    print(html)
    pattern = re.compile(
        '<img alt=".*?" src="(.*?)">')
    imgs = pattern.findall(html)
    for index, url in enumerate(imgs):
        print(title, index, url)
        # save_img(title, index, url)


def run():
    url_list = get_list()
    # print(url_list)
    # for url, title in url_list:
    get_detail(" Pink vine succulent plant ", "https://www.zhimengo.com/duoroutu/24413")
 Copy code 

Of the above code save_img Is the save picture function , The specific code is as follows :

def save_img(title, index, url):
    try:
        img_res = requests.get(url, headers=headers)
        img_data = img_res.content
        print(f" Grab :{url}")
        with open(f"images/{title}_{index}.png", "ab+") as f:
            f.write(img_data)
    except Exception as e:
        print(e)
 Copy code 

Next , Continue to reinvent the code , This case realizes 2 A process crawler , That is, dual process crawling , A process grab 1-25 page , A process grab 26-55 page .

The improved code here is mainly run function , As follows :

def run(start, end):
    wait_url = [
        f"https://www.zhimengo.com/duoroutu?page={i}" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
        ret = get_list(item)
        # print(len(ret))
        print(f" We've captured :{len(ret)}  Data ")
        url_list.extend(ret)

    # print(len(url_list))
    for url, title in url_list:
        get_detail(title, url)


if __name__ == "__main__":
    start = input(" Please enter the start page :")
    end = input(" Please enter the end page :")
    run(start, end)
 Copy code 

Run the program , Test according to the network speed , Two processes enable the Python Script , You can distinguish page numbers manually .

 In the technology circle 【 Meaty little man 】, You can do it with an article

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300531092024.html

Random recommended