current position:Home>I took 100g pictures offline overnight with Python just to prevent the website from disappearing

I took 100g pictures offline overnight with Python just to prevent the website from disappearing

2022-01-30 06:48:36 Dream eraser

This article has participated in  「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund .

20 Line code , Turn into a small and fleshy person in the technology circle

use Python Crawling 100G Cosers picture

The goal of this blog is

Climb to the target

  • Target data source :www.cosplay8.com/pic/chinaco…, Another Cos Website , It's easy to disappear On the Internet , In order to store the data , We set it up .

 To prevent websites from disappearing , I use Python Offline all night 100G picture

The use of Python modular

  • requests,re,os

Key learning content

  • Today's focus is , It can be put on the detail page , This technique was not covered in previous blogs , Take care of it in the process of writing code .

List page and detail page analysis

Through developer tools , It is convenient to analyze the label of the target data .

 To prevent websites from disappearing , I use Python Offline all night 100G picture Click on any image , Enter details page , Get the target image for single page display , One picture per page .

<a href="javascript:dPlayNext();" id="infoss">
  <img src="/uploads/allimg/210601/112879-210601143204.jpg" id="bigimg" width="800" alt="" border="0" /></a>
 Copy code 

Get the list page and details page at the same time URL The generation rules are as follows :

List of pp.

Details page

Note that there is no serial number on the first page of the details page 1, Gu crawled to get the total page number at the same time , Need to store home page picture .

Code time

The target site classifies the images , namely At home cos, Abroad cos, Hanfu circle ,Lolita, Therefore, it can be dynamically input during crawling , That is to crawl the target source custom .


def run(category, start, end):
    #  Generate list pages to crawl 
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
    	# get_list  Function is provided later 
        ret = get_list(item)

        print(f" We've captured :{len(ret)}  Data ")
        url_list.extend(ret)


if __name__ == "__main__":

    # http://www.cosplay8.com/pic/chinacos/list_22_2.html
    category = input(" Please enter the classification number :")
    start = input(" Please enter the start page :")
    end = input(" Please enter the end page :")
    run(category, start, end)
 Copy code 

The above code is first based on user input , Generate the target URL , Then pass the target URL to get_list Function , The function code is as follows :

def get_list(url):
    """  Get a link to the full details page  """
    all_list = []

    res = requests.get(url, headers=headers)
    html = res.text
    pattern = re.compile('<li><a href="(.*?)">')
    all_list = pattern.findall(html)

    return all_list

 Copy code 

Through regular expressions <li><a href="(.*?)"> Match all details page addresses in the list page , And return it as a whole .

stay run Continue to add code to the function , Get details page picture material , And save the captured pictures .

def run(category, start, end):
    #  List page to crawl 
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
        ret = get_list(item)

        print(f" We've captured :{len(ret)}  Data ")
        url_list.extend(ret)

    print(url_list)
    # print(len(url_list))
    for url in url_list:
        get_detail(f"http://www.cosplay8.com{url}")
 Copy code 

Because the address of the matched detail page is relative , Gu formats the address , Generate full address . get_detail The function code is as follows :

def get_detail(url):
	#  Request details page data 
    res = requests.get(url=url, headers=headers)
    #  Set encoding 
    res.encoding = "utf-8"
    #  Get the source code of the web page 
    html = res.text

    #  Page number , Save the first picture 
    size_pattern = re.compile('<span> common (\d+) page : </span>')
    #  Get the title , It was found that there were differences in publication , Regular expressions have been modified 
    # title_pattern = re.compile('<title>(.*?)-Cosplay China </title>')
    title_pattern = re.compile('<title>(.*?)-Cosplay( China |8)</title>')
    #  Set image regular expressions 
    first_img_pattern = re.compile("<img src='(.*?)' id='bigimg'")
    try:
    	#  Try matching page numbers 
        page_size = size_pattern.search(html).group(1)
        #  Try to match the title 
        title = title_pattern.search(html).group(1)
        #  Try to match the address 
        first_img = first_img_pattern.search(html).group(1)

        print(f"URL The corresponding data is {page_size} page ", title, first_img)
        #  Path is generated 
        path = f'images/{title}'
        #  Path judgment 
        if not os.path.exists(path):
            os.makedirs(path)

        #  Ask for the first picture 
        save_img(path, title, first_img, 1)

        #  Ask for more pictures 
        urls = [f"{url[0:url.rindex('.')]}_{i}.html" for i in range(2, int(page_size)+1)]

        for index, child_url in enumerate(urls):
            try:
                res = requests.get(url=child_url, headers=headers)

                html = res.text
                first_img_pattern = re.compile("<img src='(.*?)' id='bigimg'")
                first_img = first_img_pattern.search(html).group(1)

                save_img(path, title, first_img, index)
            except Exception as e:
                print(" Grab subpages ", e)

    except Exception as e:
        print(url, e)
 Copy code 

The core logic of the above code has been written into the comments , A focus on title Regular matching part , The initial writing of regular expressions is as follows :

<title>(.*?)-Cosplay China </title>
 Copy code 

The subsequent discovery can't all match successfully , Change to the following :

<title>(.*?)-Cosplay( China |8)</title>
 Copy code 

, lack of save_img The function code is as follows :

def save_img(path, title, first_img, index):
    try:
        #  Ask for pictures 
        img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers)
        img_data = img_res.content

        with open(f"{path}/{title}_{index}.png", "wb+") as f:
            f.write(img_data)
    except Exception as e:
        print(e)
 Copy code 

 To prevent websites from disappearing , I use Python Offline all night 100G picture

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300648352733.html

Random recommended