current position:Home>The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly

The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly

2022-01-30 05:09:58 Dream eraser

This article has participated in  「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund .

I didn't expect that Altman really had so many kinds .

The goal of this blog is

Climb to the target

Using frames

  • requests,re

Focus on learning content

  • get request ;
  • requests Request timeout settings ,timeout Parameters ;
  • re Module regular expressions ;
  • Data De duplication ;
  • URL Address splicing .

List page analysis After a brief review of developer tools , Get all the Altman cards DOM The label is <li class="item"></li> The label on the details page is <a = href=" Details page " ……

The specific element of the tag is shown in the figure below :

image.png

Later, according to the actual request data , Sort out regular expressions .

Details page

Click on any target data , Enter details page , Details page for Altman pictures , The location of the picture address is shown in the figure below .

 Bear boy doesn't agree with me , use Python Tell him how many altmans there are in the world

Right click to get the label of the picture .

 Bear boy doesn't agree with me , use Python Tell him how many altmans there are in the world

The sorting requirements are as follows

  1. Through the list page , Crawl the address of all Altman details pages ;
  2. Enter details page , Crawl the picture address in the details page ;
  3. Download and save pictures ;

Code implementation

Crawl all the Altman details page addresses

In the process of crawling the list page , I found that Altman page used iframe nesting , This method is also the simplest anti climbing method , Extract the real link , So the target data source is switched to http://www.ultramanclub.com/allultraman/ .

 Bear boy doesn't agree with me , use Python Tell him how many altmans there are in the world

import requests
import re
import time


#  Reptile entrance 
def run():
    url = "http://www.ultramanclub.com/allultraman/"
    try:
        #  Web access is slow , Need to set up  timeout
        res = requests.get(url=url, timeout=10)
        res.encoding = "gb2312"
        html = res.text
        get_detail_list(html)

    except Exception as e:
        print(" Request exception ", e)


#  Get the full Altman details page 
def get_detail_list(html):
    start_index = '<ul class="lists">'
    start = html.find(start_index)
    html = html[start:]
    links = re.findall('<li class="item"><a href="(.*)">', html)
    print(len(links))
    links = list(set(links))
    print(len(links))


if __name__ == '__main__':
    run()
 Copy code 

In the process of coding , Found that the web page access speed is slow , Therefore, it should be set up timeout The attribute is 10, To prevent abnormalities ,

When regular expressions match data , There's duplicate data , adopt set Set for de duplication , Finally, in the conversion to list.

Next, I'll take a look at what I got list Make a secondary splice , Get the details page address .

The address of the detail page obtained by secondary splicing , The code is as follows :

#  Get the full Altman details page 
def get_detail_list(html):
    start_index = '<ul class="lists">'
    start = html.find(start_index)
    html = html[start:]
    links = re.findall('<li class="item"><a href="(.*)">', html)
    # links = list(set(links))
    links = [f"http://www.ultramanclub.com/allultraman/{i.split('/')[1]}/" for i in set(links)]
    print(links)
 Copy code 

Get all the big pictures of Altman

This step is how to get the page title first , Then use the title , Name the big picture of Altman .

The crawling logic is very simple , Just cycle through the above to crawl to the details page address , And then match it with regular expressions .

Modify the code as follows , Key nodes view comments .

import requests
import re
import time

#  Statement  UA
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
}
#  Store exception paths , Prevent crawling failure 
error_list = []

#  Reptile entrance 
def run():
    url = "http://www.ultramanclub.com/allultraman/"
    try:
        #  Web access is slow , Need to set up  timeout
        res = requests.get(url=url, headers=headers, timeout=10)
        res.encoding = "gb2312"
        html = res.text
        return get_detail_list(html)

    except Exception as e:
        print(" Request exception ", e)


#  Get the full Altman details page 
def get_detail_list(html):
    start_index = '<ul class="lists">'
    start = html.find(start_index)
    html = html[start:]
    links = re.findall('<li class="item"><a href="(.*)">', html)
    # links = list(set(links))
    links = [
        f"http://www.ultramanclub.com/allultraman/{i.split('/')[1]}/" for i in set(links)]
    return links


def get_image(url):
    try:
        #  Web access is slow , Need to set up  timeout
        res = requests.get(url=url, headers=headers, timeout=15)
        res.encoding = "gb2312"
        html = res.text
        print(url)
        #  Get the details page title , As the image file name 
        title = re.search('<title>(.*?)\[', html).group(1)
        #  Get the picture short connection address 
        image_short = re.search(
            '<figure class="image tile">[.\s]*?<img src="(.*?)"', html).group(1)

        #  Mosaic full picture address 
        img_url = "http://www.ultramanclub.com/allultraman/" + image_short[3:]
        #  Get picture data 
        img_data = requests.get(img_url).content
        print(f" Crawling up {title}")
        if title is not None and image_short is not None:
            with open(f"images/{title}.png", "wb") as f:
                f.write(img_data)

    except Exception as e:
        print("*"*100)
        print(url)
        print(" Request exception ", e)

        error_list.append(url)


if __name__ == '__main__':
    details = run()
    for detail in details:
        get_image(detail)

    while len(error_list) > 0:
        print(" Crawl again ")
        detail = error_list.pop()
        get_image(detail)

    print(" Altman image data crawling complete ")
 Copy code 

Run code , See pictures stored locally one after another images Directory .

 Bear boy doesn't agree with me , use Python Tell him how many altmans there are in the world Code instructions :

The above code is in the main function , The detail page captured from the list page is crawled circularly . That is the following part of the code :

 for detail in details:
        get_image(detail)
 Copy code 

Due to the slow crawling speed of this website , Therefore, in get_image Function get Ask inside , Joined the timeout=15 Set up .

Image address regular matching and address stitching , The code used is as follows :

#  Get the details page title , As the image file name 
title = re.search('<title>(.*?)\[', html).group(1)
#  Get the picture short connection address 
image_short = re.search(
    '<figure class="image tile">[.\s]*?<img src="(.*?)"', html).group(1)

#  Mosaic full picture address 
img_url = "http://www.ultramanclub.com/allultraman/" + image_short[3:]
 Copy code 

Ah , These altmans really look different .

 Bear boy doesn't agree with me , use Python Tell him how many altmans there are in the world

Lucky draw time ( At present, we have sent out 4 Share )

Just count the comments 50 Random selection of a lucky reader Reward 39.9 Meta crawler 100 A special column 1 You can buy a coupon at a discount , just 3.99 element

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300509568304.html

Random recommended