2022-01-30

Suck cats with code ! This article is participating in 【 Meow star essay solicitation activity 】.


Acquisition target

Web resource address…

Tool preparation

development tool :pycharm
development environment :python3.7, Windows11
Using the toolkit :requests

Analysis of project ideas

To make a reptile case, you first need to clarify your collection target , Bai Youbai here collects all the picture information of the current web page , Sort out your coding process after you have goals , The basic four steps of a reptile :

  • First step : Get the web resource address
  • The second step : Send network request to address
  • The third step : Extract the corresponding data information
    • The methods of extracting data are generally regular 、xpath、bs4、jsonpath、css Selectors
  • Step four : Save data information

First step : Find data address

There are generally two ways to load data , A static, a dynamic , The data of the current web page is constantly loaded when it is refreshed , It can be judged that the data loading mode is dynamic , Dynamic data needs to be obtained through the browser's packet capture tool , Right click to check , Or press f12 Shortcut to , Find the loaded data address


Find the corresponding data address , After clicking the pop-up interface, you can click preview , Preview the open page is the data shown to us , When there is a lot of data, use it to view , The data obtained is obtained through the website , The URL data is in the request , Send a network request to the web address

The second step : Code to send network request

There will be many toolkits to send requests , The introductory phase is more about using requests tool kit ,requests It's a third-party toolkit , Need to download :pip install requests When sending a request, you need to note that we request through code ,web The server will http Request message to distinguish whether it is a browser or a crawler , Reptiles are not welcome , The crawler code needs to disguise itself , Send the request with headers The data type transmitted is dictionary key value pair ,ua Field is a very important browser ID card

The third step : Extract the data

The currently acquired data is dynamic data , Dynamic data dynamic data is generally json data ,json Data can be obtained by jsonpath Extract directly , It can also be directly converted into a dictionary , adopt Python The ultimate goal of extraction is to extract the image url Address


After extracting the new address, you need to send a request to the web address again , What we need is picture data , Links are usually stored in data , Send a request to get the hexadecimal data corresponding to the picture

Step four : Save the data

After the data is obtained, the data is stored , Choose where you want to store your data , Select write mode , The data we get is binary data , For file access mode wb, Just write the acquired picture into the data , The suffix of the file must be the suffix at the end of the picture , You can choose to name with a title , White and white use the back part of the website to name .

Easy source sharing

import requests  #  Import the requested toolkit 
import re  #  Regular matching toolkit 

#  Add request header 
headers = {
    #  The user agent 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    #  Request data source 
    # "Referer": "",
    # "Host": ""

key = input(" Please enter the picture to download :")
#  The address to save the picture 
path = r" picture /"
#  Request data interface 
for i in range(5, 50):
    url = ""
    #  Send a request 
    response = requests.get(url, headers=headers)
    #  Regular match data 
    url_list = re.findall('"thumbURL":"(.*?)",', response.text)

    #  Loop out the picture url  and  name
    for new_url in url_list:
        #  Send a request to the picture again 
        result = requests.get(new_url).content
        #  Split URL to get picture name 
        name = new_url.split("/")[-1]
        #  write file 
        with open(path + name, "wb")as f:

