current position:Home>What cat is the most popular? Python crawls the whole network of cat pictures. Which one is your favorite

What cat is the most popular? Python crawls the whole network of cat pictures. Which one is your favorite

2022-01-30 23:43:42 White and white I

Suck cats with code ! This article is participating in 【 Meow star essay solicitation activity 】.

Preface

Acquisition target

Web resource address :image.baidu.com/search/inde…

QQ Screenshot 20211103141819.png

Tool preparation

development tool :pycharm
development environment :python3.7, Windows11
Using the toolkit :requests

Analysis of project ideas

To make a reptile case, you first need to clarify your collection target , Bai Youbai here collects all the picture information of the current web page , Sort out your coding process after you have goals , The basic four steps of a reptile :

  • First step : Get the web resource address
  • The second step : Send network request to address
  • The third step : Extract the corresponding data information
    • The methods of extracting data are generally regular 、xpath、bs4、jsonpath、css Selectors
  • Step four : Save data information

First step : Find data address

There are generally two ways to load data , A static, a dynamic , The data of the current web page is constantly loaded when it is refreshed , It can be judged that the data loading mode is dynamic , Dynamic data needs to be obtained through the browser's packet capture tool , Right click to check , Or press f12 Shortcut to , Find the loaded data address

image.png

Find the corresponding data address , After clicking the pop-up interface, you can click preview , Preview the open page is the data shown to us , When there is a lot of data, use it to view , The data obtained is obtained through the website , The URL data is in the request , Send a network request to the web address

The second step : Code to send network request

There will be many toolkits to send requests , The introductory phase is more about using requests tool kit ,requests It's a third-party toolkit , Need to download :pip install requests When sending a request, you need to note that we request through code ,web The server will http Request message to distinguish whether it is a browser or a crawler , Reptiles are not welcome , The crawler code needs to disguise itself , Send the request with headers The data type transmitted is dictionary key value pair ,ua Field is a very important browser ID card

The third step : Extract the data

The currently acquired data is dynamic data , Dynamic data dynamic data is generally json data ,json Data can be obtained by jsonpath Extract directly , It can also be directly converted into a dictionary , adopt Python The ultimate goal of extraction is to extract the image url Address

image.png
image.png

After extracting the new address, you need to send a request to the web address again , What we need is picture data , Links are usually stored in data , Send a request to get the hexadecimal data corresponding to the picture

Step four : Save the data

After the data is obtained, the data is stored , Choose where you want to store your data , Select write mode , The data we get is binary data , For file access mode wb, Just write the acquired picture into the data , The suffix of the file must be the suffix at the end of the picture , You can choose to name with a title , White and white use the back part of the website to name .

Easy source sharing

import requests  #  Import the requested toolkit 
import re  #  Regular matching toolkit 

#  Add request header 
headers = {
    #  The user agent 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    #  Request data source 
    # "Referer": "https://tupian.baidu.com/search/index",
    # "Host": "tupian.baidu.com"
}

key = input(" Please enter the picture to download :")
#  The address to save the picture 
path = r" picture /"
#  Request data interface 
for i in range(5, 50):
    url = "https://image.baidu.com/search/acjson?tn=resultjson_com&logid=12114112735054631287&ipn=rj&ct=201326592&is=&fp=result&fr=&word=%E7%8C%AB%E5%92%AA&queryWord=%E7%8C%AB%E5%92%AA&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&expermode=&nojc=&isAsync=&pn=120&rn=30&gsm=78&1635836468641="
    #  Send a request 
    response = requests.get(url, headers=headers)
    print(response.text)
    #  Regular match data 
    url_list = re.findall('"thumbURL":"(.*?)",', response.text)

    print(url_list)
    #  Loop out the picture url  and  name
    for new_url in url_list:
        #  Send a request to the picture again 
        result = requests.get(new_url).content
        #  Split URL to get picture name 
        name = new_url.split("/")[-1]
        print(name)
        #  write file 
        with open(path + name, "wb")as f:
            f.write(result)

 Copy code 

I am white and white i, A program Yuan who likes to share knowledge ️
Interested can pay attention to my official account : White and white Python【 Thank you very much for your praise 、 Collection 、 Focus on 、 Comment on , One key three links support 】

copyright notice
author[White and white I],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201302343408305.html

Random recommended