current position:Home>Python crawler actual combat: crawl all the pictures in the answer

Python crawler actual combat: crawl all the pictures in the answer

2022-02-01 07:10:14 Clever crane

This is my participation 11 The fourth of the yuegengwen challenge 7 God , Check out the activity details :2021 One last more challenge

When I usually go shopping , You can often see a lot of great pictures , Beautiful wallpaper , Funny expression pack , Interesting screenshots and so on , There's always an impulse to save it all .

So at the request of a little brother , I transformed the previous known reptile , Converted into a new crawler that can download all the pictures in the answer .

1. Analysis website

Zhihu's website has been crawled many times

Python The actual battle of web crawlers : Crawl to know all the problems under one topic

Python The actual battle of web crawlers : Under the topic of Zhihu 18934 Answer data

Python The actual battle of web crawlers : Nearly a thousand Mid Autumn Festival blessing language cases make you the most beautiful child among relatives and friends

therefore , We will briefly explain the process of capturing packets in website analysis , Not too detailed ( If there are students who want to know , You can go to the previous article to see ).


In this paper, we know the problem 《 What good-looking computer wallpapers are worth sharing 》 For example , Explain the reptile .

Sample web address :www.zhihu.com/question/31…

After the previous analysis , We know , Know the answer data of the website , It's through Ajax Dynamically loaded , Every time the page reaches the bottom , Request new 5 Load data , The interface is shown in the figure below .

Parameters have 4 individual ,include Parameters control what data is requested from the server ,limit Parameters control how many pieces of data are requested each time ,offset Parameter controls the offset , That is, the number of pages ,sort_by Parameter control sorting method .

Among our reptiles , Just change offset To control the number of pages crawled , Others can remain unchanged .

The data returned by this interface is Json Format , The answers are located in data[i] --> content in , With html Format store .

Found after formatting it , The picture is in img In the label , The picture link is src attribute .

however , Strangely enough , It's the same picture , Two as like as two peas. img label , One of them is in noscript Under the label .( I'm not quite sure what the intention is , If anyone knows , You can share it in the comments area )

Besides , During viewing site analysis , I found it through img["src"] The picture link obtained is not the original picture .

The original picture is linked in data-original Properties of the .

Do you want to download the original picture , Depending on your needs .

2. Coding

After analysis , Next, enter the coding phase .

First , Import the required Library

from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
import time
import os
 Copy code 

Used to initiate network requests fetchUrl function .

def fetchUrl(url):
    header = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
    }

    r = requests.get(url, headers=header)
    r.encoding = "utf-8"
    return r
 Copy code 

For parsing Json Format interface data parseJson function .

It will get every request 5 Analyze the data of the answers , obtain Author's nickname , Individuality signature , Release date , Number of likes , comments , Answer content The data of , And back in turn .

def parseJson(jsonStr):

    jsonObj = json.loads(jsonStr)
    data = jsonObj['data']

    for item in data:
        name = item["author"]["name"]
        print(" Crawling up ", name, " Answer ")
        headline = item["author"]["headline"]
        dateTIme = time.strftime("%Y-%m-%d", time.localtime(item['updated_time']))
        comment_count = item['comment_count']
        voteup_count = item['voteup_count']
        content = parseHtml(item["content"])

        # print(name, headline, dateTIme, comment_count, voteup_count, content)

        yield [[name, headline, dateTIme, comment_count, voteup_count, content]]
 Copy code 

For parsing Html Format of answer content data parseHtml function .

It parses each answer , If a picture tag is detected , Then download the picture , The rest of the text is parsed into a plain text string and returns .

def parseHtml(html):

    bsObj = BeautifulSoup(html, "lxml")
    images = bsObj.find_all("noscript")

    if(len(images) == 0):
        print(" There is no picture in the answer ")
    else:
        print(" Answer: there are ",len(images)," A picture , Downloading ……")
        for item in images:
            link = item.img['data-original']
            downloadImage(link, "Images/")
        print(" Picture download completed ")

    return bsObj.text
 Copy code 

Download pictures of downloadImage function .

def downloadImage(url, path):

    bytes = fetchUrl(url).content
    # url : https://pic3.zhimg.com/c7ad985268e7144b588d7bf94eedb487_r.jpg?source=1940ef5c
    # filename: c7ad985268e7144b588d7bf94eedb487_r.jpg
    filename = url.split("?")[0].split("/")[-1]

    #  If there is no such folder , Then automatically generate 
    if not os.path.exists(path):
        os.makedirs(path)

    with open(path + filename, "wb+") as f:
        f.write(bytes)
 Copy code 

preservation csv Textual saveData function , Used to save answer data .

def saveData(data, filename):

    dataframe = pd.DataFrame(data)
    dataframe.to_csv(filename, mode='a', index=False, sep=',', header=False, encoding="utf_8_sig")
 Copy code 

as well as , The main function , Used for program entry and crawler scheduling .

if __name__ == "__main__":

    #  Saved file name 
    filename = "data.csv"
    qid = 316039999
    offset = 0
    totalNum = 50

    while offset < totalNum:
        url = "https://www.zhihu.com/api/v4/questions/{0}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset={1}&platform=desktop&sort_by=default" .format(qid, offset)
        html = fetchUrl(url).text

        for data in parseJson(html):
            # print(data)
            saveData(data, filename)

            offset += 1
            print(" The climb has been completed ",offset," Answer data , Altogether ",totalNum, " strip ")
            print("---"*20)
 Copy code 

among ,filename For storing answer data csv file name ,qid Know the problem for what you want to climb id,offset and totalNum Used to control the number of start and end pages of the crawler .

3. Operation results and summary

3.1 Running results

Run the program , After climbing for some time .

Program run output results

Climb to the picture .

3.2 Summary and improvement

In this paper , We have finished downloading all the pictures in Zhihu's answer ( Original picture ) The reptiles of , And attached all the crawler code .

However, the crawler code still has a lot to improve .

3.2.1 The code is not robust enough

In order to quickly realize the crawler function , And reduce the difficulty of novice reading code , Many error prone places are not judged and checked , This easily leads to insufficient program robustness , Easy to break down .

for instance , in front parseHtml Function , Parse part of picture link .

for item in images:
    link = item.img['data-original']
    downloadImage(link, "Images/")
print(" Picture download completed ")
 Copy code 

Write it like this , Normally, it's ok

But not for a while , Maybe it's wrong , Cause the program to crash and exit .

<img class="content_image" data-rawheight="34" data-rawwidth="40" data-size="normal" src="https://pic2.zhimg.com/50/v2-56d491ec13d5b3ad6c1c1bd40ad9f0a5_720w.jpg?source=1940ef5c" width="40"/>
 Copy code 

Because there are img In the label , did not data-original attribute .

The right thing to do is , Around error prone code , use try ... except ... Surround , Exception capture and handling . Such as :

for item in images:
    try:
        link = item.img['data-original']
        downloadImage(link, "Images/")
	except:
		print(item.img)
print(" Picture download completed ")
 Copy code 

In this way, the robustness of the program will be higher , It's not that easy to collapse .

3.2.2 Climbing efficiency is too low

To reduce the difficulty of code reading , The code of this crawler uses a single thread to crawl .

The crawler gets the next answer data , You need to wait until all the pictures in the previous answer are downloaded .

When there are a large number of pictures in the answer to a certain kind of question , Crawling efficiency will be very slow .

It can be multithreaded , Modification of crawler program .

  1. First of all, will Crawl answer data and Download the pictures The tasks of the two parts are separated . The former is only responsible for storing the parsed image links into the array , Instead of waiting for the picture to download ; The latter is only responsible for taking links from the array to download pictures , And don't care where the links come from .
  2. The image download part can be transformed into multithreading , Increase the crawl rate .

3.2.3 Automatically skip downloaded pictures

In the process of crawling , Reptiles may exit due to various factors , If you crawl again , The picture needs to be downloaded again from the beginning , Waste unnecessary work and time .

So we can save the picture before , First read whether the picture is available locally , If you have any , Then skip , If there is no , Then download .

def downloadImage(url, filename, path):
    bytes = fetchUrl(url).content
    #  If there is no such folder , Then automatically generate 
    if not os.path.exists(path):
        os.makedirs(path)
    with open(path + filename, "wb+") as f:
        f.write(bytes)


for item in images:
    try:
        link = item.img['data-original']
        filename =  + url.split("?")[0].split("/")[-1]
        if not os.path.exists(path + filename):
            #  Download when the picture does not exist in the path 
        	downloadImage(link, "Images/", filename)
	except:
		print(item.img)
 Copy code 

If there is something in the article that is not clear , Or the wrong explanation , Welcome to comment on , Or scan the QR code below , Add me WeChat , Let's learn and communicate , Common progress .

 add me into your friend list

copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010710121328.html

Random recommended