current position:Home>Baidu Post: high definition Python

Baidu Post: high definition Python

2022-01-31 16:07:51 Clever crane

This is my participation 11 The fourth of the yuegengwen challenge 3 God , Check out the activity details :2021 One last more challenge

I was entrusted by my friends some time ago , Climb the high-definition picture in a post in the post bar .

Here's the thing , My buddy found many beautiful pictures in the post bar , Want to download the original picture to make wallpaper , But there are too many pictures in the post , He wants it all , So I wanted to help write a reptile , Download it in bulk .

There are only two requirements :

  1. Download the original image
  2. Realize batch download

Don't talk much , Just start .

1. Analysis website

The post address provided by my friend :… .

First analysis url form , We can guess 6516084831 It's a post id .

stay Check to see only the landlord , Page turning After these operations , The link becomes like this… ,URL Two more parameters . It's almost certain ,see_lz=1 It means only looking at the landlord ,pn=1 Indicates that the current number of pages is the first page .

Open the browser Developer tools , Switch to Network Carry out the bag .

It is found that the post content data is directly rendered in html On the page ( Not a separate data interface ), in other words , We just need to parse This web page , You can get the picture data of the post .


2. Anti creep mechanism verification

use Python Write a simple piece of code , Test the anti creep mechanism

import requests

url = ""
r = requests.get(url)
 Copy code 

After testing , There is no special anti climbing mechanism , You don't even need to verify User Agent You can climb directly to the data .


3. Extract the data

There is no anti climbing mechanism , And the data is in the static web page , So let's look directly at the web source code , Parsing data .


Picture in class by BDE_Image Of img In the label , The picture link is labeled src attribute .

import requests
from bs4 import BeautifulSoup

url = ""
r = requests.get(url)
html = r.text
bsObj = BeautifulSoup(html, "lxml")
imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
for img in imgList:
 Copy code 

We go through BeautifulSoup In the library find_all The function searches for all that meet the requirements img label , And then take src Attribute is enough .


4. Download the pictures

Downloading pictures is actually the same as crawling text , The only difference is that its data is binary .

import requests
import os

imgUrl = ""
r = requests.get(imgUrl)
content = r.content
with open("image.jpg", "wb") as f:
 Copy code 

Processing network requests response when , take .content , When saving a file ,mode Set to wb , that will do .

5. Link to the original drawing to get

however , Soon we will find , The downloaded image is not the original image , It's a thumbnail ( The resolution is only 580x326 , The original picture is 1920x1080 ).


After groping , On the page of viewing the large picture , Found the download link of the original picture



Original picture…

Comparative observation found that , On the basis of the original link, you can make a little change

"…" + "0f72eed3fd1f4134d5a75d01321f95cad0c85ead.jpg"

6. Code sorting

Through the above analysis , We can do that Download the original picture of Post Bar in batch The function of .

The following is all the source code after sorting .

import requests
from bs4 import BeautifulSoup
import os

def fetchUrl(url):
    r = requests.get(url)
    return r.text

def parseHtml(html):
    bsObj = BeautifulSoup(html, "lxml")
    imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
    return imgList

def getPageNum(url):
    html = fetchUrl(url)
    bsObj = BeautifulSoup(html, "lxml")
    maxPage = bsObj.find("input", attrs={"id" : "jumpPage4"})["max-page"]
    return int(maxPage)

def downLoadImage(imgList):
    for img in imgList:
        imgName = img['src'].split("/")[-1]
        imgUrl = "" + imgName
        if os.path.exists(" HD map /" + imgName):
            print("Skip :", imgName)
        picReq = requests.get(imgUrl)
        saveFile(" HD map /", imgName, picReq.content)

def saveFile(path, filename, content):
    if not os.path.exists(path):
    with open(path + filename, "wb") as  f:
def run(tid):
    url = "" %tid
    totalNum = getPageNum(url)
    for page in range(1, totalNum + 1):
        url = "" % (tid, page)
        html = fetchUrl(url)
        imgList = parseHtml(html)
if __name__ == "__main__":
    tid = 6516084831
 Copy code 

If there is something in the article that is not clear , Or the wrong explanation , Welcome to comment on , Or scan the QR code below , Add me WeChat , Let's learn and communicate , Common progress .


copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.

Random recommended