current position:Home>Baidu Post: high definition Python

Baidu Post: high definition Python

2022-01-31 16:07:51 Clever crane

This is my participation 11 The fourth of the yuegengwen challenge 3 God , Check out the activity details :2021 One last more challenge

I was entrusted by my friends some time ago , Climb the high-definition picture in a post in the post bar .

Here's the thing , My buddy found many beautiful pictures in the post bar , Want to download the original picture to make wallpaper , But there are too many pictures in the post , He wants it all , So I wanted to help write a reptile , Download it in bulk .

There are only two requirements :

  1. Download the original image
  2. Realize batch download

Don't talk much , Just start .

1. Analysis website

The post address provided by my friend : tieba.baidu.com/p/651608483… .

First analysis url form , We can guess 6516084831 It's a post id .

stay Check to see only the landlord , Page turning After these operations , The link becomes like this tieba.baidu.com/p/651608483… ,URL Two more parameters . It's almost certain ,see_lz=1 It means only looking at the landlord ,pn=1 Indicates that the current number of pages is the first page .

Open the browser Developer tools , Switch to Network Carry out the bag .

It is found that the post content data is directly rendered in html On the page ( Not a separate data interface ), in other words , We just need to parse https://tieba.baidu.com/p/6516084831?see_lz=1&pn=2 This web page , You can get the picture data of the post .

image-20211115100938528

2. Anti creep mechanism verification

use Python Write a simple piece of code , Test the anti creep mechanism

import requests

url = "https://tieba.baidu.com/p/6516084831?see_lz=1&pn=1"
r = requests.get(url)
print(r.text)
 Copy code 

After testing , There is no special anti climbing mechanism , You don't even need to verify User Agent You can climb directly to the data .

image-20211115183413501

3. Extract the data

There is no anti climbing mechanism , And the data is in the static web page , So let's look directly at the web source code , Parsing data .

image-20211115183718451

Picture in class by BDE_Image Of img In the label , The picture link is labeled src attribute .

import requests
from bs4 import BeautifulSoup

url = "https://tieba.baidu.com/p/6516084831?see_lz=1&pn=1"
r = requests.get(url)
html = r.text
bsObj = BeautifulSoup(html, "lxml")
imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
for img in imgList:
    print(img["src"])
 Copy code 

We go through BeautifulSoup In the library find_all The function searches for all that meet the requirements img label , And then take src Attribute is enough .

image-20211115184619168

4. Download the pictures

Downloading pictures is actually the same as crawling text , The only difference is that its data is binary .

import requests
import os

imgUrl = "http://tiebapic.baidu.com/forum/w%3D580/sign=1ecd59e749df8db1bc2e7c6c3922dddb/0f72eed3fd1f4134d5a75d01321f95cad0c85ead.jpg"
r = requests.get(imgUrl)
content = r.content
with open("image.jpg", "wb") as f:
	f.write(content)
 Copy code 

Processing network requests response when , take .content , When saving a file ,mode Set to wb , that will do .

5. Link to the original drawing to get

however , Soon we will find , The downloaded image is not the original image , It's a thumbnail ( The resolution is only 580x326 , The original picture is 1920x1080 ).

image-20211115191013298

After groping , On the page of viewing the large picture , Found the download link of the original picture

image-20201222213617266

Thumbnail :tiebapic.baidu.com/forum/w%3D5…

Original picture :tiebapic.baidu.com/forum/pic/i…

Comparative observation found that , On the basis of the original link, you can make a little change

"tiebapic.baidu.com/forum/pic/i…" + "0f72eed3fd1f4134d5a75d01321f95cad0c85ead.jpg"

6. Code sorting

Through the above analysis , We can do that Download the original picture of Post Bar in batch The function of .

The following is all the source code after sorting .

import requests
from bs4 import BeautifulSoup
import os

def fetchUrl(url):
    r = requests.get(url)
    return r.text

def parseHtml(html):
    bsObj = BeautifulSoup(html, "lxml")
    imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
    return imgList

def getPageNum(url):
    html = fetchUrl(url)
    bsObj = BeautifulSoup(html, "lxml")
    maxPage = bsObj.find("input", attrs={"id" : "jumpPage4"})["max-page"]
    print(maxPage)
    return int(maxPage)

def downLoadImage(imgList):
    for img in imgList:
        imgName = img['src'].split("/")[-1]
        imgUrl = "http://tiebapic.baidu.com/forum/pic/item/" + imgName
        
        if os.path.exists(" HD map /" + imgName):
            print("Skip :", imgName)
            continue
        
        picReq = requests.get(imgUrl)
        saveFile(" HD map /", imgName, picReq.content)
        print(imgName)

def saveFile(path, filename, content):
    
    if not os.path.exists(path):
        os.makedirs(path)
    
    with open(path + filename, "wb") as  f:
        f.write(content)
        
def run(tid):
    url = "https://tieba.baidu.com/p/%d?see_lz=1&pn=1" %tid
    totalNum = getPageNum(url)
    for page in range(1, totalNum + 1):
        url = "https://tieba.baidu.com/p/%d?see_lz=1&pn=%d" % (tid, page)
        html = fetchUrl(url)
        imgList = parseHtml(html)
        downLoadImage(imgList)
    
if __name__ == "__main__":
    tid = 6516084831
    run(tid)
    print("over")
 Copy code 

If there is something in the article that is not clear , Or the wrong explanation , Welcome to comment on , Or scan the QR code below , Add me WeChat , Let's learn and communicate , Common progress .

img

copyright notice
author[Clever crane],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311607505917.html

Random recommended