current position:Home>[Python data acquisition] page image crawling and saving

[Python data acquisition] page image crawling and saving

2022-01-30 19:32:32 liedmirror

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge ” Creative activities

Preface

In this article , Will pass through requests and urllib.request library , Get some url All the pictures in , It can be applied to all pages through static rendering .

Ideas

obtain html

Construct a get_html function , Input url And the request method , return html Text :

def get_html(url: str, request_type: str) -> str:
    """  obtain html :param url:  Access address  :param request_type:  Request mode : urllib.request  or  urllib.request  Or corresponding cache  :return: html """
    if request_type == "urllib.request":
        # urllib.request
        return urllib.request.urlopen(url).read().decode("utf-8")
    elif request_type == "requests":
        # requests
        response = requests.get(url)
        return response.text
    else:
        #  Read cache file 
        with open(f'./.cache/{request_type}.html', 'r', encoding='utf-8') as f:
            return f.read()
 Copy code 

Resolution images

Using regular matching , Will all Inside the label src extracted :

imgList = re.findall(r'<img.*?src="(.*?)"', html, re.S)
 Copy code 

Be careful : Here you need to set and use re.S Method pattern , Otherwise, for cross line images , Will not be able to identify extraction .

pictures saving

The content of the file is the requested picture url Return content , Use with open Create file management objects , And set the mode to wb( Binary write mode ):

resp = requests.get(img_url)
with open(f'./download/{img.split("/")[-1]}', 'wb') as f:
    f.write(resp.content)
 Copy code 

Use string segmentation from url Extract the name of the file ( Make sure the suffix is correct ):

img.split("/")[-1]
#  Or to "." Segmentation , Extract only the suffix , Label and name the file by yourself .
 Copy code 

Finally using f.write(resp.content) Saving can be realized .

def get_imgs(html:str, download=True) -> None:
    """  Get all picture addresses , Optional download  :param html:  Input html :param download:  Download or not  :return: None """
    imgList = re.findall(r'<img.*?src="(.*?)"', html, re.S)
    print(imgList)
    print(f' share {len(imgList)} A picture ')
    if download:
        for i, img in enumerate(imgList):
            img_url = "http://news.fzu.edu.cn" + img
            print(f" Saving the {i + 1} A picture   route :{img_url}")
            resp = requests.get(img_url)
            with open(f'./download/{img.split("/")[-1]}', 'wb') as f:
                f.write(resp.content)
 Copy code 

Running results

  

   Saved pictures :

   

copyright notice
author[liedmirror],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301932302775.html

Random recommended