current position:Home>Python crawler actual combat, requests module, python to capture headlines and take beautiful pictures

Python crawler actual combat, requests module, python to capture headlines and take beautiful pictures

2022-02-01 04:42:16 Dai mubai

「 This is my participation 11 The fourth of the yuegengwen challenge 19 God , Check out the activity details :2021 One last more challenge 」.

Preface

utilize Python Climb the street in today's headlines and take beautiful pictures . I don't say much nonsense .

Let's start happily ~

development tool

Python edition : 3.6.4

Related modules :

re;

requests modular ;

As well as some Python Built in modules .

Environment building

install Python And add to environment variable ,pip Install the relevant modules required .

Detailed browser information

1.png

2.png

3.png

Get the code related to the article link :

import requests
import json
import re

headers = {
    'user-agent''Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}

def get_first_data(offset):
    params = {
        'offset': offset,
        'format''json',
        'keyword'' Street pat ',
        'autoload''true',
        'count''20',
        'cur_tab''1',
        'from':'search_tab'
    }
    response = requests.get(url='https://www.toutiao.com/search_content/', headers=headers, params=params)
    try:
        response.raise_for_status()
        return response.text
    except Exception as exc:
        print(" Acquisition failure ")
        return None

def handle_first_data(html):
    data = json.loads(html)
    if data and "data" in data.keys():
        for item in data.get("data"):
            yield item.get("article_url")
 Copy code 

It needs to be mentioned here requests Module error , stay response Object raise_for_status() Method , If there is an error downloading the file , It throws an exception , Need to use try and except Statement wraps lines of code , Handle this error , Don't let the program crash .

Also attached requests Module technical documentation website :http://cn.python-requests.org/zh_CN/latest/

Get the picture link related code :

def get_second_data(url):
    if url: 
        try:
            reponse = requests.get(url, headers=headers)
            reponse.raise_for_status()
            return reponse.text
        except Exception as exc:
            print(" An error occurred while entering the link ")
            return None

def handle_second_data(html):
    if html:
        pattern = re.compile(r'gallery: JSON.parse\((.*?)\),', re.S)
        result = re.search(pattern, html)
        if result:
            imageurl = []
            data = json.loads(json.loads(result.group(1)))
            if data and "sub_images" in data.keys():
                sub_images = data.get("sub_images")
                images = [item.get('url'for item in sub_images]
                for image in images:
                    imageurl.append(images)
                return imageurl
        else:
            print("have no result")
 Copy code 

Get the picture related code :

def download_image(imageUrl):
    for url in imageUrl:
        try:
            image = requests.get(url).content
        except:
            pass
        with open("images"+str(url[-10:])+".jpg""wb"as ob:
            ob.write(image)
            ob.close()
            print(url[-10:] + " Download successful !" + url)

def main():
    html = get_first_data(0)
    for url in handle_first_data(html):
        html = get_second_data(url)
        if html:
            result = handle_second_data(html)
            if result:
                try:
                    download_image(result)
                except KeyError:
                    print("{0} Existing problems , skip ".format(result))
                    continue

if __name__ == '__main__':
    main()

 Copy code 

Finally, the download was successful

4.png

5.png

Check the details

6.png

7.png

8.png

copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202010442153685.html

Random recommended