current position:Home>Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9

Tiger sniffing 24-hour praise device, a case with a crawler skill, python crawler lesson 7-9

2022-01-31 15:56:50 Dream eraser

「 This is my participation 11 The fourth of the yuegengwen challenge 16 God , Check out the activity details :2021 One last more challenge

A lot of platforms have a little like it , The idea offered today can be applied to many platforms , Hope to master this skill , Implement your own favorite . The goal of this case is to smell the tiger 24 The hour channel likes .  Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

Analysis before crawling

Analyze hot data sources

The case will take some time in the analysis stage , There is a little difficulty in the process , After understanding , You will learn a very common crawler writing technique .

Drag the browser scroll bar to the bottom of the page , Capture request :https://moment-api.huxiu.com/web-v2/moment/feed The request data and return data of this content are shown in the figure below .

The request method is POST.

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

The request data is three values , The last two should be fixed values , first last_dateline You need to find the source of parameter value to solve this case .

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

There are many nested levels of returned data , It looks complicated , But the data format is JSON Format , Normal parsing is enough .

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

The above code has another key content , Is in the returned data , In fact, it provides parameters last_dateline Value , Then you can keep looking through an endless loop , Until the data acquisition fails, the loop ends .

Here we are , Data acquisition and analysis completed , The conclusions are as follows :

  • The request method is POST;
  • Request address is moment-api.huxiu.com/web-v2/mome…
  • The request parameter is last_dateline,platform, is_ai;
  • The first request can be fixed first last_dateline Value ;
  • Get all through the loop hot spot information .

Analyze the like interface

Click the like button to grab the link as https://moment-api.huxiu.com/web/moment/agree, This link is the like request address .

The request method and address are as follows , Still for POST Request mode :

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

The requested data is shown in the figure below :

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

Parameters in moment_id It should be the hot spots obtained above monent_id, The specific location is shown in the figure :

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

The format of the data returned after the request is successful is shown in the figure below .

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

Here we are , Like interface analysis is complete , The conclusions are as follows :

  • The request method is POST;
  • Request address is moment-api.huxiu.com/web/moment/…
  • The request parameter is monent_id,platform;
  • When testing code, you can first make a request ,monent_id The value of can be fixed by one value .

Crawler writing time

requests post Ask for an introduction

Let's go to the specific writing process of the crawler , First of all, understand requests How the library initiates POST Ask for it. .

POST Request in requests There are several uses in , They are as follows .

ordinary post Most used Realization way , Simply pass a dictionary to data Parameters can be .

import requests
#  This website is specially used for testing 
url = 'http://httpbin.org/post'
data = {'key1':'value1','key2':'value2'}
r =requests.post(url,data)
print(r)
print(r.text)
print(r.content)
 Copy code 

Note that there are... In the returned data <Response [200]> It means success .

data Parameter into a tuple list

If multiple elements in the form to be submitted use the same key When , This is especially effective .

import requests
#  This website is specially used for testing 
url = 'http://httpbin.org/post'
payload = (('key1', 'value1'), ('key1', 'value2'))
r = requests.post('http://httpbin.org/post', data=payload)

print(r.text)
 Copy code 

Pass a JSON character string

The data sent is not encoded as a form , You need to pass a string , Not a dictionary .

import requests
import json

#  This website is specially used for testing 
url_json = 'http://httpbin.org/post'
# dumps: take python The object is decoded as json data 
data_json = json.dumps({'key1': 'value1', 'key2': 'value2'})
r_json = requests.post(url_json, data_json)

print(r_json.text)
 Copy code 

Deliver documents

This content is a little beyond the outline of the crawler class , Temporary neglect .

be aware ,POST and GET The difference is data The difference between this parameter , Now you can actually code .

Get the data to be liked

First of all get One page Data to be liked , The code is as follows :

def get_feed():
	#  Splice data dictionary 
    data = {
        "last_dateline": "1605591900",
        "platform": "www",
        "is_ai": 0
    }
    r = requests.post(
        "https://moment-api.huxiu.com/web-v2/moment/feed", data=data)
    data = r.json()
    # data["success"]

    #  get data , Note that there are a lot of nesting levels , Take your time 
    datalist = data["data"]["moment_list"]["datalist"][0]["datalist"]
    print(datalist)
 Copy code 

In the experiment, it is found that the first acquisition can last_dateline It can also be left blank . The code part of the subsequent cycle is left to you , The reptile class is coming to an end , The rest you need to expand yourself Python Basic grammar .

Like code writing

I encountered a little problem in the process of writing the praise code , The issue involves cookies This old and difficult problem .

import requests
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}

def agree(moment_id):

    data = {
        "moment_id": moment_id,
        "platform": "www",
    }

    r = requests.post(
        "https://moment-api.huxiu.com/web/moment/agree", data=data, headers=headers)
    res_data = r.json()
    print(res_data)
    # print(res_data["success"] == "true")

if __name__ == "__main__":
    agree(130725)
 Copy code 

The above code was found during operation , The result of the request is :

{'success': False, 'message': ' Please turn on COOKIE'}
 Copy code 

Check the request captured by the developer tool again , Did find a unique cookie, As follows :

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

One huxiu_analyzer_wcy_id Parameters , It seems that the problem is here . The acquisition of this value becomes more troublesome , We need to find a way to capture .

The eraser in this place is solved according to the following steps , I hope you can learn this type of problem or find common cookie The way to .

Open Google browser traceless mode , Input https://www.huxiu.com/moment/ Jump directly , The reason for turning on traceless mode is to ensure that cookie Not cached .

Next, search directly in the developer tool huxiu_analyzer_wcy_id Parameters .

Get the following :

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

Here's what I found cookie Where it's set up , Find out what to do cookie By checklogin This interface responds to and sets .

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

The next solution is very rough , to glance at checklogin How is it called , And then ask for , Save its response cookie value , Then go to the request like interface .

Finally modify the liking code , To complete the task :

def agree(moment_id):
    s = requests.Session()
    r = s.get("https://www-api.huxiu.com/v1/user/checklogin", headers=headers)

    jar = r.cookies
    print(jar.items())
    data = {
        "moment_id": moment_id,
        "platform": "www",
    }

    r = s.post(
        "https://moment-api.huxiu.com/web/moment/agree", data=data, headers=headers, cookies=jar)
    res_data = r.json()
    print(res_data)
    # print(res_data["success"] == "true")
 Copy code 

Run the above code to get the correct praise success description .

{'success': True, 'message': ' I like it '}
 Copy code 

At the end of the sentence

This blog is mainly used to introduce requests Library post Request mode , By the way, I wrote for you cookie General access to , Here's a reminder that in the process of writing crawler , Search in developer tools is often used , Especially in solving JS When there is an encryption problem .

 Tiger sniffing  24  Hour liker , Scalable to multiple platforms ,Python  Reptile lesson  7-9

copyright notice
author[Dream eraser],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311556460821.html

Random recommended