current position:Home>Python crawler actual combat, requests module, python realizes capturing a video barrage

Python crawler actual combat, requests module, python realizes capturing a video barrage

2022-01-31 02:18:06 Dai mubai

「 This is my participation 11 The fourth of the yuegengwen challenge 4 God , Check out the activity details :2021 One last more challenge 」.

Preface

utilize Python Capture a video barrage , I don't say much nonsense .

Let's start happily ~

development tool

Python edition : 3.6.4

Related modules :

requests modular ;

pandas modular

As well as some Python Built in modules .

Environment building

install Python And add to environment variable ,pip Install the relevant modules required .

Thought analysis

This article is based on crawling film 《 Revolutionary 》 For example , Explain how to crawl the barrage and comments of a video !

Target website

https://v.qq.com/x/cover/mzc00200m72fcup.html
 Copy code 

Capture barrage

Analyze the website

Still enter the developer tool of the browser to capture packets , When the video plays 30 It will update one in seconds json Data packets , It contains the barrage data we need .

 Barrage data

Get accurate URL:

https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057&timestamp=15&_=1628947050569\
https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057&timestamp=45&_=1628947050572
 Copy code 

The different parameters are timestamp and _._ It's a time stamp .timestamp It's the number of pages , First article url by 15, The following is tolerance 30 Increasing , The tolerance is based on the packet update time , The maximum number of pages is the video duration 7245 second . Still delete unnecessary parameters , obtain URL:

https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094&timestamp=15&_=1628418086509
 Copy code 

Code implementation

import pandas as pd\
import time\
import requests\
\
headers = {\
    'User-Agent''Googlebot'\
}\
#  For the initial 15,7245  For video seconds , Links are incremented by 30 seconds \
df = pd.DataFrame()\
for i in range(15724530):\
    url = "https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094&timestamp={}&_=1628418086509".format(i)\
    html = requests.get(url, headers=headers).json()\
    time.sleep(1)\
    for i in html['comments']:\
        content = i['content']\
        print(content)\
        text = pd.DataFrame({' bullet chat ': [content]})\
        df = pd.concat([df, text])\
df.to_csv(' Revolutionary _ bullet chat .csv', encoding='utf-8', index=False)
 Copy code 

Effect display

 effect

Grab comments

Web analytics

The data of a video comment is at the bottom of the web page , Still dynamically loaded , You need to follow the steps below to enter the developer tool to capture packages :

 Grab the bag

Click to see more comments , The data package contains the comment data we need , Get the truth URL:

https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867522\
https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6786869637356389636&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867523
 Copy code 

URL Parameters in callback as well as _ Delete it . What matters is the parameters cursor, Article 1 with a url Parameters cursor Is equal to 0 Of , Second url Only then , So look for cursor How parameters appear . After my observation ,cursor The parameter is actually the previous one url Of last Parameters :

cursor

Code implementation

import requests\
import pandas as pd\
import time\
import random\
\
headers = {\
    'User-Agent''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
a = 1\
#  The number of cycles... Must be set here , Otherwise, it will crawl indefinitely \
# 281 It refers to... In the data package oritotal, There are... In the packet 10 Data , loop 280 Time to get 2800 Data , But not including the comments replied below \
#  In the packet commentnum, Is the total number of comment data including replies , And the packets contain 10 Comment data and comment data of the reply below , So just put 2800 Divide 10 Integer +1 that will do !\
while a < 281:\
    if a == 1:\
        url = 'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132'\
    else:\
        url = f'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor={cursor}&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132'\
    res = requests.get(url, headers=headers).json()\
    cursor = res['data']['last']\
    for i in res['data']['oriCommList']:\
        ids = i['id']\
        times = i['time']\
        up = i['up']\
        content = i['content'].replace('\n''')\
        text = pd.DataFrame({'ids': [ids], 'times': [times], 'up': [up], 'content': [content]})\
        df = pd.concat([df, text])\
    a += 1\
    time.sleep(random.uniform(23))\
    df.to_csv(' Revolutionary _ Comment on .csv', encoding='utf-8', index=False)
 Copy code 

Effect display

 effect

copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310218050324.html

Random recommended