current position:Home>Python crawler actual combat, requests module, python realizes capturing a video barrage
Python crawler actual combat, requests module, python realizes capturing a video barrage
2022-01-31 02:18:06 【Dai mubai】
「 This is my participation 11 The fourth of the yuegengwen challenge 4 God , Check out the activity details :2021 One last more challenge 」.
Preface
utilize Python Capture a video barrage , I don't say much nonsense .
Let's start happily ~
development tool
Python edition : 3.6.4
Related modules :
requests modular ;
pandas modular
As well as some Python Built in modules .
Environment building
install Python And add to environment variable ,pip Install the relevant modules required .
Thought analysis
This article is based on crawling film 《 Revolutionary 》 For example , Explain how to crawl the barrage and comments of a video !
Target website
https://v.qq.com/x/cover/mzc00200m72fcup.html
Copy code
Capture barrage
Analyze the website
Still enter the developer tool of the browser to capture packets , When the video plays 30 It will update one in seconds json Data packets , It contains the barrage data we need .
Get accurate URL:
https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057×tamp=15&_=1628947050569\
https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057×tamp=45&_=1628947050572
Copy code
The different parameters are timestamp
and _
._ It's a time stamp .timestamp It's the number of pages , First article url by 15, The following is tolerance 30 Increasing , The tolerance is based on the packet update time , The maximum number of pages is the video duration 7245 second . Still delete unnecessary parameters , obtain URL:
https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094×tamp=15&_=1628418086509
Copy code
Code implementation
import pandas as pd\
import time\
import requests\
\
headers = {\
'User-Agent': 'Googlebot'\
}\
# For the initial 15,7245 For video seconds , Links are incremented by 30 seconds \
df = pd.DataFrame()\
for i in range(15, 7245, 30):\
url = "https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094×tamp={}&_=1628418086509".format(i)\
html = requests.get(url, headers=headers).json()\
time.sleep(1)\
for i in html['comments']:\
content = i['content']\
print(content)\
text = pd.DataFrame({' bullet chat ': [content]})\
df = pd.concat([df, text])\
df.to_csv(' Revolutionary _ bullet chat .csv', encoding='utf-8', index=False)
Copy code
Effect display
Grab comments
Web analytics
The data of a video comment is at the bottom of the web page , Still dynamically loaded , You need to follow the steps below to enter the developer tool to capture packages :
Click to see more comments , The data package contains the comment data we need , Get the truth URL:
https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867522\
https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6786869637356389636&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867523
Copy code
URL Parameters in callback as well as _ Delete it . What matters is the parameters cursor
, Article 1 with a url Parameters cursor
Is equal to 0 Of , Second url Only then , So look for cursor
How parameters appear . After my observation ,cursor
The parameter is actually the previous one url Of last
Parameters :
Code implementation
import requests\
import pandas as pd\
import time\
import random\
\
headers = {\
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
a = 1\
# The number of cycles... Must be set here , Otherwise, it will crawl indefinitely \
# 281 It refers to... In the data package oritotal, There are... In the packet 10 Data , loop 280 Time to get 2800 Data , But not including the comments replied below \
# In the packet commentnum, Is the total number of comment data including replies , And the packets contain 10 Comment data and comment data of the reply below , So just put 2800 Divide 10 Integer +1 that will do !\
while a < 281:\
if a == 1:\
url = 'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132'\
else:\
url = f'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor={cursor}&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132'\
res = requests.get(url, headers=headers).json()\
cursor = res['data']['last']\
for i in res['data']['oriCommList']:\
ids = i['id']\
times = i['time']\
up = i['up']\
content = i['content'].replace('\n', '')\
text = pd.DataFrame({'ids': [ids], 'times': [times], 'up': [up], 'content': [content]})\
df = pd.concat([df, text])\
a += 1\
time.sleep(random.uniform(2, 3))\
df.to_csv(' Revolutionary _ Comment on .csv', encoding='utf-8', index=False)
Copy code
Effect display
copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310218050324.html
The sidebar is recommended
- Python notes (20): built in high-order functions
- Python notes (17): closure
- Python notes (18): decorator
- Python notes (16): generators and iterators
- Python notes (XV): List derivation
- Python tells you what timing attacks are
- Python -- file and exception
- [Python from introduction to mastery] (IV) what are the built-in data types of Python? Figure out
- Python code to scan code to pay attention to official account login
- [algorithm learning] 1221 Split balanced string (Java / C / C + + / Python / go / trust)
guess what you like
-
Python notes (22): errors and exceptions
-
Python has been hidden for ten years, and once image recognition is heard all over the world
-
Python notes (21): random number module
-
Python notes (19): anonymous functions
-
Use Python and OpenCV to calculate and draw two-dimensional histogram
-
Python, Hough circle transformation in opencv
-
A library for reading and writing markdown in Python: mdutils
-
Datetime of Python time operation (Part I)
-
The most useful decorator in the python standard library
-
Python iterators and generators
Random recommended
- [Python from introduction to mastery] (V) Python's built-in data types - sequences and strings. They have no girlfriend, not a nanny, and can only be used as dry goods
- Does Python have a, = operator?
- Go through the string common sense in Python
- Fanwai 4 Handling of mouse events and solutions to common problems in Python opencv
- Summary of common functions for processing strings in Python
- When writing Python scripts, be sure to add this
- Python web crawler - Fundamentals (1)
- Pandas handles duplicate values
- Python notes (23): regular module
- Python crawlers are slow? Concurrent programming to understand it
- Parameter passing of Python function
- Stroke tuple in Python
- Talk about ordinary functions and higher-order functions in Python
- [Python data acquisition] page image crawling and saving
- [Python data collection] selenium automated test framework
- Talk about function passing and other supplements in Python
- Python programming simulation poker game
- leetcode 160. Intersection of Two Linked Lists (python)
- Python crawler actual combat, requests module, python to grab the beautiful wallpaper of a station
- Fanwai 5 Detailed description of slider in Python opencv and solutions to common problems
- My friend's stock suffered a terrible loss. When I was angry, I crawled the latest data of securities with Python
- Python interface automation testing framework -- if you want to do well, you must first sharpen its tools
- Python multi thread crawling weather website pictures and saving
- How to convert pandas data to excel file
- Python series tutorials 122
- Python Complete Guide - printing data using pyspark
- Python Complete Guide -- tuple conversion array
- Stroke the list in python (top)
- Analysis of Python requests module
- Comments and variables in Python
- New statement match, the new version of Python is finally going to introduce switch case?
- Fanwai 6 Different operations for image details in Python opencv
- Python crawler native code learning (I)
- Python quantitative data warehouse building series 2: Python operation database
- Python code reading (Part 50): taking elements from list intervals
- Pyechart + pandas made word cloud pictures of 19 report documents
- [Python crawler] multithreaded daemon & join() blocking
- Python crawls cat pictures in batches to realize thousand image imaging
- Van * Python | simple crawling of a planet
- Input and output of Python practice