current position:Home>Python crawler actual combat, requests module, python realizes capturing video barrage comments of station B

Python crawler actual combat, requests module, python realizes capturing video barrage comments of station B

2022-01-31 03:58:09 Dai mubai

「 This is my participation 11 The fourth of the yuegengwen challenge 5 God , Check out the activity details :2021 One last more challenge 」.

Preface

utilize Python Realize crawling B Station video barrage comments , I don't say much nonsense .

Let's start happily ~

development tool

Python edition : 3.6.4

Related modules :

requests modular ;

re modular ;

pandas modular ;

As well as some Python Built in modules .

Environment building

install Python And add to environment variable ,pip Install the relevant modules required .

Thought analysis

This article is based on crawling video 《“ This is the biggest Olympic champion of the Chinese team I have ever seen ”》 For example , Explain how to crawl B Barrage and comments of station video !

Destination address

https://www.bilibili.com/video/BV1wq4y1Q7dp
 Copy code 

Capture barrage

Web analytics

B The bullet screen of the station video is not like Tencent video , Playing the video will trigger the barrage packet , He needs to click on the expansion of the bullet list line on the right side of the page , Then click view historical barrage to get video barrage start date to end date link :

 link

Link ends with oid And the start date to form the barrage date URL:

https://api.bilibili.com/x/v2/dm/history/index?type=1&oid=384801460&month=2021-08
 Copy code 

Based on the above , Click any effective date to get the barrage data packet of this date , The contents are not understood at present , The reason why it is determined to be a barrage packet , It's because he clicked the date that he loaded it , And the link is related to the previous link :

 link 2

Got URL:

https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid=384801460&date=2021-08-08
 Copy code 

URL Medium oid Link for video barrage id value ;data The parameter is just the date , And get all the bullet screen content of the video , Just change data Parameters can be . and data Parameters can be from the bullet screen date above url get , It can also be constructed by itself ; The web page data format is json Format

Code implementation

import requests\
import pandas as pd\
import re\
\
def data_resposen(url):\
    headers = {\
        "cookie"" Yours cookie",\
        "user-agent""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"\
    }\
    resposen = requests.get(url, headers=headers)\
    return resposen\
\
def main(oid, month):\
    df = pd.DataFrame()\
    url = f'https://api.bilibili.com/x/v2/dm/history/index?type=1&oid={oid}&month={month}'\
    list_data = data_resposen(url).json()['data']  #  Get all the dates \
    print(list_data)\
    for data in list_data:\
        urls = f'https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid={oid}&date={data}'\
        text = re.findall(".*?([\u4E00-\u9FA5]+).*?", data_resposen(urls).text)\
        for e in text:\
            print(e)\
            data = pd.DataFrame({' bullet chat ': [e]})\
            df = pd.concat([df, data])\
    df.to_csv(' bullet chat .csv', encoding='utf-8', index=False, mode='a+')\
\
if __name__ == '__main__':\
    oid = '384801460'  #  Video barrage link id value \
    month = '2021-08'  #  Start date \
    main(oid, month)
 Copy code 

Effect display

B3.png

Grab comments

Web analytics

B The comment content of the station video is at the bottom of the web page , After entering the developer tool of the browser , Just pull down to load the data package :

![ Data packets (p3-juejin.byteimg.com/tos-cn-i-k3…?)

Get the truth URL:

https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550479&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1&_=1629012090500\
https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550483&jsonp=jsonp&next=2&type=1&oid=589656273&mode=3&plat=1&_=1629012513080\
https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550484&jsonp=jsonp&next=3&type=1&oid=589656273&mode=3&plat=1&_=1629012803039
 Copy code 

Two article urlnext Parameters , as well as _ and callback Parameters ._ and callback One is a time stamp , One is the interference parameter , Delete it .next The first parameter is 0, The second is 2, The third is 3, So the first one next The parameter is fixed to 0, The second article begins to increase ; The web page data format is json Format .

Code implementation

import requests\
import pandas as pd\
\
df = pd.DataFrame()\
headers = {\
    'user-agent''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}\
try:\
    a = 1\
    while True:\
        if a == 1:\
         #  Delete the first item obtained by unnecessary parameters url\
            url = f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1'\
        else:\
            url = f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={a}&type=1&oid=589656273&mode=3&plat=1'\
        print(url)\
        html = requests.get(url, headers=headers).json()\
        for i in html['data']['replies']:\
            uname = i['member']['uname']  #  User name \
            sex = i['member']['sex']  #  User's gender \
            mid = i['mid']  #  user id\
            current_level = i['member']['level_info']['current_level']  # vip Grade \
            message = i['content']['message'].replace('\n''')  #  User reviews \
            like = i['like']  #  Comment like times \
            ctime = i['ctime']  #  Comment on time \
            data = pd.DataFrame({' User name ': [uname], ' User's gender ': [sex], ' user id': [mid],\
                                 'vip Grade ': [current_level], ' User reviews ': [message], ' Comment like times ': [like],\
                                 ' Comment on time ': [ctime]})\
            df = pd.concat([df, data])\
        a += 1\
except Exception as e:\
    print(e)\
df.to_csv(' Olympic Games .csv', encoding='utf-8')\
print(df.shape)
 Copy code 

Result display , The content obtained does not include secondary comments , if necessary , You can climb on your own , The operation steps are similar :

B5.png

copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310358064672.html

Random recommended