current position:Home>Climb Conan barrage with Python + gephi to sort out the main plot

Climb Conan barrage with Python + gephi to sort out the main plot

2022-02-02 04:30:09 Baekhyun

Introduction :9000 What happened in my childhood after ?

Red and white machines Contra 、 World of warcraft 、 Fantasy westward journey 、 Crossing the line of fire 、 Hero alliance 、 Plants vs. zombies 、 Super Marie 、 Mine sweeping and other games Or glass marbles playing on the street 、 Make a picture 、 Jump house 、 Or a bunch of little friends watching TV and having children previous Or the endless cycle of returning beads in summer But I believe there are many little cute children who can't live without comics !! When I was a child, I stayed in the bookstore and read comics all afternoon. It's not too satisfying

Conan is Xiaobian's favorite cartoon character

Conan's classic mantra :

1. Konan Edogawa , He is a detective

Every time Conan reveals the true face of the murderer or helps others solve the problem of time , The killer or someone else always asks “ Who the hell are you ?”, At this time, Conan will pose in a cool pose , Then the corners of the mouth rise slightly , Say that “ Konan Edogawa , He is a detective ”. It's a pity that our Maori Kogoro has been stabbed so many times , I won't ask “ Who the hell are you ?”

2. So that's it , That's what happened !

Every time I encounter a case , Generally, Conan can't solve this case easily , He will encounter some bottleneck . At this time, a casual act or a word from someone else will make Conan shine , Then the eyes reflect “ So that's it , That's what happened !”. At this time, many viewers must still be in a state of ignorance , How did Conan understand again .

3. So the only prisoners are ……

In most cases , Although Conan solved the technique of crime , But there is no conclusive evidence , Can't infer the real prisoner . After passing him “ carefully ” After your investigation , He always finds key clues and evidence , Then infer the true identity of the prisoner according to this clue and evidence , Then he would say “ So the only prisoners are ……”.

4. There is only one truth

This sentence “ There is only one truth ” Can be said to be 《 Detective Conan 》 The most classic sentence in , Basically, there is this sentence in the beginning of every cartoon episode , And every time Conan said this sentence, he was dignified . In reality, many fans will imitate Conan to say this sentence in their daily life ,“ New machine words dig one here, Mo he many here !”.

5. Ah, le

If it comes to 《 Detective Conan 》 Zhongconan's most impressive mantra , Except for the top “ There is only one truth ” This is the sentence “ Ah, le ” 了 . In the plot , When Conan found any clues at the crime scene , Will deliberately say this sentence in a child's tone “ Ah, le ” To remind others . And in the recent red revision , After Conan regained the identity of Shinichi Kudo , I said this sentence unconsciously when solving the case “ Ah, le ”, It can be seen how much this mantra has influenced him .

Okay ~ Recall doing business here

One 、 Crawling Introduction

utilize Chrome The browser captures the package , The bullet screen file of a station is XML Document storage , As shown below ( A total of 3000 live barrages )

Its URL by :comment.bilibili.com/183362119.x…

Numbers 183362119 It means that the video is exclusive ID, The corresponding barrage file can be obtained by changing the number . Open the first 1 Video of the collection , View source code , As shown in the figure below .

It's not hard to see. ,CID Is corresponding to each video ID, Next, use regular extraction .

The complete crawl code is as follows

import requests
import re
from bs4 import BeautifulSoup as BS
import os
path='C:/Users/dell/Desktop/ Conan '
if os.path.exists(path)==False:
    os.makedirs(path)
os.chdir(path)
 
def gethtml(url,header):
    r=requests.get(url,headers=header)
    r.encoding='utf-8'
    return r.text
 
def crawl_comments(r_text):
    txt1=gethtml(url,header)
    pat='"cid":(\d+)'
    chapter_total=re.findall(pat,txt1)[1:-2]
    count=1
    for chapter in chapter_total:
        url_base='http://comment.bilibili.com/{}.xml'.format(chapter)
        txt2=gethtml(url_base,header)
        soup=BS(txt2,'lxml')
        all_d=soup.find_all('d')
        with open('{}.txt'.format(count),'w',encoding='utf-8') as f:
            for d in all_d:
                f.write(d.get_text()+'\n')
        print(' The first {} The bullet screen is written '.format(count))
        count+=1
 
if __name__=='__main__':
    url='https://www.bilibili.com/bangumi/play/ep321808'
    header={'user-agent':'Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02'}
    r_text=gethtml(url,header)
    crawl_comments(r_text)
 Copy code 

All the final barrage files are on the desktop " Conan " Under the document

notes : Here we climb to 980 A barrage file .【 since 941 After the episode, jump to 994 Set ( Only large members can watch ). Although currently updated to 1032 word , But there is no 1032 Set content , As shown in the figure below 】

Two 、 Barrage visualization

I. Analysis of the total number of discussions between the main characters

(1) Total number of statistics note :role.txt Is the main character name file ( It should be considered that the bullet screen generally does not address the full name of the character , Most use nicknames , Otherwise, it may be quite different from the actual situation .)

import jieba
import os
import pandas as pd
os.chdir('C:/Users/dell/Desktop')
jieba.load_userdict('role.txt')
role=[ i.replace('\n','') for i in open('role.txt','r',encoding='utf-8').readlines()]
txt_all=os.listdir('./ Conan /')
txt_all.sort(key=lambda x:int(x.split('.')[0]))  # Sort by number of sets 
count=1
def role_count():
df = pd.DataFrame()
 for chapter in txt_all:
     names={}
     data=[]
     with open('./ Conan /{}'.format(chapter),'r',encoding='utf-8') as f:
         for line in f.readlines():
             poss=jieba.cut(line)
             for word in poss:
                 if word in role:
                     if names.get(word) is None:
                         names[word]=0
                     names[word]+=1
         df_new = pd.DataFrame.from_dict(names,orient='index',columns=['{}'.format(count)])   
         df = pd.concat([df,df_new],axis=1)
     print(' The first {} The character statistics of the collection are finished '.format(count))
     count+=1
df.T.to_csv('role_count.csv',encoding='gb18030')
 Copy code 

(2) visualization

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['kaiti']
plt.style.use('ggplot')
df=pd.read_csv('role_count.csv',encoding='gbk')
df=df.fillna(0).set_index('episode')
plt.figure(figsize=(10,5))
role_sum=df.sum().to_frame().sort_values(by=0,ascending=False)
g=sns.barplot(role_sum.index,role_sum[0],palette='Set3',alpha=0.8)
index=np.arange(len(role_sum))
for name,count in zip(index,role_sum[0]):
    g.text(name,count+50,int(count),ha='center',va='bottom',)
plt.title('B Station Detective Conan barrage —— The main characters discuss the distribution of the total number of times ')
plt.ylabel(' Number of discussions ')
plt.show()
 Copy code 

Although it is a ten thousand year primary school student , Conan still has a time to change back to a new one , And the plot is not just " Find the prisoner — Catch the prisoner ". Next, from the perspective of data , Pick up some wonderful episodes .

II. Conan changed back to a new episode of Statistics

Considering that the new one in the number of partial episodes appears in memory , In order to reduce the deviation , Set the threshold for discussion to 250 Time , Draw the following distribution map

The results of the number of discussions and the title of the play are shown in the table below

Interested little ones can code , except 235 Outside the collection , Are the number of episodes that Conan changed back to a new one .

The relevant code is as follows :

df=pd.read_csv('role_count.csv',encoding='gbk')
df=df.fillna(0).set_index('episode')
xinyi=df[df[' New one ']>=250][' New one '].to_frame()
print(xinyi) # Number of new episodes 
plt.figure(figsize=(10,5))
plt.plot(df.index,df[' New one '],label=' New one ',color='blue',alpha=0.6)
plt.annotate(' blues :50, Number of discussions :309', 
             xy=(50,309),
             xytext=(40,330),
             arrowprops=dict(color='red',headwidth=8,headlength=8)
            )
plt.annotate(' blues :206, Number of discussions :263', 
             xy=(206,263),
             xytext=(195,280),
             arrowprops=dict(color='red',headwidth=8,headlength=8)
            )
plt.annotate(' blues :571, Number of discussions :290', 
             xy=(571,290),
             xytext=(585,310),
             arrowprops=dict(color='red',headwidth=8,headlength=8)
            )
plt.hlines(xmin=df.index.min(),xmax=df.index.max(),y=250,linestyles='--',colors='red')
plt.legend(loc='best',frameon=False)
plt.xlabel(' blues ')
plt.ylabel(' Number of discussions ')
plt.title(' Kudo Shinichi discusses the distribution of times ')
plt.show()
 Copy code 

The most frequently discussed 572 Set , Draw a cloud of words ( Eliminate high-frequency words " New one ", Prevent omission of other information ) As shown below :

As you can see from the diagram , The most frequent words are cosmetic surgery 、 Service Department 、 voice 、 Love, etc .( It seems that the murderer committed the crime in a new look , And Xinlan's emotional drama is in it , It's worth seeing. )

III. Main line set number content analysis

The main plot is mainly around the members of the organization ( Gin and wine 、 Vodka 、 Belmord ) an , Draw the distribution diagram as follows :

plt.figure(figsize=(10,5))
names=[' Gin and wine ',' Vodka ',' Sister Bei ']
colors=['#090707','#004e66','#EC7357']
alphas=[0.8,0.7,0.6]
for name,color,alpha in zip(names,colors,alphas):
    plt.plot(df.index,df[name],label=name,color=color,alpha=alpha)
plt.legend(loc='best',frameon=False)
plt.annotate(' blues :{}, Number of discussions :{}'.
             format(df[' Sister Bei '].idxmax(),int(df[' Sister Bei '].max())), 
             xy=(df[' Sister Bei '].idxmax(),df[' Sister Bei '].max()),
             xytext=(df[' Sister Bei '].idxmax()+30,df[' Sister Bei '].max()),
             arrowprops=dict(color='red',headwidth=8,headlength=8)
            )
plt.xlabel(' blues ')
plt.ylabel(' Number of discussions ')
plt.title(' Distribution of discussion times among winery members ')
plt.hlines(xmin=df.index.min(),xmax=df.index.max(),y=200,linestyles='--',colors='red')
plt.ylim(0,400)
 
# Output the main series 
mainline=set(list(df[df[' Sister Bei ']>=200].index)+list(df[df[' Gin and wine ']>=200].index))  # Vodka is negligible 
print(mainline) 
 Copy code 

It can be seen from the analysis above that , The actions of the members of the organization are basically the same , Among them, sister Bei ( Belmord ) The popularity of is higher among the three , Especially in 375 Set ( Face to face with the dark group ), The number of discussions is as high as 379. Besides , Statistics show that the number of discussions is greater than 200 The number of times , give the result as follows :

With the highest number of discussions 375 Set as content , Draw a cloud of words ( Eliminate high-frequency words " Sister Bei ", Prevent omission of other information ) as follows

It can be seen from the picture that , The angel 、 Gin and wine 、 godmother 、 Love dearly 、 Sniper and other words appear more frequently . From the main line of failure with low word frequency, we can see , The winery operation should have ended in failure .

3、 ... and 、 Persona network analysis

I. Merge txt file

In order to reflect the description of the characters by the bullet screen audience as much as possible , Considering an episode of barrage 3000 strip , To reduce operating costs , Here, only those who have the most discussions with specific characters are selected 20 Analyze after merging the sets .

import os
import pandas as pd
df=pd.read_csv('role_count.csv',encoding='gbk')
df=df.fillna(0).set_index('episode')
huiyuan_ep=list(df.sort_values(by=' Ash yuanai ',ascending=False).index[:20])
mergefiledir = 'C:/Users/dell/Desktop/ Conan '
file=open('txt_all.txt','w',encoding='UTF-8')   
count=0
for filename in huiyuan_ep:
    filepath=mergefiledir+'/'+str(filename)+'.txt'
    for line in open(filepath,encoding='UTF-8'):
        file.writelines(line)
    file.write('\n')
    count+=1
    print(' The first {} Set write complete '.format(count))
file.close()
 Copy code 

II. Persona visualization

With the help of the idea of co-occurrence matrix , That is, if there are two specified words in the same sentence, count 1. Specify the starting point Source Mourn for ashara , The code is as follows :( notes : among ,stopwods.txt File for stop word ,role.txt For the character nickname file )

import codecs
import csv
import jieba
linesName=[]
names={}
relationship={}
jieba.load_userdict('role.txt')
txt=[ line.strip() for line in open('stopwords.txt','r',encoding='utf-8')]
name_list=[ i.replace('\n','') for i in open('role.txt','r',encoding='utf-8').readlines()]
 
def base(path):
    with codecs.open(path,'r','UTF-8') as f:
        for line in f.readlines():
            line=line.replace('\r\n','')
            poss = jieba.cut(line)
            linesName.append([])
            for word in poss:  
                if word in txt:
                    continue
                linesName[-1].append(word)
                if names.get(word) is None:
                    names[word]=0
                    relationship[word]={}
                names[word]+=1
    return linesName,relationship
 
def relationships(linesName,relationship,name_list):          
    for line in linesName:
        for name1 in line:
            if name1 in name_list:
                for name2 in line:
                    if name1==name2:
                        continue
                    if relationship[name1].get(name2) is None:
                        relationship[name1][name2]=1
                    else:
                        relationship[name1][name2]+=1
    return relationship
 
def write_csv(relationship):
    csv_writer2=open('edges.csv','w',encoding='gb18030')
    writer=csv.writer(csv_writer2,delimiter=',',lineterminator='\n')
    writer.writerow(['Source','Target','Weight'])
    for name,edges in relationship.items():
        for k,v in edges.items():
            if v>10:
                writer.writerow([name,k,v])
    csv_writer2.close()    
 
if __name__=='__main__':
    linesName,relationship=base('txt_all.txt')
    data=relationships(linesName,relationship,name_list)
    write_csv(data) 
 Copy code 

Import the generated file into Gephi, Get the following character image

The thicker the line , The more obvious the character is . It's not hard to see. , Everyone's evaluation of AI sauce is mainly beautiful and greasy 、 lovely 、 Love dearly .

END:

Well, here is PYTHON The whole content of the actual battle Like the little cute people, remember to make up three company for the little girl

The support of family members who will update online every day in the future is the biggest driving force of this !!

There's more to it PYTHON Source code sharing can be private letter wo~~

copyright notice
author[Baekhyun],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202020430048193.html

Random recommended