current position:Home>Python crawler actual combat, pyecharts module, python realizes the visualization of river review data

Python crawler actual combat, pyecharts module, python realizes the visualization of river review data

2022-02-01 11:44:29 Dai mubai

「 This is my participation 11 The fourth of the yuegengwen challenge 24 God , Check out the activity details :2021 One last more challenge 」.

Preface

utilize Python Realize the visualization of river review data . I don't say much nonsense .

Let's start happily ~

development tool

Python edition : 3.6.4

Related modules :

requests modular

proxy2808

pandas modular

pyecharts modular ;

As well as some Python Built in modules .

Environment building

install Python And add to environment variable ,pip Install the relevant modules required .

Because watercress anti climbing is still serious

 Anti creeping

2808PROXY Agent services provided by

2808PROXY Agent services provided by

If you don't use an agent, it's almost impossible

Analyze the web

Although there are more than 20000 comments , But when Douban landed , Just release 500 Data .

Only the data under the tab of all comments and bad comments are obtained this time , The total is about 900 multiple .

 Web analytics

Then, get the user's registration time .

900 Multiple users ,900 Multiple requests .

I believe there is no agency , absolute Game Over.

4.jpg

get data

Comments and user information access code

import time
import requests
import proxy2808
from bs4 import BeautifulSoup

USERNAME = ' user name '
PASSWORD = ' password '

headers = {
    'Cookie': ' Yours Cookie value ',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}


def get_comments(page, proxy_url_secured):
    """  Comment access  """
    #  Get hot comments 
    url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P'
    #  Get good reviews 
    # url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P&percent_type=h'
    #  General comments get 
    # url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P&percent_type=m'
    #  Get bad reviews 
    # url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P&percent_type=l'
    #  Use 2808proxy agent 
    response = requests.get(url=url, headers=headers, proxies={'http': proxy_url_secured, 'https': proxy_url_secured})
    soup = BeautifulSoup(response.text, 'html.parser')
    for div in soup.find_all(class_='comment-item'):
        time.sleep(3)
        #  Comment information 
        comment_info = div.find(class_='comment-info')
        #  user name 
        user_name = comment_info.find('a').get_text()
        print(user_name)
        #  User home address 
        user_url = comment_info.find('a').attrs['href']
        print(user_url)
        #  Get user registration time , It's necessary to watch the Navy 
        registered_time = get_user(user_url, proxy_url_secured)
        print(registered_time)
        #  User rating 
        score = comment_info.find_all('span')[1].attrs['class'][0][-2:-1]
        print(score)
        #  User evaluation 
        eva = comment_info.find_all('span')[1].attrs['title']
        print(eva)
        #  Useful number 
        useful_num = div.find(class_='votes').get_text()
        print(useful_num)
        #  Evaluation date 
        date = comment_info.find(class_='comment-time ').attrs['title'].split(' ')[0]
        print(date)
        #  Evaluation time 
        comment_time = comment_info.find(class_='comment-time ').attrs['title'].split(' ')[1]
        print(comment_time)
        #  User reviews 
        comment = div.find(class_='short').get_text().replace('\n', '').strip().replace(',', ',').replace(' ', '')
        print(comment)
        #  write in csv file 
        with open('comments_douban_l.csv', 'a', encoding='utf-8-sig') as f:
            f.write(user_name + ',' + user_url + ',' + registered_time + ',' + score + ',' + date + ',' + comment_time + ',' + useful_num + ',' + comment + '\n')
        f.close()


def get_user(user_url, proxy_url_secured):
    """  Get user registration time  """
    #  Use 2808proxy agent 
    response = requests.get(url=user_url, headers=headers, proxies={'http': proxy_url_secured, 'https': proxy_url_secured})
    soup = BeautifulSoup(response.text, 'html.parser')
    user_message = soup.find(class_='basic-info')
    #  Get user registration time 
    try:
        user_registered = user_message.find(class_='pl')
        registered_time = user_registered.get_text().split(' ')[1].replace(' Join in ', '')
    except:
        registered_time = 'unknow'
    return registered_time


def main():
    num = 0
    for i in range(0, 500, 20):
        cli = proxy2808.Client(username=USERNAME, password=PASSWORD)
        cli.release_all()
        p = cli.get_proxies(amount=1, expire_seconds=300)[0]
        proxy_url_secured = "%s://%s:%[email protected]%s:%d" % ('http', USERNAME, PASSWORD, p['ip'], p['http_port_secured'])
        print(proxy_url_secured)
        get_comments(i, proxy_url_secured)
        num += 1


if __name__ == '__main__':
    main()
 Copy code 

Get the data under all comments tab (500 strip ).

 data 1

The red box indicates the user's registration time .

Suppose I can crawl through all the comments , Then I guess I'll catch the Navy .

Personal understanding , The navy is just too many new registered users ...

However, Douban did not give us this opportunity .

Get the data in the bid evaluation tab (482 strip ).

 data 2

Look at the registration time of users with bad comments .

More favorable user registration time , It's kind of interesting .

The registration time is relatively late .

Analyze emotions

The emotional analysis of comments uses Baidu's natural language processing .

 Sentiment analysis

Let's use the website as an example .

 Example

Specifically, you can go to the official website to see the documents , Here is just a brief description .

Log in to Baidu through your baidu account AI Development platform , Create a new natural language processing project .

obtain 「API Key」 And 「Secret Key」 after .

Call sentiment analysis interface , Get emotional results .

import urllib.request
import pandas
import json
import time


def get_access_token():
    """  Get Baidu AI Platform Access Token """
    #  Use your API Key And Secret Key
    host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=[API Key]&client_secret=[Secret Key]'
    request = urllib.request.Request(host)
    request.add_header('Content-Type', 'application/json; charset=UTF-8')
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    rdata = json.loads(content)
    return rdata['access_token']


def sentiment_classify(text, acc):
    """  Get the emotional bias of the text ( negative  or  positive  or  neutral )  Parameters : text:str  this paper  """
    raw = {"text":" Content "}
    raw['text'] = text
    data = json.dumps(raw).encode('utf-8')
    #  Sentiment analysis interface 
    host = "https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify?charset=UTF-8&access_token=" + acc
    request = urllib.request.Request(url=host, data=data)
    request.add_header('Content-Type', 'application/json')
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    rdata = json.loads(content)
    return rdata


#  obtain access_token
access_token = get_access_token()
#  Comment label 
df = pandas.read_csv('comments_douban_l.csv', header=None, names=['user_name', 'user_url', 'registered_time', 'score', 'date', 'comment_time', 'useful_num', 'comment'])
#  Praise label 
# df = pandas.read_csv('comments_douban_a.csv', header=None, names=['user_name', 'user_url', 'registered_time', 'score', 'date', 'comment_time', 'useful_num', 'comment'])

#  Output emotion polarity classification results ,0: Negative ,1: Neutral ,2: positive 
sentiments = []
for text in df['comment']:
    time.sleep(1)
    result = sentiment_classify(str(text), access_token)
    value = result['items'][0]['sentiment']
    sentiments.append(value)
    # print(result)
    print(result['items'][0]['sentiment'], text)

#  Add score column and emotion column 
df['score1'] = df['score']
df['emotional'] = sentiments
#  Comment label 
df.to_csv('comments_douban_ll.csv', header=0, index=False, encoding='utf-8-sig')
#  Praise label 
# df.to_csv('comments_douban_al.csv', header=0, index=False, encoding='utf-8-sig')
 Copy code 

The results of emotion analysis are as follows .

9.jpg

in general 5 The results of star score are mostly positive (2) Of .

Of course, there are some negative (0) Result .

But it's still acceptable .

 result

1 The comment emotion tendency of star score is mostly negative .

Here, the positive ones are circled in red , You can experience it yourself .

After all, the recognition level of the machine is limited , To achieve 100% distinguish , The possibility is almost 0.

Data visualization

Distribution of comment dates

 All essays -500 Comment date distribution

 All bad reviews -500 Comment date distribution

With the launch of the TV series , Then slowly there is no change .

But the bad comments have some fluctuations later .

Comment on the time distribution

 All essays -500 Comment on the day distribution

 All bad reviews -500 Comment on the day distribution

Most of the comments are in the evening , Normal .

Comment on the rating

 All essays -500 Comment on the rating

 All bad reviews -500 Comment on the day distribution

Of all the essays 5 Star score accounts for the majority .

All bad reviews 1 Seiwa 2 Astrology big head .

Comment on Emotional Analysis

 All essays -500 Comment on Emotional Analysis

 All bad reviews -500 Comment on Emotional Analysis

among 「2」 Represents positive ,「1」 Stands for neutral ,「-2」 Stands for negative .

The positive results of all short reviews account for the majority .

The ranking of all essays is based on the number of likes .

So for the whole play , We are still quite recognized .

Comment on user registration time

 All essays - User registration time distribution

 All bad reviews - User registration time distribution

Generate comment cloud

Words of praise

 Words of praise

Bad comment cloud

 Bad comment cloud

copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011144277059.html

Random recommended