current position:Home>Are people who like drinking tea successful? I use Python to make a tea guide! Do you like it?

Are people who like drinking tea successful? I use Python to make a tea guide! Do you like it?

2022-01-30 06:35:01 Programming small code farmer

Preface

       Throw a handful of green leaves , Sigh that the years are still like a dream , Note I Wang Qingquan , Pondering floating life in a variety of forms .

Tell me why I suddenly thought of writing Python Come on A small case like Jiancha ! Because today, the leader called me to the office , We had two cups of tea together , But I'm not the kind of person who likes drinking tea , I haven't studied tea ! So I'm going to give a tutorial today , It can also teach you Python It can also prevent the lack of this knowledge , At least some of the most common etiquette for drinking tea should be clear to yourself , Not to make a fool of yourself in the future !

                              æ ¸æ¡ƒä¸èƒ½å’Œä»€ä¹ˆä¸€èµ·åƒ 冬å¤åƒæ ¸æ¡ƒæ³¨æ„è¿™10大相克

Start

Read this article and the source code , You can learn with Xiaobian xpath Expression crawls data , Multi process crawling ,pandas Basic operation ,pyecharts visualization ,stylecloud The word cloud , Text cosine similarity ,KMeans, Keyword extraction algorithm :TextRank,TF-IDF,LDA Theme model .

The source code is obtained at the end of the article

Xiaobian found a website related to tea :

chaping.chayu.com/?bid=1

图片

​​

Data acquisition

Enter tea review from the home page , You can see the basic information of all tea , The result is multiple pages , Get all the basic information, including the title , score , brand , Place of Origin , Tea , Detailed Links ,id:

 picture

 picture

Then according to the obtained link , Drill down and climb to get the recommended index of each kind of tea , General comment , All ranking :

 picture

And crawl the corresponding comments , If you have more than one page, you can crawl more than one page , Include field reviewers , Reviewer rating , score , Comment on , Comment on time :

 picture

 picture

Last saved as tea.csv,comment.csv Two csv:

 picture

 picture

The whole crawler process is like this , Used xpath extract , Multi process crawling , Logic is not complicated , See the source code for detailed implementation logic .

Data analysis

All in all 3w Data , Once you get the data, you can start exploring .

Check the title first , The title is composed of brand and name , Process to keep only the name part , Draw word clouds .

black tea , Baidudan , Tie Guanyin , Green tea , Maojian and others have heard a lot of tea names :

 picture

The tea score is 0-10, Cut the score every two points and draw the histogram .

In terms of the results , The scores were very high , Only individual scores are lower than 4 Points of , Xiao Bian selected the data and looked at it , The general evaluation is not particularly friendly to these low-grade teas :

 picture

Now basically every kind of tea has a special brand on sale , Make statistics on brands , Draw words .

Discovery of douji tea industry , Chinese tea , Great benefit , Tianfu tea is more prominent , Even if these brands don't know tea , But more or less I've heard and seen in the street :

 picture

Each kind of tea has its unique origin , Draw a thermal map of the place of origin .

It is found that the origin comes from Yunnan , Up to thousands of , Xiao Bian checked , The most important origin of Yunnan tea , Yunnan is the oldest hometown of tea .

Followed by Fujian , It has a tea culture history of more than 1000 years , It is the most important tea producing area in China :

 picture

At present, tea can be divided into Pu'er , Green tea , black tea , Wulong , Black tea , White tea , scented tea , Yellow tea , Bag bubble , Ten categories of instant tea , Each big category has many sub categories , Make statistics for each category and draw a histogram .

It is found that Pu'er tea has the most categories , Followed by green tea , black tea , Seeing here, Xiao Bian thought that he seldom drank Pu'er tea :

 picture

Hot search can reflect whether a kind of tea is popular or not , Xiaobian selects the top of hot search 10 My tea , Pull out details .

It was found that the classic Pu'er tea ranked first , Pu'er is also the most diverse tea , You can buy some specially and try it later :

 picture

The comment time is in the dimension of time, month and year , Comment trend chart year-on-year, year-on-year and month-on-month .

Find comment users 14-17 The annual activity level has been rising , Then it fell :

 picture

Come here , The exploratory analysis is complete , Mainly used ,pandas,stylecloud,jieba,pyecharts These technologies , The detailed implementation process can refer to the source code .

​​

Keywords extraction

In the data obtained , There is a general comment field , That is, comments on each kind of tea , There is a field for each user comment , These two fields are used to extract text keywords .

For the general comment , We want to divide the tea with similar general comments , have access to KMeans clustering algorithm , But the general comment is text data .

You need to extract the keywords in each general comment first , Used TextRank The algorithm extracts keywords , The principle is word segmentation based on sentences , Weight each word , Get a high score as a keyword .

Vectorization of keywords , Then calculate the cosine similarity , Finally, the clustering algorithm is used , There are two kinds .

Category 1 is mainly evaluated from the taste direction , aroma , Taste , entrance , Smooth, etc .

Category 2 is mainly evaluated from the appearance direction , shape , A rope , Colour and lustre , Raw materials, etc :

 picture

Use... For comments first TF-IDF Algorithm for keyword extraction , Yes, there is TF,IDF The algorithm consists of two parts .

TF, Calculate the frequency of each word in all texts .

IDF, Calculate each word in all comments , In how many comments, how many times , Map a score .

Last TF*IDF Before selecting the score 10 Key words :

 picture

The second method is to use the topic model LDA Keyword extraction , You need to determine the number of topics first , Then extract the keywords , Here we choose 1 A theme , And before 10 key word :

 picture

 picture

The source code for

You can get the source code here

copyright notice
author[Programming small code farmer],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300634595933.html

Random recommended