current position:Home>Python crawler actual combat, requests module, python realizes IMDB movie top data visualization

Python crawler actual combat, requests module, python realizes IMDB movie top data visualization

2022-01-31 17:02:15 Dai mubai

「 This is my participation 11 The fourth of the yuegengwen challenge 15 God , Check out the activity details :2021 One last more challenge 」.

Preface

utilize Python Crawling IMDB The movie . I don't say much nonsense .

Let's start happily ~

development tool

Python edition : 3.6.4

Related modules :

requests modular ;

random modular ;

bs4 modular ;

As well as some Python Built in modules .

Environment building

install Python And add to environment variable ,pip Install the relevant modules required .

How to get started with watercress as a reptile , The in-depth analysis of various cattle has become perfect ; On the other hand, with the development of China's film industry , We need to turn our perspective to the international market , Through data analysis , Learn about the movies that foreigners are interested in .

Thought analysis

IMDB top250 Home page

IMDB top250 Home page

IMDB Movie details page (1)

IMDB Movie details page 1

IMDB Movie details page (2)

IMDB  Movie details page 2

Based on the above web page structure , We found that we only need to get the detail page code of each movie ( only ), adopt 2 Time “ Frog leaping ”, Implementation details page (1)(2) Export country & type , fraction & Access to information on the number of people . Easy to understand , The climbing mind map is as follows :

 Mind mapping

The crawler code

IMDB top250 Home page

# Import library -------------------------------------------
from urllib import request
from chardet import detect
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

# Get web source , Generate soup object -------------------------
def getSoup(url):
    with request.urlopen(url) as fp:
       byt = fp.read()
       det = detect(byt)
       time.sleep(random.randrange(1,5))
       return BeautifulSoup(byt.decode(det['encoding']),'lxml')
   
# Parsing data -------------------------------------------  
def getData(soup):
   # Get the score 
   ol = soup.find('tbody', attrs = {'class''lister-list'})
   score_info = ol.find_all('td',attrs={'class':'imdbRating'})
   film_scores = [k.text.replace('\n',''for k in score_info]
   # Get the score 、 The movie name 、 The director ・ actor 、 Release year 、 Details page links 
   film_info = ol.find_all('td',attrs={'class':'titleColumn'})
   film_names =  [k.find('a').text for k in film_info]
   film_actors =  [k.find('a').attrs['title'for k in film_info]
   film_years = [k.find('span').text[1:5for k in film_info]
   next_nurl =  [url2 + k.find('a').attrs['href'][0:17]  for k in film_info]
   data=pd.DataFrame({'name':film_names,'year':film_years,'score':film_scores,'actors':film_actors,'newurl':next_nurl})      
   return data    
 Copy code 

IMDB top250 Movie details page

# Get detail page data -------------------------------------------
def nextUrl(detail,detail1):
  # Get movie country 
  detail_list = detail.find('div',attrs={'id':'titleDetails'}).find_all('div',attrs={'class':'txt-block'})
  detail_str = [k.text.replace('\n',''for k in detail_list]
  detail_str = [k for k in detail_str if k.find(':')>=0]
  detail_dict = {k.split(':')[0] : k.split(':')[1for k in detail_str}
  country = detail_dict['Country']    
  # Get movie type 
  detail_list1 = detail.find('div',attrs={'class':'title_wrapper'}).find_all('div',attrs={'class':'subtext'})
  detail_str1 = [k.find('a').text for k in detail_list1]
  movie_type=pd.DataFrame({'Type':detail_str1})
  # Get detailed ratings of movies by group 、 The number of 
  div_list = detail1.find_all('td',attrs= {'align''center'})
  value = [k.find('div',attrs= {'class''bigcell'}).text.strip() for k in div_list]
  num   = [k.find('div', attrs={'class''smallcell'}).text.strip() for k in div_list]
  scores=pd.DataFrame({'value':value,'num':num})  
  return country,movie_type,scores
 Copy code 

Result display

 Result display

Data analysis

Comparison of film types

First, let's look at the proportion of various types of films :

 Proportion of films

Top250 The type of film accounts for , The top three are comedy 、 Crime and action .

A tense and exciting mood 、 A relaxed plot , The most memorable viewing experience for fans .

Now let's take a look at the score comparison of various types of films

 Score comparison

In terms of type , The western film is a masterpiece , The reason may be related to the small audience 、 Enthusiasts' wild running character is easy to give high scores . secondly , crime 、 action 、 adventure 、 Reasoning 、 Horror subjects are also prone to high scores

Year comparison

First of all, let's take a look TOP250 The year of the film

 High score

Top250 In the movie ,1957、1995、 And 2014 There are many films in , and 1975 After year , The number of films on the list has an obvious increasing trend , This may be related to the growing maturity of the film industry .

as for 1995 year , Friends who are familiar with movies may know ,1995 The year is the world film 100 Anniversary of the , Countless movie geniuses have the idea of giving gifts , In this year, their great works were born , We are familiar with 《 The Shawshank Redemption 》、《 Forrest gump 》、《 Vulgar novel 》、《 Four Weddings and a funeral 》、《 Seven deadly SINS 》、《 Disney's The Lion King 》 etc. .

At the same time, let's look at the evaluation scores of films in each year

 Evaluation score

Compare movie age ratings , There is no significant upward or downward trend , It can be seen that film art will not lose its value over time . For movies , Technology is not the first , The factors of emotional resonance account for a greater weight ; Which movie is the best to see ? The answer is in each of us .

National comparison

Let's look at various countries and regions in TOP250 The proportion in the film

 Proportion

This data is interesting , A bit like the Nobel Prize , American films account for half of the country , Other countries divide up the rest of the cake . The top ones are Britain 、 The French 、 Japan 、 Germany . And China , The only movie on the list is only one ——《 Mood for love 》.

If it is the reason for the mainstream Western values , Japan, also a neighboring country representing Oriental Culture , But there are 16 This movie is on the list , It can be seen that Western values can not be the main reason why Chinese films are few on the list . Although there have been many in China in recent years 《 Begonia big fish 》、 And the latest release 《 Wandering the earth 》 Such high-quality works go online , But the response in the international market is still mediocre . I believe that movies have a common language , There are really such things as universal values . How to build an international film industry , Tell stories to the people of the world , It is the next topic that Chinese filmmakers need to explore .

Director comparison

Let's take a look at those in TOP250 In the list , The most common directors

TOP250 The list

The Nobel Prize in the film industry , Let's see which authors are on the list . Since you may not be familiar with the names of foreign directors , I've been a director here - Representative works as a comparison table , It is worth noting that , ridley ・ Scott 、 James ・ Cameron 、 David ・ Finch directed the films respectively 《 alien 1》《 alien 2》《 alien 3》, a 《 alien 》 Out 3 A director on the list , It can be seen that its series influence .

 The movie & The author is on the list

Crowd comparison

First, let's look at the scores of different groups

 Ratings

From a gender perspective , Men are more likely than women to give high marks . On the other hand , From the side of age , Both men and women , Minors are most likely to give high scores , With age , The score is getting sharper (è) benefit (dú), exceed 45 Age group , The score given is the lowest . Whether it has gone through the sea , The harder a hard heart is to be moved ? Or maybe he's well-informed , To evaluate a film fairly and objectively ? Maybe we can study this problem , Such as 《 Scientific allocation method of age group of Film Festival judges 》.

However, I know the score , We also need to know the proportion of various groups

 Proportion of various groups

although “ Old uncle ”、” Old aunt ” Our scores are low , But the reputation of a film doesn't have to worry about this kind of people . Because the data tell us , Satisfy 30-44 as well as 18-29 Young and middle-aged men of these two ages , The word-of-mouth of the film must be no worse . From recent years 《 Warwolf 》、《 The red sea action 》, This kind of war action films have achieved good reputation , You can know a little about the scoring mechanism .

type 、 Relationship between age and score

First of all, let's use the heat map to see the ratings of various groups on different types of films

 Ratings of different types of films

Different age groups , Preferences for film types are different . Such as underage male 、 women , Yes, reasoning 、 Westerns show great interest , and 45 More than men 、 women , For science fiction 、 The black film genre loves .

The score also needs to be comprehensively analyzed in combination with the proportion

 Comprehensive analysis of

This time, we will refine the data granularity to all age groups , Combined with the scores of all age groups , Below we give the age groups in TOP250 The recommended movies on the list .

Movie recommendation

Minor male (<18)

 Minor male (<18)

18-29 Year old male

18-29 Year old male

30-44 Year old male

30-44 Year old male

45+ men

45+ men

Underage women (<18)

 Underage women (<18)

18-29 Year old woman

18-29 Year old woman

30-44 Year old woman

30-44 Year old woman

45+ women

45+ women

The above is based on IMDBtop250 Data recommended movies , If there is any nonconformity , Say sorry here . After all, there is still a certain difference between the preferences of the American people and China .

copyright notice
author[Dai mubai],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201311702135390.html

Random recommended