current position:Home>Data analysis starts from scratch. Pandas reads HTML pages + data processing and analysis

Data analysis starts from scratch. Pandas reads HTML pages + data processing and analysis

2022-02-01 15:48:02 cousin

This is my participation 11 The fourth of the yuegengwen challenge 19 God , Check out the activity details :2021 One last more challenge

zero Write it at the front

This series of study notes reference books :  《 Data analysis practice 》 Tomaz · Drobas , I will share my notes of learning this book with you , Also open into a series 『 Data analysis starts from scratch 』.

Click to see the first article :# Data analysis starts from scratch ,Pandas Reading and writing CSV data

Click to view the second article :# Data analysis starts from scratch ,Pandas Reading and writing TSV/Json data

Click to view the third article :# Data analysis starts from scratch ,Pandas Reading and writing Excel/XML data

The previous three articles talked about data analysis, virtual environment creation and pandas Reading and writing CSV、TSV、JSON、Excel、XML Formatted data , Today we continue to explore pandas.

One A summary of basic knowledge

1. utilize Pandas retrieval HTML page (read_html function )

2. Actual training uses read_html Function to get page data directly

3. Basic data processing : Header processing 、dropna and fillna Detailed explanation

4. Basic data visualization analysis case

Two Start thinking

1.Pandas Of read_html function

What we want to introduce here is Pandas Interior analysis HTML Functions of the page :read_html. After looking at the source code, we can see , This function has many parameters , Now I'll pick the key points to explain to you .

(1)io( The most critical parameter )

Source code comments

		A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.
 Copy code 

I understand it

	 Data address ( Web address 、 contain HTML File address or string ).
	 Be careful lxml We only accept HTTP、FTP And documents URL agreement .
	 If you have “https” At the beginning URL, You can try to delete “s” Then pass in the parameter .
 Copy code 

(2)match

Source code comments

		str or compiled regular expression, optional
        The set of tables containing text matching this regex or string will be
        returned. Unless the HTML is extremely simple you will probably need to
        pass a non-empty string here. Defaults to '.+' (match any non-empty
        string). The default value will return all tables contained on a page.
        This value is converted to a regular expression so that there is
        consistent behavior between Beautiful Soup and lxml.
 Copy code 

I understand it

	 String or compiled regular expression , Optional 
	 A set of tables containing text that matches this regular expression or string will return .
	 Unless HTML It's simple , Otherwise, you may need to pass a non empty string here .
	 The default is “.+”( Match any non empty string ). The default value will return all the information contained on the page <table> The label contains tables .
	 This value is converted to a regular expression , In order to Beautiful Soup and LXML Consistent between .
 Copy code 

(3)flavor

Source code comments

		flavor : str or None, container of strings
        The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
        each other, they are both there for backwards compatibility. The
        default of ``None`` tries to use ``lxml`` to parse and if that fails it
        falls back on ``bs4`` + ``html5lib``.
 Copy code 

I understand it

	 The parsing engine to use .'bs4' and 'html5lib' Is synonymous with each other ,
	 They are all for backward compatibility . The default is empty. , Try for lxml The default value of the resolution ,
	 If you fail , Then use bs4 and  html5lib.
 Copy code 

2. Basic data processing

(1) Process column names
#  Process column names 
import re
#  A regular expression that matches any white space character in a string 
space = re.compile(r"\s+")

def fix_string_spaces(columnsToFix):
    '''  Converts white space characters in column names to underscores  '''
    tempColumnNames = []   #  Save the processed column name 
    #  Loop through all columns 
    for item in columnsToFix:
        #  Match to 
        if space.search(item):
            #  Process and add to the list 
            tempColumnNames.append('_'.join((space.split(item))))
            '''  This sentence is a little long and involves some operations of the list , Let me explain  str1.split(str2) str1  Represents a delimited string ;str2 Represents a delimited string  str3.join(list1) str2  Indicates what string to connect with ;list1 Indicates the list to be connected  list2.append(str4)  Indicates in the list list2 Added at the end of str4 This element  '''
        else :
            #  Otherwise, add directly to the list 
            tempColumnNames.append(item)
    return tempColumnNames
 Copy code 

The above code comes from books , The purpose is to handle column names , Convert empty characters in the column name to - Symbol , Think carefully , In fact, this can be used in general , For example, if a row of data is empty , Deal with empty data in a list, etc , It's very reusable .

(2) Processing of missing data dropna function

dropna() function : Filter the missing data . Common parameter analysis : axis

Source code comments

		 axis : {0 or 'index', 1 or 'columns'}, default 0
            Determine if rows or columns which contain missing values are removed.
            * 0, or 'index' : Drop rows which contain missing values.
            * 1, or 'columns' : Drop columns which contain missing value.
            .. deprecated:: 0.23.0: Pass tuple or list to drop on multiple axes.
 Copy code 

I understand it

	 To use less , The default value is 0, Indicates that the row containing the missing value is deleted ; The value is 1, Indicates that the column containing the missing value is deleted .
 Copy code 

how

Source code comments

		how : {'any', 'all'}, default 'any'
            Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
            * 'any' : If any NA values are present, drop that row or column.
            * 'all' : If all values are NA, drop that row or column.
 Copy code 

I understand it

	 The default value is any, Indicates if there is any NA( empty ) value , Then delete the row or column ;
	 The value is all, If it's all NA value , Then delete the row or column .
 Copy code 

thresh

Source code comments

		thresh : int, optional
            Require that many non-NA values.
 Copy code 

I understand it

	 Not for NA The number of , Rows that meet the requirements are reserved , Unsatisfied rows are deleted .
 Copy code 

inplace

Source code comments

		inplace : bool, default False
            If True, do operation inplace and return None.
 Copy code 

I understand it

	 The default is False, Indicates that the operation is not on the original object ,
	 Instead, copy a new object to operate and return ;
	 The value is True when , Means to operate directly on the original object .
 Copy code 
(3) Processing of missing data fillna function

fillna() function : Fill the missing data with the specified value or interpolation . Common parameter analysis : value

Source code comments

	value : scalar, dict, Series, or DataFrame
            Value to use to fill holes (e.g. 0), alternately a
            dict/Series/DataFrame of values specifying which value to use for
            each index (for a Series) or column (for a DataFrame). (values not
            in the dict/Series/DataFrame will not be filled). This value cannot
            be a list.
 Copy code 

I understand it

 To put it simply , Just replace NA( Null value ) Value . If you give the value directly , Means replace all ;
 If it's a dictionary : { Name : Replacement value }   Means to replace all null values contained in the column .
 Copy code 

method

Source code comments

	method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
            Method to use for filling holes in reindexed Series
            pad / ffill: propagate last valid observation forward to next valid
            backfill / bfill: use NEXT valid observation to fill gap
 Copy code 

I understand it

 The method of filling in blank values in the re index series .
pad / ffill: Retrieve by column , Assign the last non null value to the next null value .
backfill / bfill: Retrieve by column , Assign the next non null value to the null value .
 Be careful : This parameter cannot be associated with value  At the same time 
 Copy code 

limit

Source code comments

limit : int, default None
            If method is specified, this is the maximum number of consecutive.
            NaN values to forward/backward fill. In other words, if there is a gap 
            with more than this number of consecutive NaNs, it will only be partially
             filled. If method is not specified, this is the maximum number of entries
              along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
 Copy code 

I understand it

 It's very simple , Is to search for null values by column , then limit The value of indicates the maximum number of consecutive null values .
 such as :limit=2, Represents a top-down search in a column , Replace only the first two null values , No replacement later .
 Copy code 

Vomit trough : Don't look at the English Notes in the source code. The words are very simple , but , Too simple. , It doesn't even make a sentence , I practice one by one + Surface translation , Then we can understand the meaning of the parameter .

3. Data crawling training

Five lines of code crawl 2019 Rich list (60 More than $100 million )

import pandas as pd

#  Ranking List 
for i in range(15):
    #  Page address 
    url = "https://www.phb123.com/renwu/fuhao/shishi_%d.html" % (i+1)
    #  call read_html function , Parse the page to get the data  List
    url_read = pd.read_html(url, header=0)[0]
    #  Store data csv file 
    url_read.to_csv(r'rich_list.csv', mode='a', encoding='utf_8_sig', header=0, index=False)
 Copy code 

The page data : Crawling results Through the above actual combat , You need to know : 1、 Don't think it's so simple ( Because I found the website , There is only one in this website data table, The data is also relatively clean ); 2、 In real work, the website may not cooperate , The data may not match , The best way at this time is to have different opinions , See more source code .

4. Data visualization analysis and actual training

Based on the data we got above , Let's do a simple data visualization and analysis report . We've got it up there 2019 Rich list (60 More than $100 million ) The data of , Including rankings 、 full name 、 The amount of wealth 、 Source of wealth 、 This information from the State , After defining the data attributes , We should think about what aspects we can analyze those problems ? I think of several aspects : (1) The number of people in each country on the list ? Those countries have the most ? (2) Those companies have the most people on the list ? (3) The industry distribution of the people on the list ?

(0) Reading data and data visualization

Read the data and we use it directly pandans Of read_csv function .

import pandas as pd

#  Raw data file path 
rpath_csv = 'rich_list.csv'
#  Reading data 
csv_read = pd.read_csv(rpath_csv)
#  The extracted data is pandans Of Series object 
#  Post processing can be directly converted to lists 
name_list = csv_read[" name "]
money_list = csv_read[" Wealth (10 Billion dollars )"]
company_list = csv_read[" Source of wealth "]
country_list = csv_read[" Country / region "]
 Copy code 

Data visualization , We start from the simplest pyecharts modular .

	pip install pyecharts==0.5.11
 Copy code 
(1) The number of people in each country on the list ? Those countries have the most ?
#  The number of people in each country on the list ? Those countries have the most ?
""" 1、 statistics   utilize collections Modular Counter function  """
country_list = list(country_list)
from collections import Counter
dict_number = Counter(country_list)

""" 2、 Data visualization   utilize pyecharts Modular Bar class  """
bar = Bar(" Histogram of rich countries ")
bar.add(" The rich ", key_list, values_list, is_more_utils=True, is_datazoom_show=True,
        xaxis_interval=0, xaxis_rotate=30, yaxis_rotate=30, mark_line=["average"], mark_point=["max", "min"])
bar.render("rich_country.html")
 Copy code 

From the above data , We can clearly find that , The nationality of the rich on the rich list , Most Americans , And it can be said to be far ahead , The total is 300 people , American nationality includes 106 people , Of the total data 1/3 even more than , This is easier to understand , The United States has always been a superpower , All aspects of development rank among the top in the world .

In second place is China , Account for the 43 people , There are also many , And for China , It's not easy to develop now , from 1949 Founded in , In the year to 2019 year , Founding 70 year , from “ Reading for the rise of China ” To “ To realize the Chinese dream 、 Strive to build a prosperous, strong, democratic, civilized, harmonious and beautiful modern socialist power ”, As a Chinese , I am proud .

The third place is Germany and Russia , Each account 20 people , Germany is a big industrial country , Europe's largest economy , So the strength of Germany is obvious , In addition, Russia , The largest country in the world , Once the Soviet Union was also the second economic power in the world , Although the disintegration of the Soviet Union is not as good as before , But Putin has been in power in recent years , The economy picked up steadily .

Most of the latter countries are European countries , The fifth is India , Its scientific and technological strength is very developed .

(2) Those companies have the most people on the list ?

Watch out! ~ Who can be on this list , The lowest wealth is 60 Billion dollars , From the perspective of Statistics , Mars has the most people on the list , Yes 6 The richest people on the list are from Mars , Followed by Wal Mart Department Store Co., Ltd , Yes 3 Individuals from the company , Both companies are daily chemical companies , Next : Microsoft 、Facebook、 Google is a technology company

No, check , I really don't know “ Hungry goods , Come on, a snicker ” Taxi rack 、“ Dove , Enjoy silky smooth ” Dave is from a company , And it's from Mars , Double click here 666. In addition, Wal Mart is 2018 In, it was selected as the first of the world's top 500 , In a sense , This is the strongest company in the universe ~( When I was a child, I always thought Fudi was the best supermarket , When I grew up, I thought Wanda was the most powerful supermarket , Now? , I got it! , It's Wal Mart !)

(3) The industry distribution of the people on the list ? When you answer

This part is actually not easy to do , Because there is no data directly related to the industry in the data we obtained , The only thing that can have some connection with the industry is the company , This requires us to judge by the company name ( Or get it online ) The Category attribute of the company , For example, Internet companies , Or traditional industries and so on .

3、 ... and Send you a message

insist and Strive : You'll get what you get .

The thought is complex ,

Implementation is interesting ,

As long as you don't give up ,

There will be a day of fame .

—《 Old watch limericks 》

See you next time , I'm an old watch who loves cats and Technology , If you think this article is helpful to your study , Welcome to thumb up 、 Comment on 、 Pay attention to me !

copyright notice
author[cousin],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011548009725.html

Random recommended