current position:Home>Python data De duplication and missing value processing

Python data De duplication and missing value processing

2022-01-31 23:23:31 D porridge

This is my participation 11 The fourth of the yuegengwen challenge 7 God , Check out the activity details :2021 One last more challenge

When we have fresh data from the Internet QQ picture 20211119153946.png title name, author author, score grade, Count how many people have seen it stats

Reading data

Use pandas Of read_csv Method to read data , usecols You can select some specified columns to read , The default is all columns

import pandas as pd
df = pd.read_csv("foodInfo.csv", usecols=['name', 'author', 'grade', 'stats'])
 Copy code 

You can output the first five to see the effect print(df.head())

duplicate removal

print(df.duplicated().value_counts())
 Copy code 

 Screen capture  2021-11-19 155901.png

Through the output data, we can see that there are 103 Data , One of them repeats , We can also pass df.duplicated() Check which one is duplicate data

df.drop_duplicates(keep='first', inplace=True)
 Copy code 

drop_duplicates Weight removal depends on different situations 3 Parameters

subset : Array of column names , The default is select all , That is, if the data of the specified columns are duplicate, it will be deleted

keep : The default is first, first Is to keep only the duplicate lines that appear for the first time , last Is to keep only the last repeated line , False Is to delete all duplicate lines

inplace : by True Is to directly change the original data , by False That is, you need to receive variables

Missing value processing

#  See which column has missing values 
print(df.isnull().any())

#  Locate the column with missing value 
data = df[df.isnull().values==True]
 Copy code 

 Screen capture  2021-11-19 161310.png

Delete missing value dropna

df.dropna(how='any', inplace=True)
 Copy code 

axis :0 Yes ,1 Is listed , The default is OK

subset : Delete missing values for specific columns

how : any As long as there is 1 Delete the entire line if there are two missing values ,all All columns are deleted only if they are missing values

thresh : Quantity criteria for missing values , This threshold is reached before deleting

inplace : by True Is to directly change the original data , by False That is, you need to receive variables

Fill in missing values fillna

I specify a value to replace the missing value , Fill in the missing value with the average score of the author in the data

def fillByAuthor(author):
    count = 0
    sum = 0.0
    for i in range(len(df)):
        if math.isnan(df.grade[i]):
            continue
        if df.author[i] == author:
            count = count + 1
            sum = sum + df.grade[i]
    return round(sum / count, 2)
 Copy code 
a = fillByAuthor(' Wang Guangguang ')
df.fillna(a, inplace=True)
 Copy code 

inplace : by True Is to directly change the original data , by False That is, you need to receive variables

method : pad/ffill: Fill in the missing value with the previous non missing value ; backfill/bfill: Fill the missing value with the next non missing value

None: Specify a value to replace the missing value ( This is the default )

limit : Limit the number of fillings

axis: Change the filling direction

Save as

df.to_csv("clean_data.csv")
 Copy code 

QQ picture 20211119154500.gif

copyright notice
author[D porridge],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201312323282158.html

Random recommended