current position:Home>Pandas handles duplicate values

Pandas handles duplicate values

2022-01-30 19:03:51 Dream, killer

Little knowledge , Great challenge ! This article is participating in 「 A programmer must have a little knowledge 」 Creative activities

Example data :

import pandas as pd

df = pd.DataFrame({'a':['Python', 'Python', 'Java', 'Java', 'C'], 'b': [2, 2, 6, 8, 10]})
df
 Copy code 

 Insert picture description here

Only judge whether there are duplicate values in a single column

  1. Use values_counts() Count the number of occurrences of the value in the column . The results are arranged in descending order by default , You only need to judge whether the number of occurrences of the first line value is 1 You can determine whether there are duplicate values .
df['a'].value_counts()
 Copy code 

 Insert picture description here

  1. Use drop_duplicates() Delete duplicate values , Keep only the first occurrence , Judge whether the processed value is consistent with the original value df equal , If False It means that there are duplicate values .
df.equals(df.drop_duplicates(subset=['a'], keep='first'))

False
 Copy code 

Determine whether all columns have duplicate rows Also use drop_duplicates() Delete duplicate values , Keep only the first occurrence , Not applicable at this time subset Parameter setting column , The default is all columns , Judge whether the processed value is consistent with the original value df equal , If False It means that there are duplicate values .

df.equals(df.drop_duplicates(keep='first'))

False
 Copy code 

Count the number of duplicate lines

len(df) - len(df.drop_duplicates(keep="first"))
 Copy code 

Show duplicate data rows Delete the duplicate lines first , Keep only the first occurrence of , Get a row unique data set , Reuse drop_duplicates() Delete the df There are all duplicate data in , This time, the duplicate value for the first time is not retained , Merge the above two result sets , Use drop_duplicates() De duplicate the newly generated data set , You can get the data of duplicate lines .

df.drop_duplicates(keep="first").append(df.drop_duplicates(keep=False)).drop_duplicates(keep=False)
 Copy code 

 Insert picture description here

For beginners  Python  Or want to get started  Python  Little buddy , You can search through wechat 【Python New horizons 】, Exchange and study together , They all come from novices , Sometimes a simple question card takes a long time , But maybe someone else's advice will suddenly realize , I sincerely hope you can make progress together .

copyright notice
author[Dream, killer],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201301903484431.html

Random recommended