current position:Home>You can easily get started with Excel. Python data analysis package pandas (V): duplicate value processing

You can easily get started with Excel. Python data analysis package pandas (V): duplicate value processing

2021-08-23 04:59:48 Excel catalyst

> I often listen to others Python How powerful in the data field , As a result, I studied for a long time , Even data processing is a death of trouble . Only later , It's not Python Data processing is powerful , But he has a data analysis artifact —— pandas

Preface

Sometimes duplicate values appear in the data , It may cause errors in the final statistical results , therefore , Finding and removing duplicate values is a common operation in data processing . Today we'll look at it pandas How to realize .

Excel Handle duplicate values

Excel It directly provides the function of removing duplication , Therefore, simple operation can be realized . as follows :

  • - Function card " data "," Data tools " There is " Delete duplicates " Button
  • - Then you can choose which columns to use as repeated judgment

> besides ,Excel Conditional formatting can also be used in 、 Advanced filtering or function formula to achieve similar functions

pandas Tag duplicate value

pandas Also provides a simple way to mark duplicate values , And ratio Excel There are more flexible ways for you to choose , Let's see :

  • - DataFrame.duplicated() , Generate a boolean flag whether it is a duplicate record . By default, all data in the whole row is used as the judgment basis
  • - The results are clear , The last line is the repeating line , So the value of the last row of the tag column is True

We can specify , When there are duplicate values , Where to keep the row . as follows :

  • - By default ,duplicated() Of keep Parameter is "first", You mean for " Keep the first "
  • - Now let's keep Set to "last", So keep the last one , So now the first of the repeated lines is marked True

besides , We can also put keep Parameter set to False, intend " No reservation ", as follows :

  • - Now, where there are duplicate lines , Are marked True

Through parameters subset Which columns can be specified as the basis for judgment :

image Excel Remove duplicates as well

In fact, after marking the duplicate value , Only simple filtering is needed to get non duplicate records . however pandas There is a direct method to remove duplication . as follows :

  • - call DataFrame.drop_duplicates() , You can remove duplicate
  • - His parameters and rules are the same as duplicated As like as two peas . It's actually a duplicated() Marked as True Just remove the line

Last

  1. - DataFrame.duplicated() , Mark duplicates . Use subset Specify duplicate value judgment column ,keep={'first','last',False} Specify how to determine which are duplicates
  2. - DataFrame.drop_duplicates() , Remove duplicates

Next section , We'll look at the implementation of the sorting function . Stay tuned .

** If you want to learn from scratch pandas , Then you can look at my pandas special column .**

This article is from WeChat official account. - Excel catalyst (ExcelCuiHuaJi)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the [email protected] Delete .

Original publication time : 2019-08-21

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

copyright notice
author[Excel catalyst],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2021/08/20210823045945687N.html

Random recommended