current position:Home>You can easily get started with Excel. Python data analysis package pandas (XI): segment matching

You can easily get started with Excel. Python data analysis package pandas (XI): segment matching

2021-08-22 22:36:13 Excel catalyst

> I often listen to others Python How powerful in the data field , As a result, I studied for a long time , Even data processing is a death of trouble . Only later , It's not Python Data processing is powerful , But he has a data analysis artifact —— pandas

Preface

There is already an article in this series that introduces pandas To realize Excel Of vlookup How to function , however vlookup One more " Fuzzy matching " The function of , It is mainly used for segment matching , Come and have a look today pandas How to achieve the same effect in .

Case study 1

today , You have received an urgent and temporary demand , The data table is as follows :

  • - 200 Million lines of records
  • - Need to follow fraction Column , Calculate according to the rules The rating

The rule table is as follows :

  • - This is specially for Vlookup And the rule table designed

> If pressed pandas To design the rule table , that Vlookup The solution will be troublesome .

What do I do ? There's so much data Excel I don't open it , Whatever you are Vlookup still Xlookup It's no use .

Multifunctional Vlookup

Because there are too many data this time , use Excel This file can no longer be opened , So I'll use a small amount of data to briefly demonstrate how to use Vlookup Solve this problem :

  • - It's actually the last parameter input 1(True), that will do
  • - It's important , Rule table value Column , Remember to put it in order , Otherwise, the result will be disordered, and you don't know

pandas Segment matching in

This requirement is commonly referred to as... In data processing " Separate boxes ",pandas Use in cut Method to achieve :

  • - We from csv Reading data , from Excel Read the rule table
  • - Notice this is pandas The top-level method , So it is pd.cut()
  • - The first 1 The parameter is passed into the judgment data column
  • - The first 2 Parameters are passed into the rule table value , however cut The method must give the boundaries of all intervals . Like the rule table in this example , No higher than 120 The end boundary of the score , We need to add a large value as the end boundary
  • - Parameters right, Set to False , Just to be with vlookup Just the same effect , Express : " The boundary of the right section is open ", such as : 120 branch , Be divided into A+ The rating
  • - Parameters labels, Is the result returned

> You can see pandas You can easily read data from any data source , In this case, even if your data source is in various databases

> Be careful ,bins When there is no ascending sort , Will report a mistake . It's a very good design

It's hard to read , Look at this diagram , It should be much clearer :

Case study 2: Automatic partition

In practical analysis , You may not know at first how to define the nodes of the rule table .

For example, you now want to divide 3 Segments , But you don't know how to define the boundary between segments . Then you can call cut Method :

  • - cut Methods the first 2 Parameters , We specify 3, To indicate a division 3 paragraph
  • - Do not specify parameters labels, You can see the divided interval .

You can also specify labels:

The final division results shall be balanced in the number of each interval as much as possible , Look at the number of intervals :

summary

  • - pd.cut() , Process the data in boxes
  • - Parameters bins You can specify your own rule table , You can also directly specify the number of segments
  • - When specifying the number of segments , Each partition interval will be defined automatically
  • - When the specified bins When the rule table is not sorted in ascending order , Will report a mistake

This article is from WeChat official account. - Excel catalyst (ExcelCuiHuaJi)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the [email protected] Delete .

Original publication time : 2019-09-06

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

copyright notice
author[Excel catalyst],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2021/08/20210822223609215v.html

Random recommended