# You can easily get started with Excel. Python data analysis package pandas (XI): segment matching

2021-08-22 22:36:13

> I often listen to others Python How powerful in the data field , As a result, I studied for a long time , Even data processing is a death of trouble . Only later , It's not Python Data processing is powerful , But he has a data analysis artifact —— pandas

Preface

There is already an article in this series that introduces pandas To realize Excel Of vlookup How to function , however vlookup One more " Fuzzy matching " The function of , It is mainly used for segment matching , Come and have a look today pandas How to achieve the same effect in .

Case study 1

today , You have received an urgent and temporary demand , The data table is as follows ：

• - 200 Million lines of records
• - Need to follow fraction Column , Calculate according to the rules The rating

The rule table is as follows ：

• - This is specially for Vlookup And the rule table designed

> If pressed pandas To design the rule table , that Vlookup The solution will be troublesome .

What do I do ？ There's so much data Excel I don't open it , Whatever you are Vlookup still Xlookup It's no use .

Multifunctional Vlookup

Because there are too many data this time , use Excel This file can no longer be opened , So I'll use a small amount of data to briefly demonstrate how to use Vlookup Solve this problem ：

• - It's actually the last parameter input 1(True), that will do
• - It's important , Rule table value Column , Remember to put it in order , Otherwise, the result will be disordered, and you don't know

pandas Segment matching in

This requirement is commonly referred to as... In data processing " Separate boxes ",pandas Use in cut Method to achieve ：

• - We from csv Reading data , from Excel Read the rule table
• - Notice this is pandas The top-level method , So it is pd.cut()
• - The first 1 The parameter is passed into the judgment data column
• - The first 2 Parameters are passed into the rule table value , however cut The method must give the boundaries of all intervals . Like the rule table in this example , No higher than 120 The end boundary of the score , We need to add a large value as the end boundary
• - Parameters right, Set to False , Just to be with vlookup Just the same effect , Express ： " The boundary of the right section is open ", such as ： 120 branch , Be divided into A+ The rating
• - Parameters labels, Is the result returned

> You can see pandas You can easily read data from any data source , In this case, even if your data source is in various databases

> Be careful ,bins When there is no ascending sort , Will report a mistake . It's a very good design

It's hard to read , Look at this diagram , It should be much clearer ：

Case study 2： Automatic partition

In practical analysis , You may not know at first how to define the nodes of the rule table .

For example, you now want to divide 3 Segments , But you don't know how to define the boundary between segments . Then you can call cut Method ：

• - cut Methods the first 2 Parameters , We specify 3, To indicate a division 3 paragraph
• - Do not specify parameters labels, You can see the divided interval .

You can also specify labels：

The final division results shall be balanced in the number of each interval as much as possible , Look at the number of intervals ：

summary

• - pd.cut() , Process the data in boxes
• - Parameters bins You can specify your own rule table , You can also directly specify the number of segments
• - When specifying the number of segments , Each partition interval will be defined automatically
• - When the specified bins When the rule table is not sorted in ascending order , Will report a mistake

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the [email protected] Delete .

Original publication time ： 2019-09-06

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .