current position:Home>Similarities and differences of five pandas combinatorial functions

Similarities and differences of five pandas combinatorial functions

2022-01-30 02:33:18 Crayon Xi

1. explain

In the daily processing of data , Often encounter different dataframe The connection of 、 Combination and other operations , At the beginning , It's going to be a little tricky , After all, the following functions are similar , Easy to confuse .

concat

join merge combine append

Let's explain with the simplest example 、 Distinguish the role of several functions .

>>> import pandas as pd

>>> df0 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df0
   a  b
0  1  4
1  2  5
2  3  6

>>> df1 = pd.DataFrame({"c": [2, 3, 4], "d": [5, 6, 7]})
>>> df1
   c  d
0  2  5
1  3  6
2  4  7
 Copy code 

2. concat

concat, full pinyin “concatenation”, Allow horizontal or vertical side-by-side Combined data .

2.1 Line splicing

When merging data with the same column (2 individual df Keep the column names consistent , Line splicing ), You can specify the axis as 0( The default value is ) To call .

When axis=0 It should be understood as , Splice in the direction of the column ( Splice down ), That is to splice them in rows .

The column names must be consistent , What you get is what you want ,df1.rename(columns={"c": "a", "d": "b"}).

>>> pd.concat([df0, df1.rename(columns={"c": "a", "d": "b"})]
                , axis=0)
 
   a  b
0  1  4
1  2  5
2  3  6
0  2  5
1  3  6
2  4  7

 Copy code 

If the column names are inconsistent , You will not get the desired row splicing result .

In [9]: pd.concat([df0,df1],axis=0)
Out[9]:
     a    b    c    d
0  1.0  4.0  NaN  NaN
1  2.0  5.0  NaN  NaN
2  3.0  6.0  NaN  NaN
0  NaN  NaN  2.0  5.0
1  NaN  NaN  3.0  6.0
2  NaN  NaN  4.0  7.0
 Copy code 

Summary :

Line splicing , If no extra columns are generated , Note that the column names are consistent

2.2 Column splicing

Splicing by columns ,axis=1 It can be understood as , Splice in the direction of the row ( Splice to the right ), That is to splice the columns .

>>> pd.concat([df0, df1], axis=1)
   a  b  c  d
0  1  4  2  5
1  2  5  3  6
2  3  6  4  7
 Copy code 

By default , When horizontally combining data ( Along the column ) when , Will try to use the index . When they are not the same , Will see NaN Fill in non overlapping data , As shown below :

>>> df2 = df1.copy()
>>> df2.index = [1, 2, 3]
>>> pd.concat([df0, df2], axis=1)
     a    b    c    d
0  1.0  4.0  NaN  NaN
1  2.0  5.0  2.0  5.0
2  3.0  6.0  3.0  6.0
3  NaN  NaN  4.0  7.0
 Copy code 

If you want to unify the index for splicing , You have to reset their indexes first :

>>> pd.concat([df0.reset_index(drop=True), 
               df2.reset_index(drop=True)], axis=1)
   a  b  c  d
0  1  4  2  5
1  2  5  3  6
2  3  6  4  7
 Copy code 

Summary :

Column splicing , If no extra rows are generated , Note that the index is consistent

3. join

And concat comparison ,join Dedicated to Use index join DataFrame Columns between objects .

df0,df1 The index of is consistent :

>>> df0.join(df1)
   a  b  c  d
0  1  4  2  5
1  2  5  3  6
2  3  6  4  7
 Copy code 

When the index is inconsistent , The connection is left on the left by default DataFrame The line of ( The default left table is the drive table ); If the right side DataFrame There is no left side in the DataFrame Rows that match the index in , On the right side DataFrame Be deleted 、 use Null fill , As shown below :

>>> df0.join(df2)
   a  b    c    d
0  1  4  NaN  NaN
1  2  5  2.0  5.0
2  3  6  3.0  6.0
 Copy code 

You can also set how Parameter to change the drive table , That is to say SQL Several association connections in .

# "right" uses df2’s index
>>> df0.join(df2, how="right")
     a    b  c  d
1  2.0  5.0  2  5
2  3.0  6.0  3  6
3  NaN  NaN  4  7# "outer" uses the union
>>> df0.join(df2, how="outer")
     a    b    c    d
0  1.0  4.0  NaN  NaN
1  2.0  5.0  2.0  5.0
2  3.0  6.0  3.0  6.0
3  NaN  NaN  4.0  7.0# "inner" uses the intersection
>>> df0.join(df2, how="inner")
   a  b  c  d
1  2  5  2  5
2  3  6  3  6
 Copy code 

Summary :

join Is an index based connection , Only column connections , And sql The association is similar to

4. merge

And join comparison ,merge More general , You can merge columns and indexes .

stay a Column :

>>> df0.merge(df1.rename(columns={"c": "a"}), 
               on="a", how="inner")
   a  b  d
0  2  5  5
1  3  6  6
 Copy code 

If you want to keep the associated columns at the same time , You can write like this :

>>> df0.merge(df1, left_on="a", right_on="c")
   a  b  c  d
0  2  5  2  5
1  3  6  3  6
 Copy code 

When two DataFrame Objects have the same columns , Instead of merging ,suffixes Parameter sets the suffix for renaming these columns ; By default , Left 、 The suffixes of the right data frame are “_x” and “_y”, You can also customize it .

>>> df0.merge(df1.rename(columns={"c": "a", "d": "b"}), 
               on="a", 
               how="outer", 
               suffixes=("_l", "_r"))

   a  b_l  b_r
0  1  4.0  NaN
1  2  5.0  5.0
2  3  6.0  6.0
3  4  NaN  7.0
 Copy code 

5. combine

combine Functions also act on 2 individual DataFrame Between objects , Combine by column , But it is very different from the above functions .

combine The special thing about the function is that it requires a Function parameter . This function takes two series, Every series Corresponds to each DataFrame Merge columns in , And return a series The final value of the operation as an element of the same column .

It's a little twisted , Take an example :

>>> def taking_larger_square(s1, s2):
...     return s1 * s1 if s1.sum() > s2.sum() else s2 * s2

>>> df0.combine(df1.rename(columns={"c": "a", "d": "b"}), 
                  taking_larger_square)
    a   b
0   4  25
1   9  36
2  16  49

 Copy code 

take_larger_square Function pair df0 and df1 Medium a And df0 and df1 Medium b Column to operate . In two columns a And two columns b Between ,taking_larger_square Take the square of the value in the larger column . under these circumstances ,df1 Of a and b The column will be used as the square , Produce the final value , This happens to be df1 Of a、b Is greater than df0 Of a、b, If df1 One of the big 、 One small , Then take the largest one as the square .

In [13]: df0,df1
Out[13]:
(   a  b
 0  1  2
 1  2  3
 2  3  4,
    c  d
 0  4  1
 1  5  2
 2  6  3)

 In [11]: df0.combine(df1.rename(columns={"c": "a", "d": "b"}),
                      taking_larger_square)
Out[11]:
    a   b
0  16   4
1  25   9
2  36  16

 Copy code 

Summary :

use combine Are combined , Mainly aimed at 2 individual DataFrame Object series Do function processing , Take one as the result .

6. append

append The function is dedicated to Append rows to existing DataFrame object , Create a new object .

>>> df0.append(df1.rename(columns={"c": "a", "d": "b"}))

   a  b
0  1  4
1  2  5
2  3  6
0  2  5
1  3  6
2  4  7
 Copy code 

This sum concat( ,axis=0) The effect is the same .

append It is unique in that it can also add dict object , This gives us the flexibility to append different types of data . Be careful , Must be ignore_index Set to True, because dict Objects have no DataFrame Available index information .

>>> df0.append({"a": 1, "b": 2}, ignore_index=True)

   a  b
0  1  4
1  2  5
2  3  6
3  1  2
 Copy code 

7. summary

  • concat: Combine data by row and column

  • join: Use index , Merge data by row

  • merge: Merge data by column , More like database connection operation

  • combine: Merge data by column , With inter column ( Same column ) Element operation

  • append: With DataFrame or dict Append data line by line in the form of object


Welcome to follow individual public number :Distinct Count

copyright notice
author[Crayon Xi],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201300233174832.html

Random recommended