current position:Home>Pandas advanced tutorial: category data type

Pandas advanced tutorial: category data type

2022-06-24 08:04:58Procedural stuff

brief introduction

Pandas There is a special data type called category. It represents a category , Generally used in statistical classification , Like gender , Blood type , classification , Levels and so on . It's kind of like java Medium enum.

Let's talk about it in detail today category Usage of .

establish category

Use Series establish

Creating Series Add dtype=”category” You can create category 了 .category In two parts , Part of it is order, Part of it is literal :

In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

Can be DF Medium Series Convert to category:

In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})

In [4]: df["B"] = df["A"].astype("category")

In [5]: df["B"]
Out[32]: 
0    a
1    b
2    c
3    a
Name: B, dtype: category
Categories (3, object): [a, b, c]

You can create a pandas.Categorical , Pass it as a parameter to Series:

In [10]: raw_cat = pd.Categorical(
   ....:     ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
   ....: )
   ....: 

In [11]: s = pd.Series(raw_cat)

In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'd']

Use DF establish

establish DataFrame When , Can also be passed in dtype=”category”:

In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")

In [18]: df.dtypes
Out[18]: 
A    category
B    category
dtype: object

DF Medium A and B It's all one category:

In [19]: df["A"]
Out[19]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [20]: df["B"]
Out[20]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): ['b', 'c', 'd']

Or use df.astype(“category”) take DF All of the Series Convert to category:

In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

In [22]: df_cat = df.astype("category")

In [23]: df_cat.dtypes
Out[23]: 
A    category
B    category
dtype: object

Create controls

By default dtype=’category’ created category The default value is used :

  1. Categories It's inferred from the data .
  2. Categories There is no order of size .

You can display create CategoricalDtype To change the two default values above :

In [26]: from pandas.api.types import CategoricalDtype

In [27]: s = pd.Series(["a", "b", "c", "a"])

In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)

In [29]: s_cat = s.astype(cat_type)

In [30]: s_cat
Out[30]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b' < 'c' < 'd']

alike CategoricalDtype It can also be used in DF in :

In [31]: from pandas.api.types import CategoricalDtype

In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)

In [34]: df_cat = df.astype(cat_type)

In [35]: df_cat["A"]
Out[35]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

In [36]: df_cat["B"]
Out[36]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

Convert to original type

Use Series.astype(original_dtype) perhaps np.asarray(categorical) Can be Category Convert to original type :

In [39]: s = pd.Series(["a", "b", "c", "a"])

In [40]: s
Out[40]: 
0    a
1    b
2    c
3    a
dtype: object

In [41]: s2 = s.astype("category")

In [42]: s2
Out[42]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [43]: s2.astype(str)
Out[43]: 
0    a
1    b
2    c
3    a
dtype: object

In [44]: np.asarray(s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

categories The operation of

obtain category Properties of

Categorical The data are categories and ordered Two attributes . Can pass s.cat.categories and s.cat.ordered To get :

In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [58]: s.cat.categories
Out[58]: Index(['a', 'b', 'c'], dtype='object')

In [59]: s.cat.ordered
Out[59]: False

rearrangement category The order of :

In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))

In [61]: s.cat.categories
Out[61]: Index(['c', 'b', 'a'], dtype='object')

In [62]: s.cat.ordered
Out[62]: False

rename categories

By giving s.cat.categories Assignments can be renamed categories:

In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [68]: s
Out[68]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]

In [70]: s
Out[70]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): ['Group a', 'Group b', 'Group c']

Use rename_categories Can achieve the same effect :

In [71]: s = s.cat.rename_categories([1, 2, 3])

In [72]: s
Out[72]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

Or use dictionary objects :

# You can also pass a dict-like object to map the renaming
In [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})

In [74]: s
Out[74]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']

Use add_categories add to category

have access to add_categories To add category:

In [77]: s = s.cat.add_categories([4])

In [78]: s.cat.categories
Out[78]: Index(['x', 'y', 'z', 4], dtype='object')

In [79]: s
Out[79]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (4, object): ['x', 'y', 'z', 4]

Use remove_categories Delete category

In [80]: s = s.cat.remove_categories([4])

In [81]: s
Out[81]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']

Delete unused ones cagtegory

In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))

In [83]: s
Out[83]: 
0    a
1    b
2    a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [84]: s.cat.remove_unused_categories()
Out[84]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']

Reset cagtegory

Use set_categories() You can add and delete at the same time category operation :

In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category")

In [86]: s
Out[86]: 
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): ['-', 'four', 'one', 'two']

In [87]: s = s.cat.set_categories(["one", "two", "three", "four"])

In [88]: s
Out[88]: 
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): ['one', 'two', 'three', 'four']

category Sort

If category Created with ordered=True , Then you can sort them :

In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))

In [92]: s.sort_values(inplace=True)

In [93]: s
Out[93]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

In [94]: s.min(), s.max()
Out[94]: ('a', 'c')

have access to as_ordered() perhaps as_unordered() To force sort or not sort :

In [95]: s.cat.as_ordered()
Out[95]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

In [96]: s.cat.as_unordered()
Out[96]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

Reorder

Use Categorical.reorder_categories() It's possible to do something about the existing category Reorder :

In [103]: s = pd.Series([1, 2, 3, 1], dtype="category")

In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)

In [105]: s
Out[105]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

Multi column sorting

sort_values Support multi column sorting :

In [109]: dfs = pd.DataFrame(
   .....:     {
   .....:         "A": pd.Categorical(
   .....:             list("bbeebbaa"),
   .....:             categories=["e", "a", "b"],
   .....:             ordered=True,
   .....:         ),
   .....:         "B": [1, 2, 1, 2, 2, 1, 2, 1],
   .....:     }
   .....: )
   .....: 

In [110]: dfs.sort_values(by=["A", "B"])
Out[110]: 
   A  B
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

Comparison operation

If it was set when it was created orderedTrue , that category Can be compared between . Support ==, !=, >, >=, <, and <= These operators .

In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
In [119]: cat > cat_base
Out[119]: 
0     True
1    False
2    False
dtype: bool

In [120]: cat > 2
Out[120]: 
0     True
1    False
2    False
dtype: bool

Other operating

Cagetory In essence, it's still a Series, therefore Series The operation of category Basically, it can be used , such as : Series.min(), Series.max() and Series.mode().

value_counts:

In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))

In [132]: s.value_counts()
Out[132]: 
c    2
a    1
b    1
d    0
dtype: int64

DataFrame.sum():

In [133]: columns = pd.Categorical(
   .....:     ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True
   .....: )
   .....: 

In [134]: df = pd.DataFrame(
   .....:     data=[[1, 2, 3], [4, 5, 6]],
   .....:     columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
   .....: )
   .....: 

In [135]: df.sum(axis=1, level=1)
Out[135]: 
   One  Two  Three
0    3    3      0
1    9    6      0

Groupby:

In [136]: cats = pd.Categorical(
   .....:     ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
   .....: )
   .....: 

In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})

In [138]: df.groupby("cats").mean()
Out[138]: 
      values
cats        
a        1.0
b        2.0
c        4.0
d        NaN

In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

In [140]: df2 = pd.DataFrame(
   .....:     {
   .....:         "cats": cats2,
   .....:         "B": ["c", "d", "c", "d"],
   .....:         "values": [1, 2, 3, 4],
   .....:     }
   .....: )
   .....: 

In [141]: df2.groupby(["cats", "B"]).mean()
Out[141]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN

Pivot tables:

In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})

In [144]: pd.pivot_table(df, values="values", index=["A", "B"])
Out[144]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4

This article has been included in http://www.flydean.com/08-python-pandas-category/ The most popular interpretation , The deepest dry goods , The most concise tutorial , There are so many tricks you don't know about waiting for you to discover ! Welcome to my official account. :「 Program those things 」, Know technology , Know you better !

copyright notice
author[Procedural stuff],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/175/20210628191923949o.html

Random recommended