current position:Home>Data type conversion in pandas module

Data type conversion in pandas module

2022-06-24 07:50:29AI technology base camp

3128afda6e62ddcac1ae7c592a382e74.gif

author | Junxin

source |  About data analysis and visualization

When we are sorting out the data , Data type errors often occur , Today, I'd like to share with you about Pandas Module in the data type conversion related skills , It's full of dry goods !

Import datasets and modules

So our first routine is to import Pandas Module and create data set , The code is as follows

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'string_col': ['1','2','3','4'],
        'int_col': [1,2,3,4],
        'float_col': [1.1,1.2,1.3,4.7],
        'mix_col': ['a', 2, 3, 4],
        'missing_col': [1.0, 2, 3, np.nan],
        'money_col': ['£1,000.00', '£2,400.00', '£2,400.00', '£2,400.00'],
        'boolean_col': [True, False, True, True],
        'custom': ['Y', 'Y', 'N', 'N']
  })
  
df

output

e513cfdec86f40c823b4b7fe4867b643.png

Let's first look at the data types of each column , The code is as follows

df.dtypes

output

string_col      object
int_col          int64
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

Of course, we can also call info() Method to achieve the above purpose , The code is as follows

df.info()

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   string_col   4 non-null      object 
 1   int_col      4 non-null      int64  
 2   float_col    4 non-null      float64
 3   mix_col      4 non-null      object 
 4   missing_col  3 non-null      float64
 5   money_col    4 non-null      object 
 6   boolean_col  4 non-null      bool   
 7   custom       4 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(4)
memory usage: 356.0+ bytes

Data type conversion

Next, we start the data type conversion , The most commonly used is astype() Method , For example, we convert floating-point data to integer , The code is as follows

df['float_col'] = df['float_col'].astype('int')

Or we'll take one of them “string_col” This column is converted to integer data , The code is as follows

df['string_col'] = df['string_col'].astype('int')

Of course, we consider from the perspective of saving memory , convert to int32 perhaps int16 Data of type ,

df['string_col'] = df['string_col'].astype('int8')
df['string_col'] = df['string_col'].astype('int16')
df['string_col'] = df['string_col'].astype('int32')

Then let's take a look at the data types of each column after conversion

df.dtypes

output

string_col     float32
int_col          int64
float_col        int32
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

But when a column has more than one data type , An error will be reported during the conversion process , for example “mix_col” This column

df['mix_col'] = df['mix_col'].astype('int')

output

ValueError: invalid literal for int() with base 10: 'a'

So we can call to_numeric() Methods and errors Parameters , The code is as follows

df['mix_col'] = pd.to_numeric(df['mix_col'], errors='coerce')
df

output

7b74ddda631f6110b85b29e32791baa1.png

And if you encounter missing values , An error will also occur during data type conversion , The code is as follows

df['missing_col'].astype('int')

output

ValueError: Cannot convert non-finite values (NA or inf) to integer

We can start by calling fillna() Method to populate missing values with other values , And then type conversion , The code is as follows

df["missing_col"] = df["missing_col"].fillna(0).astype('int')
df

output

437371979bd12e9682c57d8dbd0ac36e.png

And finally “money_col” This column , We can see the currency symbol in it , So the first step we have to do is to replace these currency symbols , Then the data type is converted , The code is as follows

df['money_replace'] = df['money_col'].str.replace('£', '').str.replace(',','')
df['money_replace'] = pd.to_numeric(df['money_replace'])
df['money_replace']

output

0    1000.0
1    2400.0
2    2400.0
3    2400.0

When encountering time series data

When we need to type convert data in date format , What you usually need to call is to_datetime() Method , The code is as follows

df = pd.DataFrame({'date': ['3/10/2015', '3/11/2015', '3/12/2015'],
                   'value': [2, 3, 4]})
df

output

f3f8738d9f836294ec8517e0fd56c0ed.png

Let's first look at the data types of each column

df.dtypes

output

date     object
value     int64
dtype: object

We call to_datetime() The code of the method is as follows

pd.to_datetime(df['date'])

output

0   2015-03-10
1   2015-03-11
2   2015-03-12
Name: date, dtype: datetime64[ns]

Of course, this does not mean that you cannot call astype() The method , The result is the same as the above , The code is as follows

df['date'].astype('datetime64')

When we encounter date format data in user-defined format , Also call to_datetime() Method , But the format that needs to be set is format Parameters need to be consistent

df = pd.DataFrame({'date': ['2016-6-10 20:30:0', 
                            '2016-7-1 19:45:30', 
                            '2013-10-12 4:5:1'],
                   'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'], format="%Y-%d-%m %H:%M:%S")

output

bf1d9d7791b4660fbd7b641c8351b6e0.png

Is it possible to achieve the goal in one step ?

Last , Maybe someone will ask , Is there any way to realize data type conversion in one step ? That, of course, can be achieved , The code is as follows

df = pd.DataFrame({'date_start': ['3/10/2000', '3/11/2000', '3/12/2000'],
                   'date_end': ['3/11/2000', '3/12/2000', '3/13/2000'],
                   'string_col': ['1','2','3'],
                   'float_col': [1.1,1.2,1.3],
                   'value': [2, 3, 4]})
                   
df = df.astype({
    'date_start': 'datetime64',
    'date_end': 'datetime64',
    'string_col': 'int32',
    'float_col': 'int64',
    'value': 'float32',
})

Let's take a look at the results

df

output

103878cb201e2573a488a935621a970e.png

e0bd67e0046dd28d530e4eb1a998247b.gif

Looking back

Matplotlib Two methods of drawing torus !

13 individual python Necessary knowledge , Recommended collection !

Artifact , Easy visualization Python Calling process !

Low code out of half a lifetime , Come back or " cancer "!

 Share 
 Point collection 
 A little bit of praise 
 Click to see 

copyright notice
author[AI technology base camp],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/175/202206240348504151.html

Random recommended