current position:Home>[Pandas] A primer on Pandas processing csv file datasets (neural network/machine learning algorithm data preprocessing)

[Pandas] A primer on Pandas processing csv file datasets (neural network/machine learning algorithm data preprocessing)

2022-08-06 06:33:04little girl

Motivation

The data collected with a certain boss iscsv格式的,Haven't dealt with it beforecsv格式的数据.When I used it to write neural network training, I stepped on a lot of pits,这里记录一下,It is also convenient for later people to learn.

Pandas处理csv文件

处理csvThere should be quite a few packages of files,这里就做一个pandas的教程了(其他的没用过hhhh).Here I take one of my data as an example to demonstrate some common processing methods.

文件读取

  1. 语句:
    origin_data = pd.read_csv("origin_data.csv", na_values=" NaN")
    
  2. csvNull values ​​in the file(NaN)是什么? 这里是一个大坑.I recommend everyone to read itcsvWhen I use the following parameters,Set missing values ​​uniformly to "NaN".In this way, if you need to manually filter out missing values ​​later, you can index to the position.之前试过,如果不设置这个参数,缺失值不是False、0、"NaN"中的任何一个.
  3. 结果:
    在这里插入图片描述

dataframeIndex a column

pandas读进来的csvThe data will be encapsulated into a calldataframe的格式,This format can be converted to numpy数组.Let's see how it works firstdataframe.

  1. 语句: 使用data.nameto index a column by label.
    origin_data.Height
    
  2. 结果:
    在这里插入图片描述

删除某一列

  1. 语句:delKeyword tagging removes a column
    del origin_data["Weight change"]
    
  2. 结果: 可以看到"Weight change"A column has been deleted
    在这里插入图片描述

删除缺失值所在的行/列

对于缺失值,In general, interpolation can be used to complete or directly discard the data.这里以删除NaNThe row where the value is located is an example to demonstrate.

  1. 语句:.dropna()方法,Delete by defaultNaN值的行.可以设置.dropna(axis=1)删除有NaN值的列.Other usages can be consulted by yourself.This usage is the most common.
    origin_data = origin_data.dropna()
    
  2. 结果: You can see that there are fewer lines,没有NaN值了.
    在这里插入图片描述

修改索引

After doing some processing on the data,The index of the data is likely to be messed up directly.比如这里:We deleted some lines,So the index is discontinuous.At this time, if we traverse the data according to the index, an error will be reported.Therefore, it is generally necessary to reset the index after the data is processed.
在这里插入图片描述

  1. 语句: 这里重点说一下drop参数.drop参数为TrueIndicates that it is not necessary to drop the index column directly,Then reset the order.drop参数为FalseIndicates to reset the index,and keep the index column.
    origin_data = origin_data.reset_index(drop=True)
    
  2. 结果:
    在这里插入图片描述
    在这里插入图片描述

Modify the value conditionally

We are doing data preprocessing,Need to convert some non-numeric values ​​to numbers.比如性别、省市等.Here is an example of gender,我希望把M/F转化为0/1,for the neural network to process.

  1. 语句:.loc[row, flag]Get the data that needs to be indexed,The value is then modified by conditional judgment
    for i in range(len(origin_data)):
    	origin_data.loc[i, 'Sex'] = 1 if origin_data.loc[i, 'Sex'] == "F" else 0
    
  2. 结果: Here I have changed the data of two columns,结果如图所示
    在这里插入图片描述

copyright notice
author[little girl],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/218/202208060519291274.html

Random recommended