# [Python Data Science Quick Start Series | 03] Playing with Data Extraction: Numpy Indexing and Slicing

2022-08-06 04:14:19

# 1. 索引与切片

numpy数据容器ndarray和list一样,Also supports index and slice access and modification.Basic knowledge more indexes and sliced as shown in the previous post【Python数据容器之序列】.

``````# 导入鸢尾花数据集
import numpy as np

data = []
with open(file='iris.txt',mode='r') as f:
# 打印标题行
print(f"title:{

for line in f.readlines():  # In a row at a time
data.append(line.strip().split(','))
``````
``````title:sepal_length,sepal_width,petal_length,petal_width,target_names
``````
``````# As can be seen from the execution results of the above code,数据集有5个标题,对应5列数据
# 前4Listed as characteristics of iris,最后1classified as iris
# sepal_length - 花萼长度
# sepal_width - 花萼宽度
# petal_length - 花瓣长度
# petal_width - 花萼宽度
# target_names - 目标名称

# 取2行数据,Take a look at data values ​​and types
data[:2]
``````
``````[['5.1', '3.5', '1.4', '0.2', '0'], ['4.9', '3.0', '1.4', '0.2', '0']]
``````
``````# Before extracting from the list above2row data can be seen,It as a string data type,在将data从list转换为ndarray数组时,Direct data type conversion,转换为float
data = np.array(data, dtype=float)

``````
``````# 和列表一样,ndarrayObjects can also use slices to access arrays
# Before extracting here4Row data view the range and data type of the data value
data[:4]
``````
``````array([[5.1, 3.5, 1.4, 0.2, 0. ],
[4.9, 3. , 1.4, 0.2, 0. ],
[4.7, 3.2, 1.3, 0.2, 0. ],
[4.6, 3.1, 1.5, 0.2, 0. ]])
``````

## 1.1 How to extract data from a specified column

Use slices to extract.Look at the data firstshape

``````data.shape
``````
``````(150, 5)
``````

``````# before comma:means to extract all rows,after comma2表示提取第3列
data[:,2]
``````
``````array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4,
1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1. , 1.7, 1.9, 1.6,
1.6, 1.5, 1.4, 1.6, 1.6, 1.5, 1.5, 1.4, 1.5, 1.2, 1.3, 1.4, 1.3,
1.5, 1.3, 1.3, 1.3, 1.6, 1.9, 1.4, 1.6, 1.4, 1.5, 1.4, 4.7, 4.5,
4.9, 4. , 4.6, 4.5, 4.7, 3.3, 4.6, 3.9, 3.5, 4.2, 4. , 4.7, 3.6,
4.4, 4.5, 4.1, 4.5, 3.9, 4.8, 4. , 4.9, 4.7, 4.3, 4.4, 4.8, 5. ,
4.5, 3.5, 3.8, 3.7, 3.9, 5.1, 4.5, 4.5, 4.7, 4.4, 4.1, 4. , 4.4,
4.6, 4. , 3.3, 4.2, 4.2, 4.2, 4.3, 3. , 4.1, 6. , 5.1, 5.9, 5.6,
5.8, 6.6, 4.5, 6.3, 5.8, 6.1, 5.1, 5.3, 5.5, 5. , 5.1, 5.3, 5.5,
6.7, 6.9, 5. , 5.7, 4.9, 6.7, 4.9, 5.7, 6. , 4.8, 4.9, 5.6, 5.8,
6.1, 6.4, 5.6, 5.1, 5.6, 6.1, 5.6, 5.5, 4.8, 5.4, 5.6, 5.1, 5.1,
5.9, 5.7, 5.2, 5. , 5.2, 5.4, 5.1])
``````

## 1.2 How to separate feature data and label data

``````# Before using slice extraction4Column data as feature data
X_data = data[:, :4]  # 或者 X_data = data[:, :-1]

# Use slices to extract the last1Column data as label data
y_data = data[:, -1]

data.shape, X_data.shape, y_data.shape
``````
``````((150, 5), (150, 4), (150,))
``````
``````# Before viewing the unsplit dataset5个样本
data[:5]
``````
``````array([[5.1, 3.5, 1.4, 0.2, 0. ],
[4.9, 3. , 1.4, 0.2, 0. ],
[4.7, 3.2, 1.3, 0.2, 0. ],
[4.6, 3.1, 1.5, 0.2, 0. ],
[5. , 3.6, 1.4, 0.2, 0. ]])
``````
``````# Before viewing characteristic data5个样本
X_data[:5]
``````
``````array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]])
``````
``````# 查看标签数据前5个样本
y_data[:5]
``````
``````array([0., 0., 0., 0., 0.])
``````

Dynamically load batch size data,Content using the generator,如果不太了解,Can read blogger write articles before：How to gracefully load massive data

``````# First verify before extracting32个样本,
X_data[:32,:]
``````
``````array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4]])
``````
``````# Add a self-increasing index to slide extraction32个样本:X_data[32*i:32*(i+1),:]

# 定义一个生成器,Dynamic extraction of sample size
def get_data(X_data):
# len(X_data)//32,最后不够32A set of samples
for i in range(len(X_data)//32):
yield X_data[32*i:32*(i+1),:]

f_batch_data = get_data(X_data=X_data)
print(next(f_batch_data))

``````
``````[[5.1 3.5 1.4 0.2]
[4.9 3.  1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.  3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5.  3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3.  1.4 0.1]
[4.3 3.  1.1 0.1]
[5.8 4.  1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1.  0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5.  3.  1.6 0.2]
[5.  3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]]
``````

# 2. 高级索引

numpy的ndarrayObject supports array indexing、布尔索引和花式索引.

## 2.1 数组索引

``````X_data[[0,4,7],[0, 1, 2]]
``````
``````array([5.1, 3.6, 1.5])
``````
``````X_data[:7,:]
``````
``````array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3]])
``````

## 2.2 布尔索引

Boolean indexes are heavily used in machine learning and deep learning,Very useful when filtering data.

``````# The first value of the index is the number of sample rows,选择所有,The second value specifies the first column

X_data_column0 = X_data[X_data[:,0]>4.5]
percent = len(X_data_column0)/len(X_data)
percent
``````
``````0.9666666666666667
``````

calyx width can be seen>4.5The proportion of the sample is very high.

## 2.3 花式索引

future research.

