current position:Home>[Python Data Science Quick Start Series | 03] Playing with Data Extraction: Numpy Indexing and Slicing

[Python Data Science Quick Start Series | 03] Playing with Data Extraction: Numpy Indexing and Slicing

2022-08-06 04:14:19machine future

这是机器未来的第41篇文章

原文首发地址:https://blog.csdn.net/RobotFutures/article/details/126176343

在这里插入图片描述

1. 索引与切片

numpy数据容器ndarray和list一样,Also supports index and slice access and modification.Basic knowledge more indexes and sliced as shown in the previous post【Python数据容器之序列】.

在这里,我们使用sklearn自带的数据集——An example of the iris dataset.

# 导入鸢尾花数据集
import numpy as np

data = []
with open(file='iris.txt',mode='r') as f:
    # 打印标题行
    print(f"title:{
      f.readline()}")

    # f.readlines()读取文件所有行
    for line in f.readlines():  # In a row at a time
        data.append(line.strip().split(','))
title:sepal_length,sepal_width,petal_length,petal_width,target_names
# As can be seen from the execution results of the above code,数据集有5个标题,对应5列数据
# 前4Listed as characteristics of iris,最后1classified as iris
# sepal_length - 花萼长度
# sepal_width - 花萼宽度
# petal_length - 花瓣长度
# petal_width - 花萼宽度
# target_names - 目标名称

# 取2行数据,Take a look at data values ​​and types
data[:2]
[['5.1', '3.5', '1.4', '0.2', '0'], ['4.9', '3.0', '1.4', '0.2', '0']]
# Before extracting from the list above2row data can be seen,It as a string data type,在将data从list转换为ndarray数组时,Direct data type conversion,转换为float
data = np.array(data, dtype=float)

# 和列表一样,ndarrayObjects can also use slices to access arrays
# Before extracting here4Row data view the range and data type of the data value
data[:4]  
array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ]])

1.1 How to extract data from a specified column

思考:When doing data pretreatment could analysis feature data and the correlation of tag data,We need to extract the named column of data with the label data for analysis,So how to extract a column of data??
Use slices to extract.Look at the data firstshape

data.shape
(150, 5)

可以看到是150行,5列,It is now necessary to extract the3列数据,可以这样操作

# before comma:means to extract all rows,after comma2表示提取第3列
data[:,2]
array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4,
       1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1. , 1.7, 1.9, 1.6,
       1.6, 1.5, 1.4, 1.6, 1.6, 1.5, 1.5, 1.4, 1.5, 1.2, 1.3, 1.4, 1.3,
       1.5, 1.3, 1.3, 1.3, 1.6, 1.9, 1.4, 1.6, 1.4, 1.5, 1.4, 4.7, 4.5,
       4.9, 4. , 4.6, 4.5, 4.7, 3.3, 4.6, 3.9, 3.5, 4.2, 4. , 4.7, 3.6,
       4.4, 4.5, 4.1, 4.5, 3.9, 4.8, 4. , 4.9, 4.7, 4.3, 4.4, 4.8, 5. ,
       4.5, 3.5, 3.8, 3.7, 3.9, 5.1, 4.5, 4.5, 4.7, 4.4, 4.1, 4. , 4.4,
       4.6, 4. , 3.3, 4.2, 4.2, 4.2, 4.3, 3. , 4.1, 6. , 5.1, 5.9, 5.6,
       5.8, 6.6, 4.5, 6.3, 5.8, 6.1, 5.1, 5.3, 5.5, 5. , 5.1, 5.3, 5.5,
       6.7, 6.9, 5. , 5.7, 4.9, 6.7, 4.9, 5.7, 6. , 4.8, 4.9, 5.6, 5.8,
       6.1, 6.4, 5.6, 5.1, 5.6, 6.1, 5.6, 5.5, 4.8, 5.4, 5.6, 5.1, 5.1,
       5.9, 5.7, 5.2, 5. , 5.2, 5.4, 5.1])

1.2 How to separate feature data and label data

思考:Now we have joined the dataset,为了训练模型,We need to convert the feature data(即输入)和标签数据(即结果)分离,Get trained label results after training with feature input,and then with the label data(实际结果)比对.how to separate?

# Before using slice extraction4Column data as feature data
X_data = data[:, :4]  # 或者 X_data = data[:, :-1]

# Use slices to extract the last1Column data as label data
y_data = data[:, -1]

data.shape, X_data.shape, y_data.shape
((150, 5), (150, 4), (150,))
# Before viewing the unsplit dataset5个样本
data[:5]
array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ],
       [5. , 3.6, 1.4, 0.2, 0. ]])
# Before viewing characteristic data5个样本
X_data[:5]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])
# 查看标签数据前5个样本
y_data[:5]
array([0., 0., 0., 0., 0.])

从结果可以看到,We have characteristic data and the label has been separated.

1.3 How to dynamically load samples of small size,Instead of loading the entire dataset at once??

思考:Encountered a very large dataset,The number of data samples is very large,Need to load data in batches,The number of samples in each batch is32,So how to extract it?

Dynamically load batch size data,Content using the generator,如果不太了解,Can read blogger write articles before:How to gracefully load massive data

# First verify before extracting32个样本,
X_data[:32,:]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4]])
# Add a self-increasing index to slide extraction32个样本:X_data[32*i:32*(i+1),:]

# 定义一个生成器,Dynamic extraction of sample size
def get_data(X_data):
    # len(X_data)//32,最后不够32A set of samples
    for i in range(len(X_data)//32):
        yield X_data[32*i:32*(i+1),:]

f_batch_data = get_data(X_data=X_data)
print(next(f_batch_data))

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]]

2. 高级索引

numpy的ndarrayObject supports array indexing、布尔索引和花式索引.

2.1 数组索引

例如下面这个代码,[0,4,7],[0, 1, 2]The extracted data samples are actually(0,0),(4,1),(7,2),In fact, it is the number of bits

X_data[[0,4,7],[0, 1, 2]]
array([5.1, 3.6, 1.5])
X_data[:7,:]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3]])

可以和X_dataCompare the original feature data,Findings are consistent with the description.

2.2 布尔索引

Boolean indexes are heavily used in machine learning and deep learning,Very useful when filtering data.
其格式为X[条件表达式]

思考:Now I need to analyze calyx width>4.5What is the proportion of the sample in the whole sample,该如何处理呢?

# The first value of the index is the number of sample rows,选择所有,The second value specifies the first column

X_data_column0 = X_data[X_data[:,0]>4.5]
percent = len(X_data_column0)/len(X_data)
percent
0.9666666666666667

calyx width can be seen>4.5The proportion of the sample is very high.

2.3 花式索引

future research.

写在末尾:

  • 博客简介:专注AIoT领域,追逐未来时代的脉搏,记录路途中的技术成长!
  • 专栏简介:从0到1Master Data Science Common LibrariesNumpy、Matploblib、Pandas.
  • 面向人群:AI初级学习者
  • 专栏计划:接下来会逐步发布跨入人工智能的系列博文,敬请期待

copyright notice
author[machine future],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/218/202208060347136978.html

Random recommended