# "Compared to Excel, easy learning Python data analysis," reading notes -- -- -- -- -- - data operation

2022-08-06 07:52:59

# 8 数据运算

## 8.1 算术运算

#### Excel

Use of basic formulas

=Arithmetic operands 算术运算符(+ - * /) Arithmetic operands


#### Python

SeriesObjects can perform operations such as addition, subtraction, multiplication, and division directly through arithmetic operators,返回一个新的Series对象

df = pd.DataFrame({
"A": [1, 2, 3], "B": [4, 5, 6]})

# 两列相加
print(df["A"] + df["B"])
""" 0 5 1 7 2 9 dtype: int64 """
# 两列相减
print(df["A"] - df["B"])
""" 0 -3 1 -3 2 -3 dtype: int64 """

#两列相乘
print(df["A"] * df["B"])
""" 0 4 1 10 2 18 dtype: int64 """

#两列相除
print(df["A"] / df["B"])
""" 0 0.25 1 0.40 2 0.50 dtype: float64 """

# each element of the column+1
print(df["A"] + 1)
""" 0 2 1 3 2 4 Name: A, dtype: int64 """

# each element of the column-1
print(df["A"] - 1)
""" 0 0 1 1 2 2 Name: A, dtype: int64 """

#Multiply each element of the column2
print(df["A"] * 2)
""" 0 2 1 4 2 6 Name: A, dtype: int64 """

#Divide each element of the column2
print(df["A"] / 2)
""" 0 0.5 1 1.0 2 1.5 Name: A, dtype: float64 """


## 8.2 比较运算

#### Excel

Use of basic formulas

=Compare operands 比较运算符(> = < >= <= <>[不等于]) Compare operands


#### Python

SeriesObjects can pass comparators directly（> == < >= <= !=)进行比较运算,返回一个新的Series对象,results from the comparison(True,False)组成.

df = pd.DataFrame({
"A": [1, 3, 4], "B": [2, 3, 5]})
print(df["A"] < df["B"])
""" 0 True 1 False 2 True dtype: bool """

print(df["A"] == df["B"])
""" 0 False 1 True 2 False dtype: bool """


## 8.3 汇总运算

### 8.3.1 计算总个数count

#### Excel

=COUNTA(区域)   # Counts the total number of non-blank cells in the specified range
=COUNT(区域)	  # Calculates the number of numeric data in the specified range


#### Python

df.count(axis: 'Axis' = 0,
numeric_only: 'bool' = False)

• axisSet the row(0"index")Count still put columns(1"columns")计数
• numeric_onlySet to count only numeric data rows/列(int,float,bool).
• The return value is an array by row name(axis=1)或列名(axis=0)作索引,各列/A row of non-null valuesSeries对象,

df = pd.DataFrame({

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, None, None, 2],
"C": ["hello", "world", 1, 2, 3, None]
},index=["一","二","三","四","五","六"])

result = df.count()
print(result)
""" A 6 B 4 C 5 dtype: int64 """

result = df.count(axis=1)
print(result)
""" 一 3 二 3 三 3 四 2 五 2 六 2 dtype: int64 """

result = df.count(numeric_only=True)
print(result)
""" A 6 B 4 dtype: int64 """

Series.count()

• 返回该SeriesThe total number of non-null values ​​in the object

n = df["B"].count()
print(n)
# 4


### 8.3.2 计算总和sum

#### Excel

=SUM(区域)


#### Python

df.sum(axis=None,
skipna=True,
numeric_only=None,
min_count=0)

• axisSet the row(0"index")Summation or put columns(1"columns")求和
• skipnaSet to ignore null values
• numeric_onlySet to count only numeric data rows/列(int,float,bool).
• min_countSet the minimum number of summations.How to this column/The number of numeric data in the row is less than the minimum number of sums,The summation operation will not be performed.
• The return value is an array by row name(axis=1)或列名(axis=0)作索引,各列/It consists of the sum of the data of the rowSeries对象,

df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.sum())
""" A 24.0 B 51.0 C 15.0 dtype: float64 """
print(df.sum(axis=1))
""" 一 15.0 二 9.0 三 9.0 四 24.0 五 25.0 六 8.0 dtype: float64 """
print(df.sum(min_count=3))
""" A 24.0 B 51.0 C NaN dtype: float64 """

Series.sum()

• 返回该SeriesThe sum of the object's data

s = df["B"].sum()
print(s)
# 51


### 8.3.3 计算算术平均数mean

#### Excel

=AVGERAGE(区域)


#### Python

Parameters and return values ​​are similarsum()

df.mean(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)

Series.mean()


df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.mean())
""" A 4.0 B 8.5 C 7.5 dtype: float64 """

print(df.mean(axis=1))
""" 一 5.0 二 3.0 三 4.5 四 12.0 五 12.5 六 4.0 dtype: float64 """

print(df["B"].mean())
# 8.5


### 8.3.4 计算最值max min

#### Excel

=MAX(区域)
=MIN(区域)


#### Python

Parameters and return values ​​are similarsum()

df.max(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)

df.min(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)

Series.max()
Series.min()


df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.max())
""" A 6.0 B 20.0 C 12.0 dtype: float64 """
print(df.min(axis=1))
""" 一 1.0 二 3.0 三 4.0 四 5.0 五 5.0 六 2.0 dtype: float64 """
print(df["B"].max())
# 20


### 8.3.5 计算中位数median

#### Excel

=MEDIAN(区域)


#### Python

Parameters and return values ​​are similarsum()

df.median(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)

Series.sum()


df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.median())
""" A 4.5 B 4.0 C 7.5 dtype: float64 """
print(df.median(axis=1))
""" 一 2.0 二 3.0 三 4.5 四 12.0 五 12.5 六 4.0 dtype: float64 """
print(df["B"].median())
# 4


### 8.3.6 计算众数mode

#### Excel

=MODE(区域)


#### Python

df.mode(axis: 'Axis' = 0,
numeric_only: 'bool' = False,
dropna: 'bool' = True)

Series.mode()

• There may be more than one mode.故返回值为DataFrameSeries

df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.mode())   # C列3和12The most frequent and equal occurrences
""" A B C 0 5.0 2.0 3.0 1 NaN NaN 12.0 """

print(df.mode(axis=1))
""" 0 1 2 一 1.0 2.0 12.0 二 3.0 NaN NaN 三 4.0 5.0 NaN 四 5.0 19.0 NaN 五 5.0 20.0 NaN 六 2.0 6.0 NaN """
print(df["B"].mode())
""" 0 2 Name: B, dtype: int64 """


### 8.3.7 计算方差 var

v a r ( A n ) = 1 n [ ( a 1 − a ˉ ) 2 + ( a 2 − a ˉ ) 2 + ⋅ ⋅ ⋅ + ( a n − a ˉ ) 2 ] = 1 n ∑ i = 1 n ( a i − a ˉ ) 2 其 中 , a ˉ 为 数 据 集 A n 的 平 均 值 var (A_n)=\frac{1}{n}[(a_1-\bar a)^2+(a_2-\bar a)^2+···+(a_n-\bar a)^2]=\frac{1}{n}\sum_{i=1}^{n}(a_i-\bar a)^2\\ 其中,\bar a为数据集A_n的平均值

#### Excel

=VAR(区域)


#### Python

Parameters and return values ​​are similarsum()

df.var(axis: 'Axis' = 0,
numeric_only: 'bool' = False,
dropna: 'bool' = True)

Series.var()


df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.var())
""" A 3.2 B 73.9 C 40.5 dtype: float64 """
print(df.var(axis=1))
""" 一 37.0 二 0.0 三 0.5 四 98.0 五 112.5 六 8.0 dtype: float64 """
print(df["B"].var())
# 73.9


### 8.3.8 计算标准差std

s t d ( A n ) = v a r ( A n ) = 1 n ∑ i = 1 n ( a i − a ˉ ) 2 std(A_n)=\sqrt {var(A_n)}=\sqrt {\frac{1}{n}\sum_{i=1}^{n}(a_i-\bar a)^2}

#### Excel

=STDEVP(区域)


#### Python

Parameters and return values ​​are similarsum()

df.std(axis: 'Axis' = 0,
numeric_only: 'bool' = False,
dropna: 'bool' = True)

Series.std()


df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

print(df.std())
""" A 1.788854 B 8.596511 C 6.363961 dtype: float64 """
print(df.std(axis=1))
""" 一 6.082763 二 0.000000 三 0.707107 四 9.899495 五 10.606602 六 2.828427 dtype: float64 """
print(df["B"].std())
""" 8.596510920134982 """


### 8.3.9 计算分位数quantile

q 分 位 数 : 将 n 个 数 据 按 从 小 到 大 的 顺 序 排 列 , 位 于   1 + ( n − 1 ) ⋅ q   位 置 的 数 据 其 中 0 < q < 1 q分位数:将nThe data are arranged in ascending order,位于\ 1+(n-1)·q\ 位置的数据\\ 其中0<q<1

#### Excel

=PERCENTILE(区域,q)


#### Python

df.quantile(q=0.5,
axis: 'Axis' = 0,
numeric_only: 'bool' = True,
interpolation: 'str' = 'linear')

Series.quantile(q=0.5,interpolation: 'str' = 'linear')

• q设置分位数.默认0.5,即中位数.也可以是列表.
• interpolation设置插值方式.假设 j = 1 + ( n − 1 ) ⋅ q    ,   i < j < i + 1 j=1+(n-1)·q\ \ ,\ i< j<i+1
• linear：取 a i + ( j − i ) ( a i + 1 − a i ) a_i+(j-i)(a_{i+1}-a_i) .
• lower:取 a i a_i
• higher:取 a i + 1 a_{i+1}
• nearest：令 i , i + 1 i,i+1 中离 j j closer to k k ,取 a k a_k
• midpoint：取 a i + a i + 1 2 \displaystyle \frac{a_i+a_{i+1}}{2}

df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])

qdf=df.quantile([0.25,0.5,0.75])
print(qdf)
""" A B C 0.25 3.25 2.25 5.25 0.50 4.50 4.00 7.50 0.75 5.00 15.50 9.75 """

qdf=df.quantile([0.25,0.5,0.75],interpolation="higher")
print(qdf)
""" A B C 0.25 4 3 12.0 0.50 5 5 12.0 0.75 5 19 12.0 """

print(df["A"].quantile(0.25))
# 3.25


### 8.3.10 计算协方差cov

c o v [ X n , Y n ] = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) n c o v 为 正 表 示 两 列 数 据 正 相 关 . 为 负 代 表 负 相 关 . cov[X_n,Y_n]=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{n}\\ covPositive means that the two columns of data are positively correlated.为负代表负相关.

#### Excel

=COVAR(列1,列2)


#### Python

df.cov()

• Returns the composition of the covariance between the columnsDataFrame
Series.cov(other)

• other为另一个Series对象.
• 返回两个SeriesThe covariance value of

df = pd.DataFrame(
{

"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.cov())
""" A B C A 3.5 -3.5 7.400000 B -3.5 3.5 -7.400000 C 7.4 -7.4 41.866667 """

c=df["A"].cov(df["B"])
print(c)
""" -3.5 """


### 8.3.11 计算相关系数correl

The most common is the Pearson correlation coefficient:
c o r r e l ( X n , Y n ) = c o v [ X n , Y n ] v a r ( X n ) ⋅ v a r ( Y n ) = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 ⋅ ∑ i = 1 n ( y i − y ˉ ) 2 c o r r e l 绝 对 值 越 小 说 明 相 关 性 越 弱 . 正 负 代 表 正 负 相 关 . − 1 ≤ c o r r e l ≤ 1 correl(X_n,Y_n)=\frac{cov[X_n,Y_n]}{\sqrt {var(X_n)·var(Y_n)}}=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sqrt {\sum_{i=1}^n(x_i-\bar x)^2·\sum_{i=1}^n(y_i-\bar y)^2}}\\ correl The smaller the absolute value, the weaker the correlation.Positive and negative represent positive and negative correlations.-1\le correl \le 1

#### Excel

=CORREL(列1,列2)


#### Python

df.corr()

• Returns the composition of the correlation coefficients between the columnsDataFrame
Series.corr(other)

• other为另一个Series对象.
• 返回两个Series的相关系数

df = pd.DataFrame(
{

"A": [1, 2, 3, 4, 5, 6],
"B": [6, 5, 4, 3, 2, 1],
"C": [12, 1,8, 10, 10, 21]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.corr())
""" A B C A 1.000000 -1.000000 0.611313 B -1.000000 1.000000 -0.611313 C 0.611313 -0.611313 1.000000 """
r=df["A"].cov(corr["B"])
print(r)
# -1.0