current position:Home>"Compared to Excel, easy learning Python data analysis," reading notes -- -- -- -- -- - data operation
"Compared to Excel, easy learning Python data analysis," reading notes -- -- -- -- -- - data operation
2022-08-06 07:52:59【Muzi Hikari】
8 数据运算
8.1 算术运算
Excel
Use of basic formulas
=Arithmetic operands 算术运算符(+ - * /) Arithmetic operands
Python
Series
Objects can perform operations such as addition, subtraction, multiplication, and division directly through arithmetic operators,返回一个新的Series
对象
df = pd.DataFrame({
"A": [1, 2, 3], "B": [4, 5, 6]})
# 两列相加
print(df["A"] + df["B"])
""" 0 5 1 7 2 9 dtype: int64 """
# 两列相减
print(df["A"] - df["B"])
""" 0 -3 1 -3 2 -3 dtype: int64 """
#两列相乘
print(df["A"] * df["B"])
""" 0 4 1 10 2 18 dtype: int64 """
#两列相除
print(df["A"] / df["B"])
""" 0 0.25 1 0.40 2 0.50 dtype: float64 """
# each element of the column+1
print(df["A"] + 1)
""" 0 2 1 3 2 4 Name: A, dtype: int64 """
# each element of the column-1
print(df["A"] - 1)
""" 0 0 1 1 2 2 Name: A, dtype: int64 """
#Multiply each element of the column2
print(df["A"] * 2)
""" 0 2 1 4 2 6 Name: A, dtype: int64 """
#Divide each element of the column2
print(df["A"] / 2)
""" 0 0.5 1 1.0 2 1.5 Name: A, dtype: float64 """
8.2 比较运算
Excel
Use of basic formulas
=Compare operands 比较运算符(> = < >= <= <>[不等于]) Compare operands
Python
Series
Objects can pass comparators directly(> == < >= <= !=)进行比较运算,返回一个新的Series
对象,results from the comparison(True
,False
)组成.
df = pd.DataFrame({
"A": [1, 3, 4], "B": [2, 3, 5]})
print(df["A"] < df["B"])
""" 0 True 1 False 2 True dtype: bool """
print(df["A"] == df["B"])
""" 0 False 1 True 2 False dtype: bool """
8.3 汇总运算
8.3.1 计算总个数count
Excel
=COUNTA(区域) # Counts the total number of non-blank cells in the specified range
=COUNT(区域) # Calculates the number of numeric data in the specified range
Python
df.count(axis: 'Axis' = 0,
numeric_only: 'bool' = False)
axis
Set the row(0
或"index"
)Count still put columns(1
或"columns"
)计数numeric_only
Set to count only numeric data rows/列(int,float,bool).- The return value is an array by row name(
axis=1
)或列名(axis=0
)作索引,各列/A row of non-null valuesSeries
对象,
如:
df = pd.DataFrame({
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, None, None, 2],
"C": ["hello", "world", 1, 2, 3, None]
},index=["一","二","三","四","五","六"])
result = df.count()
print(result)
""" A 6 B 4 C 5 dtype: int64 """
result = df.count(axis=1)
print(result)
""" 一 3 二 3 三 3 四 2 五 2 六 2 dtype: int64 """
result = df.count(numeric_only=True)
print(result)
""" A 6 B 4 dtype: int64 """
Series.count()
- 返回该
Series
The total number of non-null values in the object
如:
n = df["B"].count()
print(n)
# 4
8.3.2 计算总和sum
Excel
=SUM(区域)
Python
df.sum(axis=None,
skipna=True,
numeric_only=None,
min_count=0)
axis
Set the row(0
或"index"
)Summation or put columns(1
或"columns"
)求和skipna
Set to ignore null valuesnumeric_only
Set to count only numeric data rows/列(int,float,bool).min_count
Set the minimum number of summations.How to this column/The number of numeric data in the row is less than the minimum number of sums,The summation operation will not be performed.- The return value is an array by row name(
axis=1
)或列名(axis=0
)作索引,各列/It consists of the sum of the data of the rowSeries
对象,
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.sum())
""" A 24.0 B 51.0 C 15.0 dtype: float64 """
print(df.sum(axis=1))
""" 一 15.0 二 9.0 三 9.0 四 24.0 五 25.0 六 8.0 dtype: float64 """
print(df.sum(min_count=3))
""" A 24.0 B 51.0 C NaN dtype: float64 """
Series.sum()
- 返回该
Series
The sum of the object's data
如:
s = df["B"].sum()
print(s)
# 51
8.3.3 计算算术平均数mean
算 术 平 均 数 : m e a n ( A n ) = a 1 + a 2 + ⋅ ⋅ ⋅ + a n n = 1 n ∑ i = 1 n a i 算术平均数:mean(A_n)=\frac{a_1+a_2+···+a_n}{n}=\frac{1}{n}\sum_{i=1}^{n}a_i \\ 算术平均数:mean(An)=na1+a2+⋅⋅⋅+an=n1i=1∑nai
Excel
=AVGERAGE(区域)
Python
Parameters and return values are similarsum()
df.mean(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)
Series.mean()
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.mean())
""" A 4.0 B 8.5 C 7.5 dtype: float64 """
print(df.mean(axis=1))
""" 一 5.0 二 3.0 三 4.5 四 12.0 五 12.5 六 4.0 dtype: float64 """
print(df["B"].mean())
# 8.5
8.3.4 计算最值max min
Excel
=MAX(区域)
=MIN(区域)
Python
Parameters and return values are similarsum()
df.max(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)
df.min(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)
Series.max()
Series.min()
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.max())
""" A 6.0 B 20.0 C 12.0 dtype: float64 """
print(df.min(axis=1))
""" 一 1.0 二 3.0 三 4.0 四 5.0 五 5.0 六 2.0 dtype: float64 """
print(df["B"].max())
# 20
8.3.5 计算中位数median
中 位 数 : 将 数 据 从 小 到 大 排 列 , 位 于 正 中 间 的 一 个 数 据 或 两 个 数 据 的 平 均 值 m e d i a n ( A 2 k ) = a k + a k + 1 2 m e d i a n ( A 2 k + 1 ) = a k + 1 其 中 , a k ≤ a k + 1 , k = 1 , 2 , 3 ⋅ ⋅ ⋅ 中位数:将数据从小到大排列,The average of a data or two data in the middle\\ median \ (A_{2k})=\frac{a_k+a_{k+1}}{2} \\ median \ (A_{2k+1})=a_{k+1}\\ 其中,a_k\le a_{k+1},k=1,2,3··· 中位数:将数据从小到大排列,位于正中间的一个数据或两个数据的平均值median (A2k)=2ak+ak+1median (A2k+1)=ak+1其中,ak≤ak+1,k=1,2,3⋅⋅⋅
Excel
=MEDIAN(区域)
Python
Parameters and return values are similarsum()
df.median(axis: 'int | None | lib.NoDefault' = <no_default>,
skipna=True,
numeric_only=None)
Series.sum()
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.median())
""" A 4.5 B 4.0 C 7.5 dtype: float64 """
print(df.median(axis=1))
""" 一 2.0 二 3.0 三 4.5 四 12.0 五 12.5 六 4.0 dtype: float64 """
print(df["B"].median())
# 4
8.3.6 计算众数mode
众 数 : 一 定 区 域 的 数 据 中 出 现 次 数 最 多 的 数 据 值 众数:The data value that occurs most frequently in the data of a certain area 众数:一定区域的数据中出现次数最多的数据值
Excel
=MODE(区域)
Python
参数类似sum()
df.mode(axis: 'Axis' = 0,
numeric_only: 'bool' = False,
dropna: 'bool' = True)
Series.mode()
- There may be more than one mode.故返回值为
DataFrame
或Series
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.mode()) # C列3和12The most frequent and equal occurrences
""" A B C 0 5.0 2.0 3.0 1 NaN NaN 12.0 """
print(df.mode(axis=1))
""" 0 1 2 一 1.0 2.0 12.0 二 3.0 NaN NaN 三 4.0 5.0 NaN 四 5.0 19.0 NaN 五 5.0 20.0 NaN 六 2.0 6.0 NaN """
print(df["B"].mode())
""" 0 2 Name: B, dtype: int64 """
8.3.7 计算方差 var
v a r ( A n ) = 1 n [ ( a 1 − a ˉ ) 2 + ( a 2 − a ˉ ) 2 + ⋅ ⋅ ⋅ + ( a n − a ˉ ) 2 ] = 1 n ∑ i = 1 n ( a i − a ˉ ) 2 其 中 , a ˉ 为 数 据 集 A n 的 平 均 值 var (A_n)=\frac{1}{n}[(a_1-\bar a)^2+(a_2-\bar a)^2+···+(a_n-\bar a)^2]=\frac{1}{n}\sum_{i=1}^{n}(a_i-\bar a)^2\\ 其中,\bar a为数据集A_n的平均值 var(An)=n1[(a1−aˉ)2+(a2−aˉ)2+⋅⋅⋅+(an−aˉ)2]=n1i=1∑n(ai−aˉ)2其中,aˉ为数据集An的平均值
Excel
=VAR(区域)
Python
Parameters and return values are similarsum()
df.var(axis: 'Axis' = 0,
numeric_only: 'bool' = False,
dropna: 'bool' = True)
Series.var()
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.var())
""" A 3.2 B 73.9 C 40.5 dtype: float64 """
print(df.var(axis=1))
""" 一 37.0 二 0.0 三 0.5 四 98.0 五 112.5 六 8.0 dtype: float64 """
print(df["B"].var())
# 73.9
8.3.8 计算标准差std
s t d ( A n ) = v a r ( A n ) = 1 n ∑ i = 1 n ( a i − a ˉ ) 2 std(A_n)=\sqrt {var(A_n)}=\sqrt {\frac{1}{n}\sum_{i=1}^{n}(a_i-\bar a)^2} std(An)=var(An)=n1i=1∑n(ai−aˉ)2
Excel
=STDEVP(区域)
Python
Parameters and return values are similarsum()
df.std(axis: 'Axis' = 0,
numeric_only: 'bool' = False,
dropna: 'bool' = True)
Series.std()
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.std())
""" A 1.788854 B 8.596511 C 6.363961 dtype: float64 """
print(df.std(axis=1))
""" 一 6.082763 二 0.000000 三 0.707107 四 9.899495 五 10.606602 六 2.828427 dtype: float64 """
print(df["B"].std())
""" 8.596510920134982 """
8.3.9 计算分位数quantile
q 分 位 数 : 将 n 个 数 据 按 从 小 到 大 的 顺 序 排 列 , 位 于 1 + ( n − 1 ) ⋅ q 位 置 的 数 据 其 中 0 < q < 1 q分位数:将nThe data are arranged in ascending order,位于\ 1+(n-1)·q\ 位置的数据\\ 其中0<q<1 q分位数:将n个数据按从小到大的顺序排列,位于 1+(n−1)⋅q 位置的数据其中0<q<1
Excel
=PERCENTILE(区域,q)
Python
df.quantile(q=0.5,
axis: 'Axis' = 0,
numeric_only: 'bool' = True,
interpolation: 'str' = 'linear')
Series.quantile(q=0.5,interpolation: 'str' = 'linear')
q
设置分位数.默认0.5,即中位数.也可以是列表.interpolation
设置插值方式.假设 j = 1 + ( n − 1 ) ⋅ q , i < j < i + 1 j=1+(n-1)·q\ \ ,\ i< j<i+1 j=1+(n−1)⋅q , i<j<i+1linear
:取 a i + ( j − i ) ( a i + 1 − a i ) a_i+(j-i)(a_{i+1}-a_i) ai+(j−i)(ai+1−ai) .lower
:取 a i a_i aihigher
:取 a i + 1 a_{i+1} ai+1nearest
:令 i , i + 1 i,i+1 i,i+1中离 j j j closer to k k k,取 a k a_k akmidpoint
:取 a i + a i + 1 2 \displaystyle \frac{a_i+a_{i+1}}{2} 2ai+ai+1
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
qdf=df.quantile([0.25,0.5,0.75])
print(qdf)
""" A B C 0.25 3.25 2.25 5.25 0.50 4.50 4.00 7.50 0.75 5.00 15.50 9.75 """
qdf=df.quantile([0.25,0.5,0.75],interpolation="higher")
print(qdf)
""" A B C 0.25 4 3 12.0 0.50 5 5 12.0 0.75 5 19 12.0 """
print(df["A"].quantile(0.25))
# 3.25
8.3.10 计算协方差cov
c o v [ X n , Y n ] = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) n c o v 为 正 表 示 两 列 数 据 正 相 关 . 为 负 代 表 负 相 关 . cov[X_n,Y_n]=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{n}\\ covPositive means that the two columns of data are positively correlated.为负代表负相关. cov[Xn,Yn]=n∑i=1n(xi−xˉ)(yi−yˉ)cov为正表示两列数据正相关.为负代表负相关.
Excel
=COVAR(列1,列2)
Python
df.cov()
- Returns the composition of the covariance between the columns
DataFrame
Series.cov(other)
other
为另一个Series
对象.- 返回两个
Series
The covariance value of
如:
df = pd.DataFrame(
{
"A": [1, 3, 4, 5, 5, 6],
"B": [2, 3, 5, 19, 20, 2],
"C": [12, 3, None, None, None, None]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.cov())
""" A B C A 3.5 -3.5 7.400000 B -3.5 3.5 -7.400000 C 7.4 -7.4 41.866667 """
c=df["A"].cov(df["B"])
print(c)
""" -3.5 """
8.3.11 计算相关系数correl
The most common is the Pearson correlation coefficient:
c o r r e l ( X n , Y n ) = c o v [ X n , Y n ] v a r ( X n ) ⋅ v a r ( Y n ) = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 ⋅ ∑ i = 1 n ( y i − y ˉ ) 2 c o r r e l 绝 对 值 越 小 说 明 相 关 性 越 弱 . 正 负 代 表 正 负 相 关 . − 1 ≤ c o r r e l ≤ 1 correl(X_n,Y_n)=\frac{cov[X_n,Y_n]}{\sqrt {var(X_n)·var(Y_n)}}=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sqrt {\sum_{i=1}^n(x_i-\bar x)^2·\sum_{i=1}^n(y_i-\bar y)^2}}\\ correl The smaller the absolute value, the weaker the correlation.Positive and negative represent positive and negative correlations.-1\le correl \le 1 correl(Xn,Yn)=var(Xn)⋅var(Yn)cov[Xn,Yn]=∑i=1n(xi−xˉ)2⋅∑i=1n(yi−yˉ)2∑i=1n(xi−xˉ)(yi−yˉ)correl绝对值越小说明相关性越弱.正负代表正负相关.−1≤correl≤1
Excel
=CORREL(列1,列2)
Python
df.corr()
- Returns the composition of the correlation coefficients between the columns
DataFrame
Series.corr(other)
other
为另一个Series
对象.- 返回两个
Series
的相关系数
如:
df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6],
"B": [6, 5, 4, 3, 2, 1],
"C": [12, 1,8, 10, 10, 21]
},
index=["一", "二", "三", "四", "五", "六"])
print(df.corr())
""" A B C A 1.000000 -1.000000 0.611313 B -1.000000 1.000000 -0.611313 C 0.611313 -0.611313 1.000000 """
r=df["A"].cov(corr["B"])
print(r)
# -1.0
copyright notice
author[Muzi Hikari],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/218/202208060749318272.html
The sidebar is recommended
- Get the input and output information of the onnx model Python script
- Common parameters of python matplotlib and drawing examples
- python axessubplot_ use matplotlib's savefig to save plots generated from python pandas (AxesSubPlot)
- Qixi Festival_A Python program that moves the mouse to play the confession balloon (available in August 2022)
- Python knowledge points: Python variables, data types
- Python open virtual environment
- Python uses Hive to query data
- python+opencv study notes
- python--log processing logging.handlers.TimedRotatingFileHandler
- The problem that yum is unavailable after python upgrade
guess what you like
Application of bubbling method in program thought in advanced scl programming in python and 1200PLC
Tensorflow C++ deployment practice - python environment establishment on linux platform (2)
Python graduation design works based on django framework personal blog system completed finished product (7) mid-term inspection report
Python graduation design works based on django framework personal blog system completed design (8) graduation design thesis template
Python graduation design works are based on the django framework enterprise company website. The finished product (1) Development overview
Application of bubbling method in program thought in advanced scl programming in python and 1200PLC
python get all characters before or after a specified character
[Python | Word Cloud] Draw a super beautiful word cloud from chat records (Happy Qixi Festival, classmate Zeng)
Python data visualization-----make a global earthquake scatter plot
Python. Iterator object iter() (Based on the iterator feature, dismantle the complicated single-line "forced code", and understand the "secret method" for selecting fixed-length elements of the sequence.)
Random recommended
- Python's common modules of threading and Thread modules The first stage of thread implementation
- Blender Python Programming: Creating Emitting Materials
- Python multiprocessing
- How does python implement image edge detection
- Django paging method
- django regex
- How does Python represent time?2 modules, 3 ways, 1 text to understand~
- Modify column name and row index in pandas DataFrame
- [python pandas groupby]
- Python Daily Practice (New Question Bank of Niu Ke) - Day 20: Dictionary Practice
- [LeetCode brush questions] Hash table - some questions are implemented with built-in functions (with Python code)
- [LeetCode brush questions] Linked list topic (1) (with Python code)
- [Small case of python learning] Simulation system invasion to enhance interest
- Getting Started with Python Basics - Essential Knowledge for Getting Started
- How does Python represent time?2 modules, 3 ways, 1 text to get it~
- Python office software automation, master openpyxl operation in 5 minutes
- Introduction to Python Basics - Variables, Strings
- [python2] remove the u in front of the unicode string
- How to use the Python Color class to draw text
- How to use Asyncio Python web programming
- 27 Python artificial intelligence libraries have been sorted out, it is recommended to collect!
- [Python] Word2Vec predicts what will be left if I subtract 'love' from my 'life'
- When I export a pandas package, there is a problem. If I don't import it, there is no problem. Is this not enough memory?
- Python version 3.7.4 How can I install locust successfully?
- How does python use pyinstaller to package music into exe, that is, play background music when running the packaged program?
- Python use pyinstaller how to wrap up music exe, is to run a packaged program play background music?
- Rescue 007 of graph traversal application, detailed analysis of python version
- 27 Python artificial intelligence libraries have been sorted out, it is recommended to collect~
- pandas DataFrame data filtering (2)
- Python is how to represent time?- two modules, three ways, complete the ~ 1
- The definition of pycharm writing python code
- Problems defining functions in Python
- Python Socket Network Programming
- Django server running error
- Python image processing notes - image matching based on Harris corners (skimage)
- (Thirteen--1) Concurrent programming of python (thread, queue, process, coroutine, process thread coroutine comparison)
- (12) Python's memory management mechanism
- Python crawler entry case day07: Hippopx
- Django reports an error ModuleNotFoundError: No module named 'mysqlclient'
- Python study notes