current position:Home>Data Science Library Python -- learning of time series data
Data Science Library Python -- learning of time series data
2022-06-24 07:51:07【Bayesian grandson】
Problem description one : Count the number of different types of emergencies in these data .
Option two :for Traverse the whole DataFrame
Option three : Add a column , Then the classification Groupby
( Two ) stay DataFrame Use time series in
( 3、 ... and )pandas Resampling
Downsampling ( High frequency data to low frequency data ):
L sampling ( Low frequency data to high frequency data )
( One ) Data initialization operation :
( Two ) According to the statistics 911 Number of calls in different months in the data
( 3、 ... and ) Visual analysis —— drawing
Now we have 2015 To 2017 year 25 Ten thousand 911 Emergency call data , Please count out these data Number of different types of emergencies , If we still want to figure out Different types in different months Changes in the number of emergency calls , What should be done ?
Data sources :https://www.kaggle.com/mchirico/montcoalert/data
Case practice
First , Import some basic data analysis packages , Read data information , View the data head and info()
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv("./911.csv")
print(df.head(3))
print(df.info())
>>> df.head()
lat lng desc \
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ...
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...
zip title timeStamp twp \
0 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:10:52 NEW HANOVER
1 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:29:21 HATFIELD TOWNSHIP
2 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 14:39:21 NORRISTOWN
addr e
0 REINDEER CT & DEAD END 1
1 BRIAR PATH & WHITEMARSH LN 1
2 HAWS AVE 1
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249737 entries, 0 to 249736
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lat 249737 non-null float64
1 lng 249737 non-null float64
2 desc 249737 non-null object
3 zip 219391 non-null float64
4 title 249737 non-null object
5 timeStamp 249737 non-null object
6 twp 249644 non-null object
7 addr 249737 non-null object
8 e 249737 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 17.1+ MB
Problem description one : The statistics show that Number of different types of emergencies .
We need to title Cut the contents inside , Extract [EMS,Fire,Traffic] The content of .
data_1 = df["title"].str.split(":").tolist()
data_1[0:5]
>>>
[['EMS', ' BACK PAINS/INJURY'],
['EMS', ' DIABETIC EMERGENCY'],
['Fire', ' GAS-ODOR/LEAK'],
['EMS', ' CARDIAC EMERGENCY'],
['EMS', ' DIZZINESS']]
Next, I'll extract data_1 in category Information about .
cate_list = list(set(i[0] for i in data_1))
cate_list
>>>
['Fire', 'EMS', 'Traffic']
At this time, we have several ways to According to the statistics Number of different types of emergencies .
Scheme 1 :set Method
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(cate_list))),columns=cate_list)
for cate in cate_list:
zeros_df[cate][df["title"].str.contains(cate)] = 1
# break
print(zeros_df)
sum_ret = zeros_df.sum(axis=0)
print(sum_ret)
>>>
Fire EMS Traffic
0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 1.0 0.0
... ... ... ...
249732 0.0 1.0 0.0
249733 0.0 1.0 0.0
249734 0.0 1.0 0.0
249735 1.0 0.0 0.0
249736 0.0 0.0 1.0
[249737 rows x 3 columns]
Fire 37432.0
EMS 124844.0
Traffic 87465.0
dtype: float6
then sum Just a moment .
zeros_df.sum(axis = 0)
>>>
Fire 37432.0
EMS 124844.0
Traffic 87465.0
dtype: float64
Option two :for Traverse the whole DataFrame
Go through all the lists directly , This way is quite slow .
for i in range(df.shape[0]):
zeros_df.loc[i,data_1[i][0]] =1
pass
print(zeros_df)
>>>
Fire EMS Traffic
0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 1.0 0.0
... ... ... ...
249732 0.0 1.0 0.0
249733 0.0 1.0 0.0
249734 0.0 1.0 0.0
249735 1.0 0.0 0.0
249736 0.0 0.0 1.0
[249737 rows x 3 columns]
zeros_df.sum(axis = 0)
zeros_df.sum(axis = 0)
Option three : Add a column , Then the classification Groupby
Add a column , then groupby, Last count Count it .
cate_list = [i[0] for i in data_1]
df["cate"] = pd.DataFrame(np.array(cate_list).reshape((df.shape[0],1)))
print(df.groupby(by="cate").count()["title"])
>>>
cate
EMS 124840
Fire 37432
Traffic 87465
Name: title, dtype: int64
Problem description 2 : Statistics of different months , Changes in different types of emergency calls .
This involves time series analysis .
stay pandas Processing time series in is very simple
Time series analysis
( One ) Generate a time range
pd.date_range(start=None, end=None, periods=None, freq='D')
start and end as well as freq Coordination can produce start and end Range Internal to frequency freq A set of time indexes
start and periods as well as freq Coordination can be generated from start The starting frequency is freq Of periods individual Time index
four parameters: start, end, periods, and freq, exactly three must be specified
Four parameters , You must specify at least three of them .
import pandas as pd
pd.date_range(start="20170909",end = "20180908",freq = "M")
>>>
DatetimeIndex(['2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
'2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
'2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31'],
dtype='datetime64[ns]', freq='M')
pd.date_range(start="20170909",periods = 5,freq = "D")
>>>
DatetimeIndex(['2017-09-09', '2017-09-10', '2017-09-11', '2017-09-12',
'2017-09-13'],
dtype='datetime64[ns]', freq='D')
freq: Is the frequency of time .
More abbreviations for frequency
( Two ) stay DataFrame Use time series in
import numpy as np
index=pd.date_range("20170101",periods=10)
df = pd.DataFrame(np.random.rand(10),index=index)
df
>>>
0
2017-01-01 0.090949
2017-01-02 0.996337
2017-01-03 0.737334
2017-01-04 0.405381
2017-01-05 0.743721
2017-01-06 0.681303
2017-01-07 0.606283
2017-01-08 0.917397
2017-01-09 0.167316
2017-01-10 0.155164
Go back to the beginning 911 In the case of data , We can use pandas The method provided converts a time string into a time series
df["timeStamp"] = pd.to_datetime(df["timeStamp"],format="")
format In most cases, parameters can be left blank , But for pandas Unformatted time string , We can Use this parameter , For example, include Chinese .
So here comes the question :
Now we need to count the number of times in each month or quarter. What should we do ?
( 3、 ... and )pandas Resampling
Resampling : It refers to the transformation of time series from One frequency is converted to another frequency Processing process , Convert high frequency data into low frequency data Downsampling , Low frequency is converted to high frequency L sampling .pandas Provides a resample To help us achieve frequency conversion
1. According to the statistics 911 Changes in the number of calls in different months in the data .
2. According to the statistics 911 Changes in the number of different types of calls in different months in the data .
pandas.DataFrame.resample
pandas.DataFrame.resample() This function is mainly used for The time series Do frequency conversion , The function prototype is as follows :
DataFrame.resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
Downsampling ( High frequency data to low frequency data ):
import pandas as pd
import numpy as np
index=pd.date_range('20190115','20190125',freq='D')
data1=pd.Series(np.arange(len(index)),index=index)
data1
>>>
2019-01-15 0
2019-01-16 1
2019-01-17 2
2019-01-18 3
2019-01-19 4
2019-01-20 5
2019-01-21 6
2019-01-22 7
2019-01-23 8
2019-01-24 9
2019-01-25 10
Freq: D, dtype: int64
data1.resample(rule='3D').sum() >>> 2019-01-15 3 2019-01-18 12 2019-01-21 21 2019-01-24 19 Freq: 3D, dtype: int64 data1.resample(rule='3D').mean() >>> 2019-01-15 1.0 2019-01-18 4.0 2019-01-21 7.0 2019-01-24 9.5 Freq: 3D, dtype: float64
label This parameter controls the value of the aggregate tag after grouping . stay label by right Under the circumstances , Take the value on the right side of the sub box as the new label .
L sampling ( Low frequency data to high frequency data )
The process of downsampling is demonstrated above , Let's demonstrate the process of liter sampling , According to the definition of liter sampling , We just need to resample Function to change the frequency , However, unlike downsampling, the number of new frequencies after upsampling is null , So resample Also provided 3 There are three ways to fill , Let's use code to demonstrate :
The three filling methods are :
ffill( Take the previous value )
bfill( Take the following value )
interpolate( Linear value )
data1.resample(rule='12H').asfreq()
>>>
2019-01-15 00:00:00 0.0
2019-01-15 12:00:00 NaN
2019-01-16 00:00:00 1.0
2019-01-16 12:00:00 NaN
2019-01-17 00:00:00 2.0
2019-01-17 12:00:00 NaN
2019-01-18 00:00:00 3.0
2019-01-18 12:00:00 NaN
2019-01-19 00:00:00 4.0
2019-01-19 12:00:00 NaN
2019-01-20 00:00:00 5.0
2019-01-20 12:00:00 NaN
2019-01-21 00:00:00 6.0
2019-01-21 12:00:00 NaN
2019-01-22 00:00:00 7.0
2019-01-22 12:00:00 NaN
2019-01-23 00:00:00 8.0
2019-01-23 12:00:00 NaN
2019-01-24 00:00:00 9.0
2019-01-24 12:00:00 NaN
2019-01-25 00:00:00 10.0
Freq: 12H, dtype: float64
The original daily data is upsampled to 6 Hour hour , Many null values will be generated , For this null value resample Provides 3 Ways of planting , Respectively ffill( Take the previous value )、bfill( Take the following value )、interpolate( Linear value ), Here we test separately , as follows :
(1) stay ffill When no parameters are passed in , be-all NAN Be filled with , Here we can enter the number , So as to specify the number of null values to be filled .
data1.resample(rule='12H').ffill()
# Forward filling , take NAN Fill in the previous values
2019-01-15 00:00:00 0
2019-01-15 12:00:00 0
2019-01-16 00:00:00 1
2019-01-16 12:00:00 1
2019-01-17 00:00:00 2
2019-01-17 12:00:00 2
2019-01-18 00:00:00 3
2019-01-18 12:00:00 3
2019-01-19 00:00:00 4
2019-01-19 12:00:00 4
2019-01-20 00:00:00 5
2019-01-20 12:00:00 5
2019-01-21 00:00:00 6
2019-01-21 12:00:00 6
2019-01-22 00:00:00 7
2019-01-22 12:00:00 7
2019-01-23 00:00:00 8
2019-01-23 12:00:00 8
2019-01-24 00:00:00 9
2019-01-24 12:00:00 9
2019-01-25 00:00:00 10
Freq: 12H, dtype: int64
data1.resample(rule='12H').ffill(2)
>>>
2019-01-15 00:00:00 0
2019-01-15 12:00:00 1
2019-01-16 00:00:00 1
2019-01-16 12:00:00 2
2019-01-17 00:00:00 2
2019-01-17 12:00:00 3
2019-01-18 00:00:00 3
2019-01-18 12:00:00 4
2019-01-19 00:00:00 4
2019-01-19 12:00:00 5
2019-01-20 00:00:00 5
2019-01-20 12:00:00 6
2019-01-21 00:00:00 6
2019-01-21 12:00:00 7
2019-01-22 00:00:00 7
2019-01-22 12:00:00 8
2019-01-23 00:00:00 8
2019-01-23 12:00:00 9
2019-01-24 00:00:00 9
2019-01-24 12:00:00 10
2019-01-25 00:00:00 10
Freq: 12H, dtype: int64
data1.resample(rule='12H').bfill()
>>>
2019-01-15 00:00:00 0
2019-01-15 12:00:00 1
2019-01-16 00:00:00 1
2019-01-16 12:00:00 2
2019-01-17 00:00:00 2
2019-01-17 12:00:00 3
2019-01-18 00:00:00 3
2019-01-18 12:00:00 4
2019-01-19 00:00:00 4
2019-01-19 12:00:00 5
2019-01-20 00:00:00 5
2019-01-20 12:00:00 6
2019-01-21 00:00:00 6
2019-01-21 12:00:00 7
2019-01-22 00:00:00 7
2019-01-22 12:00:00 8
2019-01-23 00:00:00 8
2019-01-23 12:00:00 9
2019-01-24 00:00:00 9
2019-01-24 12:00:00 10
2019-01-25 00:00:00 10
Freq: 12H, dtype: int64
data1.resample(rule='12H').interpolate()
# Linear filling
>>>
2019-01-15 00:00:00 0.0
2019-01-15 12:00:00 0.5
2019-01-16 00:00:00 1.0
2019-01-16 12:00:00 1.5
2019-01-17 00:00:00 2.0
2019-01-17 12:00:00 2.5
2019-01-18 00:00:00 3.0
2019-01-18 12:00:00 3.5
2019-01-19 00:00:00 4.0
2019-01-19 12:00:00 4.5
2019-01-20 00:00:00 5.0
2019-01-20 12:00:00 5.5
2019-01-21 00:00:00 6.0
2019-01-21 12:00:00 6.5
2019-01-22 00:00:00 7.0
2019-01-22 12:00:00 7.5
2019-01-23 00:00:00 8.0
2019-01-23 12:00:00 8.5
2019-01-24 00:00:00 9.0
2019-01-24 12:00:00 9.5
2019-01-25 00:00:00 10.0
Freq: 12H, dtype: float64
( One ) Data initialization operation :
Set the index of the original data to the time index value , Do the following .
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv("./911.csv")
df["timeStamp"] = pd.to_datetime(df["timeStamp"])
df.set_index("timeStamp",inplace=True)
df
>>>
( Two ) According to the statistics 911 Number of calls in different months in the data
count_by_month = df.resample("M").count()["title"]
print(count_by_month)
>>>
timeStamp
2015-12-31 7916
2016-01-31 13096
2016-02-29 11396
2016-03-31 11059
2016-04-30 11287
2016-05-31 11374
2016-06-30 11732
2016-07-31 12088
2016-08-31 11904
2016-09-30 11669
2016-10-31 12502
2016-11-30 12091
2016-12-31 12162
2017-01-31 11605
2017-02-28 10267
2017-03-31 11684
2017-04-30 11056
2017-05-31 11719
2017-06-30 12333
2017-07-31 11768
2017-08-31 11753
2017-09-30 7276
Freq: M, Name: title, dtype: int64
( 3、 ... and ) Visual analysis —— drawing
# drawing
_x = count_by_month.index
_y = count_by_month.values
_x = [i.strftime("%Y%m%d") for i in _x]
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x,rotation=45)
plt.show()
Expand practice ——911 Changes in the number of different types of calls in different months in the data
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Convert the time string to time type and set it as index
df = pd.read_csv("./911.csv")
df["timeStamp"] = pd.to_datetime(df["timeStamp"])
# Add columns , Indicates classification
temp_list = df["title"].str.split(": ").tolist()
cate_list = [i[0] for i in temp_list]
# print(np.array(cate_list).reshape((df.shape[0],1)))
df["cate"] = pd.DataFrame(np.array(cate_list).reshape((df.shape[0],1)))
df.set_index("timeStamp",inplace=True)
print(df.head(1))
plt.figure(figsize=(20, 8), dpi=80)
# grouping
for group_name,group_data in df.groupby(by="cate"):
# Draw different categories
count_by_month = group_data.resample("M").count()["title"]
# drawing
_x = count_by_month.index
print(_x)
_y = count_by_month.values
_x = [i.strftime("%Y%m%d") for i in _x]
plt.plot(range(len(_x)), _y, label=group_name)
plt.xticks(range(len(_x)), _x, rotation=45)
plt.legend(loc="best")
plt.show()
>>>
lat lng \
timeStamp
2015-12-10 17:10:52 40.297876 -75.581294
desc \
timeStamp
2015-12-10 17:10:52 REINDEER CT & DEAD END; NEW HANOVER; Station ...
zip title twp \
timeStamp
2015-12-10 17:10:52 19525.0 EMS: BACK PAINS/INJURY NEW HANOVER
addr e cate
timeStamp
2015-12-10 17:10:52 REINDEER CT & DEAD END 1 EMS
DatetimeIndex(['2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
'2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
'2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
'2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
'2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
'2017-08-31', '2017-09-30'],
dtype='datetime64[ns]', name='timeStamp', freq='M')
DatetimeIndex(['2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
'2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
'2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
'2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
'2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
'2017-08-31', '2017-09-30'],
dtype='datetime64[ns]', name='timeStamp', freq='M')
DatetimeIndex(['2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
'2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
'2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
'2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
'2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
'2017-08-31', '2017-09-30'],
dtype='datetime64[ns]', name='timeStamp', freq='M')
About PM2.5 Of Demo
# coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
file_path = "./PM2.5/BeijingPM20100101_20151231.csv"
df = pd.read_csv(file_path)
# Pass the separated time string through periodIndex The method is transformed into pandas The type of time
period = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H")
df["datetime"] = period
# print(df.head(10))
# hold datetime Set as index
df.set_index("datetime",inplace=True)
# Down sampling
df = df.resample("7D").mean()
print(df.head())
# Processing missing data , Delete missing data
# print(df["PM_US Post"])
data =df["PM_US Post"]
data_china = df["PM_Nongzhanguan"]
print(data_china.head(100))
# drawing
_x = data.index
_x = [i.strftime("%Y%m%d") for i in _x]
_x_china = [i.strftime("%Y%m%d") for i in data_china.index]
print(len(_x_china),len(_x_china))
_y = data.values
_y_china = data_china.values
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y,label="US_POST",alpha=0.7)
plt.plot(range(len(_x_china)),_y_china,label="CN_POST",alpha=0.7)
plt.xticks(range(0,len(_x_china),10),list(_x_china)[::10],rotation=45)
plt.legend(loc="best")
plt.show()
Be careful :
Separate time strings are passed through periodIndex The method is transformed into pandas The type of time
period=pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H") df["datetime"] = period
About Resampling For some of the contents of :
copyright notice
author[Bayesian grandson],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/175/202206240340585596.html
The sidebar is recommended
- Python automatic switching environment
- Detailed explanation of python3 rounding problem
- [Master Wu's Python bakery] day 2
- [Master Wu's Python bakery] day 1
- [Master Wu's Python bakery] day 3
- [Master Wu's Python bakery] day 4
- [Master Wu's Python bakery] day 5
- [Master Wu's Python bakery] day 6
- [Master Wu's Python bakery] day 7
- [Master Wu's Python bakery] day 8
guess what you like
Introduction and examples of socket programming in Python
Python notes - permissionerror
Python notes - deprecationwarning
Python notes - Open Python project
Python notes - PIL Library
Python notes - with as statement
How to export IPython history to Py file?
Python multithreading combined with dataloader to load data
Make Python not echo commands that get password input
In c/c++ and python programming, null and none cannot be distinguished clearly
Random recommended
- Writing sample code for functions in Python
- Summary of operation methods of Python set (about 20 operation methods), with sample code attached
- Python -- functions
- Anonymous and recursive functions in Python
- How to query the methods (member functions) of a class or an object in Python [using the function dir()]
- Summary of operation methods of Python Dictionary (dict) (about 18 operation methods), with sample code attached
- Collect hot search lists in Python at work, which can be called a fishing artifact
- Running Django and Vue on docker
- Data classification in pandas
- About Python: docxtpl is embedded by default when inserting pictures
- How to solve the problem of CSV? (Language Python)
- Installation and use of redis (Python)
- Python implements sending mail (implements single / group mail verification code)
- On the built-in object type of Python -- number (one of the differences between py2 and PY3)
- Python God uses a problem to help you solve the problems of formal and actual parameters in Python functions
- "Project Euler Python Mini tutorial" 001-010 solution introduction
- Most beginners learn Python and web automation. In this way, they learn and give up
- Python matrices and numpy arrays
- Exciting challenge: Python crawler crawls the cover picture of station B
- After installing python3, use the yum command to report an error?
- New features of python3.6, 3.7, 3.8 and 3.9
- Application of Python simplehttpserver
- Python sending mail (single / group) - yagmail module
- After learning these English words, mom doesn't have to worry that I can't learn Python any more
- 1-python+ selenium automated test (detailed tutorial) in the series of exercises of "teach you by hand"
- Cannot unmarshal array into go value of type main
- Analysis of the principle of Python import
- Python quickly saves pictures in wechat official account articles (multiple articles can be specified)
- Python error reporting series (14) -- selenium support for phantom JS has been deprecated
- Python variable data type
- Advanced Python Programming - functions and modules
- Python conditional judgment and loop statements
- Python dictionary nesting
- I want to use Python to write a census management software. I want to ask about the ideas and software involved
- I want to use Python to write a census management software. I want to consult the ideas and software involved.
- Python program has no idea
- How to set the initial position of the cursor in Python Tkinter
- The scrapy framework only gets a set of results. I don't know why (Language Python)
- Code problems in Python
- Python automation framework