current position:Home>Data Science Library Python -- learning of time series data

Data Science Library Python -- learning of time series data

2022-06-24 07:51:07Bayesian grandson

  Case practice

  Problem description one : Count the number of different types of emergencies in these data .

Scheme 1 :set Method

Option two :for Traverse the whole DataFrame

Option three : Add a column , Then the classification Groupby

Time series analysis

( One ) Generate a time range

( Two ) stay DataFrame Use time series in

( 3、 ... and )pandas Resampling

  Downsampling ( High frequency data to low frequency data ):

L sampling ( Low frequency data to high frequency data )

( One ) Data initialization operation :

( Two ) According to the statistics 911 Number of calls in different months in the data

( 3、 ... and ) Visual analysis —— drawing

  Expand practice ——911 Changes in the number of different types of calls in different months in the data

  About PM2.5 Of Demo


Now we have 2015 To 2017 year 25 Ten thousand 911 Emergency call data , Please count out these data Number of different types of emergencies , If we still want to figure out Different types in different months Changes in the number of emergency calls , What should be done ?

Data sources :https://www.kaggle.com/mchirico/montcoalert/data

  Case practice

First , Import some basic data analysis packages , Read data information , View the data head and info()

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv("./911.csv")
print(df.head(3))
print(df.info())


>>> df.head()

         lat        lng                                               desc  \
0  40.297876 -75.581294  REINDEER CT & DEAD END;  NEW HANOVER; Station ...   
1  40.258061 -75.264680  BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP...   
2  40.121182 -75.351975  HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...   

       zip                    title            timeStamp                twp  \
0  19525.0   EMS: BACK PAINS/INJURY  2015-12-10 17:10:52        NEW HANOVER   
1  19446.0  EMS: DIABETIC EMERGENCY  2015-12-10 17:29:21  HATFIELD TOWNSHIP   
2  19401.0      Fire: GAS-ODOR/LEAK  2015-12-10 14:39:21         NORRISTOWN   

                         addr  e  
0      REINDEER CT & DEAD END  1  
1  BRIAR PATH & WHITEMARSH LN  1  
2                    HAWS AVE  1  

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249737 entries, 0 to 249736
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   lat        249737 non-null  float64
 1   lng        249737 non-null  float64
 2   desc       249737 non-null  object 
 3   zip        219391 non-null  float64
 4   title      249737 non-null  object 
 5   timeStamp  249737 non-null  object 
 6   twp        249644 non-null  object 
 7   addr       249737 non-null  object 
 8   e          249737 non-null  int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 17.1+ MB

  Problem description one : The statistics show that Number of different types of emergencies .

  We need to title Cut the contents inside , Extract [EMS,Fire,Traffic] The content of .

data_1 = df["title"].str.split(":").tolist()
data_1[0:5]

>>>
[['EMS', ' BACK PAINS/INJURY'],
 ['EMS', ' DIABETIC EMERGENCY'],
 ['Fire', ' GAS-ODOR/LEAK'],
 ['EMS', ' CARDIAC EMERGENCY'],
 ['EMS', ' DIZZINESS']]

  Next, I'll extract data_1 in category Information about .

cate_list = list(set(i[0] for i in data_1))
cate_list

>>>

['Fire', 'EMS', 'Traffic']

  At this time, we have several ways to According to the statistics Number of different types of emergencies .

Scheme 1 :set Method

zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(cate_list))),columns=cate_list)

for cate in cate_list:
    zeros_df[cate][df["title"].str.contains(cate)] = 1
    # break
print(zeros_df)

sum_ret = zeros_df.sum(axis=0)
print(sum_ret)

>>>

        Fire  EMS  Traffic
0        0.0  1.0      0.0
1        0.0  1.0      0.0
2        1.0  0.0      0.0
3        0.0  1.0      0.0
4        0.0  1.0      0.0
...      ...  ...      ...
249732   0.0  1.0      0.0
249733   0.0  1.0      0.0
249734   0.0  1.0      0.0
249735   1.0  0.0      0.0
249736   0.0  0.0      1.0

[249737 rows x 3 columns]
Fire        37432.0
EMS        124844.0
Traffic     87465.0
dtype: float6

  then sum Just a moment .

zeros_df.sum(axis = 0)
>>>

Fire        37432.0
EMS        124844.0
Traffic     87465.0
dtype: float64

Option two :for Traverse the whole DataFrame

Go through all the lists directly , This way is quite slow .

for i in range(df.shape[0]):
    zeros_df.loc[i,data_1[i][0]] =1
    pass
print(zeros_df)

>>>
      Fire  EMS  Traffic
0        0.0  1.0      0.0
1        0.0  1.0      0.0
2        1.0  0.0      0.0
3        0.0  1.0      0.0
4        0.0  1.0      0.0
...      ...  ...      ...
249732   0.0  1.0      0.0
249733   0.0  1.0      0.0
249734   0.0  1.0      0.0
249735   1.0  0.0      0.0
249736   0.0  0.0      1.0

[249737 rows x 3 columns]
zeros_df.sum(axis = 0)
zeros_df.sum(axis = 0)

Option three : Add a column , Then the classification Groupby

Add a column , then groupby, Last count Count it .
 

cate_list = [i[0] for i in data_1]
df["cate"] = pd.DataFrame(np.array(cate_list).reshape((df.shape[0],1)))
print(df.groupby(by="cate").count()["title"])

>>>
cate
EMS        124840
Fire        37432
Traffic     87465
Name: title, dtype: int64

Problem description 2 : Statistics of different months , Changes in different types of emergency calls .

This involves time series analysis .

stay pandas Processing time series in is very simple

Time series analysis

( One ) Generate a time range

pd.date_range(start=None, end=None, periods=None, freq='D')

start and end as well as freq Coordination can produce start and end Range Internal to frequency freq A set of time indexes

start and periods as well as freq Coordination can be generated from start The starting frequency is freq Of periods individual Time index

 four parameters: start, end, periods, and freq, exactly three must be specified

Four parameters , You must specify at least three of them .

import pandas as pd

pd.date_range(start="20170909",end = "20180908",freq = "M")

>>>
DatetimeIndex(['2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31'],
              dtype='datetime64[ns]', freq='M')
pd.date_range(start="20170909",periods = 5,freq = "D")

>>>
DatetimeIndex(['2017-09-09', '2017-09-10', '2017-09-11', '2017-09-12',
               '2017-09-13'],
              dtype='datetime64[ns]', freq='D')
​

freq: Is the frequency of time .

More abbreviations for frequency

( Two ) stay DataFrame Use time series in

import numpy as np
index=pd.date_range("20170101",periods=10)
df = pd.DataFrame(np.random.rand(10),index=index)
df

>>>
0
2017-01-01	0.090949
2017-01-02	0.996337
2017-01-03	0.737334
2017-01-04	0.405381
2017-01-05	0.743721
2017-01-06	0.681303
2017-01-07	0.606283
2017-01-08	0.917397
2017-01-09	0.167316
2017-01-10	0.155164

Go back to the beginning 911 In the case of data , We can use pandas The method provided converts a time string into a time series

df["timeStamp"] = pd.to_datetime(df["timeStamp"],format="")

format In most cases, parameters can be left blank , But for pandas Unformatted time string , We can Use this parameter , For example, include Chinese .

So here comes the question :

Now we need to count the number of times in each month or quarter. What should we do ?

( 3、 ... and )pandas Resampling

Resampling : It refers to the transformation of time series from One frequency is converted to another frequency Processing process , Convert high frequency data into low frequency data Downsampling , Low frequency is converted to high frequency L sampling .pandas Provides a resample To help us achieve frequency conversion

1. According to the statistics 911 Changes in the number of calls in different months in the data .

2. According to the statistics 911 Changes in the number of different types of calls in different months in the data .

pandas.DataFrame.resample

 pandas.DataFrame.resample() This function is mainly used for The time series Do frequency conversion , The function prototype is as follows :

DataFrame.resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)

 

  Downsampling ( High frequency data to low frequency data )

import pandas as pd
import numpy as np
index=pd.date_range('20190115','20190125',freq='D')
data1=pd.Series(np.arange(len(index)),index=index)
data1

>>>
2019-01-15     0
2019-01-16     1
2019-01-17     2
2019-01-18     3
2019-01-19     4
2019-01-20     5
2019-01-21     6
2019-01-22     7
2019-01-23     8
2019-01-24     9
2019-01-25    10
Freq: D, dtype: int64
data1.resample(rule='3D').sum()

>>>
2019-01-15     3
2019-01-18    12
2019-01-21    21
2019-01-24    19
Freq: 3D, dtype: int64

data1.resample(rule='3D').mean()

>>>
2019-01-15    1.0
2019-01-18    4.0
2019-01-21    7.0
2019-01-24    9.5
Freq: 3D, dtype: float64

 label This parameter controls the value of the aggregate tag after grouping . stay label by right Under the circumstances , Take the value on the right side of the sub box as the new label .

L sampling ( Low frequency data to high frequency data )

The process of downsampling is demonstrated above , Let's demonstrate the process of liter sampling , According to the definition of liter sampling , We just need to resample Function to change the frequency , However, unlike downsampling, the number of new frequencies after upsampling is null , So resample Also provided 3 There are three ways to fill , Let's use code to demonstrate :

The three filling methods are :

ffill( Take the previous value )

bfill( Take the following value )

interpolate( Linear value )

data1.resample(rule='12H').asfreq()

>>>
2019-01-15 00:00:00     0.0
2019-01-15 12:00:00     NaN
2019-01-16 00:00:00     1.0
2019-01-16 12:00:00     NaN
2019-01-17 00:00:00     2.0
2019-01-17 12:00:00     NaN
2019-01-18 00:00:00     3.0
2019-01-18 12:00:00     NaN
2019-01-19 00:00:00     4.0
2019-01-19 12:00:00     NaN
2019-01-20 00:00:00     5.0
2019-01-20 12:00:00     NaN
2019-01-21 00:00:00     6.0
2019-01-21 12:00:00     NaN
2019-01-22 00:00:00     7.0
2019-01-22 12:00:00     NaN
2019-01-23 00:00:00     8.0
2019-01-23 12:00:00     NaN
2019-01-24 00:00:00     9.0
2019-01-24 12:00:00     NaN
2019-01-25 00:00:00    10.0
Freq: 12H, dtype: float64

The original daily data is upsampled to 6 Hour hour , Many null values will be generated , For this null value resample Provides 3 Ways of planting , Respectively ffill( Take the previous value )、bfill( Take the following value )、interpolate( Linear value ), Here we test separately , as follows :

(1) stay ffill When no parameters are passed in , be-all NAN Be filled with , Here we can enter the number , So as to specify the number of null values to be filled .

data1.resample(rule='12H').ffill()
#  Forward filling , take NAN Fill in the previous values 
2019-01-15 00:00:00     0
2019-01-15 12:00:00     0
2019-01-16 00:00:00     1
2019-01-16 12:00:00     1
2019-01-17 00:00:00     2
2019-01-17 12:00:00     2
2019-01-18 00:00:00     3
2019-01-18 12:00:00     3
2019-01-19 00:00:00     4
2019-01-19 12:00:00     4
2019-01-20 00:00:00     5
2019-01-20 12:00:00     5
2019-01-21 00:00:00     6
2019-01-21 12:00:00     6
2019-01-22 00:00:00     7
2019-01-22 12:00:00     7
2019-01-23 00:00:00     8
2019-01-23 12:00:00     8
2019-01-24 00:00:00     9
2019-01-24 12:00:00     9
2019-01-25 00:00:00    10
Freq: 12H, dtype: int64


data1.resample(rule='12H').ffill(2)

>>>
2019-01-15 00:00:00     0
2019-01-15 12:00:00     1
2019-01-16 00:00:00     1
2019-01-16 12:00:00     2
2019-01-17 00:00:00     2
2019-01-17 12:00:00     3
2019-01-18 00:00:00     3
2019-01-18 12:00:00     4
2019-01-19 00:00:00     4
2019-01-19 12:00:00     5
2019-01-20 00:00:00     5
2019-01-20 12:00:00     6
2019-01-21 00:00:00     6
2019-01-21 12:00:00     7
2019-01-22 00:00:00     7
2019-01-22 12:00:00     8
2019-01-23 00:00:00     8
2019-01-23 12:00:00     9
2019-01-24 00:00:00     9
2019-01-24 12:00:00    10
2019-01-25 00:00:00    10
Freq: 12H, dtype: int64


data1.resample(rule='12H').bfill()
>>>
2019-01-15 00:00:00     0
2019-01-15 12:00:00     1
2019-01-16 00:00:00     1
2019-01-16 12:00:00     2
2019-01-17 00:00:00     2
2019-01-17 12:00:00     3
2019-01-18 00:00:00     3
2019-01-18 12:00:00     4
2019-01-19 00:00:00     4
2019-01-19 12:00:00     5
2019-01-20 00:00:00     5
2019-01-20 12:00:00     6
2019-01-21 00:00:00     6
2019-01-21 12:00:00     7
2019-01-22 00:00:00     7
2019-01-22 12:00:00     8
2019-01-23 00:00:00     8
2019-01-23 12:00:00     9
2019-01-24 00:00:00     9
2019-01-24 12:00:00    10
2019-01-25 00:00:00    10
Freq: 12H, dtype: int64

data1.resample(rule='12H').interpolate()
#  Linear filling 
>>>
2019-01-15 00:00:00     0.0
2019-01-15 12:00:00     0.5
2019-01-16 00:00:00     1.0
2019-01-16 12:00:00     1.5
2019-01-17 00:00:00     2.0
2019-01-17 12:00:00     2.5
2019-01-18 00:00:00     3.0
2019-01-18 12:00:00     3.5
2019-01-19 00:00:00     4.0
2019-01-19 12:00:00     4.5
2019-01-20 00:00:00     5.0
2019-01-20 12:00:00     5.5
2019-01-21 00:00:00     6.0
2019-01-21 12:00:00     6.5
2019-01-22 00:00:00     7.0
2019-01-22 12:00:00     7.5
2019-01-23 00:00:00     8.0
2019-01-23 12:00:00     8.5
2019-01-24 00:00:00     9.0
2019-01-24 12:00:00     9.5
2019-01-25 00:00:00    10.0
Freq: 12H, dtype: float64

( One ) Data initialization operation :

            Set the index of the original data to the time index value , Do the following .

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt


df = pd.read_csv("./911.csv")

df["timeStamp"] = pd.to_datetime(df["timeStamp"])

df.set_index("timeStamp",inplace=True)
df

>>>

( Two ) According to the statistics 911 Number of calls in different months in the data

count_by_month = df.resample("M").count()["title"]
print(count_by_month)

>>>
timeStamp
2015-12-31     7916
2016-01-31    13096
2016-02-29    11396
2016-03-31    11059
2016-04-30    11287
2016-05-31    11374
2016-06-30    11732
2016-07-31    12088
2016-08-31    11904
2016-09-30    11669
2016-10-31    12502
2016-11-30    12091
2016-12-31    12162
2017-01-31    11605
2017-02-28    10267
2017-03-31    11684
2017-04-30    11056
2017-05-31    11719
2017-06-30    12333
2017-07-31    11768
2017-08-31    11753
2017-09-30     7276
Freq: M, Name: title, dtype: int64
​

( 3、 ... and ) Visual analysis —— drawing

# drawing 
_x = count_by_month.index
_y = count_by_month.values

_x = [i.strftime("%Y%m%d") for i in _x]

plt.figure(figsize=(20,8),dpi=80)

plt.plot(range(len(_x)),_y)

plt.xticks(range(len(_x)),_x,rotation=45)

plt.show()

  Expand practice ——911 Changes in the number of different types of calls in different months in the data

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# Convert the time string to time type and set it as index 
df = pd.read_csv("./911.csv")
df["timeStamp"] = pd.to_datetime(df["timeStamp"])

# Add columns , Indicates classification 
temp_list = df["title"].str.split(": ").tolist()
cate_list = [i[0] for i in temp_list]
# print(np.array(cate_list).reshape((df.shape[0],1)))
df["cate"] = pd.DataFrame(np.array(cate_list).reshape((df.shape[0],1)))

df.set_index("timeStamp",inplace=True)

print(df.head(1))

plt.figure(figsize=(20, 8), dpi=80)

# grouping 
for group_name,group_data in df.groupby(by="cate"):

    # Draw different categories 
    count_by_month = group_data.resample("M").count()["title"]

    #  drawing 
    _x = count_by_month.index
    print(_x)
    _y = count_by_month.values

    _x = [i.strftime("%Y%m%d") for i in _x]

    plt.plot(range(len(_x)), _y, label=group_name)


plt.xticks(range(len(_x)), _x, rotation=45)
plt.legend(loc="best")
plt.show()

>>>
lat        lng  \
timeStamp                                   
2015-12-10 17:10:52  40.297876 -75.581294   

                                                                  desc  \
timeStamp                                                                
2015-12-10 17:10:52  REINDEER CT & DEAD END;  NEW HANOVER; Station ...   

                         zip                   title          twp  \
timeStamp                                                           
2015-12-10 17:10:52  19525.0  EMS: BACK PAINS/INJURY  NEW HANOVER   

                                       addr  e cate  
timeStamp                                            
2015-12-10 17:10:52  REINDEER CT & DEAD END  1  EMS  
DatetimeIndex(['2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
               '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
               '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
               '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
               '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
               '2017-08-31', '2017-09-30'],
              dtype='datetime64[ns]', name='timeStamp', freq='M')
DatetimeIndex(['2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
               '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
               '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
               '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
               '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
               '2017-08-31', '2017-09-30'],
              dtype='datetime64[ns]', name='timeStamp', freq='M')
DatetimeIndex(['2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
               '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
               '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
               '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
               '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
               '2017-08-31', '2017-09-30'],
              dtype='datetime64[ns]', name='timeStamp', freq='M')

  About PM2.5 Of Demo

# coding=utf-8
import pandas as pd
from matplotlib import  pyplot as plt
file_path = "./PM2.5/BeijingPM20100101_20151231.csv"

df = pd.read_csv(file_path)

# Pass the separated time string through periodIndex The method is transformed into pandas The type of time 
period = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H")
df["datetime"] = period
# print(df.head(10))

# hold datetime  Set as index 
df.set_index("datetime",inplace=True)

# Down sampling 
df = df.resample("7D").mean()
print(df.head())
# Processing missing data , Delete missing data 
# print(df["PM_US Post"])

data  =df["PM_US Post"]
data_china = df["PM_Nongzhanguan"]

print(data_china.head(100))
# drawing 
_x = data.index
_x = [i.strftime("%Y%m%d") for i in _x]
_x_china = [i.strftime("%Y%m%d") for i in data_china.index]
print(len(_x_china),len(_x_china))
_y = data.values
_y_china = data_china.values


plt.figure(figsize=(20,8),dpi=80)

plt.plot(range(len(_x)),_y,label="US_POST",alpha=0.7)
plt.plot(range(len(_x_china)),_y_china,label="CN_POST",alpha=0.7)

plt.xticks(range(0,len(_x_china),10),list(_x_china)[::10],rotation=45)

plt.legend(loc="best")

plt.show()

Be careful :

Separate time strings are passed through periodIndex The method is transformed into pandas The type of time

period=pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H")
df["datetime"] = period

  About Resampling For some of the contents of :

http://t.csdn.cn/ViZmt

copyright notice
author[Bayesian grandson],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/175/202206240340585596.html

Random recommended