데이터 한 그릇

번외)딥러닝을 이용한 시계열 예측 본문

시계열 분석/Practical TIme Series Analysis

번외)딥러닝을 이용한 시계열 예측

장사이언스 2022. 10. 14. 17:19

시계열 데이터를 살펴보면 Trend 가 있고 Sesonality 그리고 Cycle 이 존재한다

이때 Cycle 은 알 수 없어서 잘 넣지 않는다

 

전통적인 통계 기법은 AR, MA, ARIMA 등등의 방식들이 있다

 

이때 머신러닝 관점에서 AR 모델을 생각해보면 딥러닝과 매핑이 됨을 인지할 수 있다

딥러닝의 입력층 각 노드는 각 시점값이라고 할 수 있다(x_t-1 ... x_t-k)

 

CNN과 RNN 을 통한 시계열 예측을 시도해볼 예정

 


인공신경망 시계열 예측

 

# -*- coding: utf-8 -*-
"""
Air pollution PRES prediction by MLP
@author: kjw
"""
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import datetime


#Read the dataset into a pandas.DataFrame
df = pd.read_csv('PRSA_data_2010.1.1-2014.12.31.csv')
print('Shape of the dataframe:', df.shape)
df.head()

# Index creation
df['datetime'] = df[['year', 'month', 'day', 'hour']].\
    apply(lambda row: datetime.datetime(year=row['year'], \
    month=row['month'], day=row['day'],hour=row['hour']), axis=1)
df.sort_values('datetime', ascending=True, inplace=True)
df.head()

#Let us draw a box plot to visualize the central tendency and dispersion of PRES
g = sns.boxplot(df['PRES'])
g.set_title('Box plot of Air Pressure')

# Time series visualization
g = sns.lineplot(x='datetime',y='PRES',data=df)
g.set_title('Time series of Air Pressure')
g.set_xlabel('Year')
g.set_ylabel('Air Pressure readings in hPa')


# Minmax scaling PRES variable
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
df['scaled_PRES'] = scaler.fit_transform(np.array(df['PRES']).reshape(-1, 1))
df.head()

"""
Let's start by splitting the dataset into train and test. 
The dataset's time period is from
Jan 1st, 2010 to Dec 31st, 2014. 
The first four years - 2010 to 2013 is used as train and
2014 is kept for test.
"""
split_date = datetime.datetime(year=2014, month=1, day=1, hour=0)
df_train = df.loc[df['datetime']<split_date]
df_test = df.loc[df['datetime']>=split_date]
print('Shape of train:', df_train.shape)
print('Shape of test:', df_test.shape)


#First five rows of train
df_train.head()

#First five rows of test
df_test.head()

#Reset the indices of the validation set
df_test.reset_index(drop=True, inplace=True)
df_test.head()

"""
The train and test time series of standardized PRES are also plotted.
"""

g = sns.lineplot(x='datetime',y='scaled_PRES',data=df_train, color='b')
g.set_title('Time series of scaled Air Pressure in train set')
g.set_xlabel('Index')
g.set_ylabel('Scaled Air Pressure readings')

g = sns.lineplot(x='datetime',y='scaled_PRES',data=df_test, color='r')
g.set_title('Time series of scaled Air Pressure in test set')
g.set_xlabel('Index')
g.set_ylabel('Scaled Air Pressure readings')

# Make dataset to forecast using past 7 days

def makeXy(ts, nb_timesteps):
    X = []
    y = []
    for i in range(nb_timesteps, ts.shape[0]):
        X.append(list(ts.loc[i-nb_timesteps:i-1]))
        y.append(ts.loc[i])
    X, y = np.array(X), np.array(y)
    return X, y

# Make training set
X_train, y_train = makeXy(df_train['scaled_PRES'], 7)
print('Shape of train arrays:', X_train.shape, y_train.shape)
print(X_train[0])
print(y_train[0])

# Make test set
X_test, y_test = makeXy(df_test['scaled_PRES'], 7)
print('Shape of Test arrays:', X_test.shape, y_test.shape)
print(X_test[0])
print(y_test[0])

#Import Keras modules
from keras.layers import Dense, Dropout
from keras.models import Sequential

model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(7,)))
model.add(Dense(16, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))

model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()

history=model.fit(X_train, y_train, batch_size=16, epochs=20,
             verbose=1, validation_split=0.3,
             shuffle=True)

preds = model.predict(X_test)
pred_PRES = scaler.inverse_transform(preds)
pred_PRES = np.squeeze(pred_PRES)


# R2 Calculation
from sklearn.metrics import r2_score
r2 = r2_score(df_test['PRES'].loc[7:], pred_PRES)
print('R-squared for the test set:', round(r2,4))

#Let's plot the first 50 actual and predicted values of air pressure.
plt.plot(range(50), df_test['PRES'].loc[7:56], linestyle='-', marker='*', color='r')
plt.plot(range(50), pred_PRES[:50], linestyle='-', marker='.', color='b')
plt.legend(['Actual','Predicted'], loc=2)
plt.title('Actual vs Predicted Air Pressure')
plt.ylabel('Air Pressure')
plt.xlabel('Index')

 

이 코드 중에서 중요한건 makeXy def 다

과거 7일 데이터를 사용하기 위해 만든 함수로 X에는 과거 7일간의 데이터를 담고 y에는 예측하려고 하는 target 값을 리스트 형태로 담는다

CNN 과 RNN은 조교하면서 돌리시면 번외로 쓸 예정

 

Comments