# Python 用于金融数据分析第9课 Part 2-----时间序列分析模型ARIMA实现

ARIMA模型实现预测的流程

## 一、导入数据

``````import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
``````

``````df=pd.read_csv('monthly_milk_production_pounds_p.csv')
df.columns=['Month', 'Milk Production']
df.drop(168, axis=0, inplace=True)
df.set_index(pd.to_datetime(df['Month']), inplace=True)
``````

## 二、绘制观测点图

``````plt.plot_date(x=df.index, y=df['Milk Production'], xdate=True, marker=None, linestyle='solid',color='darkblue', label=df.columns[1])
df['Milk Production'].rolling(6).mean().plot(color='darkgreen', label='rolling mean')
df['Milk Production'].rolling(6).std().plot(color='yellow', label='rolling std')
plt.legend(loc='upper left')
``````

### 3.1 检测不变性的显著性

``````from statsmodels.tsa.stattools import adfuller
labels=['adf statistics', 'pvalue', 'used lag', 'Number of observations']
for label, result in zip(labels, results):
print label+': '+str(result)
if results[1]<=0.05:
print "strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary"
else:
print "weak evidence against the null hypothesis, accept the null hypothesis. Data has unit root and is nonstationary"
``````

（这里其实并不明白unit root是什么东西）

``````def adfuller_test(df_series):
labels=['adf statistics', 'pvalue', 'used lag', 'Number of observations']
for label, result in zip(labels, results):
print label+': '+str(result)
if results[1]<=0.05:
print "strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary"
else:
print "weak evidence against the null hypothesis, accept the null hypothesis. Data has unit root and is nonstationary"
``````

### 3.2 转换不变性

#### 3.2.1 一次差分

``````#shift方法把索引往上移动一位，相当于数据往下移动一位
``````

``````df['Milk Production one difference']=df['Milk Production']-df['Milk Production'].shift(1)
``````

``````df['Milk Production one difference'].drop(1, axis=0, inplace=True).plot(color='darkgreen',label='Milk Production one difference')
plt.legend(loc='upper left')
``````

``````adfuller_test(df['Milk Production one difference'].dropna(axis=0))
``````

#### 3.2.2 季节性差分

``````df['seasonal difference']=df['Milk Production']-df['Milk Production'].shift(12)
df['seasonal difference'].plot(color='darkred',label='seasonal difference')
plt.legend(loc='upper left')
``````

``````adfuller_test(df['seasonal difference'].dropna(axis=0))
``````

#### 3.2.3 季节性一阶差分

``````df['seasonal one difference']=df['seasonal difference']-df['seasonal difference'].shift(1)
df['seasonal one difference'].plot(color='darkred',label='seasonal one difference')
plt.legend(loc='upper left')
``````

``````adfuller_test(df['seasonal one difference'].dropna(axis=0))
``````

## 四、绘制自相关和偏相关图，确定p, d, q

### 4.2 绘制一次差分和季节性一次差分的acf和pcf图

#### 4.2.1 一次差分的acf和pcf

acf图

``````from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(df['Milk Production one difference'].dropna())
``````

pacf图

``````plot_pacf(df['Milk Production one difference'].dropna())
``````

#### 4.2.2 季节性差分的一次差分的acf和pcf

acf图

``````plot_acf(df['seasonal one difference'].dropna())
``````

pacf图

``````plot_pacf(df['seasonal one difference'].dropna())
``````

#### 4.2.3 对季节性差分的一次差分最终绘图

``````#设定画布的大小
fig=plt.figure(figsize=(16, 6))
#增加第一个子图

plot_acf(df['seasonal one difference'].dropna(), ax=ax1, lags=40)#在这里我选择第一个子图绘图，并且以前面的40个时序数据为滞后来绘制自相关图

#增加第二个子图
plot_pacf(df['seasonal one difference'].dropna(), ax=ax2, lags=40)
``````

## 五、构建模型

### 5.1 构建模型

``````from statsmodels.tsa.statespace.sarimax import SARIMAX
model=SARIMAX(df['Milk Production'], order=(0, 1, 0), seasonal_order=(1, 1, 1, 12))
results=model.fit()
print results.summary()
``````

### 5.2 绘制残差

``````results.resid.dropna(axis=0, inplace=True)
results.resid
``````

image.png

``````results.resid.plot()
``````

``````results.resid.plot(kind='kde')
``````

## 六、预测未来值

### 6.1 预测值

``````df['predicted_values']=results.predict(start=158, end=168, dynamic=True)
df['predicted_values']
df[['Milk Production','predicted_values']].plot()
``````

### 6.2 预测未来值

pandas的date_range可以按照自己的需要生成一系列相应的日期序列

``````future_dates=pd.date_range(start='1976-1-1', periods=24,freq='MS')
future_dates
``````

MS是频率的单位，在这里表示的是按每个月步长，返回每个月的开始第一天month start的意思。

``````future=pd.DataFrame(index=future_dates, columns=df.columns)
final=pd.concat(objs=[df, future], axis=0)
``````
``````final['forecast_values']=results.predict(start=168, end=188, dynamic=True)
final[['Milk Production','forecast_values']].plot()
``````