A time series can be thought of as a vector or matrix of numbers along with some information about what times those numbers were recorded. This information is stored in a ts object in R. In most exercises, you will use time series that are part of existing packages. However, if you want to work with your own data, you need to know how to create a ts object in R.
Let’s look at an example usnim_2002 below, containing net interest margins for US banks for the year 2002 (source: FFIEC).
usnim_2002
usnim
1 2002-01-01 4.08
2 2002-04-01 4.10
3 2002-07-01 4.06
4 2002-10-01 4.04
# ts(data, start, frequency, ...)
usnim_ts = ts(usnim_2002[, 2], start = c(2002, 1), frequency = 4)
The function ts() takes in three arguments:
In this exercise, you will read in some time series data from an xlsx file using read_excel(), a function from the readxl package, and store the data as a ts object. Both the xlsx file and package have been loaded into your workspace.
library(readxl)
# Read the data from Excel into R
mydata <- read_excel("data/exercise1.xlsx")
New names:
* `` -> ...1
# Look at the first few lines of mydata
head(mydata)
...1 <chr> | Sales <dbl> | AdBudget <dbl> | GDP <dbl> | |
---|---|---|---|---|
Mar-81 | 1020.2 | 659.2 | 251.8 | |
Jun-81 | 889.2 | 589.0 | 290.9 | |
Sep-81 | 795.0 | 512.5 | 290.8 | |
Dec-81 | 1003.9 | 614.1 | 292.4 | |
Mar-82 | 1057.7 | 647.2 | 279.1 | |
Jun-82 | 944.4 | 602.0 | 254.0 |
Take a look at the dates - there are four observations in 1981, indicating quarterly data with a frequency of four rows per year. The first observation or start date is Mar-81, the first of four rows for year 1981, indicating that March corresponds with the first period.
# Create a ts object called myts
myts <- ts(mydata[, c(2:4)], start = c(1981, 1), frequency = 4)
head(myts)
Sales AdBudget GDP
[1,] 1020.2 659.2 251.8
[2,] 889.2 589.0 290.9
[3,] 795.0 512.5 290.8
[4,] 1003.9 614.1 292.4
[5,] 1057.7 647.2 279.1
[6,] 944.4 602.0 254.0
The first step in any data analysis task is to plot the data. Graphs enable you to visualize many features of the data, including patterns, unusual observations, changes over time, and relationships between variables. Just as the type of data determines which forecasting method to use, it also determines which graphs are appropriate.
You can use the autoplot() function to produce a time plot of the data with or without facets, or panels that display different subsets of data:
autoplot(usnim_2002, facets = FALSE)
The above method is one of the many taught in this course that accepts boolean arguments. Both T and TRUE mean “true”, and F and FALSE mean “false”, however, T and F can be overwritten in your code. Therefore, you should only rely on TRUE and FALSE to set your indicators for the remainder of the course.
You will use two more functions in this exercise, which.max() and frequency(). which.max() can be used to identify the smallest index of the maximum value
> x <- c(4, 5, 5)
> which.max(x)
[1] 2
To find the number of observations per unit time, use frequency(). Recall the usnim_2002 data from the previous exercise:
> frequency(usnim_2002)
[1] 4
Because this course involves the use of the forecast and ggplot2 packages:
library(forecast)
package 㤼㸱forecast㤼㸲 was built under R version 4.0.3Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(ggplot2)
The following three series are available in the package forecast:
# Plot the data with facetting
autoplot(myts, facets = TRUE)
# Plot the data without facetting
autoplot(myts, facets = FALSE)
# Plot the three series
autoplot(gold)
autoplot(woolyrnq)
autoplot(gas)
# Find the outlier in the gold series
goldoutlier <- which.max(gold)
goldoutlier
[1] 770
# Look at the seasonal frequencies of the three series
frequency(gold)
[1] 1
frequency(woolyrnq)
[1] 4
frequency(gas)
[1] 12
Along with time plots, there are other useful ways of plotting data to emphasize seasonal patterns and show changes in these patterns over time.
-A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. You can create one using the ggseasonplot() function the same way you do with autoplot(). - An interesting variant of a season plot uses polar coordinates, where the time axis is circular rather than horizontal; to make one, simply add a polar argument and set it to TRUE. - A subseries plot comprises mini time plots for each season. Here, the mean for each season is shown as a blue horizontal line.
One way of splitting a time series is by using the window() function, which extracts a subset from the object x observed between the times start and end.
window(x, start = NULL, end = NULL)
In this exercise, you will load the fpp2 package and use two of its datasets:
# Load the fpp2 package
library(fpp2)
package 㤼㸱fpp2㤼㸲 was built under R version 4.0.3-- Attaching packages ---------------------------------------------- fpp2 2.4 --
v fma 2.4 v expsmooth 2.3
package 㤼㸱fma㤼㸲 was built under R version 4.0.3package 㤼㸱expsmooth㤼㸲 was built under R version 4.0.3
# Create plots of the a10 data
autoplot(a10)
ggseasonplot(a10)
# Produce a polar coordinate season plot for the a10 data
ggseasonplot(a10, polar = TRUE)
# Restrict the ausbeer data to start in 1992
beer <- window(ausbeer, start = 1992)
# Make plots of the beer data
autoplot(beer)
ggsubseriesplot(beer)
Another way to look at time series data is to plot each observation against another observation that occurred some time previously by using gglagplot(). For example, you could plot yt against yt−1. This is called a lag plot because you are plotting the time series against lags of itself.
The correlations associated with the lag plots form what is called the autocorrelation function (ACF). The ggAcf() function produces ACF plots.
In this exercise, you will work with the pre-loaded oil data (available in the package fpp2), which contains the annual oil production in Saudi Arabia from 1965-2013 (measured in millions of tons).
# Create an autoplot of the oil data
autoplot(oil)
# Create a lag plot of the oil data
gglagplot(oil)
# Create an ACF plot of the oil data
ggAcf(oil)
When data are either seasonal or cyclic, the ACF will peak around the seasonal lags or at the average cycle length.
You will investigate this phenomenon by plotting the annual sunspot series (which follows the solar cycle of approximately 10-11 years) in sunspot.year and the daily traffic to the Hyndsight blog (which follows a 7-day weekly pattern) in hyndsight.
# Plot the annual sunspot numbers
autoplot(sunspot.year)
ggAcf(sunspot.year)
# Save the lag corresponding to maximum autocorrelation
maxlag_sunspot <- 1
# Plot the traffic on the Hyndsight blog
autoplot(hyndsight)
ggAcf(hyndsight)
White noise is a term that describes purely random data. You can conduct a Ljung-Box test using the function below to confirm the randomness of a series; a p-value greater than 0.05 suggests that the data are not significantly different from white noise.
Box.test(pigs, lag = 24, fitdf = 0, type = "Ljung")
There is a well-known result in economics called the “Efficient Market Hypothesis” that states that asset prices reflect all available information. A consequence of this is that the daily changes in stock prices should behave like white noise (ignoring dividends, interest rates and transaction costs). The consequence for forecasters is that the best forecast of the future price is the current price.
You can test this hypothesis by looking at the goog series, which contains the closing stock price for Google over 1000 trading days ending on February 13, 2017.
# Plot the original series
autoplot(goog)
# Plot the differenced series
autoplot(diff(goog))
# ACF of the differenced series
ggAcf(diff(goog))
# Ljung-Box test of the differenced series
Box.test(diff(goog), lag = 10, type = "Ljung")
Box-Ljung test
data: diff(goog)
X-squared = 13.123, df = 10, p-value = 0.2169
You have seen that the Ljung-Box test was not significant, so daily changes in the Google stock price look like white noise.
A forecast is the mean or median of simulated futures of a time series.
The very simplest forecasting method is to use the most recent observation; this is called a naive forecast and can be implemented in a namesake function. This is the best that can be done for many time series including most stock price data, and even if it is not a good forecasting method, it provides a useful benchmark for other forecasting methods.
For seasonal data, a related idea is to use the corresponding season from the last year of data. For example, if you want to forecast the sales volume for next March, you would use the sales volume from the previous March. This is implemented in the snaive() function, meaning, seasonal naive.
For both forecasting methods, you can set the second argument h, which specifies the number of values you want to forecast; as shown in the code below, they have different default values. The resulting output is an object of class forecast. This is the core class of objects in the forecast package, and there are many functions for dealing with them including summary() and autoplot()
naive(y, h = 10)
snaive(y, h = 2 * frequency(x))
# Use naive() to forecast the goog series
fcgoog <- naive(goog, h = 20)
# Plot and summarize the forecasts
autoplot(fcgoog)
summary(fcgoog)
Forecast method: Naive method
Model Information:
Call: naive(y = goog, h = 20)
Residual sd: 8.7343
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.4212612 8.734286 5.829407 0.06253998 0.9741428 1 0.03871446
Forecasts:
Point Forecast <dbl> | Lo 80 <dbl> | Hi 80 <dbl> | Lo 95 <dbl> | Hi 95 <dbl> | |
---|---|---|---|---|---|
1001 | 813.67 | 802.4765 | 824.8634 | 796.5511 | 830.7889 |
1002 | 813.67 | 797.8401 | 829.4999 | 789.4602 | 837.8797 |
1003 | 813.67 | 794.2824 | 833.0576 | 784.0192 | 843.3208 |
1004 | 813.67 | 791.2831 | 836.0569 | 779.4322 | 847.9078 |
1005 | 813.67 | 788.6407 | 838.6993 | 775.3910 | 851.9490 |
1006 | 813.67 | 786.2518 | 841.0882 | 771.7374 | 855.6025 |
1007 | 813.67 | 784.0549 | 843.2850 | 768.3777 | 858.9623 |
1008 | 813.67 | 782.0102 | 845.3298 | 765.2505 | 862.0895 |
1009 | 813.67 | 780.0897 | 847.2503 | 762.3133 | 865.0266 |
1010 | 813.67 | 778.2732 | 849.0667 | 759.5353 | 867.8047 |
# Use snaive() to forecast the ausbeer series
fcbeer <- snaive(ausbeer, h = 16)
# Plot and summarize the forecasts
autoplot(fcbeer)
summary(fcbeer)
Forecast method: Seasonal naive method
Model Information:
Call: snaive(y = ausbeer, h = 16)
Residual sd: 19.3259
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 3.098131 19.32591 15.50935 0.838741 3.69567 1 0.01093868
Forecasts:
Point Forecast <dbl> | Lo 80 <dbl> | Hi 80 <dbl> | Lo 95 <dbl> | Hi 95 <dbl> | |
---|---|---|---|---|---|
2010 Q3 | 419 | 394.2329 | 443.7671 | 381.1219 | 456.8781 |
2010 Q4 | 488 | 463.2329 | 512.7671 | 450.1219 | 525.8781 |
2011 Q1 | 414 | 389.2329 | 438.7671 | 376.1219 | 451.8781 |
2011 Q2 | 374 | 349.2329 | 398.7671 | 336.1219 | 411.8781 |
2011 Q3 | 419 | 383.9740 | 454.0260 | 365.4323 | 472.5677 |
2011 Q4 | 488 | 452.9740 | 523.0260 | 434.4323 | 541.5677 |
2012 Q1 | 414 | 378.9740 | 449.0260 | 360.4323 | 467.5677 |
2012 Q2 | 374 | 338.9740 | 409.0260 | 320.4323 | 427.5677 |
2012 Q3 | 419 | 376.1020 | 461.8980 | 353.3932 | 484.6068 |
2012 Q4 | 488 | 445.1020 | 530.8980 | 422.3932 | 553.6068 |
When applying a forecasting method, it is important to always check that the residuals are well-behaved (i.e., no outliers or patterns) and resemble white noise. The prediction intervals are computed assuming that the residuals are also normally distributed. You can use the checkresiduals() function to verify these characteristics; it will give the results of a Ljung-Box test.
In this exercise, you will test the above functions on the forecasts equivalent to what you produced in the previous exercise (fcgoog obtained after applying naive() to goog, and fcbeer obtained after applying snaive() to ausbeer).
# Check the residuals from the naive forecasts applied to the goog series
goog %>% naive() %>% checkresiduals()
Ljung-Box test
data: Residuals from Naive method
Q* = 13.123, df = 10, p-value = 0.2169
Model df: 0. Total lags used: 10
# Do they look like white noise (TRUE or FALSE)
googwn <- TRUE
# Check the residuals from the seasonal naive forecasts applied to the ausbeer series
ausbeer %>% snaive() %>% checkresiduals()
Ljung-Box test
data: Residuals from Seasonal naive method
Q* = 60.535, df = 8, p-value = 3.661e-10
Model df: 0. Total lags used: 8
# Do they look like white noise (TRUE or FALSE)
beerwn <- FALSE
In data science, a training set is a data set that is used to discover possible relationships. A test set is a data set that is used to verify the strength of these potential relationships. When you separate a data set into these parts, you generally allocate more of the data for training, and less for testing.
One function that can be used to create training and test sets is subset.ts(), which returns a subset of a time series where the optional start and end arguments are specified using index values.
# x is a numerical vector or time series
# To subset observations from 101 to 500
train <- subset(x, start = 101, end = 500, ...)
`
# To subset the first 500 observations
train <- subset(x, end = 500, ...)
Another function, accuracy(), computes various forecast accuracy statistics given the forecasts and the corresponding actual observations. It is smart enough to find the relevant observations if you give it more than the ones you are forecasting.
> # f is an object of class "forecast"
> # x is a numerical vector or time series
> accuracy(f, x, ...)
The accuracy measures provided include root mean squared error (RMSE) which is the square root of the mean squared error (MSE). Minimizing RMSE, which corresponds with increasing accuracy, is the same as minimizing MSE.
The pre-loaded time series gold comprises daily gold prices for 1108 days. Here, you’ll use the first 1000 days as a training set, and compute forecasts for the remaining 108 days. These will be compared to the actual values for these days using the simple forcasting functions naive(), which you used earlier in this chapter, and meanf(), which gives forecasts equal to the mean of all observations. You’ll have to specify the keyword h (which specifies the number of values you want to forecast) for both.
# Create the training data as train
train <- subset(gold, end = 1000)
# Compute naive forecasts and save to naive_fc
naive_fc <- naive(train, h = 108)
# Compute mean forecasts and save to mean_fc
mean_fc <- meanf(train, h = 108)
# Use accuracy() to compute RMSE statistics
accuracy(naive_fc, gold)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set 0.1079897 6.358087 3.20366 0.0201449 0.8050646 1.014334 -0.3086638 NA
Test set -6.5383495 15.842361 13.63835 -1.7462269 3.4287888 4.318139 0.9793153 5.335899
accuracy(mean_fc, gold)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set -4.239671e-15 59.17809 53.63397 -2.390227 14.230224 16.981449 0.9907254 NA
Test set 1.319363e+01 19.55255 15.66875 3.138577 3.783133 4.960998 0.9793153 6.123788
# Assign one of the two forecasts as bestforecasts
bestforecasts <- naive_fc
The window() function specifies the start and end of a time series using the relevant times rather than the index values. Either of those two arguments can be formatted as a vector like c(year, period) which you have also previously used as an argument for ts(). Again, period refers to quarter here.
Here, you will use the Melbourne quarterly visitor numbers (vn[, “Melbourne”]) to create three different training sets, omitting the last 1, 2 and 3 years, respectively. Inspect the pre-loaded vn data in your console before beginning the exercise; this will help you determine the correct value to use for the keyword h (which specifies the number of values you want to forecast) in your forecasting methods.
Then for each training set, compute the next year of data, and finally compare the mean absolute percentage error (MAPE) of the forecasts using accuracy(). Why do you think that the MAPE vary so much?
vn <- readRDS("vn.RDS")
vn <- ts(vn[, c(2:8)], start = c(1998, 1), frequency = 4)
# Create three training series omitting the last 1, 2, and 3 years
train1 <- window(vn[, "Melbourne"], end = c(2014, 4))
train2 <- window(vn[, "Melbourne"], end = c(2013, 4))
train3 <- window(vn[, "Melbourne"], end = c(2012, 4))
# Produce forecasts using snaive()
fc1 <- snaive(train1, h = 4)
fc2 <- snaive(train2, h = 4)
fc3 <- snaive(train3, h = 4)
# Use accuracy() to compare the MAPE of each series
accuracy(fc1, vn[, "Melbourne"])["Test set", "MAPE"]
[1] 5.477917
accuracy(fc2, vn[, "Melbourne"])["Test set", "MAPE"]
[1] 12.50071
accuracy(fc3, vn[, "Melbourne"])["Test set", "MAPE"]
[1] 7.950202
A good model forecasts well (so has low RMSE on the test set) and uses all available information in the training data (so has white noise residuals).
The tsCV() function computes time series cross-validation errors. It requires you to specify the time series, the forecast method, and the forecast horizon.
e = tsCV(oil, forecastfunction = naive, h = 1)
Here, you will use tsCV() to compute and plot the MSE values for up to 8 steps ahead, along with the naive() method applied to the goog data. T
# Compute cross-validated errors for up to 8 steps ahead
e <- tsCV(goog, forecastfunction = naive, h = 8)
# Compute the MSE values and remove missing values
mse <- colMeans(e^2, na.rm = TRUE)
# Plot the MSE values against the forecast horizon
data.frame(h = 1:8, MSE = mse) %>%
ggplot(aes(x = h, y = MSE)) + geom_point()
data.frame(h = 1:8, MSE = mse)
h <int> | MSE <dbl> | |||
---|---|---|---|---|
h=1 | 1 | 76.28775 | ||
h=2 | 2 | 158.96282 | ||
h=3 | 3 | 241.54818 | ||
h=4 | 4 | 314.72312 | ||
h=5 | 5 | 379.76003 | ||
h=6 | 6 | 441.01128 | ||
h=7 | 7 | 504.44510 | ||
h=8 | 8 | 566.55904 |
The ses() function produces forecasts obtained using simple exponential smoothing (SES). The parameters are estimated using least squares estimation. All you need to specify is the time series and the forecast horizon; the default forecast time is h = 10 years.
> args(ses)
function (y, h = 10, ...)
> fc <- ses(oildata, h = 5)
> summary(fc)
You will also use summary() and fitted(), along with autolayer() for the first time, which is like autoplot() but it adds a “layer” to a plot rather than creating a new plot.
Here, you will apply these functions to marathon, the annual winning times in the Boston marathon from 1897-2016. The data are available in your workspace.
# Use ses() to forecast the next 10 years of winning times
fc <- ses(marathon, h = 10)
# Use summary() to see the model parameters
summary(fc)
Forecast method: Simple exponential smoothing
Model Information:
Simple exponential smoothing
Call:
ses(y = marathon, h = 10)
Smoothing parameters:
alpha = 0.3457
Initial states:
l = 167.1741
sigma: 5.519
AIC AICc BIC
988.4474 988.6543 996.8099
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.8874349 5.472771 3.826294 -0.7097395 2.637644 0.8925685 -0.01211236
Forecasts:
Point Forecast <dbl> | Lo 80 <dbl> | Hi 80 <dbl> | Lo 95 <dbl> | Hi 95 <dbl> | |
---|---|---|---|---|---|
2017 | 130.3563 | 123.2835 | 137.4292 | 119.5394 | 141.1733 |
2018 | 130.3563 | 122.8727 | 137.8399 | 118.9111 | 141.8015 |
2019 | 130.3563 | 122.4833 | 138.2293 | 118.3156 | 142.3970 |
2020 | 130.3563 | 122.1123 | 138.6003 | 117.7482 | 142.9644 |
2021 | 130.3563 | 121.7573 | 138.9553 | 117.2053 | 143.5074 |
2022 | 130.3563 | 121.4164 | 139.2963 | 116.6839 | 144.0288 |
2023 | 130.3563 | 121.0880 | 139.6247 | 116.1816 | 144.5310 |
2024 | 130.3563 | 120.7708 | 139.9418 | 115.6966 | 145.0161 |
2025 | 130.3563 | 120.4639 | 140.2488 | 115.2271 | 145.4856 |
2026 | 130.3563 | 120.1661 | 140.5466 | 114.7717 | 145.9409 |
# Use autoplot() to plot the forecasts
autoplot(fc)
# Add the one-step forecasts for the training data to the plot
autoplot(fc) + autolayer(fitted(fc))
In this exercise, you will apply your knowledge of training and test sets, the subset() function, and the accuracy() function, all of which you learned in Chapter 2, to compare SES and naive forecasts for the marathon data.
You did something very similar to compare the naive and mean forecasts in an earlier exercise “Evaluating forecast accuracy of non-seasonal methods”.
Let’s review the process:
# Create a training set using subset()
train <- subset(marathon, end = length(marathon) - 20)
# Compute SES and naive forecasts, save to fcses and fcnaive
fcses <- ses(train, h = 20)
fcnaive <- naive(train, h = 20)
# Calculate forecast accuracy measures
accuracy(fcses, marathon)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set -1.0851741 5.863790 4.155948 -0.8603998 2.827993 0.8990906 -0.01595953 NA
Test set 0.4574579 2.493971 1.894237 0.3171919 1.463862 0.4097960 -0.12556096 0.6870735
accuracy(fcnaive, marathon)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set -0.4638047 6.904742 4.622391 -0.4086317 3.123559 1.0000000 -0.3589323 NA
Test set 0.2266667 2.462113 1.846667 0.1388780 1.429608 0.3995047 -0.1255610 0.6799062
Holt’s local trend method is implemented in the holt() function:
holt(y, h = 10, ...)
Here, you will apply it to the austa series, which contains annual counts of international visitors to Australia from 1980-2015 (in millions).
# Produce 10 year forecasts of austa using holt()
fcholt <- holt(austa, h=10)
# Look at fitted model using summary()
summary(fcholt)
Forecast method: Holt's method
Model Information:
Holt's method
Call:
holt(y = austa, h = 10)
Smoothing parameters:
alpha = 0.9999
beta = 0.0085
Initial states:
l = 0.656
b = 0.1706
sigma: 0.1952
AIC AICc BIC
17.14959 19.14959 25.06719
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.00372838 0.1840662 0.1611085 -1.222083 5.990319 0.7907078 0.2457733
Forecasts:
Point Forecast <dbl> | Lo 80 <dbl> | Hi 80 <dbl> | Lo 95 <dbl> | Hi 95 <dbl> | |
---|---|---|---|---|---|
2016 | 7.030683 | 6.780483 | 7.280882 | 6.648036 | 7.413330 |
2017 | 7.202446 | 6.847114 | 7.557778 | 6.659013 | 7.745879 |
2018 | 7.374209 | 6.937169 | 7.811249 | 6.705814 | 8.042604 |
2019 | 7.545972 | 7.039179 | 8.052765 | 6.770899 | 8.321045 |
2020 | 7.717736 | 7.148723 | 8.286748 | 6.847506 | 8.587965 |
2021 | 7.889499 | 7.263543 | 8.515455 | 6.932181 | 8.846816 |
2022 | 8.061262 | 7.382302 | 8.740222 | 7.022882 | 9.099642 |
2023 | 8.233025 | 7.504136 | 8.961915 | 7.118285 | 9.347766 |
2024 | 8.404788 | 7.628444 | 9.181133 | 7.217472 | 9.592105 |
2025 | 8.576552 | 7.754792 | 9.398311 | 7.319779 | 9.833324 |
# Plot the forecasts
autoplot(fcholt)
# Check that the residuals look like white noise
checkresiduals(fcholt)
Ljung-Box test
data: Residuals from Holt's method
Q* = 4.8886, df = 3, p-value = 0.1801
Model df: 4. Total lags used: 7
You learned that the hw() function produces forecasts using the Holt-Winters method specific to whatever you set equal to the seasonal argument:
fc1 <- hw(aust, seasonal = "additive")
fc2 <- hw(aust, seasonal = "multiplicative")
Here, you will apply hw() to a10, the monthly sales of anti-diabetic drugs in Australia from 1991 to 2008.
# Plot the data
autoplot(a10)
# Produce 3 year forecasts
fc <- hw(a10, seasonal = "multiplicative", h = 36)
# Check if residuals look like white noise
checkresiduals(fc)
Ljung-Box test
data: Residuals from Holt-Winters' multiplicative method
Q* = 75.764, df = 8, p-value = 3.467e-13
Model df: 16. Total lags used: 24
whitenoise <- FALSE
# Plot forecasts
autoplot(fc)
The forecasts might still provide useful information even with residuals that fail the white noise test.
The Holt-Winters method can also be used for daily type of data, where the seasonal pattern is of length 7, and the appropriate unit of time for h is in days.
Here, you will compare an additive Holt-Winters method and a seasonal naive() method for the hyndsight data, which contains the daily pageviews on the Hyndsight blog for one year starting April 30, 2014.
# Create training data with subset()
train <- subset(hyndsight, end = length(hyndsight) - 28)
# Holt-Winters additive forecasts as fchw
fchw <- hw(train, seasonal = "additive", h = 28)
# Seasonal naive forecasts as fcsn
fcsn <- snaive(train, h = 28)
# Find better forecasts with accuracy()
accuracy(fchw, hyndsight)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set -3.976241 228.2440 165.0244 -2.407211 13.9955 0.7492131 0.1900853 NA
Test set -3.999460 201.7656 152.9584 -3.218292 10.5558 0.6944332 0.3013328 0.4868701
accuracy(fcsn, hyndsight)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set 10.50 310.3282 220.2636 -2.1239387 18.01077 1.0000000 0.4255730 NA
Test set 0.25 202.7610 160.4643 -0.6888732 10.25880 0.7285101 0.3089795 0.450266
# Plot the better forecasts
autoplot(fchw)
The namesake function for finding errors, trend, and seasonality (ETS) provides a completely automatic way of producing forecasts for a wide range of time series.
# Fit ETS model to austa in fitaus
fitaus <- ets(austa)
# Check residuals
checkresiduals(fitaus)
Ljung-Box test
data: Residuals from ETS(A,A,N)
Q* = 4.8886, df = 3, p-value = 0.1801
Model df: 4. Total lags used: 7
# Plot forecasts
autoplot(forecast(fitaus))
# Repeat for hyndsight data in fiths
fiths <- ets(hyndsight)
checkresiduals(fiths)
Ljung-Box test
data: Residuals from ETS(A,N,A)
Q* = 68.616, df = 5, p-value = 1.988e-13
Model df: 9. Total lags used: 14
autoplot(forecast(fiths))
# Which model(s) fails test? (TRUE or FALSE)
fitausfail <- FALSE
fithsfail <- TRUE
Here, you will compare ETS forecasts against seasonal naive forecasting for 20 years of cement, which contains quarterly cement production using time series cross-validation for 4 steps ahead.
The second argument for tsCV() must return a forecast object, so you need a function to fit a model and return forecasts. Recall:
> args(tsCV)
function (y, forecastfunction, h = 1, ...)
cement_ <- subset(qcement, start = length(qcement) - 80)
# Function to return ETS forecasts
fets <- function(y, h) {
forecast(ets(y), h = h)
}
# Apply tsCV() for both methods
e1 <- tsCV(cement_, fets, h = 4)
e2 <- tsCV(cement_, snaive, h = 4)
# Compute MSE of resulting errors (watch out for missing values)
mean(e1^2, na.rm = TRUE)
[1] 0.0329896
mean(e2^2, na.rm = TRUE)
[1] 0.03046892
# Copy the best forecast MSE
bestmse <- mean(e2^2, na.rm = TRUE)
Computing the ETS does not work well for all series.
Here, you will observe why it does not work well for the annual Canadian lynx population available in your workspace as lynx.
# Plot the lynx series
autoplot(lynx)
# Use ets() to model the lynx series
fit <- ets(lynx)
# Use summary() to look at model and parameters
summary(fit)
ETS(M,N,N)
Call:
ets(y = lynx)
Smoothing parameters:
alpha = 0.9999
Initial states:
l = 2372.8047
sigma: 0.9594
AIC AICc BIC
2058.138 2058.356 2066.346
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 8.975647 1198.452 842.0649 -52.12968 101.3686 1.013488 0.3677583
# Plot 20-year forecasts of the lynx series
fit %>% forecast(h = 20) %>% autoplot()
It’s important to realize that ETS doesn’t work for all cases.
Here, you will use a Box-Cox transformation to stabilize the variance of the pre-loaded a10 series, which contains monthly anti-diabetic drug sales in Australia from 1991-2008.
In this exercise, you will need to experiment to see the effect of the lambda (λ) argument on the transformation. Notice that small changes in λ make little difference to the resulting series. You want to find a value of λ that makes the seasonal fluctuations of roughly the same size across the series.
Recall from the video that the recommended range for lambda values is −1≤λ≤1.
# Plot the series
autoplot(a10)
# Try four values of lambda in Box-Cox transformations
a10 %>% BoxCox(lambda = 0.0) %>% autoplot()
a10 %>% BoxCox(lambda = 0.1) %>% autoplot()
a10 %>% BoxCox(lambda = 0.2) %>% autoplot()
a10 %>% BoxCox(lambda = 0.3) %>% autoplot()
# Compare with BoxCox.lambda()
BoxCox.lambda(a10)
[1] 0.1313326
It seems like a lambda of .13 would work well.
Differencing is a way of making a time series stationary; this means that you remove any systematic patterns such as trend and seasonality from the data. A white noise series is considered a special case of a stationary time series.
With non-seasonal data, you use lag-1 differences to model changes between observations rather than the observations directly. You have done this before by using the diff() function.
In this exercise, you will use the pre-loaded wmurders data, which contains the annual female murder rate in the US from 1950-2004.
# Plot the US female murder rate
autoplot(wmurders)
# Plot the differenced murder rate
autoplot(diff(wmurders))
# Plot the ACF of the differenced murder rate
ggAcf(diff(wmurders))
It seems like the data look like white noise after differencing.
With seasonal data, differences are often taken between observations in the same season of consecutive years, rather than in consecutive periods. For example, with quarterly data, one would take the difference between Q1 in one year and Q1 in the previous year. This is called seasonal differencing.
Sometimes you need to apply both seasonal differences and lag-1 differences to the same series, thus, calculating the differences in the differences.
In this exercise, you will use differencing and transformations simultaneously to make a time series look stationary. The data set here is h02, which contains 17 years of monthly corticosteroid drug sales in Australia.
# Plot the data
autoplot(h02)
# Take logs and seasonal differences of h02
difflogh02 <- diff(log(h02), lag = 12)
# Plot difflogh02
autoplot(difflogh02)
# Take another difference and plot
ddifflogh02 <- diff(difflogh02)
autoplot(ddifflogh02)
# Plot ACF of ddifflogh02
ggAcf(ddifflogh02)
The data doesn’t look like white noise after the transformation, but you could develop an ARIMA model for it.
The auto.arima() function will select an appropriate autoregressive integrated moving average (ARIMA) model given a time series, just like the ets() function does for ETS models. The summary() function can provide some additional insights:
> # p = 2, d = 1, p = 2
> summary(fit)
Series: usnetelec
ARIMA(2,1,2) with drift
...
In this exercise, you will automatically choose an ARIMA model for the pre-loaded austa series, which contains the annual number of international visitors to Australia from 1980-2015. You will then check the residuals (recall that a p-value greater than 0.05 indicates that the data resembles white noise) and produce some forecasts. Other than the modelling function, this is identicial to what you did with ETS forecasting.
# Fit an automatic ARIMA model to the austa series
fit <- auto.arima(austa)
# Check that the residuals look like white noise
checkresiduals(fit)
Ljung-Box test
data: Residuals from ARIMA(0,1,1) with drift
Q* = 2.297, df = 5, p-value = 0.8067
Model df: 2. Total lags used: 7
residualsok <- TRUE
# Summarize the model
summary(fit)
Series: austa
ARIMA(0,1,1) with drift
Coefficients:
ma1 drift
0.3006 0.1735
s.e. 0.1647 0.0390
sigma^2 estimated as 0.03376: log likelihood=10.62
AIC=-15.24 AICc=-14.46 BIC=-10.57
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0008313383 0.1759116 0.1520309 -1.069983 5.513269 0.7461559 -0.000571993
# Find the AICc value and the number of differences used
AICc <- -14.46
d <- 1
# Plot forecasts of fit
fit %>% forecast(h = 10) %>% autoplot()
The automatic method in the previous exercise chose an ARIMA(0,1,1) with drift model for the austa data, that is, yt=c+yt−1+θet−1+et. You will now experiment with various other ARIMA models for the data to see what difference it makes to the forecasts.
The Arima() function can be used to select a specific ARIMA model. Its first argument, order, is set to a vector that specifies the values of p, d and q. The second argument, include.constant, is a booolean that determines if the constant c, or drift, should be included. Below is an example of a pipe function that would plot forecasts of usnetelec from an ARIMA(2,1,2) model with drift:
usnetelec %>%
Arima(order = c(2,1,2), include.constant = TRUE) %>%
forecast() %>%
autoplot()
In the examples here, watch for how the different models affect the forecasts and the prediction intervals.
# Plot forecasts from an ARIMA(0,1,1) model with no drift
austa %>% Arima(order = c(0, 1, 1), include.constant = FALSE) %>% forecast() %>% autoplot()
# Plot forecasts from an ARIMA(2,1,3) model with drift
austa %>% Arima(order = c(2, 1, 3), include.constant = TRUE) %>% forecast() %>% autoplot()
# Plot forecasts from an ARIMA(0,0,1) model with a constant
austa %>% Arima(order = c(0, 0, 1), include.constant = TRUE) %>% forecast() %>% autoplot()
# Plot forecasts from an ARIMA(0,2,1) model with no constant
austa %>% Arima(order = c(0, 2, 1), include.constant = FALSE) %>% forecast() %>% autoplot()
The AICc statistic is useful for selecting between models in the same class. For example, you can use it to select an ETS model or to select an ARIMA model. However, you cannot use it to compare ETS and ARIMA models because they are in different model classes.
Instead, you can use time series cross-validation to compare an ARIMA model and an ETS model on the austa data. Because tsCV() requires functions that return forecast objects, you will set up some simple functions that fit the models and return the forecasts. The arguments of tsCV() are a time series, forecast function, and forecast horizon h. Examine this code snippet from the second chapter:
e <- matrix(NA_real_, nrow = 1000, ncol = 8)
for (h in 1:8)
e[, h] <- tsCV(goog, naive, h = h)
...
Furthermore, recall that pipe operators in R take the value of whatever is on the left and pass it as an argument to whatever is on the right, step by step, from left to right. Here’s an example based on code you saw in an earlier chapter:
# Plot 20-year forecasts of the lynx series modeled by ets()
lynx %>% ets() %>% forecast(h = 20) %>% autoplot()
In this exercise, you will compare the MSE of two forecast functions applied to austa, and plot forecasts of the function that computes the best forecasts.
# Set up forecast functions for ETS and ARIMA models
fets <- function(x, h) {
forecast(ets(x), h = h)
}
farima <- function(x, h) {
forecast(auto.arima(x), h = h)
}
# Compute CV errors for ETS on austa as e1
e1 <- tsCV(austa, fets, h = 1)
# Compute CV errors for ARIMA on austa as e2
e2 <- tsCV(austa, farima, h = 1)
# Find MSE of each model class
mean(e1^2, na.rm = TRUE)
[1] 0.05623684
mean(e2^2, na.rm = TRUE)
[1] 0.04336277
# Plot 10-year forecasts using the best model class
austa %>% farima(h = 10) %>% autoplot()
the auto.arima() function also works with seasonal data. Note that setting lambda = 0 in the auto.arima() function - applying a log transformation - means that the model will be fitted to the transformed data, and that the forecasts will be back-transformed onto the original scale.
After applying summary() to this kind of fitted model, you may see something like the output below which corresponds with (p,d,q)(P,D,Q)[m]:
ARIMA(0,1,4)(0,1,1)[12]
In this exercise, you will use these functions to model and forecast the pre-loaded h02 data, which contains monthly sales of cortecosteroid drugs in Australia.
# Check that the logged h02 data have stable variance
h02 %>% log() %>% autoplot()
# Fit a seasonal ARIMA model to h02 with lambda = 0
fit <- auto.arima(h02, lambda = 0)
# Summarize the fitted model
summary(fit)
Series: h02
ARIMA(2,1,1)(0,1,2)[12]
Box Cox transformation: lambda= 0
Coefficients:
ar1 ar2 ma1 sma1 sma2
-1.1358 -0.5753 0.3683 -0.5318 -0.1817
s.e. 0.1608 0.0965 0.1884 0.0838 0.0881
sigma^2 estimated as 0.004278: log likelihood=248.25
AIC=-484.51 AICc=-484.05 BIC=-465
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.003931805 0.0501571 0.03629816 -0.5323365 4.611253 0.5987988 -0.003740267
# Record the amount of lag-1 differencing and seasonal differencing used
d <- 1
D <- 1
# Plot 2-year forecasts
fit %>% forecast(h = 24) %>% autoplot()
auto.arima() is flexible enough to even work with seasonal time series!
The auto.arima() function needs to estimate a lot of different models, and various short-cuts are used to try to make the function as fast as possible. This can cause a model to be returned which does not actually have the smallest AICc value. To make auto.arima() work harder to find a good model, add the optional argument stepwise = FALSE to look at a much larger collection of models.
Here, you will try finding an ARIMA model for the pre-loaded euretail data, which contains quarterly retail trade in the Euro area from 1996-2011.
autoplot(euretail)
# Find an ARIMA model for euretail
fit1 <- auto.arima(euretail)
# Don't use a stepwise search
fit2 <- auto.arima(euretail, stepwise = FALSE)
summary(fit1)
Series: euretail
ARIMA(0,1,3)(0,1,1)[4]
Coefficients:
ma1 ma2 ma3 sma1
0.2630 0.3694 0.4200 -0.6636
s.e. 0.1237 0.1255 0.1294 0.1545
sigma^2 estimated as 0.156: log likelihood=-28.63
AIC=67.26 AICc=68.39 BIC=77.65
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.02965298 0.3661147 0.2787802 -0.02795377 0.2885545 0.2267735 0.006455781
summary(fit2)
Series: euretail
ARIMA(0,1,3)(0,1,1)[4]
Coefficients:
ma1 ma2 ma3 sma1
0.2630 0.3694 0.4200 -0.6636
s.e. 0.1237 0.1255 0.1294 0.1545
sigma^2 estimated as 0.156: log likelihood=-28.63
AIC=67.26 AICc=68.39 BIC=77.65
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.02965298 0.3661147 0.2787802 -0.02795377 0.2885545 0.2267735 0.006455781
# AICc of better model
AICc <- 68.39
# Compute 2-year forecasts from better model
fit2 %>% forecast(h = 8) %>% autoplot()
What happens when you want to create training and test sets for data that is more frequent than yearly? If needed, you can use a vector in form c(year, period) for the start and/or end keywords in the window() function. You must also ensure that you’re using the appropriate values of h in forecasting functions. Recall that h should be equal to the length of the data that makes up your test set.
For example, if your data spans 15 years, your training set consists of the first 10 years, and you intend to forecast the last 5 years of data, you would use h = 12 * 5 not h = 5 because your test set would include 60 monthly observations. If instead your training set consists of the first 9.5 years and you want forecast the last 5.5 years, you would use h = 66 to account for the extra 6 months.
In the final exercise for this chapter, you will compare seasonal ARIMA and ETS models applied to the quarterly cement production data qcement. Because the series is very long, you can afford to use a training and test set rather than time series cross-validation. This is much faster.
# Use 20 years of the qcement data beginning in 1988
train <- window(qcement, start = c(1988, 1), end = c(2007, 4))
# Fit an ARIMA and an ETS model to the training data
fit1 <- auto.arima(train)
fit2 <- ets(train)
# Check that both models have white noise residuals
checkresiduals(fit1)
Ljung-Box test
data: Residuals from ARIMA(1,0,1)(2,1,1)[4] with drift
Q* = 3.3058, df = 3, p-value = 0.3468
Model df: 6. Total lags used: 9
checkresiduals(fit2)
Ljung-Box test
data: Residuals from ETS(M,N,M)
Q* = 6.3457, df = 3, p-value = 0.09595
Model df: 6. Total lags used: 9
# Produce forecasts for each model
fc1 <- forecast(fit1, h = 25)
fc2 <- forecast(fit2, h = 25)
# Use accuracy() to find better model based on RMSE
accuracy(fc1, qcement)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set -0.006205705 0.1001195 0.07988903 -0.6704455 4.372443 0.5458078 -0.01133907 NA
Test set -0.158835253 0.1996098 0.16882205 -7.3332836 7.719241 1.1534049 0.29170452 0.7282225
accuracy(fc2, qcement)
ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
Training set 0.01406512 0.1022079 0.07958478 0.4938163 4.371823 0.5437292 -0.03346295 NA
Test set -0.13495515 0.1838791 0.15395141 -6.2508975 6.986077 1.0518075 0.53438371 0.680556
bettermodel <- fit2
Looks like the ETS model did better here.
The auto.arima() function will fit a dynamic regression model with ARIMA errors. The only change to how you used it previously is that you will now use the xreg argument containing a matrix of regression variables.
fit <- auto.arima(uschange[, "Consumption"],
xreg = uschange[, "Income"])
# rep(x, times)
fcast <- forecast(fit, xreg = rep(0.8, 8))
You can see that the data is set to the Consumption column of uschange, and the regression variable is the Income column. Furthermore, the rep() function in this case would replicate the value 0.8 exactly eight times for the matrix argument xreg.
In this exercise, you will model sales data regressed against advertising expenditure, with an ARMA error to account for any serial correlation in the regression errors. The data are available in your workspace as advert and comprise 24 months of sales and advertising expenditure for an automotive parts company. The plot shows sales vs advertising expenditure.
ggplot(aes(x = advert, y = sales), data = as.data.frame(advsales)) +
geom_point()
# Time plot of both variables
autoplot(advsales, facets = TRUE)
Fit a regression with ARIMA errors to advert by setting the first argument of auto.arima() to the “sales” column, second argument xreg to the “advert” column, and third argument stationary to TRUE.
# Fit ARIMA model
fit <- auto.arima(advsales[, "sales"], xreg = advsales[,"advert"], stationary = TRUE)
Check that the fitted model is a regression with AR(1) errors. What is the increase in sales for every unit increase in advertising? This coefficient is the third element in the coefficients() output.
summary(fit)
Series: advsales[, "sales"]
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 intercept xreg
0.6412 20.8737 0.0926
s.e. 0.1559 2.0842 0.0426
sigma^2 estimated as 18.48: log likelihood=-102.28
AIC=212.55 AICc=213.84 BIC=218.89
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.249122 4.11534 3.545525 -2.679881 15.90975 0.9530981 -0.006839681
# Check model. Increase in sales for each unit increase in advertising
salesincrease <- coefficients(fit)[3]
salesincrease
xreg
0.09264183
orecast from the fitted model specifying the next 6 months of advertising expenditure as 10 units per month as fc. To repeat 10 six times, use the rep() function inside xreg like in the example code above.
Plot the forecasts fc and fill in the provided code to add an x label “Month” and y label “Sales”.
# Forecast fit as fc
fc <- forecast(fit, xreg = rep(0.1, 6))
# Plot fc with x and y labels
autoplot(fc) + xlab("Month") + ylab("Sales")
The dynamic regression allows you to include other outside information into your forecast.
You can also model daily electricity demand as a function of temperature. As you may have seen on your electric bill, more electricity is used on hot days due to air conditioning and on cold days due to heating.
In this exercise, you will fit a quadratic regression model with an ARMA error. One year of daily data are stored as elec including total daily demand, an indicator variable for workdays (a workday is represented with 1, and a non-workday is represented with 0), and daily maximum temperatures. Because there is weekly seasonality, the frequency has been set to 7.
Let’s take a look at the first three rows:
elecdaily[1:3, ]
Demand WorkDay Temperature
[1,] 174.8963 0 26.0
[2,] 188.5909 1 23.0
[3,] 188.9169 1 22.2
ggplot(aes(x = Temperature, y = Demand), data = as.data.frame(elecdaily)) +
geom_point()
# Time plots of demand and temperatures
autoplot(elecdaily[, c("Demand", "Temperature")], facets = TRUE)
Index elec accordingly to set up the matrix of regressors to include MaxTemp for the maximum temperatures, MaxTempSq which represents the squared value of the maximum temperature, and Workday, in that order. Clearly, the second argument of cbind() will require a simple mathematical operator.
# Matrix of regressors
# Matrix of regressors
xreg <- cbind(MaxTemp = elecdaily[, "Temperature"],
MaxTempSq = elecdaily[, "Temperature"]^2,
Workday = elecdaily[, "WorkDay"])
Fit a dynamic regression model of the demand column with ARIMA errors and call this fit
# Fit model
fit <- auto.arima(elecdaily[, "Demand"], xreg = xreg)
If the next day is a working day (indicator is 1) with maximum temperature forecast to be 20°C, what is the forecast demand? Fill out the appropriate values in cbind() for the xreg argument in forecast().
# Forecast fit one day ahead
forecast(fit, xreg = cbind(MaxTemp = 20,
MaxTempSq = 20^2,
Workday = 1))
Point Forecast <dbl> | Lo 80 <dbl> | Hi 80 <dbl> | Lo 95 <dbl> | Hi 95 <dbl> | |
---|---|---|---|---|---|
53.57143 | 185.4008 | 176.9271 | 193.8745 | 172.4414 | 198.3602 |
Now you’ve seen how multiple independent variables can be included using matrices.
With weekly data, it is difficult to handle seasonality using ETS or ARIMA models as the seasonal length is too large (approximately 52). Instead, you can use harmonic regression which uses sines and cosines to model the seasonality.
The fourier() function makes it easy to generate the required harmonics. The higher the order (K), the more “wiggly” the seasonal pattern is allowed to be. With K=1, it is a simple sine curve. You can select the value of K by minimizing the AICc value. fourier() takes in a required time series, required number of Fourier terms to generate, and optional number of rows it needs to forecast:
# fourier(x, K, h = NULL)
> fit <- auto.arima(cafe, xreg = fourier(cafe, K = 6),
seasonal = FALSE, lambda = 0)
> fit %>%
forecast(xreg = fourier(cafe, K = 6, h = 24)) %>%
autoplot() + ylim(1.6, 5.1)
gasoline data comprises weekly data on US finished motor gasoline products. In this exercise, you will fit a harmonic regression to this data set and forecast the next 3 years.
# Set up harmonic regressors of order 13
harmonics <- fourier(gasoline, K = 13)
# Fit regression model with ARIMA errors
fit <- auto.arima(gasoline, xreg = harmonics, seasonal = FALSE)
# Forecasts next 3 years
newharmonics <- fourier(gasoline, K = 13, h = 52*3)
fc <- forecast(fit, xreg = newharmonics)
# Plot forecasts fc
autoplot(fc)
The point predictions look to be a bit low.
Harmonic regressions are also useful when time series have multiple seasonal patterns. For example, taylor contains half-hourly electricity demand in England and Wales over a few months in the year 2000. The seasonal periods are 48 (daily seasonality) and 7 x 48 = 336 (weekly seasonality). There is not enough data to consider annual seasonality.
auto.arima() would take a long time to fit a long time series such as this one, so instead you will fit a standard regression model with Fourier terms using the tslm() function. This is very similar to lm() but is designed to handle time series. With multiple seasonality, you need to specify the order K for each of the seasonal periods.
# The formula argument is a symbolic description
# of the model to be fitted
> args(tslm)
function (formula, ...)
tslm() is a newly introduced function, so you should be able to follow the pre-written code for the most part.
# Fit a harmonic regression using order 10 for each type of seasonality
fit <- tslm(taylor ~ fourier(taylor, K = c(10, 10)))
# Forecast 20 working days ahead
fc <- forecast(fit, newdata = data.frame(fourier(taylor, K = c(10, 10), h = 48*20)))
# Plot the forecasts
autoplot(fc)
# Check the residuals of fit
checkresiduals(fit)
Breusch-Godfrey test for serial correlation of order up to 672
data: Residuals from Linear regression model
LM test = 3938.9, df = 672, p-value < 2.2e-16
The residuals from the fitted model fail the tests badly, yet the forecasts are quite good.
Another time series with multiple seasonal periods is calls, which contains 20 consecutive days of 5-minute call volume data for a large North American bank. There are 169 5-minute periods in a working day, and so the weekly seasonal frequency is 5 x 169 = 845. The weekly seasonality is relatively weak, so here you will just model daily seasonality. calls is pre-loaded into your workspace.
The residuals in this case still fail the white noise tests, but their autocorrelations are tiny, even though they are significant. This is because the series is so long. It is often unrealistic to have residuals that pass the tests for such long series. The effect of the remaining correlations on the forecasts will be negligible.
# Plot the calls data
autoplot(calls)
Set up the xreg matrix using order 10 for daily seasonality and 0 for weekly seasonality. Note that if you incorrectly specify your vector, your session may expire!
Fit a dynamic regression model called fit using auto.arima() with seasonal = FALSE and stationary = TRUE.
# Set up the xreg matrix
xreg <- fourier(calls, K = c(10, 0))
# Fit a dynamic regression model
fit <- auto.arima(calls, xreg = xreg, seasonal = FALSE, stationary = TRUE)
# Check the residuals
checkresiduals(fit)
Ljung-Box test
data: Residuals from Regression with ARIMA(5,0,1) errors
Q* = 6846.8, df = 1663, p-value < 2.2e-16
Model df: 27. Total lags used: 1690
# Plot forecasts for 10 working days ahead
fc <- forecast(fit, xreg = fourier(calls, c(10, 0), h = 10*169))
autoplot(fc)
A TBATS model is a special kind of time series model. It can be very slow to estimate, especially with multiple seasonal time series, so in this exercise you will try it on a simpler series to save time. Let’s break down elements of a TBATS model in TBATS(1, {0,0}, -, {<51.18,14>})
Component | Meaning |
---|---|
1 Box-Cox | transformation parameter |
{0,0} | ARMA error |
- | Damping parameter |
{<51.18,14>} | Seasonal period, Fourier terms |
The gas data contains Australian monthly gas production. A plot of the data shows the variance has changed a lot over time, so it needs a transformation. The seasonality has also changed shape over time, and there is a strong trend. This makes it an ideal series to test the tbats() function which is designed to handle these features.
# Plot the gas data
autoplot(gas)
# Fit a TBATS model to the gas data
fit <- tbats(gas)
# Forecast the series for the next 5 years
fc <- forecast(fit, h = 12*5)
# Plot the forecasts
autoplot(fc)
# Record the Box-Cox parameter and the order of the Fourier terms
lambda <- 0.082
K <- 5
Just remember that completely automated solutions don’t work every time.