DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Amazon’s DeepAR is a forecasting method based on autoregressive recurrent networks, which learns a global model from historical data of all time series in the dataset.

Keshav G
5 min readNov 25, 2020
Summary of the model

ABOUT THE DATASET

Here we use UCI’s Electricity Dataset that contains electricity consumption of 370 points/clients from a period of 2011 to 2014 at an interval of 15 minutes. Apart from electricity consumption data, we also generate some covariates series (for example Day of the Week, Month of the year, etc.) for individual time series.

UCI’s Electricity Dataset (Link: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014)

DeepAr model learns seasonal behaviour pattern from these covariates which strengthens its forecasting capabilities. There are a total of 300 points/clients in the dataset. Each 300 points/clients is allotted a unique index called “Index of the series” which is passed along as covariate to the model. This “Index of the series” is further passed to an embedding layer which learns a unique embedding for individual points/clients.

DEEP DIVE INTO THE MODEL ARCHITECTURE

DeepAR

1. TRAINING

DeepAR employs LSTM-based recurrent neural network architecture to the probabilistic forecasting. Some of the good resources for understanding LSTMs:

  1. https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
  2. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

At each time step t, the inputs to the network are the covariates x(t), the target value at the previous time step z(t−1), as well as the previous network output h(t−1) (LSTM’s hidden state from previous time step). The network outputs h(t) = LSTM( h(t-1), z(t-1), x(t) ). This h(t) is then further used to calculate parameters (μ and σ in case of gaussian distribution or α and µ in case of binomial distribution) for the likelihood of z(t), which is used for training the LSTM parameters. LSTM initial cell state and hidden state are initialised with zeroes.

In this blog all calculation are made with respect to Gaussian Distribution. Therefore we will be only discussing Gaussian Likelihood.

Gaussian Likelihood

In statistics, the likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters.

For Simplicity, likelihood is the probability of a point z(t) to lie in a gaussian distribution whose μ(mean) and σ(standard deviation) are known.

For example

So, likelihood of target point z(t) is its probability of lying inside the gaussian distribution whose parameter are μ(mean) and σ(standard deviation) (predicted by the model). If z(t) is closer to mean it will have high likelihood.

Therefore negative of this Gaussian Likelihood can be used as a loss.

DeepAR uses this only as its loss function, to be specific they use negative of log of Gaussian Likelihood.

to learn more about this equation: http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html

There are many likelihood algorithms, but their choice has to be made looking at the type of time series one has. Gaussian likelihood for real-valued data, and negative-binomial likelihood for positive count data. Other likelihood models can also readily be used, e.g. beta likelihood for data in the unit interval, Bernoulli likelihood for binary data, or mixtures in order to handle complex marginal distributions, as long as samples from the distribution can cheaply be obtained, and the log-likelihood and its gradients w.r.t. the parameters can be evaluated.

Predicting μ(mean) and σ(standard deviation)

The LSTM output of this time step is further passed to two dense layer, output of first dense layer is the predicted μ(mean) and and output from the second dense layer is subjected to a Softplus activation , which results in predicted σ(standard deviation).

2. PREDICTION

Upon training the model up to t time step, now it comes to predicting time steps > t i.e. t+1,t+2… etc. For predicting z(t+1), first a gaussian distribution in created using the μ(t) and σ(t) , from this distribution n samples are drawn, median of these n samples is set to z`(t) . z`(t) along with current know covariates x(t+1) and previous hidden state h(t) are fed into the trained LSTM model which outputs h(t+1).

h(t+1) = LSTM(z`(t),x(t+1),h(t))

μ(t+1) and σ(t+1) are obtained using h(t+1), using the new μ(t+1) and σ(t+1) a gaussian distribution is created from which n samples are selected, whose median acts as the predicted value z`(t+1) and standard deviation as confidence interval, ci(t+1).

z`(t+1) = median(*samples)

ci(t+1) = standard deviation(*samples) x p

  • *samples = n sample drawn from gaussian distribution with parameters μ(t+1) and σ(t+1)
  • p = confidence percentage

For a 95% confidence interval set p = 2

Upper Confidence Interval = z`(t+1) + 2 * ci(t+1)

Lower Confidence Interval = z`(t+1) — 2*ci(t-1)

Resources:

Paper : https://arxiv.org/pdf/1704.04110.pdf

Pytorch Unofficial GitHub Implementations:

1) https://github.com/zhykoties/TimeSeries — closest to paper description

2) https://github.com/jingw2 — implemented binomial likelihood as well

Amazon’s Official Implementation : https://github.com/awslabs/gluon-ts/tree/master/src/gluonts/model/deepar — implemented in MxNet😔

About Me:

ML Engineer in a fast-growing startup, Greendeck. We help retailers with pricing intelligence and business monitoring. We have offices both in London and Indore, India. If you are passionate about deep learning, or simply want to say hi, please drop me a line at keshav.gupta@greendeck.co. Any suggestion regarding the blog will be great as well. www.linkedin.com/in/kshavG

--

--