DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Amazon’s DeepAR is a forecasting method based on autoregressive recurrent networks, which learns a global model from historical data of all time series in the dataset.

Image for post
Image for post
Summary of the model


Here we use UCI’s Electricity Dataset that contains electricity consumption of 370 points/clients from a period of 2011 to 2014 at an interval of 15 minutes. Apart from electricity consumption data, we also generate some covariates series (for example Day of the Week, Month of the year, etc.) for individual time series.

Image for post
Image for post
UCI’s Electricity Dataset (Link:

DeepAr model learns seasonal behaviour pattern from these covariates which strengthens its forecasting capabilities. There are a total of 300 points/clients in the dataset. Each 300 points/clients is allotted a unique index called “Index of the series” which is passed along as covariate to the model. This “Index of the series” is further passed to an embedding layer which learns a unique embedding for individual points/clients.


Image for post
Image for post


DeepAR employs LSTM-based recurrent neural network architecture to the probabilistic forecasting. Some of the good resources for understanding LSTMs:


At each time step t, the inputs to the network are the covariates x(t), the target value at the previous time step z(t−1), as well as the previous network output h(t−1) (LSTM’s hidden state from previous time step). The network outputs h(t) = LSTM( h(t-1), z(t-1), x(t) ). This h(t) is then further used to calculate parameters (μ and σ in case of gaussian distribution or α and µ in case of binomial distribution) for the likelihood of z(t), which is used for training the LSTM parameters. LSTM initial cell state and hidden state are initialised with zeroes.

In this blog all calculation are made with respect to Gaussian Distribution. Therefore we will be only discussing Gaussian Likelihood.

Gaussian Likelihood

In statistics, the likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters.

For Simplicity, likelihood is the probability of a point z(t) to lie in a gaussian distribution whose μ(mean) and σ(standard deviation) are known.

For example

Image for post
Image for post

So, likelihood of target point z(t) is its probability of lying inside the gaussian distribution whose parameter are μ(mean) and σ(standard deviation) (predicted by the model). If z(t) is closer to mean it will have high likelihood.

Therefore negative of this Gaussian Likelihood can be used as a loss.

DeepAR uses this only as its loss function, to be specific they use negative of log of Gaussian Likelihood.

Image for post
Image for post
to learn more about this equation:

There are many likelihood algorithms, but their choice has to be made looking at the type of time series one has. Gaussian likelihood for real-valued data, and negative-binomial likelihood for positive count data. Other likelihood models can also readily be used, e.g. beta likelihood for data in the unit interval, Bernoulli likelihood for binary data, or mixtures in order to handle complex marginal distributions, as long as samples from the distribution can cheaply be obtained, and the log-likelihood and its gradients w.r.t. the parameters can be evaluated.

Predicting μ(mean) and σ(standard deviation)

Image for post
Image for post

The LSTM output of this time step is further passed to two dense layer, output of first dense layer is the predicted μ(mean) and and output from the second dense layer is subjected to a Softplus activation , which results in predicted σ(standard deviation).


Image for post
Image for post

Upon training the model up to t time step, now it comes to predicting time steps > t i.e. t+1,t+2… etc. For predicting z(t+1), first a gaussian distribution in created using the μ(t) and σ(t) , from this distribution n samples are drawn, median of these n samples is set to z`(t) . z`(t) along with current know covariates x(t+1) and previous hidden state h(t) are fed into the trained LSTM model which outputs h(t+1).

h(t+1) = LSTM(z`(t),x(t+1),h(t))

μ(t+1) and σ(t+1) are obtained using h(t+1), using the new μ(t+1) and σ(t+1) a gaussian distribution is created from which n samples are selected, whose median acts as the predicted value z`(t+1) and standard deviation as confidence interval, ci(t+1).

z`(t+1) = median(*samples)

ci(t+1) = standard deviation(*samples) x p

  • *samples = n sample drawn from gaussian distribution with parameters μ(t+1) and σ(t+1)
  • p = confidence percentage

For a 95% confidence interval set p = 2

Upper Confidence Interval = z`(t+1) + 2 * ci(t+1)

Lower Confidence Interval = z`(t+1) — 2*ci(t-1)

Image for post
Image for post


Paper :

Pytorch Unofficial GitHub Implementations:

1) — closest to paper description

2) — implemented binomial likelihood as well

Amazon’s Official Implementation : — implemented in MxNet😔

About Me:

Head Of Research in a fast-growing startup, Greendeck. We help retailers with pricing intelligence and business monitoring. We have offices both in London and Indore, India. If you are passionate about deep learning, or simply want to say hi, please drop me a line at Any suggestion regarding the blog will be great as well.

Written by

ML Engineer/Research Head at Greendeck.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store