DEEP PROBABILISTIC MODELS FOR
SEQUENTIAL PREDICTION
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Binh Van Tang
August 2021
©c 2021 Binh Van Tang
ALL RIGHTS RESERVED
DEEP PROBABILISTIC MODELS FOR SEQUENTIAL PREDICTION
Binh Van Tang, Ph.D.
Cornell University 2021
Despite significant advances in deep learning, probabilistic modeling of sequen-
tial data has remained challenging due to the interplay of high-dimensional in-
puts and temporal dynamics across long-distance time steps. In this disserta-
tion, we propose deep probabilistic methods that model the temporal interac-
tions between sequential inputs while accounting for the inherent uncertainty
of future predictions. First, we study the problem of continual learning where
samples of different classes arrive sequentially and incrementally, and propose a
discriminative approach that uses random graphs to model sample similarities
and guard against catastrophic forgetting. Second, we marry state space mod-
els with recent advances in deep learning architectures for the task of time se-
ries prediction, aiming to capture non-Markovian dynamics via latent variable
models. Third, we extend such generative models to the challenging domain
of videos in which both spatial and temporal signals are key to multi-frame
video predictions. Empirical results show that our models perform compet-
itively against recent baselines, bringing us one step closer to unlocking the
underexplored potentials of sequential data.
BIOGRAPHICAL SKETCH
Binh Tang was born in Nghe An, a rural province in the north central coast of
Vietnam. In 2012, he fortuitously won a scholarship by the Vietnamese govern-
ment and began studying abroad in the United States. He attended Clark Uni-
versity in Worcester, Massachusetts, and graduated with a bachelor’s degree in
mathematics and computer science with highest honors in 2015.
In 2016, Binh pursued a doctorate degree in statistics at Cornell University in
Ithaca, New York. He received a master’s degree in computer science in 2020.
He interned with Facebook during his graduate studies, and joined the com-
pany as a full-time research scientist in Seattle, Washington, in 2021.
iii
This dissertation is lovingly dedicated to my mother.
iv
ACKNOWLEDGEMENTS
I am deeply grateful to my advisor, Professor David S. Matteson, for his unwa-
vering support and mentorship throughout my journey at Cornell University.
It is also with my honor and gratitude to have Professor Claire Cardie and Pro-
fessor James Booth on my Ph.D. committee.
I would also like to thank supportive colleagues from multiple research groups,
including Professor Peter A. Crozier’s at Arizona State University, Professor
Carlos Fernandez-Granda’s at New York University, and Professor Ying Sun’s
and especially Professor David S. Matteson’s at Cornell University.
My work was financially supported in part through National Science Founda-
tion Awards, Cornell University Graduate School Fellowship and Cornell Uni-
versity Atkinson Center for Sustainability Award, all of which are greatly ap-
preciated.
Finally, I thank my undergraduate advisors, Professor Li Han and Professor
Lawrence Morris, for introducing me to computer science and advanced math-
ematics, respectively.
v
CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction 1
2 Graph-Based Continual Learning 3
2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Graph-Based Continual Learning . . . . . . . . . . . . . . . . . . . 6
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . 19
3 Probabilistic Transformer for Time Series Prediction 20
3.1 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Transformer Architectures . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Probabilistic Transformer . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Single-Layered Probabilistic Transformer . . . . . . . . . . 27
3.3.2 Multi-Layered Extension . . . . . . . . . . . . . . . . . . . 31
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 Time-series Forecasting . . . . . . . . . . . . . . . . . . . . 34
3.5.2 Human Motion Prediction . . . . . . . . . . . . . . . . . . . 38
3.6 Conclusion & Discussion . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Probabilistic Transformer for Video Prediction 41
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Probabilistic Transformer . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Additional Applications of Deep Neural Networks 50
5.1 Dynamic Poverty Prediction with Vegetation Index . . . . . . . . 50
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.2 Datasets and Methodology . . . . . . . . . . . . . . . . . . 52
5.1.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Deep Denoising for Scientific Discovery . . . . . . . . . . . . . . . 58
5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
vi
5.2.4 Experiments Results . . . . . . . . . . . . . . . . . . . . . . 67
6 Final Remarks 73
A Graph-Based Continual Learning 74
B Probabilistic Transformer 80
vii
LIST OF TABLES
2.1 Classification results (%) on PERMUTED MNIST, ROTATED MNIST
and SPLIT SVHN. The means and standard deviations are computed
over five runs using different random seeds, When used, episodic mem-
ories contain 5 samples per class on average. The symbol ↑ (↓) indicates
that a higher (lower) number is better. . . . . . . . . . . . . . . . . . . 16
2.2 Classification results (%) on SPLIT CIFAR10 and SPLIT CIFAR100 and
SPLIT MINIIMAGENET. The means and standard deviations are com-
puted over five runs using different random seeds, When used, episodic
memories contain 5 samples per class on average. The symbol ↑ (↓) in-
dicates that a higher (lower) number is better. . . . . . . . . . . . . . . 17
2.3 Ablation study on SPLIT CIFAR10. . . . . . . . . . . . . . . . . . . . 18
3.1 Test set CRPSsum of time series forecasting models (lower is better).
The means and standard deviations are computed over five runs using
different random seeds. . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Ablation study on TRAFFIC. . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Human motion prediction results. . . . . . . . . . . . . . . . . . . . . 39
4.1 PSRN and SSIM scores on MOVING MNIST. . . . . . . . . . . . . . . 47
5.1 Spatially cross-validated r2 values of the predictions of NDVI models
relative to Jean et al. [101]. Separate models are fine-tuned and evalu-
ated for different countries and surveys. For NDVI models, the means
and standard deviations of r2 values are reported using 5 independent
trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Field of view of CNN architectures and performance. Mean PSNR and
SSIM (± standard deviation) of different CNN architectures on the (a)
held-out simulated test set of TEM data described in Section 5.2.4 and
(b) validation set of the DIV2K photographic image dataset [2]. . . . . 61
5.3 Results on simulated test data. Mean PSNR and SSIM (± standard
deviation) of different denoising methods on the held-out simulated
test set described in Section 5.2.4. SBD approaches achieve the best
results. SBD combined with the proposed architecture outperforms all
other techniques by about 12 dB. The performance of SBD applied to
additional architectures is reported in Table 5.2. . . . . . . . . . . . . 68
A.1 Number of trainable parameters in continual learning models. . . . . . 76
A.2 GCL results and CN-DPM results with different memory sizes. . . . . 77
A.3 Memory usage of ER and GCL for various datasets. . . . . . . . . . . 77
B.1 Dimension, domain, frequency, total training timesteps and prediction
length properties of the training datasets used in the experiments. . . . 80
viii
B.2 Test set NMSEsum and NDsum of time series models (lower is better).
The means and standard deviations are computed over 5 runs using
different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
B.3 Number of parameters of Transfomer-MAF [190], TimeGrad [189], and
ProTran (our model) used in the time-series forecasting experiments. . 83
B.4 Number of parameters of DLow[257], its conditional VAE model, and
ProTran (our model) used in the human-motion prediction experiments. 83
ix
LIST OF FIGURES
2.1 Illustration of Experiment Replay (ER) [40] on the left and our model
(GCL) on the right. While ER independently processes context images
from the episodic memory and target images from the current task, GCL
models pairwise similarities between the images via the random graphs
G and A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 t-SNE visualization of image embeddings (small circles) from the penul-
timate layers and class embeddings (large circles) from the weights of
the last layers on SPLIT SVHN. The left figure shows that Finetune, a
model naively trained on the data stream, fails to recognize the class-
based clustering structure and bias the image embeddings toward the
last task (class 8 & 9). In contrast, the right figure shows that GCL (our
model) maintains the relational structure and is more robust to the dis-
tributional shifts incurred by task changes. . . . . . . . . . . . . . . . 6
2.3 Average accuracy as a function of the number of tasks trained. . . . . . 16
2.4 Memory sizes effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Wall-clock running time. . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Context graph G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Graph regularization (λG). . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Graphical model representations of linear dynamical systems (LDSs)
in (a), and our proposed models (ProTran) in (b), (c), and (d). Black
arrows denote the generative mechanism and red arrows the inference
procedure. The separation of generation and inference in (c) and (d)
is for readability. While traditional SSMs such as LDSs are limited to
Markovian dynamics and linear dependencies, our models allow for
non-Markovian and non-linear interactions between time steps via at-
tention mechanism. A multi-layer extension of our models further in-
creases expressiveness without compromising the tractable inference
procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Visualizations of attention weights in an image captioning task. The
model sequentially generates words in the shown caption by focusing
on the corresponding salient regions in the image depicted with differ-
ent colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Multihead Attention [226]. . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Prediction intervals and test set ground-truth from ProTran (our model)
for the TRAFFIC dataset of the first 16 of 963 time series. . . . . . . . . 37
3.5 Ground-truth pose sequences (first row) and corresponding predic-
tions by ProTran (second row). Solid colors indicate later time-steps
and faded ones are older. The body-part movements in the predicted
and ground-truth poses resemble similar patterns, while certain varia-
tions are retained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
x
4.1 Graphical model representations of our probabilistic transformer mod-
els. Black arrows denote the generative mechanism and red arrows the
inference procedure. The separation of generation and inference in (c)
and (d) is for readability. We interleave recursive layers (e.g. layer 1
and layer 3) and non-recursive layers (e.g. layer 2 and layer 4) to in-
crease expressiveness of the temporal dynamics and reduce running
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Peak signal-to-noise ratio as a function of time horizon. . . . . . . . . 47
4.3 Predicted video frames on odd rows and their corresponding ground
truths on even rows at two different time horizons, namely t = 2 left
and t = 4 on the right on the STOCHASTIC MOVEMENT DATASET. The
shown predictions are among the closest samples to the ground truths
based on PSNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Predicted video frames on odd rows and their corresponding ground
truths on even rows at two different time horizons, namely t = 4 left
and t = 13 on the right on the STOCHASTIC MOVING MNIST dataset.
The shown predictions are among the closest samples to the ground
truths based on PSNR. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1 NDVI measurements for Uganda in 2011. On the left, the background
image shows annual average NDVI with a vertical colorbar while the
foreground scatters depict log consumption expenditures with a hor-
izontal colorbar. On the right, the annual NDVI, spatially averaged
over all survey locations, with notable drops during the 2011-2012 East
Africa drought highlighted in gray. . . . . . . . . . . . . . . . . . . . 51
5.2 Spatially cross-validated results of NDVI models relative to nightlights
and Jean et al. [101]. Nightlight-based models are random forests
trained on scalar nighttime light intensities. The top figure shows r2
values for estimating consumption using pooled observations across
the four LSMS countries. We run separate trials for increasing per-
centages of the pooled dataset (e.g., the x-axis value of 60 indicates all
surveyed communities below the 60th percentile of consumption are
included. The bottom figure show similar r2 values for estimating asset
index.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Consumption predictions for LSMS communities in Uganda made by a
random forest model trained on 2011 data and tested on 2013 data. The
top figure shows the ground-truth consumption along with predictions
for LSMS communities ordered by 2011 data. The bottom figure shows
RMSE values of the predictions for increasing percentages of the LSMS
communities (e.g., the x-axis value of 60 indicates all communities be-
low the 60th percentile in 2011 consumption are included). . . . . . . . 56
xi
5.4 Simulation-based denoising framework. (Top) A training dataset is
generated by simulating TEM images of different structures at varying
imaging conditions. (Middle) A CNN is trained using the simulated
images, paired with noisy counterparts obtained by simulating the rel-
evant noise process. (Bottom) The trained CNN is applied to real data
to yield a denoised image, and a likelihood map is generated to quan-
tify the agreement between this structure and the noisy data. . . . . . 62
5.5 Likelihood map. When the simulated noisy image in (a) is denoised
using the proposed framework (b), a spurious atom appears at the left
edge of the nanoparticle (see zoomed image (d)). The value of the like-
lihood map (c) at that location is very low, indicating that the presence
of an atom is less consistent with the observed data than its absence. . 63
5.6 Denoising results for real data. (a) An experimentally-acquired atomic-
resolution transmission electron microscope image of a CeO2-supported
Pt nanoparticle. The average image intensity is 0.45 electrons/pixel
(i.e., a large fraction of pixels register zero electrons), which results in
an extremely low signal-to-noise ratio. (b) Denoised image obtained
via Fourier-based filtering by a domain expert. (c) Denoised image
obtained via the wavelet-based PURE-LET method [147]. (d) Denoised
image obtained by the proposed simulation-based denoising (SBD) frame-
work. (e) Likelihood map quantifying to what extent the atomic struc-
ture identified from the SBD denoised image is consistent with the
data. Regions in red are more likely to correspond to atomic columns
in the nanoparticle. Regions in blue are more likely to belong to the
vacuum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.7 Distribution of likelihood ratio. The figure shows the distribution of
log-likelihood ratio of over 25, 000 regions of interest computed from
the surface of 1550 denoised images using the dataset. The regions
containing spurious atoms (false positives, (a)) have a much lower log-
likelihood ratio than the regions containing accurately recovered atoms
(true positives, (b)). Regions where existing atoms were not detected
(false negatives, (c)) have a higher log-likelihood ratio, comparable to
that of the regions with accurately recovered atoms. The occurrence
of missing and spurious atoms in denoised images is quite rare: out of
the 25, 732 regions of interest, only 2, 457 and 2, 368 were false positives
and false negatives respectively. . . . . . . . . . . . . . . . . . . . . . 65
xii
5.8 Performance of SBD in terms of our proposed metrics. We compute all
our proposed metrics on over 7, 000 denoised images corresponding to
25 unique noisy images sampled from the 308 clean images. The em-
pirical distribution on the surface (red) and bulk (green) is visualized
as box plots indicating the median, 25th quartile, 75th quartile, mini-
mum and maximum value of the distribution. SBD has a near perfect
performance in the bulk with all metric values hovering around 1. On
the surface, SBD achieves a median score of 1 for precision and recall,
and about 0.95 for F1 score and Jaccard index. . . . . . . . . . . . . . 69
5.9 Validation on real data. The real data consist of 40 frames which are
approximately stationary and aligned. Their temporal average (left)
therefore provides a reasonable estimate for the true intensity profile.
In the image on the right, we compare the average intensity profile on
the surface atomic columns of the platinum nanoparticle for the de-
noised data (middle) and the temporal average (left). The profiles are
very similar (except for some spurious fluctuations in the temporal av-
erage), which suggests that the proposed approach achieves effective
denoising on the real data. . . . . . . . . . . . . . . . . . . . . . . . . 70
A.1 Average accuracy as a function of the number of tasks trained on PER-
MUTED MNIST, ROTATED MNIST, SPLIT SVHN, and SPLIT CIFAR10. 78
A.2 Average accuracy as a function of numbers of samples in the episodic
memory on SPLIT SVHN and SPLIT CIFAR10. . . . . . . . . . . . . . 79
A.3 Average forgetting as a function of numbers of samples in the episodic
memory on SPLIT SVHN and SPLIT CIFAR10. . . . . . . . . . . . . . 79
B.1 Conditioning pose sequences in green and corresponding predictions
in red by ProTran. Solid colors indicate later time-steps and faded ones
are older. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
xiii
CHAPTER 1
INTRODUCTION
Although much progress has been made recently, probabilistic modeling of se-
quential data has remained intractable. While deep learning techniques have
been tremendously successful with static modalities such as images [87, 122],
sequential inputs continue to pose challenges due to the complexity of high-
dimensional inputs and the temporal dynamics across long-distance time steps.
In this dissertation, we study the problem from multiple perspectives. While
traditional paradigms assume that all training data is available at once in large
quantities, we consider the continual learning setups where images and labels
of different classes arrive sequentially and incrementally in Chapter 2. The
changes in the data distribution often lead to dramatic decreases in classifica-
tion performance, a problem known as catastrophic forgetting. As a remedy,
we propose using random graphs to connect similar samples of different arrival
times, resulting in a discriminative model more robust to distributional shifts.
In Chapter 3, we combine state space models (SSMs), a classical, rigorously mo-
tivated class of time series models, with recent advances in deep learning archi-
tectures. While traditional SSMs are often limited to linear dependencies and
Markovian dynamics between time steps, we propose non-Markovian, highly
expressive latent variable models based on transformer architectures and vari-
ational inference. Our models are capable of generating diverse long-term time
series forecasts with uncertainty estimates.
In Chapter 4, we extend such generative models to the challenging task of con-
ditional video predictions. More specifically, we propose applying recent ad-
1
vances on attention mechanism to high-dimensional video frames to capture
both the spatial and temporal signals as well as their interactions. Empirical re-
sults show that our models perform promisingly on several datasets, bringing
us one step closer to unlocking the underexplored potentials of unlabeled data.
We provide additional applications of deep learning models in poverty esti-
mation and image denoising in Chapter 5 and conclude with final remarks in
Chapter 6.
2
CHAPTER 2
GRAPH-BASED CONTINUAL LEARNING1
Recent breakthroughs of deep neural networks often hinge on the ability to re-
peatedly iterate over stationary batches of training data. When exposed to incre-
mentally available data from non-stationary distributions, such networks often
fail to learn new information without forgetting much of its previously acquired
knowledge, a phenomenon often known as catastrophic forgetting [70, 157, 191].
Despite significant advances, the limitation has remained a long-standing chal-
lenge for computational systems that aim to continually learn from dynamic
data distributions [174].
Among various proposed solutions, rehearsal approaches that store samples
from previous tasks in an episodic memory and regularly replay them are one
of the earliest and most successful strategies against catastrophic forgetting [137,
197]. An episodic memory is typically implemented as an array of independent
slots; each slot holds one example coupled with its label. During training, these
samples are interleaved with those from the new task, allowing for simultane-
ous multi-task learning as if the resulting data were independently and identi-
cally distributed (i.i.d.).
While such approaches are effective in simple settings, they require sizable mem-
ory and are often impaired by memory constraints, performing rather poorly
on complex datasets. A possible explanation is that slot-based memories fail
to utilize relational structure between samples; semantically similar items are
treated independently both during training and at test time. In marked con-
trast, relational memory is a prominent feature of biological systems that has
1Joint work with David S. Matteson [].
3
been strongly linked to successful memory retrieval and generalization [181].
Humans, for example, encode event features into cortical representations and
bind them together in the medial temporal lobe, resulting in a durable, yet flex-
ible form of memory [206].
In this chapter, we introduce a novel Graph-based Continual Learning model
(GCL) that resembles some characteristics of relational memory. More specifi-
cally, we explicitly model pairwise similarities between samples, including both
those in the episodic memory and those found in the current task. These similar-
ities allow for representation transfer between samples and provide a resilient
mean to guard against catastrophic forgetting. Our contributions are twofold:
(a)) We propose the use of random graphs to represent relational structures
between samples. While similar notions of dependencies have been pro-
posed in the literature [145, 253], the application of random graphs in task-
free continual learning is novel, at least to the best of our knowledge.
(b)) We introduce a new regularization objective that uses the random graphs
to alleviate catastrophic forgetting. In contrast to previous work [131, 192]
based on knowledge distillation [90], the objective penalizes the model for
forgetting learned edges between samples instead of its predictions.
Our approach performs competitively on four commonly used datasets, im-
proving accuracy by up to 19.7% and reducing forgetting by almost 37% in the
best case when bench-marked against multiple baselines in continual learning.
4
UC G ZC
z y ̂
Context u uG(b−1) Context 1 1 0.1 0.7 0.3 1 1
y ̂ ux 2 x u1 1 y 2 0.7 0.4 0.8 z2 y ̂1 20.1 0.7 0.3 y1 x1 u3 u3 0.3 0.8 0.5 z3 MLP y3̂y2 x2 y2̂ 0.7 0.4 0.8 y2 x2 MLP θ3θ1
u A ZTy3 x3 y3̂ 0.3 0.8 0.5 y3 x3 u4 1
CNN CNN x 0.1 0.7 0.3 z4
y4̂
u
θ θ u5 21 0.7 0.4 0.8 z5 y5̂u3
Target Target UT VC
x4 y4̂ x4 v1
x y x MLP v5 5̂ 5 θ2 2
v3
Figure 2.1: Illustration of Experiment Replay (ER) [40] on the left and our model (GCL)
on the right. While ER independently processes context images from the episodic mem-
ory and target images from the current task, GCL models pairwise similarities between
the images via the random graphs G and A.
x
2.1 Problem Formulation
Following the image classification protocol in [143], we consider a training set
D = {D1, · · · ,DT} consisting of T tasks where the dataset for the t-th task Dt =
{(xt,yt)}nti i i=1 contains nt input-target pairs (xti,yti) ∈ X×Y . While the tasks arrive
sequentially, we assume the input-target pairs (xti,yti) in each task are i.i.d. The
goal is to learn a supervised model fθ : X → Y , parametrized by θ, that outputs
a class label y ∈ Y given an unseen image x ∈ X .
Following prior work [40, 143, 195], we consider online streams of tasks in
which samples from different tasks arrive at different times. As an additional
constraint, we insist that the model can only revisit a small amount of data cho-
sen to be stored in a fixed-size episodic memoryM.
For clarity, the data in the memory are referred to as context images and context
labels, denoted by XC = {xi}i∈C and YC = {yi}i∈C , while those in the current
task are referred to as target images and target labels, denoted by XT = {xj}j∈T
and YT = {yj}j∈T , respectively. The model is allowed to update the context
samples during training, but the memory is necessarily frozen at test time.
5
0 5
1 6
2 7
3 8
4 9
Figure 2.2: t-SNE visualization of image embeddings (small circles) from the penulti-
mate layers and class embeddings (large circles) from the weights of the last layers on
SPLIT SVHN. The left figure shows that Finetune, a model naively trained on the data
stream, fails to recognize the class-based clustering structure and bias the image em-
beddings toward the last task (class 8 & 9). In contrast, the right figure shows that GCL
(our model) maintains the relational structure and is more robust to the distributional
shifts incurred by task changes.
2.2 Graph-Based Continual Learning
In this section, we propose a Graph-based Continual Learning (GCL) algorithm.
While most rehearsal approaches ignore the correlations between images and
independently pass them through a network to compute predictions [8, 40, 192],
we model pairwise similarities between the images with learnable edges in ran-
dom graphs (see Figure 2.1). Intuitively, although it might be easy for the model
to forget any particular sample, the multiple connections it forms with similar
neighbors are harder to be forgotten altogether. If trained well, the random
graphs can therefore equip the model with a plastic and durable means to fight
against catastrophic forgetting.
Graph Construction. Given a minibatch of target images XT from the current
task, our model makes predictions based on the context images XC and con-
text labels YC that span several previously seen tasks, up to and including the
current one. In particular, we explicitly build two random graphs of pairwise
6
dependencies: an undirected graph G between the context images XC and a
directed, bipartite graph A from the context images XC to the target images XT .
Since an undirected graph can be thought of as a directed graph between its
vertices and a copy of itself, we treat the context graph G as such and build it
analogously to the context-target graph A. Specifically, the high-dimensional
context images XC and target images XT are first mapped to the image embed-
dings UC and UT , respectively, using an image encoder f : X → Rd1θ1 . Follow-
ing [145], we then represent the edges in each graph by independent Bernoulli
random variables whose means are specified by a kernel function in the em-
bedding space. More precisely, the distribution of the resulting Erdős-Rényi
random graphs [62] can be defined as
∏∏
p(G |UC) = Ber(Gik |κτ (ui,uk)), (2.1)
∏i∈C k∏∈C
p(A |UT ,UC) = Ber(Ajk |κτ (uj,uk)), (2.2)
j∈T k∈C
for all i, k ∈ C and j ∈ T where κτ : Rd1 × Rd1 → [0,∞) is a kernel function
that encodes sim(ilarities betwe)en image embeddings such as the RBF kernel
κτ (ui,uj) = exp − τ ‖ui − uj‖22 . Here, with a slight abuse of notation, we also2
use G and A to denote the corresponding adjacency matrices; Ajk ∈ {0, 1}, for
example, represents the presence or absence of a directed edge between the j-th
target image and the k-th context image.
Predictive Distribution. Given a context graph G and a context-target graph
A that encode pairwise similarities to the context images, our next step is to
propagate information from the context images XC and context labels YC to
make predictions. To that end, we embed XC by another image encoder fθ2 with
weights partially tied to the previous one fθ1 , and encode YC by a linear label
7
encoder before concatenating the resulting embeddings into latent representa-
tions V ∈ R|C|×dC 2 . In combination with the distributions of G and A, we com-
pute context-aware representations for the context images and target images,
denoted by {zi}i∈C and {zj}j∈T , res∫pectively:
p(zi |UC,VC) = ∫ I{G̃ V }(zi) dP (G |UC) (2.3)i CG
p(zj |UT ,UC,VC) = I{Ã V }(zj) dP (A |UT ,UC). (2.4)j C
A
where G̃i and Ãj indicate the i-th and j-th row of G and A, each normalized to
sum to 1, and IS(·) denotes the indicator function on a set S . Intuitively, the rep-
resentations VC are linearly weighted by each graph sample, and the normaliza-
tion step ensures proper scaling in case the numbers of edges formed with the
context images vary. Once we summarize each image by the context samples,
a final network f : Rd2θ3 → Y takes as input the context-aware representations
and produces predictive distr∫ibutions:
p(yi |XC) = ∫ p (yi | fθ3(zi)) dP (zi |UC,VC), (2.5)zi
p(yj |xj,XC) = p (yj | fθ3(zj)) dP (zj |UT ,UC,VC). (2.6)
zj
Since the numbers of random binary graphs G and A are exponential, we ap-
proximate the integrals in (1) - (6) by Monte Carlo samples. More specifically,
we use one sample of G and A during training to reduce training time, and 30
samples of A at test time for more accurate representations of the graph dis-
tributions. These graph samples are inherently non-differentiable, so we use
the Gumbel-Softmax relaxations of the Bernoulli random variables during train-
ing [99, 150]. The degree of approximation is controlled by temperature hyper-
parameters, which exert significant influence over the density of the graph sam-
ples. We find that a small temperature for G and a larger temperature for A
work well in practice.
8
There are several reasons for making the graphs G and A random. First, the
stochasticity induced by the Bernoulli random variables allows us to output
multiple predictions and average these predictions, and such ensemble tech-
niques have been quite successful in continual learning settings [50, 64]. Per-
haps more importantly, we find that the deterministic version with the Bernoulli
random variables replaced by their parameters results in very sparse graphs
where samples from the same classes are often deemed dissimilar. In a simi-
lar fashion to dropout [215], the random edges encourage the model to be less
reliant on a few particular edges and therefore promote knowledge transfer be-
tween samples. By a similar reasoning, we remove self-edges in the context
graph and also observe more connections between samples.
Graph Regularization. As training switches to new tasks, the distributional
shifts to the target images necessarily result in changes to both the context graph
G and the context-target graph A. In addition, the context images are regu-
larly updated to be representative of the data distribution up to that point, so
any well-learned connections between the context images are also susceptible
to catastrophic forgetting. As a remedy, we save the parameters of the Bernoulli
edges to the episodic memory in conjunction with the context images and con-
text labels, and introduce a regularization term that discourages the model from
forgetting previously learned edges:
1 ( ( ) ( ))L(b)G (θ1) , (b−1) (b)|I | ` p GI(b) , p GI(b) . (2.7)(b)
Here, `(·, ·) denotes the cross-entropy between two probability distributions, I(b)
the index set of edges to be regularized in the bth minibatch, and G(b−1) the ad-
jacency matrix learned from the beginning up to the previous minibatch. The
selection strategies I(b) are discussed in the next subsection. Besides the regu-
larization term, our training objective includes two other cross-entropy losses,
9
one for the context images and another for the target images:
λC ∑ ( ) ∑ ( )L (s) λT (s) (b)(θ1, θ2, θ3) = |C| ` yi, ŷi + |T | ` yj, ŷj + λGLG (θ1), (2.8)
i∈C j∈T
where (s) (s) (s) (s) (s) (s)ŷi = fθ3(zi ), ŷj = fθ3(zj ), zi ∼ p(zi|UC,VC), zj ∼ p(zj|UT ,UC,VC)
are from Equation 2.3 and Equation 2.4, and λC , λT , λG are hyperparameters.
While the graph regularization term appears similar to knowledge distillation
[90], we emphasize that the former aims to preserve the covariance structures
between the outputs of the image encoder fθ1 rather than the outputs them-
selves. We believe that in light of new data, the image encoder should be able
to update its potentially superficial representations of previously seen samples
as long as it keeps the correlations between them unchanged. Indeed, some of
the early regularization approaches based on knowledge distillation [131, 192]
are sometimes too restrictive and underperform in certain scenarios [110].
Task-Free Knowledge Consolidation. When task identities are not available,
we use reservoir sampling [231] to update the episodic memory as in [195]. The
sampling strategy takes as input a stream of data and randomly replaces a con-
text sample in the episodic memory with a target sample with probability pro-
portional to the number of observed samples. Despite its simplicity, reservoir
sampling has been shown to yield strong performance recently [40, 195, 197].
While most prior work uses task boundaries to perform knowledge consolida-
tion at the end of each task [117, 192], we update the context graph in memory
after every minibatch of training data. In addition, such updates are performed
at the sample level to maximize flexibility; we keep track of the cross entropy
loss on each context sample and only update its edges in the graph when the
model reaches a new low (denoted by I(b) previously). Intuitively, the loss mea-
10
sures how well the model has learned the context image through the connec-
tions it forms with others, so meaningful relations are most likely obtained at
the bottom of the loss surface. Though samples from the same task often pro-
vide more support for each other, the task-agnostic mechanism for updating the
context graph also allows for knowledge transfer across tasks when necessary.
Memory and Time Complexity. The inclusion of pairwise similarities and graph
regularization result in a time and memory complexity of O(|M|2 + |M|N) and
O(|M|2), respectively, where |M| denotes the size of the episodic memory and
N the batch size for target images. The quadratic costs in |M|, however, are not
concerning in practice, as we deliberately use a small, fixed-size episodic mem-
ory. The cost of storing G is often dwarfed by the memory required for storing
high-dimensional images, as each edge only needs one floating point number
(see Appendix A for more details on memory usage).
2.3 Related Work
Continual Learning Approaches. The existing work on continual learning
mostly falls into three categories: regularization, expansion, and rehearsal. Regu-
larization approaches alleviate catastrophic forgetting by penalizing changes in
model weights that are important for past tasks. Different measures of weight
importance are considered, including Fisher information [37, 117], synaptic rel-
evance [259], and uncertainty estimates [59]. The constraints on weight updates
can also be studied from Bayesian perspectives, where the posterior distribu-
tion of the weights is approximated and used as the prior for the next task
[167, 196, 220]. These regularization methods are efficient in memory and com-
putational usage but suffer from brittleness due to representation drift [220].
11
Expansion approaches dynamically allocate additional task-specific neural re-
sources as more tasks arrive. [200], for example, blocks changes to parameters
learned for previous tasks and expands sub-networks while [254] performs neu-
ron splitting or duplication upon arrival of new tasks. Recently, non-parametric
Bayesian approaches use Dirichlet process mixture models to expand a set of
neural networks in a principled way [102, 127]. By design, these dynamic archi-
tectures prevent forgetting but quickly result in considerable model complexity.
Instead of growing model capacity, rehearsal approaches maintain a small episodic
memory of previous data or, alternatively, train a generative model to produce
pseudo-data for past tasks, which are then replayed and interleaved with sam-
ples from the new task. Such generative models [1, 32, 110, 173, 207] reduce
working memory effectively, but they are also susceptible to catastrophic for-
getting and invoke the complexity of the generative task [174]. In contrast,
episodic memory approaches are simpler and remarkably effective against for-
getting [197, 247]. [143] and [39], for example, use an episodic storage of past
data to impose inequality constraints on gradient updates while [192] constructs
exemplars for knowledge distillation and nearest neighbor search. Recently, it
has been shown that simple replay techniques and optimization-based meta-
learning on the episodic memory outperform many previous approaches in on-
line settings [38, 40, 85, 195]. Our model is also based on experience replay, but
it differs from the other approaches in the way the episodic memory is handled.
Task-Free Continual Learning. In real-world scenarios, task changes are often
unknown and definitive boundaries between tasks do not always exist. How-
ever, most methods mentioned above rely on explicit task identities or task
boundaries to consolidate knowledge or select sub-modules for task adapta-
12
tion. Despite its significance, there are only a few works that address task-free
continual learning. While [7] heuristically detects peaks in the loss surface to
consolidate knowledge, [6, 8] remove the need for task boundaries by a sample
selection strategy for the episodic memory. Recently, the aforementioned non-
parametric approaches train density estimators to detect task boundaries and
perform model expansion [127, 188]. In contrast, our approach uses reservoir
sampling [231] to update the episodic memory, similar to [40, 195].
Learning with Random Graphs. Although widely studied in graph theory
[243], random graphs appear sparingly in the machine learning literature. Our
work is mostly related to previous work on functional neural process [145],
where the authors build random graphs of dependencies to represent relational
structures between context points in a stochastic process. Our approach is dif-
ferent in that (1) the random graphs are undirected and grow incrementally, (2)
no variational inference is required, and (3) it addresses catastrophic forgetting
and performs well under continual learning settings.
Attention Mechanism. While we motivate our approach from a graphical per-
spective, it can be considered as a form of attention mechanism. In particular,
the context graph G represents self-attention [226] across context images, and
the context-target graph A represents cross-attention [12] between context im-
ages and target images. Though advanced mechanisms such as multi-head at-
tention have been successful in many stationary settings [113, 214, 226, 251, 260],
we note that naive applications of such techniques in online continual learn-
ing suffer from catastrophic forgetting due to representation drift when training
switches to new tasks. In contrast, our model employs random attention, which
makes it more robust to such distributional shifts (see Figure 2.2).
13
2.4 Experiment Results
In this section, we evaluate our model on commonly used continual learning
benchmarks. Additional results and details about the datasets, experiment setup,
model architectures, and result analyses are available in the appendices.
Experiment Setup. We perform experiments on 6 image classification datasets:
PERMUTED MNIST, ROTATED MNIST [125], SPLIT SVHN [166], SPLIT CI-
FAR10 [121], SPLIT CIFAR100 [121], and SPLIT MINIIMAGENET [230]. For
each dataset, we follow [39, 143] and adopt the setting where the model only
has access to an online stream of data with a batch size of 10 (see Appendix A
for more details).
We consider both single-head and multiple-head settings. More specifically, we
use single-head and one-epoch settings for our model and all baselines on PER-
MUTED MNIST, ROTATED MNIST, SPLIT SVHN, and SPLIT CIFAR10. While
most of previous work [40, 143, 192] assume task identities on SPLIT CIFAR10,
we require all models to perform 10-way classification on each task with the
same output head. This variant is more practical and challenging due to the
need for incremental knowledge consolidation across tasks.
In addition, we also report results for multiple-head and 10-epochs settings on
SPLIT CIFAR100 and SPLIT MINIIMAGENET, following [143]. These datasets
have more classes and fewer samples per class, rendering them too challenging
for single-head settings.
Model Architecture. Our image encoders fθ1 and fθ2 partially share weights
and are parametrized by an MLP on the MNIST variants and a simple 6-layer
convolutional network on other datasets, each followed by a RELU activation
14
and a separate linear mapping. We use an RBF kernel to compute similarities
between image embeddings and find it sufficiently easy for initialization. The
output mappings fθ3 are MLPs in all cases (see Appendix A for more details).
Baselines. We benchmark our model against multiple models, including (1)
Finetune, a popular baseline, naively trained on the data stream; (2) EWC [117],
an early regularization approach; (3) GEM [143], a rehearsal approach based on
an episodic memory of parameter gradients; (4) ER [40], a simple yet competi-
tive experience method based on reservoir sampling; (5) MER [195], a rehearsal
approach inspired by optimization-based meta-learning, and (6) ICARL [192]
another well-known rehearsal strategy. Most of these baselines share the same
model architectures: an MLP with two hidden layers on the MNIST variants,
and a ResNet-18 [87] on SPLIT SVHN and SPLIT CIFAR10, following [143] (see
Appendix A for more details).
Metrics. Following [37, 40, 143], we evaluate the models using two classifica-
tion metrics, namely, average accuracy and average forgetting:
∑T ∑T−11 1
ACC , RT ,i, FGT , − (RT ,i −Ri,i), (2.9)T T 1
i=1 j=1
whereRi,j denotes the test accuracy on task j after the model has finished task i.
Intuitively, the former measures the average test accuracy across all tasks while
the latter measures the average decrease between each task’s peak accuracy and
its accuracy at the end of continual learning.
Classification Performance. Table 2.1 and Table 2.2 show the overall experi-
mental results, and the evolution of performance as a function of the number
of tasks are detailed in Figure 2.3. In every setting, our model (GCL) outper-
forms the baselines by significant margins, and the gains in performance are
15
Table 2.1: Classification results (%) on PERMUTED MNIST, ROTATED MNIST and SPLIT
SVHN. The means and standard deviations are computed over five runs using different
random seeds, When used, episodic memories contain 5 samples per class on average.
The symbol ↑ (↓) indicates that a higher (lower) number is better.
DATASET PERMUTED MNIST ROTATED MNIST SPLIT SVHN
Method ACC (↑) FGT (↓) ACC (↑) FGT (↓) ACC (↑) FGT(↓)
Finetune 60.19 ± 2.31 23.62 ± 1.98 43.80 ± 1.64 46.52 ± 1.71 18.85 ± 0.10 94.78 ± 1.24
EWC 64.94 ± 1.22 18.33 ± 1.07 44.99 ± 1.73 44.98 ± 1.95 18.76 ± 0.27 94.99 ± 1.23
GEM 79.17 ± 0.70 3.68 ± 0.68 82.60 ± 0.48 5.47 ± 0.45 33.40 ± 3.27 68.91 ± 4.06
ER 79.90 ± 0.46 3.78 ± 0.45 80.82 ± 0.68 6.78 ± 0.69 45.41 ± 3.03 62.37 ± 4.33
MER 79.68 ± 0.42 3.47 ± 0.41 83.56 ± 0.23 8.14 ± 0.46 - -
GCL 82.36 ± 0.36 2.92 ± 0.23 86.37 ± 0.32 3.22 ± 0.50 60.68 ± 1.67 21.86 ± 2.35
especially substantial on complex datasets such as SPLIT CIFAR10 or SPLIT CI-
FAR100. As noted by [39], EWC [117] performs poorly without multiple passes
over the datasets, and GEM [143] is not very effective under the single-head
variants (e.g. on SPLIT CIFAR10). Task-free approaches such as ER perform
more favorably, and such findings are consistent with recent studies [40, 195].
The advantageous performance of GCL can be attributed to its efficient use of
the episodic memory. Figure 2.4 shows that both ER [40] and GCL benefit from
increases in memory size, but the outperformance of GCL is more visible un-
der the low-resource regime. Sample efficiency is especially important since
PERMUTED MNIST SPLIT CIFAR100
90 80
75 70
60 Finetune EWC 60
GEM ER
MER GCL
45 50
0 5 10 15 20 0 5 10 15 20
Number of Tasks Number of Tasks
Figure 2.3: Average accuracy as a function of the number of tasks trained.
16
Average Accuracy (%)
Average Accuracy (%)
Table 2.2: Classification results (%) on SPLIT CIFAR10 and SPLIT CIFAR100 and SPLIT
MINIIMAGENET. The means and standard deviations are computed over five runs us-
ing different random seeds, When used, episodic memories contain 5 samples per class
on average. The symbol ↑ (↓) indicates that a higher (lower) number is better.
DATASET SPLIT CIFAR10 SPLIT CIFAR100 SPLIT MINIIMAGENET
Method ACC (↑) FGT (↓) ACC (↑) FGT (↓) ACC (↑) FGT (↓)
Finetune 18.46 ± 0.12 86.48 ± 1.02 55.39 ± 1.94 25.94 ± 1.89 37.84 ± 0.87 31.41 ± 1.57
EWC 18.49 ± 0.13 86.95 ± 1.15 55.60 ± 1.11 23.53 ± 1.19 36.61 ± 2.06 28.17 ± 4.49
ICARL - - 58.08 ± 1.44 24.22 ± 1.35 - -
GEM 22.88 ± 3.41 76.90 ± 5.53 65.66 ± 0.70 15.52 ± 0.41 54.06 ± 0.22 13.17 ± 0.74
ER 29.94 ± 3.08 72.64 ± 4.88 69.40 ± 1.21 11.25 ± 1.24 58.74 ± 0.74 9.02 ± 2.49
GCL 49.62 ± 1.85 35.69 ± 3.33 74.51 ± 0.99 6.54 ± 1.26 61.54 ± 0.57 6.10 ± 2.73
the memory constraints are not relaxable despite the growing complexity of the
data distribution during training. It is also worth emphasizing that although
our model takes more time to train and evaluate at test time than ER, its train-
ing time and testing time are comparable to other approaches (see Figure 2.5).
Learned Graphs. Central to our approach are the pairwise similarities between
context images captured by the context graph G. Figure 2.6 shows a continuous
realization of the context graph at the end of continual learning on SPLIT CI-
FAR10, which has been sorted according to context labels placed underneath
the adjacency matrix. Despite being trained exclusively on two classes of target
SPLIT CIFAR10 SPLIT CIFAR100
75 1,500 0.6
ER GCL Training Testing
50 1,000 0.4
25 500 0.2
0 0 0
100 250 500 1000 EWC ER ICARL GCL GEM
Memory Size Method
Figure 2.4: Memory sizes effects. Figure 2.5: Wall-clock running time.
17
Average Accuracy (%)
Total Training Time (s)
Testing Time per Sample (ms)
55 45
0.8 ACC
50 FGT 40
0.6
45 35
0.4
40 30
0.2
0.0 35 25
plane car bird cat deer dog frog horse ship truck 0 10 50 100 1000
Figure 2.6: Context graph G. Figure 2.7: Graph regularization (λG).
images at a time (e.g., plane & car or bird & cat), the model appears to learn
the clustering structure of images relatively well with more pronounced edges
formed within classes than across them. The edges across tasks are noisier, but
some edges indicate intuitive visual similarities such as those between images of
car and truck. We note that the 10-way classification setup in each task encour-
ages the model to clear inter-class edges, so the degree of knowledge transfer
across tasks is understandably more subtle.
Ablation Study. We further investigate our model performance with an abla-
tion study and summarize it in Table 2.3. Without the graph regularization term
in Equation 2.7, the model significantly performs worse, indicating that past
connections between context samples can help alleviate catastrophic forgetting.
By varying the hyper-parameter λG, we also see from Figure 2.7 that an extreme
amount of graph regularization (e.g. λG = 1000) can have detrimental effects on
Table 2.3: Ablation study on SPLIT CIFAR10.
Graph regularization X × × ×
Multiple graph samples X X × ×
Random G & A X X X ×
Deterministic G & A × × × X
Average accuracy 49.62 44.04 42.08 30.50
18
Average Accuracy (%)
Average Forgetting (%)
the model performance as well. As alluded earlier, the ability to draw multiple
graph samples and average their predictions at test time brings out some gains,
as often the case with ensemble methods. Perhaps more importantly, we find
that making the context graph G and the context-target graph A deterministic
results in a dramatic drop in accuracy. The resulting model is a variant of atten-
tion mechanism, most similar to attentive neural process [113], and as discussed
in Section 2.3, such a deterministic model often relies on a handful of edges, all
of which are also prone to distributional shifts and thus catastrophic forgetting.
2.5 Conclusion and Discussion
In this chapter, we have introduced a graph-based approach to continual learn-
ing that exploits pairwise similarities between samples to support knowledge
transfer. Based on the learned graphs, we derive a regularization term to guide
the training of new tasks against catastrophic forgetting. Our model demon-
strates an efficient use of the episodic memory, and as a result, performs compet-
itively under various settings, without requiring access to task definition both
during training and at test time in some cases.
As graph-based approaches naturally describe relational inductive biases [16],
we hope future works further examine the applications of graphs under con-
tinual learning settings. If trained well, these graphs can be used not only to
share knowledge but also to minimize inference between samples and tasks.
A promising direction, for example, is to pose the problem of updating the
episodic memory as a graph search and leverage the rich literature on graph
theory to devise better strategies for sample selection. As shown in previous
works [8, 97], such selection mechanisms can be effective against catastrophic
forgetting, especially when the data distribution is not balanced across tasks.
19
CHAPTER 3
PROBABILISTIC TRANSFORMER FOR TIME SERIES PREDICTION1
Generative modeling of multivariate time series is a challenging problem with
wide-ranging applications in demand forecasting [34, 202], autonomous driv-
ing [4, 35], robotics [65, 170], and health care [47, 48, 140]. Despite remarkable
progress in recent years, models that predict high-dimensional future observa-
tions from a few past examples have remained intractable, partly due to the
complex, non-deterministic temporal dynamics across long-distance time steps.
Given a sequence of human poses, for example, such models must internally
figure out the involved dynamics of various body components across space and
time while maintaining the inherent uncertainty of multiple plausible futures,
even though only one such future is observed.
Among proposed probabilistic approaches, state space models (SSMs) provide
a principled framework for learning and drawing inference from sequential in-
puts [58, 163]. While autoregressive models feed its predictions back into the
dynamics model without any compressed representation of data, SSMs model
stochastic transitions between abstract states using latent variables, allowing for
efficient state-to-state sampling without the need to render high-dimensional
observations. Gaussian linear dynamical systems (LDSs), one of the best known
SSMs [244], for example, postulate linear state transitions and enjoy exact infer-
ence via the celebrated Kalman filter algorithm.
While early extensions of LDSs focus on linearization [100] and unscented trans-
form [237], recent work that marry state space models with deep neural net-
works offers much more flexibility to model complex dependencies across dif-
1Joint work with David S. Matteson.
20
ferent time steps. Some approaches retain the Markovian dynamics of LDSs and
only replace their linear observation models with feed-forward networks [51,
66, 108, 186], whereas others favor nonlinear state transitions and parametrize
such dependencies via recurrent neural networks (RNNs) [49, 51, 67, 83, 119,
201]. Despite differences, both Markovian transitions and RNNs are often not
capable of capturing long-range dependencies in highly structured sequential
inputs [79, 265], limiting the capacity of the corresponding SSMs.
In this chapter, we propose to combine the complementary strengths of SSMs
and transformer architectures [226], a powerful mechanism for modeling long-
term interactions that enjoys success across a variety of sequence modeling tasks
[57, 111, 261]. In contrast to most SSMs, our models make extensive use of atten-
tion mechanism [12, 226] between latent variables to model non-Markovian dy-
namics (see Figure 3.1). Compared to transformer-based methods, our models
are probabilistic, non-autoregressive, and capable of generating diverse long-
term forecasts with uncertainty estimates.
Our main contributions are threefold. First, we propose novel SSMs based on
transformer architectures for multivariate time series, which include generative
models and inference procedures based on variational inference [116, 194]. Sec-
ond, we extend our models to include several layers of stochastic latent vari-
ables organized in a hierarchy for further expressiveness. Third, we conduct
extensive experiments on time series forecasting and human motion prediction
and demonstrate that our Probabilistic Transformer (ProTran) performs remark-
ably well compared to various state-of-the-art baselines.
21
3.1 State Space Models
Notations & Objective. Let { (i)x N1:T}i=1 be a collection of N univariate time se-
ries of length T where (i)xt ∈ R denotes the scalar value of the i-th time series
(i) (i) (i) (i)
x1:T = (x1 ,x2 , · · · ,xT ) at time t. For convenience, we consider the multivari-
ate form (1) (N)x1:T = (x1,x2, . . . ,xT ) where x Nt = (xt , . . .xt ) ∈ R . Conditioning
on the multivariate time series up to time C (C < T ), we aim to produce distri-
bution forecasts into the future p(xC+1:T |x1:C).
General Formulation. State space models (SSMs) achieve such a goal by as-
suming the joint distribution pθ∫(x1:T ), parametrized by θ, can be written as
pθ(x1:T ) = pθ(z1:T ) pθ(x1:T | z1:T ) dz1:T , (3.1)
where z1:T = (z1, z2, . . . , zT ) is a sequence of latent variables, sometimes referred
to as states. In other words, each SSM is a generative model that can be decom-
posed into a transition model pθ(z1:T ) between the latent variables and an emis-
sion model pθ(x1:T | z1:T ) from the latent variables to the observable outputs. As
a result, the forecast distribution p(xC+1:T |x1:C) can be computed by marginal-
izing out all the latent variab∫les z1:T :
pθ(xC+1:T |x1:C) = pθ(z1:T |x1:C) pθ(xC+1:T | z1:T ,x1:C) dz1:T . (3.2)
Linear Dynamical Systems. Linear dynamical systems (LDSs), for example,
assume that b∏oth transition models and emission mo∏dels are linear-Gaussian
2:
T T
pθ(z1:T ) = N (zt |Atzt−1,Qt), pθ(x1:T | z1:T ) = N (xt |Ctzt,Rt), (3.3)
t=1 t=1
where θ = (At,Ct,Qt,Rt) are learnable parameters. The Gaussianity assump-
tion and the linear dependencies via the transition matrix At and the emission
2For notational simplicity, we assume z0 = ∅ and p(· |∅) = p(·).
22
xt−1 xt xt+1
xt−1 xt xt+1 xt−1 xt xt+1
zt−1 zt zt+1
z(3) z(3) z(3) z(3) z(3) z(3)t−1 t t+1 t−1 t t+1
(a) LDS
x x x z(2) z(2) z(2) (2) (2) (2)t−1 t t+1 t−1 t t+1 zt−1 zt zt+1
z z z z(1) z(1) z(1) z(1) z(1) z(1)t−1 t t+1 t−1 t t+1 t−1 t t+1
(b) ProTran (1 layer) (c) ProTran Generation (3 layers) (d) ProTran Inference (3 layers)
Figure 3.1: Graphical model representations of linear dynamical systems (LDSs) in (a),
and our proposed models (ProTran) in (b), (c), and (d). Black arrows denote the gener-
ative mechanism and red arrows the inference procedure. The separation of generation
and inference in (c) and (d) is for readability. While traditional SSMs such as LDSs are
limited to Markovian dynamics and linear dependencies, our models allow for non-
Markovian and non-linear interactions between time steps via attention mechanism. A
multi-layer extension of our models further increases expressiveness without compro-
mising the tractable inference procedure.
matrix Ct enable exact inference, where we can alternately perform prediction
and update steps with closed forms of p(zt |x1:t−1) and p(zt |x1:t), respectively
[163]. Despite their simplicity and efficiency, LDSs are unsuited for applications
with complex transition dynamics or observable outputs due to such strong as-
sumptions on the model components.
Our Model Assumptions. In contrast to LDSs, we allow the latent variables
to exhibit non-Markovian dynamics auto-regressively. The forecast distribution
can be decomposed into 3
∏T
pθ(z1:T |x1:C) = pθ(zt | z1:t−1,x1:C), (3.4)
t=∏1T
pθ(xC+1:T | z1:T ,x1:C) = pθ(xt | zt). (3.5)
t=C+1
3Similarly, we assume x0 = z1:0 = ∅ for notational convenience.
23
As demonstrated in Figure 3.1(b), the latent variable zt+1 depends not only on
zt but also on all of its preceding latent variables, including zt−1. In addition,
our transition and emission models allow for non-linearity via neural network
parametrizations. These assumptions aim to maximize model capacity for real-
world applications with complex emissions or temporal dependencies.
However, we note that neither x1:t−1 nor z1:t−1 are included in the emission
model p(xt | z1:T ,x1:C). Such assumptions are important, as it has been argued
previously that a leakage of information from the latent space in autoregressive
models can hinder long-term predictions [51, 108]. While all ground truth ob-
servations are available during training, the entire sequence has to be generated
sequentially at test time, making the dependencies on x1:t−1 prone to accumu-
lated errors over multiple time steps. By letting the latent variable zt capture all
information needed to render xt, we also avoid the computational costs associ-
ated with repeatedly decoding and encoding xt in multi-step predictions.
3.2 Transformer Architectures
Attention Mechanism. Central to our models and other transformer-based
approaches [111, 226] is the notion of attention [12], which allows the models
to focus on important parts within a context in an analogous fashion to human
visual attention (see Figure 3.2). The concept can be broadly interpreted as a
vector of importance weights: in order to predict or infer one element, such as
a time series forecast or a word in a caption, we estimate using the attention
vector how strongly it is correlated with (or ”attends to”) other elements, such
as previously observed time series or image pixels, and take the sum of their
values weighted by the attention vector as the approximation of the target.
24
Figure 3.2: Visualizations of attention weights in an image captioning task. The model
sequentially generates words in the shown caption by focusing on the corresponding
salient regions in the image depicted with different colors.
Multi-head attention, for example, maps a sequence of queries Q ∈ R`q×d of
length `q to a sequence of outputs O = [O , . . . ,O ] ∈ R`q×d1 H of the same size by
attending over `k given key-value pairs K ∈ R`k×d, V ∈ R`k×d( ):
Q√hK
T
Oh = Attention(Qh,K ,V ) = Softmax
h
h h Vh, (3.6)
d
where Q Qh = QWh , Kh = KW
K
h , Vh = VW
V
h are projected queries, keys, and
values corresponding to head h ∈ [1,H] with learning parameters WQh ,WKh ,WVh ,
respectively (see Figure 3.3). Here, the correlations between queries and keys
Scaled Dot-Product Attention Multi-Head Attention
are computed via the matrix multiplication Q KTh h , and the Softmax operator
Figure 3.3: Multihead Attention [226].
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel. 25
p
query with all keys, divide each by dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
the matrix of outputs as:
QKT
Attention(Q, K, V ) = softmax( p )V (1)
dk
The two most commonly used attention functions are additive attention [2], and dot-product (multi-
plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor
of p1 . Additive attention computes the compatibility function using a feed-forward network with
dk
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
much faster and more space-efficient in practice, since it can be implemented using highly optimized
matrix multiplication code.
While for small values of dk the two mechanisms perform similarly, additive attention outperforms
dot product attention without scaling for larger values of dk [3]. We suspect that for large values of
dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients 4. To counteract this effect, we scale the dot products by p1 .
dk
3.2.2 Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
we found it beneficial to linearly project the queries, keys and values h times with different, learned
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.
4To illustrate why the dot products get large, assume that the compoPnents of q and k are independent random
variables with mean 0 and variance 1. Then their dot product, q · k = dki=1 qiki, has mean 0 and variance dk.
4
outputs the attention weights to be mapped with the corresponding values Vh.
In case Q = K = V, we refer to such an attention mechanism as self-attention.
Given fully observed sequences of inputs, the mapping can be computed effi-
ciently without any imposed sequential order often seen in recurrent neural net-
works [46, 92]. More importantly, the direct connections between long-distance
time steps are baked into the mechanism as information from previous time
steps is easily accessible without being compressed into a fixed representation,
which eases optimization and learning of long-term dependencies [12, 226].
Positional Embeddings. Without recurrence, Transformer [226] encodes the
arrow of time by associating each time step t with a predefined sinusoidal posi-
tional embedding:
Position(t) = [p dt(1), . . . , pt(d)] ∈ R (3.7)
where the i-th embedding pt(i) = sin(t · ci/d) for even i and pt(i) = cos(t · ci/d)
for odd i and some large constant c. Empirical results show that such positional
embeddings are also important to our models.
3.3 Probabilistic Transformer
In this section, we first present our single-layered model and subsequently its
multi-layered extension for a hierarchy of stochastic latent variables. As alluded
earlier, our model consists of a generative model and an inference model that
share information and parameters extensively.
26
3.3.1 Single-Layered Probabilistic Transformer
Generative Model. Given some contexts x1:C , we first apply a linear projection
and combine it with a positional embedding to obtain h d1:C ∈ R , i.e.
ht = LayerNorm(MLP(xt) + Position(t)), (3.8)
where LayerNorm and MLP denote layer normalizations [10] and multi-layer per-
ceptrons, respectively. While a traditional transformer model often dedicates an
entire encoder for the same purpose [130, 190], we find such a simple mapping
works sufficiently well in conjunction with the context-attention module of the
corresponding decoder.
As implied in Equation 3.5, our latent dynamics decomposes auto-regressively.
At each time step, we parametrize the distribution pθ(zt | z1:t−1,x1:C) by a Gaus-
sian with parameters resulting from two steps of attention: a self-attention over
the previously inferred states z1:t−1 and another attention over the projected con-
texts h1:C . These operations mirror those in the decoder of Transformer [226],
with the stochastic latent variables replacing its decoder inputs.
Unfortunately, using stochastic samples of zt as attention queries is problem-
atic since purely stochastic transitions make it difficult for the model to reliably
retain information across multiple time steps [44, 67, 83]. We therefore encap-
sulate the latent variables in hidden representations wt that also has a deter-
ministic component. Combined with the attention steps, such representations
allow us to model long-range temporal dependencies while accounting for the
stochasticity of future observations.
Starting with a learnable, context-agnostic representation w0, we recursively
compute wt using a stochastic sample from pθ(zt | z1:t−1,x1:C) and the positional
27
embedding for the current time step t. The generating process for the time step
t can be summarized by the following pseudocode:
w̄t = LayerNorm(wt−1 + Attention(wt−1,w1:t−1,w1:t−1)) (3.9)
ŵt = LayerNorm(w̄t + Attention(w̄t,h1:C ,h1:C)) (3.10)
zt ∼ N (MLP(ŵt), Softplus(MLP(ŵt))) (3.11)
wt = LayerNorm(ŵt + MLP(zt) + Position(t)), (3.12)
where Softplus is an approximating rectifier operator.
Each stochastic sample of w1:T is then mapped to a sequence of x1:T via a multi-
layer perceptron. We emphasize that our generation procedure in the latent
space is more efficient than others in the observation space, which requires en-
coding and decoding high-dimensional inputs repeatedly.
Inference Model. The inclusion of nonlinear state transitions and observation
models necessarily requires approximate inference. We follow the stochastic
variational inference framework [116, 194] and assume that the variational pos-
terior qφ(z1:T |x1:T ), parametrized by φ, can be decomposed auto-regressively in
a similar fashion to the prior in Equation 3.4:
∏T
qφ(z1:T |x1:T ) = qφ(zt | z1:t−1,x1:T ). (3.13)
t=1
The approximate posterior qφ(zt | z1:t−1,x1:T ) at time step t is parametrized anal-
ogously to the prior pθ(zt | z1:t−1,x1:C). Indeed, these parametrizations share
most parameters and are done simultaneously in the same recursive loop, fol-
lowing the exact same steps in Equation 3.9 and Equation 3.10 (see Figure 3.1).
We note that similar sharing techniques between the generative and inference
processes have emerged as a common theme among recent successful VAE mod-
els [44, 149, 223].
28
While the prior only has access to the conditioning observations x1:C , the ap-
proximate posterior should take into account all observations during training,
including the targets xC+1:T . Due to the inherent unidirectional aspect of RNNs,
previous work that uses RNNs to parametrize the approximate posterior of-
ten disregards such a property [49, 67, 119] and often resorts to a filtering rou-
tine p(zt | z1:t−1,x1:t). In contrast, our inference procedure resembles more of
the smoothing process of LDSs to compute p(zt | z1:T ,x1:t) instead, factoring in
both past and future observations during training via another application of
self-attention:
kt = Attention(h1:T ,h1:T ,h1:T )) (3.14)
zt ∼ N (MLP([ŵt,kt]), Softplus(MLP([ŵt,kt])), (3.15)
where [·, ·] denotes the concatenation operator. Here, we replace Equation 3.11
in the generative model with Equation 3.15, where the hidden representation kt
summarizing all information relevant to the current time step t has been con-
catenate to the latent-and-context-aware representation ŵt preceding the Gaus-
sian parametrization.
Variational Objective. The generative model and the inference model are
trained end-to-end with a v[ariational lower bound on the lo]g likelihood:
| p (z |x[E θ 1:T 1:C
) pθ(x1:T | z1:T ,x1:C)
log pθ(x1:T x1:C) = log q (3.16)
qφ(z1:T |x1:T ) ]
≥ pθ(z1:T |x1:C) pθ(x1:T | z1:T ,x1:C)Eq [log (3.17)qφ(z1:T |x1:T ) ]
p (z |x )
= E | θ 1:T 1:Cq [log pθ(x1:T z1:T ,x1:C) + log (3.18)∑ ∑ qφ(z1:T |x1:T ) ]T T
| p (zE θ t | z1:t−1,x1:C)= q log pθ(xt zt) + log , (3.19)
qφ(zt | z1:t−1,x1:T )t=1 t=1
29
which is equivalent to the following objective
∑T
(Eq [log pθ(xt | zt)]− KL(qφ(zt | z1:t−1,x1:T ) ‖ pθ(zt | z1:t−1,x1:C))) (3.20)
t=1
where KL is the Kullback-Leibler divergence. Here, Equation 3.17 follows from
Jensen’s inequality, and Equation 3.19 from the factorizations of the emission
model, the prior, and the approximate posterior in Equation 3.5, Equation 3.4,
and Equation 3.13, respectively.
The objective in Equation 3.20 consists of a reconstruction loss for x1:T and a
KL term for z1:T , which include terms for the given contexts x1:C and their in-
ferred states z1:C . Alternatively, we can exclude these terms from the objective,
or equivalently start the inference process from t = C + 1 instead of t = 1.
For computational stability, we assume homoscedasticity and choose Laplace
distribution with scale parameter β as a parametric form for pθ(xt | zt), i.e. we
optimize for L1 reconstruction loss with a cross-validated factor β for the KL
term, following similar variational autoencoder (VAE) work [52, 89, 228]. Such
an assumption does not necessarily limit the capacity of our models, as pow-
erful stochastic transitions and flexible emission models can theoretically char-
acterize arbitrary noise covariance [163]. Incorporating structured probabilistic
outputs such as Gaussian copulas [201] or normalizing flows [51] can potentially
further improve our model performance.
Complexity. Our models incur a time complexity ofO(T 2d) and a memory cost
of O(T 2d), where T is the total sequence length and d is the dimensionality of
the latent space. The recursive latent dynamics also does not allow use the take
full advantange of parallelizable attentions. However, we find that our models
are still efficient in practice, especially for reasonably small values of T .
30
3.3.2 Multi-Layered Extension
Inspired by recent work on hierarchical VAEs for non-sequential inputs [44, 212,
223, 266], we extend our proposed model to include several layers of latent vari-
ables, aiming to further increase its flexibility for modelling sequential data.
Generative and Inference Models. We represent each time step twith a Markov
chain of L latent variables, denoted by (1:L) (1) (L)zt = (zt , . . . , zt ) (see Figure 3.1).
The generative and inference model also decompose auto-regressively across
different time steps and[may exhibit n]o[n-Markovian dynamics:∏T ∏ ]L ∏T
(1:L) (L) (`) (`) (`−1)
pθ(x1:T , z1:T |x1:C) = pθ(xt|zt ) pθ(zt | z1:t−1, z1:T ,x1:C) (3.21)
( ) ∏t=L ∏1 `=1 t=1T ( )
(1:L) (`) (`) (`−1)
qφ z1:T |x1:T = qφ zt | z1:t−1, z1:T ,x1:T . (3.22)
`=1 t=1
Intuitively, we generate samples x1:T conditioning on x1:C by following the la-
tent dynamics from the bottom up and using the generative process described
earlier within each layer.
More specifically, we parametrize the prior (`) | (`) (`)pθ(zt z1:t−1, z1:T ,x1:C) using (1)
self-attention over the inferred latents (`)wt−1 on the same layer and (2) another
attention over contexts h1:C . In this case, we include an additional self-attention
over all latent variables from the layer immediately below it (see Equation 3.23):
(`) (`) (`) (`−1) (`−1)
w̃t = LayerNorm(wt−1 + Attention(wt−1,w1:T ,w1:T )) (3.23)
(`) (`) (`) (`) (`)
w̄t = LayerNorm(w̃t + Attention(w̃t ,w1:t−1,w1:t−1)) (3.24)
(`) (`) (`)
ŵt = LayerNorm(w̄t + Attention(w̄t ,h1:C ,h1:C)) (3.25)
(`) ∼ N (`) (`)zt (MLP(ŵt ), Softplus(MLP(ŵt ))) (3.26)
(`) (`) (`)
wt = LayerNorm(ŵt + MLP(zt ) + Position(t)), (3.27)
31
Variational Objective. The multi-layered architecture results in a variational
∑bound[similar to Equ]atio∑n 3.20:T L
E (L)| − (`) (`) (`) (`) (`) (`)q log pθ(xt zt) KL(qφ(zt |z1:t−1, z1:T ,x1:T ) ‖ pθ(zt |z1:t−1, z1:T ,x1:C)).
t=1 `=1
Complexity. Stacking multiple layers of latent variables increases model ex-
pressiveness, but it also result in a linear increase in running time and the num-
ber of parameters. The time complexity for the L-layers transformer isO(LT 2d),
while the space complexity remains D(T 2d) due to the Markovian structure of
the chain (1:L)zt at each time step t. In our experiments, we restrict the number
of layers of our hierachical models to two or three.
3.4 Related Work
Deep State Space Models. Deep neural networks have been extensively com-
bined with state space models, resulting in flexible, yet principledly motivated
latent variable approaches. While some work keep the linear state transition
intact to leverage the efficient Kalman filer algorithms [51, 66, 108, 186], more
expressive, nonlinear latent dynamics parametrized by neural networks have
been proposed [119, 120]. All such models are limited to the Markovian dynam-
ics of LDSs, which hinders learning of long-range dependencies. The limitation
is often alleviated by combining the stochastic transitions with a deterministic
RNN that enables access to all past states [9, 18, 49, 67, 83, 204]. Our models
are similarly non-Markovian, but the dependencies on the past states are done
via attention, which allows for easy connections between long-distance time
steps. In addition, while most existing deep SSMs represent each time step with
a single latent variable, our models include several layers of hierarchical latent
variables with tractable inference mechanism.
32
Attentive Recurrent Networks. Attention mechanism has been widely adopted
in recent time series work using sequence-to-sequence models [3, 63] or trans-
former architectures [33, 130, 135, 190, 213, 246]. While our models are equipped
with latent variables, these transformer approaches [130, 190] lack inference
mechanism and are susceptible to feeding back observation noise into the dy-
namics model at test time. Our work, however, can be considered as an exten-
sion of the attentive state space model proposed in [3], with discrete latent states
replaced by their continuous analogs. Recent developments in natural language
processing [138, 141, 239] also combine transformer and VAE; however, these
approaches often use a time-agnostic latent variable.
Time Series Forecasting. Traditional univariate time series models, such as
Box-Jenkins methods [28] and exponential smoothing [95], often assume inde-
pendence between any collection of time series [202]. While multivariate ex-
tensions of the classical approaches, including vector autoregression [222] and
multivariate GARCH [17], do not require such a strong assumption, they come
with many others such as stationarity and homocesdasticity, demand manual
selection of covariates and models, and do not scale well to even a moderate
number of time series [84, 176].
Deep learning methods for time series forecasting have recently emerged as an
expressive, scalable framework for industrial applications [23, 172, 211, 240].
While early work focus on point forecasts [124, 182, 255], recent approaches
employ recurrent neural networks with probabilistic forecasts parametrized di-
rectly [202], using quantile functions [72], Gaussian copulas [201], normalizing
flows [51], or diffusion models [189]. In contrast, our models are devoid of such
architectures and rely on latent variables to output distributional forecasts.
33
Human Motion Prediction. Despite being almost identical in formulation,
human motion prediction has often been studied independently from time se-
ries forecasting. While some work deterministically generate future motions or
video frames [31, 68, 73, 132], stochastic prediction has also been proposed with
deep neural networks often outperforming traditional methods such as hidden
Markov models [245] or Gaussian processes [238] on complex motion datasets
[31, 68, 98, 129, 154]. In contrast to earlier work [252, 257] that employ a global
latent variable across different time steps via conditional VAE [116], we lever-
age the principled framework of state space models for learning and inference
of hierarchical, time-dependent latent variables.
3.5 Experiments
We present our experiment results on two tasks, namely, time series forecasting
and human motion prediction. These tasks are often studied independently,
despite being almost identical as conditional prediction problems.
3.5.1 Time-series Forecasting
Datasets & Covariates. Following the experiment setup in [189, 190, 201], we
evaluate our models and multiple competitive baselines on five popular pub-
lic datasets: SOLAR, ELECTRICITY, TRAFFIC, TAXI, and WIKIPEDIA. The data is
recorded with hourly or daily frequency and shows seasonal patterns of differ-
ent frequencies (see Appendix B for more dataset details). As in [189, 190], the
covariates include lagged inputs, fixed time embeddings (e.g. day of week, hour
of day), and learnable time-series embeddings. The inputs are scaled using the
conditioning examples before being fed into the model, and the predictions are
rescaled appropriately afterward.
34
Metrics. Following [51, 190, 201], we evaluate our model and all baselines us-
ing continuous ranked probability score (CRPS) [156] summed across time series,
denoted by CRPSsum. Given a univariate distribution function F and an obser-
vation x, CRPS is defined as ∫
CRPS(F ,x) = (F (z)− 1 2{x≤z}) dz,
R
where 1{x≤z} is the indicator function. As argued in [51], CRPSsum is a proper
scoring rule [75] and can be computed without analytical forecast distributions.
We compute the metrics in a rolling fashion and use 100 samples for the distri-
butional forecasts, similar to the aforementioned work.
Baselines. We benchmark our models against various baselines, including
(1) VES [95], an innovation state space model; (2) VAR-Lasso and VAR [148],
two multivariate linear autoregressive; (3) GARCH [224], a multivariate condi-
tional heteroskedastic model; (4) DeepAR [202], an autoregressive RNN; LSTM-
Copula and GP-Copula [201], two RNN-based models that use Gaussian copula
to model nonlinearity; (5) KVAE [119], a variational approach based on linear
dynamics; (6) NKF [51], a normalizing-flow model coupled with Kalman filters;
(7) Transformer [190], a transformer-based model based on masked autoregres-
sive flow; and (8) TimeGrad [189], a recent diffusion-based approach.
Implementations. We use 8-head attentions and 2-layers MLPs to parametrize
the generative and inference models. The stochastic latent variables zt are 16-
dimensional while the hidden representations w are in R128t . Our probabilistic
transformers for SOLAR and ELECTRICITY have one stochastic layer while those
for the other datasets of higher dimensional observations employ two layers.
We report the numbers of parameters of our models in Table B.4 in Appendix B,
which are all comparable to those of the state-of-the-art approaches.
35
Table 3.1: Test set CRPSsum of time series forecasting models (lower is better). The means
and standard deviations are computed over five runs using different random seeds.
DATASET SOLAR ELECTRICITY TRAFFIC TAXI WIKIPEDIA
VES [95] 0.900 ± 0.003 0.880 ± 0.004 0.350 ± 0.002 - -
VAR [148] 0.830 ± 0.006 0.039 ± 0.001 0.290 ± 0.001 - -
VAR-Lasso [148] 0.510 ± 0.006 0.025 ± 0.000 0.150 ± 0.002 - 3.100 ± 0.004
GARCH [224] 0.880 ± 0.002 0.190 ± 0.001 0.370 ± 0.001 - -
DeepAR [202] 0.336 ± 0.014 0.023 ± 0.001 0.055 ± 0.003 - 0.127 ± 0.042
LSTM-Copula [201] 0.319 ± 0.011 0.064 ± 0.008 0.103 ± 0.006 0.326 ± 0.007 0.241 ± 0.003
GP-Copula [201] 0.337 ± 0.024 0.024 ± 0.002 0.078 ± 0.002 0.208 ± 0.183 0.086 ± 0.004
KVAE [119] 0.340 ± 0.025 0.051 ± 0.019 0.100 ± 0.005 - 0.095 ± 0.012
NKF [51] 0.320 ± 0.020 0.016 ± 0.001 0.100 ± 0.002 - 0.071 ± 0.002
Transformer-MAF [190] 0.301 ± 0.014 0.021 ± 0.000 0.056 ± 0.001 0.179 ± 0.002 0.063 ± 0.003
TimeGrad [189] 0.287 ± 0.020 0.021 ± 0.001 0.044 ± 0.006 0.114 ± 0.020 0.049 ± 0.002
ProTran (Ours) 0.194 ± 0.030 0.016 ± 0.001 0.028 ± 0.001 0.084 ± 0.003 0.047 ± 0.004
Accuracy Comparison. Table 3.1 shows that our models perform competi-
tively across all five high-dimensional time series datasets, achieving CRPSsum
comparable to the best methods on ELECTRICITY and WIKIPEDIA while outper-
forming all baselines, including a transformer-based approach [190], by signifi-
cant margins on SOLAR, TRAFFIC and TAXI. Further analyses with other metrics,
including CRPS and NMSE, in Appendix B also confirm our findings.
Qualitative Results. Figure 3.4 shows that the distribution forecasts generated
by our model follow closely the ground truths, which is consistent with our
accuracy results. In addition, the model appears to capture the uncertainty of
future forecasts to some extent; observations of large magnitudes and far into
Table 3.2: Ablation study on TRAFFIC.
Two Layers X × × ×
One Layer × X X X
Context Attention X X × X
Deterministic × × × X
CRPSsum 0.028 0.031 0.033 0.041
36
·10−2 ·10−2 ·10−2 ·10−2
15 8
3 8
6
2 10
6
4 4
1
5 2 2
0 0 0
·10−2 ·10−2 ·10−2 ·10−2
30 10 20
10
20 15
5 10
5 10
5
0 0 0 0
·10−2 ·10−2 ·10−2 ·10−2
15 30
20 20
10 20
10
5 10 10
0 0 0 0
·10−2 ·10−2 ·10−2 ·10−2
30 15 10
10
20 10
5
10 5
5
0 0 0
00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00
06-15-08 06-16-08 06-15-08 06-16-08 06-15-08 06-16-08 06-15-08 06-16-08
Figure 3.4: Prediction intervals and test set ground-truth from ProTran (our model) for
the TRAFFIC dataset of the first 16 of 963 time series.
the future seem to correctly have higher variance estimates.
Ablation Study. We include a small scale ablation study on the TRAFFIC dataset
to investigate which components of our models are essential. Table 3.2 sug-
gests that removing the stochasticity from wt has most impacts on model perfor-
mance, implying that incorporating latent variables into a transformer is indeed
useful. Other aspects such as context attention or multiple layers of stochastic
variables do not show dramatic effects in this study; however, they do con-
tribute performance gains.
37
3.5.2 Human Motion Prediction
Datasets. Following the experiment setup in [257], we evaluation our mod-
els on two public motion capture datasets: Human3.6M[96] and HumanEva-I
[208]. While Human3.6 is a large-scale dataset with 3.6 million video frames
recorded at 50Hz, HumanEva-I is smaller with only 3 subjects and recorded at
60Hz. We follow the preprocessing steps of previous work [155, 257] and obtain
a 17-joint skeleton for Human3.6 and a 15-joint skeleton for HumanEva-I. As in
[257], we predict future motion for 2 seconds conditioning on observed motion
of 0.5 seconds and 1 second conditioning on 0.25 seconds for Human3.6 and
HumanEva-I, respectively.
Metrics. Following previous work on trajectory forecasting [4, 81], we adopt
two popular metrics, namely, average displacement error (ADE) and final dis-
placement error (FDE). ADE measures the average L2 distance over all time
steps between the ground truth motion and the closest sample, while FDE only
consider such distance for the final pose.
Baselines. We compare our models against 9 models, including ERD [68] and
acLSTM [132], two deterministic RNN-based approaches; MT-VAE [252] and
Pose-Knows [236], two conditional VAE models; HP-GAN [14], a conditional
GAN; Best-Many [25], GMVAE [55], DeliGAN [82]. and DSP [258], four ap-
proaches optimizing for diversity objectives. The results for these baselines are
reported as in [257].
Implementations. Similar to the previous experiments, we use 8-head atten-
tions and 2-layers MLPs. Since Human3.6M is significantly more complex and
multi-modal than the time series forecasting datasets, we make use of 3 stochas-
38
Smoking Walk Together Phoning Walking Dicussion Walk Dog
Figure 3.5: Ground-truth pose sequences (first row) and corresponding predictions by
ProTran (second row). Solid colors indicate later time-steps and faded ones are older.
The body-part movements in the predicted and ground-truth poses resemble similar
patterns, while certain variations are retained.
Table 3.3: Human motion prediction results.
DATASET HUMAN3.6M HUMANEVA-I
Method ADE ↓ FDE ↓ ADE ↓ FDE ↓
ERD [68] 0.722 0.969 0.382 0.461
acLSTM [132] 0.789 1.126 0.429 0.541
MT-VAE [252] 0.457 0.595 0.345 0.403
Pose-Knows [236] 0.461 0.560 0.269 0.296
HP-GAN [14] 0.858 0.867 0.772 0.749
Best-Many [25] 0.448 0.533 0.271 0.279
GMVAE [55] 0.461 0.555 0.305 0.345
DeliGAN [82] 0.483 0.534 0.306 0.322
DSP [258] 0.493 0.592 0.273 0.290
DLow [257] 0.425 0.518 0.251 0.268
ProTran (Ours) 0.381 0.491 0.258 0.255
tic layers, as opposed to 2 layers for HumanEva-I. For Human3.6M, the context
and target observations are significantly longer and set up for long-term predic-
tions, so we only infer latent variables for target observations. Appendix B also
contains further details about our models and their number of parameters.
Quantitative Results. Table 3.3 shows that our models convincingly outper-
form all baselines based on both metrics ADE and FDE, with the gains signifi-
cantly higher for the larger dataset Human3.6M. We emphasize that our favor-
39
able performance is evaluated using random samples, while the closest com-
petitor, DLow [257], relies on a separate model for selecting samples to promote
diversity, which can potentially be combined with our probabilistic transformer
for further improvements.
Qualitative Results. We show in Figure 3.5 human pose predictions made by
our model that are most similar to the corresponding ground truths among a
collection of such stochastic predictions. The similarities between the body-part
movements in both sequences suggest that our model has been able to capture
the temporal dynamics quite well.
3.6 Conclusion & Discussion
In this chapter, we have introduced generative models for multivariate time
series that combines strengths of state space models and transformer architec-
tures. In contrast to previous work, our models do not rely on RNNs but make
extensive use of attention mechanism. We also extend our models to include
hierarchical latent variables, inspired by recent developments of VAEs for non-
sequential data [44, 223]. Empirical experiments show that our models perform
remarkably well on time series forecasting and human motion prediction.
Our models do not come without limitations, however. As in other transformer-
based approaches, the reliance on attention incurs a quadratic time and memory
complexity. While we do not find it problematic in our experiments, the limi-
tation necessarily hinders applications of our models in tasks characterized by
long-term dependencies such as language modelling or music generation [79].
Fortunately, recent work on sparse transformer [22, 45, 118, 130] can potentially
address the issue, and we leave such an investigation for future work.
40
CHAPTER 4
PROBABILISTIC TRANSFORMER FOR VIDEO PREDICTION1
While impressive advances has been made on generative models of images
[91, 109, 223], audio [26, 41], and text [29, 183], video predictions remain excep-
tionally challenging given the interplay of high-dimensional images and com-
plex temporal dynamics. Given 5 conditioning 64× 64 video frames, for exam-
ple, a model would need to extract information across space and time from more
than 60,000 pixels in order to generate 60,000 more pixels for future frames. At
a higher level, such a model would need to continuously detect and track the
objects in the frames and consistently decode their positions and appearances
into future ones, despite of possible occlusion, clutter, or object deformations.
In this chapter, we extend the probabilistic transformer model proposed in Chap-
ter 3 into the settings of conditional video prediction [187, 216, 232]. Although
it is conceivable to naively flatten the 3-dimensional tensors representing video
frames and apply the previous model without any changes, doing so would
lead to an exploding number of parameters, and more importantly, disregard
the local connectivity and translational invariance of images. Such properties
naturally lend themselves to convolutional operators, so we intuitively replace
the classical attention mechanism with a new convolution-based alternative de-
signed to work with 3-dimensional tensors. While the time series forecasting
task do not benefit significantly from our deep multi-layered formulation on
some datasets, we find that deep, hierarchical architectures are especially help-
ful for videos when combined with convolutional operations, similar to how
stacking multiple convolutional modules helps with image modeling.
1Joint work with David S. Matteson.
41
As a state space model, the resulting model inherit the principled probabilistic
framework with tractable variational inference. The transformer architectures
also allow for non-Markovian temporal dynamics, which are especially relevant
for video contents of complex modalities. Empirical results demonstrate that
our models are relatively effective on several video generation datasets.
4.1 Related Work
Deterministic Models. Earlier works on video prediction are often based on
deterministic recurrent neural networks [78], such as long short-term memory
(LSTM) networks [92], including [16, 105, 216, 227, 229] or convolutional LSTMs
[249] such as [103, 187, 250]. Several other approaches propose using specialized
computer vision techniques such as pixel-level transformations or optical flow
[65, 71, 103, 133, 142, 144, 146, 233, 234, 235]. These models, however, are limited
to deterministic predictions and fail to generate sharp long-term video frames
[11, 52].
Stochastic and Autoregressive Models. Several approaches directly optimize
exact likelihood via pixel-level autoregression [107, 171, 242] or normalizing
flows using invertible transformations [115, 123], all of which requires restric-
tive temporal generation designs to manipulate high-dimensional inputs. Closer
to our work are models based on variational inference [116, 194] that incorpo-
rate latent variables into convolutional LSTMs [11, 126] or LSTMs [52, 86]. In
contrast to our model, however, the latent variables in these models do not con-
tain all information required to render future frames and model predictions are
often fed back into the latent space, making them more susceptible to accumu-
lated errors in multi-frame predictions.
42
State Space Models. As discussed in Chapter 3, several previous works have
explored various modeling choices to learn stochastic sequential models, which
differ in the factorization of the generative and inference models, their network
architectures, and the objectives used in their training procedures.[9, 56, 66,
67, 69, 80, 108, 120]. These models are often based on RNNs, which are not
equipped with a mechanism to learn long-range dependencies. In contrast, our
models are completely devoid of recurrent architectures and make extensive use
of attention for learning such dependencies.
Attention in Vision Models. Central to our model is the convolution-based
attention mechanism operating on 3-dimensional tensors such as images. While
previous works have experimented with replacing all convolution operations in
images with self attention [21, 185, 264], we find that such techniques often lead
to large model sizes and do not necessarily work well on video data and that
combining convolution with space-time attention [24] offers a relatively simple
and effective alternative.
4.2 Probabilistic Transformer
Space-Time Attention. As discussion in Chapter 3, multi-head attention, maps
a sequence of queries Q ∈ R`q×d of length `q to a sequence of outputs O =
[O1, . . . ,O ] ∈ R`q×dH of the same size by attending over `k given key-value pairs
K ∈ R`k×d, V ∈ R`k×d: ( )
Q ThK
Oh = Attention(Qh,Kh,Vh) = Softmax √ h Vh, (4.1)
d
where Qh = QW
Q
h , Kh = KW
K V
h , Vh = VWh are projected queries, keys, and
values corresponding to head h ∈ [1,H] with learning parameters WQh ,WK Vh ,Wh ,
respectively (see Figure 3.3).
43
Here all queries, keys, and values are d-dimensional vectors, so the dot products
between queries and keys are well-defined and can be computed efficiently via
the matrix vector multiplications in Equation 4.5.
For video generation, all frame inputs and most intermediate outputs are 3-
dimensional tensors of varying resolutions. As a result, we first replace the
linear projections in multihead attention with 1× 1 convolutional operations to
map the queries Q ∈ R`q×c×h′×w′ , the keys and values ′ ′K,V ∈ R`k×c×h ×w into
3-dimensional tensors of the same size
(q) ′ ′ ′
Qh = Conv1x1h (Q) ∈ R`q×c ×h ×w , (4.2)
(k) ∈ R` ×c′K = Conv1x1 (K) k ×h′×w′h h , (4.3)
(v) ′ ′ ′
Vh = Conv1x1 (V) ∈ R`k×c ×h ×wh . (4.4)
where (q) (k) (v)Conv1x1h , Conv1x1h , and Conv1x1h have separate learnable parameters
for the queries, keys, and values corresponding to head h ∈ [1,H], respectively.
Intuitively, these mappings allow for the interactions between the original c
channels within each tensor.
We compute attention weights between Qh and Kh by flattening the tensors into
Q′ ∈ R(` ′ ′ ′ ′ ′ ′qh w )×c and K′ ,V′ ∈ R(`kh w )×ch h h , and taking dot products to obtain a
R`q×`k×h′×w′ tensor. The Softmax operation is then computed over the `k time
steps in the resulting tensor, and the attention weights are combined with the
corresponding transformed values as in Equation 4.5: ( )
Q′K′T
Oh = SpaceTimeAttention(Qh,Kh,Vh) = Softmax √h h V′h, (4.5)
d
In short, the space-time attention uses convolutions to measure similarities be-
tween the queries and the keys but attends over the values at different time
steps as in multi-head attention.
44
xt−1 xt xt+1 xt−1 xt xt+1
z(4) z(4) z(4) z(4) (4)) (4)t−1 t t+1 t−1 zt zt+1
z(3) z(3) z(3) z(3) z(3) (3)t−1 t t+1 t−1 t zt+1
z(2)t−1 z
(2) z(2) z(2) z(2) (2)t t+1 t−1 t zt+1
z(1) z(1) z(1)t−1 t t+1 z
(1) (1) (1)
t−1 zt zt+1
(a) Generation (b) Inference
Figure 4.1: Graphical model representations of our probabilistic transformer models.
Black arrows denote the generative mechanism and red arrows the inference procedure.
The separation of generation and inference in (c) and (d) is for readability. We interleave
recursive layers (e.g. layer 1 and layer 3) and non-recursive layers (e.g. layer 2 and layer
4) to increase expressiveness of the temporal dynamics and reduce running time.
Interleaving Recursive and Non-Recursive Layers. As in Chapter 3, we study
hierarchical models constructed by stacking multiple probabilistic transformer
layers described in Subsection 3.3.2 (see Figure 4.1). In our case, each layer
consists of intermediate outputs of all time steps, with the bottom layers encod-
ing low-resolution, high-level information while the top layers encoding high-
resolution, fine details about video frames.
Such layers are built recursively from left to right, allowing a current time step
to focus on all preceding ones. As a result, each of these layers has to be formed
sequentially and significantly slows down the generation and inference proce-
dures, especially when there are many future steps or hierarchical layers. Be-
cause a layer has access to all temporal information from the layer below it, in-
cluding all past, present, and future time steps, we therefore propose removing
recursive connections in some layers to speed up training and inference.
45
In particular, we keep recursive layers for each new resolution from the bot-
tom up and make all other layers of the same resolution to non-recursive. The
resulting models are fast to train and serve but remain highly expressive.
4.3 Experiment Results
Datasets. We run experiments on two three video datasets:
• STOCHASTIC MOVEMENT DATASET [11]. The first frame of every video
consists of a shape placed near the center of a 64 × 64 × 3 resolution gray
background with its type, size and color randomly sampled. The shape
randomly moves in one of eight directions with constant speed, implying
the position of the shape at any time step is completely determined by that
of the shape at the previous step. We condition only on the initial frames
and predict the next four frames both during training and at test time.
• DETERMINISTIC MOVING MNIST [52]. This dataset consists of one or
two MNIST digits [125] moving linearly and deterministically bouncing
on walls with predefined direction and velocity. We condition only on the
initial frames and predict the next 15 and 100 frames during training and
at test time, respectively.
• STOCHASTIC MOVING MNIST [52]. This dataset consists of one or two
MNIST digits [125] moving linearly and randomly bouncing on walls with
new direction and velocity sampled randomly at each bounce. We condi-
tion on the first five frames of each video and predict the next 10 and 25
frames during training and at test time, respectively.
Metrics. Following previous works [69], we use peak-signal-to-noise ratio
46
STOCHASTIC MNIST DETERMINISTIC MNIST
25 25
SVG SRVP SVG SRVP
SRVP-GRU ProTran SRVP-GRU ProTran
20 20 SRVP-NZ
15 15
10 10
5 10 15 20 25 0 20 40 60 80 100
Time Horizon Time Horizon
Figure 4.2: Peak signal-to-noise ratio as a function of time horizon.
(PSNR) and structural similarity index measure (SSIM) to compare predictions
of future frames and their corresponding ground truths.
Quantitative Results. Table 4.1 shows that our models perform on par with
other baselines, include SVG [52], and SRVP [69], two popular latent variable
models based on variational inference As seen in Figure 4.2, our model per-
formance decreases slowly as the time horizon increases while autoregressive
models such as SVG are more prone to accumulated errors in multi-frame pre-
diction settings.
Qualitative Results. Figure 4.3 and Figure 4.4 show video frame predictions
and their corresponding ground truths at different time horizons. Our models
appear to learn the temporal dynamics and maintain the properties of involved
Table 4.1: PSRN and SSIM scores on MOVING MNIST.
DATASET STOCHASTIC DETERMINISTIC
Method PSRN (↑) SSIM (↑) PSNR (↑) SSIM(↑)
SVG 14.50 ± 0.04 0.7090 ± 0.0015 12.85 ± 0.03 0.6185 ± 0.0011
SRVP 16.93 ± 0.07 0.7799 ± 0.0020 18.25 ± 0.06 0.8300 ± 0.0017
ProTran 17.20 ± 0.05 0.7842 ± 0.0017 18.43 ± 0.12 0.8544 ± 0.0014
47
PSRN
Figure 4.3: Predicted video frames on odd rows and their corresponding ground truths
on even rows at two different time horizons, namely t = 2 left and t = 4 on the right on
the STOCHASTIC MOVEMENT DATASET. The shown predictions are among the closest
samples to the ground truths based on PSNR.
objects relatively well as the frames mirror each other. From Figure 4.4, we also
see that the frame predictions after the collisions of digits seem consistent with
the ground truths, suggesting that the non-Markovian dynamics might indeed
have been captured in the latent space.
4.4 Conclusion
In this chapter, we have extended the probabilistic transformer proposed in
Chapter 3 to the task of video prediction. While preliminary results on rela-
tively simple datasets are encouraging, additional experiments on challenging
real-world datasets can further validate our model performance.
48
Figure 4.4: Predicted video frames on odd rows and their corresponding ground truths
on even rows at two different time horizons, namely t = 4 left and t = 13 on the right
on the STOCHASTIC MOVING MNIST dataset. The shown predictions are among the
closest samples to the ground truths based on PSNR.
49
CHAPTER 5
ADDITIONAL APPLICATIONS OF DEEP NEURAL NETWORKS
5.1 Dynamic Poverty Prediction with Vegetation Index1
Despite global economic growth, 330 million people are still living in extreme
poverty in Africa [20]. The United Nations has acknowledged poverty as one
of the greatest challenges facing humanity and aims to end extreme poverty
in all forms by 2030 [164]. To achieve the goal, policy makers often rely on
complex household surveys to measure poverty and allocate resources [13].
Since the data collection process is costly and time consuming, there is a lack
of good-quality data to assess poverty regularly [101]. Inexpensive and scalable
approaches to poverty prediction are thus needed to complement household
surveys.
Recent advances in remote sensing and machine learning have opened up a new
path for poverty prediction. High resolution satellite images are rich in content
and available globally, providing an objective view on the economic conditions
of developing countries [42, 88, 128]. The increasing abundance of these images
lends themselves to convolutional neural networks (CNNs), a deep learning
approach that has recently seen tremendous success in many computer vision
tasks [193, 199, 205]. In a critically acclaimed paper [101], Jean et al. applied
CNNs on daytime satellite images to measure regional poverty in Africa, yield-
ing results comparable to estimation based on past surveys.
Although promising, most current studies with satellite images are limited to
providing fixed estimates of poverty maps. The poverty predictions in [101], for
example, were constant scalars regardless of the time of prediction due to the
1Joint work with Ying Sun, Yanyan Liu, and David S. Matteson.
50
Figure 5.1: NDVI measurements for Uganda in 2011. On the left, the background im-
age shows annual average NDVI with a vertical colorbar while the foreground scatters
depict log consumption expenditures with a horizontal colorbar. On the right, the an-
nual NDVI, spatially averaged over all survey locations, with notable drops during the
2011-2012 East Africa drought highlighted in gray.
lack of continuous access to proprietary data. Given the unprecedented climate
changes anticipated in sub-Saharan countries [175], dynamic poverty mapping
is critical to timely interventions and policy evaluation.
In this section, we use the continuous streaming of the normalized difference
vegetation index (NDVI), one of the widely known satellite measurements of
Earth’s vegetation greenness, to estimate poverty indicators more frequently.
NDVI measures the difference in red and near-infrared light reflectance resulted
from photosynthesis; areas of barren rock have very low NDVI while dense
croplands often have high NDVI values (see Figure 5.1). As ultra-poor regions
heavily depend on agriculture [53], NDVI provides an uninterrupted signal for
crop heath and poverty tracking in general.
Our contribution is twofold: (1) we demonstrate that publicly-available, moderate-
resolution NDVI can help predict poverty in Malawi, Nigeria, Rwanda, Tanza-
nia, and Uganda as well as competitive baselines, and (2) we perform poverty
prediction for an out-of-sample period and capture changes in poverty mea-
51
sures for ultra-poor regions in Uganda.
5.1.1 Related Work
Recent studies on poverty prediction rely on passively collected data and sta-
tistical methods to circumvent the scarcity of household surveys. Researchers
have fit linear models to nighttime light luminosities and found they strongly
correlate with the gross domestic product of various countries [42, 88, 128]. Pro-
prietary cell phone records of millions of subscribers in Rwanda and Senegal
have also been used with tree-based classifiers and Gaussian process regressors
to provide asset wealth index estimates [27, 178]. In another line of research,
CNNs help extract predictive features from Google’s high-resolution daytime
images, providing accurate estimates of both consumption expenditure and as-
set wealth in multiple African countries [101, 248].
Although existing technologies can only measure NDVI at a lower resolution,
recent work on poverty and health in Africa has proven its significant predictive
power. Through increased crop yields, NDVI has been found to be positively
correlated with child survival, nutrition, and anthropometric variables such as
wasting [106]. Using spatial statistics techniques, Sedda et al. also [203] showed
that the intensity of poverty varies inversely with NDVI in West Africa.
5.1.2 Datasets and Methodology
Inspired by previous work, we apply CNNs to publicly available NDVI images
to learn features useful for poverty prediction. Following Jean et al. [101], we
use transfer learning and a two-step procedure to bypass the lack of labeled
52
Consumption Prediction (LSMS) Asset Index Prediction (DHS)
Country
Year Jean et al. [101] NDVI Year Jean et al. [101] NDVI
Malawi 2013 0.37 0.341 ± 0.038 2010 0.55 0.498 ± 0.020
Nigeria 2013 0.42 0.387 ± 0.013 2013 0.68 0.738 ± 0.005
Rwanda - - - 2010 0.75 0.725 ± 0.022
Tanzania 2012 0.55 0.603 ± 0.019 2010 0.57 0.638 ± 0.012
Uganda 2011 0.41 0.490 ± 0.012 2011 0.69 0.751 ± 0.007
Table 5.1: Spatially cross-validated r2 values of the predictions of NDVI models relative
to Jean et al. [101]. Separate models are fine-tuned and evaluated for different countries
and surveys. For NDVI models, the means and standard deviations of r2 values are
reported using 5 independent trials.
responses: (1) fine-tune a VGG-16 network [209] on NDVI images to predict
nighttime light intensities, and (2) fit random forest regression models using
NDVI features to predict poverty indicators. The combination of NDVI images
and nighttime lights allows vegetation features indicative of economic activity
to be learned and generalized to the poverty prediction task.
In the first step of our procedure, we start with a VGG-16 network pre-trained
on ImageNet and adapt its fully connected layers to fit our input image sizes.
Our inputs are annual average NDVI images, each 64 × 64 pixels in size and
at a spatial resolution of 250 square meters per pixel. The images are sampled
from a dataset produced by NASA’s Terra satellite [54] and represent areas that
are evenly spaced at 0.025 degree intervals. The network learns to map each
NDVI image to the average value of annual nighttime light intensities that de-
scribe the same geographical region, as provided by the National Oceanic and
Atmospheric Administration [169]. In contrast to [101, 248], we take a log trans-
formation, but we do not discretize nightime light intensities.
In the second step, we extract features from NDVI images and fit regression
53
models to predict two survey variables - logarithm of consumption expenditure
and asset index. For direct comparison, we select the same surveys and follow
the same preprocessing steps as in [101] (see Table 5.1 for the list of surveys).
The conv5-2 layer of the fine-tuned network outputs a feature map of size 512×
1 for each NDVI image, and we average feature maps of images whose centers
are within five kilometers of a surveyed community. For each survey, we then
train random forests on the 512-dimensional feature maps, using nested 5-fold
spatial cross-validation to select hyperparameters, and output predictions.
In response to weather shocks, NDVI often changes over time (see Figure 5.1),
and its feature maps can potentially capture and reflect these events in poverty
predictions. Hence, we also fine-tune the network in the first step with updated
NDVI images and nighttime lights. Predictions for out-of-sample periods are
obtained by first training a random forest on the previous NDVI feature maps
and testing it on the updated ones.
5.1.3 Experiment Results
Our first experiment is to evaluate the predictive power of NDVI for poverty
estimation. Following [101], we use expenditure data from the World Bank’s
Living Standards Measurement Study (LSMS) and asset index data from the
Demographic and Health Surveys (DHS). Table 5.1 shows that our NDVI mod-
els are highly predictive of both average household consumption and average
asset wealth. Spatially cross-validated predictions explain 34 to 60% of the vari-
ation in average consumption and 50 to 75% of the variation in average asset
wealth across surveyed countries. In general, our models perform comparably
to Jean et al. [101] when fit using data from individual countries.
54
0.6 Nightlights
Jean et al.
0.5 NDVI
1x poverty line
0.4
r2 2x poverty line
0.3 3x poverty line
0.2
0.1
0.0
20 40 60 80 100
0.6
0.5
0.4
r2
0.3
0.2 Nightlights
Jean et al.
0.1
NDVI
0.0
20 40 60 80 100
Poorest Percent of Surveyed Communities Used
Figure 5.2: Spatially cross-validated results of NDVI models relative to nightlights and
Jean et al. [101]. Nightlight-based models are random forests trained on scalar night-
time light intensities. The top figure shows r2 values for estimating consumption using
pooled observations across the four LSMS countries. We run separate trials for increas-
ing percentages of the pooled dataset (e.g., the x-axis value of 60 indicates all surveyed
communities below the 60th percentile of consumption are included. The bottom figure
show similar r2 values for estimating asset index.)
When trained on pooled consumption or asset observations across all countries,
our models perform significantly and consistently better than the state-of-art
method (see Figure 5.2). We see an improvement of more than 100% in r2 for
asset index predictions for regions below the 2x poverty line, which is set at
$1.9 per person per day by the World Bank. This observation agrees with our
intuition that extremely poor communities depend most heavily on crop pro-
duction. The modest improvements in consumption prediction can be partly
explained by the fact that consumption data is noisier [101, 217].
In the second experiment, we study whether temporal changes in NDVI are in-
dicative of poverty changes. Because surveys from different years are generally
conducted at different locations, we limit this experiment to 209 communities in
55
2011 Ground Truth
3 2013 Prediction
2013 Ground Truth
2
1
0
0 50 100 150 200
Community Index
0.45 2013 Prediction
2011 Ground Truth
0.40
0.35
0.30
20 40 60 80 100
Poorest Percent of Surveyed Communities Used
Figure 5.3: Consumption predictions for LSMS communities in Uganda made by a ran-
dom forest model trained on 2011 data and tested on 2013 data. The top figure shows
the ground-truth consumption along with predictions for LSMS communities ordered
by 2011 data. The bottom figure shows RMSE values of the predictions for increasing
percentages of the LSMS communities (e.g., the x-axis value of 60 indicates all commu-
nities below the 60th percentile in 2011 consumption are included).
Uganda that are part of both the 2011-2012 and 2013-2014 LSMS surveys. As the
2011-2012 East Africa drought affected a large area of Uganda, the consumption
distribution for these communities changes quite significantly between the sur-
veys. Figure 5.3 shows that our random forest model can translate the increase
in annual NDVI from 2011 to 2013 to reflect increased consumption in the poor-
est communities following the drought. In contrast, models that rely on static
inputs such as et al. [101] can only perform as well as the 2011 ground truth
when tested on the 2013 data.
Conclusion. In this paper, we have leveraged CNNs to extract features from
NDVI images that are highly predictive of poverty. We demonstrate that pub-
licly available, moderate-resolution NDVI can predict poverty as well as high-
56
RMSE
Log Consumption ($/person/day)
resolution images constrained by Google’s licensing terms. Our model based on
NDVI can also produce dynamic poverty estimates, potentially helping policy-
makers make more informed and timely decisions.
57
5.2 Deep Denoising for Scientific Discovery2
Despite significant advances in imaging technology [134, 158, 268], scientific
images are still often corrupted by noise during signal generation or detection,
which requires denoising procedures to restore information for scientific discov-
ery. While deep denoising models have been tremendously successful on natu-
ral images [43, 262], the potential of these techniques has barely been explored
in the context of scientific imaging, where in contrast to traditional denoising
setups, labeled datasets are typically not available in large quantities.
To address this issue, we propose a simulation-based denoising (SBD) frame-
work in which denoising models are trained on simulated images of transmis-
sion electron microscopy (TEM), a powerful technique for probing the atomic-
level structure and composition of a wide range of materials [210, 219]. Our
contributions are threefold. First, we propose an architecture for simulation-
based data that outperforms existing techniques by a wide margin on held-out
simulated data as well as on real TEM measurements. Second, we demon-
strate that standard performance metrics for photographs often fail to produce a
scientifically-meaningful evaluation of the denoising results, and propose new
scientifically-motivated metrics. Third, we propose a likelihood-based visual-
ization of the agreement between the observed measurements and structures of
interest, such as atomic columns, in the denoised image.
2Joint work with Sreyas Mohan, Ramon Manzorro, Joshua L. Vincent, David S. Matteson,
Peter A. Crozier, and Carlos Fernandez-Granda.
58
5.2.1 Related Work
Denoising in Scientific Imaging. A wide variety of denoising methods have
been applied across different scientific imaging modalities, including traditional
linear filters [165], nonlinear filters [104, 160, 221], wavelet-based methods [36,
159, 179, 267], and sparsity-based approaches [19, 159]. Several works in scien-
tific domains exist, including low-dose computer tomography [112], positron-
emission tomography [76], and scintillation-camera data [161]. While [60, 74,
225] apply CNNs to denoise simulated electron microscopy data without vali-
dating on real data, [153] trains CNNs to denoise Raman scattering microscopy
data, using measurements gathered at a higher signal-to-noise ratio (SNR) as
ground-truth images. These results showcase the potential of deep denoising
for scientific imaging, but also the challenge of gathering adequate datasets to
train the deep networks.
Deep Learning for TEM. Deep models have been applied to other image-
processing tasks in TEM beyond denoising (see [61] for a comprehensive re-
view). [218] proposes a CNN-based method for TEM image super-resolution,
wherein CNNs are trained on pairs of low-resolution and high-resolution im-
ages acquired experimentally. [93] applies CNNs to perform segmentation and
systematically studies the influence of the design of the training dataset and
network architecture on the generalization capabilities of these models. In this
work, we provide a similar analysis for denoising. [151] and [184] propose a
CNN-based method to identify structures of interest in TEM images by training
on carefully designed simulated data and show that the model generalizes to
real data. Our work provides further evidence that CNNs trained on simulated
data can generalize effectively to real measurements.
59
5.2.2 Methodology
Simulation-Based Denoising. Simulation-based denoising (SBD) consists of
three stages: simulation of the training set, training of the CNNs using the sim-
ulated data, and inference on the real data (see Figure 5.4). In order to generate
the training set, we simulate clean images x M1, . . . , xN ∈ R (where M is the
number of pixels) according to a predefined physical model. These clean im-
ages are then corrupted using a noise model, which can follow a predefined
model or be learned from the data, to generate the simulated noisy data.
Learning Objective. Let Y (xi) denote the random vector representing the
noisy image corresponding to the clean simulated image xi and let y(xi) rep-
resent a realization of Y (xi). The denoising model fθ : RM → RM , parametrized
by the weights θ are the weights, aims to minimize a loss function L : RM ×
RM → R which quantifies how close the estimate from the CNN fθ(y(xi)) is
to the clean image xi. In our case study, we use mean squared error, which is
a standard choice in CNN-based denoising [262]. More concretely, during the
training stage, w[e compute the para]meters by solving∑ [∑ ]N N
θ̂ = arg minE L(fθ(Y (xi)),xi) = arg minE ‖fθ(Y (xi))− xi‖22 (5.1)
θ θ
i=1 i=1
where the expectation is taken over the noise model and approximated by draw-
ing new realizations of the noisy image Y (xi) every time we compute the gradi-
ent. Once the network is trained, it can be directly applied to new noisy images
to perform denoising.
Exploiting Non-Local Signal Structure. While current state-of-the-art net-
works for denoising photographic images have very small fields of view, the
TEM images in our case study exhibit very prominent global regularities, due
60
to periodicity in the atomic structure of the imaged materials. In addition,
electron-microscopy images are often measured at very low SNRs. Hence, we
propose to denoise TEM data using UNet network architectures [198] with very
large fields of view: 221 × 221 pixels and 893 × 893 pixels. Table 5.2 compares
the influence of the field of view in denoising photographic and TEM images.
For photographic images the performance of the network remains almost con-
stant as we increase the field of view. In contrast, for TEM images increasing
the field of view produces a dramatic improvement in performance (6 dB and
10 dB, when the field of view is 221×221 and 893×893 respectively). Increasing
the number of parameters, while keeping the field of view constant, has a very
modest effect, which suggests that the increase in field of view is the reason for
the improvement.
(a) TEM Images
MODEL Parameters Field of View PSNR SSIM
SBD + DnCNN [262] 668K 41 × 41 30.47 ± 0.64 0.93 ± 0.01
SBD + Small UNet [263] 233K 45 × 45 30.87 ± 0.56 0.93 ± 0.01
SBD + UNet (32 base channels) 352K 221 × 221 36.39 ± 0.77 0.98 ± 0.01
SBD + UNet (64 base channels) 1.41M 221 × 221 37.24 ± 0.76 0.99 ± 0.01
SBD + UNet (128 base channels) 5.61M 221 × 221 38.05 ± 0.81 0.99 ± 0.01
SBD + UNet (128 base channels) 70.15M 893 × 893 42.87 ± 1.45 0.99 ± 0.01
(b) Photographic Images
MODEL Parameters Field of View PSNR SSIM
σ = 30 σ = 70 σ = 30 σ = 70
UNet 102K 49 × 49 29.67 ± 2.84 26.16 ± 2.79 0.83 ± 0.06 0.70 ± 0.09
UNet 352K 221 × 221 29.65 ± 2.76 26.08 ± 2.68 0.83 ± 0.05 0.70 ± 0.08
UNet 4.4M 893 × 893 29.54 ± 2.82 26.07 ± 2.80 0.83 ± 0.06 0.70 ± 0.09
Table 5.2: Field of view of CNN architectures and performance. Mean PSNR and SSIM
(± standard deviation) of different CNN architectures on the (a) held-out simulated test
set of TEM data described in Section 5.2.4 and (b) validation set of the DIV2K photo-
graphic image dataset [2].
61
Pt
CeO2
2D projection from the 
Simulating 3D atomic model Simulated images
Simulate noise …
Simulated data Noisy data Convolutional neural network Denoised data
Training
…
Real noisy data Trained Network Denoised real data Likelihood map
Inference
Figure 5.4: Simulation-based denoising framework. (Top) A training dataset is gener-
ated by simulating TEM images of different structures at varying imaging conditions.
(Middle) A CNN is trained using the simulated images, paired with noisy counterparts
obtained by simulating the relevant noise process. (Bottom) The trained CNN is applied
to real data to yield a denoised image, and a likelihood map is generated to quantify
the agreement between this structure and the noisy data.
Likelihood Maps. In most applied domains, the goal of denoising is to un-
cover image structure of scientific interest. In our case study, this corresponds to
the location and intensity of projected columns of atoms in a catalytic nanopar-
ticle that is surrounded by a vacuum. Quantifying to what extent such structure
is consistent with the observed measurements is therefore of great interest. We
62
(a) Data (b) Denoised image (c) Likelihood map (d) Zoomed
0.04
0.02
0.00
0.02
0.04
Figure 5.5: Likelihood map. When the simulated noisy image in (a) is denoised using
the proposed framework (b), a spurious atom appears at the left edge of the nanoparti-
cle (see zoomed image (d)). The value of the likelihood map (c) at that location is very
low, indicating that the presence of an atom is less consistent with the observed data
than its absence.
propose to achieve this by computing the likelihood of the data with respect to
meaningful features identified in the denoised image. The general procedure, and its
implementation in the case of our case study, are as follows:
1. Identify a region of interestR. In our case, we locate atomic columns using
blob detection [139].
2. Fit a low-dimensional model to the denoised image within the region of
interest. We assume that the intensity of each atomic column and the vac-
uum are constant and perform averaging over all denoised pixels inR.
3. Compute the likelihood of the noisy data in R with respect to the esti-
mated pixel values. In our case, the noise is approximately i.i.d. Poisson,
so the likelihood is given by
∏
L(R) := px (y ), (5.2)i i
i∈R
where yi denotes the noisy value in the ith pixel, and px is a Poisson prob-i
ability mass function (pmf) with rate parameter xi.
63
(a) Data (b) Spot Filter (c) PURE-LET (d) SBD (e) Likelihood Map
0.03
0.02
0.01
0.00
0.01
0.02
0.03
Figure 5.6: Denoising results for real data. (a) An experimentally-acquired atomic-
resolution transmission electron microscope image of a CeO2-supported Pt nanopar-
ticle. The average image intensity is 0.45 electrons/pixel (i.e., a large fraction of pixels
register zero electrons), which results in an extremely low signal-to-noise ratio. (b) De-
noised image obtained via Fourier-based filtering by a domain expert. (c) Denoised
image obtained via the wavelet-based PURE-LET method [147]. (d) Denoised image
obtained by the proposed simulation-based denoising (SBD) framework. (e) Likelihood
map quantifying to what extent the atomic structure identified from the SBD denoised
image is consistent with the data. Regions in red are more likely to correspond to atomic
columns in the nanoparticle. Regions in blue are more likely to belong to the vacuum.
Figure 5.6 and Figure 5.5 show likelihood maps for the real data and for a sim-
ulated example. In the simulated example (Figure 5.5), a spurious atom is de-
tected at the left end of the zoomed region. However, the likelihood map at
that location is very low, which indicates that the presence of an atom is not
consistent with the observed data at that location. Figure 5.7 shows the distri-
bution of log-likelihood ratio of over 25, 000 regions of interest extracted from
the surface of over 1, 550 denoised images obtained from the dataset. As shown
in Figure 5.7, the log-likelihood ratio values of spurious atoms (false postives,
64
·10−2
2
1.5
1
0.5
0
−0.5
−1
−1.5
(a) False positives (b) True positives (c) False negatives
Figure 5.7: Distribution of likelihood ratio. The figure shows the distribution of log-
likelihood ratio of over 25, 000 regions of interest computed from the surface of 1550
denoised images using the dataset. The regions containing spurious atoms (false posi-
tives, (a)) have a much lower log-likelihood ratio than the regions containing accurately
recovered atoms (true positives, (b)). Regions where existing atoms were not detected
(false negatives, (c)) have a higher log-likelihood ratio, comparable to that of the re-
gions with accurately recovered atoms. The occurrence of missing and spurious atoms
in denoised images is quite rare: out of the 25, 732 regions of interest, only 2, 457 and
2, 368 were false positives and false negatives respectively.
(a)) are much lower than those of correctly-identified atoms (true positives, (c)).
When the network fails to detects atoms (false negatives, (b)), we observe that
the log-likelihood ratio in such regions tends to be high. It is worth noting that
the occurrence of spurious and missing atoms in the denoised images is quite
rare: out of the 25, 732 regions identified, only 2, 457 and 2, 368 regions corre-
spond to spurious and missing atoms respectively.
5.2.3 Datasets
The TEM image data used in this work correspond to images from a widely uti-
lized catalytic system, which consist of platinum (Pt) nanoparticles supported
on a larger cerium (IV) oxide (CeO2) nanoparticle. This bi-functional catalytic
65
Log Likelihood Ratio
system is ubiquitously used in clean energy conversion and environmental re-
mediation applications, in addition to a broad range of other chemical reactions
[162, 168, 256]. From a general point of view, this system can be considered as
a model for supported nanoparticle catalysts, since a large number of hetero-
geneous catalysts are based on metallic nanoparticles supported over different
oxides. Thus, results and conclusions extracted from the current work are rel-
evant to a great number of similar samples in the field of catalysis (e.g., oxide
crystals supporting metal nanoparticles).
Real Data. The real data used to test the proposed SBD framework consist
of a series of images of the Pt/CeO2 catalyst. The images were acquired in a
N2 gas atmosphere using an aberration-corrected FEI Titan transmission elec-
tron microscope (TEM), operated at 300 kV and coupled with a Gatan K2 IS
direct electron detector. The detector was operated in electron counting mode
with a time resolution of 0.025 sec/frame and an incident electron dose rate of
5,000 e−/Å2/s. The electromagnetic lens system of the microscope was tuned to
achieve a highly coherent parallel beam configuration with minimal low-order
aberrations (e.g., astigmatism, coma), and a third-order spherical aberration co-
efficient of approximately -13 µm.
Simulation Dataset. The simulated TEM image dataset was generated using
the multi-slice TEM image simulation method, as implemented in the Dr. Probe
software package [15]. Images were simulated with 1024 x 1024 pixels and then
binned to match the approximate pixel size of the experimentally acquired im-
age series. To equate the intensity range of the simulated images with those
acquired experimentally, the intensities of the simulated images were scaled by
a factor which equalized the vacuum intensity in a single simulation to the aver-
66
age intensity measured over a large area of the vacuum in a single 0.025 second
experimental frame (i.e., 0.45 counts per pixel in the vacuum region).
5.2.4 Experiments Results
Experiment Setups. We use CNNs with the proposed UNet architecture with
128 base channels and 6 scales in all of our experiments. The networks were
trained on 400×400 patches extracted from the training images and augmented
with horizontal flipping, vertical flipping, random rotations between −45◦ and
+45◦, and random resizing by a factor of 0.75-0.82. The models were trained
using the Adam optimizer [114], with a default starting learning rate of 10−3,
which was reduced by a factor of 2 every time the validation PSNR plateaued.
Training was terminated via early stopping based on validation PSNR. The de-
tails of training, validation and test data for each experiment are provided in
the corresponding section. Since the models are trained on 400 × 400 patches,
when applying them to larger images we divide the images into overlapping
400× 400 patches, denoise them, and then combine them via averaging.
PSNR and SSIM. The imaging parameters of the real data are well described
by the white contrast category. We therefore used the subset of simulated dataset
corresponding to this contrast (5583 images) to compare our proposed method-
ology to other models. 90% of the data were used for training. The remaining
559 images were evenly split into validation and test sets. We compare our pro-
posed UNet architecture with two state-of-the-art architectures for photographic-
image denoising [262, 263], and with several classical denoising methods: low-
pass filtering [165], adaptive Wiener filtering [136], BM3D [152], non-local means
67
METHODS PSNR SSIM
Raw 3.56 ± 0.03 0.00 ± 0.00
Low Pass Filter [165] 21.59 ± 0.07 0.44 ± 0.03
Adaptive Wiener Filter [136] 22.42 ± 1.08 0.63 ± 0.02
VST + NLM [30] 26.55 ± 0.16 0.73 ± 0.01
VST + BM3D [152] 25.27 ± 0.15 0.80 ± 0.01
PURE-LET [147] 28.36 ± 0.88 0.93 ± 0.01
SBD + DnCNN [262] 30.47 ± 0.64 0.93 ± 0.01
SBD + Small UNet [263] 30.87 ± 0.56 0.93 ± 0.01
SBD + Proposed Architecture 42.87 ± 1.45 0.99 ± 0.01
Table 5.3: Results on simulated test data. Mean PSNR and SSIM (± standard devi-
ation) of different denoising methods on the held-out simulated test set described in
Section 5.2.4. SBD approaches achieve the best results. SBD combined with the pro-
posed architecture outperforms all other techniques by about 12 dB. The performance
of SBD applied to additional architectures is reported in Table 5.2.
[147] and, a wavelet-based method known as PURE-LET [147]. For all methods,
hyperparameters were chosen based on the validation data. Performance was
measured in terms of SSIM [241] and peak signal-to-noise ratio (PSNR).
The results demonstrate that SBD is an effective denoising methodology for
TEM data. Our proposed CNN outperforms all other methods by a margin
of 12 dB in PSNR on the simulated test data, as shown in Table 5.3. SBD recov-
ers the overall shape of the nanoparticle, the interface between the nanoparticle
and the support, and the different periodic patterns of the CeO2 support and Pt
nanoparticle. Contrast features, such as subtle patterns of bright, intermediate
and dark features associated with the atomic structure of the CeO2 crystal, are
well reproduced in the images denoised via SBD, but are mostly absent from
the results of the baseline approaches.
Metrics Beyond PSNR. Domain scientists denoise images in order to extract
scientifically relevant information. In our case, the atoms on the surface of
68
1.05
1
0.95
0.9
0.85
0.8
0.75
0.7 Surface Bulk
0.65
Precision Recall F1 Score Jaccard Index
Figure 5.8: Performance of SBD in terms of our proposed metrics. We compute all our
proposed metrics on over 7, 000 denoised images corresponding to 25 unique noisy
images sampled from the 308 clean images. The empirical distribution on the surface
(red) and bulk (green) is visualized as box plots indicating the median, 25th quartile,
75th quartile, minimum and maximum value of the distribution. SBD has a near perfect
performance in the bulk with all metric values hovering around 1. On the surface, SBD
achieves a median score of 1 for precision and recall, and about 0.95 for F1 score and
Jaccard index.
nanoparticles are of particular interest, because the atomic configuration at the
surface regulates the nanoparticle’s ability to catalyze chemical reactions. It is
therefore of critical importance to understand how different denoising methods
recover these atoms. We can verify visually that SBD achieves a largely success-
ful recovery in held-out simulated data, whereas the baseline methods do not.
However, visual inspection is a limited and non-quantitative evaluation tool.
Unfortunately, standard metrics like PSNR and SSIM are insensitive to changes
in the atomic structure of the nanoparticle surface, because these changes have
a small effect on the overall intensity of the images.
To define metrics that evaluate detection of surface atoms, we apply a blob de-
tection algorithm, e.g. Laplacian of Gaussian [139], to locate the centers, and
compute the α-shape of all the atom centers using Delaunay triangulation [180].
We propose the following four metrics to measure the fidelity of the recovered
69
0.8 Noisy Data (40-Frames Average)
Denoised Data (40-Frames Average)
0.7
0.6
0.5
0.4
0.3
350 400 450 500 550 600
Pixel Coordinate
Figure 5.9: Validation on real data. The real data consist of 40 frames which are ap-
proximately stationary and aligned. Their temporal average (left) therefore provides a
reasonable estimate for the true intensity profile. In the image on the right, we compare
the average intensity profile on the surface atomic columns of the platinum nanoparti-
cle for the denoised data (middle) and the temporal average (left). The profiles are very
similar (except for some spurious fluctuations in the temporal average), which suggests
that the proposed approach achieves effective denoising on the real data.
structure: precision, recall, F1 score, and Jaccard index.
Performance on Real Data. In the experiments reported in Sections 5.2.4 and
5.2.4 we used a network trained on all simulated images from the white con-
trast category. However, the real data described in Section 5.2.2 more closely
corresponds to a subset of white contrast images satisfying the following con-
ditions: structure limited to PtNP2, thickness between 40 Å - 60 Å and, defocus
between 5 nm and 10 nm. We used 236 images from this subset for training, and
another such 15 images for validation. We also trained two state-of-the-art ar-
chitectures for photographic image denoising - DnCNN [262] and DURR [263]
on these data.
Results on real experimental data obtained using SBD trained on this relevant
subset of white contrast are shown in Figure 5.6. SBD produces denoised images
that are of much higher quality than those of the baseline methods described in
Section 5.2.4, which contain obvious artefacts. Further, we validate the denois-
70
Intensity
ing results of SBD by comparing to an estimated reference image obtained by
temporal averaging. Our real dataset consists of 40 frames that are approxi-
mately stationary and aligned. Therefore, their temporal average provides a
good estimate for the ground-truth images. As shown in Figure 5.9, the de-
noised intensity values of the atomic column approximately match those of the
estimated reference image.
In the rest of this section, we compare the performance of SBD and unsuper-
vised denoising techniques on the real experimental data, and analyze the effect
of the design of the training dataset on the denoised output produced by SBD.
Discussion and Conclusions. Our case study is a proof of concept that CNNs
trained on simulated data can be remarkably effective when applied to real
imaging data. It provides several insights and suggests future research direc-
tions that are relevant, beyond electron microscopy, to other domains where the
images of interest can be simulated, such as medical imaging [112, 161], other
types of microscopy [74, 225], or astronomy [177].
We show that the design of the training dataset is critical, so an important ques-
tion is how to design simulated training datasets in a principled systematic way.
Answering it will require a deeper understanding of the generalization ability
of CNNs with respect to variations in the statistics of the input images. We also
demonstrate that architectures tailored to photographic imaging can perform
poorly when applied to other data. Designing CNNs for other domains requires
an understanding of the image features that are exploited for denoising. Gra-
dient visualization is shown to be useful here, but more advanced visualization
techniques are needed. In addition, we demonstrate that standard metrics used
to quantify performance in photographs may not be sensitive to scientifically
71
relevant features, and propose several new metrics to address this problem. Al-
though SBD outperforms other methods by a large margin, some artefacts such
as phantom atoms still appear. Our proposed likelihood maps help to flag such
events, but may still fail to do so in regions of unusually low SNR. Developing
more sophisticated methods for uncertainty quantification is therefore a key re-
search direction. It would also be of great interest to develop unsupervised or
self-supervised denoising approaches that are effective with small amounts of
data at low SNRs. Finally, to encourage further development of deep-learning
methodologies for scientific imaging, we release a denoising benchmark dataset
of TEM images, containing 18,000 examples.
72
CHAPTER 6
FINAL REMARKS
In this dissertation, we have studied the problem of probabilistic modeling of
sequential data from three perspectives. In Chapter 2, we consider continual
learning settings where models are often prone to catastrophic forgetting when
exposed to incrementally available data from non-stationary distributions. By
modelling sample similarities and promoting information sharing in the form
of random graphs, we provide an effective mechanism against such a problem.
In Chapter 3 and Chapter 4, we combine state space models and transformer ar-
chitectures for time series forecasting and video prediction, respectively, aiming
to capture complex temporal dynamics and unravel high-dimensional inputs
via latent variables models.
Despite limited scopes, these works demonstrate that capturing temporal struc-
tures within sequential data is critical to downstream applications. As predic-
tions and anticipation of future events is a key component of intelligent systems,
we hope that future works further explore the topic and realize the potentials of
sequential data for positive impacts on human society.
73
APPENDIX A
GRAPH-BASED CONTINUAL LEARNING
Experiment Setup. We perform experiments on six commonly used classifica-
tion datasets: PERMUTED MNIST, ROTATED MNIST [125], SPLIT SVHN [166],
SPLIT CIFAR10 [121], SPLIT CIFAR100 [121], and SPLIT MINIIMAGENET [230].
• PERMUTED MNIST [77] is a variant of the MNIST dataset of handwritten
digits [125], where each task applies a fixed random pixel permutation to
the original dataset. The benchmark dataset consists of 20 tasks, each with
1000 samples from 10 different classes.
• ROTATED MNIST [143] is another variant of the MNIST dataset of hand-
written digits [125], where each task applies a fixed random image rotation
to the original dataset. The benchmark dataset consists of 20 tasks, each
with 1000 samples from 10 different classes.
• SPLIT SVHN is a variant of the SVHN dataset [166] that consists of 5 tasks,
each with two consecutive classes. Since the benchmark dataset is much
more challenging than the MNIST variants, we use all of its 73,257 training
samples (i.e. 14,650 samples per task) to train our model and the baselines.
• SPLIT CIFAR10 is a variant of the CIFAR-10 dataset [121]. Similar SPLIT
SVHN, the benchmark dataset consists of 5 tasks, each with two consecu-
tive classes. We use all of its 50,000 training samples (i.e. 10,000 samples
per task) to train our model and the baselines.
• SPLIT CIFAR100 is a variant of the CIFAR-100 dataset [121]. The bench-
mark dataset consists of 20 tasks, each with 5 consecutive classes. We use
all of its 50,000 training samples (i.e. 2,500 samples per task) to train our
model and the baselines.
74
• SPLIT MINIIMAGENET is a variant of the MINIIMAGENET dataset [121].
The benchmark dataset consists of 20 tasks, each with 5 consecutive classes.
We use all of its 50,000 training samples (i.e. 2,500 samples per task) to
train our model and the baselines. Each image is resized to 84 × 84 pixels.
Model Architectures. As mentioned, while most of previous work uses multi-
head architectures and assumes knowledge of task boundaries at test time, we
employ a shared classifier head for all tasks. For the MNIST datasets, the im-
age encoders fθ1 (for graph construction) and fθ2 (for latent computation) share
a multi-layered perceptron with two hidden layers of 256 ReLU neurons, fol-
lowed by two separate linear mappings, one for each of the encoders. For SPLIT
SVHN, SPLIT CIFAR10, SPLIT CIFAR100, and SPLIT MINIIMAGENET, the im-
age encoders share a simple convolutional network with the following struc-
ture: conv 64→ conv 64→ maxpool→ conv 64→ conv 64→ maxpool→ conv
64 → conv 64 → maxpool, where conv NF is a 3 × 3 convolution with NF out-
put filters, BatchNorm, and ReLU activations. For all datasets, another linear
mapping follows the image encoder fθ1 before a Gaussian kernel computes the
similarities between image embeddings. Finally, the classifier head consists of a
RELU activation and a single linear mapping.
Baseline Architectures. We use the same neural network architectures for all
the baselines described in this chapter: a multi-layered perceptron with two hid-
den layers of 400 ReLU neurons on PERMUTED MNIST and ROTATED MNIST,
following [94], and a ResNet-18 [87] with 20 filters across all layers on other
datasets, following [143]. For all datasets, the baselines consist of more parame-
ters than our corresponding models (see Table A.1 for more details).
We adopt the implementations of EWC [117], GEM [143], and MER [195] from
75
Table A.1: Number of trainable parameters in continual learning models.
Method Finetune EWC GEM ER MER GCL
SPLIT MNIST 478K 478K 478K 478K 478K 406K
PERMUTED MNIST 478K 478K 478K 478K 478K 406K
ROTATED MNIST 478K 478K 478K 478K 478K 406K
SPLIT SVHN 1.09M 1.09M 1.09M 1.09M - 326K
SPLIT CIFAR10 1.09M 1.09M 1.09M 1.09M - 326K
SPLIT CIFAR100 1.09M 1.09M 1.09M 1.09M - 326K
SPLIT MINIIMAGENET 1.09M 1.09M 1.09M 1.09M - 343K
the authors’ repositories 1 2.
Additional Task-Free Baselines. We also note that despite our attempts to
tune parameters for MER [195] on SPLIT SVHN and SPLIT CIFAR10, the base-
line does not perform reasonably well. The model uses a batch size of 1 and
requires multiple passes through the episodic memory per batch, so it is much
slower than our model and all other baselines. Due to limited time and com-
putational resources, we do not further investigate the baseline and therefore
avoid reporting immature results for fairness.
However, we include results of CN-DPM [127], a competitive task-free model
based on Dirichlet process mixture models in Table A.2. Our setup for SPLIT CI-
FAR10 is analogous to that of [127], so we directly quote the numbers for CN-
DPM from the paper. Although CN-DPM performs favorably among task-free
approaches to continually learning, including GSS [8], our model outperforms
CN-DPM by a significant margin, even when using a smaller memory size.
1https://github.com/facebookresearch/GradientEpisodicMemory
2https://github.com/mattriemer/mer
76
Table A.2: GCL results and CN-DPM results with different memory sizes.
SPLIT SVHN SPLIT CIFAR10
Method
250 500 500 1000
ER [40] 45.51 ± 3.03 57.51 ± 2.77 36.08 ± 1.09 45.75 ± 1.82
CN-DPM [127] − − 43.07 ± 0.16 45.21 ± 0.18
GCL (Ours) 60.68 ± 1.67 65.79 ± 1.54 53.87 ± 0.97 57.26 ± 0.28
Memory Usage. Both GCL and ER [40] uses an episodic memory to store im-
ages and labels from past tasks. The only additional memory usage of GCL
comes from the context graph G, which is represented by a square matrix whose
entries intuitively describe pairwise similarities between such images. Given a
memory consisting of |M| images of size C × H × W , it only requires |M|2
floating points to store the matrix.
Table A.3: Memory usage of ER and GCL for various datasets.
DATASET |M| Image Size ER GCL
PERMUTED MNIST 1000 1 × 28 × 28 3.284 MB 7.284 MB
ROTATED MNIST 1000 1 × 28 × 28 3.284 MB 7.284 MB
SPLIT CIFAR10 250 3 × 32 × 32 3.109 MB 3.359 MB
SPLIT SVHN 250 3 × 32 × 32 3.109 MB 3.359 MB
SPLIT CIFAR100 500 3 × 32 × 32 6.219 MB 7.199 MB
SPLIT MINIIMAGENET 500 3 × 84 × 84 42.408 MB 43.389 MB
As seen from Table A.3, the memory usage of GCL are very similar the same
as that of ER, except when both are very small as in the case of PERMUTED
MNIST and ROTATED MNIST, because (1) continual learning algorithms are
often required to use a very small |M| and (2) the cost for storing natural images
are often much higher than that of the context graph.
As the number of tasks increases, it is perhaps essential to expand the episodic
77
PERMUTED MNIST ROTATED MNIST
90 100
85 90
80
80
75
70
70
60 Finetune EWC
65
GEM ER
60 50 MER GCL
55 40
0 5 10 15 20 0 5 10 15 20
SPLIT SVHN SPLIT CIFAR10
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 1 2 3 4 5
Figure A.1: Average accuracy as a function of the number of tasks trained on PERMUTED
MNIST, ROTATED MNIST, SPLIT SVHN, and SPLIT CIFAR10.
memory, in which case the quadratic growth of the latter might dominate the
linear increase of the former (e.g. |M| = 5000 and images are of size 3×32×32).
Although we have not practically encountered such a problem with GCL, we
note that the quadratic growth of the number of entries in the context graph can
be reduced to a linear growth in memory requirements. More specifically, each
entr(y is the outpu)t of the kernel function κτ (see Section 3, e.g. κτ (ui,uj) =
exp − τ ‖u − u 2i j‖2 ), so we could easily store |M| intermediate embeddings2
{ui} at each step and apply the kernel function on the fly, which is especially
beneficial when ui are much lower dimensional than the original images.
Additional Experiment Results.
78
Average Accuracy (%)
Average Accuracy (%)
SPLIT SVHN SPLIT CIFAR10
75 75
ER GCL
50 50
25 25
0 0
100 250 500 1000 100 250 500 1000
Memory Size Memory Size
Figure A.2: Average accuracy as a function of numbers of samples in the episodic mem-
ory on SPLIT SVHN and SPLIT CIFAR10.
SPLIT SVHN SPLIT CIFAR10
100 100
ER Ours
75 75
50 50
25 25
0 0
100 250 500 1000 100 250 500 1000
Memory Size Memory Size
Figure A.3: Average forgetting as a function of numbers of samples in the episodic
memory on SPLIT SVHN and SPLIT CIFAR10.
79
Average Forgetting (%) Average Accuracy (%)
APPENDIX B
PROBABILISTIC TRANSFORMER
Time Series Datasets. Following [190, 201], we run experiments on 5 datasets
for time series forecasting, including SOLAR [124], ELECTRICITY 1, TRAFFIC 2,
TAXI 3, and WIKIPEDIA 4. Table B.1 includes more details about the datasets.
For human motion prediction, we run experiments on two datasets, namely,
Human3.6M[96] and HumanEva-I [208], following [257]. As described in Sec-
tion 3.5, Human3.6 is a large-scale dataset with 11 subjects performing 15 ac-
tions, totaling 3.6 million video frames recorded at 50Hz. To be consistent with
previous work [155, 257], we adopt a 17-joint skeleton and train on 5 subjects
(S1, S5, S6, S7, S8) and test on two subjects (S9, S11). For HumanEva-I, we
adopt a 15-joint skeleton and use the same training and test split provided in
the dataset. As in [257], we predict future motion for 2 seconds conditioning on
observed motion of 0.5 seconds and 1 second conditioning on 0.25 seconds for
Human3.6 and HumanEva-I, respectively.
1https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
2https://archive.ics.uci.edu/ml/datasets/PEMS-SF
3https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
4https://github.com/mbohlkeschneider/gluon-ts/tree/mv release/datasets
Table B.1: Dimension, domain, frequency, total training timesteps and prediction length
properties of the training datasets used in the experiments.
DATASET DIMENSION DOMAIN FREQUENCY TIMESTEPS PREDICTION
SOLAR 137 R+ Hour 7,009 24
ELECTRICITY 370 R+ Hour 5,790 24
TRAFFIC 963 (0, 1) Hour 10,413 24
TAXI 1,214 N 30-Min 1,488 24
WIKIPEDIA 2,000 N Day 792 30
80
Table B.2: Test set NMSEsum and NDsum of time series models (lower is better). The
means and standard deviations are computed over 5 runs using different seeds.
DATASET SOLAR ELECTRICITY TRAFFIC
Method NRMSEsum NDsum NRMSEsum NDsum NRMSEsum NDsum
Transformer-MAF [190] 0.634 ± 0.034 0.323 ± 0.031 0.039 ± 0.00 0.030 ± 0.00 0.363 ± 0.00 0.301 ± 0.02
TimeGrad [189] 0.715 ± 0.046 0.399 ± 0.023 0.039 ± 0.00 0.026 ± 0.00 0.073 ± 0.00 0.055 ± 0.00
ProTran (Ours) 0.579 ± 0.050 0.317 ± 0.027 0.030 ± 0.00 0.022 ± 0.00 0.046 ± 0.01 0.031 ± 0.00
Based on the dataset descriptions of previous work, we assume that they were
obtained and curated appropriately with consent from pertaining people and
that they contain no personally identifiable information or offensive content.
Time Series Forecasting Results. In addition to CRPSsum reported in Section
3.5, we also include experiment results for time series forecasting using two
other metrics, namely normalized root mean squared error (NRMSEsum) and nor-
malized deviation (NDsum), in Table B.2. As in [5], we define NRMSEsum as the
root mean squared error normalized by the absolute values of targets summed
across all time series. NDsum, is defined as the mean absolute error between pre-
dicted values and targets summed across all time series.
Consistent with the results in Section 3.5, our models perform significantly bet-
ter than Transformer-MAF [190] and TimeGrad [189], two competitive baselines
proposed recently.
Human Motion Prediction Results. Figure B.1 shows that given the same
contexts consisting of fixed conditioning pose sequences, our model generates
diverse yet sensible pose sequences. The variations in predictions stem from the
stochasticity induced by our latent variables at different time steps.
Model Architectures. As described in Section 3.5, our models are based on
transformer architectures [226] with extensive use of attention modules. For all
81
Context Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Figure B.1: Conditioning pose sequences in green and corresponding predictions in red
by ProTran. Solid colors indicate later time-steps and faded ones are older.
experiments, we use a single linear layer to map inputs into fixed-size repre-
sentations in R128 or R256) (see Equation 3.8). We use multihead attention with
8 heads in Equation 3.9, Equation 3.10, and Equation 3.14 to model temporal
interactions between latent variables, dependencies on conditional inputs, and
interactions of all inputs in posterior distributions. The MLPs in Equation 3.11
and Equation 3.15 as well as the final MLP that maps latent variables to outputs
consist of 2 layers each with ReLU or Tanh activations. We use fixed positional
embeddings as in [226] (see Equation 3.8 and Equation 3.12). The LayerNorms
in Equation 3.8, Equation 3.9, Equation 3.10, Equation 3.12 all have learnable
parameters with  = 10−5.
For time series forecasting, we also employ a learnable embedding layer, the
outputs of which are concatenated with the lagged inputs as in [190]. Our ob-
jective function (see Equation 3.20) has an L1 reconstruction loss in most cases,
except for the TRAFFIC dataset, in which case we replace it with binary cross
entropy and enforce outputs to be in the [0, 1] domain.
82
Table B.3: Number of parameters of Transfomer-MAF [190], TimeGrad [189], and Pro-
Tran (our model) used in the time-series forecasting experiments.
DATASET SOLAR ELECTRICITY TRAFFIC TAXI WIKIPEDIA
Transformer-MAF 290,181 532,734 1,150,047 1,333,706 2,229,500
TimeGrad 116,959 300,216 1,010,691 1,126,974 3,099,501
ProTran 342,418 464,292 844,998 695,612 1,510,496
Table B.3 and Table B.4 detail the numbers of parameters of our models in time
series and human motion experiments, respectively. In all cases, our model sizes
are comparable or smaller than other baselines.
Table B.4: Number of parameters of DLow[257], its conditional VAE model, and Pro-
Tran (our model) used in the human-motion prediction experiments.
DATASET HUMAN3.6M HUMANEVA-I
CVAE 725,292 717,174
DLow 2,763,820 2,753,398
ProTran 1,166,704 1,163,626
83
BIBLIOGRAPHY
[1] Alessandro Achille, Tom Eccles, Loic Matthey, Chris Burgess, Nicholas
Watters, Alexander Lerchner, and Irina Higgins. Life-long disentangled
representation learning with cross-domain latent homologies. In Advances
in Neural Information Processing Systems, pages 9873–9883, 2018.
[2] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single im-
age super-resolution: Dataset and study. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) Workshops, July 2017.
[3] Ahmed Alaa and Mihaela van der Schaar. Attentive state-space modeling
of disease progression. 2019.
[4] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robic-
quet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory pre-
diction in crowded spaces. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 961–971, 2016.
[5] Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider,
Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C Maddix,
Syama Rangapuram, David Salinas, Jasper Schulz, et al. Gluonts: Prob-
abilistic and neural time series modeling in python. Journal of Machine
Learning Research, 21(116):1–6, 2020.
[6] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin,
Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learn-
ing with maximal interfered retrieval. In Advances in Neural Information
Processing Systems, pages 11849–11860, 2019.
84
[7] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free con-
tinual learning. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 11254–11263, 2019.
[8] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient
based sample selection for online continual learning. In Advances in Neural
Information Processing Systems, pages 11816–11825, 2019.
[9] Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and
Liam Paninski. Black box variational inference for state space models.
arXiv preprint arXiv:1511.07367, 2015.
[10] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normal-
ization. arXiv preprint arXiv:1607.06450, 2016.
[11] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Camp-
bell, and Sergey Levine. Stochastic variational video prediction. arXiv
preprint arXiv:1710.11252, 2017.
[12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-
chine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
[13] Abhijit Banerjee, Esther Duflo, Nathanael Goldberg, Dean Karlan, Robert
Osei, William Parienté, Jeremy Shapiro, Bram Thuysbaert, and Christo-
pher Udry. A multifaceted program causes lasting progress for the very
poor: Evidence from six countries. Science, 348(6236):1260799, 2015.
[14] Emad Barsoum, John Kender, and Zicheng Liu. Hp-gan: Probabilistic 3d
human motion prediction via gan. In Proceedings of the IEEE conference on
computer vision and pattern recognition workshops, pages 1418–1427, 2018.
85
[15] Juri Barthel. Dr. probe: A software for high-resolution stem image simu-
lation. Ultramicroscopy, 193:1–11, 2018.
[16] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-
Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti,
David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational in-
ductive biases, deep learning, and graph networks. arXiv preprint
arXiv:1806.01261, 2018.
[17] Luc Bauwens, Sébastien Laurent, and Jeroen VK Rombouts. Multivariate
garch models: a survey. Journal of applied econometrics, 21(1):79–109, 2006.
[18] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent net-
works. In NIPS 2014 Workshop on Advances in Variational Inference, 2014.
[19] Simon Beckouche, Jean-Luc Starck, and Jalal Fadili. Astronomical image
denoising using dictionary learning. Astronomy & Astrophysics, 556:A132,
2013.
[20] Kathleen Beegle, Luc Christiaensen, Andrew Dabalen, and Isis Gaddis.
Poverty in a rising Africa. The World Bank, 2016.
[21] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V
Le. Attention augmented convolutional networks. In Proceedings of
the IEEE/CVF international conference on computer vision, pages 3286–3295,
2019.
[22] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-
document transformer. arXiv preprint arXiv:2004.05150, 2020.
[23] Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert,
Bernie Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael
86
Bohlke-Schneider, David Salinas, Lorenzo Stella, et al. Neural forecast-
ing: Introduction and literature overview. arXiv preprint arXiv:2004.10240,
2020.
[24] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time
attention all you need for video understanding? arXiv preprint
arXiv:2102.05095, 2021.
[25] Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Accurate and di-
verse sampling of sequences based on a “best of many” sample objective.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 8485–8493, 2018.
[26] Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich
Elsen, Norman Casagrande, Luis C Cobo, and Karen Simonyan. High
fidelity speech synthesis with adversarial networks. arXiv preprint
arXiv:1909.11646, 2019.
[27] Joshua Blumenstock, Gabriel Cadamuro, and Robert On. Predicting
poverty and wealth from mobile phone metadata. Science, 350(6264):1073–
1076, 2015.
[28] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M
Ljung. Time series analysis: forecasting and control. John Wiley & Sons,
2015.
[29] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas-
try, Amanda Askell, et al. Language models are few-shot learners. arXiv
preprint arXiv:2005.14165, 2020.
87
[30] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for
image denoising. In 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65. IEEE,
2005.
[31] Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom.
Deep representation learning for human motion prediction and classifi-
cation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 6158–6166, 2017.
[32] Lucas Caccia, Eugene Belilovsky, Massimo Caccia, and Joelle Pineau. On-
line learned continual compression with stacked quantization module.
arXiv preprint arXiv:1911.08019, 2019.
[33] Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Conguri
Huang, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, et al. Spectral tempo-
ral graph neural network for multivariate time-series forecasting. arXiv
preprint arXiv:2103.07719, 2021.
[34] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application
of machine learning techniques for supply chain demand forecasting. Eu-
ropean Journal of Operational Research, 184(3):1140–1154, 2008.
[35] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Sla-
womir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva
Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps.
In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 8740–8749. IEEE Computer Society, 2019.
[36] S Grace Chang, Bin Yu, and Martin Vetterli. Adaptive wavelet threshold-
88
ing for image denoising and compression. IEEE Trans. Image Processing,
9(9):1532–1546, 2000.
[37] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and
Philip HS Torr. Riemannian walk for incremental learning: Understand-
ing forgetting and intransigence. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 532–547, 2018.
[38] Arslan Chaudhry, Albert Gordo, Puneet K Dokania, Philip Torr, and
David Lopez-Paz. Using hindsight to anchor past knowledge in continual
learning. arXiv preprint arXiv:2002.08165, 2020.
[39] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mo-
hamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint
arXiv:1812.00420, 2018.
[40] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Tha-
laiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and
Marc’Aurelio Ranzato. Continual learning with tiny episodic memories.
arXiv preprint arXiv:1902.10486, 2019.
[41] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi,
and William Chan. Wavegrad: Estimating gradients for waveform gener-
ation. arXiv preprint arXiv:2009.00713, 2020.
[42] Xi Chen and William D Nordhaus. Using luminosity data as a proxy
for economic statistics. Proceedings of the National Academy of Sciences,
108(21):8589–8594, 2011.
[43] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A
89
flexible framework for fast and effective image restoration. IEEE transac-
tions on pattern analysis and machine intelligence, 39(6):1256–1272, 2016.
[44] Rewon Child. Very deep vaes generalize autoregressive models and can
outperform them on images. arXiv preprint arXiv:2011.10650, 2020.
[45] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating
long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
2019.
[46] Kyunghyun Cho, B van Merrienboer, Caglar Gulcehre, F Bougares,
H Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder-decoder for statistical machine translation. In Conference on
Empirical Methods in Natural Language Processing (EMNLP 2014), 2014.
[47] Edward Choi, Mohammad Taha Bahadori, Joshua A Kulas, Andy
Schuetz, Walter F Stewart, and Jimeng Sun. Retain: An interpretable pre-
dictive model for healthcare using reverse time attention mechanism. Ad-
vances in Neural Information Processing Systems, pages 3512–3520, 2016.
[48] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stew-
art, and Jimeng Sun. Doctor ai: Predicting clinical events via recurrent
neural networks. In Machine learning for healthcare conference, pages 301–
318. PMLR, 2016.
[49] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C
Courville, and Yoshua Bengio. A recurrent latent variable model for se-
quential data. Advances in Neural Information Processing Systems, 28:2980–
2988, 2015.
90
[50] Robert Coop, Aaron Mishtal, and Itamar Arel. Ensemble learning in fixed
expansion layer networks for mitigating catastrophic forgetting. IEEE
transactions on neural networks and learning systems, 24(10):1623–1634, 2013.
[51] Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Beni-
dis, Michael Bohlke-Schneider, Richard Kurle, Lorenzo Stella, Hilaf Has-
son, Patrick Gallinari, and Tim Januschowski. Normalizing kalman filters
for multivariate time series analysis. Advances in Neural Information Pro-
cessing Systems, 33, 2020.
[52] Emily Denton and Rob Fergus. Stochastic video generation with a learned
prior. In International Conference on Machine Learning, pages 1174–1183.
PMLR, 2018.
[53] Xinshen Diao, Peter Hazell, and James Thurlow. The role of agriculture in
african development. World development, 38(10):1375–1383, 2010.
[54] Kamel Didan. MOD13Q1 MODIS/Terra Vegetation Indices 16-Day L3
Global 250m SIN Grid V006. NASA EOSDIS LP DAAC, 2015.
[55] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH
Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep
unsupervised clustering with gaussian mixture variational autoencoders.
arXiv preprint arXiv:1611.02648, 2016.
[56] Andreas Doerr, Christian Daniel, Martin Schiegg, Nguyen-Tuong Duy,
Stefan Schaal, Marc Toussaint, and Trimpe Sebastian. Probabilistic recur-
rent state-space models. In International Conference on Machine Learning,
pages 1280–1289. PMLR, 2018.
91
[57] Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-
recurrence sequence-to-sequence model for speech recognition. In 2018
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 5884–5888. IEEE, 2018.
[58] James Durbin and Siem Jan Koopman. Time series analysis by state space
methods. Oxford university press, 2012.
[59] Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, and Marcus
Rohrbach. Uncertainty-guided continual learning with bayesian neural
networks. arXiv preprint arXiv:1906.02425, 2019.
[60] Jeffrey M Ede and Richard Beanland. Improving electron micrograph
signal-to-noise with an atrous convolutional encoder-decoder. Ultrami-
croscopy, 202:18–25, 2019.
[61] Jeffrey Mark Ede. Deep learning in electron microscopy. Machine Learning:
Science and Technology, 2020.
[62] P. Erdös and A. Rényi. On random graphs i. Publicationes Mathematicae
Debrecen, 6:290, 1959.
[63] Chenyou Fan, Yuze Zhang, Yi Pan, Xiaoyue Li, Chi Zhang, Rong Yuan,
Di Wu, Wensheng Wang, Jian Pei, and Heng Huang. Multi-horizon time
series forecasting with temporal attention learning. In Proceedings of the
25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, pages 2527–2535, 2019.
[64] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols,
David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Path-
92
net: Evolution channels gradient descent in super neural networks. arXiv
preprint arXiv:1701.08734, 2017.
[65] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning
for physical interaction through video prediction. In Proceedings of the 30th
International Conference on Neural Information Processing Systems, pages 64–
72, 2016.
[66] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A dis-
entangled recognition and nonlinear dynamics model for unsupervised
learning. In Proceedings of the 31st International Conference on Neural Infor-
mation Processing Systems, pages 3604–3613, 2017.
[67] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther.
Sequential neural models with stochastic layers. In Proceedings of the 30th
International Conference on Neural Information Processing Systems, pages
2207–2215, 2016.
[68] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik.
Recurrent network models for human dynamics. In Proceedings of the IEEE
International Conference on Computer Vision, pages 4346–4354, 2015.
[69] Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, Sylvain Lam-
prier, and Patrick Gallinari. Stochastic latent residual video prediction.
arXiv preprint arXiv:2002.09219, 2020.
[70] Robert M French. Catastrophic forgetting in connectionist networks.
Trends in cognitive sciences, 3(4):128–135, 1999.
[71] Hang Gao, Huazhe Xu, Qi-Zhi Cai, Ruth Wang, Fisher Yu, and Trevor
Darrell. Disentangling propagation and generation for video prediction.
93
In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 9006–9015, 2019.
[72] Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Ran-
gapuram, David Salinas, Valentin Flunkert, and Tim Januschowski. Prob-
abilistic forecasting with spline quantile function rnns. In The 22nd in-
ternational conference on artificial intelligence and statistics, pages 1901–1910.
PMLR, 2019.
[73] Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. Learning hu-
man motion models for long-term predictions. In 2017 International Con-
ference on 3D Vision (3DV), pages 458–466. IEEE, 2017.
[74] E Giannatou, G Papavieros, V Constantoudis, H Papageorgiou, and
E Gogolides. Deep learning denoising of sem images towards noise-
reduced ler measurements. Microelectronic Engineering, 216:111051, 2019.
[75] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules,
prediction, and estimation. Journal of the American statistical Association,
102(477):359–378, 2007.
[76] Kuang Gong, Jiahui Guan, Chih-Chieh Liu, and Jinyi Qi. Pet image de-
noising using a deep neural network through fine tuning. IEEE Transac-
tions on Radiation and Plasma Medical Sciences, 3(2):153–161, 2018.
[77] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua
Bengio. An empirical investigation of catastrophic forgetting in gradient-
based neural networks. arXiv preprint arXiv:1312.6211, 2013.
[78] Alex Graves. Generating sequences with recurrent neural networks. arXiv
preprint arXiv:1308.0850, 2013.
94
[79] Alexander Greaves-Tunnell and Zaid Harchaoui. A statistical investiga-
tion of long memory in language and music. In International Conference on
Machine Learning, pages 2394–2403. PMLR, 2019.
[80] Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and
Theophane Weber. Temporal difference variational auto-encoder. In In-
ternational Conference on Learning Representations, 2018.
[81] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre
Alahi. Social gan: Socially acceptable trajectories with generative adver-
sarial networks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2255–2264, 2018.
[82] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and
R Venkatesh Babu. Deligan: Generative adversarial networks for
diverse and limited data. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 166–174, 2017.
[83] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha,
Honglak Lee, and James Davidson. Learning latent dynamics for plan-
ning from pixels. In International Conference on Machine Learning, pages
2555–2565. PMLR, 2019.
[84] Andrew C Harvey. Forecasting, Structural Time Series Models and the Kalman
Filter. Cambridge University Press, 1990.
[85] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan. Memory effi-
cient experience replay for streaming learning. In 2019 International Con-
ference on Robotics and Automation (ICRA), pages 9769–9776. IEEE, 2019.
95
[86] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid
Sigal. Probabilistic video generation using holistic attribute control. In
Proceedings of the European Conference on Computer Vision (ECCV), pages
452–467, 2018.
[87] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[88] J Vernon Henderson, Adam Storeygard, and David N Weil. Measuring
economic growth from outer space. American economic review, 102(2):994–
1028, 2012.
[89] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glo-
rot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-
vae: Learning basic visual concepts with a constrained variational frame-
work. 2016.
[90] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge
in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[91] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion proba-
bilistic models. arXiv preprint arXiv:2006.11239, 2020.
[92] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu-
ral computation, 9(8):1735–1780, 1997.
[93] James P Horwath, Dmitri N Zakharov, Remi Megret, and Eric A Stach.
Understanding important features of deep learning models for segmen-
tation of high-resolution transmission electron microscopy images. npj
Computational Materials, 6(1):1–9, 2020.
96
[94] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-
evaluating continual learning scenarios: A categorization and case for
strong baselines. arXiv preprint arXiv:1810.12488, 2018.
[95] Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Fore-
casting with exponential smoothing: the state space approach. Springer Science
& Business Media, 2008.
[96] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu.
Human3. 6m: Large scale datasets and predictive methods for 3d human
sensing in natural environments. IEEE transactions on pattern analysis and
machine intelligence, 36(7):1325–1339, 2013.
[97] David Isele and Akansel Cosgun. Selective experience replay for lifelong
learning. In Thirty-second AAAI conference on artificial intelligence, 2018.
[98] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena.
Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings
of the ieee conference on computer vision and pattern recognition, pages 5308–
5317, 2016.
[99] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization
with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
[100] Andrew H Jazwinski. Stochastic processes and filtering theory. Courier Cor-
poration, 2007.
[101] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lo-
bell, and Stefano Ermon. Combining satellite imagery and machine learn-
ing to predict poverty. Science, 353(6301):790–794, 2016.
97
[102] Ghassen Jerfel, Erin Grant, Tom Griffiths, and Katherine A Heller. Rec-
onciling meta-learning and continual learning with online mixtures of
tasks. In Advances in Neural Information Processing Systems, pages 9119–
9130, 2019.
[103] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic
filter networks. Advances in neural information processing systems, 29:667–
675, 2016.
[104] Wen Jiang, Matthew L Baker, Qiu Wu, Chandrajit Bajaj, and Wah Chiu.
Applications of a bilateral denoising filter in biological electron mi-
croscopy. Journal of structural biology, 144(1-2):114–122, 2003.
[105] Beibei Jin, Yu Hu, Qiankun Tang, Jingyu Niu, Zhiping Shi, Yinhe Han,
and Xiaowei Li. Exploring spatial-temporal multi-frequency analysis for
high-fidelity and temporal-consistency video prediction. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
4554–4563, 2020.
[106] Kiersten Johnson and Molly E Brown. Environmental risk factors and
child nutritional status and survival in a context of climate variability and
change. Applied Geography, 54:209–221, 2014.
[107] Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol
Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks.
In International Conference on Machine Learning, pages 1771–1779. PMLR,
2017.
[108] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick Van der
98
Smagt. Deep variational bayes filters: Unsupervised learning of state
space models from raw data. arXiv preprint arXiv:1605.06432, 2016.
[109] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progres-
sive growing of gans for improved quality, stability, and variation. arXiv
preprint arXiv:1710.10196, 2017.
[110] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model
for incremental learning. arXiv preprint arXiv:1711.10563, 2017.
[111] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert:
Pre-training of deep bidirectional transformers for language understand-
ing. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
[112] Byeongjoon Kim, Minah Han, Hyunjung Shim, and Jongduk Baek. A
performance comparison of convolutional neural network-based image
denoising methods: The effect of loss functions on low-dose ct images.
Medical physics, 46(9):3906–3923, 2019.
[113] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Es-
lami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural
processes. arXiv preprint arXiv:1901.05761, 2019.
[114] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980, 2014.
[115] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with
invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
[116] Diederik P Kingma and Max Welling. Auto-encoding variational bayes.
2014.
99
[117] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guil-
laume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ra-
malho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic
forgetting in neural networks. Proceedings of the national academy of sci-
ences, 114(13):3521–3526, 2017.
[118] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The effi-
cient transformer. arXiv preprint arXiv:2001.04451, 2020.
[119] Rahul Krishnan, Uri Shalit, and David Sontag. Structured inference net-
works for nonlinear state space models. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, volume 31, 2017.
[120] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters.
arXiv preprint arXiv:1511.05121, 2015.
[121] Alex Krizhevsky et al. Learning multiple layers of features from tiny im-
ages. 2009.
[122] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas-
sification with deep convolutional neural networks. Advances in neural
information processing systems, 25:1097–1105, 2012.
[123] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn,
Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flow-
based generative model for video.
[124] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Model-
ing long-and short-term temporal patterns with deep neural networks. In
The 41st International ACM SIGIR Conference on Research & Development in
Information Retrieval, pages 95–104, 2018.
100
[125] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[126] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn,
and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint
arXiv:1804.01523, 2018.
[127] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A neural
dirichlet process mixture model for task-free continual learning. arXiv
preprint arXiv:2001.00689, 2020.
[128] Yong Suk Lee. International isolation and regional inequality: Evidence
from sanctions on north korea. Journal of Urban Economics, 103:34–51, 2018.
[129] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. Convolutional
sequence to sequence model for human dynamics. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 5226–
5234, 2018.
[130] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang
Wang, and Xifeng Yan. Enhancing the locality and breaking the memory
bottleneck of transformer on time series forecasting. Advances in Neural
Information Processing Systems, 32:5243–5253, 2019.
[131] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE trans-
actions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
[132] Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li.
Auto-conditioned recurrent networks for extended complex human mo-
tion synthesis. arXiv preprint arXiv:1707.05363, 2017.
101
[133] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for
future-flow embedded video prediction. In proceedings of the IEEE interna-
tional conference on computer vision, pages 1744–1752, 2017.
[134] Jeff W Lichtman and José-Angel Conchello. Fluorescence microscopy. Na-
ture methods, 2(12):910–919, 2005.
[135] Bryan Lim, Sercan O Arik, Nicolas Loeff, and Tomas Pfister. Temporal fu-
sion transformers for interpretable multi-horizon time series forecasting.
arXiv preprint arXiv:1912.09363, 2019.
[136] Jae S Lim. Two-dimensional signal and image processing. ph, 1990.
[137] Long-Ji Lin. Self-improving reactive agents based on reinforcement learn-
ing, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
[138] Zhaojiang Lin, Genta Indra Winata, Peng Xu, Zihan Liu, and Pascale
Fung. Variational transformers for diverse response generation. arXiv
preprint arXiv:2003.12738, 2020.
[139] Tony Lindeberg. Scale selection properties of generalized scale-space in-
terest point detectors. Journal of Mathematical Imaging and vision, 46(2):177–
210, 2013.
[140] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel.
Learning to diagnose with lstm recurrent neural networks. In International
Conference on Learning Representations, 2016.
[141] Danyang Liu and Gongshen Liu. A transformer-based variational au-
toencoder for sentence generation. In 2019 International Joint Conference on
Neural Networks (IJCNN), pages 1–7. IEEE, 2019.
102
[142] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agar-
wala. Video frame synthesis using deep voxel flow. In Proceedings of the
IEEE International Conference on Computer Vision, pages 4463–4471, 2017.
[143] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory
for continual learning. In Advances in Neural Information Processing Sys-
tems, pages 6467–6476, 2017.
[144] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding
networks for video prediction and unsupervised learning. arXiv preprint
arXiv:1605.08104, 2016.
[145] Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The
functional neural process. In Advances in Neural Information Processing Sys-
tems, pages 8743–8754, 2019.
[146] Chaochao Lu, Michael Hirsch, and Bernhard Scholkopf. Flexible spatio-
temporal networks for video prediction. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 6523–6531, 2017.
[147] Florian Luisier, Thierry Blu, and Michael Unser. Image denoising in
mixed poisson–gaussian noise. IEEE Transactions on image processing,
20(3):696–708, 2010.
[148] Helmut Lütkepohl. New introduction to multiple time series analysis.
Springer Science & Business Media, 2005.
[149] Lars Maaløe, Marco Fraccaro, Valentin Lievin, and Ole Winther. Biva:
A very deep hierarchy of latent variables for generative modeling. In
33rd Conference on Neural Information Processing Systems, page 8882. Neural
Information Processing Systems Foundation, 2019.
103
[150] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete dis-
tribution: A continuous relaxation of discrete random variables. arXiv
preprint arXiv:1611.00712, 2016.
[151] Jacob Madsen, Pei Liu, Jens Kling, Jakob Birkedal Wagner,
Thomas Willum Hansen, Ole Winther, and Jakob Schiøtz. A deep
learning approach to identify local structures in atomic-resolution trans-
mission electron microscopy images. Advanced Theory and Simulations,
1(8):1800037, 2018.
[152] Markku Makitalo and Alessandro Foi. Optimal inversion of the general-
ized anscombe transformation for poisson-gaussian noise. IEEE transac-
tions on image processing, 22(1):91–103, 2012.
[153] Bryce Manifold, Elena Thomas, Andrew T Francis, Andrew H Hill, and
Dan Fu. Denoising of stimulated raman scattering microscopy images via
deep learning. Biomedical optics express, 10(8):3860–3874, 2019.
[154] Julieta Martinez, Michael J Black, and Javier Romero. On human motion
prediction using recurrent neural networks. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017.
[155] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A
simple yet effective baseline for 3d human pose estimation. In Proceedings
of the IEEE International Conference on Computer Vision, pages 2640–2649,
2017.
[156] James E Matheson and Robert L Winkler. Scoring rules for continuous
probability distributions. Management science, 22(10):1087–1096, 1976.
104
[157] Michael McCloskey and Neal J Cohen. Catastrophic interference in con-
nectionist networks: The sequential learning problem. In Psychology of
learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
[158] Ian S McLean. Electronic imaging in astronomy: detectors and instrumenta-
tion. Springer Science & Business Media, 2008.
[159] William Meiniel, Jean-Christophe Olivo-Marin, and Elsa D Angelini. De-
noising of microscopy images: a review of the state-of-the-art, and a new
sparsity-based method. IEEE Transactions on Image Processing, 27(8):3842–
3856, 2018.
[160] Peyman Milanfar. A tour of modern image filtering: New insights and
methods, both practical and theoretical. IEEE signal processing magazine,
30(1):106–128, 2012.
[161] David Minarik, Olof Enqvist, and Elin Trägårdh. Denoising of scintillation
camera images using a deep convolutional neural network: a monte carlo
simulation approach. Journal of Nuclear Medicine, 61(2):298–303, 2020.
[162] Tiziano Montini, Michele Melchionna, Matteo Monai, and Paolo For-
nasiero. Fundamentals and catalytic applications of ceo2-based materials.
Chemical reviews, 116(10):5987–6041, 2016.
[163] Kevin P Murphy. Machine learning: a probabilistic perspective. 2012.
[164] United Nations. Transforming our world: The 2030 agenda for sustain-
able development. Resolution adopted by the General Assembly, 2015.
[165] PD Nellist and SJ Pennycook. Accurate structure determination from im-
age reconstruction in adf stem. Journal of Microscopy, 190(1-2):159–170,
1998.
105
[166] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and
Andrew Y Ng. Reading digits in natural images with unsupervised fea-
ture learning. 2011.
[167] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Vari-
ational continual learning. arXiv preprint arXiv:1710.10628, 2017.
[168] Yao Nie, Li Li, and Zidong Wei. Recent advancements in pt and pt-
free catalysts for oxygen reduction reaction. Chemical Society Reviews,
44(8):2168–2201, 2015.
[169] NOAA National Centers for Environmental Information. Nighttime
Lights Time Series. 2010.
[170] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder
Singh. Action-conditional video prediction using deep networks in atari
games. arXiv preprint arXiv:1507.08750, 2015.
[171] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt,
Alex Graves, and Koray Kavukcuoglu. Conditional image generation
with pixelcnn decoders. arXiv preprint arXiv:1606.05328, 2016.
[172] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Ben-
gio. N-beats: Neural basis expansion analysis for interpretable time series
forecasting. In International Conference on Learning Representations, 2019.
[173] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and
Moin Nabi. Learning to remember: A synaptic plasticity driven frame-
work for continual learning. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 11321–11329, 2019.
106
[174] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Ste-
fan Wermter. Continual lifelong learning with neural networks: A review.
Neural Networks, 2019.
[175] Martin Parry, Martin L Parry, Osvaldo Canziani, Jean Palutikof, Paul
Van der Linden, and Clair Hanson. Climate change 2007-impacts, adapta-
tion and vulnerability: Working group II contribution to the fourth assessment
report of the IPCC, volume 4. Cambridge University Press, 2007.
[176] Andrew J Patton. A review of copula models for economic time series.
Journal of Multivariate Analysis, 110:4–18, 2012.
[177] JR Peterson, JG Jernigan, SM Kahn, AP Rasmussen, E Peng, Z Ahmad,
J Bankert, C Chang, C Claver, DK Gilmore, et al. Simulation of astro-
nomical images from optical survey telescopes using a comprehensive
photon monte carlo approach. The Astrophysical Journal Supplement Series,
218(1):14, 2015.
[178] Neeti Pokhriyal and Damien Christophe Jacques. Combining disparate
data sources for improved poverty prediction and mapping. Proceedings
of the National Academy of Sciences, 114(46):E9783–E9792, 2017.
[179] Javier Portilla, Vasily Strela, Martin J Wainwright, and Eero P Simoncelli.
Image denoising using scale mixtures of gaussians in the wavelet domain.
IEEE Trans. Image Processing, 12(11), 2003.
[180] Franco P Preparata and Michael Ian Shamos. Convex hulls: Basic algo-
rithms. In Computational geometry, pages 95–149. Springer, 1985.
[181] Steven E Prince, Sander M Daselaar, and Roberto Cabeza. Neural corre-
107
lates of relational memory: successful encoding and retrieval of semantic
and perceptual associations. Journal of Neuroscience, 25(5):1203–1210, 2005.
[182] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Gar-
rison Cottrell. A dual-stage attention-based recurrent neural network for
time series prediction. arXiv preprint arXiv:1704.02971, 2017.
[183] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. Language models are unsupervised multitask learners.
OpenAI blog, 1(8):9, 2019.
[184] Marco Ragone, Vitaliy Yurkiv, Boao Song, Ajaykrishna Ramsubramanian,
Reza Shahbazian-Yassar, and Farzad Mashayek. Atomic column heights
detection in metallic nanoparticles using deep convolutional learning.
Computational Materials Science, 180:109722, 2020.
[185] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm
Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision mod-
els. arXiv preprint arXiv:1906.05909, 2019.
[186] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo
Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for
time series forecasting. Advances in neural information processing systems,
31:7785–7794, 2018.
[187] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ro-
nan Collobert, and Sumit Chopra. Video (language) modeling: a baseline
for generative models of natural videos. arXiv preprint arXiv:1412.6604,
2014.
108
[188] Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye
Teh, and Raia Hadsell. Continual unsupervised representation learn-
ing. In Advances in Neural Information Processing Systems, pages 7645–7655,
2019.
[189] Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf.
Autoregressive denoising diffusion models for multivariate probabilistic
time series forecasting. arXiv preprint arXiv:2101.12072, 2021.
[190] Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs Bergmann, and
Roland Vollgraf. Multi-variate probabilistic time series forecasting via
conditioned normalizing flows. arXiv preprint arXiv:2002.06103, 2020.
[191] Roger Ratcliff. Connectionist models of recognition memory: con-
straints imposed by learning and forgetting functions. Psychological review,
97(2):285, 1990.
[192] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and
Christoph H Lampert. icarl: Incremental classifier and representation
learning. In Proceedings of the IEEE conference on Computer Vision and Pat-
tern Recognition, pages 2001–2010, 2017.
[193] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You
only look once: Unified, real-time object detection. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 779–788,
2016.
[194] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochas-
tic backpropagation and approximate inference in deep generative mod-
109
els. In International conference on machine learning, pages 1278–1286. PMLR,
2014.
[195] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish,
Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting
by maximizing transfer and minimizing interference. arXiv preprint
arXiv:1810.11910, 2018.
[196] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured
laplace approximations for overcoming catastrophic forgetting. In Ad-
vances in Neural Information Processing Systems, pages 3738–3748, 2018.
[197] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and
Gregory Wayne. Experience replay for continual learning. In Advances in
Neural Information Processing Systems, pages 348–358, 2019.
[198] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In International Con-
ference on Medical image computing and computer-assisted intervention, pages
234–241. Springer, 2015.
[199] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael
Bernstein, et al. Imagenet large scale visual recognition challenge. In-
ternational Journal of Computer Vision, 115(3):211–252, 2015.
[200] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer,
James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Had-
sell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
110
[201] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto
Medico, and Jan Gasthaus. High-dimensional multivariate forecasting
with low-rank gaussian copula processes. arXiv preprint arXiv:1910.03002,
2019.
[202] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski.
Deepar: Probabilistic forecasting with autoregressive recurrent networks.
International Journal of Forecasting, 36(3):1181–1191, 2020.
[203] Luigi Sedda, Andrew J Tatem, David W Morley, Peter M Atkinson,
Nicola A Wardrop, Carla Pezzulo, Alessandro Sorichetta, Joanna Kuleszo,
and David J Rogers. Poverty, health and satellite-derived vegetation in-
dices: their inter-spatial relationship in west africa. International health,
7(2):99–106, 2015.
[204] Iulian Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle
Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent vari-
able encoder-decoder model for generating dialogues. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 31, 2017.
[205] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fer-
gus, and Yann LeCun. Overfeat: Integrated recognition, localization and
detection using convolutional networks. arXiv preprint arXiv:1312.6229,
2013.
[206] Arthur P Shimamura. Episodic retrieval and the cortical binding of rela-
tional activity. Cognitive, Affective, & Behavioral Neuroscience, 11(3):277–291,
2011.
[207] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual
111
learning with deep generative replay. In Advances in Neural Information
Processing Systems, pages 2990–2999, 2017.
[208] Leonid Sigal and Michael J Black. Humaneva: Synchronized video
and motion capture dataset for evaluation of articulated human motion.
Brown Univertsity TR, 120(2), 2006.
[209] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014.
[210] David Smith. CHAPTER 1: Characterization of nanomaterials using transmis-
sion electron microscopy, pages 1–29. Number 37 in RSC Nanoscience and
Nanotechnology. Royal Society of Chemistry, 37 edition, January 2015.
[211] Slawek Smyl, Jai Ranganathan, and Andrea Pasqua. M4 forecasting com-
petition: Introducing a new hybrid es-rnn model. URL: https://eng. uber.
com/m4-forecasting-competition, 2018.
[212] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby,
and Ole Winther. Ladder variational autoencoders. In Proceedings of
the 30th International Conference on Neural Information Processing Systems,
pages 3745–3753, 2016.
[213] Huan Song, Deepta Rajan, Jayaraman Thiagarajan, and Andreas Spanias.
Attend and diagnose: Clinical time series analysis using attention models.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32,
2018.
[214] Pablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander
Pritzel, Adria Puigdomenech Badia, Benigno Uria, Oriol Vinyals, Demis
112
Hassabis, Razvan Pascanu, and Charles Blundell. Memory-based param-
eter adaptation. arXiv preprint arXiv:1802.10542, 2018.
[215] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning research, 15(1):1929–1958,
2014.
[216] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsu-
pervised learning of video representations using lstms. In International
conference on machine learning, pages 843–852. PMLR, 2015.
[217] Jessica E Steele, Pål Roe Sundsøy, Carla Pezzulo, Victor A Alegana,
Tomas J Bird, Joshua Blumenstock, Johannes Bjelland, Kenth Engø-
Monsen, Yves-Alexandre de Montjoye, Asif M Iqbal, et al. Mapping
poverty using mobile phone and satellite data. Journal of The Royal Society
Interface, 14(127):20160690, 2017.
[218] Amit Suveer, Anindya Gupta, Gustaf Kylberg, and Ida-Maria Sintorn.
Super-resolution reconstruction of transmission electron microscopy im-
ages using deep learning. In 2019 IEEE 16th International Symposium on
Biomedical Imaging (ISBI 2019), pages 548–551. IEEE, 2019.
[219] Franklin Tao and Peter Crozier. Atomic-scale observations of catalyst
structures under reaction conditions and during catalysis. Chemical Re-
views, 116(6):3487–3539, March 2016.
[220] Michalis K Titsias, Jonathan Schwarz, Alexander G de G Matthews, Raz-
van Pascanu, and Yee Whye Teh. Functional regularisation for continual
learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019.
113
[221] Carlo Tomasi and Roberto Manduchi. Bilateral filtering for gray and color
images. In ICCV, volume 98, 1998.
[222] Ruey S Tsay. Multivariate time series analysis: with R and financial applica-
tions. John Wiley & Sons, 2013.
[223] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational au-
toencoder. arXiv preprint arXiv:2007.03898, 2020.
[224] Roy Van der Weide. Go-garch: a multivariate generalized orthogonal
garch model. Journal of Applied Econometrics, 17(5):549–564, 2002.
[225] Rama K Vasudevan and Stephen Jesse. Deep learning as a tool for image
denoising and drift correction. Microscopy and Microanalysis, 25(S2):190–
191, 2019.
[226] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In Advances in neural information processing systems, pages
5998–6008, 2017.
[227] Ruben Villegas, Dumitru Erhan, Honglak Lee, et al. Hierarchical long-
term video prediction without supervision. In International Conference on
Machine Learning, pages 6038–6046. PMLR, 2018.
[228] Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan,
Quoc V Le, and Honglak Lee. High fidelity video prediction with large
stochastic recurrent neural networks. In NeurIPS, 2019.
[229] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak
Lee. Decomposing motion and content for natural video sequence predic-
tion. arXiv preprint arXiv:1706.08033, 2017.
114
[230] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al.
Matching networks for one shot learning. In Advances in neural information
processing systems, pages 3630–3638, 2016.
[231] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on
Mathematical Software (TOMS), 11(1):37–57, 1985.
[232] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating
the future by watching unlabeled video. arXiv preprint arXiv:1504.08023,
2, 2015.
[233] Carl Vondrick and Antonio Torralba. Generating the future with adver-
sarial transformers. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1020–1028, 2017.
[234] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An
uncertain future: Forecasting from static images using variational au-
toencoders. In European Conference on Computer Vision, pages 835–851.
Springer, 2016.
[235] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow
prediction from a static image. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 2443–2451, 2015.
[236] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The
pose knows: Video forecasting by generating pose futures. In Proceed-
ings of the IEEE international conference on computer vision, pages 3332–3341,
2017.
[237] Eric A Wan and Rudolph Van Der Merwe. The unscented kalman fil-
ter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Sys-
115
tems for Signal Processing, Communications, and Control Symposium (Cat. No.
00EX373), pages 153–158. Ieee, 2000.
[238] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process
dynamical models for human motion. IEEE transactions on pattern analysis
and machine intelligence, 30(2):283–298, 2007.
[239] Tianming Wang and Xiaojun Wan. T-cvae: Transformer-based condi-
tioned variational autoencoder for story completion. In IJCAI, pages 5233–
5239, 2019.
[240] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster,
and Tim Januschowski. Deep factors for forecasting. In International Con-
ference on Machine Learning, pages 6607–6617. PMLR, 2019.
[241] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Im-
age quality assessment: from error visibility to structural similarity. IEEE
transactions on image processing, 13(4):600–612, 2004.
[242] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autore-
gressive video models. arXiv preprint arXiv:1906.02634, 2019.
[243] Douglas Brent West et al. Introduction to graph theory, volume 2. Prentice
hall Upper Saddle River, 2001.
[244] Mike West and Jeff Harrison. Bayesian forecasting and dynamic models.
Springer Science & Business Media, 2006.
[245] Di Wu and Ling Shao. Leveraging hierarchical parametric networks for
skeletal joints based action segmentation and recognition. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 724–
731, 2014.
116
[246] Neo Wu, Bradley Green, Xue Ben, and Shawn O’Banion. Deep trans-
former models for time series forecasting: The influenza prevalence case.
arXiv preprint arXiv:2001.08317, 2020.
[247] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yan-
dong Guo, and Yun Fu. Large scale incremental learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages
374–382, 2019.
[248] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon.
Transfer learning from deep features for remote sensing and poverty map-
ping. arXiv preprint arXiv:1510.00098, 2015.
[249] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong,
and Wang-chun Woo. Convolutional lstm network: A machine learning
approach for precipitation nowcasting. In Advances in neural information
processing systems, pages 802–810, 2015.
[250] Jingwei Xu, Bingbing Ni, Zefan Li, Shuo Cheng, and Xiaokang Yang.
Structure preserving video prediction. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pages 1460–1469, 2018.
[251] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Rus-
lan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell:
Neural image caption generation with visual attention. In International
conference on machine learning, pages 2048–2057, 2015.
[252] Xinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli
Shechtman, Sunil Hadap, Ersin Yumer, and Honglak Lee. Mt-vae: Learn-
ing motion transformations to generate multimodal human dynamics. In
117
Proceedings of the European Conference on Computer Vision (ECCV), pages
265–281, 2018.
[253] Huaxiu Yao, Xian Wu, Zhiqiang Tao, Yaliang Li, Bolin Ding, Ruirui Li,
and Zhenhui Li. Automated relational meta-learning. arXiv preprint
arXiv:2001.00745, 2020.
[254] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Life-
long learning with dynamically expandable networks. arXiv preprint
arXiv:1708.01547, 2017.
[255] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-
term forecasting using tensor-train rnns. Arxiv, 2017.
[256] Weiting Yu, Marc D Porosoff, and Jingguang G Chen. Review of pt-based
bimetallic catalysis: from model surfaces to supported catalysts. Chemical
reviews, 112(11):5780–5817, 2012.
[257] Ye Yuan and Kris Kitani. Dlow: Diversifying latent flows for diverse hu-
man motion prediction. In European Conference on Computer Vision, pages
346–364. Springer, 2020.
[258] Ye Yuan and Kris M Kitani. Diverse trajectory forecasting with determi-
nantal point processes. In International Conference on Learning Representa-
tions, 2019.
[259] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning
through synaptic intelligence. In Proceedings of the 34th International Con-
ference on Machine Learning-Volume 70, pages 3987–3995. JMLR. org, 2017.
[260] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus
118
Odena. Self-attention generative adversarial networks. arXiv preprint
arXiv:1805.08318, 2018.
[261] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena.
Self-attention generative adversarial networks. In International conference
on machine learning, pages 7354–7363. PMLR, 2019.
[262] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang.
Beyond a Gaussian denoiser: Residual learning of deep CNN for image
denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
[263] Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. Dynamically un-
folding recurrent restorer: A moving endpoint control method for image
restoration. arXiv preprint arXiv:1805.07709, 2018.
[264] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention
for image recognition. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 10076–10085, 2020.
[265] Jingyu Zhao, Feiqing Huang, Jia Lv, Yanjie Duan, Zhen Qin, Guodong Li,
and Guangjian Tian. Do rnn and lstm have long memory? In International
Conference on Machine Learning, pages 11365–11375. PMLR, 2020.
[266] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical
features from generative models. arXiv preprint arXiv:1702.08396, 2017.
[267] Hai Jing Zhu, Bo Chong Han, and Bo Qiu. Survey of astronomical im-
age processing methods. In International Conference on Image and Graphics,
pages 420–429. Springer, 2015.
[268] Jian-Min Zuo and J.C.H. Spence. Advanced Transmission Electron Mi-
croscopy, Imaging and Diffraction in Nanoscience. 01 2017.
119