DEEP PROBABILISTIC MODELS FOR SEQUENTIAL PREDICTION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Binh Van Tang August 2021 ©c 2021 Binh Van Tang ALL RIGHTS RESERVED DEEP PROBABILISTIC MODELS FOR SEQUENTIAL PREDICTION Binh Van Tang, Ph.D. Cornell University 2021 Despite significant advances in deep learning, probabilistic modeling of sequen- tial data has remained challenging due to the interplay of high-dimensional in- puts and temporal dynamics across long-distance time steps. In this disserta- tion, we propose deep probabilistic methods that model the temporal interac- tions between sequential inputs while accounting for the inherent uncertainty of future predictions. First, we study the problem of continual learning where samples of different classes arrive sequentially and incrementally, and propose a discriminative approach that uses random graphs to model sample similarities and guard against catastrophic forgetting. Second, we marry state space mod- els with recent advances in deep learning architectures for the task of time se- ries prediction, aiming to capture non-Markovian dynamics via latent variable models. Third, we extend such generative models to the challenging domain of videos in which both spatial and temporal signals are key to multi-frame video predictions. Empirical results show that our models perform compet- itively against recent baselines, bringing us one step closer to unlocking the underexplored potentials of sequential data. BIOGRAPHICAL SKETCH Binh Tang was born in Nghe An, a rural province in the north central coast of Vietnam. In 2012, he fortuitously won a scholarship by the Vietnamese govern- ment and began studying abroad in the United States. He attended Clark Uni- versity in Worcester, Massachusetts, and graduated with a bachelor’s degree in mathematics and computer science with highest honors in 2015. In 2016, Binh pursued a doctorate degree in statistics at Cornell University in Ithaca, New York. He received a master’s degree in computer science in 2020. He interned with Facebook during his graduate studies, and joined the com- pany as a full-time research scientist in Seattle, Washington, in 2021. iii This dissertation is lovingly dedicated to my mother. iv ACKNOWLEDGEMENTS I am deeply grateful to my advisor, Professor David S. Matteson, for his unwa- vering support and mentorship throughout my journey at Cornell University. It is also with my honor and gratitude to have Professor Claire Cardie and Pro- fessor James Booth on my Ph.D. committee. I would also like to thank supportive colleagues from multiple research groups, including Professor Peter A. Crozier’s at Arizona State University, Professor Carlos Fernandez-Granda’s at New York University, and Professor Ying Sun’s and especially Professor David S. Matteson’s at Cornell University. My work was financially supported in part through National Science Founda- tion Awards, Cornell University Graduate School Fellowship and Cornell Uni- versity Atkinson Center for Sustainability Award, all of which are greatly ap- preciated. Finally, I thank my undergraduate advisors, Professor Li Han and Professor Lawrence Morris, for introducing me to computer science and advanced math- ematics, respectively. v CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction 1 2 Graph-Based Continual Learning 3 2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Graph-Based Continual Learning . . . . . . . . . . . . . . . . . . . 6 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . 19 3 Probabilistic Transformer for Time Series Prediction 20 3.1 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Transformer Architectures . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Probabilistic Transformer . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Single-Layered Probabilistic Transformer . . . . . . . . . . 27 3.3.2 Multi-Layered Extension . . . . . . . . . . . . . . . . . . . 31 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 Time-series Forecasting . . . . . . . . . . . . . . . . . . . . 34 3.5.2 Human Motion Prediction . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusion & Discussion . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Probabilistic Transformer for Video Prediction 41 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Probabilistic Transformer . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 Additional Applications of Deep Neural Networks 50 5.1 Dynamic Poverty Prediction with Vegetation Index . . . . . . . . 50 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1.2 Datasets and Methodology . . . . . . . . . . . . . . . . . . 52 5.1.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 Deep Denoising for Scientific Discovery . . . . . . . . . . . . . . . 58 5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vi 5.2.4 Experiments Results . . . . . . . . . . . . . . . . . . . . . . 67 6 Final Remarks 73 A Graph-Based Continual Learning 74 B Probabilistic Transformer 80 vii LIST OF TABLES 2.1 Classification results (%) on PERMUTED MNIST, ROTATED MNIST and SPLIT SVHN. The means and standard deviations are computed over five runs using different random seeds, When used, episodic mem- ories contain 5 samples per class on average. The symbol ↑ (↓) indicates that a higher (lower) number is better. . . . . . . . . . . . . . . . . . . 16 2.2 Classification results (%) on SPLIT CIFAR10 and SPLIT CIFAR100 and SPLIT MINIIMAGENET. The means and standard deviations are com- puted over five runs using different random seeds, When used, episodic memories contain 5 samples per class on average. The symbol ↑ (↓) in- dicates that a higher (lower) number is better. . . . . . . . . . . . . . . 17 2.3 Ablation study on SPLIT CIFAR10. . . . . . . . . . . . . . . . . . . . 18 3.1 Test set CRPSsum of time series forecasting models (lower is better). The means and standard deviations are computed over five runs using different random seeds. . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Ablation study on TRAFFIC. . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Human motion prediction results. . . . . . . . . . . . . . . . . . . . . 39 4.1 PSRN and SSIM scores on MOVING MNIST. . . . . . . . . . . . . . . 47 5.1 Spatially cross-validated r2 values of the predictions of NDVI models relative to Jean et al. [101]. Separate models are fine-tuned and evalu- ated for different countries and surveys. For NDVI models, the means and standard deviations of r2 values are reported using 5 independent trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Field of view of CNN architectures and performance. Mean PSNR and SSIM (± standard deviation) of different CNN architectures on the (a) held-out simulated test set of TEM data described in Section 5.2.4 and (b) validation set of the DIV2K photographic image dataset [2]. . . . . 61 5.3 Results on simulated test data. Mean PSNR and SSIM (± standard deviation) of different denoising methods on the held-out simulated test set described in Section 5.2.4. SBD approaches achieve the best results. SBD combined with the proposed architecture outperforms all other techniques by about 12 dB. The performance of SBD applied to additional architectures is reported in Table 5.2. . . . . . . . . . . . . 68 A.1 Number of trainable parameters in continual learning models. . . . . . 76 A.2 GCL results and CN-DPM results with different memory sizes. . . . . 77 A.3 Memory usage of ER and GCL for various datasets. . . . . . . . . . . 77 B.1 Dimension, domain, frequency, total training timesteps and prediction length properties of the training datasets used in the experiments. . . . 80 viii B.2 Test set NMSEsum and NDsum of time series models (lower is better). The means and standard deviations are computed over 5 runs using different seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 B.3 Number of parameters of Transfomer-MAF [190], TimeGrad [189], and ProTran (our model) used in the time-series forecasting experiments. . 83 B.4 Number of parameters of DLow[257], its conditional VAE model, and ProTran (our model) used in the human-motion prediction experiments. 83 ix LIST OF FIGURES 2.1 Illustration of Experiment Replay (ER) [40] on the left and our model (GCL) on the right. While ER independently processes context images from the episodic memory and target images from the current task, GCL models pairwise similarities between the images via the random graphs G and A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 t-SNE visualization of image embeddings (small circles) from the penul- timate layers and class embeddings (large circles) from the weights of the last layers on SPLIT SVHN. The left figure shows that Finetune, a model naively trained on the data stream, fails to recognize the class- based clustering structure and bias the image embeddings toward the last task (class 8 & 9). In contrast, the right figure shows that GCL (our model) maintains the relational structure and is more robust to the dis- tributional shifts incurred by task changes. . . . . . . . . . . . . . . . 6 2.3 Average accuracy as a function of the number of tasks trained. . . . . . 16 2.4 Memory sizes effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Wall-clock running time. . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Context graph G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Graph regularization (λG). . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Graphical model representations of linear dynamical systems (LDSs) in (a), and our proposed models (ProTran) in (b), (c), and (d). Black arrows denote the generative mechanism and red arrows the inference procedure. The separation of generation and inference in (c) and (d) is for readability. While traditional SSMs such as LDSs are limited to Markovian dynamics and linear dependencies, our models allow for non-Markovian and non-linear interactions between time steps via at- tention mechanism. A multi-layer extension of our models further in- creases expressiveness without compromising the tractable inference procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Visualizations of attention weights in an image captioning task. The model sequentially generates words in the shown caption by focusing on the corresponding salient regions in the image depicted with differ- ent colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Multihead Attention [226]. . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Prediction intervals and test set ground-truth from ProTran (our model) for the TRAFFIC dataset of the first 16 of 963 time series. . . . . . . . . 37 3.5 Ground-truth pose sequences (first row) and corresponding predic- tions by ProTran (second row). Solid colors indicate later time-steps and faded ones are older. The body-part movements in the predicted and ground-truth poses resemble similar patterns, while certain varia- tions are retained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 x 4.1 Graphical model representations of our probabilistic transformer mod- els. Black arrows denote the generative mechanism and red arrows the inference procedure. The separation of generation and inference in (c) and (d) is for readability. We interleave recursive layers (e.g. layer 1 and layer 3) and non-recursive layers (e.g. layer 2 and layer 4) to in- crease expressiveness of the temporal dynamics and reduce running time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Peak signal-to-noise ratio as a function of time horizon. . . . . . . . . 47 4.3 Predicted video frames on odd rows and their corresponding ground truths on even rows at two different time horizons, namely t = 2 left and t = 4 on the right on the STOCHASTIC MOVEMENT DATASET. The shown predictions are among the closest samples to the ground truths based on PSNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Predicted video frames on odd rows and their corresponding ground truths on even rows at two different time horizons, namely t = 4 left and t = 13 on the right on the STOCHASTIC MOVING MNIST dataset. The shown predictions are among the closest samples to the ground truths based on PSNR. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 NDVI measurements for Uganda in 2011. On the left, the background image shows annual average NDVI with a vertical colorbar while the foreground scatters depict log consumption expenditures with a hor- izontal colorbar. On the right, the annual NDVI, spatially averaged over all survey locations, with notable drops during the 2011-2012 East Africa drought highlighted in gray. . . . . . . . . . . . . . . . . . . . 51 5.2 Spatially cross-validated results of NDVI models relative to nightlights and Jean et al. [101]. Nightlight-based models are random forests trained on scalar nighttime light intensities. The top figure shows r2 values for estimating consumption using pooled observations across the four LSMS countries. We run separate trials for increasing per- centages of the pooled dataset (e.g., the x-axis value of 60 indicates all surveyed communities below the 60th percentile of consumption are included. The bottom figure show similar r2 values for estimating asset index.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Consumption predictions for LSMS communities in Uganda made by a random forest model trained on 2011 data and tested on 2013 data. The top figure shows the ground-truth consumption along with predictions for LSMS communities ordered by 2011 data. The bottom figure shows RMSE values of the predictions for increasing percentages of the LSMS communities (e.g., the x-axis value of 60 indicates all communities be- low the 60th percentile in 2011 consumption are included). . . . . . . . 56 xi 5.4 Simulation-based denoising framework. (Top) A training dataset is generated by simulating TEM images of different structures at varying imaging conditions. (Middle) A CNN is trained using the simulated images, paired with noisy counterparts obtained by simulating the rel- evant noise process. (Bottom) The trained CNN is applied to real data to yield a denoised image, and a likelihood map is generated to quan- tify the agreement between this structure and the noisy data. . . . . . 62 5.5 Likelihood map. When the simulated noisy image in (a) is denoised using the proposed framework (b), a spurious atom appears at the left edge of the nanoparticle (see zoomed image (d)). The value of the like- lihood map (c) at that location is very low, indicating that the presence of an atom is less consistent with the observed data than its absence. . 63 5.6 Denoising results for real data. (a) An experimentally-acquired atomic- resolution transmission electron microscope image of a CeO2-supported Pt nanoparticle. The average image intensity is 0.45 electrons/pixel (i.e., a large fraction of pixels register zero electrons), which results in an extremely low signal-to-noise ratio. (b) Denoised image obtained via Fourier-based filtering by a domain expert. (c) Denoised image obtained via the wavelet-based PURE-LET method [147]. (d) Denoised image obtained by the proposed simulation-based denoising (SBD) frame- work. (e) Likelihood map quantifying to what extent the atomic struc- ture identified from the SBD denoised image is consistent with the data. Regions in red are more likely to correspond to atomic columns in the nanoparticle. Regions in blue are more likely to belong to the vacuum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.7 Distribution of likelihood ratio. The figure shows the distribution of log-likelihood ratio of over 25, 000 regions of interest computed from the surface of 1550 denoised images using the dataset. The regions containing spurious atoms (false positives, (a)) have a much lower log- likelihood ratio than the regions containing accurately recovered atoms (true positives, (b)). Regions where existing atoms were not detected (false negatives, (c)) have a higher log-likelihood ratio, comparable to that of the regions with accurately recovered atoms. The occurrence of missing and spurious atoms in denoised images is quite rare: out of the 25, 732 regions of interest, only 2, 457 and 2, 368 were false positives and false negatives respectively. . . . . . . . . . . . . . . . . . . . . . 65 xii 5.8 Performance of SBD in terms of our proposed metrics. We compute all our proposed metrics on over 7, 000 denoised images corresponding to 25 unique noisy images sampled from the 308 clean images. The em- pirical distribution on the surface (red) and bulk (green) is visualized as box plots indicating the median, 25th quartile, 75th quartile, mini- mum and maximum value of the distribution. SBD has a near perfect performance in the bulk with all metric values hovering around 1. On the surface, SBD achieves a median score of 1 for precision and recall, and about 0.95 for F1 score and Jaccard index. . . . . . . . . . . . . . 69 5.9 Validation on real data. The real data consist of 40 frames which are approximately stationary and aligned. Their temporal average (left) therefore provides a reasonable estimate for the true intensity profile. In the image on the right, we compare the average intensity profile on the surface atomic columns of the platinum nanoparticle for the de- noised data (middle) and the temporal average (left). The profiles are very similar (except for some spurious fluctuations in the temporal av- erage), which suggests that the proposed approach achieves effective denoising on the real data. . . . . . . . . . . . . . . . . . . . . . . . . 70 A.1 Average accuracy as a function of the number of tasks trained on PER- MUTED MNIST, ROTATED MNIST, SPLIT SVHN, and SPLIT CIFAR10. 78 A.2 Average accuracy as a function of numbers of samples in the episodic memory on SPLIT SVHN and SPLIT CIFAR10. . . . . . . . . . . . . . 79 A.3 Average forgetting as a function of numbers of samples in the episodic memory on SPLIT SVHN and SPLIT CIFAR10. . . . . . . . . . . . . . 79 B.1 Conditioning pose sequences in green and corresponding predictions in red by ProTran. Solid colors indicate later time-steps and faded ones are older. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 xiii CHAPTER 1 INTRODUCTION Although much progress has been made recently, probabilistic modeling of se- quential data has remained intractable. While deep learning techniques have been tremendously successful with static modalities such as images [87, 122], sequential inputs continue to pose challenges due to the complexity of high- dimensional inputs and the temporal dynamics across long-distance time steps. In this dissertation, we study the problem from multiple perspectives. While traditional paradigms assume that all training data is available at once in large quantities, we consider the continual learning setups where images and labels of different classes arrive sequentially and incrementally in Chapter 2. The changes in the data distribution often lead to dramatic decreases in classifica- tion performance, a problem known as catastrophic forgetting. As a remedy, we propose using random graphs to connect similar samples of different arrival times, resulting in a discriminative model more robust to distributional shifts. In Chapter 3, we combine state space models (SSMs), a classical, rigorously mo- tivated class of time series models, with recent advances in deep learning archi- tectures. While traditional SSMs are often limited to linear dependencies and Markovian dynamics between time steps, we propose non-Markovian, highly expressive latent variable models based on transformer architectures and vari- ational inference. Our models are capable of generating diverse long-term time series forecasts with uncertainty estimates. In Chapter 4, we extend such generative models to the challenging task of con- ditional video predictions. More specifically, we propose applying recent ad- 1 vances on attention mechanism to high-dimensional video frames to capture both the spatial and temporal signals as well as their interactions. Empirical re- sults show that our models perform promisingly on several datasets, bringing us one step closer to unlocking the underexplored potentials of unlabeled data. We provide additional applications of deep learning models in poverty esti- mation and image denoising in Chapter 5 and conclude with final remarks in Chapter 6. 2 CHAPTER 2 GRAPH-BASED CONTINUAL LEARNING1 Recent breakthroughs of deep neural networks often hinge on the ability to re- peatedly iterate over stationary batches of training data. When exposed to incre- mentally available data from non-stationary distributions, such networks often fail to learn new information without forgetting much of its previously acquired knowledge, a phenomenon often known as catastrophic forgetting [70, 157, 191]. Despite significant advances, the limitation has remained a long-standing chal- lenge for computational systems that aim to continually learn from dynamic data distributions [174]. Among various proposed solutions, rehearsal approaches that store samples from previous tasks in an episodic memory and regularly replay them are one of the earliest and most successful strategies against catastrophic forgetting [137, 197]. An episodic memory is typically implemented as an array of independent slots; each slot holds one example coupled with its label. During training, these samples are interleaved with those from the new task, allowing for simultane- ous multi-task learning as if the resulting data were independently and identi- cally distributed (i.i.d.). While such approaches are effective in simple settings, they require sizable mem- ory and are often impaired by memory constraints, performing rather poorly on complex datasets. A possible explanation is that slot-based memories fail to utilize relational structure between samples; semantically similar items are treated independently both during training and at test time. In marked con- trast, relational memory is a prominent feature of biological systems that has 1Joint work with David S. Matteson []. 3 been strongly linked to successful memory retrieval and generalization [181]. Humans, for example, encode event features into cortical representations and bind them together in the medial temporal lobe, resulting in a durable, yet flex- ible form of memory [206]. In this chapter, we introduce a novel Graph-based Continual Learning model (GCL) that resembles some characteristics of relational memory. More specifi- cally, we explicitly model pairwise similarities between samples, including both those in the episodic memory and those found in the current task. These similar- ities allow for representation transfer between samples and provide a resilient mean to guard against catastrophic forgetting. Our contributions are twofold: (a)) We propose the use of random graphs to represent relational structures between samples. While similar notions of dependencies have been pro- posed in the literature [145, 253], the application of random graphs in task- free continual learning is novel, at least to the best of our knowledge. (b)) We introduce a new regularization objective that uses the random graphs to alleviate catastrophic forgetting. In contrast to previous work [131, 192] based on knowledge distillation [90], the objective penalizes the model for forgetting learned edges between samples instead of its predictions. Our approach performs competitively on four commonly used datasets, im- proving accuracy by up to 19.7% and reducing forgetting by almost 37% in the best case when bench-marked against multiple baselines in continual learning. 4 UC G ZC z y ̂ Context u uG(b−1) Context 1 1 0.1 0.7 0.3 1 1 y ̂ ux 2 x u1 1 y 2 0.7 0.4 0.8 z2 y ̂1 20.1 0.7 0.3 y1 x1 u3 u3 0.3 0.8 0.5 z3 MLP y3̂y2 x2 y2̂ 0.7 0.4 0.8 y2 x2 MLP θ3θ1 u A ZTy3 x3 y3̂ 0.3 0.8 0.5 y3 x3 u4 1 CNN CNN x 0.1 0.7 0.3 z4 y4̂ u θ θ u5 21 0.7 0.4 0.8 z5 y5̂u3 Target Target UT VC x4 y4̂ x4 v1 x y x MLP v5 5̂ 5 θ2 2 v3 Figure 2.1: Illustration of Experiment Replay (ER) [40] on the left and our model (GCL) on the right. While ER independently processes context images from the episodic mem- ory and target images from the current task, GCL models pairwise similarities between the images via the random graphs G and A. x 2.1 Problem Formulation Following the image classification protocol in [143], we consider a training set D = {D1, · · · ,DT} consisting of T tasks where the dataset for the t-th task Dt = {(xt,yt)}nti i i=1 contains nt input-target pairs (xti,yti) ∈ X×Y . While the tasks arrive sequentially, we assume the input-target pairs (xti,yti) in each task are i.i.d. The goal is to learn a supervised model fθ : X → Y , parametrized by θ, that outputs a class label y ∈ Y given an unseen image x ∈ X . Following prior work [40, 143, 195], we consider online streams of tasks in which samples from different tasks arrive at different times. As an additional constraint, we insist that the model can only revisit a small amount of data cho- sen to be stored in a fixed-size episodic memoryM. For clarity, the data in the memory are referred to as context images and context labels, denoted by XC = {xi}i∈C and YC = {yi}i∈C , while those in the current task are referred to as target images and target labels, denoted by XT = {xj}j∈T and YT = {yj}j∈T , respectively. The model is allowed to update the context samples during training, but the memory is necessarily frozen at test time. 5 0 5 1 6 2 7 3 8 4 9 Figure 2.2: t-SNE visualization of image embeddings (small circles) from the penulti- mate layers and class embeddings (large circles) from the weights of the last layers on SPLIT SVHN. The left figure shows that Finetune, a model naively trained on the data stream, fails to recognize the class-based clustering structure and bias the image em- beddings toward the last task (class 8 & 9). In contrast, the right figure shows that GCL (our model) maintains the relational structure and is more robust to the distributional shifts incurred by task changes. 2.2 Graph-Based Continual Learning In this section, we propose a Graph-based Continual Learning (GCL) algorithm. While most rehearsal approaches ignore the correlations between images and independently pass them through a network to compute predictions [8, 40, 192], we model pairwise similarities between the images with learnable edges in ran- dom graphs (see Figure 2.1). Intuitively, although it might be easy for the model to forget any particular sample, the multiple connections it forms with similar neighbors are harder to be forgotten altogether. If trained well, the random graphs can therefore equip the model with a plastic and durable means to fight against catastrophic forgetting. Graph Construction. Given a minibatch of target images XT from the current task, our model makes predictions based on the context images XC and con- text labels YC that span several previously seen tasks, up to and including the current one. In particular, we explicitly build two random graphs of pairwise 6 dependencies: an undirected graph G between the context images XC and a directed, bipartite graph A from the context images XC to the target images XT . Since an undirected graph can be thought of as a directed graph between its vertices and a copy of itself, we treat the context graph G as such and build it analogously to the context-target graph A. Specifically, the high-dimensional context images XC and target images XT are first mapped to the image embed- dings UC and UT , respectively, using an image encoder f : X → Rd1θ1 . Follow- ing [145], we then represent the edges in each graph by independent Bernoulli random variables whose means are specified by a kernel function in the em- bedding space. More precisely, the distribution of the resulting Erdős-Rényi random graphs [62] can be defined as ∏∏ p(G |UC) = Ber(Gik |κτ (ui,uk)), (2.1) ∏i∈C k∏∈C p(A |UT ,UC) = Ber(Ajk |κτ (uj,uk)), (2.2) j∈T k∈C for all i, k ∈ C and j ∈ T where κτ : Rd1 × Rd1 → [0,∞) is a kernel function that encodes sim(ilarities betwe)en image embeddings such as the RBF kernel κτ (ui,uj) = exp − τ ‖ui − uj‖22 . Here, with a slight abuse of notation, we also2 use G and A to denote the corresponding adjacency matrices; Ajk ∈ {0, 1}, for example, represents the presence or absence of a directed edge between the j-th target image and the k-th context image. Predictive Distribution. Given a context graph G and a context-target graph A that encode pairwise similarities to the context images, our next step is to propagate information from the context images XC and context labels YC to make predictions. To that end, we embed XC by another image encoder fθ2 with weights partially tied to the previous one fθ1 , and encode YC by a linear label 7 encoder before concatenating the resulting embeddings into latent representa- tions V ∈ R|C|×dC 2 . In combination with the distributions of G and A, we com- pute context-aware representations for the context images and target images, denoted by {zi}i∈C and {zj}j∈T , res∫pectively: p(zi |UC,VC) = ∫ I{G̃ V }(zi) dP (G |UC) (2.3)i CG p(zj |UT ,UC,VC) = I{Ã V }(zj) dP (A |UT ,UC). (2.4)j C A where G̃i and Ãj indicate the i-th and j-th row of G and A, each normalized to sum to 1, and IS(·) denotes the indicator function on a set S . Intuitively, the rep- resentations VC are linearly weighted by each graph sample, and the normaliza- tion step ensures proper scaling in case the numbers of edges formed with the context images vary. Once we summarize each image by the context samples, a final network f : Rd2θ3 → Y takes as input the context-aware representations and produces predictive distr∫ibutions: p(yi |XC) = ∫ p (yi | fθ3(zi)) dP (zi |UC,VC), (2.5)zi p(yj |xj,XC) = p (yj | fθ3(zj)) dP (zj |UT ,UC,VC). (2.6) zj Since the numbers of random binary graphs G and A are exponential, we ap- proximate the integrals in (1) - (6) by Monte Carlo samples. More specifically, we use one sample of G and A during training to reduce training time, and 30 samples of A at test time for more accurate representations of the graph dis- tributions. These graph samples are inherently non-differentiable, so we use the Gumbel-Softmax relaxations of the Bernoulli random variables during train- ing [99, 150]. The degree of approximation is controlled by temperature hyper- parameters, which exert significant influence over the density of the graph sam- ples. We find that a small temperature for G and a larger temperature for A work well in practice. 8 There are several reasons for making the graphs G and A random. First, the stochasticity induced by the Bernoulli random variables allows us to output multiple predictions and average these predictions, and such ensemble tech- niques have been quite successful in continual learning settings [50, 64]. Per- haps more importantly, we find that the deterministic version with the Bernoulli random variables replaced by their parameters results in very sparse graphs where samples from the same classes are often deemed dissimilar. In a simi- lar fashion to dropout [215], the random edges encourage the model to be less reliant on a few particular edges and therefore promote knowledge transfer be- tween samples. By a similar reasoning, we remove self-edges in the context graph and also observe more connections between samples. Graph Regularization. As training switches to new tasks, the distributional shifts to the target images necessarily result in changes to both the context graph G and the context-target graph A. In addition, the context images are regu- larly updated to be representative of the data distribution up to that point, so any well-learned connections between the context images are also susceptible to catastrophic forgetting. As a remedy, we save the parameters of the Bernoulli edges to the episodic memory in conjunction with the context images and con- text labels, and introduce a regularization term that discourages the model from forgetting previously learned edges: 1 ( ( ) ( ))L(b)G (θ1) , (b−1) (b)|I | ` p GI(b) , p GI(b) . (2.7)(b) Here, `(·, ·) denotes the cross-entropy between two probability distributions, I(b) the index set of edges to be regularized in the bth minibatch, and G(b−1) the ad- jacency matrix learned from the beginning up to the previous minibatch. The selection strategies I(b) are discussed in the next subsection. Besides the regu- larization term, our training objective includes two other cross-entropy losses, 9 one for the context images and another for the target images: λC ∑ ( ) ∑ ( )L (s) λT (s) (b)(θ1, θ2, θ3) = |C| ` yi, ŷi + |T | ` yj, ŷj + λGLG (θ1), (2.8) i∈C j∈T where (s) (s) (s) (s) (s) (s)ŷi = fθ3(zi ), ŷj = fθ3(zj ), zi ∼ p(zi|UC,VC), zj ∼ p(zj|UT ,UC,VC) are from Equation 2.3 and Equation 2.4, and λC , λT , λG are hyperparameters. While the graph regularization term appears similar to knowledge distillation [90], we emphasize that the former aims to preserve the covariance structures between the outputs of the image encoder fθ1 rather than the outputs them- selves. We believe that in light of new data, the image encoder should be able to update its potentially superficial representations of previously seen samples as long as it keeps the correlations between them unchanged. Indeed, some of the early regularization approaches based on knowledge distillation [131, 192] are sometimes too restrictive and underperform in certain scenarios [110]. Task-Free Knowledge Consolidation. When task identities are not available, we use reservoir sampling [231] to update the episodic memory as in [195]. The sampling strategy takes as input a stream of data and randomly replaces a con- text sample in the episodic memory with a target sample with probability pro- portional to the number of observed samples. Despite its simplicity, reservoir sampling has been shown to yield strong performance recently [40, 195, 197]. While most prior work uses task boundaries to perform knowledge consolida- tion at the end of each task [117, 192], we update the context graph in memory after every minibatch of training data. In addition, such updates are performed at the sample level to maximize flexibility; we keep track of the cross entropy loss on each context sample and only update its edges in the graph when the model reaches a new low (denoted by I(b) previously). Intuitively, the loss mea- 10 sures how well the model has learned the context image through the connec- tions it forms with others, so meaningful relations are most likely obtained at the bottom of the loss surface. Though samples from the same task often pro- vide more support for each other, the task-agnostic mechanism for updating the context graph also allows for knowledge transfer across tasks when necessary. Memory and Time Complexity. The inclusion of pairwise similarities and graph regularization result in a time and memory complexity of O(|M|2 + |M|N) and O(|M|2), respectively, where |M| denotes the size of the episodic memory and N the batch size for target images. The quadratic costs in |M|, however, are not concerning in practice, as we deliberately use a small, fixed-size episodic mem- ory. The cost of storing G is often dwarfed by the memory required for storing high-dimensional images, as each edge only needs one floating point number (see Appendix A for more details on memory usage). 2.3 Related Work Continual Learning Approaches. The existing work on continual learning mostly falls into three categories: regularization, expansion, and rehearsal. Regu- larization approaches alleviate catastrophic forgetting by penalizing changes in model weights that are important for past tasks. Different measures of weight importance are considered, including Fisher information [37, 117], synaptic rel- evance [259], and uncertainty estimates [59]. The constraints on weight updates can also be studied from Bayesian perspectives, where the posterior distribu- tion of the weights is approximated and used as the prior for the next task [167, 196, 220]. These regularization methods are efficient in memory and com- putational usage but suffer from brittleness due to representation drift [220]. 11 Expansion approaches dynamically allocate additional task-specific neural re- sources as more tasks arrive. [200], for example, blocks changes to parameters learned for previous tasks and expands sub-networks while [254] performs neu- ron splitting or duplication upon arrival of new tasks. Recently, non-parametric Bayesian approaches use Dirichlet process mixture models to expand a set of neural networks in a principled way [102, 127]. By design, these dynamic archi- tectures prevent forgetting but quickly result in considerable model complexity. Instead of growing model capacity, rehearsal approaches maintain a small episodic memory of previous data or, alternatively, train a generative model to produce pseudo-data for past tasks, which are then replayed and interleaved with sam- ples from the new task. Such generative models [1, 32, 110, 173, 207] reduce working memory effectively, but they are also susceptible to catastrophic for- getting and invoke the complexity of the generative task [174]. In contrast, episodic memory approaches are simpler and remarkably effective against for- getting [197, 247]. [143] and [39], for example, use an episodic storage of past data to impose inequality constraints on gradient updates while [192] constructs exemplars for knowledge distillation and nearest neighbor search. Recently, it has been shown that simple replay techniques and optimization-based meta- learning on the episodic memory outperform many previous approaches in on- line settings [38, 40, 85, 195]. Our model is also based on experience replay, but it differs from the other approaches in the way the episodic memory is handled. Task-Free Continual Learning. In real-world scenarios, task changes are often unknown and definitive boundaries between tasks do not always exist. How- ever, most methods mentioned above rely on explicit task identities or task boundaries to consolidate knowledge or select sub-modules for task adapta- 12 tion. Despite its significance, there are only a few works that address task-free continual learning. While [7] heuristically detects peaks in the loss surface to consolidate knowledge, [6, 8] remove the need for task boundaries by a sample selection strategy for the episodic memory. Recently, the aforementioned non- parametric approaches train density estimators to detect task boundaries and perform model expansion [127, 188]. In contrast, our approach uses reservoir sampling [231] to update the episodic memory, similar to [40, 195]. Learning with Random Graphs. Although widely studied in graph theory [243], random graphs appear sparingly in the machine learning literature. Our work is mostly related to previous work on functional neural process [145], where the authors build random graphs of dependencies to represent relational structures between context points in a stochastic process. Our approach is dif- ferent in that (1) the random graphs are undirected and grow incrementally, (2) no variational inference is required, and (3) it addresses catastrophic forgetting and performs well under continual learning settings. Attention Mechanism. While we motivate our approach from a graphical per- spective, it can be considered as a form of attention mechanism. In particular, the context graph G represents self-attention [226] across context images, and the context-target graph A represents cross-attention [12] between context im- ages and target images. Though advanced mechanisms such as multi-head at- tention have been successful in many stationary settings [113, 214, 226, 251, 260], we note that naive applications of such techniques in online continual learn- ing suffer from catastrophic forgetting due to representation drift when training switches to new tasks. In contrast, our model employs random attention, which makes it more robust to such distributional shifts (see Figure 2.2). 13 2.4 Experiment Results In this section, we evaluate our model on commonly used continual learning benchmarks. Additional results and details about the datasets, experiment setup, model architectures, and result analyses are available in the appendices. Experiment Setup. We perform experiments on 6 image classification datasets: PERMUTED MNIST, ROTATED MNIST [125], SPLIT SVHN [166], SPLIT CI- FAR10 [121], SPLIT CIFAR100 [121], and SPLIT MINIIMAGENET [230]. For each dataset, we follow [39, 143] and adopt the setting where the model only has access to an online stream of data with a batch size of 10 (see Appendix A for more details). We consider both single-head and multiple-head settings. More specifically, we use single-head and one-epoch settings for our model and all baselines on PER- MUTED MNIST, ROTATED MNIST, SPLIT SVHN, and SPLIT CIFAR10. While most of previous work [40, 143, 192] assume task identities on SPLIT CIFAR10, we require all models to perform 10-way classification on each task with the same output head. This variant is more practical and challenging due to the need for incremental knowledge consolidation across tasks. In addition, we also report results for multiple-head and 10-epochs settings on SPLIT CIFAR100 and SPLIT MINIIMAGENET, following [143]. These datasets have more classes and fewer samples per class, rendering them too challenging for single-head settings. Model Architecture. Our image encoders fθ1 and fθ2 partially share weights and are parametrized by an MLP on the MNIST variants and a simple 6-layer convolutional network on other datasets, each followed by a RELU activation 14 and a separate linear mapping. We use an RBF kernel to compute similarities between image embeddings and find it sufficiently easy for initialization. The output mappings fθ3 are MLPs in all cases (see Appendix A for more details). Baselines. We benchmark our model against multiple models, including (1) Finetune, a popular baseline, naively trained on the data stream; (2) EWC [117], an early regularization approach; (3) GEM [143], a rehearsal approach based on an episodic memory of parameter gradients; (4) ER [40], a simple yet competi- tive experience method based on reservoir sampling; (5) MER [195], a rehearsal approach inspired by optimization-based meta-learning, and (6) ICARL [192] another well-known rehearsal strategy. Most of these baselines share the same model architectures: an MLP with two hidden layers on the MNIST variants, and a ResNet-18 [87] on SPLIT SVHN and SPLIT CIFAR10, following [143] (see Appendix A for more details). Metrics. Following [37, 40, 143], we evaluate the models using two classifica- tion metrics, namely, average accuracy and average forgetting: ∑T ∑T−11 1 ACC , RT ,i, FGT , − (RT ,i −Ri,i), (2.9)T T 1 i=1 j=1 whereRi,j denotes the test accuracy on task j after the model has finished task i. Intuitively, the former measures the average test accuracy across all tasks while the latter measures the average decrease between each task’s peak accuracy and its accuracy at the end of continual learning. Classification Performance. Table 2.1 and Table 2.2 show the overall experi- mental results, and the evolution of performance as a function of the number of tasks are detailed in Figure 2.3. In every setting, our model (GCL) outper- forms the baselines by significant margins, and the gains in performance are 15 Table 2.1: Classification results (%) on PERMUTED MNIST, ROTATED MNIST and SPLIT SVHN. The means and standard deviations are computed over five runs using different random seeds, When used, episodic memories contain 5 samples per class on average. The symbol ↑ (↓) indicates that a higher (lower) number is better. DATASET PERMUTED MNIST ROTATED MNIST SPLIT SVHN Method ACC (↑) FGT (↓) ACC (↑) FGT (↓) ACC (↑) FGT(↓) Finetune 60.19 ± 2.31 23.62 ± 1.98 43.80 ± 1.64 46.52 ± 1.71 18.85 ± 0.10 94.78 ± 1.24 EWC 64.94 ± 1.22 18.33 ± 1.07 44.99 ± 1.73 44.98 ± 1.95 18.76 ± 0.27 94.99 ± 1.23 GEM 79.17 ± 0.70 3.68 ± 0.68 82.60 ± 0.48 5.47 ± 0.45 33.40 ± 3.27 68.91 ± 4.06 ER 79.90 ± 0.46 3.78 ± 0.45 80.82 ± 0.68 6.78 ± 0.69 45.41 ± 3.03 62.37 ± 4.33 MER 79.68 ± 0.42 3.47 ± 0.41 83.56 ± 0.23 8.14 ± 0.46 - - GCL 82.36 ± 0.36 2.92 ± 0.23 86.37 ± 0.32 3.22 ± 0.50 60.68 ± 1.67 21.86 ± 2.35 especially substantial on complex datasets such as SPLIT CIFAR10 or SPLIT CI- FAR100. As noted by [39], EWC [117] performs poorly without multiple passes over the datasets, and GEM [143] is not very effective under the single-head variants (e.g. on SPLIT CIFAR10). Task-free approaches such as ER perform more favorably, and such findings are consistent with recent studies [40, 195]. The advantageous performance of GCL can be attributed to its efficient use of the episodic memory. Figure 2.4 shows that both ER [40] and GCL benefit from increases in memory size, but the outperformance of GCL is more visible un- der the low-resource regime. Sample efficiency is especially important since PERMUTED MNIST SPLIT CIFAR100 90 80 75 70 60 Finetune EWC 60 GEM ER MER GCL 45 50 0 5 10 15 20 0 5 10 15 20 Number of Tasks Number of Tasks Figure 2.3: Average accuracy as a function of the number of tasks trained. 16 Average Accuracy (%) Average Accuracy (%) Table 2.2: Classification results (%) on SPLIT CIFAR10 and SPLIT CIFAR100 and SPLIT MINIIMAGENET. The means and standard deviations are computed over five runs us- ing different random seeds, When used, episodic memories contain 5 samples per class on average. The symbol ↑ (↓) indicates that a higher (lower) number is better. DATASET SPLIT CIFAR10 SPLIT CIFAR100 SPLIT MINIIMAGENET Method ACC (↑) FGT (↓) ACC (↑) FGT (↓) ACC (↑) FGT (↓) Finetune 18.46 ± 0.12 86.48 ± 1.02 55.39 ± 1.94 25.94 ± 1.89 37.84 ± 0.87 31.41 ± 1.57 EWC 18.49 ± 0.13 86.95 ± 1.15 55.60 ± 1.11 23.53 ± 1.19 36.61 ± 2.06 28.17 ± 4.49 ICARL - - 58.08 ± 1.44 24.22 ± 1.35 - - GEM 22.88 ± 3.41 76.90 ± 5.53 65.66 ± 0.70 15.52 ± 0.41 54.06 ± 0.22 13.17 ± 0.74 ER 29.94 ± 3.08 72.64 ± 4.88 69.40 ± 1.21 11.25 ± 1.24 58.74 ± 0.74 9.02 ± 2.49 GCL 49.62 ± 1.85 35.69 ± 3.33 74.51 ± 0.99 6.54 ± 1.26 61.54 ± 0.57 6.10 ± 2.73 the memory constraints are not relaxable despite the growing complexity of the data distribution during training. It is also worth emphasizing that although our model takes more time to train and evaluate at test time than ER, its train- ing time and testing time are comparable to other approaches (see Figure 2.5). Learned Graphs. Central to our approach are the pairwise similarities between context images captured by the context graph G. Figure 2.6 shows a continuous realization of the context graph at the end of continual learning on SPLIT CI- FAR10, which has been sorted according to context labels placed underneath the adjacency matrix. Despite being trained exclusively on two classes of target SPLIT CIFAR10 SPLIT CIFAR100 75 1,500 0.6 ER GCL Training Testing 50 1,000 0.4 25 500 0.2 0 0 0 100 250 500 1000 EWC ER ICARL GCL GEM Memory Size Method Figure 2.4: Memory sizes effects. Figure 2.5: Wall-clock running time. 17 Average Accuracy (%) Total Training Time (s) Testing Time per Sample (ms) 55 45 0.8 ACC 50 FGT 40 0.6 45 35 0.4 40 30 0.2 0.0 35 25 plane car bird cat deer dog frog horse ship truck 0 10 50 100 1000 Figure 2.6: Context graph G. Figure 2.7: Graph regularization (λG). images at a time (e.g., plane & car or bird & cat), the model appears to learn the clustering structure of images relatively well with more pronounced edges formed within classes than across them. The edges across tasks are noisier, but some edges indicate intuitive visual similarities such as those between images of car and truck. We note that the 10-way classification setup in each task encour- ages the model to clear inter-class edges, so the degree of knowledge transfer across tasks is understandably more subtle. Ablation Study. We further investigate our model performance with an abla- tion study and summarize it in Table 2.3. Without the graph regularization term in Equation 2.7, the model significantly performs worse, indicating that past connections between context samples can help alleviate catastrophic forgetting. By varying the hyper-parameter λG, we also see from Figure 2.7 that an extreme amount of graph regularization (e.g. λG = 1000) can have detrimental effects on Table 2.3: Ablation study on SPLIT CIFAR10. Graph regularization X × × × Multiple graph samples X X × × Random G & A X X X × Deterministic G & A × × × X Average accuracy 49.62 44.04 42.08 30.50 18 Average Accuracy (%) Average Forgetting (%) the model performance as well. As alluded earlier, the ability to draw multiple graph samples and average their predictions at test time brings out some gains, as often the case with ensemble methods. Perhaps more importantly, we find that making the context graph G and the context-target graph A deterministic results in a dramatic drop in accuracy. The resulting model is a variant of atten- tion mechanism, most similar to attentive neural process [113], and as discussed in Section 2.3, such a deterministic model often relies on a handful of edges, all of which are also prone to distributional shifts and thus catastrophic forgetting. 2.5 Conclusion and Discussion In this chapter, we have introduced a graph-based approach to continual learn- ing that exploits pairwise similarities between samples to support knowledge transfer. Based on the learned graphs, we derive a regularization term to guide the training of new tasks against catastrophic forgetting. Our model demon- strates an efficient use of the episodic memory, and as a result, performs compet- itively under various settings, without requiring access to task definition both during training and at test time in some cases. As graph-based approaches naturally describe relational inductive biases [16], we hope future works further examine the applications of graphs under con- tinual learning settings. If trained well, these graphs can be used not only to share knowledge but also to minimize inference between samples and tasks. A promising direction, for example, is to pose the problem of updating the episodic memory as a graph search and leverage the rich literature on graph theory to devise better strategies for sample selection. As shown in previous works [8, 97], such selection mechanisms can be effective against catastrophic forgetting, especially when the data distribution is not balanced across tasks. 19 CHAPTER 3 PROBABILISTIC TRANSFORMER FOR TIME SERIES PREDICTION1 Generative modeling of multivariate time series is a challenging problem with wide-ranging applications in demand forecasting [34, 202], autonomous driv- ing [4, 35], robotics [65, 170], and health care [47, 48, 140]. Despite remarkable progress in recent years, models that predict high-dimensional future observa- tions from a few past examples have remained intractable, partly due to the complex, non-deterministic temporal dynamics across long-distance time steps. Given a sequence of human poses, for example, such models must internally figure out the involved dynamics of various body components across space and time while maintaining the inherent uncertainty of multiple plausible futures, even though only one such future is observed. Among proposed probabilistic approaches, state space models (SSMs) provide a principled framework for learning and drawing inference from sequential in- puts [58, 163]. While autoregressive models feed its predictions back into the dynamics model without any compressed representation of data, SSMs model stochastic transitions between abstract states using latent variables, allowing for efficient state-to-state sampling without the need to render high-dimensional observations. Gaussian linear dynamical systems (LDSs), one of the best known SSMs [244], for example, postulate linear state transitions and enjoy exact infer- ence via the celebrated Kalman filter algorithm. While early extensions of LDSs focus on linearization [100] and unscented trans- form [237], recent work that marry state space models with deep neural net- works offers much more flexibility to model complex dependencies across dif- 1Joint work with David S. Matteson. 20 ferent time steps. Some approaches retain the Markovian dynamics of LDSs and only replace their linear observation models with feed-forward networks [51, 66, 108, 186], whereas others favor nonlinear state transitions and parametrize such dependencies via recurrent neural networks (RNNs) [49, 51, 67, 83, 119, 201]. Despite differences, both Markovian transitions and RNNs are often not capable of capturing long-range dependencies in highly structured sequential inputs [79, 265], limiting the capacity of the corresponding SSMs. In this chapter, we propose to combine the complementary strengths of SSMs and transformer architectures [226], a powerful mechanism for modeling long- term interactions that enjoys success across a variety of sequence modeling tasks [57, 111, 261]. In contrast to most SSMs, our models make extensive use of atten- tion mechanism [12, 226] between latent variables to model non-Markovian dy- namics (see Figure 3.1). Compared to transformer-based methods, our models are probabilistic, non-autoregressive, and capable of generating diverse long- term forecasts with uncertainty estimates. Our main contributions are threefold. First, we propose novel SSMs based on transformer architectures for multivariate time series, which include generative models and inference procedures based on variational inference [116, 194]. Sec- ond, we extend our models to include several layers of stochastic latent vari- ables organized in a hierarchy for further expressiveness. Third, we conduct extensive experiments on time series forecasting and human motion prediction and demonstrate that our Probabilistic Transformer (ProTran) performs remark- ably well compared to various state-of-the-art baselines. 21 3.1 State Space Models Notations & Objective. Let { (i)x N1:T}i=1 be a collection of N univariate time se- ries of length T where (i)xt ∈ R denotes the scalar value of the i-th time series (i) (i) (i) (i) x1:T = (x1 ,x2 , · · · ,xT ) at time t. For convenience, we consider the multivari- ate form (1) (N)x1:T = (x1,x2, . . . ,xT ) where x Nt = (xt , . . .xt ) ∈ R . Conditioning on the multivariate time series up to time C (C < T ), we aim to produce distri- bution forecasts into the future p(xC+1:T |x1:C). General Formulation. State space models (SSMs) achieve such a goal by as- suming the joint distribution pθ∫(x1:T ), parametrized by θ, can be written as pθ(x1:T ) = pθ(z1:T ) pθ(x1:T | z1:T ) dz1:T , (3.1) where z1:T = (z1, z2, . . . , zT ) is a sequence of latent variables, sometimes referred to as states. In other words, each SSM is a generative model that can be decom- posed into a transition model pθ(z1:T ) between the latent variables and an emis- sion model pθ(x1:T | z1:T ) from the latent variables to the observable outputs. As a result, the forecast distribution p(xC+1:T |x1:C) can be computed by marginal- izing out all the latent variab∫les z1:T : pθ(xC+1:T |x1:C) = pθ(z1:T |x1:C) pθ(xC+1:T | z1:T ,x1:C) dz1:T . (3.2) Linear Dynamical Systems. Linear dynamical systems (LDSs), for example, assume that b∏oth transition models and emission mo∏dels are linear-Gaussian 2: T T pθ(z1:T ) = N (zt |Atzt−1,Qt), pθ(x1:T | z1:T ) = N (xt |Ctzt,Rt), (3.3) t=1 t=1 where θ = (At,Ct,Qt,Rt) are learnable parameters. The Gaussianity assump- tion and the linear dependencies via the transition matrix At and the emission 2For notational simplicity, we assume z0 = ∅ and p(· |∅) = p(·). 22 xt−1 xt xt+1 xt−1 xt xt+1 xt−1 xt xt+1 zt−1 zt zt+1 z(3) z(3) z(3) z(3) z(3) z(3)t−1 t t+1 t−1 t t+1 (a) LDS x x x z(2) z(2) z(2) (2) (2) (2)t−1 t t+1 t−1 t t+1 zt−1 zt zt+1 z z z z(1) z(1) z(1) z(1) z(1) z(1)t−1 t t+1 t−1 t t+1 t−1 t t+1 (b) ProTran (1 layer) (c) ProTran Generation (3 layers) (d) ProTran Inference (3 layers) Figure 3.1: Graphical model representations of linear dynamical systems (LDSs) in (a), and our proposed models (ProTran) in (b), (c), and (d). Black arrows denote the gener- ative mechanism and red arrows the inference procedure. The separation of generation and inference in (c) and (d) is for readability. While traditional SSMs such as LDSs are limited to Markovian dynamics and linear dependencies, our models allow for non- Markovian and non-linear interactions between time steps via attention mechanism. A multi-layer extension of our models further increases expressiveness without compro- mising the tractable inference procedure. matrix Ct enable exact inference, where we can alternately perform prediction and update steps with closed forms of p(zt |x1:t−1) and p(zt |x1:t), respectively [163]. Despite their simplicity and efficiency, LDSs are unsuited for applications with complex transition dynamics or observable outputs due to such strong as- sumptions on the model components. Our Model Assumptions. In contrast to LDSs, we allow the latent variables to exhibit non-Markovian dynamics auto-regressively. The forecast distribution can be decomposed into 3 ∏T pθ(z1:T |x1:C) = pθ(zt | z1:t−1,x1:C), (3.4) t=∏1T pθ(xC+1:T | z1:T ,x1:C) = pθ(xt | zt). (3.5) t=C+1 3Similarly, we assume x0 = z1:0 = ∅ for notational convenience. 23 As demonstrated in Figure 3.1(b), the latent variable zt+1 depends not only on zt but also on all of its preceding latent variables, including zt−1. In addition, our transition and emission models allow for non-linearity via neural network parametrizations. These assumptions aim to maximize model capacity for real- world applications with complex emissions or temporal dependencies. However, we note that neither x1:t−1 nor z1:t−1 are included in the emission model p(xt | z1:T ,x1:C). Such assumptions are important, as it has been argued previously that a leakage of information from the latent space in autoregressive models can hinder long-term predictions [51, 108]. While all ground truth ob- servations are available during training, the entire sequence has to be generated sequentially at test time, making the dependencies on x1:t−1 prone to accumu- lated errors over multiple time steps. By letting the latent variable zt capture all information needed to render xt, we also avoid the computational costs associ- ated with repeatedly decoding and encoding xt in multi-step predictions. 3.2 Transformer Architectures Attention Mechanism. Central to our models and other transformer-based approaches [111, 226] is the notion of attention [12], which allows the models to focus on important parts within a context in an analogous fashion to human visual attention (see Figure 3.2). The concept can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a time series forecast or a word in a caption, we estimate using the attention vector how strongly it is correlated with (or ”attends to”) other elements, such as previously observed time series or image pixels, and take the sum of their values weighted by the attention vector as the approximation of the target. 24 Figure 3.2: Visualizations of attention weights in an image captioning task. The model sequentially generates words in the shown caption by focusing on the corresponding salient regions in the image depicted with different colors. Multi-head attention, for example, maps a sequence of queries Q ∈ R`q×d of length `q to a sequence of outputs O = [O , . . . ,O ] ∈ R`q×d1 H of the same size by attending over `k given key-value pairs K ∈ R`k×d, V ∈ R`k×d( ): Q√hK T Oh = Attention(Qh,K ,V ) = Softmax h h h Vh, (3.6) d where Q Qh = QWh , Kh = KW K h , Vh = VW V h are projected queries, keys, and values corresponding to head h ∈ [1,H] with learning parameters WQh ,WKh ,WVh , respectively (see Figure 3.3). Here, the correlations between queries and keys Scaled Dot-Product Attention Multi-Head Attention are computed via the matrix multiplication Q KTh h , and the Softmax operator Figure 3.3: Multihead Attention [226]. Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 25 p query with all keys, divide each by dk, and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: QKT Attention(Q, K, V ) = softmax( p )V (1) dk The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of p1 . Additive attention computes the compatibility function using a feed-forward network with dk a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by p1 . dk 3.2.2 Multi-Head Attention Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. 4To illustrate why the dot products get large, assume that the compoPnents of q and k are independent random variables with mean 0 and variance 1. Then their dot product, q · k = dki=1 qiki, has mean 0 and variance dk. 4 outputs the attention weights to be mapped with the corresponding values Vh. In case Q = K = V, we refer to such an attention mechanism as self-attention. Given fully observed sequences of inputs, the mapping can be computed effi- ciently without any imposed sequential order often seen in recurrent neural net- works [46, 92]. More importantly, the direct connections between long-distance time steps are baked into the mechanism as information from previous time steps is easily accessible without being compressed into a fixed representation, which eases optimization and learning of long-term dependencies [12, 226]. Positional Embeddings. Without recurrence, Transformer [226] encodes the arrow of time by associating each time step t with a predefined sinusoidal posi- tional embedding: Position(t) = [p dt(1), . . . , pt(d)] ∈ R (3.7) where the i-th embedding pt(i) = sin(t · ci/d) for even i and pt(i) = cos(t · ci/d) for odd i and some large constant c. Empirical results show that such positional embeddings are also important to our models. 3.3 Probabilistic Transformer In this section, we first present our single-layered model and subsequently its multi-layered extension for a hierarchy of stochastic latent variables. As alluded earlier, our model consists of a generative model and an inference model that share information and parameters extensively. 26 3.3.1 Single-Layered Probabilistic Transformer Generative Model. Given some contexts x1:C , we first apply a linear projection and combine it with a positional embedding to obtain h d1:C ∈ R , i.e. ht = LayerNorm(MLP(xt) + Position(t)), (3.8) where LayerNorm and MLP denote layer normalizations [10] and multi-layer per- ceptrons, respectively. While a traditional transformer model often dedicates an entire encoder for the same purpose [130, 190], we find such a simple mapping works sufficiently well in conjunction with the context-attention module of the corresponding decoder. As implied in Equation 3.5, our latent dynamics decomposes auto-regressively. At each time step, we parametrize the distribution pθ(zt | z1:t−1,x1:C) by a Gaus- sian with parameters resulting from two steps of attention: a self-attention over the previously inferred states z1:t−1 and another attention over the projected con- texts h1:C . These operations mirror those in the decoder of Transformer [226], with the stochastic latent variables replacing its decoder inputs. Unfortunately, using stochastic samples of zt as attention queries is problem- atic since purely stochastic transitions make it difficult for the model to reliably retain information across multiple time steps [44, 67, 83]. We therefore encap- sulate the latent variables in hidden representations wt that also has a deter- ministic component. Combined with the attention steps, such representations allow us to model long-range temporal dependencies while accounting for the stochasticity of future observations. Starting with a learnable, context-agnostic representation w0, we recursively compute wt using a stochastic sample from pθ(zt | z1:t−1,x1:C) and the positional 27 embedding for the current time step t. The generating process for the time step t can be summarized by the following pseudocode: w̄t = LayerNorm(wt−1 + Attention(wt−1,w1:t−1,w1:t−1)) (3.9) ŵt = LayerNorm(w̄t + Attention(w̄t,h1:C ,h1:C)) (3.10) zt ∼ N (MLP(ŵt), Softplus(MLP(ŵt))) (3.11) wt = LayerNorm(ŵt + MLP(zt) + Position(t)), (3.12) where Softplus is an approximating rectifier operator. Each stochastic sample of w1:T is then mapped to a sequence of x1:T via a multi- layer perceptron. We emphasize that our generation procedure in the latent space is more efficient than others in the observation space, which requires en- coding and decoding high-dimensional inputs repeatedly. Inference Model. The inclusion of nonlinear state transitions and observation models necessarily requires approximate inference. We follow the stochastic variational inference framework [116, 194] and assume that the variational pos- terior qφ(z1:T |x1:T ), parametrized by φ, can be decomposed auto-regressively in a similar fashion to the prior in Equation 3.4: ∏T qφ(z1:T |x1:T ) = qφ(zt | z1:t−1,x1:T ). (3.13) t=1 The approximate posterior qφ(zt | z1:t−1,x1:T ) at time step t is parametrized anal- ogously to the prior pθ(zt | z1:t−1,x1:C). Indeed, these parametrizations share most parameters and are done simultaneously in the same recursive loop, fol- lowing the exact same steps in Equation 3.9 and Equation 3.10 (see Figure 3.1). We note that similar sharing techniques between the generative and inference processes have emerged as a common theme among recent successful VAE mod- els [44, 149, 223]. 28 While the prior only has access to the conditioning observations x1:C , the ap- proximate posterior should take into account all observations during training, including the targets xC+1:T . Due to the inherent unidirectional aspect of RNNs, previous work that uses RNNs to parametrize the approximate posterior of- ten disregards such a property [49, 67, 119] and often resorts to a filtering rou- tine p(zt | z1:t−1,x1:t). In contrast, our inference procedure resembles more of the smoothing process of LDSs to compute p(zt | z1:T ,x1:t) instead, factoring in both past and future observations during training via another application of self-attention: kt = Attention(h1:T ,h1:T ,h1:T )) (3.14) zt ∼ N (MLP([ŵt,kt]), Softplus(MLP([ŵt,kt])), (3.15) where [·, ·] denotes the concatenation operator. Here, we replace Equation 3.11 in the generative model with Equation 3.15, where the hidden representation kt summarizing all information relevant to the current time step t has been con- catenate to the latent-and-context-aware representation ŵt preceding the Gaus- sian parametrization. Variational Objective. The generative model and the inference model are trained end-to-end with a v[ariational lower bound on the lo]g likelihood: | p (z |x[E θ 1:T 1:C ) pθ(x1:T | z1:T ,x1:C) log pθ(x1:T x1:C) = log q (3.16) qφ(z1:T |x1:T ) ] ≥ pθ(z1:T |x1:C) pθ(x1:T | z1:T ,x1:C)Eq [log (3.17)qφ(z1:T |x1:T ) ] p (z |x ) = E | θ 1:T 1:Cq [log pθ(x1:T z1:T ,x1:C) + log (3.18)∑ ∑ qφ(z1:T |x1:T ) ]T T | p (zE θ t | z1:t−1,x1:C)= q log pθ(xt zt) + log , (3.19) qφ(zt | z1:t−1,x1:T )t=1 t=1 29 which is equivalent to the following objective ∑T (Eq [log pθ(xt | zt)]− KL(qφ(zt | z1:t−1,x1:T ) ‖ pθ(zt | z1:t−1,x1:C))) (3.20) t=1 where KL is the Kullback-Leibler divergence. Here, Equation 3.17 follows from Jensen’s inequality, and Equation 3.19 from the factorizations of the emission model, the prior, and the approximate posterior in Equation 3.5, Equation 3.4, and Equation 3.13, respectively. The objective in Equation 3.20 consists of a reconstruction loss for x1:T and a KL term for z1:T , which include terms for the given contexts x1:C and their in- ferred states z1:C . Alternatively, we can exclude these terms from the objective, or equivalently start the inference process from t = C + 1 instead of t = 1. For computational stability, we assume homoscedasticity and choose Laplace distribution with scale parameter β as a parametric form for pθ(xt | zt), i.e. we optimize for L1 reconstruction loss with a cross-validated factor β for the KL term, following similar variational autoencoder (VAE) work [52, 89, 228]. Such an assumption does not necessarily limit the capacity of our models, as pow- erful stochastic transitions and flexible emission models can theoretically char- acterize arbitrary noise covariance [163]. Incorporating structured probabilistic outputs such as Gaussian copulas [201] or normalizing flows [51] can potentially further improve our model performance. Complexity. Our models incur a time complexity ofO(T 2d) and a memory cost of O(T 2d), where T is the total sequence length and d is the dimensionality of the latent space. The recursive latent dynamics also does not allow use the take full advantange of parallelizable attentions. However, we find that our models are still efficient in practice, especially for reasonably small values of T . 30 3.3.2 Multi-Layered Extension Inspired by recent work on hierarchical VAEs for non-sequential inputs [44, 212, 223, 266], we extend our proposed model to include several layers of latent vari- ables, aiming to further increase its flexibility for modelling sequential data. Generative and Inference Models. We represent each time step twith a Markov chain of L latent variables, denoted by (1:L) (1) (L)zt = (zt , . . . , zt ) (see Figure 3.1). The generative and inference model also decompose auto-regressively across different time steps and[may exhibit n]o[n-Markovian dynamics:∏T ∏ ]L ∏T (1:L) (L) (`) (`) (`−1) pθ(x1:T , z1:T |x1:C) = pθ(xt|zt ) pθ(zt | z1:t−1, z1:T ,x1:C) (3.21) ( ) ∏t=L ∏1 `=1 t=1T ( ) (1:L) (`) (`) (`−1) qφ z1:T |x1:T = qφ zt | z1:t−1, z1:T ,x1:T . (3.22) `=1 t=1 Intuitively, we generate samples x1:T conditioning on x1:C by following the la- tent dynamics from the bottom up and using the generative process described earlier within each layer. More specifically, we parametrize the prior (`) | (`) (`)pθ(zt z1:t−1, z1:T ,x1:C) using (1) self-attention over the inferred latents (`)wt−1 on the same layer and (2) another attention over contexts h1:C . In this case, we include an additional self-attention over all latent variables from the layer immediately below it (see Equation 3.23): (`) (`) (`) (`−1) (`−1) w̃t = LayerNorm(wt−1 + Attention(wt−1,w1:T ,w1:T )) (3.23) (`) (`) (`) (`) (`) w̄t = LayerNorm(w̃t + Attention(w̃t ,w1:t−1,w1:t−1)) (3.24) (`) (`) (`) ŵt = LayerNorm(w̄t + Attention(w̄t ,h1:C ,h1:C)) (3.25) (`) ∼ N (`) (`)zt (MLP(ŵt ), Softplus(MLP(ŵt ))) (3.26) (`) (`) (`) wt = LayerNorm(ŵt + MLP(zt ) + Position(t)), (3.27) 31 Variational Objective. The multi-layered architecture results in a variational ∑bound[similar to Equ]atio∑n 3.20:T L E (L)| − (`) (`) (`) (`) (`) (`)q log pθ(xt zt) KL(qφ(zt |z1:t−1, z1:T ,x1:T ) ‖ pθ(zt |z1:t−1, z1:T ,x1:C)). t=1 `=1 Complexity. Stacking multiple layers of latent variables increases model ex- pressiveness, but it also result in a linear increase in running time and the num- ber of parameters. The time complexity for the L-layers transformer isO(LT 2d), while the space complexity remains D(T 2d) due to the Markovian structure of the chain (1:L)zt at each time step t. In our experiments, we restrict the number of layers of our hierachical models to two or three. 3.4 Related Work Deep State Space Models. Deep neural networks have been extensively com- bined with state space models, resulting in flexible, yet principledly motivated latent variable approaches. While some work keep the linear state transition intact to leverage the efficient Kalman filer algorithms [51, 66, 108, 186], more expressive, nonlinear latent dynamics parametrized by neural networks have been proposed [119, 120]. All such models are limited to the Markovian dynam- ics of LDSs, which hinders learning of long-range dependencies. The limitation is often alleviated by combining the stochastic transitions with a deterministic RNN that enables access to all past states [9, 18, 49, 67, 83, 204]. Our models are similarly non-Markovian, but the dependencies on the past states are done via attention, which allows for easy connections between long-distance time steps. In addition, while most existing deep SSMs represent each time step with a single latent variable, our models include several layers of hierarchical latent variables with tractable inference mechanism. 32 Attentive Recurrent Networks. Attention mechanism has been widely adopted in recent time series work using sequence-to-sequence models [3, 63] or trans- former architectures [33, 130, 135, 190, 213, 246]. While our models are equipped with latent variables, these transformer approaches [130, 190] lack inference mechanism and are susceptible to feeding back observation noise into the dy- namics model at test time. Our work, however, can be considered as an exten- sion of the attentive state space model proposed in [3], with discrete latent states replaced by their continuous analogs. Recent developments in natural language processing [138, 141, 239] also combine transformer and VAE; however, these approaches often use a time-agnostic latent variable. Time Series Forecasting. Traditional univariate time series models, such as Box-Jenkins methods [28] and exponential smoothing [95], often assume inde- pendence between any collection of time series [202]. While multivariate ex- tensions of the classical approaches, including vector autoregression [222] and multivariate GARCH [17], do not require such a strong assumption, they come with many others such as stationarity and homocesdasticity, demand manual selection of covariates and models, and do not scale well to even a moderate number of time series [84, 176]. Deep learning methods for time series forecasting have recently emerged as an expressive, scalable framework for industrial applications [23, 172, 211, 240]. While early work focus on point forecasts [124, 182, 255], recent approaches employ recurrent neural networks with probabilistic forecasts parametrized di- rectly [202], using quantile functions [72], Gaussian copulas [201], normalizing flows [51], or diffusion models [189]. In contrast, our models are devoid of such architectures and rely on latent variables to output distributional forecasts. 33 Human Motion Prediction. Despite being almost identical in formulation, human motion prediction has often been studied independently from time se- ries forecasting. While some work deterministically generate future motions or video frames [31, 68, 73, 132], stochastic prediction has also been proposed with deep neural networks often outperforming traditional methods such as hidden Markov models [245] or Gaussian processes [238] on complex motion datasets [31, 68, 98, 129, 154]. In contrast to earlier work [252, 257] that employ a global latent variable across different time steps via conditional VAE [116], we lever- age the principled framework of state space models for learning and inference of hierarchical, time-dependent latent variables. 3.5 Experiments We present our experiment results on two tasks, namely, time series forecasting and human motion prediction. These tasks are often studied independently, despite being almost identical as conditional prediction problems. 3.5.1 Time-series Forecasting Datasets & Covariates. Following the experiment setup in [189, 190, 201], we evaluate our models and multiple competitive baselines on five popular pub- lic datasets: SOLAR, ELECTRICITY, TRAFFIC, TAXI, and WIKIPEDIA. The data is recorded with hourly or daily frequency and shows seasonal patterns of differ- ent frequencies (see Appendix B for more dataset details). As in [189, 190], the covariates include lagged inputs, fixed time embeddings (e.g. day of week, hour of day), and learnable time-series embeddings. The inputs are scaled using the conditioning examples before being fed into the model, and the predictions are rescaled appropriately afterward. 34 Metrics. Following [51, 190, 201], we evaluate our model and all baselines us- ing continuous ranked probability score (CRPS) [156] summed across time series, denoted by CRPSsum. Given a univariate distribution function F and an obser- vation x, CRPS is defined as ∫ CRPS(F ,x) = (F (z)− 1 2{x≤z}) dz, R where 1{x≤z} is the indicator function. As argued in [51], CRPSsum is a proper scoring rule [75] and can be computed without analytical forecast distributions. We compute the metrics in a rolling fashion and use 100 samples for the distri- butional forecasts, similar to the aforementioned work. Baselines. We benchmark our models against various baselines, including (1) VES [95], an innovation state space model; (2) VAR-Lasso and VAR [148], two multivariate linear autoregressive; (3) GARCH [224], a multivariate condi- tional heteroskedastic model; (4) DeepAR [202], an autoregressive RNN; LSTM- Copula and GP-Copula [201], two RNN-based models that use Gaussian copula to model nonlinearity; (5) KVAE [119], a variational approach based on linear dynamics; (6) NKF [51], a normalizing-flow model coupled with Kalman filters; (7) Transformer [190], a transformer-based model based on masked autoregres- sive flow; and (8) TimeGrad [189], a recent diffusion-based approach. Implementations. We use 8-head attentions and 2-layers MLPs to parametrize the generative and inference models. The stochastic latent variables zt are 16- dimensional while the hidden representations w are in R128t . Our probabilistic transformers for SOLAR and ELECTRICITY have one stochastic layer while those for the other datasets of higher dimensional observations employ two layers. We report the numbers of parameters of our models in Table B.4 in Appendix B, which are all comparable to those of the state-of-the-art approaches. 35 Table 3.1: Test set CRPSsum of time series forecasting models (lower is better). The means and standard deviations are computed over five runs using different random seeds. DATASET SOLAR ELECTRICITY TRAFFIC TAXI WIKIPEDIA VES [95] 0.900 ± 0.003 0.880 ± 0.004 0.350 ± 0.002 - - VAR [148] 0.830 ± 0.006 0.039 ± 0.001 0.290 ± 0.001 - - VAR-Lasso [148] 0.510 ± 0.006 0.025 ± 0.000 0.150 ± 0.002 - 3.100 ± 0.004 GARCH [224] 0.880 ± 0.002 0.190 ± 0.001 0.370 ± 0.001 - - DeepAR [202] 0.336 ± 0.014 0.023 ± 0.001 0.055 ± 0.003 - 0.127 ± 0.042 LSTM-Copula [201] 0.319 ± 0.011 0.064 ± 0.008 0.103 ± 0.006 0.326 ± 0.007 0.241 ± 0.003 GP-Copula [201] 0.337 ± 0.024 0.024 ± 0.002 0.078 ± 0.002 0.208 ± 0.183 0.086 ± 0.004 KVAE [119] 0.340 ± 0.025 0.051 ± 0.019 0.100 ± 0.005 - 0.095 ± 0.012 NKF [51] 0.320 ± 0.020 0.016 ± 0.001 0.100 ± 0.002 - 0.071 ± 0.002 Transformer-MAF [190] 0.301 ± 0.014 0.021 ± 0.000 0.056 ± 0.001 0.179 ± 0.002 0.063 ± 0.003 TimeGrad [189] 0.287 ± 0.020 0.021 ± 0.001 0.044 ± 0.006 0.114 ± 0.020 0.049 ± 0.002 ProTran (Ours) 0.194 ± 0.030 0.016 ± 0.001 0.028 ± 0.001 0.084 ± 0.003 0.047 ± 0.004 Accuracy Comparison. Table 3.1 shows that our models perform competi- tively across all five high-dimensional time series datasets, achieving CRPSsum comparable to the best methods on ELECTRICITY and WIKIPEDIA while outper- forming all baselines, including a transformer-based approach [190], by signifi- cant margins on SOLAR, TRAFFIC and TAXI. Further analyses with other metrics, including CRPS and NMSE, in Appendix B also confirm our findings. Qualitative Results. Figure 3.4 shows that the distribution forecasts generated by our model follow closely the ground truths, which is consistent with our accuracy results. In addition, the model appears to capture the uncertainty of future forecasts to some extent; observations of large magnitudes and far into Table 3.2: Ablation study on TRAFFIC. Two Layers X × × × One Layer × X X X Context Attention X X × X Deterministic × × × X CRPSsum 0.028 0.031 0.033 0.041 36 ·10−2 ·10−2 ·10−2 ·10−2 15 8 3 8 6 2 10 6 4 4 1 5 2 2 0 0 0 ·10−2 ·10−2 ·10−2 ·10−2 30 10 20 10 20 15 5 10 5 10 5 0 0 0 0 ·10−2 ·10−2 ·10−2 ·10−2 15 30 20 20 10 20 10 5 10 10 0 0 0 0 ·10−2 ·10−2 ·10−2 ·10−2 30 15 10 10 20 10 5 10 5 5 0 0 0 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 06-15-08 06-16-08 06-15-08 06-16-08 06-15-08 06-16-08 06-15-08 06-16-08 Figure 3.4: Prediction intervals and test set ground-truth from ProTran (our model) for the TRAFFIC dataset of the first 16 of 963 time series. the future seem to correctly have higher variance estimates. Ablation Study. We include a small scale ablation study on the TRAFFIC dataset to investigate which components of our models are essential. Table 3.2 sug- gests that removing the stochasticity from wt has most impacts on model perfor- mance, implying that incorporating latent variables into a transformer is indeed useful. Other aspects such as context attention or multiple layers of stochastic variables do not show dramatic effects in this study; however, they do con- tribute performance gains. 37 3.5.2 Human Motion Prediction Datasets. Following the experiment setup in [257], we evaluation our mod- els on two public motion capture datasets: Human3.6M[96] and HumanEva-I [208]. While Human3.6 is a large-scale dataset with 3.6 million video frames recorded at 50Hz, HumanEva-I is smaller with only 3 subjects and recorded at 60Hz. We follow the preprocessing steps of previous work [155, 257] and obtain a 17-joint skeleton for Human3.6 and a 15-joint skeleton for HumanEva-I. As in [257], we predict future motion for 2 seconds conditioning on observed motion of 0.5 seconds and 1 second conditioning on 0.25 seconds for Human3.6 and HumanEva-I, respectively. Metrics. Following previous work on trajectory forecasting [4, 81], we adopt two popular metrics, namely, average displacement error (ADE) and final dis- placement error (FDE). ADE measures the average L2 distance over all time steps between the ground truth motion and the closest sample, while FDE only consider such distance for the final pose. Baselines. We compare our models against 9 models, including ERD [68] and acLSTM [132], two deterministic RNN-based approaches; MT-VAE [252] and Pose-Knows [236], two conditional VAE models; HP-GAN [14], a conditional GAN; Best-Many [25], GMVAE [55], DeliGAN [82]. and DSP [258], four ap- proaches optimizing for diversity objectives. The results for these baselines are reported as in [257]. Implementations. Similar to the previous experiments, we use 8-head atten- tions and 2-layers MLPs. Since Human3.6M is significantly more complex and multi-modal than the time series forecasting datasets, we make use of 3 stochas- 38 Smoking Walk Together Phoning Walking Dicussion Walk Dog Figure 3.5: Ground-truth pose sequences (first row) and corresponding predictions by ProTran (second row). Solid colors indicate later time-steps and faded ones are older. The body-part movements in the predicted and ground-truth poses resemble similar patterns, while certain variations are retained. Table 3.3: Human motion prediction results. DATASET HUMAN3.6M HUMANEVA-I Method ADE ↓ FDE ↓ ADE ↓ FDE ↓ ERD [68] 0.722 0.969 0.382 0.461 acLSTM [132] 0.789 1.126 0.429 0.541 MT-VAE [252] 0.457 0.595 0.345 0.403 Pose-Knows [236] 0.461 0.560 0.269 0.296 HP-GAN [14] 0.858 0.867 0.772 0.749 Best-Many [25] 0.448 0.533 0.271 0.279 GMVAE [55] 0.461 0.555 0.305 0.345 DeliGAN [82] 0.483 0.534 0.306 0.322 DSP [258] 0.493 0.592 0.273 0.290 DLow [257] 0.425 0.518 0.251 0.268 ProTran (Ours) 0.381 0.491 0.258 0.255 tic layers, as opposed to 2 layers for HumanEva-I. For Human3.6M, the context and target observations are significantly longer and set up for long-term predic- tions, so we only infer latent variables for target observations. Appendix B also contains further details about our models and their number of parameters. Quantitative Results. Table 3.3 shows that our models convincingly outper- form all baselines based on both metrics ADE and FDE, with the gains signifi- cantly higher for the larger dataset Human3.6M. We emphasize that our favor- 39 able performance is evaluated using random samples, while the closest com- petitor, DLow [257], relies on a separate model for selecting samples to promote diversity, which can potentially be combined with our probabilistic transformer for further improvements. Qualitative Results. We show in Figure 3.5 human pose predictions made by our model that are most similar to the corresponding ground truths among a collection of such stochastic predictions. The similarities between the body-part movements in both sequences suggest that our model has been able to capture the temporal dynamics quite well. 3.6 Conclusion & Discussion In this chapter, we have introduced generative models for multivariate time series that combines strengths of state space models and transformer architec- tures. In contrast to previous work, our models do not rely on RNNs but make extensive use of attention mechanism. We also extend our models to include hierarchical latent variables, inspired by recent developments of VAEs for non- sequential data [44, 223]. Empirical experiments show that our models perform remarkably well on time series forecasting and human motion prediction. Our models do not come without limitations, however. As in other transformer- based approaches, the reliance on attention incurs a quadratic time and memory complexity. While we do not find it problematic in our experiments, the limi- tation necessarily hinders applications of our models in tasks characterized by long-term dependencies such as language modelling or music generation [79]. Fortunately, recent work on sparse transformer [22, 45, 118, 130] can potentially address the issue, and we leave such an investigation for future work. 40 CHAPTER 4 PROBABILISTIC TRANSFORMER FOR VIDEO PREDICTION1 While impressive advances has been made on generative models of images [91, 109, 223], audio [26, 41], and text [29, 183], video predictions remain excep- tionally challenging given the interplay of high-dimensional images and com- plex temporal dynamics. Given 5 conditioning 64× 64 video frames, for exam- ple, a model would need to extract information across space and time from more than 60,000 pixels in order to generate 60,000 more pixels for future frames. At a higher level, such a model would need to continuously detect and track the objects in the frames and consistently decode their positions and appearances into future ones, despite of possible occlusion, clutter, or object deformations. In this chapter, we extend the probabilistic transformer model proposed in Chap- ter 3 into the settings of conditional video prediction [187, 216, 232]. Although it is conceivable to naively flatten the 3-dimensional tensors representing video frames and apply the previous model without any changes, doing so would lead to an exploding number of parameters, and more importantly, disregard the local connectivity and translational invariance of images. Such properties naturally lend themselves to convolutional operators, so we intuitively replace the classical attention mechanism with a new convolution-based alternative de- signed to work with 3-dimensional tensors. While the time series forecasting task do not benefit significantly from our deep multi-layered formulation on some datasets, we find that deep, hierarchical architectures are especially help- ful for videos when combined with convolutional operations, similar to how stacking multiple convolutional modules helps with image modeling. 1Joint work with David S. Matteson. 41 As a state space model, the resulting model inherit the principled probabilistic framework with tractable variational inference. The transformer architectures also allow for non-Markovian temporal dynamics, which are especially relevant for video contents of complex modalities. Empirical results demonstrate that our models are relatively effective on several video generation datasets. 4.1 Related Work Deterministic Models. Earlier works on video prediction are often based on deterministic recurrent neural networks [78], such as long short-term memory (LSTM) networks [92], including [16, 105, 216, 227, 229] or convolutional LSTMs [249] such as [103, 187, 250]. Several other approaches propose using specialized computer vision techniques such as pixel-level transformations or optical flow [65, 71, 103, 133, 142, 144, 146, 233, 234, 235]. These models, however, are limited to deterministic predictions and fail to generate sharp long-term video frames [11, 52]. Stochastic and Autoregressive Models. Several approaches directly optimize exact likelihood via pixel-level autoregression [107, 171, 242] or normalizing flows using invertible transformations [115, 123], all of which requires restric- tive temporal generation designs to manipulate high-dimensional inputs. Closer to our work are models based on variational inference [116, 194] that incorpo- rate latent variables into convolutional LSTMs [11, 126] or LSTMs [52, 86]. In contrast to our model, however, the latent variables in these models do not con- tain all information required to render future frames and model predictions are often fed back into the latent space, making them more susceptible to accumu- lated errors in multi-frame predictions. 42 State Space Models. As discussed in Chapter 3, several previous works have explored various modeling choices to learn stochastic sequential models, which differ in the factorization of the generative and inference models, their network architectures, and the objectives used in their training procedures.[9, 56, 66, 67, 69, 80, 108, 120]. These models are often based on RNNs, which are not equipped with a mechanism to learn long-range dependencies. In contrast, our models are completely devoid of recurrent architectures and make extensive use of attention for learning such dependencies. Attention in Vision Models. Central to our model is the convolution-based attention mechanism operating on 3-dimensional tensors such as images. While previous works have experimented with replacing all convolution operations in images with self attention [21, 185, 264], we find that such techniques often lead to large model sizes and do not necessarily work well on video data and that combining convolution with space-time attention [24] offers a relatively simple and effective alternative. 4.2 Probabilistic Transformer Space-Time Attention. As discussion in Chapter 3, multi-head attention, maps a sequence of queries Q ∈ R`q×d of length `q to a sequence of outputs O = [O1, . . . ,O ] ∈ R`q×dH of the same size by attending over `k given key-value pairs K ∈ R`k×d, V ∈ R`k×d: ( ) Q ThK Oh = Attention(Qh,Kh,Vh) = Softmax √ h Vh, (4.1) d where Qh = QW Q h , Kh = KW K V h , Vh = VWh are projected queries, keys, and values corresponding to head h ∈ [1,H] with learning parameters WQh ,WK Vh ,Wh , respectively (see Figure 3.3). 43 Here all queries, keys, and values are d-dimensional vectors, so the dot products between queries and keys are well-defined and can be computed efficiently via the matrix vector multiplications in Equation 4.5. For video generation, all frame inputs and most intermediate outputs are 3- dimensional tensors of varying resolutions. As a result, we first replace the linear projections in multihead attention with 1× 1 convolutional operations to map the queries Q ∈ R`q×c×h′×w′ , the keys and values ′ ′K,V ∈ R`k×c×h ×w into 3-dimensional tensors of the same size (q) ′ ′ ′ Qh = Conv1x1h (Q) ∈ R`q×c ×h ×w , (4.2) (k) ∈ R` ×c′K = Conv1x1 (K) k ×h′×w′h h , (4.3) (v) ′ ′ ′ Vh = Conv1x1 (V) ∈ R`k×c ×h ×wh . (4.4) where (q) (k) (v)Conv1x1h , Conv1x1h , and Conv1x1h have separate learnable parameters for the queries, keys, and values corresponding to head h ∈ [1,H], respectively. Intuitively, these mappings allow for the interactions between the original c channels within each tensor. We compute attention weights between Qh and Kh by flattening the tensors into Q′ ∈ R(` ′ ′ ′ ′ ′ ′qh w )×c and K′ ,V′ ∈ R(`kh w )×ch h h , and taking dot products to obtain a R`q×`k×h′×w′ tensor. The Softmax operation is then computed over the `k time steps in the resulting tensor, and the attention weights are combined with the corresponding transformed values as in Equation 4.5: ( ) Q′K′T Oh = SpaceTimeAttention(Qh,Kh,Vh) = Softmax √h h V′h, (4.5) d In short, the space-time attention uses convolutions to measure similarities be- tween the queries and the keys but attends over the values at different time steps as in multi-head attention. 44 xt−1 xt xt+1 xt−1 xt xt+1 z(4) z(4) z(4) z(4) (4)) (4)t−1 t t+1 t−1 zt zt+1 z(3) z(3) z(3) z(3) z(3) (3)t−1 t t+1 t−1 t zt+1 z(2)t−1 z (2) z(2) z(2) z(2) (2)t t+1 t−1 t zt+1 z(1) z(1) z(1)t−1 t t+1 z (1) (1) (1) t−1 zt zt+1 (a) Generation (b) Inference Figure 4.1: Graphical model representations of our probabilistic transformer models. Black arrows denote the generative mechanism and red arrows the inference procedure. The separation of generation and inference in (c) and (d) is for readability. We interleave recursive layers (e.g. layer 1 and layer 3) and non-recursive layers (e.g. layer 2 and layer 4) to increase expressiveness of the temporal dynamics and reduce running time. Interleaving Recursive and Non-Recursive Layers. As in Chapter 3, we study hierarchical models constructed by stacking multiple probabilistic transformer layers described in Subsection 3.3.2 (see Figure 4.1). In our case, each layer consists of intermediate outputs of all time steps, with the bottom layers encod- ing low-resolution, high-level information while the top layers encoding high- resolution, fine details about video frames. Such layers are built recursively from left to right, allowing a current time step to focus on all preceding ones. As a result, each of these layers has to be formed sequentially and significantly slows down the generation and inference proce- dures, especially when there are many future steps or hierarchical layers. Be- cause a layer has access to all temporal information from the layer below it, in- cluding all past, present, and future time steps, we therefore propose removing recursive connections in some layers to speed up training and inference. 45 In particular, we keep recursive layers for each new resolution from the bot- tom up and make all other layers of the same resolution to non-recursive. The resulting models are fast to train and serve but remain highly expressive. 4.3 Experiment Results Datasets. We run experiments on two three video datasets: • STOCHASTIC MOVEMENT DATASET [11]. The first frame of every video consists of a shape placed near the center of a 64 × 64 × 3 resolution gray background with its type, size and color randomly sampled. The shape randomly moves in one of eight directions with constant speed, implying the position of the shape at any time step is completely determined by that of the shape at the previous step. We condition only on the initial frames and predict the next four frames both during training and at test time. • DETERMINISTIC MOVING MNIST [52]. This dataset consists of one or two MNIST digits [125] moving linearly and deterministically bouncing on walls with predefined direction and velocity. We condition only on the initial frames and predict the next 15 and 100 frames during training and at test time, respectively. • STOCHASTIC MOVING MNIST [52]. This dataset consists of one or two MNIST digits [125] moving linearly and randomly bouncing on walls with new direction and velocity sampled randomly at each bounce. We condi- tion on the first five frames of each video and predict the next 10 and 25 frames during training and at test time, respectively. Metrics. Following previous works [69], we use peak-signal-to-noise ratio 46 STOCHASTIC MNIST DETERMINISTIC MNIST 25 25 SVG SRVP SVG SRVP SRVP-GRU ProTran SRVP-GRU ProTran 20 20 SRVP-NZ 15 15 10 10 5 10 15 20 25 0 20 40 60 80 100 Time Horizon Time Horizon Figure 4.2: Peak signal-to-noise ratio as a function of time horizon. (PSNR) and structural similarity index measure (SSIM) to compare predictions of future frames and their corresponding ground truths. Quantitative Results. Table 4.1 shows that our models perform on par with other baselines, include SVG [52], and SRVP [69], two popular latent variable models based on variational inference As seen in Figure 4.2, our model per- formance decreases slowly as the time horizon increases while autoregressive models such as SVG are more prone to accumulated errors in multi-frame pre- diction settings. Qualitative Results. Figure 4.3 and Figure 4.4 show video frame predictions and their corresponding ground truths at different time horizons. Our models appear to learn the temporal dynamics and maintain the properties of involved Table 4.1: PSRN and SSIM scores on MOVING MNIST. DATASET STOCHASTIC DETERMINISTIC Method PSRN (↑) SSIM (↑) PSNR (↑) SSIM(↑) SVG 14.50 ± 0.04 0.7090 ± 0.0015 12.85 ± 0.03 0.6185 ± 0.0011 SRVP 16.93 ± 0.07 0.7799 ± 0.0020 18.25 ± 0.06 0.8300 ± 0.0017 ProTran 17.20 ± 0.05 0.7842 ± 0.0017 18.43 ± 0.12 0.8544 ± 0.0014 47 PSRN Figure 4.3: Predicted video frames on odd rows and their corresponding ground truths on even rows at two different time horizons, namely t = 2 left and t = 4 on the right on the STOCHASTIC MOVEMENT DATASET. The shown predictions are among the closest samples to the ground truths based on PSNR. objects relatively well as the frames mirror each other. From Figure 4.4, we also see that the frame predictions after the collisions of digits seem consistent with the ground truths, suggesting that the non-Markovian dynamics might indeed have been captured in the latent space. 4.4 Conclusion In this chapter, we have extended the probabilistic transformer proposed in Chapter 3 to the task of video prediction. While preliminary results on rela- tively simple datasets are encouraging, additional experiments on challenging real-world datasets can further validate our model performance. 48 Figure 4.4: Predicted video frames on odd rows and their corresponding ground truths on even rows at two different time horizons, namely t = 4 left and t = 13 on the right on the STOCHASTIC MOVING MNIST dataset. The shown predictions are among the closest samples to the ground truths based on PSNR. 49 CHAPTER 5 ADDITIONAL APPLICATIONS OF DEEP NEURAL NETWORKS 5.1 Dynamic Poverty Prediction with Vegetation Index1 Despite global economic growth, 330 million people are still living in extreme poverty in Africa [20]. The United Nations has acknowledged poverty as one of the greatest challenges facing humanity and aims to end extreme poverty in all forms by 2030 [164]. To achieve the goal, policy makers often rely on complex household surveys to measure poverty and allocate resources [13]. Since the data collection process is costly and time consuming, there is a lack of good-quality data to assess poverty regularly [101]. Inexpensive and scalable approaches to poverty prediction are thus needed to complement household surveys. Recent advances in remote sensing and machine learning have opened up a new path for poverty prediction. High resolution satellite images are rich in content and available globally, providing an objective view on the economic conditions of developing countries [42, 88, 128]. The increasing abundance of these images lends themselves to convolutional neural networks (CNNs), a deep learning approach that has recently seen tremendous success in many computer vision tasks [193, 199, 205]. In a critically acclaimed paper [101], Jean et al. applied CNNs on daytime satellite images to measure regional poverty in Africa, yield- ing results comparable to estimation based on past surveys. Although promising, most current studies with satellite images are limited to providing fixed estimates of poverty maps. The poverty predictions in [101], for example, were constant scalars regardless of the time of prediction due to the 1Joint work with Ying Sun, Yanyan Liu, and David S. Matteson. 50 Figure 5.1: NDVI measurements for Uganda in 2011. On the left, the background im- age shows annual average NDVI with a vertical colorbar while the foreground scatters depict log consumption expenditures with a horizontal colorbar. On the right, the an- nual NDVI, spatially averaged over all survey locations, with notable drops during the 2011-2012 East Africa drought highlighted in gray. lack of continuous access to proprietary data. Given the unprecedented climate changes anticipated in sub-Saharan countries [175], dynamic poverty mapping is critical to timely interventions and policy evaluation. In this section, we use the continuous streaming of the normalized difference vegetation index (NDVI), one of the widely known satellite measurements of Earth’s vegetation greenness, to estimate poverty indicators more frequently. NDVI measures the difference in red and near-infrared light reflectance resulted from photosynthesis; areas of barren rock have very low NDVI while dense croplands often have high NDVI values (see Figure 5.1). As ultra-poor regions heavily depend on agriculture [53], NDVI provides an uninterrupted signal for crop heath and poverty tracking in general. Our contribution is twofold: (1) we demonstrate that publicly-available, moderate- resolution NDVI can help predict poverty in Malawi, Nigeria, Rwanda, Tanza- nia, and Uganda as well as competitive baselines, and (2) we perform poverty prediction for an out-of-sample period and capture changes in poverty mea- 51 sures for ultra-poor regions in Uganda. 5.1.1 Related Work Recent studies on poverty prediction rely on passively collected data and sta- tistical methods to circumvent the scarcity of household surveys. Researchers have fit linear models to nighttime light luminosities and found they strongly correlate with the gross domestic product of various countries [42, 88, 128]. Pro- prietary cell phone records of millions of subscribers in Rwanda and Senegal have also been used with tree-based classifiers and Gaussian process regressors to provide asset wealth index estimates [27, 178]. In another line of research, CNNs help extract predictive features from Google’s high-resolution daytime images, providing accurate estimates of both consumption expenditure and as- set wealth in multiple African countries [101, 248]. Although existing technologies can only measure NDVI at a lower resolution, recent work on poverty and health in Africa has proven its significant predictive power. Through increased crop yields, NDVI has been found to be positively correlated with child survival, nutrition, and anthropometric variables such as wasting [106]. Using spatial statistics techniques, Sedda et al. also [203] showed that the intensity of poverty varies inversely with NDVI in West Africa. 5.1.2 Datasets and Methodology Inspired by previous work, we apply CNNs to publicly available NDVI images to learn features useful for poverty prediction. Following Jean et al. [101], we use transfer learning and a two-step procedure to bypass the lack of labeled 52 Consumption Prediction (LSMS) Asset Index Prediction (DHS) Country Year Jean et al. [101] NDVI Year Jean et al. [101] NDVI Malawi 2013 0.37 0.341 ± 0.038 2010 0.55 0.498 ± 0.020 Nigeria 2013 0.42 0.387 ± 0.013 2013 0.68 0.738 ± 0.005 Rwanda - - - 2010 0.75 0.725 ± 0.022 Tanzania 2012 0.55 0.603 ± 0.019 2010 0.57 0.638 ± 0.012 Uganda 2011 0.41 0.490 ± 0.012 2011 0.69 0.751 ± 0.007 Table 5.1: Spatially cross-validated r2 values of the predictions of NDVI models relative to Jean et al. [101]. Separate models are fine-tuned and evaluated for different countries and surveys. For NDVI models, the means and standard deviations of r2 values are reported using 5 independent trials. responses: (1) fine-tune a VGG-16 network [209] on NDVI images to predict nighttime light intensities, and (2) fit random forest regression models using NDVI features to predict poverty indicators. The combination of NDVI images and nighttime lights allows vegetation features indicative of economic activity to be learned and generalized to the poverty prediction task. In the first step of our procedure, we start with a VGG-16 network pre-trained on ImageNet and adapt its fully connected layers to fit our input image sizes. Our inputs are annual average NDVI images, each 64 × 64 pixels in size and at a spatial resolution of 250 square meters per pixel. The images are sampled from a dataset produced by NASA’s Terra satellite [54] and represent areas that are evenly spaced at 0.025 degree intervals. The network learns to map each NDVI image to the average value of annual nighttime light intensities that de- scribe the same geographical region, as provided by the National Oceanic and Atmospheric Administration [169]. In contrast to [101, 248], we take a log trans- formation, but we do not discretize nightime light intensities. In the second step, we extract features from NDVI images and fit regression 53 models to predict two survey variables - logarithm of consumption expenditure and asset index. For direct comparison, we select the same surveys and follow the same preprocessing steps as in [101] (see Table 5.1 for the list of surveys). The conv5-2 layer of the fine-tuned network outputs a feature map of size 512× 1 for each NDVI image, and we average feature maps of images whose centers are within five kilometers of a surveyed community. For each survey, we then train random forests on the 512-dimensional feature maps, using nested 5-fold spatial cross-validation to select hyperparameters, and output predictions. In response to weather shocks, NDVI often changes over time (see Figure 5.1), and its feature maps can potentially capture and reflect these events in poverty predictions. Hence, we also fine-tune the network in the first step with updated NDVI images and nighttime lights. Predictions for out-of-sample periods are obtained by first training a random forest on the previous NDVI feature maps and testing it on the updated ones. 5.1.3 Experiment Results Our first experiment is to evaluate the predictive power of NDVI for poverty estimation. Following [101], we use expenditure data from the World Bank’s Living Standards Measurement Study (LSMS) and asset index data from the Demographic and Health Surveys (DHS). Table 5.1 shows that our NDVI mod- els are highly predictive of both average household consumption and average asset wealth. Spatially cross-validated predictions explain 34 to 60% of the vari- ation in average consumption and 50 to 75% of the variation in average asset wealth across surveyed countries. In general, our models perform comparably to Jean et al. [101] when fit using data from individual countries. 54 0.6 Nightlights Jean et al. 0.5 NDVI 1x poverty line 0.4 r2 2x poverty line 0.3 3x poverty line 0.2 0.1 0.0 20 40 60 80 100 0.6 0.5 0.4 r2 0.3 0.2 Nightlights Jean et al. 0.1 NDVI 0.0 20 40 60 80 100 Poorest Percent of Surveyed Communities Used Figure 5.2: Spatially cross-validated results of NDVI models relative to nightlights and Jean et al. [101]. Nightlight-based models are random forests trained on scalar night- time light intensities. The top figure shows r2 values for estimating consumption using pooled observations across the four LSMS countries. We run separate trials for increas- ing percentages of the pooled dataset (e.g., the x-axis value of 60 indicates all surveyed communities below the 60th percentile of consumption are included. The bottom figure show similar r2 values for estimating asset index.) When trained on pooled consumption or asset observations across all countries, our models perform significantly and consistently better than the state-of-art method (see Figure 5.2). We see an improvement of more than 100% in r2 for asset index predictions for regions below the 2x poverty line, which is set at $1.9 per person per day by the World Bank. This observation agrees with our intuition that extremely poor communities depend most heavily on crop pro- duction. The modest improvements in consumption prediction can be partly explained by the fact that consumption data is noisier [101, 217]. In the second experiment, we study whether temporal changes in NDVI are in- dicative of poverty changes. Because surveys from different years are generally conducted at different locations, we limit this experiment to 209 communities in 55 2011 Ground Truth 3 2013 Prediction 2013 Ground Truth 2 1 0 0 50 100 150 200 Community Index 0.45 2013 Prediction 2011 Ground Truth 0.40 0.35 0.30 20 40 60 80 100 Poorest Percent of Surveyed Communities Used Figure 5.3: Consumption predictions for LSMS communities in Uganda made by a ran- dom forest model trained on 2011 data and tested on 2013 data. The top figure shows the ground-truth consumption along with predictions for LSMS communities ordered by 2011 data. The bottom figure shows RMSE values of the predictions for increasing percentages of the LSMS communities (e.g., the x-axis value of 60 indicates all commu- nities below the 60th percentile in 2011 consumption are included). Uganda that are part of both the 2011-2012 and 2013-2014 LSMS surveys. As the 2011-2012 East Africa drought affected a large area of Uganda, the consumption distribution for these communities changes quite significantly between the sur- veys. Figure 5.3 shows that our random forest model can translate the increase in annual NDVI from 2011 to 2013 to reflect increased consumption in the poor- est communities following the drought. In contrast, models that rely on static inputs such as et al. [101] can only perform as well as the 2011 ground truth when tested on the 2013 data. Conclusion. In this paper, we have leveraged CNNs to extract features from NDVI images that are highly predictive of poverty. We demonstrate that pub- licly available, moderate-resolution NDVI can predict poverty as well as high- 56 RMSE Log Consumption ($/person/day) resolution images constrained by Google’s licensing terms. Our model based on NDVI can also produce dynamic poverty estimates, potentially helping policy- makers make more informed and timely decisions. 57 5.2 Deep Denoising for Scientific Discovery2 Despite significant advances in imaging technology [134, 158, 268], scientific images are still often corrupted by noise during signal generation or detection, which requires denoising procedures to restore information for scientific discov- ery. While deep denoising models have been tremendously successful on natu- ral images [43, 262], the potential of these techniques has barely been explored in the context of scientific imaging, where in contrast to traditional denoising setups, labeled datasets are typically not available in large quantities. To address this issue, we propose a simulation-based denoising (SBD) frame- work in which denoising models are trained on simulated images of transmis- sion electron microscopy (TEM), a powerful technique for probing the atomic- level structure and composition of a wide range of materials [210, 219]. Our contributions are threefold. First, we propose an architecture for simulation- based data that outperforms existing techniques by a wide margin on held-out simulated data as well as on real TEM measurements. Second, we demon- strate that standard performance metrics for photographs often fail to produce a scientifically-meaningful evaluation of the denoising results, and propose new scientifically-motivated metrics. Third, we propose a likelihood-based visual- ization of the agreement between the observed measurements and structures of interest, such as atomic columns, in the denoised image. 2Joint work with Sreyas Mohan, Ramon Manzorro, Joshua L. Vincent, David S. Matteson, Peter A. Crozier, and Carlos Fernandez-Granda. 58 5.2.1 Related Work Denoising in Scientific Imaging. A wide variety of denoising methods have been applied across different scientific imaging modalities, including traditional linear filters [165], nonlinear filters [104, 160, 221], wavelet-based methods [36, 159, 179, 267], and sparsity-based approaches [19, 159]. Several works in scien- tific domains exist, including low-dose computer tomography [112], positron- emission tomography [76], and scintillation-camera data [161]. While [60, 74, 225] apply CNNs to denoise simulated electron microscopy data without vali- dating on real data, [153] trains CNNs to denoise Raman scattering microscopy data, using measurements gathered at a higher signal-to-noise ratio (SNR) as ground-truth images. These results showcase the potential of deep denoising for scientific imaging, but also the challenge of gathering adequate datasets to train the deep networks. Deep Learning for TEM. Deep models have been applied to other image- processing tasks in TEM beyond denoising (see [61] for a comprehensive re- view). [218] proposes a CNN-based method for TEM image super-resolution, wherein CNNs are trained on pairs of low-resolution and high-resolution im- ages acquired experimentally. [93] applies CNNs to perform segmentation and systematically studies the influence of the design of the training dataset and network architecture on the generalization capabilities of these models. In this work, we provide a similar analysis for denoising. [151] and [184] propose a CNN-based method to identify structures of interest in TEM images by training on carefully designed simulated data and show that the model generalizes to real data. Our work provides further evidence that CNNs trained on simulated data can generalize effectively to real measurements. 59 5.2.2 Methodology Simulation-Based Denoising. Simulation-based denoising (SBD) consists of three stages: simulation of the training set, training of the CNNs using the sim- ulated data, and inference on the real data (see Figure 5.4). In order to generate the training set, we simulate clean images x M1, . . . , xN ∈ R (where M is the number of pixels) according to a predefined physical model. These clean im- ages are then corrupted using a noise model, which can follow a predefined model or be learned from the data, to generate the simulated noisy data. Learning Objective. Let Y (xi) denote the random vector representing the noisy image corresponding to the clean simulated image xi and let y(xi) rep- resent a realization of Y (xi). The denoising model fθ : RM → RM , parametrized by the weights θ are the weights, aims to minimize a loss function L : RM × RM → R which quantifies how close the estimate from the CNN fθ(y(xi)) is to the clean image xi. In our case study, we use mean squared error, which is a standard choice in CNN-based denoising [262]. More concretely, during the training stage, w[e compute the para]meters by solving∑ [∑ ]N N θ̂ = arg minE L(fθ(Y (xi)),xi) = arg minE ‖fθ(Y (xi))− xi‖22 (5.1) θ θ i=1 i=1 where the expectation is taken over the noise model and approximated by draw- ing new realizations of the noisy image Y (xi) every time we compute the gradi- ent. Once the network is trained, it can be directly applied to new noisy images to perform denoising. Exploiting Non-Local Signal Structure. While current state-of-the-art net- works for denoising photographic images have very small fields of view, the TEM images in our case study exhibit very prominent global regularities, due 60 to periodicity in the atomic structure of the imaged materials. In addition, electron-microscopy images are often measured at very low SNRs. Hence, we propose to denoise TEM data using UNet network architectures [198] with very large fields of view: 221 × 221 pixels and 893 × 893 pixels. Table 5.2 compares the influence of the field of view in denoising photographic and TEM images. For photographic images the performance of the network remains almost con- stant as we increase the field of view. In contrast, for TEM images increasing the field of view produces a dramatic improvement in performance (6 dB and 10 dB, when the field of view is 221×221 and 893×893 respectively). Increasing the number of parameters, while keeping the field of view constant, has a very modest effect, which suggests that the increase in field of view is the reason for the improvement. (a) TEM Images MODEL Parameters Field of View PSNR SSIM SBD + DnCNN [262] 668K 41 × 41 30.47 ± 0.64 0.93 ± 0.01 SBD + Small UNet [263] 233K 45 × 45 30.87 ± 0.56 0.93 ± 0.01 SBD + UNet (32 base channels) 352K 221 × 221 36.39 ± 0.77 0.98 ± 0.01 SBD + UNet (64 base channels) 1.41M 221 × 221 37.24 ± 0.76 0.99 ± 0.01 SBD + UNet (128 base channels) 5.61M 221 × 221 38.05 ± 0.81 0.99 ± 0.01 SBD + UNet (128 base channels) 70.15M 893 × 893 42.87 ± 1.45 0.99 ± 0.01 (b) Photographic Images MODEL Parameters Field of View PSNR SSIM σ = 30 σ = 70 σ = 30 σ = 70 UNet 102K 49 × 49 29.67 ± 2.84 26.16 ± 2.79 0.83 ± 0.06 0.70 ± 0.09 UNet 352K 221 × 221 29.65 ± 2.76 26.08 ± 2.68 0.83 ± 0.05 0.70 ± 0.08 UNet 4.4M 893 × 893 29.54 ± 2.82 26.07 ± 2.80 0.83 ± 0.06 0.70 ± 0.09 Table 5.2: Field of view of CNN architectures and performance. Mean PSNR and SSIM (± standard deviation) of different CNN architectures on the (a) held-out simulated test set of TEM data described in Section 5.2.4 and (b) validation set of the DIV2K photo- graphic image dataset [2]. 61 Pt CeO2 2D projection from the Simulating 3D atomic model Simulated images Simulate noise … Simulated data Noisy data Convolutional neural network Denoised data Training … Real noisy data Trained Network Denoised real data Likelihood map Inference Figure 5.4: Simulation-based denoising framework. (Top) A training dataset is gener- ated by simulating TEM images of different structures at varying imaging conditions. (Middle) A CNN is trained using the simulated images, paired with noisy counterparts obtained by simulating the relevant noise process. (Bottom) The trained CNN is applied to real data to yield a denoised image, and a likelihood map is generated to quantify the agreement between this structure and the noisy data. Likelihood Maps. In most applied domains, the goal of denoising is to un- cover image structure of scientific interest. In our case study, this corresponds to the location and intensity of projected columns of atoms in a catalytic nanopar- ticle that is surrounded by a vacuum. Quantifying to what extent such structure is consistent with the observed measurements is therefore of great interest. We 62 (a) Data (b) Denoised image (c) Likelihood map (d) Zoomed 0.04 0.02 0.00 0.02 0.04 Figure 5.5: Likelihood map. When the simulated noisy image in (a) is denoised using the proposed framework (b), a spurious atom appears at the left edge of the nanoparti- cle (see zoomed image (d)). The value of the likelihood map (c) at that location is very low, indicating that the presence of an atom is less consistent with the observed data than its absence. propose to achieve this by computing the likelihood of the data with respect to meaningful features identified in the denoised image. The general procedure, and its implementation in the case of our case study, are as follows: 1. Identify a region of interestR. In our case, we locate atomic columns using blob detection [139]. 2. Fit a low-dimensional model to the denoised image within the region of interest. We assume that the intensity of each atomic column and the vac- uum are constant and perform averaging over all denoised pixels inR. 3. Compute the likelihood of the noisy data in R with respect to the esti- mated pixel values. In our case, the noise is approximately i.i.d. Poisson, so the likelihood is given by ∏ L(R) := px (y ), (5.2)i i i∈R where yi denotes the noisy value in the ith pixel, and px is a Poisson prob-i ability mass function (pmf) with rate parameter xi. 63 (a) Data (b) Spot Filter (c) PURE-LET (d) SBD (e) Likelihood Map 0.03 0.02 0.01 0.00 0.01 0.02 0.03 Figure 5.6: Denoising results for real data. (a) An experimentally-acquired atomic- resolution transmission electron microscope image of a CeO2-supported Pt nanopar- ticle. The average image intensity is 0.45 electrons/pixel (i.e., a large fraction of pixels register zero electrons), which results in an extremely low signal-to-noise ratio. (b) De- noised image obtained via Fourier-based filtering by a domain expert. (c) Denoised image obtained via the wavelet-based PURE-LET method [147]. (d) Denoised image obtained by the proposed simulation-based denoising (SBD) framework. (e) Likelihood map quantifying to what extent the atomic structure identified from the SBD denoised image is consistent with the data. Regions in red are more likely to correspond to atomic columns in the nanoparticle. Regions in blue are more likely to belong to the vacuum. Figure 5.6 and Figure 5.5 show likelihood maps for the real data and for a sim- ulated example. In the simulated example (Figure 5.5), a spurious atom is de- tected at the left end of the zoomed region. However, the likelihood map at that location is very low, which indicates that the presence of an atom is not consistent with the observed data at that location. Figure 5.7 shows the distri- bution of log-likelihood ratio of over 25, 000 regions of interest extracted from the surface of over 1, 550 denoised images obtained from the dataset. As shown in Figure 5.7, the log-likelihood ratio values of spurious atoms (false postives, 64 ·10−2 2 1.5 1 0.5 0 −0.5 −1 −1.5 (a) False positives (b) True positives (c) False negatives Figure 5.7: Distribution of likelihood ratio. The figure shows the distribution of log- likelihood ratio of over 25, 000 regions of interest computed from the surface of 1550 denoised images using the dataset. The regions containing spurious atoms (false posi- tives, (a)) have a much lower log-likelihood ratio than the regions containing accurately recovered atoms (true positives, (b)). Regions where existing atoms were not detected (false negatives, (c)) have a higher log-likelihood ratio, comparable to that of the re- gions with accurately recovered atoms. The occurrence of missing and spurious atoms in denoised images is quite rare: out of the 25, 732 regions of interest, only 2, 457 and 2, 368 were false positives and false negatives respectively. (a)) are much lower than those of correctly-identified atoms (true positives, (c)). When the network fails to detects atoms (false negatives, (b)), we observe that the log-likelihood ratio in such regions tends to be high. It is worth noting that the occurrence of spurious and missing atoms in the denoised images is quite rare: out of the 25, 732 regions identified, only 2, 457 and 2, 368 regions corre- spond to spurious and missing atoms respectively. 5.2.3 Datasets The TEM image data used in this work correspond to images from a widely uti- lized catalytic system, which consist of platinum (Pt) nanoparticles supported on a larger cerium (IV) oxide (CeO2) nanoparticle. This bi-functional catalytic 65 Log Likelihood Ratio system is ubiquitously used in clean energy conversion and environmental re- mediation applications, in addition to a broad range of other chemical reactions [162, 168, 256]. From a general point of view, this system can be considered as a model for supported nanoparticle catalysts, since a large number of hetero- geneous catalysts are based on metallic nanoparticles supported over different oxides. Thus, results and conclusions extracted from the current work are rel- evant to a great number of similar samples in the field of catalysis (e.g., oxide crystals supporting metal nanoparticles). Real Data. The real data used to test the proposed SBD framework consist of a series of images of the Pt/CeO2 catalyst. The images were acquired in a N2 gas atmosphere using an aberration-corrected FEI Titan transmission elec- tron microscope (TEM), operated at 300 kV and coupled with a Gatan K2 IS direct electron detector. The detector was operated in electron counting mode with a time resolution of 0.025 sec/frame and an incident electron dose rate of 5,000 e−/Å2/s. The electromagnetic lens system of the microscope was tuned to achieve a highly coherent parallel beam configuration with minimal low-order aberrations (e.g., astigmatism, coma), and a third-order spherical aberration co- efficient of approximately -13 µm. Simulation Dataset. The simulated TEM image dataset was generated using the multi-slice TEM image simulation method, as implemented in the Dr. Probe software package [15]. Images were simulated with 1024 x 1024 pixels and then binned to match the approximate pixel size of the experimentally acquired im- age series. To equate the intensity range of the simulated images with those acquired experimentally, the intensities of the simulated images were scaled by a factor which equalized the vacuum intensity in a single simulation to the aver- 66 age intensity measured over a large area of the vacuum in a single 0.025 second experimental frame (i.e., 0.45 counts per pixel in the vacuum region). 5.2.4 Experiments Results Experiment Setups. We use CNNs with the proposed UNet architecture with 128 base channels and 6 scales in all of our experiments. The networks were trained on 400×400 patches extracted from the training images and augmented with horizontal flipping, vertical flipping, random rotations between −45◦ and +45◦, and random resizing by a factor of 0.75-0.82. The models were trained using the Adam optimizer [114], with a default starting learning rate of 10−3, which was reduced by a factor of 2 every time the validation PSNR plateaued. Training was terminated via early stopping based on validation PSNR. The de- tails of training, validation and test data for each experiment are provided in the corresponding section. Since the models are trained on 400 × 400 patches, when applying them to larger images we divide the images into overlapping 400× 400 patches, denoise them, and then combine them via averaging. PSNR and SSIM. The imaging parameters of the real data are well described by the white contrast category. We therefore used the subset of simulated dataset corresponding to this contrast (5583 images) to compare our proposed method- ology to other models. 90% of the data were used for training. The remaining 559 images were evenly split into validation and test sets. We compare our pro- posed UNet architecture with two state-of-the-art architectures for photographic- image denoising [262, 263], and with several classical denoising methods: low- pass filtering [165], adaptive Wiener filtering [136], BM3D [152], non-local means 67 METHODS PSNR SSIM Raw 3.56 ± 0.03 0.00 ± 0.00 Low Pass Filter [165] 21.59 ± 0.07 0.44 ± 0.03 Adaptive Wiener Filter [136] 22.42 ± 1.08 0.63 ± 0.02 VST + NLM [30] 26.55 ± 0.16 0.73 ± 0.01 VST + BM3D [152] 25.27 ± 0.15 0.80 ± 0.01 PURE-LET [147] 28.36 ± 0.88 0.93 ± 0.01 SBD + DnCNN [262] 30.47 ± 0.64 0.93 ± 0.01 SBD + Small UNet [263] 30.87 ± 0.56 0.93 ± 0.01 SBD + Proposed Architecture 42.87 ± 1.45 0.99 ± 0.01 Table 5.3: Results on simulated test data. Mean PSNR and SSIM (± standard devi- ation) of different denoising methods on the held-out simulated test set described in Section 5.2.4. SBD approaches achieve the best results. SBD combined with the pro- posed architecture outperforms all other techniques by about 12 dB. The performance of SBD applied to additional architectures is reported in Table 5.2. [147] and, a wavelet-based method known as PURE-LET [147]. For all methods, hyperparameters were chosen based on the validation data. Performance was measured in terms of SSIM [241] and peak signal-to-noise ratio (PSNR). The results demonstrate that SBD is an effective denoising methodology for TEM data. Our proposed CNN outperforms all other methods by a margin of 12 dB in PSNR on the simulated test data, as shown in Table 5.3. SBD recov- ers the overall shape of the nanoparticle, the interface between the nanoparticle and the support, and the different periodic patterns of the CeO2 support and Pt nanoparticle. Contrast features, such as subtle patterns of bright, intermediate and dark features associated with the atomic structure of the CeO2 crystal, are well reproduced in the images denoised via SBD, but are mostly absent from the results of the baseline approaches. Metrics Beyond PSNR. Domain scientists denoise images in order to extract scientifically relevant information. In our case, the atoms on the surface of 68 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 Surface Bulk 0.65 Precision Recall F1 Score Jaccard Index Figure 5.8: Performance of SBD in terms of our proposed metrics. We compute all our proposed metrics on over 7, 000 denoised images corresponding to 25 unique noisy images sampled from the 308 clean images. The empirical distribution on the surface (red) and bulk (green) is visualized as box plots indicating the median, 25th quartile, 75th quartile, minimum and maximum value of the distribution. SBD has a near perfect performance in the bulk with all metric values hovering around 1. On the surface, SBD achieves a median score of 1 for precision and recall, and about 0.95 for F1 score and Jaccard index. nanoparticles are of particular interest, because the atomic configuration at the surface regulates the nanoparticle’s ability to catalyze chemical reactions. It is therefore of critical importance to understand how different denoising methods recover these atoms. We can verify visually that SBD achieves a largely success- ful recovery in held-out simulated data, whereas the baseline methods do not. However, visual inspection is a limited and non-quantitative evaluation tool. Unfortunately, standard metrics like PSNR and SSIM are insensitive to changes in the atomic structure of the nanoparticle surface, because these changes have a small effect on the overall intensity of the images. To define metrics that evaluate detection of surface atoms, we apply a blob de- tection algorithm, e.g. Laplacian of Gaussian [139], to locate the centers, and compute the α-shape of all the atom centers using Delaunay triangulation [180]. We propose the following four metrics to measure the fidelity of the recovered 69 0.8 Noisy Data (40-Frames Average) Denoised Data (40-Frames Average) 0.7 0.6 0.5 0.4 0.3 350 400 450 500 550 600 Pixel Coordinate Figure 5.9: Validation on real data. The real data consist of 40 frames which are ap- proximately stationary and aligned. Their temporal average (left) therefore provides a reasonable estimate for the true intensity profile. In the image on the right, we compare the average intensity profile on the surface atomic columns of the platinum nanoparti- cle for the denoised data (middle) and the temporal average (left). The profiles are very similar (except for some spurious fluctuations in the temporal average), which suggests that the proposed approach achieves effective denoising on the real data. structure: precision, recall, F1 score, and Jaccard index. Performance on Real Data. In the experiments reported in Sections 5.2.4 and 5.2.4 we used a network trained on all simulated images from the white con- trast category. However, the real data described in Section 5.2.2 more closely corresponds to a subset of white contrast images satisfying the following con- ditions: structure limited to PtNP2, thickness between 40 Å - 60 Å and, defocus between 5 nm and 10 nm. We used 236 images from this subset for training, and another such 15 images for validation. We also trained two state-of-the-art ar- chitectures for photographic image denoising - DnCNN [262] and DURR [263] on these data. Results on real experimental data obtained using SBD trained on this relevant subset of white contrast are shown in Figure 5.6. SBD produces denoised images that are of much higher quality than those of the baseline methods described in Section 5.2.4, which contain obvious artefacts. Further, we validate the denois- 70 Intensity ing results of SBD by comparing to an estimated reference image obtained by temporal averaging. Our real dataset consists of 40 frames that are approxi- mately stationary and aligned. Therefore, their temporal average provides a good estimate for the ground-truth images. As shown in Figure 5.9, the de- noised intensity values of the atomic column approximately match those of the estimated reference image. In the rest of this section, we compare the performance of SBD and unsuper- vised denoising techniques on the real experimental data, and analyze the effect of the design of the training dataset on the denoised output produced by SBD. Discussion and Conclusions. Our case study is a proof of concept that CNNs trained on simulated data can be remarkably effective when applied to real imaging data. It provides several insights and suggests future research direc- tions that are relevant, beyond electron microscopy, to other domains where the images of interest can be simulated, such as medical imaging [112, 161], other types of microscopy [74, 225], or astronomy [177]. We show that the design of the training dataset is critical, so an important ques- tion is how to design simulated training datasets in a principled systematic way. Answering it will require a deeper understanding of the generalization ability of CNNs with respect to variations in the statistics of the input images. We also demonstrate that architectures tailored to photographic imaging can perform poorly when applied to other data. Designing CNNs for other domains requires an understanding of the image features that are exploited for denoising. Gra- dient visualization is shown to be useful here, but more advanced visualization techniques are needed. In addition, we demonstrate that standard metrics used to quantify performance in photographs may not be sensitive to scientifically 71 relevant features, and propose several new metrics to address this problem. Al- though SBD outperforms other methods by a large margin, some artefacts such as phantom atoms still appear. Our proposed likelihood maps help to flag such events, but may still fail to do so in regions of unusually low SNR. Developing more sophisticated methods for uncertainty quantification is therefore a key re- search direction. It would also be of great interest to develop unsupervised or self-supervised denoising approaches that are effective with small amounts of data at low SNRs. Finally, to encourage further development of deep-learning methodologies for scientific imaging, we release a denoising benchmark dataset of TEM images, containing 18,000 examples. 72 CHAPTER 6 FINAL REMARKS In this dissertation, we have studied the problem of probabilistic modeling of sequential data from three perspectives. In Chapter 2, we consider continual learning settings where models are often prone to catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. By modelling sample similarities and promoting information sharing in the form of random graphs, we provide an effective mechanism against such a problem. In Chapter 3 and Chapter 4, we combine state space models and transformer ar- chitectures for time series forecasting and video prediction, respectively, aiming to capture complex temporal dynamics and unravel high-dimensional inputs via latent variables models. Despite limited scopes, these works demonstrate that capturing temporal struc- tures within sequential data is critical to downstream applications. As predic- tions and anticipation of future events is a key component of intelligent systems, we hope that future works further explore the topic and realize the potentials of sequential data for positive impacts on human society. 73 APPENDIX A GRAPH-BASED CONTINUAL LEARNING Experiment Setup. We perform experiments on six commonly used classifica- tion datasets: PERMUTED MNIST, ROTATED MNIST [125], SPLIT SVHN [166], SPLIT CIFAR10 [121], SPLIT CIFAR100 [121], and SPLIT MINIIMAGENET [230]. • PERMUTED MNIST [77] is a variant of the MNIST dataset of handwritten digits [125], where each task applies a fixed random pixel permutation to the original dataset. The benchmark dataset consists of 20 tasks, each with 1000 samples from 10 different classes. • ROTATED MNIST [143] is another variant of the MNIST dataset of hand- written digits [125], where each task applies a fixed random image rotation to the original dataset. The benchmark dataset consists of 20 tasks, each with 1000 samples from 10 different classes. • SPLIT SVHN is a variant of the SVHN dataset [166] that consists of 5 tasks, each with two consecutive classes. Since the benchmark dataset is much more challenging than the MNIST variants, we use all of its 73,257 training samples (i.e. 14,650 samples per task) to train our model and the baselines. • SPLIT CIFAR10 is a variant of the CIFAR-10 dataset [121]. Similar SPLIT SVHN, the benchmark dataset consists of 5 tasks, each with two consecu- tive classes. We use all of its 50,000 training samples (i.e. 10,000 samples per task) to train our model and the baselines. • SPLIT CIFAR100 is a variant of the CIFAR-100 dataset [121]. The bench- mark dataset consists of 20 tasks, each with 5 consecutive classes. We use all of its 50,000 training samples (i.e. 2,500 samples per task) to train our model and the baselines. 74 • SPLIT MINIIMAGENET is a variant of the MINIIMAGENET dataset [121]. The benchmark dataset consists of 20 tasks, each with 5 consecutive classes. We use all of its 50,000 training samples (i.e. 2,500 samples per task) to train our model and the baselines. Each image is resized to 84 × 84 pixels. Model Architectures. As mentioned, while most of previous work uses multi- head architectures and assumes knowledge of task boundaries at test time, we employ a shared classifier head for all tasks. For the MNIST datasets, the im- age encoders fθ1 (for graph construction) and fθ2 (for latent computation) share a multi-layered perceptron with two hidden layers of 256 ReLU neurons, fol- lowed by two separate linear mappings, one for each of the encoders. For SPLIT SVHN, SPLIT CIFAR10, SPLIT CIFAR100, and SPLIT MINIIMAGENET, the im- age encoders share a simple convolutional network with the following struc- ture: conv 64→ conv 64→ maxpool→ conv 64→ conv 64→ maxpool→ conv 64 → conv 64 → maxpool, where conv NF is a 3 × 3 convolution with NF out- put filters, BatchNorm, and ReLU activations. For all datasets, another linear mapping follows the image encoder fθ1 before a Gaussian kernel computes the similarities between image embeddings. Finally, the classifier head consists of a RELU activation and a single linear mapping. Baseline Architectures. We use the same neural network architectures for all the baselines described in this chapter: a multi-layered perceptron with two hid- den layers of 400 ReLU neurons on PERMUTED MNIST and ROTATED MNIST, following [94], and a ResNet-18 [87] with 20 filters across all layers on other datasets, following [143]. For all datasets, the baselines consist of more parame- ters than our corresponding models (see Table A.1 for more details). We adopt the implementations of EWC [117], GEM [143], and MER [195] from 75 Table A.1: Number of trainable parameters in continual learning models. Method Finetune EWC GEM ER MER GCL SPLIT MNIST 478K 478K 478K 478K 478K 406K PERMUTED MNIST 478K 478K 478K 478K 478K 406K ROTATED MNIST 478K 478K 478K 478K 478K 406K SPLIT SVHN 1.09M 1.09M 1.09M 1.09M - 326K SPLIT CIFAR10 1.09M 1.09M 1.09M 1.09M - 326K SPLIT CIFAR100 1.09M 1.09M 1.09M 1.09M - 326K SPLIT MINIIMAGENET 1.09M 1.09M 1.09M 1.09M - 343K the authors’ repositories 1 2. Additional Task-Free Baselines. We also note that despite our attempts to tune parameters for MER [195] on SPLIT SVHN and SPLIT CIFAR10, the base- line does not perform reasonably well. The model uses a batch size of 1 and requires multiple passes through the episodic memory per batch, so it is much slower than our model and all other baselines. Due to limited time and com- putational resources, we do not further investigate the baseline and therefore avoid reporting immature results for fairness. However, we include results of CN-DPM [127], a competitive task-free model based on Dirichlet process mixture models in Table A.2. Our setup for SPLIT CI- FAR10 is analogous to that of [127], so we directly quote the numbers for CN- DPM from the paper. Although CN-DPM performs favorably among task-free approaches to continually learning, including GSS [8], our model outperforms CN-DPM by a significant margin, even when using a smaller memory size. 1https://github.com/facebookresearch/GradientEpisodicMemory 2https://github.com/mattriemer/mer 76 Table A.2: GCL results and CN-DPM results with different memory sizes. SPLIT SVHN SPLIT CIFAR10 Method 250 500 500 1000 ER [40] 45.51 ± 3.03 57.51 ± 2.77 36.08 ± 1.09 45.75 ± 1.82 CN-DPM [127] − − 43.07 ± 0.16 45.21 ± 0.18 GCL (Ours) 60.68 ± 1.67 65.79 ± 1.54 53.87 ± 0.97 57.26 ± 0.28 Memory Usage. Both GCL and ER [40] uses an episodic memory to store im- ages and labels from past tasks. The only additional memory usage of GCL comes from the context graph G, which is represented by a square matrix whose entries intuitively describe pairwise similarities between such images. Given a memory consisting of |M| images of size C × H × W , it only requires |M|2 floating points to store the matrix. Table A.3: Memory usage of ER and GCL for various datasets. DATASET |M| Image Size ER GCL PERMUTED MNIST 1000 1 × 28 × 28 3.284 MB 7.284 MB ROTATED MNIST 1000 1 × 28 × 28 3.284 MB 7.284 MB SPLIT CIFAR10 250 3 × 32 × 32 3.109 MB 3.359 MB SPLIT SVHN 250 3 × 32 × 32 3.109 MB 3.359 MB SPLIT CIFAR100 500 3 × 32 × 32 6.219 MB 7.199 MB SPLIT MINIIMAGENET 500 3 × 84 × 84 42.408 MB 43.389 MB As seen from Table A.3, the memory usage of GCL are very similar the same as that of ER, except when both are very small as in the case of PERMUTED MNIST and ROTATED MNIST, because (1) continual learning algorithms are often required to use a very small |M| and (2) the cost for storing natural images are often much higher than that of the context graph. As the number of tasks increases, it is perhaps essential to expand the episodic 77 PERMUTED MNIST ROTATED MNIST 90 100 85 90 80 80 75 70 70 60 Finetune EWC 65 GEM ER 60 50 MER GCL 55 40 0 5 10 15 20 0 5 10 15 20 SPLIT SVHN SPLIT CIFAR10 100 100 80 80 60 60 40 40 20 20 0 0 1 2 3 4 5 1 2 3 4 5 Figure A.1: Average accuracy as a function of the number of tasks trained on PERMUTED MNIST, ROTATED MNIST, SPLIT SVHN, and SPLIT CIFAR10. memory, in which case the quadratic growth of the latter might dominate the linear increase of the former (e.g. |M| = 5000 and images are of size 3×32×32). Although we have not practically encountered such a problem with GCL, we note that the quadratic growth of the number of entries in the context graph can be reduced to a linear growth in memory requirements. More specifically, each entr(y is the outpu)t of the kernel function κτ (see Section 3, e.g. κτ (ui,uj) = exp − τ ‖u − u 2i j‖2 ), so we could easily store |M| intermediate embeddings2 {ui} at each step and apply the kernel function on the fly, which is especially beneficial when ui are much lower dimensional than the original images. Additional Experiment Results. 78 Average Accuracy (%) Average Accuracy (%) SPLIT SVHN SPLIT CIFAR10 75 75 ER GCL 50 50 25 25 0 0 100 250 500 1000 100 250 500 1000 Memory Size Memory Size Figure A.2: Average accuracy as a function of numbers of samples in the episodic mem- ory on SPLIT SVHN and SPLIT CIFAR10. SPLIT SVHN SPLIT CIFAR10 100 100 ER Ours 75 75 50 50 25 25 0 0 100 250 500 1000 100 250 500 1000 Memory Size Memory Size Figure A.3: Average forgetting as a function of numbers of samples in the episodic memory on SPLIT SVHN and SPLIT CIFAR10. 79 Average Forgetting (%) Average Accuracy (%) APPENDIX B PROBABILISTIC TRANSFORMER Time Series Datasets. Following [190, 201], we run experiments on 5 datasets for time series forecasting, including SOLAR [124], ELECTRICITY 1, TRAFFIC 2, TAXI 3, and WIKIPEDIA 4. Table B.1 includes more details about the datasets. For human motion prediction, we run experiments on two datasets, namely, Human3.6M[96] and HumanEva-I [208], following [257]. As described in Sec- tion 3.5, Human3.6 is a large-scale dataset with 11 subjects performing 15 ac- tions, totaling 3.6 million video frames recorded at 50Hz. To be consistent with previous work [155, 257], we adopt a 17-joint skeleton and train on 5 subjects (S1, S5, S6, S7, S8) and test on two subjects (S9, S11). For HumanEva-I, we adopt a 15-joint skeleton and use the same training and test split provided in the dataset. As in [257], we predict future motion for 2 seconds conditioning on observed motion of 0.5 seconds and 1 second conditioning on 0.25 seconds for Human3.6 and HumanEva-I, respectively. 1https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 2https://archive.ics.uci.edu/ml/datasets/PEMS-SF 3https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page 4https://github.com/mbohlkeschneider/gluon-ts/tree/mv release/datasets Table B.1: Dimension, domain, frequency, total training timesteps and prediction length properties of the training datasets used in the experiments. DATASET DIMENSION DOMAIN FREQUENCY TIMESTEPS PREDICTION SOLAR 137 R+ Hour 7,009 24 ELECTRICITY 370 R+ Hour 5,790 24 TRAFFIC 963 (0, 1) Hour 10,413 24 TAXI 1,214 N 30-Min 1,488 24 WIKIPEDIA 2,000 N Day 792 30 80 Table B.2: Test set NMSEsum and NDsum of time series models (lower is better). The means and standard deviations are computed over 5 runs using different seeds. DATASET SOLAR ELECTRICITY TRAFFIC Method NRMSEsum NDsum NRMSEsum NDsum NRMSEsum NDsum Transformer-MAF [190] 0.634 ± 0.034 0.323 ± 0.031 0.039 ± 0.00 0.030 ± 0.00 0.363 ± 0.00 0.301 ± 0.02 TimeGrad [189] 0.715 ± 0.046 0.399 ± 0.023 0.039 ± 0.00 0.026 ± 0.00 0.073 ± 0.00 0.055 ± 0.00 ProTran (Ours) 0.579 ± 0.050 0.317 ± 0.027 0.030 ± 0.00 0.022 ± 0.00 0.046 ± 0.01 0.031 ± 0.00 Based on the dataset descriptions of previous work, we assume that they were obtained and curated appropriately with consent from pertaining people and that they contain no personally identifiable information or offensive content. Time Series Forecasting Results. In addition to CRPSsum reported in Section 3.5, we also include experiment results for time series forecasting using two other metrics, namely normalized root mean squared error (NRMSEsum) and nor- malized deviation (NDsum), in Table B.2. As in [5], we define NRMSEsum as the root mean squared error normalized by the absolute values of targets summed across all time series. NDsum, is defined as the mean absolute error between pre- dicted values and targets summed across all time series. Consistent with the results in Section 3.5, our models perform significantly bet- ter than Transformer-MAF [190] and TimeGrad [189], two competitive baselines proposed recently. Human Motion Prediction Results. Figure B.1 shows that given the same contexts consisting of fixed conditioning pose sequences, our model generates diverse yet sensible pose sequences. The variations in predictions stem from the stochasticity induced by our latent variables at different time steps. Model Architectures. As described in Section 3.5, our models are based on transformer architectures [226] with extensive use of attention modules. For all 81 Context Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Figure B.1: Conditioning pose sequences in green and corresponding predictions in red by ProTran. Solid colors indicate later time-steps and faded ones are older. experiments, we use a single linear layer to map inputs into fixed-size repre- sentations in R128 or R256) (see Equation 3.8). We use multihead attention with 8 heads in Equation 3.9, Equation 3.10, and Equation 3.14 to model temporal interactions between latent variables, dependencies on conditional inputs, and interactions of all inputs in posterior distributions. The MLPs in Equation 3.11 and Equation 3.15 as well as the final MLP that maps latent variables to outputs consist of 2 layers each with ReLU or Tanh activations. We use fixed positional embeddings as in [226] (see Equation 3.8 and Equation 3.12). The LayerNorms in Equation 3.8, Equation 3.9, Equation 3.10, Equation 3.12 all have learnable parameters with  = 10−5. For time series forecasting, we also employ a learnable embedding layer, the outputs of which are concatenated with the lagged inputs as in [190]. Our ob- jective function (see Equation 3.20) has an L1 reconstruction loss in most cases, except for the TRAFFIC dataset, in which case we replace it with binary cross entropy and enforce outputs to be in the [0, 1] domain. 82 Table B.3: Number of parameters of Transfomer-MAF [190], TimeGrad [189], and Pro- Tran (our model) used in the time-series forecasting experiments. DATASET SOLAR ELECTRICITY TRAFFIC TAXI WIKIPEDIA Transformer-MAF 290,181 532,734 1,150,047 1,333,706 2,229,500 TimeGrad 116,959 300,216 1,010,691 1,126,974 3,099,501 ProTran 342,418 464,292 844,998 695,612 1,510,496 Table B.3 and Table B.4 detail the numbers of parameters of our models in time series and human motion experiments, respectively. In all cases, our model sizes are comparable or smaller than other baselines. Table B.4: Number of parameters of DLow[257], its conditional VAE model, and Pro- Tran (our model) used in the human-motion prediction experiments. DATASET HUMAN3.6M HUMANEVA-I CVAE 725,292 717,174 DLow 2,763,820 2,753,398 ProTran 1,166,704 1,163,626 83 BIBLIOGRAPHY [1] Alessandro Achille, Tom Eccles, Loic Matthey, Chris Burgess, Nicholas Watters, Alexander Lerchner, and Irina Higgins. Life-long disentangled representation learning with cross-domain latent homologies. In Advances in Neural Information Processing Systems, pages 9873–9883, 2018. [2] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single im- age super-resolution: Dataset and study. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) Workshops, July 2017. [3] Ahmed Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progression. 2019. [4] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robic- quet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory pre- diction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 961–971, 2016. [5] Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, et al. Gluonts: Prob- abilistic and neural time series modeling in python. Journal of Machine Learning Research, 21(116):1–6, 2020. [6] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learn- ing with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pages 11849–11860, 2019. 84 [7] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free con- tinual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019. [8] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, pages 11816–11825, 2019. [9] Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367, 2015. [10] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normal- ization. arXiv preprint arXiv:1607.06450, 2016. [11] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Camp- bell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017. [12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma- chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [13] Abhijit Banerjee, Esther Duflo, Nathanael Goldberg, Dean Karlan, Robert Osei, William Parienté, Jeremy Shapiro, Bram Thuysbaert, and Christo- pher Udry. A multifaceted program causes lasting progress for the very poor: Evidence from six countries. Science, 348(6236):1260799, 2015. [14] Emad Barsoum, John Kender, and Zicheng Liu. Hp-gan: Probabilistic 3d human motion prediction via gan. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1418–1427, 2018. 85 [15] Juri Barthel. Dr. probe: A software for high-resolution stem image simu- lation. Ultramicroscopy, 193:1–11, 2018. [16] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez- Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational in- ductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. [17] Luc Bauwens, Sébastien Laurent, and Jeroen VK Rombouts. Multivariate garch models: a survey. Journal of applied econometrics, 21(1):79–109, 2006. [18] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent net- works. In NIPS 2014 Workshop on Advances in Variational Inference, 2014. [19] Simon Beckouche, Jean-Luc Starck, and Jalal Fadili. Astronomical image denoising using dictionary learning. Astronomy & Astrophysics, 556:A132, 2013. [20] Kathleen Beegle, Luc Christiaensen, Andrew Dabalen, and Isis Gaddis. Poverty in a rising Africa. The World Bank, 2016. [21] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3286–3295, 2019. [22] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long- document transformer. arXiv preprint arXiv:2004.05150, 2020. [23] Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert, Bernie Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael 86 Bohlke-Schneider, David Salinas, Lorenzo Stella, et al. Neural forecast- ing: Introduction and literature overview. arXiv preprint arXiv:2004.10240, 2020. [24] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095, 2021. [25] Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Accurate and di- verse sampling of sequences based on a “best of many” sample objective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 8485–8493, 2018. [26] Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C Cobo, and Karen Simonyan. High fidelity speech synthesis with adversarial networks. arXiv preprint arXiv:1909.11646, 2019. [27] Joshua Blumenstock, Gabriel Cadamuro, and Robert On. Predicting poverty and wealth from mobile phone metadata. Science, 350(6264):1073– 1076, 2015. [28] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [29] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. 87 [30] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65. IEEE, 2005. [31] Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. Deep representation learning for human motion prediction and classifi- cation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6158–6166, 2017. [32] Lucas Caccia, Eugene Belilovsky, Massimo Caccia, and Joelle Pineau. On- line learned continual compression with stacked quantization module. arXiv preprint arXiv:1911.08019, 2019. [33] Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Conguri Huang, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, et al. Spectral tempo- ral graph neural network for multivariate time-series forecasting. arXiv preprint arXiv:2103.07719, 2021. [34] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of machine learning techniques for supply chain demand forecasting. Eu- ropean Journal of Operational Research, 184(3):1140–1154, 2008. [35] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Sla- womir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8740–8749. IEEE Computer Society, 2019. [36] S Grace Chang, Bin Yu, and Martin Vetterli. Adaptive wavelet threshold- 88 ing for image denoising and compression. IEEE Trans. Image Processing, 9(9):1532–1546, 2000. [37] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understand- ing forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–547, 2018. [38] Arslan Chaudhry, Albert Gordo, Puneet K Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. arXiv preprint arXiv:2002.08165, 2020. [39] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mo- hamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018. [40] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Tha- laiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486, 2019. [41] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform gener- ation. arXiv preprint arXiv:2009.00713, 2020. [42] Xi Chen and William D Nordhaus. Using luminosity data as a proxy for economic statistics. Proceedings of the National Academy of Sciences, 108(21):8589–8594, 2011. [43] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A 89 flexible framework for fast and effective image restoration. IEEE transac- tions on pattern analysis and machine intelligence, 39(6):1256–1272, 2016. [44] Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650, 2020. [45] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. [46] Kyunghyun Cho, B van Merrienboer, Caglar Gulcehre, F Bougares, H Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 2014. [47] Edward Choi, Mohammad Taha Bahadori, Joshua A Kulas, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Retain: An interpretable pre- dictive model for healthcare using reverse time attention mechanism. Ad- vances in Neural Information Processing Systems, pages 3512–3520, 2016. [48] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stew- art, and Jimeng Sun. Doctor ai: Predicting clinical events via recurrent neural networks. In Machine learning for healthcare conference, pages 301– 318. PMLR, 2016. [49] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for se- quential data. Advances in Neural Information Processing Systems, 28:2980– 2988, 2015. 90 [50] Robert Coop, Aaron Mishtal, and Itamar Arel. Ensemble learning in fixed expansion layer networks for mitigating catastrophic forgetting. IEEE transactions on neural networks and learning systems, 24(10):1623–1634, 2013. [51] Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Beni- dis, Michael Bohlke-Schneider, Richard Kurle, Lorenzo Stella, Hilaf Has- son, Patrick Gallinari, and Tim Januschowski. Normalizing kalman filters for multivariate time series analysis. Advances in Neural Information Pro- cessing Systems, 33, 2020. [52] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1174–1183. PMLR, 2018. [53] Xinshen Diao, Peter Hazell, and James Thurlow. The role of agriculture in african development. World development, 38(10):1375–1383, 2010. [54] Kamel Didan. MOD13Q1 MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V006. NASA EOSDIS LP DAAC, 2015. [55] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016. [56] Andreas Doerr, Christian Daniel, Martin Schiegg, Nguyen-Tuong Duy, Stefan Schaal, Marc Toussaint, and Trimpe Sebastian. Probabilistic recur- rent state-space models. In International Conference on Machine Learning, pages 1280–1289. PMLR, 2018. 91 [57] Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no- recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5884–5888. IEEE, 2018. [58] James Durbin and Siem Jan Koopman. Time series analysis by state space methods. Oxford university press, 2012. [59] Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, and Marcus Rohrbach. Uncertainty-guided continual learning with bayesian neural networks. arXiv preprint arXiv:1906.02425, 2019. [60] Jeffrey M Ede and Richard Beanland. Improving electron micrograph signal-to-noise with an atrous convolutional encoder-decoder. Ultrami- croscopy, 202:18–25, 2019. [61] Jeffrey Mark Ede. Deep learning in electron microscopy. Machine Learning: Science and Technology, 2020. [62] P. Erdös and A. Rényi. On random graphs i. Publicationes Mathematicae Debrecen, 6:290, 1959. [63] Chenyou Fan, Yuze Zhang, Yi Pan, Xiaoyue Li, Chi Zhang, Rong Yuan, Di Wu, Wensheng Wang, Jian Pei, and Heng Huang. Multi-horizon time series forecasting with temporal attention learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2527–2535, 2019. [64] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Path- 92 net: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017. [65] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 64– 72, 2016. [66] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A dis- entangled recognition and nonlinear dynamics model for unsupervised learning. In Proceedings of the 31st International Conference on Neural Infor- mation Processing Systems, pages 3604–3613, 2017. [67] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2207–2215, 2016. [68] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pages 4346–4354, 2015. [69] Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, Sylvain Lam- prier, and Patrick Gallinari. Stochastic latent residual video prediction. arXiv preprint arXiv:2002.09219, 2020. [70] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999. [71] Hang Gao, Huazhe Xu, Qi-Zhi Cai, Ruth Wang, Fisher Yu, and Trevor Darrell. Disentangling propagation and generation for video prediction. 93 In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9006–9015, 2019. [72] Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Ran- gapuram, David Salinas, Valentin Flunkert, and Tim Januschowski. Prob- abilistic forecasting with spline quantile function rnns. In The 22nd in- ternational conference on artificial intelligence and statistics, pages 1901–1910. PMLR, 2019. [73] Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. Learning hu- man motion models for long-term predictions. In 2017 International Con- ference on 3D Vision (3DV), pages 458–466. IEEE, 2017. [74] E Giannatou, G Papavieros, V Constantoudis, H Papageorgiou, and E Gogolides. Deep learning denoising of sem images towards noise- reduced ler measurements. Microelectronic Engineering, 216:111051, 2019. [75] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007. [76] Kuang Gong, Jiahui Guan, Chih-Chieh Liu, and Jinyi Qi. Pet image de- noising using a deep neural network through fine tuning. IEEE Transac- tions on Radiation and Plasma Medical Sciences, 3(2):153–161, 2018. [77] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient- based neural networks. arXiv preprint arXiv:1312.6211, 2013. [78] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. 94 [79] Alexander Greaves-Tunnell and Zaid Harchaoui. A statistical investiga- tion of long memory in language and music. In International Conference on Machine Learning, pages 2394–2403. PMLR, 2019. [80] Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and Theophane Weber. Temporal difference variational auto-encoder. In In- ternational Conference on Learning Representations, 2018. [81] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable trajectories with generative adver- sarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018. [82] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu. Deligan: Generative adversarial networks for diverse and limited data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 166–174, 2017. [83] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for plan- ning from pixels. In International Conference on Machine Learning, pages 2555–2565. PMLR, 2019. [84] Andrew C Harvey. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1990. [85] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan. Memory effi- cient experience replay for streaming learning. In 2019 International Con- ference on Robotics and Automation (ICRA), pages 9769–9776. IEEE, 2019. 95 [86] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018. [87] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [88] J Vernon Henderson, Adam Storeygard, and David N Weil. Measuring economic growth from outer space. American economic review, 102(2):994– 1028, 2012. [89] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glo- rot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta- vae: Learning basic visual concepts with a constrained variational frame- work. 2016. [90] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [91] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion proba- bilistic models. arXiv preprint arXiv:2006.11239, 2020. [92] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu- ral computation, 9(8):1735–1780, 1997. [93] James P Horwath, Dmitri N Zakharov, Remi Megret, and Eric A Stach. Understanding important features of deep learning models for segmen- tation of high-resolution transmission electron microscopy images. npj Computational Materials, 6(1):1–9, 2020. 96 [94] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re- evaluating continual learning scenarios: A categorization and case for strong baselines. arXiv preprint arXiv:1810.12488, 2018. [95] Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Fore- casting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008. [96] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013. [97] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In Thirty-second AAAI conference on artificial intelligence, 2018. [98] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 5308– 5317, 2016. [99] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. [100] Andrew H Jazwinski. Stochastic processes and filtering theory. Courier Cor- poration, 2007. [101] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lo- bell, and Stefano Ermon. Combining satellite imagery and machine learn- ing to predict poverty. Science, 353(6301):790–794, 2016. 97 [102] Ghassen Jerfel, Erin Grant, Tom Griffiths, and Katherine A Heller. Rec- onciling meta-learning and continual learning with online mixtures of tasks. In Advances in Neural Information Processing Systems, pages 9119– 9130, 2019. [103] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. Advances in neural information processing systems, 29:667– 675, 2016. [104] Wen Jiang, Matthew L Baker, Qiu Wu, Chandrajit Bajaj, and Wah Chiu. Applications of a bilateral denoising filter in biological electron mi- croscopy. Journal of structural biology, 144(1-2):114–122, 2003. [105] Beibei Jin, Yu Hu, Qiankun Tang, Jingyu Niu, Zhiping Shi, Yinhe Han, and Xiaowei Li. Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4554–4563, 2020. [106] Kiersten Johnson and Molly E Brown. Environmental risk factors and child nutritional status and survival in a context of climate variability and change. Applied Geography, 54:209–221, 2014. [107] Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771–1779. PMLR, 2017. [108] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick Van der 98 Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016. [109] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progres- sive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [110] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563, 2017. [111] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understand- ing. In Proceedings of NAACL-HLT, pages 4171–4186, 2019. [112] Byeongjoon Kim, Minah Han, Hyunjung Shim, and Jongduk Baek. A performance comparison of convolutional neural network-based image denoising methods: The effect of loss functions on low-dose ct images. Medical physics, 46(9):3906–3923, 2019. [113] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Es- lami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019. [114] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980, 2014. [115] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018. [116] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. 2014. 99 [117] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guil- laume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ra- malho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017. [118] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The effi- cient transformer. arXiv preprint arXiv:2001.04451, 2020. [119] Rahul Krishnan, Uri Shalit, and David Sontag. Structured inference net- works for nonlinear state space models. In Proceedings of the AAAI Confer- ence on Artificial Intelligence, volume 31, 2017. [120] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015. [121] Alex Krizhevsky et al. Learning multiple layers of features from tiny im- ages. 2009. [122] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- sification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012. [123] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flow- based generative model for video. [124] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Model- ing long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95–104, 2018. 100 [125] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [126] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018. [127] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A neural dirichlet process mixture model for task-free continual learning. arXiv preprint arXiv:2001.00689, 2020. [128] Yong Suk Lee. International isolation and regional inequality: Evidence from sanctions on north korea. Journal of Urban Economics, 103:34–51, 2018. [129] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5226– 5234, 2018. [130] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems, 32:5243–5253, 2019. [131] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE trans- actions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. [132] Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. Auto-conditioned recurrent networks for extended complex human mo- tion synthesis. arXiv preprint arXiv:1707.05363, 2017. 101 [133] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for future-flow embedded video prediction. In proceedings of the IEEE interna- tional conference on computer vision, pages 1744–1752, 2017. [134] Jeff W Lichtman and José-Angel Conchello. Fluorescence microscopy. Na- ture methods, 2(12):910–919, 2005. [135] Bryan Lim, Sercan O Arik, Nicolas Loeff, and Tomas Pfister. Temporal fu- sion transformers for interpretable multi-horizon time series forecasting. arXiv preprint arXiv:1912.09363, 2019. [136] Jae S Lim. Two-dimensional signal and image processing. ph, 1990. [137] Long-Ji Lin. Self-improving reactive agents based on reinforcement learn- ing, planning and teaching. Machine learning, 8(3-4):293–321, 1992. [138] Zhaojiang Lin, Genta Indra Winata, Peng Xu, Zihan Liu, and Pascale Fung. Variational transformers for diverse response generation. arXiv preprint arXiv:2003.12738, 2020. [139] Tony Lindeberg. Scale selection properties of generalized scale-space in- terest point detectors. Journal of Mathematical Imaging and vision, 46(2):177– 210, 2013. [140] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose with lstm recurrent neural networks. In International Conference on Learning Representations, 2016. [141] Danyang Liu and Gongshen Liu. A transformer-based variational au- toencoder for sentence generation. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2019. 102 [142] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agar- wala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 4463–4471, 2017. [143] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Sys- tems, pages 6467–6476, 2017. [144] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016. [145] Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The functional neural process. In Advances in Neural Information Processing Sys- tems, pages 8743–8754, 2019. [146] Chaochao Lu, Michael Hirsch, and Bernhard Scholkopf. Flexible spatio- temporal networks for video prediction. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pages 6523–6531, 2017. [147] Florian Luisier, Thierry Blu, and Michael Unser. Image denoising in mixed poisson–gaussian noise. IEEE Transactions on image processing, 20(3):696–708, 2010. [148] Helmut Lütkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2005. [149] Lars Maaløe, Marco Fraccaro, Valentin Lievin, and Ole Winther. Biva: A very deep hierarchy of latent variables for generative modeling. In 33rd Conference on Neural Information Processing Systems, page 8882. Neural Information Processing Systems Foundation, 2019. 103 [150] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete dis- tribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016. [151] Jacob Madsen, Pei Liu, Jens Kling, Jakob Birkedal Wagner, Thomas Willum Hansen, Ole Winther, and Jakob Schiøtz. A deep learning approach to identify local structures in atomic-resolution trans- mission electron microscopy images. Advanced Theory and Simulations, 1(8):1800037, 2018. [152] Markku Makitalo and Alessandro Foi. Optimal inversion of the general- ized anscombe transformation for poisson-gaussian noise. IEEE transac- tions on image processing, 22(1):91–103, 2012. [153] Bryce Manifold, Elena Thomas, Andrew T Francis, Andrew H Hill, and Dan Fu. Denoising of stimulated raman scattering microscopy images via deep learning. Biomedical optics express, 10(8):3860–3874, 2019. [154] Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017. [155] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2640–2649, 2017. [156] James E Matheson and Robert L Winkler. Scoring rules for continuous probability distributions. Management science, 22(10):1087–1096, 1976. 104 [157] Michael McCloskey and Neal J Cohen. Catastrophic interference in con- nectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989. [158] Ian S McLean. Electronic imaging in astronomy: detectors and instrumenta- tion. Springer Science & Business Media, 2008. [159] William Meiniel, Jean-Christophe Olivo-Marin, and Elsa D Angelini. De- noising of microscopy images: a review of the state-of-the-art, and a new sparsity-based method. IEEE Transactions on Image Processing, 27(8):3842– 3856, 2018. [160] Peyman Milanfar. A tour of modern image filtering: New insights and methods, both practical and theoretical. IEEE signal processing magazine, 30(1):106–128, 2012. [161] David Minarik, Olof Enqvist, and Elin Trägårdh. Denoising of scintillation camera images using a deep convolutional neural network: a monte carlo simulation approach. Journal of Nuclear Medicine, 61(2):298–303, 2020. [162] Tiziano Montini, Michele Melchionna, Matteo Monai, and Paolo For- nasiero. Fundamentals and catalytic applications of ceo2-based materials. Chemical reviews, 116(10):5987–6041, 2016. [163] Kevin P Murphy. Machine learning: a probabilistic perspective. 2012. [164] United Nations. Transforming our world: The 2030 agenda for sustain- able development. Resolution adopted by the General Assembly, 2015. [165] PD Nellist and SJ Pennycook. Accurate structure determination from im- age reconstruction in adf stem. Journal of Microscopy, 190(1-2):159–170, 1998. 105 [166] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised fea- ture learning. 2011. [167] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Vari- ational continual learning. arXiv preprint arXiv:1710.10628, 2017. [168] Yao Nie, Li Li, and Zidong Wei. Recent advancements in pt and pt- free catalysts for oxygen reduction reaction. Chemical Society Reviews, 44(8):2168–2201, 2015. [169] NOAA National Centers for Environmental Information. Nighttime Lights Time Series. 2010. [170] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. arXiv preprint arXiv:1507.08750, 2015. [171] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328, 2016. [172] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Ben- gio. N-beats: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, 2019. [173] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and Moin Nabi. Learning to remember: A synaptic plasticity driven frame- work for continual learning. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 11321–11329, 2019. 106 [174] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Ste- fan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019. [175] Martin Parry, Martin L Parry, Osvaldo Canziani, Jean Palutikof, Paul Van der Linden, and Clair Hanson. Climate change 2007-impacts, adapta- tion and vulnerability: Working group II contribution to the fourth assessment report of the IPCC, volume 4. Cambridge University Press, 2007. [176] Andrew J Patton. A review of copula models for economic time series. Journal of Multivariate Analysis, 110:4–18, 2012. [177] JR Peterson, JG Jernigan, SM Kahn, AP Rasmussen, E Peng, Z Ahmad, J Bankert, C Chang, C Claver, DK Gilmore, et al. Simulation of astro- nomical images from optical survey telescopes using a comprehensive photon monte carlo approach. The Astrophysical Journal Supplement Series, 218(1):14, 2015. [178] Neeti Pokhriyal and Damien Christophe Jacques. Combining disparate data sources for improved poverty prediction and mapping. Proceedings of the National Academy of Sciences, 114(46):E9783–E9792, 2017. [179] Javier Portilla, Vasily Strela, Martin J Wainwright, and Eero P Simoncelli. Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans. Image Processing, 12(11), 2003. [180] Franco P Preparata and Michael Ian Shamos. Convex hulls: Basic algo- rithms. In Computational geometry, pages 95–149. Springer, 1985. [181] Steven E Prince, Sander M Daselaar, and Roberto Cabeza. Neural corre- 107 lates of relational memory: successful encoding and retrieval of semantic and perceptual associations. Journal of Neuroscience, 25(5):1203–1210, 2005. [182] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Gar- rison Cottrell. A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971, 2017. [183] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [184] Marco Ragone, Vitaliy Yurkiv, Boao Song, Ajaykrishna Ramsubramanian, Reza Shahbazian-Yassar, and Farzad Mashayek. Atomic column heights detection in metallic nanoparticles using deep convolutional learning. Computational Materials Science, 180:109722, 2020. [185] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision mod- els. arXiv preprint arXiv:1906.05909, 2019. [186] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31:7785–7794, 2018. [187] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ro- nan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014. 108 [188] Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. Continual unsupervised representation learn- ing. In Advances in Neural Information Processing Systems, pages 7645–7655, 2019. [189] Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. arXiv preprint arXiv:2101.12072, 2021. [190] Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs Bergmann, and Roland Vollgraf. Multi-variate probabilistic time series forecasting via conditioned normalizing flows. arXiv preprint arXiv:2002.06103, 2020. [191] Roger Ratcliff. Connectionist models of recognition memory: con- straints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990. [192] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pat- tern Recognition, pages 2001–2010, 2017. [193] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. [194] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochas- tic backpropagation and approximate inference in deep generative mod- 109 els. In International conference on machine learning, pages 1278–1286. PMLR, 2014. [195] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018. [196] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. In Ad- vances in Neural Information Processing Systems, pages 3738–3748, 2018. [197] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems, pages 348–358, 2019. [198] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu- tional networks for biomedical image segmentation. In International Con- ference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. [199] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. In- ternational Journal of Computer Vision, 115(3):211–252, 2015. [200] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Had- sell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 110 [201] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. High-dimensional multivariate forecasting with low-rank gaussian copula processes. arXiv preprint arXiv:1910.03002, 2019. [202] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020. [203] Luigi Sedda, Andrew J Tatem, David W Morley, Peter M Atkinson, Nicola A Wardrop, Carla Pezzulo, Alessandro Sorichetta, Joanna Kuleszo, and David J Rogers. Poverty, health and satellite-derived vegetation in- dices: their inter-spatial relationship in west africa. International health, 7(2):99–106, 2015. [204] Iulian Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent vari- able encoder-decoder model for generating dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [205] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fer- gus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. [206] Arthur P Shimamura. Episodic retrieval and the cortical binding of rela- tional activity. Cognitive, Affective, & Behavioral Neuroscience, 11(3):277–291, 2011. [207] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual 111 learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017. [208] Leonid Sigal and Michael J Black. Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Brown Univertsity TR, 120(2), 2006. [209] Karen Simonyan and Andrew Zisserman. Very deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [210] David Smith. CHAPTER 1: Characterization of nanomaterials using transmis- sion electron microscopy, pages 1–29. Number 37 in RSC Nanoscience and Nanotechnology. Royal Society of Chemistry, 37 edition, January 2015. [211] Slawek Smyl, Jai Ranganathan, and Andrea Pasqua. M4 forecasting com- petition: Introducing a new hybrid es-rnn model. URL: https://eng. uber. com/m4-forecasting-competition, 2018. [212] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3745–3753, 2016. [213] Huan Song, Deepta Rajan, Jayaraman Thiagarajan, and Andreas Spanias. Attend and diagnose: Clinical time series analysis using attention models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [214] Pablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander Pritzel, Adria Puigdomenech Badia, Benigno Uria, Oriol Vinyals, Demis 112 Hassabis, Razvan Pascanu, and Charles Blundell. Memory-based param- eter adaptation. arXiv preprint arXiv:1802.10542, 2018. [215] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. [216] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsu- pervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015. [217] Jessica E Steele, Pål Roe Sundsøy, Carla Pezzulo, Victor A Alegana, Tomas J Bird, Joshua Blumenstock, Johannes Bjelland, Kenth Engø- Monsen, Yves-Alexandre de Montjoye, Asif M Iqbal, et al. Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface, 14(127):20160690, 2017. [218] Amit Suveer, Anindya Gupta, Gustaf Kylberg, and Ida-Maria Sintorn. Super-resolution reconstruction of transmission electron microscopy im- ages using deep learning. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 548–551. IEEE, 2019. [219] Franklin Tao and Peter Crozier. Atomic-scale observations of catalyst structures under reaction conditions and during catalysis. Chemical Re- views, 116(6):3487–3539, March 2016. [220] Michalis K Titsias, Jonathan Schwarz, Alexander G de G Matthews, Raz- van Pascanu, and Yee Whye Teh. Functional regularisation for continual learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019. 113 [221] Carlo Tomasi and Roberto Manduchi. Bilateral filtering for gray and color images. In ICCV, volume 98, 1998. [222] Ruey S Tsay. Multivariate time series analysis: with R and financial applica- tions. John Wiley & Sons, 2013. [223] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational au- toencoder. arXiv preprint arXiv:2007.03898, 2020. [224] Roy Van der Weide. Go-garch: a multivariate generalized orthogonal garch model. Journal of Applied Econometrics, 17(5):549–564, 2002. [225] Rama K Vasudevan and Stephen Jesse. Deep learning as a tool for image denoising and drift correction. Microscopy and Microanalysis, 25(S2):190– 191, 2019. [226] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [227] Ruben Villegas, Dumitru Erhan, Honglak Lee, et al. Hierarchical long- term video prediction without supervision. In International Conference on Machine Learning, pages 6038–6046. PMLR, 2018. [228] Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High fidelity video prediction with large stochastic recurrent neural networks. In NeurIPS, 2019. [229] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence predic- tion. arXiv preprint arXiv:1706.08033, 2017. 114 [230] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016. [231] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985. [232] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2, 2015. [233] Carl Vondrick and Antonio Torralba. Generating the future with adver- sarial transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1020–1028, 2017. [234] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from static images using variational au- toencoders. In European Conference on Computer Vision, pages 835–851. Springer, 2016. [235] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 2443–2451, 2015. [236] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The pose knows: Video forecasting by generating pose futures. In Proceed- ings of the IEEE international conference on computer vision, pages 3332–3341, 2017. [237] Eric A Wan and Rudolph Van Der Merwe. The unscented kalman fil- ter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Sys- 115 tems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pages 153–158. Ieee, 2000. [238] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–298, 2007. [239] Tianming Wang and Xiaojun Wan. T-cvae: Transformer-based condi- tioned variational autoencoder for story completion. In IJCAI, pages 5233– 5239, 2019. [240] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski. Deep factors for forecasting. In International Con- ference on Machine Learning, pages 6607–6617. PMLR, 2019. [241] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Im- age quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. [242] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autore- gressive video models. arXiv preprint arXiv:1906.02634, 2019. [243] Douglas Brent West et al. Introduction to graph theory, volume 2. Prentice hall Upper Saddle River, 2001. [244] Mike West and Jeff Harrison. Bayesian forecasting and dynamic models. Springer Science & Business Media, 2006. [245] Di Wu and Ling Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724– 731, 2014. 116 [246] Neo Wu, Bradley Green, Xue Ben, and Shawn O’Banion. Deep trans- former models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317, 2020. [247] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yan- dong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019. [248] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. Transfer learning from deep features for remote sensing and poverty map- ping. arXiv preprint arXiv:1510.00098, 2015. [249] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015. [250] Jingwei Xu, Bingbing Ni, Zefan Li, Shuo Cheng, and Xiaokang Yang. Structure preserving video prediction. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1460–1469, 2018. [251] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Rus- lan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015. [252] Xinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli Shechtman, Sunil Hadap, Ersin Yumer, and Honglak Lee. Mt-vae: Learn- ing motion transformations to generate multimodal human dynamics. In 117 Proceedings of the European Conference on Computer Vision (ECCV), pages 265–281, 2018. [253] Huaxiu Yao, Xian Wu, Zhiqiang Tao, Yaliang Li, Bolin Ding, Ruirui Li, and Zhenhui Li. Automated relational meta-learning. arXiv preprint arXiv:2001.00745, 2020. [254] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Life- long learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017. [255] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long- term forecasting using tensor-train rnns. Arxiv, 2017. [256] Weiting Yu, Marc D Porosoff, and Jingguang G Chen. Review of pt-based bimetallic catalysis: from model surfaces to supported catalysts. Chemical reviews, 112(11):5780–5817, 2012. [257] Ye Yuan and Kris Kitani. Dlow: Diversifying latent flows for diverse hu- man motion prediction. In European Conference on Computer Vision, pages 346–364. Springer, 2020. [258] Ye Yuan and Kris M Kitani. Diverse trajectory forecasting with determi- nantal point processes. In International Conference on Learning Representa- tions, 2019. [259] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Con- ference on Machine Learning-Volume 70, pages 3987–3995. JMLR. org, 2017. [260] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus 118 Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018. [261] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363. PMLR, 2019. [262] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017. [263] Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. Dynamically un- folding recurrent restorer: A moving endpoint control method for image restoration. arXiv preprint arXiv:1805.07709, 2018. [264] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10076–10085, 2020. [265] Jingyu Zhao, Feiqing Huang, Jia Lv, Yanjie Duan, Zhen Qin, Guodong Li, and Guangjian Tian. Do rnn and lstm have long memory? In International Conference on Machine Learning, pages 11365–11375. PMLR, 2020. [266] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from generative models. arXiv preprint arXiv:1702.08396, 2017. [267] Hai Jing Zhu, Bo Chong Han, and Bo Qiu. Survey of astronomical im- age processing methods. In International Conference on Image and Graphics, pages 420–429. Springer, 2015. [268] Jian-Min Zuo and J.C.H. Spence. Advanced Transmission Electron Mi- croscopy, Imaging and Diffraction in Nanoscience. 01 2017. 119