BAYESIAN METHODS FOR FUNCTIONAL AND
TIME SERIES DATA
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Daniel R. Kowal
August 2017
©c 2017 Daniel R. Kowal
ALL RIGHTS RESERVED
BAYESIAN METHODS FOR FUNCTIONAL AND TIME SERIES DATA
Daniel R. Kowal, Ph.D.
Cornell University 2017
We introduce new Bayesian methodology for modeling functional and time se-
ries data. While broadly applicable, the methodology focuses on the challeng-
ing cases in which (1) functional data exhibit additional dependence, such as
time dependence or contemporaneous dependence; (2) functional or time series
data demonstrate local features, such as jumps or rapidly-changing smoothness;
and (3) a time series of functional data is observed sparsely or irregularly with
non-negligible measurement error. A unifying characteristic of the proposed
methods is the employment of the dynamic linear model (DLM) framework in
new contexts to construct highly efficient Gibbs sampling algorithms.
To model dependent functional data, we extend DLMs for multivariate time
series data to the functional data setting, and identify a smooth, time-invariant
functional basis for the functional observations. The proposed model provides
flexible modeling of complex dependence structures among the functional ob-
servations, such as time dependence, contemporaneous dependence, stochastic
volatility, and covariates. We apply the model to multi-economy yield curve
data and local field potential brain signals in rats.
For locally adaptive Bayesian time series and regression analysis, we pro-
pose a novel class of dynamic shrinkage processes. We extend a broad class of
popular global-local shrinkage priors, such as the horseshoe prior, to the dy-
namic setting by allowing the local scale parameters to depend on the history
of the shrinkage process. We prove that the resulting processes inherit desirable
shrinkage behavior from the non-dynamic analogs, but provide additional lo-
cally adaptive shrinkage properties. We demonstrate the substantial empirical
gains from the proposed dynamic shrinkage processes using extensive simula-
tions, a Bayesian trend filtering model for irregular curve-fitting of CPU usage
data, and an adaptive time-varying parameter regression model, which we em-
ploy to study the dynamic relevance of the factors in the Fama-French asset
pricing model.
Finally, we propose a hierarchical functional autoregressive (FAR) model
with Gaussian process innovations for forecasting and inference of sparsely or
irregularly sampled functional time series data. We prove finite-sample fore-
casting and interpolation optimality properties of the proposed model, which
remain valid with the Gaussian assumption relaxed. We apply the proposed
methods to produce highly competitive forecasts of daily U.S. nominal and real
yield curves.
BIOGRAPHICAL SKETCH
Daniel Ryan Kowal was born in Albany, New York. After finishing high school
at Salesianum School in Wilmington, Delaware, Daniel attended Washington
University in St. Louis. While at Washington University, Daniel participated
in the Pathfinder Program for Environmental Sustainability, completed a se-
nior honors thesis Applications of linear mixed effects models: an analysis of Mis-
souri school data, and graduated summa cum laude in mathematics with minors
in computer science and legal studies. After graduating in 2012, Daniel en-
tered the Cornell University Ph.D. program in statistics. During his graduate
studies, Daniel co-authored publications in the Journal of the American Statistical
Association, the Journal of Business & Economic Statistics, Cellular and Molecular
Bioengineering, and the Journal of Biomechanics. He has received student paper
awards from the American Statistical Association in both the Section on Bayesian
Statistical Science and the Nonparametric Statistics Section. Following the comple-
tion of his Ph.D., Daniel will join the Rice University Department of Statistics as
an assistant professor.
iii
To my parents and my brother.
iv
ACKNOWLEDGEMENTS
First, I would like to express my sincere gratitude to my co-advisers, Dr. David
S. Matteson and Dr. David Ruppert, for their guidance, their time, and their
commitment to my development as an independent researcher. I would also
like to thank my undergraduate thesis advisor, Dr. Jimin Ding, for helping to
set me on this path.
Second, I would like to thank my fellow Ph.D. students and friends, espe-
cially Dr. Amy Willis, Dr. David Sinclair, and Dr. William Nicholson. Both cele-
bration and commiseration are unwritten prerequisites for graduate study, and
their vital roles in each are greatly appreciated.
Third, I would like to thank my family, especially my parents, for persis-
tently emphasizing the value of higher education and extracurricular learning,
and my brother, for teaching me mathematics from an early age, despite my
frequent protests.
And finally, I especially thank my wife, Dr. Marsha Kowal, for the encour-
aging notes, the travel packs, the early morning breakfast surprises, and most
importantly, for her unwavering support.
v
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
2 A Bayesian Multivariate Functional Dynamic Linear Model 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 A Multivariate Functional Dynamic Linear Model . . . . . . . . . 7
2.3 Estimating the Factor Loading Curves . . . . . . . . . . . . . . . . 11
2.3.1 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Bayesian Splines . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Constrained Bayesian Splines . . . . . . . . . . . . . . . . . 17
2.3.4 Common Factor Loading Curves for Multivariate Modeling 20
2.4 Data Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Multi-Economy Yield Curves . . . . . . . . . . . . . . . . . 21
2.4.2 Multivariate Time-Frequency Analysis for Local Field Po-
tential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Dynamic Shrinkage Processes 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Dynamic Shrinkage Processes . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Stochastic Volatility Models for Dynamic Scale Parameters 45
3.2.2 Log-Scale Representations of Global-Local Priors . . . . . 46
3.2.3 Scale Mixtures via Pólya-Gamma Processes . . . . . . . . . 52
3.3 Bayesian Trend Filtering with Dynamic Shrinkage Processes . . . 53
3.3.1 Bayesian Trend Filtering: Simulations . . . . . . . . . . . . 55
3.3.2 Bayesian Trend Filtering: Application to CPU Usage Data 58
3.4 Joint Shrinkage for Time-Varying Parameter Models . . . . . . . . 60
3.4.1 Time-Varying Parameter Models: Simulations . . . . . . . 62
3.4.2 Time-Varying Parameter Models: The Fama-French Asset
Pricing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 MCMC Sampling Algorithm and Computational Details . . . . . 67
3.5.1 Efficient Sampling for the Dynamic Shrinkage Process . . 69
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vi
4 Functional Autoregression for Sparsely Sampled Data 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Hierarchical Gaussian Processes for FAR . . . . . . . . . . . . . . . 77
4.2.1 Dynamic Linear Models for FAR(p) . . . . . . . . . . . . . 80
4.3 A Dynamic Functional Factor Model for the Innovation Process . 83
4.4 Modeling the FAR Kernel . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Finite-Dimensional Optimality . . . . . . . . . . . . . . . . . . . . 89
4.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 Sampling Designs . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.2 Competing Estimators . . . . . . . . . . . . . . . . . . . . . 96
4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Forecasting Nominal and Real Yield Curves . . . . . . . . . . . . . 101
4.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Conclusions 108
A A Bayesian Multivariate Functional Dynamic Linear Model 110
A.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.1.1 Common Factor Loading Curves . . . . . . . . . . . . . . . 111
A.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.2.1 General Algorithm . . . . . . . . . . . . . . . . . . . . . . . 112
A.2.2 Sampling the Common Trend Hidden Markov Model . . . 116
A.3 Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B Dynamic Shrinkage Processes 122
B.1 MCMC Sampling Algorithm and Computational Details . . . . . 125
B.1.1 Efficient Sampling for the Dynamic Shrinkage Process . . 127
B.1.2 Efficient Sampling for the State Variables . . . . . . . . . . 130
B.2 Linear Regression for the Fama-French Asset Pricing Model . . . 132
C Functional Autoregression for Sparsely Sampled Data 134
C.1 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
C.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 136
C.3 Initialization and MCMC Sampling Algorithm . . . . . . . . . . . 137
C.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C.3.2 Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . 138
C.4 Additional Theoretical Results . . . . . . . . . . . . . . . . . . . . 143
C.4.1 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . 143
C.4.2 DLM Recursions and Special Cases of Theorem 4.1 . . . . 144
C.4.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . 145
C.5 Additional Simulation Results . . . . . . . . . . . . . . . . . . . . . 146
C.6 Additional Details for the Yield Curve Application . . . . . . . . . 147
C.7 Additional Details on the Quadrature Approximation . . . . . . . 150
vii
LIST OF TABLES
2.1 Posterior means and 95% HPD intervals for (c)γk , which measures
the strength of the linear relationship between (c)βk,t and
(1)
βk,t . . . . 29
3.1 Special cases of the inverted-Beta prior. . . . . . . . . . . . . . . . 41
4.1 h-step RMSFEs for nominal yields, grouped (left to right) by
multivariate methods, parametric yield curve models, existing
functional data methods, and proposed hierarchical FAR meth-
ods. The minimum RMSFE in each row is italicized. . . . . . . . 106
4.2 h-step RMSFEs for real yields, grouped (left to right) by multi-
variate methods, parametric yield curve models, existing func-
tional data methods, and proposed hierarchical FAR methods.
The minimum RMSFE in each row is italicized. . . . . . . . . . . 107
B.1 Ordinary linear regression results for the weekly manufacturing
industry data in the six-factor model. Significant factors at the
5% level are italicized. . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.2 Ordinary linear regression results for the weekly healthcare in-
dustry data in the six-factor model. Significant factors at the 5%
level are italicized. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
viii
LIST OF FIGURES
2.1 Multi-economy yield curves from July 29, 2011 (solid) and Au-
gust 5, 2011 (dashed), together with the corresponding one-week
change curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Posterior means of the common FLCs, {f1, f2, f3, f4}, as a func-
tion of maturity, τ . . . . . . . . . . . . . . . .∑. . . . . . . . . . . . 28
2.3 The MCMC sample proportions of r2 4 2k,(c),t and k=1 rk,(c),t that ex-
ceed the 95th percentile of the assumed χ2-distributions. . . . . 30
2.4 The raw LFP data from a rat during an FS trial. The vertical lines
indicates the approximate time at which the rat processed the
stimuli, t∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Pointwise 95% HPD intervals and the posterior mean for (3)µ̄t ,
which is the average difference in squared coherence between
the FC and FS trials. The black vertical lines indicate the event
time t∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Bayesian trend filtering (D = 2) with dynamic horseshoe process in-
novations of minute-by-minute CPU usage data. (a) Observed data yt
(points), posterior expectation (cyan) of βt, and 95% pointwise high-
est posterior density (HPD) credible intervals (light gray) and 95% si-
multaneous credible bands (dark gray) for the posterior predictive dis-
tribution of yt. (b) Second difference of observed data ∆2yt (points),
posterior expectation of ωt = ∆2βt (cyan), and 95% pointwise HPD
intervals (light gray) and simultaneous credible bands (dark gray) for
the posterior predictive distribution of ∆2yt. (c) Posterior expectation
of time-dependent observation standard deviations, σt. (d) Posterior
expectation of time-dependent innovation (prior) standard deviations,
τλt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Simulation-based estimate of the stationary distribution of κt for
various AR(1) coefficients φ. The blue line indicates the density
of κt in the static (φ = 0) horseshoe, [κ] ∼ Beta (1/2, 1/2). . . . . . 48
3.3 Fitted curves for simulated data with T = 128 and RSNR = 7.
Each panel includes the simulated observations (x-marks), the
posterior expectations of βt (cyan), and the 95% pointwise HPD
credible intervals (light gray) and 95% simultaneous credible
bands (dark gray) for the posterior predictive distribution of {yt}
under BTF-DHS model (3.8) with D = 2. The proposed esti-
mator, as well as the uncertainty bands, accurately capture both
slowly- and rapidly-changing behavior in the underlying func-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
3.4 Root mean squared errors for simulated data with T = 128 and
RSNR = 7. The Bayesian trend filtering (BTF) estimators differ
in their innovation distributions, which determines the shrink-
age behavior of the second order differences (D = 2): normal-
inverse-Gamma (NIG), horseshoe (HS), and dynamic horseshoe
(DHS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Root mean squared error for out-of-sample minute-by-minute CPU us-
age data. The Bayesian trend filtering (BTF) estimators differ in their
innovation distributions, which determines the shrinkage behavior of
the second order differences (D = 2): normal-inverse-Gamma (NIG),
horseshoe (HS), and dynamic horseshoe (DHS). . . . . . . . . . . . . 60
3.6 True regression functions β∗j,t (black line) and corresponding pos-
terior expectations (cyan), 95% pointwise HPD credible intervals
(light gray) and 95% simultaneous credible bands (dark gray) for
βj,t under the BTF-DHS model given by (3.9) and (3.10) for a sim-
ulated data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Root mean squared errors for the regression coefficients, β∗j,t
(left) and the true curves, y∗ ′ ∗t = xtβt (right) for simulated data. . 64
3.8 Posterior expectations (cyan), 95% pointwise HPD credible in-
tervals (light gray) and 95% simultaneous credible bands (dark
gray) for βj,t and σt (bottom right) under the BTF-DHS model
given by (3.9) and (3.10) for value-weighted manufacturing in-
dustry returns. The solid black line is zero, the dashed green line
is the ordinary linear regression estimate, and the solid red line
indicates periods for which the 95% simultaneous credible bands
do not contain zero. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.9 Posterior expectations (cyan), 95% pointwise HPD credible in-
tervals (light gray) and 95% simultaneous credible bands (dark
gray) for βj,t and σt (bottom right) under the BTF-DHS model
given by (3.9) and (3.10) for value-weighted healthcare industry
returns. The solid black line is zero, the dashed green line is the
ordinary linear regression estimate, and the solid red line indi-
cates periods for which the 95% simultaneous credible bands do
not contain zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1 Sample paths of t and Yt = µt + µ as a function of τ , where
t is a Gaussian process with the Matérn correlation function,
ρ = (ρ1, 0.1), σ = 0.01, and Yt is generated using the Bimodal-
Gaussian FAR(1) kernel, t = 1, . . . , T = 50. The curves are time-
ordered by color (from red/orange to blue/violet). Left to right:
t(τ), ρ1 = 2.5; t(τ), ρ1 = 0.5; Yt(τ), ρ1 = 2.5; Yt(τ), ρ1 = 0.5. Note
that we do not observe Yt directly, but rather yi,t = Yt(τi,t) + νi,t,
where νi,t ∼ N(0, σ2ν) is measurement error with σν = σ/5 =
0.002 and Tt = {τ1,t, . . . , τmt,t} are the observation points at time t. 95
x
4.2 MSFEe under various designs. Top left: FAR(1), T = 350,
sparse-random design with the Linear-u kernel and smooth GP
innovations. Top right: FAR(1), T = 50, sparse-random design
with the Bimodal-Gaussian kernel and non-smooth GP innova-
tions. Bottom left: FAR(1), T = 350, sparse-fixed design with
the Bimodal-Gaussian kernel and smooth GP innovations. Bot-
tom right: FAR(2), T = 125, sparse-fixed design with Bimodal-
Gaussian and Linear−τ kernels and smooth GP innovations.
The proposed methods provide superior forecasts and nearly
achieve the oracle performance, despite the presence of sparsity. 99
4.3 MSEψ1 under various designs. Top left: FAR(1), T = 350,
sparse-random design with the Linear-u kernel and smooth GP
innovations. Top right: FAR(1), T = 50, sparse-random design
with the Bimodal-Gaussian kernel and non-smooth GP innova-
tions. Bottom left: FAR(1), T = 350, sparse-fixed design with
the Bimodal-Gaussian kernel and smooth GP innovations. Bot-
tom right: FAR(2), T = 125, sparse-fixed design with Bimodal-
Gaussian and Linear−τ kernels and smooth GP innovations. Es-
timates of ψ1 are far superior for the proposed methods, includ-
ing the FAR(p) with model averaging. . . . . . . . . . . . . . . . 100
4.4 One-step nominal (left) and real (right) yield curve forecasts
during 2016. Top: Time series of five (×) and ten (4) year ob-
served maturities with one-step forecasts. Bottom: Observed
(points) and forecast (line) curves on 8/2/16, corresponding to
the dotted vertical line in the top panels. Posterior means (blue)
and 95% pointwise and simultaneous prediction bands (light
gray and dark gray, respectively) estimated using 10,000 MCMC
simulations after a burn-in of 5,000. . . . . . . . . . . . . . . . . . 104
A.1 Pointwise 95% HPD intervals and the posterior mean for (1)µ̄t ,
which is the average difference in the PFC log-spectra between
the FC and FS trials. The black vertical lines indicate t∗. . . . . . . 119
A.2 Pointwise 95% HPD intervals and the posterior mean for (2)µ̄t ,
which is the average difference in the PFC log-spectra between
the FC and FS trials. The black vertical lines indicate t∗. . . . . . . 120
A.3 The observed volatility clustering from the yield curve applica-
tion. The black lines are the posterior means of the squared resid-
uals from the AR(1) process on the (c)ωk,t in the common trend hid-
den Markov model of Section 2.4.1. The red lines are the poste-
rior means of the corresponding volatility estimates σ2k,(c),t dis-
cussed in Section 2.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . 121
xi
B.1 Computation time per 1000 MCMC iterations for the Bayesian
trend filtering model with dynamic horseshoe innovations (BTF-
DHS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
C.1 MSFEe (top) and corresponding MSEψ1 (bottom) under var-
ious designs. Left: FAR(1), T = 50, dense design with
the Bimodal-Gaussian kernel and non-smooth GP innovations.
Right: FAR(1), T = 350, dense design with the Bimodal-
Gaussian kernel and smooth GP innovations. The proposed
methods provide superior forecasts and nearly achieve the or-
acle performance, despite the presence of sparsity. . . . . . . . . 147
C.2 The Bimodal-Gaussian kernel, ψ(τ, u) ∝ 0.75 exp{−(τ −
π(0.3)(0.4)
0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45∫ ∫exp{−(τ−0.7)2/(0.3)2−π(0.3)(0.4)
(u− 0.8)2/(0.4)2}, normalized so that ψ2` (τ, u) dτ du = 0.8. . . 148
C.3 Traceplot for one-step forecasts for nominal yield curves at se-
lected maturities during 2016. . . . . . . . . . . . . . . . . . . . . 150
C.4 Traceplot for one-step forecasts for real yield curves at selected
maturities during 2016. . . . . . . . . . . . . . . . . . . . . . . . . 151
E1 Standardized squared errors and relative absolute errors for
smooth (top) and non-smooth (bottom) integrands. The errors
are small in magnitude, particularly in the smooth case, and de-
cay quickly for M > 20. . . . . . . . . . . . . . . . . . . . . . . . . 153
xii
CHAPTER 1
INTRODUCTION
We present Bayesian methodology for modeling functional and time series
data. The methods are broadly applicable for (dependent) functional and time
series data, but we focus in particular on the following challenging cases for
which existing methods are inadequate:
1. Functional data with additional complex dependence, such as time depen-
dence, contemporaneous dependence, stochastic volatility, covariates, and
change points (Chapter 2);
2. Functional data, time series data, or regression functions with local fea-
tures, such as jumps or rapidly-changing smoothness (Chapter 3); and
3. Forecasting and inference of functional time series data with sparsely or
irregularly sampled curves and for curves sampled with non-negligible
measurement error (Chapter 4).
A unifying characteristic of the proposed methods is the employment of the
dynamic linear model (DLM) framework in new contexts to construct inter-
pretable models and computationally efficient MCMC sampling algorithms. In
particular, we develop highly efficient Gibbs sampling algorithms that build
upon existing DLM sampling components for large blocks of parameters (e.g.,
Rue, 2001; Durbin and Koopman, 2002). The novel applications of DLMs in-
clude functional dynamic factor models, Bayesian trend filtering models, dy-
namic shrinkage processes (see Chapter 3), and functional autoregressive mod-
els. Importantly, the Bayesian framework permits joint estimation of the model
1
parameters and provides exact inference (up to MCMC error) on specific pa-
rameters.
The proposed methodology is motivated by important applications includ-
ing multi-economy interest rate modeling, nominal and real yield curve fore-
casting, dynamic extensions of the Fama-French asset pricing model, irregular
curve-fitting of CPU usage data, and local field potential brain signals in rats.
The methods are evaluated through extensive simulations, and compared to
state-of-the-art alternative estimators, with favorable results.
In Chapter 2, we present a Bayesian model for multivariate, dependent func-
tional data, in which we extend DLMs for multivariate time series to the func-
tional data setting. We also develop Bayesian spline theory in a more general
constrained optimization framework. The proposed methods identify a time-
invariant functional basis for the functional observations, which is smooth and
interpretable. We apply the methodology to study the interactions of multi-
economy yield curves during the recent global recession, and analyze local field
potential brain signals in rats, for which we develop a multivariate functional
time series approach for multivariate time-frequency analysis.
In Chapter 3, we propose a novel class of dynamic shrinkage processes for
Bayesian time series and regression analysis. We extend a broad class of popular
global-local shrinkage priors, such as the horseshoe prior, to the dynamic setting
by allowing the local scale parameters to depend on the history of the shrinkage
process. We prove that the resulting processes inherit desirable shrinkage be-
havior from the non-dynamic analogs, but provide additional locally adaptive
shrinkage properties. The proposed dynamic shrinkage processes are widely
applicable, particularly within the family of dynamic linear models. By express-
2
ing dynamic shrinkage processes on the log-scale, we adapt successful tech-
niques from stochastic volatility modeling, and propose a Pólya-Gamma scale
mixture representation to produce a highly efficient Gibbs sampling algorithm.
We use the proposed processes to produce superior Bayesian trend filtering es-
timates and posterior credible intervals for irregular curve-fitting of minute-by-
minute Twitter CPU usage data, and develop an adaptive time-varying param-
eter regression model to assess the efficacy of the Fama-French five-factor asset
pricing model with momentum added as a sixth factor.
In Chapter 4, we develop a hierarchical Gaussian process model for forecast-
ing and inference of functional time series data. Unlike existing methods, our
approach is especially suited for sparsely or irregularly sampled curves and
for curves sampled with non-negligible measurement error. The latent pro-
cess is dynamically modeled as a functional autoregression (FAR) with Gaus-
sian process innovations, with extensions for FAR(p) models with model aver-
aging over the lag p. We propose a fully nonparametric dynamic functional
factor model for the dynamic innovation process, with broader applicability
and improved computational efficiency over standard Gaussian process mod-
els. We prove finite-sample forecasting and interpolation optimality properties
of the proposed model, which remain valid with the Gaussian assumption re-
laxed. Extensive simulations demonstrate substantial improvements in fore-
casting performance and recovery of the autoregressive surface over competing
methods, especially under sparse designs. We apply the proposed methods to
forecast nominal and real yield curves using daily U.S. data. Real yields are ob-
served more sparsely than nominal yields, yet the proposed methods are highly
competitive in both settings.
3
CHAPTER 2
A BAYESIAN MULTIVARIATE FUNCTIONAL DYNAMIC LINEAR
MODEL
Portions of this chapter were published in Kowal et al. (2016).
2.1 Introduction
We consider a multivariate time series of functional data. Functional data anal-
ysis (FDA) methods are widely applicable, including diverse fields such as eco-
nomics and finance (e.g., Hays et al., 2012); brain imaging (e.g., Staicu et al.,
2012); chemometric analysis, speech recognition, and electricity consumption
(Ferraty and Vieu, 2006); and growth curves and environmental monitoring
(Ramsay and Silverman, 2005). Methodology for independent and identically
distributed (iid) functional data has been well-developed, but in the case of de-
pendent functional data, the iid methods are not appropriate. Such dependence
is common, and can arise via multiple responses, temporal and spatial effects,
repeated measurements, missing covariates, or simply because of some natural
grouping in the data (e.g., Horváth and Kokoszka, 2012). Here, we consider two
distinct sources of dependence: time dependence for time-ordered functional
observations and contemporaneous dependence for multivariate functional ob-
servations.
Suppose we observe multiple functions (c)Yt (τ), c = 1, . . . , C, at time points
t = 1, . . . , T . Such observations have three dominant features:
(a) For each c and t, (c)Yt (τ) is a function of τ ∈ T ;
4
(b) For each c and , (c)τ Yt (τ) is a time series for t = 1, . . . , T ; and
(c) For each t and τ , (c)Yt (τ) is a multivariate observation with outcomes c =
1, . . . , C.
We assume that T ⊆ Rd is compact, and focus on the case d = 1 in which τ is a
scalar. However, our approach may be adapted to the more general setting.
We consider two diverse applications of multivariate functional time series
(MFTS).
Multi-Economy Yield Curves: Let (c)Yt (τ) denote multi-economy yield curves ob-
served on weeks t = 1, . . . , T for economies c = 1, . . . , C, which refer to the Fed-
eral Reserve, the Bank of England, the European Central Bank, and the Bank of
Canada. For a given currency and level of risk of a debt, the yield curve de-
scribes the interest rate as a function of the length of the borrowing period, or
time to maturity, τ . Yield curves are important in a variety of economic and
financial applications, such as evaluating economic and monetary conditions,
pricing fixed-income securities, generating forward curves, computing infla-
tion premiums, and monitoring business cycles (Bolder et al., 2004). We are
particularly interested in the relationships among yield curves for the aforemen-
tioned globally-influential economies, and in how these relationships vary over
time. However, existing FDA methods are inadequate to model the dynamic de-
pendences among and between the yield curves for different economies, such
as contemporaneous dependence, volatility clustering, covariates, and change
points. Our approach resolves these inadequacies, and provides useful insights
into the interactions among multi-economy yield curves (see Section 2.4.1).
Multivariate Time-Frequency Analysis: For multivariate time series, the peri-
5
odic behavior of the process is often the primary interest. Time-frequency anal-
ysis is used when this periodic behavior varies over time, which requires con-
sideration of both the time and frequency domains (e.g., Shumway and Stof-
fer, 2000). Typical methods segment the multivariate time series into (overlap-
ping) time bins within which the periodic behavior is approximately stationary;
within each bin, standard frequency domain or spectral analysis is performed,
which uses the multivariate discrete Fourier transform of the time series to iden-
tify dominant frequencies. Interestingly, although the raw signal in this set-
ting is a multivariate time series, time-frequency analysis produces a MFTS:
the multivariate discrete Fourier transform is a function of frequency τ for time
bins t = 1, . . . , T , where c = 1, . . . , C index the multivariate components of the
spectrum. We analyze local field potential (LFP) data collected on rats, which
measures the neural activity of local brain regions over time (Ljubojevic et al.,
2013). Our interest is in the time-dependent periodic behavior of these local
brain regions under different stimuli, and in particular the synchronization be-
tween brain regions. Our novel MFTS approach to time-frequency analysis pro-
vides the necessary multivariate structure and inference—which is unavailable
in standard time-frequency analysis—to precisely characterize brain behavior
under certain stimuli (see Section 2.4.2).
To model MFTS, we extend the hierarchical dynamic linear model (DLM)
framework of Gamerman and Migon (1993) and West and Harrison (1997) for
multivariate time series to the functional data setting. For smooth, flexible, and
optimal function estimates, we extend Bayesian spline theory to a more general
constrained optimization framework, which we apply for parameter identifi-
ability. Our constraints are explicit in the posterior distribution via appropri-
ate conditioning of the standard Bayesian spline posterior distribution, and the
6
corresponding posterior mean is the solution to an appropriate optimization
problem. We implement an efficient Gibbs sampler to obtain samples from the
joint posterior distribution, which provides exact (up to MCMC error) inference
for any parameters of interest. The proposed hierarchical Bayesian Multivariate
Functional Dynamic Linear Model has greater applicability and utility than re-
lated methods. It provides flexible modeling of complex dependence structures
among the functional observations, such as time dependence, contemporane-
ous dependence, stochastic volatility, covariates, and change points, and can
incorporate application-specific prior information.
The paper proceeds as follows. In Section 2.2, we present our model in its
most general form. We develop our (factor loading) curve estimation technique
in Section 2.3. In Section 2.4, we apply our model to the two applications dis-
cussed above and interpret the results. We also provide the details of our Gibbs
sampling algorithm, present MCMC diagnostics for our applications, and in-
clude additional figures in Appendix A.
2.2 A Multivariate Functional Dynamic Linear Model
Suppose we observe functions (c)Yt : T → R at times t = 1, . . . , T for outcomes
c = 1, . . . , C, where T ⊆ R is compact. We refer to the following model as the
Multivariate Functional Dynamic Linear Model (MFDLM):
7
 [ ∣
∣ ] in
Yt(τ) = F(τ)βt + t(τ), [t(∣τ) E] t ∼
dep
N (0,Et) ,
 indepβt = Xtθt + ν
∣
t, [νt ∣Vt ] ∼ N(0,V ), (2.1)
t
∣ indepθt[= Gtθt−1 + ωt, ωt ]Wt ∼ N(0,Wt),′
where (1) (2) (C)Yt(τ) = Yt (τ), Yt (τ), . . . , Yt (τ) is the C-dimensional vector of
multivariate functional observations at time t ev[aluated at τ ∈ T ; F(τ) is] the
C×KC block matrix with 1×K diagonal blocks (c) (c) (c)f1 (τ), f2 (τ), . . . , fK (τ) for
c = 1, . . . , C of factor loading curves evaluated at[τ ∈ T , with K the numbe]r of′
factors per outcome, and zeros elsewhere; (1) (1) (2) (C)βt = β1,t , . . . , βK,t, β1,t , . . . , βK,t is
the KC-dimensional vector of factors that serve as the time-dependent weights
on the factor loading curves; Xt is the known KC × p matrix of covariates at
time t, where p is the total number of covariates; θt is the p-dimensional vector
of regression coefficients associated with Xt; Gt is the p × p evolution matrix
of the regression coefficients θt at time t; and t(τ), νt, and ωt are mutually in-
dependent error vectors with variance matrices Et, Vt, and Wt, respectively.
We assume conditional independence of [t(τ)|Et] over both t = 1, . . . , T and
τ ∈ T ; however, the latter assumption of independence over τ may be relaxed.
We can immediately obtain a useful submodel of (2.1) by excluding covariates,
Xt = ICK×CK , and removing a level of the hierarchy, Vt = 0CK×CK , so that
setting Gt = G models βt (= θt, almost surely) with a vector autoregression
(VAR).
To understand (2.1), first note that the observation level of the model com-
bines the functional component F(τ) with the multivariate time series component
βt. In scalar notation, we can write the observation level as
∑K
(c) (c) (c) (c)
Yt (τ) = fk (τ)βk,t + t (τ) (2.2)
k=1
8
in which (c)t (τ) are the elements of the vector t(τ). In our construction, we
can always write the observation level of (2.1) as (2.2); simplifications for the
other levels will depend on the choice of submodel. For model identifiability,
we require orthonormali∫ty of the factor loading curves:
(c) (c)
fk (τ)fj (τ) dτ = 1(k = j) (2.3)
τ∈T
for k, j = 1, . . . , K and all outcomes c = 1, . . . , C, where 1(·) is the indicator func-
tion. In addition, to ensure a unique and interpretable ordering of the factors
(c) (c)
β1,t , . . . , βK,t for each outcome c = 1, . . . , C, we order the factor loading curves
(c) (c)
f1 , . . . , fK by decreasing smoothness. We discuss our implementation of these
constraints in Sections 2.3.2 and 2.3.3.
There are three primary interpretations of the model, which provide insight
into useful extensions and submodels.
First, we can view (2.2) as a basis expansion of the functional observations
(c)
Yt , with a (multivariate) time series model for the basis coefficients
(c)
βk,t to ac-
count for the additional dependence structures, such as common trends (see
Section 2.4.1), stochastic volatility (see Section 2.4.1), and covariates. Since the
identifiability constraint in (2.3) e{xpresses orth}onormality with respect to the L2
inner product, we can interpret (c) (c)f1 , . . . , fK as an orthonormal basis for the
functional observations (c)Yt . In contrast to common basis expansion procedures
that assume the basis functions are known and only the coefficients need to be
estimated (e.g., Bowsher and Meeks, 2008), we allow our basis functions (c)fk to
be estimated from the data. As a result, the (c)fk will be more closely tailored
to the data, which reduces the number of functions K needed to adequately
fit the data. Conditional on the (c)fk , we can specify the βt- and θt-levels of
(2.1) to appropriately model the remaining dependence among the (c)Yt . Using
9
this interpretation, we also note that (2.1) may be described as a multivariate
dynamic (concurrent) functional linear model, and therefore extends a highly
useful model in FDA (Cardot et al., 1999).
Similarly, we can interpret (2.1) as a dynamic factor analysis, which is a com-
mon approach in yield curve modeling (e.g., Hays et al., 2012; Jungbacker et al.,
2013). Under this interpretation, the (c)βk,t are dynamic factors and the
(c)
fk are fac-
tor loading curves (FLCs); we will use this terminology for the remainder of the
paper. Compared to a standard factor analysis, (2.1) has two major modifica-
tions: the factors (c)βk,t are dynamic and therefore have an accompanying (multi-
variate) time series model, and the (c)fk are functions rather than vectors.
Naturally, (2.1) has strong connections to a hierarchical DLM. Stan-
dard hierarchical DLM algorithms for sampling βt and θt assume that
{F,Gt,Xt,Et,Vt,Wt} is known (e.g., Durbin and Koopman, 2002; Petris et al.,
2009). Within our Gibbs sampler, we may condition on this set of parame-
ters, and then use existing DLM algorithms to efficiently sample βt and θt
with minimal implementation effort. Unconditionally, F is unknown, but we
impose the necessary identifiability constraints; see Section 2.3 for more de-
tails. Gt may be known or unknown depending on the application, but in
general it supplies the time series structure of the model (along with the time-
dependent error variances): in Section 2.4.1, Gt = G is unknown to allow for
data-driven dependence among the multi-economy yield curves, and in Section
2.4.2, Gt = ICK×CK is chosen to provide parsimonious time-domain smoothing.
We assume that Xt is known, and may consist of covariates relevant to each
outcome or can be chosen to provide additional shrinkage of βt through θt. Al-
though Gamerman and Migon (1993) suggest that dim(θt) < dim(βt) for strict
10
dimension reduction in the hierarchy, we relax this assumption to allow for co-
variate information. Finally, we treat the error variance matrices as unknown,
but typically there are simplifications available depending on the application
and model choice. We discuss some examples in Section 2.4.
We must also specify a choice for K. In the yield curve application, two
natural choices are K = 3 and K = 4 for comparison with the common para-
metric yield curve models: the Nelson-Siegel model (Nelson and Siegel, 1987)
and the Svensson model (Svensson, 1994), both of which can be expressed as
submodels of (2.1); see Diebold and Li (2006) and Laurini and Hotta (2010).
More formally, we can treat K as a parameter and estimate it using reversible
jump MCMC methods (Green, 1995), or select K using marginal likelihood. In
particular, since we employ a Gibbs sampler, the marginal likelihood estima-
tion procedure of Chib (1995) is convenient for many submodels of (2.1). For
more complex models, DIC provides a less computationally intensive approach
than either reversible jump MCMC or marginal likelihood, and is very simple
to compute. In Appendix A, we discuss a fast procedure based on the singu-
lar value decomposition from our initialization algorithm which can be used to
estimate a range of reasonable values for K.
2.3 Estimating the Factor Loading Curves
We would like to model the FLCs (c)fk in a smooth, flexible, and computationally
appealing manner. Clearly, the latter two attributes are important for broader
applicability and larger data sets—including larger T , larger C, and larger (c)mt ,
where (c)mt denotes the number of observation points for outcome c at time t.
11
The smoothness requirement is fundamental as well: as documented in Jung-
backer et al. (2013), smoothness constraints can improve forecasting, despite
the small biases imposed by such constraints. Smooth curves also tend to be
more interpretable, since gradual trends are usually easier to explain than sharp
changes or discontinuities.
However, there are some additional complications. First, we must incorpo-
rate the identifiability constraints, preferably without severely detracting from
the smoothness and goodness-of-fit of the FLCs. We also have K curves to es-
timate for each outcome—or perhaps K curves common to all outcomes (see
Section 2.3.4)—similar to the varying-coefficients model of Hastie and Tibshi-
rani (1993), conditional on the factors (c)βk,t . Finally, the observation points for the
functions (c)Yt are likely different for each outcome c, and may also vary with
time t.
2.3.1 Splines
A common approach in nonparametric and semiparametric regression is to ex-
press each unknown function (c)fk as a linear combination of known basis func-
tions, and then estimate the associated coefficients by maximizing a (penalized)
likelihood (e.g., Wahba, 1990; Eubank, 1999; Ruppert et al., 2003). We use B-
spline basis functions for their numerical properties and easy implementation,
but our methods can accommodate other bases as well. For now, we ignore
dependence on c for notational convenience; this also corresponds to either the
univariate case (C = 1) or C > 1 with Et diagonal and the FLCs assumed to be
a priori independent for c = 1, . . . , C (see Section 2.3.4 for an important alterna-
12
tive). Following Wand and Ormerod (2008), we use cubic splines and the knot
sequence a = κ1 = . . . = κ4 < κ5 < . . . < κM+4 < κM+5 = . . . = κM+8 = b,
with φB = (φ1, . . . , φM+4) the associated cubic B-spline basis, M the number of
interior knots, and T = [a, b]. While we could allow each fk to have its own
B-spline basis and accompanying sequence of knots, there is no obvious reason
to do so. In our applications, we use M = 20 interior knots. For knot placement,
we prefer a quantile-based approach such as the default method described in
Ruppert et al. (2003), which is responsive to the location of observation points
in the data yet is computationally inexpensive; however, equally-spaced knots
may be preferable in some applications.
Explicitly, we write fk(τ) = φ′B(τ)dk, where dk is the (M + 4)-dimensional
vector of unknown coefficients. Therefore, the function estimation problem is
reduced to a vector estimation problem. In classical nonparametric regression,
dk is estimated by maximizing a penalized likelihood, or equivalently solving
min−2 log[Y|dk] + λkP(dk) (2.4)
dk
where [Y|dk] is a likelihood, P is a convex penalty function, and λk ≥ 0. We
express (2.4) as a log-likelihood multiplied by −2 so that for a Gaussian likeli-
hood, (2.4) is simply a penalized least squares objective. For greater generality,
we leave the likelihood unspecified, but later consider the likelihood of model
(2.2). To penalize roughness, a standard choice for P is the L2-norm of the sec-
ond derivative of fk, which can∫be w[ritten]in terms of dk:2
P(dk) = f̈k(τ) dτ = d′kΩφdk (2.5)
τ∈T ∫ ′
where f̈k denotes the second derivative of fk and Ωφ = T φ̈B(τ)φ̈B (τ) dτ ,
which is easily computable for B-splines. With this choice of penalty, (2.4) bal-
ances goodness-of-fit with smoothness, where the trade-off is determined by
13
λk.
Since P is a quadratic in dk, for fixed λk, (2.4) is straightforward to solve for
many likelihoods, in particular a Gaussian likelihood. Letting d̄k be this solu-
tion, we can estimate fk(τ) for any τ ∈ T with f̂k(τ) = φ′B(τ)d̄k. For a general
knot sequence, the resulting estimator f̂k is an O’Sullivan spline, or O-spline, in-
troduced by O’Sullivan (1986) and explored in Wand and Ormerod (2008). In
the special case of univariate nonparametric regression in which there is a knot
at every observation point, f̂k is a natural cubic smoothing spline (e.g., Green
and Silverman, 1993). Alternatively, if we choose a sparser sequence of knots
and set λk = 0, f̂k is a regression spline (e.g., Ramsay and Silverman, 2005). O-
splines are numerically stable, possess natural boundary properties, and can be
computed efficiently (cf. Wand and Ormerod, 2008).
2.3.2 Bayesian Splines
Splines also have a convenient Bayesian interpretation (e.g., Wahba, 1978, 1983,
1990; Gu, 1992; Van der Linde, 1995; Berry et al., 2002). Returning to (2.4), we no-
tably have a likelihood term and a penalty term, where the penalty is a function
of only the vector of coefficients dk and known quantities. Therefore, condi-
tional on λk, the term λkP(dk) provides prior information about dk, for example
that f = φ′k Bdk is smooth. Under this general interpretation, (2.4) combines
the prior information with the likelihood to obtain an estimate of dk. A natural
Bayesian approach is therefore to construct a prior for dk based on the penalty
P , in particular so that the posterior mode of dk is the solution to (2.4). For the
most common settings in which the likelihood is Gaussian and the penalty P
14
is (2.5), the posterior distribution of dk will be Gaussian, so the posterior mean
will also solve (2.4).
To construct a prior from P , it is computationally and conceptually con-
venient to reparameterize dk so that the penalty matrix Ωφ is diagonal. Un-
der a Gaussian prior, this corresponds to prior independence of the compo-
nents of dk. The reparameterization will also affect the basis φB, but otherwise
will leave the likelihood in (2.4) unchanged. Following Wand and Ormerod
(2008), let Ω ′φ = UΩDΩUΩ be the singular value decomposition of Ωφ, where
U′ΩUΩ = I(M+4)×(M+4) and DΩ is a diagonal matrix with M + 2 positive compo-
nents. Denote the diagonal matrix of these positive entries by DΩ,P and let UΩ,P
be the corresponding (M[ + 4) × (M + 2) sub]matrix of UΩ. Using the reparam-
eterized basis φ′ −1/2(τ) = 1, τ,φ′B(τ)UΩ,PDΩ,P and penalty d
′
kΩDdk with ΩD =
diag (0, 0, λk, . . . , λk), the new solution d̂k to (2.4) satisfies f̂k(τ) = φB(τ)d̄k =
φ′(τ)d̂k; see Wand and Ormerod (2008) for more det(ails. It is therefore na)tural
to use the prior dk ∼ N(0,Dk), where D 8k = diag 10 , 108, λ−1k , . . . , λ
−1
k and
λk > 0, which satisfies D−1k ≈ ΩD. Notably, this prior is proper, yet is diffuse
over the space of constant and linear functions—which are unpenalized by P .
This reparameterization is a common approach for fitting splines using mixed
effects model software (e.g., Ruppert et al., 2003).
Since we assume conditional independence between levels of (2.1), our con-
ditional likelihood for the FLCs is simply that of model (2.2), but we ignore
dependence on c for now:
∑K ∑K
Yt(τ) = β
′
k,tfk(τ) + t(τ) = βk,tφ (τ)dk + t(τ) (2.6)
k=1 k=1
where iid 2t(τ) ∼ N(0, σ ) for simplicity; the results are similar for more sophisti-
cated error variance structures. In particular, (2.6) describes the distribution of
15
the functional data Yt given the FLCs fk (or dk), also conditional on β and σ2k,t .
Under the likelihood of model (2.6) and the reparameterized (approxi-
mate) penalty d′D−1k k dk, the solution to (2.4)∑conditio∑nal on dj , j =6 k is
give∑n by d̂ −1 −1 −2 T 2 ′k ∑= Bkb[k where∑Bk = Dk +] σ t=1 βk,t τ∈T φ(τ)φ (τ), bk =t
σ−2 Tt=1 βk,t τ∈T Yt(τ)− j 6=k βj,tfj(τ) φ(τ), and Tt ⊆ T denotes the dis-t
crete set of |Tt| = mt observation points for Yt at time t. Note that if Tt = T1 for
t = 2, . . . , T , then Bk and bk may be rewritten more conveniently in vector no-
tation. Most importantly for our purposes, under the same likelihood induced
by (2.6) and the prior dk ∼ N(0,Dk), the posterior distribution of dk is multi-
variate Gaussian with mean d̂k and variance Bk. For convenient computations,
Wand and Ormerod (2008) provide an exact construction of Ωφ and suggest effi-
cient algorithms for d̂k based on the Cholesky decomposition; we provide more
details in Appendix A.
To identify the ordering of the factors and FLCs in (2.2), we constrain the
smoothing parameters λ1 > λ2 > · · · > λK > 0. While other model constraints
are available, this ordering constraint is particularly appealing: it sorts the FLCs
fk by decreasing smoothness, as characterized by the penalty function P , and
leads to a convenient prior distribution on the smoothing parameters λk. In
the Bayesian setting, the smoothing parameters are equivalently the prior preci-
sions of the penalized (nonlinear) components of dk. Letting dk,j denote the jth
component of dk, the prior on the FLC basis coefficients is
i∼iddk,j N(0, λ−1k ) for
j = 3, . . . ,M + 4. This is similar to the hierarchical setting of Gelman (2006), in
which there are M + 2 groups for each λk, k = 1, . . . , K. Since M + 2 is typically
large, we follow the Gelman (2006) recommendation to place uniform priors on
the group standard deviations −1/2λk , k = 1, . . . , K. Incorporating the ordering
16
constraint, the conditional priors are −1/2λk ∼ Uniform (`k, uk), where `1 = 0,
−1/2 −1/2
`k = λk−1 for k = 2, . . . , K, uk = λk+1 for k = 1, . . . , K − 1, and uK = 104.
The upper bound on −1/2λK , and therefore all
−1/2
λk , is chosen to equal the diffuse
prior standard deviation of dk,1 and dk,2. (The full cond∑itional di)stributions of
the smoothing parameters λ are Gamma 1(M + 1), 1 M+4 d2k j=3 k,j truncated to2 2
(u−2 −2k , `k ) for k = 1, . . . , K, where we define `
−2
1 =∞. Notably, we avoid the dif-
fuse Gamma prior on λk, which can be undesirably informative and is strongly
discouraged by Gelman (2006). More generally, our approach provides a natu-
ral and data-driven method for estimating the smoothing parameters, yet does
not inhibit inference. Details on the sampling of λk are provided in Appendix
A.
2.3.3 Constrained Bayesian Splines
We extend the Bayesian spline approach to accommodate the necessary iden-
tifiability constraints for th∫e MFDLM. For each k = 1, . . . , K, we impose the
orthonormality constraints T fk(τ)fj(τ) = 1(k = j) for j = 1, . . . , K. The unit-
norm constraint preserves identifiability with respect to scaling, i.e., relative to
the factors βk,t (up to changes in sign). The orthogonality constraints distinguish
between pairs of FLCs, and in our approach identify the FLCs with distinct pos-
terior distributions.
While other identifiability constraints are available for the fk, orthonormality
is appealing for a number of reasons. As discussed in Section 2.2, the orthonor-
mality constraints suggest that we can interpret {f1, . . . , fK} as an orthonormal
basis for the functional observations Yt. As such, the orthogonality constraints
17
help eliminate any information overlap between FLCs, which keeps the total
number of necessary FLCs to a minimum. Furthermore, the unit norm con-
straint allows for easier comparisons among the fk. Of course, the fk will be
weighted by the factors βk,t, so they can still have varying effects on the condi-
tional mean of Yt in (2.2). Finally, we can write the constraints conveniently in
terms∫of the vectors dk and∫dj :
fk(τ)fj(τ) dτ = φ
′(τ)d φ′k (τ)dj dτ = d
′
kJφdj = 1(k = j) (2.7)
τ∈T τ∈∫T
for j = 1, . . . , K, where J = φ(τ)φ′φ ∈T (τ) dτ is easily computed for B-splines,τ
and only needs to be computed once, prior to any MCMC sampling.
The addition of an orthogonality constraint to a (penalized) least squares
problem has an intuitive regression-based interpretation, which we present in
the following theorem:
∑
Theorem 2.1. Consider the penalized least squares objective σ−2 ni=1(y
′
i −Xid)2 +
λd′Ωd, where yi ∈ R, d is an unknown (M + 4)-dimensional vector, Xi is a known
(M + 4)-dimensional vector, Ω is a known (M + 4)× (M + 4) positive-definite matrix,
and∑σ2, λ > 0 are known sc∑alars. The solution is d̂ = Bb, where B−1 = λΩ +
σ−2 n ′i=1 XiXi and b = σ
−2 n
i=1 Xiyi. Now consider the same objective, but subject
to the J linear constraints d′L = 0 for L a known (M + 4)× J matrix of rank J . The
solution is d̃ = Bb̃, where b̃ is the vector of residuals from the generalized least squares
regression b = LΛ + δ with E(δ) = 0 and Var(δ) = B.
Proof. The optimality of d̂ is a∑well-known result. For the constrained case, the
Lagrangian is L(d,Λ) = σ−2 ni=1(yi − X′id)2 + λd
′Ωd + d′LΛ, where Λ is the
J-dimensional vector of Lagrange multipliers associated with the J linear con-
straints. It is straightforward to minimize L(d,Λ) with respect to d and obtain
18
the solution d̃ = Bb̃ = B(b−LΛ). Similarly, solving∇L(d̃,Λ) = 0 for Λ implies
that Λ = (L′BL)−1L′Bb, which is the solution to the generalized least squares
regression of b on L with error variance B.
The result is interpretable: to incorporate linear constraints into a penalized
least squares regression, we find b̃ nearest to b under the inner product induced
by B among vectors in the space orthogonal to Col(L). In our setting, extend-
ing (2.4) under a Gaussian likelihood to accommodate the (linear) orthogonality
constraints d′kJφdj = 0 for j =6 k may be described via a regression of the un-
constrained solution on the constraints. However, the unit norm constraint is
nonlinear. This constraint affects the scaling but not the shape of fk. Therefore,
a reasonable approach is to construct a posterior distribution for dk that respects
the (linear) orthogonality constraints only, and then normalize the samples from
this posterior to preserve identifiability. We provide more details in Appendix
A.
To extend the unconstrained Bayesian splines of Section 2.3.2 to incor-
porate the orthogonality constraints, we write the constraints d′kJφdj =
0 for j =6 k as the linear constraints in Theorem 2.1 with L[−k] =
(Jφd1, . . . ,Jφdk−1,Jφdk+1, . . . ,JφdK) and J = K − 1. Using the full condi-
tional posterior distribution dk ∼ N(Bkbk,Bk) from Section 2.3.2, we can ad-
ditionally condition on the linear constraints d′kL[−k] = 0, and obtain the con-
strained full conditional distribution dk ∼ N(B̃kbk, B̃k), where B̃k = Bk −
B L (L′ −1 ′k [−k] [−k]BkL[−k]) L[−k]Bk. Conditioning on the orthogonality constraints
is particularly interpretable in the Bayesian setting, and is convenient for pos-
terior sampling; see Appendix A for more details. By comparison, Theorem
2.1 implies that the solution to (2.4) under the likelihood of model (2.6), the
19
penalty d′kD
−1
k dk, and subject to the linear constraints d
′
kL[−k] = 0 is given by
d̃k = Bkb̃k, where b̃k = bk − L Λ and Λ = (L′[−k] [−k] [−k] [−k]B L )−1L′k [−k] [−k]Bkbk.
Notably, B̃kbk = Bkb̃k = d̃k, which is a useful result: by simply conditioning
on the linear orthogonality constraints in the full conditional Gaussian distribu-
tion for dk, the posterior mean of the resulting Gaussian distribution solves the
constrained regression problem of Theorem 2.1. In this sense, the identifiability
constraints on fk are enforced optimally.
2.3.4 Common Factor Loading Curves for Multivariate Model-
ing
Reintroducing dependence on c for the FLCs (c)fk , suppose that C > 1, so that
our functional time series (c)Yt is truly multivariate. If we wish to estimate a
priori independent FLCs for each outcome c (with Et diagonal), then we can
sample from the relevant posterior distributions independently for c = 1, . . . , C
using the methods of Section 2.3.3. The more interesting case is the common fac-
tor loading curves model given by (c)fk = fk, so that all outcomes share a common
set of FLCs. In the basis interpretation of the MFDLM, this corresponds to the
assumption that the functional observations for all outcomes (c)Yt , c = 1, . . . , C,
t = 1, . . . , T share a common basis. We find this approach to be useful and intu-
itive, since it pools information across outcomes and suggests a more parsimo-
nious model. Equally important, the common FLCs approach allows for direct
′
comparison between factors (c)βk,t and
(c )
βk,t for outcomes c and c
′, since these fac-
tors serve as weights on the same FLC (or basis function) fk. We use this model
in both applications in Section 2.4.
20
The common FLCs model implies (c) (c)fk (τ) = φ
′
(c)(τ)dk = fk(τ). However,
since the FLCs for each outcome are identical, it is reasonable to assume that
they have the same vector of basis functions , so (c)φ fk = fk is equivalent to
(c)
dk =
dk. Moreover, by writing
(c)
fk (τ) = φ
′(τ)dk, we can use all of the observation
points across all outcomes c = 1, . . . , C and times t = 1, . . . , T , yet the parameter
of interest, dk, will only be (M + 4)-dimensional.
Modifying our previous approach, we use the likelihood of model (2.2)
with the simple error distribution (c) it (τ) ∼
id
N(0, σ2(c)). The implied full
conditional posterior di∑stribution∑for dk is a∑gain N(B̃kbk, B̃k), but now
∑with B−1∑ = D−1∑+ C[ σ−2 (c) 2 ′k k c=1 (c) t∈T∑(c)(βk,t ) τ∈T](c) φ(τ)φ (τ) and bk =tC −2 (c) (c) − (c)c=1 σ(c) t∈T (c) βk,t ∈T (c) Yt (τ) j 6=k βj,t fj(τ) φ(τ). For full generality,τ t
we allow the (discrete) set of times T (c) to vary for each outcome c and the (dis-
crete) set of observation points T (c)t to vary with both time t and outcome c,
with |T (c)| (c)t = mt . Note that we reuse the same notation from Section 2.3.3 to
emphasize the similarity of the multivariate results to the univariate (or a priori
independent FLC) results. The common notation also allows for a more concise
description of the sampling algorithm, which we present in Appendix A.
2.4 Data Analysis and Results
2.4.1 Multi-Economy Yield Curves
We jointly analyze weekly yield curves provided by the Federal Reserve (Fed),
the Bank of England (BOE), the European Central Bank (ECB), and the Bank of
Canada (BOC; Bolder et al. 2004) from late 2004 to early 2014 (T = 490 and C =
21
4). These data are publicly available and published on the respective central
bank websites—and as such, we treat them as reliable estimates of the yield
curves. For each outcome, the yield curves are estimated differently: the Fed
uses quasi-cubic splines, the BOE uses cubic splines with variable smoothing
parameters (Waggoner, 1997), the ECB uses Svensson curves, and the BOC uses
exponential splines (Li et al., 2001). Therefore, the functional observations have
already been smoothed, although by different procedures. The available set of
maturities T (c)t is not the same across economies c, and occasionally varies with
time t. The most frequent values of (c)mt , t = 1, . . . , T , are 11 (Fed), 100 (BOE),
354 (ECB), and 120 (BOC), with maturities τ ranging from 1-3 months up to 300-
360 months. To facilitate a simpler analysis, we let (c)Yt (τ) be the week-to-week
change in the cth central bank yield curve on week t for maturity τ . Differencing
the yield curves conveniently addresses the nonstationarity in the weekly data,
and, because the yield curves are pre-smoothed, does not introduce any notable
difficulties with time-varying observation points. We show an example of the
multi-economy yield curves observed at adjacent times on July 29, 2011 and
August 5, 2011, as well as the corresponding one-week change in Figure 2.1.
The literature on yield curve modeling is extensive. Yield curve mod-
els commonly adopt the Nelson-Siegel parameterization (Nelson and Siegel,
1987), often within a state space framework (e.g., Diebold and Li, 2006; Diebold
et al., 2006, 2008; Koopman et al., 2010). Many Bayesian models also use the
Nelson-Siegel or Svensson parameterizations (e.g., Laurini and Hotta, 2010;
Cruz-Marcelo et al., 2011). However, the Nelson-Siegel parameterization does
not extend to other applications, and often requires solving computationally
intensive nonlinear optimization problems. Alternatively, Chib and Ergashev
(2009) develop an arbitrage-free affine term structure model, which is similarly
22
Multi−Economy Yield Curves on 2011−07−29 and 2011−08−05 Change in Multi−Economy Yield Curves
Fed
BOE
ECB
BOC
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Maturity (months) Maturity (months)
Figure 2.1: Multi-economy yield curves from July 29, 2011 (solid) and August 5,
2011 (dashed), together with the corresponding one-week change curves.
cast in a Bayesian state space framework. More similar to our approach are the
Functional Dynamic Factor Model (FDFM) of Hays et al. (2012) and the Smooth
Dynamic Factor Model (SDFM) of Jungbacker et al. (2013), both of which fea-
ture nonparametric functional components within a state space framework. The
FDFM cleverly uses an EM algorithm to jointly estimate the functional and time
series components of the model. However, the EM algorithm makes more so-
phisticated (multivariate) time series models more challenging to implement,
and introduces some difficulties with generalized cross-validation (GCV) for
estimation of the nonparametric smoothing parameters. The SDFM avoids
GCV and instead relies on hypothesis tests to select the number and location
of knots—and therefore determine the smoothness of the curves. However, this
suggests that the smoothness of the curves depends on the significance levels
used for the hypothesis tests, of which there can be a substantial number as
(c)
mt , C, or T grow large. By comparison, our smoothing parameters naturally
depend on the data through the posterior distribution, which notably does not
23
Yield (%)
0 1 2 3 4 5
Change in Yield (%)
−0.30 −0.20 −0.10 0.00
create any difficulties for inference.
The multi-economy yield curves application is a natural setting for the com-
mon FLCs model of Section 2.3.4. First, since (c)fk = fk for c = 1, . . . , C, the
functional component of the MFDLM is the same for all economies, which helps
reconcile the aforementioned different central bank yield curve es∑timation tech-
niques. More specifically, the conditional expectations (c) (c)µt (τ) ≡
K
k=1 βk,tfk(τ)
are linear combinations of the same {f1, . . . , fK}, and therefore are more directly
comparable for c = 1, . . . , C. Second, the common FLCs model is very useful
when the set of observed maturities T (c)t varies with either outcome c or time
t. Since the fk are estimated using all of the observed maturities ∪t,cT (c)t , we
notably do not need a missing data model for uno(bserved)maturities at time t
for economy c. In addition, for any τ ∈ int range ∪ (c)t,cTt , we may estimate
fk(τ) and ((c)µt (τ)) without any spline-related boundary problems—even when
∈6 range T (c)τ t . By comparison, non-common FLCs—or more generally, any
linear combination of o(utcom)e-specific natural cubic splines—would impose a
linear fit for τ ∈6 range T (c)t , which may not be reasonable for some applica-
tions.
The Common Trend Model
To investigate the similarities and relationships among theC = 4 economy yield
curves, we implement the following parsimonious model for multivariate de-
pendence among the factors: (1) (1)βk,t = ω k,t (2.8)(c) (c) (1) (c)βk,t = γk βk,t + ωk,t c = 2, . . . , C
24
where (c)γk ∈ R is the economy-specific slope term for each factor with the dif-
fuse conjugate prior (c) i∼idγ N(0, 108k ). For the errors
(c)
ωk,t , we use independent
AR(r) models with time-dependent variances, which we discuss in more de-
tail in Section 2.4.1. We also implement an interesting extension of (2.8) based
on the autoregressive regime switching models of Albert and Chib (1993) and
{McCulloch and Ts}ay (1993) using the model (c) (c) (c) (1) (c)βk,t = sk,t(γk βk,t ) + ωk,t , where
(c)
sk,t : t = 1, . . . , T is a discrete Markov chain with states {0, 1}. While this
more complex model is not supported by DIC, it is a useful example of the flex-
ibility of the MFDLM; we provide the details in Appendix A.
Letting c = 1 correspond to the Fed yield curve, we can use (2.8) to inves-
tigate how the factors (c)βk,t for each economy c > 1 are directly related to those
of the Fed, (1)βk,t . Since the U.S. economy is commonly regarded as a dominant
presence in the global economy (e.g., Dées and Saint-Guilhem, 2011), the Fed
yield curve is a natural and interesting reference point. Model (2.8) relates each
economy c > 1 to the Fed using a regression framework, in which we regress
(c) (1)
βk,t on βk,t with AR(r) errors; since the yield curves were differenced, there is
no need (or evidence) for an intercept. The slope parameters (c)γk measure the
strength of this relationship for each factor k and economy c. In addition, we
can investigate the residuals (c)ωk,t to determine times t for which
(c)
βk,t deviated
substantially from the linear dependence on (1)βk,t assumed in model (2.8). Such
periods of uncorrelatedness can offer insight into the interactions between the
U.S. and other economies.
25
Stochastic Volatility Models
For the errors (c)ωk,t in (2.8), we∑use independent AR(r) models with time-
dependent variances, i.e., (c) r (c) (c) (c) (c) iidωk,t = i=1 ψk,iωk,t−i + σk,(c),tzk,t with zk,t ∼ N(0, 1),
c = 1, . . . , C. The AR(r) specification accounts for the time dependence of the
yield curves, while the σ2k,(c),t model the observed volatility clustering. This lat-
ter component is important: in applications of financial time series, it is very
common—and often necessary for proper inference—to include a model for the
volatility (e.g., Taylor, 1994; Harvey et al., 1994). It is reasonable to suppose
that applications of financial functional time series may also require volatility
modeling; the weekly yield curve data provide one such example. Notably,
our hierarchical Bayesian approach seamlessly incorporates volatility model-
ing, since, conditional on the volatilities, DLM algorithms require no additional
adjustments for posterior sampling.
Within the Bayesian framework of the MFDLM, it is most natural to use a
stochastic volatility model (e.g., Kim et al., 1998; Chib et al., 2002). Stochastic
volatility models are parsimonious, which is important in hierarchical model-
ing, yet are highly competitive with more heavily parameterized GARCH mod-
els (Danı́elsson, 1998). We model the log-volatility, log(σ2(c),k,t), as a stationary
AR(1) process (for fixed c and k), using the priors and the efficient MCMC sam-
pler of Kastner and Frühwirth-Schnatter (2014). We provide a plot of the volatil-
ities σ2k,(c),t and additional model details in Appendix A.
26
Results
We fit model (2.8) to the multi-economy yield curve data, using the the Kast-
ner and Frühwirth-Schnatter (2014) model for the volatilities and setting r = 1,
which adequately models the time dependence of the factors, with the diffuse
stationarity prior (c) i∼idψk,1 N(0, 108) truncated to (−(1, 1). We use) the common
FLCs model of Section 2.3.4, and let Et = diag
iid
σ2 2 −2(1), . . . , σ(C) with σ(c) ∼
Gamma (0.001, 0.001). We prefer the choice K = 4, which corresponds to the
number of curves in the Svensson model. However, since the observations (c)Yt
and the conditional expectations (c)µt (τ) are both smooth by construction, the
errors (c)t are also smooth—and therefore correlated with respect to τ . To mit-
igate the effects of the error correlation, we increase the number of factors to
K = 6, so that the fitted model (2.2) explains more than 99.5% of the variabil-
ity in (c)Yt (τ). Since we are primarily interested in the first four factors, we fix
(c)
γk = 0 for k > 4 in model (2.8), so the two additional factors for each outcome
are modeled as independent AR(1) processes with stochastic volatility. We ran
the MCMC sampler for 7, 000 iterations and discarded the first 2, 000 iterations
as a burn-in. The MCMC sampler is efficient, especially for the factors (c)βk,t and
the common FLCs fk; we provide the MCMC diagnostics in Appendix A.
In Figure 2.2, we plot the posterior means of the common FLCs fk for
k = 1, . . . , 4. We can interpret these fk as estimates of the time-invariant un-
derlying functional structure of the yield curves shared by the Fed, the BOE,
the ECB, and the BOC. The FLCs are very smooth, and the dominant hump-like
features occur at different maturities—following from the orthonormality con-
straints—which allows the model to fit a variety of yield curve shapes. Interest-
ingly, the estimated f1, f2, and f3 are similar to the level, slope, and curvature
27
functions of the Nelson-Siegel parameterization described by Diebold and Li
(2006). Since the factors (c)βk,t serve as weights on the FLCs fk in (2.2), we may
interpret the factors (c) (c)βk,t—and therefore the slopes γk —based on these features
of the yield curve explained by the corresponding fk.
Common Factor Loading Curves
k = 1
k = 2
k = 3
k = 4
0 50 100 150 200 250 300 350
Maturity (months)
Figure 2.2: Posterior means of the common FLCs, {f1, f2, f3, f4}, as a function of
maturity, τ .
In Table 1, we compute posterior means and 95% highest posterior density
(HPD) intervals for (c)γk , which measures the strength of the linear relationship
between (c) and (1)βk,t βk,t . For the level and slope factors k = 1, 2, the ECB is sub-
stantially less correlated with the Fed factors than are the BOE and BOC factors.
For k = 4, the BOE, ECB, and BOC factors are nearly uncorrelated with the Fed
factors.
Finally(, we analyze the)conditional standardized residuals from model (2.8),
(c) − (c) (c) iidrk,(c),t = ωk,t φk,1ωk,t−1 /σk,(c),t ∼ N(0, 1), to determine periods of time t
for which (2.8) is inadequate, which can indicate deviations from the assumed
linear relationship between the Fed factors and the other economy factors. By
computing the MCMC sample proportion of r2 2k,(c),t ∼ χ1 that exceed a critical
28
Yield (%)
−0.3 −0.2 −0.1 0.0 0.1
Economy k = 1 k = 2 k = 3 k = 4
BOE 0.62 0.72 0.37 0.03(0.57, 0.67) (0.56, 0.89) (0.27, 0.46) (-0.03, 0.09)
ECB 0.39 0.27 0.44 0.07(0.34, 0.45) (0.11, 0.42) (0.35, 0.52) (0.00, 0.15)
BOC 0.61 0.56 0.49 0.16(0.57, 0.65) (0.47, 0.65) (0.41, 0.58) (0.08, 0.25)
Table 2.1: Posterior means and 95% HPD intervals for (c)γk , which measures the
strength of the linear relationship between (c) (1)βk,t and βk,t .
value of the χ2-distribution, e.g., the 95th percentile χ21,0.05 ≈ 3.84, we can ob-
tain a simple estimate of the probability that r2k,(c),t exceeds the critical value
∑and, by that measure, is likely an outlier. We can compute a similar quantity for4
k=1 r
2
k,(c),t ∼ χ24, which aggregates across factors k = 1, . . . , 4. In Figure 2.3, we
plot these MCMC sample proportions, restricted to the U.S. recession of Decem-
ber 2007 to June 2009. Around November 2008, there were outliers for all three
economies for k = 2, 3, 4 and the aggregate, which suggests that the U.S. interest
rate market may have behaved differently from the other economies during this
time period. We are currently investigating an extension of model (2.8) to in-
corporate several important financial predictors as covariates, with a particular
focus on the weeks during the recession.
2.4.2 Multivariate Time-Frequency Analysis for Local Field Po-
tential
Local field potential (LFP) data were collected on rats to study the neural ac-
tivity involved in feature binding, which describes how the brain amalgamates
distinct sensory information into a single neural representation (Botly and De
29
∑
Figure 2.3: The MCMC sample proportions of r2 4 2k,(c),t and k=1 rk,(c),t that exceed
the 95th percentile of the assumed χ2-distributions.
Rosa, 2009; Ljubojevic et al., 2013). LFP uses pairs of electrodes implanted di-
rectly in local brain regions of interest to record the neural activity over time;
in this case, the brain regions of interest are the prefrontal cortex (PFC) and
the posterior parietal cortex (PPC). The rats were given two sets of tasks: one
that required the rats to synthesize multiple stimuli in order to receive a reward
(called feature conjunction, or FC), and one that only required the rats to process
a single stimulus in order to receive a reward (called feature singleton, or FS). FC
involves feature binding, while FS may serve as a baseline. The tasks were re-
peated in 20 trials each for FS and FC, during which electrodes implanted in the
PFC and the PPC recorded the neural activity. Therefore, the raw data signal is
a bivariate time series with 40 replications for each rat; we show an example of
the bivariate signals for one such replication in Figure 2.4a. Each signal replicate
is 3 seconds long, and has been centered around the behavior-based laboratory
estimate of the time at which the rat processed the stimuli, which we denote
by t∗.
30
Log−Spectrum, PFC Log−Cross−Spectrum
80 80 −240
70 −115 70 −250
60 −120 60
50 50 −260
40 −125 40 −270
30 −130 30 −280
20 −135 20 −290
10 10
−140 −300
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Log−Spectrum, PPC Squared Coherence
80 −120 80
0.8
70 70
60 −125 60 0.6
50 50
−130
40 40 0.4
30 −135 30
20 20 0.2
10 −140 10
0.0
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Time Bin Time Bin
(a) The bivariate LFP signal. (b) The associated (log-) spectra and squared coherence.
Figure 2.4: The raw LFP data from a rat during an FS trial. The vertical lines
indicates the approximate time at which the rat processed the stimuli, t∗.
Our interest is in the time-dependent behavior of these bivariate signals and
the interaction between them. A natural approach is to use time-frequency anal-
ysis; however, exact inference for standard time-frequency procedures is not
available. An appealing alternative is to use time-frequency methods to trans-
form the bivariate signal into a MFTS, which makes available the multivariate
modeling and inference of the MFDLM.
Since the MFDLM provides smoothing in both the frequency domain T and
the time domain T , we may use time-frequency preprocessing that provides
minimal smoothing. For the time domain, we segment the signal into time bins
of width one-eighth the length of the original signal, with a 50% overlap be-
tween neighboring bins to reduce undesirable boundary effects. Within each
time bin, we compute the periodograms and cross-periodogram of the bivariate
signal. Let (1) (2)qt (τ) and qt (τ) be the discrete Fourier transforms of the PFC and
PPC signals, respectively, for time bin t evaluated at frequency τ , after removing
linear trends. The periodograms are (c) | (c)It (τ) = q 2t | for c = 1, 2 and the cross-
periodogram is (3) (1) (2), where (2)It (τ) = qt q̄t q̄t is the complex conjugate of
(2)
qt .
The cross-periodogram is generally complex-valued, and if the periodograms
31
Frequency (Hz) Frequency (Hz)
are unsmoothed, then | (3) (1) (2)It (τ)|2 = It (τ)It (τ) is real-valued but clearly fails
to provide new information (Bloomfield, 2004). This does not imply that the
cross-periodogram is uninformative, but rather that some frequency domain
smoothing of the periodograms is necessary.
Following Shumway and Stoffer (2000), we use a modified Daniell kernel
to obtain the smoothed periodograms, or spectra. We subdivide each time bin
into five segments, compute (c)It (τ), c = 1, 2, 3 within each segment, and then
average the resulting periodograms using decreasing weights determined by
the(modifie)d Daniell kernel. Denoting these spectra by (c) , we let (c)Ĩt (τ) Yt (τ) =
(c)
log Ĩt (τ) for c = 1, 2, where the log-transformation is appealing because it
is the variance-stabilizing transformation for the periodogram (Shumway and
Stoffer, 2000). To account for the pe(riodic dep)endence between signals, one
choice is the log-cross-spectrum, (3)log |Ĩt (τ)|2 . An appealing alternative is
the squared coherence defined by 2 ≡ | (3) (1) (2)κt (τ) Ĩt (τ)|2/(Ĩt (τ)Ĩt (τ)), which satis-
fies the constraints 0 ≤ κ2t (τ) ≤ 1 and is the frequency domain analog to the
squared correlation (Bloomfield, 2004). Since (2.1) specifies that (c)Yt (τ) ∈ R,
we transform the squared coherence and let (3)Yt (τ) = Φ−1(κ2t (τ)) ∈ R, where
Φ−1 : [0, 1] → R is a known monotone function; we use the Gaussian quantile
function. We have found that fitting (3)Yt (τ) produces very similar results to
fitting κ2t (τ) (directly,)yet in the transformed case, our estimate of the squared
coherence (3)Φ µt (τ) obeys the constraints. Because of our Bayesian approach,
this transformation does not inhibit inference.
More generally, this procedure is applicable to `-dimensional time series,
which, including either the squared coherence or the cross-spectra, yields a C =
`(`+ 1)/2-dimensional MFTS. We show an example of the resulting MFTS from
32
a rat during an FS trial in Figure 2.4b. For completeness, we include the log-
cross-spectrum, which is not a component of the MFTS.
MFDLM Specification
We use the common FLCs model of Section 2.3.4 accompanied by a random
walk model for the factors: [ ∣ ](c) ∑ (c) (c) (c) indepYi,s,t(τ) =
K
k=1 βk,i,s,tfk(τ) + i,s,t(τ), [i,s,t(τ∣)∣σ
2
(c)] ∼ N(0, σ
2 )
 (c) (2.9)indepβk,i,s,t = βk,i,s,t−1 + ω ∣k,i,s,t, ωk,i,s,t Wk ∼ N(0,Wk)
where (1) (C) (c)βk,i,s,t = (β ′k,i,s,t, . . . , βk,i,s,t) , Yi,s,t are the log-spectra for c = 1, 2 and the
probit-transformed squared coherences for c = 3, i = 1, . . . , 8 index the rats,
s = 1, . . . , 40 index the trials for each rat, and t = 1, . . . , 15 index the time bins
for each trial. The joint indices (i, s, t) in (2.9) correspond to the time index t in
(2.1), and are used to specify independence of the residuals ωk,i,s,t between rats
and between trials. For each initial time bin t = 1, we let βk,i,s,1 ∼ N(0, 104IC×C),
since the corresponding observations are only time-ordered within a trial. The
C × C factor covariance matrices Wk do not depend on the rat or the trial, and
can help summarize the overall dependence among factors. For simplicity and
parsimonious modeling, (2.9) assumes independence between ωk,i,s,t and ωj,i,s,t
for j =6 k ∈ {1, . . . , K}, but allows for correlation between outcomes for fixed
k. The Wk control the a∑mount of time domain smoothing for the factors and
therefore for (c) (c)µi,s,t(τ) ≡
K
k=1 βk,i,s,tfk(τ). For the error variances, we use the
conjugate priors σ−2 i∼id(c) Gamma
iid
(0.001, 0.001) and W−1k ∼ Wishart((ρR)−1, ρ),
with R−1 = IC×C , the expected prior precision, and ρ = C ≥ rank(R−1). We
provide the full conditional posterior distributions in Appendix A.
To determine the effects of feature binding, we compare the values of
33
(c)
µi,s,t(τ) between the FS and FC trials. Letting Si,FC (respectively, Si,FS) be
the subset of FC (respectively, FS) trials for which rat i received the re-
w∑ard, [we est∑imate posterior distribu∑tions for the sa]mple means (c)µ̄t (τ) ≡
1 8 1 ∑ [ (c) − 1 ( (c)8 i=1 |S | s∈S µi,s,t(τ) |S | s′∈S )µi,s′,t(τ) for c =( 1, 2 andi,FC i,FC i,FS i,FS )](3)
µ̄ (τ) ≡ 1 8
∑
1 (3)
∑
1 (3)
t 8 i=1 | Φ µSi,FC | s∈Si,FC i,s,t(τ) − |S | s′i,FS ∈S Φ µi,FS i,s′,t(τ) .
Therefore, we examine the difference in the log-spectra and the squared coher-
ences between the FC and the FS trials, which we average over all rats and over
all trials for which the rat responded correctly to the stimuli. This restriction is
important, since it filters out unrepresentative trials, in particular FC trials for
which feature binding may not have occurred.
Results
Since we observe functions in 15 time bins for 40 trials for 8 rats, the time-
dimension of our 3-dimensional MFTS is T = (15)(40)(8) = 4800. We restrict the
frequencies to T = [0.1, 80] Hz, which is the range of interest for this application
and yields (c)mt = 30 for all c, t. Guided by DIC, we select K = 10. Alternatively,
we could use a smaller value ofK by increasing the initial smoothing of the log-
spectra and the squared coherences, but would risk smoothing over important
features. We ran the MCMC sampler for 7, 000 iterations and discarded the first
2, 000 iterations as a burn-in; see Appendix A for the MCMC diagnostics.
We compute 95% pointwise HPD intervals and posterior means for (c)µ̄t (τ),
c = 1, 2, 3 and display the results as spectrogram plots; the plots for c = 1, 2 are
in Appendix A, while c = 3 is in Figure 2.5. Regions of red or orange in the
lower 95% HPD interval plots indicate a significant positive difference between
the FC and FS trials, while regions of blue in the upper 95% HPD interval plots
34
indicate a significant negative difference. We are particularly interested in the
time bins around t∗, which indicates the approximate time at which the stimuli
were processed, and frequencies up to 40-50 Hz.
The averages of the differenced log-spectra, (1) (2)µ̄t (τ) and µ̄t (τ), describe how
the distinct regions of the brain—the PFC and PPC, respectively—respond dif-
ferently to stimuli that do or do not require feature binding. By comparison, the
average of the differenced squared coherences, (3)µ̄t (τ), describes how these re-
gions of the brain interact with each other under the different stimuli. Based on
Figure 2.5, feature binding appears to be most strongly associated with greater
squared coherence at frequencies in the Theta range (4-8 Hz), the Alpha range
(8-13 Hz), and the Beta range (13-30 Hz) around t∗. This pattern persists in
the power of both the PFC and PPC log-spectra plots, which suggests that these
ranges of frequencies are important to the process of feature binding. Therefore,
using the inference provided by the MFDLM, we conclude that during feature
binding, the Theta, Alpha, and Beta ranges are associated with increased brain
activity in both the PFC and the PPC, as well as greater synchronization between
these regions.
2.5 Conclusions
The MFDLM provides a general framework to model complex dependence
among functional observations. Because we separate out the functional compo-
nent through appropriate conditioning and include the necessary identifiability
constraints, we can model the remaining dependence using familiar scalar and
multivariate methods. The hierarchical Bayesian approach allows us to incor-
35
Lower 95% HPD Interval Posterior Mean Upper 95% HPD Interval
80 80 80
 0.0
 − 40.04   
70  −0.02 70 70
 −0.02 0.10 
 −0.0
2 
 
60 −0.04 60 60
0.05
 0.04 
50  0   50 50 0. 00 4.0  
 0 2  
 −0.04 
 −0.02  −  −0.02 0  0 
40  −0.02 
.04 40  −0.02 
 
40  0.02 0.00
 0 
 0.02 
30 30 30
−0.05
 0.0
20 20 2 20  0.04 
 0 
 −0.02 −0.10
10 10 10
4 
2  .0 8 
.0 4  0 .0
 0 0.0  0  0.06
 −0.04 
 
2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14
Time Bin Time Bin Time Bin
Figure 2.5: Pointwise 95% HPD intervals and the posterior mean for (3)µ̄t , which
is the average difference in squared coherence between the FC and FS trials. The
black vertical lines indicate the event time t∗.
porate interesting and useful submodels seamlessly, such as the common trend
model of Section 2.4.1, the stochastic volatility model of Section 2.4.1, and the
random walk model of Section 2.4.2. We combine Bayesian spline theory and
convex optimization to model the functional component as a set of smooth and
optimal curves subject to (identifiability) constraints. Using an efficient Gibbs
sampler, we obtain posterior samples of all of the unknown parameters in (2.1),
which allows us to perform inference on any parameters of interest, such as (c)µ̄t
in the LFP example.
Our two diverse applications demonstrate the flexibility and wide applica-
bility of our model. The common trend model of Section 2.4.1 provides useful
insights into the interactions among multi-economy yield curves, and our LFP
example suggests a novel approach to time-frequency analysis via MFTS. In
these applications, the MFDLM adequately models a variety of functional de-
pendence structures, including time dependence, (time-varying) contempora-
36
Frequency (Hz)
 0.02  −0.02 
 0 
 −0.06 
 0.02 
 −0
 − .00 4.0  4  0 
 0 
 0.04 
 0.06 
 0.02 
 −0
 0 ..02 02  0 
 0.0  4 0  
 −0.02 
 −0.04  0.02 
 0.04 
 0 
 −0.02 
 0  0 
 0.02 
 0.08 
 0.02 
 0.04 
 0.04  0.02 
 0 
 −0.02 
 0.02 
 0.04 
 0 
 0.08 
 0.02 
 0.06 
 0.02 
 0.04 
 0 .02
  0.06 
 0
 0.06
 
 −0.02  0 
 0.06  0.04 
0  0  
 0 
 −0.
02 
 0 
 0.02 
 0 
4 
 0.
0
 0 
.02
 
 0
 −0.02 
 0 
 0.02 
 −0.02  0.0
6 
 0 
 −0.02  0 
 0.0
4  
0.0
2
.06
 
 −  0
 0.04   0.02
 0.02  −0.02 
 −0.0
4 
 0 
.02  
 −0
 0.02  −0.02 
0.06 
2 
 
0
 0.
 0.02 
 0 
 0.02 
 0  0 
 2 
.02
 0.0
 −0
04  0.
 0.06 
 −0.04 
 −0.02
 
2 
 0.0
 0.02 
 −0.0
6 
 
 0
 −0.04 
 0.02  0.04 
 −0.02 
 0 
.02
 
 −0
  −0.02  0  −0.04 
 −0.04 
 0 2 
 0.0
 
 0.0
4
neous dependence, and stochastic volatility, and may readily accommodate ad-
ditional dependence structures, such as covariates, repeated measurements, and
spatial dependence. We are currently developing an R package for our methods.
37
CHAPTER 3
DYNAMIC SHRINKAGE PROCESSES
3.1 Introduction
The global-local class of prior distributions is a popular and successful mecha-
nism for providing shrinkage and regularization in a broad variety of models
and applications. Global-local priors use continuous scale mixtures of Gaus-
sian distributions to produce desirable shrinkage properties, such as (approxi-
mate) sparsity or smoothness, often leading to highly competitive and computa-
tionally tractable estimation procedures. For example, in the variable selection
context, exact sparsity-inducing priors such as the spike-and-slab prior become
intractable for even a moderate number of predictors. By comparison, global-
local priors that shrink toward sparsity, such as the horseshoe prior (Carvalho
et al., 2010), produce competitive estimators with greater scalability, and are val-
idated by theoretical results, simulation studies, and a variety of applications
(Carvalho et al., 2009; Datta and Ghosh, 2013; van der Pas et al., 2014). Unlike
non-Bayesian counterparts such as the lasso (Tibshirani, 1996), shrinkage pri-
ors also provide adequate uncertainty quantification for parameters of interest
(Kyung et al., 2010; van der Pas et al., 2014).
The class of global-local scale mixtures of Gaussian distributions (e.g., Car-
38
valho et al., 2010; Polson and Scott, 2010, 2012a) is defined as follows:
| indep[ω τ, λ 2 2t t] ∼∏N(0, τ λt ), t = 1, . . . , T (3.1a)T [ { } ]
[λ21, . . . , λ
2
T ] = λ
2
t | λ2s (3.1b)s<t
∏t=1T
∼ π(λ2t ) (3.1c)
t=1
where π(·) denotes a generic prior distribution and τ > 0 is either endowed with
its own prior distribution or estimated using empirical Bayes methods. Here
(3.1c) follows from (3.1b) assuming the {λt} are a priori independent and iden-
tically distributed (iid). The iid assumption is commonly made, but as we will
argue below, it can be advantageous to forego the independence assumption. In
what follows, only (3.1a) and (3.1b) will be assumed.
The prior in (3.1a)–(3.1c) is commonly paired with the likelihood
[y |ω , σ2 in∼dept t ] N(ωt, σ2), but we will consider dynamic generalizations. In
(3.1a), τ > 0 controls the global shrinkage for all {ω }Tt t=1, while λt tunes the
local shrinkage for a particular ωt. Such a model is particularly well-suited for
sparse data: τ determines the global level of sparsity for {ωt}Tt=1, while each λt
allows for large absolute deviations of ωt from its prior mean (zero). Careful
choice of priors for λ2t and τ 2 provide both robustness to large signals and ade-
quate shrinkage of noise (e.g., Carvalho et al., 2010), so the framework of (3.1) is
widely applicable.
In the dynamic setting, in which the observations yt are time-ordered and t
denotes a time index, it is natural to allow the local scale parameter, λt, to de-
pend on the history of the shrinkage process {λs}s<t. As a result, the probability
of large (or small) deviations of ωt from the prior mean (zero), as determined by
λt, is informed by the previous shrinkage behavior {λs}s<t. Such model-based
39
dependence may improve the ability of the model to adapt dynamically, which
is important for time series estimation, forecasting, and inference. However, the
standard global-local prior independence assumption in (3.1c) precludes depen-
dence in the shrinkage process.
We propose to model the dynamic dependence of the process {λt} in (3.1b)
via a novel scale-mixture representation of stochastic volatility (SV) models. SV
models for dynamic scale parameters are highly popular and successful, partic-
ularly in finance applications (e.g., Kim et al., 1998). In the standard SV model,
{λ2t} is modeled as an autoregressive process of order 1, or AR(1), on the log-
scale. An important contribution of this manuscript is to extend the standard
SV model to provide (1) direct extensions of popular shrinkage priors to the dy-
namic setting and (2) a highly efficient Gibbs sampling algorithm. We develop a
log-scale representation of a broad class of global-local shrinkage priors, which
provides a natural setting for modeling dynamic dependence. The proposed
dynamic shrinkage process replaces the independence assumption (3.1c) with the
dynamic evolution model
iid
ht+1 = µ+ φ(ht − µ) + ηt, ηt ∼ Z(α, β, 0, 1) (3.2)
where ht = log(τ 2λ2t ), or equivalently τ 2 = exp(µ) and λ2t = exp(ht − µ), and
Z(α, β, µ , σ ) denotes the Z-distribution with density function
[ z z ]−1{ [ ]}α{ [ ]}−(α+β)
[z] = σB(α, β) exp (z−µz)/σz 1+exp (z−µz)/σz , z ∈ R (3.3)
where B(·, ·) is the Beta function. When φ = 0, model (3.2) reduces to the static
setting, and implies an inverted-Beta prior for λ2t (see Section 3.2.2 for more de-
tails). Notably, the class of priors represented in (3.2) includes the important
shrinkage distributions in Table 3.1, in each case extended to the dynamic set-
ting via an autoregression akin to the standard SV model.
40
α = β = 1/2 Horseshoe Prior Carvalho et al. (2010)
α = 1/2, β = 1 Strawderman-Berger Strawderman (1971); Berger
Prior (1980)
α = 1, β = c−2, c > 0 Normal-Exponential- Griffin and Brown (2005)
Gamma Prior
α = β → 0 (Improper) Normal- Figueiredo (2003); Bae and
Jeffreys’ Prior Mallick (2004)
Table 3.1: Special cases of the inverted-Beta prior.
Despite the apparent complexity of the model, we develop a new Gibbs sam-
pling algorithm that builds upon existing efficient sampling algorithms via a
parameter expansion of model (3.2): a stochastic volatility sampler (Kim et al.,
1998) and a Pólya-Gamma sampler (Polson et al., 2013). The resulting model is
highly flexible, easy to implement, computationally efficient, and widely appli-
cable.
For a motivating example, consider the minute-by-minute Twitter CPU us-
age data in Figure 3.1a (James et al., 2016). The data show an overall smooth
trend interrupted by irregular jumps throughout the morning and early after-
noon, with increased volatility from 16:00-18:00. It is important to identify both
abrupt changes as well as slowly-varying intraday trends. To model these fea-
tures, we combine the likelihood in∼depyt N(β 2t, σt ) with a standard SV model for
the observation error variance, σ2t , and a dynamic horseshoe process as the prior on
the second differences of the conditional mean, ωt = ∆2βt = ∆βt −∆βt−1, given
by (3.2) with α = β = 1/2 (see Section 3.3.2 for details). The dynamic horseshoe
process either drives ωt to zero, in which case βt is locally linear, or leaves ωt ef-
fectively unpenalized, in which case large changes in slope are permissible (see
Figure 3.1b). The resulting posterior expectation of βt and credible bands for
the posterior predictive distribution of {yt} adapt to both irregular jumps and
smooth trends (see Figure 3.1a).
41
(Scaled) CPU Usage 2nd Difference of (Scaled) CPU Usage
02:00 07:00 12:00 17:00 22:00 02:00 07:00 12:00 17:00 22:00
Time of Day Time of Day
(a) (b)
(Scaled) CPU Usage: Observation Standard Deviation (Scaled) CPU Usage: Innovation Standard Deviation
02:00 07:00 12:00 17:00 22:00 02:00 07:00 12:00 17:00 22:00
Time of Day Time of Day
(c) (d)
Figure 3.1: Bayesian trend filtering (D = 2) with dynamic horseshoe process inno-
vations of minute-by-minute CPU usage data. (a) Observed data yt (points), posterior
expectation (cyan) of βt, and 95% pointwise highest posterior density (HPD) credible in-
tervals (light gray) and 95% simultaneous credible bands (dark gray) for the posterior
predictive distribution of yt. (b) Second difference of observed data ∆2yt (points), pos-
terior expectation of ωt = ∆2βt (cyan), and 95% pointwise HPD intervals (light gray)
and simultaneous credible bands (dark gray) for the posterior predictive distribution
of ∆2yt. (c) Posterior expectation of time-dependent observation standard deviations,
σt. (d) Posterior expectation of time-dependent innovation (prior) standard deviations,
τλt.
For comparison, Figure 3.1 provides the posterior expectations of both the
observation error standard deviations, σt (Figure 3.1c) and the prior standard
deviations, [τλt] = exp(ht/2) (Figure 3.1d). The horseshoe-like shrinkage be-
havior of λt is evident: values of λt are either near zero, corresponding to ag-
gressive shrinkage of ω = ∆2t βt to zero, or large, corresponding to large absolute
changes in the slope of βt. Importantly, Figure 3.1 also provides motivation for a
dynamic shrinkage process: there is clear volatility clustering of {λt}, in which the
42
(Scaled) CPU Usage (Scaled) CPU Usage
0.2 0.3 0.4 0.5 4 6 8 10
(Scaled) CPU Usage (Scaled) CPU Usage
0.0 1.0 2.0 3.0 -6 -4 -2 0 2 4 6
shrinkage induced by λt persists for consecutive time points. The volatility clus-
tering reflects—and motivates—the temporally adaptive shrinkage behavior of
the dynamic shrinkage process.
Shrinkage priors and variable selection have been used successfully for time
series modeling in a broad variety of settings. Belmonte et al. (2014) propose a
Bayesian Lasso prior for shrinkage in dynamic linear models, while Korobilis
(2013a) consider several (non-dynamic) scale mixture priors for time series re-
gression. In both cases, the lack of a local (dynamic) scale parameter implies
a time-invariant rate of shrinkage for each variable. Frühwirth-Schnatter and
Wagner (2010) introduce indicator variables to discern between static and dy-
namic parameters, but the model cannot shrink adaptively for local time peri-
ods. Nakajima and West (2013) provide a procedure for local thresholding of
dynamic coefficients, but the computational challenges of model implementa-
tion are significant. Chan et al. (2012) propose a class of time-varying dimension
models, but due to the computational complexity of the model, only consider
inclusion or exclusion of a variable for all times, which produces non-dynamic
variable selection and a limited set of models.
Perhaps most comparable to the proposed methodology, Kalli and Griffin
(2014) propose a class of priors which exhibit dynamic shrinkage using normal-
gamma autoregressive processes. The Kalli and Griffin (2014) prior is a dynamic
extension of the normal-gamma prior of Griffin and Brown (2010), and provides
improvements in forecasting performance relative to non-dynamic shrinkage
priors. However, the Kalli and Griffin (2014) model requires careful specifi-
cation of several hyperparameters and hyperpriors, and the computation re-
quires sophisticated adaptive MCMC techniques, which results in lengthy com-
43
putation times. By comparison, our proposed class of dynamic shrinkage pro-
cesses is far more general, and includes the dynamic horseshoe process as a
special case—which notably does not require tuning of sensitive hyperparame-
ters. Furthermore, our proposed MCMC sampling algorithm combines existing
samplers for large blocks of parameters, which produces a straightforward yet
efficient Gibbs sampler, with computations linear in the number of time points.
We apply dynamic shrinkage processes to develop a dynamic fundamen-
tal factor model for asset pricing. We build upon the five-factor Fama-French
model (Fama and French, 2015), which extends the three-factor Fama-French
model (Fama and French, 1993) for modeling equity returns with common risk
factors. We propose a dynamic extension which allows for time-varying fac-
tor loadings, possibly with localized or irregular features, and include a sixth
factor, momentum (Carhart, 1997). Despite the popularity of the three-factor
Fama-French model, there is not yet consensus regarding the necessity of all five
factors in Fama and French (2015) or the momentum factor. Dynamic shrinkage
processes provide a mechanism for addressing this question: within a time-
varying parameter regression model, dynamic shrinkage processes provide the
necessary flexibility to adapt to rapidly-changing features, while shrinking un-
necessary factors to zero. Our dynamic analysis shows that with the exception
of the market risk factor, no other risk factors are significant except for brief
periods.
We introduce the dynamic shrinkage process in Section 3.2 and discuss rele-
vant properties, including the Pólya-Gamma parameter expansion for efficient
computations. In Section 3.3, we apply the prior to develop a more adaptive
Bayesian trend filtering model for irregular curve-fitting, and we compare the
44
proposed procedure with competitive alternatives through simulations and a
CPU usage application. We propose in Section 3.4 a time-varying parameter re-
gression model with dynamic shrinkage processes for adaptive regularization
and evaluate the model using simulations and an asset pricing example. In Sec-
tion 3.5, we discuss the details of the Gibbs sampling algorithm, and conclude
in Section 3.6. Proofs and additional details are in Appendix B.
3.2 Dynamic Shrinkage Processes
The proposed dynamic shrinkage process contains three prominent features:
(1) a dynamic model for the local scale parameters, λt, via an autoregression on
the log-scale; (2) a log-scale representation of a broad class of global-local pri-
ors to propagate desirable shrinkage properties to the dynamic setting; and (3)
a Gaussian scale-mixture representation of the implied log-volatility evolution
error to provide an efficient Gibbs sampling algorithm. In this section, we pro-
vide the relevant details regarding these features, and explore the properties of
the resulting process.
3.2.1 Stochastic Volatility Models for Dynamic Scale Parame-
ters
To extend the class of global-local scale mixtures of Gaussian distributions in
(3.1) to the dynamic setting, we propose to model the local scale parameter, λt,
using a stochastic volatility (SV) model (e.g., Kim et al., 1998). The SV model,
which is the most common approach for modeling time-dependent random
45
scale parameters, introduces dynamic dependence via an AR(1) model for the
log-variance (or log-volatility), as in model (3.2). Unlike model (3.2), standard
SV models typically assume i∼idηt N(0, σ2η). A distinctive feature of the SV model
is that it encourages volatility clustering, in which large (or small) variance—or
shrinkage—persists for consecutive time points.
MCMC implementations of the SV model commonly represent the likeli-
hood for ht on the log-scale and approximate the resulting distribution using a
known discrete mixture of Gaussian distributions (e.g., Kim et al., 1998). Impor-
tantly, the resulting approximation provides a framework for a fast and efficient
MCMC sampler: conditional on the mixing component, the model for {h }Tt t=1 is
a Gaussian dynamic linear model, and therefore {ht}Tt=1 may be sampled jointly
in O(T ) computations. We provide the relevant details in Section 3.5.
3.2.2 Log-Scale Representations of Global-Local Priors
Stochastic volatility models will not automatically exhibit desirable shrinkage
behavior: we must consider appropriate distributions for µ and ηt. To il-
lustrate this point, consider the standard SV model assumption for the evo-
lution error distribution of log-volatility, iηt ∼
id
N(0, σ2η). For the likelihood
yt ∼ N(ωt, 1) and the prior (3.1a), the posterior expectation of ωt is E[ωt|{ys}, τ ] =
(1− E[κt|{ys}, τ ]) yt, where
≡ 1 1κt = (3.4)
1 + Var [ωt|τ, λt] 1 + τ 2λ2t
is the shrinkage parameter. As noted by Carvalho et al. (2010), E[κt|{ys}, τ ] is
interpretable as the amount of shrinkage toward zero a posteriori: κt ≈ 0 yields
minimal shrinkage (for signals), while κt ≈ 1 yields maximal shrinkage to zero
46
(for noise). For the standard SV model and fixing φ = µ = 0 for simplicity, λt =
exp(ηt/2) is log-no{rmally[dist(ribute)d], a}nd the shrinkage parameter has density2
[κ ] ∝ 1 exp − 1t − 2 log
1−κt . Notably, the density for κt approachesκt(1 κt) 2ση κt
zero as κt → 0 and as κt → 1. As a result, direct application of the Gaussian SV
model may overshrink true signals and undershrink noise.
By comparison, consider the horseshoe prior of Carvalho et al. (2010). The
horseshoe prior is the special case of (3.1c) with iid[λ ] ∼ C+(0, 1), where C+t de-
notes the half-Cauchy distribution. For fixed τ = 1, the half-Cauchy prior on
λt is equivalent to
i
κt ∼
id Beta(1/2, 1/2), which induces a “horseshoe” shape for
the shrinkage parameter (see Figure 3.2). The horseshoe-like behavior is ideal in
sparse settings, since the prior density allocates most of its mass near zero (min-
imal shrinkage of signals) and one (maximal shrinkage of noise). Theoretical
results, simulation studies, and a variety of applications confirm the effective-
ness of the horseshoe prior (Carvalho et al., 2009, 2010; Datta and Ghosh, 2013;
van der Pas et al., 2014).
To emulate the robustness and sparsity properties of the horseshoe and other
shrinkage priors in the dynamic setting, we represent a general class of global-
local shrinkage priors on the log-scale. As a motivating example, consider the
special case of (3.1a) and (3.2) with indepφ = 0: ωt ∼ N(0, τ 2λ2t ) with log(λ2t ) = ηt.
This example is illuminating: we equivalently express the (static) horseshoe
prior by letting Dηt = log λ2t , where
D
= denotes equality in distribution. In par-
ticular, [ ] ( ) ( )
λ2 ∝ λ2 −1/2 1 + λ2 −1t t t
implies
[ηt] = π
−1 exp(ηt/2) [1 + exp(ηt)]
−1
47
φ = 0.25 φ = 0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
κt κt
φ = 0.75 φ = 0.99
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
κt κt
Figure 3.2: Simulation-based estimate of the stationary distribution of κt for
various AR(1) coefficients φ. The blue line indicates the density of κt in the
static (φ = 0) horseshoe, [κ] ∼ Beta (1/2, 1/2).
so ηt is Z-distributed with ηt ∼ Z(α = 1/2, β = 1/2, µz = 0, σz = 1). Im-
portantly, Z-distributions may be written as mean-variance scale mixtures of
Gaussian distributions (Barndorff-Nielsen et al., 1982), which produces a useful
framework for a parameter-expanded Gibbs sampler.
More generally, consider the inverted-Beta prior, denoted IB(β, α), for λ2
with density ( )α−1( )2 ∝ 2 2 −(α+β)[λ ] λ 1 + λ , λ > 0
(e.g., Armagan et al., 2011; Polson and Scott, 2012a,b). Special cases of the
inverted-Beta distribution are provided in Table 3.1.
This broad class of priors may be equivalently constructed via the variances
λ2t , the shrinkage parameters κt, or the log-variances ηt.
Proposition 3.1. The following distributions are equivalent:
1. λ2 ∼ IB(β, α);
48
Density Density
0 2 4 0.0 1.5 3.0
Density Density
0 2 4 6 8 0 1 2 3
( )
2. κ = 1/ 1 + λ2 ∼ Beta(β, α);
3. η = log(λ2) = log(κ−1 − 1) ∼ Z(α, β, 0, 1).
Note that the ordering of the parameters α, β is identical for the inverted-
Beta and Beta distributions, but reversed for the Z-distribution.
Now consider the dynamic setting in which φ 6= 0. Model (3.2) implies that
the conditional prior variance for ωt in (3.1a) is exp(ht) = exp(µ+ φ(ht−1 − µ) +
2 2φ 2 2 iidηt) = τ λt−1λ̃t , where τ = exp(µ), λ2t−1 = exp(ht−1 − µ), and λ̃2t = exp(ηt) ∼
IB(β, α), as in the non-dynamic setting. This prior generalizes the IB(β, α)
prior via the local variance term, λ2φt−1, which incorporates information about the
shrinkage behavior at the previous time t − 1 in the prior for ωt. We formalize
the role of this local adjustment term with the following results.
Proposition 3.2. Suppose η ∼ Z(α, β, µz, 1) for µz ∈ R. Then κ = 1/(1 + exp(η)) ∼
TPB(β, α, exp(µz)), where κ ∼ TPB(β, α, γ) denote the three-parameter Beta dis-
tribution with density [κ] = [B(β, α)]−1γβκβ−1(1 − κ)α−1 [1 + (γ − 1)κ]−(α+β) , κ ∈
(0, 1), γ > 0.
The three-parameter Beta (TPB) distribution generalizes the Beta distribu-
tion: γ = 1 produces the Beta(β, α) distribution, while γ > 1 (respectively, γ < 1)
allocates more mass near zero (respectively, one) relative to the Beta(β, α) dis-
tribution. For dynamic shrinkage processes, the TPB distribution arises as the
conditional prior distribution of κt+1 given {κs}s≤t.
Theorem 3.1. For the dynamic shrinkage(process (3.2)), the conditional prior distribu-
tion of the shrinkage parameter κt+1 = 1/ 1(+ τ 2λ2t+1 is [ ] )φ
|{ } 2(1−φ) 1− κt[κt+1 κs s≤t, φ, τ ] ∼ TPB β, α, τ (3.5)
κt
49
or equivalently, [κt+1|{λs}s≤t, φ, τ ] ∼ TPB(β, α, τ 2λ2φt ).
The proof of Theorem 3.1 is in Appendix B. Naturally, the previous value of
the shrinkage parameter, κt, together with the AR(1) coefficient φ, inform both
the magnitude and the direction of the distributional shift of κt+1.
Theorem 3.2. For the dynamic horseshoe process of (3.2() with α = β = 1/2 )and fixed
τ = 1, the conditional prior distribution (3.5) satisfies P κt+1 < ε|{κs}s≤t, φ → 1 as
κt → 0 for any ε ∈ (0, 1) and fixed φ 6= 0.
The proof of Theorem 3.2 is in Appendix B. Importantly, Theorem 3.2
demonstrates that the mass of the conditional prior distribution for κt+1 con-
centrates near zero—corresponding to minimal shrinkage of signals—when κt
is near zero, so the shrinkage behavior at time t informs the (prior) shrinkage
behavior at time t+ 1.
We similarly characterize the posterior distribution of κt+1 given {κs}s≤t in
the following theorem, which extends the results of Datta and Ghosh (2013) to
the dynamic setting.
Theorem 3.3. Under the likelihood in∼depyt N(ωt, 1), the prior (3.1a), and the dynamic
horseshoe process (3.2) with α = β = 1/2 and fixed φ 6= 0, the posterior distribution of
κt+1 given the history of the shrinkage process {κs}s≤t satisfies the following properties:
( ∣ )
(a) For any fixed ε ∈ (0, 1), P κt+1 > 1 − ε∣yt+1, {κs}s≤t, φ, τ → 1 as γt → 0
uniformly in y ∈ R, where γ = τ 2(1−φ)t+1 t ([(1− κt)/κ ]
φ
∣t . )
(b) For any fixed ε ∈ (0, 1) and γt < 1, P κt+1 < ε∣yt+1, {κs}s≤t, φ, τ → 1 as
|yt+1| → ∞.
50
The proof of Theorem 3.3 is in Appendix B, and uses the observation that
marginally, [yt+1|{
indep
κs}] ∼ N(0, κ−1t+1), so the posterior distribution of κ{ } t+1
is
[ ]
β−1 α−1 −(α+β)[κt+1|yt+1, {κs}s≤t, φ, τ ] ∝ κ{t+1 (1− κ(t+1) 1 + (γ)t}− 1)κt+1
× 1/2κ 2t+1 exp −[yt+1κt+1/2 ] ( )
∝ −1(1− κ −1/2t+1) 1 + (γt − 1)κ 2t+1 exp −yt+1κt+1/2 .
Theorem 3.3(a) demonstrates that the posterior mass of [κt+1|{κs}s≤t] concen-
trates near one as τ → 0, as in the non-dynamic horseshoe, but also as κt → 1.
Therefore, the dynamic horseshoe process provides an additional mechanism
for shrinkage of noise, besides the global scale parameter τ , via the previous
shrinkage parameter κt. Moreover, Theorem 3.3(b) shows that, despite the ad-
ditional shrinkage capabilities, the posterior mass of [κt+1|{κs}s≤t] concentrates
near zero for large absolute signals |yt+1|, which indicates robustness of the dy-
namic horseshoe process to large signals analogous to the static horseshoe prior.
When |φ| < 1, the log-volatility process {ht} is stationary, which implies {κt}
is stationary. In Figure 3.2, we plot a simulation-based estimate of the stationary
distribution of κt for various values of φ under the dynamic horseshoe process.
The stationary distribution of κt is similar to the static horseshoe distribution
(φ = 0) for φ < 0.5, while for large values of φ the distribution becomes more
peaked at zero (less shrinkage of ωt) and one (more shrinkage of ωt). The result
is intuitive: larger |φ| corresponds to greater persistence in shrinkage behavior,
so marginally we expect states of aggressive shrinkage or little shrinkage.
51
3.2.3 Scale Mixtures via Pólya-Gamma Processes
Standard SV sampling algorithms rely on a Gaussian assumption for the log-
volatility innovations, iηt ∼
id
N(0, σ2η), to efficiently sample the log-volatilities
{ht} (e.g., Kim et al., 1998; Omori et al., 2007; Kastner and Frühwirth-Schnatter,
2014). To extend these techniques to the dynamic shrinkage process (3.2) in
which iidηt ∼ Z(α, β, 0, 1), we use parameter expansion to write ηt as a scale mix-
ture of Gaussian distributions. The representation of a Z-distribution as a mean-
variance scale mixtures of Gaussian distributions is due to Barndorff-Nielsen
et al. (1982). For parameter expansion, we build on the framework of Polson
et al. (2013), who propose a Pólya-Gamma scale mixture of Gaussians represen-
tation for Bayesian logistic regression. Importantly, this representation allows
us to construct an efficient sampling algorithm that combines anO(T ) sampling
algorithm for the log-volatilities {ht}Tt=1 with a Pólya-Gamma sampler for the
mixing parameters.
A Pólya-Gamma random variable ξ with parameters b > 0 and c ∈ R, denoted
ξ ∼ PG(b, c), is an infinite convolution of Gamma random variables:
∑∞
D 1 gk
ξ = (3.6)
2π2 (k − 1/2)2 − c2/(4π2)
k=1
where i∼idgk Gamma(b, 1). Properties of Pólya-Gamma random variables may
be found in Barndorff-Nielsen et al. (1982) and Polson et al. (2013). Our interest
in Pólya-Gamma random variables derives from their role in representing the
Z-distribution as a mean-variance scale mixture of Gaussians.
Theorem 3.4. The random variable η ∼ Z(α, β, 0, 1), or equivalently η = log(λ2)
52
with λ2 ∼ IB(β, α), is a mean-variance scale mixture of Gaussian distributions with[η|ξ] ∼ N (ξ
−1[α− β]/2, ξ−1)
 (3.7)[ξ] ∼ PG(α + β, 0).
Moreover, the conditional distribution of ξ is [ξ|η] ∼ PG(α + β, η).
The proof of Theorem 3.4 is in Appendix B. When α = β, the Z-distribution
is symmetric, and the conditional expectation in (3.7) simplifies to E[η|ξ] = 0.
Polson et al. (2013) propose a sampling algorithm for Pólya-Gamma random
variables, which is available in the R package BayesLogit, and is extremely
efficient when b = 1. In our setting, this corresponds to α+ β = 1, for which the
horseshoe prior is the prime example.
3.3 Bayesian Trend Filtering with Dynamic Shrinkage Pro-
cesses
Dynamic shrinkage processes are particularly appropriate for dynamic linear
models (DLMs). DLMs combine an observation equation, which relates the ob-
served data to latent state variables, and an evolution equation, which allows
the state variables—and therefore the conditional mean of the data—to be dy-
namic. By construction, DLMs contain many parameters, and therefore may
benefit from structured regularization. The proposed dynamic shrinkage pro-
cesses offer such regularization, and unlike existing methods, do so adaptively.
Consider the following DLM with a Dth order random walk on the state
53
variable, β:t iidyt = βt + t, [t|σ] ∼ N(0, σ
2
 ), t = 1, . . . , T (3.8)∆Dβt+1 = ωt, [ωt|τ, {λs} in∼dep] N(0, τ 2λ2t ), t = D, . . . , T
and β 2t+1 = ωt ∼ N(0, τ λ2t ) for t = 0, . . . , D − 1, where ∆ is the differencing op-
erator and D ∈ Z+ is the degree of differencing. By imposing a shrinkage prior
on λt, model (3.8) may be viewed as a Bayesian adaptation of the trend filtering
model of Kim et al. (2009) and Tibshirani (2014): model (3.8) features a penalty
encouraging sparsity of the Dth order differences of the conditional mean, βt.
Faulkner and Minin (2016) provide an implementation based on the (static)
horseshoe prior and the Bayesian lasso, and further allow for non-Gaussian
likelihoods. We refer to model (3.8) as a Bayesian trend filtering (BTF) model,
with various choices available for the distribution of the innovation standard
deviations, [τλt].
We propose a dynamic horseshoe process as the prior for the innovations
ωt in model (3.8). The aggressive shrinkage of the horseshoe prior forces small
values of |ωt| = |∆Dβt+1| toward zero, while the robustness of the horseshoe
prior permits large values of |∆Dβt+1|. When D = 2, model (3.8) will shrink the
conditional mean βt toward a piecewise linear function with breakpoints deter-
mined adaptively, while allowing large absolute changes in the slopes. Further,
using the dynamic horseshoe process, the shrinkage effects induced by λt are
time-dependent, which provides localized adaptability to regions with rapidly-
or slowly-changing features. Following Carvalho et al. (2010) and Polson and
Scott (2012b), we assume a half-Cauchy prior for the global scale parameter
√
τ ∼ C+(0, σ/ T ), in which we scale by the observation error variance and the
sample size (Piironen and Vehtari, 2016). Using Pólya-Gamma mixtures, the
implied conditional prior on µ = log(τ 2) is [µ|σ 2 −1, ξµ] ∼ N(log σ − log T, ξµ )
54
with ξµ ∼ PG(1, 0). We include the details of the Gibbs sampling algorithm
for model (3.8) in Section 3.5, which is notably linear in the number of time
points, T : the full conditional posterior precision matrices for β = (β ′1, . . . , βT )
and h = (h1, . . . , h ′T ) are D-banded and tridiagonal, respectively, which ad-
mit highly efficient O(T ) back-band substitution sampling algorithms (see Ap-
pendix B for empirical evidence).
3.3.1 Bayesian Trend Filtering: Simulations
To assess the performance of the Bayesian trend filtering (BTF) model (3.8) with
dynamic horseshoe innovations (BTF-DHS), we compared the proposed meth-
ods to several competitive alternatives using simulated data. We considered the
following variations on BTF model (3.8): normal-inverse-Gamma (BTF-NIG)
innovations via τ−2 ∼ Gamma(0.001, 0.001) with λt = 1; and (static) horseshoe
priors for the innovations (BTF-HS) via iidτ, λt ∼ C+(0, 1). In addition, we include
the (non-Bayesian) trend filtering model of Tibshirani (2014) implemented us-
ing the R package genlasso (Arnold and Tibshirani, 2014), for which the regu-
larization tuning parameter is chosen using cross-validation (Trend Filtering).
For all trend filtering models, we select D = 2, but the relative performance is
similar for D = 1. Among non-trend filtering models, we include a smooth-
ing spline estimator implemented via smooth.spline() in R (Smoothing
Spline); the wavelet-based estimator of Abramovich et al. (1998) (BayesThresh)
implemented in the wavethresh package (Nason, 2016); and the nested Gaus-
sian Process (nGP) model of Zhu and Dunson (2013), which relies on a state
space model framework for efficient computations, comparable to—but empir-
ically less efficient than—the BTF model (3.8).
55
We simulated 100 data sets from the model yt = y∗t + t, where y∗t is the
true function and in∼dept N(0, σ2∗). We use the following true functions y∗t from
Donoho and Johnstone (1994): Doppler, Bumps, Blocks, and Heavisine, imple-
mented in the R package wmtsa (Constantine and Percival, 2016). The noise
variance σ2∗ is det√erm∑ined by s/electing a root-signal-to-noise ratio (RSNR) andT (y∗ ∗ 2 ∑computing t=1 t−ȳ )σ∗ = − RSNR, where ȳ∗ = 1 T y∗. As in Zhu andT 1 T t=1 t
Dunson (2013), we select RSNR = 7 and use a moderate length time series,
T = 128.
Doppler: Fitted Curve Bumps: Fitted Curve
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
Blocks: Fitted Curve Heavisine: Fitted Curve
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
Figure 3.3: Fitted curves for simulated data with T = 128 and RSNR = 7. Each
panel includes the simulated observations (x-marks), the posterior expectations
of βt (cyan), and the 95% pointwise HPD credible intervals (light gray) and 95%
simultaneous credible bands (dark gray) for the posterior predictive distribu-
tion of {yt} under BTF-DHS model (3.8) with D = 2. The proposed estimator,
as well as the uncertainty bands, accurately capture both slowly- and rapidly-
changing behavior in the underlying functions.
In Figure 3.3, we provide an example of each true curve y∗t , together with
the proposed BTF-DHS posterior expectations and credible bands. Notably, the
56
yt yt
-2 0 2 4 6 -0.6 -0.2 0.2 0.6
yt yt
-8 -6 -4 -2 0 2 4 6 0 1 2 3 4 5
Bayesian trend filtering model (3.8) with D = 2 and dynamic horseshoe inno-
vations provides an exceptionally accurate fit to each data set. Importantly, the
posterior expectations and the posterior credible bands adapt to both slowly-
and rapidly-changing behavior in the underlying curves. The implementation
is also efficient: the computation time for 15,000 iterations of the Gibbs sam-
pling algorithm, implemented in R (on a MacBook Pro, 2.7 GHz Intel Core i5),
is about 1.15 minutes.
To compare the aforeme√ntion∑ed procedures, we compute the root mean
squared errors RMSE(ŷ) = 1 Tt=1 (y
∗
t − ŷt)
2 for all estimators ŷ of the true
T
function, y∗. The results are displayed in Figure 3.4. The proposed BTF-DHS im-
plementation substantially outperforms all competitors, especially for rapidly-
changing curves (Doppler and Bumps). The exceptional performance of BTF-
DHS is paired with comparably small variability of RMSE, especially relative
to non-dynamic horseshoe model (BTF-HS). Interestingly, the magnitude and
variability of the RMSEs for BTF-DHS are related to the AR(1) coefficient, φ:
the 95% HPD intervals (corresponding to Figure 3.3) are (0.77, 0.97) (Doppler),
(0.81, 0.97) (Bumps), (0.76, 0.96) (Blocks), and (−0.04, 0.74) (Heavisine). For the
smoothest function, Heavisine, there is less separation among the estimators.
Nonetheless, BTF-DHS performs the best, even though the HPD interval for φ
is wider and contains zero. These results show that the Bayesian trend filtering
model (3.8) with dynamic horseshoe innovations substantially improves upon
existing curve-fitting procedures, and due to both its computational efficiency
and the availability of posterior inference, may provide a useful procedure for a
wide variety of applications.
57
Doppler: Root Mean Squared Error Bumps: Root Mean Squared Error
Smoothing Spline Smoothing Spline
BayesThresh BayesThresh
nGP nGP
Trend Filtering Trend Filtering
BTF-NIG BTF-NIG
BTF-HS BTF-HS
BTF-DHS BTF-DHS
0.15 0.20 0.25 0.30 0.3 0.4 0.5 0.6 0.7
Blocks: Root Mean Squared Error Heavisine: Root Mean Squared Error
Smoothing Spline Smoothing Spline
BayesThresh BayesThresh
nGP nGP
Trend Filtering Trend Filtering
BTF-NIG BTF-NIG
BTF-HS BTF-HS
BTF-DHS BTF-DHS
0.4 0.5 0.6 0.7 0.8 0.9 0.40 0.45 0.50 0.55 0.60
Figure 3.4: Root mean squared errors for simulated data with T = 128 and
RSNR = 7. The Bayesian trend filtering (BTF) estimators differ in their inno-
vation distributions, which determines the shrinkage behavior of the second
order differences (D = 2): normal-inverse-Gamma (NIG), horseshoe (HS), and
dynamic horseshoe (DHS).
3.3.2 Bayesian Trend Filtering: Application to CPU Usage Data
To demonstrate the adaptability of the dynamic horseshoe process for model
(3.8), we consider the CPU usage data in Figure 3.1a. The data exhibit substan-
tial complexity: an overall smooth intraday trend but with multiple irregularly-
spaced jumps, and an increase in volatility from 16:00-18:00. Our goal is to
provide an accurate measure of the trend, including jumps, with appropriate
uncertainty quantification. For this purpose, we employ the BTF-DHS model
(3.8), which we extend to include stochastic volatility for the observation error:
in∼dep iidyt N(βt, σ2t ) with an AR(1) model on log σ2t as in (3.2) with ηt ∼ N(0, σ2η).
For the additional sampling step of the stochastic volatility parameters, we use
58
the algorithm of Kastner and Frühwirth-Schnatter (2014) implemented in the R
package stochvol (Kastner, 2016).
The resulting model fit is summarized in Figure 3.1. The posterior expec-
tation and posterior credible bands accurately model both irregular jumps and
smooth trends, and capture the increase in volatility from 16:00-18:00 (see Figure
3.1c). By examining regions of nonoverlapping simultaneous posterior credible
bands, we may assess change points in the level of the data. In particular, the
model fit suggests that the CPU usage followed a slowly increasing trend in-
terrupted by jumps of two distinct magnitudes prior to 16:00, after which the
volatility increased and the level decreased until approximately 18:00.
We augment the simulation study of Section 3.3.1 with a comparison of out-
of-sample estimation of the CPU usage data. We fit each model using 90% (T =
1296) of the data selected randomly for training and the remaining 10% (T =
144) for testing, which was repeated independently 100 times. Models were
compared using RMSE.
Unlike the simulation study in Section 3.3.1, the subsampled data are not
equally spaced. Taking advantage of the computational efficiency of the pro-
posed BTF methodology, we employ a model-based imputation scheme, which
is valid for missing observations. For unequally-spaced data yt , i = 1, . . . , T , wei
expand the operative data set to include missing observations along an equally-
spaced grid, t∗ = 1, . . . , T ∗, such that for each observation point i, yt = y ∗i t
for some t∗. Although T ∗ ≥ T , possibly with T ∗  T , all computations
within the sampling algorithm, including the imputation sampling scheme for
{yt∗ : t∗ 6= ti}, are linear in the number of (equally-spaced) time points, T ∗.
Therefore, we may apply the same Gibbs sampling algorithm as before, with
59
the additional step of drawing inyt∗ ∼
dep
N(β 2 ∗t∗ , σt∗) for each unobserved t =6 ti.
Implicitly, this procedure assumes that the unobserved points are missing at
random, which is satisfied by the aforementioned subsampling scheme.
The results of the out-of-sample estimation study are displayed in Figure 3.5.
The BTF procedures are notably superior to the non-Bayesian trend filtering and
smoothing spline estimators, and, as with the simulations of Section 3.3.1, the
proposed BTF-DHS model substantially outperforms all competitors.
Root Mean Squared Error
Smoothing Spline
Trend Filtering
BTF-NIG
BTF-HS
BTF-DHS
0.25 0.30 0.35 0.40 0.45 0.50 0.55
Figure 3.5: Root mean squared error for out-of-sample minute-by-minute CPU usage
data. The Bayesian trend filtering (BTF) estimators differ in their innovation distribu-
tions, which determines the shrinkage behavior of the second order differences (D = 2):
normal-inverse-Gamma (NIG), horseshoe (HS), and dynamic horseshoe (DHS).
3.4 Joint Shrinkage for Time-Varying Parameter Models
Dynamic shrinkage processes are appropriate for multivariate time series mod-
els that may benefit from locally adaptive shrinkage properties. Consider the
following time-varying parameter regression model with multiple dynamic pre-
dictors xt =(x1,t, . . . , xp,t)
′:
 indepyt = x
′
tβt + t, [
2
 t
|σ] ∼ N(0, σ )
 (3.9)∆D indepβt+1 = ωt, [ωj,t|τ0, {τk}, {λ }] ∼ N(0, τ 2τ 2λ2k,s 0 j j,t)
60
where βt = (β1,t, . . . , βp,t)′ is the vector of dynamic regression coefficients and
D ∈ Z+ is the degree of differencing. The prior for the innovations ωj,t incor-
porates three levels of global-local shrinkage: a global shrinkage parameter τ0,
a predictor-specific shrinkage parameter τj , and a predictor- and time-specific
local shrinkage parameter λj,t.
To provide jointly localized shrinkage of the dynamic regression coefficients
{βj,t} analogous to the Bayesian trend filtering model of Section 3.3, we ex-
tend (3.2) to allow for multivariate time dependence via a vector autoregression
(VAR) on the log-volatility: indep[ω 2 2 2j,t|τ0, {τk}, {λk,s}] ∼ N(0, τ0 τj λj,t)
h
2 2 2 (3.10)
 j,t
= log(τ0 τj λj,t), j = 1, . . . , p, t = 1, . . . , T
 iidht+1 = µ+ Φ(ht − µ) + ηt, ηj,t ∼ Z(α, β, 0, 1)
where ht = (h1,t, . . . , hp,t)′, µ = (µ1, . . . , µp)′, ηt = (η ′1,t, . . . , ηp,t) , and Φ is the
p × p VAR coefficient matrix. We assume Φ = diag (φ1, . . . , φp) for simplicity,
but non-diagonal extensions are available. As in the univariate setting, we use
Pólya-Gamma mixtures (independently) for the log-volatility evolution errors,
| in∼dep[η ξ ] N(ξ−1 iidj,t j,t j,t [α − β]/2, ξ−1j,t ) with ξj,t ∼ PG(α + β, 0) and α = β = 1/2.
We augment model (3.10) with half-Cauchy priors for the predictor-specific and
√
global parameters, in∼depτ C+j (0, 1) and τ0 ∼ C+(0, σ/ Tp), in which we scale
by the observation error variance and the number of innovations {ωj,t} (Piironen
and Vehtari, 2016). These priors may be equivalently represented on the log-
scale using the Pólya-Gamma parameter expansion [µj|µ, ξµ ] ∼ N(µ, ξ−1µ ) andj j
[µ0|
iid
σ, ξµ0 ] ∼ N(log σ2 − log T, ξ−1µ ) with ξµ , ξµ0 ∼ PG(1, 0) and the identification0 j
µj = log(τ
2 2 2
0 τj ) and µ0 = log(τ0 ).
61
3.4.1 Time-Varying Parameter Models: Simulations
We conducted a simulation study to evaluate competing variations of the time-
varying parameter regression model (3.9), in particular relative to the proposed
dynamic shrinkage process (BTF-DHS) in (3.10). Similar to the simulations of
Section 3.3.1, we focus on the distribution of the innovations, ωj,t, and again in-
clude the normal-inverse-Gamma (BTF-NIG) and the (static) horseshoe (BTF-
HS) as competitors, in each case selecting D = 2. Among models with non-
dynamic regression coefficients, we include a lasso regression (Tibshirani, 1996)
implemented via the R package glmnet (Friedman et al., 2010), which incorpo-
rates variable selection, and an ordinary linear regression.
We simulated 100 data sets of length T = 500 from the model y = x′β∗t t t + t,
where the p = 7 predictors are iidx1,t = 1 and xj,t ∼ N(0, 1) for j > 2, and
i
t ∼
id
N(0, σ2∗). The true regression coefficients β∗ = (β∗t 1,t, . . . , β∗ ′p,t) are the fol-
lowing: β∗1,t = 1, β∗ ∗2,t and β3,t are the Bumps and Heavisine functions, respectively,
from Section 3.3.1 rescaled to [0, 1], and β∗j,t = 0 for j = 4, . . . , p = 7. The predic-
tor set contains a variety of functions: a constant nonzero function, a rapidly-
changing function (Bumps), a relatively smooth function (Heavisine), and three
true zeros. The noise variance σ2∗ is det√er∑mined by s/electing a root-signal-to-T ∗ ∗ 2
noise ratio (∑RSNR) and computing σ = t=1(yt−ȳ )∗ − RSNR, where y∗t = x′tβ∗T 1 t
and ȳ∗ = 1 Tt=1 y
∗
t . We select RSNR = 10.T
In Figure 3.6, we show the true regression functions β∗j,t, together with the
proposed BTF-DHS posterior expectations and credible bands for βj,t. De-
spite the challenge presented by the Bumps function, the proposed model (3.9)
with innovation distribution (3.10) adequately identifies the constant and zero
curves, captures the important features of the Bumps function, and accurately
62
estimates the smoother Heavisine function.
We evalua√te co∑mpeting methods using RMSEs f√or bo∑th y∗t ∑and β(∗t defined b)y2
RMSE(ŷ) = 1 T (y∗ − ŷ )2t=1 t t and RMSE(β̂) =
1 T p ∗
T Tp t=1 j=1
βj,t − β̂j,t
for all estimators β̂t of the true regression functions, β∗t with ŷ = x′t tβ̂t. The
results are displayed in Figure 3.7. The proposed BTF-DHS model substantially
outperforms the competitors in both recovery of the true regression functions,
β∗j,t and estimation of the true curves, y∗t . Notably, the dynamic (BTF) procedures
offer massive gains over the models with static regression coefficients.
Intercept Bumps Heavisine
Zero Zero Zero
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 3.6: True regression functions β∗j,t (black line) and corresponding poste-
rior expectations (cyan), 95% pointwise HPD credible intervals (light gray) and
95% simultaneous credible bands (dark gray) for βj,t under the BTF-DHS model
given by (3.9) and (3.10) for a simulated data set.
63
-1.5 -0.5 0.5 1.5 -1.5 -0.5 0.5 1.5
Regression Coefficients: Root Mean Squared Error Fitted Values: Root Mean Squared Error
Linear Regression Linear Regression
Lasso Lasso
BTF-NIG BTF-NIG
BTF-HS BTF-HS
BTF-DHS BTF-DHS
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Figure 3.7: Root mean squared errors for the regression coefficients, β∗j,t (left)
and the true curves, y∗ = x′β∗t t t (right) for simulated data.
3.4.2 Time-Varying Parameter Models: The Fama-French Asset
Pricing Model
Asset pricing models commonly feature highly structured factor models to par-
simoniously model the co-movement of stock returns. Such fundamental fac-
tor models identify common risk factors among assets, which may be treated
as exogenous predictors in a time series regression. Popular approaches in-
clude the one-factor Capital Asset Pricing Model (CAPM, Sharpe, 1964) and the
three-factor Fama-French model (FF-3, Fama and French, 1993). Recently, the
five-factor Fama-French model (FF-5, Fama and French, 2015) was proposed as
an extension of FF-3 to incorporate additional common risk factors. However,
outstanding questions remain regarding which, and how many, factors are nec-
essary. Importantly, an attempt to address these questions must consider the
dynamic component: the relevance of individual factors may change over time,
particularly for different assets.
We apply model (3.9) to extend these fundamental factor models to the dy-
namic setting, in which the factor loadings are permitted to vary—perhaps
64
rapidly—over time. For further generality, we append the momentum factor
of Carhart (1997) to FF-5 to produce a fundamental factor model with six factors
and dynamic factor loadings. Importantly, the shrinkage towards sparsity in-
duced by the dynamic horseshoe process allows the model to effectively select
out unimportant factors, which also may change over time. As in Section 3.3.2,
we modify model (3.9) to include stochastic volatility for the observation error,
[t|{σs}
in∼dep] N(0, σ2t ).
To study various market sectors, we use weekly industry portfolio data
from the website of Kenneth R. French, which provide the value-weighted
return of stocks in the given industry. We focus on manufacturing (Manuf)
and healthcare (Hlth). For a given industry portfolio, the response variable
is the returns in excess of the risk free rate, yt = Rt − RF,t, with predictors
xt = (1, RM,t − RF,t, SMB t,HMLt,RMW t,CMAt,MOM t)′, defined as follows:
the market risk factor, RM,t − RF,t is the return on the market portfolio RM,t in
excess of the risk free rate RF,t; the size factor, SMB t (small minus big) is the dif-
ference in returns between portfolios of small and large market value stocks; the
value factor, HMLt (high minus low) is the difference in returns between portfo-
lios of high and low book-to-market value stocks; the profitability factor, RMW t
is the difference in returns between portfolios of robust and weak profitability
stocks; the investment factor, CMAt is the difference in returns between portfolios
of stocks of low and high investment firms; and the momentum factor, MOM t is
the difference in returns between portfolios of stocks with high and low prior re-
turns. These data are publicly available on Kenneth R. French’s website, which
provides additional details on the portfolios. We standardize all predictors and
the response to have unit variance.
65
In Figures 3.8 and 3.9, we plot the posterior expectation and credible bands
for the time-varying regression coefficients and observation error stochastic
volatility for the weekly manufacturing and healthcare industry data sets, re-
spectively, from 4/1/2007 - 4/1/2017 (T = 522). The 95% simultaneous credi-
ble bands (dark gray) indicate which coefficients are significantly different from
zero, and if so, at which times. For the manufacturing industry, the significant
factors are the market risk (RM,t − RF,t), investment (CMAt), and momentum
(MOM t), where both CMAt and MOM t are significantly time-varying (i.e., the
simultaneous credible bands contain no constant function). By comparison, an
ordinary linear regression does not find MOM t to be significant at the 5% level,
since the non-dynamic model ignores the fluctuations from 2008-2012, but does
identify the market risk, profitability (RMW t), and investment as significant
factors (see Appendix B for details).
For the healthcare industry, the significant factors are market risk, value
(HMLt), and profitability. By comparison, the ordinary linear regression iden-
tifies these factors as well as size (SMB t) as significant at the 5% level (see the
Appendix B for details). Notably, the only common factor significant in both
the manufacturing and healthcare industries under model (3.9) over this time
period is the market risk. This result suggests that the aggressive shrinkage
behavior of the dynamic shrinkage process is important in this setting, since
several factors may be effectively irrelevant for some or all time points.
66
Manuf: Intercept Manuf: Mkt.RF Manuf: SMB Manuf: HML
Manuf: RMW Manuf: CMA Manuf: MOM Manuf: SV
2008 2012 2016 2008 2012 2016 2008 2012 2016 2008 2012 2016
Figure 3.8: Posterior expectations (cyan), 95% pointwise HPD credible intervals
(light gray) and 95% simultaneous credible bands (dark gray) for βj,t and σt
(bottom right) under the BTF-DHS model given by (3.9) and (3.10) for value-
weighted manufacturing industry returns. The solid black line is zero, the
dashed green line is the ordinary linear regression estimate, and the solid red
line indicates periods for which the 95% simultaneous credible bands do not
contain zero.
3.5 MCMC Sampling Algorithm and Computational Details
We design a Gibbs sampling algorithm for the dynamic shrinkage process. The
sampling algorithm is both computationally and MCMC efficient, and builds
upon two main components: (1) a stochastic volatility sampling algorithm
(Kastner and Frühwirth-Schnatter, 2014) augmented with a Pólya-Gamma sam-
pler (Polson et al., 2013); and (2) a Cholesky Factor Algorithm (CFA, Rue, 2001)
for sampling the state variables in the dynamic linear model. Importantly, both
components employ algorithms that are linear in the number of time points,
which produces a highly efficient sampling algorithm.
67
-0.6 -0.2 0.2 0.6 -0.4 -0.2 0.0 0.2 0.4
-0.4 0.0 0.4 -0.5 0.0 0.5 1.0 1.5 2.0
-0.5 0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4
0.0 0.2 0.4 0.6 0.8 -0.5 0.0 0.5
Hlth: Intercept Hlth: Mkt.RF Hlth: SMB Hlth: HML
Hlth: RMW Hlth: CMA Hlth: MOM Hlth: SV
2008 2012 2016 2008 2012 2016 2008 2012 2016 2008 2012 2016
Figure 3.9: Posterior expectations (cyan), 95% pointwise HPD credible intervals
(light gray) and 95% simultaneous credible bands (dark gray) for βj,t and σt
(bottom right) under the BTF-DHS model given by (3.9) and (3.10) for value-
weighted healthcare industry returns. The solid black line is zero, the dashed
green line is the ordinary linear regression estimate, and the solid red line in-
dicates periods for which the 95% simultaneous credible bands do not contain
zero.
The general sampling algorithm is as follows: (1) sample the dynamic
shrinkage components (the log-volatilities {ht}, the Pólya-Gamma mixing pa-
rameters {ξt}, the unconditional mean of log-volatility µ, the AR(1) coefficient
of log-volatility φ, and the discrete mixture component indicators {st}); (2) sam-
ple the state variables {βt}; and (3) sample the observation error variance σ2 . We
provide details of the dynamic shrinkage process sampling algorithm in Section
3.5.1 and include the details for sampling steps (2) and (3) in Appendix B.
68
-1.5 -0.5 0.5 -0.5 0.0 0.5
-1.5 -0.5 0.5 1.5 -1 0 1 2 3
-2.0 -1.0 0.0 1.0 -1.0 -0.5 0.0 0.5
0.0 0.5 1.0 1.5 -1.5 -0.5 0.5 1.0 1.5
3.5.1 Efficient Sampling for the Dynamic Shrinkage Process
Consider the (univariate) dynamic shrinkage process in (3.2) with the Pólya-
Gamma parameter expansion of Theorem 3.4. We provide implementation de-
tails for the dynamic horseshoe process with α = β = 1/2, but extensions to
other cases are straightforward. The SV sampling framework of Kastner and
Frühwirth-Schnatter (2014) represents the likelihood for ht on the log-scale, and
approximates the ensuing logχ21 distribution for the errors via a known discrete
mixture of Gaussian distributions. In particular, let ỹt = log(ω2t + c), where c
is a small offset to avoid numerical issues. Conditional on the mixture com-
ponent indicators , the likelihood is in∼depst ỹt N(ht + mst , vst) where mi and
vi, i = 1, . . . , 10 are the pre-specified mean and variance components of the 10-
component Gaussian mixture provided in Omori et al. (2007). The evolution
equation is ht+1 = µ+ φ(ht − µ) + ηt with initialization h1 = µ+ η0 and innova-
tions indep iid[η −1t|ξt] ∼ N(0, ξt ) for [ξt] ∼ PG(1, 0).
To sample h = (h1, . . . , hT ) jointly, we directly compute the posterior dis-
tribution of h and exploit the tridiagonal structure of the resulting posterior
precision matrix. In particular, we equivalently have ỹ ∼ N(m + h̃ + µ̃,Σv)
and Dφh̃ ∼ N(0,Σξ), where m = (ms1 , . . . ,(ms )′, h̃)= (h1 − µ, . . .(, hT − µ)′,T )
µ̃ = (µ, (1 − φ)µ, . . . , (1 − φ)µ)′, Σv = diag {v T −1 Tst}t=1 , Σξ = diag {ξt }t=1 ,
and Dφ is a lower triangular matrix with ones on the diagonal, −φ on the first
off-diagonal, and zeros elsewhere. We sample from the posterior distribution
of h by sampling from the posterior distribution of h̃ and setting h = h̃ + µ1
for 1 a(T -dimensio)nal vector of ones. The required posterior distribution is
h̃ ∼ N Q−1`h̃,Q
−1 , where Q −1 ′ −1
h̃ h̃ h̃
= Σv +DφΣξ Dφ is a tridiagonal symmetric
matrix with diagonal elements d0(Qh̃) and first off-diagonal elements d1(Qh̃)
69
defined as
[ ]
d (Q ) = (v−1 + ξ + φ2ξ ), (v−1 + ξ + φ20 h̃ [ s 1 2 s 2 ] ξ3), . . . , (v
−1
s + ξT−1 + φ
2ξT ), (v
−1
s + ξ ) ,1 2 T−1 T T
d1(Qh̃) =( (−φξ2), (−)φξ3), . . . , (−φξT−1) , and
` −1h̃ = Σ[ v ỹ −m− µ̃ỹ1 −ms1 − µ ỹ2 −ms2 − (1− φ)µ ỹT −ms − (1− φ)µ]′= , , . . . , T .
vs1 vs2 vsT
Drawing from this posterior distribution is straightforward and efficient, using
band back-substitution described in Kastner and Frühwirth-Schnatter (2014):
(1) compute the Cholesky decomposition Qh̃ = LL
′, where L is lower triangle;
(2) solve La = `h̃ for a; and (3) solve L
′h̃ = a+ e for h̃, where e ∼ N(0, IT ).
Conditional on the log-volatilities {ht}, we sample the AR(1) evolution pa-
rameters: the log-innovation precisions {ξt}, the autoregressive coefficient φ,
and the unconditional mean µ. The precisions are distributed [ξt|ηt] ∼ PG(1, ηt)
for ηt = ht+1 − µ − φ(ht − µ), which we sample using the rpg() function
in the R package BayesLogit (Polson et al., 2013). The Pólya-Gamma sam-
pler is efficient: using only exponential and inverse-Gaussian draws, Polson
et al. (2013) construct an accept-reject sampler for which the probability of ac-
ceptance is uniformly bounded below at 0.99919, which does not require any
tuning. Next, we assume the prior [(φ + 1)/2] ∼ Beta(aφ, bφ), which restricts
|φ| < 1 for stationarity, and sample from the full conditional distribution of φ
using the slice sampler of Neal (2003). We select aφ = 10 and bφ = 2, which
places most of the mass for the density of φ in (0, 1) with a prior mean of 2/3
and a prior mode of 4/5 to reflect the likely presence of persistent volatility
√
clustering. The prior for the global scale parameter is τ ∼ C+(0, σ/ T ), which
implies µ = log(τ 2) is [µ|σ, ξµ] ∼ N(log(σ2/T ), ξ−1µ ) with ξµ ∼ PG(1, 0). In-
cluding the initialization h1 ∼ N(µ, ξ−10 ) with ξ0 ∼ PG(1, 0), the posterior dis-
70
∑
tribution for µ is µ ∼ N(Q−1` −1 2 T−1µ µ, Q∑µ ) with Qµ = ξµ + ξ0 + (1 − φ) t=1 ξt and
`µ = ξ
2
µ log(σ/T )+ξ0h1 +(1−φ)
T−1
t=1 ξt(ht+1−φht). Sampling ξµ and ξ0 follows
the Pólya-Gamma sampling scheme above.
Finally, we sample the discrete mixture component indicators st. The dis-
crete mixture probabilities are straightforward to compute: the prior mixture
probabilities are the mixing proportions given by Omori et al. (2007) and the
likelihood is inỹt ∼
dep
N(ht+mst , vst); see Kastner and Frühwirth-Schnatter (2014)
for details.
3.6 Conclusions
Dynamic shrinkage processes provide a computationally convenient and
widely applicable mechanism for incorporating adaptive shrinkage and reg-
ularization into existing models. By extending a broad class of global-local
shrinkage priors to the dynamic setting, the resulting processes inherit the de-
sirable shrinkage behavior, but with greater time-localization. The success of
dynamic shrinkage processes suggests that other priors may benefit from log-
scale or other appropriate representations, with or without additional depen-
dence modeling.
As demonstrated in Sections 3.3 and 3.4, dynamic shrinkage processes are
particularly appropriate for dynamic linear models, including trend filtering
and time-varying parameter regression. In both settings, the dynamic linear
models with dynamic horseshoe innovations outperform all competitors in sim-
ulated data, and produce reasonable and interpretable results for real data ap-
plications. Dynamic shrinkage processes may be useful in other dynamic linear
71
models, such as incorporating seasonality or change points with appropriately-
defined (dynamic) shrinkage. Given the exceptional curve-fitting capabilities
of the Bayesian trend filtering model (3.8) with dynamic horseshoe innovations
(BTF-DHS), a natural extension would be to incorporate the BTF-DHS into more
general additive, functional, or longitudinal data models in order to capture ir-
regular or local curve features.
An important extension of the dynamic fundamental factor model of Section
3.4.2 is to incorporate a large number of assets, possibly with residual correla-
tion among stock returns beyond the common factors of FF-5. Building upon
Carvalho et al. (2011), a reasonable approach may be to combine a set of known
factors, such as the Fama-French factors, with a set of unknown factors to be
estimated from the data, where both sets of factor loadings are endowed with
dynamic shrinkage processes to provide greater adaptability yet sufficient ca-
pability for shrinkage of irrelevant factors.
72
CHAPTER 4
FUNCTIONAL AUTOREGRESSION FOR SPARSELY SAMPLED DATA
Portions of this chapter were published in Kowal et al. (2017).
4.1 Introduction
We develop a hierarchical Gaussian process model for forecasting and infer-
ence of functional time series data. A functional time series is a time-ordered
sequence of random functions, Y1, . . . , YT , on some compact index set T ⊂ RD,
typically withD = 1. Unlike existing methods, our approach is especially suited
for sparsely or irregularly sampled curves, in which the functions Yt(τ) are ob-
served at a small number of possibly unequally-spaced points τ ∈ T , and for
curves sampled with non-negligible measurement error, which occur frequently
in financial applications. Applications of functional time series are abundant,
including: daily or weekly interest rate curves as a function of time to matu-
rity, such as daily Eurodollar futures contracts (Kargin and Onatski, 2008) and
weekly yield curves (Hays et al., 2012; Kowal et al., 2016); yearly sea surface
temperature as a function of time-of-year (Besse et al., 2000); yearly mortality
and fertility rates as a function of age (Hyndman and Ullah, 2007); daily pollu-
tion curves as a function of time-of-day (Damon and Guillas, 2002; Aue et al.,
2015); and a vast collection of spatio-temporal applications in which a time-
dependent variable is measured as a function of spatial location (e.g., Cressie
and Wikle, 2011). The primary goal of functional time series analysis is usually
forecasting {Yt}, but we are also interested in performing inference and obtain-
ing an interpretable representation of the time evolution of {Yt}.
73
The most prevalent model for functional time series data is the functional
autoregressive model of order 1, written FAR(1):
Yt − µ = Ψ(Yt−1 − µ) + t, (4.1)
where Y 2t ∈ L (T ), Ψ is a bounded linear operator on L2(T ), t ∈ L2(T )
is a sequence of independent mean zero random innovation functions with
E||t||2 < ∞, and µ is the mean of {Yt} under stationarity. The FAR(1) model,
developed by Bosq (2000), is an extension of two highly successful models: the
functional linear model for function-on-function regression and the vector au-
toregressive model for multivariate time series, and has been successfully ap-
plied in a variety of applications. Importantly, the FAR(1) model provides a
mechanism for modeling the evolution of {Yt} jointly over the entirety of the
domain T . More∑generally, (4.1) can be extended for multiple lags to the FAR(p)
model: Yt − µ = p`=1 Ψ`(Yt−` − µ) + t.
Existing approaches for estimating the FAR(p) model typically use an eigen-
decomposition of the empirical (contemporaneous and lagged) covariance oper-
ators (Damon and Guillas, 2002, 2005; Horváth and Kokoszka, 2012; Kokoszka,
2012) or kernel-based procedures for modeling the conditional expectation
(Besse et al., 2000). A related approach is to estimate a multivariate time series
model for the functional principal component (FPC) scores of the observed data
(Aue et al., 2015). Extensions of the FAR(1) model for nonstationary functional
time series are available, such as the time-dependent FAR kernels proposed in
Chen and Li (2015).
In general, existing methods for FAR(p) are designed for functional data ob-
served on dense grids without measurement error, and typically require pre-
smoothing discretized functional observations. However, such procedures may
74
exhibit erratic behavior for sparse designs and are inappropriate in such set-
tings. More generally, under an FAR(p) model that includes measurement error
and discretization of the functional observations, we prove that the two most
common approaches for functional data analysis—estimators that are linear in
the FPC scores or the pre-smoothed observations—produce predictions that are
inadmissible (in a decision theory sense). Indeed, the presence of measurement
error fundamentally alters the behavior of the observable process: if an FAR
process is observed with measurement error, then the observable process is no
longer an FAR process, but rather a functional autoregressive moving average
process (see Proposition 4.1). Even under dense designs, existing methods pro-
duce poor estimates of the FAR operator Ψ (Didericksen et al., 2012), which
inhibits interpretability of the time evolution of {Yt}, and do not provide finite-
sample inference. We propose new methodology that simultaneously addresses
all of these challenges.
We propose a general two-level hierarchy for modeling functional time se-
ries: an observation equation addresses measurement error and discretization of
the functional data, while an evolution equation defines a process model for the
underlying functional time series. The latent process is dynamically modeled as
an FAR(p). We parsimoniously specify the FAR model with mean zero Gaussian
process innovations, which are fully specified by covariance functions with-
out parameterizing sample paths. The dynamic innovation process is further
specified by a dynamic functional factor model. In contrast with standard ap-
proaches for Gaussian processes, this avoids selecting and estimating a para-
metric covariance function, and allows greater computational stability and effi-
ciency, and broader applicability. Interpolating curves at unsampled locations
and forecasting future curves are primary objectives in functional time series
75
modeling; the proposed model produces optimal (best linear) predictions under
both sparse and dense designs in the presence of measurement error, even with
the Gaussian assumption relaxed. We propose an efficient Gibbs sampling algo-
rithm for estimation, inference, and forecasting. Extensive simulations demon-
strate substantial improvements in forecasting performance and recovery of the
autoregressive surface over competing methods, especially under sparse de-
signs.
We apply our methodology to model and forecast nominal and real yield
curves using daily U.S. data. For a given currency and level of risk of a debt,
the nominal yield curve, Y Nt (τ), describes the interest rate at time t as a func-
tion of the length of the borrowing period, or time to maturity, τ . Similarly, the
real yield curve, Y Rt (τ), corresponds to an interest rate that is adjusted for in-
flation. Both Y Nt and Y Rt may be modeled as functional time series. However,
real yields are sparsely observed for each time t, and only at longer maturities,
which is problematic for existing functional time series models. The proposed
methods provide a natural hierarchical framework for modeling both nominal
yield curves and real yield curves, and in both cases produce highly competitive
forecasts.
Bayesian methods for functional time series are limited, with the excep-
tion of Laurini (2014) and Kowal et al. (2016). The primary contributions of
this article are the following: (i) development of a hierarchical framework for
FAR(p) (Section 4.2), which produces optimal (best linear) predictions under
both sparse and dense designs in the presence of measurement error; (ii) a dy-
namic functional factor model for the innovation covariance, which is nonpara-
metric, computationally convenient, and offers useful generalizations to non-
76
Gaussian distributions (Section 4.3); (iii) a procedure for model averaging over
the lag, p, within a hierarchical FAR(p) model (Section 4.4); (iv) comparisons of
the proposed methods to existing methods for FAR(p) using theoretical results
(Section 4.5), an extensive simulation study (Section 4.6), and a real data appli-
cation (Section 4.7); (v) a comparative forecasting study of daily U.S. nominal
and real yield curve data (Section 4.7); and (vi) an efficient Gibbs sampling al-
gorithm, which uses common full conditional distributions and existing R soft-
ware (Appendix C). Details of our Gibbs sampling algorithm and additional
theoretical and simulation results are in Appendix C.
4.2 Hierarchical Gaussian Processes for FAR
Let Y1, . . . , YT be a time-ordered sequence of random functions in L2(T ), where
T ⊂ RD is a compact index set. We focus on D = 1 with T = [0, 1], but the
methods can be developed more generally. For interpretability and computa-
tional conve∫nience, we restrict our attention to the integral operators defined by
Ψ`(Y )(τ) = ψ`(τ, u∑)Y (u) du, so the FAR(p) model isp ∫
Yt(τ)− µ(τ) = ψ`(τ, u) {Yt−`(u)− µ(u)} du+ t(τ) ∀τ ∈ T . (4.2)
`=1
Using integral operators, the FAR(p) model resembles the functional linear
model, in which (Yt − µ) is regressed on (Yt−1 − µ), . . . , (Yt−p − µ). The func-
tional linear model is widely popular in functional data analysis, and has been
extensively studied (e.g., Cardot et al., 1999; Ramsay, 2006).
In practice, model (4.2) is incomplete: the functional observations {Yt} are
not observed directly, but rather via discrete samples of each curve, and typi-
cally with measurement error. Suppose that we observe yi,t ∈ R sampled with
77
noise νi,t from Yt ∈ L2(T ):
yi,t = Yt(τi,t) + νi,t (4.3)
for i = 1, . . . ,mt, where τ1,t, . . . , τmt,t are the observation points of Yt and νi,t is
a mean zero measurement error with finite variance. Typically for functional
data, mt will be large and Tt = {τ1,t, . . . , τmt,t} will be dense in T . However, for
our procedures, we allow mt to be small for some (or all) t, with observation
points To ≡ ∪tTt dense or sparse in T . Combining (4.3) with (4.2) for p = 1 and
defining µt ≡Yt − µ, we obtain the two-level hierarchical modelyi,t = µ(τ∫i,t) + µt(τ i,t
) + νi,t, i = 1, . . . ,mt,
 (4.4)µt(τ) = ψ(τ, u)µt−1(u) du+ t(τ), ∀τ ∈ T
for t = 2, . . . , T , where we assume that {νi,t} and {t} are mutually independent
sequences.
The measurement error is a nontrivial component of model (4.4), which we
demonstrate in the following proposition:
∑
Proposition 4.1. Let Y pt−µ = `=1 Ψ`(Yt−`−µ)+t, and suppose that we observe yt =
Yt + νt, where {t} and {νt} are independent white noise processes. Then the observable
process {yt} follows a functional autoregressive moving average (FARMA) process
of order (p, p).
We define a FARMA process and prove Proposition 4.1 in Section C.4.1 of
Appendix C. The implication of Proposition 4.1 is that, if the true model for
Yt is FAR(p), yet Yt is observed with error, then the FAR(p) model for the ob-
servables is inappropriate. As a result, estimation of Ψ` will be inefficient and
forecasting will deteriorate, due to both increased estimation error of Ψ` and
model misspecification. By comparison, the hierarchical model decomposes the
78
observed data into a functional (autoregressive) process and measurement er-
ror, and in doing so circumvents the model misspecification issues implied by
Proposition 4.1.
We model the random functions µ, ψ, and {t} as Gaussian processes: µ ∼
GP indep(0, Kµ), ψ ∼ GP(0, Kψ), and t ∼ GP(0, K), where the notation GP(m,K)
denotes a Gaussian process with mean function m and covariance function
K. Gaussian processes have a long history in machine learning (Rasmussen
and Williams, 2006) and spatial statistics (Cressie and Wikle, 2011), and have
seen increased application in functional data analysis, especially for hierar-
chical modeling (Behseta et al., 2005; Kaufman and Sain, 2010; Shi and Choi,
2011; Earls and Hook∫er, 2014). The conditional distribution of µt = Yt − µ is
[µt|µt−1, ψ,K] ∼ GP( ψ(·, u)µt−1(u) du,K), which models the evolution of µt
and serves as the prior distribution for the observation level of (4.4). Notably,
the model only requires conditionally Gaussian processes, and therefore may ac-
commodate more general distributional assumptions, such as scale-mixtures of
Gaussian distributions and stochastic volatility. Moreover, the posterior expec-
tations derived from the hierarchical Gaussian process model are best linear
predictors, and therefore are optimal among linear predictors for interpolation
and forecasting of Yt, even for non-Gaussian distributions (see Section 4.5). We
assume i∼idνi,t N(0, σ2ν) for the measurement errors; priors for σ2ν and the param-
eters associated with Kµ, K, and Kψ will be discussed later.
79
4.2.1 Dynamic Linear Models for FAR(p)
For practical implementation of model (4.4), we must select a finite set of eval-
uation points, Te ≡ {τ1, . . . , τM} ⊂ T , at which we wish to estimate, fore-
cast, or perform inference on the random functions, in particular µt = Yt − µ.
Naturally, we assume that Tt ⊆ Te for all t, but this assumption may be re-
laxed. Notably, Te provides a convenient structure for forecasting and infer-
ence of yi,t and Yt(τi,t) at the observations points τi,t ∈ Tt, as well as inter-
polation of Yt at any unobserved points, τ ∗ ∈ Te \ To. By definition, for any
Gaussian process x ∼ GP(m,K) defined on T , we have x ∼ N(m,K), where
x = (x(τ1), . . . , x(τM))
′, m = (m(τ1), . . . ,m(τM))′, and K = {K(τi, τk)}Mi,k=1. This
result is particularly useful for constructing an estimation procedure and deriv-
ing the optimality results of Section 4.5.
By selecting M large and Te dense in T , we can accurately approximate the
integral in (4.4∫) using quadrature methods:
ψ(τ, u)µt−1(u) du ≈ (ψ(τ, τ1), . . . , ψ(τ, τM))Qµt−1, (4.5)
whereQ is a known quadrature weight matrix andµt−1 = (µt−1(τ1), . . . , µt−1(τM))′.
The approximation in (4.5) is important for computational tractability in estima-
tion of both µt and ψ. Practical implementations of functional data methods re-
quire discretization or finite approximations; the quadrature approximation in
(4.5) is a natural approach, and does not impose restrictive assumptions on the
functional forms of ψ and µt−1. In addition, our simulation analysis suggests
that the quadrature approximation does not noticeably inhibit estimation or
forecasting, especially relative to existing FAR methods. In practice, the trape-
zoidal rule for computing Q works well, and for simulated data M = 20 is
sufficiently large. We include a sensitivity analysis in Appendix C to assess the
80
effects of M on the approximation error in (4.5), which supports this choice of
M .
Assuming To ⊆ Te, let Zt be the mt ×M incidence matrix that identifies the
observations points observed at time t, i.e., (τ1,t, . . . , τ ′mt,t) = Zt(τ1, . . . , τ ′M) . We
can write the hierarchical model (4.4) as a dynamic linear model (DLM; West and
Harrison, 1997) in µ t
:
y = Z µ+Z µ + ν , [ν |σ2 indept t t t t t ν ] ∼ N (0, σ2νImt) for t = 1, . . . , T,
 indepµt = ΨQµt−1 +  , [ |K ] ∼ N (0,K ) for t = 2, . . . , T, (4.6) t t  µ1 ∼ N(0,K),
where y ′t = (y1,t, . . . , ymt,t) , µ = (µ(τ1), . . . , µ(τ ))′M , Ψ = {ψ(τ Mi, τk)}i,k=1, and
K = {K(τi, τk)}Mi,k=1. Model (4.6) can be extended fo∑r multiple lags to the
FAR(p) model by replacing the second level with µ = pt `=1 Ψ`Qµt−` + t for
Ψ` = {ψ M`(τi, τk)}i,k=1. The DLM formulation of the FAR(p) is useful for MCMC
sampling, since efficient samplers exist for the vector-valued state variables,
{µt} (e.g., Durbin and Koopman, 2002). The proposed Gibbs sampling algo-
rithm for model (4.6) (see Appendix C) is a moderate extension of traditional
DLM samplers, and iteratively samples the state vectors {µt}, the measurement
error variance σ2ν , the innovation covariance K, and the unknown evolution
matrix Ψ. The DLM also facilitates non-Bayesian parameter estimation and
forecasting, such as an EM algorithm for the latent state variables {µt} with
the parameters {σ2ν ,K,Ψ} (e.g., Cressie and Wikle, 2011).
The connection between the hierarchical FAR model (4.4) and the DLM (4.6)
is further illuminated by considering the autocovariance properties of the re-
spective models. Recalling µt(τ) = Yt(τ)− µ(τ), let C`(τ1, τ2) = E [µt(τ1)µt−`(τ2)]
be the lag-` autocovariance function of {Yt}, which is time-invariant under sta-
81
tionarity of {Yt}. Under model (4.4) and assuming stationarity of {Yt}, the
lag[{-1∫ autocovariance function i}s equival]ently∫C1(τ1, τ2) = E [µt(τ1)µt−1(τ2)] =
E ψ(τ1, u)µt−1(u) du+ t(τ1) µt−1(τ2) = ψ(τ∫1, u)C0(u, τ2) du. For ` ≥ 1,
we have the more general recursion C`(τ1, τ2) = ψ(τ1, u)C`−1(u, τ2) du, from
which it is clear t[hat eac]h C` is completely determined by the pair (ψ,C0).
Now let C = E µ µ′` t t−` be the lag-` autocovariance matrix for the vector-
valued time series {µt} in (4.6). [Under]station[arity of {µt}, the la]g-1 autoco-
variance matrix of µt is C1 = E µ ′ ′tµt−1 = E {ΨQµt−1 + t}µt−1 = ΨQC0.
Notably, the relationship∫C1 = ΨQC0 is an approximation to the continu-
ous version, C1(τ1, τ2) = ψ(τ1, u)C0(u, τ2) du, using the same quadrature ap-
proximation as in (4.5). More generally, the matrix recursion C` = ΨQC`−1
i∫s a quadrature-based approximation to the continuous recursion, C`(τ1, τ2) =
ψ(τ1, u)C`−1(u, τ2) du for ` ≥ 1. Therefore, the evolution matrix ΨQ in the
DLM (4.6) induces a discrete approximation to the autocovariance structure in
the hierarchical FAR model (4.4).
The evolution equation of (4.6) resembles a VAR(1) onµt = (µt(τ1), . . . , µt(τM))′,
but differs from a standard VAR on yt for a few critical reasons. First, fitting a
VAR to yt is only well-defined if both the dimension mt and the observation
points Tt are fixed over time. If this does not hold, then imputation is neces-
sary. Our procedure imputes automatically and optimally using the conditional
mean function and the conditional covariance function of the corresponding
Gaussian process. Second, the components of yt are likely highly correlated
due to the functional nature of the observations. Strong collinearity in VARs
can cause overfitting and adversely affect forecasting and inference. In our
model, the kernel function ψ is regularized using a smoothness prior (see Sec-
tion 4.4), which mitigates the adverse effects of collinearity on estimation of ψ.
82
The smoothness prior on ψ is a nonstandard regularization technique for VARs,
but is appropriate in this setting. Finally, the quadrature matrix, Q, is absorbed
into the VAR coefficient matrix ΨQ, and reweights the vector µt−1 using infor-
mation from the evaluation points Te. This reweighting incorporates not only
the vector values µt, but also the information that the components of µt corre-
spond to ordered elements of Te, which need not be equally spaced. The simu-
lations of Section 4.6 demonstrate the substantial improvements in forecasting
of our procedure relative to a VAR on yt.
4.3 A Dynamic Functional Factor Model for the Innovation Pro-
cess
The standard approach for Gaussian process models is to select a parametric
covariance function that only depends on a few parameters, and then estimate
those parameters using either fully Bayesian methods or empirical Bayes (Ras-
mussen and Williams, 2006). The choice of the covariance function determines
the properties of the sample trajectories, such as smoothness and periodicity, but
notably does not imply a parametric form for the sample trajectories. Indeed,
the FAR(1) model (4.6) may be estimated using these standard approaches; we
provide one implementation in Section 4.6.
However, there are substantial computational limitations that accompany
standard parametric covariance functions. Even when the covariance function
is known up to some parameters ρ, in general we cannot directly sample from
the full conditional posterior distribution for ρ. As a result, posterior sampling
for ρ can be inefficient. Gaussian processes also require computation of the
83
M × M innovation covariance matrix K, which must be inverted—both for
evaluating the conditional likelihood of ρ and for sampling {µt} and ψ. Most
common choices for parametric covariance functions do not offer any simpli-
fying structure for computing this inverse, which may be computationally in-
efficient and unstable. In addition, extensions for time-dependent covariance
functions or non-Gaussian distributions are not readily available, and further
increase the difficulties with posterior sampling.
We propose a low-rank, fully nonparametric approach for modeling the in-
novation covariance function. Using the functional dynamic linear model (FDLM)
of Kowal et al. (2016), we estimate the unknown covariance function using a
functional factor model, which does not require specification of a parametric
form for the covariance function. This method avoids the need for inversion
of the full M ×M covariance matrix, and is more computationally stable and
efficient. The integration of the FDLM into (4.6) retains the fully Bayesian hi-
erarchical structure, and permits joint inference for all parameters via an effi-
cient MCMC sampling algorithm. A functional factor model is most appropriate
because t is a Gaussian process with covariance function K, so K must be
well-defined on T × T . Notably, the FDLM offers convenient generalizations
for stochastic volatility models (Kim et al., 1998) and more robust models using
scale-mixtures of Gaussian distributions (Fernandez and Steel, 2000).
The FDLM decomposes the innovations t into factor loading curves (FLCs),
φj ∈ L2(T ), and time-dependent factors, ej,t ∈ R, for j = 1, . . . , J:
∑J
t(τ) = ej,tφj(τ) + ηt(τ) ∀τ ∈ T , (4.7)
j=1
where J is the number of factors and {ηt} is the mean zero approximation er-
ror with iidη 2t ∼ GP(0, Kη), where Kη(τ, u) = ση1(τ = u) and 1(·) is the indicator
84
function. We model each FLC φj as a smooth function admitting the basis ex-
pansion φj(τ) = b′φ(τ)ξj , where bφ is a Jφ-dimensional vector of known basis
functions and ξj is an unknown vector of coefficients. For superior MCMC per-
formance, we prefer the low-rank thin plate spline basis for bφ (e.g., Crainiceanu
et al., 2005) with knot locations selected using the quantiles of the observa-
tion points, To. We place a smoothness prior on each ξj , which is expressed
via a conditionally conjugate Gaussian distribution and is convenient for effi-
cient posterior sampling (see the Appendix). The smoothness assumption typ-
ically produces more interpretable FLCs {φj} and can improve estimation for
unobserved points τ ∗ ∈6 To. For the fa(ctors et )= (e1,t, . . . , e ′J,t) , we assume
| in∼dep[et Σe] N(0,Σe), with Σ = diag {σ2}Je j j=1 for simplicity. By compari-
son, the factors in Kowal et al. (2016) are time-dependent; we assume inde-
pendence to obtain a special case of the FDLM in which the implied innovation
process {t} is an independent sequence, which also improves computational
efficiency of the FDLM sampling algorithm. Importantly, we obtain a nonpara-
metric, low-rank approximation to the innovation covariance, K, with useful
computational simplifications.
For identifiability, we order the factors according to variability of t ex-
plained, σ21 > σ22 > · · · > σ2J > 0, and require orthonormality of the FLCs. It
is computationally convenient to enforce the discrete orthonormality constraint
Φ′Φ = IJ , where Φ = BφΞ is the M × J matrix of FLCs evaluated at Te,
Bφ = (bφ(τ1), . . . , b
′
φ(τM)) is the M × Jφ matrix of basis functions evaluated at
Te, and Ξ = (ξ1, . . . , ξJ ) is the Jφ × J matrix of unknown FLC basis coeffi-
cients. The implied covariance matrix for t = (t(τ1), . . . , t(τM))′ under (4.7) is
K = ΦΣeΦ
′+σ2ηIM , conditional on {φj, σ2j} and σ2η . Importantly, the discretized
orthonormality constraint offers a substantial simplification for computing the
85
inverse ofK using the Woodbury identity:
K−1 −2 −2 ′ = ση IM − ση ΦΣ̃eΦ , (4.8)
( ) ( )
where Σ̃ = σ−2 Σ−1 −2 ′ −1+ σ Φ Φ = diag {σ2e η e η j/(σ2η + σ2)}Jj j=1 . As a result,
K−1 may be computed without any matrix inversions. By comparison, para-
metric covariance functions not only fail to offer computational simplifications
for K−1 , but also require additional computations of K
−1
 in the estimation of
the covariance function parameters, ρ. The FDLM sampling algorithm for the
factors {ej,t}, the FLCs {φj}, and the variances {σ2j} and σ2η is computationally
inexpensive and MCMC efficient. Note that the approximation error is a non-
trivial addition to model (4.7): ηt is necessary for nondegeneracy of K, which
is invertible only when σ2 > 0. And while σ2η η > 0 implies that the innovations
t, and therefore µt, are not smooth, we find that in practice, the sample paths
of t and µt do appear smooth for sufficiently small σ2η . Generalizations to non-
nugget approximation error variance functions Kη(τ, u) = σ2η(τ)1(τ = u) for
σ2η : T → R+ are available, but may introduce additional model complexity and
computational costs.
An important application of the FDLM simplification in (4.8) is given in The-
orem 4.2, in which we derive a computationally convenient form for estimating
the out-of-sample posterior distribution [µt(τ ∗)|{yr}s ∗r=1] for τ 6∈ Te, which in-
cludes as special cases the forecasting distribution (s < t), the filtering distribu-
tion (s = t), and the smoothing distribution (s > t).
86
4.4 Modeling the FAR Kernel
An accurate predictor of ψ is important not only for forecasting and inference,
but also for interpreting the time evolution of {Yt}. The likelihood for ψ is speci-
fied by the evolution equation in model (4.6), which may be extended for multi-
ple lags. We select a Gaussian process prior for ψ, which encourages smoothness
of the surface and produces more interpretable results. Using the basis approxi-
mation ψ ′`(τ, u) = b0(τ, u)θψ , we place a Gaussian prior on θψ , which induces a` `
Gaussian process prior for ψ`. A tensor product basis b′ ′ ′0(τ, u) = (bψ(u)⊗ bψ(τ))
for bψ a Jψ-dimensional vector of B-spline basis functions is computationally
efficient in our setting, especially for large M . The details are presented in the
Appendix. Since Jψ < M , the evolution matrix ΨQ in (4.6) has J2 2ψ < M un-
known parameters, so the evolution equation in the DLM (4.6) has fewer pa-
rameters than a standard VAR(1) on µt. Notably, the posterior distribution for
ψ depends onK−1`  , which is computationally unstable for many common para-
metric covariance functions. By comparison, the nonparametric FDLM estimate
of K−1 in (4.8) is computationally stable, which further stabilizes estimates of
ψ`.
An important choice in the FAR(p) model is the maximum lag, p: a poor
choice of p can produce suboptimal forecasts and reduce MCMC efficiency. A
reasonable approach is to compare the DIC or marginal likelihoods for different
choices of p. However, this requires recomputing the model for each choice of
p, which can be computationally intensive. Similarly, Kokoszka and Reimherr
(2013) propose a multistage hypothesis testing procedure based on asymptotic
approximations and an FPC decomposition, but would require modification for
the hierarchical Bayesian implementation of (4.6).
87
Our approach is to select a maximum lag under consideration, pmax, and as-
sign each lag ` a state variable, s` ∈ {0, 1}, for ` = 1, . . . , pmax, to assess whether
or not ψ` is included in the model:
p∑max ∫
µt(τ) = s` ψ`(τ, u)µt−`(u) du+ t(τ), (4.9)
`=1
which extends Kuo and Mallick (1998) and Korobilis (2013b) to the FAR(p) set-
ting. By averaging over the states {s }pmax` `=1 , the forecasts of model (4.9) are the
model-averaged forecasts over the FAR(`) models for ` = 1, . . . , pmax. Since
we restrict s` ∈ {0, 1}, rather than strongly shrinking ψ` toward zero, we
can substantially improve computational efficiency: at each MCMC iteration,
we sample {µt} jointly from the FAR(p∗) extension of the DLM (4.6), where
p∗ = min{` : s`+1 = · · · = spmax = 0} is the largest lag of nonzero autocorre-
lation.
∏
The joint distribution of the states is [s1, s2, . . . , s ] = [s ]
pmax
pmax 1 `=2 [s`|s`−1, . . . , s1],
where [s`|s`−1, . . . , s1] is the probability that the lag ` autocorrelation term is in-
cluded in the model, given whether the autocorrelation terms of the smaller
lags ` − 1, . . . , 1 are included in the model. We assume that s` = 0 im-
plies that sk is likely also zero for all k > `, which induces a more parsimo-
nious model. In particular, we use the computationally convenient Markov
assumption [s`|s`−1, . . . , s1] = [s`|s`−1] with a small transition probability for
P(s` = 1|s`−1 = 0) = q01. The reverse transition probability, P(s` = 0|s`−1 =
1) = q10, encourages smaller models when it is large. By default, we select
q01 = 0.01, q10 = 0.75, and complete the joint prior distribution of {s }pmax` `=1 with
P(s1 = 1) = 0.9; for simulated data, the posterior does not appear to be sensitive
to these choices.
88
4.5 Finite-Dimensional Optimality
The Gaussian assumptions in model (4.6) provide convenient posterior distri-
butions for MCMC sampling and a useful framework for inference, but are
not necessary for model (4.2). Suppose we relax the Gaussian assumption to
t ∼ SP(0, K), where SP(m,K) denotes a second-order stochastic process with
mean function m and covariance function K. Similarly, let νi,t be a mean zero
random variable with variance σ2ν and let µ1 ≡ 1. Given a finite set of evaluation
points,Te ⊂ T , model (4.4) implies the distribution-free DLM
yt = Ztµ+Ztµt + νt, E[νt|σ
2
ν ] = 0, Cov[νt|σ2ν ] = σ2νImt ,
 (4.10)µt = ΨQµt−1 + t, E[t|K] = 0, Cov[t|K] = K,
under the integral approximation (4.5), where the vectors and matrices are de-
fined as before and µ1 ≡ 1. Since this holds for any finite set of evaluation
points Te ⊂ T , we may consider the DLM (4.10) to be a collection of models
indexed by the evaluation points, Te. The error sequences, {νt} and {t}, are as-
sumed to be uncorrelated, rather than independent. If we additionally assume
Gaussianity of {νt} and {t}, then the uncorrelatedness implies independence,
and model (4.10) becomes model (4.6). Extensions for the FAR(p) models are
similar. The results below also hold for time-dependent variances for νt and t.
Let d be an estimator of δ ∈ L2(T ), and consider the squared error loss us-
ing the Euclidean norm: Le(δ, d) = (δ − d)′(δ − d), where Le is indexed by the
set of evaluation points, Te, at which δ and d are evaluated to form the corre-
sponding vectors δ and d. When Te is an equally-spaced fine grid on T , the
loss function∫Le will approximate the usual loss function for functional data,
LL2(δ, d) = (δ(u) − d(u))2 du, for most reasonable choices of δ and d (up to
a rescaling by M = |Te|). In a standard Bayesian analysis, the goal would be
89
to minimize the posterior risk, E[Le(δ, d)|{yt}], for which the solution is the
posterior expectation, d = E[δ|{yt}]. Indeed, the estimators discussed below
minimize the posterior risk under the Gaussian assumptions of model (4.6).
However, by relaxing the distributional assumptions in (4.10) to increase the
generality of the model, we no longer have sufficient information to compute
posterior distributions or posterior moments. In addition, it is difficult to com-
pare Bayesian and non-Bayesian procedures under the posterior risk, and most
procedures for functional time series modeling are non-Bayesian. Therefore, we
consider the overall riskRe(δ, d) = E[Le(δ, d)],which is the expected value of the
posterior risk with respect to the sampling distribution. As with the loss func-
tion Le, the risk function Re is indexed by the evaluation points, Te; we seek to
minimizeRe for any choice of Te.
{ }
Let Dt = yt,yt−1, . . . ,y1 ∪D0 be the information available at time t, where
D0 represents the information prior to time t = 1.
Theorem 4.1. For any finite set of evaluation points Te ⊂ T , the unique best linear
predictor of the conditional random vector δ ∼ [δ|Y ,Θ], where δ,Y ⊆ DT ∪{µt(τ) :
τ ∈ Te, t = 1, . . . , T} and Θ = {µ, σ2ν , ψ,K}, under the risk Re and conditional
on model (4.4) with the integral approximation (4.5), is the conditional expectation
δ̂(Y |Θ) ≡ E[δ|Y ,Θ] as computed under model (4.6).
The proof of Theorem 4.1 is in the Appendix, and extends fundamental re-
sults for vector-valued DLMs. The best linear predictors of Theorem 4.1 equiv-
alently minimize the risk R(δ, d) = supT Re(δ, d) among all linear estimators,e
where the sup is taken over all finite Te ⊂ T . The most useful examples
of [δ|Y ,Θ] in Theorem 4.1 are the forecasting distributions [yt+h|Dt,Θ] and
[µt+h|Dt,Θ] for h > 0, the smoothing distributions [µt|DT ,Θ], and the filter-
90
ing distributions [µt|Dt,Θ], for t = 1, . . . , T . Theorem 4.1 depends on the ob-
servation points To only via the assumption that Zt is known. In general, we
assume To ⊆ Te, so Zt is an incidence matrix and therefore known. Theorem 4.1
does not require To to become arbitrarily dense in T , and is valid for both sparse
and dense designs. For implementation, we compute the relevant expectations
within the Gibbs sampling algorithm (see Appendix C), and then average over
the Gibbs sample of Θ. Alternatively, an EM algorithm could be used to esti-
mate the relevant expectations (Cressie and Wikle, 2011).
There is no intrinsic reason to restrict the estimators to linearity. However,
several popular competing methods are linear, and therefore are dominated by
the conditional expectations computed from model (4.6) whenever the estima-
tors are distinct. More formally:
Corollary. Consider a basis expansion of the observations yt ≈ B ′tθt, where Bt =
(b(τ1,t), . . . , b(τmt,t)), b is a known J-dimensional vector of basis functions, and θt is
the corresponding J-dimensional vector of unknown basis coefficients. If the estimator
θ̂t of θt is linear in yt, then estimates or forecasts of the form Hθ̂t + h, conditional on
the matrix H and the vector h, are inadmissible for all [δ|Y ] whenever Hθ̂t + h 6=
δ̂(Y |Θ).
The most important application of Corollary 4.5 is to characterize the inad-
missibility of procedures based on FPC scores. In the notation of Corollary 4.5,
let b be the FPC basis, which we assume is fixed and∫known. The components
∫of θt correspond to the FPC scores, defined by θj,t = {Yt(u) − µ(u)}bj(u) du =
µt(u)bj(u) du. There are two standard approaches for computing FPC scores:
quadrature methods for dense designs absent measurement error, and the PACE
procedure of Yao et al. (2005), which uses conditional expectations under a
91
Gaussian assumption and applies more generally. In both cases, the FPC scores
are linear in yt, so Corollary 4.5 applies.
Among functional time series methods, the most pertinent procedures are
Aue et al. (2015) and Hyndman and Ullah (2007). Aue et al. (2015) provide the
more general framework, in which they compute the best linear predictors for
the FPC scores, and then forecast the FPC scores using multivariate time series
methods. For time series methods that are linear in the FPC scores, such fore-
casts are inadmissible. While Aue et al. (2015) undoubtedly provide a simple
yet general framework for forecasting a functional time series, the simulations
of Section 4.6 confirm the consequence of inadmissibility on forecasting perfor-
mance.
Corollary. Consider the common functional data pre-processing procedure in which
the discrete, noisy observations, yt, are replaced by estimated functions evaluated on a
fine grid, ŷt, and then estimates and forecasts are computed using the functional “data”
ŷt. If ŷt is linear in {yt}, then any estimator or forecast linear in {ŷt} is inadmissible
for all [δ|Y ] whenever ŷt 6= δ̂(Y |Θ).
Typically, ŷt is estimated using splines or kernel smoothers, both of which
are linear in yt. As an application of Corollary 4.5, the simple forecasting
method of fitting a VAR to ŷt evaluated on a grid of points, conditional on the
VAR coefficient matrix, is inadmissible.
Corollary. The unique best linear predictor of [µ (τ ∗t )|Ds] for any times t, s and any
point τ ∗ ∈ T is the corresponding expectation under model (4.6).
Model (4.6) achieves the optimality of a kriging estimator for interpolation
of any point τ ∗ ∈ T , simply by adding τ ∗ to the evaluation set Te.
92
In practice, we need not include all such τ ∗ in Te: we can estimate the out-
of-sample posterior distribution [µ ∗t(τ )|Ds] f[or τ ∗ 6∈ Te by sampli]ng from the
out-of-sample full conditional distribution µ (τ ∗t )|{µ Tr}r=1,Θ,Ds within the
Gibbs sampler, and then averaging over the Gibbs sample of {µr}Tr=1 and Θ.
Let ψ′(τ ∗) ≡ (ψ(τ ∗, τ1), . . . , ψ(τ ∗, τ )) and φ′(τ ∗M ) ≡ (φ1(τ ∗), . . . , φJ(τ ∗)). In the
special case of model (4.4) and using the FDLM (4.7), we have the following
computationally efficient alternative for state space imputation:
Theorem 4.2. Suppose τ ∗ ∈ T such that τ ∗ ∈6 Te. Under the FDLM (4.7) and con-
ditional on model (4.4) with the inte[gral approximation (4.5]), the out-of-sample full
conditional distribution of µ (τ ∗) is µ (τ ∗)|{µ( }T ,Θ,D ∼) N (m (τ ∗t t r r=1 s t ), Kt(τ ∗)),
where m (τ ∗) = ψ′ ∗t (τ )Qµt−1 + φ
′(τ ∗)Σ̃eΦ
′ µt −ΨQµt−1 and K (τ ∗) = σ2t η +
σ2φ′η (τ
∗)Σ̃eφ(τ
∗).
The proof of Theorem 4.2 and extensions for p > 1 are in Appendix C. Using
Theorem 4.2, we can efficiently estimate the out-of-sample posterior distribu-
tion [µ ∗t(τ )|Ds] with minimal adjustments to the Gibbs sampling algorithm (see
Appendix C). Theorem 4.2 builds upon the approximation in (4.5) and the com-
putational simplifications of the FDLM to produce simple and efficient moment
calculations for the full conditional distributions without expanding the dimen-
sion of the state vector, M . Note that for implementation purposes, the terms
µt and µt−1 appear[ing in m (τ ∗t ) ar]e assumed to be sampled from the full condi-
tional distribution {µr}Tr=1|Θ,Ds .
93
4.6 Simulations
We conducted extensive simulations to evaluate the proposed methods for
FAR(p) relative to several competitive alternatives. We are particularly inter-
ested in one-step forecasting and recovery of the FAR kernel ψ1, and in how the
associated performance varies with the sample size T , the location and number
of the observation points τ1,t, . . . , τmt,t, the kernel ψ1, and the smoothness of the
innovation process t. We also assess the performance of the model averaging
procedure of Section 4.4 for p ∈ {1, 2}, and compare the nonparametric FDLM
approach of Section 4.3 with a more standard parametric Gaussian process im-
plementation.
4.6.1 Sampling Designs
For all simulations, the mean function is µ(τ) = 1 τ 3 sin(2πτ), which pro-
10
duces the dominate shape in the rightmost panels of Figure 4.1. The measure-
ment errors are identically distributed for all simulations: iνi,t ∼
id
N(0, σ2ν) with
σν = 0.002. We vary the sample size from small (T = 50) to large (T = 350) for
the FAR(1) simulations, and use a moderate sample size (T = 125) for the FAR(2)
simulation. The FAR(1) kernel used for Figure 4.1 is the Bimodal-Gaussian kernel,
ψ(τ, u) ∝ 0.75 exp{−(τ−0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45 exp{−(τ−
π(0.3)(0.4) π(0.3)(0.4)
0.7)2/(0.3)2 − (u − 0.8)2/(0.4)2}, following Wood (2003); see Appendix C for
a plot of the Bimodal-Gaussian kernel. We also present results for the Linear-
τ kernel, ψ(τ, u) ∝ τ , and the Linear-u kernel, ψ(τ, u) ∝ u∫. ∫Each kernel is
rescal∑ed according to a pre-specified squared norm, Cψ = ψ2` (τ, u) dτ du,`
with p`=1Cψ < 1 for stationarity. We select Cψ1 = 0.8 for the FAR(1) simu-`
94
lations and use (Cψ1 , Cψ2) = (0.4, 0.2) for the FAR(2) simulation; smaller val-
ues of Cψ produce similar comparative results, but the forecasting performance`
deteriorates for all methods. For the innovation process, t, we consider both
smooth and non-smooth Gaussian processes. We use the covariance func-
tion parametrization K = σ2Rρ, where Rρ is the Matérn correlation function
−1
R (τ, u) = {2ρ1−1ρ Γ(ρ1)} (||τ − u||/ρ2)ρ1 Kρ1(||τ − u||/ρ2), Γ(·) is the gamma
function, Kρ1 is the modified Bessel function of order ρ1, and ρ = (ρ1, ρ2) are
parameters (Matérn, 2013). We let σ = 0.01 and ρ = (ρ1, 0.1), with ρ1 = 2.5 for
smooth (twice-differentiable) sample paths and ρ1 = 0.5 for non-smooth (con-
tinuous, non-differentiable) sample paths.
Smooth Gaussian Process Non−Smooth Gaussian Process Smooth Gaussian Process Non−Smooth Gaussian Process
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
τ τ τ τ
Figure 4.1: Sample paths of t and Yt = µt + µ as a function of τ , where t is a
Gaussian process with the Matérn correlation function, ρ = (ρ1, 0.1), σ = 0.01,
and Yt is generated using the Bimodal-Gaussian FAR(1) kernel, t = 1, . . . , T =
50. The curves are time-ordered by color (from red/orange to blue/violet). Left
to right: t(τ), ρ1 = 2.5; t(τ), ρ1 = 0.5; Yt(τ), ρ1 = 2.5; Yt(τ), ρ1 = 0.5. Note that
we do not observe Yt directly, but rather yi,t = Yt(τ 2i,t) + νi,t, where νi,t ∼ N(0, σν)
is measurement error with σν = σ/5 = 0.002 and Tt = {τ1,t, . . . , τmt,t} are the
observation points at time t.
We consider three sampling designs for the observation points: dense, sparse-
random, and sparse-fixed. In each case, the set of evaluation points, Te, is an
equally-spaced grid of M = 30 points on T = [0, 1]. The dense design uses
mt = 25 equally-spaced observation points on [0, 1] for all t, for which the re-
sults are representative of denser (mt  25) designs and similar to those of Did-
ericksen et al. (2012); see Appendix C. The sparse-random design is generated by
95
εt(τ)
−0.02 0.00 0.01 0.02
εt(τ)
−0.03 −0.01 0.01 0.03
Yt(τ)
−0.08 −0.04 0.00 0.02 0.04
Yt(τ)
−0.08 −0.04 0.00 0.04
first sampling each mt from a zero-truncated Poisson (5) distribution, and then
sampling τ1,t, . . . , τmt,t without replacement from Te. This is a common design in
sparse functional data, in whichmt may be small for some t, but To is dense in T .
The sparse-fixed design uses mt = 8 equally-spaced points in T . This is the most
challenging design, and one for which multivariate time series methods should
be most competitive with functional time series methods. Comparatively, the
sparse settings are similar to the dense setting, but with additional missing ob-
servations.
4.6.2 Competing Estimators
Within the proposed framework and using the FDLM of Section 4.3 for the in-
novation covariance function, we compute forecasts for p = 1 (FDLM-FAR(1)),
and in the FAR(2) simulation, for p = 2 (FDLM-FAR(2)) and p = 3 (FDLM-
FAR(3)). We also compute forecasts using the model averaging procedure with
pmax = 4 (FDLM-FAR(p)). To assess the performance of the FDLM implemen-
tation, we compute forecasts using model (4.6) with a parametric covariance
function for K = σ2Rρ (GP-FAR(1)). We use the Matérn correlation function for
Rρ, with ρ1 = 2.5 as in the smooth Gaussian process simulations, and use the
priors σ−2 ∼ Gamma (10−3, 10−3) and ρ2 ∼ Uniform (0, Uρ2), where Uρ2 is the
maximum value of ρ2 for which the correlation function Rρ is less than 0.99
for all pairs of evaluation points. These models are implemented using the
Gibbs sampling algorithm provided in Appendix C, and estimates are based
on 5,000 MCMC simulations after a burn-in of 5,000. For the large sample set-
ting (T = 350), the mean computation time per 1,000 MCMC simulations was
2.3 minutes for FDLM-FAR(1) and 4.4 minutes for GP-FAR(1). The computing
96
times are calculated on a 64-bit Windows machine with a 2.40-GHz Intel core
i7-4700MQ processor with 8 GB of RAM, and the code is written in R.
We consider several important competing methods. Let ŷt+1 denote the one-
step forecast at time t. For baseline comparisons, we use the random-walk (RW)
forecast, ŷt+1 = yt, and the mean (Mean) forecast, ŷt+1 = µ̂, where µ̂ is a smooth
estimate of the mean of {ys}ts=1. We estimate µ̂ using a B-spline basis expansion
via the function meanfd() in the R package fda (Ramsay et al., 2014). Both es-
timators are robust against overfitting, and the mean forecast is optimal when
ψ = 0. We also compute the one-step forecast based on a VAR(1) fit to {ys}ts=1
(VAR-Y). In the sparse-random design, the observations yt were used to linear
interpolate on Te prior to fitting the VAR. In the sparse-fixed design, the VAR
was fit to the observation points, and then forecasts for the evaluation points
were computed by fitting a spline to the VAR forecasts of the observation points.
For additional comparisons, we computed forecasts from a simple exponential
smoother (SES) applied pointwise to each component of yt, i.e., each time series
{y Tj,t}t=1. The SES forecasts are implemented using the ses function in the R
package forecast (Hyndman and Khandakar, 2008), with an identical impu-
tation scheme as VAR-Y. We also considered two functional data methods. First,
we used the Estimated Kernel procedure outlined in Horváth and Kokoszka
(2012), which estimates ψ` in (4.2) using FPCs (FAR Classic); we fix p = 1 for sim-
plicity. This method has well-studied theoretical properties and is a useful base-
line for FAR models. Second, we implemented the method of Aue et al. (2015),
which we briefly described in Section 4.5, using a VAR(1) on the FPC scores
(VAR-FPC). We compute the FPCs using the fda package in R with B-spline ba-
sis functions. To avoid the ill-conditioned estimators discussed in Horváth and
Kokoszka (2012), we regularize via basis truncation, using 8 equally-spaced in-
97
terior knots. The number of components is selected to explain at least 95% of
the variability in {yt}. For the sampling designs considered here, this approach
works well. Finally, we report the oracle forecast (FA∑R Ora∫cle) computed us-
ing the true one-step forecasts E [µt(τ)|{ψ`, µt−`}p ] = p`=1 `=1 ψ`(τ, u)µt−`(u) du
within the simulation, where {ψ p`}`=1 are the FAR kernels from the simulation
specification, {µt} are the simulated values of the latent FAR process, and the
integral is approximated using the trapezoidal rule with M = 200 grid points.
The oracle forecast is not actually an estimator, and is unaffected by sparsity or
small sample sizes.
We estimate the one-step forecasts [yT+h|y1:(T+h−1)], h = 1, . . . , 25, for
all estimators under consideration, ∑and compare them using the mean
squared forecast error MSFE = 1 25e h=1 ||Y T+h − Ŷ T+h||2 where Y25M T+h =
(YT+h(τ1), . . . , YT+h(τM)))
′, which measures the one-step forecasting perfor-
man∑ce aM ∑t the evaluation points, and the mean squared error MSEψ1 =1 M
2 i=1 k=1{ψ1(τi, τk)− ψ̂1(τi, τk)}2, which measures the recovery of the lag-1M
kernel ψ1. Because Te is relatively∫dense in T , MSFEe and MS∫E∫ψ1 approxi-
mate the integrated squared errors {Y (u)− Ŷ (u)}2T+h T+h du and {ψ1(τ, u)−
ψ̂1(τ, u)}2 dτ du, respectively. Estimators ψ̂1 are available only for the proposed
methods and FAR Classic. For computational convenience in the proposed
methods, we update {µ T+h−1t}t=1 using all of the data y1:(T+h−1), but sample all
other parameters only conditional on y1:T . DLM updating algorithms provide
recursive one-step forecasts for µt, but in general there are no convenient up-
dating algorithms for the other parameters. In practice, this is not a problem,
but suggests that our simulation analysis may underestimate the performance
of the proposed model.
98
4.6.3 Results
We computedMSFEe andMSEψ1 under a variety of sampling designs, each for
N = 50 simulations, and present the results for a few important cases in Figures
4.2 and 4.3, respectively. The figures are color-coded: multivariate methods are
green, existing functional data methods are red, the proposed methods are blue,
and the oracle is gold.
MSFEe MSFEe
RW RW
Mean Mean
VAR−Y ● VAR−Y ● ● ●
SES ● ● ● SES
FAR Classic ● FAR Classic
VAR−FPC ● VAR−FPC
GP−FAR(1) ● ● GP−FAR(1) ●
FDLM−FAR(1) ● ● FDLM−FAR(1) ●
FDLM−FAR(p) ● ● FDLM−FAR(p) ●
FAR Oracle ● ● ● FAR Oracle ● ●
1e−04 2e−04 3e−04 4e−04 5e−04 0.0002 0.0006 0.0010 0.0014
MSFEe MSFEe
RW ● ● ● ● RW
Mean ● ● Mean
● ● ●
VAR−Y VAR−Y
●
SES ●
SES ● ● FAR Classic
FAR Classic ● VAR−FPC
VAR−FPC ● GP−FAR(1) ● ●
● ● ●
GP−FAR(1) ● ● FDLM−FAR(3)
● ● ●
FDLM−FAR(1) FDLM−FAR(2)● ● ● FDLM−FAR(1) ● ●
FDLM−FAR(p) ● ● FDLM−FAR(p) ● ●
FAR Oracle ● ● ● FAR Oracle ●● ●
1e−04 2e−04 3e−04 4e−04 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012
Figure 4.2: MSFEe under various designs. Top left: FAR(1), T = 350, sparse-
random design with the Linear-u kernel and smooth GP innovations. Top right:
FAR(1), T = 50, sparse-random design with the Bimodal-Gaussian kernel and
non-smooth GP innovations. Bottom left: FAR(1), T = 350, sparse-fixed design
with the Bimodal-Gaussian kernel and smooth GP innovations. Bottom right:
FAR(2), T = 125, sparse-fixed design with Bimodal-Gaussian and Linear−τ ker-
nels and smooth GP innovations. The proposed methods provide superior fore-
casts and nearly achieve the oracle performance, despite the presence of spar-
sity.
For the sparse designs in Figure 4.2, the proposed methods are all superior to
the competitors, and in some cases nearly achieve the oracle performance, even
though the oracle is unaffected by sparsity. Figure 4.3 shows that the proposed
methods also offer a substantial improvement in ψ1 estimation. Importantly, the
99
MSEψ MSE1 ψ1
FAR Classic FAR Classic
GP−FAR(1) ● GP−FAR(1)
FDLM−FAR(1) ●● ●●● FDLM−FAR(1) ●
FDLM−FAR(p) FDLM−FAR(p)
0.0 0.1 0.2 0.3 0.4 0.5 0.5 1.0 1.5 2.0
MSEψ MSE1 ψ1
● ●
FAR Classic FAR Classic● ●
GP−FAR(1)
GP−FAR(1) ●● ● FDLM−FAR(3) ● ●
FDLM−FAR(1) FDLM−FAR(2)
●
● ● ●●
FDLM−FAR(1)
FDLM−FAR(p) ●● FDLM−FAR(p)
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.2 0.4 0.6 0.8
Figure 4.3: MSEψ1 under various designs. Top left: FAR(1), T = 350, sparse-
random design with the Linear-u kernel and smooth GP innovations. Top right:
FAR(1), T = 50, sparse-random design with the Bimodal-Gaussian kernel and
non-smooth GP innovations. Bottom left: FAR(1), T = 350, sparse-fixed design
with the Bimodal-Gaussian kernel and smooth GP innovations. Bottom right:
FAR(2), T = 125, sparse-fixed design with Bimodal-Gaussian and Linear−τ ker-
nels and smooth GP innovations. Estimates of ψ1 are far superior for the pro-
posed methods, including the FAR(p) with model averaging.
proposed model with model averaging is competitive with the known p model
for both forecasting and estimation of ψ1. The model averaging procedure of
Section 4.4 typically identifies the true p with high probability, with a mild ten-
dency to overestimate p. However, this behavior is encouraging: the bottom
right panel of Figure 4.3, in which p = 2, suggests that overestimating the lag
(FDLM-FAR(3)) is preferable to underestimating the lag (FDLM-FAR(1), GP-
FAR(1)) for ψ1 estimation. FDLM-FAR(1) is competitive with GP-FAR(1), even
when the parametric Gaussian process model assumes the correct (smooth) in-
novation distribution, which suggests that the FDLM implementation of Sec-
tion 4.3 provides an adequate approximation. Under the dense design (see Ap-
pendix C), the improvements of the proposed methods over existing functional
100
data methods are less substantial, and for T = 350 the functional data methods
all nearly achieve the oracle performance. The proposed methods, however,
again provide superior recovery of ψ1. In general, we find that the functional
data methods, in particular the proposed approaches, outperform the multi-
variate methods, especially in the dense design. We conclude that the proposed
methods provide highly competitive forecasts and superior FAR kernel recov-
ery in a wide variety of important settings.
4.7 Forecasting Nominal and Real Yield Curves
We apply the proposed methods to model and forecast nominal and real yield
curves. Yield curves are important in a variety of economic and financial appli-
cations, such as evaluating economic and monetary conditions, pricing fixed-
income securities, generating forward curves, computing inflation premiums,
and monitoring business cycles (Bolder et al., 2004). In practice, the U.S. real
yield curve is estimated using Treasury Inflation-Protected Securities (TIPS), for
which payments are adjusted according to the Consumer Price Index for All Ur-
ban Consumers (CPI-U) to provide investors with protection against inflation.
U.S. nominal and TIPS yield curve data are published daily by the Federal Re-
serve, which uses actively-traded securities to fit a quasi-cubic spline for each
curve. Estimates of the real and nominal yield curves are provided for matu-
rities T Rt = {60, 84, 120, 240, 360} and T Nt = {1, 3, 6, 12, 24, 36} ∪ T Rt months,
respectively. Notably, the real yield is observed sparsely, and only at longer
maturities. The small number of available maturities for real yields presents
a challenge for existing functional time series models, and provides an inter-
esting comparison with the nominal yield, for which there are more observed
101
maturities.
To assess the performance of the proposed model, we conducted an exten-
sive forecasting study using daily nominal and real yield curve data. Beginning
in 2003, we construct nine consecutive yet non-overlapping 18-month subperi-
ods for estimation (T ≈ 375); the corresponding starting dates are given in Table
4.1. For the month following each estimation period, we compute both one- and
five-step (i.e., one business week) forecasts (≈ 20 and ≈ 15 time points, respec-
tively) for both the nominal and real yields. In all cases, the nominal and real
yields are modeled separately in order to provide additional comparisons.
We compute forecasts for the proposed methods by simulating from the fore-
casting distribution in the DLM (4.6). For computational convenience, we up-
date only the DLM state parameters {µt} during the forecast periods, and fix
the remaining parameters based on the estimation periods. We also rescale the
observation points T Rt and T Nt such that T Rt , T Nt ⊂ T = [0, 1]. We compute
forecasts using the competing methods described in Section 4.6, which use all
available data for each forecast. For further comparisons, we include two pop-
ular parametric yield curve models based on the Nelson-Siegel parametrization
(Nelson and Siegel, 1987): Diebold and Li (2006, DL), which extends the Nelson-
Siegel model to the dynamic setting via a two-step estimation procedure, and
Diebold et al. (2006, DRA) which is similar to DL, but instead estimates parame-
ters jointly using maximum likelihood within a state space model; see Appendix
C for implementation details.
The one- and five-step root mean squared forecasting errors (RMSFEs) for
the nominal yields and real yields are in Tables 4.1 and 4.2, respectively. We omit
unstable DRA forecasts, as well as multi-step forecasts for FAR Classic, which
102
are unavailable. For both data sets, the proposed methods—denoted FAR(1)
and FAR(p), using the lag selection procedure with pmax = 3—are consistently
among the best forecasters for all time periods, and outperform the existing
functional data forecasts by a wide margin. For the nominal yields, the FAR(1)
provides the best one-step forecasts aggregated across all time periods. For the
real yields, the proposed methods are again among the most competitive, par-
ticularly in the periods since the financial crisis. Echoing the results in Diebold
and Li (2006), the RW forecast is a difficult benchmark to clear, and the existing
functional data models typically fail to do so. By comparison, the proposed FAR
forecasts are highly competitive across all time periods and for both the nominal
and (sparsely-observed) real yields.
An important feature of the proposed FAR model is the ability to compute
exact (up to MCMC error) credible bands for parameters of interest, including
forecasts. Such uncertainty quantification is unavailable for the RW forecast,
which is our primary competitor in this application. For illustration, we com-
pute pointwise and simultaneous credible bands for one-step forecasts during
August 2016 in Figure 4.4. For both nominal and real yields, the credible bands
are tighter for shorter maturities and widen in regions of unobserved points,
which is appropriate behavior for a nonparametric method.
4.8 Concluding Remarks
The proposed hierarchical FAR(p) model provides a useful framework for es-
timation, inference, and forecasting functional time series data. Our model
is especially suited for sparsely or irregularly sampled curves and for curves
103
Nominal Yield: 5 and 10 Year Real Yield: 5 and 10 Year
Jul 18 Jul 25 Aug 01 Aug 08 Aug 15 Aug 22 Aug 29 Jul 18 Jul 25 Aug 01 Aug 08 Aug 15 Aug 22 Aug 29
2016 2016
Nominal Yield Curve Real Yield Curve
5y
10y
0 50 100 150 200 250 300 350 50 100 150 200 250 300 350
Maturity (months) Maturity (months)
Figure 4.4: One-step nominal (left) and real (right) yield curve forecasts during
2016. Top: Time series of five (×) and ten (4) year observed maturities with
one-step forecasts. Bottom: Observed (points) and forecast (line) curves on
8/2/16, corresponding to the dotted vertical line in the top panels. Posterior
means (blue) and 95% pointwise and simultaneous prediction bands (light gray
and dark gray, respectively) estimated using 10,000 MCMC simulations after a
burn-in of 5,000.
sampled with non-negligible measurement error, and produces best linear pre-
dictors in a general FAR(p) setting, thereby dominating many competing func-
tional time series models. The FDLM provides a more flexible, computationally
efficient, and stable approach for modeling (innovation) covariance functions.
Our model averaging procedure provides an effective solution to the problem
of specifying p, and produces highly competitive forecasts. The simulation anal-
ysis and yield curve application suggest that the proposed FAR(p) model may
improve forecasting and estimation in a wide range of settings, and the efficient
MCMC sampling algorithm allows us to perform exact (up to MCMC error and
prior misspecification) inference for important parameters.
While we assumed independent factors (and therefore independent innova-
104
Yield (%) Yield (%)
0.5 1.5 2.5 1.0 1.2 1.4 1.6 1.8
Yield (%) Yield (%)
0.0 0.5 1.0 1.5 −0.4 −0.2 0.0 0.2
tions) in Section 4.3, we can relax this assumption and allow Σe to be a stochastic
process evolving over time. In this more general framework, the FDLM (4.7) can
accommodate stochastic volatility or heavier-tailed distributions for the factors,
yet retains (the comp)utational simplifications of (4.8) and Theorem 4.2. Letting
Σt = diag {σ2j,t}Jj=1 , the (time∑-dependent) innovation covariance function is
K (τ, u) ≡ Cov ( (τ),  (u)) = J 2t t t j=1 σj,tφj(τ)φj(u) + σ2η1{τ = u}. By modeling
each {σ2 Tj,t}t=1 for j = 1, . . . , J with an independent stochastic volatility model
(e.g., Kim et al., 1998), the time-dependence of {σ2j,t}will propagate to the inno-
vation covariance functions, Kt . Similar modifications can accommodate scale-
mixtures of Gaussian distributions for the factors (Fernandez and Steel, 2000)
to induce more general distributions for the innovation process, {t}. These
generalizations are particularly important for financial applications, for which
stochastic volatility models and heavy-tailed distributions are commonly ap-
propriate.
Future work will investigate more adaptive FAR(p) models for longer, pos-
sibly nonstationary functional time series through stochastic volatility, time-
varying ψ`, and regime shifts. Important extensions also include modeling mul-
tiple functional responses Yt(τ) ∈ Rd for d > 1, which requires a model for both
the auto- and cross-correlations, and incorporating exogenous predictors. In
both cases, the DLM framework of (4.6) offers a promising platform for pursu-
ing these extensions.
105
106
Nominal Yields: h-Step Root Mean Squared Forecast Errors (RMSFEs)
h RW Mean VAR-Y DL DRA FAR Classic VAR-FPC FAR(1) FAR(p)
2/03 1 0.0488 0.4554 0.0487 0.1218 0.1440 0.1641 0.1631 0.0498 0.0516
5 0.0966 0.4369 0.0904 0.1409 0.8221 - 0.1941 0.0879 0.1002
8/04 1 0.0253 1.1079 0.0252 0.0877 - 0.1113 0.1127 0.0281 0.0281
5 0.0525 1.1279 0.0383 0.0953 - - 0.1435 0.0412 0.0505
2/06 1 0.1710 0.5408 0.1809 0.2206 - 0.3349 0.3334 0.1682 0.1673
5 0.4534 0.5971 0.5885 0.4927 - - 0.5928 0.4680 0.4627
8/07 1 0.0833 1.3125 0.0860 0.1817 0.1854 0.1168 0.1173 0.0806 0.0793
5 0.1345 1.3146 0.1402 0.2099 0.2998 - 0.1292 0.1537 0.1233
2/09 1 0.0487 0.5268 0.0517 0.1376 0.0917 0.1406 0.1398 0.0488 0.0760
5 0.0894 0.5560 0.1227 0.1872 0.1451 - 0.1990 0.1323 0.2608
8/10 1 0.0344 0.5063 0.0333 0.1920 0.0878 0.0551 0.0554 0.0291 0.0292
5 0.0583 0.4999 0.0603 0.1950 0.1356 - 0.0724 0.0452 0.0495
2/12 1 0.0383 0.5329 0.0384 0.0953 0.1915 0.0464 0.0463 0.0312 0.0311
5 0.0951 0.5522 0.0915 0.1240 0.2476 - 0.0989 0.0760 0.0734
8/13 1 0.0463 0.4169 0.0443 0.0621 0.0692 0.0634 0.0644 0.0547 0.0676
5 0.1210 0.3842 0.1104 0.1423 0.1448 - 0.1100 0.1208 0.1100
2/15 1 0.0329 0.3085 0.0320 0.1125 0.1001 0.0594 0.0606 0.0305 0.0321
5 0.0420 0.3080 0.0403 0.1149 0.1202 - 0.0697 0.0393 0.0441
Table 4.1: h-step RMSFEs for nominal yields, grouped (left to right) by multivariate methods, parametric yield curve
models, existing functional data methods, and proposed hierarchical FAR methods. The minimum RMSFE in each row
is italicized.
107
Real Yields: h-Step Root Mean Squared Forecast Errors (RMSFEs)
h RW Mean VAR-Y DL DRA FAR Classic VAR-FPC FAR(1) FAR(p)
2/03 1 0.0490 0.1629 0.0504 0.0499 0.0492 0.1366 0.1329 0.0509 0.0572
5 0.1001 0.1585 0.1040 0.1017 0.1128 - 0.1525 0.0967 0.1110
8/04 1 0.0331 0.3827 0.0337 0.0353 0.0528 0.0431 0.0440 0.0331 0.0326
5 0.0724 0.3924 0.0707 0.0792 0.1690 - 0.0721 0.0679 0.0651
2/06 1 0.0429 0.1089 0.0428 0.0448 0.0453 0.0529 0.0533 0.0424 0.0424
5 0.0934 0.1082 0.0858 0.0957 0.1362 - 0.0920 0.0852 0.0835
8/07 1 0.0802 0.2150 0.0896 0.0944 0.1979 0.1212 0.1202 0.0898 0.0880
5 0.1866 0.2309 0.2268 0.2504 1.1843 - 0.1916 0.2051 0.1980
2/09 1 0.0519 0.5162 0.0544 0.0643 0.1229 0.0736 0.0749 0.0526 0.0541
5 0.0798 0.5262 0.1100 0.1092 0.3606 - 0.1092 0.0992 0.1046
8/10 1 0.0490 0.7836 0.0492 0.0591 0.0663 0.0800 0.0762 0.0488 0.0486
5 0.0735 0.7845 0.0787 0.0794 0.1815 - 0.0959 0.0727 0.0744
2/12 1 0.0602 0.8838 0.0612 0.0675 0.1492 0.0906 0.0853 0.0610 0.0608
5 0.1845 0.9250 0.1958 0.1897 1.7442 - 0.2034 0.1840 0.1846
8/13 1 0.0526 0.3242 0.0506 0.0736 - 0.0613 0.0610 0.0500 0.0492
5 0.1551 0.2981 0.1278 0.1380 - - 0.1246 0.1407 0.1239
2/15 1 0.0328 0.3088 0.0327 0.0439 0.1529 0.0776 0.0779 0.0325 0.0336
5 0.0489 0.3104 0.0521 0.0562 - - 0.0816 0.0466 0.0543
Table 4.2: h-step RMSFEs for real yields, grouped (left to right) by multivariate methods, parametric yield curve mod-
els, existing functional data methods, and proposed hierarchical FAR methods. The minimum RMSFE in each row is
italicized.
CHAPTER 5
CONCLUSIONS
The proposed methods provide effective Bayesian approaches for model-
ing functional and time series data. While broadly applicable, the proposed
methodology directly addresses the following challenging cases for which ex-
isting methods are inadequate:
1. Functional data with additional complex dependence, such as time depen-
dence, contemporaneous dependence, stochastic volatility, covariates, and
change points (Chapter 2);
2. Functional data, time series data, or regression functions with local fea-
tures, such as jumps or rapidly-changing smoothness (Chapter 3); and
3. Forecasting and inference of functional time series data with sparsely or
irregularly sampled curves and for curves sampled with non-negligible
measurement error (Chapter 4).
Using the MFDLM of Chapter 2, we may adapt general scalar and multivari-
ate methods to the functional data setting. In particular, by separating out the
functional component through appropriate conditioning and include the nec-
essary identifiability constraints, the remaining dependence structures, such as
covariates, repeated measurements, and spatial dependence, may be modeled
via the factors. The hierarchical Bayesian approach allows us to incorporate
interesting and useful submodels seamlessly, with minimal adjustments to the
proposed Gibbs sampling algorithm.
An interesting extension of the MFDLM of Chapter 2 would be to in-
corporate the dynamic shrinkage processes of Chapter 3 to provide adaptive
108
shrinkage and regularization of the dynamic factors. Dynamic shrinkage pro-
cesses inherit the desirable shrinkage behavior of global-local priors, such as
the horseshoe prior, but with greater time-localization. By construction, the
MFDLM—and more broadly, dynamic linear models—contains many param-
eters, and therefore may benefit from structured regularization. By synthesiz-
ing the MFDLM of Chapter 2 and the dynamic shrinkage processes of Chap-
ter 3, we may model additional dependence among functional data, such as
covariates, repeated measurements, and spatial dependence, while simulta-
neously introducing temporally adaptive shrinkage behavior to guard against
overparametrization.
Important extensions of the dynamic shrinkage processes of Chapter 3 in-
clude alternative dependence models in (3.2) or multivariate shrinkage in (3.10).
For example, dynamic shrinkage processes may offer effective shrinkage behav-
ior for spatial or spatio-temporal models, in which case (3.2) may be modified to
incorporate spatial dependence. Similarly, for replicate time series or functional
data, multi-level extensions of (3.2) may provide both hierarchical and locally
adaptive shrinkage behavior.
The proposed hierarchical FAR(p) model in Chapter 4 may be extended to
incorporate exogenous predictors, stochastic volatility, time-varying autoregres-
sive kernels ψ`, and regime shifts. These important generalizations may provide
broader applicability of the hierarchical FAR(p) model for longer, possibly non-
stationary functional time series data. As with the MFDLM, the dynamic linear
model framework offers a promising platform for pursuing these extensions,
and may be combined with dynamic shrinkage processes for additional locally
adaptive regularization.
109
APPENDIX A
A BAYESIAN MULTIVARIATE FUNCTIONAL DYNAMIC LINEAR
MODEL
To sample from the joint posterior distribution, we use a Gibbs sampler. Be-
cause the Gibbs sampler allows blocks of parameters to be conditioned on all
other blocks of parameters, it is a convenient approach for our model. First, hi-
erarchical dynamic linear model (DLM) algorithms typically require that βt and
θt be the only unknown components, which we can accommodate by condition-
ing appropriately. Second, our sequential orthonormality approach for (c)fk fits
nicely within a Gibbs sampler, and we can adapt the algorithms described in
Wand and Ormerod (2008). And third, the hierarchical structure of our model
imposes natural conditional independence assumptions, which allows us to eas-
ily partition the parameters into appropriate blocks.
A.1 Initialization
( )′
To initialize the factors (c) (c) (c)βk = βk,1, . . . , βk,T and the factor loading curves
(FLCs) (c)fk for k = 1, . . . , K and c = 1, . . . , C, we compute the singular value
decomposition (SVD) of the data matrix Y(c) = U(c)Σ(c)V(c)′ for c = 1, . . . , C.
Note that to obtain a data matrix Y(c), with rows corresponding to times t and
columns to observations points τ , we need to estimate (c)Yt (τ) for any unob-
served τ at each time t, which may be computed quickly using splines. How-
ever, these estimated data values are only used for the initialization step. Then,
letting (c)U1:K be the first K columns of
(c)
U(c), Σ1:K be the upper left K × K sub-
(matrix of Σ(c)), and (c)V1:K be the first K colum(ns of V(c), w)e initialize the factors
(c) (c) (c) (c)
β1 , . . . ,βK = U1:KΣ1:K and the FLCs
(c) (c) (c) (c)
f 1 , . . . ,fK = V1:K , where fk
110
is the vector of FLC k evaluated at all observation points ∪ T (c)t t for outcome c.
′
The (c)fk are orthonormal in the sense that
(c) (c)
fk f j = 1(k = j), but they are not
smooth. This approach is similar to the initializations in Matteson et al. (2011)
and Hays et al. (2012).
Given the factors (c)βk and the FLCs
(c)
fk , we can estimate each σ
2
(c) (or more
generally, Et) as a conditional maximum likelihood estimator (MLE), using the
likelihood from the observation level of model (2.1). Similarly, we can estimate
each (c)λk conditional on
(c)
fk by maximizing the partially informative normal
likelihood. Then, given (c)λ , σ2k (c),
(c), and (c)βk fk , we can estimate each
(c)
dk by
normalizing the full conditional posterior expectation given in the main paper;
i.e., solving the relevant quadratic program and then normalizing the solution.
Initializations for the remaining levels proceed similarly as conditional MLEs,
but depend on the form chosen for Xt, Vt, Gt, and Wt. In our applications, this
conditional MLE approach produces reasonable starting values for all variables.
A.1.1 Common Factor Loading Curves
If we wish to implement the common FLCs model (c)fk =(fk for all k, c, t)hen′
we instead compute the SVD of the stacked data matrices ′ ′Y(1) , . . . ,Y(C) =
UΣV′, where now the data matrices Y(1), . . . ,Y(C) are imputed using splines
for all observation points for all outcomes, ∪ (c)t,cTt , and therefore have the same
number of columns. Alternatively, we may improve computational efficiency
by choosing a small yet representative subset of observation points T ∗ ⊂ ∪ (c)t,cTt
and then estimating each data matrix Y(c) for all ∈ T ∗. Let (c)τ U1:K be the first
K columns of U(c), where the U(c), c = 1, . . . , C, correspond to the outcome-
111
( )
( ′specific block)s of U = U(1)′ ′, . . . ,U(C) . Then, similar to before, we set
(c) (c) (c)
β1 , . . . ,βK = U1:KΣ1:K for c = 1, . . . , C, and (f 1, . . . ,fK) = V1:K , where
Σ1:K is the upper left K ×K submatrix of Σ and V1:K is the first K columns of
V. Again, the fk are unsmoothed with f
′
kf j = 1(k = j), but now the initialized
FLCs are common for c = 1, . . . , C. Initialization of the remaining parameters
proceeds as before, but now with (c) (c)λk = λk and dk = dk, which can be obtained
by maximizing the relevant conditional likelihoods under the common FLCs
model.
A.2 Sampling
A.2.1 General Algorithm
For greater generality, we present our sampling algorithm for non-common
FLCs; i.e., we retain dependence on for (c)c dk and
(c)
λk . When applicable, we
discuss the necessary modifications for the common FLCs model.
The algorithm proceeds in four main blocks:
1. Sample the basis coefficients (c)dk and the smoothing parameters
(c)
λk for
the FLCs. For (c)λk , we use a Gamma(γ1, γ2) prior distribution, which is
conjugate to the partially informative normal likelihood and implies that
the full conditional posterior distribution is Gamma(γ1 + rank(Ωφ)/2, γ2 +
(c)′ (c)
dk Ωφdk /2). For the common FLCs model, we simply replace
(c)
dk with
dk to obtain the full conditional posterior for λk. We use the hyperparame-
ters γ1 = γ2 = 0.001, although the effect of the hyperparameters is negligi-
112
′
ble as long as γ1 and γ2 are small relative to rank
(c) (c)
(Ωφ)/2 and dk Ωφdk /2,
respectively. After sampling the (c)λk , we sample and then normalize the
(c)
dk with a modified version of the efficient Cholesky decomposition ap-
proach of Wand and Ormerod (2008):
(a) Compute the (lower triangular) Cholesky decomposition B−1k =
B̄LB̄
′
L;
(b) If k = 1, set L′1:(k−1)Λ1:(k−1) = 0;
If k > 1, use forward substitutions to obtain x̄ and ȳ from the equa-
tions B̄Lx̄ = L′1:(k−1) and B̄Lȳ = bk, and let Λ1:(k−1) be the solution to
the regression of ȳ on x̄;
(c) Use forward substitution to obtain b̄ as the solution to B̄Lb̄ = bk,
then use backward substitution to obtain d∗k as the solution to B̄′ d
∗
L k =
b̄ + z̄, where z̄ ∼ N(0, I(M+4√)×(M+4)); √
(d) Retain the vector (c) (c) (c)dk = d
∗
k/ d
∗′ ∗
k Jφdk and set βk = d
∗′
k J d
∗
φ kβk .
The definitions of Bk and bk depend on whether or not we use the com-
mon FLCs model with (c)fk = fk. Compared with unconstrained Bayesian
splines, the extra orthogonality step (b) uses the Cholesky decomposi-
tion—which we must compute regardless—and adds only the computa-
tional cost of a simple linear regression for each k > 1, which is perhaps
expected in light of Theorem 2.1. The scaling of (c)dk and
(c)
βk in (d) en-
forces the unit-norm constraint on (c)fk yet ensures that
(c) (c)
fk (τ)βk —which
appears in the posterior distribution of (c)dj for all j 6= k—is unaffected by
the normalization.
2. Sample the factors βt (and θt, if present) conditional on all other parame-
ters in (2.1) using either the DLM implementation of forward filtering back-
113
ward sampling (e.g., Petris et al. (2009)) or the state space sampler of Durbin
and Koopman (2002); Koopman and Durbin (2003, 2000), the latter of
which is optimized when Et is diagonal. For general hierarchical mod-
els, we may modify the hierarchical DLM algorithms of Gamerman and
Migon (1993).
For the prior distributions, we only need to specify the distribution of β0
(and θ0); the remaining distributions are computed recursively using F,
Xt, Gt and the error variances. For simplicity, we let
(c) iid
β 6k,0 ∼ N(0, 10 ),
which is a common choice for DLMs. Alternatively, we could use past
data not included in our analysis to estimate these initial values. However,
the resulting estimates for t > 1 in our applications are not noticeably
different.
3. Sample the state evolution matrix Gt (if unknown). Gt may have a special
form (see Section A.2.2) or provide a more common time series model such
as a VAR. In the latter case, we may choose some structure for Gt = G,
e.g. diagonality to allow dependence between (c) (c)βk,t and βk,t−1, or K blocks
′
of dimension C × C to allow dependence between (c) (c )βk,t and βk,t−1 for
c, c′ = 1, . . . , C. A simple choice of prior for the nonzero entries of G is
iid N(0, 106), which is conjugate to the likelihood induced by (2.1). Un-
der this prior, it is straightforward to derive the posterior distribution of
vec0(G), where vec0 stacks the nonzero entries of the matrix (by column)
into a vector.
4. Sample each of the remaining error variance parameters individually: Et,
Vt, and Wt. These distributions depend on our assumptions for the
model structure, but we typically prefer conjugate priors when avail-
able. For example, in the random walk factor model of (8), we have
114
indep
βk,i,s,t = βk,i,s,t−1 + ωk,i,s,t with ωk,i,s,t ∼ N(0,Wk). Using the Wishart
prior W−1 −1k ∼Wishart((ρR) , ρ), the full cond∑itional posterior distribution
for the precision is W−1k ∼ Wishart((ρR + i,s,t w ′ −1k,i,s,twk,i,s,t) , ρ + T ),
where wk,i,s,t = βk,i,s,t − βk,i,s,t−1 is conditional on the factors and T =
(15)(40)(8) = 4800 counts the indices (i, s, t). We let R−1 = IC×C , which is
the expected prior precision, and ρ = C ≥ rank(R−1).
For the stochastic volatility model of Section 2.4.1, we use the distributions
given in Kim et al. (1998). In particular, letting σ2 (c)k,(c),t = exp(hk,t), Kim
et al. (1998) propose the model (c) (c) (c) (c) (c) (c)hk,t = ξk,0 + ξk,1(hk,t−1 − ξk,0) + ζk,t , where
(c) indep
ζ 2k,t ∼ N(0, σH,k,(c)) for and
(c) ∼ (c) (c)t = 2, . . . , T hk,1 N(ξ 2 2k,0, σH,k,(c)/(1− (ξk,1) ))
with | (c)ξk,1| < 1 for stationarity. Kim et al. (1998) also suggest priors
for (c) (c)ξ 2k,0, ξk,1, and σH,k,(c) and provide an efficient MCMC sampling algo-
rithm. For additional motivation for the stochastic volatility approach
over GARCH models, see Danı́elsson (1998).
Recall that we construct a posterior distribution of (c)dk without the unit norm
constraint, and then normalize the samples from this distribution. As a result,
the conditions of Theorem 2.1 are satisfied and the (unnormalized) full condi-
tional posterior distribution of (c)dk is Gaussian, both of which are convenient
results. The normalization step 1.(d) is interpretable, corresponding to the pro-
jection of a Gaussian distribution onto the unit sphere. Note that rescaling the
factors (c)βk in 1.(d) does not affect the remainder of the sampling algorithm
(steps 2. - 4.). The rescaled (c)βk are from the previous MCMC iteration, which
does not affect the full conditional distributions of step 2. in the current MCMC
iteration. The subsequent steps 3., 4., and 1. are then conditional on the newly
sampled factors (c)βk from step 2., which have not been rescaled.
115
A.2.2 Sampling the Common Trend Hidden Markov Model
Recall the common trend hidden Markov model for the factors, k = 1, . . . , K: ∑∆D (1) (1) (1) (1) (1) (1) βk,t = ωk,t , ω
r
k,t =∑i=1 ψk,iω k,t−i
+ σk,(1),tzk,t
 (A.1)(c) (c) (c) (1) (c) (c) (c) (c) (c)∆Dβ D rk,t = sk,t(γk ∆ βk,t ) + ωk,t , ωk,t = i=1 ψk,iωk,t−i + σk,(c),tzk,t
for c = 2, . . . , C, where ∆ is the differencing operator, D is the degree of
d{ifferencing, (c)γk }∈ R is the economy-specific slope term for each factor,
(c)
sk,t : t = 1 . . . , T is a discrete Markov chain with states {0, 1}, σ2k,(c),t are the
time-dependent error variances, and (c) iidzk,t ∼ N(0, 1). We specify iid N(0, 106)
priors for (c)γk , which are conjugate to the likelihood in (A.1).
We can express (A.1) as the βt = θt-level in (2.1) with Xt = ICK×CK and
Vt = 0CK×CK . Let Lβt = ICK×CK −Qt,  0K×K 0K×K · · · 0K×K  (2) St γ(2) 0K×K · · · 0K×KQt =  . . . . 
 ,.. .. . . .. 
(C)
S γ(C)t 0K×K · · · 0K×K
where (c)St = diag
(c)
({s K (c)k,t}k=1) and γ = diag {
(c)
( γ Kk }k=1). Note that L
−1
βt =
ICK×CK + Qt. To derive the state evolution matrix Gt, we can modify
the standard ARIMA(r,D,0) framework for DLMs to incorporate the (c)sk,t-
dependent common trend. For example, when D = r = 1, we have (c)∆βk,t −
(c) (c) (1) (c) (c) (c) (c) (1) (c)
sk,tγk ∆βk,t = ψk (∆βk,t−1 − sk,t−1γk ∆βk,t−1) + σk,(c),tzk,t which can be rewrit-
ten as (c)− (c) (c) (1) (c) (c) (c) (c) (c) (c) (1) (c) (c)βk,t sk,tγk βk,t = (1 +ψk )βk,t−1− (sk,t + sk,t−1ψk )γk βk,t−1−ψk βk,t−2 +
(c) (c) (c) (1) (c)
ψk sk,t−1γk βk,t−2 + σk,(c),tzk,t . The left side of this equation is given by the el-
ements of Lβtβt, while the right side may clearly be expressed using a simple
modification of the standard ARIMA DLM state evolution matrix G. In vector
116
notation, we have        Lβt 0CK×CK βt   Gt,1 Gt,2 βt−1  ω̃t = +
0CK×CK ICK×CK βt−1 0CK×CK ICK×CK βt−2 ω̃t−1
(A.2)
where  
 0K×K 0K×(C−1)K − (2) (2)(S + S Ψ(2))γ(2) t t−1 0K×(C−1)K
Gt,1 = (ICK×CK + Ψ) + ... ... 
 ,
− (C) (C)(St + S Ψ(C))γ(C)t−1 0K×(C−1)K
 
 0K×K 0K×(C−1)K (2)S (2) (2) −  t−1
Ψ γ 0K×(C−1)K
Gt,2 = Ψ + .. .. 
, and
. . 
 (C)S (C) (C)t−1Ψ γ 0K×(C−1)K
ω̃t Wt 0CK×CK
Var   =   ,
ω̃t−1 0CK×CK 0CK×CK
with diag { (c) (c)Ψ = ( ψk }k,c), Wt = diag({σ2k,(c),t}k,c), and ω̃t has elements ω̃k,t =
(c)
σk,(c),tzk,t , which are the residuals from the AR(r) process in (A.1).
Many of these matrix multiplications involve diagonal matrices, and there-
fore may be computed quickly. The error variance is not a proper variance ma-
trix, but is commonly used for sampling DLMs with multiple lags or differenc-
ing. Note that to write (2.1) in this form, we must also append CK columns of
zeros to F(τ), since Yt(τ) depends on βt but not on βt−1.
Inverting the block diagonal matrix L̃βt = bdiag(Lβt, ICK×CK), we obtain
−1
L̃βt = bdiag(L
−1
βt , ICK×CK) = bdiag(ICK×CK + Qt, ICK×CK). Therefore, we
117
canrewrite (A.2) as βt  
    
Gt,1 + QtGt,1 Gt,2 + QtGt,2βt−1 −1 ω̃t = + L̃βt   (A.3)
βt−1 0CK×CK ICK×CK βt−2 ω̃t−1
where the error variance has the same block form as previously, but with
Wt replaced by L −1 −1 ′ ′βt Wt(Lβt ) = (ICK×CK + Qt)Wt(ICK×CK + Qt) = Wt +
QtWt + (Q W )
′ + Q W Q′ . Letting σ2 = diag({σ2 Kt t t t t (c),t k,(c),t}k=1) so that Wt =
bdiag(σ2(1),t, . . . ,σ
2
(C),t), wemay compute the relevant terms explicitly:
 0K×K 0K×K · · · 0K×K 
QtWt = 

(2)
 S γ(2)σ2 t (1),t 0K×K · · · 0K×K.. .. . . . ... . . 
(C)
S (C) 2t γ σ(1),t 0K×K · · · 0K×K
and  0K×K 0K×K · · · 0K×K (2) (2) 2 (2) (2) (2) (2) 2 (C)0 (C)′ K×K St γ σ(1),tSt γ · · · St γ σ(1),tSt γ QtWtQt =  .. . .. .. . . . .. 
(C)
0 S γ(C) 2
(2)
σ S γ(2) · · · (C) (C)S γ(C)σ2 S γ(C)K×K t (1),t t t (1),t t
where again, the component terms are all diagonal, and therefore can be re-
ordered for convenience. Combining terms and simplifying, the nonzero upper
leftblock of the error variance matrix, L −1βt Wt(L −1 ′βt ) , is 
(2) (C)
σ2(1),t St γ
(2)σ2 (C) 2(1),t · · · St γ σ(1),t
 

(2)
 S γ(2) 2 2
(2) (2) 2 2 · · · (2) (C)σ σ + S (γ ) σ S S γ(2)γ(C)σ2 t
 (1),t (2),t
t (1),t t t (1),t 
 .. .. .. . 
.
. . ... 
(C) (C) (2) (C) (C)St γ σ
2 (2) (C) 2
(1),t St St γ γ σ(1),t · · · σ2 (C) 2 2(C),t + St (γ ) σ(1),t
When (c) (c)sk,t = 1, c > 1 the slope parameter γk may increase or decrease the
error variance of the residuals (c)ω̃k,t at time t, and determines the contempora-
′
neous covariance between (c) and (1) (c) (c )ω̃k,t ω̃k,t . Similarly, when sk,t = sk,t = 1, the
118
′
product (c) (c )γk γk σ
2
k,(1),t determines the contemporaneous covariance between
(c)
ω̃k,t
′
and (c )ω̃k,t at time t. Note that as long as there exist distinct times t, t
′ such that
(c)
sk,t 6
(c)
= sk,t′ , the slopes and volatilities are identifiable for each k and c. There-
fore, the common trend hidden Markov model of (A.1) provides a flexible, time-
dependent contemporaneous covariance structure within a relatively simple re-
gression framework.
A.3 Additional Figures
Lower 95% HPD Interval Posterior Mean Upper 95% HPD Interval
80  0 80 80
 0.5 1.5
70 70 70
.5  0
1.0
60 60 60
0.5
50 50 50
40 40 40 0.0
30 30 30 −0.5
20  0  0 20 20 −1.0
10 10 10  0  1.5 −1.5
2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14
Time Bin Time Bin Time Bin
Figure A.1: Pointwise 95% HPD intervals and the posterior mean for (1)µ̄t , which
is the average difference in the PFC log-spectra between the FC and FS trials.
The black vertical lines indicate t∗.
119
Frequency (Hz)
 0.5 
 0 
 −0.5 
 0 
 0.5 
 0 
 1 
 1.5 
 0.5 
 0.5  0.5 
 −0.5 
 0 
 0 
 0.5 
 1 
 0 
 0.5 
 1 
 0 
 1 
 0 
 0.5 
 1 
 0 
 0 
 0.5 
 1 
 0  0 
 1 
 0.5 
 1 
 0 
 −0.5  −0.5  −0.5
 
 0.5 
 −0.5 
 0  
 −0.
5
 0.5  0 
0.5
 
 −
 0.5 
 0.5  1.5 
 1 
 −0.5 
 −0.5
  −0.5 
 0 
 −1 
 −0.5 
 0 
0.5   0 
.5 
 −0
 0.5 
 0 
 −0.5 
 0.5 
 1 
 0.5 
 −0.5 
 −0.5 
 0 
 −1 
 −0.5 
 0 
 −0.5 
 0 
 0.5 
 −0.5 
Lower 95% HPD Interval Posterior Mean Upper 95% HPD Interval
80 80 80
2
70 70 70
60 60  0 60 1
 0  0 
50 50  0  0 50
 0.5 
40 40 40 0
 0 
 0 
30 30 30
 0 −1
20 20 20
10 10 10
−2
2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14
Time Bin Time Bin Time Bin
Figure A.2: Pointwise 95% HPD intervals and the posterior mean for (2)µ̄t , which
is the average difference in the PFC log-spectra between the FC and FS trials.
The black vertical lines indicate t∗.
120
Frequency (Hz)
 0.5 
 0 
 −1  1 
 0 
 0.5 
 0 
 0.5  0.5 
 1 
 0.5 
 0 
 −1  0 
 0.5 
 0.5 
 1 
 1 
 0.5 
 0.5 
 −1  0 
 0.5 
 0.5 
 0.5 
 0
 1 . 5 
 0 
 0.5 
 0  0 
 0.5 
 0.5 
 1 
.5 
 −0
 1  
 0.
5
 0 
 0 
 1 
 0.5 
 0 
 0 
 0.5
 
 0.5 
 1 
.5 
 −0
 1 
 
 0.5
 0  0.5 
 1 
 0 
 0.5
 
 0.5 
 1 
 
 −0
.5
 0.5
 
 0 
 1 
 1.5 
121
Fed , k =  1 Fed , k =  2 Fed , k =  3 Fed , k =  4
2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014
Dates Dates Dates Dates
BOE , k =  1 BOE , k =  2 BOE , k =  3 BOE , k =  4
2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014
Dates Dates Dates Dates
ECB , k =  1 ECB , k =  2 ECB , k =  3 ECB , k =  4
2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014
Dates Dates Dates Dates
BOC , k =  1 BOC , k =  2 BOC , k =  3 BOC , k =  4
2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014
Dates Dates Dates Dates
Figure A.3: The observed volatility clustering from the yield curve application. The black lines are the posterior means
of the squared residuals from the AR(1) process on the (c)ωk,t in the common trend hidden Markov model of Section 2.4.1.
The red lines are the posterior means of the corresponding volatility estimates σ2k,(c),t discussed in Section 2.4.1.
S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s
0 2 4 0 4 0 8 0 0 3 0 6 0 0 2 0 5 0
S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s
0 . 0 1 . 5 3 . 0 0 4 8 0 4 8 1 2 0 4 8
S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s
0 . 0 0 0 . 2 0 0 4 8 1 2 0 . 0 1 . 0 2 . 0 0 . 0 1 . 5 3 . 0
S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s S q u a r e d  F a c t o r  R e s i d u a l s
0 . 0 0 . 4 0 . 8 0 2 4 6 0 2 4 6 0 2 4
APPENDIX B
DYNAMIC SHRINKAGE PROCESSES
Proof. (Proposition 3.1) Proposition 3.1 follows from Proposition 3.2 with µz =
0.
Proof. (Proposition 3.2) Let η ∼ Z(α, β, µz, 1) with density (3.3), i.e.,[ ]−1{ [ ]}α{ [ ]}− − −(α+β)[z] = σB(α, β) exp (z µz)/σz 1 + exp (z µz)/σz .
The density of λ2 = exp(η) is
[ ] ( )−1{ [ ]}2 ∝ 2 2 − α{ [ ]}−(α+β)λ (λ ) exp log(λ ) µz 1 + exp log(λ2)− µz
∝ 2 α−1
[ ]
2 −(α+β)λ 1 + λ / exp(µz)
and therefore the density of κ = 1/(1 + λ2) is
[ ]α−1[ ]∝ −(α+β)[κ] κ−2 κ−1 − 1 1 +{(κ−1[− 1)/ exp(µz) ]}
∝ κ−2−(α−1)(1− κ[)α−1 κ−1 −
−(α+β)
κ exp(µz)]+ (1 κ)
∝ − α−1 β−1 − −(α+β)(1 κ) κ κ exp(µz) + (1 κ)
i.e., κ ∼ TPB(β, α, exp(µz)).
Proof. (Theorem 3.1) Under model (3.2), i.e.,
iid
ht+1 = µ+ φ(ht − µ) + ηt, ηt ∼ Z(α, β, 0, 1),
we have [ht+1|ht, φ, µ] ∼ Z(α, β, µ+φ(ht−µ), 1). Using Proposition 3.2, the con-
ditional distribution for κt+1 is [κt+1|ht, φ, µ] ∼ TPB(β, α, exp(µ + φ(ht − µ))).
By substituting τ = exp(µ) and λt = exp(ht − µ), w[e eq]uivalently haveφ
[κ 2 2φt+1|λt, φ, τ ] ∼ TPB(β, α, τ λt ). Noting τ 2λ
2φ = τ 2(1−φ) 1−κtt completes theκt
proof.
122
[ ]
P[ roof. (Theore]m 3.2) Let φγ = (1 − κ )/κ and note that κ 7→ κ−1/2t t t and κ →7−1
1 + (γt − 1)κ are decr∫easing in κ for γt > 1. It follows that, for γt > 1,(
P κ > ε∣∣ ) 1{ } −1 1/2 −1/2t+1 κs s≤t, φ = π γt κt+1 (1− κ )−1/2t+1 ∫[1 + (γt − 1)κ
−1
t+1] dκt+1
ε
1
≤ π−1 1/2γ ε−1/2t [1 + (γt − 1)ε]
−1 (1− κ )−1/2t+1 dκt+1
ε
1/2
≤ 2π−1 −1/2 − 1/2 γε (1 ε) t
1 + (γt − 1)ε
converges to zero as κt → 0, since κt → 0 implies γt →∞.
Proof. (Theorem 3.3) Marginalizing over ωt, the likelihood is [yt+1|{
indep
κs}] ∼
N(0, κ−1t+1). From Theorem 3.1, the posterior distribution of κt+1 may be com-
puted as
{ }
[κ |y , {κ } , φ, τ ] ∝ κ{β−1(1− κ )α−
[ ]
1 −(α+β)
t+1 t+1 s s≤t t+1 (t+1 1 + (γ)t}− 1)κt+1
× 1/2κt+1 exp − 2[yt+1κt+1/2− ]−1 ( )∝ (1− κ 1/2t+1) 1 + (γt − 1)κt+1 exp −y2t+1κt+1/2
[ ]
for φα = β =[1/2, where γ ]= τ 2(1−φ)t (1 − κt)/κt . De(fining p1(κ)) = (1 − κ)−1/2,
| − −1p2(κ γt) = 1 + (γt 1)κ , and p 23(κ|yt+1) =[ exp ]−yt+1κ/2 for κ ∈ (0, 1),
observe that 2p1(·) is increasing in κ, p2(κ|γt) ≤ p1(κ) for all γt ≥ 0, and p3(·)
is decreasing in κ. Similar to Datta and Ghosh (2013), the following inequalities
123
hold for ε ∈ (0, 1) with ε′ = 1− ε:
( ∣∣ ) (( ∣P κ < ε′∣ )′ t+1 ∣∣yt+1, {κs}s≤t, φ, τP κt+1 < ε yt+1, {κs}s≤t, φ, τ ≤ )P κ ′t∫+1 > ε yt+1, {κs}s≤t, φ, τ∫ [ ε′ ( )(1− κt+1)−3/]2 exp −y(2t+1κt+1/2 dκ≤ 0 )t+11 −3/2
′ 1 + (γ − 1)κ exp −y2 κε t ∫ t+1 t+1 t+1/2 dκt+1( ε)′(∫1−[ κ )−3/2≤ 0 t+1 dκt+11 ]−3/2
exp −y2t+1/2 ′ 1[+ (γt − 1)κ dκε t+1 ] t+1( ) 2 (1− {ε′[)−1/2 − 1≤ ]
[ − 2 ] − ( −1 ) − ′ −1/2
}
exp yt+1/2 2(γt 1) 1 + (γt 1)ε −
−1/2
γt
≤ (1{− ε′)−1/2 − 1 exp y2 1/2t+1/2 γ}t
× 1− γt .
− 1/21 γ ′ 1/2t /[1 + (γt − 1)ε ]
N(oting the fina∣∣l term in curly b)races converges to 1 as γt → 0, we obtainP κt+1 < 1 − ε yt+1, {κs}s≤t, φ, τ → 0 as γt → 0. The result for (a) follows
immediately.
For ε ∈ (0, 1) and γt < 1, and observing that p2(κ|γt) is increasing in κ for
γt < 1, then for any δ ∈ (0, 1),( ∣ ) γ− ( ) ∫1 exp −y2 1ε/2 (1− κ )−1/2
P t t+1
dκt+1
κt+1 > ε∣yt+1, {κs}s≤t, φ, τ ≤ ∫ t+1 ε(δ′ ( ) ,exp −y)2t+1κt+1/2 dκ0 t+1
γ−1t exp −(y2t+1ε/2 2()1− ε)1/2≤
( exp −y2t+1δε/2) δε
= exp −y2t+1ε[1− δ]/2 γ−1t 2(1− ε)1/2(δε)−1
which converges to zero as |yt+1| → ∞, proving (b).
Proof. (Theorem 3.4) The density of η ∼ Z(α, β, 0, 1) may be written
1 [exp(η)]α
[η] =
B(α, β) [1 + exp(η)]α+β ∫
1 ∞
= 2−(α+β) exp{η[α− (α + β)/2]} exp(−η2ξ/2)pα+β(ξ) dξ
B(α, β) 0
124
using Theorem 1 of Polson et al. (2013), where pb(ξ) is the density of the random
variab∫le ξ ∼ P{G(b,∞ 1[0), b > 0. It fo]l}lows that ∫ ∞ ( )
[η] ∝ exp − η2ξ−η(α−β) p (ξ) dξ ∝ f η; ξ−1α+β N [α−β]/2, ξ−1 pα+β(ξ) dξ
0 2 0
where fN(η;µN , σ2N) is the density of the random variable η ∼ N(µ , σ2N N).
The conditional distribution [ξ|η] ∼ PG(α + β, η) is a direct result of Polson
et al. (2013).
B.1 MCMC Sampling Algorithm and Computational Details
We design a Gibbs sampling algorithm for the dynamic shrinkage process. The
sampling algorithm is both computationally and MCMC efficient, and builds
upon two main components: (1) a stochastic volatility sampling algorithm
(Kastner and Frühwirth-Schnatter, 2014) augmented with a Pólya-Gamma sam-
pler (Polson et al., 2013); and (2) a Cholesky Factor Algorithm (CFA, Rue, 2001)
for sampling the state variables in the dynamic linear model. Alternative sam-
pling algorithms exist for more general DLMs, such as the simulation smooth-
ing algorithm of Durbin and Koopman (2002). However, as demonstrated by
McCausland et al. (2011) and explored in Chan and Jeliazkov (2009) and Chan
(2013), the CFA sampler is often more efficient. Importantly, both components
employ algorithms that are linear in the number of time points, which produces
a highly efficient sampling algorithm.
The general sampling algorithm is as follows, with the details provided in
the subsequent sections:
1. Sample the dynamic shrinkage components (Section 3.5.1)
125
(a) Log-volatilities, {ht}
(b) Pólya-Gamma mixing parameters, {ξt}
(c) Unconditional mean of log-volatility, µ
(d) AR(1) coefficient of log-volatility, φ
(e) Discrete mixture component indicators, {st}
2. Sample the state variables, {βt} (Section B.1.2)
3. Sample the observation error variance, σ2 .
For the observation error variance, we follow Carvalho et al. (2010) and as-
sume the Jeffreys’ prior [σ2 ] ∝ 1/σ{2 . Th∑e full condit}ional distribution is√
[σ T T 2 −1 −T 1 T 2 T|{yt}t=1, {βt}t=1, τ ] ∝ σ ×σ exp − 2 t=1(yt−βt) ×2σ σ (1+Tτ2 2 , where  /σ√ 
)
the last term arises from τ ∼ C+(0, σ/ T ). We sample from this distribution
using the slice sampler of Neal (2003).
If we instead use a stochastic volatility model for the observation error vari-
ance as in Sections 3.3.2 and 3.4.2, we replace this step with a stochastic volatil-
ity sampling algorithm (e.g., Kastner and Frühwirth-Schnatter, 2014), which re-
quires additional sampling steps for the corresponding log-volatility and the
unconditional mean, AR(1) coefficient, and evolution error variance of log-
volatility. An efficient implementation of such a sampler is available in the R
package stochvol (Kastner, 2016). In this setting, we do not scale τ by the
√
standard deviation, and instead assume τ ∼ C+(0, 1/ T ).
In Figure B.1, we provide empirical evidence for the linear time O(T ) com-
putations of the Bayesian trend filtering model with dynamic horseshoe inno-
vations. The runtime per 1000 MCMC iterations is less than 6 minutes (on a
126
MacBook Pro, 2.7 GHz Intel Core i5) for samples sizes up to T = 105, so the
Gibbs sampling algorithm is scalable.
Computation Time for BTF-DHS (per 1000 MCMC iterations)
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
Number of Time Points, T
Figure B.1: Computation time per 1000 MCMC iterations for the Bayesian trend
filtering model with dynamic horseshoe innovations (BTF-DHS).
B.1.1 Efficient Sampling for the Dynamic Shrinkage Process
Consider the (univariate) dynamic shrinkage process in (3.2) with the Pólya-
Gamma parameter expansion of Theorem 3.4. We provide implementation de-
tails for the dynamic horseshoe prior with α = β = 1/2, but extensions to
other cases are straightforward. The SV sampling framework of Kastner and
Frühwirth-Schnatter (2014) represents the likelihood for ht on the log-scale, and
approximates the ensuing logχ21 distribution for the errors via a known discrete
mixture of Gaussian distributions. In particular, let ỹ 2t = log(ωt + c), where c
127
Time (minutes)
0 1 2 3 4 5 6
is a small offset to avoid numerical issues. Conditional on the mixture com-
ponent indicators st, the likelihood is
in∼depỹt N(ht + mst , vst) where mi and
vi, i = 1, . . . , 10 are the pre-specified mean and variance components of the 10-
component Gaussian mixture provided in Omori et al. (2007). The evolution
equation is ht+1 = µ+ φ(ht − µ) + ηt with initialization h1 = µ+ η0 and innova-
tions | in∼dep iid[ηt ξt] N(0, ξ−1t ) for [ξt] ∼ PG(1, 0).
To sample h = (h1, . . . , hT ) jointly, we directly compute the posterior dis-
tribution of h and exploit the tridiagonal structure of the resulting posterior
precision matrix. In particular, we equivalently have ỹ ∼ N(m + h̃ + µ̃,Σv)
and Dφh̃ ∼ N(0,Σξ), where m = (ms1 , . . . ,(m )′s , h̃)= (h1 − µ, . . .(, hT − µ)T )′,
µ̃ = (µ, (1 − φ)µ, . . . , (1 − φ)µ)′, Σ T −1 Tv = diag {vst}t=1 , Σξ = diag {ξt }t=1 ,
and Dφ is a lower triangular matrix with ones on the diagonal, −φ on the first
off-diagonal, and zeros elsewhere. We sample from the posterior distribution
of h by sampling from the posterior distribution of h̃ and setting h = h̃ + µ1
for 1 a(T -dimensio)nal vector of ones. The required posterior distribution is
h̃ ∼ N Q−1` ,Q−1 , where Q = Σ−1 +D′Σ−1h̃ h̃ v φ ξ Dφ is a tridiagonal symmetrich̃ h̃
matrix with diagonal elements d0(Qh̃) and first off-diagonal elements d1(Qh̃)
defined as
[ ]
d0(Q
−1 2
h̃) = [(vs + ξ1 + φ ξ2), (v
−1
s + ξ2 + φ
2
] ξ3), . . . , (v
−1 2 −1
s + ξT−1 + φ ξT ), (vs + ξT ) ,1 2 T−1 T
d1(Qh̃) =( (−φξ2), (−)φξ3), . . . , (−φξT−1) , and
`h̃ = Σ[ −1v ỹ −m− µ̃ỹ1 −ms1 − µ ỹ2 −ms2 − (1− φ)µ ỹT −ms − (1− φ)µ]′= , , . . . , T .
vs1 vs2 vsT
Drawing from this posterior distribution is straightforward and efficient, using
band back-substitution described in Kastner and Frühwirth-Schnatter (2014):
(1) compute the Cholesky decomposition Q = LL′h̃ , where L is lower triangle;
128
(2) solve La = `h̃ for a; and (3) solve L
′h̃ = a+ e for h̃, where e ∼ N(0, IT ).
Conditional on the log-volatilities {ht}, we sample the AR(1) evolution pa-
rameters: the log-innovation precisions {ξt}, the autoregressive coefficient φ,
and the unconditional mean µ. The precisions are distributed [ξt|ηt] ∼ PG(1, ηt)
for ηt = ht+1 − µ − φ(ht − µ), which we sample using the rpg() function
in the R package BayesLogit (Polson et al., 2013). The Pólya-Gamma sam-
pler is efficient: using only exponential and inverse-Gaussian draws, Polson
et al. (2013) construct an accept-reject sampler for which the probability of ac-
ceptance is uniformly bounded below at 0.99919, which does not require any
tuning. Next, we assume the prior [(φ + 1)/2] ∼ Beta(aφ, bφ), which restricts
|φ| < 1 for stationarity, and sample from the full conditional distribution of φ
using the slice sampler of Neal (2003). We select aφ = 10 and bφ = 2, which
places most of the mass for the density of φ in (0, 1) with a prior mean of 2/3
and a prior mode of 4/5 to reflect the likely presence of persistent volatility
√
clustering. The prior for the global scale parameter is τ ∼ C+(0, σ/ T ), which
implies µ = log(τ 2) is [µ|σ, ξµ] ∼ N(log(σ2/T ), ξ−1 µ ) with ξµ ∼ PG(1, 0). In-
cluding the initialization h1 ∼ N(µ, ξ−10 ) with ξ0 ∼ PG(1, 0), the posterior dis-
tribution for µ is µ ∼ N(Q−
∑
1` ,Q∑−1µ µ µ ) with Qµ = ξµ + ξ0 + (1 − φ)2 T−1t=1 ξt and
` 2 T−1µ = ξµ log(σ/T )+ξ0h1 +(1−φ) t=1 ξt(ht+1−φht). Sampling ξµ and ξ0 follows
the Pólya-Gamma sampling scheme above.
Finally, we sample the discrete mixture component indicators st. The dis-
crete mixture probabilities are straightforward to compute: the prior mixture
probabilities are the mixing proportions given by Omori et al. (2007) and the
likelihood is in∼depỹt N(ht+mst , vst); see Kastner and Frühwirth-Schnatter (2014)
for details.
129
In the multivariate setting p > 1 of (3.10) with Φ = diag (φ1, . . . , φp), we
may modify the log-volatility sampler of {hj,t} by redefining relevant quan-
tities using the ordering h = (h1,1, . . . , h1,T , h2,1, . . . , h ′p,T ) . In particular, the
posterior pr[ecision matrix is aga]in tridiagonal, but with diagonal elements
d[0(Qh̃) = d0,1(Qh̃), . . . , d0,p(Qh̃) an]d first off-diagonal elements d1(Qh̃) =
d1,1(Qh̃), 0, d1,2(Qh̃), 0, . . . , 0, d1,p(Qh̃) , where d0,j(Qh̃) and d1,j(Qh̃) are the di-
agonal elements and first off-diagonal elements, respectively, for predictor j
a[s computed] in the univariate case above. Similarly, the linear term `h̃ =′
`′ , . . . , `′ where `h̃,j is the linear term for predictor j as computed in theh̃,1 h̃,p
univariate case. The parameters ξj,t, φj , and sj,t may be sampled independently
as in the univariate case, while samplers for {µj} and µ0 proceed as in a stan-
dard hierarchical Gaussian model. For the more general case of non-diagonal
Φ, we may use a simulation smoothing algorithm (e.g., Durbin and Koopman,
2002) for the log-volatilities {hj,t}, while the sampler for Φ will depend on the
chosen prior.
B.1.2 Efficient Sampling for the State Variables
In the univariate setting of (3.8), the sampler for β = (β1, . . . , βT ) is similar to
the log-volatility sample in Section 3.5.1. We provide the details for D = 2; the
D = 1 case is similar to Section 3.5.1 with φ = 1, µ = 0, and mst = 0. Model (3.8)
may be wri(tten y ∼) N(β,Σ) an(d D2β ∼) N(0,Σω), where y = (y , . . . , y )′1 T ,
Σ = diag {σ2}T , Σ = diag {σ2 }T t t=1 ω ω t=1 for σ2 2 2ω = τ λt , and D2 is a lowert t
triangular matrix with ones on the diagonal, (0,−2, . . . ,−2) on the first off-
diagonal, ones on the second off-diagonal, and zeros elsewhere. Note that we
allow the observation error variance σ2t to be time-dependent for full gener-
130
( )
ality, as in Section 3.4.2. The posterior for β is β ∼ N Q−1` −1β β,Qβ , where
Q = Σ−1 +D′β  2Σ
−1
ω D2 is a pentadiagonal symmetric matrix with diagonal ele-
ments d0(Qβ), first off-diagonal elements d1(Qβ), and second-off diagonal ele-
ments d2(Qβ) defined as[( ) ( )
d0(Qβ) =( σ
−2 + σ−2 −2 −2 −2 −2 −21 ω + σ1 ω , σ3 2 +)σω + 4σ2 ω + σ3 ω , . . . ,4
(σ−2t + σ−2 + 4σ−2 + σ−2ω ω ω , . . . ,t t+1 t+2 ]
σ−
) ( ) ( )
2 + σ−2 + 4σ−2 + σ−2 , σ−2 + σ−2 −2T−2 ω ω ω T−1 ω + 4σω , σ
−2 + σ−2 ,
T−2 T−1 T T−1 T T ωT
d (Q ) = [[−2σ−
( ) ( ) ( )
2,−2 σ−2 −2 −21 β ω ω + σω ] , . . . ,−2 σω + σ−2ω , . . . ,−2 σ−2 + σ−2 ,−2σ−2],3 3 4 t t+1 ωT−1 ωT ωT
d2(Q ) = σ
−2
β ω , . . . , σ
−2
ω , . . . , σ
−2
3 t ω
,
T
and ` = Σ−
[ ]
1y = y /σ2
′
β  1 1, . . . , yt/σ
2
t , . . . , yT/σ
2
T . Drawing from the posterior
distribution is straightforward and efficient using the back-band substitution
algorithm described in Section 3.5.1.
In the multivariate setting of (3.9), we similarly derive the posterior dis-
tribution f(or β = )(β′1, . . . ,β′ )′T = (β ′1,1, β2,1, . . . , βp,1, β1,2, . . . , βp,T ) . Let X =
blockdiag {(x′}Tt t=1 d)enote the T × Tp block-diagonal matrix of predictors and
Σ ( = diag {σ)2 2 2 2 2ω ω }j,t for σω = τ0 τj λj,t. The posterior distribution is β ∼j,t j,t
N Q−1β `β,Q
−1
β , where
Qβ = X
′Σ−1 X + (D
′
2 ⊗ I −1p) Σω (D2 ⊗ Ip)
and
` = X ′Σ−
[ ]
1y = x′ y /σ2, . . . ,x′y /σ2, . . . ,x′ y /σ2
′
β  1 1 1 t t t T T T .
Note that Qβ may be constructed directly as above, but is now 2p-banded. Al-
ternatively, the regression coefficients {βj,t} may be sampled jointly using the
simulation smoothing algorithm of Durbin and Koopman (2002).
131
B.2 Linear Regression for the Fama-French Asset Pricing
Model
We present the ordinary linear regression results for the six-factor model dis-
cussed in Section 4.2, in which we append the momentum factor of Carhart
(1997) to the five-factor Fama-French model (FF-5, Fama and French, 2015).
We use weekly industry portfolio data from the website of Kenneth R. French,
which provide the value-weighted return of stocks in the given industry. We fo-
cus on manufacturing (Manuf) and healthcare (Hlth). For a given industry port-
folio, the response variable is the returns in excess of the risk free rate, yt = Rt−
RF,t, with predictors xt = (1, RM,t − RF,t, SMB t,HMLt,RMW t,CMAt,MOM )′t ,
defined as follows: the market risk factor, RM,t − RF,t is the return on the market
portfolio RM,t in excess of the risk free rate RF,t; the size factor, SMB t (small mi-
nus big) is the difference in returns between portfolios of small and large market
value stocks; the value factor, HMLt (high minus low) is the difference in returns
between portfolios of high and low book-to-market value stocks; the profitability
factor, RMW t is the difference in returns between portfolios of robust and weak
profitability stocks; the investment factor, CMAt is the difference in returns be-
tween portfolios of stocks of low and high investment firms; and the momentum
factor, MOM t is the difference in returns between portfolios of stocks with high
and low prior returns. These data are publicly available on Kenneth R. French’s
website, which provides additional details on the portfolios. We standardize all
predictors and the response to have unit variance.
The results for the weekly manufacturing and healthcare industry data sets
from 4/1/2007 - 4/1/2017 (T = 522) are in Tables B.1 and B.2, respectively. For
132
the manufacturing industry, the significant factors are market risk (RM,t−RF,t),
profitability (RMW t), and investment (CMAt). For the healthcare industry, the
significant factors are market risk, size (SMB t), value (HMLt), and profitability.
Ordinary Linear Regression: Manufacturing Industry
Estimate Std. Error t value Pr(>|t|)
Intercept -0.020 0.015 -1.350 0.178
Mkt.RF 1.010 0.018 55.359 0.000
SMB -0.013 0.016 -0.780 0.436
HML -0.028 0.022 -1.264 0.207
RMW 0.088 0.018 4.918 0.000
CMA 0.052 0.017 3.106 0.002
MOM 0.029 0.020 1.437 0.151
Table B.1: Ordinary linear regression results for the weekly manufacturing in-
dustry data in the six-factor model. Significant factors at the 5% level are itali-
cized.
Ordinary Linear Regression: Healthcare Industry
Estimate Std. Error t value Pr(>|t|)
Intercept 0.045 0.023 1.966 0.050
Mkt.RF 0.924 0.028 32.686 0.000
SMB -0.130 0.025 -5.221 0.000
HML -0.264 0.034 -7.703 0.000
RMW -0.168 0.028 -6.076 0.000
CMA 0.029 0.026 1.125 0.261
MOM 0.027 0.031 0.857 0.392
Table B.2: Ordinary linear regression results for the weekly healthcare industry
data in the six-factor model. Significant factors at the 5% level are italicized.
133
APPENDIX C
FUNCTIONAL AUTOREGRESSION FOR SPARSELY SAMPLED DATA
C.1 Priors
The prior for {µ }Tt t=1 is determined by (4.6). Let bψ be a Jψ-dimensional
vector of cubic B-spline basis functions with min{|To|/2, 35} = (Jψ − 4)
equally-spaced int(erior knots. )The tensor product expansion ψ`(τ, u) =
b′ψ(τ)Θψ bψ(u) = b
′ (u)⊗ b′ψ ψ(τ) θψ , where Θψ is a Jψ × Jψ matrix of un-` ` `
known coefficients and θψ = vec (Θψ ), is computationally con(venient fo)r` `
the FAR surfaces {ψ p`}`=1. The Gaussian prior [θ |λ
−1 −1
ψ` ψ ] ∼ N 0, λ Ω` ψ` ψ`
induces a Gaussian process prior on ψ`, where Ωψ is a penalty ma-`
∫tri∫x{and λψ is a smoothing parameter. The st}andard roughness penalty`
∂2 2 2ψ`(u1, u2) + 2
∂ ψ`(u1, u2) +
∂ ψ`(u1, u2) du1 du2 can be expressed∂u1 ∂u1∂u2 ∂u2
as θ′ψ Ω2θψ for a known singular matrix Ω2. To obtain a proper prior, which` `
is necessary for our model averaging procedure, we combine the roughness
penalty with a nonstat∑ionari∫ty∫ penalty: a sufficient condition for stationarity
o∑f Y in model (4.2) is pt `=1 ψ2` (τ, u) dτ du < 1, which can be expressed asp
`=1 θ
′
ψ Ω0θψ < 1 where Ω0 is a known invertible matrix. We use the prior` `
precision matrix Ωψ = Ω2 + κ`Ω0, which penalizes roughness of ψ` and pro-`
vides shrinkage toward stationarity, where the trade-off is determined by κ`.
Simulations suggest that the posterior distribution is not sensitive to the choice
of κ`; we fix κ` = 1 for the simulations and assume log (κ`) ∼ N (0, 4) for the
application. For the smoothing parameter λψ , we use the half-Cauchy prior of`
Gelman (2006), which provides excellent mixing of the states {s`} in the model
averaging procedure. The prior may be expressed hierarchically via the auxil-
134
( ) ( )
iary variables λ̃ 1 1ψ ∼ Gamma , , ξ̃ψ ∼ N (0, 106), and θ̃ψ ∼ N 0, λ̃−1Ω−1 ,` 2 2 ` ` ψ` ψ`
with the identification θ = ξ̃ θ̃ and λ = ξ̃−2ψ ψ ψ λ̃ .` ` ` ψ` ψ` ψ`
We use the conditionally conjugate inverse-Gamma priors σ−2, σ−2ν η ∼
Gamma(10−3, 10−3) for the measurement error precision and the FDLM ap-
proximation error precision, respectively. In some cases, we may prefer
smoother sample paths of µ 2t, but the paths will not be smooth when ση is
large. If increasing J is infeasible or undesirable, fixing σ2η at some small
value, such as σ2η = 10−6, often works well, and can be interpreted as a jit-
ter term for computing a valid inverse of K (Neal, 1999). Assuming the
FDLM (4.7) for the innova(tion cov)ariance K, the factors are distributed i∼idet
N(0,Σ ) with Σ = diag {σ2}Je e j j=1 , although many generalizations are avail-
able (Kowal et al., 2016). To enforce the ordering constraints σ2 21 > σ2 > · · · >
[σ2J > 0, rec]all th[at t]h∏e joint[distribution (of ]the precisions) may be written
σ−21 , . . . , σ
−2 −2 J−1 −2
J = σJ j=1 σj |σ
−2
j+1, . . . , σ
−2
J . A noninformative joint prior  
that [respects the cons]train[ts is full]y specified by( σ−2J ∼) Gamma (10−3, 10−3)
and σ−2|σ−2j j+1, . . . , σ−2J = σ
−2|σ−2j j+1 ∼ Uniform 0, σ−2j+1 for j = 1, . . . , J  − 1.
The FLCs are φj(τ) = b′φ(τ)ξj , where bφ is a low-rank thin plate spline basis
with knot locations determined by the quantiles of the observation points, To,
ξj ∼ N(0,Λφ ), and Λ−1φ is the low-rank thin plate spline penalty matrix. Wej j
follow Wand and Ormerod (2008) in the singular value( decomposition-base)d
diagonalization of the penalty matrix, so that Λφ = diag 108, 108, λ−1φ , . . . , λ
−1 ,
j φj
which places a noninformative prior on the constant and linear components of
the thin plate spline basis, which are unpenalized. The prior precision λφ isj
common among the nonlinear components, and corresponds to the smoothing
parameter for the regression function φj . Following Gelman (2006), we place
uniform priors on the standard deviations −1/2λφ ∼ Uniform (0, 104), which im-j
135
plies the prior for the precision −3/2[λ −8φ ] ∝ λ 1{λ > 10 }. The upper boundj φj φj
for the prior standard deviation is selected to match the noninformative compo-
nents of Λφ . The orthonormality constraint is enforced during sampling, whichj
we discuss in Appendix C. We assume the same parametrization and prior dis-
tribution for the mean function, µ(τ) = b′φ(τ)θµ.
C.2 Proof of Theorem 4.1
To prove Theorem 4.1, we use the following well-known results:
Proposition C.1. For random vectors δ and Y with known mean and covariance, the
unique best linear predictor of δ given Y is EG[δ|Y ], where EG is the expectation
computed under the assumption that (δ′,Y ′)′ is jointly Gaussian.
Proposition C.2 (West and Harrison, 1997). Under a DLM such as model (4.6), the
random vectors y = (y′ , . . . ,y′ )′1:T 1 T and µ1:T = (µ′1, . . . ,µ′ ′T ) are jointly Gaussian,
conditional on the remaining parameters. In addition, all conditionals and marginals of
the joint distribution of (y′1:T ,µ′1:T )′ are Gaussian.
Note that we could extend µ1:T to include µ, which is also a Gaussian ran-
dom vector. Following Propositions C.1 and C.2, the proof of Theorem 4.1 is
straightforward:
Proof. (Theorem 4.1) Let Te be fixed and finite such that Te ⊂ T . Given this
choice of Te, we can form the DLM (4.10) with the appropriately modified terms.
Similarly, we can form the Gaussian DLM (4.6). Proposition C.2 implies that
(y′1:T ,µ
′
1:T )
′ under model (4.6) and conditional on Θ is jointly Gaussian. There-
fore, for any δ,Y ⊆ DT ∪ {µt(τ) : τ ∈ Te, t = 1, . . . , T}, i.e., any subvectors
136
of (y′ ,µ′1:T 1:T )
′, the distribution of [δ|Y ,Θ] is Gaussian. Proposition (C.1) im-
plies that δ̂(Y |Θ) ≡ E[δ|Y ,Θ], computed under the Gaussian DLM (4.6), is the
unique best linear predictor of [δ|Y ,Θ] under the DLM (4.10).
C.3 Initialization and MCMC Sampling Algorithm
C.3.1 Initialization
We initialize the unknown functions using splines and the remaining parame-
ters using conditional maximum likelihood estimators. We first estimate µ as a
smooth mean of {y Tt}t=1, evaluated at Te. Next, we estimate each µt by fitting a
spline to yt −Ztµ for t = 1, . . . , T using the R function smooth.spline. Since
sparse observation points may lead to unstable initializations ofµt, we compute
the median degrees of freedom implied by the spline fits for t = 1, . . . , T , and
then recompute the splines for t = 1, . . . , T using this common degrees of free-
dom parameter. Conditional on these estimates, we estimate σ2ν , {θψ1 , . . . ,θψp},
and {λψ1 , . . . , λψp} using the maximum likelihood estimators, and initialize
, −1/2θ̃ψ = θψ λ̃ψ = 1, and ξ̃ψ = λψ . From these estimators, we compute the` ` ` ` `
innovations t for t = 1, . . . , T . We initialize the FDLM parameters using the
initialization algorithm of Kowal et al. (2016) based on the singular value de-
composition (SVD) of (1, . . . , T )′ = U ′eDeV e. For the FLCs, we let Φ equal
the first J columns of V and then estimate Ξ to minimize ||Φ − B Ξ||2e φ . For
the factors, we let (e1, . . . , eT )′ be the first J columns of (U eDe), and then esti-
∑mate {σ2j}∑and σ2η using the conditional maximum likelihood estimators. Sincej
k=1 σ
2
k/ k σ
2
k estimates the proportion of variance of t explained by the first j
137
factors, we set J to be the smallest number of factors that explain at least 95% of
the variance of t. While more sophisticated procedures are available for select-
ing J, such as DIC and marginal likelihood, we find that this simple approach
performs well in simulations.
C.3.2 Gibbs Sampling Algorithm
We propose to sample from the joint posterior distribution using a Gibbs sam-
pler with the following steps:
1. FAR process, Yt:
(a) [Centered FA] R process, µt: form the DLM (6) and sample
{µt}Tt=1| · · · jointly using the state space sample of Durbin and
Koopman (2002) implemented in the R package KFAS.
(b) Mean function, µ(τ) = b′φ(τ)θµ: sample [θµ| · · · ] ∼ N(Aµaµ,Aµ)
where
∑T
A−1µ = Λ
−1 + σ−2µ ν B
′ ′
φZtZtBφ,
∑ t=1T
a = σ−2 B′ Z ′µ ν φ t(yt −Ztµt),
( t=1 )
and Λµ = diag 108, 108, λ−1µ , . .(. , λ−1µ . We∑sample t)he smoothing
parameter [λµ| · · · J] ∼ Gamma 1(J 1 µ 2µ − 3), j=3 θµ,j restricted to2 2
λµ > 10
−8 (see the σ−2j sampler below), where Jµ (= Jφ) is the di-
mension of θµ and θµ,j is the jth component of θµ.
Set Yt = µt + µ or, in vector form, Y t = µt + µ.
138
2. Measurement error p(recision, σ−2ν : sample∑ )T T1 1 ∑∑mt
[σ−2ν | · · · ] ∼ Gamma 10−3 + m , 10−3t + (yi,t − µ(τi,t)− µt(τi,t))2 .2 2
t=1 t=1 i=1
3. The FAR kernels, ψ1, . . . , ψp: using the Gelman (2006) prior and
parametrization of θ = ξ̃ ′ψ` ψ θ̃ψ , where ψ`(τ, u) = bψ(τ, u)θψ and Bψ =` ` `
(bψ(τ1), . . . , bψ(τ
′
M)) , we sample
′ ′
(a) θ̃ψ = (θ̃ ′ψ , . . . , θ̃ψ ) jointly[from [θ̃ψ{| · · · ] ∼ N(Aψa}ψ,Aψ), wh]ere1 p ∑T
− [ ]A 1ψ [`, `] = λψ Ω + s ξ̃2 (B′ Q) µ µ′ (B′ Q)′ ⊗ B′ K−1B ,` ψ` [` ψ` {ψ t−`}t−` ψ] ψ  ψ∑ t=p+1T [ ]
A−1ψ [`, k] = s s ξ̃ ξ̃ (B
′ Q) ′ ′ ′ ′ −1` k ψ` ψ(k ψ { µt−`µt}−k (BψQ)) ⊗ BψK Bψ ,t∑=p+1T
aψ[`] = s ξ̃ vec B′ K−1 ′ ′ ′` ψ` ψ  µtµt−` (BψQ) ,
t=p+1
A−1ψ [`, k] is the (`, k)th block ofA
−1 of dimension J2 × J2ψ ψ ψ and aψ[`] is
the `th subvector of aψ o[f length]J2ψ; ( )
(b) For ` = 1, . . . , p, sa(m[ple ξ̃ψ {| · · · ∼ N Aξ̃ }aξ̃ , Aξ̃ , where` ψ` ψ` ψ`] )
T
−1 −6 ′
∑ [ ]
A = 10 +(θ̃ψ (B
′
ψ{Q) [ µt−`µ
′ ′
t−` (BψQ)
′ ⊗ B′] }ψK
−1
ξ̃ 
Bψ θ̃ψ,
ψ`
′ ∑ t=p+1 )T ∑
aξ̃ = θ̃ψvec B
′ K−1ψ  µt − s ′kG(ψk)µt−k µt−` (B′ψQ)′ ,ψ`
[ ] t=p+(1 k 6=` )
sample λ̃ψ | · · · ∼ Gamma 1 + J2ψ/2, 1 + θ
′
ψ Ωψ θψ /2 , and, if κ` 2 2 ` ` ` ` is
unknown, sample κ` using the slice sampler (Neal, 2003). Set θψ =`
ξ̃ψ θ̃ψ and update Ω` ` ψ .`
(c) For the model averaging procedure, sample [s`| · · · ] (in random or-
der), i.e., set s` = 1 if logO
post
10 > log(1/U − 1) and s` = 0 otherwise,
139
where U ∼ Uniform[(0, 1), logO
post
10 is t(he log-posterior oddsT1 ∑ ∑ )′ ]
logOpost10 = − µ′t−`K−1 −1 µt−` − 2 µt − skG(ψk)µt−k K µ2 t−`
t=p+1 k=6 `
+ logOprior10 ,
and logOprior10 = logP(s` = 1|sk, k 6= `) − logP(s` = 0|sk, k 6= `) is the
log-prior odds.
4. The innovation covariance,K, under the FDLM:
(a) The factors, {e }Tt t=1: u∑sing the prior iidet ∼ N(0,Σe) and the conditional
likelihood t = µ − pt `=1G(ψ`)µt−` = Φet + ηt, sample [et| · · · ] ∼
N(Aeaet ,Ae), where
− ( )A 1 −2 ′ −1e = ση Φ Φ + Σe = diag {σ−2 + σ−2}Jη j j=1
a = σ−2 ′et η Φ t.
Note thatAe is time-invariant and diagonal, so we can sample {e }Tt t=1
jointly and efficiently.
(b) The factor precisions, σ−2j : sam(ple ∑ )T
[σ−2J | · · ·
T 1
] ∼ Gamma 10−3 + , 10−3 + e2J ,t ; 2 2 
t=1
then, for j = J −2 −1 − 1, . . . , 1, set σj = Fφ (U ; sφ, rφ ), where Fφ isj
the distribution function for a Gamma random variable w∑ith shape
parameter sφ = (T(− 1)/2)and rate parameter rφ = T 2j t=1 ej,t/2,
and U ∼ Uniform aφ , bφ where aφ = Fφ(0; sφ, rφ ) and bj j j j φ =j
F −2φ(σj+1; sφ, rφ ).j
(c) The approximation error(precision, σ−2η : sample∑ )T
[σ−2η | · · · ] ∼
TM 1
Gamma 10−3 + , 10−3 + || 2t −Φet||
2 2
t=1
140
where || · ||2 denotes the Euclidean distance.
(d) The factor loading curves: for j = 1, . . . , J (in random order), sample
ξj ∼ N(Aξ aξ ,Aξ ), wherej j j (∑ )T
A−1 = Λ−1 + σ−2 2 ′ξj φj η ( ej,t BφBφ,∑ t=1 ∑ )T
aξ = σ
−2
η B
′
φ ej,t t −Bφ ξkek,t .j
t=1 k=6 j
To enforce the orthogonality constraint, we condition on the linear
constraints (Bφξ )
′
k Bφξj = 0 for k 6= j; since ξj is Gaussian and ξk
is conditioned upon, the resulting distribution is Gaussian with eas-
ily computable moments, which is also convenient for efficient sam-
pling; see Kowal et al. (2016) for more details. After sampling from
the conditional distribution, we normalize the sampled vector ξj , so
that the orthonormality constraint is enforced at every MCMC itera-
tion. We(sample the co∑rrespond)ing smoothing parameters [λφ | · · · ] ∼j
Gamma 1 (J − 3) , 1 Jφ ξ2 −8
2 φ 2 k=3 j,k
restricted to λφ > 10 , where ξj j,k is
the kth component of ξj .
Finally, we form the covariance and precision matrices K and K−1  , re-
spectively, using the sampled components. Since the orthonormality con-
straint Φ′Φ = IJ is enforced at every MCMC iteration, we can compute
K−1 directly and efficiently using (8).
When the sample size T or the number of evaluation points M is large (i.e.,
T > 10, 000 or M > 50), the Durbin and Koopman (2002) joint sampler is com-
putationally inefficient. Instead, we may use a single-move sampler for {µ }Tt t=1,
in which we sample from the full conditional distribution of each [µt|µs, s 6= t]
separately for t = 1, . . . , T (in random order). The single-move sampler is more
141
computationally efficient, but is typically less MCMC efficient. The FDLM pro-
vides a closed form for K−1 , which substantially reduces computation time
when M is large.
The tensor product basis for ψ` provides a computational simplification for
jointly sampling the FAR kernel basis coefficients, θψ. Importantly, the dimen-
sion of the Kronecker product for computingA−1ψ is determined by the number
of basis functions, Jψ, which is bounded by 35 in our specification, and may be
smaller for some applications. For other bivariate bases, such as the thin plate
spline basis, such simplifications are not readily available, and the Kronecker
product scales with the number of evaluation points, M .
In the model averaging procedure, there is a nontrivial concern about the
ability of the MCMC sampler to move between states. When s` = 0, ψ` does not
appear in the likelihood (9), so the Gibbs sampler will draw ψ` from its prior.
Therefore, the prior for ψ` must be proper; if it is nonetheless noninformative,
then the draws of ψ` from the prior distribution may not be reasonable for (9),
so the next MCMC sample of s` will be zero with high probability. To alleviate
this problem, we fix s` = 1 for all ` during a short burn-in period, so that each
ψ` is well-estimated and therefore more likely to be included in the model if it is
relevant. In both simulations and the yield curve application, the Gelman (2006)
parametrization for ψ` sampling discussed in the Appendix provides excellent
mixing among the states {s }pmax` `=1 .
142
C.4 Additional Theoretical Results
C.4.1 Proof of Proposition 4.1
Let Ψ(B) be a polynomial in the backshift ope∑rator B of order p, so that
Ψ(B)Yt = (1−Ψ1B−Ψ 22B − · · ·−Ψ Bpp )Y pt = Yt− `=1 Ψ`(Yt−`), where {Ψ
p
`}`=1
are bounded linear operators on L2(T ). Similarly, let Θ(B) be a polynomial in
the backshift operator B of order q, where {Θ}q`=1 are bounded linear operators
on L2(T ). A functional autoregressive moving average process of order (p, q), written
FARMA(p, q), is defined by Ψ(B)(Yt − µ) = Θ(B)t, where {t} is a white noise
process in L2(T ) and µ is the unconditional mean of Yt. The FAR(p) model may
be written compactly as Ψ(B)(Yt − µ) = t. By assumption, we observe the pro-
cess {yt}, where yt = Yt + νt and {νt} is a white noise process in L2(T ) indepen-
dent of {t}. Rewriting the observation equation yt−µ = Yt−µ+νt and applying
Ψ(B), we have Ψ(B)(yt − µ) = Ψ(B)(Yt − µ) + Ψ(B)νt = t + Ψ(B)νt. It remains
to show that Zt ≡ t + Ψ(B)νt is a functional moving average process of order p, or
equivalently, FARMA(0, p). Clearly, Xt ≡ Ψ(B)νt is FARMA(0, p). By Proposi-
tion 10.2 in Bosq and Blanke (2008), CX X Xp 6= 0 and C` = 0 for ` > p, where C` is
the covariance operator of X defined by CX(x) ≡ E [〈X , x〉X ] for x ∈ L2t ` t t+` (T ).
Let CZ` and C

` denote the covariance operators for Zt and t, respectively. Then
CZ` (x) = E [〈Zt, x〉Zt+`] = E [〈t +Xt, x〉 ( t+` +Xt+`)] = C`(x) +CX` (x), using in-
dependence of {t} and {νt}. Since t is white noise, C` = 0 for ` > 0, from which
it follows that CZp 6= 0 and CZ` = 0 for ` > p. Proposition 10.2 in Bosq and Blanke
(2008) implies that Zt is FARMA(0, p), so we conclude that yt is FARMA(p, p).
143
C.4.2 DLM Recursions and Special Cases of Theorem 4.1
For completeness, we provide the standard DLM recursion formulas for model
(6). LetDt = {yt,yt−1, . . . ,y1}∪D0 be the information available at time t, where
D0 represents the information prior to t = 1. For our purposes—in particular,
for the Gibbs sampling algorithm—we let D0 = {µ, σ2ν , ψ,K} (denoted by Θ
in Theorem 4.1). We may compute full conditional posterior distributions from
model (6) using standard DLM recursions (e.g., West and Harrison, 1997). For
simplicity, let G = G(ψ). Suppose that [µt−1|Dt−1] ∼ N(mt−1,Ct−1). The prior
at time t is [µt|Dt−1] ∼ N(at,Rt), where at = Gmt−1 and Rt = GCt−1G′ +K.
The one-step forecast at time t is [yt|Dt−1] ∼ N(f t,Qt), where f t = Ztµ +Ztat =
Zt(µ + Gmt−1) and Qt = ZtR( ′ 2tZt + σνImt . The posterior)at time t is [µt|Dt] ∼
N(m ,C ), where m = C−1 R−1t t t t t at + σ−2ν Z
′
t(yt −Ztµ) and C−1 = R−1t t +
σ−2Z ′ν tZt, or, more commonly, mt = at + Atrt, At = R Z
′ −1
t tQt , rt = yt − f t,
and Ct = Rt − AtQ A′t t. The h-step forecast of the functional observations is
E[yt+h|Dt] = E[Zt+hµ + Zt+hµt+h + νt+h|Dt] = Zt+hµ + Zt+hE[µt+h|Dt], where
E[µt+h|Dt] = Ghmt, which is the h-step forecast of µt.
Some special cases of Theorem 4.1 are proved in West and Harrison (1997):
Corollary B.2.1 (Theorem 4.10, West and Harrison, 1997). The unique best linear
predictor of the filtering random variable [µt|Dt] ismt
Corollary B.2.2 (Corollary 4.7, West and Harrison, 1997). The unique best linear
predictor of the one-step forecast [µt|Dt−1] is at. The unique best linear predictor of the
one-step forecast [yt|Dt−1] is f t.
144
C.4.3 Proof of Theorem 4.2
Suppose τ ∗ ∈ T such that τ ∗ 6∈ T . The full conditional distribution of µ (τ ∗) is
[ ] [ e ] [ t ]
µt(τ
∗)|{µ }Tr r=1,Θ,Ds ∝ [y1, . . . ,ys|µt(τ ∗), {]µ Tr}r=1,Θ × µt(τ ∗)|{µ Tr}r=1,Θ
∝ µt(τ ∗)|{µ Tr}r=1,Θ ,
since the likelihood term is constant with respect to µt(τ ∗): To ⊆ Te, so τ ∗ 6∈ Te
implies τ ∗ 6∈ To, and therefore µt(τ ∗) does not appear in the likelihood of model
(4). For p = 1, the conditional Gaussian process prior f(or µt implied b)y model
(4) under the approximation (5) is [µt|µt−1, ψ,K ] ∼ GP ψ′ (·)Qµt−1, K , where
ψ′(τ) = (ψ(τ, τ1), . . . , ψ(τ, τM)), Q is a known quadrature weight matrix, and
µt−1 = (µt−1(τ1), . . . , µ
′
t−1(τM)) is the function µt−1 evaluated at each τ ∈ Te.
Notably, τ ∗ 6∈ Te implies that µt(τ ∗) does not appear in the conditional mean
function for µt+1, so we may further simplify the distribution of µ (τ ∗):[ ] [ ] t
µt(τ
∗)|{µ T ∗r}r=1,Θ,Ds ∝ µt(τ )|µt,µt−1,Θ .
To compute this distribution, we use the definition of a Gaussian process, which
implies the following joint distribution of µ (τ ∗t ) and µt, conditional on µt−1, ψ,
and K:      µ (τ ∗) ψ′t (τ ∗)Qµ ∗ ∗t−1 K(τ , τ ) K(τ ∗)∼ N ,  ,
µt ΨQµ K
′ ∗
t−1 (τ ) K
where Ψ = {ψ(τi, τ )}M ∗ ∗k i,k=1 and K(τ ) = (K(τ , τ1), . . . , K(τ ∗, τM)). Con-
ditioning on µt induces the desired distribution [µ (τ ∗t )|µt,(µt−1, ψ,K] ∼
N (m (τ ∗
)
t ), Kt(τ
∗)), where m (τ ∗) = ψ′ ∗t (τ )Qµ ∗ −1t−1 + K(τ )K µt −ΨQµt−1
and Kt(τ ∗) = K (τ ∗ , τ ∗) − K(τ ∗)K−1 ′ ∗ K(τ ). Under the FDLM, the fol-
lowing useful simplifications are available: K(τ ∗, τ ∗) = σ2η + φ
′(τ ∗)Σ ∗eφ(τ ),
K (τ ∗ ) = φ
′(τ ∗)Σ ′eΦ , and using (8), K−1 = σ−2I −2 ′ η M − ση ΦΣ̃eΦ , where
145
φ′(τ ∗
( )
) = (φ( (τ ∗), . . . , φ (τ ∗)),)Σ = diag {σ2}J1 J e j j=1 , Φ = (φ(τ1), . . . ,φ(τM))′, and
Σ̃e = diag {σ2j/(σ2 + σ2η j )}Jj=1 . By substitution, we derive( )
mt(τ
∗) = ψ′(τ ∗)Qµt−1 +K(τ
∗)K−1 (µt −ΨQµt−1 ) ( )
= ψ′(τ ∗)Qµ + φ′(τ ∗t−1 )ΣeΦ
′ σ−2I − σ−2( η M η Φ) Σ̃eΦ
′ µt −ΨQµt−1
= ψ′(τ ∗)Qµt−1 + φ
′(τ ∗)Σ̃ Φ′e µt −ΨQµt−1 ,
using the constraint Φ′Φ = IJ and the simplification σ−2η Σe − σ−2η ΣeΣ̃e = Σ̃e.
Similarly,
Kt(τ
∗) = K (τ ∗, τ ∗)−K (τ ∗)K−1   K ′(τ ∗) ( )
= σ2 + φ′η (τ
∗)Σ φ(τ ∗)− φ′(τ ∗)Σ Φ′ σ−2I − σ−2ΦΣ̃ Φ′e e η M η e ΦΣeφ(τ ∗)
= σ2η + σ
2
ηφ
′(τ ∗)Σ̃ ∗eφ(τ ),
which is time-invariant. ∑Extensions for p > 1 only requi(re mo∑dification of th)e
mean function: m ∗ p ′ ∗ ′ ∗ ′ pt(τ ) = `=1ψ`(τ )Qµt−`+φ (τ )Σ̃eΦ µt − `=1 Ψ`Qµt−` ,
where ψ′`(τ) = (ψ`(τ, τ1), . . . , ψ`(τ, τM)) and Ψ M` = {ψ`(τi, τk)}i,k=1.
C.5 Additional Simulation Results
In Figure C.1, we display the results from FAR(1) simulations under the dense
design, while varying both smoothness of t and the sample size, T . The func-
tional data methods all nearly achieve the oracle performance, and are superior
to the multivariate methods. These results confirm the findings of Diderick-
sen et al. (2012): when T is large and the observation points are dense in T ,
existing functional data methods can nearly achieve the oracle performance,
even when ψ1 is estimated poorly. The proposed methods, particularly with
the FDLM (FDLM-FAR(1) and FDLM-FAR(p)), outperform existing functional
146
data methods for non-smooth GP innovations, and again are far superior for
ψ1 estimation. The uncertainty of p incorporated into the lag selection proce-
dure (FDLM-FAR(p)) does not appear to inhibit forecasting or estimation of ψ1
substantially.
For further clarity, we plot the Bimodal-Gaussian kernel in Figure C.2, which
is featured prominently in our simulation study.
MSFEe MSFEe
RW ● ● RW ● ● ●
Mean ● Mean ● ●
VAR−Y VAR−Y ●
SES ● ● SES ●
FAR Classic ● FAR Classic ●
VAR−FPC ● VAR−FPC ● ●
GP−FAR(1) ● ●● GP−FAR(1) ● ● ●
FDLM−FAR(1) ● FDLM−FAR(1) ● ● ●
FDLM−FAR(p) ● ● ● FDLM−FAR(p) ● ● ●
FAR Oracle ● ● FAR Oracle ● ● ●
1e−04 2e−04 3e−04 4e−04 5e−04 6e−04 1e−04 2e−04 3e−04 4e−04
MSEψ MSE1 ψ1
FAR Classic FAR Classic ● ● ●
GP−FAR(1) ● GP−FAR(1)
FDLM−FAR(1) ● ● ● FDLM−FAR(1) ●
FDLM−FAR(p) ● ● ● FDLM−FAR(p)
0.0 0.5 1.0 1.5 2.0 0.05 0.10 0.15 0.20
Figure C.1: MSFEe (top) and corresponding MSEψ1 (bottom) under various
designs. Left: FAR(1), T = 50, dense design with the Bimodal-Gaussian kernel
and non-smooth GP innovations. Right: FAR(1), T = 350, dense design with the
Bimodal-Gaussian kernel and smooth GP innovations. The proposed methods
provide superior forecasts and nearly achieve the oracle performance, despite
the presence of sparsity.
C.6 Additional Details for the Yield Curve Application
We include MCMC diagnostics for the yield curve application. All diagnostics
were computed using the R package coda (Plummer et al., 2006). In Figures
C.3 and C.4, we provide trace plots for the one-step forecast distributions for
147
Bimodal−Gaussian Kernel
1.5
1.0
1.0
0.8
0.5 0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
tau
Figure C.2: The Bimodal-Gaussian kernel, ψ(τ, u) ∝ 0.75 exp{−(τ −
∫ ∫ π(0.3)(0.4)0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45 exp{−(τ−0.7)2/(0.3)2−(u−0.8)2/(0.4)2},π(0.3)(0.4)
normalized so that ψ2` (τ, u) dτ du = 0.8.
the nominal and real yield curves, respectively, on a single day in 2016 across
selected maturities. The mixing is very efficient, which is confirmed by effective
sample sizes which exceed 5,000 in all cases.
In our yield curve forecasting study of Section 4.7, we included two popu-
lar parametric yield curve models based on the Nelson-Siegel parametrization
(Nelson and Siegel, 1987): Diebold and Li (2006, DL) and Diebold et al. (2006,
DRA). The Nelson-Siegel basis is defined by f1(τ) = 1, f2(τ |λ ) = 1−exp(−τλNS)NS ,τλNS
and f (τ |λ ) = 1−exp(−τλNS)3 NS − exp(−τλNS), where λNS is an unknown parame-τλNS
ter. For both DL and DRA, the yield curve Yt(τ) for time t and time to maturity τ
148
psi(tau, u)
is written as a linear combination of the Nelson-Siegel basis function, for which
the corresponding weights are dynamic:
Yt(τ) = f
′(τ |λNS)βt + t(τ), (C.1)
(βt − µβ) = A (βt−1 − µβ) + ηt (C.2)
where f ′(τ |λNS) = (f1(τ), f2(τ |λNS), f3(τ |λNS)), βt is the corresponding 3-
dimensional vector of dynamic weights with unconditional mean µβ , and A is
the 3× 3 evolution matrix. For implementation purposes, assume that the yield
curve is observed at a fixed set of maturities τ1, . . . , τM , so that (C.1) becomes
yt = FNSβt + t (C.3)
where yt = (Yt(τ ′
′
1), . . . , Yt(τM)) , FNS = (f(τ1|λNS), . . . ,f(τM |λNS)) , and t =
(t(τ1), . . . , t(τM))
′.
The DL approach fixes λNS = 0.0609 and then estimates the parameters us-
ing a multi-step procedure. First, the weights {βt} are estimated using ordinary
least squares from (C.3). Next, the evolution matrix A in (C.2) is estimated as
a VAR coefficient matrix, conditional on {βt}. Diebold and Li (2006) note that
constraining A to be diagonal may improve forecasting in some cases. Finally,
h-step forecasts ŷT+h are computed via ŷT+h = FNSβ̂T+h, where β̂T+h is the
h-step forecast computed from the VAR in (C.2).
Alternatively, the DRA approach combines (C.3) and (C.2) into a state space
model, with error distributions i∼idt N(0,H) independent of
i∼idηt N(0,Q).
DRA assume that H is diagonal; we further assume that Q is diagonal, which
helps stabilize computations. The unknown parameters {λNS,A,H ,Q} are
then estimated jointly using maximum likelihood based on the Kalman filter.
Following DRA, we model λNS and the diagonal elements of H and Q on the
149
log-scale to ensure positivity in the optimization routine. Conditional on the
maximum likelihood estimates for these parameters, DRA use standard state
space computations to construct forecasts for the response vector, yt.
Traceplot: One-Step Forecasts for Maturity 1 Months Traceplot: One-Step Forecasts for Maturity 3 Months
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Iterations Iterations
Traceplot: One-Step Forecasts for Maturity 6 Months Traceplot: One-Step Forecasts for Maturity 12 Months
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Iterations Iterations
Traceplot: One-Step Forecasts for Maturity 24 Months Traceplot: One-Step Forecasts for Maturity 36 Months
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Iterations Iterations
Figure C.3: Traceplot for one-step forecasts for nominal yield curves at selected
maturities during 2016.
C.7 Additional Details on the Quadrature Approximation
∫
Consider the integral in the FAR(1) evolution equation, I(τ) ≡ ψ(τ, u)µt−1(u) du,
where we omit dependence of I on t for notational simplicity. In the pro-
posed methodology, we approximate this integral using quadrature: I(τ) ≈
IM(τ) ≡ (ψ(τ, τ1), . . . , ψ(τ, τM))Qµt−1, where {τ1, . . . , τM} = Te ⊂ T is the
set of unique evaluation points, Q is a known M ×M quadrature matrix, and
µt−1 = (µt−1(τ1), . . . , µ
′
t−1(τM)) is the function µt−1 evaluated at the evaluation
150
0.55 0.70 0.30 0.40 0.15 0.25
0.7 0.9 0.40 0.50 0.60 0.25 0.35
Traceplot: One-Step Forecasts for Maturity 60 Months Traceplot: One-Step Forecasts for Maturity 84 Months
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Iterations Iterations
Traceplot: One-Step Forecasts for Maturity 120 Months Traceplot: One-Step Forecasts for Maturity 240 Months
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Iterations Iterations
Traceplot: One-Step Forecasts for Maturity 360 Months
0 1000 2000 3000 4000 5000
Iterations
Figure C.4: Traceplot for one-step forecasts for real yield curves at selected ma-
turities during 2016.
points. It is important to assess how the accuracy of the approximation of I by
IM depends in M , and in particular to determine a value of M sufficiently large
to produce reasonable approximations in practice. However, there is a tradeoff:
the state vector in the dynamic linear model is M -dimensional, so increasing M
indiscriminately may unnecessarily increase computation time.
We conducted a sensitivity analysis based on the simulations from Section
4.6 of the main paper. In particular, we use the Bimodal-Gaussian kernel,
ψ(τ, u) ∝ 0.75 exp{−(τ−0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45 exp{−(τ−
π(0.3)(0.4) ∫ ∫π(0.3)(0.4)
0.7)2/(0.3)2 − (u − 0.8)2/(0.4)2}, normalized so that ψ2` (τ, u) dτ du =
0.8. The Bimodal-Gaussian kernel is nonlinear, and therefore is inher-
ently more difficult to approximate using linear quadrature methods, such
as the trapezoidal rule. For the other component of the integrand, µt−1,
151
0.50 0.65 0.80 -0.10 0.10 -0.35 -0.15
0.3 0.5 -0.25 -0.05
we simulate µt−1 ∼ GP(0, K) using the covariance function parameteriza-
tion K = σ2Rρ, where Rρ is the Matérn correlation function Rρ(τ, u) =
{ −12ρ1−1Γ(ρ1)} (||τ − u||/ρ )ρ12 Kρ1(||τ − u||/ρ2), Γ(·) is the gamma function, Kρ1
is the modified Bessel function of order ρ1, and ρ = (ρ1, ρ2) are parameters
(Matérn, 2013). We let σ = 0.01 and ρ = (ρ1, 0.1), with ρ1 = 2.5 for smooth
(twice-differentiable) sample paths and ρ1 = 0.5 for non-smooth (continuous,
non-differentiable) sample paths. Comparisons between these cases are impor-
tant: the non-smooth setting is substantially more challenging for approxima-
tions.
For each simulated value of µt−1 ∼ GP(0, K), we compute I200(τ), which
we use as a proxy for the true (but unknown) integral value I(τ), and compare
it to IM(τ) for M ∈ {5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100}. Note that the
approximation induced by I200(τ) is also used to generate the simulations of
Section 4.6 in the main paper. We measure accuracy using the relative absolute
error (RAE) a∫nd∣ the standardized squared error (SSE), defined respectively by∣∣∣I200(τ)− IM(τ)RM = I (τ) ∣∣∣
∣ ∫
(I200(τ)− IM(τ))2
dτ, SM = dτ, (C.4)
200 σ2
which we compute for each simulation. We report the pointwise medians for
each RM and SM as a function of M in Figure E1. As expected, for fixed M ,
the integral approximation is more accurate when µt−1—and therefore the in-
tegrand—is smooth. Nonetheless, the relative gains of increasing M decline
quickly for M > 20 in both cases.
152
Standardized Squared Errors Relative Absolute Error
20 40 60 80 100 20 40 60 80 100
M M
Standardized Squared Errors Relative Absolute Error
20 40 60 80 100 20 40 60 80 100
M M
Figure E1: Standardized squared errors and relative absolute errors for smooth
(top) and non-smooth (bottom) integrands. The errors are small in magnitude,
particularly in the smooth case, and decay quickly for M > 20.
153
SSE σ2 SSE σ2
0.00 0.01 0.02 0.03 0.04 0.0000 0.0005 0.0010 0.0015 0.0020
RAE RAE
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15
BIBLIOGRAPHY
Abramovich, F., Sapatinas, T., and Silverman, B. W. (1998). Wavelet threshold-
ing via a Bayesian approach. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 60(4):725–749.
Albert, J. H. and Chib, S. (1993). Bayes inference via Gibbs sampling of autore-
gressive time series subject to Markov mean and variance shifts. Journal of
Business & Economic Statistics, 11(1):1–15.
Armagan, A., Clyde, M., and Dunson, D. B. (2011). Generalized Beta mixtures of
Gaussians. In Advances in neural information processing systems, pages 523–531.
Arnold, T. B. and Tibshirani, R. J. (2014). genlasso: Path algorithm for generalized
lasso problems. R package version 1.3.
Aue, A., Norinho, D. D., and Hörmann, S. (2015). On the prediction of sta-
tionary functional time series. Journal of the American Statistical Association,
110(509):378–392.
Bae, K. and Mallick, B. K. (2004). Gene selection using a two-level hierarchical
Bayesian model. Bioinformatics, 20(18):3423–3430.
Barndorff-Nielsen, O., Kent, J., and Sørensen, M. (1982). Normal variance-mean
mixtures and z distributions. International Statistical Review/Revue Interna-
tionale de Statistique, pages 145–159.
Behseta, S., Kass, R. E., and Wallstrom, G. L. (2005). Hierarchical models for
assessing variability among functions. Biometrika, 92(2):419–434.
Belmonte, M. A., Koop, G., and Korobilis, D. (2014). Hierarchical shrinkage in
time-varying parameter models. Journal of Forecasting, 33(1):80–94.
154
Berger, J. (1980). A robust generalized Bayes estimator and confidence region
for a multivariate normal mean. The Annals of Statistics, pages 716–761.
Berry, S. M., Carroll, R. J., and Ruppert, D. (2002). Bayesian smoothing and
regression splines for measurement error problems. Journal of the American
Statistical Association, 97(457):160–169.
Besse, P. C., Cardot, H., and Stephenson, D. B. (2000). Autoregressive forecasting
of some functional climatic variations. Scandinavian Journal of Statistics, pages
673–687.
Bloomfield, P. (2004). Fourier analysis of time series. John Wiley & Sons.
Bolder, D., Johnson, G., and Metzler, A. (2004). An empirical analysis of the Cana-
dian term structure of zero-coupon interest rates. Bank of Canada.
Bosq, D. (2000). Linear processes in function spaces: theory and applications, volume
149. Springer Science & Business Media.
Bosq, D. and Blanke, D. (2008). Inference and prediction in large dimensions, volume
754. John Wiley & Sons.
Botly, L. C. and De Rosa, E. (2009). Cholinergic deafferentation of the neocortex
using 192 IgG-saporin impairs feature binding in rats. The Journal of Neuro-
science, 29(13):4120–4130.
Bowsher, C. G. and Meeks, R. (2008). The dynamics of economic functions:
modeling and forecasting the yield curve. Journal of the American Statistical
Association, 103(484).
Cardot, H., Ferraty, F., and Sarda, P. (1999). Functional linear model. Statistics &
Probability Letters, 45(1):11–22.
155
Carhart, M. M. (1997). On persistence in mutual fund performance. The Journal
of finance, 52(1):57–82.
Carvalho, C. M., Lopes, H. F., and Aguilar, O. (2011). Dynamic stock selection
strategies: A structured factor model framework. Bayesian Statistics, 9:1–21.
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling sparsity via the
horseshoe. In AISTATS, volume 5, pages 73–80.
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator
for sparse signals. Biometrika, pages 465–480.
Chan, J. C. (2013). Moving average stochastic volatility models with application
to inflation forecast. Journal of Econometrics, 176(2):162–172.
Chan, J. C. and Jeliazkov, I. (2009). Efficient simulation and integrated likeli-
hood estimation in state space models. International Journal of Mathematical
Modelling and Numerical Optimisation, 1(1-2):101–120.
Chan, J. C., Koop, G., Leon-Gonzalez, R., and Strachan, R. W. (2012). Time
varying dimension models. Journal of Business & Economic Statistics, 30(3):358–
367.
Chen, Y. and Li, B. (2015). An adaptive functional autoregressive forecast model
to predict electricity price curves. Journal of Business & Economic Statistics,
(just-accepted):1–56.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the Ameri-
can Statistical Association, 90(432):1313–1321.
Chib, S. and Ergashev, B. (2009). Analysis of multifactor affine yield curve mod-
els. Journal of the American Statistical Association, 104(488):1324–1337.
156
Chib, S., Nardari, F., and Shephard, N. (2002). Markov chain Monte Carlo meth-
ods for stochastic volatility models. Journal of Econometrics, 108(2):281–316.
Constantine, W. and Percival, D. (2016). wmtsa: Wavelet Methods for Time Series
Analysis. R package version 2.0-2.
Crainiceanu, C., Ruppert, D., and Wand, M. P. (2005). Bayesian analysis for
penalized spline regression using WinBUGS. Journal of Statistical Software,
14(14):1–24.
Cressie, N. and Wikle, C. K. (2011). Statistics for spatio-temporal data. John Wiley
& Sons.
Cruz-Marcelo, A., Ensor, K. B., and Rosner, G. L. (2011). Estimating the term
structure with a semiparametric Bayesian hierarchical model: an application
to corporate bonds. Journal of the American Statistical Association, 106(494).
Damon, J. and Guillas, S. (2002). The inclusion of exogenous variables in func-
tional autoregressive ozone forecasting. Environmetrics, 13:759–774.
Damon, J. and Guillas, S. (2005). Estimation and simulation of autoregressive
hilbertian processes with exogenous variables. Statistical Inference for Stochas-
tic Processes, 8(2):185–204.
Danı́elsson, J. (1998). Multivariate stochastic volatility models: estimation and a
comparison with VGARCH models. Journal of Empirical Finance, 5(2):155–173.
Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of Bayes risk for the
horseshoe prior. Bayesian Analysis, 8(1):111–132.
Dées, S. and Saint-Guilhem, A. (2011). The role of the united states in the global
economy and its evolution over time. Empirical Economics, 41(3):573–591.
157
Didericksen, D., Kokoszka, P., and Zhang, X. (2012). Empirical properties of
forecasts with the functional autoregressive model. Computational Statistics,
27(2):285–298.
Diebold, F. X. and Li, C. (2006). Forecasting the term structure of government
bond yields. Journal of Econometrics, 130(2):337–364.
Diebold, F. X., Li, C., and Yue, V. Z. (2008). Global yield curve dynamics
and interactions: a dynamic Nelson–Siegel approach. Journal of Econometrics,
146(2):351–363.
Diebold, F. X., Rudebusch, G. D., and Aruoba, B. S. (2006). The macroeconomy
and the yield curve: a dynamic latent factor approach. Journal of Econometrics,
131(1):309–338.
Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet
shrinkage. Biometrika, 81(3):425–455.
Durbin, J. and Koopman, S. J. (2002). A simple and efficient simulation smoother
for state space time series analysis. Biometrika, 89(3):603–616.
Earls, C. and Hooker, G. (2014). Bayesian covariance estimation and inference
in latent Gaussian process models. Statistical Methodology, 18:79–100.
Eubank, R. L. (1999). Nonparametric regression and spline smoothing. CRC Press.
Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on
stocks and bonds. Journal of Financial Economics, 33(1):3–56.
Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of
Financial Economics, 116(1):1–22.
158
Faulkner, J. R. and Minin, V. N. (2016). Locally adaptive smoothing with Markov
random fields and shrinkage priors. Bayesian Analysis.
Fernandez, C. and Steel, M. F. (2000). Bayesian regression analysis with scale
mixtures of normals. Econometric Theory, 16(01):80–101.
Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis: theory and
practice. Springer.
Figueiredo, M. A. (2003). Adaptive sparseness for supervised learning. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(9):1150–1159.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for gen-
eralized linear models via coordinate descent. Journal of Statistical Software,
33(1):1–22.
Frühwirth-Schnatter, S. and Wagner, H. (2010). Stochastic model specification
search for Gaussian and partial non-Gaussian state space models. Journal of
Econometrics, 154(1):85–100.
Gamerman, D. and Migon, H. S. (1993). Dynamic hierarchical models. Journal
of the Royal Statistical Society. Series B (Methodological), pages 629–642.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchi-
cal models (comment on article by Browne and Draper). Bayesian Analysis,
1(3):515–534.
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation
and Bayesian model determination. Biometrika, 82(4):711–732.
Green, P. J. and Silverman, B. W. (1993). Nonparametric regression and generalized
linear models: a roughness penalty approach. CRC Press.
159
Griffin, J. E. and Brown, P. J. (2005). Alternative prior distributions for variable
selection with very many more variables than observations. Technical report,
University of Warwick, Centre for Research in Statistical Methodology.
Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distri-
butions in regression problems. Bayesian Analysis, 5(1):171–188.
Gu, C. (1992). Penalized likelihood regression: a Bayesian analysis. Statistica
Sinica, 2(1):255–264.
Harvey, A., Ruiz, E., and Shephard, N. (1994). Multivariate stochastic variance
models. The Review of Economic Studies, 61(2):247–264.
Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the
Royal Statistical Society. Series B (Methodological), pages 757–796.
Hays, S., Shen, H., and Huang, J. Z. (2012). Functional dynamic factor models
with application to yield curve forecasting. The Annals of Applied Statistics,
6(3):870–894.
Horváth, L. and Kokoszka, P. (2012). Inference for functional data with applications,
volume 200. Springer Science & Business Media.
Hyndman, R. and Khandakar, Y. (2008). Automatic time series forecasting: The
forecast package for R. Journal of Statistical Software, 27(1):1–22.
Hyndman, R. J. and Ullah, M. S. (2007). Robust forecasting of mortality and fer-
tility rates: a functional data approach. Computational Statistics & Data Analy-
sis, 51(10):4942–4956.
James, N. A., Kejariwal, A., and Matteson, D. S. (2016). Leveraging cloud data
160
to mitigate user experience from ‘Breaking Bad’. In 2016 IEEE International
Conference on Big Data, pages 3499–3508. IEEE.
Jungbacker, B., Koopman, S. J., and van der Wel, M. (2013). Smooth dynamic
factor analysis with application to the US term structure of interest rates. Jour-
nal of Applied Econometrics.
Kalli, M. and Griffin, J. E. (2014). Time-varying sparsity in dynamic regression
models. Journal of Econometrics, 178(2):779–793.
Kargin, V. and Onatski, A. (2008). Curve forecasting by functional autoregres-
sion. Journal of Multivariate Analysis, 99(10):2508–2526.
Kastner, G. (2016). Dealing with stochastic volatility in time series using the R
package stochvol. Journal of Statistical Software, 69(5):1–30.
Kastner, G. and Frühwirth-Schnatter, S. (2014). Ancillarity-sufficiency inter-
weaving strategy (ASIS) for boosting MCMC estimation of stochastic volatil-
ity models. Computational Statistics & Data Analysis, 76:408–423.
Kaufman, C. G. and Sain, S. R. (2010). Bayesian functional ANOVA modeling
using gaussian process prior distributions. Bayesian Analysis, 5(1):123–149.
Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: likelihood in-
ference and comparison with ARCH models. The Review of Economic Studies,
65(3):361–393.
Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. (2009). `1 Trend Filtering. SIAM
review, 51(2):339–360.
Kokoszka, P. (2012). Dependent functional data. ISRN Probability and Statistics,
2012.
161
Kokoszka, P. and Reimherr, M. (2013). Determining the order of the functional
autoregressive model. Journal of Time Series Analysis, 34(1):116–129.
Koopman, S. J. and Durbin, J. (2000). Fast filtering and smoothing for multivari-
ate state space models. Journal of Time Series Analysis, 21(3):281–296.
Koopman, S. J. and Durbin, J. (2003). Filtering and smoothing of state vector for
diffuse state-space models. Journal of Time Series Analysis, 24(1):85–98.
Koopman, S. J., Mallee, M. I., and Van der Wel, M. (2010). Analyzing the term
structure of interest rates using the dynamic Nelson–Siegel model with time-
varying parameters. Journal of Business & Economic Statistics, 28(3):329–343.
Korobilis, D. (2013a). Hierarchical shrinkage priors for dynamic regressions
with many predictors. International Journal of Forecasting, 29(1):43–59.
Korobilis, D. (2013b). VAR forecasting using Bayesian variable selection. Journal
of Applied Econometrics, 28(2):204–230.
Kowal, D. R., Matteson, D. S., and Ruppert, D. (2016). A Bayesian multivariate
functional dynamic linear model. Journal of the American Statistical Association.
(in press).
Kowal, D. R., Matteson, D. S., and Ruppert, D. (2017). Functional autoregression
for sparsely sampled data. Journal of Business & Economic Statistics, pages 1–13.
Kuo, L. and Mallick, B. (1998). Variable selection for regression models. Sankhyā:
The Indian Journal of Statistics, Series B, pages 65–81.
Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression,
standard errors, and Bayesian lassos. Bayesian Analysis, 5(2):369–411.
162
Laurini, M. P. (2014). Dynamic functional data analysis with non-parametric
state space models. Journal of Applied Statistics, 41(1):142–163.
Laurini, M. P. and Hotta, L. K. (2010). Bayesian extensions to Diebold-Li term
structure model. International Review of Financial Analysis, 19(5):342–350.
Li, B., DeWetering, E., Lucas, G., Brenner, R., and Shapiro, A. (2001). Merrill
Lynch exponential spline model. Technical report, Merrill Lynch working
paper.
Ljubojevic, V., Bennett, L.-A., Gill, P. R., Luu, P., Takehara-Nishiuchi, K., and
De Rosa, E. (2013). Cholinergic modulation of attention-driven oscillations
during feature binding in rats. In Society for Neuroscience.
Matérn, B. (2013). Spatial variation, volume 36. Springer Science & Business
Media.
Matteson, D. S., McLean, M. W., Woodard, D. B., and Henderson, S. G. (2011).
Forecasting emergency medical service call arrival rates. The Annals of Applied
Statistics, 5(2B):1379–1406.
McCausland, W. J., Miller, S., and Pelletier, D. (2011). Simulation smoothing
for state–space models: A computational efficiency analysis. Computational
Statistics & Data Analysis, 55(1):199–212.
McCulloch, R. E. and Tsay, R. S. (1993). Bayesian inference and prediction for
mean and variance shifts in autoregressive time series. Journal of the American
Statistical Association, 88(423):968–978.
Nakajima, J. and West, M. (2013). Bayesian analysis of latent threshold dynamic
models. Journal of Business & Economic Statistics, 31(2):151–164.
163
Nason, G. (2016). wavethresh: Wavelets Statistics and Transforms. R package
version 4.6.8.
Neal, R. M. (1999). Regression and classification using Gaussian process priors.
Bayesian Statistics, 6:475–501.
Neal, R. M. (2003). Slice sampling. Annals of Statistics, pages 705–741.
Nelson, C. R. and Siegel, A. F. (1987). Parsimonious modeling of yield curves.
Journal of Business, 60(4):473.
Omori, Y., Chib, S., Shephard, N., and Nakajima, J. (2007). Stochastic volatility
with leverage: Fast and efficient likelihood inference. Journal of Econometrics,
140(2):425–449.
O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse problems.
Statistical Science, pages 502–518.
Petris, G., Petrone, S., and Campagnoli, P. (2009). Dynamic linear models with R.
Springer.
Piironen, J. and Vehtari, A. (2016). On the hyperprior choice for the global
shrinkage parameter in the horseshoe prior. arXiv preprint arXiv:1610.05559.
Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: Convergence
diagnosis and output analysis for MCMC. R News, 6(1):7–11.
Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: sparse Bayesian
regularization and prediction. Bayesian Statistics, 9:501–538.
Polson, N. G. and Scott, J. G. (2012a). Local shrinkage rules, Lévy processes and
regularized regression. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 74(2):287–311.
164
Polson, N. G. and Scott, J. G. (2012b). On the half-Cauchy prior for a global scale
parameter. Bayesian Analysis, 7(4):887–902.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic
models using Pólya–Gamma latent variables. Journal of the American Statistical
Association, 108(504):1339–1349.
Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer.
Ramsay, J. O. (2006). Functional data analysis. Wiley Online Library.
Ramsay, J. O., Wickham, H., Graves, S., and Hooker, G. (2014). fda: Functional
Data Analysis. R package version 2.4.4.
Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine
learning. The MIT Press.
Rue, H. (2001). Fast sampling of Gaussian Markov random fields. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 63(2):325–338.
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression.
Number 12. Cambridge University Press.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under
conditions of risk. The Journal of Finance, 19(3):425–442.
Shi, J. Q. and Choi, T. (2011). Gaussian process regression analysis for functional
data. CRC Press.
Shumway, R. H. and Stoffer, D. S. (2000). Time series analysis and its applications,
volume 3. Springer New York.
165
Staicu, A.-M., Crainiceanu, C. M., Reich, D. S., and Ruppert, D. (2012). Modeling
functional data with spatially heterogeneous shape characteristics. Biometrics,
68(2):331–343.
Strawderman, W. E. (1971). Proper bayes minimax estimators of the multivari-
ate normal mean. The Annals of Mathematical Statistics, 42(1):385–388.
Svensson, L. E. (1994). Estimating and interpreting forward interest rates: Swe-
den 1992-1994. Technical report, National Bureau of Economic Research.
Taylor, S. J. (1994). Modeling stochastic volatility: A review and comparative
study. Mathematical Finance, 4(2):183–204.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society. Series B (Methodological), pages 267–288.
Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend
filtering. The Annals of Statistics, 42(1):285–323.
Van der Linde, A. (1995). Splines from a Bayesian point of view. Test, 4(1):63–81.
van der Pas, S., Kleijn, B., and van der Vaart, A. (2014). The horseshoe estima-
tor: Posterior concentration around nearly black vectors. Electronic Journal of
Statistics, 8(2):2585–2618.
Waggoner, D. F. (1997). Spline methods for extracting interest rate curves from coupon
bond prices, volume 97. Federal Reserve Bank of Atlanta USA.
Wahba, G. (1978). Improper priors, spline smoothing and the problem of guard-
ing against model errors in regression. Journal of the Royal Statistical Society.
Series B (Methodological), pages 364–372.
166
Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated
smoothing spline. Journal of the Royal Statistical Society. Series B (Methodologi-
cal), pages 133–150.
Wahba, G. (1990). Spline models for observational data, volume 59. Siam.
Wand, M. and Ormerod, J. (2008). On semiparametric regression with
O’Sullivan penalized splines. Australian & New Zealand Journal of Statistics,
50(2):179–198.
West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models.
Springer.
Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 65(1):95–114.
Yao, F., Müller, H.-G., and Wang, J.-L. (2005). Functional data analysis for sparse
longitudinal data. Journal of the American Statistical Association, 100(470):577–
590.
Zhu, B. and Dunson, D. B. (2013). Locally adaptive Bayes nonparametric regres-
sion via nested Gaussian processes. Journal of the American Statistical Associa-
tion, 108(504):1445–1456.
167