BAYESIAN METHODS FOR FUNCTIONAL AND TIME SERIES DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Daniel R. Kowal August 2017 ©c 2017 Daniel R. Kowal ALL RIGHTS RESERVED BAYESIAN METHODS FOR FUNCTIONAL AND TIME SERIES DATA Daniel R. Kowal, Ph.D. Cornell University 2017 We introduce new Bayesian methodology for modeling functional and time se- ries data. While broadly applicable, the methodology focuses on the challeng- ing cases in which (1) functional data exhibit additional dependence, such as time dependence or contemporaneous dependence; (2) functional or time series data demonstrate local features, such as jumps or rapidly-changing smoothness; and (3) a time series of functional data is observed sparsely or irregularly with non-negligible measurement error. A unifying characteristic of the proposed methods is the employment of the dynamic linear model (DLM) framework in new contexts to construct highly efficient Gibbs sampling algorithms. To model dependent functional data, we extend DLMs for multivariate time series data to the functional data setting, and identify a smooth, time-invariant functional basis for the functional observations. The proposed model provides flexible modeling of complex dependence structures among the functional ob- servations, such as time dependence, contemporaneous dependence, stochastic volatility, and covariates. We apply the model to multi-economy yield curve data and local field potential brain signals in rats. For locally adaptive Bayesian time series and regression analysis, we pro- pose a novel class of dynamic shrinkage processes. We extend a broad class of popular global-local shrinkage priors, such as the horseshoe prior, to the dy- namic setting by allowing the local scale parameters to depend on the history of the shrinkage process. We prove that the resulting processes inherit desirable shrinkage behavior from the non-dynamic analogs, but provide additional lo- cally adaptive shrinkage properties. We demonstrate the substantial empirical gains from the proposed dynamic shrinkage processes using extensive simula- tions, a Bayesian trend filtering model for irregular curve-fitting of CPU usage data, and an adaptive time-varying parameter regression model, which we em- ploy to study the dynamic relevance of the factors in the Fama-French asset pricing model. Finally, we propose a hierarchical functional autoregressive (FAR) model with Gaussian process innovations for forecasting and inference of sparsely or irregularly sampled functional time series data. We prove finite-sample fore- casting and interpolation optimality properties of the proposed model, which remain valid with the Gaussian assumption relaxed. We apply the proposed methods to produce highly competitive forecasts of daily U.S. nominal and real yield curves. BIOGRAPHICAL SKETCH Daniel Ryan Kowal was born in Albany, New York. After finishing high school at Salesianum School in Wilmington, Delaware, Daniel attended Washington University in St. Louis. While at Washington University, Daniel participated in the Pathfinder Program for Environmental Sustainability, completed a se- nior honors thesis Applications of linear mixed effects models: an analysis of Mis- souri school data, and graduated summa cum laude in mathematics with minors in computer science and legal studies. After graduating in 2012, Daniel en- tered the Cornell University Ph.D. program in statistics. During his graduate studies, Daniel co-authored publications in the Journal of the American Statistical Association, the Journal of Business & Economic Statistics, Cellular and Molecular Bioengineering, and the Journal of Biomechanics. He has received student paper awards from the American Statistical Association in both the Section on Bayesian Statistical Science and the Nonparametric Statistics Section. Following the comple- tion of his Ph.D., Daniel will join the Rice University Department of Statistics as an assistant professor. iii To my parents and my brother. iv ACKNOWLEDGEMENTS First, I would like to express my sincere gratitude to my co-advisers, Dr. David S. Matteson and Dr. David Ruppert, for their guidance, their time, and their commitment to my development as an independent researcher. I would also like to thank my undergraduate thesis advisor, Dr. Jimin Ding, for helping to set me on this path. Second, I would like to thank my fellow Ph.D. students and friends, espe- cially Dr. Amy Willis, Dr. David Sinclair, and Dr. William Nicholson. Both cele- bration and commiseration are unwritten prerequisites for graduate study, and their vital roles in each are greatly appreciated. Third, I would like to thank my family, especially my parents, for persis- tently emphasizing the value of higher education and extracurricular learning, and my brother, for teaching me mathematics from an early age, despite my frequent protests. And finally, I especially thank my wife, Dr. Marsha Kowal, for the encour- aging notes, the travel packs, the early morning breakfast surprises, and most importantly, for her unwavering support. v TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction 1 2 A Bayesian Multivariate Functional Dynamic Linear Model 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 A Multivariate Functional Dynamic Linear Model . . . . . . . . . 7 2.3 Estimating the Factor Loading Curves . . . . . . . . . . . . . . . . 11 2.3.1 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Bayesian Splines . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 Constrained Bayesian Splines . . . . . . . . . . . . . . . . . 17 2.3.4 Common Factor Loading Curves for Multivariate Modeling 20 2.4 Data Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Multi-Economy Yield Curves . . . . . . . . . . . . . . . . . 21 2.4.2 Multivariate Time-Frequency Analysis for Local Field Po- tential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3 Dynamic Shrinkage Processes 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Dynamic Shrinkage Processes . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Stochastic Volatility Models for Dynamic Scale Parameters 45 3.2.2 Log-Scale Representations of Global-Local Priors . . . . . 46 3.2.3 Scale Mixtures via Pólya-Gamma Processes . . . . . . . . . 52 3.3 Bayesian Trend Filtering with Dynamic Shrinkage Processes . . . 53 3.3.1 Bayesian Trend Filtering: Simulations . . . . . . . . . . . . 55 3.3.2 Bayesian Trend Filtering: Application to CPU Usage Data 58 3.4 Joint Shrinkage for Time-Varying Parameter Models . . . . . . . . 60 3.4.1 Time-Varying Parameter Models: Simulations . . . . . . . 62 3.4.2 Time-Varying Parameter Models: The Fama-French Asset Pricing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 MCMC Sampling Algorithm and Computational Details . . . . . 67 3.5.1 Efficient Sampling for the Dynamic Shrinkage Process . . 69 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 vi 4 Functional Autoregression for Sparsely Sampled Data 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Hierarchical Gaussian Processes for FAR . . . . . . . . . . . . . . . 77 4.2.1 Dynamic Linear Models for FAR(p) . . . . . . . . . . . . . 80 4.3 A Dynamic Functional Factor Model for the Innovation Process . 83 4.4 Modeling the FAR Kernel . . . . . . . . . . . . . . . . . . . . . . . 87 4.5 Finite-Dimensional Optimality . . . . . . . . . . . . . . . . . . . . 89 4.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.1 Sampling Designs . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.2 Competing Estimators . . . . . . . . . . . . . . . . . . . . . 96 4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.7 Forecasting Nominal and Real Yield Curves . . . . . . . . . . . . . 101 4.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5 Conclusions 108 A A Bayesian Multivariate Functional Dynamic Linear Model 110 A.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.1.1 Common Factor Loading Curves . . . . . . . . . . . . . . . 111 A.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.2.1 General Algorithm . . . . . . . . . . . . . . . . . . . . . . . 112 A.2.2 Sampling the Common Trend Hidden Markov Model . . . 116 A.3 Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B Dynamic Shrinkage Processes 122 B.1 MCMC Sampling Algorithm and Computational Details . . . . . 125 B.1.1 Efficient Sampling for the Dynamic Shrinkage Process . . 127 B.1.2 Efficient Sampling for the State Variables . . . . . . . . . . 130 B.2 Linear Regression for the Fama-French Asset Pricing Model . . . 132 C Functional Autoregression for Sparsely Sampled Data 134 C.1 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 C.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 136 C.3 Initialization and MCMC Sampling Algorithm . . . . . . . . . . . 137 C.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 C.3.2 Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . 138 C.4 Additional Theoretical Results . . . . . . . . . . . . . . . . . . . . 143 C.4.1 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . 143 C.4.2 DLM Recursions and Special Cases of Theorem 4.1 . . . . 144 C.4.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . 145 C.5 Additional Simulation Results . . . . . . . . . . . . . . . . . . . . . 146 C.6 Additional Details for the Yield Curve Application . . . . . . . . . 147 C.7 Additional Details on the Quadrature Approximation . . . . . . . 150 vii LIST OF TABLES 2.1 Posterior means and 95% HPD intervals for (c)γk , which measures the strength of the linear relationship between (c)βk,t and (1) βk,t . . . . 29 3.1 Special cases of the inverted-Beta prior. . . . . . . . . . . . . . . . 41 4.1 h-step RMSFEs for nominal yields, grouped (left to right) by multivariate methods, parametric yield curve models, existing functional data methods, and proposed hierarchical FAR meth- ods. The minimum RMSFE in each row is italicized. . . . . . . . 106 4.2 h-step RMSFEs for real yields, grouped (left to right) by multi- variate methods, parametric yield curve models, existing func- tional data methods, and proposed hierarchical FAR methods. The minimum RMSFE in each row is italicized. . . . . . . . . . . 107 B.1 Ordinary linear regression results for the weekly manufacturing industry data in the six-factor model. Significant factors at the 5% level are italicized. . . . . . . . . . . . . . . . . . . . . . . . . . 133 B.2 Ordinary linear regression results for the weekly healthcare in- dustry data in the six-factor model. Significant factors at the 5% level are italicized. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 viii LIST OF FIGURES 2.1 Multi-economy yield curves from July 29, 2011 (solid) and Au- gust 5, 2011 (dashed), together with the corresponding one-week change curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Posterior means of the common FLCs, {f1, f2, f3, f4}, as a func- tion of maturity, τ . . . . . . . . . . . . . . . .∑. . . . . . . . . . . . 28 2.3 The MCMC sample proportions of r2 4 2k,(c),t and k=1 rk,(c),t that ex- ceed the 95th percentile of the assumed χ2-distributions. . . . . 30 2.4 The raw LFP data from a rat during an FS trial. The vertical lines indicates the approximate time at which the rat processed the stimuli, t∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Pointwise 95% HPD intervals and the posterior mean for (3)µ̄t , which is the average difference in squared coherence between the FC and FS trials. The black vertical lines indicate the event time t∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Bayesian trend filtering (D = 2) with dynamic horseshoe process in- novations of minute-by-minute CPU usage data. (a) Observed data yt (points), posterior expectation (cyan) of βt, and 95% pointwise high- est posterior density (HPD) credible intervals (light gray) and 95% si- multaneous credible bands (dark gray) for the posterior predictive dis- tribution of yt. (b) Second difference of observed data ∆2yt (points), posterior expectation of ωt = ∆2βt (cyan), and 95% pointwise HPD intervals (light gray) and simultaneous credible bands (dark gray) for the posterior predictive distribution of ∆2yt. (c) Posterior expectation of time-dependent observation standard deviations, σt. (d) Posterior expectation of time-dependent innovation (prior) standard deviations, τλt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Simulation-based estimate of the stationary distribution of κt for various AR(1) coefficients φ. The blue line indicates the density of κt in the static (φ = 0) horseshoe, [κ] ∼ Beta (1/2, 1/2). . . . . . 48 3.3 Fitted curves for simulated data with T = 128 and RSNR = 7. Each panel includes the simulated observations (x-marks), the posterior expectations of βt (cyan), and the 95% pointwise HPD credible intervals (light gray) and 95% simultaneous credible bands (dark gray) for the posterior predictive distribution of {yt} under BTF-DHS model (3.8) with D = 2. The proposed esti- mator, as well as the uncertainty bands, accurately capture both slowly- and rapidly-changing behavior in the underlying func- tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ix 3.4 Root mean squared errors for simulated data with T = 128 and RSNR = 7. The Bayesian trend filtering (BTF) estimators differ in their innovation distributions, which determines the shrink- age behavior of the second order differences (D = 2): normal- inverse-Gamma (NIG), horseshoe (HS), and dynamic horseshoe (DHS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Root mean squared error for out-of-sample minute-by-minute CPU us- age data. The Bayesian trend filtering (BTF) estimators differ in their innovation distributions, which determines the shrinkage behavior of the second order differences (D = 2): normal-inverse-Gamma (NIG), horseshoe (HS), and dynamic horseshoe (DHS). . . . . . . . . . . . . 60 3.6 True regression functions β∗j,t (black line) and corresponding pos- terior expectations (cyan), 95% pointwise HPD credible intervals (light gray) and 95% simultaneous credible bands (dark gray) for βj,t under the BTF-DHS model given by (3.9) and (3.10) for a sim- ulated data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Root mean squared errors for the regression coefficients, β∗j,t (left) and the true curves, y∗ ′ ∗t = xtβt (right) for simulated data. . 64 3.8 Posterior expectations (cyan), 95% pointwise HPD credible in- tervals (light gray) and 95% simultaneous credible bands (dark gray) for βj,t and σt (bottom right) under the BTF-DHS model given by (3.9) and (3.10) for value-weighted manufacturing in- dustry returns. The solid black line is zero, the dashed green line is the ordinary linear regression estimate, and the solid red line indicates periods for which the 95% simultaneous credible bands do not contain zero. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.9 Posterior expectations (cyan), 95% pointwise HPD credible in- tervals (light gray) and 95% simultaneous credible bands (dark gray) for βj,t and σt (bottom right) under the BTF-DHS model given by (3.9) and (3.10) for value-weighted healthcare industry returns. The solid black line is zero, the dashed green line is the ordinary linear regression estimate, and the solid red line indi- cates periods for which the 95% simultaneous credible bands do not contain zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1 Sample paths of t and Yt = µt + µ as a function of τ , where t is a Gaussian process with the Matérn correlation function, ρ = (ρ1, 0.1), σ = 0.01, and Yt is generated using the Bimodal- Gaussian FAR(1) kernel, t = 1, . . . , T = 50. The curves are time- ordered by color (from red/orange to blue/violet). Left to right: t(τ), ρ1 = 2.5; t(τ), ρ1 = 0.5; Yt(τ), ρ1 = 2.5; Yt(τ), ρ1 = 0.5. Note that we do not observe Yt directly, but rather yi,t = Yt(τi,t) + νi,t, where νi,t ∼ N(0, σ2ν) is measurement error with σν = σ/5 = 0.002 and Tt = {τ1,t, . . . , τmt,t} are the observation points at time t. 95 x 4.2 MSFEe under various designs. Top left: FAR(1), T = 350, sparse-random design with the Linear-u kernel and smooth GP innovations. Top right: FAR(1), T = 50, sparse-random design with the Bimodal-Gaussian kernel and non-smooth GP innova- tions. Bottom left: FAR(1), T = 350, sparse-fixed design with the Bimodal-Gaussian kernel and smooth GP innovations. Bot- tom right: FAR(2), T = 125, sparse-fixed design with Bimodal- Gaussian and Linear−τ kernels and smooth GP innovations. The proposed methods provide superior forecasts and nearly achieve the oracle performance, despite the presence of sparsity. 99 4.3 MSEψ1 under various designs. Top left: FAR(1), T = 350, sparse-random design with the Linear-u kernel and smooth GP innovations. Top right: FAR(1), T = 50, sparse-random design with the Bimodal-Gaussian kernel and non-smooth GP innova- tions. Bottom left: FAR(1), T = 350, sparse-fixed design with the Bimodal-Gaussian kernel and smooth GP innovations. Bot- tom right: FAR(2), T = 125, sparse-fixed design with Bimodal- Gaussian and Linear−τ kernels and smooth GP innovations. Es- timates of ψ1 are far superior for the proposed methods, includ- ing the FAR(p) with model averaging. . . . . . . . . . . . . . . . 100 4.4 One-step nominal (left) and real (right) yield curve forecasts during 2016. Top: Time series of five (×) and ten (4) year ob- served maturities with one-step forecasts. Bottom: Observed (points) and forecast (line) curves on 8/2/16, corresponding to the dotted vertical line in the top panels. Posterior means (blue) and 95% pointwise and simultaneous prediction bands (light gray and dark gray, respectively) estimated using 10,000 MCMC simulations after a burn-in of 5,000. . . . . . . . . . . . . . . . . . 104 A.1 Pointwise 95% HPD intervals and the posterior mean for (1)µ̄t , which is the average difference in the PFC log-spectra between the FC and FS trials. The black vertical lines indicate t∗. . . . . . . 119 A.2 Pointwise 95% HPD intervals and the posterior mean for (2)µ̄t , which is the average difference in the PFC log-spectra between the FC and FS trials. The black vertical lines indicate t∗. . . . . . . 120 A.3 The observed volatility clustering from the yield curve applica- tion. The black lines are the posterior means of the squared resid- uals from the AR(1) process on the (c)ωk,t in the common trend hid- den Markov model of Section 2.4.1. The red lines are the poste- rior means of the corresponding volatility estimates σ2k,(c),t dis- cussed in Section 2.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . 121 xi B.1 Computation time per 1000 MCMC iterations for the Bayesian trend filtering model with dynamic horseshoe innovations (BTF- DHS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 C.1 MSFEe (top) and corresponding MSEψ1 (bottom) under var- ious designs. Left: FAR(1), T = 50, dense design with the Bimodal-Gaussian kernel and non-smooth GP innovations. Right: FAR(1), T = 350, dense design with the Bimodal- Gaussian kernel and smooth GP innovations. The proposed methods provide superior forecasts and nearly achieve the or- acle performance, despite the presence of sparsity. . . . . . . . . 147 C.2 The Bimodal-Gaussian kernel, ψ(τ, u) ∝ 0.75 exp{−(τ − π(0.3)(0.4) 0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45∫ ∫exp{−(τ−0.7)2/(0.3)2−π(0.3)(0.4) (u− 0.8)2/(0.4)2}, normalized so that ψ2` (τ, u) dτ du = 0.8. . . 148 C.3 Traceplot for one-step forecasts for nominal yield curves at se- lected maturities during 2016. . . . . . . . . . . . . . . . . . . . . 150 C.4 Traceplot for one-step forecasts for real yield curves at selected maturities during 2016. . . . . . . . . . . . . . . . . . . . . . . . . 151 E1 Standardized squared errors and relative absolute errors for smooth (top) and non-smooth (bottom) integrands. The errors are small in magnitude, particularly in the smooth case, and de- cay quickly for M > 20. . . . . . . . . . . . . . . . . . . . . . . . . 153 xii CHAPTER 1 INTRODUCTION We present Bayesian methodology for modeling functional and time series data. The methods are broadly applicable for (dependent) functional and time series data, but we focus in particular on the following challenging cases for which existing methods are inadequate: 1. Functional data with additional complex dependence, such as time depen- dence, contemporaneous dependence, stochastic volatility, covariates, and change points (Chapter 2); 2. Functional data, time series data, or regression functions with local fea- tures, such as jumps or rapidly-changing smoothness (Chapter 3); and 3. Forecasting and inference of functional time series data with sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error (Chapter 4). A unifying characteristic of the proposed methods is the employment of the dynamic linear model (DLM) framework in new contexts to construct inter- pretable models and computationally efficient MCMC sampling algorithms. In particular, we develop highly efficient Gibbs sampling algorithms that build upon existing DLM sampling components for large blocks of parameters (e.g., Rue, 2001; Durbin and Koopman, 2002). The novel applications of DLMs in- clude functional dynamic factor models, Bayesian trend filtering models, dy- namic shrinkage processes (see Chapter 3), and functional autoregressive mod- els. Importantly, the Bayesian framework permits joint estimation of the model 1 parameters and provides exact inference (up to MCMC error) on specific pa- rameters. The proposed methodology is motivated by important applications includ- ing multi-economy interest rate modeling, nominal and real yield curve fore- casting, dynamic extensions of the Fama-French asset pricing model, irregular curve-fitting of CPU usage data, and local field potential brain signals in rats. The methods are evaluated through extensive simulations, and compared to state-of-the-art alternative estimators, with favorable results. In Chapter 2, we present a Bayesian model for multivariate, dependent func- tional data, in which we extend DLMs for multivariate time series to the func- tional data setting. We also develop Bayesian spline theory in a more general constrained optimization framework. The proposed methods identify a time- invariant functional basis for the functional observations, which is smooth and interpretable. We apply the methodology to study the interactions of multi- economy yield curves during the recent global recession, and analyze local field potential brain signals in rats, for which we develop a multivariate functional time series approach for multivariate time-frequency analysis. In Chapter 3, we propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. We extend a broad class of popular global-local shrinkage priors, such as the horseshoe prior, to the dynamic setting by allowing the local scale parameters to depend on the history of the shrinkage process. We prove that the resulting processes inherit desirable shrinkage be- havior from the non-dynamic analogs, but provide additional locally adaptive shrinkage properties. The proposed dynamic shrinkage processes are widely applicable, particularly within the family of dynamic linear models. By express- 2 ing dynamic shrinkage processes on the log-scale, we adapt successful tech- niques from stochastic volatility modeling, and propose a Pólya-Gamma scale mixture representation to produce a highly efficient Gibbs sampling algorithm. We use the proposed processes to produce superior Bayesian trend filtering es- timates and posterior credible intervals for irregular curve-fitting of minute-by- minute Twitter CPU usage data, and develop an adaptive time-varying param- eter regression model to assess the efficacy of the Fama-French five-factor asset pricing model with momentum added as a sixth factor. In Chapter 4, we develop a hierarchical Gaussian process model for forecast- ing and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent pro- cess is dynamically modeled as a functional autoregression (FAR) with Gaus- sian process innovations, with extensions for FAR(p) models with model aver- aging over the lag p. We propose a fully nonparametric dynamic functional factor model for the dynamic innovation process, with broader applicability and improved computational efficiency over standard Gaussian process mod- els. We prove finite-sample forecasting and interpolation optimality properties of the proposed model, which remain valid with the Gaussian assumption re- laxed. Extensive simulations demonstrate substantial improvements in fore- casting performance and recovery of the autoregressive surface over competing methods, especially under sparse designs. We apply the proposed methods to forecast nominal and real yield curves using daily U.S. data. Real yields are ob- served more sparsely than nominal yields, yet the proposed methods are highly competitive in both settings. 3 CHAPTER 2 A BAYESIAN MULTIVARIATE FUNCTIONAL DYNAMIC LINEAR MODEL Portions of this chapter were published in Kowal et al. (2016). 2.1 Introduction We consider a multivariate time series of functional data. Functional data anal- ysis (FDA) methods are widely applicable, including diverse fields such as eco- nomics and finance (e.g., Hays et al., 2012); brain imaging (e.g., Staicu et al., 2012); chemometric analysis, speech recognition, and electricity consumption (Ferraty and Vieu, 2006); and growth curves and environmental monitoring (Ramsay and Silverman, 2005). Methodology for independent and identically distributed (iid) functional data has been well-developed, but in the case of de- pendent functional data, the iid methods are not appropriate. Such dependence is common, and can arise via multiple responses, temporal and spatial effects, repeated measurements, missing covariates, or simply because of some natural grouping in the data (e.g., Horváth and Kokoszka, 2012). Here, we consider two distinct sources of dependence: time dependence for time-ordered functional observations and contemporaneous dependence for multivariate functional ob- servations. Suppose we observe multiple functions (c)Yt (τ), c = 1, . . . , C, at time points t = 1, . . . , T . Such observations have three dominant features: (a) For each c and t, (c)Yt (τ) is a function of τ ∈ T ; 4 (b) For each c and , (c)τ Yt (τ) is a time series for t = 1, . . . , T ; and (c) For each t and τ , (c)Yt (τ) is a multivariate observation with outcomes c = 1, . . . , C. We assume that T ⊆ Rd is compact, and focus on the case d = 1 in which τ is a scalar. However, our approach may be adapted to the more general setting. We consider two diverse applications of multivariate functional time series (MFTS). Multi-Economy Yield Curves: Let (c)Yt (τ) denote multi-economy yield curves ob- served on weeks t = 1, . . . , T for economies c = 1, . . . , C, which refer to the Fed- eral Reserve, the Bank of England, the European Central Bank, and the Bank of Canada. For a given currency and level of risk of a debt, the yield curve de- scribes the interest rate as a function of the length of the borrowing period, or time to maturity, τ . Yield curves are important in a variety of economic and financial applications, such as evaluating economic and monetary conditions, pricing fixed-income securities, generating forward curves, computing infla- tion premiums, and monitoring business cycles (Bolder et al., 2004). We are particularly interested in the relationships among yield curves for the aforemen- tioned globally-influential economies, and in how these relationships vary over time. However, existing FDA methods are inadequate to model the dynamic de- pendences among and between the yield curves for different economies, such as contemporaneous dependence, volatility clustering, covariates, and change points. Our approach resolves these inadequacies, and provides useful insights into the interactions among multi-economy yield curves (see Section 2.4.1). Multivariate Time-Frequency Analysis: For multivariate time series, the peri- 5 odic behavior of the process is often the primary interest. Time-frequency anal- ysis is used when this periodic behavior varies over time, which requires con- sideration of both the time and frequency domains (e.g., Shumway and Stof- fer, 2000). Typical methods segment the multivariate time series into (overlap- ping) time bins within which the periodic behavior is approximately stationary; within each bin, standard frequency domain or spectral analysis is performed, which uses the multivariate discrete Fourier transform of the time series to iden- tify dominant frequencies. Interestingly, although the raw signal in this set- ting is a multivariate time series, time-frequency analysis produces a MFTS: the multivariate discrete Fourier transform is a function of frequency τ for time bins t = 1, . . . , T , where c = 1, . . . , C index the multivariate components of the spectrum. We analyze local field potential (LFP) data collected on rats, which measures the neural activity of local brain regions over time (Ljubojevic et al., 2013). Our interest is in the time-dependent periodic behavior of these local brain regions under different stimuli, and in particular the synchronization be- tween brain regions. Our novel MFTS approach to time-frequency analysis pro- vides the necessary multivariate structure and inference—which is unavailable in standard time-frequency analysis—to precisely characterize brain behavior under certain stimuli (see Section 2.4.2). To model MFTS, we extend the hierarchical dynamic linear model (DLM) framework of Gamerman and Migon (1993) and West and Harrison (1997) for multivariate time series to the functional data setting. For smooth, flexible, and optimal function estimates, we extend Bayesian spline theory to a more general constrained optimization framework, which we apply for parameter identifi- ability. Our constraints are explicit in the posterior distribution via appropri- ate conditioning of the standard Bayesian spline posterior distribution, and the 6 corresponding posterior mean is the solution to an appropriate optimization problem. We implement an efficient Gibbs sampler to obtain samples from the joint posterior distribution, which provides exact (up to MCMC error) inference for any parameters of interest. The proposed hierarchical Bayesian Multivariate Functional Dynamic Linear Model has greater applicability and utility than re- lated methods. It provides flexible modeling of complex dependence structures among the functional observations, such as time dependence, contemporane- ous dependence, stochastic volatility, covariates, and change points, and can incorporate application-specific prior information. The paper proceeds as follows. In Section 2.2, we present our model in its most general form. We develop our (factor loading) curve estimation technique in Section 2.3. In Section 2.4, we apply our model to the two applications dis- cussed above and interpret the results. We also provide the details of our Gibbs sampling algorithm, present MCMC diagnostics for our applications, and in- clude additional figures in Appendix A. 2.2 A Multivariate Functional Dynamic Linear Model Suppose we observe functions (c)Yt : T → R at times t = 1, . . . , T for outcomes c = 1, . . . , C, where T ⊆ R is compact. We refer to the following model as the Multivariate Functional Dynamic Linear Model (MFDLM): 7  [ ∣ ∣ ] in Yt(τ) = F(τ)βt + t(τ), [t(∣τ) E] t ∼ dep N (0,Et) ,  indepβt = Xtθt + ν ∣ t, [νt ∣Vt ] ∼ N(0,V ), (2.1) t ∣ indepθt[= Gtθt−1 + ωt, ωt ]Wt ∼ N(0,Wt),′ where (1) (2) (C)Yt(τ) = Yt (τ), Yt (τ), . . . , Yt (τ) is the C-dimensional vector of multivariate functional observations at time t ev[aluated at τ ∈ T ; F(τ) is] the C×KC block matrix with 1×K diagonal blocks (c) (c) (c)f1 (τ), f2 (τ), . . . , fK (τ) for c = 1, . . . , C of factor loading curves evaluated at[τ ∈ T , with K the numbe]r of′ factors per outcome, and zeros elsewhere; (1) (1) (2) (C)βt = β1,t , . . . , βK,t, β1,t , . . . , βK,t is the KC-dimensional vector of factors that serve as the time-dependent weights on the factor loading curves; Xt is the known KC × p matrix of covariates at time t, where p is the total number of covariates; θt is the p-dimensional vector of regression coefficients associated with Xt; Gt is the p × p evolution matrix of the regression coefficients θt at time t; and t(τ), νt, and ωt are mutually in- dependent error vectors with variance matrices Et, Vt, and Wt, respectively. We assume conditional independence of [t(τ)|Et] over both t = 1, . . . , T and τ ∈ T ; however, the latter assumption of independence over τ may be relaxed. We can immediately obtain a useful submodel of (2.1) by excluding covariates, Xt = ICK×CK , and removing a level of the hierarchy, Vt = 0CK×CK , so that setting Gt = G models βt (= θt, almost surely) with a vector autoregression (VAR). To understand (2.1), first note that the observation level of the model com- bines the functional component F(τ) with the multivariate time series component βt. In scalar notation, we can write the observation level as ∑K (c) (c) (c) (c) Yt (τ) = fk (τ)βk,t + t (τ) (2.2) k=1 8 in which (c)t (τ) are the elements of the vector t(τ). In our construction, we can always write the observation level of (2.1) as (2.2); simplifications for the other levels will depend on the choice of submodel. For model identifiability, we require orthonormali∫ty of the factor loading curves: (c) (c) fk (τ)fj (τ) dτ = 1(k = j) (2.3) τ∈T for k, j = 1, . . . , K and all outcomes c = 1, . . . , C, where 1(·) is the indicator func- tion. In addition, to ensure a unique and interpretable ordering of the factors (c) (c) β1,t , . . . , βK,t for each outcome c = 1, . . . , C, we order the factor loading curves (c) (c) f1 , . . . , fK by decreasing smoothness. We discuss our implementation of these constraints in Sections 2.3.2 and 2.3.3. There are three primary interpretations of the model, which provide insight into useful extensions and submodels. First, we can view (2.2) as a basis expansion of the functional observations (c) Yt , with a (multivariate) time series model for the basis coefficients (c) βk,t to ac- count for the additional dependence structures, such as common trends (see Section 2.4.1), stochastic volatility (see Section 2.4.1), and covariates. Since the identifiability constraint in (2.3) e{xpresses orth}onormality with respect to the L2 inner product, we can interpret (c) (c)f1 , . . . , fK as an orthonormal basis for the functional observations (c)Yt . In contrast to common basis expansion procedures that assume the basis functions are known and only the coefficients need to be estimated (e.g., Bowsher and Meeks, 2008), we allow our basis functions (c)fk to be estimated from the data. As a result, the (c)fk will be more closely tailored to the data, which reduces the number of functions K needed to adequately fit the data. Conditional on the (c)fk , we can specify the βt- and θt-levels of (2.1) to appropriately model the remaining dependence among the (c)Yt . Using 9 this interpretation, we also note that (2.1) may be described as a multivariate dynamic (concurrent) functional linear model, and therefore extends a highly useful model in FDA (Cardot et al., 1999). Similarly, we can interpret (2.1) as a dynamic factor analysis, which is a com- mon approach in yield curve modeling (e.g., Hays et al., 2012; Jungbacker et al., 2013). Under this interpretation, the (c)βk,t are dynamic factors and the (c) fk are fac- tor loading curves (FLCs); we will use this terminology for the remainder of the paper. Compared to a standard factor analysis, (2.1) has two major modifica- tions: the factors (c)βk,t are dynamic and therefore have an accompanying (multi- variate) time series model, and the (c)fk are functions rather than vectors. Naturally, (2.1) has strong connections to a hierarchical DLM. Stan- dard hierarchical DLM algorithms for sampling βt and θt assume that {F,Gt,Xt,Et,Vt,Wt} is known (e.g., Durbin and Koopman, 2002; Petris et al., 2009). Within our Gibbs sampler, we may condition on this set of parame- ters, and then use existing DLM algorithms to efficiently sample βt and θt with minimal implementation effort. Unconditionally, F is unknown, but we impose the necessary identifiability constraints; see Section 2.3 for more de- tails. Gt may be known or unknown depending on the application, but in general it supplies the time series structure of the model (along with the time- dependent error variances): in Section 2.4.1, Gt = G is unknown to allow for data-driven dependence among the multi-economy yield curves, and in Section 2.4.2, Gt = ICK×CK is chosen to provide parsimonious time-domain smoothing. We assume that Xt is known, and may consist of covariates relevant to each outcome or can be chosen to provide additional shrinkage of βt through θt. Al- though Gamerman and Migon (1993) suggest that dim(θt) < dim(βt) for strict 10 dimension reduction in the hierarchy, we relax this assumption to allow for co- variate information. Finally, we treat the error variance matrices as unknown, but typically there are simplifications available depending on the application and model choice. We discuss some examples in Section 2.4. We must also specify a choice for K. In the yield curve application, two natural choices are K = 3 and K = 4 for comparison with the common para- metric yield curve models: the Nelson-Siegel model (Nelson and Siegel, 1987) and the Svensson model (Svensson, 1994), both of which can be expressed as submodels of (2.1); see Diebold and Li (2006) and Laurini and Hotta (2010). More formally, we can treat K as a parameter and estimate it using reversible jump MCMC methods (Green, 1995), or select K using marginal likelihood. In particular, since we employ a Gibbs sampler, the marginal likelihood estima- tion procedure of Chib (1995) is convenient for many submodels of (2.1). For more complex models, DIC provides a less computationally intensive approach than either reversible jump MCMC or marginal likelihood, and is very simple to compute. In Appendix A, we discuss a fast procedure based on the singu- lar value decomposition from our initialization algorithm which can be used to estimate a range of reasonable values for K. 2.3 Estimating the Factor Loading Curves We would like to model the FLCs (c)fk in a smooth, flexible, and computationally appealing manner. Clearly, the latter two attributes are important for broader applicability and larger data sets—including larger T , larger C, and larger (c)mt , where (c)mt denotes the number of observation points for outcome c at time t. 11 The smoothness requirement is fundamental as well: as documented in Jung- backer et al. (2013), smoothness constraints can improve forecasting, despite the small biases imposed by such constraints. Smooth curves also tend to be more interpretable, since gradual trends are usually easier to explain than sharp changes or discontinuities. However, there are some additional complications. First, we must incorpo- rate the identifiability constraints, preferably without severely detracting from the smoothness and goodness-of-fit of the FLCs. We also have K curves to es- timate for each outcome—or perhaps K curves common to all outcomes (see Section 2.3.4)—similar to the varying-coefficients model of Hastie and Tibshi- rani (1993), conditional on the factors (c)βk,t . Finally, the observation points for the functions (c)Yt are likely different for each outcome c, and may also vary with time t. 2.3.1 Splines A common approach in nonparametric and semiparametric regression is to ex- press each unknown function (c)fk as a linear combination of known basis func- tions, and then estimate the associated coefficients by maximizing a (penalized) likelihood (e.g., Wahba, 1990; Eubank, 1999; Ruppert et al., 2003). We use B- spline basis functions for their numerical properties and easy implementation, but our methods can accommodate other bases as well. For now, we ignore dependence on c for notational convenience; this also corresponds to either the univariate case (C = 1) or C > 1 with Et diagonal and the FLCs assumed to be a priori independent for c = 1, . . . , C (see Section 2.3.4 for an important alterna- 12 tive). Following Wand and Ormerod (2008), we use cubic splines and the knot sequence a = κ1 = . . . = κ4 < κ5 < . . . < κM+4 < κM+5 = . . . = κM+8 = b, with φB = (φ1, . . . , φM+4) the associated cubic B-spline basis, M the number of interior knots, and T = [a, b]. While we could allow each fk to have its own B-spline basis and accompanying sequence of knots, there is no obvious reason to do so. In our applications, we use M = 20 interior knots. For knot placement, we prefer a quantile-based approach such as the default method described in Ruppert et al. (2003), which is responsive to the location of observation points in the data yet is computationally inexpensive; however, equally-spaced knots may be preferable in some applications. Explicitly, we write fk(τ) = φ′B(τ)dk, where dk is the (M + 4)-dimensional vector of unknown coefficients. Therefore, the function estimation problem is reduced to a vector estimation problem. In classical nonparametric regression, dk is estimated by maximizing a penalized likelihood, or equivalently solving min−2 log[Y|dk] + λkP(dk) (2.4) dk where [Y|dk] is a likelihood, P is a convex penalty function, and λk ≥ 0. We express (2.4) as a log-likelihood multiplied by −2 so that for a Gaussian likeli- hood, (2.4) is simply a penalized least squares objective. For greater generality, we leave the likelihood unspecified, but later consider the likelihood of model (2.2). To penalize roughness, a standard choice for P is the L2-norm of the sec- ond derivative of fk, which can∫be w[ritten]in terms of dk:2 P(dk) = f̈k(τ) dτ = d′kΩφdk (2.5) τ∈T ∫ ′ where f̈k denotes the second derivative of fk and Ωφ = T φ̈B(τ)φ̈B (τ) dτ , which is easily computable for B-splines. With this choice of penalty, (2.4) bal- ances goodness-of-fit with smoothness, where the trade-off is determined by 13 λk. Since P is a quadratic in dk, for fixed λk, (2.4) is straightforward to solve for many likelihoods, in particular a Gaussian likelihood. Letting d̄k be this solu- tion, we can estimate fk(τ) for any τ ∈ T with f̂k(τ) = φ′B(τ)d̄k. For a general knot sequence, the resulting estimator f̂k is an O’Sullivan spline, or O-spline, in- troduced by O’Sullivan (1986) and explored in Wand and Ormerod (2008). In the special case of univariate nonparametric regression in which there is a knot at every observation point, f̂k is a natural cubic smoothing spline (e.g., Green and Silverman, 1993). Alternatively, if we choose a sparser sequence of knots and set λk = 0, f̂k is a regression spline (e.g., Ramsay and Silverman, 2005). O- splines are numerically stable, possess natural boundary properties, and can be computed efficiently (cf. Wand and Ormerod, 2008). 2.3.2 Bayesian Splines Splines also have a convenient Bayesian interpretation (e.g., Wahba, 1978, 1983, 1990; Gu, 1992; Van der Linde, 1995; Berry et al., 2002). Returning to (2.4), we no- tably have a likelihood term and a penalty term, where the penalty is a function of only the vector of coefficients dk and known quantities. Therefore, condi- tional on λk, the term λkP(dk) provides prior information about dk, for example that f = φ′k Bdk is smooth. Under this general interpretation, (2.4) combines the prior information with the likelihood to obtain an estimate of dk. A natural Bayesian approach is therefore to construct a prior for dk based on the penalty P , in particular so that the posterior mode of dk is the solution to (2.4). For the most common settings in which the likelihood is Gaussian and the penalty P 14 is (2.5), the posterior distribution of dk will be Gaussian, so the posterior mean will also solve (2.4). To construct a prior from P , it is computationally and conceptually con- venient to reparameterize dk so that the penalty matrix Ωφ is diagonal. Un- der a Gaussian prior, this corresponds to prior independence of the compo- nents of dk. The reparameterization will also affect the basis φB, but otherwise will leave the likelihood in (2.4) unchanged. Following Wand and Ormerod (2008), let Ω ′φ = UΩDΩUΩ be the singular value decomposition of Ωφ, where U′ΩUΩ = I(M+4)×(M+4) and DΩ is a diagonal matrix with M + 2 positive compo- nents. Denote the diagonal matrix of these positive entries by DΩ,P and let UΩ,P be the corresponding (M[ + 4) × (M + 2) sub]matrix of UΩ. Using the reparam- eterized basis φ′ −1/2(τ) = 1, τ,φ′B(τ)UΩ,PDΩ,P and penalty d ′ kΩDdk with ΩD = diag (0, 0, λk, . . . , λk), the new solution d̂k to (2.4) satisfies f̂k(τ) = φB(τ)d̄k = φ′(τ)d̂k; see Wand and Ormerod (2008) for more det(ails. It is therefore na)tural to use the prior dk ∼ N(0,Dk), where D 8k = diag 10 , 108, λ−1k , . . . , λ −1 k and λk > 0, which satisfies D−1k ≈ ΩD. Notably, this prior is proper, yet is diffuse over the space of constant and linear functions—which are unpenalized by P . This reparameterization is a common approach for fitting splines using mixed effects model software (e.g., Ruppert et al., 2003). Since we assume conditional independence between levels of (2.1), our con- ditional likelihood for the FLCs is simply that of model (2.2), but we ignore dependence on c for now: ∑K ∑K Yt(τ) = β ′ k,tfk(τ) + t(τ) = βk,tφ (τ)dk + t(τ) (2.6) k=1 k=1 where iid 2t(τ) ∼ N(0, σ ) for simplicity; the results are similar for more sophisti- cated error variance structures. In particular, (2.6) describes the distribution of 15 the functional data Yt given the FLCs fk (or dk), also conditional on β and σ2k,t . Under the likelihood of model (2.6) and the reparameterized (approxi- mate) penalty d′D−1k k dk, the solution to (2.4)∑conditio∑nal on dj , j =6 k is give∑n by d̂ −1 −1 −2 T 2 ′k ∑= Bkb[k where∑Bk = Dk +] σ t=1 βk,t τ∈T φ(τ)φ (τ), bk =t σ−2 Tt=1 βk,t τ∈T Yt(τ)− j 6=k βj,tfj(τ) φ(τ), and Tt ⊆ T denotes the dis-t crete set of |Tt| = mt observation points for Yt at time t. Note that if Tt = T1 for t = 2, . . . , T , then Bk and bk may be rewritten more conveniently in vector no- tation. Most importantly for our purposes, under the same likelihood induced by (2.6) and the prior dk ∼ N(0,Dk), the posterior distribution of dk is multi- variate Gaussian with mean d̂k and variance Bk. For convenient computations, Wand and Ormerod (2008) provide an exact construction of Ωφ and suggest effi- cient algorithms for d̂k based on the Cholesky decomposition; we provide more details in Appendix A. To identify the ordering of the factors and FLCs in (2.2), we constrain the smoothing parameters λ1 > λ2 > · · · > λK > 0. While other model constraints are available, this ordering constraint is particularly appealing: it sorts the FLCs fk by decreasing smoothness, as characterized by the penalty function P , and leads to a convenient prior distribution on the smoothing parameters λk. In the Bayesian setting, the smoothing parameters are equivalently the prior preci- sions of the penalized (nonlinear) components of dk. Letting dk,j denote the jth component of dk, the prior on the FLC basis coefficients is i∼iddk,j N(0, λ−1k ) for j = 3, . . . ,M + 4. This is similar to the hierarchical setting of Gelman (2006), in which there are M + 2 groups for each λk, k = 1, . . . , K. Since M + 2 is typically large, we follow the Gelman (2006) recommendation to place uniform priors on the group standard deviations −1/2λk , k = 1, . . . , K. Incorporating the ordering 16 constraint, the conditional priors are −1/2λk ∼ Uniform (`k, uk), where `1 = 0, −1/2 −1/2 `k = λk−1 for k = 2, . . . , K, uk = λk+1 for k = 1, . . . , K − 1, and uK = 104. The upper bound on −1/2λK , and therefore all −1/2 λk , is chosen to equal the diffuse prior standard deviation of dk,1 and dk,2. (The full cond∑itional di)stributions of the smoothing parameters λ are Gamma 1(M + 1), 1 M+4 d2k j=3 k,j truncated to2 2 (u−2 −2k , `k ) for k = 1, . . . , K, where we define ` −2 1 =∞. Notably, we avoid the dif- fuse Gamma prior on λk, which can be undesirably informative and is strongly discouraged by Gelman (2006). More generally, our approach provides a natu- ral and data-driven method for estimating the smoothing parameters, yet does not inhibit inference. Details on the sampling of λk are provided in Appendix A. 2.3.3 Constrained Bayesian Splines We extend the Bayesian spline approach to accommodate the necessary iden- tifiability constraints for th∫e MFDLM. For each k = 1, . . . , K, we impose the orthonormality constraints T fk(τ)fj(τ) = 1(k = j) for j = 1, . . . , K. The unit- norm constraint preserves identifiability with respect to scaling, i.e., relative to the factors βk,t (up to changes in sign). The orthogonality constraints distinguish between pairs of FLCs, and in our approach identify the FLCs with distinct pos- terior distributions. While other identifiability constraints are available for the fk, orthonormality is appealing for a number of reasons. As discussed in Section 2.2, the orthonor- mality constraints suggest that we can interpret {f1, . . . , fK} as an orthonormal basis for the functional observations Yt. As such, the orthogonality constraints 17 help eliminate any information overlap between FLCs, which keeps the total number of necessary FLCs to a minimum. Furthermore, the unit norm con- straint allows for easier comparisons among the fk. Of course, the fk will be weighted by the factors βk,t, so they can still have varying effects on the condi- tional mean of Yt in (2.2). Finally, we can write the constraints conveniently in terms∫of the vectors dk and∫dj : fk(τ)fj(τ) dτ = φ ′(τ)d φ′k (τ)dj dτ = d ′ kJφdj = 1(k = j) (2.7) τ∈T τ∈∫T for j = 1, . . . , K, where J = φ(τ)φ′φ ∈T (τ) dτ is easily computed for B-splines,τ and only needs to be computed once, prior to any MCMC sampling. The addition of an orthogonality constraint to a (penalized) least squares problem has an intuitive regression-based interpretation, which we present in the following theorem: ∑ Theorem 2.1. Consider the penalized least squares objective σ−2 ni=1(y ′ i −Xid)2 + λd′Ωd, where yi ∈ R, d is an unknown (M + 4)-dimensional vector, Xi is a known (M + 4)-dimensional vector, Ω is a known (M + 4)× (M + 4) positive-definite matrix, and∑σ2, λ > 0 are known sc∑alars. The solution is d̂ = Bb, where B−1 = λΩ + σ−2 n ′i=1 XiXi and b = σ −2 n i=1 Xiyi. Now consider the same objective, but subject to the J linear constraints d′L = 0 for L a known (M + 4)× J matrix of rank J . The solution is d̃ = Bb̃, where b̃ is the vector of residuals from the generalized least squares regression b = LΛ + δ with E(δ) = 0 and Var(δ) = B. Proof. The optimality of d̂ is a∑well-known result. For the constrained case, the Lagrangian is L(d,Λ) = σ−2 ni=1(yi − X′id)2 + λd ′Ωd + d′LΛ, where Λ is the J-dimensional vector of Lagrange multipliers associated with the J linear con- straints. It is straightforward to minimize L(d,Λ) with respect to d and obtain 18 the solution d̃ = Bb̃ = B(b−LΛ). Similarly, solving∇L(d̃,Λ) = 0 for Λ implies that Λ = (L′BL)−1L′Bb, which is the solution to the generalized least squares regression of b on L with error variance B. The result is interpretable: to incorporate linear constraints into a penalized least squares regression, we find b̃ nearest to b under the inner product induced by B among vectors in the space orthogonal to Col(L). In our setting, extend- ing (2.4) under a Gaussian likelihood to accommodate the (linear) orthogonality constraints d′kJφdj = 0 for j =6 k may be described via a regression of the un- constrained solution on the constraints. However, the unit norm constraint is nonlinear. This constraint affects the scaling but not the shape of fk. Therefore, a reasonable approach is to construct a posterior distribution for dk that respects the (linear) orthogonality constraints only, and then normalize the samples from this posterior to preserve identifiability. We provide more details in Appendix A. To extend the unconstrained Bayesian splines of Section 2.3.2 to incor- porate the orthogonality constraints, we write the constraints d′kJφdj = 0 for j =6 k as the linear constraints in Theorem 2.1 with L[−k] = (Jφd1, . . . ,Jφdk−1,Jφdk+1, . . . ,JφdK) and J = K − 1. Using the full condi- tional posterior distribution dk ∼ N(Bkbk,Bk) from Section 2.3.2, we can ad- ditionally condition on the linear constraints d′kL[−k] = 0, and obtain the con- strained full conditional distribution dk ∼ N(B̃kbk, B̃k), where B̃k = Bk − B L (L′ −1 ′k [−k] [−k]BkL[−k]) L[−k]Bk. Conditioning on the orthogonality constraints is particularly interpretable in the Bayesian setting, and is convenient for pos- terior sampling; see Appendix A for more details. By comparison, Theorem 2.1 implies that the solution to (2.4) under the likelihood of model (2.6), the 19 penalty d′kD −1 k dk, and subject to the linear constraints d ′ kL[−k] = 0 is given by d̃k = Bkb̃k, where b̃k = bk − L Λ and Λ = (L′[−k] [−k] [−k] [−k]B L )−1L′k [−k] [−k]Bkbk. Notably, B̃kbk = Bkb̃k = d̃k, which is a useful result: by simply conditioning on the linear orthogonality constraints in the full conditional Gaussian distribu- tion for dk, the posterior mean of the resulting Gaussian distribution solves the constrained regression problem of Theorem 2.1. In this sense, the identifiability constraints on fk are enforced optimally. 2.3.4 Common Factor Loading Curves for Multivariate Model- ing Reintroducing dependence on c for the FLCs (c)fk , suppose that C > 1, so that our functional time series (c)Yt is truly multivariate. If we wish to estimate a priori independent FLCs for each outcome c (with Et diagonal), then we can sample from the relevant posterior distributions independently for c = 1, . . . , C using the methods of Section 2.3.3. The more interesting case is the common fac- tor loading curves model given by (c)fk = fk, so that all outcomes share a common set of FLCs. In the basis interpretation of the MFDLM, this corresponds to the assumption that the functional observations for all outcomes (c)Yt , c = 1, . . . , C, t = 1, . . . , T share a common basis. We find this approach to be useful and intu- itive, since it pools information across outcomes and suggests a more parsimo- nious model. Equally important, the common FLCs approach allows for direct ′ comparison between factors (c)βk,t and (c ) βk,t for outcomes c and c ′, since these fac- tors serve as weights on the same FLC (or basis function) fk. We use this model in both applications in Section 2.4. 20 The common FLCs model implies (c) (c)fk (τ) = φ ′ (c)(τ)dk = fk(τ). However, since the FLCs for each outcome are identical, it is reasonable to assume that they have the same vector of basis functions , so (c)φ fk = fk is equivalent to (c) dk = dk. Moreover, by writing (c) fk (τ) = φ ′(τ)dk, we can use all of the observation points across all outcomes c = 1, . . . , C and times t = 1, . . . , T , yet the parameter of interest, dk, will only be (M + 4)-dimensional. Modifying our previous approach, we use the likelihood of model (2.2) with the simple error distribution (c) it (τ) ∼ id N(0, σ2(c)). The implied full conditional posterior di∑stribution∑for dk is a∑gain N(B̃kbk, B̃k), but now ∑with B−1∑ = D−1∑+ C[ σ−2 (c) 2 ′k k c=1 (c) t∈T∑(c)(βk,t ) τ∈T](c) φ(τ)φ (τ) and bk =tC −2 (c) (c) − (c)c=1 σ(c) t∈T (c) βk,t ∈T (c) Yt (τ) j 6=k βj,t fj(τ) φ(τ). For full generality,τ t we allow the (discrete) set of times T (c) to vary for each outcome c and the (dis- crete) set of observation points T (c)t to vary with both time t and outcome c, with |T (c)| (c)t = mt . Note that we reuse the same notation from Section 2.3.3 to emphasize the similarity of the multivariate results to the univariate (or a priori independent FLC) results. The common notation also allows for a more concise description of the sampling algorithm, which we present in Appendix A. 2.4 Data Analysis and Results 2.4.1 Multi-Economy Yield Curves We jointly analyze weekly yield curves provided by the Federal Reserve (Fed), the Bank of England (BOE), the European Central Bank (ECB), and the Bank of Canada (BOC; Bolder et al. 2004) from late 2004 to early 2014 (T = 490 and C = 21 4). These data are publicly available and published on the respective central bank websites—and as such, we treat them as reliable estimates of the yield curves. For each outcome, the yield curves are estimated differently: the Fed uses quasi-cubic splines, the BOE uses cubic splines with variable smoothing parameters (Waggoner, 1997), the ECB uses Svensson curves, and the BOC uses exponential splines (Li et al., 2001). Therefore, the functional observations have already been smoothed, although by different procedures. The available set of maturities T (c)t is not the same across economies c, and occasionally varies with time t. The most frequent values of (c)mt , t = 1, . . . , T , are 11 (Fed), 100 (BOE), 354 (ECB), and 120 (BOC), with maturities τ ranging from 1-3 months up to 300- 360 months. To facilitate a simpler analysis, we let (c)Yt (τ) be the week-to-week change in the cth central bank yield curve on week t for maturity τ . Differencing the yield curves conveniently addresses the nonstationarity in the weekly data, and, because the yield curves are pre-smoothed, does not introduce any notable difficulties with time-varying observation points. We show an example of the multi-economy yield curves observed at adjacent times on July 29, 2011 and August 5, 2011, as well as the corresponding one-week change in Figure 2.1. The literature on yield curve modeling is extensive. Yield curve mod- els commonly adopt the Nelson-Siegel parameterization (Nelson and Siegel, 1987), often within a state space framework (e.g., Diebold and Li, 2006; Diebold et al., 2006, 2008; Koopman et al., 2010). Many Bayesian models also use the Nelson-Siegel or Svensson parameterizations (e.g., Laurini and Hotta, 2010; Cruz-Marcelo et al., 2011). However, the Nelson-Siegel parameterization does not extend to other applications, and often requires solving computationally intensive nonlinear optimization problems. Alternatively, Chib and Ergashev (2009) develop an arbitrage-free affine term structure model, which is similarly 22 Multi−Economy Yield Curves on 2011−07−29 and 2011−08−05 Change in Multi−Economy Yield Curves Fed BOE ECB BOC 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Maturity (months) Maturity (months) Figure 2.1: Multi-economy yield curves from July 29, 2011 (solid) and August 5, 2011 (dashed), together with the corresponding one-week change curves. cast in a Bayesian state space framework. More similar to our approach are the Functional Dynamic Factor Model (FDFM) of Hays et al. (2012) and the Smooth Dynamic Factor Model (SDFM) of Jungbacker et al. (2013), both of which fea- ture nonparametric functional components within a state space framework. The FDFM cleverly uses an EM algorithm to jointly estimate the functional and time series components of the model. However, the EM algorithm makes more so- phisticated (multivariate) time series models more challenging to implement, and introduces some difficulties with generalized cross-validation (GCV) for estimation of the nonparametric smoothing parameters. The SDFM avoids GCV and instead relies on hypothesis tests to select the number and location of knots—and therefore determine the smoothness of the curves. However, this suggests that the smoothness of the curves depends on the significance levels used for the hypothesis tests, of which there can be a substantial number as (c) mt , C, or T grow large. By comparison, our smoothing parameters naturally depend on the data through the posterior distribution, which notably does not 23 Yield (%) 0 1 2 3 4 5 Change in Yield (%) −0.30 −0.20 −0.10 0.00 create any difficulties for inference. The multi-economy yield curves application is a natural setting for the com- mon FLCs model of Section 2.3.4. First, since (c)fk = fk for c = 1, . . . , C, the functional component of the MFDLM is the same for all economies, which helps reconcile the aforementioned different central bank yield curve es∑timation tech- niques. More specifically, the conditional expectations (c) (c)µt (τ) ≡ K k=1 βk,tfk(τ) are linear combinations of the same {f1, . . . , fK}, and therefore are more directly comparable for c = 1, . . . , C. Second, the common FLCs model is very useful when the set of observed maturities T (c)t varies with either outcome c or time t. Since the fk are estimated using all of the observed maturities ∪t,cT (c)t , we notably do not need a missing data model for uno(bserved)maturities at time t for economy c. In addition, for any τ ∈ int range ∪ (c)t,cTt , we may estimate fk(τ) and ((c)µt (τ)) without any spline-related boundary problems—even when ∈6 range T (c)τ t . By comparison, non-common FLCs—or more generally, any linear combination of o(utcom)e-specific natural cubic splines—would impose a linear fit for τ ∈6 range T (c)t , which may not be reasonable for some applica- tions. The Common Trend Model To investigate the similarities and relationships among theC = 4 economy yield curves, we implement the following parsimonious model for multivariate de- pendence among the factors: (1) (1)βk,t = ω k,t (2.8)(c) (c) (1) (c)βk,t = γk βk,t + ωk,t c = 2, . . . , C 24 where (c)γk ∈ R is the economy-specific slope term for each factor with the dif- fuse conjugate prior (c) i∼idγ N(0, 108k ). For the errors (c) ωk,t , we use independent AR(r) models with time-dependent variances, which we discuss in more de- tail in Section 2.4.1. We also implement an interesting extension of (2.8) based on the autoregressive regime switching models of Albert and Chib (1993) and {McCulloch and Ts}ay (1993) using the model (c) (c) (c) (1) (c)βk,t = sk,t(γk βk,t ) + ωk,t , where (c) sk,t : t = 1, . . . , T is a discrete Markov chain with states {0, 1}. While this more complex model is not supported by DIC, it is a useful example of the flex- ibility of the MFDLM; we provide the details in Appendix A. Letting c = 1 correspond to the Fed yield curve, we can use (2.8) to inves- tigate how the factors (c)βk,t for each economy c > 1 are directly related to those of the Fed, (1)βk,t . Since the U.S. economy is commonly regarded as a dominant presence in the global economy (e.g., Dées and Saint-Guilhem, 2011), the Fed yield curve is a natural and interesting reference point. Model (2.8) relates each economy c > 1 to the Fed using a regression framework, in which we regress (c) (1) βk,t on βk,t with AR(r) errors; since the yield curves were differenced, there is no need (or evidence) for an intercept. The slope parameters (c)γk measure the strength of this relationship for each factor k and economy c. In addition, we can investigate the residuals (c)ωk,t to determine times t for which (c) βk,t deviated substantially from the linear dependence on (1)βk,t assumed in model (2.8). Such periods of uncorrelatedness can offer insight into the interactions between the U.S. and other economies. 25 Stochastic Volatility Models For the errors (c)ωk,t in (2.8), we∑use independent AR(r) models with time- dependent variances, i.e., (c) r (c) (c) (c) (c) iidωk,t = i=1 ψk,iωk,t−i + σk,(c),tzk,t with zk,t ∼ N(0, 1), c = 1, . . . , C. The AR(r) specification accounts for the time dependence of the yield curves, while the σ2k,(c),t model the observed volatility clustering. This lat- ter component is important: in applications of financial time series, it is very common—and often necessary for proper inference—to include a model for the volatility (e.g., Taylor, 1994; Harvey et al., 1994). It is reasonable to suppose that applications of financial functional time series may also require volatility modeling; the weekly yield curve data provide one such example. Notably, our hierarchical Bayesian approach seamlessly incorporates volatility model- ing, since, conditional on the volatilities, DLM algorithms require no additional adjustments for posterior sampling. Within the Bayesian framework of the MFDLM, it is most natural to use a stochastic volatility model (e.g., Kim et al., 1998; Chib et al., 2002). Stochastic volatility models are parsimonious, which is important in hierarchical model- ing, yet are highly competitive with more heavily parameterized GARCH mod- els (Danı́elsson, 1998). We model the log-volatility, log(σ2(c),k,t), as a stationary AR(1) process (for fixed c and k), using the priors and the efficient MCMC sam- pler of Kastner and Frühwirth-Schnatter (2014). We provide a plot of the volatil- ities σ2k,(c),t and additional model details in Appendix A. 26 Results We fit model (2.8) to the multi-economy yield curve data, using the the Kast- ner and Frühwirth-Schnatter (2014) model for the volatilities and setting r = 1, which adequately models the time dependence of the factors, with the diffuse stationarity prior (c) i∼idψk,1 N(0, 108) truncated to (−(1, 1). We use) the common FLCs model of Section 2.3.4, and let Et = diag iid σ2 2 −2(1), . . . , σ(C) with σ(c) ∼ Gamma (0.001, 0.001). We prefer the choice K = 4, which corresponds to the number of curves in the Svensson model. However, since the observations (c)Yt and the conditional expectations (c)µt (τ) are both smooth by construction, the errors (c)t are also smooth—and therefore correlated with respect to τ . To mit- igate the effects of the error correlation, we increase the number of factors to K = 6, so that the fitted model (2.2) explains more than 99.5% of the variabil- ity in (c)Yt (τ). Since we are primarily interested in the first four factors, we fix (c) γk = 0 for k > 4 in model (2.8), so the two additional factors for each outcome are modeled as independent AR(1) processes with stochastic volatility. We ran the MCMC sampler for 7, 000 iterations and discarded the first 2, 000 iterations as a burn-in. The MCMC sampler is efficient, especially for the factors (c)βk,t and the common FLCs fk; we provide the MCMC diagnostics in Appendix A. In Figure 2.2, we plot the posterior means of the common FLCs fk for k = 1, . . . , 4. We can interpret these fk as estimates of the time-invariant un- derlying functional structure of the yield curves shared by the Fed, the BOE, the ECB, and the BOC. The FLCs are very smooth, and the dominant hump-like features occur at different maturities—following from the orthonormality con- straints—which allows the model to fit a variety of yield curve shapes. Interest- ingly, the estimated f1, f2, and f3 are similar to the level, slope, and curvature 27 functions of the Nelson-Siegel parameterization described by Diebold and Li (2006). Since the factors (c)βk,t serve as weights on the FLCs fk in (2.2), we may interpret the factors (c) (c)βk,t—and therefore the slopes γk —based on these features of the yield curve explained by the corresponding fk. Common Factor Loading Curves k = 1 k = 2 k = 3 k = 4 0 50 100 150 200 250 300 350 Maturity (months) Figure 2.2: Posterior means of the common FLCs, {f1, f2, f3, f4}, as a function of maturity, τ . In Table 1, we compute posterior means and 95% highest posterior density (HPD) intervals for (c)γk , which measures the strength of the linear relationship between (c) and (1)βk,t βk,t . For the level and slope factors k = 1, 2, the ECB is sub- stantially less correlated with the Fed factors than are the BOE and BOC factors. For k = 4, the BOE, ECB, and BOC factors are nearly uncorrelated with the Fed factors. Finally(, we analyze the)conditional standardized residuals from model (2.8), (c) − (c) (c) iidrk,(c),t = ωk,t φk,1ωk,t−1 /σk,(c),t ∼ N(0, 1), to determine periods of time t for which (2.8) is inadequate, which can indicate deviations from the assumed linear relationship between the Fed factors and the other economy factors. By computing the MCMC sample proportion of r2 2k,(c),t ∼ χ1 that exceed a critical 28 Yield (%) −0.3 −0.2 −0.1 0.0 0.1 Economy k = 1 k = 2 k = 3 k = 4 BOE 0.62 0.72 0.37 0.03(0.57, 0.67) (0.56, 0.89) (0.27, 0.46) (-0.03, 0.09) ECB 0.39 0.27 0.44 0.07(0.34, 0.45) (0.11, 0.42) (0.35, 0.52) (0.00, 0.15) BOC 0.61 0.56 0.49 0.16(0.57, 0.65) (0.47, 0.65) (0.41, 0.58) (0.08, 0.25) Table 2.1: Posterior means and 95% HPD intervals for (c)γk , which measures the strength of the linear relationship between (c) (1)βk,t and βk,t . value of the χ2-distribution, e.g., the 95th percentile χ21,0.05 ≈ 3.84, we can ob- tain a simple estimate of the probability that r2k,(c),t exceeds the critical value ∑and, by that measure, is likely an outlier. We can compute a similar quantity for4 k=1 r 2 k,(c),t ∼ χ24, which aggregates across factors k = 1, . . . , 4. In Figure 2.3, we plot these MCMC sample proportions, restricted to the U.S. recession of Decem- ber 2007 to June 2009. Around November 2008, there were outliers for all three economies for k = 2, 3, 4 and the aggregate, which suggests that the U.S. interest rate market may have behaved differently from the other economies during this time period. We are currently investigating an extension of model (2.8) to in- corporate several important financial predictors as covariates, with a particular focus on the weeks during the recession. 2.4.2 Multivariate Time-Frequency Analysis for Local Field Po- tential Local field potential (LFP) data were collected on rats to study the neural ac- tivity involved in feature binding, which describes how the brain amalgamates distinct sensory information into a single neural representation (Botly and De 29 ∑ Figure 2.3: The MCMC sample proportions of r2 4 2k,(c),t and k=1 rk,(c),t that exceed the 95th percentile of the assumed χ2-distributions. Rosa, 2009; Ljubojevic et al., 2013). LFP uses pairs of electrodes implanted di- rectly in local brain regions of interest to record the neural activity over time; in this case, the brain regions of interest are the prefrontal cortex (PFC) and the posterior parietal cortex (PPC). The rats were given two sets of tasks: one that required the rats to synthesize multiple stimuli in order to receive a reward (called feature conjunction, or FC), and one that only required the rats to process a single stimulus in order to receive a reward (called feature singleton, or FS). FC involves feature binding, while FS may serve as a baseline. The tasks were re- peated in 20 trials each for FS and FC, during which electrodes implanted in the PFC and the PPC recorded the neural activity. Therefore, the raw data signal is a bivariate time series with 40 replications for each rat; we show an example of the bivariate signals for one such replication in Figure 2.4a. Each signal replicate is 3 seconds long, and has been centered around the behavior-based laboratory estimate of the time at which the rat processed the stimuli, which we denote by t∗. 30 Log−Spectrum, PFC Log−Cross−Spectrum 80 80 −240 70 −115 70 −250 60 −120 60 50 50 −260 40 −125 40 −270 30 −130 30 −280 20 −135 20 −290 10 10 −140 −300 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Log−Spectrum, PPC Squared Coherence 80 −120 80 0.8 70 70 60 −125 60 0.6 50 50 −130 40 40 0.4 30 −135 30 20 20 0.2 10 −140 10 0.0 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Time Bin Time Bin (a) The bivariate LFP signal. (b) The associated (log-) spectra and squared coherence. Figure 2.4: The raw LFP data from a rat during an FS trial. The vertical lines indicates the approximate time at which the rat processed the stimuli, t∗. Our interest is in the time-dependent behavior of these bivariate signals and the interaction between them. A natural approach is to use time-frequency anal- ysis; however, exact inference for standard time-frequency procedures is not available. An appealing alternative is to use time-frequency methods to trans- form the bivariate signal into a MFTS, which makes available the multivariate modeling and inference of the MFDLM. Since the MFDLM provides smoothing in both the frequency domain T and the time domain T , we may use time-frequency preprocessing that provides minimal smoothing. For the time domain, we segment the signal into time bins of width one-eighth the length of the original signal, with a 50% overlap be- tween neighboring bins to reduce undesirable boundary effects. Within each time bin, we compute the periodograms and cross-periodogram of the bivariate signal. Let (1) (2)qt (τ) and qt (τ) be the discrete Fourier transforms of the PFC and PPC signals, respectively, for time bin t evaluated at frequency τ , after removing linear trends. The periodograms are (c) | (c)It (τ) = q 2t | for c = 1, 2 and the cross- periodogram is (3) (1) (2), where (2)It (τ) = qt q̄t q̄t is the complex conjugate of (2) qt . The cross-periodogram is generally complex-valued, and if the periodograms 31 Frequency (Hz) Frequency (Hz) are unsmoothed, then | (3) (1) (2)It (τ)|2 = It (τ)It (τ) is real-valued but clearly fails to provide new information (Bloomfield, 2004). This does not imply that the cross-periodogram is uninformative, but rather that some frequency domain smoothing of the periodograms is necessary. Following Shumway and Stoffer (2000), we use a modified Daniell kernel to obtain the smoothed periodograms, or spectra. We subdivide each time bin into five segments, compute (c)It (τ), c = 1, 2, 3 within each segment, and then average the resulting periodograms using decreasing weights determined by the(modifie)d Daniell kernel. Denoting these spectra by (c) , we let (c)Ĩt (τ) Yt (τ) = (c) log Ĩt (τ) for c = 1, 2, where the log-transformation is appealing because it is the variance-stabilizing transformation for the periodogram (Shumway and Stoffer, 2000). To account for the pe(riodic dep)endence between signals, one choice is the log-cross-spectrum, (3)log |Ĩt (τ)|2 . An appealing alternative is the squared coherence defined by 2 ≡ | (3) (1) (2)κt (τ) Ĩt (τ)|2/(Ĩt (τ)Ĩt (τ)), which satis- fies the constraints 0 ≤ κ2t (τ) ≤ 1 and is the frequency domain analog to the squared correlation (Bloomfield, 2004). Since (2.1) specifies that (c)Yt (τ) ∈ R, we transform the squared coherence and let (3)Yt (τ) = Φ−1(κ2t (τ)) ∈ R, where Φ−1 : [0, 1] → R is a known monotone function; we use the Gaussian quantile function. We have found that fitting (3)Yt (τ) produces very similar results to fitting κ2t (τ) (directly,)yet in the transformed case, our estimate of the squared coherence (3)Φ µt (τ) obeys the constraints. Because of our Bayesian approach, this transformation does not inhibit inference. More generally, this procedure is applicable to `-dimensional time series, which, including either the squared coherence or the cross-spectra, yields a C = `(`+ 1)/2-dimensional MFTS. We show an example of the resulting MFTS from 32 a rat during an FS trial in Figure 2.4b. For completeness, we include the log- cross-spectrum, which is not a component of the MFTS. MFDLM Specification We use the common FLCs model of Section 2.3.4 accompanied by a random walk model for the factors: [ ∣ ](c) ∑ (c) (c) (c) indepYi,s,t(τ) = K k=1 βk,i,s,tfk(τ) + i,s,t(τ), [i,s,t(τ∣)∣σ 2 (c)] ∼ N(0, σ 2 )  (c) (2.9)indepβk,i,s,t = βk,i,s,t−1 + ω ∣k,i,s,t, ωk,i,s,t Wk ∼ N(0,Wk) where (1) (C) (c)βk,i,s,t = (β ′k,i,s,t, . . . , βk,i,s,t) , Yi,s,t are the log-spectra for c = 1, 2 and the probit-transformed squared coherences for c = 3, i = 1, . . . , 8 index the rats, s = 1, . . . , 40 index the trials for each rat, and t = 1, . . . , 15 index the time bins for each trial. The joint indices (i, s, t) in (2.9) correspond to the time index t in (2.1), and are used to specify independence of the residuals ωk,i,s,t between rats and between trials. For each initial time bin t = 1, we let βk,i,s,1 ∼ N(0, 104IC×C), since the corresponding observations are only time-ordered within a trial. The C × C factor covariance matrices Wk do not depend on the rat or the trial, and can help summarize the overall dependence among factors. For simplicity and parsimonious modeling, (2.9) assumes independence between ωk,i,s,t and ωj,i,s,t for j =6 k ∈ {1, . . . , K}, but allows for correlation between outcomes for fixed k. The Wk control the a∑mount of time domain smoothing for the factors and therefore for (c) (c)µi,s,t(τ) ≡ K k=1 βk,i,s,tfk(τ). For the error variances, we use the conjugate priors σ−2 i∼id(c) Gamma iid (0.001, 0.001) and W−1k ∼ Wishart((ρR)−1, ρ), with R−1 = IC×C , the expected prior precision, and ρ = C ≥ rank(R−1). We provide the full conditional posterior distributions in Appendix A. To determine the effects of feature binding, we compare the values of 33 (c) µi,s,t(τ) between the FS and FC trials. Letting Si,FC (respectively, Si,FS) be the subset of FC (respectively, FS) trials for which rat i received the re- w∑ard, [we est∑imate posterior distribu∑tions for the sa]mple means (c)µ̄t (τ) ≡ 1 8 1 ∑ [ (c) − 1 ( (c)8 i=1 |S | s∈S µi,s,t(τ) |S | s′∈S )µi,s′,t(τ) for c =( 1, 2 andi,FC i,FC i,FS i,FS )](3) µ̄ (τ) ≡ 1 8 ∑ 1 (3) ∑ 1 (3) t 8 i=1 | Φ µSi,FC | s∈Si,FC i,s,t(τ) − |S | s′i,FS ∈S Φ µi,FS i,s′,t(τ) . Therefore, we examine the difference in the log-spectra and the squared coher- ences between the FC and the FS trials, which we average over all rats and over all trials for which the rat responded correctly to the stimuli. This restriction is important, since it filters out unrepresentative trials, in particular FC trials for which feature binding may not have occurred. Results Since we observe functions in 15 time bins for 40 trials for 8 rats, the time- dimension of our 3-dimensional MFTS is T = (15)(40)(8) = 4800. We restrict the frequencies to T = [0.1, 80] Hz, which is the range of interest for this application and yields (c)mt = 30 for all c, t. Guided by DIC, we select K = 10. Alternatively, we could use a smaller value ofK by increasing the initial smoothing of the log- spectra and the squared coherences, but would risk smoothing over important features. We ran the MCMC sampler for 7, 000 iterations and discarded the first 2, 000 iterations as a burn-in; see Appendix A for the MCMC diagnostics. We compute 95% pointwise HPD intervals and posterior means for (c)µ̄t (τ), c = 1, 2, 3 and display the results as spectrogram plots; the plots for c = 1, 2 are in Appendix A, while c = 3 is in Figure 2.5. Regions of red or orange in the lower 95% HPD interval plots indicate a significant positive difference between the FC and FS trials, while regions of blue in the upper 95% HPD interval plots 34 indicate a significant negative difference. We are particularly interested in the time bins around t∗, which indicates the approximate time at which the stimuli were processed, and frequencies up to 40-50 Hz. The averages of the differenced log-spectra, (1) (2)µ̄t (τ) and µ̄t (τ), describe how the distinct regions of the brain—the PFC and PPC, respectively—respond dif- ferently to stimuli that do or do not require feature binding. By comparison, the average of the differenced squared coherences, (3)µ̄t (τ), describes how these re- gions of the brain interact with each other under the different stimuli. Based on Figure 2.5, feature binding appears to be most strongly associated with greater squared coherence at frequencies in the Theta range (4-8 Hz), the Alpha range (8-13 Hz), and the Beta range (13-30 Hz) around t∗. This pattern persists in the power of both the PFC and PPC log-spectra plots, which suggests that these ranges of frequencies are important to the process of feature binding. Therefore, using the inference provided by the MFDLM, we conclude that during feature binding, the Theta, Alpha, and Beta ranges are associated with increased brain activity in both the PFC and the PPC, as well as greater synchronization between these regions. 2.5 Conclusions The MFDLM provides a general framework to model complex dependence among functional observations. Because we separate out the functional compo- nent through appropriate conditioning and include the necessary identifiability constraints, we can model the remaining dependence using familiar scalar and multivariate methods. The hierarchical Bayesian approach allows us to incor- 35 Lower 95% HPD Interval Posterior Mean Upper 95% HPD Interval 80 80 80 0.0 − 40.04 70 −0.02 70 70 −0.02 0.10 −0.0 2 60 −0.04 60 60 0.05 0.04 50 0 50 50 0. 00 4.0 0 2 −0.04 −0.02 − −0.02 0 0 40 −0.02 .04 40 −0.02 40 0.02 0.00 0 0.02 30 30 30 −0.05 0.0 20 20 2 20 0.04 0 −0.02 −0.10 10 10 10 4 2 .0 8 .0 4 0 .0 0 0.0 0 0.06 −0.04 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Time Bin Time Bin Time Bin Figure 2.5: Pointwise 95% HPD intervals and the posterior mean for (3)µ̄t , which is the average difference in squared coherence between the FC and FS trials. The black vertical lines indicate the event time t∗. porate interesting and useful submodels seamlessly, such as the common trend model of Section 2.4.1, the stochastic volatility model of Section 2.4.1, and the random walk model of Section 2.4.2. We combine Bayesian spline theory and convex optimization to model the functional component as a set of smooth and optimal curves subject to (identifiability) constraints. Using an efficient Gibbs sampler, we obtain posterior samples of all of the unknown parameters in (2.1), which allows us to perform inference on any parameters of interest, such as (c)µ̄t in the LFP example. Our two diverse applications demonstrate the flexibility and wide applica- bility of our model. The common trend model of Section 2.4.1 provides useful insights into the interactions among multi-economy yield curves, and our LFP example suggests a novel approach to time-frequency analysis via MFTS. In these applications, the MFDLM adequately models a variety of functional de- pendence structures, including time dependence, (time-varying) contempora- 36 Frequency (Hz) 0.02 −0.02 0 −0.06 0.02 −0 − .00 4.0 4 0 0 0.04 0.06 0.02 −0 0 ..02 02 0 0.0 4 0 −0.02 −0.04 0.02 0.04 0 −0.02 0 0 0.02 0.08 0.02 0.04 0.04 0.02 0 −0.02 0.02 0.04 0 0.08 0.02 0.06 0.02 0.04 0 .02 0.06 0 0.06 −0.02 0 0.06 0.04 0 0 0 −0. 02 0 0.02 0 4 0. 0 0 .02 0 −0.02 0 0.02 −0.02 0.0 6 0 −0.02 0 0.0 4 0.0 2 .06 − 0 0.04 0.02 0.02 −0.02 −0.0 4 0 .02 −0 0.02 −0.02 0.06 2 0 0. 0.02 0 0.02 0 0 2 .02 0.0 −0 04 0. 0.06 −0.04 −0.02 2 0.0 0.02 −0.0 6 0 −0.04 0.02 0.04 −0.02 0 .02 −0 −0.02 0 −0.04 −0.04 0 2 0.0 0.0 4 neous dependence, and stochastic volatility, and may readily accommodate ad- ditional dependence structures, such as covariates, repeated measurements, and spatial dependence. We are currently developing an R package for our methods. 37 CHAPTER 3 DYNAMIC SHRINKAGE PROCESSES 3.1 Introduction The global-local class of prior distributions is a popular and successful mecha- nism for providing shrinkage and regularization in a broad variety of models and applications. Global-local priors use continuous scale mixtures of Gaus- sian distributions to produce desirable shrinkage properties, such as (approxi- mate) sparsity or smoothness, often leading to highly competitive and computa- tionally tractable estimation procedures. For example, in the variable selection context, exact sparsity-inducing priors such as the spike-and-slab prior become intractable for even a moderate number of predictors. By comparison, global- local priors that shrink toward sparsity, such as the horseshoe prior (Carvalho et al., 2010), produce competitive estimators with greater scalability, and are val- idated by theoretical results, simulation studies, and a variety of applications (Carvalho et al., 2009; Datta and Ghosh, 2013; van der Pas et al., 2014). Unlike non-Bayesian counterparts such as the lasso (Tibshirani, 1996), shrinkage pri- ors also provide adequate uncertainty quantification for parameters of interest (Kyung et al., 2010; van der Pas et al., 2014). The class of global-local scale mixtures of Gaussian distributions (e.g., Car- 38 valho et al., 2010; Polson and Scott, 2010, 2012a) is defined as follows: | indep[ω τ, λ 2 2t t] ∼∏N(0, τ λt ), t = 1, . . . , T (3.1a)T [ { } ] [λ21, . . . , λ 2 T ] = λ 2 t | λ2s (3.1b)s 0 is either endowed with its own prior distribution or estimated using empirical Bayes methods. Here (3.1c) follows from (3.1b) assuming the {λt} are a priori independent and iden- tically distributed (iid). The iid assumption is commonly made, but as we will argue below, it can be advantageous to forego the independence assumption. In what follows, only (3.1a) and (3.1b) will be assumed. The prior in (3.1a)–(3.1c) is commonly paired with the likelihood [y |ω , σ2 in∼dept t ] N(ωt, σ2), but we will consider dynamic generalizations. In (3.1a), τ > 0 controls the global shrinkage for all {ω }Tt t=1, while λt tunes the local shrinkage for a particular ωt. Such a model is particularly well-suited for sparse data: τ determines the global level of sparsity for {ωt}Tt=1, while each λt allows for large absolute deviations of ωt from its prior mean (zero). Careful choice of priors for λ2t and τ 2 provide both robustness to large signals and ade- quate shrinkage of noise (e.g., Carvalho et al., 2010), so the framework of (3.1) is widely applicable. In the dynamic setting, in which the observations yt are time-ordered and t denotes a time index, it is natural to allow the local scale parameter, λt, to de- pend on the history of the shrinkage process {λs}s 0 Normal-Exponential- Griffin and Brown (2005) Gamma Prior α = β → 0 (Improper) Normal- Figueiredo (2003); Bae and Jeffreys’ Prior Mallick (2004) Table 3.1: Special cases of the inverted-Beta prior. Despite the apparent complexity of the model, we develop a new Gibbs sam- pling algorithm that builds upon existing efficient sampling algorithms via a parameter expansion of model (3.2): a stochastic volatility sampler (Kim et al., 1998) and a Pólya-Gamma sampler (Polson et al., 2013). The resulting model is highly flexible, easy to implement, computationally efficient, and widely appli- cable. For a motivating example, consider the minute-by-minute Twitter CPU us- age data in Figure 3.1a (James et al., 2016). The data show an overall smooth trend interrupted by irregular jumps throughout the morning and early after- noon, with increased volatility from 16:00-18:00. It is important to identify both abrupt changes as well as slowly-varying intraday trends. To model these fea- tures, we combine the likelihood in∼depyt N(β 2t, σt ) with a standard SV model for the observation error variance, σ2t , and a dynamic horseshoe process as the prior on the second differences of the conditional mean, ωt = ∆2βt = ∆βt −∆βt−1, given by (3.2) with α = β = 1/2 (see Section 3.3.2 for details). The dynamic horseshoe process either drives ωt to zero, in which case βt is locally linear, or leaves ωt ef- fectively unpenalized, in which case large changes in slope are permissible (see Figure 3.1b). The resulting posterior expectation of βt and credible bands for the posterior predictive distribution of {yt} adapt to both irregular jumps and smooth trends (see Figure 3.1a). 41 (Scaled) CPU Usage 2nd Difference of (Scaled) CPU Usage 02:00 07:00 12:00 17:00 22:00 02:00 07:00 12:00 17:00 22:00 Time of Day Time of Day (a) (b) (Scaled) CPU Usage: Observation Standard Deviation (Scaled) CPU Usage: Innovation Standard Deviation 02:00 07:00 12:00 17:00 22:00 02:00 07:00 12:00 17:00 22:00 Time of Day Time of Day (c) (d) Figure 3.1: Bayesian trend filtering (D = 2) with dynamic horseshoe process inno- vations of minute-by-minute CPU usage data. (a) Observed data yt (points), posterior expectation (cyan) of βt, and 95% pointwise highest posterior density (HPD) credible in- tervals (light gray) and 95% simultaneous credible bands (dark gray) for the posterior predictive distribution of yt. (b) Second difference of observed data ∆2yt (points), pos- terior expectation of ωt = ∆2βt (cyan), and 95% pointwise HPD intervals (light gray) and simultaneous credible bands (dark gray) for the posterior predictive distribution of ∆2yt. (c) Posterior expectation of time-dependent observation standard deviations, σt. (d) Posterior expectation of time-dependent innovation (prior) standard deviations, τλt. For comparison, Figure 3.1 provides the posterior expectations of both the observation error standard deviations, σt (Figure 3.1c) and the prior standard deviations, [τλt] = exp(ht/2) (Figure 3.1d). The horseshoe-like shrinkage be- havior of λt is evident: values of λt are either near zero, corresponding to ag- gressive shrinkage of ω = ∆2t βt to zero, or large, corresponding to large absolute changes in the slope of βt. Importantly, Figure 3.1 also provides motivation for a dynamic shrinkage process: there is clear volatility clustering of {λt}, in which the 42 (Scaled) CPU Usage (Scaled) CPU Usage 0.2 0.3 0.4 0.5 4 6 8 10 (Scaled) CPU Usage (Scaled) CPU Usage 0.0 1.0 2.0 3.0 -6 -4 -2 0 2 4 6 shrinkage induced by λt persists for consecutive time points. The volatility clus- tering reflects—and motivates—the temporally adaptive shrinkage behavior of the dynamic shrinkage process. Shrinkage priors and variable selection have been used successfully for time series modeling in a broad variety of settings. Belmonte et al. (2014) propose a Bayesian Lasso prior for shrinkage in dynamic linear models, while Korobilis (2013a) consider several (non-dynamic) scale mixture priors for time series re- gression. In both cases, the lack of a local (dynamic) scale parameter implies a time-invariant rate of shrinkage for each variable. Frühwirth-Schnatter and Wagner (2010) introduce indicator variables to discern between static and dy- namic parameters, but the model cannot shrink adaptively for local time peri- ods. Nakajima and West (2013) provide a procedure for local thresholding of dynamic coefficients, but the computational challenges of model implementa- tion are significant. Chan et al. (2012) propose a class of time-varying dimension models, but due to the computational complexity of the model, only consider inclusion or exclusion of a variable for all times, which produces non-dynamic variable selection and a limited set of models. Perhaps most comparable to the proposed methodology, Kalli and Griffin (2014) propose a class of priors which exhibit dynamic shrinkage using normal- gamma autoregressive processes. The Kalli and Griffin (2014) prior is a dynamic extension of the normal-gamma prior of Griffin and Brown (2010), and provides improvements in forecasting performance relative to non-dynamic shrinkage priors. However, the Kalli and Griffin (2014) model requires careful specifi- cation of several hyperparameters and hyperpriors, and the computation re- quires sophisticated adaptive MCMC techniques, which results in lengthy com- 43 putation times. By comparison, our proposed class of dynamic shrinkage pro- cesses is far more general, and includes the dynamic horseshoe process as a special case—which notably does not require tuning of sensitive hyperparame- ters. Furthermore, our proposed MCMC sampling algorithm combines existing samplers for large blocks of parameters, which produces a straightforward yet efficient Gibbs sampler, with computations linear in the number of time points. We apply dynamic shrinkage processes to develop a dynamic fundamen- tal factor model for asset pricing. We build upon the five-factor Fama-French model (Fama and French, 2015), which extends the three-factor Fama-French model (Fama and French, 1993) for modeling equity returns with common risk factors. We propose a dynamic extension which allows for time-varying fac- tor loadings, possibly with localized or irregular features, and include a sixth factor, momentum (Carhart, 1997). Despite the popularity of the three-factor Fama-French model, there is not yet consensus regarding the necessity of all five factors in Fama and French (2015) or the momentum factor. Dynamic shrinkage processes provide a mechanism for addressing this question: within a time- varying parameter regression model, dynamic shrinkage processes provide the necessary flexibility to adapt to rapidly-changing features, while shrinking un- necessary factors to zero. Our dynamic analysis shows that with the exception of the market risk factor, no other risk factors are significant except for brief periods. We introduce the dynamic shrinkage process in Section 3.2 and discuss rele- vant properties, including the Pólya-Gamma parameter expansion for efficient computations. In Section 3.3, we apply the prior to develop a more adaptive Bayesian trend filtering model for irregular curve-fitting, and we compare the 44 proposed procedure with competitive alternatives through simulations and a CPU usage application. We propose in Section 3.4 a time-varying parameter re- gression model with dynamic shrinkage processes for adaptive regularization and evaluate the model using simulations and an asset pricing example. In Sec- tion 3.5, we discuss the details of the Gibbs sampling algorithm, and conclude in Section 3.6. Proofs and additional details are in Appendix B. 3.2 Dynamic Shrinkage Processes The proposed dynamic shrinkage process contains three prominent features: (1) a dynamic model for the local scale parameters, λt, via an autoregression on the log-scale; (2) a log-scale representation of a broad class of global-local pri- ors to propagate desirable shrinkage properties to the dynamic setting; and (3) a Gaussian scale-mixture representation of the implied log-volatility evolution error to provide an efficient Gibbs sampling algorithm. In this section, we pro- vide the relevant details regarding these features, and explore the properties of the resulting process. 3.2.1 Stochastic Volatility Models for Dynamic Scale Parame- ters To extend the class of global-local scale mixtures of Gaussian distributions in (3.1) to the dynamic setting, we propose to model the local scale parameter, λt, using a stochastic volatility (SV) model (e.g., Kim et al., 1998). The SV model, which is the most common approach for modeling time-dependent random 45 scale parameters, introduces dynamic dependence via an AR(1) model for the log-variance (or log-volatility), as in model (3.2). Unlike model (3.2), standard SV models typically assume i∼idηt N(0, σ2η). A distinctive feature of the SV model is that it encourages volatility clustering, in which large (or small) variance—or shrinkage—persists for consecutive time points. MCMC implementations of the SV model commonly represent the likeli- hood for ht on the log-scale and approximate the resulting distribution using a known discrete mixture of Gaussian distributions (e.g., Kim et al., 1998). Impor- tantly, the resulting approximation provides a framework for a fast and efficient MCMC sampler: conditional on the mixing component, the model for {h }Tt t=1 is a Gaussian dynamic linear model, and therefore {ht}Tt=1 may be sampled jointly in O(T ) computations. We provide the relevant details in Section 3.5. 3.2.2 Log-Scale Representations of Global-Local Priors Stochastic volatility models will not automatically exhibit desirable shrinkage behavior: we must consider appropriate distributions for µ and ηt. To il- lustrate this point, consider the standard SV model assumption for the evo- lution error distribution of log-volatility, iηt ∼ id N(0, σ2η). For the likelihood yt ∼ N(ωt, 1) and the prior (3.1a), the posterior expectation of ωt is E[ωt|{ys}, τ ] = (1− E[κt|{ys}, τ ]) yt, where ≡ 1 1κt = (3.4) 1 + Var [ωt|τ, λt] 1 + τ 2λ2t is the shrinkage parameter. As noted by Carvalho et al. (2010), E[κt|{ys}, τ ] is interpretable as the amount of shrinkage toward zero a posteriori: κt ≈ 0 yields minimal shrinkage (for signals), while κt ≈ 1 yields maximal shrinkage to zero 46 (for noise). For the standard SV model and fixing φ = µ = 0 for simplicity, λt = exp(ηt/2) is log-no{rmally[dist(ribute)d], a}nd the shrinkage parameter has density2 [κ ] ∝ 1 exp − 1t − 2 log 1−κt . Notably, the density for κt approachesκt(1 κt) 2ση κt zero as κt → 0 and as κt → 1. As a result, direct application of the Gaussian SV model may overshrink true signals and undershrink noise. By comparison, consider the horseshoe prior of Carvalho et al. (2010). The horseshoe prior is the special case of (3.1c) with iid[λ ] ∼ C+(0, 1), where C+t de- notes the half-Cauchy distribution. For fixed τ = 1, the half-Cauchy prior on λt is equivalent to i κt ∼ id Beta(1/2, 1/2), which induces a “horseshoe” shape for the shrinkage parameter (see Figure 3.2). The horseshoe-like behavior is ideal in sparse settings, since the prior density allocates most of its mass near zero (min- imal shrinkage of signals) and one (maximal shrinkage of noise). Theoretical results, simulation studies, and a variety of applications confirm the effective- ness of the horseshoe prior (Carvalho et al., 2009, 2010; Datta and Ghosh, 2013; van der Pas et al., 2014). To emulate the robustness and sparsity properties of the horseshoe and other shrinkage priors in the dynamic setting, we represent a general class of global- local shrinkage priors on the log-scale. As a motivating example, consider the special case of (3.1a) and (3.2) with indepφ = 0: ωt ∼ N(0, τ 2λ2t ) with log(λ2t ) = ηt. This example is illuminating: we equivalently express the (static) horseshoe prior by letting Dηt = log λ2t , where D = denotes equality in distribution. In par- ticular, [ ] ( ) ( ) λ2 ∝ λ2 −1/2 1 + λ2 −1t t t implies [ηt] = π −1 exp(ηt/2) [1 + exp(ηt)] −1 47 φ = 0.25 φ = 0.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 κt κt φ = 0.75 φ = 0.99 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 κt κt Figure 3.2: Simulation-based estimate of the stationary distribution of κt for various AR(1) coefficients φ. The blue line indicates the density of κt in the static (φ = 0) horseshoe, [κ] ∼ Beta (1/2, 1/2). so ηt is Z-distributed with ηt ∼ Z(α = 1/2, β = 1/2, µz = 0, σz = 1). Im- portantly, Z-distributions may be written as mean-variance scale mixtures of Gaussian distributions (Barndorff-Nielsen et al., 1982), which produces a useful framework for a parameter-expanded Gibbs sampler. More generally, consider the inverted-Beta prior, denoted IB(β, α), for λ2 with density ( )α−1( )2 ∝ 2 2 −(α+β)[λ ] λ 1 + λ , λ > 0 (e.g., Armagan et al., 2011; Polson and Scott, 2012a,b). Special cases of the inverted-Beta distribution are provided in Table 3.1. This broad class of priors may be equivalently constructed via the variances λ2t , the shrinkage parameters κt, or the log-variances ηt. Proposition 3.1. The following distributions are equivalent: 1. λ2 ∼ IB(β, α); 48 Density Density 0 2 4 0.0 1.5 3.0 Density Density 0 2 4 6 8 0 1 2 3 ( ) 2. κ = 1/ 1 + λ2 ∼ Beta(β, α); 3. η = log(λ2) = log(κ−1 − 1) ∼ Z(α, β, 0, 1). Note that the ordering of the parameters α, β is identical for the inverted- Beta and Beta distributions, but reversed for the Z-distribution. Now consider the dynamic setting in which φ 6= 0. Model (3.2) implies that the conditional prior variance for ωt in (3.1a) is exp(ht) = exp(µ+ φ(ht−1 − µ) + 2 2φ 2 2 iidηt) = τ λt−1λ̃t , where τ = exp(µ), λ2t−1 = exp(ht−1 − µ), and λ̃2t = exp(ηt) ∼ IB(β, α), as in the non-dynamic setting. This prior generalizes the IB(β, α) prior via the local variance term, λ2φt−1, which incorporates information about the shrinkage behavior at the previous time t − 1 in the prior for ωt. We formalize the role of this local adjustment term with the following results. Proposition 3.2. Suppose η ∼ Z(α, β, µz, 1) for µz ∈ R. Then κ = 1/(1 + exp(η)) ∼ TPB(β, α, exp(µz)), where κ ∼ TPB(β, α, γ) denote the three-parameter Beta dis- tribution with density [κ] = [B(β, α)]−1γβκβ−1(1 − κ)α−1 [1 + (γ − 1)κ]−(α+β) , κ ∈ (0, 1), γ > 0. The three-parameter Beta (TPB) distribution generalizes the Beta distribu- tion: γ = 1 produces the Beta(β, α) distribution, while γ > 1 (respectively, γ < 1) allocates more mass near zero (respectively, one) relative to the Beta(β, α) dis- tribution. For dynamic shrinkage processes, the TPB distribution arises as the conditional prior distribution of κt+1 given {κs}s≤t. Theorem 3.1. For the dynamic shrinkage(process (3.2)), the conditional prior distribu- tion of the shrinkage parameter κt+1 = 1/ 1(+ τ 2λ2t+1 is [ ] )φ |{ } 2(1−φ) 1− κt[κt+1 κs s≤t, φ, τ ] ∼ TPB β, α, τ (3.5) κt 49 or equivalently, [κt+1|{λs}s≤t, φ, τ ] ∼ TPB(β, α, τ 2λ2φt ). The proof of Theorem 3.1 is in Appendix B. Naturally, the previous value of the shrinkage parameter, κt, together with the AR(1) coefficient φ, inform both the magnitude and the direction of the distributional shift of κt+1. Theorem 3.2. For the dynamic horseshoe process of (3.2() with α = β = 1/2 )and fixed τ = 1, the conditional prior distribution (3.5) satisfies P κt+1 < ε|{κs}s≤t, φ → 1 as κt → 0 for any ε ∈ (0, 1) and fixed φ 6= 0. The proof of Theorem 3.2 is in Appendix B. Importantly, Theorem 3.2 demonstrates that the mass of the conditional prior distribution for κt+1 con- centrates near zero—corresponding to minimal shrinkage of signals—when κt is near zero, so the shrinkage behavior at time t informs the (prior) shrinkage behavior at time t+ 1. We similarly characterize the posterior distribution of κt+1 given {κs}s≤t in the following theorem, which extends the results of Datta and Ghosh (2013) to the dynamic setting. Theorem 3.3. Under the likelihood in∼depyt N(ωt, 1), the prior (3.1a), and the dynamic horseshoe process (3.2) with α = β = 1/2 and fixed φ 6= 0, the posterior distribution of κt+1 given the history of the shrinkage process {κs}s≤t satisfies the following properties: ( ∣ ) (a) For any fixed ε ∈ (0, 1), P κt+1 > 1 − ε∣yt+1, {κs}s≤t, φ, τ → 1 as γt → 0 uniformly in y ∈ R, where γ = τ 2(1−φ)t+1 t ([(1− κt)/κ ] φ ∣t . ) (b) For any fixed ε ∈ (0, 1) and γt < 1, P κt+1 < ε∣yt+1, {κs}s≤t, φ, τ → 1 as |yt+1| → ∞. 50 The proof of Theorem 3.3 is in Appendix B, and uses the observation that marginally, [yt+1|{ indep κs}] ∼ N(0, κ−1t+1), so the posterior distribution of κ{ } t+1 is [ ] β−1 α−1 −(α+β)[κt+1|yt+1, {κs}s≤t, φ, τ ] ∝ κ{t+1 (1− κ(t+1) 1 + (γ)t}− 1)κt+1 × 1/2κ 2t+1 exp −[yt+1κt+1/2 ] ( ) ∝ −1(1− κ −1/2t+1) 1 + (γt − 1)κ 2t+1 exp −yt+1κt+1/2 . Theorem 3.3(a) demonstrates that the posterior mass of [κt+1|{κs}s≤t] concen- trates near one as τ → 0, as in the non-dynamic horseshoe, but also as κt → 1. Therefore, the dynamic horseshoe process provides an additional mechanism for shrinkage of noise, besides the global scale parameter τ , via the previous shrinkage parameter κt. Moreover, Theorem 3.3(b) shows that, despite the ad- ditional shrinkage capabilities, the posterior mass of [κt+1|{κs}s≤t] concentrates near zero for large absolute signals |yt+1|, which indicates robustness of the dy- namic horseshoe process to large signals analogous to the static horseshoe prior. When |φ| < 1, the log-volatility process {ht} is stationary, which implies {κt} is stationary. In Figure 3.2, we plot a simulation-based estimate of the stationary distribution of κt for various values of φ under the dynamic horseshoe process. The stationary distribution of κt is similar to the static horseshoe distribution (φ = 0) for φ < 0.5, while for large values of φ the distribution becomes more peaked at zero (less shrinkage of ωt) and one (more shrinkage of ωt). The result is intuitive: larger |φ| corresponds to greater persistence in shrinkage behavior, so marginally we expect states of aggressive shrinkage or little shrinkage. 51 3.2.3 Scale Mixtures via Pólya-Gamma Processes Standard SV sampling algorithms rely on a Gaussian assumption for the log- volatility innovations, iηt ∼ id N(0, σ2η), to efficiently sample the log-volatilities {ht} (e.g., Kim et al., 1998; Omori et al., 2007; Kastner and Frühwirth-Schnatter, 2014). To extend these techniques to the dynamic shrinkage process (3.2) in which iidηt ∼ Z(α, β, 0, 1), we use parameter expansion to write ηt as a scale mix- ture of Gaussian distributions. The representation of a Z-distribution as a mean- variance scale mixtures of Gaussian distributions is due to Barndorff-Nielsen et al. (1982). For parameter expansion, we build on the framework of Polson et al. (2013), who propose a Pólya-Gamma scale mixture of Gaussians represen- tation for Bayesian logistic regression. Importantly, this representation allows us to construct an efficient sampling algorithm that combines anO(T ) sampling algorithm for the log-volatilities {ht}Tt=1 with a Pólya-Gamma sampler for the mixing parameters. A Pólya-Gamma random variable ξ with parameters b > 0 and c ∈ R, denoted ξ ∼ PG(b, c), is an infinite convolution of Gamma random variables: ∑∞ D 1 gk ξ = (3.6) 2π2 (k − 1/2)2 − c2/(4π2) k=1 where i∼idgk Gamma(b, 1). Properties of Pólya-Gamma random variables may be found in Barndorff-Nielsen et al. (1982) and Polson et al. (2013). Our interest in Pólya-Gamma random variables derives from their role in representing the Z-distribution as a mean-variance scale mixture of Gaussians. Theorem 3.4. The random variable η ∼ Z(α, β, 0, 1), or equivalently η = log(λ2) 52 with λ2 ∼ IB(β, α), is a mean-variance scale mixture of Gaussian distributions with[η|ξ] ∼ N (ξ −1[α− β]/2, ξ−1)  (3.7)[ξ] ∼ PG(α + β, 0). Moreover, the conditional distribution of ξ is [ξ|η] ∼ PG(α + β, η). The proof of Theorem 3.4 is in Appendix B. When α = β, the Z-distribution is symmetric, and the conditional expectation in (3.7) simplifies to E[η|ξ] = 0. Polson et al. (2013) propose a sampling algorithm for Pólya-Gamma random variables, which is available in the R package BayesLogit, and is extremely efficient when b = 1. In our setting, this corresponds to α+ β = 1, for which the horseshoe prior is the prime example. 3.3 Bayesian Trend Filtering with Dynamic Shrinkage Pro- cesses Dynamic shrinkage processes are particularly appropriate for dynamic linear models (DLMs). DLMs combine an observation equation, which relates the ob- served data to latent state variables, and an evolution equation, which allows the state variables—and therefore the conditional mean of the data—to be dy- namic. By construction, DLMs contain many parameters, and therefore may benefit from structured regularization. The proposed dynamic shrinkage pro- cesses offer such regularization, and unlike existing methods, do so adaptively. Consider the following DLM with a Dth order random walk on the state 53 variable, β:t iidyt = βt + t, [t|σ] ∼ N(0, σ 2  ), t = 1, . . . , T (3.8)∆Dβt+1 = ωt, [ωt|τ, {λs} in∼dep] N(0, τ 2λ2t ), t = D, . . . , T and β 2t+1 = ωt ∼ N(0, τ λ2t ) for t = 0, . . . , D − 1, where ∆ is the differencing op- erator and D ∈ Z+ is the degree of differencing. By imposing a shrinkage prior on λt, model (3.8) may be viewed as a Bayesian adaptation of the trend filtering model of Kim et al. (2009) and Tibshirani (2014): model (3.8) features a penalty encouraging sparsity of the Dth order differences of the conditional mean, βt. Faulkner and Minin (2016) provide an implementation based on the (static) horseshoe prior and the Bayesian lasso, and further allow for non-Gaussian likelihoods. We refer to model (3.8) as a Bayesian trend filtering (BTF) model, with various choices available for the distribution of the innovation standard deviations, [τλt]. We propose a dynamic horseshoe process as the prior for the innovations ωt in model (3.8). The aggressive shrinkage of the horseshoe prior forces small values of |ωt| = |∆Dβt+1| toward zero, while the robustness of the horseshoe prior permits large values of |∆Dβt+1|. When D = 2, model (3.8) will shrink the conditional mean βt toward a piecewise linear function with breakpoints deter- mined adaptively, while allowing large absolute changes in the slopes. Further, using the dynamic horseshoe process, the shrinkage effects induced by λt are time-dependent, which provides localized adaptability to regions with rapidly- or slowly-changing features. Following Carvalho et al. (2010) and Polson and Scott (2012b), we assume a half-Cauchy prior for the global scale parameter √ τ ∼ C+(0, σ/ T ), in which we scale by the observation error variance and the sample size (Piironen and Vehtari, 2016). Using Pólya-Gamma mixtures, the implied conditional prior on µ = log(τ 2) is [µ|σ 2 −1, ξµ] ∼ N(log σ − log T, ξµ ) 54 with ξµ ∼ PG(1, 0). We include the details of the Gibbs sampling algorithm for model (3.8) in Section 3.5, which is notably linear in the number of time points, T : the full conditional posterior precision matrices for β = (β ′1, . . . , βT ) and h = (h1, . . . , h ′T ) are D-banded and tridiagonal, respectively, which ad- mit highly efficient O(T ) back-band substitution sampling algorithms (see Ap- pendix B for empirical evidence). 3.3.1 Bayesian Trend Filtering: Simulations To assess the performance of the Bayesian trend filtering (BTF) model (3.8) with dynamic horseshoe innovations (BTF-DHS), we compared the proposed meth- ods to several competitive alternatives using simulated data. We considered the following variations on BTF model (3.8): normal-inverse-Gamma (BTF-NIG) innovations via τ−2 ∼ Gamma(0.001, 0.001) with λt = 1; and (static) horseshoe priors for the innovations (BTF-HS) via iidτ, λt ∼ C+(0, 1). In addition, we include the (non-Bayesian) trend filtering model of Tibshirani (2014) implemented us- ing the R package genlasso (Arnold and Tibshirani, 2014), for which the regu- larization tuning parameter is chosen using cross-validation (Trend Filtering). For all trend filtering models, we select D = 2, but the relative performance is similar for D = 1. Among non-trend filtering models, we include a smooth- ing spline estimator implemented via smooth.spline() in R (Smoothing Spline); the wavelet-based estimator of Abramovich et al. (1998) (BayesThresh) implemented in the wavethresh package (Nason, 2016); and the nested Gaus- sian Process (nGP) model of Zhu and Dunson (2013), which relies on a state space model framework for efficient computations, comparable to—but empir- ically less efficient than—the BTF model (3.8). 55 We simulated 100 data sets from the model yt = y∗t + t, where y∗t is the true function and in∼dept N(0, σ2∗). We use the following true functions y∗t from Donoho and Johnstone (1994): Doppler, Bumps, Blocks, and Heavisine, imple- mented in the R package wmtsa (Constantine and Percival, 2016). The noise variance σ2∗ is det√erm∑ined by s/electing a root-signal-to-noise ratio (RSNR) andT (y∗ ∗ 2 ∑computing t=1 t−ȳ )σ∗ = − RSNR, where ȳ∗ = 1 T y∗. As in Zhu andT 1 T t=1 t Dunson (2013), we select RSNR = 7 and use a moderate length time series, T = 128. Doppler: Fitted Curve Bumps: Fitted Curve 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t t Blocks: Fitted Curve Heavisine: Fitted Curve 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t t Figure 3.3: Fitted curves for simulated data with T = 128 and RSNR = 7. Each panel includes the simulated observations (x-marks), the posterior expectations of βt (cyan), and the 95% pointwise HPD credible intervals (light gray) and 95% simultaneous credible bands (dark gray) for the posterior predictive distribu- tion of {yt} under BTF-DHS model (3.8) with D = 2. The proposed estimator, as well as the uncertainty bands, accurately capture both slowly- and rapidly- changing behavior in the underlying functions. In Figure 3.3, we provide an example of each true curve y∗t , together with the proposed BTF-DHS posterior expectations and credible bands. Notably, the 56 yt yt -2 0 2 4 6 -0.6 -0.2 0.2 0.6 yt yt -8 -6 -4 -2 0 2 4 6 0 1 2 3 4 5 Bayesian trend filtering model (3.8) with D = 2 and dynamic horseshoe inno- vations provides an exceptionally accurate fit to each data set. Importantly, the posterior expectations and the posterior credible bands adapt to both slowly- and rapidly-changing behavior in the underlying curves. The implementation is also efficient: the computation time for 15,000 iterations of the Gibbs sam- pling algorithm, implemented in R (on a MacBook Pro, 2.7 GHz Intel Core i5), is about 1.15 minutes. To compare the aforeme√ntion∑ed procedures, we compute the root mean squared errors RMSE(ŷ) = 1 Tt=1 (y ∗ t − ŷt) 2 for all estimators ŷ of the true T function, y∗. The results are displayed in Figure 3.4. The proposed BTF-DHS im- plementation substantially outperforms all competitors, especially for rapidly- changing curves (Doppler and Bumps). The exceptional performance of BTF- DHS is paired with comparably small variability of RMSE, especially relative to non-dynamic horseshoe model (BTF-HS). Interestingly, the magnitude and variability of the RMSEs for BTF-DHS are related to the AR(1) coefficient, φ: the 95% HPD intervals (corresponding to Figure 3.3) are (0.77, 0.97) (Doppler), (0.81, 0.97) (Bumps), (0.76, 0.96) (Blocks), and (−0.04, 0.74) (Heavisine). For the smoothest function, Heavisine, there is less separation among the estimators. Nonetheless, BTF-DHS performs the best, even though the HPD interval for φ is wider and contains zero. These results show that the Bayesian trend filtering model (3.8) with dynamic horseshoe innovations substantially improves upon existing curve-fitting procedures, and due to both its computational efficiency and the availability of posterior inference, may provide a useful procedure for a wide variety of applications. 57 Doppler: Root Mean Squared Error Bumps: Root Mean Squared Error Smoothing Spline Smoothing Spline BayesThresh BayesThresh nGP nGP Trend Filtering Trend Filtering BTF-NIG BTF-NIG BTF-HS BTF-HS BTF-DHS BTF-DHS 0.15 0.20 0.25 0.30 0.3 0.4 0.5 0.6 0.7 Blocks: Root Mean Squared Error Heavisine: Root Mean Squared Error Smoothing Spline Smoothing Spline BayesThresh BayesThresh nGP nGP Trend Filtering Trend Filtering BTF-NIG BTF-NIG BTF-HS BTF-HS BTF-DHS BTF-DHS 0.4 0.5 0.6 0.7 0.8 0.9 0.40 0.45 0.50 0.55 0.60 Figure 3.4: Root mean squared errors for simulated data with T = 128 and RSNR = 7. The Bayesian trend filtering (BTF) estimators differ in their inno- vation distributions, which determines the shrinkage behavior of the second order differences (D = 2): normal-inverse-Gamma (NIG), horseshoe (HS), and dynamic horseshoe (DHS). 3.3.2 Bayesian Trend Filtering: Application to CPU Usage Data To demonstrate the adaptability of the dynamic horseshoe process for model (3.8), we consider the CPU usage data in Figure 3.1a. The data exhibit substan- tial complexity: an overall smooth intraday trend but with multiple irregularly- spaced jumps, and an increase in volatility from 16:00-18:00. Our goal is to provide an accurate measure of the trend, including jumps, with appropriate uncertainty quantification. For this purpose, we employ the BTF-DHS model (3.8), which we extend to include stochastic volatility for the observation error: in∼dep iidyt N(βt, σ2t ) with an AR(1) model on log σ2t as in (3.2) with ηt ∼ N(0, σ2η). For the additional sampling step of the stochastic volatility parameters, we use 58 the algorithm of Kastner and Frühwirth-Schnatter (2014) implemented in the R package stochvol (Kastner, 2016). The resulting model fit is summarized in Figure 3.1. The posterior expec- tation and posterior credible bands accurately model both irregular jumps and smooth trends, and capture the increase in volatility from 16:00-18:00 (see Figure 3.1c). By examining regions of nonoverlapping simultaneous posterior credible bands, we may assess change points in the level of the data. In particular, the model fit suggests that the CPU usage followed a slowly increasing trend in- terrupted by jumps of two distinct magnitudes prior to 16:00, after which the volatility increased and the level decreased until approximately 18:00. We augment the simulation study of Section 3.3.1 with a comparison of out- of-sample estimation of the CPU usage data. We fit each model using 90% (T = 1296) of the data selected randomly for training and the remaining 10% (T = 144) for testing, which was repeated independently 100 times. Models were compared using RMSE. Unlike the simulation study in Section 3.3.1, the subsampled data are not equally spaced. Taking advantage of the computational efficiency of the pro- posed BTF methodology, we employ a model-based imputation scheme, which is valid for missing observations. For unequally-spaced data yt , i = 1, . . . , T , wei expand the operative data set to include missing observations along an equally- spaced grid, t∗ = 1, . . . , T ∗, such that for each observation point i, yt = y ∗i t for some t∗. Although T ∗ ≥ T , possibly with T ∗  T , all computations within the sampling algorithm, including the imputation sampling scheme for {yt∗ : t∗ 6= ti}, are linear in the number of (equally-spaced) time points, T ∗. Therefore, we may apply the same Gibbs sampling algorithm as before, with 59 the additional step of drawing inyt∗ ∼ dep N(β 2 ∗t∗ , σt∗) for each unobserved t =6 ti. Implicitly, this procedure assumes that the unobserved points are missing at random, which is satisfied by the aforementioned subsampling scheme. The results of the out-of-sample estimation study are displayed in Figure 3.5. The BTF procedures are notably superior to the non-Bayesian trend filtering and smoothing spline estimators, and, as with the simulations of Section 3.3.1, the proposed BTF-DHS model substantially outperforms all competitors. Root Mean Squared Error Smoothing Spline Trend Filtering BTF-NIG BTF-HS BTF-DHS 0.25 0.30 0.35 0.40 0.45 0.50 0.55 Figure 3.5: Root mean squared error for out-of-sample minute-by-minute CPU usage data. The Bayesian trend filtering (BTF) estimators differ in their innovation distribu- tions, which determines the shrinkage behavior of the second order differences (D = 2): normal-inverse-Gamma (NIG), horseshoe (HS), and dynamic horseshoe (DHS). 3.4 Joint Shrinkage for Time-Varying Parameter Models Dynamic shrinkage processes are appropriate for multivariate time series mod- els that may benefit from locally adaptive shrinkage properties. Consider the following time-varying parameter regression model with multiple dynamic pre- dictors xt =(x1,t, . . . , xp,t) ′:  indepyt = x ′ tβt + t, [ 2  t |σ] ∼ N(0, σ )  (3.9)∆D indepβt+1 = ωt, [ωj,t|τ0, {τk}, {λ }] ∼ N(0, τ 2τ 2λ2k,s 0 j j,t) 60 where βt = (β1,t, . . . , βp,t)′ is the vector of dynamic regression coefficients and D ∈ Z+ is the degree of differencing. The prior for the innovations ωj,t incor- porates three levels of global-local shrinkage: a global shrinkage parameter τ0, a predictor-specific shrinkage parameter τj , and a predictor- and time-specific local shrinkage parameter λj,t. To provide jointly localized shrinkage of the dynamic regression coefficients {βj,t} analogous to the Bayesian trend filtering model of Section 3.3, we ex- tend (3.2) to allow for multivariate time dependence via a vector autoregression (VAR) on the log-volatility: indep[ω 2 2 2j,t|τ0, {τk}, {λk,s}] ∼ N(0, τ0 τj λj,t) h 2 2 2 (3.10)  j,t = log(τ0 τj λj,t), j = 1, . . . , p, t = 1, . . . , T  iidht+1 = µ+ Φ(ht − µ) + ηt, ηj,t ∼ Z(α, β, 0, 1) where ht = (h1,t, . . . , hp,t)′, µ = (µ1, . . . , µp)′, ηt = (η ′1,t, . . . , ηp,t) , and Φ is the p × p VAR coefficient matrix. We assume Φ = diag (φ1, . . . , φp) for simplicity, but non-diagonal extensions are available. As in the univariate setting, we use Pólya-Gamma mixtures (independently) for the log-volatility evolution errors, | in∼dep[η ξ ] N(ξ−1 iidj,t j,t j,t [α − β]/2, ξ−1j,t ) with ξj,t ∼ PG(α + β, 0) and α = β = 1/2. We augment model (3.10) with half-Cauchy priors for the predictor-specific and √ global parameters, in∼depτ C+j (0, 1) and τ0 ∼ C+(0, σ/ Tp), in which we scale by the observation error variance and the number of innovations {ωj,t} (Piironen and Vehtari, 2016). These priors may be equivalently represented on the log- scale using the Pólya-Gamma parameter expansion [µj|µ, ξµ ] ∼ N(µ, ξ−1µ ) andj j [µ0| iid σ, ξµ0 ] ∼ N(log σ2 − log T, ξ−1µ ) with ξµ , ξµ0 ∼ PG(1, 0) and the identification0 j µj = log(τ 2 2 2 0 τj ) and µ0 = log(τ0 ). 61 3.4.1 Time-Varying Parameter Models: Simulations We conducted a simulation study to evaluate competing variations of the time- varying parameter regression model (3.9), in particular relative to the proposed dynamic shrinkage process (BTF-DHS) in (3.10). Similar to the simulations of Section 3.3.1, we focus on the distribution of the innovations, ωj,t, and again in- clude the normal-inverse-Gamma (BTF-NIG) and the (static) horseshoe (BTF- HS) as competitors, in each case selecting D = 2. Among models with non- dynamic regression coefficients, we include a lasso regression (Tibshirani, 1996) implemented via the R package glmnet (Friedman et al., 2010), which incorpo- rates variable selection, and an ordinary linear regression. We simulated 100 data sets of length T = 500 from the model y = x′β∗t t t + t, where the p = 7 predictors are iidx1,t = 1 and xj,t ∼ N(0, 1) for j > 2, and i t ∼ id N(0, σ2∗). The true regression coefficients β∗ = (β∗t 1,t, . . . , β∗ ′p,t) are the fol- lowing: β∗1,t = 1, β∗ ∗2,t and β3,t are the Bumps and Heavisine functions, respectively, from Section 3.3.1 rescaled to [0, 1], and β∗j,t = 0 for j = 4, . . . , p = 7. The predic- tor set contains a variety of functions: a constant nonzero function, a rapidly- changing function (Bumps), a relatively smooth function (Heavisine), and three true zeros. The noise variance σ2∗ is det√er∑mined by s/electing a root-signal-to-T ∗ ∗ 2 noise ratio (∑RSNR) and computing σ = t=1(yt−ȳ )∗ − RSNR, where y∗t = x′tβ∗T 1 t and ȳ∗ = 1 Tt=1 y ∗ t . We select RSNR = 10.T In Figure 3.6, we show the true regression functions β∗j,t, together with the proposed BTF-DHS posterior expectations and credible bands for βj,t. De- spite the challenge presented by the Bumps function, the proposed model (3.9) with innovation distribution (3.10) adequately identifies the constant and zero curves, captures the important features of the Bumps function, and accurately 62 estimates the smoother Heavisine function. We evalua√te co∑mpeting methods using RMSEs f√or bo∑th y∗t ∑and β(∗t defined b)y2 RMSE(ŷ) = 1 T (y∗ − ŷ )2t=1 t t and RMSE(β̂) = 1 T p ∗ T Tp t=1 j=1 βj,t − β̂j,t for all estimators β̂t of the true regression functions, β∗t with ŷ = x′t tβ̂t. The results are displayed in Figure 3.7. The proposed BTF-DHS model substantially outperforms the competitors in both recovery of the true regression functions, β∗j,t and estimation of the true curves, y∗t . Notably, the dynamic (BTF) procedures offer massive gains over the models with static regression coefficients. Intercept Bumps Heavisine Zero Zero Zero 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3.6: True regression functions β∗j,t (black line) and corresponding poste- rior expectations (cyan), 95% pointwise HPD credible intervals (light gray) and 95% simultaneous credible bands (dark gray) for βj,t under the BTF-DHS model given by (3.9) and (3.10) for a simulated data set. 63 -1.5 -0.5 0.5 1.5 -1.5 -0.5 0.5 1.5 Regression Coefficients: Root Mean Squared Error Fitted Values: Root Mean Squared Error Linear Regression Linear Regression Lasso Lasso BTF-NIG BTF-NIG BTF-HS BTF-HS BTF-DHS BTF-DHS 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Figure 3.7: Root mean squared errors for the regression coefficients, β∗j,t (left) and the true curves, y∗ = x′β∗t t t (right) for simulated data. 3.4.2 Time-Varying Parameter Models: The Fama-French Asset Pricing Model Asset pricing models commonly feature highly structured factor models to par- simoniously model the co-movement of stock returns. Such fundamental fac- tor models identify common risk factors among assets, which may be treated as exogenous predictors in a time series regression. Popular approaches in- clude the one-factor Capital Asset Pricing Model (CAPM, Sharpe, 1964) and the three-factor Fama-French model (FF-3, Fama and French, 1993). Recently, the five-factor Fama-French model (FF-5, Fama and French, 2015) was proposed as an extension of FF-3 to incorporate additional common risk factors. However, outstanding questions remain regarding which, and how many, factors are nec- essary. Importantly, an attempt to address these questions must consider the dynamic component: the relevance of individual factors may change over time, particularly for different assets. We apply model (3.9) to extend these fundamental factor models to the dy- namic setting, in which the factor loadings are permitted to vary—perhaps 64 rapidly—over time. For further generality, we append the momentum factor of Carhart (1997) to FF-5 to produce a fundamental factor model with six factors and dynamic factor loadings. Importantly, the shrinkage towards sparsity in- duced by the dynamic horseshoe process allows the model to effectively select out unimportant factors, which also may change over time. As in Section 3.3.2, we modify model (3.9) to include stochastic volatility for the observation error, [t|{σs} in∼dep] N(0, σ2t ). To study various market sectors, we use weekly industry portfolio data from the website of Kenneth R. French, which provide the value-weighted return of stocks in the given industry. We focus on manufacturing (Manuf) and healthcare (Hlth). For a given industry portfolio, the response variable is the returns in excess of the risk free rate, yt = Rt − RF,t, with predictors xt = (1, RM,t − RF,t, SMB t,HMLt,RMW t,CMAt,MOM t)′, defined as follows: the market risk factor, RM,t − RF,t is the return on the market portfolio RM,t in excess of the risk free rate RF,t; the size factor, SMB t (small minus big) is the dif- ference in returns between portfolios of small and large market value stocks; the value factor, HMLt (high minus low) is the difference in returns between portfo- lios of high and low book-to-market value stocks; the profitability factor, RMW t is the difference in returns between portfolios of robust and weak profitability stocks; the investment factor, CMAt is the difference in returns between portfolios of stocks of low and high investment firms; and the momentum factor, MOM t is the difference in returns between portfolios of stocks with high and low prior re- turns. These data are publicly available on Kenneth R. French’s website, which provides additional details on the portfolios. We standardize all predictors and the response to have unit variance. 65 In Figures 3.8 and 3.9, we plot the posterior expectation and credible bands for the time-varying regression coefficients and observation error stochastic volatility for the weekly manufacturing and healthcare industry data sets, re- spectively, from 4/1/2007 - 4/1/2017 (T = 522). The 95% simultaneous credi- ble bands (dark gray) indicate which coefficients are significantly different from zero, and if so, at which times. For the manufacturing industry, the significant factors are the market risk (RM,t − RF,t), investment (CMAt), and momentum (MOM t), where both CMAt and MOM t are significantly time-varying (i.e., the simultaneous credible bands contain no constant function). By comparison, an ordinary linear regression does not find MOM t to be significant at the 5% level, since the non-dynamic model ignores the fluctuations from 2008-2012, but does identify the market risk, profitability (RMW t), and investment as significant factors (see Appendix B for details). For the healthcare industry, the significant factors are market risk, value (HMLt), and profitability. By comparison, the ordinary linear regression iden- tifies these factors as well as size (SMB t) as significant at the 5% level (see the Appendix B for details). Notably, the only common factor significant in both the manufacturing and healthcare industries under model (3.9) over this time period is the market risk. This result suggests that the aggressive shrinkage behavior of the dynamic shrinkage process is important in this setting, since several factors may be effectively irrelevant for some or all time points. 66 Manuf: Intercept Manuf: Mkt.RF Manuf: SMB Manuf: HML Manuf: RMW Manuf: CMA Manuf: MOM Manuf: SV 2008 2012 2016 2008 2012 2016 2008 2012 2016 2008 2012 2016 Figure 3.8: Posterior expectations (cyan), 95% pointwise HPD credible intervals (light gray) and 95% simultaneous credible bands (dark gray) for βj,t and σt (bottom right) under the BTF-DHS model given by (3.9) and (3.10) for value- weighted manufacturing industry returns. The solid black line is zero, the dashed green line is the ordinary linear regression estimate, and the solid red line indicates periods for which the 95% simultaneous credible bands do not contain zero. 3.5 MCMC Sampling Algorithm and Computational Details We design a Gibbs sampling algorithm for the dynamic shrinkage process. The sampling algorithm is both computationally and MCMC efficient, and builds upon two main components: (1) a stochastic volatility sampling algorithm (Kastner and Frühwirth-Schnatter, 2014) augmented with a Pólya-Gamma sam- pler (Polson et al., 2013); and (2) a Cholesky Factor Algorithm (CFA, Rue, 2001) for sampling the state variables in the dynamic linear model. Importantly, both components employ algorithms that are linear in the number of time points, which produces a highly efficient sampling algorithm. 67 -0.6 -0.2 0.2 0.6 -0.4 -0.2 0.0 0.2 0.4 -0.4 0.0 0.4 -0.5 0.0 0.5 1.0 1.5 2.0 -0.5 0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 -0.5 0.0 0.5 Hlth: Intercept Hlth: Mkt.RF Hlth: SMB Hlth: HML Hlth: RMW Hlth: CMA Hlth: MOM Hlth: SV 2008 2012 2016 2008 2012 2016 2008 2012 2016 2008 2012 2016 Figure 3.9: Posterior expectations (cyan), 95% pointwise HPD credible intervals (light gray) and 95% simultaneous credible bands (dark gray) for βj,t and σt (bottom right) under the BTF-DHS model given by (3.9) and (3.10) for value- weighted healthcare industry returns. The solid black line is zero, the dashed green line is the ordinary linear regression estimate, and the solid red line in- dicates periods for which the 95% simultaneous credible bands do not contain zero. The general sampling algorithm is as follows: (1) sample the dynamic shrinkage components (the log-volatilities {ht}, the Pólya-Gamma mixing pa- rameters {ξt}, the unconditional mean of log-volatility µ, the AR(1) coefficient of log-volatility φ, and the discrete mixture component indicators {st}); (2) sam- ple the state variables {βt}; and (3) sample the observation error variance σ2 . We provide details of the dynamic shrinkage process sampling algorithm in Section 3.5.1 and include the details for sampling steps (2) and (3) in Appendix B. 68 -1.5 -0.5 0.5 -0.5 0.0 0.5 -1.5 -0.5 0.5 1.5 -1 0 1 2 3 -2.0 -1.0 0.0 1.0 -1.0 -0.5 0.0 0.5 0.0 0.5 1.0 1.5 -1.5 -0.5 0.5 1.0 1.5 3.5.1 Efficient Sampling for the Dynamic Shrinkage Process Consider the (univariate) dynamic shrinkage process in (3.2) with the Pólya- Gamma parameter expansion of Theorem 3.4. We provide implementation de- tails for the dynamic horseshoe process with α = β = 1/2, but extensions to other cases are straightforward. The SV sampling framework of Kastner and Frühwirth-Schnatter (2014) represents the likelihood for ht on the log-scale, and approximates the ensuing logχ21 distribution for the errors via a known discrete mixture of Gaussian distributions. In particular, let ỹt = log(ω2t + c), where c is a small offset to avoid numerical issues. Conditional on the mixture com- ponent indicators , the likelihood is in∼depst ỹt N(ht + mst , vst) where mi and vi, i = 1, . . . , 10 are the pre-specified mean and variance components of the 10- component Gaussian mixture provided in Omori et al. (2007). The evolution equation is ht+1 = µ+ φ(ht − µ) + ηt with initialization h1 = µ+ η0 and innova- tions indep iid[η −1t|ξt] ∼ N(0, ξt ) for [ξt] ∼ PG(1, 0). To sample h = (h1, . . . , hT ) jointly, we directly compute the posterior dis- tribution of h and exploit the tridiagonal structure of the resulting posterior precision matrix. In particular, we equivalently have ỹ ∼ N(m + h̃ + µ̃,Σv) and Dφh̃ ∼ N(0,Σξ), where m = (ms1 , . . . ,(ms )′, h̃)= (h1 − µ, . . .(, hT − µ)′,T ) µ̃ = (µ, (1 − φ)µ, . . . , (1 − φ)µ)′, Σv = diag {v T −1 Tst}t=1 , Σξ = diag {ξt }t=1 , and Dφ is a lower triangular matrix with ones on the diagonal, −φ on the first off-diagonal, and zeros elsewhere. We sample from the posterior distribution of h by sampling from the posterior distribution of h̃ and setting h = h̃ + µ1 for 1 a(T -dimensio)nal vector of ones. The required posterior distribution is h̃ ∼ N Q−1`h̃,Q −1 , where Q −1 ′ −1 h̃ h̃ h̃ = Σv +DφΣξ Dφ is a tridiagonal symmetric matrix with diagonal elements d0(Qh̃) and first off-diagonal elements d1(Qh̃) 69 defined as [ ] d (Q ) = (v−1 + ξ + φ2ξ ), (v−1 + ξ + φ20 h̃ [ s 1 2 s 2 ] ξ3), . . . , (v −1 s + ξT−1 + φ 2ξT ), (v −1 s + ξ ) ,1 2 T−1 T T d1(Qh̃) =( (−φξ2), (−)φξ3), . . . , (−φξT−1) , and ` −1h̃ = Σ[ v ỹ −m− µ̃ỹ1 −ms1 − µ ỹ2 −ms2 − (1− φ)µ ỹT −ms − (1− φ)µ]′= , , . . . , T . vs1 vs2 vsT Drawing from this posterior distribution is straightforward and efficient, using band back-substitution described in Kastner and Frühwirth-Schnatter (2014): (1) compute the Cholesky decomposition Qh̃ = LL ′, where L is lower triangle; (2) solve La = `h̃ for a; and (3) solve L ′h̃ = a+ e for h̃, where e ∼ N(0, IT ). Conditional on the log-volatilities {ht}, we sample the AR(1) evolution pa- rameters: the log-innovation precisions {ξt}, the autoregressive coefficient φ, and the unconditional mean µ. The precisions are distributed [ξt|ηt] ∼ PG(1, ηt) for ηt = ht+1 − µ − φ(ht − µ), which we sample using the rpg() function in the R package BayesLogit (Polson et al., 2013). The Pólya-Gamma sam- pler is efficient: using only exponential and inverse-Gaussian draws, Polson et al. (2013) construct an accept-reject sampler for which the probability of ac- ceptance is uniformly bounded below at 0.99919, which does not require any tuning. Next, we assume the prior [(φ + 1)/2] ∼ Beta(aφ, bφ), which restricts |φ| < 1 for stationarity, and sample from the full conditional distribution of φ using the slice sampler of Neal (2003). We select aφ = 10 and bφ = 2, which places most of the mass for the density of φ in (0, 1) with a prior mean of 2/3 and a prior mode of 4/5 to reflect the likely presence of persistent volatility √ clustering. The prior for the global scale parameter is τ ∼ C+(0, σ/ T ), which implies µ = log(τ 2) is [µ|σ, ξµ] ∼ N(log(σ2/T ), ξ−1µ ) with ξµ ∼ PG(1, 0). In- cluding the initialization h1 ∼ N(µ, ξ−10 ) with ξ0 ∼ PG(1, 0), the posterior dis- 70 ∑ tribution for µ is µ ∼ N(Q−1` −1 2 T−1µ µ, Q∑µ ) with Qµ = ξµ + ξ0 + (1 − φ) t=1 ξt and `µ = ξ 2 µ log(σ/T )+ξ0h1 +(1−φ) T−1 t=1 ξt(ht+1−φht). Sampling ξµ and ξ0 follows the Pólya-Gamma sampling scheme above. Finally, we sample the discrete mixture component indicators st. The dis- crete mixture probabilities are straightforward to compute: the prior mixture probabilities are the mixing proportions given by Omori et al. (2007) and the likelihood is inỹt ∼ dep N(ht+mst , vst); see Kastner and Frühwirth-Schnatter (2014) for details. 3.6 Conclusions Dynamic shrinkage processes provide a computationally convenient and widely applicable mechanism for incorporating adaptive shrinkage and reg- ularization into existing models. By extending a broad class of global-local shrinkage priors to the dynamic setting, the resulting processes inherit the de- sirable shrinkage behavior, but with greater time-localization. The success of dynamic shrinkage processes suggests that other priors may benefit from log- scale or other appropriate representations, with or without additional depen- dence modeling. As demonstrated in Sections 3.3 and 3.4, dynamic shrinkage processes are particularly appropriate for dynamic linear models, including trend filtering and time-varying parameter regression. In both settings, the dynamic linear models with dynamic horseshoe innovations outperform all competitors in sim- ulated data, and produce reasonable and interpretable results for real data ap- plications. Dynamic shrinkage processes may be useful in other dynamic linear 71 models, such as incorporating seasonality or change points with appropriately- defined (dynamic) shrinkage. Given the exceptional curve-fitting capabilities of the Bayesian trend filtering model (3.8) with dynamic horseshoe innovations (BTF-DHS), a natural extension would be to incorporate the BTF-DHS into more general additive, functional, or longitudinal data models in order to capture ir- regular or local curve features. An important extension of the dynamic fundamental factor model of Section 3.4.2 is to incorporate a large number of assets, possibly with residual correla- tion among stock returns beyond the common factors of FF-5. Building upon Carvalho et al. (2011), a reasonable approach may be to combine a set of known factors, such as the Fama-French factors, with a set of unknown factors to be estimated from the data, where both sets of factor loadings are endowed with dynamic shrinkage processes to provide greater adaptability yet sufficient ca- pability for shrinkage of irrelevant factors. 72 CHAPTER 4 FUNCTIONAL AUTOREGRESSION FOR SPARSELY SAMPLED DATA Portions of this chapter were published in Kowal et al. (2017). 4.1 Introduction We develop a hierarchical Gaussian process model for forecasting and infer- ence of functional time series data. A functional time series is a time-ordered sequence of random functions, Y1, . . . , YT , on some compact index set T ⊂ RD, typically withD = 1. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves, in which the functions Yt(τ) are ob- served at a small number of possibly unequally-spaced points τ ∈ T , and for curves sampled with non-negligible measurement error, which occur frequently in financial applications. Applications of functional time series are abundant, including: daily or weekly interest rate curves as a function of time to matu- rity, such as daily Eurodollar futures contracts (Kargin and Onatski, 2008) and weekly yield curves (Hays et al., 2012; Kowal et al., 2016); yearly sea surface temperature as a function of time-of-year (Besse et al., 2000); yearly mortality and fertility rates as a function of age (Hyndman and Ullah, 2007); daily pollu- tion curves as a function of time-of-day (Damon and Guillas, 2002; Aue et al., 2015); and a vast collection of spatio-temporal applications in which a time- dependent variable is measured as a function of spatial location (e.g., Cressie and Wikle, 2011). The primary goal of functional time series analysis is usually forecasting {Yt}, but we are also interested in performing inference and obtain- ing an interpretable representation of the time evolution of {Yt}. 73 The most prevalent model for functional time series data is the functional autoregressive model of order 1, written FAR(1): Yt − µ = Ψ(Yt−1 − µ) + t, (4.1) where Y 2t ∈ L (T ), Ψ is a bounded linear operator on L2(T ), t ∈ L2(T ) is a sequence of independent mean zero random innovation functions with E||t||2 < ∞, and µ is the mean of {Yt} under stationarity. The FAR(1) model, developed by Bosq (2000), is an extension of two highly successful models: the functional linear model for function-on-function regression and the vector au- toregressive model for multivariate time series, and has been successfully ap- plied in a variety of applications. Importantly, the FAR(1) model provides a mechanism for modeling the evolution of {Yt} jointly over the entirety of the domain T . More∑generally, (4.1) can be extended for multiple lags to the FAR(p) model: Yt − µ = p`=1 Ψ`(Yt−` − µ) + t. Existing approaches for estimating the FAR(p) model typically use an eigen- decomposition of the empirical (contemporaneous and lagged) covariance oper- ators (Damon and Guillas, 2002, 2005; Horváth and Kokoszka, 2012; Kokoszka, 2012) or kernel-based procedures for modeling the conditional expectation (Besse et al., 2000). A related approach is to estimate a multivariate time series model for the functional principal component (FPC) scores of the observed data (Aue et al., 2015). Extensions of the FAR(1) model for nonstationary functional time series are available, such as the time-dependent FAR kernels proposed in Chen and Li (2015). In general, existing methods for FAR(p) are designed for functional data ob- served on dense grids without measurement error, and typically require pre- smoothing discretized functional observations. However, such procedures may 74 exhibit erratic behavior for sparse designs and are inappropriate in such set- tings. More generally, under an FAR(p) model that includes measurement error and discretization of the functional observations, we prove that the two most common approaches for functional data analysis—estimators that are linear in the FPC scores or the pre-smoothed observations—produce predictions that are inadmissible (in a decision theory sense). Indeed, the presence of measurement error fundamentally alters the behavior of the observable process: if an FAR process is observed with measurement error, then the observable process is no longer an FAR process, but rather a functional autoregressive moving average process (see Proposition 4.1). Even under dense designs, existing methods pro- duce poor estimates of the FAR operator Ψ (Didericksen et al., 2012), which inhibits interpretability of the time evolution of {Yt}, and do not provide finite- sample inference. We propose new methodology that simultaneously addresses all of these challenges. We propose a general two-level hierarchy for modeling functional time se- ries: an observation equation addresses measurement error and discretization of the functional data, while an evolution equation defines a process model for the underlying functional time series. The latent process is dynamically modeled as an FAR(p). We parsimoniously specify the FAR model with mean zero Gaussian process innovations, which are fully specified by covariance functions with- out parameterizing sample paths. The dynamic innovation process is further specified by a dynamic functional factor model. In contrast with standard ap- proaches for Gaussian processes, this avoids selecting and estimating a para- metric covariance function, and allows greater computational stability and effi- ciency, and broader applicability. Interpolating curves at unsampled locations and forecasting future curves are primary objectives in functional time series 75 modeling; the proposed model produces optimal (best linear) predictions under both sparse and dense designs in the presence of measurement error, even with the Gaussian assumption relaxed. We propose an efficient Gibbs sampling algo- rithm for estimation, inference, and forecasting. Extensive simulations demon- strate substantial improvements in forecasting performance and recovery of the autoregressive surface over competing methods, especially under sparse de- signs. We apply our methodology to model and forecast nominal and real yield curves using daily U.S. data. For a given currency and level of risk of a debt, the nominal yield curve, Y Nt (τ), describes the interest rate at time t as a func- tion of the length of the borrowing period, or time to maturity, τ . Similarly, the real yield curve, Y Rt (τ), corresponds to an interest rate that is adjusted for in- flation. Both Y Nt and Y Rt may be modeled as functional time series. However, real yields are sparsely observed for each time t, and only at longer maturities, which is problematic for existing functional time series models. The proposed methods provide a natural hierarchical framework for modeling both nominal yield curves and real yield curves, and in both cases produce highly competitive forecasts. Bayesian methods for functional time series are limited, with the excep- tion of Laurini (2014) and Kowal et al. (2016). The primary contributions of this article are the following: (i) development of a hierarchical framework for FAR(p) (Section 4.2), which produces optimal (best linear) predictions under both sparse and dense designs in the presence of measurement error; (ii) a dy- namic functional factor model for the innovation covariance, which is nonpara- metric, computationally convenient, and offers useful generalizations to non- 76 Gaussian distributions (Section 4.3); (iii) a procedure for model averaging over the lag, p, within a hierarchical FAR(p) model (Section 4.4); (iv) comparisons of the proposed methods to existing methods for FAR(p) using theoretical results (Section 4.5), an extensive simulation study (Section 4.6), and a real data appli- cation (Section 4.7); (v) a comparative forecasting study of daily U.S. nominal and real yield curve data (Section 4.7); and (vi) an efficient Gibbs sampling al- gorithm, which uses common full conditional distributions and existing R soft- ware (Appendix C). Details of our Gibbs sampling algorithm and additional theoretical and simulation results are in Appendix C. 4.2 Hierarchical Gaussian Processes for FAR Let Y1, . . . , YT be a time-ordered sequence of random functions in L2(T ), where T ⊂ RD is a compact index set. We focus on D = 1 with T = [0, 1], but the methods can be developed more generally. For interpretability and computa- tional conve∫nience, we restrict our attention to the integral operators defined by Ψ`(Y )(τ) = ψ`(τ, u∑)Y (u) du, so the FAR(p) model isp ∫ Yt(τ)− µ(τ) = ψ`(τ, u) {Yt−`(u)− µ(u)} du+ t(τ) ∀τ ∈ T . (4.2) `=1 Using integral operators, the FAR(p) model resembles the functional linear model, in which (Yt − µ) is regressed on (Yt−1 − µ), . . . , (Yt−p − µ). The func- tional linear model is widely popular in functional data analysis, and has been extensively studied (e.g., Cardot et al., 1999; Ramsay, 2006). In practice, model (4.2) is incomplete: the functional observations {Yt} are not observed directly, but rather via discrete samples of each curve, and typi- cally with measurement error. Suppose that we observe yi,t ∈ R sampled with 77 noise νi,t from Yt ∈ L2(T ): yi,t = Yt(τi,t) + νi,t (4.3) for i = 1, . . . ,mt, where τ1,t, . . . , τmt,t are the observation points of Yt and νi,t is a mean zero measurement error with finite variance. Typically for functional data, mt will be large and Tt = {τ1,t, . . . , τmt,t} will be dense in T . However, for our procedures, we allow mt to be small for some (or all) t, with observation points To ≡ ∪tTt dense or sparse in T . Combining (4.3) with (4.2) for p = 1 and defining µt ≡Yt − µ, we obtain the two-level hierarchical modelyi,t = µ(τ∫i,t) + µt(τ i,t ) + νi,t, i = 1, . . . ,mt,  (4.4)µt(τ) = ψ(τ, u)µt−1(u) du+ t(τ), ∀τ ∈ T for t = 2, . . . , T , where we assume that {νi,t} and {t} are mutually independent sequences. The measurement error is a nontrivial component of model (4.4), which we demonstrate in the following proposition: ∑ Proposition 4.1. Let Y pt−µ = `=1 Ψ`(Yt−`−µ)+t, and suppose that we observe yt = Yt + νt, where {t} and {νt} are independent white noise processes. Then the observable process {yt} follows a functional autoregressive moving average (FARMA) process of order (p, p). We define a FARMA process and prove Proposition 4.1 in Section C.4.1 of Appendix C. The implication of Proposition 4.1 is that, if the true model for Yt is FAR(p), yet Yt is observed with error, then the FAR(p) model for the ob- servables is inappropriate. As a result, estimation of Ψ` will be inefficient and forecasting will deteriorate, due to both increased estimation error of Ψ` and model misspecification. By comparison, the hierarchical model decomposes the 78 observed data into a functional (autoregressive) process and measurement er- ror, and in doing so circumvents the model misspecification issues implied by Proposition 4.1. We model the random functions µ, ψ, and {t} as Gaussian processes: µ ∼ GP indep(0, Kµ), ψ ∼ GP(0, Kψ), and t ∼ GP(0, K), where the notation GP(m,K) denotes a Gaussian process with mean function m and covariance function K. Gaussian processes have a long history in machine learning (Rasmussen and Williams, 2006) and spatial statistics (Cressie and Wikle, 2011), and have seen increased application in functional data analysis, especially for hierar- chical modeling (Behseta et al., 2005; Kaufman and Sain, 2010; Shi and Choi, 2011; Earls and Hook∫er, 2014). The conditional distribution of µt = Yt − µ is [µt|µt−1, ψ,K] ∼ GP( ψ(·, u)µt−1(u) du,K), which models the evolution of µt and serves as the prior distribution for the observation level of (4.4). Notably, the model only requires conditionally Gaussian processes, and therefore may ac- commodate more general distributional assumptions, such as scale-mixtures of Gaussian distributions and stochastic volatility. Moreover, the posterior expec- tations derived from the hierarchical Gaussian process model are best linear predictors, and therefore are optimal among linear predictors for interpolation and forecasting of Yt, even for non-Gaussian distributions (see Section 4.5). We assume i∼idνi,t N(0, σ2ν) for the measurement errors; priors for σ2ν and the param- eters associated with Kµ, K, and Kψ will be discussed later. 79 4.2.1 Dynamic Linear Models for FAR(p) For practical implementation of model (4.4), we must select a finite set of eval- uation points, Te ≡ {τ1, . . . , τM} ⊂ T , at which we wish to estimate, fore- cast, or perform inference on the random functions, in particular µt = Yt − µ. Naturally, we assume that Tt ⊆ Te for all t, but this assumption may be re- laxed. Notably, Te provides a convenient structure for forecasting and infer- ence of yi,t and Yt(τi,t) at the observations points τi,t ∈ Tt, as well as inter- polation of Yt at any unobserved points, τ ∗ ∈ Te \ To. By definition, for any Gaussian process x ∼ GP(m,K) defined on T , we have x ∼ N(m,K), where x = (x(τ1), . . . , x(τM)) ′, m = (m(τ1), . . . ,m(τM))′, and K = {K(τi, τk)}Mi,k=1. This result is particularly useful for constructing an estimation procedure and deriv- ing the optimality results of Section 4.5. By selecting M large and Te dense in T , we can accurately approximate the integral in (4.4∫) using quadrature methods: ψ(τ, u)µt−1(u) du ≈ (ψ(τ, τ1), . . . , ψ(τ, τM))Qµt−1, (4.5) whereQ is a known quadrature weight matrix andµt−1 = (µt−1(τ1), . . . , µt−1(τM))′. The approximation in (4.5) is important for computational tractability in estima- tion of both µt and ψ. Practical implementations of functional data methods re- quire discretization or finite approximations; the quadrature approximation in (4.5) is a natural approach, and does not impose restrictive assumptions on the functional forms of ψ and µt−1. In addition, our simulation analysis suggests that the quadrature approximation does not noticeably inhibit estimation or forecasting, especially relative to existing FAR methods. In practice, the trape- zoidal rule for computing Q works well, and for simulated data M = 20 is sufficiently large. We include a sensitivity analysis in Appendix C to assess the 80 effects of M on the approximation error in (4.5), which supports this choice of M . Assuming To ⊆ Te, let Zt be the mt ×M incidence matrix that identifies the observations points observed at time t, i.e., (τ1,t, . . . , τ ′mt,t) = Zt(τ1, . . . , τ ′M) . We can write the hierarchical model (4.4) as a dynamic linear model (DLM; West and Harrison, 1997) in µ t : y = Z µ+Z µ + ν , [ν |σ2 indept t t t t t ν ] ∼ N (0, σ2νImt) for t = 1, . . . , T,  indepµt = ΨQµt−1 +  , [ |K ] ∼ N (0,K ) for t = 2, . . . , T, (4.6) t t  µ1 ∼ N(0,K), where y ′t = (y1,t, . . . , ymt,t) , µ = (µ(τ1), . . . , µ(τ ))′M , Ψ = {ψ(τ Mi, τk)}i,k=1, and K = {K(τi, τk)}Mi,k=1. Model (4.6) can be extended fo∑r multiple lags to the FAR(p) model by replacing the second level with µ = pt `=1 Ψ`Qµt−` + t for Ψ` = {ψ M`(τi, τk)}i,k=1. The DLM formulation of the FAR(p) is useful for MCMC sampling, since efficient samplers exist for the vector-valued state variables, {µt} (e.g., Durbin and Koopman, 2002). The proposed Gibbs sampling algo- rithm for model (4.6) (see Appendix C) is a moderate extension of traditional DLM samplers, and iteratively samples the state vectors {µt}, the measurement error variance σ2ν , the innovation covariance K, and the unknown evolution matrix Ψ. The DLM also facilitates non-Bayesian parameter estimation and forecasting, such as an EM algorithm for the latent state variables {µt} with the parameters {σ2ν ,K,Ψ} (e.g., Cressie and Wikle, 2011). The connection between the hierarchical FAR model (4.4) and the DLM (4.6) is further illuminated by considering the autocovariance properties of the re- spective models. Recalling µt(τ) = Yt(τ)− µ(τ), let C`(τ1, τ2) = E [µt(τ1)µt−`(τ2)] be the lag-` autocovariance function of {Yt}, which is time-invariant under sta- 81 tionarity of {Yt}. Under model (4.4) and assuming stationarity of {Yt}, the lag[{-1∫ autocovariance function i}s equival]ently∫C1(τ1, τ2) = E [µt(τ1)µt−1(τ2)] = E ψ(τ1, u)µt−1(u) du+ t(τ1) µt−1(τ2) = ψ(τ∫1, u)C0(u, τ2) du. For ` ≥ 1, we have the more general recursion C`(τ1, τ2) = ψ(τ1, u)C`−1(u, τ2) du, from which it is clear t[hat eac]h C` is completely determined by the pair (ψ,C0). Now let C = E µ µ′` t t−` be the lag-` autocovariance matrix for the vector- valued time series {µt} in (4.6). [Under]station[arity of {µt}, the la]g-1 autoco- variance matrix of µt is C1 = E µ ′ ′tµt−1 = E {ΨQµt−1 + t}µt−1 = ΨQC0. Notably, the relationship∫C1 = ΨQC0 is an approximation to the continu- ous version, C1(τ1, τ2) = ψ(τ1, u)C0(u, τ2) du, using the same quadrature ap- proximation as in (4.5). More generally, the matrix recursion C` = ΨQC`−1 i∫s a quadrature-based approximation to the continuous recursion, C`(τ1, τ2) = ψ(τ1, u)C`−1(u, τ2) du for ` ≥ 1. Therefore, the evolution matrix ΨQ in the DLM (4.6) induces a discrete approximation to the autocovariance structure in the hierarchical FAR model (4.4). The evolution equation of (4.6) resembles a VAR(1) onµt = (µt(τ1), . . . , µt(τM))′, but differs from a standard VAR on yt for a few critical reasons. First, fitting a VAR to yt is only well-defined if both the dimension mt and the observation points Tt are fixed over time. If this does not hold, then imputation is neces- sary. Our procedure imputes automatically and optimally using the conditional mean function and the conditional covariance function of the corresponding Gaussian process. Second, the components of yt are likely highly correlated due to the functional nature of the observations. Strong collinearity in VARs can cause overfitting and adversely affect forecasting and inference. In our model, the kernel function ψ is regularized using a smoothness prior (see Sec- tion 4.4), which mitigates the adverse effects of collinearity on estimation of ψ. 82 The smoothness prior on ψ is a nonstandard regularization technique for VARs, but is appropriate in this setting. Finally, the quadrature matrix, Q, is absorbed into the VAR coefficient matrix ΨQ, and reweights the vector µt−1 using infor- mation from the evaluation points Te. This reweighting incorporates not only the vector values µt, but also the information that the components of µt corre- spond to ordered elements of Te, which need not be equally spaced. The simu- lations of Section 4.6 demonstrate the substantial improvements in forecasting of our procedure relative to a VAR on yt. 4.3 A Dynamic Functional Factor Model for the Innovation Pro- cess The standard approach for Gaussian process models is to select a parametric covariance function that only depends on a few parameters, and then estimate those parameters using either fully Bayesian methods or empirical Bayes (Ras- mussen and Williams, 2006). The choice of the covariance function determines the properties of the sample trajectories, such as smoothness and periodicity, but notably does not imply a parametric form for the sample trajectories. Indeed, the FAR(1) model (4.6) may be estimated using these standard approaches; we provide one implementation in Section 4.6. However, there are substantial computational limitations that accompany standard parametric covariance functions. Even when the covariance function is known up to some parameters ρ, in general we cannot directly sample from the full conditional posterior distribution for ρ. As a result, posterior sampling for ρ can be inefficient. Gaussian processes also require computation of the 83 M × M innovation covariance matrix K, which must be inverted—both for evaluating the conditional likelihood of ρ and for sampling {µt} and ψ. Most common choices for parametric covariance functions do not offer any simpli- fying structure for computing this inverse, which may be computationally in- efficient and unstable. In addition, extensions for time-dependent covariance functions or non-Gaussian distributions are not readily available, and further increase the difficulties with posterior sampling. We propose a low-rank, fully nonparametric approach for modeling the in- novation covariance function. Using the functional dynamic linear model (FDLM) of Kowal et al. (2016), we estimate the unknown covariance function using a functional factor model, which does not require specification of a parametric form for the covariance function. This method avoids the need for inversion of the full M ×M covariance matrix, and is more computationally stable and efficient. The integration of the FDLM into (4.6) retains the fully Bayesian hi- erarchical structure, and permits joint inference for all parameters via an effi- cient MCMC sampling algorithm. A functional factor model is most appropriate because t is a Gaussian process with covariance function K, so K must be well-defined on T × T . Notably, the FDLM offers convenient generalizations for stochastic volatility models (Kim et al., 1998) and more robust models using scale-mixtures of Gaussian distributions (Fernandez and Steel, 2000). The FDLM decomposes the innovations t into factor loading curves (FLCs), φj ∈ L2(T ), and time-dependent factors, ej,t ∈ R, for j = 1, . . . , J: ∑J t(τ) = ej,tφj(τ) + ηt(τ) ∀τ ∈ T , (4.7) j=1 where J is the number of factors and {ηt} is the mean zero approximation er- ror with iidη 2t ∼ GP(0, Kη), where Kη(τ, u) = ση1(τ = u) and 1(·) is the indicator 84 function. We model each FLC φj as a smooth function admitting the basis ex- pansion φj(τ) = b′φ(τ)ξj , where bφ is a Jφ-dimensional vector of known basis functions and ξj is an unknown vector of coefficients. For superior MCMC per- formance, we prefer the low-rank thin plate spline basis for bφ (e.g., Crainiceanu et al., 2005) with knot locations selected using the quantiles of the observa- tion points, To. We place a smoothness prior on each ξj , which is expressed via a conditionally conjugate Gaussian distribution and is convenient for effi- cient posterior sampling (see the Appendix). The smoothness assumption typ- ically produces more interpretable FLCs {φj} and can improve estimation for unobserved points τ ∗ ∈6 To. For the fa(ctors et )= (e1,t, . . . , e ′J,t) , we assume | in∼dep[et Σe] N(0,Σe), with Σ = diag {σ2}Je j j=1 for simplicity. By compari- son, the factors in Kowal et al. (2016) are time-dependent; we assume inde- pendence to obtain a special case of the FDLM in which the implied innovation process {t} is an independent sequence, which also improves computational efficiency of the FDLM sampling algorithm. Importantly, we obtain a nonpara- metric, low-rank approximation to the innovation covariance, K, with useful computational simplifications. For identifiability, we order the factors according to variability of t ex- plained, σ21 > σ22 > · · · > σ2J > 0, and require orthonormality of the FLCs. It is computationally convenient to enforce the discrete orthonormality constraint Φ′Φ = IJ , where Φ = BφΞ is the M × J matrix of FLCs evaluated at Te, Bφ = (bφ(τ1), . . . , b ′ φ(τM)) is the M × Jφ matrix of basis functions evaluated at Te, and Ξ = (ξ1, . . . , ξJ ) is the Jφ × J matrix of unknown FLC basis coeffi- cients. The implied covariance matrix for t = (t(τ1), . . . , t(τM))′ under (4.7) is K = ΦΣeΦ ′+σ2ηIM , conditional on {φj, σ2j} and σ2η . Importantly, the discretized orthonormality constraint offers a substantial simplification for computing the 85 inverse ofK using the Woodbury identity: K−1 −2 −2 ′ = ση IM − ση ΦΣ̃eΦ , (4.8) ( ) ( ) where Σ̃ = σ−2 Σ−1 −2 ′ −1+ σ Φ Φ = diag {σ2e η e η j/(σ2η + σ2)}Jj j=1 . As a result, K−1 may be computed without any matrix inversions. By comparison, para- metric covariance functions not only fail to offer computational simplifications for K−1 , but also require additional computations of K −1  in the estimation of the covariance function parameters, ρ. The FDLM sampling algorithm for the factors {ej,t}, the FLCs {φj}, and the variances {σ2j} and σ2η is computationally inexpensive and MCMC efficient. Note that the approximation error is a non- trivial addition to model (4.7): ηt is necessary for nondegeneracy of K, which is invertible only when σ2 > 0. And while σ2η η > 0 implies that the innovations t, and therefore µt, are not smooth, we find that in practice, the sample paths of t and µt do appear smooth for sufficiently small σ2η . Generalizations to non- nugget approximation error variance functions Kη(τ, u) = σ2η(τ)1(τ = u) for σ2η : T → R+ are available, but may introduce additional model complexity and computational costs. An important application of the FDLM simplification in (4.8) is given in The- orem 4.2, in which we derive a computationally convenient form for estimating the out-of-sample posterior distribution [µt(τ ∗)|{yr}s ∗r=1] for τ 6∈ Te, which in- cludes as special cases the forecasting distribution (s < t), the filtering distribu- tion (s = t), and the smoothing distribution (s > t). 86 4.4 Modeling the FAR Kernel An accurate predictor of ψ is important not only for forecasting and inference, but also for interpreting the time evolution of {Yt}. The likelihood for ψ is speci- fied by the evolution equation in model (4.6), which may be extended for multi- ple lags. We select a Gaussian process prior for ψ, which encourages smoothness of the surface and produces more interpretable results. Using the basis approxi- mation ψ ′`(τ, u) = b0(τ, u)θψ , we place a Gaussian prior on θψ , which induces a` ` Gaussian process prior for ψ`. A tensor product basis b′ ′ ′0(τ, u) = (bψ(u)⊗ bψ(τ)) for bψ a Jψ-dimensional vector of B-spline basis functions is computationally efficient in our setting, especially for large M . The details are presented in the Appendix. Since Jψ < M , the evolution matrix ΨQ in (4.6) has J2 2ψ < M un- known parameters, so the evolution equation in the DLM (4.6) has fewer pa- rameters than a standard VAR(1) on µt. Notably, the posterior distribution for ψ depends onK−1`  , which is computationally unstable for many common para- metric covariance functions. By comparison, the nonparametric FDLM estimate of K−1 in (4.8) is computationally stable, which further stabilizes estimates of ψ`. An important choice in the FAR(p) model is the maximum lag, p: a poor choice of p can produce suboptimal forecasts and reduce MCMC efficiency. A reasonable approach is to compare the DIC or marginal likelihoods for different choices of p. However, this requires recomputing the model for each choice of p, which can be computationally intensive. Similarly, Kokoszka and Reimherr (2013) propose a multistage hypothesis testing procedure based on asymptotic approximations and an FPC decomposition, but would require modification for the hierarchical Bayesian implementation of (4.6). 87 Our approach is to select a maximum lag under consideration, pmax, and as- sign each lag ` a state variable, s` ∈ {0, 1}, for ` = 1, . . . , pmax, to assess whether or not ψ` is included in the model: p∑max ∫ µt(τ) = s` ψ`(τ, u)µt−`(u) du+ t(τ), (4.9) `=1 which extends Kuo and Mallick (1998) and Korobilis (2013b) to the FAR(p) set- ting. By averaging over the states {s }pmax` `=1 , the forecasts of model (4.9) are the model-averaged forecasts over the FAR(`) models for ` = 1, . . . , pmax. Since we restrict s` ∈ {0, 1}, rather than strongly shrinking ψ` toward zero, we can substantially improve computational efficiency: at each MCMC iteration, we sample {µt} jointly from the FAR(p∗) extension of the DLM (4.6), where p∗ = min{` : s`+1 = · · · = spmax = 0} is the largest lag of nonzero autocorre- lation. ∏ The joint distribution of the states is [s1, s2, . . . , s ] = [s ] pmax pmax 1 `=2 [s`|s`−1, . . . , s1], where [s`|s`−1, . . . , s1] is the probability that the lag ` autocorrelation term is in- cluded in the model, given whether the autocorrelation terms of the smaller lags ` − 1, . . . , 1 are included in the model. We assume that s` = 0 im- plies that sk is likely also zero for all k > `, which induces a more parsimo- nious model. In particular, we use the computationally convenient Markov assumption [s`|s`−1, . . . , s1] = [s`|s`−1] with a small transition probability for P(s` = 1|s`−1 = 0) = q01. The reverse transition probability, P(s` = 0|s`−1 = 1) = q10, encourages smaller models when it is large. By default, we select q01 = 0.01, q10 = 0.75, and complete the joint prior distribution of {s }pmax` `=1 with P(s1 = 1) = 0.9; for simulated data, the posterior does not appear to be sensitive to these choices. 88 4.5 Finite-Dimensional Optimality The Gaussian assumptions in model (4.6) provide convenient posterior distri- butions for MCMC sampling and a useful framework for inference, but are not necessary for model (4.2). Suppose we relax the Gaussian assumption to t ∼ SP(0, K), where SP(m,K) denotes a second-order stochastic process with mean function m and covariance function K. Similarly, let νi,t be a mean zero random variable with variance σ2ν and let µ1 ≡ 1. Given a finite set of evaluation points,Te ⊂ T , model (4.4) implies the distribution-free DLM yt = Ztµ+Ztµt + νt, E[νt|σ 2 ν ] = 0, Cov[νt|σ2ν ] = σ2νImt ,  (4.10)µt = ΨQµt−1 + t, E[t|K] = 0, Cov[t|K] = K, under the integral approximation (4.5), where the vectors and matrices are de- fined as before and µ1 ≡ 1. Since this holds for any finite set of evaluation points Te ⊂ T , we may consider the DLM (4.10) to be a collection of models indexed by the evaluation points, Te. The error sequences, {νt} and {t}, are as- sumed to be uncorrelated, rather than independent. If we additionally assume Gaussianity of {νt} and {t}, then the uncorrelatedness implies independence, and model (4.10) becomes model (4.6). Extensions for the FAR(p) models are similar. The results below also hold for time-dependent variances for νt and t. Let d be an estimator of δ ∈ L2(T ), and consider the squared error loss us- ing the Euclidean norm: Le(δ, d) = (δ − d)′(δ − d), where Le is indexed by the set of evaluation points, Te, at which δ and d are evaluated to form the corre- sponding vectors δ and d. When Te is an equally-spaced fine grid on T , the loss function∫Le will approximate the usual loss function for functional data, LL2(δ, d) = (δ(u) − d(u))2 du, for most reasonable choices of δ and d (up to a rescaling by M = |Te|). In a standard Bayesian analysis, the goal would be 89 to minimize the posterior risk, E[Le(δ, d)|{yt}], for which the solution is the posterior expectation, d = E[δ|{yt}]. Indeed, the estimators discussed below minimize the posterior risk under the Gaussian assumptions of model (4.6). However, by relaxing the distributional assumptions in (4.10) to increase the generality of the model, we no longer have sufficient information to compute posterior distributions or posterior moments. In addition, it is difficult to com- pare Bayesian and non-Bayesian procedures under the posterior risk, and most procedures for functional time series modeling are non-Bayesian. Therefore, we consider the overall riskRe(δ, d) = E[Le(δ, d)],which is the expected value of the posterior risk with respect to the sampling distribution. As with the loss func- tion Le, the risk function Re is indexed by the evaluation points, Te; we seek to minimizeRe for any choice of Te. { } Let Dt = yt,yt−1, . . . ,y1 ∪D0 be the information available at time t, where D0 represents the information prior to time t = 1. Theorem 4.1. For any finite set of evaluation points Te ⊂ T , the unique best linear predictor of the conditional random vector δ ∼ [δ|Y ,Θ], where δ,Y ⊆ DT ∪{µt(τ) : τ ∈ Te, t = 1, . . . , T} and Θ = {µ, σ2ν , ψ,K}, under the risk Re and conditional on model (4.4) with the integral approximation (4.5), is the conditional expectation δ̂(Y |Θ) ≡ E[δ|Y ,Θ] as computed under model (4.6). The proof of Theorem 4.1 is in the Appendix, and extends fundamental re- sults for vector-valued DLMs. The best linear predictors of Theorem 4.1 equiv- alently minimize the risk R(δ, d) = supT Re(δ, d) among all linear estimators,e where the sup is taken over all finite Te ⊂ T . The most useful examples of [δ|Y ,Θ] in Theorem 4.1 are the forecasting distributions [yt+h|Dt,Θ] and [µt+h|Dt,Θ] for h > 0, the smoothing distributions [µt|DT ,Θ], and the filter- 90 ing distributions [µt|Dt,Θ], for t = 1, . . . , T . Theorem 4.1 depends on the ob- servation points To only via the assumption that Zt is known. In general, we assume To ⊆ Te, so Zt is an incidence matrix and therefore known. Theorem 4.1 does not require To to become arbitrarily dense in T , and is valid for both sparse and dense designs. For implementation, we compute the relevant expectations within the Gibbs sampling algorithm (see Appendix C), and then average over the Gibbs sample of Θ. Alternatively, an EM algorithm could be used to esti- mate the relevant expectations (Cressie and Wikle, 2011). There is no intrinsic reason to restrict the estimators to linearity. However, several popular competing methods are linear, and therefore are dominated by the conditional expectations computed from model (4.6) whenever the estima- tors are distinct. More formally: Corollary. Consider a basis expansion of the observations yt ≈ B ′tθt, where Bt = (b(τ1,t), . . . , b(τmt,t)), b is a known J-dimensional vector of basis functions, and θt is the corresponding J-dimensional vector of unknown basis coefficients. If the estimator θ̂t of θt is linear in yt, then estimates or forecasts of the form Hθ̂t + h, conditional on the matrix H and the vector h, are inadmissible for all [δ|Y ] whenever Hθ̂t + h 6= δ̂(Y |Θ). The most important application of Corollary 4.5 is to characterize the inad- missibility of procedures based on FPC scores. In the notation of Corollary 4.5, let b be the FPC basis, which we assume is fixed and∫known. The components ∫of θt correspond to the FPC scores, defined by θj,t = {Yt(u) − µ(u)}bj(u) du = µt(u)bj(u) du. There are two standard approaches for computing FPC scores: quadrature methods for dense designs absent measurement error, and the PACE procedure of Yao et al. (2005), which uses conditional expectations under a 91 Gaussian assumption and applies more generally. In both cases, the FPC scores are linear in yt, so Corollary 4.5 applies. Among functional time series methods, the most pertinent procedures are Aue et al. (2015) and Hyndman and Ullah (2007). Aue et al. (2015) provide the more general framework, in which they compute the best linear predictors for the FPC scores, and then forecast the FPC scores using multivariate time series methods. For time series methods that are linear in the FPC scores, such fore- casts are inadmissible. While Aue et al. (2015) undoubtedly provide a simple yet general framework for forecasting a functional time series, the simulations of Section 4.6 confirm the consequence of inadmissibility on forecasting perfor- mance. Corollary. Consider the common functional data pre-processing procedure in which the discrete, noisy observations, yt, are replaced by estimated functions evaluated on a fine grid, ŷt, and then estimates and forecasts are computed using the functional “data” ŷt. If ŷt is linear in {yt}, then any estimator or forecast linear in {ŷt} is inadmissible for all [δ|Y ] whenever ŷt 6= δ̂(Y |Θ). Typically, ŷt is estimated using splines or kernel smoothers, both of which are linear in yt. As an application of Corollary 4.5, the simple forecasting method of fitting a VAR to ŷt evaluated on a grid of points, conditional on the VAR coefficient matrix, is inadmissible. Corollary. The unique best linear predictor of [µ (τ ∗t )|Ds] for any times t, s and any point τ ∗ ∈ T is the corresponding expectation under model (4.6). Model (4.6) achieves the optimality of a kriging estimator for interpolation of any point τ ∗ ∈ T , simply by adding τ ∗ to the evaluation set Te. 92 In practice, we need not include all such τ ∗ in Te: we can estimate the out- of-sample posterior distribution [µ ∗t(τ )|Ds] f[or τ ∗ 6∈ Te by sampli]ng from the out-of-sample full conditional distribution µ (τ ∗t )|{µ Tr}r=1,Θ,Ds within the Gibbs sampler, and then averaging over the Gibbs sample of {µr}Tr=1 and Θ. Let ψ′(τ ∗) ≡ (ψ(τ ∗, τ1), . . . , ψ(τ ∗, τ )) and φ′(τ ∗M ) ≡ (φ1(τ ∗), . . . , φJ(τ ∗)). In the special case of model (4.4) and using the FDLM (4.7), we have the following computationally efficient alternative for state space imputation: Theorem 4.2. Suppose τ ∗ ∈ T such that τ ∗ ∈6 Te. Under the FDLM (4.7) and con- ditional on model (4.4) with the inte[gral approximation (4.5]), the out-of-sample full conditional distribution of µ (τ ∗) is µ (τ ∗)|{µ( }T ,Θ,D ∼) N (m (τ ∗t t r r=1 s t ), Kt(τ ∗)), where m (τ ∗) = ψ′ ∗t (τ )Qµt−1 + φ ′(τ ∗)Σ̃eΦ ′ µt −ΨQµt−1 and K (τ ∗) = σ2t η + σ2φ′η (τ ∗)Σ̃eφ(τ ∗). The proof of Theorem 4.2 and extensions for p > 1 are in Appendix C. Using Theorem 4.2, we can efficiently estimate the out-of-sample posterior distribu- tion [µ ∗t(τ )|Ds] with minimal adjustments to the Gibbs sampling algorithm (see Appendix C). Theorem 4.2 builds upon the approximation in (4.5) and the com- putational simplifications of the FDLM to produce simple and efficient moment calculations for the full conditional distributions without expanding the dimen- sion of the state vector, M . Note that for implementation purposes, the terms µt and µt−1 appear[ing in m (τ ∗t ) ar]e assumed to be sampled from the full condi- tional distribution {µr}Tr=1|Θ,Ds . 93 4.6 Simulations We conducted extensive simulations to evaluate the proposed methods for FAR(p) relative to several competitive alternatives. We are particularly inter- ested in one-step forecasting and recovery of the FAR kernel ψ1, and in how the associated performance varies with the sample size T , the location and number of the observation points τ1,t, . . . , τmt,t, the kernel ψ1, and the smoothness of the innovation process t. We also assess the performance of the model averaging procedure of Section 4.4 for p ∈ {1, 2}, and compare the nonparametric FDLM approach of Section 4.3 with a more standard parametric Gaussian process im- plementation. 4.6.1 Sampling Designs For all simulations, the mean function is µ(τ) = 1 τ 3 sin(2πτ), which pro- 10 duces the dominate shape in the rightmost panels of Figure 4.1. The measure- ment errors are identically distributed for all simulations: iνi,t ∼ id N(0, σ2ν) with σν = 0.002. We vary the sample size from small (T = 50) to large (T = 350) for the FAR(1) simulations, and use a moderate sample size (T = 125) for the FAR(2) simulation. The FAR(1) kernel used for Figure 4.1 is the Bimodal-Gaussian kernel, ψ(τ, u) ∝ 0.75 exp{−(τ−0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45 exp{−(τ− π(0.3)(0.4) π(0.3)(0.4) 0.7)2/(0.3)2 − (u − 0.8)2/(0.4)2}, following Wood (2003); see Appendix C for a plot of the Bimodal-Gaussian kernel. We also present results for the Linear- τ kernel, ψ(τ, u) ∝ τ , and the Linear-u kernel, ψ(τ, u) ∝ u∫. ∫Each kernel is rescal∑ed according to a pre-specified squared norm, Cψ = ψ2` (τ, u) dτ du,` with p`=1Cψ < 1 for stationarity. We select Cψ1 = 0.8 for the FAR(1) simu-` 94 lations and use (Cψ1 , Cψ2) = (0.4, 0.2) for the FAR(2) simulation; smaller val- ues of Cψ produce similar comparative results, but the forecasting performance` deteriorates for all methods. For the innovation process, t, we consider both smooth and non-smooth Gaussian processes. We use the covariance func- tion parametrization K = σ2Rρ, where Rρ is the Matérn correlation function −1 R (τ, u) = {2ρ1−1ρ Γ(ρ1)} (||τ − u||/ρ2)ρ1 Kρ1(||τ − u||/ρ2), Γ(·) is the gamma function, Kρ1 is the modified Bessel function of order ρ1, and ρ = (ρ1, ρ2) are parameters (Matérn, 2013). We let σ = 0.01 and ρ = (ρ1, 0.1), with ρ1 = 2.5 for smooth (twice-differentiable) sample paths and ρ1 = 0.5 for non-smooth (con- tinuous, non-differentiable) sample paths. Smooth Gaussian Process Non−Smooth Gaussian Process Smooth Gaussian Process Non−Smooth Gaussian Process 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 τ τ τ τ Figure 4.1: Sample paths of t and Yt = µt + µ as a function of τ , where t is a Gaussian process with the Matérn correlation function, ρ = (ρ1, 0.1), σ = 0.01, and Yt is generated using the Bimodal-Gaussian FAR(1) kernel, t = 1, . . . , T = 50. The curves are time-ordered by color (from red/orange to blue/violet). Left to right: t(τ), ρ1 = 2.5; t(τ), ρ1 = 0.5; Yt(τ), ρ1 = 2.5; Yt(τ), ρ1 = 0.5. Note that we do not observe Yt directly, but rather yi,t = Yt(τ 2i,t) + νi,t, where νi,t ∼ N(0, σν) is measurement error with σν = σ/5 = 0.002 and Tt = {τ1,t, . . . , τmt,t} are the observation points at time t. We consider three sampling designs for the observation points: dense, sparse- random, and sparse-fixed. In each case, the set of evaluation points, Te, is an equally-spaced grid of M = 30 points on T = [0, 1]. The dense design uses mt = 25 equally-spaced observation points on [0, 1] for all t, for which the re- sults are representative of denser (mt  25) designs and similar to those of Did- ericksen et al. (2012); see Appendix C. The sparse-random design is generated by 95 εt(τ) −0.02 0.00 0.01 0.02 εt(τ) −0.03 −0.01 0.01 0.03 Yt(τ) −0.08 −0.04 0.00 0.02 0.04 Yt(τ) −0.08 −0.04 0.00 0.04 first sampling each mt from a zero-truncated Poisson (5) distribution, and then sampling τ1,t, . . . , τmt,t without replacement from Te. This is a common design in sparse functional data, in whichmt may be small for some t, but To is dense in T . The sparse-fixed design uses mt = 8 equally-spaced points in T . This is the most challenging design, and one for which multivariate time series methods should be most competitive with functional time series methods. Comparatively, the sparse settings are similar to the dense setting, but with additional missing ob- servations. 4.6.2 Competing Estimators Within the proposed framework and using the FDLM of Section 4.3 for the in- novation covariance function, we compute forecasts for p = 1 (FDLM-FAR(1)), and in the FAR(2) simulation, for p = 2 (FDLM-FAR(2)) and p = 3 (FDLM- FAR(3)). We also compute forecasts using the model averaging procedure with pmax = 4 (FDLM-FAR(p)). To assess the performance of the FDLM implemen- tation, we compute forecasts using model (4.6) with a parametric covariance function for K = σ2Rρ (GP-FAR(1)). We use the Matérn correlation function for Rρ, with ρ1 = 2.5 as in the smooth Gaussian process simulations, and use the priors σ−2 ∼ Gamma (10−3, 10−3) and ρ2 ∼ Uniform (0, Uρ2), where Uρ2 is the maximum value of ρ2 for which the correlation function Rρ is less than 0.99 for all pairs of evaluation points. These models are implemented using the Gibbs sampling algorithm provided in Appendix C, and estimates are based on 5,000 MCMC simulations after a burn-in of 5,000. For the large sample set- ting (T = 350), the mean computation time per 1,000 MCMC simulations was 2.3 minutes for FDLM-FAR(1) and 4.4 minutes for GP-FAR(1). The computing 96 times are calculated on a 64-bit Windows machine with a 2.40-GHz Intel core i7-4700MQ processor with 8 GB of RAM, and the code is written in R. We consider several important competing methods. Let ŷt+1 denote the one- step forecast at time t. For baseline comparisons, we use the random-walk (RW) forecast, ŷt+1 = yt, and the mean (Mean) forecast, ŷt+1 = µ̂, where µ̂ is a smooth estimate of the mean of {ys}ts=1. We estimate µ̂ using a B-spline basis expansion via the function meanfd() in the R package fda (Ramsay et al., 2014). Both es- timators are robust against overfitting, and the mean forecast is optimal when ψ = 0. We also compute the one-step forecast based on a VAR(1) fit to {ys}ts=1 (VAR-Y). In the sparse-random design, the observations yt were used to linear interpolate on Te prior to fitting the VAR. In the sparse-fixed design, the VAR was fit to the observation points, and then forecasts for the evaluation points were computed by fitting a spline to the VAR forecasts of the observation points. For additional comparisons, we computed forecasts from a simple exponential smoother (SES) applied pointwise to each component of yt, i.e., each time series {y Tj,t}t=1. The SES forecasts are implemented using the ses function in the R package forecast (Hyndman and Khandakar, 2008), with an identical impu- tation scheme as VAR-Y. We also considered two functional data methods. First, we used the Estimated Kernel procedure outlined in Horváth and Kokoszka (2012), which estimates ψ` in (4.2) using FPCs (FAR Classic); we fix p = 1 for sim- plicity. This method has well-studied theoretical properties and is a useful base- line for FAR models. Second, we implemented the method of Aue et al. (2015), which we briefly described in Section 4.5, using a VAR(1) on the FPC scores (VAR-FPC). We compute the FPCs using the fda package in R with B-spline ba- sis functions. To avoid the ill-conditioned estimators discussed in Horváth and Kokoszka (2012), we regularize via basis truncation, using 8 equally-spaced in- 97 terior knots. The number of components is selected to explain at least 95% of the variability in {yt}. For the sampling designs considered here, this approach works well. Finally, we report the oracle forecast (FA∑R Ora∫cle) computed us- ing the true one-step forecasts E [µt(τ)|{ψ`, µt−`}p ] = p`=1 `=1 ψ`(τ, u)µt−`(u) du within the simulation, where {ψ p`}`=1 are the FAR kernels from the simulation specification, {µt} are the simulated values of the latent FAR process, and the integral is approximated using the trapezoidal rule with M = 200 grid points. The oracle forecast is not actually an estimator, and is unaffected by sparsity or small sample sizes. We estimate the one-step forecasts [yT+h|y1:(T+h−1)], h = 1, . . . , 25, for all estimators under consideration, ∑and compare them using the mean squared forecast error MSFE = 1 25e h=1 ||Y T+h − Ŷ T+h||2 where Y25M T+h = (YT+h(τ1), . . . , YT+h(τM))) ′, which measures the one-step forecasting perfor- man∑ce aM ∑t the evaluation points, and the mean squared error MSEψ1 =1 M 2 i=1 k=1{ψ1(τi, τk)− ψ̂1(τi, τk)}2, which measures the recovery of the lag-1M kernel ψ1. Because Te is relatively∫dense in T , MSFEe and MS∫E∫ψ1 approxi- mate the integrated squared errors {Y (u)− Ŷ (u)}2T+h T+h du and {ψ1(τ, u)− ψ̂1(τ, u)}2 dτ du, respectively. Estimators ψ̂1 are available only for the proposed methods and FAR Classic. For computational convenience in the proposed methods, we update {µ T+h−1t}t=1 using all of the data y1:(T+h−1), but sample all other parameters only conditional on y1:T . DLM updating algorithms provide recursive one-step forecasts for µt, but in general there are no convenient up- dating algorithms for the other parameters. In practice, this is not a problem, but suggests that our simulation analysis may underestimate the performance of the proposed model. 98 4.6.3 Results We computedMSFEe andMSEψ1 under a variety of sampling designs, each for N = 50 simulations, and present the results for a few important cases in Figures 4.2 and 4.3, respectively. The figures are color-coded: multivariate methods are green, existing functional data methods are red, the proposed methods are blue, and the oracle is gold. MSFEe MSFEe RW RW Mean Mean VAR−Y ● VAR−Y ● ● ● SES ● ● ● SES FAR Classic ● FAR Classic VAR−FPC ● VAR−FPC GP−FAR(1) ● ● GP−FAR(1) ● FDLM−FAR(1) ● ● FDLM−FAR(1) ● FDLM−FAR(p) ● ● FDLM−FAR(p) ● FAR Oracle ● ● ● FAR Oracle ● ● 1e−04 2e−04 3e−04 4e−04 5e−04 0.0002 0.0006 0.0010 0.0014 MSFEe MSFEe RW ● ● ● ● RW Mean ● ● Mean ● ● ● VAR−Y VAR−Y ● SES ● SES ● ● FAR Classic FAR Classic ● VAR−FPC VAR−FPC ● GP−FAR(1) ● ● ● ● ● GP−FAR(1) ● ● FDLM−FAR(3) ● ● ● FDLM−FAR(1) FDLM−FAR(2)● ● ● FDLM−FAR(1) ● ● FDLM−FAR(p) ● ● FDLM−FAR(p) ● ● FAR Oracle ● ● ● FAR Oracle ●● ● 1e−04 2e−04 3e−04 4e−04 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Figure 4.2: MSFEe under various designs. Top left: FAR(1), T = 350, sparse- random design with the Linear-u kernel and smooth GP innovations. Top right: FAR(1), T = 50, sparse-random design with the Bimodal-Gaussian kernel and non-smooth GP innovations. Bottom left: FAR(1), T = 350, sparse-fixed design with the Bimodal-Gaussian kernel and smooth GP innovations. Bottom right: FAR(2), T = 125, sparse-fixed design with Bimodal-Gaussian and Linear−τ ker- nels and smooth GP innovations. The proposed methods provide superior fore- casts and nearly achieve the oracle performance, despite the presence of spar- sity. For the sparse designs in Figure 4.2, the proposed methods are all superior to the competitors, and in some cases nearly achieve the oracle performance, even though the oracle is unaffected by sparsity. Figure 4.3 shows that the proposed methods also offer a substantial improvement in ψ1 estimation. Importantly, the 99 MSEψ MSE1 ψ1 FAR Classic FAR Classic GP−FAR(1) ● GP−FAR(1) FDLM−FAR(1) ●● ●●● FDLM−FAR(1) ● FDLM−FAR(p) FDLM−FAR(p) 0.0 0.1 0.2 0.3 0.4 0.5 0.5 1.0 1.5 2.0 MSEψ MSE1 ψ1 ● ● FAR Classic FAR Classic● ● GP−FAR(1) GP−FAR(1) ●● ● FDLM−FAR(3) ● ● FDLM−FAR(1) FDLM−FAR(2) ● ● ● ●● FDLM−FAR(1) FDLM−FAR(p) ●● FDLM−FAR(p) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.2 0.4 0.6 0.8 Figure 4.3: MSEψ1 under various designs. Top left: FAR(1), T = 350, sparse- random design with the Linear-u kernel and smooth GP innovations. Top right: FAR(1), T = 50, sparse-random design with the Bimodal-Gaussian kernel and non-smooth GP innovations. Bottom left: FAR(1), T = 350, sparse-fixed design with the Bimodal-Gaussian kernel and smooth GP innovations. Bottom right: FAR(2), T = 125, sparse-fixed design with Bimodal-Gaussian and Linear−τ ker- nels and smooth GP innovations. Estimates of ψ1 are far superior for the pro- posed methods, including the FAR(p) with model averaging. proposed model with model averaging is competitive with the known p model for both forecasting and estimation of ψ1. The model averaging procedure of Section 4.4 typically identifies the true p with high probability, with a mild ten- dency to overestimate p. However, this behavior is encouraging: the bottom right panel of Figure 4.3, in which p = 2, suggests that overestimating the lag (FDLM-FAR(3)) is preferable to underestimating the lag (FDLM-FAR(1), GP- FAR(1)) for ψ1 estimation. FDLM-FAR(1) is competitive with GP-FAR(1), even when the parametric Gaussian process model assumes the correct (smooth) in- novation distribution, which suggests that the FDLM implementation of Sec- tion 4.3 provides an adequate approximation. Under the dense design (see Ap- pendix C), the improvements of the proposed methods over existing functional 100 data methods are less substantial, and for T = 350 the functional data methods all nearly achieve the oracle performance. The proposed methods, however, again provide superior recovery of ψ1. In general, we find that the functional data methods, in particular the proposed approaches, outperform the multi- variate methods, especially in the dense design. We conclude that the proposed methods provide highly competitive forecasts and superior FAR kernel recov- ery in a wide variety of important settings. 4.7 Forecasting Nominal and Real Yield Curves We apply the proposed methods to model and forecast nominal and real yield curves. Yield curves are important in a variety of economic and financial appli- cations, such as evaluating economic and monetary conditions, pricing fixed- income securities, generating forward curves, computing inflation premiums, and monitoring business cycles (Bolder et al., 2004). In practice, the U.S. real yield curve is estimated using Treasury Inflation-Protected Securities (TIPS), for which payments are adjusted according to the Consumer Price Index for All Ur- ban Consumers (CPI-U) to provide investors with protection against inflation. U.S. nominal and TIPS yield curve data are published daily by the Federal Re- serve, which uses actively-traded securities to fit a quasi-cubic spline for each curve. Estimates of the real and nominal yield curves are provided for matu- rities T Rt = {60, 84, 120, 240, 360} and T Nt = {1, 3, 6, 12, 24, 36} ∪ T Rt months, respectively. Notably, the real yield is observed sparsely, and only at longer maturities. The small number of available maturities for real yields presents a challenge for existing functional time series models, and provides an inter- esting comparison with the nominal yield, for which there are more observed 101 maturities. To assess the performance of the proposed model, we conducted an exten- sive forecasting study using daily nominal and real yield curve data. Beginning in 2003, we construct nine consecutive yet non-overlapping 18-month subperi- ods for estimation (T ≈ 375); the corresponding starting dates are given in Table 4.1. For the month following each estimation period, we compute both one- and five-step (i.e., one business week) forecasts (≈ 20 and ≈ 15 time points, respec- tively) for both the nominal and real yields. In all cases, the nominal and real yields are modeled separately in order to provide additional comparisons. We compute forecasts for the proposed methods by simulating from the fore- casting distribution in the DLM (4.6). For computational convenience, we up- date only the DLM state parameters {µt} during the forecast periods, and fix the remaining parameters based on the estimation periods. We also rescale the observation points T Rt and T Nt such that T Rt , T Nt ⊂ T = [0, 1]. We compute forecasts using the competing methods described in Section 4.6, which use all available data for each forecast. For further comparisons, we include two pop- ular parametric yield curve models based on the Nelson-Siegel parametrization (Nelson and Siegel, 1987): Diebold and Li (2006, DL), which extends the Nelson- Siegel model to the dynamic setting via a two-step estimation procedure, and Diebold et al. (2006, DRA) which is similar to DL, but instead estimates parame- ters jointly using maximum likelihood within a state space model; see Appendix C for implementation details. The one- and five-step root mean squared forecasting errors (RMSFEs) for the nominal yields and real yields are in Tables 4.1 and 4.2, respectively. We omit unstable DRA forecasts, as well as multi-step forecasts for FAR Classic, which 102 are unavailable. For both data sets, the proposed methods—denoted FAR(1) and FAR(p), using the lag selection procedure with pmax = 3—are consistently among the best forecasters for all time periods, and outperform the existing functional data forecasts by a wide margin. For the nominal yields, the FAR(1) provides the best one-step forecasts aggregated across all time periods. For the real yields, the proposed methods are again among the most competitive, par- ticularly in the periods since the financial crisis. Echoing the results in Diebold and Li (2006), the RW forecast is a difficult benchmark to clear, and the existing functional data models typically fail to do so. By comparison, the proposed FAR forecasts are highly competitive across all time periods and for both the nominal and (sparsely-observed) real yields. An important feature of the proposed FAR model is the ability to compute exact (up to MCMC error) credible bands for parameters of interest, including forecasts. Such uncertainty quantification is unavailable for the RW forecast, which is our primary competitor in this application. For illustration, we com- pute pointwise and simultaneous credible bands for one-step forecasts during August 2016 in Figure 4.4. For both nominal and real yields, the credible bands are tighter for shorter maturities and widen in regions of unobserved points, which is appropriate behavior for a nonparametric method. 4.8 Concluding Remarks The proposed hierarchical FAR(p) model provides a useful framework for es- timation, inference, and forecasting functional time series data. Our model is especially suited for sparsely or irregularly sampled curves and for curves 103 Nominal Yield: 5 and 10 Year Real Yield: 5 and 10 Year Jul 18 Jul 25 Aug 01 Aug 08 Aug 15 Aug 22 Aug 29 Jul 18 Jul 25 Aug 01 Aug 08 Aug 15 Aug 22 Aug 29 2016 2016 Nominal Yield Curve Real Yield Curve 5y 10y 0 50 100 150 200 250 300 350 50 100 150 200 250 300 350 Maturity (months) Maturity (months) Figure 4.4: One-step nominal (left) and real (right) yield curve forecasts during 2016. Top: Time series of five (×) and ten (4) year observed maturities with one-step forecasts. Bottom: Observed (points) and forecast (line) curves on 8/2/16, corresponding to the dotted vertical line in the top panels. Posterior means (blue) and 95% pointwise and simultaneous prediction bands (light gray and dark gray, respectively) estimated using 10,000 MCMC simulations after a burn-in of 5,000. sampled with non-negligible measurement error, and produces best linear pre- dictors in a general FAR(p) setting, thereby dominating many competing func- tional time series models. The FDLM provides a more flexible, computationally efficient, and stable approach for modeling (innovation) covariance functions. Our model averaging procedure provides an effective solution to the problem of specifying p, and produces highly competitive forecasts. The simulation anal- ysis and yield curve application suggest that the proposed FAR(p) model may improve forecasting and estimation in a wide range of settings, and the efficient MCMC sampling algorithm allows us to perform exact (up to MCMC error and prior misspecification) inference for important parameters. While we assumed independent factors (and therefore independent innova- 104 Yield (%) Yield (%) 0.5 1.5 2.5 1.0 1.2 1.4 1.6 1.8 Yield (%) Yield (%) 0.0 0.5 1.0 1.5 −0.4 −0.2 0.0 0.2 tions) in Section 4.3, we can relax this assumption and allow Σe to be a stochastic process evolving over time. In this more general framework, the FDLM (4.7) can accommodate stochastic volatility or heavier-tailed distributions for the factors, yet retains (the comp)utational simplifications of (4.8) and Theorem 4.2. Letting Σt = diag {σ2j,t}Jj=1 , the (time∑-dependent) innovation covariance function is K (τ, u) ≡ Cov ( (τ),  (u)) = J 2t t t j=1 σj,tφj(τ)φj(u) + σ2η1{τ = u}. By modeling each {σ2 Tj,t}t=1 for j = 1, . . . , J with an independent stochastic volatility model (e.g., Kim et al., 1998), the time-dependence of {σ2j,t}will propagate to the inno- vation covariance functions, Kt . Similar modifications can accommodate scale- mixtures of Gaussian distributions for the factors (Fernandez and Steel, 2000) to induce more general distributions for the innovation process, {t}. These generalizations are particularly important for financial applications, for which stochastic volatility models and heavy-tailed distributions are commonly ap- propriate. Future work will investigate more adaptive FAR(p) models for longer, pos- sibly nonstationary functional time series through stochastic volatility, time- varying ψ`, and regime shifts. Important extensions also include modeling mul- tiple functional responses Yt(τ) ∈ Rd for d > 1, which requires a model for both the auto- and cross-correlations, and incorporating exogenous predictors. In both cases, the DLM framework of (4.6) offers a promising platform for pursu- ing these extensions. 105 106 Nominal Yields: h-Step Root Mean Squared Forecast Errors (RMSFEs) h RW Mean VAR-Y DL DRA FAR Classic VAR-FPC FAR(1) FAR(p) 2/03 1 0.0488 0.4554 0.0487 0.1218 0.1440 0.1641 0.1631 0.0498 0.0516 5 0.0966 0.4369 0.0904 0.1409 0.8221 - 0.1941 0.0879 0.1002 8/04 1 0.0253 1.1079 0.0252 0.0877 - 0.1113 0.1127 0.0281 0.0281 5 0.0525 1.1279 0.0383 0.0953 - - 0.1435 0.0412 0.0505 2/06 1 0.1710 0.5408 0.1809 0.2206 - 0.3349 0.3334 0.1682 0.1673 5 0.4534 0.5971 0.5885 0.4927 - - 0.5928 0.4680 0.4627 8/07 1 0.0833 1.3125 0.0860 0.1817 0.1854 0.1168 0.1173 0.0806 0.0793 5 0.1345 1.3146 0.1402 0.2099 0.2998 - 0.1292 0.1537 0.1233 2/09 1 0.0487 0.5268 0.0517 0.1376 0.0917 0.1406 0.1398 0.0488 0.0760 5 0.0894 0.5560 0.1227 0.1872 0.1451 - 0.1990 0.1323 0.2608 8/10 1 0.0344 0.5063 0.0333 0.1920 0.0878 0.0551 0.0554 0.0291 0.0292 5 0.0583 0.4999 0.0603 0.1950 0.1356 - 0.0724 0.0452 0.0495 2/12 1 0.0383 0.5329 0.0384 0.0953 0.1915 0.0464 0.0463 0.0312 0.0311 5 0.0951 0.5522 0.0915 0.1240 0.2476 - 0.0989 0.0760 0.0734 8/13 1 0.0463 0.4169 0.0443 0.0621 0.0692 0.0634 0.0644 0.0547 0.0676 5 0.1210 0.3842 0.1104 0.1423 0.1448 - 0.1100 0.1208 0.1100 2/15 1 0.0329 0.3085 0.0320 0.1125 0.1001 0.0594 0.0606 0.0305 0.0321 5 0.0420 0.3080 0.0403 0.1149 0.1202 - 0.0697 0.0393 0.0441 Table 4.1: h-step RMSFEs for nominal yields, grouped (left to right) by multivariate methods, parametric yield curve models, existing functional data methods, and proposed hierarchical FAR methods. The minimum RMSFE in each row is italicized. 107 Real Yields: h-Step Root Mean Squared Forecast Errors (RMSFEs) h RW Mean VAR-Y DL DRA FAR Classic VAR-FPC FAR(1) FAR(p) 2/03 1 0.0490 0.1629 0.0504 0.0499 0.0492 0.1366 0.1329 0.0509 0.0572 5 0.1001 0.1585 0.1040 0.1017 0.1128 - 0.1525 0.0967 0.1110 8/04 1 0.0331 0.3827 0.0337 0.0353 0.0528 0.0431 0.0440 0.0331 0.0326 5 0.0724 0.3924 0.0707 0.0792 0.1690 - 0.0721 0.0679 0.0651 2/06 1 0.0429 0.1089 0.0428 0.0448 0.0453 0.0529 0.0533 0.0424 0.0424 5 0.0934 0.1082 0.0858 0.0957 0.1362 - 0.0920 0.0852 0.0835 8/07 1 0.0802 0.2150 0.0896 0.0944 0.1979 0.1212 0.1202 0.0898 0.0880 5 0.1866 0.2309 0.2268 0.2504 1.1843 - 0.1916 0.2051 0.1980 2/09 1 0.0519 0.5162 0.0544 0.0643 0.1229 0.0736 0.0749 0.0526 0.0541 5 0.0798 0.5262 0.1100 0.1092 0.3606 - 0.1092 0.0992 0.1046 8/10 1 0.0490 0.7836 0.0492 0.0591 0.0663 0.0800 0.0762 0.0488 0.0486 5 0.0735 0.7845 0.0787 0.0794 0.1815 - 0.0959 0.0727 0.0744 2/12 1 0.0602 0.8838 0.0612 0.0675 0.1492 0.0906 0.0853 0.0610 0.0608 5 0.1845 0.9250 0.1958 0.1897 1.7442 - 0.2034 0.1840 0.1846 8/13 1 0.0526 0.3242 0.0506 0.0736 - 0.0613 0.0610 0.0500 0.0492 5 0.1551 0.2981 0.1278 0.1380 - - 0.1246 0.1407 0.1239 2/15 1 0.0328 0.3088 0.0327 0.0439 0.1529 0.0776 0.0779 0.0325 0.0336 5 0.0489 0.3104 0.0521 0.0562 - - 0.0816 0.0466 0.0543 Table 4.2: h-step RMSFEs for real yields, grouped (left to right) by multivariate methods, parametric yield curve mod- els, existing functional data methods, and proposed hierarchical FAR methods. The minimum RMSFE in each row is italicized. CHAPTER 5 CONCLUSIONS The proposed methods provide effective Bayesian approaches for model- ing functional and time series data. While broadly applicable, the proposed methodology directly addresses the following challenging cases for which ex- isting methods are inadequate: 1. Functional data with additional complex dependence, such as time depen- dence, contemporaneous dependence, stochastic volatility, covariates, and change points (Chapter 2); 2. Functional data, time series data, or regression functions with local fea- tures, such as jumps or rapidly-changing smoothness (Chapter 3); and 3. Forecasting and inference of functional time series data with sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error (Chapter 4). Using the MFDLM of Chapter 2, we may adapt general scalar and multivari- ate methods to the functional data setting. In particular, by separating out the functional component through appropriate conditioning and include the nec- essary identifiability constraints, the remaining dependence structures, such as covariates, repeated measurements, and spatial dependence, may be modeled via the factors. The hierarchical Bayesian approach allows us to incorporate interesting and useful submodels seamlessly, with minimal adjustments to the proposed Gibbs sampling algorithm. An interesting extension of the MFDLM of Chapter 2 would be to in- corporate the dynamic shrinkage processes of Chapter 3 to provide adaptive 108 shrinkage and regularization of the dynamic factors. Dynamic shrinkage pro- cesses inherit the desirable shrinkage behavior of global-local priors, such as the horseshoe prior, but with greater time-localization. By construction, the MFDLM—and more broadly, dynamic linear models—contains many param- eters, and therefore may benefit from structured regularization. By synthesiz- ing the MFDLM of Chapter 2 and the dynamic shrinkage processes of Chap- ter 3, we may model additional dependence among functional data, such as covariates, repeated measurements, and spatial dependence, while simulta- neously introducing temporally adaptive shrinkage behavior to guard against overparametrization. Important extensions of the dynamic shrinkage processes of Chapter 3 in- clude alternative dependence models in (3.2) or multivariate shrinkage in (3.10). For example, dynamic shrinkage processes may offer effective shrinkage behav- ior for spatial or spatio-temporal models, in which case (3.2) may be modified to incorporate spatial dependence. Similarly, for replicate time series or functional data, multi-level extensions of (3.2) may provide both hierarchical and locally adaptive shrinkage behavior. The proposed hierarchical FAR(p) model in Chapter 4 may be extended to incorporate exogenous predictors, stochastic volatility, time-varying autoregres- sive kernels ψ`, and regime shifts. These important generalizations may provide broader applicability of the hierarchical FAR(p) model for longer, possibly non- stationary functional time series data. As with the MFDLM, the dynamic linear model framework offers a promising platform for pursuing these extensions, and may be combined with dynamic shrinkage processes for additional locally adaptive regularization. 109 APPENDIX A A BAYESIAN MULTIVARIATE FUNCTIONAL DYNAMIC LINEAR MODEL To sample from the joint posterior distribution, we use a Gibbs sampler. Be- cause the Gibbs sampler allows blocks of parameters to be conditioned on all other blocks of parameters, it is a convenient approach for our model. First, hi- erarchical dynamic linear model (DLM) algorithms typically require that βt and θt be the only unknown components, which we can accommodate by condition- ing appropriately. Second, our sequential orthonormality approach for (c)fk fits nicely within a Gibbs sampler, and we can adapt the algorithms described in Wand and Ormerod (2008). And third, the hierarchical structure of our model imposes natural conditional independence assumptions, which allows us to eas- ily partition the parameters into appropriate blocks. A.1 Initialization ( )′ To initialize the factors (c) (c) (c)βk = βk,1, . . . , βk,T and the factor loading curves (FLCs) (c)fk for k = 1, . . . , K and c = 1, . . . , C, we compute the singular value decomposition (SVD) of the data matrix Y(c) = U(c)Σ(c)V(c)′ for c = 1, . . . , C. Note that to obtain a data matrix Y(c), with rows corresponding to times t and columns to observations points τ , we need to estimate (c)Yt (τ) for any unob- served τ at each time t, which may be computed quickly using splines. How- ever, these estimated data values are only used for the initialization step. Then, letting (c)U1:K be the first K columns of (c) U(c), Σ1:K be the upper left K × K sub- (matrix of Σ(c)), and (c)V1:K be the first K colum(ns of V(c), w)e initialize the factors (c) (c) (c) (c) β1 , . . . ,βK = U1:KΣ1:K and the FLCs (c) (c) (c) (c) f 1 , . . . ,fK = V1:K , where fk 110 is the vector of FLC k evaluated at all observation points ∪ T (c)t t for outcome c. ′ The (c)fk are orthonormal in the sense that (c) (c) fk f j = 1(k = j), but they are not smooth. This approach is similar to the initializations in Matteson et al. (2011) and Hays et al. (2012). Given the factors (c)βk and the FLCs (c) fk , we can estimate each σ 2 (c) (or more generally, Et) as a conditional maximum likelihood estimator (MLE), using the likelihood from the observation level of model (2.1). Similarly, we can estimate each (c)λk conditional on (c) fk by maximizing the partially informative normal likelihood. Then, given (c)λ , σ2k (c), (c), and (c)βk fk , we can estimate each (c) dk by normalizing the full conditional posterior expectation given in the main paper; i.e., solving the relevant quadratic program and then normalizing the solution. Initializations for the remaining levels proceed similarly as conditional MLEs, but depend on the form chosen for Xt, Vt, Gt, and Wt. In our applications, this conditional MLE approach produces reasonable starting values for all variables. A.1.1 Common Factor Loading Curves If we wish to implement the common FLCs model (c)fk =(fk for all k, c, t)hen′ we instead compute the SVD of the stacked data matrices ′ ′Y(1) , . . . ,Y(C) = UΣV′, where now the data matrices Y(1), . . . ,Y(C) are imputed using splines for all observation points for all outcomes, ∪ (c)t,cTt , and therefore have the same number of columns. Alternatively, we may improve computational efficiency by choosing a small yet representative subset of observation points T ∗ ⊂ ∪ (c)t,cTt and then estimating each data matrix Y(c) for all ∈ T ∗. Let (c)τ U1:K be the first K columns of U(c), where the U(c), c = 1, . . . , C, correspond to the outcome- 111 ( ) ( ′specific block)s of U = U(1)′ ′, . . . ,U(C) . Then, similar to before, we set (c) (c) (c) β1 , . . . ,βK = U1:KΣ1:K for c = 1, . . . , C, and (f 1, . . . ,fK) = V1:K , where Σ1:K is the upper left K ×K submatrix of Σ and V1:K is the first K columns of V. Again, the fk are unsmoothed with f ′ kf j = 1(k = j), but now the initialized FLCs are common for c = 1, . . . , C. Initialization of the remaining parameters proceeds as before, but now with (c) (c)λk = λk and dk = dk, which can be obtained by maximizing the relevant conditional likelihoods under the common FLCs model. A.2 Sampling A.2.1 General Algorithm For greater generality, we present our sampling algorithm for non-common FLCs; i.e., we retain dependence on for (c)c dk and (c) λk . When applicable, we discuss the necessary modifications for the common FLCs model. The algorithm proceeds in four main blocks: 1. Sample the basis coefficients (c)dk and the smoothing parameters (c) λk for the FLCs. For (c)λk , we use a Gamma(γ1, γ2) prior distribution, which is conjugate to the partially informative normal likelihood and implies that the full conditional posterior distribution is Gamma(γ1 + rank(Ωφ)/2, γ2 + (c)′ (c) dk Ωφdk /2). For the common FLCs model, we simply replace (c) dk with dk to obtain the full conditional posterior for λk. We use the hyperparame- ters γ1 = γ2 = 0.001, although the effect of the hyperparameters is negligi- 112 ′ ble as long as γ1 and γ2 are small relative to rank (c) (c) (Ωφ)/2 and dk Ωφdk /2, respectively. After sampling the (c)λk , we sample and then normalize the (c) dk with a modified version of the efficient Cholesky decomposition ap- proach of Wand and Ormerod (2008): (a) Compute the (lower triangular) Cholesky decomposition B−1k = B̄LB̄ ′ L; (b) If k = 1, set L′1:(k−1)Λ1:(k−1) = 0; If k > 1, use forward substitutions to obtain x̄ and ȳ from the equa- tions B̄Lx̄ = L′1:(k−1) and B̄Lȳ = bk, and let Λ1:(k−1) be the solution to the regression of ȳ on x̄; (c) Use forward substitution to obtain b̄ as the solution to B̄Lb̄ = bk, then use backward substitution to obtain d∗k as the solution to B̄′ d ∗ L k = b̄ + z̄, where z̄ ∼ N(0, I(M+4√)×(M+4)); √ (d) Retain the vector (c) (c) (c)dk = d ∗ k/ d ∗′ ∗ k Jφdk and set βk = d ∗′ k J d ∗ φ kβk . The definitions of Bk and bk depend on whether or not we use the com- mon FLCs model with (c)fk = fk. Compared with unconstrained Bayesian splines, the extra orthogonality step (b) uses the Cholesky decomposi- tion—which we must compute regardless—and adds only the computa- tional cost of a simple linear regression for each k > 1, which is perhaps expected in light of Theorem 2.1. The scaling of (c)dk and (c) βk in (d) en- forces the unit-norm constraint on (c)fk yet ensures that (c) (c) fk (τ)βk —which appears in the posterior distribution of (c)dj for all j 6= k—is unaffected by the normalization. 2. Sample the factors βt (and θt, if present) conditional on all other parame- ters in (2.1) using either the DLM implementation of forward filtering back- 113 ward sampling (e.g., Petris et al. (2009)) or the state space sampler of Durbin and Koopman (2002); Koopman and Durbin (2003, 2000), the latter of which is optimized when Et is diagonal. For general hierarchical mod- els, we may modify the hierarchical DLM algorithms of Gamerman and Migon (1993). For the prior distributions, we only need to specify the distribution of β0 (and θ0); the remaining distributions are computed recursively using F, Xt, Gt and the error variances. For simplicity, we let (c) iid β 6k,0 ∼ N(0, 10 ), which is a common choice for DLMs. Alternatively, we could use past data not included in our analysis to estimate these initial values. However, the resulting estimates for t > 1 in our applications are not noticeably different. 3. Sample the state evolution matrix Gt (if unknown). Gt may have a special form (see Section A.2.2) or provide a more common time series model such as a VAR. In the latter case, we may choose some structure for Gt = G, e.g. diagonality to allow dependence between (c) (c)βk,t and βk,t−1, or K blocks ′ of dimension C × C to allow dependence between (c) (c )βk,t and βk,t−1 for c, c′ = 1, . . . , C. A simple choice of prior for the nonzero entries of G is iid N(0, 106), which is conjugate to the likelihood induced by (2.1). Un- der this prior, it is straightforward to derive the posterior distribution of vec0(G), where vec0 stacks the nonzero entries of the matrix (by column) into a vector. 4. Sample each of the remaining error variance parameters individually: Et, Vt, and Wt. These distributions depend on our assumptions for the model structure, but we typically prefer conjugate priors when avail- able. For example, in the random walk factor model of (8), we have 114 indep βk,i,s,t = βk,i,s,t−1 + ωk,i,s,t with ωk,i,s,t ∼ N(0,Wk). Using the Wishart prior W−1 −1k ∼Wishart((ρR) , ρ), the full cond∑itional posterior distribution for the precision is W−1k ∼ Wishart((ρR + i,s,t w ′ −1k,i,s,twk,i,s,t) , ρ + T ), where wk,i,s,t = βk,i,s,t − βk,i,s,t−1 is conditional on the factors and T = (15)(40)(8) = 4800 counts the indices (i, s, t). We let R−1 = IC×C , which is the expected prior precision, and ρ = C ≥ rank(R−1). For the stochastic volatility model of Section 2.4.1, we use the distributions given in Kim et al. (1998). In particular, letting σ2 (c)k,(c),t = exp(hk,t), Kim et al. (1998) propose the model (c) (c) (c) (c) (c) (c)hk,t = ξk,0 + ξk,1(hk,t−1 − ξk,0) + ζk,t , where (c) indep ζ 2k,t ∼ N(0, σH,k,(c)) for and (c) ∼ (c) (c)t = 2, . . . , T hk,1 N(ξ 2 2k,0, σH,k,(c)/(1− (ξk,1) )) with | (c)ξk,1| < 1 for stationarity. Kim et al. (1998) also suggest priors for (c) (c)ξ 2k,0, ξk,1, and σH,k,(c) and provide an efficient MCMC sampling algo- rithm. For additional motivation for the stochastic volatility approach over GARCH models, see Danı́elsson (1998). Recall that we construct a posterior distribution of (c)dk without the unit norm constraint, and then normalize the samples from this distribution. As a result, the conditions of Theorem 2.1 are satisfied and the (unnormalized) full condi- tional posterior distribution of (c)dk is Gaussian, both of which are convenient results. The normalization step 1.(d) is interpretable, corresponding to the pro- jection of a Gaussian distribution onto the unit sphere. Note that rescaling the factors (c)βk in 1.(d) does not affect the remainder of the sampling algorithm (steps 2. - 4.). The rescaled (c)βk are from the previous MCMC iteration, which does not affect the full conditional distributions of step 2. in the current MCMC iteration. The subsequent steps 3., 4., and 1. are then conditional on the newly sampled factors (c)βk from step 2., which have not been rescaled. 115 A.2.2 Sampling the Common Trend Hidden Markov Model Recall the common trend hidden Markov model for the factors, k = 1, . . . , K: ∑∆D (1) (1) (1) (1) (1) (1) βk,t = ωk,t , ω r k,t =∑i=1 ψk,iω k,t−i + σk,(1),tzk,t  (A.1)(c) (c) (c) (1) (c) (c) (c) (c) (c)∆Dβ D rk,t = sk,t(γk ∆ βk,t ) + ωk,t , ωk,t = i=1 ψk,iωk,t−i + σk,(c),tzk,t for c = 2, . . . , C, where ∆ is the differencing operator, D is the degree of d{ifferencing, (c)γk }∈ R is the economy-specific slope term for each factor, (c) sk,t : t = 1 . . . , T is a discrete Markov chain with states {0, 1}, σ2k,(c),t are the time-dependent error variances, and (c) iidzk,t ∼ N(0, 1). We specify iid N(0, 106) priors for (c)γk , which are conjugate to the likelihood in (A.1). We can express (A.1) as the βt = θt-level in (2.1) with Xt = ICK×CK and Vt = 0CK×CK . Let Lβt = ICK×CK −Qt,  0K×K 0K×K · · · 0K×K  (2) St γ(2) 0K×K · · · 0K×KQt =  . . . .   ,.. .. . . ..  (C) S γ(C)t 0K×K · · · 0K×K where (c)St = diag (c) ({s K (c)k,t}k=1) and γ = diag { (c) ( γ Kk }k=1). Note that L −1 βt = ICK×CK + Qt. To derive the state evolution matrix Gt, we can modify the standard ARIMA(r,D,0) framework for DLMs to incorporate the (c)sk,t- dependent common trend. For example, when D = r = 1, we have (c)∆βk,t − (c) (c) (1) (c) (c) (c) (c) (1) (c) sk,tγk ∆βk,t = ψk (∆βk,t−1 − sk,t−1γk ∆βk,t−1) + σk,(c),tzk,t which can be rewrit- ten as (c)− (c) (c) (1) (c) (c) (c) (c) (c) (c) (1) (c) (c)βk,t sk,tγk βk,t = (1 +ψk )βk,t−1− (sk,t + sk,t−1ψk )γk βk,t−1−ψk βk,t−2 + (c) (c) (c) (1) (c) ψk sk,t−1γk βk,t−2 + σk,(c),tzk,t . The left side of this equation is given by the el- ements of Lβtβt, while the right side may clearly be expressed using a simple modification of the standard ARIMA DLM state evolution matrix G. In vector 116 notation, we have        Lβt 0CK×CK βt   Gt,1 Gt,2 βt−1  ω̃t = + 0CK×CK ICK×CK βt−1 0CK×CK ICK×CK βt−2 ω̃t−1 (A.2) where    0K×K 0K×(C−1)K − (2) (2)(S + S Ψ(2))γ(2) t t−1 0K×(C−1)K Gt,1 = (ICK×CK + Ψ) + ... ...   , − (C) (C)(St + S Ψ(C))γ(C)t−1 0K×(C−1)K    0K×K 0K×(C−1)K (2)S (2) (2) −  t−1 Ψ γ 0K×(C−1)K Gt,2 = Ψ + .. ..  , and . .   (C)S (C) (C)t−1Ψ γ 0K×(C−1)K ω̃t Wt 0CK×CK Var   =   , ω̃t−1 0CK×CK 0CK×CK with diag { (c) (c)Ψ = ( ψk }k,c), Wt = diag({σ2k,(c),t}k,c), and ω̃t has elements ω̃k,t = (c) σk,(c),tzk,t , which are the residuals from the AR(r) process in (A.1). Many of these matrix multiplications involve diagonal matrices, and there- fore may be computed quickly. The error variance is not a proper variance ma- trix, but is commonly used for sampling DLMs with multiple lags or differenc- ing. Note that to write (2.1) in this form, we must also append CK columns of zeros to F(τ), since Yt(τ) depends on βt but not on βt−1. Inverting the block diagonal matrix L̃βt = bdiag(Lβt, ICK×CK), we obtain −1 L̃βt = bdiag(L −1 βt , ICK×CK) = bdiag(ICK×CK + Qt, ICK×CK). Therefore, we 117 canrewrite (A.2) as βt        Gt,1 + QtGt,1 Gt,2 + QtGt,2βt−1 −1 ω̃t = + L̃βt   (A.3) βt−1 0CK×CK ICK×CK βt−2 ω̃t−1 where the error variance has the same block form as previously, but with Wt replaced by L −1 −1 ′ ′βt Wt(Lβt ) = (ICK×CK + Qt)Wt(ICK×CK + Qt) = Wt + QtWt + (Q W ) ′ + Q W Q′ . Letting σ2 = diag({σ2 Kt t t t t (c),t k,(c),t}k=1) so that Wt = bdiag(σ2(1),t, . . . ,σ 2 (C),t), wemay compute the relevant terms explicitly:  0K×K 0K×K · · · 0K×K  QtWt =   (2)  S γ(2)σ2 t (1),t 0K×K · · · 0K×K.. .. . . . ... . .  (C) S (C) 2t γ σ(1),t 0K×K · · · 0K×K and  0K×K 0K×K · · · 0K×K (2) (2) 2 (2) (2) (2) (2) 2 (C)0 (C)′ K×K St γ σ(1),tSt γ · · · St γ σ(1),tSt γ QtWtQt =  .. . .. .. . . . ..  (C) 0 S γ(C) 2 (2) σ S γ(2) · · · (C) (C)S γ(C)σ2 S γ(C)K×K t (1),t t t (1),t t where again, the component terms are all diagonal, and therefore can be re- ordered for convenience. Combining terms and simplifying, the nonzero upper leftblock of the error variance matrix, L −1βt Wt(L −1 ′βt ) , is  (2) (C) σ2(1),t St γ (2)σ2 (C) 2(1),t · · · St γ σ(1),t    (2)  S γ(2) 2 2 (2) (2) 2 2 · · · (2) (C)σ σ + S (γ ) σ S S γ(2)γ(C)σ2 t  (1),t (2),t t (1),t t t (1),t   .. .. .. .  . . . ...  (C) (C) (2) (C) (C)St γ σ 2 (2) (C) 2 (1),t St St γ γ σ(1),t · · · σ2 (C) 2 2(C),t + St (γ ) σ(1),t When (c) (c)sk,t = 1, c > 1 the slope parameter γk may increase or decrease the error variance of the residuals (c)ω̃k,t at time t, and determines the contempora- ′ neous covariance between (c) and (1) (c) (c )ω̃k,t ω̃k,t . Similarly, when sk,t = sk,t = 1, the 118 ′ product (c) (c )γk γk σ 2 k,(1),t determines the contemporaneous covariance between (c) ω̃k,t ′ and (c )ω̃k,t at time t. Note that as long as there exist distinct times t, t ′ such that (c) sk,t 6 (c) = sk,t′ , the slopes and volatilities are identifiable for each k and c. There- fore, the common trend hidden Markov model of (A.1) provides a flexible, time- dependent contemporaneous covariance structure within a relatively simple re- gression framework. A.3 Additional Figures Lower 95% HPD Interval Posterior Mean Upper 95% HPD Interval 80 0 80 80 0.5 1.5 70 70 70 .5 0 1.0 60 60 60 0.5 50 50 50 40 40 40 0.0 30 30 30 −0.5 20 0 0 20 20 −1.0 10 10 10 0 1.5 −1.5 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Time Bin Time Bin Time Bin Figure A.1: Pointwise 95% HPD intervals and the posterior mean for (1)µ̄t , which is the average difference in the PFC log-spectra between the FC and FS trials. The black vertical lines indicate t∗. 119 Frequency (Hz) 0.5 0 −0.5 0 0.5 0 1 1.5 0.5 0.5 0.5 −0.5 0 0 0.5 1 0 0.5 1 0 1 0 0.5 1 0 0 0.5 1 0 0 1 0.5 1 0 −0.5 −0.5 −0.5 0.5 −0.5 0 −0. 5 0.5 0 0.5 − 0.5 0.5 1.5 1 −0.5 −0.5 −0.5 0 −1 −0.5 0 0.5 0 .5 −0 0.5 0 −0.5 0.5 1 0.5 −0.5 −0.5 0 −1 −0.5 0 −0.5 0 0.5 −0.5 Lower 95% HPD Interval Posterior Mean Upper 95% HPD Interval 80 80 80 2 70 70 70 60 60 0 60 1 0 0 50 50 0 0 50 0.5 40 40 40 0 0 0 30 30 30 0 −1 20 20 20 10 10 10 −2 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Time Bin Time Bin Time Bin Figure A.2: Pointwise 95% HPD intervals and the posterior mean for (2)µ̄t , which is the average difference in the PFC log-spectra between the FC and FS trials. The black vertical lines indicate t∗. 120 Frequency (Hz) 0.5 0 −1 1 0 0.5 0 0.5 0.5 1 0.5 0 −1 0 0.5 0.5 1 1 0.5 0.5 −1 0 0.5 0.5 0.5 0 1 . 5 0 0.5 0 0 0.5 0.5 1 .5 −0 1 0. 5 0 0 1 0.5 0 0 0.5 0.5 1 .5 −0 1 0.5 0 0.5 1 0 0.5 0.5 1 −0 .5 0.5 0 1 1.5 121 Fed , k = 1 Fed , k = 2 Fed , k = 3 Fed , k = 4 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 Dates Dates Dates Dates BOE , k = 1 BOE , k = 2 BOE , k = 3 BOE , k = 4 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 Dates Dates Dates Dates ECB , k = 1 ECB , k = 2 ECB , k = 3 ECB , k = 4 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 Dates Dates Dates Dates BOC , k = 1 BOC , k = 2 BOC , k = 3 BOC , k = 4 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 2006 2008 2010 2012 2014 Dates Dates Dates Dates Figure A.3: The observed volatility clustering from the yield curve application. The black lines are the posterior means of the squared residuals from the AR(1) process on the (c)ωk,t in the common trend hidden Markov model of Section 2.4.1. The red lines are the posterior means of the corresponding volatility estimates σ2k,(c),t discussed in Section 2.4.1. S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s 0 2 4 0 4 0 8 0 0 3 0 6 0 0 2 0 5 0 S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s 0 . 0 1 . 5 3 . 0 0 4 8 0 4 8 1 2 0 4 8 S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s 0 . 0 0 0 . 2 0 0 4 8 1 2 0 . 0 1 . 0 2 . 0 0 . 0 1 . 5 3 . 0 S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s S q u a r e d F a c t o r R e s i d u a l s 0 . 0 0 . 4 0 . 8 0 2 4 6 0 2 4 6 0 2 4 APPENDIX B DYNAMIC SHRINKAGE PROCESSES Proof. (Proposition 3.1) Proposition 3.1 follows from Proposition 3.2 with µz = 0. Proof. (Proposition 3.2) Let η ∼ Z(α, β, µz, 1) with density (3.3), i.e.,[ ]−1{ [ ]}α{ [ ]}− − −(α+β)[z] = σB(α, β) exp (z µz)/σz 1 + exp (z µz)/σz . The density of λ2 = exp(η) is [ ] ( )−1{ [ ]}2 ∝ 2 2 − α{ [ ]}−(α+β)λ (λ ) exp log(λ ) µz 1 + exp log(λ2)− µz ∝ 2 α−1 [ ] 2 −(α+β)λ 1 + λ / exp(µz) and therefore the density of κ = 1/(1 + λ2) is [ ]α−1[ ]∝ −(α+β)[κ] κ−2 κ−1 − 1 1 +{(κ−1[− 1)/ exp(µz) ]} ∝ κ−2−(α−1)(1− κ[)α−1 κ−1 − −(α+β) κ exp(µz)]+ (1 κ) ∝ − α−1 β−1 − −(α+β)(1 κ) κ κ exp(µz) + (1 κ) i.e., κ ∼ TPB(β, α, exp(µz)). Proof. (Theorem 3.1) Under model (3.2), i.e., iid ht+1 = µ+ φ(ht − µ) + ηt, ηt ∼ Z(α, β, 0, 1), we have [ht+1|ht, φ, µ] ∼ Z(α, β, µ+φ(ht−µ), 1). Using Proposition 3.2, the con- ditional distribution for κt+1 is [κt+1|ht, φ, µ] ∼ TPB(β, α, exp(µ + φ(ht − µ))). By substituting τ = exp(µ) and λt = exp(ht − µ), w[e eq]uivalently haveφ [κ 2 2φt+1|λt, φ, τ ] ∼ TPB(β, α, τ λt ). Noting τ 2λ 2φ = τ 2(1−φ) 1−κtt completes theκt proof. 122 [ ] P[ roof. (Theore]m 3.2) Let φγ = (1 − κ )/κ and note that κ 7→ κ−1/2t t t and κ →7−1 1 + (γt − 1)κ are decr∫easing in κ for γt > 1. It follows that, for γt > 1,( P κ > ε∣∣ ) 1{ } −1 1/2 −1/2t+1 κs s≤t, φ = π γt κt+1 (1− κ )−1/2t+1 ∫[1 + (γt − 1)κ −1 t+1] dκt+1 ε 1 ≤ π−1 1/2γ ε−1/2t [1 + (γt − 1)ε] −1 (1− κ )−1/2t+1 dκt+1 ε 1/2 ≤ 2π−1 −1/2 − 1/2 γε (1 ε) t 1 + (γt − 1)ε converges to zero as κt → 0, since κt → 0 implies γt →∞. Proof. (Theorem 3.3) Marginalizing over ωt, the likelihood is [yt+1|{ indep κs}] ∼ N(0, κ−1t+1). From Theorem 3.1, the posterior distribution of κt+1 may be com- puted as { } [κ |y , {κ } , φ, τ ] ∝ κ{β−1(1− κ )α− [ ] 1 −(α+β) t+1 t+1 s s≤t t+1 (t+1 1 + (γ)t}− 1)κt+1 × 1/2κt+1 exp − 2[yt+1κt+1/2− ]−1 ( )∝ (1− κ 1/2t+1) 1 + (γt − 1)κt+1 exp −y2t+1κt+1/2 [ ] for φα = β =[1/2, where γ ]= τ 2(1−φ)t (1 − κt)/κt . De(fining p1(κ)) = (1 − κ)−1/2, | − −1p2(κ γt) = 1 + (γt 1)κ , and p 23(κ|yt+1) =[ exp ]−yt+1κ/2 for κ ∈ (0, 1), observe that 2p1(·) is increasing in κ, p2(κ|γt) ≤ p1(κ) for all γt ≥ 0, and p3(·) is decreasing in κ. Similar to Datta and Ghosh (2013), the following inequalities 123 hold for ε ∈ (0, 1) with ε′ = 1− ε: ( ∣∣ ) (( ∣P κ < ε′∣ )′ t+1 ∣∣yt+1, {κs}s≤t, φ, τP κt+1 < ε yt+1, {κs}s≤t, φ, τ ≤ )P κ ′t∫+1 > ε yt+1, {κs}s≤t, φ, τ∫ [ ε′ ( )(1− κt+1)−3/]2 exp −y(2t+1κt+1/2 dκ≤ 0 )t+11 −3/2 ′ 1 + (γ − 1)κ exp −y2 κε t ∫ t+1 t+1 t+1/2 dκt+1( ε)′(∫1−[ κ )−3/2≤ 0 t+1 dκt+11 ]−3/2 exp −y2t+1/2 ′ 1[+ (γt − 1)κ dκε t+1 ] t+1( ) 2 (1− {ε′[)−1/2 − 1≤ ] [ − 2 ] − ( −1 ) − ′ −1/2 } exp yt+1/2 2(γt 1) 1 + (γt 1)ε − −1/2 γt ≤ (1{− ε′)−1/2 − 1 exp y2 1/2t+1/2 γ}t × 1− γt . − 1/21 γ ′ 1/2t /[1 + (γt − 1)ε ] N(oting the fina∣∣l term in curly b)races converges to 1 as γt → 0, we obtainP κt+1 < 1 − ε yt+1, {κs}s≤t, φ, τ → 0 as γt → 0. The result for (a) follows immediately. For ε ∈ (0, 1) and γt < 1, and observing that p2(κ|γt) is increasing in κ for γt < 1, then for any δ ∈ (0, 1),( ∣ ) γ− ( ) ∫1 exp −y2 1ε/2 (1− κ )−1/2 P t t+1 dκt+1 κt+1 > ε∣yt+1, {κs}s≤t, φ, τ ≤ ∫ t+1 ε(δ′ ( ) ,exp −y)2t+1κt+1/2 dκ0 t+1 γ−1t exp −(y2t+1ε/2 2()1− ε)1/2≤ ( exp −y2t+1δε/2) δε = exp −y2t+1ε[1− δ]/2 γ−1t 2(1− ε)1/2(δε)−1 which converges to zero as |yt+1| → ∞, proving (b). Proof. (Theorem 3.4) The density of η ∼ Z(α, β, 0, 1) may be written 1 [exp(η)]α [η] = B(α, β) [1 + exp(η)]α+β ∫ 1 ∞ = 2−(α+β) exp{η[α− (α + β)/2]} exp(−η2ξ/2)pα+β(ξ) dξ B(α, β) 0 124 using Theorem 1 of Polson et al. (2013), where pb(ξ) is the density of the random variab∫le ξ ∼ P{G(b,∞ 1[0), b > 0. It fo]l}lows that ∫ ∞ ( ) [η] ∝ exp − η2ξ−η(α−β) p (ξ) dξ ∝ f η; ξ−1α+β N [α−β]/2, ξ−1 pα+β(ξ) dξ 0 2 0 where fN(η;µN , σ2N) is the density of the random variable η ∼ N(µ , σ2N N). The conditional distribution [ξ|η] ∼ PG(α + β, η) is a direct result of Polson et al. (2013). B.1 MCMC Sampling Algorithm and Computational Details We design a Gibbs sampling algorithm for the dynamic shrinkage process. The sampling algorithm is both computationally and MCMC efficient, and builds upon two main components: (1) a stochastic volatility sampling algorithm (Kastner and Frühwirth-Schnatter, 2014) augmented with a Pólya-Gamma sam- pler (Polson et al., 2013); and (2) a Cholesky Factor Algorithm (CFA, Rue, 2001) for sampling the state variables in the dynamic linear model. Alternative sam- pling algorithms exist for more general DLMs, such as the simulation smooth- ing algorithm of Durbin and Koopman (2002). However, as demonstrated by McCausland et al. (2011) and explored in Chan and Jeliazkov (2009) and Chan (2013), the CFA sampler is often more efficient. Importantly, both components employ algorithms that are linear in the number of time points, which produces a highly efficient sampling algorithm. The general sampling algorithm is as follows, with the details provided in the subsequent sections: 1. Sample the dynamic shrinkage components (Section 3.5.1) 125 (a) Log-volatilities, {ht} (b) Pólya-Gamma mixing parameters, {ξt} (c) Unconditional mean of log-volatility, µ (d) AR(1) coefficient of log-volatility, φ (e) Discrete mixture component indicators, {st} 2. Sample the state variables, {βt} (Section B.1.2) 3. Sample the observation error variance, σ2 . For the observation error variance, we follow Carvalho et al. (2010) and as- sume the Jeffreys’ prior [σ2 ] ∝ 1/σ{2 . Th∑e full condit}ional distribution is√ [σ T T 2 −1 −T 1 T 2 T|{yt}t=1, {βt}t=1, τ ] ∝ σ ×σ exp − 2 t=1(yt−βt) ×2σ σ (1+Tτ2 2 , where  /σ√  ) the last term arises from τ ∼ C+(0, σ/ T ). We sample from this distribution using the slice sampler of Neal (2003). If we instead use a stochastic volatility model for the observation error vari- ance as in Sections 3.3.2 and 3.4.2, we replace this step with a stochastic volatil- ity sampling algorithm (e.g., Kastner and Frühwirth-Schnatter, 2014), which re- quires additional sampling steps for the corresponding log-volatility and the unconditional mean, AR(1) coefficient, and evolution error variance of log- volatility. An efficient implementation of such a sampler is available in the R package stochvol (Kastner, 2016). In this setting, we do not scale τ by the √ standard deviation, and instead assume τ ∼ C+(0, 1/ T ). In Figure B.1, we provide empirical evidence for the linear time O(T ) com- putations of the Bayesian trend filtering model with dynamic horseshoe inno- vations. The runtime per 1000 MCMC iterations is less than 6 minutes (on a 126 MacBook Pro, 2.7 GHz Intel Core i5) for samples sizes up to T = 105, so the Gibbs sampling algorithm is scalable. Computation Time for BTF-DHS (per 1000 MCMC iterations) 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Number of Time Points, T Figure B.1: Computation time per 1000 MCMC iterations for the Bayesian trend filtering model with dynamic horseshoe innovations (BTF-DHS). B.1.1 Efficient Sampling for the Dynamic Shrinkage Process Consider the (univariate) dynamic shrinkage process in (3.2) with the Pólya- Gamma parameter expansion of Theorem 3.4. We provide implementation de- tails for the dynamic horseshoe prior with α = β = 1/2, but extensions to other cases are straightforward. The SV sampling framework of Kastner and Frühwirth-Schnatter (2014) represents the likelihood for ht on the log-scale, and approximates the ensuing logχ21 distribution for the errors via a known discrete mixture of Gaussian distributions. In particular, let ỹ 2t = log(ωt + c), where c 127 Time (minutes) 0 1 2 3 4 5 6 is a small offset to avoid numerical issues. Conditional on the mixture com- ponent indicators st, the likelihood is in∼depỹt N(ht + mst , vst) where mi and vi, i = 1, . . . , 10 are the pre-specified mean and variance components of the 10- component Gaussian mixture provided in Omori et al. (2007). The evolution equation is ht+1 = µ+ φ(ht − µ) + ηt with initialization h1 = µ+ η0 and innova- tions | in∼dep iid[ηt ξt] N(0, ξ−1t ) for [ξt] ∼ PG(1, 0). To sample h = (h1, . . . , hT ) jointly, we directly compute the posterior dis- tribution of h and exploit the tridiagonal structure of the resulting posterior precision matrix. In particular, we equivalently have ỹ ∼ N(m + h̃ + µ̃,Σv) and Dφh̃ ∼ N(0,Σξ), where m = (ms1 , . . . ,(m )′s , h̃)= (h1 − µ, . . .(, hT − µ)T )′, µ̃ = (µ, (1 − φ)µ, . . . , (1 − φ)µ)′, Σ T −1 Tv = diag {vst}t=1 , Σξ = diag {ξt }t=1 , and Dφ is a lower triangular matrix with ones on the diagonal, −φ on the first off-diagonal, and zeros elsewhere. We sample from the posterior distribution of h by sampling from the posterior distribution of h̃ and setting h = h̃ + µ1 for 1 a(T -dimensio)nal vector of ones. The required posterior distribution is h̃ ∼ N Q−1` ,Q−1 , where Q = Σ−1 +D′Σ−1h̃ h̃ v φ ξ Dφ is a tridiagonal symmetrich̃ h̃ matrix with diagonal elements d0(Qh̃) and first off-diagonal elements d1(Qh̃) defined as [ ] d0(Q −1 2 h̃) = [(vs + ξ1 + φ ξ2), (v −1 s + ξ2 + φ 2 ] ξ3), . . . , (v −1 2 −1 s + ξT−1 + φ ξT ), (vs + ξT ) ,1 2 T−1 T d1(Qh̃) =( (−φξ2), (−)φξ3), . . . , (−φξT−1) , and `h̃ = Σ[ −1v ỹ −m− µ̃ỹ1 −ms1 − µ ỹ2 −ms2 − (1− φ)µ ỹT −ms − (1− φ)µ]′= , , . . . , T . vs1 vs2 vsT Drawing from this posterior distribution is straightforward and efficient, using band back-substitution described in Kastner and Frühwirth-Schnatter (2014): (1) compute the Cholesky decomposition Q = LL′h̃ , where L is lower triangle; 128 (2) solve La = `h̃ for a; and (3) solve L ′h̃ = a+ e for h̃, where e ∼ N(0, IT ). Conditional on the log-volatilities {ht}, we sample the AR(1) evolution pa- rameters: the log-innovation precisions {ξt}, the autoregressive coefficient φ, and the unconditional mean µ. The precisions are distributed [ξt|ηt] ∼ PG(1, ηt) for ηt = ht+1 − µ − φ(ht − µ), which we sample using the rpg() function in the R package BayesLogit (Polson et al., 2013). The Pólya-Gamma sam- pler is efficient: using only exponential and inverse-Gaussian draws, Polson et al. (2013) construct an accept-reject sampler for which the probability of ac- ceptance is uniformly bounded below at 0.99919, which does not require any tuning. Next, we assume the prior [(φ + 1)/2] ∼ Beta(aφ, bφ), which restricts |φ| < 1 for stationarity, and sample from the full conditional distribution of φ using the slice sampler of Neal (2003). We select aφ = 10 and bφ = 2, which places most of the mass for the density of φ in (0, 1) with a prior mean of 2/3 and a prior mode of 4/5 to reflect the likely presence of persistent volatility √ clustering. The prior for the global scale parameter is τ ∼ C+(0, σ/ T ), which implies µ = log(τ 2) is [µ|σ, ξµ] ∼ N(log(σ2/T ), ξ−1 µ ) with ξµ ∼ PG(1, 0). In- cluding the initialization h1 ∼ N(µ, ξ−10 ) with ξ0 ∼ PG(1, 0), the posterior dis- tribution for µ is µ ∼ N(Q− ∑ 1` ,Q∑−1µ µ µ ) with Qµ = ξµ + ξ0 + (1 − φ)2 T−1t=1 ξt and ` 2 T−1µ = ξµ log(σ/T )+ξ0h1 +(1−φ) t=1 ξt(ht+1−φht). Sampling ξµ and ξ0 follows the Pólya-Gamma sampling scheme above. Finally, we sample the discrete mixture component indicators st. The dis- crete mixture probabilities are straightforward to compute: the prior mixture probabilities are the mixing proportions given by Omori et al. (2007) and the likelihood is in∼depỹt N(ht+mst , vst); see Kastner and Frühwirth-Schnatter (2014) for details. 129 In the multivariate setting p > 1 of (3.10) with Φ = diag (φ1, . . . , φp), we may modify the log-volatility sampler of {hj,t} by redefining relevant quan- tities using the ordering h = (h1,1, . . . , h1,T , h2,1, . . . , h ′p,T ) . In particular, the posterior pr[ecision matrix is aga]in tridiagonal, but with diagonal elements d[0(Qh̃) = d0,1(Qh̃), . . . , d0,p(Qh̃) an]d first off-diagonal elements d1(Qh̃) = d1,1(Qh̃), 0, d1,2(Qh̃), 0, . . . , 0, d1,p(Qh̃) , where d0,j(Qh̃) and d1,j(Qh̃) are the di- agonal elements and first off-diagonal elements, respectively, for predictor j a[s computed] in the univariate case above. Similarly, the linear term `h̃ =′ `′ , . . . , `′ where `h̃,j is the linear term for predictor j as computed in theh̃,1 h̃,p univariate case. The parameters ξj,t, φj , and sj,t may be sampled independently as in the univariate case, while samplers for {µj} and µ0 proceed as in a stan- dard hierarchical Gaussian model. For the more general case of non-diagonal Φ, we may use a simulation smoothing algorithm (e.g., Durbin and Koopman, 2002) for the log-volatilities {hj,t}, while the sampler for Φ will depend on the chosen prior. B.1.2 Efficient Sampling for the State Variables In the univariate setting of (3.8), the sampler for β = (β1, . . . , βT ) is similar to the log-volatility sample in Section 3.5.1. We provide the details for D = 2; the D = 1 case is similar to Section 3.5.1 with φ = 1, µ = 0, and mst = 0. Model (3.8) may be wri(tten y ∼) N(β,Σ) an(d D2β ∼) N(0,Σω), where y = (y , . . . , y )′1 T , Σ = diag {σ2}T , Σ = diag {σ2 }T t t=1 ω ω t=1 for σ2 2 2ω = τ λt , and D2 is a lowert t triangular matrix with ones on the diagonal, (0,−2, . . . ,−2) on the first off- diagonal, ones on the second off-diagonal, and zeros elsewhere. Note that we allow the observation error variance σ2t to be time-dependent for full gener- 130 ( ) ality, as in Section 3.4.2. The posterior for β is β ∼ N Q−1` −1β β,Qβ , where Q = Σ−1 +D′β  2Σ −1 ω D2 is a pentadiagonal symmetric matrix with diagonal ele- ments d0(Qβ), first off-diagonal elements d1(Qβ), and second-off diagonal ele- ments d2(Qβ) defined as[( ) ( ) d0(Qβ) =( σ −2 + σ−2 −2 −2 −2 −2 −21 ω + σ1 ω , σ3 2 +)σω + 4σ2 ω + σ3 ω , . . . ,4 (σ−2t + σ−2 + 4σ−2 + σ−2ω ω ω , . . . ,t t+1 t+2 ] σ− ) ( ) ( ) 2 + σ−2 + 4σ−2 + σ−2 , σ−2 + σ−2 −2T−2 ω ω ω T−1 ω + 4σω , σ −2 + σ−2 , T−2 T−1 T T−1 T T ωT d (Q ) = [[−2σ− ( ) ( ) ( ) 2,−2 σ−2 −2 −21 β ω ω + σω ] , . . . ,−2 σω + σ−2ω , . . . ,−2 σ−2 + σ−2 ,−2σ−2],3 3 4 t t+1 ωT−1 ωT ωT d2(Q ) = σ −2 β ω , . . . , σ −2 ω , . . . , σ −2 3 t ω , T and ` = Σ− [ ] 1y = y /σ2 ′ β  1 1, . . . , yt/σ 2 t , . . . , yT/σ 2 T . Drawing from the posterior distribution is straightforward and efficient using the back-band substitution algorithm described in Section 3.5.1. In the multivariate setting of (3.9), we similarly derive the posterior dis- tribution f(or β = )(β′1, . . . ,β′ )′T = (β ′1,1, β2,1, . . . , βp,1, β1,2, . . . , βp,T ) . Let X = blockdiag {(x′}Tt t=1 d)enote the T × Tp block-diagonal matrix of predictors and Σ ( = diag {σ)2 2 2 2 2ω ω }j,t for σω = τ0 τj λj,t. The posterior distribution is β ∼j,t j,t N Q−1β `β,Q −1 β , where Qβ = X ′Σ−1 X + (D ′ 2 ⊗ I −1p) Σω (D2 ⊗ Ip) and ` = X ′Σ− [ ] 1y = x′ y /σ2, . . . ,x′y /σ2, . . . ,x′ y /σ2 ′ β  1 1 1 t t t T T T . Note that Qβ may be constructed directly as above, but is now 2p-banded. Al- ternatively, the regression coefficients {βj,t} may be sampled jointly using the simulation smoothing algorithm of Durbin and Koopman (2002). 131 B.2 Linear Regression for the Fama-French Asset Pricing Model We present the ordinary linear regression results for the six-factor model dis- cussed in Section 4.2, in which we append the momentum factor of Carhart (1997) to the five-factor Fama-French model (FF-5, Fama and French, 2015). We use weekly industry portfolio data from the website of Kenneth R. French, which provide the value-weighted return of stocks in the given industry. We fo- cus on manufacturing (Manuf) and healthcare (Hlth). For a given industry port- folio, the response variable is the returns in excess of the risk free rate, yt = Rt− RF,t, with predictors xt = (1, RM,t − RF,t, SMB t,HMLt,RMW t,CMAt,MOM )′t , defined as follows: the market risk factor, RM,t − RF,t is the return on the market portfolio RM,t in excess of the risk free rate RF,t; the size factor, SMB t (small mi- nus big) is the difference in returns between portfolios of small and large market value stocks; the value factor, HMLt (high minus low) is the difference in returns between portfolios of high and low book-to-market value stocks; the profitability factor, RMW t is the difference in returns between portfolios of robust and weak profitability stocks; the investment factor, CMAt is the difference in returns be- tween portfolios of stocks of low and high investment firms; and the momentum factor, MOM t is the difference in returns between portfolios of stocks with high and low prior returns. These data are publicly available on Kenneth R. French’s website, which provides additional details on the portfolios. We standardize all predictors and the response to have unit variance. The results for the weekly manufacturing and healthcare industry data sets from 4/1/2007 - 4/1/2017 (T = 522) are in Tables B.1 and B.2, respectively. For 132 the manufacturing industry, the significant factors are market risk (RM,t−RF,t), profitability (RMW t), and investment (CMAt). For the healthcare industry, the significant factors are market risk, size (SMB t), value (HMLt), and profitability. Ordinary Linear Regression: Manufacturing Industry Estimate Std. Error t value Pr(>|t|) Intercept -0.020 0.015 -1.350 0.178 Mkt.RF 1.010 0.018 55.359 0.000 SMB -0.013 0.016 -0.780 0.436 HML -0.028 0.022 -1.264 0.207 RMW 0.088 0.018 4.918 0.000 CMA 0.052 0.017 3.106 0.002 MOM 0.029 0.020 1.437 0.151 Table B.1: Ordinary linear regression results for the weekly manufacturing in- dustry data in the six-factor model. Significant factors at the 5% level are itali- cized. Ordinary Linear Regression: Healthcare Industry Estimate Std. Error t value Pr(>|t|) Intercept 0.045 0.023 1.966 0.050 Mkt.RF 0.924 0.028 32.686 0.000 SMB -0.130 0.025 -5.221 0.000 HML -0.264 0.034 -7.703 0.000 RMW -0.168 0.028 -6.076 0.000 CMA 0.029 0.026 1.125 0.261 MOM 0.027 0.031 0.857 0.392 Table B.2: Ordinary linear regression results for the weekly healthcare industry data in the six-factor model. Significant factors at the 5% level are italicized. 133 APPENDIX C FUNCTIONAL AUTOREGRESSION FOR SPARSELY SAMPLED DATA C.1 Priors The prior for {µ }Tt t=1 is determined by (4.6). Let bψ be a Jψ-dimensional vector of cubic B-spline basis functions with min{|To|/2, 35} = (Jψ − 4) equally-spaced int(erior knots. )The tensor product expansion ψ`(τ, u) = b′ψ(τ)Θψ bψ(u) = b ′ (u)⊗ b′ψ ψ(τ) θψ , where Θψ is a Jψ × Jψ matrix of un-` ` ` known coefficients and θψ = vec (Θψ ), is computationally con(venient fo)r` ` the FAR surfaces {ψ p`}`=1. The Gaussian prior [θ |λ −1 −1 ψ` ψ ] ∼ N 0, λ Ω` ψ` ψ` induces a Gaussian process prior on ψ`, where Ωψ is a penalty ma-` ∫tri∫x{and λψ is a smoothing parameter. The st}andard roughness penalty` ∂2 2 2ψ`(u1, u2) + 2 ∂ ψ`(u1, u2) + ∂ ψ`(u1, u2) du1 du2 can be expressed∂u1 ∂u1∂u2 ∂u2 as θ′ψ Ω2θψ for a known singular matrix Ω2. To obtain a proper prior, which` ` is necessary for our model averaging procedure, we combine the roughness penalty with a nonstat∑ionari∫ty∫ penalty: a sufficient condition for stationarity o∑f Y in model (4.2) is pt `=1 ψ2` (τ, u) dτ du < 1, which can be expressed asp `=1 θ ′ ψ Ω0θψ < 1 where Ω0 is a known invertible matrix. We use the prior` ` precision matrix Ωψ = Ω2 + κ`Ω0, which penalizes roughness of ψ` and pro-` vides shrinkage toward stationarity, where the trade-off is determined by κ`. Simulations suggest that the posterior distribution is not sensitive to the choice of κ`; we fix κ` = 1 for the simulations and assume log (κ`) ∼ N (0, 4) for the application. For the smoothing parameter λψ , we use the half-Cauchy prior of` Gelman (2006), which provides excellent mixing of the states {s`} in the model averaging procedure. The prior may be expressed hierarchically via the auxil- 134 ( ) ( ) iary variables λ̃ 1 1ψ ∼ Gamma , , ξ̃ψ ∼ N (0, 106), and θ̃ψ ∼ N 0, λ̃−1Ω−1 ,` 2 2 ` ` ψ` ψ` with the identification θ = ξ̃ θ̃ and λ = ξ̃−2ψ ψ ψ λ̃ .` ` ` ψ` ψ` ψ` We use the conditionally conjugate inverse-Gamma priors σ−2, σ−2ν η ∼ Gamma(10−3, 10−3) for the measurement error precision and the FDLM ap- proximation error precision, respectively. In some cases, we may prefer smoother sample paths of µ 2t, but the paths will not be smooth when ση is large. If increasing J is infeasible or undesirable, fixing σ2η at some small value, such as σ2η = 10−6, often works well, and can be interpreted as a jit- ter term for computing a valid inverse of K (Neal, 1999). Assuming the FDLM (4.7) for the innova(tion cov)ariance K, the factors are distributed i∼idet N(0,Σ ) with Σ = diag {σ2}Je e j j=1 , although many generalizations are avail- able (Kowal et al., 2016). To enforce the ordering constraints σ2 21 > σ2 > · · · > [σ2J > 0, rec]all th[at t]h∏e joint[distribution (of ]the precisions) may be written σ−21 , . . . , σ −2 −2 J−1 −2 J = σJ j=1 σj |σ −2 j+1, . . . , σ −2 J . A noninformative joint prior   that [respects the cons]train[ts is full]y specified by( σ−2J ∼) Gamma (10−3, 10−3) and σ−2|σ−2j j+1, . . . , σ−2J = σ −2|σ−2j j+1 ∼ Uniform 0, σ−2j+1 for j = 1, . . . , J  − 1. The FLCs are φj(τ) = b′φ(τ)ξj , where bφ is a low-rank thin plate spline basis with knot locations determined by the quantiles of the observation points, To, ξj ∼ N(0,Λφ ), and Λ−1φ is the low-rank thin plate spline penalty matrix. Wej j follow Wand and Ormerod (2008) in the singular value( decomposition-base)d diagonalization of the penalty matrix, so that Λφ = diag 108, 108, λ−1φ , . . . , λ −1 , j φj which places a noninformative prior on the constant and linear components of the thin plate spline basis, which are unpenalized. The prior precision λφ isj common among the nonlinear components, and corresponds to the smoothing parameter for the regression function φj . Following Gelman (2006), we place uniform priors on the standard deviations −1/2λφ ∼ Uniform (0, 104), which im-j 135 plies the prior for the precision −3/2[λ −8φ ] ∝ λ 1{λ > 10 }. The upper boundj φj φj for the prior standard deviation is selected to match the noninformative compo- nents of Λφ . The orthonormality constraint is enforced during sampling, whichj we discuss in Appendix C. We assume the same parametrization and prior dis- tribution for the mean function, µ(τ) = b′φ(τ)θµ. C.2 Proof of Theorem 4.1 To prove Theorem 4.1, we use the following well-known results: Proposition C.1. For random vectors δ and Y with known mean and covariance, the unique best linear predictor of δ given Y is EG[δ|Y ], where EG is the expectation computed under the assumption that (δ′,Y ′)′ is jointly Gaussian. Proposition C.2 (West and Harrison, 1997). Under a DLM such as model (4.6), the random vectors y = (y′ , . . . ,y′ )′1:T 1 T and µ1:T = (µ′1, . . . ,µ′ ′T ) are jointly Gaussian, conditional on the remaining parameters. In addition, all conditionals and marginals of the joint distribution of (y′1:T ,µ′1:T )′ are Gaussian. Note that we could extend µ1:T to include µ, which is also a Gaussian ran- dom vector. Following Propositions C.1 and C.2, the proof of Theorem 4.1 is straightforward: Proof. (Theorem 4.1) Let Te be fixed and finite such that Te ⊂ T . Given this choice of Te, we can form the DLM (4.10) with the appropriately modified terms. Similarly, we can form the Gaussian DLM (4.6). Proposition C.2 implies that (y′1:T ,µ ′ 1:T ) ′ under model (4.6) and conditional on Θ is jointly Gaussian. There- fore, for any δ,Y ⊆ DT ∪ {µt(τ) : τ ∈ Te, t = 1, . . . , T}, i.e., any subvectors 136 of (y′ ,µ′1:T 1:T ) ′, the distribution of [δ|Y ,Θ] is Gaussian. Proposition (C.1) im- plies that δ̂(Y |Θ) ≡ E[δ|Y ,Θ], computed under the Gaussian DLM (4.6), is the unique best linear predictor of [δ|Y ,Θ] under the DLM (4.10). C.3 Initialization and MCMC Sampling Algorithm C.3.1 Initialization We initialize the unknown functions using splines and the remaining parame- ters using conditional maximum likelihood estimators. We first estimate µ as a smooth mean of {y Tt}t=1, evaluated at Te. Next, we estimate each µt by fitting a spline to yt −Ztµ for t = 1, . . . , T using the R function smooth.spline. Since sparse observation points may lead to unstable initializations ofµt, we compute the median degrees of freedom implied by the spline fits for t = 1, . . . , T , and then recompute the splines for t = 1, . . . , T using this common degrees of free- dom parameter. Conditional on these estimates, we estimate σ2ν , {θψ1 , . . . ,θψp}, and {λψ1 , . . . , λψp} using the maximum likelihood estimators, and initialize , −1/2θ̃ψ = θψ λ̃ψ = 1, and ξ̃ψ = λψ . From these estimators, we compute the` ` ` ` ` innovations t for t = 1, . . . , T . We initialize the FDLM parameters using the initialization algorithm of Kowal et al. (2016) based on the singular value de- composition (SVD) of (1, . . . , T )′ = U ′eDeV e. For the FLCs, we let Φ equal the first J columns of V and then estimate Ξ to minimize ||Φ − B Ξ||2e φ . For the factors, we let (e1, . . . , eT )′ be the first J columns of (U eDe), and then esti- ∑mate {σ2j}∑and σ2η using the conditional maximum likelihood estimators. Sincej k=1 σ 2 k/ k σ 2 k estimates the proportion of variance of t explained by the first j 137 factors, we set J to be the smallest number of factors that explain at least 95% of the variance of t. While more sophisticated procedures are available for select- ing J, such as DIC and marginal likelihood, we find that this simple approach performs well in simulations. C.3.2 Gibbs Sampling Algorithm We propose to sample from the joint posterior distribution using a Gibbs sam- pler with the following steps: 1. FAR process, Yt: (a) [Centered FA] R process, µt: form the DLM (6) and sample {µt}Tt=1| · · · jointly using the state space sample of Durbin and Koopman (2002) implemented in the R package KFAS. (b) Mean function, µ(τ) = b′φ(τ)θµ: sample [θµ| · · · ] ∼ N(Aµaµ,Aµ) where ∑T A−1µ = Λ −1 + σ−2µ ν B ′ ′ φZtZtBφ, ∑ t=1T a = σ−2 B′ Z ′µ ν φ t(yt −Ztµt), ( t=1 ) and Λµ = diag 108, 108, λ−1µ , . .(. , λ−1µ . We∑sample t)he smoothing parameter [λµ| · · · J] ∼ Gamma 1(J 1 µ 2µ − 3), j=3 θµ,j restricted to2 2 λµ > 10 −8 (see the σ−2j sampler below), where Jµ (= Jφ) is the di- mension of θµ and θµ,j is the jth component of θµ. Set Yt = µt + µ or, in vector form, Y t = µt + µ. 138 2. Measurement error p(recision, σ−2ν : sample∑ )T T1 1 ∑∑mt [σ−2ν | · · · ] ∼ Gamma 10−3 + m , 10−3t + (yi,t − µ(τi,t)− µt(τi,t))2 .2 2 t=1 t=1 i=1 3. The FAR kernels, ψ1, . . . , ψp: using the Gelman (2006) prior and parametrization of θ = ξ̃ ′ψ` ψ θ̃ψ , where ψ`(τ, u) = bψ(τ, u)θψ and Bψ =` ` ` (bψ(τ1), . . . , bψ(τ ′ M)) , we sample ′ ′ (a) θ̃ψ = (θ̃ ′ψ , . . . , θ̃ψ ) jointly[from [θ̃ψ{| · · · ] ∼ N(Aψa}ψ,Aψ), wh]ere1 p ∑T − [ ]A 1ψ [`, `] = λψ Ω + s ξ̃2 (B′ Q) µ µ′ (B′ Q)′ ⊗ B′ K−1B ,` ψ` [` ψ` {ψ t−`}t−` ψ] ψ  ψ∑ t=p+1T [ ] A−1ψ [`, k] = s s ξ̃ ξ̃ (B ′ Q) ′ ′ ′ ′ −1` k ψ` ψ(k ψ { µt−`µt}−k (BψQ)) ⊗ BψK Bψ ,t∑=p+1T aψ[`] = s ξ̃ vec B′ K−1 ′ ′ ′` ψ` ψ  µtµt−` (BψQ) , t=p+1 A−1ψ [`, k] is the (`, k)th block ofA −1 of dimension J2 × J2ψ ψ ψ and aψ[`] is the `th subvector of aψ o[f length]J2ψ; ( ) (b) For ` = 1, . . . , p, sa(m[ple ξ̃ψ {| · · · ∼ N Aξ̃ }aξ̃ , Aξ̃ , where` ψ` ψ` ψ`] ) T −1 −6 ′ ∑ [ ] A = 10 +(θ̃ψ (B ′ ψ{Q) [ µt−`µ ′ ′ t−` (BψQ) ′ ⊗ B′] }ψK −1 ξ̃  Bψ θ̃ψ, ψ` ′ ∑ t=p+1 )T ∑ aξ̃ = θ̃ψvec B ′ K−1ψ  µt − s ′kG(ψk)µt−k µt−` (B′ψQ)′ ,ψ` [ ] t=p+(1 k 6=` ) sample λ̃ψ | · · · ∼ Gamma 1 + J2ψ/2, 1 + θ ′ ψ Ωψ θψ /2 , and, if κ` 2 2 ` ` ` ` is unknown, sample κ` using the slice sampler (Neal, 2003). Set θψ =` ξ̃ψ θ̃ψ and update Ω` ` ψ .` (c) For the model averaging procedure, sample [s`| · · · ] (in random or- der), i.e., set s` = 1 if logO post 10 > log(1/U − 1) and s` = 0 otherwise, 139 where U ∼ Uniform[(0, 1), logO post 10 is t(he log-posterior oddsT1 ∑ ∑ )′ ] logOpost10 = − µ′t−`K−1 −1 µt−` − 2 µt − skG(ψk)µt−k K µ2 t−` t=p+1 k=6 ` + logOprior10 , and logOprior10 = logP(s` = 1|sk, k 6= `) − logP(s` = 0|sk, k 6= `) is the log-prior odds. 4. The innovation covariance,K, under the FDLM: (a) The factors, {e }Tt t=1: u∑sing the prior iidet ∼ N(0,Σe) and the conditional likelihood t = µ − pt `=1G(ψ`)µt−` = Φet + ηt, sample [et| · · · ] ∼ N(Aeaet ,Ae), where − ( )A 1 −2 ′ −1e = ση Φ Φ + Σe = diag {σ−2 + σ−2}Jη j j=1 a = σ−2 ′et η Φ t. Note thatAe is time-invariant and diagonal, so we can sample {e }Tt t=1 jointly and efficiently. (b) The factor precisions, σ−2j : sam(ple ∑ )T [σ−2J | · · · T 1 ] ∼ Gamma 10−3 + , 10−3 + e2J ,t ; 2 2  t=1 then, for j = J −2 −1 − 1, . . . , 1, set σj = Fφ (U ; sφ, rφ ), where Fφ isj the distribution function for a Gamma random variable w∑ith shape parameter sφ = (T(− 1)/2)and rate parameter rφ = T 2j t=1 ej,t/2, and U ∼ Uniform aφ , bφ where aφ = Fφ(0; sφ, rφ ) and bj j j j φ =j F −2φ(σj+1; sφ, rφ ).j (c) The approximation error(precision, σ−2η : sample∑ )T [σ−2η | · · · ] ∼ TM 1 Gamma 10−3 + , 10−3 + || 2t −Φet|| 2 2 t=1 140 where || · ||2 denotes the Euclidean distance. (d) The factor loading curves: for j = 1, . . . , J (in random order), sample ξj ∼ N(Aξ aξ ,Aξ ), wherej j j (∑ )T A−1 = Λ−1 + σ−2 2 ′ξj φj η ( ej,t BφBφ,∑ t=1 ∑ )T aξ = σ −2 η B ′ φ ej,t t −Bφ ξkek,t .j t=1 k=6 j To enforce the orthogonality constraint, we condition on the linear constraints (Bφξ ) ′ k Bφξj = 0 for k 6= j; since ξj is Gaussian and ξk is conditioned upon, the resulting distribution is Gaussian with eas- ily computable moments, which is also convenient for efficient sam- pling; see Kowal et al. (2016) for more details. After sampling from the conditional distribution, we normalize the sampled vector ξj , so that the orthonormality constraint is enforced at every MCMC itera- tion. We(sample the co∑rrespond)ing smoothing parameters [λφ | · · · ] ∼j Gamma 1 (J − 3) , 1 Jφ ξ2 −8 2 φ 2 k=3 j,k restricted to λφ > 10 , where ξj j,k is the kth component of ξj . Finally, we form the covariance and precision matrices K and K−1  , re- spectively, using the sampled components. Since the orthonormality con- straint Φ′Φ = IJ is enforced at every MCMC iteration, we can compute K−1 directly and efficiently using (8). When the sample size T or the number of evaluation points M is large (i.e., T > 10, 000 or M > 50), the Durbin and Koopman (2002) joint sampler is com- putationally inefficient. Instead, we may use a single-move sampler for {µ }Tt t=1, in which we sample from the full conditional distribution of each [µt|µs, s 6= t] separately for t = 1, . . . , T (in random order). The single-move sampler is more 141 computationally efficient, but is typically less MCMC efficient. The FDLM pro- vides a closed form for K−1 , which substantially reduces computation time when M is large. The tensor product basis for ψ` provides a computational simplification for jointly sampling the FAR kernel basis coefficients, θψ. Importantly, the dimen- sion of the Kronecker product for computingA−1ψ is determined by the number of basis functions, Jψ, which is bounded by 35 in our specification, and may be smaller for some applications. For other bivariate bases, such as the thin plate spline basis, such simplifications are not readily available, and the Kronecker product scales with the number of evaluation points, M . In the model averaging procedure, there is a nontrivial concern about the ability of the MCMC sampler to move between states. When s` = 0, ψ` does not appear in the likelihood (9), so the Gibbs sampler will draw ψ` from its prior. Therefore, the prior for ψ` must be proper; if it is nonetheless noninformative, then the draws of ψ` from the prior distribution may not be reasonable for (9), so the next MCMC sample of s` will be zero with high probability. To alleviate this problem, we fix s` = 1 for all ` during a short burn-in period, so that each ψ` is well-estimated and therefore more likely to be included in the model if it is relevant. In both simulations and the yield curve application, the Gelman (2006) parametrization for ψ` sampling discussed in the Appendix provides excellent mixing among the states {s }pmax` `=1 . 142 C.4 Additional Theoretical Results C.4.1 Proof of Proposition 4.1 Let Ψ(B) be a polynomial in the backshift ope∑rator B of order p, so that Ψ(B)Yt = (1−Ψ1B−Ψ 22B − · · ·−Ψ Bpp )Y pt = Yt− `=1 Ψ`(Yt−`), where {Ψ p `}`=1 are bounded linear operators on L2(T ). Similarly, let Θ(B) be a polynomial in the backshift operator B of order q, where {Θ}q`=1 are bounded linear operators on L2(T ). A functional autoregressive moving average process of order (p, q), written FARMA(p, q), is defined by Ψ(B)(Yt − µ) = Θ(B)t, where {t} is a white noise process in L2(T ) and µ is the unconditional mean of Yt. The FAR(p) model may be written compactly as Ψ(B)(Yt − µ) = t. By assumption, we observe the pro- cess {yt}, where yt = Yt + νt and {νt} is a white noise process in L2(T ) indepen- dent of {t}. Rewriting the observation equation yt−µ = Yt−µ+νt and applying Ψ(B), we have Ψ(B)(yt − µ) = Ψ(B)(Yt − µ) + Ψ(B)νt = t + Ψ(B)νt. It remains to show that Zt ≡ t + Ψ(B)νt is a functional moving average process of order p, or equivalently, FARMA(0, p). Clearly, Xt ≡ Ψ(B)νt is FARMA(0, p). By Proposi- tion 10.2 in Bosq and Blanke (2008), CX X Xp 6= 0 and C` = 0 for ` > p, where C` is the covariance operator of X defined by CX(x) ≡ E [〈X , x〉X ] for x ∈ L2t ` t t+` (T ). Let CZ` and C  ` denote the covariance operators for Zt and t, respectively. Then CZ` (x) = E [〈Zt, x〉Zt+`] = E [〈t +Xt, x〉 ( t+` +Xt+`)] = C`(x) +CX` (x), using in- dependence of {t} and {νt}. Since t is white noise, C` = 0 for ` > 0, from which it follows that CZp 6= 0 and CZ` = 0 for ` > p. Proposition 10.2 in Bosq and Blanke (2008) implies that Zt is FARMA(0, p), so we conclude that yt is FARMA(p, p). 143 C.4.2 DLM Recursions and Special Cases of Theorem 4.1 For completeness, we provide the standard DLM recursion formulas for model (6). LetDt = {yt,yt−1, . . . ,y1}∪D0 be the information available at time t, where D0 represents the information prior to t = 1. For our purposes—in particular, for the Gibbs sampling algorithm—we let D0 = {µ, σ2ν , ψ,K} (denoted by Θ in Theorem 4.1). We may compute full conditional posterior distributions from model (6) using standard DLM recursions (e.g., West and Harrison, 1997). For simplicity, let G = G(ψ). Suppose that [µt−1|Dt−1] ∼ N(mt−1,Ct−1). The prior at time t is [µt|Dt−1] ∼ N(at,Rt), where at = Gmt−1 and Rt = GCt−1G′ +K. The one-step forecast at time t is [yt|Dt−1] ∼ N(f t,Qt), where f t = Ztµ +Ztat = Zt(µ + Gmt−1) and Qt = ZtR( ′ 2tZt + σνImt . The posterior)at time t is [µt|Dt] ∼ N(m ,C ), where m = C−1 R−1t t t t t at + σ−2ν Z ′ t(yt −Ztµ) and C−1 = R−1t t + σ−2Z ′ν tZt, or, more commonly, mt = at + Atrt, At = R Z ′ −1 t tQt , rt = yt − f t, and Ct = Rt − AtQ A′t t. The h-step forecast of the functional observations is E[yt+h|Dt] = E[Zt+hµ + Zt+hµt+h + νt+h|Dt] = Zt+hµ + Zt+hE[µt+h|Dt], where E[µt+h|Dt] = Ghmt, which is the h-step forecast of µt. Some special cases of Theorem 4.1 are proved in West and Harrison (1997): Corollary B.2.1 (Theorem 4.10, West and Harrison, 1997). The unique best linear predictor of the filtering random variable [µt|Dt] ismt Corollary B.2.2 (Corollary 4.7, West and Harrison, 1997). The unique best linear predictor of the one-step forecast [µt|Dt−1] is at. The unique best linear predictor of the one-step forecast [yt|Dt−1] is f t. 144 C.4.3 Proof of Theorem 4.2 Suppose τ ∗ ∈ T such that τ ∗ 6∈ T . The full conditional distribution of µ (τ ∗) is [ ] [ e ] [ t ] µt(τ ∗)|{µ }Tr r=1,Θ,Ds ∝ [y1, . . . ,ys|µt(τ ∗), {]µ Tr}r=1,Θ × µt(τ ∗)|{µ Tr}r=1,Θ ∝ µt(τ ∗)|{µ Tr}r=1,Θ , since the likelihood term is constant with respect to µt(τ ∗): To ⊆ Te, so τ ∗ 6∈ Te implies τ ∗ 6∈ To, and therefore µt(τ ∗) does not appear in the likelihood of model (4). For p = 1, the conditional Gaussian process prior f(or µt implied b)y model (4) under the approximation (5) is [µt|µt−1, ψ,K ] ∼ GP ψ′ (·)Qµt−1, K , where ψ′(τ) = (ψ(τ, τ1), . . . , ψ(τ, τM)), Q is a known quadrature weight matrix, and µt−1 = (µt−1(τ1), . . . , µ ′ t−1(τM)) is the function µt−1 evaluated at each τ ∈ Te. Notably, τ ∗ 6∈ Te implies that µt(τ ∗) does not appear in the conditional mean function for µt+1, so we may further simplify the distribution of µ (τ ∗):[ ] [ ] t µt(τ ∗)|{µ T ∗r}r=1,Θ,Ds ∝ µt(τ )|µt,µt−1,Θ . To compute this distribution, we use the definition of a Gaussian process, which implies the following joint distribution of µ (τ ∗t ) and µt, conditional on µt−1, ψ, and K:      µ (τ ∗) ψ′t (τ ∗)Qµ ∗ ∗t−1 K(τ , τ ) K(τ ∗)∼ N ,  , µt ΨQµ K ′ ∗ t−1 (τ ) K where Ψ = {ψ(τi, τ )}M ∗ ∗k i,k=1 and K(τ ) = (K(τ , τ1), . . . , K(τ ∗, τM)). Con- ditioning on µt induces the desired distribution [µ (τ ∗t )|µt,(µt−1, ψ,K] ∼ N (m (τ ∗ ) t ), Kt(τ ∗)), where m (τ ∗) = ψ′ ∗t (τ )Qµ ∗ −1t−1 + K(τ )K µt −ΨQµt−1 and Kt(τ ∗) = K (τ ∗ , τ ∗) − K(τ ∗)K−1 ′ ∗ K(τ ). Under the FDLM, the fol- lowing useful simplifications are available: K(τ ∗, τ ∗) = σ2η + φ ′(τ ∗)Σ ∗eφ(τ ), K (τ ∗ ) = φ ′(τ ∗)Σ ′eΦ , and using (8), K−1 = σ−2I −2 ′ η M − ση ΦΣ̃eΦ , where 145 φ′(τ ∗ ( ) ) = (φ( (τ ∗), . . . , φ (τ ∗)),)Σ = diag {σ2}J1 J e j j=1 , Φ = (φ(τ1), . . . ,φ(τM))′, and Σ̃e = diag {σ2j/(σ2 + σ2η j )}Jj=1 . By substitution, we derive( ) mt(τ ∗) = ψ′(τ ∗)Qµt−1 +K(τ ∗)K−1 (µt −ΨQµt−1 ) ( ) = ψ′(τ ∗)Qµ + φ′(τ ∗t−1 )ΣeΦ ′ σ−2I − σ−2( η M η Φ) Σ̃eΦ ′ µt −ΨQµt−1 = ψ′(τ ∗)Qµt−1 + φ ′(τ ∗)Σ̃ Φ′e µt −ΨQµt−1 , using the constraint Φ′Φ = IJ and the simplification σ−2η Σe − σ−2η ΣeΣ̃e = Σ̃e. Similarly, Kt(τ ∗) = K (τ ∗, τ ∗)−K (τ ∗)K−1   K ′(τ ∗) ( ) = σ2 + φ′η (τ ∗)Σ φ(τ ∗)− φ′(τ ∗)Σ Φ′ σ−2I − σ−2ΦΣ̃ Φ′e e η M η e ΦΣeφ(τ ∗) = σ2η + σ 2 ηφ ′(τ ∗)Σ̃ ∗eφ(τ ), which is time-invariant. ∑Extensions for p > 1 only requi(re mo∑dification of th)e mean function: m ∗ p ′ ∗ ′ ∗ ′ pt(τ ) = `=1ψ`(τ )Qµt−`+φ (τ )Σ̃eΦ µt − `=1 Ψ`Qµt−` , where ψ′`(τ) = (ψ`(τ, τ1), . . . , ψ`(τ, τM)) and Ψ M` = {ψ`(τi, τk)}i,k=1. C.5 Additional Simulation Results In Figure C.1, we display the results from FAR(1) simulations under the dense design, while varying both smoothness of t and the sample size, T . The func- tional data methods all nearly achieve the oracle performance, and are superior to the multivariate methods. These results confirm the findings of Diderick- sen et al. (2012): when T is large and the observation points are dense in T , existing functional data methods can nearly achieve the oracle performance, even when ψ1 is estimated poorly. The proposed methods, particularly with the FDLM (FDLM-FAR(1) and FDLM-FAR(p)), outperform existing functional 146 data methods for non-smooth GP innovations, and again are far superior for ψ1 estimation. The uncertainty of p incorporated into the lag selection proce- dure (FDLM-FAR(p)) does not appear to inhibit forecasting or estimation of ψ1 substantially. For further clarity, we plot the Bimodal-Gaussian kernel in Figure C.2, which is featured prominently in our simulation study. MSFEe MSFEe RW ● ● RW ● ● ● Mean ● Mean ● ● VAR−Y VAR−Y ● SES ● ● SES ● FAR Classic ● FAR Classic ● VAR−FPC ● VAR−FPC ● ● GP−FAR(1) ● ●● GP−FAR(1) ● ● ● FDLM−FAR(1) ● FDLM−FAR(1) ● ● ● FDLM−FAR(p) ● ● ● FDLM−FAR(p) ● ● ● FAR Oracle ● ● FAR Oracle ● ● ● 1e−04 2e−04 3e−04 4e−04 5e−04 6e−04 1e−04 2e−04 3e−04 4e−04 MSEψ MSE1 ψ1 FAR Classic FAR Classic ● ● ● GP−FAR(1) ● GP−FAR(1) FDLM−FAR(1) ● ● ● FDLM−FAR(1) ● FDLM−FAR(p) ● ● ● FDLM−FAR(p) 0.0 0.5 1.0 1.5 2.0 0.05 0.10 0.15 0.20 Figure C.1: MSFEe (top) and corresponding MSEψ1 (bottom) under various designs. Left: FAR(1), T = 50, dense design with the Bimodal-Gaussian kernel and non-smooth GP innovations. Right: FAR(1), T = 350, dense design with the Bimodal-Gaussian kernel and smooth GP innovations. The proposed methods provide superior forecasts and nearly achieve the oracle performance, despite the presence of sparsity. C.6 Additional Details for the Yield Curve Application We include MCMC diagnostics for the yield curve application. All diagnostics were computed using the R package coda (Plummer et al., 2006). In Figures C.3 and C.4, we provide trace plots for the one-step forecast distributions for 147 Bimodal−Gaussian Kernel 1.5 1.0 1.0 0.8 0.5 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 tau Figure C.2: The Bimodal-Gaussian kernel, ψ(τ, u) ∝ 0.75 exp{−(τ − ∫ ∫ π(0.3)(0.4)0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45 exp{−(τ−0.7)2/(0.3)2−(u−0.8)2/(0.4)2},π(0.3)(0.4) normalized so that ψ2` (τ, u) dτ du = 0.8. the nominal and real yield curves, respectively, on a single day in 2016 across selected maturities. The mixing is very efficient, which is confirmed by effective sample sizes which exceed 5,000 in all cases. In our yield curve forecasting study of Section 4.7, we included two popu- lar parametric yield curve models based on the Nelson-Siegel parametrization (Nelson and Siegel, 1987): Diebold and Li (2006, DL) and Diebold et al. (2006, DRA). The Nelson-Siegel basis is defined by f1(τ) = 1, f2(τ |λ ) = 1−exp(−τλNS)NS ,τλNS and f (τ |λ ) = 1−exp(−τλNS)3 NS − exp(−τλNS), where λNS is an unknown parame-τλNS ter. For both DL and DRA, the yield curve Yt(τ) for time t and time to maturity τ 148 psi(tau, u) is written as a linear combination of the Nelson-Siegel basis function, for which the corresponding weights are dynamic: Yt(τ) = f ′(τ |λNS)βt + t(τ), (C.1) (βt − µβ) = A (βt−1 − µβ) + ηt (C.2) where f ′(τ |λNS) = (f1(τ), f2(τ |λNS), f3(τ |λNS)), βt is the corresponding 3- dimensional vector of dynamic weights with unconditional mean µβ , and A is the 3× 3 evolution matrix. For implementation purposes, assume that the yield curve is observed at a fixed set of maturities τ1, . . . , τM , so that (C.1) becomes yt = FNSβt + t (C.3) where yt = (Yt(τ ′ ′ 1), . . . , Yt(τM)) , FNS = (f(τ1|λNS), . . . ,f(τM |λNS)) , and t = (t(τ1), . . . , t(τM)) ′. The DL approach fixes λNS = 0.0609 and then estimates the parameters us- ing a multi-step procedure. First, the weights {βt} are estimated using ordinary least squares from (C.3). Next, the evolution matrix A in (C.2) is estimated as a VAR coefficient matrix, conditional on {βt}. Diebold and Li (2006) note that constraining A to be diagonal may improve forecasting in some cases. Finally, h-step forecasts ŷT+h are computed via ŷT+h = FNSβ̂T+h, where β̂T+h is the h-step forecast computed from the VAR in (C.2). Alternatively, the DRA approach combines (C.3) and (C.2) into a state space model, with error distributions i∼idt N(0,H) independent of i∼idηt N(0,Q). DRA assume that H is diagonal; we further assume that Q is diagonal, which helps stabilize computations. The unknown parameters {λNS,A,H ,Q} are then estimated jointly using maximum likelihood based on the Kalman filter. Following DRA, we model λNS and the diagonal elements of H and Q on the 149 log-scale to ensure positivity in the optimization routine. Conditional on the maximum likelihood estimates for these parameters, DRA use standard state space computations to construct forecasts for the response vector, yt. Traceplot: One-Step Forecasts for Maturity 1 Months Traceplot: One-Step Forecasts for Maturity 3 Months 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Iterations Iterations Traceplot: One-Step Forecasts for Maturity 6 Months Traceplot: One-Step Forecasts for Maturity 12 Months 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Iterations Iterations Traceplot: One-Step Forecasts for Maturity 24 Months Traceplot: One-Step Forecasts for Maturity 36 Months 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Iterations Iterations Figure C.3: Traceplot for one-step forecasts for nominal yield curves at selected maturities during 2016. C.7 Additional Details on the Quadrature Approximation ∫ Consider the integral in the FAR(1) evolution equation, I(τ) ≡ ψ(τ, u)µt−1(u) du, where we omit dependence of I on t for notational simplicity. In the pro- posed methodology, we approximate this integral using quadrature: I(τ) ≈ IM(τ) ≡ (ψ(τ, τ1), . . . , ψ(τ, τM))Qµt−1, where {τ1, . . . , τM} = Te ⊂ T is the set of unique evaluation points, Q is a known M ×M quadrature matrix, and µt−1 = (µt−1(τ1), . . . , µ ′ t−1(τM)) is the function µt−1 evaluated at the evaluation 150 0.55 0.70 0.30 0.40 0.15 0.25 0.7 0.9 0.40 0.50 0.60 0.25 0.35 Traceplot: One-Step Forecasts for Maturity 60 Months Traceplot: One-Step Forecasts for Maturity 84 Months 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Iterations Iterations Traceplot: One-Step Forecasts for Maturity 120 Months Traceplot: One-Step Forecasts for Maturity 240 Months 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Iterations Iterations Traceplot: One-Step Forecasts for Maturity 360 Months 0 1000 2000 3000 4000 5000 Iterations Figure C.4: Traceplot for one-step forecasts for real yield curves at selected ma- turities during 2016. points. It is important to assess how the accuracy of the approximation of I by IM depends in M , and in particular to determine a value of M sufficiently large to produce reasonable approximations in practice. However, there is a tradeoff: the state vector in the dynamic linear model is M -dimensional, so increasing M indiscriminately may unnecessarily increase computation time. We conducted a sensitivity analysis based on the simulations from Section 4.6 of the main paper. In particular, we use the Bimodal-Gaussian kernel, ψ(τ, u) ∝ 0.75 exp{−(τ−0.2)2/(0.3)2−(u−0.3)2/(0.4)2}+ 0.45 exp{−(τ− π(0.3)(0.4) ∫ ∫π(0.3)(0.4) 0.7)2/(0.3)2 − (u − 0.8)2/(0.4)2}, normalized so that ψ2` (τ, u) dτ du = 0.8. The Bimodal-Gaussian kernel is nonlinear, and therefore is inher- ently more difficult to approximate using linear quadrature methods, such as the trapezoidal rule. For the other component of the integrand, µt−1, 151 0.50 0.65 0.80 -0.10 0.10 -0.35 -0.15 0.3 0.5 -0.25 -0.05 we simulate µt−1 ∼ GP(0, K) using the covariance function parameteriza- tion K = σ2Rρ, where Rρ is the Matérn correlation function Rρ(τ, u) = { −12ρ1−1Γ(ρ1)} (||τ − u||/ρ )ρ12 Kρ1(||τ − u||/ρ2), Γ(·) is the gamma function, Kρ1 is the modified Bessel function of order ρ1, and ρ = (ρ1, ρ2) are parameters (Matérn, 2013). We let σ = 0.01 and ρ = (ρ1, 0.1), with ρ1 = 2.5 for smooth (twice-differentiable) sample paths and ρ1 = 0.5 for non-smooth (continuous, non-differentiable) sample paths. Comparisons between these cases are impor- tant: the non-smooth setting is substantially more challenging for approxima- tions. For each simulated value of µt−1 ∼ GP(0, K), we compute I200(τ), which we use as a proxy for the true (but unknown) integral value I(τ), and compare it to IM(τ) for M ∈ {5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100}. Note that the approximation induced by I200(τ) is also used to generate the simulations of Section 4.6 in the main paper. We measure accuracy using the relative absolute error (RAE) a∫nd∣ the standardized squared error (SSE), defined respectively by∣∣∣I200(τ)− IM(τ)RM = I (τ) ∣∣∣ ∣ ∫ (I200(τ)− IM(τ))2 dτ, SM = dτ, (C.4) 200 σ2 which we compute for each simulation. We report the pointwise medians for each RM and SM as a function of M in Figure E1. As expected, for fixed M , the integral approximation is more accurate when µt−1—and therefore the in- tegrand—is smooth. Nonetheless, the relative gains of increasing M decline quickly for M > 20 in both cases. 152 Standardized Squared Errors Relative Absolute Error 20 40 60 80 100 20 40 60 80 100 M M Standardized Squared Errors Relative Absolute Error 20 40 60 80 100 20 40 60 80 100 M M Figure E1: Standardized squared errors and relative absolute errors for smooth (top) and non-smooth (bottom) integrands. The errors are small in magnitude, particularly in the smooth case, and decay quickly for M > 20. 153 SSE σ2 SSE σ2 0.00 0.01 0.02 0.03 0.04 0.0000 0.0005 0.0010 0.0015 0.0020 RAE RAE 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 BIBLIOGRAPHY Abramovich, F., Sapatinas, T., and Silverman, B. W. (1998). Wavelet threshold- ing via a Bayesian approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(4):725–749. Albert, J. H. and Chib, S. (1993). Bayes inference via Gibbs sampling of autore- gressive time series subject to Markov mean and variance shifts. Journal of Business & Economic Statistics, 11(1):1–15. Armagan, A., Clyde, M., and Dunson, D. B. (2011). Generalized Beta mixtures of Gaussians. In Advances in neural information processing systems, pages 523–531. Arnold, T. B. and Tibshirani, R. J. (2014). genlasso: Path algorithm for generalized lasso problems. R package version 1.3. Aue, A., Norinho, D. D., and Hörmann, S. (2015). On the prediction of sta- tionary functional time series. Journal of the American Statistical Association, 110(509):378–392. Bae, K. and Mallick, B. K. (2004). Gene selection using a two-level hierarchical Bayesian model. Bioinformatics, 20(18):3423–3430. Barndorff-Nielsen, O., Kent, J., and Sørensen, M. (1982). Normal variance-mean mixtures and z distributions. International Statistical Review/Revue Interna- tionale de Statistique, pages 145–159. Behseta, S., Kass, R. E., and Wallstrom, G. L. (2005). Hierarchical models for assessing variability among functions. Biometrika, 92(2):419–434. Belmonte, M. A., Koop, G., and Korobilis, D. (2014). Hierarchical shrinkage in time-varying parameter models. Journal of Forecasting, 33(1):80–94. 154 Berger, J. (1980). A robust generalized Bayes estimator and confidence region for a multivariate normal mean. The Annals of Statistics, pages 716–761. Berry, S. M., Carroll, R. J., and Ruppert, D. (2002). Bayesian smoothing and regression splines for measurement error problems. Journal of the American Statistical Association, 97(457):160–169. Besse, P. C., Cardot, H., and Stephenson, D. B. (2000). Autoregressive forecasting of some functional climatic variations. Scandinavian Journal of Statistics, pages 673–687. Bloomfield, P. (2004). Fourier analysis of time series. John Wiley & Sons. Bolder, D., Johnson, G., and Metzler, A. (2004). An empirical analysis of the Cana- dian term structure of zero-coupon interest rates. Bank of Canada. Bosq, D. (2000). Linear processes in function spaces: theory and applications, volume 149. Springer Science & Business Media. Bosq, D. and Blanke, D. (2008). Inference and prediction in large dimensions, volume 754. John Wiley & Sons. Botly, L. C. and De Rosa, E. (2009). Cholinergic deafferentation of the neocortex using 192 IgG-saporin impairs feature binding in rats. The Journal of Neuro- science, 29(13):4120–4130. Bowsher, C. G. and Meeks, R. (2008). The dynamics of economic functions: modeling and forecasting the yield curve. Journal of the American Statistical Association, 103(484). Cardot, H., Ferraty, F., and Sarda, P. (1999). Functional linear model. Statistics & Probability Letters, 45(1):11–22. 155 Carhart, M. M. (1997). On persistence in mutual fund performance. The Journal of finance, 52(1):57–82. Carvalho, C. M., Lopes, H. F., and Aguilar, O. (2011). Dynamic stock selection strategies: A structured factor model framework. Bayesian Statistics, 9:1–21. Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling sparsity via the horseshoe. In AISTATS, volume 5, pages 73–80. Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, pages 465–480. Chan, J. C. (2013). Moving average stochastic volatility models with application to inflation forecast. Journal of Econometrics, 176(2):162–172. Chan, J. C. and Jeliazkov, I. (2009). Efficient simulation and integrated likeli- hood estimation in state space models. International Journal of Mathematical Modelling and Numerical Optimisation, 1(1-2):101–120. Chan, J. C., Koop, G., Leon-Gonzalez, R., and Strachan, R. W. (2012). Time varying dimension models. Journal of Business & Economic Statistics, 30(3):358– 367. Chen, Y. and Li, B. (2015). An adaptive functional autoregressive forecast model to predict electricity price curves. Journal of Business & Economic Statistics, (just-accepted):1–56. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the Ameri- can Statistical Association, 90(432):1313–1321. Chib, S. and Ergashev, B. (2009). Analysis of multifactor affine yield curve mod- els. Journal of the American Statistical Association, 104(488):1324–1337. 156 Chib, S., Nardari, F., and Shephard, N. (2002). Markov chain Monte Carlo meth- ods for stochastic volatility models. Journal of Econometrics, 108(2):281–316. Constantine, W. and Percival, D. (2016). wmtsa: Wavelet Methods for Time Series Analysis. R package version 2.0-2. Crainiceanu, C., Ruppert, D., and Wand, M. P. (2005). Bayesian analysis for penalized spline regression using WinBUGS. Journal of Statistical Software, 14(14):1–24. Cressie, N. and Wikle, C. K. (2011). Statistics for spatio-temporal data. John Wiley & Sons. Cruz-Marcelo, A., Ensor, K. B., and Rosner, G. L. (2011). Estimating the term structure with a semiparametric Bayesian hierarchical model: an application to corporate bonds. Journal of the American Statistical Association, 106(494). Damon, J. and Guillas, S. (2002). The inclusion of exogenous variables in func- tional autoregressive ozone forecasting. Environmetrics, 13:759–774. Damon, J. and Guillas, S. (2005). Estimation and simulation of autoregressive hilbertian processes with exogenous variables. Statistical Inference for Stochas- tic Processes, 8(2):185–204. Danı́elsson, J. (1998). Multivariate stochastic volatility models: estimation and a comparison with VGARCH models. Journal of Empirical Finance, 5(2):155–173. Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of Bayes risk for the horseshoe prior. Bayesian Analysis, 8(1):111–132. Dées, S. and Saint-Guilhem, A. (2011). The role of the united states in the global economy and its evolution over time. Empirical Economics, 41(3):573–591. 157 Didericksen, D., Kokoszka, P., and Zhang, X. (2012). Empirical properties of forecasts with the functional autoregressive model. Computational Statistics, 27(2):285–298. Diebold, F. X. and Li, C. (2006). Forecasting the term structure of government bond yields. Journal of Econometrics, 130(2):337–364. Diebold, F. X., Li, C., and Yue, V. Z. (2008). Global yield curve dynamics and interactions: a dynamic Nelson–Siegel approach. Journal of Econometrics, 146(2):351–363. Diebold, F. X., Rudebusch, G. D., and Aruoba, B. S. (2006). The macroeconomy and the yield curve: a dynamic latent factor approach. Journal of Econometrics, 131(1):309–338. Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455. Durbin, J. and Koopman, S. J. (2002). A simple and efficient simulation smoother for state space time series analysis. Biometrika, 89(3):603–616. Earls, C. and Hooker, G. (2014). Bayesian covariance estimation and inference in latent Gaussian process models. Statistical Methodology, 18:79–100. Eubank, R. L. (1999). Nonparametric regression and spline smoothing. CRC Press. Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1):3–56. Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1):1–22. 158 Faulkner, J. R. and Minin, V. N. (2016). Locally adaptive smoothing with Markov random fields and shrinkage priors. Bayesian Analysis. Fernandez, C. and Steel, M. F. (2000). Bayesian regression analysis with scale mixtures of normals. Econometric Theory, 16(01):80–101. Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis: theory and practice. Springer. Figueiredo, M. A. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1150–1159. Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for gen- eralized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22. Frühwirth-Schnatter, S. and Wagner, H. (2010). Stochastic model specification search for Gaussian and partial non-Gaussian state space models. Journal of Econometrics, 154(1):85–100. Gamerman, D. and Migon, H. S. (1993). Dynamic hierarchical models. Journal of the Royal Statistical Society. Series B (Methodological), pages 629–642. Gelman, A. (2006). Prior distributions for variance parameters in hierarchi- cal models (comment on article by Browne and Draper). Bayesian Analysis, 1(3):515–534. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732. Green, P. J. and Silverman, B. W. (1993). Nonparametric regression and generalized linear models: a roughness penalty approach. CRC Press. 159 Griffin, J. E. and Brown, P. J. (2005). Alternative prior distributions for variable selection with very many more variables than observations. Technical report, University of Warwick, Centre for Research in Statistical Methodology. Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distri- butions in regression problems. Bayesian Analysis, 5(1):171–188. Gu, C. (1992). Penalized likelihood regression: a Bayesian analysis. Statistica Sinica, 2(1):255–264. Harvey, A., Ruiz, E., and Shephard, N. (1994). Multivariate stochastic variance models. The Review of Economic Studies, 61(2):247–264. Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological), pages 757–796. Hays, S., Shen, H., and Huang, J. Z. (2012). Functional dynamic factor models with application to yield curve forecasting. The Annals of Applied Statistics, 6(3):870–894. Horváth, L. and Kokoszka, P. (2012). Inference for functional data with applications, volume 200. Springer Science & Business Media. Hyndman, R. and Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(1):1–22. Hyndman, R. J. and Ullah, M. S. (2007). Robust forecasting of mortality and fer- tility rates: a functional data approach. Computational Statistics & Data Analy- sis, 51(10):4942–4956. James, N. A., Kejariwal, A., and Matteson, D. S. (2016). Leveraging cloud data 160 to mitigate user experience from ‘Breaking Bad’. In 2016 IEEE International Conference on Big Data, pages 3499–3508. IEEE. Jungbacker, B., Koopman, S. J., and van der Wel, M. (2013). Smooth dynamic factor analysis with application to the US term structure of interest rates. Jour- nal of Applied Econometrics. Kalli, M. and Griffin, J. E. (2014). Time-varying sparsity in dynamic regression models. Journal of Econometrics, 178(2):779–793. Kargin, V. and Onatski, A. (2008). Curve forecasting by functional autoregres- sion. Journal of Multivariate Analysis, 99(10):2508–2526. Kastner, G. (2016). Dealing with stochastic volatility in time series using the R package stochvol. Journal of Statistical Software, 69(5):1–30. Kastner, G. and Frühwirth-Schnatter, S. (2014). Ancillarity-sufficiency inter- weaving strategy (ASIS) for boosting MCMC estimation of stochastic volatil- ity models. Computational Statistics & Data Analysis, 76:408–423. Kaufman, C. G. and Sain, S. R. (2010). Bayesian functional ANOVA modeling using gaussian process prior distributions. Bayesian Analysis, 5(1):123–149. Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: likelihood in- ference and comparison with ARCH models. The Review of Economic Studies, 65(3):361–393. Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. (2009). `1 Trend Filtering. SIAM review, 51(2):339–360. Kokoszka, P. (2012). Dependent functional data. ISRN Probability and Statistics, 2012. 161 Kokoszka, P. and Reimherr, M. (2013). Determining the order of the functional autoregressive model. Journal of Time Series Analysis, 34(1):116–129. Koopman, S. J. and Durbin, J. (2000). Fast filtering and smoothing for multivari- ate state space models. Journal of Time Series Analysis, 21(3):281–296. Koopman, S. J. and Durbin, J. (2003). Filtering and smoothing of state vector for diffuse state-space models. Journal of Time Series Analysis, 24(1):85–98. Koopman, S. J., Mallee, M. I., and Van der Wel, M. (2010). Analyzing the term structure of interest rates using the dynamic Nelson–Siegel model with time- varying parameters. Journal of Business & Economic Statistics, 28(3):329–343. Korobilis, D. (2013a). Hierarchical shrinkage priors for dynamic regressions with many predictors. International Journal of Forecasting, 29(1):43–59. Korobilis, D. (2013b). VAR forecasting using Bayesian variable selection. Journal of Applied Econometrics, 28(2):204–230. Kowal, D. R., Matteson, D. S., and Ruppert, D. (2016). A Bayesian multivariate functional dynamic linear model. Journal of the American Statistical Association. (in press). Kowal, D. R., Matteson, D. S., and Ruppert, D. (2017). Functional autoregression for sparsely sampled data. Journal of Business & Economic Statistics, pages 1–13. Kuo, L. and Mallick, B. (1998). Variable selection for regression models. Sankhyā: The Indian Journal of Statistics, Series B, pages 65–81. Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis, 5(2):369–411. 162 Laurini, M. P. (2014). Dynamic functional data analysis with non-parametric state space models. Journal of Applied Statistics, 41(1):142–163. Laurini, M. P. and Hotta, L. K. (2010). Bayesian extensions to Diebold-Li term structure model. International Review of Financial Analysis, 19(5):342–350. Li, B., DeWetering, E., Lucas, G., Brenner, R., and Shapiro, A. (2001). Merrill Lynch exponential spline model. Technical report, Merrill Lynch working paper. Ljubojevic, V., Bennett, L.-A., Gill, P. R., Luu, P., Takehara-Nishiuchi, K., and De Rosa, E. (2013). Cholinergic modulation of attention-driven oscillations during feature binding in rats. In Society for Neuroscience. Matérn, B. (2013). Spatial variation, volume 36. Springer Science & Business Media. Matteson, D. S., McLean, M. W., Woodard, D. B., and Henderson, S. G. (2011). Forecasting emergency medical service call arrival rates. The Annals of Applied Statistics, 5(2B):1379–1406. McCausland, W. J., Miller, S., and Pelletier, D. (2011). Simulation smoothing for state–space models: A computational efficiency analysis. Computational Statistics & Data Analysis, 55(1):199–212. McCulloch, R. E. and Tsay, R. S. (1993). Bayesian inference and prediction for mean and variance shifts in autoregressive time series. Journal of the American Statistical Association, 88(423):968–978. Nakajima, J. and West, M. (2013). Bayesian analysis of latent threshold dynamic models. Journal of Business & Economic Statistics, 31(2):151–164. 163 Nason, G. (2016). wavethresh: Wavelets Statistics and Transforms. R package version 4.6.8. Neal, R. M. (1999). Regression and classification using Gaussian process priors. Bayesian Statistics, 6:475–501. Neal, R. M. (2003). Slice sampling. Annals of Statistics, pages 705–741. Nelson, C. R. and Siegel, A. F. (1987). Parsimonious modeling of yield curves. Journal of Business, 60(4):473. Omori, Y., Chib, S., Shephard, N., and Nakajima, J. (2007). Stochastic volatility with leverage: Fast and efficient likelihood inference. Journal of Econometrics, 140(2):425–449. O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse problems. Statistical Science, pages 502–518. Petris, G., Petrone, S., and Campagnoli, P. (2009). Dynamic linear models with R. Springer. Piironen, J. and Vehtari, A. (2016). On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. arXiv preprint arXiv:1610.05559. Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11. Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: sparse Bayesian regularization and prediction. Bayesian Statistics, 9:501–538. Polson, N. G. and Scott, J. G. (2012a). Local shrinkage rules, Lévy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):287–311. 164 Polson, N. G. and Scott, J. G. (2012b). On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):887–902. Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504):1339–1349. Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer. Ramsay, J. O. (2006). Functional data analysis. Wiley Online Library. Ramsay, J. O., Wickham, H., Graves, S., and Hooker, G. (2014). fda: Functional Data Analysis. R package version 2.4.4. Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning. The MIT Press. Rue, H. (2001). Fast sampling of Gaussian Markov random fields. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):325–338. Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Number 12. Cambridge University Press. Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. The Journal of Finance, 19(3):425–442. Shi, J. Q. and Choi, T. (2011). Gaussian process regression analysis for functional data. CRC Press. Shumway, R. H. and Stoffer, D. S. (2000). Time series analysis and its applications, volume 3. Springer New York. 165 Staicu, A.-M., Crainiceanu, C. M., Reich, D. S., and Ruppert, D. (2012). Modeling functional data with spatially heterogeneous shape characteristics. Biometrics, 68(2):331–343. Strawderman, W. E. (1971). Proper bayes minimax estimators of the multivari- ate normal mean. The Annals of Mathematical Statistics, 42(1):385–388. Svensson, L. E. (1994). Estimating and interpreting forward interest rates: Swe- den 1992-1994. Technical report, National Bureau of Economic Research. Taylor, S. J. (1994). Modeling stochastic volatility: A review and comparative study. Mathematical Finance, 4(2):183–204. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288. Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323. Van der Linde, A. (1995). Splines from a Bayesian point of view. Test, 4(1):63–81. van der Pas, S., Kleijn, B., and van der Vaart, A. (2014). The horseshoe estima- tor: Posterior concentration around nearly black vectors. Electronic Journal of Statistics, 8(2):2585–2618. Waggoner, D. F. (1997). Spline methods for extracting interest rate curves from coupon bond prices, volume 97. Federal Reserve Bank of Atlanta USA. Wahba, G. (1978). Improper priors, spline smoothing and the problem of guard- ing against model errors in regression. Journal of the Royal Statistical Society. Series B (Methodological), pages 364–372. 166 Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society. Series B (Methodologi- cal), pages 133–150. Wahba, G. (1990). Spline models for observational data, volume 59. Siam. Wand, M. and Ormerod, J. (2008). On semiparametric regression with O’Sullivan penalized splines. Australian & New Zealand Journal of Statistics, 50(2):179–198. West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models. Springer. Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114. Yao, F., Müller, H.-G., and Wang, J.-L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100(470):577– 590. Zhu, B. and Dunson, D. B. (2013). Locally adaptive Bayes nonparametric regres- sion via nested Gaussian processes. Journal of the American Statistical Associa- tion, 108(504):1445–1456. 167