MACHINE LEARNING IN RESTING-STATE AND NATURALISTIC fMRI ANALYSIS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Meenakshi Khosla August 2021 © 2021 Meenakshi Khosla ALL RIGHTS RESERVED MACHINE LEARNING IN RESTING-STATE AND NATURALISTIC fMRI ANALYSIS Meenakshi Khosla, Ph.D. Cornell University 2021 Two brain activity recording paradigms in humans have particularly emerged as increasingly more popular tools for studying brain function in health and in disease, namely resting-state and naturalistic stimulation. These two techniques attempt to capture brain activity ‘in the wild’ when it is unconstrained by any specific task and thus reflect more naturalistic modes of operation of the brain. The complexity, very high-dimensional nature, a suite of potential applications and lack of standard, straightforward analysis tools make machine learning very attractive for this kind of data. In this thesis, we draw upon recent advances in machine learning, fueled by the success of deep learning, to develop models that can capture the full richness of this data. Resting-state fMRI (rs-fMRI) has enormous potential to advance our under- standing of the brain’s functional organization and how it is altered by damage or disease. Over the last decade, substantial effort has been devoted to using rs-fMRI for classification of a wide range of neuropsychiatric conditions, such as Alzheimer’s disease, schizophrenia, autism spectrum disorder etc. While a growing number of studies have demonstrated the promise of machine learning algorithms for rs-fMRI based clinical or behavioral prediction, most prior models have been limited in their capacity to exploit the richness of the data. The first part of this thesis describes our work on developing novel machine learning approaches for deriving subject level predictions from resting-state fMRI scans. We propose a novel volumetric Convolutional Neural Network (CNN) framework that takes advantage of the full-resolution 3D spatial structure of rs-fMRI data and fits non-linear predictive models. We showcase our approach on a challenging large-scale dataset and report state-of-the-art accuracy results on rs-fMRI-based discrimination of autism patients and healthy controls. The second part of this thesis is aimed at developing predictive models that can capture information processing within the brain under naturalistic stimulation more stringently than existing approaches. Brain activity recordings of healthy subjects during “free viewing” of movies present a powerful opportunity to build ecologically sound and generalizable models of sensory systems, also known as encoding models. Deep neural networks trained on image or sound recognition tasks have emerged as powerful models of computations underlying sensory processing in the brain, surpassing traditional models of image or sound representation based on Gabor filters and spectro-temporal filters, respectively. While this success is promising, existing encoding models based on deep neural networks have been limited in their focus on limited portions of the sensory space under naturalistic stimulation, ignoring the complex and dynamic interactions of modalities (audio and vision) in this inherently context-rich paradigm. In the second part of this thesis, we will introduce our research with predictive models of cortical responses that aims to capture several critical inductive biases about information processing in the brain: namely, hierarchical processing, assimilation over longer timescales, attentional modulation and multi-sensory auditory-visual interactions. We will describe our efforts in capturing these phenomena in models of the brain and will share our latest findings from this novel computational approach. Finally, we describe our ongoing efforts to characterize neural response properties in the visual cortex under ‘ecological’ conditions systematically in an entirely data-driven fashion using computational models. Together, our findings illustrate how computational models overcome the tradition of excessive reductionism in cognitive neuroimaging by providing a general-purpose framework that abstracts away from the particulars of the experimental approach and can be used to describe multiple experiments at the same time. BIOGRAPHICAL SKETCH Meenakshi is interested in using computational models to understand how the brain processes the natural world. She uses techniques from artificial intelligence and functional imaging in her research to understand the nature of representa- tions and computations in the brain. By modelling the computations underlying sensory processing ‘in the wild’, a critical focus of her research is to understand the function of different brain areas and how they collectively support complex human behavior. She also has more general interests across machine learning and neuroimaging, particularly in the use of predictive models to understand the distinctive characteristics of the brains of people affected with different mental disorders. Beyond satiating scientific curiosity, she hopes her research can be used to develop novel diagnostics of neuropsychiatric diseases, and inform treatment and design personalized therapeutics. Prior to completing her PhD at Cornell, she obtained a B. Tech-M. Tech dual degree in Electrical Engineering at the Indian Institute of Technology, Kanpur and worked as a postgraduate research associate at the Yale School of Medicine. iii ACKNOWLEDGEMENTS This thesis and the ideas it is based on could not have been possible without the enthusiasm, kindness, support and insight of my advisor, Mert Sabuncu. I am very thankful for the freedom that Mert gave me to explore the directions I was interested in and for his belief in them. Mert has been a constant source of encouragement and guidance through every step of graduate school, and for that I’m extremely grateful. I am very grateful for the opportunity to work with Amy Kuceyeski. Her enthusiasm for working towards understanding the brain has been both contagious and very inspiring. I would also like to thank Keith Jamison for great discussions and tremendous help. I am grateful for the wonderful company of my amazing labmates: Evan M. Yu, Zhilu Zhang, Gia H. Ngo, Zijin Gu, Cagla Bahadir, Carmen Khoo, Sijia Gao, Amaya Murguia, Tianyu Ma, Matthew Pool, Alan Wang and Victor Butoi. I am particularly thankful to Evan and Zhilu for teaching me so many things (academic or otherwise) and being my late-night lab companions and to Gia for always checking in and being a helpful and caring friend and amazing collaborator. I’d like to thank Scott Coldren for administrative help throughout graduate school. I’m also very grateful to Leila Wehbe and Andreas Tolias for being wonderful mentors during my PhD journey. I would also like to thank these fantastic people outside the lab who made Ithaca a home away from home: Drishti Wali, Arpita Sharma, Arzoo Katiyar, Ashudeep Singh, Prateek Sehgal, Utkarsh Mall and Saksham Agarwal. I also thank my oldest group of friends for being in my life and always encouraging me: Akansha Srivastava, Sakshi Sinha, Silky Gupta, Manvi Gupta, Unnat Jain, Varsha Lalwani and Mohit Sharma. iv Lastly, I’d like to thank my mother Deepika Khosla, my sister Pooja Khosla and my partner Rishabh Gupta for their love and support through this long endeavor, for keeping me sane and cheering me up when things got tough. I am extremely grateful to my entire extended family, including my grandparents, uncles, aunts and cousins whose love has always kept me going. Special thanks and love to the newest member of our family, Cocoa, who filled the last months of my PhD with so much joy. I dedicate this thesis to my father Rajiv Khosla, who always encouraged me to pursue my dreams and has been my biggest advocate and greatest source of strength. Every ounce of confidence I have, I owe it to my father. v TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction 1 2 Related Work 7 2.1 Machine learning in resting-state fMRI analysis . . . . . . . . . . . 7 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Unsupervised learning methods . . . . . . . . . . . . . . . . 14 2.1.3 Applications of unsupervised learning in rs-fMRI . . . . . . 22 2.1.4 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 41 2.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3 Linking resting-state brain activity and mental disorders with ma- chine learning 71 3.1 Ensemble learning with 3D convolutional neural networks for func- tional connectome-based prediction . . . . . . . . . . . . . . . . . . 71 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.1.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 75 3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.1.5 Limitations and future work . . . . . . . . . . . . . . . . . . 99 3.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.2 Detecting abnormalities in resting-state dynamics: An unsupervised learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4 Towards holistic encoding models for predicting fMRI responses to multimodal naturalistic stimuli 114 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2 Endowing neural encoding models with both audition and vision and and stimulus history . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 121 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vi 4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.3 Neural encoding with visual attention . . . . . . . . . . . . . . . . . 152 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 4.3.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 168 4.3.6 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . 170 5 A shared encoding model for subject-specific response prediction172 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.2.1 Implementation details . . . . . . . . . . . . . . . . . . . . . 177 5.2.2 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . 179 5.2.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.2.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 180 5.2.5 Demonstration of application: personalized brain mapping . 181 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6 A computational strategy to richly characterize the human visual cortex under naturalistic conditions 186 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.2.1 Natural Scenes Dataset . . . . . . . . . . . . . . . . . . . . . 190 6.2.2 Response-optimized encoding model architecture . . . . . . . 191 6.2.3 Training and testing models . . . . . . . . . . . . . . . . . . 192 6.2.4 Comparison against retinotopic measurements from pRF- localizer scan . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.2.5 Representational Similarity Analysis . . . . . . . . . . . . . 195 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7 Looking ahead 208 A Supplementary Information and Additional Results for Section 3.1 257 A.1 Atlas Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 A.2 Poisson Disk Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 257 A.3 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 A.3.1 Ridge Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 258 A.3.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 259 A.4 Neural network hyperparameter settings . . . . . . . . . . . . . . . 260 A.5 ABIDE-I cross-validation results . . . . . . . . . . . . . . . . . . . . 262 vii A.6 Saliency maps for individual parcellations . . . . . . . . . . . . . . . 262 A.7 Comparison of different preprocessing strategies . . . . . . . . . . . 263 B Supplementary Information and Additional Results for Section 4.2 271 B.1 HCP Movies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 B.2 Region of Interest (ROI) selection . . . . . . . . . . . . . . . . . . . 271 B.3 Estimating BOLD response delay . . . . . . . . . . . . . . . . . . . 273 B.4 Defining the stimulus-driven or “synchronous” cortex . . . . . . . . 274 B.5 Model architectures and implementation . . . . . . . . . . . . . . . 275 B.6 Regularized linear regression: deep convolutional features . . . . . . 277 B.7 Regularized linear regression: WordNet features . . . . . . . . . . . 280 B.8 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 B.9 Computing significance estimates . . . . . . . . . . . . . . . . . . . 283 B.10 Sensory-sensitivity index . . . . . . . . . . . . . . . . . . . . . . . . 284 B.11 Stimuli for synthetic contrasts . . . . . . . . . . . . . . . . . . . . . 284 B.12 Perturbation analysis with 20-sec models . . . . . . . . . . . . . . . 285 B.13 Performance improvement and autocorrelation decay . . . . . . . . 286 B.14 Surface visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 289 B.15 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 B.16 Group-level prediction accuracy: held-out set . . . . . . . . . . . . . 290 B.17 Subject-level prediction accuracy: held-out set . . . . . . . . . . . . 293 B.18 Correcting with inter-group synchrony . . . . . . . . . . . . . . . . 295 B.19 Influence of motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 C Supplementary Information and Additional Results for Section 4.3 298 C.1 Model comparison across randomly selected layers . . . . . . . . . . 298 C.2 Representational similarity analysis . . . . . . . . . . . . . . . . . . 298 C.3 Regions of interest (ROI) . . . . . . . . . . . . . . . . . . . . . . . . 300 C.4 Center-weighted attention . . . . . . . . . . . . . . . . . . . . . . . 301 C.5 Voxel-wise prediction accuracy (R) of linear models . . . . . . . . . 302 C.6 Estimating hemodynamic (BOLD) response delay . . . . . . . . . . 302 viii LIST OF TABLES 3.1 Composition of Cohorts . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2 Classification accuracy for ASD vs. Control: Independent test on ABIDE-II of baseline models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. . . . . . . . . . . . . . . . . . . . . . 85 3.3 Root mean squared error (RMSE in years) for age prediction: In- dependent test on ABIDE-II for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. . . . . . . . . . . . . . . . . . . 86 3.4 Classification/regression performance of FCN with a high-resolution parcellation ( ∼ 1024 ROIs) [216] . . . . . . . . . . . . . . . . . . . 95 3.5 Next frame prediction performance on healthy test subjects for different models. *Interpolation model had access to the frame after the predicted frame. . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.6 Reconstruction performance of the proposed recurrent autoencoder on healthy test subjects for different input sequence lengths. . . . 110 3.7 Area under the ROC curve for discriminating ASD vs Controls. P-values of the unpaired t-test comparing means of the two clinical groups are shown in brackets. . . . . . . . . . . . . . . . . . . . . 111 4.1 Evaluation against saliency prediction models. Mean and standard errors for each metric are reported. Best results are bolded. . . . . 168 A.1 Summary descriptors of ROIs in individual atlases. All volumes are in cm3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 A.2 Classification accuracy for ASD vs. Control: 10-fold cross-validation on ABIDE-I for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. . . . . . . . . . . . . . . 261 A.3 Root mean squared error (RMSE in years) for age prediction: 10-fold cross-validation on ABIDE-I for benchmark models and proposed CNN approach. For each row, best results are bolded. For each col- umn, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. . . . . . . . . . 261 A.4 Classification accuracy for ASD vs. Control: Independent results on ABIDE-II for benchmark models and proposed CNN approach. Green indicates better performance, whereas orange/red highlights worse performance. . . . . . . . . . . . . . . . . . . . . . . . . . . 264 ix A.5 Root mean squared error (RMSE in years) for age prediction: Inde- pendent results on ABIDE-II for benchmark models and proposed CNN approach. Green indicates better performance, whereas or- ange/red highlights worse performance. . . . . . . . . . . . . . . . 268 A.6 Mean absolute error (MAE in years) for age prediction: Independent testing on ABIDE-II for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. . . . . . . . . . 270 B.1 HCP dataset split . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 B.2 ROI categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 C.1 ROI categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 x LIST OF FIGURES 2.1 Traditional seed based analysis approach . . . . . . . . . . . . . . . 12 2.2 Applications of machine learning methods in resting-state fMRI . 15 2.3 A taxonomy of unsupervised learning methods used for rs-fMRI analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Illustrations of popular clustering algorithms: K-means clustering partitions the data space into Voronoi cells, where each observation is assigned to the cluster with the nearest centroid (marked red in the figure). GMMs assume that each cluster is sampled from a multivariate Gaussian distribution and estimates these probability densities to generate probabilistic assignment of observations to different clusters. Hierarchical (agglomerative) clustering generates nested partitions, where partitions are merged iteratively based on a linkage criteria. Graph-based clustering partitions the graph repre- sentation of data so that, for example, number of edges connecting distinct clusters are minimal. . . . . . . . . . . . . . . . . . . . . . 67 2.5 Schematic of application 2.1.3: In decomposition, the original fMRI data is expressed as a linear combination of spatial patterns and their associated time series - in ICA, the independence of spatial maps is optimized whereas in sparse dictionary learning, the sparsity of maps is encouraged. In clustering, time series or connectivity fingerprints of voxels are clustered to assign voxels to distinct functional networks. 68 2.6 Schematic of application 2.1.3. Three connectivity states are as- sumed in the data for illustration purposes . . . . . . . . . . . . . . 68 2.7 Schematic of application 2.1.3. Dimensionality reduction of high- dimensional connectomes into 3 latent components is shown for illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.8 A common classification/regression pipeline for connectomes . . . . 69 2.9 A summary of design choices for supervised learning with rs-fMRI . 69 2.10 A taxonomy of supervised learning methods used for rs-fMRI analysis 70 3.1 A general illustration of the proposed approach . . . . . . . . . . . 73 3.2 ROI masks for example SPs and atlas at each of the four spatial scales considered in this study. . . . . . . . . . . . . . . . . . . . . 78 3.3 Proposed CNN approach. All operations are in 3D volume. 2D correlation maps are shown for illustration only. For the age predic- tion task, an additional Max-Pooling and Batch-Normalization[208] operation followed the first and second convolutional layer. . . . . . 82 3.4 ASD-HC Classification: Receiver Operating Curves for independent validation on ABIDE-2 . . . . . . . . . . . . . . . . . . . . . . . . 87 xi 3.5 Violin plots showing the spread of prediction accuracies/errors for stochastic parcellations at multiple network scales for different classification models. Mean accuracy/error of individual violins is denoted by ’Mean SPs’. Performance of individual atlases is compared with SPs with the closest # of ROIs and is denoted as ’Single Atlas’. Results are computed by training models on entire ABIDE-1 cohort and testing on the independent ABIDE-2 cohort. . 89 3.6 Distribution of Ridge models’ performance for stochastic parcella- tions created using the same gray-matter mask as the corresponding atlas. Red denotes the atlas model’s accuracy and black indicates the SP-Ensemble accuracy. . . . . . . . . . . . . . . . . . . . . . . 90 3.7 Mean saliency maps of trained 3D-CNN models for SP-Ensemble . 91 3.8 Motion correlations . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.9 Kernel density estimates of the probability distributions for the performance difference between models, computed based on 10000 bootstrap samples from ABIDE-II. Values to the left of the black vertical line indicate bootstrap samples where the proposed ap- proach (3D CNN or SP-Ensemble) under-performed compared to the competing method. . . . . . . . . . . . . . . . . . . . . . . . . 100 3.10 Next frame prediction model. Each cuboid represents a 3D (2 spatial dimensions + time) feature map with number of features indicated on top. Flat boxes represent 2D feature maps, with number of channels on top. Input is an axial fMRI slice with T sequential frames. Conv-LSTM cell returns the last output of the output sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.11 Whisker plots showing reconstruction and prediction errors (mean squared error) for ASD patients and controls, with proposed re- current models trained on T=20 consecutive frames. Points are individual subjects. The ends of the box are upper and lower quartiles, the median is marked by a horizontal line inside the box. 111 3.12 Statistical significance of the difference in regional reconstruction error of the recurrent autoencoder between controls and ASD pa- tients. FDR with q = 0.05 was implemented for multiple testing correction. − log10 p values are shown. . . . . . . . . . . . . . . . . 112 xii 4.1 Schematic of the proposed models. (A) The short-duration (1-sec) auditory and visual models take a single image or spectrogram as input, extract multi-scale hierarchical features and feed them into a Convolutional Neural Network (CNN)-based response model to predict the whole-brain response. (B) The long-duration (20-sec) uni- modal models take a sequence of images or spectrograms as input, feed their hierarchical features into a recurrent pathway and extract the last hidden state representation for the response model. (C) The short-duration multi-modal model combines uni-modal features and passes them into the response model. (D) The long-duration multi- modal model combines auditory and visual representations from the recurrent pathways for whole-brain prediction. Architectural details, including the feature extractor and convolutional response model are provided in Supplementary Information. . . . . . . . . . 123 4.2 Regional predictive accuracy for the test movie. (A),(C)-(F) depict quantitative evaluation metrics for all the proposed models across major groups of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of all models is summarized across (A) auditory, (C) visual, (D) multi-sensory, (E) language and (F) frontal areas. Box plots depict quartiles and swarmplots depict mean prediction accuracy of every ROI in the group. For language areas (Group 4), left and right hemisphere ROIs are shown as separate points in the swarmplot because of marked differences in prediction accuracy. Statistical significance tests are performed to compare 1-sec and 20-sec models of the same modality (3 comparisons, results indicated with horizontal bars below the box plots) or uni-modal against multi-modal models of the same duration (4 comparisons, results indicated with horizontal bars above the box plots) using the paired t-test (p-value < 0.05, Bonferroni-corrected) on mean prediction accuracy within ROIs of each group. . . . . . . . . . . . 128 4.3 Model prediction accuracy in standard brain space. Left panel depicts the predictive accuracy of uni-modal (A,B) and multi-modal (C) models over the whole brain in the test movie. Colors on the brain surface indicate the Pearson correlation coefficient between the predicted timeseries at each voxel and the true voxel’s timeseries normalized by the noise ceiling (D) computed on repeated validation clips. Only significantly predicted voxels (p-value < 0.05, FDR- corrected) are colored. ROI box plots depict the un-normalized correlation coefficients between the predicted and measured response of voxels in each ROI and the respective noise ceiling for the mean. (E) shows the percentage of voxels in stimulus-driven cortex that are significantly predicted by each model and mean prediction accuracy across the stimulus-driven cortex. . . . . . . . . . . . . . . . . . . 131 xiii 4.4 Influence of temporal history on encoding performance. (A) Mean predictive performance of Audio-1sec and Audio-20sec models in early auditory and association auditory cortex ROIs. A major boost in encoding performance is seen across auditory association regions with the 20-sec model. (B) Mean predictive performance of Visual- 1sec and Visual-20sec models across ROIs in the dorsal, ventral and MT+ regions. Dorsal stream and MT+ ROIs exhibit a significant improvement with Visual-20sec model but no effect is observed for the ventral stream. Box plots are overlaid on top of the beeswarm plot to depict quartiles. Horizontal bars indicate significant differ- ences between models in the mean prediction accuracy within ROIs of each stream using the paired t-test (p-value < 0.05). . . . . . . . 132 4.5 Sensitivity of ROIs to different sensory inputs. (A) Predictive accuracy (R) of audiovisual encoding model with and without input distortions, (B) Sensory sensitivity index of different brain regions as determined using performance metrics under input distortion (see Supplementary Information for details). Regions dominated by a single modality are shown in darker colors, whereas light-colored regions are better predicted by a combination of auditory and visual information. Red indicates auditory-dominant regions whereas blue indicates visual dominance. . . . . . . . . . . . . . . . . . . . . . . 134 4.6 Encoding models as virtual brain activity synthesizers. (A) Syn- thetic contrasts are generated from trained encoding models by contrasting their “synthesized” (i.e., predicted) response to different stimulus types. (B) Comparison of the synthesized contrast for ‘speech’ against the speech association template on neurosynth, both thresholded to keep the top 5%, 10% or 15% most activated vertices. (C-D) compare the synthesized contrasts for ‘faces’ and ’places’ against the corresponding contrasts derived from HCP tfMRI ex- periments, both thresholded to keep the top 5%, 10% or 15% most activated vertices. Vertices activated in only synthetic or predicted contrast maps are shown in red and blue colors respectively whereas yellow indicates the overlap. Corresponding dice scores are displayed alongside the surface maps. Distributions of dice overlap scores between the synthetic map and all 86 HCP tfMRI contrast maps are shown as histograms at each threshold level. Red arrow points to the dice overlap between the synthetic contrast and HCP tfMRI contrast for the same condition. In all cases, the synthetic contrast exhibits the highest agreement with the tfMRI contrast that it was generated to predict. . . . . . . . . . . . . . . . . . . . . . . . . . . 137 xiv 4.7 Proposed method. A trainable soft-attention module is imple- mented on top of a pre-trained representation network to rescale features based on their salience. The rescaled features are spatially pooled and fed into a convolutional response model to predict whole- brain neural response. We assess the value of the trained attention network by comparing it with neural encoding methods employing (i) stimulus-dependent attention maps derived from human fixations (AG), (ii) stimulus-independent attention map derived from all fix- ations in the training set that reflects the center-weighted bias of our dataset (AC) as well as a (iii) no attention model that spatially pools the features directly with no scaling. . . . . . . . . . . . . . . 156 4.8 Quantitative evaluation of all models. (A) depicts mean cor- relation values across the synchronous, (i.e., stimulus-driven) cortex defined at a range of synchrony thresholds ([0.15,0.75]). Each point thus reflects the mean prediction accuracy for a model across all voxels within synchronous cortex defined by a threshold value (x- axis). (B) depicts the inter-group correlation (synchrony) values across the entire human cerebral cortex. . . . . . . . . . . . . . . . 163 4.9 Top: ROI-level analysis Mean correlation values across interme- diate (V4), higher visual areas in the inferotemporal cortex and its neighborhood and other higher higher-level visual regions (Dorsal, MT+) as described in the HCP MMP parcellation [264]. Error bars represent 95% confidence intervals around mean estimates com- puted using bootstrap sampling. (A)-(E) Prediction accuracy across the cortical surface for all deep CNN-based models. Statistical significance of individual voxel predictions is computed as the p-value of the obtained sample correlation coefficient for the null hypothesis of uncorrelatedness (i.e., true correlation coefficient is zero) under the assumptions of a bivariate normal distribution. Only significantly predicted voxels (p<0.05, FDR corrected) for each method are colored on the surface. Prediction accuracy maps for encoding methods with linear response models are provided in the Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.10 Qualitative assessment of saliency (log-density) maps. Top row shows sampled frames from the test movie, middle row shows human fixation maps overlaid on the corresponding frame, bottom row shows saliency maps predicted by the attention network of the proposed neural encoding model. Blue indicates high saliency values whereas red indicates low saliency. . . . . . . . . . . . . . . . . . . 167 xv 5.1 Proposed approach: Feature pyramid networks are used to ex- tract hierarchical features from pre-trained image/sound recognition networks. Dense features are reshaped into coarse 3D feature maps, which are mapped into increasingly fine-grained maps using con- volutions. Coarse feature transformation layers are shared across subjects while deeper convolutional layers close to predicted response are subject-specific. . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.2 Quantitative evaluation: Bar charts illustrate subject-wise pre- diction accuracy of all models, box plots depict the distribution over subjects for % of synchronous voxels significantly predicted (p<0.05, FDR corrected). N ×N correlation matrices depict the (normalized) correlation coefficient between predicted and measured responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.3 (A), (B) Correlations between predicted response of the proposed model and true time series of each voxel averaged across subjects. Only significantly predicted voxels are shown (p<0.05, FDR cor- rected). Dice matrices of predicted versus true contrasts for (C) faces and (D) scenes stimuli. (E) & (F) depict contrasts of two randomly selected subjects. ROIs are labelled from the HCP MMP parcellation [264]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.1 Schematic and Quantitative Results. A shows the convolu- tional neural network model with factorized readout and B depicts the 5 visual areas considered in this study. Quantitative assessment of different models is shown as a boxplot in C. D shows the count of voxels that are better predicted by each model along with the difference in prediction accuracies (R). E shows the raw prediction accuracy, as estimated by the Pearson’s correlation coefficient (R), across the cortical flat map for all 4 subjects. . . . . . . . . . . . . 193 6.2 Schematic of retinotopic parameters . . . . . . . . . . . . . . 194 6.3 Quantifying the agreement between the measured prf ec- centricities and the prf eccentricities estimated from pre- dictive computational models across different voxels. A Subject and ROI-specific scatter plots depict predicted eccentricities against measured eccentricities. Pearson’s correlation coefficient between the two quantities is displayed in blue in each scatter plot. B Predicted and measured eccentricities across all early visual ROI voxels displayed on the cortical surface for each subject. . . . . . . 197 xvi 6.4 Quantifying the agreement between the measured prf polar angles and the prf polar angles estimated from predictive computational models across different voxels. A Subject and ROI-specific scatter plots depict predicted polar angles against measured polar angles. Pearson’s correlation coefficient between the two quantities is displayed in blue in each scatter plot. B Predicted and measured polar angles across all early visual ROI voxels displayed on the cortical surface for each subject. . . . . . . 198 6.5 Quantifying the agreement between the measured prf sizes and the prf sizes estimated from predictive computational models across different voxels. A Subject and ROI-specific scatter plots depict predicted sizes against measured sizes. Pearson’s correlation coefficient between the two quantities is displayed in blue in each scatter plot. B Predicted and measured prf sizes across all early visual ROI voxels displayed on the cortical surface for each subject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.6 Quantifying the generalization ability across subjects. (Left) Prediction performance and (Right) Agreement between es- timated and measured retinotopic maps as a function of training examples (stimulus-response pairs) from novel subjects. . . . . . . 201 6.7 Spatial generalization matrices. Predicted response for a voxel in one ROI is correlated with the measured response of every other voxel within the same ROI (both within and across participants) to obtain a spatial generalization matrix for every ROI. Blue lines mark subject boundaries. Strong diagonal structure indicates that the predicted response for a voxel best matches the measured response of the same voxel, indicating the ability of the models to capture voxel-level idiosyncracies. . . . . . . . . . . . . . . . . . . . . . . . 202 6.8 Separability of category information across the ventral vi- sual stream. A Matrices of all pairwise similarities between the representational geometries in different visual ROIs. B Results of the Representational Similarity Analysis (RSA) framework ap- plied to several visual datasets (THINGS, CIFAR100 and CIFAR10) containing different categories of objects. . . . . . . . . . . . . . . . 203 6.9 A low-dimensional space characterizes the functional orga- nization of the ventral-occipital region. A Most and least activating images for the first two PCs. C Total explained variance as a function of the number of PCs. D Pearson’s correlation coef- ficient between the domain-selectivity of individual voxels against their projections onto the two PCs. E All images from the THINGS dataset projected onto the first two principal dimensions of the response. F and G Scatter plots depicting the domain-selectivity against the corresponding PC projection for all VO voxels. . . . . . 205 xvii A.1 Violin plots showing the spread of prediction accuracies/errors for stochastic parcellations at multiple network scales for different classification models. Mean accuracy/error of individual violins is denoted by ’Mean SPs’. Performance of individual atlases is compared with SPs with the closest # of ROIs and is denoted as ’Single Atlas’. Results are computed by 10-fold cross-validation on the entire ABIDE-1 cohort. . . . . . . . . . . . . . . . . . . . . . . 265 A.2 Saliency maps of trained CNN models for 2 randomly chosen stochas- tic parcellations at each scale for ASD-HC classification. . . . . . . 266 A.3 Saliency maps for atlas-based ASD-HC classification models. . . . . 267 A.4 ROC Curves for individual atlas based ASD-HC classification models.269 B.1 Group segregation from the HCP MMP parcellation. . . . . . . . . 272 B.2 ROI-based encoding performance for estimating delay. (A) depicts the estimated mean and standard error of the prediction accuracy (R) across various delays (1-7s) within the early auditory and association auditory group (blue) as well as across all ROIs (red), as obtained using the single epoch (1s) auditory model. (B) depicts the estimated mean and standard error of the prediction accuracy (R) for various delays (1-7s) within the primary and dorsal visual streams (blue) as well as across all ROIs (red), as obtained using the single frame visual model. Shaded regions depict the standard error in estimating mean across ROIs within each group. ROI categorization is described in the sub-section on ROI selection. . . . . . . . . . . . . . . . . . . . 273 B.3 Implementation details for the audio (top left) and visual (top right) feature extraction networks as well as the convolutional response model (bottom). All layers and blocks outside the yellow rectangle (bottom-up pathway) are trained from scratch. The blocks inside the yellow rectangular window are initialized with networks pre-trained on image or sound recognition. Further, ResNet-50 is frozen during the training of all encoding models, whereas VGG is fine-tuned. The sequence of operations within each block are defined from top to bottom, while the number of repetitions for each sequence within the block are indicated with the multiplicative symbol on the right. 278 xviii B.4 Performance of linear response models and baselines. (A) shows the region-averaged prediction accuracy of linear response models using deep convolutional features. (B) shows results of the ablation study and highlights the importance of different components of the proposed model architecture. (C) shows the region-averaged prediction accuracy of linear response models using semantically rich WordNet features and (D) shows the cortical map of the prediction accuracy (R) for the best WordNet model. The x-axis in (A) and (C) depicts the length of the windows (in seconds) over which the stimulus features are concatenated and y-axis shows the mean Pearson correlation coefficient between the predicted and measured responses across the stimulus-driven voxels. . . . . . . . . . . . . . 281 B.5 Perturbation analysis with Audio-20sec (A) and Visual-20sec (B) models. ROI box plots depict the un-normalized correlation coef- ficients between the predicted and measured response of voxels in each ROI using original or distorted 20-sec input clips at inference time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 B.6 Performance boost of the 20-sec model over 1-sec model is higher in voxels with longer autocorrelation decay times. (A) & (B) depict the performance improvement (∆R) against decay time constants for voxels associated with auditory and visual regions, respectively (Table B.2). The r value indicates the Pearson correlation coefficient between the two quantities. Each dot in the scatterplot represents an individual voxel. Bivariate kernel density estimates are overlaid on top of the scatterplot as contours to depict the probability distribution of observations. . . . . . . . . . . . . . . . . . . . . . . 288 B.7 Predicted and measured response time-series of the ‘median’ predic- tive accuracy (R) voxel across ROIs of different functional groups. Vertical dashed lines mark the boundary of clip segments in the held-out movie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 xix B.8 Model performance on held-out group of subjects. (A) Pearson correlation coefficient (R) between the model predictions and group- averaged response of an independent subject group comprising 20 subjects, on the held-out test movie, normalized by the voxel- specific noise ceiling. (B) Predictivity against the noise ceiling for all voxels with high “synchrony” across training movies (>0.5) (see Supplementary Information for details). This gives a total of 52,954 highly “synchronous” voxels that are colored based on their association with auditory and visual groups. This hue assignment of each voxel was derived from the coloration of the corresponding ROI in the multi-modal HCP parcellation. Each dot in the scatterplot represents an individual voxel. Bivariate kernel density estimates are overlaid on top of the scatterplot as contours to depict the probability distribution of observations (prediction accuracy/noise ceiling pair at every voxel). . . . . . . . . . . . . . . . . . . . . . . 292 B.9 Quantitative evaluation metrics for all the proposed models on the independent held-out population comprising 20 novel subjects. (A),(C)-(F) depict prediction accuracy (R) for all the proposed models across major groups of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of all models is summarized across (A) auditory, (C) visual, (D) multi-sensory, (E) language and (F) frontal areas. Box plots depict quartiles and swarmplots depict mean prediction accuracy of every ROI in the group. For language areas (Group 4), left and right hemisphere ROIs are shown as separate points in the swarmplot because of marked differences in the prediction accuracy. Statistical significance tests (results indicated with horizontal bars) are performed to compare 1-sec and 20-sec models of the same modality (3 comparisons) or uni-modal against multi-modal models of the same duration (4 comparisons) using paired t-test (p-value < 0.05, Bonferroni-corrected) on mean prediction accuracy within ROIs of each group. . . . . . . . . . . . 293 B.10 Comparison of voxel-level prediction accuracies (R) against subject- specific noise ceiling for 5 representative subjects from the held-out set. The subjects were chosen such that their mean prediction accuracy (un-normalized) within the stimulus-driven cortex lied in the ith percentile with i ∈ {0.01, 25, 50, 75, 99.9}. Surface maps with white background in (A)-(E) depict raw correlation coefficients between model (Audiovisual-20sec) predictions and subject-specific response on the held-out movie whereas maps on gray background indicate the respective subject-specific noise ceiling. Only signifi- cantly correlated voxels (p<0.05, FDR corrected) are colored on the surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 B.11 Synchrony-normalized prediction accuracy (R) of the Audiovisual- 20sec model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 xx B.12 Addressing the influence of motion on measured and predicted responses. (A) and (B) depict the distribution of the Pearson correlation coefficient of FD with the predicted responses of the Audiovisual-20sec model and measured responses across the whole brain respectively. Surface maps in (C) depict the raw correlation coefficients between FD and the measured responses. Only statisti- cally significant voxels (p< 0.05, FDR corrected) are colored on the surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 C.1 Quantitative evaluation. Mean correlation values across the synchronous, (i.e., stimulus-driven) cortex defined at a range of synchrony thresholds ([0.15,0.75]). Each point thus reflects the mean prediction accuracy for a model across all voxels within synchronous cortex defined by a threshold value (x-axis). . . . . . . . . . . . . . 300 C.2 Representational similarity analysis(RSA). y-axis measures the agreement between ‘model’ RDMs and ‘neural’ RDMs based on their rank correlation measure. x-axis is use to index the layer (index 1 refers to the earliest layer of the architecture) and the saliency method used for attention masking of the features before pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 C.3 A. Center-weighted saliency map and B. Eye tracking statis- tics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 C.4 Prediction accuracy across the cortical surface for all meth- ods using linear response models. Statistical significance of in- dividual voxel predictions is computed as the p-value of the obtained sample correlation coefficient for the null hypothesis of uncorrelated- ness (i.e., true correlation coefficient is zero) under the assumptions of a bivariate normal distribution. Only significantly predicted voxels (p<0.05, FDR corrected) for each method are colored on the surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 C.5 Hemodynamic response delay. 5-fold cross-validated prediction accuracy (R) of the simple (‘No attention’) model on the training dataset. Error margins are computed from the standard deviation of prediction accuracy across the 5 folds. . . . . . . . . . . . . . . . 304 xxi CHAPTER 1 INTRODUCTION Modern neurotechnologies enable us to measure human brain activity in unprece- dentedly rich ways. As we enter this new data revolution in neuroscience with large-scale compilation of neural data and dissemination through open-source ini- tiatives, we also need concomitant methodological advances to efficiently derive conceptual understanding from this rich, high-dimensional data. This thesis is a methodological effort in this direction, accompanied by novel neuroscientific findings. fMRI is a widely used non-invasive neuroimaging tool for measuring changes in local oxygenation of blood, also known as the BOLD signal, which is a proxy for neural activity. While the blurred spatio-temporal activity measured by fMRI may not be a direct window into neural coding like single-cell recordings in other animal models, this non-invasive imaging modality nonetheless serves as a very powerful and sensitive technique to study functional organization and feature representations in the human brain. Two brain activity recording paradigms in humans have particularly emerged as increasingly more popular tools for studying brain function in health and in disease, namely resting-state and naturalistic stimulation. These two techniques attempt to capture brain activity ‘in the wild’ when it is unconstrained by any specific task and thus reflect more naturalistic modes of operation of the brain. Resting-state activity refers to brain activation that arises in the absence of any task. It is measured in awake subjects when the only instructions they are given is to close their eyes and do nothing in particular. Naturalistic stimulation refers to brain activity that is recorded in more natural, complex settings to learn about information processing in the brain under ecological 1 conditions. An example would be volunteers viewing movies or listening to stories while their brain is being imaged. The complexity, very high-dimensional nature, a suite of potential applications and lack of standard, straightforward analysis tools make machine learning very attractive for this kind of data. In this thesis, we draw upon recent advances in machine learning, fueled by the success of deep learning, to develop models that can capture the full richness of this data. Resting-state fMRI (rs-fMRI) has enormous potential to advance our under- standing of the brain’s functional organization and how it is altered by damage or disease. A major emphasis in the field is on the analysis of resting-state functional connectivity (RSFC) that measures statistical dependence in BOLD fluctuations among spatially distributed brain regions. Recently, there has been a surge of studies harnessing resting-state functional connectivity for a wide range of super- vised prediction tasks. For instance, the application of machine learning methods to rs-fMRI data has shown great promise in investigations of the developing con- nectome, as well as in predicting individual differences in cognition and behavior. Over the last decade, substantial effort has been devoted to using rs-fMRI for classification of a wide range of neuropsychiatric conditions, such as Alzheimer’s disease, schizophrenia, autism spectrum disorder etc. Predictive approaches can also be used to address research questions of interest in neuroscience. For example, to what extent is resting-state functional connectivity heritable or how does it vary across different vigilance states? Given this broad range of applications and clinical potential, there is a need to develop better methods for making veridical predictions from such data. The first part of this thesis describes our work on developing novel machine learning approaches for deriving subject level predictions from neuroimaging data, in particular, resting-state fMRI scans. 2 Computational neuroscience has also witnessed a minor revolution of its own, driven largely by advances in deep learning; We now have computational mod- els that can explain high-level sensory representations. Comparison of different representational models is now beginning to reveal the neuronal code in parts of the cortex that had remained elusive so far, providing a new test bench for computational theories of the mind. In the past, sensory systems have been studied extensively using task-based paradigms where the brain activity is recorded upon stimulation with hand-crafted stimuli. In the domain of vision, this paradigm has been very successful, for example, in identifying orientation sensitivity of neurons in early visual cortex or in discovering scene-selective or face-selective regions in higher-order visual regions within the brain. While successful for testing specific hypotheses, this reductionist approach is limited in the sense that no single task- based experiment can help in developing broad theories of sensory processing that generalize outside the experimental circumstance they were based on. Predictive models, on the other hand, are based on out-of-sample prediction and they generalize to arbitrary new stimuli and can thus offer more holistic descriptions of sensory processing. The biggest advantage is that once we have such a general model, we can use it to formulate novel hypotheses about information processing in the brain that can then be tested under more rigorous and controlled conditions. Embedded knowledge within these models of the brain could also be harnessed in other applications, such as independent neural population control by optimally synthesizing stimuli to elicit a desired neural activation pattern. Further, predictive models can also be useful in hypothesis testing. In this case, encoding models encapsulating competing hypotheses about neural information processing can be pitted against each other and their empirical plausibility can be 3 directly examined by comparing their predictions on held-out data against corre- sponding measurements. Such an approach can shed new light on how information is represented in different parts of the brain. Given the usefulness of predictive models in hypothesis formation, in non-invasive brain-machine interfaces, as well as in answering important research questions relating to feature representations in the brain, the second part of my thesis is aimed at developing predictive models that can capture information processing within the brain more stringently than existing approaches. We pursue this goal by developing deep neural network-based encoding models that capture three critical inductive biases about information processing in the brain, namely, hierarchical processing, assimilation over longer timescales and multi-sensory auditory-visual interactions. By developing and evaluating these models on a large-scale movie- watching dataset, we demonstrated how incorporating this joint information leads to remarkable prediction performance across large areas of the cortex, well beyond primary sensory regions into higher-order regions that had not been characterized by predictive models previously. Taken together, our findings underscore how encoding models can shed new light on the functional architecture of the human brain and provide a basis for novel hypotheses about sensory processing. This thesis is therefore largely a methodological effort that is accompanied by novel results. A central theme of our research is to use machine learning or predictive modelling techniques to convert neural data into understanding and fundamental knowledge about the brain. This research spans multiple disciplines: 1. Medical Image Analysis: We propose a novel volumetric Convolutional Neu- ral Network (CNN) framework that takes advantage of the full-resolution 3D spatial structure of rs-fMRI data and fits non-linear predictive models. We showcase our 4 approach on a challenging large-scale dataset (ABIDE, with N >2,000) and report state-of-the-art accuracy results on rs-fMRI based discrimination of autism patients and healthy controls. 2. Computational Neuroscience: We use naturalistic stimuli and employ state- of-the-art deep learning models on a large-scale dataset to show that the shared response across subjects is largely predictable across the cortex. Furthermore, we demonstrate how these models allow us to interrogate the temporal and sensory sensitivity of different brain regions in an entirely data-driven manner from ecologi- cally valid fMRI paradigms, underscoring the potential of neural encoding models as a powerful tool for studying brain function in ecologically valid conditions. 3. Computer Vision: We employ a data-driven strategy to further improve DNNs as models of the human visual cortex, and demonstrate how findings about the brain can potentially improve computer vision models, instantiating a synergy between computational neuroscience and AI. Outline This thesis is organized as follows. In chapter 2, we present the relevant literature on machine learning in resting-state1 and naturalistic fMRI analysis. In chapter 3, we describe our novel methodological approaches for deriving individual-level clinical or behavioral measures from rs-fMRI that take advantage of the full- 1Khosla, M., Jamison, K., Ngo, G. H., Kuceyeski, A., & Sabuncu, M. R. (2019). Machine learning in resting-state fMRI analysis. Magnetic resonance imaging, 64, 101-121. 5 resolution 3D spatial structure of rs-fMRI data2,3,4. Chapter 4 describes our work on predictive models of cortical responses to naturalistic stimuli that capture three critical influences on neural computations, namely that of multi-sensory interactions, stimulus history and visual attention5,6. Chapter 5 presents our methodological contribution towards using multi-subject fMRI data for improving subject-specific responses predictions7. Chapter 6 describes our ongoing investigations into data- driven models of the human visual cortex, where we go beyond response prediction and provide a systematic framework for inferring neuronal tuning properties from these computational models. Finally, we conclude this thesis by summarizing our core contributions and giving an outlook to future experiments. 2Khosla, M., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2018). 3D convolutional neural networks for classification of functional connectomes. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (pp. 137-145). Springer, Cham. 3Khosla, M., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2019). Ensemble learning with 3D convolutional neural networks for functional connectome-based prediction. NeuroImage, 199, 651-662. 4Khosla, M., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2019, October). Detecting abnormalities in resting-state dynamics: An unsupervised learning approach. In International Workshop on Machine Learning in Medical Imaging (pp. 301-309). Springer, Cham. 5Khosla, M., Ngo, G. H., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2021). Cortical response to naturalistic stimuli is largely predictable with deep neural networks. Science Advances, 7(22), eabe7547. 6Khosla, M., Ngo, G., Jamison, K., Kuceyeski, A., & Sabuncu, M. (2020). Neural encoding with visual attention. Advances in Neural Information Processing Systems, 33. 7Khosla, M., Ngo, G. H., Jamison, K., Kuceyeski, A., Sabuncu, M. R. (2020, October). A shared neural encoding model for the prediction of subject-specific fMRI response. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 539-548). Springer, Cham. 6 CHAPTER 2 RELATED WORK 2.1 Machine learning in resting-state fMRI analysis Machine learning techniques have gained prominence for the analysis of resting- state functional Magnetic Resonance Imaging (rs-fMRI) data. Here, we present an overview of various unsupervised and supervised machine learning applications to rs-fMRI. We offer a methodical taxonomy of machine learning methods in resting-state fMRI. We identify three major divisions of unsupervised learning methods with regard to their applications to rs-fMRI, based on whether they discover principal modes of variation across space, time or population. Next, we survey the algorithms and rs-fMRI feature representations that have driven the success of supervised subject-level predictions. The goal is to provide a high-level overview of the burgeoning field of rs-fMRI from the perspective of machine learning applications. 2.1.1 Introduction Resting-state fMRI (rs-fMRI) is a widely used neuroimaging tool that measures spontaneous fluctuations in neural blood oxygen-level dependent (BOLD) signal across the whole brain, in the absence of any controlled experimental paradigm. In their seminal work, Biswal et al. [1] demonstrated temporal coherence of low- frequency spontaneous fluctuations between long-range functionally related regions of the primary sensory motor cortices even in the absence of an explicit task, suggesting a neurological significance of resting-state activity. Several subsequent 7 studies similarly reported other collections of regions co-activated by a task (such as language, motor, attention, audio or visual processing etc.) that show correlated fluctuations at rest [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. These spontaneously co-fluctuating regions came to be known as the resting state networks (RSNs) or intrinsic brain networks. The term RSN henceforth denotes brain networks subserving shared functionality as discovered using rs-fMRI. Rs-fMRI has enormous potential to advance our understanding of the brain’s functional organization and how it is altered by damage or disease. A major emphasis in the field is on the analysis of resting-state functional connectivity (RSFC) that measures statistical dependence in BOLD fluctuations among spatially distributed brain regions. Disruptions in RSFC have been identified in several neurological and psychiatric disorders, such as Alzheimer’s [12, 13, 14], autism [15, 16, 17], depression [18, 19, 20], schizophrenia [21, 22], etc. Dynamics of RSFC have also garnered considerable attention in the last few years, and a crucial challenge in rs-fMRI is the development of appropriate tools to capture the full extent of this RS activity. rs-fMRI captures a rich repertoire of intrinsic mental states or spontaneous thoughts and, given the necessary tools, has the potential to generate novel neuroscientific insights about the nature of brain disorders [23, 24, 25, 26, 27, 28]. The study of rs-fMRI data is highly interdisciplinary, majorly influenced by fields such as machine learning, signal processing and graph theory. Machine learning methods provide a rich characterization of rs-fMRI, often in a data-driven manner. Unsupervised learning methods in rs-fMRI are focused primarily on understanding the functional organization of the healthy brain and its dynamics. For instance, methods such as matrix decomposition or clustering can simultaneously expose 8 multiple functional networks within the brain and also reveal the latent structure of dynamic functional connectivity. Supervised learning techniques, on the other hand, can harness RSFC to make individual-level predictions. Substantial effort has been devoted to using rs-fMRI for classification of patients versus controls, or to predict disease prognosis and guide treatments. Another class of studies explores the extent to which individual differences in cognitive traits may be predicted by differences in RSFC, yielding promising results. Predictive approaches can also be used to address research questions of interest in neuroscience. For example, is RSFC heritable? Such questions can be formulated within a prediction framework to test novel hypotheses. From mapping functional networks to making individual-level predictions, the applications of machine learning in rs-fMRI are far-reaching. The goal of this review is to present in a concise manner the role machine learning has played in generating pioneering insights from rs-fMRI data, and describe the evolution of machine learning applications in rs-fMRI. We will present a review of the key ideas and application areas for machine learning in rs-fMRI rather than delving into the precise technical nuances of the machine learning algorithms themselves. In light of the recent developments and burgeoning potential of the field, we discuss current challenges and promising directions for future work. Resting-state fMRI: A Historical Perspective Until the 2000s, task-fMRI was the predominant neuroimaging tool to explore the functions of different brain regions and how they coordinate to create diverse mental representations of cognitive functions. The discovery of correlated spontaneous fluctuations within known cortical networks by Biswal et al. [1] and a plethora of 9 follow-up studies have established rs-fMRI as a useful tool to explore the brain’s functional architecture. Studies adopting the resting-state paradigm have grown at an unprecedented scale over the last decade. These are much simpler protocols than alternate task-based experiments, capable of providing critical insights into functional connectivity of the healthy brain as well as its disruptions in disease. Resting-state is also attractive as it allows multi-site collaborations, unlike task- fMRI that is prone to confounds induced by local experimental settings. This has enabled network analysis at an unparalleled scale. Traditionally, rs-fMRI studies have focused on identifying spatially-distinct yet functionally associated brain regions through seed-based analysis (SBA). In this approach, seed voxels or regions of interest are selected a priori and the time series from each seed is correlated with the time series from all brain voxels to generate a series of correlation maps. SBA, while simple and easily interpretable, is limited since it is heavily dictated by manual seed selection and, in its simplest form, can only reveal one specific functional system at a time. Decomposition methods like Independent Component Analysis (ICA) emerged as a highly promising alternative to seed-based correlation analysis in the early 2000s [2, 29, 30]. This was followed by other unsupervised learning techniques such as clustering. In contrast to seed-based methods that explore networks associated with a seed voxel (such as motor or visual functional connectivity maps), these new class of model-free methods based on decomposition or clustering explored RSNs simultaneously across the whole brain for individual or group-level analysis. Regardless of the analysis tool, all studies largely converged in reporting multiple robust resting-state networks across the brain, such as the primary sensorimotor network, the primary visual network, fronto-parietal attention networks and the 10 well-studied default mode network. Regions in the default mode network, such as the posterior cingulate cortex, precuneus, ventral and dorsal medial prefrontal cortex, show increased levels of activity during resting-state suggesting that this network represents the baseline or default functioning of the human brain. The default mode network has sparked a lot of interest in the rs-fMRI community [31], and several studies have consequently explored disruptions in DMN resting-state connectivity in various neurological and psychiatric disorders, including autism, schizophrenia and Alzheimer’s. [32, 33, 34] Despite the widespread success and popularity of rs-fMRI, the causal origins of ongoing spontaneous fluctuations in the resting brain remain largely unknown. Several studies explored whether resting-state coherent fluctuations have a neuronal origin, or are just manifestations of aliasing or physiological artifacts introduced by the cardiac or respiratory cycle. Over time, evidence in support for a neuronal basis of BOLD-based resting state functional connectivity has accumulated from multiple complementary sources. This includes (a) observed reproducibility of RSFC patterns across independent subject cohorts [4, 5], (b) its persistence in the absence of aliasing and distinct separability from noise components [5, 35], (c) its similarity to known functional networks [1, 2, 11] and (d) consistency with anatomy [36, 37], (e) its correlation with cortical activity studied using other modalities [38, 39, 40] and finally, (f) its systematic alterations in disease [23, 24, 25]. Application of Machine Learning in rs-fMRI A vast majority of literature on machine learning for rs-fMRI is devoted to unsu- pervised learning approaches. Unlike task-driven studies, modelling resting-state activity is not straightforward since there is no controlled stimuli driving these 11 Figure 2.1: Traditional seed based analysis approach fluctuations. Hence, analysis methods used for characterizing the spatio-temporal patterns observed in task-based fMRI are generally not suited for rs-fMRI. Given the high dimensional nature of fMRI data, it is unsurprising that early analytic approaches focused on decomposition or clustering techniques to gain a better characterization of data in spatial and temporal domains. Unsupervised learning approaches like ICA catalyzed the discovery of the so-called resting-state networks or RSNs. Subsequently, the field of resting-state brain mapping expanded with the primary goal of creating brain parcellations, i.e., optimal groupings of vox- els (or vertices in the case of surface representation) that describe functionally coherent spatial compartments within the brain. These parcellations aid in the understanding of human functional organization by providing a reference map of areas for exploring the brain’s connectivity and function. Additionally, they serve as a popular data reduction technique for statistical analysis or supervised machine learning. More recently, departing from the stationary representation of brain networks, studies have shown that RSFC exhibits meaningful variations during the course of a typical rs-fMRI scan [41, 42]. Since brain activity during resting-state is largely uncontrolled, this makes network dynamics even more interesting. Using 12 unsupervised pattern discovery methods, resting-state patterns have been shown to transition between discrete recurring functional connectivity ”states”, representing diverse mental processes [42, 43, 44]. In the simplest and most common scenario, dynamic functional connectivity is expressed using sliding-window correlations. In this approach, functional connectivity is estimated in a temporal window of fixed length, which is subsequently shifted by different time steps to yield a sequence of correlation matrices. Recurring correlation patterns can then be identified from this sequence through decomposition or clustering. This dynamic nature of functional connectivity opens new avenues for understanding the flexibility of different connections within the brain as they relate to behavioral dynamics, with potential clinical utility [45]. Another, perhaps clinically more promising application of machine learning in rs- fMRI expanded in the late 2000s. This new class of applications leveraged supervised machine learning for individual level predictions. The covariance structure of resting- state activity, more popularly known as the ”connectome”, has garnered significant interest in the field of neuroscience as a sensitive biomarker of disease. Studies have further shown that an individual’s connectome is unique and reliable, akin to a fingerprint [46]. Machine learning can exploit these neuroimaging based biomarkers to build diagnostic or prognostic tools. Visualization and interpretation of these models can complement statistical analysis to provide novel insights into the dysfunction of resting-state patterns in brain disorders. Given the prominence of deep learning in today’s era, several novel neural-network based approaches have also emerged for the analysis of rs-fMRI data. A majority of these approaches target connectomic feature extraction for single-subject level predictions. In order to organize the work in this rapidly growing field, we sub-divide the 13 machine learning approaches into different classes by methods and application focus. We first differentiate among unsupervised learning approaches based on whether their main focus is to discover (a) the underlying spatial organization that is reflected in coherent fluctuations, (b) the structure in temporal dynamics of resting-state connectivity, or (c) population-level structure for inter-subject comparisons. Next, we move on to discuss supervised learning. We organize this section by discussing the relevant rs-fMRI features employed in these models, followed by commonly used training algorithms, and finally the various application areas where rs-fMRI has shown promise in performing predictions. 2.1.2 Unsupervised learning methods The primary objective of unsupervised learning is to discover latent representations and disentangle the explanatory factors for variation in rich, unlabelled data. These learning methods do not receive any kind of supervision in the form of target outputs (or labels) to guide the learning process. Instead, they focus on learning structure in the data in order to extract relevant signal from noise. Below, we review some important unsupervised learning methods that have advanced rs-fMRI analysis. Clustering Given data points {X1, .., Xn}, the goal of clustering is to partition the data into K disjoint groups {C1, .., CK}. Different clustering algorithms differ in terms of their clustering objective, which is to maximize some notion of within-cluster similarity and/or between-cluster dissimilarity. 14 Figure 2.2: Applications of machine learning methods in resting-state fMRI Figure 2.3: A taxonomy of unsupervised learning methods used for rs-fMRI analysis 15 K-means K-means clustering is thus far the most popular learning algorithm for partitioning data. The algorithm aims at minimizing the within-cluster variance. Formally, this corresponds to the follo∥wing clustering ∥objective,∑K ∑ ∥∥∥∥ 1 ∑ ∥∥∥∥ 2 min Xi − Xt j=1 i∈C ∥ njj t∈C ∥j where nj denotes the cardinality of set Cj . This optimization problem is solved using an iterative algorithm, known as the Lloyd’s algorithm. The algorithm begins with initial estimates of cluster centroids and iteratively refines them by (a) assigning each datum to its closest cluster, and (b) updating cluster centroids based on these new assignments. Gaussian mixture models Mixture models are often used to represent prob- ability densities of complex multimodal data with hidden components. These models are constructed as mixtures of arbitrary unimodal distributions, each rep- resenting a distinct cluster. In the case of Gaussian mixture models, each Xi is assumed to be generated by a two-step process: (a) First, a latent component zi ∈ {1, .., K} is sampled, zi ∼Multinomial(φ) where φk = P (zi = k); then (b) a random sample is drawn from one of K multivariate gaussians conditional on zi, i.e. Xi|zi = k ∼ N (µk,Σk) where µk and Σk denote the mean and covariance of the k-th gaussian respectively. Each gaussian distribution thus denotes a unique cluster. The model parameters {φ, µ,Σ} are obtained by maximizing the complete data likelihood, ∑n {φopt, µopt,Σopt} = arg max logP (Xi|φ, µ,Σ) φ,µ,Σ ∑i=1n = arg max logP (Xi|zi, µ,Σ)P (zi|φ) φ,µ,Σ i=1 16 Maximum likelihood estimates of GMMs are usually obtained using the Expectation-Maximization (EM) algorithm. Hierarchical clustering Hierarchical clustering methods group the data into a set of nested partitions. This multi-resolution structure is often represented with a cluster tree, or dendrogram. Hierarchical clustering is divided into agglomerative or divisive methods, based on whether the clusters are identified in a bottom-up or top-down fashion respectively. Hierachical agglmomerative clustering (HAC), the more dominant approach, initially treats each data point as a singleton cluster and then successively merges them according to pre-specific distance metric until a single cluster containing all observations is formed. Many distance metrics, referred to as linkage criterion, have been proposed in literature that optimize different goals of hierarchical clustering. These include: (a) single-link, where distance between clusters C1 and C2 is defined as the distance between their closest points, i.e., d(C1, C2) = min d(x , x ), (b) Complete linkage, where this xi∈C1,xj∈ i j C2 distance is measured between the farthest points, C(d1, d2) = max d(xi, xj), (c) xi∈C1,xj∈C2 Average∑linkag1 ∑ e which measures the average distance between members d(C1, C2) = | || | d(xi, xj) etc. Here, d represents dissimilarity between observations.C1 C2 xi∈C1 xj∈C2 Alternate methods for merging have also been proposed, the most popular being Ward’s criterion. Ward’s method measures how much the within-cluster variance will increase when merging two partitions and minimizes this merging cost. A major drawback is computational complexity, which render HAC methods impractical in applications with large observational data. Graph-based clustering Graph based clustering forms another class of similarity-based partitioning methods for data that can be represented using a 17 graph. Given a weighted undirected graph G = {V,E} with vertex set V and edge set E, most graph-partitioning methods optimize a dissociation measure, such as the normalized cut (Ncut). The edge weights w(i, j) represent a function of similarity between vertices i and j. Ncut computes the total edge weights connecting two partitions and normalizes this by their weighted connections to all nodes within the graph. A two-way normalized-cut criteria divides G into disjoint partitions A and B (A∪B = V,A∩B = φ) by simultaneously minimizing between-cluster similarity while maximizing within-cluster similarity. This objective criterion is expressed as, ∑ ∑ Ncut(A,B) = ∑i∈A,j∈B w(i, j) w(i, j)( ) + ∑i∈A,j∈Bi∈A,j∈V w i, j i∈V,j∈B w(i, j) However, minimizing this objective directly is an NP-hard problem. Spectral clustering algorithms typically solve a relaxation of this problem. This approach can be further extended to obtain a K-way partitioning of the graph. Graph-based clustering approach is often more resilient to outliers, compared to k-means or hierarchical clustering. Latent variable models Decomposition Decomposition or factorization based approaches assume that the observed data can be decomposed as a product of simpler matrices, often imposing a specific structure and/or sparsity on these individual matrices. Formally, given data points X = [x1, .., xn] with x ∈ RDi , linear decomposition techniques seek a basis set W = [w1, .., wK ] such that the linear space spanned by W closely reconstructs X. ∑K xi = wkzi(k) k=1 18 Here, each data point xi is characterized by unique coefficients zi ∈ RK for the basis set W . Typically, K < D so that decomposition amounts to a dimensionality reduction. In matrix notation, the goal is to find W and Z such that X ≈ WZ, where Z = [z1, .., zn]. This ill-posed problem is generally solved by constraining the structure of W and/or Z. Principal component analysis (PCA): PCA is a linear projection based technique widely used for dimensionality reduction. The goal of PCA is to find an orthonormal basis W that maximizes the variance captured by projected data Z = W TX. This is equivalent to minimizing the reconstruction error of the data points based on the low-dimensional representation Z. Mathematically, this amounts to solving the following optimization problem, ∥ ∥2 Wopt = arg min ∥∥ − T ∥X WW X∥ subject to W ∈ OD×K W F where F denotes the Frobenius norm andOD×K denotes the set ofD×K dimensional orthonormal matrices. Independent component analysis (ICA) Independent Component Analy- sis (ICA) is a popular method for decomposing data as a linear combination of statistically independent components. In the ICA terminology, W is often known as the mixing matrix whereas Z comprises the source signals. In the above formalism, ICA assumes that the sources, i.e., the rows of Z, are statistically independent. The source signals are recovered using a ”whitening” or ”unmixing” matrix U , where U = W−1. Since X = WZ, we obtain Z = UX Popular algorithms thus recover the sources by estimating U such that the components of UX are statistically independent. Common ICA algorithms emulate independence by either minimizing the mutual information between sources (InfoMax) or by maximizing their non- gaussianity (FastICA). ICA usually employs a full-rank matrix factorization and is 19 often preceded with PCA for dimensionality reduction. Sparse dictionary learning Sparse dictionary learning is formulated as a linear decomposition problem, similar to ICA/PCA, but with sparsity constraints on the components Z. This results in a non-convex optimization problem of the following form: {Wopt, Z 2opt} = arg min ‖X −WZ‖F + C ‖Z‖0 W,Z In most practical applications, this optimization problem is relaxed by replacing the L0-norm with L1-norm. Non-negative matrix factorization (NMF) NMF is another dimensionality reduction technique that seeks a low-rank decomposition of the data matrix X with non-negativity constraints on the components W and Z. Typically, this corresponds to solving the following optimization, {Wopt, Zopt} = arg min ‖X −WZ‖2F subject to W ≥ 0, Z ≥ 0 W,Z Hidden Markov Models Hidden Markov Models (HMMs) are a class of unsu- pervised learning methods for sequential data. They are used to model a Markov process where the sequence of observations {x1, .., xT} are assumed to be generated from a sequence of underlying hidden states {s1, .., sT}, which can be discrete. In a HMM with K states, it is assumed that si can take discrete values in {1, .., K}. The parameters of the HMM are learned by maximizing the complete data likelihood, θopt = arg maxP (x1, .., xT , s1, .., sT |θ) θ ∏T = arg max P (st|st−1, θ)P (xt|st, θ) θ t=1 Here, P (s1|s0) denotes the initial state distribution π. The state transition proba- bilities are defined by a transition matrix T with elements Ti,j = P (st = j|st−1 = i). 20 The conditionals P (xt|st = k, θ) are captured by an emission probability table E[k, xt]. The parameters θ of this probabilistic model are thus {π, T, E}. This maximum likelihood estimation problem is efficiently solved using a special case of the Expectation-Maximization algorithm, known as the Baum-Welch algorithm. Non-linear embeddings Locally linear embeddings LLE projects data to a reduced dimensional space while preserving local distances between data points and their neighborhood. LLE algorithm proceeds in two steps. First, each input Xi, i ∈ {1, , ., n} is approximated as a linear combination of its K closest neighbors. The linear subspace W is obtaining by minimizing the reconstruction error,i.e., ∑ ∑ W 2opt = arg min |Xi − WijXj| W ∑i j subject to Wij = 1 j Here, Wij = 0 if Xj is not one of the K-nearest neighbors of Xi. In the second step, the low-dimensional embeddings Yi are obtained by minimizing the embedding cost function, ∑ Yopt = arg min |Yi − WijY 2j| Y j In the latter optimization, W is kept fixed at Wopt, while Yi’s are optimized. Autoencoders The autoencoder is an unsupervised neural-network based ap- proach for learning latent representations of high-dimensional data. It encodes the input X into a lower dimensional representation Z = fθ(X), known as the 21 bottleneck, which is then decoded to reconstruct the input X̂ = gφ(Z). Both the encoder fθ and decoder gφ are neural networks. The autoencoder is trained to minimize ∥∥the reco∥∥nstruction error on a set of examples, often measured with an L2∥ ∥2loss, i.e., X − X̂ . The autoencoder can thus be seen as a non-linear extension of PCA since fθ and gφ are in general non-linear functions. 2.1.3 Applications of unsupervised learning in rs-fMRI Unsupervised machine learning methods have proven promising for the analysis of high-dimensional data with complex structures, making it ever more relevant to rs-fMRI. Many unsupervised learning approaches in rs-fMRI aim to parcellate the brain into discrete functional sub-units, akin to atlases. These segmentations are driven by functional data, unlike those approaches that use cytoarchitecture as in the Broadmann atlas, or macroscopic anatomical features, as in the Automated Anatomical Labelling (AAL) atlas [47]. A second class of applications delve into the exploration of brain network dynamics. Unsupervised learning has recently been applied to interrogate the dynamic functional connectome with promising results[42, 43, 44, 48, 49]. Finally, the third application of unsupervised learning focuses on learning latent low-dimensional representations of RSFC to conduct analyses across a population of subjects. We discuss the methods under each of these challenging application areas below. 22 Discovering spatial patterns with coherent fluctuations Mapping the boundaries of functionally distinct neuroanatomical structures, or identifying clusters of functionally coupled regions in the brain is a major objective in neuroscience. Rs-fMRI and machine learning methods provide a promising combination with which to achieve this lofty goal. In the case of rs-fMRI, the typical approach is to decompose the 4D fMRI data into a linear superposition of distinct spatial modes that show coherent temporal dynamics using techniques like ICA. Clustering is an alternative unsupervised learning approach for analysis of rs- fMRI data. Unlike ICA or dictionary learning, clustering is used to partition the brain surface (or volume) into disjoint functional networks. It is important to draw a distinction at this stage between two slightly different applications of clustering since they sometimes warrant different constraints; one direction is focused on identifying functional networks which are often spatially distributed, whereas the other is used to parcellate brain regions. The latter application aims to construct atlases that reflect local areas that constitute the functional neuroanatomy, much like how standard atlases such as the Automated Anatomical Labelling (AAL) [47] delineate macroscopic anatomical regions. One important design decision in the application of clustering is the distance function used to measure dissimilarity between different voxels (or vertices). In the case of rs-fMRI, this distance function is either computed on raw time-series at voxels or between their connectivity profiles. While these two distances are motivated by the same idea of functional coherence, certain differences have been found in parcellations optimized using either criteria [50]. An important requirement for almost all of these methods is the a priori selection of the number of clusters/components. These are often determined through cross- 23 validation or through statistics that reflect the quality, stability or reproducibility of decomposition/partitions at different scales. ICA ICA has been one of the earliest and most widely used analytic tools for rs-fMRI, driving several pivotal neuroscientific insights into intrinsic brain networks. When applied to rs-fMRI, brain activity is expressed as a linear superposition of distinct spatial patterns or maps, with each map following its own characteristic time course. These spatial maps can reflect a coherent functional system or noise, and several criteria can be used to automatically differentiate them. This capability to isolate noise sources makes ICA particularly attractive. In the early days of rs-fMRI, several studies demonstrated marked resemblance between the ICA spatial maps and cortical functional networks known from task-activation studies [2, 4]. While typical ICA models are noise-free and assume that the only stochasticity is in the sources themselves, several variants of ICA have been proposed to model additive noise in the observed signals. Beckmann et al. [2] introduced a probabilistic ICA (PICA) model to extract the connectivity structure of rs-fMRI data. PICA models a linear instantaneous mixing process under additive noise corruption and statistical independence between sources. De Luca et al. [5] showed that PICA can reliably distinguish RSNs from artifactual patterns. Both these works showed high consistency in resting-state patterns across multiple subjects. While there is no standard criteria for validating the ICA patterns, or any clustering algorithm for that matter, reproducibility or reliability is often used for quantitative assessment. More recently, Khorshidi et al. proposed an automated denoising strategy for fMRI based on ICA, known as FIX ”FMRIB’s ICA-based-X-noiseifier”. The authors trained a classifier using manual annotations to label artefactual components based on distinct spatial/ temporal features. These components could represent a variety 24 of structured noise sources and once identified, they can be either subtracted or regressed out of the data to yield clean signals. ICA can also be extended to make group inferences in population studies. Group ICA is thus far the most widely used strategy, where multi-subject fMRI data are concatenated along the temporal dimension before implementing ICA [51]. Individual-level ICA maps can then be obtained from this group decomposition by back-projecting the group mixing matrix [51], or using a dual regression approach [52]. More recently, Du et al.[53] introduced a group information guided ICA to preserve statistical independence of individual ICs, where group ICs are used to constrain the corresponding subject-level ICs. Varoquaux et al. [54] proposed a robust group-level ICA model to facilitate between-group comparisons of ICs. They introduce a generative framework to model two levels of variance in the ICA patterns, at the group level and at a subject-level, akin to a multivariate version of mixed-effect models. The IC estimation procedure, termed Canonical ICA, employs Canonical Correlation Analysis to identify a joint subspace of common IC patterns across subjects and yields ICs that are well representative of the group. Alternatively, it is also possible to compute individual-specific ICA maps and then establish correspondences across them [53] for generating group inferences; however, this approach has been limited because source separations can be very different across subjects, for example, due to fragmentation. While ICA and its extensions have been used broadly by the rs-fMRI community, it is important to acknowledge its limitations. ICA models linear representations of non-Gaussian data. Whether a linear transformation can adequately capture the relationship between independent latent sources and the observed high-dimensional fMRI data is uncertain and likely unrealistic. Unlike the popular Principal Com- 25 ponent Analysis (PCA), ICA does not provide the ordering or the energies of its components, which makes it impossible to distinguish strong and weak sources. This also complicates replicability analysis since known sources i.e., spatial maps could be expressed in any arbitrary order. Extracting meaningful ICs also sometimes necessitates manual selection procedures, which can be inefficient or subjective. In the ideal scenario, each individual component represents either a physiologically meaningful activation pattern or noise. However, this might be an unrealistic as- sumption for rs-fMRI. Additionally, since ICA assumes non-Gaussianity of sources, Gaussian physiological noise can contaminate the extracted components. Further, due to the high-dimensionality of fMRI, analysis often proceeds with PCA based dimensionality reduction before application of ICA. PCA computes uncorrelated lin- ear transformations of highest variance (thus explaining greatest variability within the data) from the top eigenvectors of the data covariance matrix. While this step is useful to remove observation noise, it could also result in loss of signal informa- tion that might be crucial for subsequent analysis. Although ICA optimizes for independence, it does not guarantee independence. Based on studies of functional integration within the brain, assumptions of independence between functional units could themselves be questioned from a neuroscientific point of view. Several papers have suggested that ICA is especially effective when spatial patterns are sparse, with negligible or little overlap. This hints to the possibility that success of ICA is driven by sparsity of the components rather than their independence. Along these lines, Daubechies and colleagues claim that fMRI representations that optimize for sparsity in spatial patterns are more effective than fMRI representations that optimize independence [55]. 26 Learning sparse spatial maps Sparse dictionary learning is another popular framework for constructing succinct representations of observed data. Varoquaux et al. [56] adopt a dictionary learning framework for segmenting functional regions from resting-state fMRI time series. Their approach accounts for inter-subject variability in functional boundaries by allowing the subject-specific spatial maps to differ from the population-level atlas. Concretely, they optimize a loss function comprising a residual term that measures the approximation error between data and its factorization, a cost term penalizing large deviations of individual subject spatial maps from group level latent maps, and a regularization term promoting sparsity. In addition to sparsity, they also impose a smoothness constraint so that the dominant patterns in each dictionary are spatially contiguous to construct a well-defined parcellation. In order to prevent blurred edges caused due to the smoothness constraint, Abraham et al. [57] propose a total variation regularization within this multi-subject dictionary learning framework. This approach is shown to yield more structured parcellations that outperform competing methods like ICA and clustering in explaining test data. Similarly, Lv et al. [58] propose a strategy to learn sparse representations of whole-brain fMRI signals in individual subjects by factorizing the time-series into a basis dictionary and its corresponding sparse coefficients. Here, dictionaries represent the co-activation patterns of functional networks and coefficients represent the associated spatial maps. Experiments revealed a high degree of spatial overlap in the extracted functional networks in contrast to ICA that is known to yield spatially non-overlapping components in practice. K-means clustering and mixture models K-means clustering or mixture models are frequently used for spatial segmentation of fMRI data [37, 59, 60, 61]. 27 Similarity between voxels can be defined by correlating their raw time-series [59] or connectivity profiles [61]. Euclidean distance metrics have also been used on spectral features of time series [37]. K-means clustering has provided several novel insights into functional organiza- tion of the human brain. It has revealed the natural division of cortex into two complementary systems, the internally-driven ”intrinsic” system and the stimuli- driven ”extrinsic” system [59, 60]; provided evidence for a hierarchical organization of RSNs [60]; and exposed the anatomical contributions to co-varying resting-state fluctuations [37]. Golland et al. [62] proposed a Gaussian mixture model for clustering fMRI signals. Here, the signal at each voxel is modelled as a weighted sum of N Gaussian densities, with N determining the number of hypothesized functional networks and weights reflecting the probability of assignment to different networks. Large-scale systems were explored at several resolutions, revealing an intrinsic hierarchy in functional organization. Yeo et al. [63] used rs-fMRI measurements on 1000 subjects to estimate the organization of large-scale distributed cortical networks. They employed a mixture model to identify clusters of voxels with similar corticocortical connectivity profiles. Number of clusters were chosen from stability analysis and parcellations at both a coarse resolution of 7 networks and a finer scale of 17 networks were identified. A high degree of replicability was attained across data samples, suggesting that these networks can serve as reliable reference maps for functional characterization. Identifying hierarchical spatial organization Several studies have provided evidence for a hierarchical organization of functional networks in the brain[60, 28 62]. Hierachical agglmomerative clustering (HAC) thus provides a natural tool to partition rs-fMRI data and explore this latent hierarchical structure. Earliest applications of clustering to rs-fMRI were based on HAC [36, 64]. This technique thus largely demonstrated the feasibility of clustering for extracting RSNs from rs-fMRI data. Recent applications of HAC have focused on defining whole-brain parcellations for downstream analysis [65, 66, 67]. Spatial continuity can be enforced in parcels, for example, by considering only local neighborhoods as potential candidates for merging [65]. An advantage of hierarchical clustering is that unlike k-means clustering, it does not require knowledge of the number of clusters and is completely deterministic. However, once the cluster tree is formed, the dendrogram must be split at a level that best characterizes the ”natural” clusters. This can be determined based on a linkage inconsistency criterion [64], consistency across subjects [36], or advance empirical knowledge [68]. While a promising approach for rs-fMRI analysis, hierarchical clustering has some inherent limitations. It often relies on prior dimensionality reduction, for example by using an anatomical template [36], which can bias the resulting parcellation. It is a greedy strategy and erroneous partitions at an early stage cannot be rectified in subsequent iterations. Single-linkage criterion may not work well in practice since it merges partitions based on the nearest neighbor distance, and hence is not inherently robust to noisy resting-state signals. Further, different metrics usually optimize divergent attributes of clusters. For example, single-link clustering encourages extended clusters whereas complete-link clustering promotes compactness. This makes the a priori choice of distance metric somewhat arbitrary. 29 Graph based clustering Functional MRI data can be naturally represented in the form of graphs. Here, nodes represent voxels and edges represent connection strength, typically measured by a correlation coefficient between voxel time series or between connectivity maps [50, 69]. Often, thresholding is applied on edges to limit graph complexity. Graph segmentation approaches, such as those based on Ncut criteria, have been widely used to derive whole-brain parcellations [50, 70, 71]. Population-level parcellations are usually derived in a two stage procedure: First, individual graphs are clustered to extract functionally-linked regions, followed by a second stage where a group-level graph characterizing the consistency of individual cluster maps is clustered [50, 69]. Spatial contiguity can be easily enforced by constraining the connectivity graph to local neighborhoods [50], or through the use of shape priors [71]. Departing from this protocol, Shen et al. [70] propose a groupwise clustering approach that jointly optimizes individual and group parcellations in a single stage and yields spatially smooth group parcellations in the absence of any explicit constraints. A disadvantage of the Ncut criteria for fMRI is its bias towards creating uniformly sized clusters, whereas in reality functional regions show large size variations. Graph construction itself involves arbitrary decisions which can affect clustering performance [72] e.g., selecting a threshold to limit graph edges, or choosing the neighborhood to enforce spatial connectedness. 30 Table 1: Key papers for application: Discovering spatial patterns with coherent resting-state fluctuations (RSNs) Approach a: Decomposition Investigations into resting-state connectivity using independent component analysis (Beckmann et al., 2005)[2] Consistent resting-state networks across healthy subjects (Damoiseaux et al.,2006)[4] Method: ICA, Contribution: Early works demonstrating the striking similarity between ICA spatial maps and cortical functional networks Group comparison of resting-state fMRI using multi-subject ICA and dual regression(Beckmann et al.,2009) [52] A group model for stable multi-subject ICA on fMRI datasets (Varoquaux et al., 2010)[54] Group information guided ICA for fMRI data analysis (Du et al., 2013) [53] Method: ICA (group-level), Contribution: Influential works discussing analytical approaches for multi-subject ICA in resting-state Multi-subject dictionary learning to segment an atlas of brain spontaneous activity (Varoquaux et al., 2011) [56] Method: Sparse dictionary learning, Contribution: A multi-subject dictionary learning framework for learning sparse spatial maps Approach b: Clustering Hierarchical clustering to measure connectivity in fMRI resting-state data, (Cordes et al.,2002)[64] Neurophysiological Architecture of Functional Magnetic Resonance Images of Human Brain (Salvador et al.,2005)[36] Method: Hierarchical clustering, Contribution: Earliest applications of clustering to rs-fMRI; highlighted hierarchical organization of functional networks The organization of the human cerebral cortex estimated by intrinsic functional connectivity, (Yeo et al.,2011)[63] Method: Mixture models, Contribution: Influential large-scale study investigating brain’s functional organization A whole brain fMRI atlas generated via spatially constrained spectral clustering, (Craddock et al.,2012)[50] Groupwise whole-brain parcellation from resting-state fMRI data for network node identification, (Shen et al.,2013)[70] Method: Graph based clustering, Contribution: Released consistent whole-brain functional atlas for fMRI at varying spatial resolutions based on rs-fMRI data 31 Comments I. A comment on alternate connectivity-based parcellations Several papers make a distinction between clustering / decomposition and boundary detection based approaches for network segmentation. In the rs-fMRI literature, several non-learning based parcellations have been proposed, that exploit tradi- tional image segmentation algorithms to identify functional areas based on abrupt RSFC transitions [73, 74]. Clustering algorithms do not mandate spatial con- tiguity, whereas boundary based methods implicitly do. In contrast, boundary based approaches fail to represent long-range functional associations, and may not yield parcels that are as connectionally homogeneous as unsupervised learning approaches. A hybrid of these approaches can yield better models of brain network organization. This direction was recently explored by Schaefer et al. [75] with a Markov Random Field model. The resulting parcels showed superior homogeneity compared with several alternate gradient and learning-based schemes. Further, complementing RSFC with other modalities can yield corroborative and perhaps complementary information for delineating areal boundaries. Recently, Glasser et al. approached this problem by developing a multi-modal approach for generating brain parcellations[74]. The authors propose a semi-automated approach that combines supervised machine learning with manual annotations for parcellating regions based on their multi-modal fingerprints (architecture, function, connectivity and topography). Such an approach can be instrumental towards the goal of precise human brain functional mapping. II. Subject versus population level parcellations Significant effort in rs-fMRI literature is dedicated to identifying population- average parcellations. The underlying assumption is that functional connectivity graphs exhibit similar patterns across subjects, and these global parcellations reflect 32 common organizational principles. Yet, individual-level parcellations can potentially yield more sensitive connectivity features for investigating networks in health and disease. A central challenge in this effort is to match the individual-level spatial maps to a population template in order to establish correspondences across subjects. Common approaches to obtain subject-specific networks with group correspondence often incorporate back-projection and dual regression [51, 52], or hierarchical priors within unsupervised learning [56, 76]. While a number of studies have developed subject-specific parcellations, the significance of this inter-subject variability for network analysis has only recently been discussed. Kong et al. [76] developed high quality subject-specific parcellations using a multi-session hierarchical Bayesian model, and showed that subject-specific variability in functional topography can predict behavioral measures. Recently, using a novel parcellation scheme based on K-medoids clustering, Salehi et al. [77] showed that individual-level parcellation alone can predict the sex of the individual. These studies suggest the intriguing idea that subject-level network organization, i.e. voxel-to-network assignments, can capture concepts intrinsic to individuals, just like connectivity strength. III. Is there a universal ’gold-standard’ atlas? When considering the family of different methods, algorithms or modalities , there exist a plethora of diverse brain parcellations at varying levels of granularity. Thus far, there is no unified framework for reasoning about these brain parcellations. Several taxonomic classifications can be used to describe the generation of these parcellations, such as machine learning or boundary detection, decomposition or clustering, multi-modal or unimodal. Even within the large class of clustering approaches, it is impossible to find a single algorithm that is consistently superior for a collection of simple, desired properties of partitioning [78]. Several evaluation 33 criteria have emerged for comparing different parcellations, exposing the inherent trade-offs at work. Arslan et al. [79] performed an extensive comparison of several parcellations across diverse methods on resting-data from the Human Connectome Project (HCP). Through independent evaluations, they concluded that no single parcellation is consistently superior across all evaluation metrics. Recently, Salehi et al. [80] showed that different functional conditions, such as task or rest, generate reproducibly distinct parcellations thus questioning the very existence of an optimal parcellation, even at an individual-level. These novel studies necessitate rethinking about the final goals of brain mapping. Several studies have reflected the view that there is no optimal functional division of the brain, rather just an array of meaningful brain parcellations [65]. Perhaps, brain mapping should not aim to identify functional sub-units in a universal sense, like Broadmann areas. Rather, the goal of human brain mapping should be reformulated as revealing consistent functional delineations that enable reliable and meaningful investigations into brain networks. IV. A comparison between decomposition and clustering A high degree of convergence has been observed in the functionally coherent patterns extracted using decomposition and clustering. Decomposition techniques allow soft partitioning of the data, and can thus yield spatially overlapping networks. These models may be more natural representations of brain networks where, for example, highly integrated regions such as network ’hubs’ can simultaneously subserve multiple functional systems. Although it is possible to threshold and relabel the generated maps to produce spatially contiguous brain parcellations, these techniques are not naturally designed to generate disjoint partitions. In contrast, clustering techniques automatically yield hard assignments of voxels to 34 different brain networks. Spatial constraints can be easily incorporated within different clustering algorithms to yield contiguous parcels. Decomposition models can adapt to varying data distributions, whereas clustering solutions allow much less flexibility owing to rigid clustering objectives. For example, k-means clustering function looks to capture spherical clusters. While a thorough comparison between these approaches is still lacking, some studies have identified the trade-offs between choosing either technique for parcellation. Abraham et al. [57] compared clustering approaches with group-ICA and dictionary learning on two evaluation metrics: stability as reflected by reproducibility in voxel assignments on independent data, and data fidelity captured by the explained variance on independent data. They observed a stability-fidelity trade-off: while clustering models yield stable regions but do not explain test data as well, linear decomposition models explain the test data reasonably well but at the expense of reduced stability. Discovering patterns of dynamic functional connectivity Unsupervised learning has also been applied to study patterns of temporal orga- nization or dynamic reconfigurations in resting-state networks. These studies are often based on two alternate hypothesis that (a) dynamic (windowed) functional connectivity cycles between discrete ”connectivity states”, or (b) functional connec- tivity at any time can be expressed as a combination of latent ”connectivity states”. The first hypothesis is examined using clustering-based approaches or generative models like HMMs, while the second is modelled using decomposition techniques. Once stable states are determined across population, the former approach allows us to estimate the fraction of time spent in each state by all subjects. This quantity, known as dwell time or occupancy of the state, shows meaningful variation across individuals [42, 43, 81, 82]. It is important to note than in all these approaches, 35 the RSNs or the spatial patterns are assumed to be stationary over time and it is the temporal coherence that changes with time. Clustering Several studies have discovered recurring dynamic functional con- nectivity patterns, known as ”states”, through k-means clustering of windowed correlation matrices [42, 81, 82, 83, 84]. FC associated with these repeating states shows marked departure from static FC, suggesting that network dynamics provide novel signatures of the resting brain [42]. Notable differences have been observed in the dwell times of multiple states between healthy controls and patient populations across schizophrenia, bipolar disorder and psychotic-like experience domains [81, 82, 83]. Abrol et al. [84] performed a large-scale study to characterize the replicability of brain states using standard k-means as well as a more flexible, soft k-means algorithm for state estimation. Experiments indicated reproducibility of most states, as well as their summary measures, such as mean dwell times and transition probabilities etc. across independent population samples. While these studies establish the existence of recurring FC states, behavioral associations of these states is still unknown. In an interesting piece of work, Wang et al. [85] identified two stable dynamic FC states using k-means clustering that showed correspondence with internal states of high- and low-arousal respectively. This suggests that RSFC fluctuations are behavioral state-dependent, and presents one explanation to account for the heterogeneity and dynamic nature of RSFC. Markov modelling of state transition dynamics HMMs are another valuable tool to interrogate recurring functional connectivity patterns [43, 44, 86]. The notion of states remains similar to the ”FC states” described above for clustering; however, 36 the characterization and estimation is drastically different. Unlike clustering where sliding windows are used to compute dynamic FC patterns, HMMs model the rs-fMRI time-series directly. Hence, they offer a promising alternative to overcome statistical limitations of sliding-windows in characterizing FC changes. Several interesting results have emerged through the adoption of HMMs. Vidau- rre et al. [43] find that relative occupancy of different states is a subject-specific measure linked with behavioral traits and heredity. Through Markov modelling, transitions between states have been revealed to occur as a non-random sequence [42, 43], that is itself hierarchically organized [43]. Recently, network dynamics modelled using HMMs were shown to distinguish MCI patients from controls [86], thereby indicating their utility in clinical domains. Finding latent connectivity patterns across time-points Decomposition techniques for understanding RSFC dynamics have the same flavor as the ones described in section 2.1.2: of explaining data through latent factors; however, the variation of interest is across time in this case. Adoption of matrix decomposition techniques exposes a basis set of FC patterns from windowed correlation matrices. Dynamic FC has been characterized using varied decomposition approaches, in- cluding PCA[48], Singular Value Decomposition (SVD)[49], non-negative matrix factorization[87] and sparse dictionary learning[88]. Decomposition approaches, here, diverge from clustering or HMMs as they asso- ciate each dFC matrix with multiple latent factors instead of a single component. To compare these alternate approaches, Leonardi et al. [49] implemented a generalized matrix decomposition, termed k-SVD. This factorization generalizes both k-means clustering and PCA subject to variable constraints. Reproducibility analysis in 37 this study indicated that dFC is better characterized by multiple overlapping FC patterns. Decomposition of dFC has revealed novel alterations in network dynamics between healthy controls and patients suffering from PTSD [88] or multiple sclerosis [48], as well as between childhood and young adulthood [87]. Table 2: Key papers for application: Discovering reproducible patterns of dynamic functional connectivity Approach a: Decomposition Principal components of functional connectivity: a new approach to study dynamic brain connectivity during rest (Leonardi et al.,2013)[48] Method:PCA , Contribution: Early work characterizing dFC using latent connectivity patterns and suggesting altered connectivity dynamics in disease Approach b: Clustering Tracking whole-brain connectivity dynamics in the resting state, (Allen et al.,2014)[42] Method: K-means, Contribution: Provided evidence for recurring FC states and suggested marked departure of dynamic connectivity patterns from static FC Dynamic functional connectivity analysis reveals transient states of dysconnectivity in schizophrenia (Damaraju 2014)[81] Method: K-means, Contribution: Revealed strong statistical differences in dwell times of multiple FC states between controls and a disease group Approach c: Markov models Unsupervised learning of functional network dynamics in resting state fMRI (Eavani 2013)[44] Method: HMM, Contribution: Earliest application of HMMs to study resting-state functional network dynamics Brain network dynamics are hierarchically organized in time, (Vidaurre et al.,2017)[43] Method: HMM, Contribution: Demonstrated that transitions between FC states occur in a non-random hierarchically organized fashion and revealed that dwell times of FC states are linked with behavioral traits and heredity. 38 Disentangling latent factors of inter-subject FC variation Unsupervised learning can also disentangle latent explanatory factors for FC variation across population. We find two applications here: (i) learning low dimensional embeddings of FC matrices for subsequent supervised learning and (ii) learning population groupings to differentiate phenotypes based solely on FC. Dimensionality reduction Rs-fMRI analysis is plagued by the curse of dimen- sionality, i.e., the phenomenon of increasing data sparsity in higher dimensions. Commonly used data features such as FC between pairs of regions, increase as O(n2) with the number of parcellated regions. Further, sample size in typical fMRI studies is typically of the order of tens or hundreds, making it harder to learn generalizable patterns from original high dimensional data. To overcome this, linear decomposition methods like PCA or sparse dictionary learning have been widely used for dimensionality reduction of functional connectivity data [89, 90, 91, 92]. Several non-linear embedding methods like Locally linear embedding (LLE) or Autoencoders (AEs) have also garnered attention. LLE embeddings have been employed in rs-fMRI studies, for example, to improve predictions in supervised age regression [93], or for low-dimensional clustering to distinguish Schizophrenia patients from controls [94]. AEs are a neural network based alternative for generating reduced feature sets through nonlinear input transformations. They have been used for feature reduction of RSFC in several studies [86, 95]. AEs can also be used in a pre-training stage for supervised neural network training, in order to direct the learning towards parameter spaces that support generalization [96]. This technique was shown, for example, to improve classification performance of Autism and Schizophrenia using RSFC [97, 98]. 39 Clustering heterogeneous diseases Clustering can expose sub-groups within a population that show similar FC. Using unsupervised maximum margin cluster- ing [99], Zeng et al. [100] demonstrated that clusters can be associated with disease category (depressed v/s control) to yield high classification accuracy. Recently, Drysdale et al. [101] discovered novel neurophysiological subtypes of depression based on RSFC. Using an agglomerative hierarchical procedure, they identified clustered patterns of dysfunctional connectivity, where clusters showed associations with distinct clinical symptom profiles despite no external supervision. Several psychiatric disorders, like depression, schizophrenia, and autism spectrum disorder, are believed to be highly heterogeneous with widely varying clinical presentations. Instead of labelling them as a unitary syndrome, differential characterization based on disease sub-types can build better diagnostic, prognostic or therapy selection systems. Unsupervised clustering could aid in the identification of these disease subtypes based on their rs-fMRI manifestations. 40 Table 3: Key papers for application: Disentangling latent factors of inter-subject RSFC variation Identifying Sparse Connectivity Patterns in the brain using resting-state fMRI (Eavani et al.,2015)[91] Method: Sparse dictionary learning, Contribution: One of the early works explaining inter-subject RSFC variability in terms of sparse connectivity patterns Approach b: Non-linear embeddings Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding of fMRI (Shen et al.,2010)[94] Method: LLE, Contribution: Proposed an unsupervised learning approach for discriminating Schizophrenia patients from controls with impressive accuracy Identification of autism spectrum disorder using deep learning and the ABIDE dataset (Heinsfeld et al.,2018)[97] Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification performance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia (Kim et al., 2016)[98] Method: Autoencoders, Contribution: More recent works demonstrating the advantages of autoencoder based dimensionality reduction/pre-training for downstream classification Approach c: Clustering Unsupervised classification of major depression using functional connectivity MRI (Zeng et al., 2014)[100] Resting-state connectivity biomarkers define neurophysiological subtypes of depression (Drysdale et al., 2017) [101] Method: Maximum margin clustering/HAC, Contribution: Demonstrated the power of clustering approaches for diagnosing depression and identifying its subtypes based on rs-fMRI manifestations 2.1.4 Supervised Learning Supervised learning denotes the class of problems where the learning system is provided input features of the data and corresponding target predictions (or labels). The goal is to learn the mapping between input and label, so that the system 41 can compute predictions for previously unseen input data points. Prediction of autism from rs-fMRI correlations is an example problem. Since intrinsic FC reflects interactions between cognitively associated functional networks, it is hypothesized that systematic alterations in resting-state patterns can be associated with pathology or cognitive traits. Promising diagnostic accuracy attained by supervised algorithms using rs-fMRI constitute strong evidence for this hypothesis. In this section, we separate the discussion of rs-fMRI feature extraction from the classification algorithms and application domains. Deriving connectomic features To render supervised learning effective, the most critical factor is feature extraction. Capturing relevant neurophenotypes from rs-fMRI depends on various design choices. Almost all supervised prediction models use brain networks or ”connectomes” extracted from rs-fMRI time-series as input features for the learning algorithm. The prototypical prediction pipeline is shown in Figure 2.8. Here, we discuss critical aspects of common choices for brain network representations in supervised learning. The first step in the prototypical pipeline is region definition and corresponding time-series extraction. Dense connectomes derived from voxel-level correlations are rarely used in practice for supervised prediction due to their high dimension- ality. Both functional and anatomical atlases have been extensively used for this dimensionality reduction. Atlases delineate ROIs within the brain that are often used to study RSFC at a supervoxel scale. Each ROI is represented with a distinct time-course, often computed as the average signal from all voxels within the ROI. Consequently, the data is represented as an N × T matrix, where N denotes the number of ROIs and T represents the time-points in the signal. A drawback of using 42 pre-defined atlases is that they may not explain the rs-fMRI dataset very well since they are not optimized for the data at hand. Several studies employ data-driven techniques to define regions within the brain, using unsupervised models such as K-means clustering, Ward clustering, ICA or dictionary learning etc [66, 102]. It is important to note that since we use pairs of ROIs to define whole-brain RSFC, the features grow as O(N2) with the number of ROIs. Therefore, in most studies, the network granularity is often limited to the range of 10-400 ROIs. The second step in this pipeline involves defining connectivity strength for extracting the connectome matrix. Functional connectivity between pairs of ROIs is the most common feature representation of rs-fMRI in supervised learning. In order to extract connectivity matrix, first the covariance matrix needs to be estimated. Sample covariance matrices are subject to a significant amount of estimation error due to the limited number of time-points. This ill-posed problem can be partially resolved through the use of shrinkage transformations [103]. Connectivity strength can then be estimated from the covariance matrix in multiple ways. Pearson’s cor- relation coefficient is a commonly used metric for estimating functional connectivity. Partial correlation is another metric that has been shown to yield better estimates of network connections in simulated rs-fMRI data [104]. It measures the normalized correlation between two time-series, after removing the effect of all other time-series in the data. Alternatively, one can use a tangent-based reparametrization of the covariance matrix to obtain functional connectivity matrices that respect the Rie- mannian manifold of covariance matrices [105]. These connectivity coefficients can boost the sensitivity for comparing diseased versus patient populations [66, 105]. It is also possible to define frequency-specific connectivity strength by decomposing the original time-series into multiple frequency sub-bands and correlating signals separately within these sub-bands [106]. 43 A few studies depart from this routine. In graph-theoretic analysis, it is common to represent parcellated brain regions as graph nodes and functional connectivity between nodes as edge weights. This graph based representation of functional connectivity, the human ”connectome”, has been used to infer various topological characteristics of brain networks, such as modularity, clustering, small-worldedness etc. Some discriminative models have exploited these graph-based measures for individual-level predictions [13, 107, 108], although they are more commonly used for comparing groups. While limited in number, a few studies have also explored rs-fMRI features beyond RSFC. Amplitude of low-frequency fluctuations (ALFF) and local synchronization of rs-fMRI signals or Regional Homeogeneity (ReHo) are two alternate measures for studying spontaneous brain activity that have shown discriminative ability [109, 110]. More recently, several studies have also begun to explore the predictive capacity of dynamic FC in supervised models [111, 112]. Feature selection The goal of feature selection is to remove noisy, redundant or irrelevant features from the data while minimizing the information loss. Feature selection can often be an advantageous pre-processing step for training supervised learning algorithms, especially in the low sample size regime. In the absence of adequate regularization, large number of features can result in a loss of generalization power. Selecting a sub- set of features with highest relevance can thus help in building better generalizable models while reducing computational complexity. Feature selection can be performed in a supervised or unsupervised fashion. Supervised or semi-supervised feature selection techniques choose a subset of features based on their ability to distinguish samples from different classes. These methods 44 thus rely on class labels and can be further classified into filter, wrapper or embedded type models. Filter models first rank features by their importance/relevance for the classification task based on a statistical measure (e.g. t-test) and then select the top-ranked features. Wrapper models select feature subsets based on their predictive accuracy and thus need a pre-determined classification algorithm. Wrapper models thus perform better as they take into account the prediction accuracy estimates during feature selection. Due to the repeated learning and cross-validation, however, these models are computationally prohibitive. Embedded models combine the advantages of the two by integrating feature selection into the learning algorithm. Regression models such as LASSO belong to this category as they implicitly select features by encouraging sparsity. These feature selection methods are discussed in depth in a detailed review by Tang et al. [113]. An alternative for feature selection is input dimensionality reduction. Methods like PCA or LLE belong to the category of unsupervised feature selection techniques and have been used to reduce the feature set to a manageable size in several studies. However, as pointed out in [114], these are not at all guaranteed to improve classification performance since they are oblivious to class labels. Further, whether or not feature selection is necessary also depends on the downstream learning algorithm. Support vector machines, in general, deal well with high-dimensional data because of an implicit regularization. In the context of SVMs, Vapnik et al. [115] have shown that an upper bound on generalization error is independent of the number of features. Regularized models, in general, are capable of handling large feature sets. A drawback is that these models necessitate cross-validation to tune hyper-parameters such as the weight of the regularization penalty. This can reduce the effective sample size available for training and/or 45 independent testing. In some situations, it might be beneficial to exploit domain knowledge to guide feature selection. For example, if certain anatomical regions are known to have altered functional connectivity in disease based on prior studies, it might be advantageous to use this prior knowledge for constructing a focused feature set. Methods The majority of supervised learning methods applied to rs-fMRI are discriminant- based, i.e., they discriminate between classes without any prior assumptions about the generative process. The focus is on correctly estimating the boundaries between classes of interest. Learning algorithms for the same discriminant function (e.g., linear) can be based on different objective functions, giving rise to distinct models. We describe common models below. Regularized linear models A large class of supervised learning algorithms are based on regularized linear models. The goal is to predict a target variable Y given input features X. Without loss of generality and for notational convenience, let us assume that the feature vector contains a single constant entry equal to 1, which allows us to account for a bias term. These algorithms differ in the choice of their likelihood model, P (Y |X,w) and/or prior P (w), where w denotes the parameters of the model. These methods yield optimization problems that are based on a conditional likelihood estimation or a maximum a posteriori estimation (MAP) 46 framework. wopt = arg maxP (Y |X,w) : Conditional likelihood w wopt = arg maxP (w|X, Y ) : MAP w Ridge regression Ridge regression is another widely used supervised learning algorithm belonging to the class of regularized linear models. The goal is to predict a real-valued output Y given input features X. The conditional likelihood in this algorithm is specified as a multivariate normal distribution where the mean parameter is modelled as a linear combination of input features, i.e., Y |X ∼ N (wTx, σ2I). The prior on weight parameters is often modelled a zero-mean gaussian with a diagonal covariance matrix,i.e., w ∼ N (0, τ 2I). The optimal weight paramaters w are thus optimized within a maximum a posteriori estimation (MAP) framework according to, ∑n wopt = arg min 1 2 2 (yi − T 1w xi)2 + 2 w i=1 σ 2 ‖w‖ τ 2 2 The MAP estimation problem above is convex and admits an elegant analytical solution. Logistic Regression Logistic regression employs a Bernoulli distribution to model the conditional probability of an output class Y given the input features X, i.e. Y |X ∼ Bernoulli(µ). The mean parameter, µ, is specified with a logistic link function σ(·) using a linear combination of input features, i.e., µ = σ(wTx). Given data {(xi, yi), i = 1, .., n}, the model parameters w are optimized within a conditional maximum likelihood framework by solving the following convex optimization problem, ∑n w Topt = arg min log(1 + exp(−yiw xi)) w i=1 47 The training objective is optimized using iterative methods such as gradient descent or Newton’s method. Regularized variants of logistic regression incorporate priors on the weight parameters (e.g., multivariate gaussian) and optimize the MAP estimates instead of the conditional likelihood estimates. Support Vector Machines (SVMs) The SVM is the most widely used clas- sification/regression algorithm in rs-fMRI studies. SVMs search for an optimal separating hyperplane between classes that maximizes the margin, i.e., the distance from hyperplane to points closest to it on either side. This results in a classifier of the form f(x) = sign(wTx). The model parameters are obtained by solving the following convex optimization problem: ∑n w T 2opt = arg minC max(0, 1− yi(w xi)) + ‖w−b‖ . w i=1 ‖w 2−b‖ is the L2 norm of the weight vector excluding the bias term. C controls the capacity of the model and determines the margin of the classifier. Tuning C can control overfitting and reduce the generalization error of the model. The resulting classification model is determined by only a subset of training instances that are closest to the boundary, known as the support vectors. SVMs can be extended to seek non-linear separating boundaries via adopting a so-called kernel function. The kernel function, which quantifies the similarity between pairs of points, implicitly maps the input data to higher dimensions. Conceptually, the use of kernel functions allows incorporation of domain-specific measures of similarity. For example, graph-based kernels, such as Weisfeiler-Lehman subtree kernel, can define a distance metric on the graphical representation of functional connectivity data for classification directly in the graph space. 48 Decision trees and random forests Decision trees predict the output Y based on a sequence of splits in the input feature space X. The tree is a directed acyclic graph whose nodes represent decision points and edges represent their outcomes. The traversal of this tree in conjunction leads up to a target outcome prediction when a node with no children (leaf node) has been reached. Decision trees are often constructed in a top-down greedy fashion where nodes are split at each step by optimizing a metric that quantifies the consistency between predictions and ground truth. For example, in classification, an often-used information-theoretic metric for quantifying this consistency is Information-Gain, i.e., the reduction in entropy of Y after knowing X. Mathematically, this is expressed as IG(Y,X) = H(Y )−H(Y |X) where H denotes the Shannon entropy. Based on this metric, the first split will use the attribute of X that gives the maximum information gain. Decision trees can offer interpretability, often at the cost of reduced accuracy. Ensembles of decision trees, such as random forests or boosted trees, are thus a more popular choice in most applications since they yield much better prediction performance. Deep neural networks An ideal machine learning system should be highly automated, with limited hand-crafting in feature extraction as well as minimal assumptions about the nature of mapping between data and labels. The system should be able to mechanistically learn patterns useful for prediction from observed labelled data. Neural networks are highly promising methods for automated learning. This stems from their capability to approximate arbitrarily complex functions given sufficient labelled data [116]. Deep learning based models or neural networks define a mapping Y = f(X; θ) 49 and optimize for parameters θ that yield the best functional approximation. The function f(·) is typically composed as a concatenation of simple nonlinear functions, often referred to as layers. A widely-used layer is a fully-connected layer that linearly combines the input variables, and applies a simple elementwise non-linear functions such as a sigmoid. The number of layers determines the depth of the network and controls the complexity of the model. The weights and biases of the layers are optimized via gradient descent based methods to minimize an objective function that quantifies the empirical risk. Traditionally, the use of neural network algorithms has been limited since neuroimaging is a data-scarce domain, making it difficult to learn a reliable mapping between input and prediction variables. However, with data sharing and open release of large-scale neuroimaging data repositories, neural networks have recently gained adoption in the the rs-fMRI community for supervised prediction tasks. Neural networks with fully connected dense layers have been adopted to learn arbitrary mappings from connectivity features to disease labels [97, 98]. Recently, more advanced neural networks models with local receptive fields, like convolutional neural networks (CNNs), have shown promising classification accuracy using rs-fMRI data [117]. CNNs replace the fully-connected operations by convolutions with a set of learnable filters. Success of this approach stems from its ability to exploit the full-resolution 3D spatial structure of rs-fMRI without having to learn too many model parameters, thanks to the weight sharing in CNNs. Comments I. Strengths/weaknesses of diverse approaches All algorithms have their own strengths and weaknesses and the choice of approach should be driven by several factors such as the prediction task, sample size, and nature of the input features. The training objective in common supervised 50 learning algorithms used for neuroimaging applications, such as regularized linear models or SVMs, is often a combination of two terms: a data loss term that is a measure of the empirical risk or training error and a regularization penalty for the prior that helps combat over-fitting during learning (generalization error). The penalty norm can be critical and is often constrained by our prior knowledge about the data. L1 penalties encourage sparsity in weights whereas L2 penalties can allow kernelization and thus enable non-linear decision functions. L2 penalties lead to dense priors and are useful in learning problems where all features are expected to contribute to the predictive model. L1 penalties are useful when prior belief suggests that only a subset of features will contribute to predictions. Some regression models, e.g., Elastic-Net, employ a linear combination of both these penalties at the expense of an additional hyperparameter for tuning the trade-off between the two. The algorithmic choice is also affected by the end-goal. Models like decision trees or LASSO are often preferred when interpretability is desired over optimal performance whereas high-complexity models like SVMs, Random Forests or Neural Networks are imperative if the goal is to maximize performance. II. Comments on sample sizes An important question arises: What is an appropriate sample size for training supervised learning models? Unsurprisingly, research has shown that the sample size needed for learning is dependent on the complexity of the model. Powerful non-linear algorithms typically require more training examples to be effective. In general, one would also expect that the more features in the data, the more training examples would be required to characterize their distribution. Hence, the minimum training size for training a ML algorithm is in general a complex function of input dimensionality, complexity of the chosen model, quality of data, data heterogeneity, 51 separability of classes etc. Given the significant impact of sample size on classification performance, it is imperative to understand the nature of this relationship. There is significant ongoing research in answering this question using learning curves. These curves model the relationship between sample size and generalization error and can be used to predict the sample size required to train a particular classifier. Several studies have shown that learning curves can be well-characterized with an inverse power-law functional form, with E(n)αn−β, where E denotes the error and n denotes the sample size [118, 119]. Besides empirical justification, many studies have also provided theoretical motivations for the inverse power-law model. The parameters of the learning curve are fitted empirically for a given application domain based on prior classification studies. For traditional algorithms, learning curves are known to plateau, i.e., the performance gains are insignificant beyond a certain sample size. One significant advantage of deep learning methods is that given sufficient capacity, they scale remarkably well with more data. Given the recent surge of interest in single-subject predictions using rs-fMRI, estimating the learning curve for classification of rs-fMRI data could be invaluable for understanding sample size requirements in this domain. Another critical issue relates to the robustness of the estimated prediction scores. Empirical studies have shown that small sample sizes, typical in neuroimaging studies, result in large error bars on the prediction accuracy. For instance, with a sample size of 100, Varoquaux et al., ballpark the error in estimated prediction accuracy of binary classification tasks to be close to 10%. With 1000 samples, this error reduces down to 3%. Large confidence bounds can potentially invalidate the conclusions of studies based on a small number of samples. 52 One possible strategy to overcome the limitations of insufficient sample sizes is to exploit unlabelled data in a semi-supervised fashion in order to increase the effectiveness of supervised learning algorithms. Transfer learning techniques are another promising alternative for enhancing classification performance in the low-data regime. These methods exploit neural networks trained on large datasets or auxiliary tasks by fine-tuning them to a target dataset or classification task. These are relatively unexplored directions in the field of rs-fMRI analysis that hold significant potential to alleviate the sample size limitations. III. Comments on model evaluation Cross-validation is a model evaluation technique used to estimate the generalization error of a predictive model. A naive cross-validation strategy is holdout, wherein the data is randomly split into a training and test set and the test score in this single-run is used as an estimate of out-of-sample accuracy. Given the limited sample sizes in most neuroimaging studies, K-fold is the dominant cross-validation choice as it utilizes all data points for both training and validation through repeated holdout, yielding error estimates with much less variance than classic holdout. It first partitions the data into K non-overlapping subsets, D = {S1, .., SK}. For each fold i in {1, .., K}, the model is trained on D\ Si and evaluated on Si. The mean accuracy across all folds is then used to estimate the model performance. While K can be anything, common choices include 5 or 10. When K equals the number of samples in the training set, the resampling procedure is known as leave one-out cross-validation. This can be used with computationally inexpensive models when sample sizes are low, typically less than a hundred. 53 Applications of supervised learning in rs-fMRI Studies harnessing resting-state correlations for supervised prediction tasks are evolving at an unprecedented scale. We describe some interesting applications of supervised machine learning in rs-fMRI below. Brain development and aging Machine learning methods have shown promise in investigating the developing connectome. In an early influential work, Dosenbach et al. [120] demonstrated the feasibility of using RSFC to predict brain maturation as measured by chronological age, in adolescents and young adults. Using SVM, they developed a functional maturation index based on predicted brain ages. Later studies showed that brain maturity can be reasonably predicted even in diverse cohorts distributed across the human lifespan [121, 122]. These works posited rs-fMRI as a valuable tool to predict healthy neurodevelopment and exposed novel age-related dynamics of RSFC, such as major changes in FC of sensorimotor regions [122], or an increasingly distributed functional architecture with age [120]. In addition to characterizing RSFC changes accompanying natural aging, machine learning has also been used to identify atypical neurodevelopment [123]. Neurological and Psychiatric Disorders Machine learning has been exten- sively deployed to investigate the diagnostic value of rs-fMRI data in various neu- rological and psychiatric conditions. Neurodegenerative diseases like Alzheimer’s disease [24, 107, 124], its prodromal state Mild cognitive impairment [125, 126, 127, 128], Parkinson’s [129], and Amyotrophic Lateral Sclerosis (ALS) [130] have been classified by ML models with promising accuracy using functional connectivity- based biomarkers. Brain atrophy patterns in neurological disorders like Alzheimer’s 54 or Multiple Sclerosis appear well before before behavioral symptoms emerge. Thus, neuroimaging-based biomarkers derived from structural or functional abnormalities are favorable for early diagnosis and subsequent intervention to slow down the degenerative process. The biological basis of psychiatric disorders has been elusive and the diagnosis of these disorders is currently completely driven by behavioral assessments. rs-fMRI has emerged as a powerful modality to derive imaging-based biomarkers for making diagnostic predictions of psychiatric disorders. Supervised learning algorithms using RSFC have shown promising results for classifying or predicting symptom severity in a variety of psychiatric disorders, including schizophrenia [98, 131, 132, 133], depression [23, 108, 134], autism spectrum disorder [25, 66, 111, 117], attention- deficit hyperactivity disorder [135, 136], social anxiety disorder [137], post-traumatic stress disorder [138] and obsessive compulsive disorder [139]. Several novel network disruption hypotheses have emerged for these disorders as a consequence of these studies. Most of these prediction models are based on standard kernel-based SVMs, and rely on FC between ROI pairs as discriminative features. Cognitive abilities and personality traits Functional connectivity can also be used to predict individual differences in cognition and behavior [140]. In comparison to task-fMRI studies which capture a single cognitive dimension, the resting state encompasses a wide repertoire of cognitive states due to its uncontrolled nature. This makes it a rich modality to capture inter-individual variability across multiple behavioral domains. ML models have been shown to predict fluid intelligence [46], sustained attention [141], memory performance [142, 143, 144], language scores [142] from RSFC-based biomarkers in healthy and pathological populations. Recently, the utility of these models was also shown to extend to personality traits such as 55 neuroticism, extraversion, agreeableness and openness [145, 146]. Prediction of behavioral performance is useful in a clinical context to under- stand how RSFC disruptions in pathology relate to impaired cognitive functioning. Meskaldji et al. [143] used regression models to predict memory impairment in MCI patients from different connectivity measures. Siegel et al. [142] assessed the behavioral significance of network disruptions in stroke patients by training ridge re- gression models to relate RSFC and structure with performance in multiple domains (memory, language, attention, visual and motor tasks). Among them, memory deficits were better predicted by RSFC, whereas structure was more important for predicting visual and motor impairments. This study highlights how rs-fMRI can complement structural information in studying brain-behavior relationships. Vigilance fluctuations and sleep studies A handful of studies have employed machine learning to predict vigilance levels during rs-fMRI scans. Since resting- state studies demand no task-processing, subjects are prone to drifting between wakefulness and sleep. Classification of vigilance states during rs-fMRI is important to remove vigilance confounds and contamination. SVM classifiers trained on cortico-cortical RSFC have been shown to reliably detect periods of sleep within the sca [147, 148]. Tagliazucchi et al. [148] revealed loss of wakefulness in one-third subjects of the experimental cohort, as early as 3 minutes into the scanner. The findings are interesting: While resting state is assumed to capture wakefulness, this may not be entirely true even for very short scan durations. The utility of these studies should not remain limited to classification alone. Through appropriate interpretation and visualization techniques, machine learning can shed new light on the reconfiguration of functional organization as people drift into sleep. 56 Predicting individual differences in cognitive response after different sleep conditions (e.g. sleep deprivation) using machine learning analysis of rs-fMRI is another interesting research direction. There is significant interest in examining RSFC alterations following sleep deprivation [149, 150]. While statistical analysis has elucidated the functional reorganization characteristic of sleep deprivation, much remains to be understood about the FC patterns associated with inter-individual differences in vulnerability to sleep deprivation. Yeo et al. [151] trained an SVM classifier on functional connectivity data in the well-rested state to distinguish subjects vulnerable to vigilance decline following sleep deprivation from more resilient subjects, and revealed important network differences between the groups. Heritability Understanding the genetic influence on brain structure and function has been a long-standing goal in neuroscience. In a recent study, Ge et al. em- ployed a traditional statistical framework to quantify heritability of whole-brain FC estimates [152]. Investigations into the genetic and environmental underpinnings of RSFC were also pursued within a machine learning framework. Miranda-Dominguez et al. [153] trained an SVM classifier on individual FC signatures to distinguish sibling and twin pairs from unrelated subject pairs. The study unveiled several interesting findings. The ability to successfully predict familial relationships from resting-state fMRI indicates that aspects of functional connectivity are shaped by genetic or unique environmental factors. The fact that predictions remained accurate in young adult pairs suggests that these influences are sustained through development. Further, a higher accuracy of predicting twins compared to non-twin siblings implied that genetics (rather than environment) is likely the stronger predictive force. 57 Other neuroimaging modalities Machine learning can also be used to inter- rogate the correspondence between rs-fMRI and other modalities. The most closely related modality is task-fMRI. Tavor et al. [154] trained multiple regression models to show that resting-state connectivity can predict task-evoked responses in the brain across several behavioral domains. The ability of rs-fMRI, that is a task-free regime, to predict the activation pattern evoked by multiple tasks suggests that resting-state can capture the rich repertoire of cognitive states that is reflected during task-based fMRI. The performance of these regression models was shown to generalize to pathological populations [155], suggesting the clinical utility of this approach to map functional regions in populations incapable of performing certain tasks. Investigating how structural connections shape functional associations between different brain regions has been the focus of a large number of studies [156]. While neuro-computational models have been promising to achieve this goal, machine learning models are particularly well-equipped to capture inter-individual differences in the structure-function relationship. Deligianni et al. [157] proposed a structured- output multivariate regression model to predict resting-state functional connectivity from DWI-derived structural connectivity, and demonstrated the efficiency of this technique through cross-validation. Venkataraman et al. [158] introduced a novel probabilistic model to examine the relationships between anatomical connectivity measured using DWI tractography and RSFC. Their formulation assumes that the two modalities are generated from a common connectivity template. Estimated latent connectivity estimates were shown to discriminate between control and schizophrenic populations, thereby indicating that joint modelling can also be useful in a clinical context. 58 Table 4: Key papers for various supervised learning application domains Brain development and aging Prediction of individual brain maturity using fMRI (Dosenbach et al.,2010)[120] Method: SVM, Target: Age, Contribution: Early influential work demonstrating the feasibility of using RSFC features for predicting brain maturation. Neurological and Psychiatric Disorders Classification of Alzheimer disease, mild cognitive impairment, and normal cognitive status with large-scale network analysis based on resting-state functional MR imaging. (Chen et al.,2011)[24] Method: Fisher LDA, Target: Alzheimer/MCI/controls, Contribution: Early work highlighting the potential of RSFC to diagnose neurological disorders Deriving reproducible biomarkers from multi-site resting-state data: An Autism-based example(Abraham et al.,2017)[66] Method: Multiple, Target: ASD/controls, Contribution: Extensively evaluated the impact of ROI choice, connectivity metric and classifier on prediction performance in intra-site and inter-site settings Altered resting state complexity in schizophrenia (Bassett et al.,2012)[132] Method: SVM, Target: Schizophrenia/controls, Contribution: Demonstrated the utility of resting-state network complexity measures in distinguishing patients with schizophrenia Cognitive abilities and personality traits Functional connectome fingerprinting: Identifying individuals using patterns of brain connectivity (Finn et al.,2015)[46] Method: Linear regression, Target: Fluid intelligence, Contribution: Demonstrated that RSFC can uniquely identify individuals and reliably predict fluid intelligence Disruptions of network connectivity predict impairment in multiple behavioral domains after stroke (Siegel et al., 2016) [142] Method: Ridge regression, Target: Multiple cognitive measures , Contribution: Demonstrated the ability of ML coupled with RSFC to predict cognitive deficits in clinical populations Vigilance fluctuations and sleep studies Automatic sleep staging using fMRI functional connectivity data (Tagliazucchi et al.,2012) [147] Decoding wakefulness levels from typical fMRI resting-state data reveals reliable drifts between wakefulness and sleep (Tagliazucchi et al.,2014) [148] Method: SVM, Target: NREM sleep stages/wakefulness, Contribution: Demonstrated the ability of ML to detect sleep stages in resting-state Heritability Heritability of the human connectome: A connectotyping study (Miranda-Dominguez et al.,2018) [153] Method: SVM, Target: Twins/sibling/unrelated, Contribution: Provided evidence for relationship between genetics and RSFC through predictive modelling Other neuroimaging modalities Task-free MRI predicts individual differences in brain activity during task performance (Tavor et al.,2016) [154] Method: Multiple regression models, Target: Task-activation map, Contribution: 5De9monstrated that resting-state can capture the rich repertoire of cognitive states expressed during different behavioral tasks 2.1.5 Discussion Practical advice for machine learning practitioners Any machine learning application requires the following: (a) a model that reflects assumed relationships between measurements and other inductive biases, (b) a cost function to quantify how well the model captures our data and finally, (c) an appropriate optimization algorithm to minimize the cost. Successful application of machine learning to rs-fMRI requires a holistic perspective of how these algorithms work, what it means when they fail and most importantly, how to choose an algorithm for a given task or hypothesis. There are are three crucial factors that could dictate this choice: 1. What is the research question? What is our prior belief? Unsupervised learn- ing tackles questions about the data-generating process. For example, clustering and decomposition approaches have both been widely used for disentangling the underlying causal sources of rs-fMRI data. However, they represent different prior beliefs and often answer distinct research questions. For example, in the context of discovering RSNs, ICA assumes that the latent components are independent and seeks to recover spatial loci of sources of activation. This decomposition further enables separation of functional activity from noise sources. On the other hand, clustering generally assumes that the activation of each spatial location/region can be explained by exactly one underlying component from a set of clusters. Because this approach results in disjoint functional networks, clustering is the dominant approach for learning spatially contiguous whole-brain parcellations. When the goal is to make predictions, supervised learning algorithms are the 60 usual choice. The choice of a supervised model again depends on the research question: Is the goal to understand the relationship between labels and features or to build a diagnostic tool? Interpretability is key for the former application whereas highest accuracy can be construed as the primary goal for the latter. Model complexity must thus be chosen in accordance with this end-goal. We recommend that these goals be well-defined before model development. 2. How much data is needed? It is important to assess the quantity of data and whether or not it is feasible to acquire more data. Sample sizes can constrain model complexity. More training examples are required to capture a non-linear relationship between features and labels, than a linear relationship. Data fidelity and regularization must also be weighed in accordance with the sample size. With small sample sizes, regularization becomes even more critical as the model is more likely to overfit on training samples. 3. What is the computational budget? Sometimes, the computational budget can be restrictive. For example, certain algorithms like deep neural networks, have a high computational demand that may not be sustained by available resources. Further, if the number of features is very large, training even low-complexity models can be time consuming. In such cases, models with lower run-timing complexity can take precedence, especially for early investigations. Time, computational bud- get or space constraints thus must be identified while choosing an appropriate model. 61 Limitations and opportunities Many state-of-the-art techniques for rs-fMRI analysis are rooted in machine learning. Both unsupervised and supervised learning methods have substantially expanded the application domains of rs-fMRI. With large-scale compilation of neuroimaging data and progresses in learning algorithms, an even greater influence is expected in future. Despite the practical successes of machine learning, it is important to understand the challenges encountered in its current application to rs-fMRI. We outline some important limitations and unexplored opportunities below. One of the biggest challenges associated with unsupervised learning methods is that there is no ground truth for evaluation. There is no a priori universal functional map of the brain to base comparisons between parcellation schemes. Further, whole-brain parcellations are often defined at different scales of functional organization, ranging from a few large-scale parcels to several hundreds of regions, making comparisons even more challenging. Although several evaluation criteria have been developed that account for this variability, no single learning algorithm has emerged to be consistently superior in all. Due to the trade-offs among diverse approaches, the choice of which parcellation to use as reference for network analysis is thus largely subjective. Unsupervised learning approaches for exploring network dynamics are similarly prone to subjectivity. Characterizing dynamic functional connectivity through discrete mental states is difficult, primarily because the repertoire of mental states is possibly infinite. While dFC states are thought to reflect different cognitive processes, it is challenging to obtain a behavioral correspondence for distinct states since resting-state is not externally probed. This again makes interpretations hard and prone to subjective bias. Machine learning approaches in this direction have 62 thus far relied on cluster statistics to fix the number of FC states. Non-parametric models (e.g. infinite HMMs) provide an unexplored, attractive framework as they adaptively determine the number of states based on the underlying data complexity. A significant challenge in single-subject prediction using rs-fMRI is posed by the fact that rs-fMRI features can be described in multiple ways. There is no recognized gold-standard atlas for time-series extraction, nor is there a consensus on the optimal connectivity metric. Further, even the fMRI preprocessing strategies can vary considerably. Exploration across this space is cumbersome, especially for advanced machine learning models like neural networks that are slow to train. An ideal system should be invariant to these choices. However, this is hardly the case for rs-fMRI where large deviations have been reported in prediction performance in relationship to these factors [66]. Another challenge in training robust prediction systems on large populations stems from the heterogeneity of multi-site rs-fMRI data. Resting-state is easier to standardize across sites compared to task-based protocols since it does not rely on external stimuli. However, differences in acquisition protocols and scanner characteristics across sites still constitute a significant source of heterogeneity. Multi-site studies have shown little to no improvement in prediction accuracy compared to single-site studies, despite the larger sample sizes [25, 159]. While it is possible to normalize out site effects from data, more advanced tools are needed in practice to mitigate this bias. High diagnostic accuracies achieved by supervised learning methods should be interpreted with caution. Several confounding variables can induce systematic biases in estimates of functional connectivity. For example, head motion is known to affect connectivity patterns in the default mode network and frontoparietal 63 control network [160]. Further, motion profiles also vary systematically between subgroups of interest, e.g., diseased patients often move more than healthy controls. Apart from generating spurious associations, this could affect the interpretability of supervised prediction studies. Independent statistical analysis is critical to rule out the effect of confounding variables on predictions, especially when these variables differ across the groups being explored. Methodological innovations are needed to improve prediction accuracy to levels suitable for clinical translation. Several factors make comparison of methods across studies tedious. Cross-validation is the most commonly employed strategy for reporting performance of ML models. However, small sizes (common in rs-fMRI studies) are shown to yield large error bars [161], indicating that data-splits can significantly impact performance. Generalizability and interpretability should remain the key focus while developing predictive models on rs-fMRI data. These are critical attributes to achieve clinical translation of machine learning models. Uncertainty estimation is another challenge in any application of supervised learning; ideally, class assignments by any classification algorithm should be accompanied by an additional measure that reflects the uncertainty in predictions. This is especially important for clinical diagnosis, where it is important to know a reliability measure for individual predictions. Most existing studies focus on classifying a single disease versus controls. The ability of a diagnostic system to discriminate between multiple psychiatric disorders is much more useful in a clinical setting [162]. Hence, there is a need to assess the efficacy of ML models for differential diagnosis. Integrating rs-fMRI with complementary modalities like diffusion-weighted MRI can possibly yield even better neurophenotypes of disease, and is another challenging yet promising research 64 proposition. 2.1.6 Conclusions We have presented a comprehensive overview of the current state-of-the-art of machine learning in rs-fMRI analysis. We have organized the vast literature on this topic based upon applications and techniques separately to enable researchers from both neuroimaging and machine learning communities to identify gaps in current practice. 65 Table 5: Key related review papers in the field Multi-subject Independent Component Analysis of fMRI: A Decade of Intrinsic Networks, Default Mode, and Neurodiagnostic Discovery (Calhoun et al. 2009)[163] A focused review of group ICA discussing methodologies, discovery of RSNs and their diagnostic potential Imaging-based parcellations of the human brain (Eickhoff et al.,2018)[164] A detailed exploration into approaches for deriving imaging based parcellations and lurking challenges in the field Dynamic functional connectivity: Promise, issues, and interpretations (Hutchison et al.,2013)[165] An early review on findings, methods and interpretations of dynamical fuctional connectivity The Chronnectome: Time-Varying Connectivity Networks as the Next Frontier in fMRI Data Discovery (Calhoun et al.,2014) [166] A detailed review of methods for dynamic functional connectivity analysis with a focus on decomposition techniques The dynamic functional connectome: State-of-the-art and perspectives (Preti et al.,2017)[167] A comprehensive review of analytical approaches for dynamic functional connectivity analysis and future perspectives On the nature of resting fMRI and time-varying functional connectivity (Lurie et al.,2018)[168] A discussion of diverse perspectives on time-varying connectivity in rs-fMRI Clinical Applications of Resting State Functional Connectivity (Fox et al.,2010)[169] An early short review focused on clinical applications of rs-fMRI Single Subject Prediction of Brain Disorders in Neuroimaging: Promises and Pitfalls (Arbabshirani et al. 2017)[170] Extensive survey of studies on single subject prediction of brain disorders, including opinions on promises/limitations 66 Figure 2.4: Illustrations of popular clustering algorithms: K-means clustering partitions the data space into Voronoi cells, where each observation is assigned to the cluster with the nearest centroid (marked red in the figure). GMMs assume that each cluster is sampled from a multivariate Gaussian distribution and estimates these probability densities to generate probabilistic assignment of observations to different clusters. Hierarchical (agglomerative) clustering generates nested partitions, where partitions are merged iteratively based on a linkage criteria. Graph-based clustering partitions the graph representation of data so that, for example, number of edges connecting distinct clusters are minimal. 67 Figure 2.5: Schematic of application 2.1.3: In decomposition, the original fMRI data is expressed as a linear combination of spatial patterns and their associated time series - in ICA, the independence of spatial maps is optimized whereas in sparse dictionary learning, the sparsity of maps is encouraged. In clustering, time series or connectivity fingerprints of voxels are clustered to assign voxels to distinct functional networks. Figure 2.6: Schematic of application 2.1.3. Three connectivity states are assumed in the data for illustration purposes 68 Figure 2.7: Schematic of application 2.1.3. Dimensionality reduction of high- dimensional connectomes into 3 latent components is shown for illustration. Figure 2.8: A common classification/regression pipeline for connectomes Figure 2.9: A summary of design choices for supervised learning with rs-fMRI 69 Figure 2.10: A taxonomy of supervised learning methods used for rs-fMRI analysis 70 CHAPTER 3 LINKING RESTING-STATE BRAIN ACTIVITY AND MENTAL DISORDERS WITH MACHINE LEARNING 3.1 Ensemble learning with 3D convolutional neural net- works for functional connectome-based prediction Abstract The specificity and sensitivity of resting state functional MRI (rs-fMRI) measure- ments depend on preprocessing choices, such as the parcellation scheme used to define regions of interest (ROIs). In this study, we critically evaluate the effect of brain parcellations on machine learning models applied to rs-fMRI data. Our experiments reveal an intriguing trend: On average, models with stochastic par- cellations consistently perform as well as models with widely used atlases at the same spatial scale. We thus propose an ensemble learning strategy to combine the predictions from models trained on connectivity data extracted using different (e.g., stochastic) parcellations. We further present an implementation of our ensemble learning strategy with a novel 3D Convolutional Neural Network (CNN) approach. The proposed CNN approach takes advantage of the full-resolution 3D spatial structure of rs-fMRI data and fits non-linear predictive models. Our ensemble CNN framework overcomes the limitations of traditional machine learning models for connectomes that often rely on region-based summary statistics and/or linear models. We showcase our approach on a classification (autism patients versus healthy controls) and a regression problem (prediction of subject’s age), and report 71 promising results. 3.1.1 Introduction Functional connectivity, as often captured by correlations in resting state func- tional MRI (rs-fMRI) data, has produced novel insights linking differences in brain organization to individual or group-level characteristics. Recently, machine learning models are being increasingly applied to study and exploit individual variation in functional connectivity data [171, 172, 173]. These models often employ hand-engineered features, such as pairwise correlations between regions of interest (ROIs) and network topological measures of clustering, modularity, small-worldness, integration, or segregation [174, 175, 176]. The ROIs are usually computed based on a pre-defined atlas or a parcellation scheme. The choice of the ROIs can have a significant impact on downstream analyses [102, 104, 177]. Brain ROIs can be defined based on macro-anatomical features, cytoarchitecture, functional activations, and/or connectivity patterns [178, 179, 180, 181]. A common approach is to derive the ROIs either based on input from experts and/or using a data-driven strategy on a small number of subjects. Expert-defined ROIs are challenging to standardize across studies [182] and often rely on arbitrary decisions. Data-driven ROIs, on the other hand, can be biased by the selection of the subjects, especially for regions that exhibit large variability across the population. Popular data-driven techniques include clustering, dictionary learning and Independent Component Analysis (ICA) [102, 183, 184]. Such methods can be sensitive to confounds such as motion, while initialization, optimization, and other algorithmic choices can also significantly influence the results [185]. A parcellation scheme not only defines the boundaries of ROIs, but also restricts the analysis to a certain 72 Figure 3.1: A general illustration of the proposed approach spatial scale. Abraham et al. [186] showed that among various preprocessing decisions, the choice of region definition has the greatest impact on predictive accuracy with data-driven extraction based on dictionary learning outperforming ICA/clustering and other reference atlases. Given the arbitrary nature of a chosen parcellation scheme and its impact on predictive models, we hypothesized that machine learning models can benefit markedly from an ensemble strategy that integrates across different scales and ROI definitions. Figure 3.1 shows a general schematic of our proposed framework. In this work, we conducted a thorough empirical evaluation of different choices for brain parcellations. Another important factor in connectome-based machine learning pertains to the choice of the classification algorithm. A large body of related work in the literature has focused on simple linear predictive models using vectorized connectivity data. 73 A relatively recent trend is to exploit neural networks for graph-structured data, such as Graph Convolution Networks or BrainNet-CNN, to make individual-level predictions on connectomes. Ktena et al. [187] applied spectral graph convolutions in a distance-metric learning framework to train a k-nearest neighbor classifier on connectivity data. In a similar vein, Kawahara et al. [188] proposed the Brain- NetCNN architecture that extends convolutional neural networks (CNNs) to handle graph-structured data. CNNs are motivated via the translation-invariance prop- erty of image-based classification problems and can exploit voxel/pixel resolution data. On the other hand, BrainNetCNN works directly with an adjacency matrix derived from the connectome data, while disregarding spatial information. The model parameter count would scale according to the number of ROIs, making the utilization of voxel-level connectivity infeasible with this approach. As we discuss below, we propose an alternative representation of connectivity data, which allows us to leverage modern deep learning architectures, like CNNs, to build a prediction model that exploits the full-resolution 3D spatial structure of rs-fMRI without having to learn too many model parameters. In this work, we consider two applications: discrimination of autism patients and healthy controls; and regression of age. The first problem is a particularly challenging one. Several previous studies have reported altered functional con- nectivity patterns in Autism Spectrum Disorder (ASD) patients [189, 190, 191, 192]. While studies using small samples have reported classification accuracies over 75% [193], application of similar models on large heterogeneous datasets, such as ABIDE [194], have shown more modest performance levels over a wide range of connectome preprocessing schemes (accuracies that range 60-67%) [186]. Our main contributions in this paper are: 74 • An extensive evaluation of the influence of brain parcellations on functional connectome-based machine learning models • An ensemble learning strategy for combining predictions from multiple classi- fiers corresponding to different brain parcellations • An easy-to-implement 3D CNN framework for connectome-based classification 3.1.2 Materials and Methods Dataset The Autism Brain Imaging Data Exchange (ABIDE) is a multi-site consortium aggregating and openly sharing anatomical, functional MRI and phenotypic datasets of individuals diagnosed with ASD, as well as healthy controls (HC) [194]. The first phase of ABIDE (ABIDE-I) collected data from 1,112 individuals, comprising 539 individuals diagnosed with ASD and 573 typical controls across 17 sites. The second phase (ABIDE-II) aggregated 1,114 additional datasets, comprising 521 individuals with ASD and 593 healthy controls across 19 sites. Preprocessing of fMRI Data The Preprocessed Connectomes Project (PCP) released preprocessed versions of ABIDE-I using several pipelines [195]. We used the data processed through the Configurable Pipeline for the Analysis of Connectomes (CPAC). This pipeline performs motion correction, global mean intensity normalization and standardiza- tion of functional data to MNI space (3x3x3 mm resolution) before the extraction of ROI time series. Among the different strategies in the release, our analysis 75 used data de-noised by regression of nuisance signals including motion parameters, CompCor WM+CSF components, and global signal, followed by band-pass filtering (0.01-0.1Hz). We note that we have experimented with alternate preprocessing strategies that include/exclude the global signal regression and CompCor steps. These results are presented in the Supplementary Section 7.6. We preprocessed the ABIDE-II dataset following the same sequence of steps listed for ABIDE-I in CPAC (using the version v1.0.2a). Since manual quality control (QC) was not yet available for ABIDE-II, we performed an automatic QC by selecting those subjects that retained at least 100 frames or 4 minutes of fMRI scans after motion scrubbing [196]. Motion scrubbing was performed based on Framewise Displacement (FD), discarding one volume before and two volumes after the frame with FD exceeding 0.5mm [197]. Cohort selection In our experiments, we used ABIDE-I subject data that passed manual QC by all the functional raters. This yielded a final sample size of 774 ABIDE-I subjects, comprising 379 subjects with ASD and 395 typical controls. As an independent test dataset, we employed ABIDE-II subjects from sites that participated in ABIDE-I and used the same MRI sequence parameters for data collection. After automatic QC, we ended up with a final ABIDE-II sample size of 163 individuals with ASD and 230 healthy controls. For age prediction, we only considered healthy controls. Furthermore, subjects whose age were more than 3.5 standard deviations away from the median were excluded from the task of age prediction. Table 3.1 summarizes the dataset characteristics for the two prediction tasks considered in this study. 76 Dataset Prediction Sample Size Median Age (Range) in yrs ABIDE-I Age 387 13.8 (6.5-29.1) ABIDE-I ASD/HC 379/395 13.9 (6.5-56.2) ABIDE-II Age 213 10.6 (5.8-18.8) ABIDE-II ASD/HC 163/230 11.0 (5.2-38.9) Table 3.1: Composition of Cohorts Extracting ROI time series from atlases In our experiments, we considered all atlases that were used for ROI time series ex- traction in PCP. These include the following seven atlases: Talaraich and Tournoux (TT, R=97), Harvard-Oxford (HO, R=111), Automated Anatomical Labelling (AAL, R=116), Eickhoff-Zilles (EZ, R=116), Dosenbach 160 (DOS160, R=161), Craddock 200 (CC200, R=200), and Craddock 400 (CC400, R=392), where R is the number of ROIs [198, 199, 200, 201, 202, 203, 204, 205, 206]. For our 3D CNN model, described below, the parcellated regions were used as target ROIs to derive the input connectivity features at the voxel level. For the non-CNN benchmark models, also described below, each atlas was used to define a corresponding connectivity matrix which was fed as input to each model after collapsing into a vector. We report results for ensemble learning strategies as well, where we combined the predictions of models corresponding to individual atlases. Creating stochastic parcellations Stochastic parcellations were created by Poisson Disk Sampling using the method described in [207]. Given a number of ROIs, this approach divides the gray matter voxels (as defined by a given mask) into roughly equal-sized parcels while ensuring 77 that the parcels do not cross hemisphere boundaries. Stochasticity is introduced in the ROI center locations, and all the remaining voxels are assigned to the closest region center. These centers are kept a minimum distance apart based on the desired number of regions in the parcellation. Further details about the sampling approach are provided in Supplementary Section A.2. All parcellations were created in the MNI152 template at a 3mm resolution, same as the resolution of the preprocessed functional data. For creating these parcellations, we relied on a whole brain gray matter mask including sub-cortical structures. To create the mask, we took the union of the gray matter tissue prior provided in the standard MNI152 template and the cortical mantle mask used in [184]. Some example stochastic parcellations are shown in Figure 3.2 against atlases at similar resolutions. Figure 3.2: ROI masks for example SPs and atlas at each of the four spatial scales considered in this study. 78 3D Convolutional Neural Network Approach Here, we present our novel strategy to adopt a 3D CNN architecture for use with connectomic data. Loosely reminiscent of the biological visual system, CNNs use spatially localized filters to detect local image features. Unlike fully connected layers where every unit is connected to all other units of the previous layer, convolutional layers employ a structured arrangement where each unit is connected to only a small subset of spatially connected units in the input image channels. Further, the weights of these connections are shared between the units of the convolutional layer so that the same feature can be detected regardless of its spatial location. Mathematically, a convolutional layer of the form Y=Ow(X) operates on an M-dimensional input X(v)=(X1(v),....,XM(v)) by applying a set of filters {W={wm,n}, m=1,...,M; n=1,...,N}. Here, v is used to index the pixel or voxel (in case of 3D convolution). After applying an elementwise non-linearity φ (such as a logistic function) , this produces an N-dimensional output Y(v)=(Y1(v),....,YN(v)). Each element Yn(v), known as a feature map, is thus given as, ∑M Yn(v) = φ( (Xm ∗ wm,n)(v)), (3.1) m=1 where * denotes the standard spatial convolution operation.The convolutional layers in CNNs are often interspersed with pooling layers that reduce the size of feature maps and offer translation invariance. Max-pooling is the most popular pooling operation. It down-samples each input feature map (commonly referred to as a channel) separately by selecting the maximum feature response in pre-fixed local neighborhoods. A max-pooling Yi = P (Xi) operation on channel i is thus defined as, Yi(v)=Max(Xi(v̄): v̄ in neighborhood of v). In 3D, for example, the 79 neighborhood can be a 3 x 3 x 3 cube around each voxel. The convolutional and max-pooling layers form the backbone of a CNN. A CNN architecture is constructed by combining multiple layers that successively learn more complex features from the input images. For example, with L layers the output can be mathematically expressed as (Ow(L),...P ◦ Ow(1))(X). Since we are considering an image classification problem, we add fully connected layers to the flattened output at the end of a CNN. Research in visual recognition has shown that fully connected feedforward architectures don’t scale well to full images. Instead, neural network architectures with local connectivity, such as CNNs, are much more suitable when dealing with high-dimensional images. The shared weights of the CNN architecture facilitate learning with fewer parameters. 3D Convolutional layers thus transform an input 4D (3D multi-channel) volume to an output 4D volume. Each layer learns a set of spatial filters that activate in response to distinct visual patterns. Replicating or convolving each filter across the volume allows the corresponding pattern to be detected irrespective of its spatial location. Finally, the outputs from all filters are stacked along the 4th dimension to create a 4D feature map. Multiple convolutional layers coupled with pooling operations create global representations from local patterns. Stacking fully connected layers at the end after convolutional and down-sampling operations dramatically reduces the model parameter count for classification. In our proposed approach, the input to the CNN is formed by concatenat- ing voxel-level maps of “connectivity fingerprints”, which are represented as a multi-channel 3D volume. Each channel is a connectivity feature, such as the Pearson correlation between each voxel’s time series and the average signal within a target ROI. In our implementation, we use both atlas-based and stochastic brain 80 parcellation schemes to define target ROIs. The total number of input channels thus represents the number of ROIs used for creating voxel-level fingerprints. For each parcellation scheme (atlas-based or stochastic), we trained a separate model. In our experiments, we employed a simple CNN architecture, illustrated in Fig. 3.3. Our architecture has several convolutional layers, interspersed with max- pooling based down-sampling layers, followed by a couple of densely connected layers. The models were trained with a mini-batch size of 64, until convergence of validation loss. For classification, we used binary cross-entropy, whereas for regression we adopted mean squared difference as the loss function. The neural network weights were optimized via stochastic gradient descent (SGD) for classification and Adam for regression. The learning rate and momentum for SGD were set to 0.001 and 0.9 respectively. Learning rate of Adam was set to 0.0005. For age regression, we employ a stochastic weight averaging strategy where we average the neural network weights over last 20 epochs. The same architecture and settings were used for all atlases and stochastic parcellations. We note that each atlas is defined on a unique gray matter mask. To ensure that all prediction models (benchmark and proposed) relied on information from the same voxels, the atlas-specific gray matter mask was applied to the voxel-level connectivity fingerprint data before feeding into the proposed convolutional architecture. For stochastic parcellations, the custom gray matter mask as described above was used for masking the fingerprints. The code and stochastic parcellations have been made available at: https://github.com/ mk2299/Ensemble3DCNN_connectomes. Benchmark Methods In our experiments, we implemented following benchmark methods. 81 Figure 3.3: Proposed CNN approach. All operations are in 3D volume. 2D correlation maps are shown for illustration only. For the age prediction task, an additional Max-Pooling and Batch-Normalization[208] operation followed the first and second convolutional layer. Ridge Regression A linear regression model was trained with squared loss and α times the squared norm of the weight vector (See Appendix). For classification, the ground truth labels were encoded as ± 1 for the two output categories. We tested 10 linearly spaced values for the hyper-parameter α in the range [0.1,10] and report for the value with the highest cross-validation accuracy. Support Vector Machine We implemented a standard SVM as a benchmark (See Appendix). We found that a radial basis function (RBF) kernel performed better than a linear model. Thus we report results for the RBF-kernel SVM. The two hyper-parameters (RBF kernel width γ and and misclassification cost weight C) were fine-tuned by maximizing cross-validation accuracy via a grid search. For regression, we implemented the standard SVR scheme with an - insensitive loss function, optimizing for the -tube and penalty parameter of the error term via grid search. 82 Fully Connected Architecture The fully-connected neural network (FCN) architecture takes as input functional connectivity estimates between pairs of ROIs, which is vectorized and processed by a feed-forward network. We implemented following architecture, which performed best on ABIDE-I cross-validation: 4 fully connected hidden layers, with 800, 500, 100 and 20 numbers of features and each linear layer followed by an elementwise Exponential Linear Unit (ELU) activation. Dropout regularization parameter was set to 0.2 and applied to each layer during training. For classification, the output node was a sigmoid, and cross-entropy loss was used. For age prediction, the sigmoidal output was replaced with a linear activation and mean squared difference was used as the loss function. The models were trained with a mini-batch size of 64, until convergence of validation loss. SGD was used as the optimizer with learning rate and momentum set to 0.01 and 0.9 respectively for classification. For age prediction, a smaller learning rate of 0.001 was used. BrainNet Convolutional Neural Networks BrainNet CNN, originally pro- posed in [188], utilizes specialized kernels to handle connectomic data. Their work described novel edge-to-edge, edge-to-node and node-to-graph convolutional layers that can potentially capture topological relationships between network edges. For BrainNet CNN, we implemented the following architecture that worked best on ABIDE-I cross-validation: 1 edge-to-node layer with 256 filters, followed by a node-to-graph layer with 128 output nodes and finally a dense layer with single output. A leaky ReLU non-linearity with alpha equal to 0.33 was applied to the output of each layer except the last layer. The activation of the last layer was set to linear and sigmoid for the regression and classification tasks, respectively. Dropout regularization with rate 0.2 was used for the edge-to-node layer. Similar 83 to [188], Euclidean loss was minimized for age regression, whereas cross-entropy loss was used to optimize the classification models. The models were trained for 1000 iterations using SGD with momentum equal to 0.9. The learning rate was set to 0.0005 for age prediction and 0.008 for ASD/Healthy classification. The training curves were monitored for atlases to ensure convergence. Ensemble Learning In our experiments, we explored two ensemble learning strategies. The first one is what we call multi-atlas ensemble (or MA-Ensemble). MA-Ensemble averages the predictions of the models of a specific method (e.g., BrainNet CNN) computed using each one of the seven atlases. For classification, the final prediction is computed as the majority vote of the individual binary class predictions. For regression, the ensemble prediction is simply the mean. The second ensemble strategy (SP-Ensemble) averages across the models of a specific method computed using stochastic parcellations. In our experiments, unless stated otherwise, we used 30 stochastic parcellations at each of the following four spatial scales: 110, 160, 200 and 400 ROIs. These scales were chosen in accordance with existing atlases. Thus the SP-Ensemble’s prediction was computed based on fusing 120 (30 × 4 scales) models. We also implemented single-scale SP-Ensemble models, which averaged over the 30 parcellations at the same spatial scale. Visualizing the CNN model In order to understand the connectivity features captured by the CNN model, we employed the saliency map approach of [209]. This visualization technique computes the gradient of the output prediction with respect to the input image 84 ASD/HC Classification Accuracy (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 63.3 68.7 67.7 66.1 67.7 CC200 67.4 70.7 71.5 70.2 72.8 EZ 63.3 66.1 63.8 64.4 66.4 TT 66.1 67.4 65.9 67.4 70.0 CC400 69.4 68.2 69.9 71.5 70.5 AAL 63.3 65.9 65.4 64.6 69.5 DOS160 66.7 63.6 66.1 64.6 67.0 MA-Ensemble 69.7 70.0 69.9 70.7 71.7 SP-Ensemble 71.7 71.2 71.2 70.5 72.3 Table 3.2: Classification accuracy for ASD vs. Control: Independent test on ABIDE-II of baseline models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. voxel values, i.e., the 3D volume, using a single backward pass through the trained neural network. We then computed voxel-level saliency as the maximum absolute gradient value across all input channels corresponding to different target ROIs. More formally, consider an input image I, representing the connectivity fingerprints of V voxels with R ROI signals. The saliency weights w RV×R are computed by taking the absolute value of the gradient of neural network output O with respect to the input image, i.e., w = |∂O |. In order to obtain the saliency at the voxel ∂I level S  RV , we take the maximum across all the ROIs, i.e., Si = max1≤j≤R wij. Finally, to visualize an ensemble model, we averaged the individual saliency maps that made up the ensemble. 3.1.3 Results Experiments In our experiments, we considered two tasks: i) binary classification of autism vs healthy, and ii) age prediction. For each task, we implemented two evaluation 85 Age RMSE (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 3.05 2.86 2.79 2.82 2.48 CC200 2.74 2.71 2.47 2.62 2.31 EZ 2.98 2.72 2.71 2.96 2.23 TT 3.10 2.83 2.87 3.02 2.24 CC400 2.76 2.83 2.41 2.55 2.27 AAL 2.84 2.74 2.69 2.75 2.33 DOS160 3.48 3.34 3.22 3.32 2.31 MA-Ensemble 2.72 2.81 2.47 2.55 2.15 SP-Ensemble 2.68 2.69 2.38 2.55 2.15 Table 3.3: Root mean squared error (RMSE in years) for age prediction: Independent test on ABIDE-II for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. schemes. First, we conducted 10-fold cross-validation on the ABIDE-I dataset, so that we could present results that were comparable to previously reported classification results such as [171, 186]. Second, we trained each model on the entire ABIDE-I dataset and computed test performance on the independent ABIDE- II set. We report classification accuracy and the receiver operating curves (ROC), along with corresponding area under the curves (AUC) for each of these scenarios under various combinations of parcellation schemes and prediction algorithms. For age prediction, we report the root mean squared error (RMSE). Evaluation of Prediction Performance Table 3.2 shows the independent test performance for different models on the classification problem. The proposed 3D CNN approach performs at least as good as, and often better than, the benchmark methods, including the fully-connected deep neural network (FCN) and BrainNetCNN. In particular, the 3D CNN approach performs favorably against other algorithms for all but two parcellation schemes, including the ensembles. Similarly, the SP-Ensemble achieves the best ABIDE- 86 Figure 3.4: ASD-HC Classification: Receiver Operating Curves for independent validation on ABIDE-2 I cross-validation for most algorithms, including the 3D CNN. The ABIDE-I cross-validation results, reported in Table A.2, are in general compatible with the independent test results, where the 3D CNN and SP-Ensemble techniques mostly outperform the competition. Figure 3.4 shows the Receiver Operating Characteristic (ROC) curves for SP-Ensemble models for the different algorithms on the independent ABIDE-II test dataset. We observe that the 3D-CNN SP- Ensemble achieves an AUC of ∼ 77% and an accuracy of ∼ 72% on independent ABIDE-II data, slightly better than the state-of-the-art cross-validation on ABIDE-I for ASD/HC classification [192], with FCN and Brain-Net CNN ensembles yielding a similar performance. ROC Curves for individual atlases are shown in Figure A.4. Table 3.3 lists independent test results for the age prediction task on ABIDE-II, and Table A.3 reports the 10-fold cross-validation error on ABIDE-I. The 3D CNN approach consistently shows superior performance, yielding the best results for 87 all parcellation schemes. Similar to the classification scenario, SP-Ensemble or MA-Ensemble also yield the best cross-validation and independent test performance values for the majority of the algorithms, including 3D CNN. Overall, the best accuracy is achieved by SP-Ensemble 3D CNN, which yields a root mean squared error of 3.28 years on ABIDE-I cross-validation and 2.15 years on the independent ABIDE-II dataset. We also estimated mean absolute error (MAE) of all models on ABIDE-II and observed a similar trend, as reported in Table A.6. Comparison of stochastic parcellations and atlases Here, our objective is to conduct a detailed investigation of how the choice of ROIs affects prediction performance for different machine learning (ML) algorithms. For each ML algorithm and each parcellation we have a model trained on the ABIDE-I data, which we then used on the independent ABIDE-II data to quantify prediction accuracy. Figure 3.5 shows the distribution of accuracy values (estimated with a kernel density model) obtained using stochastic parcellations , while also illustrating the results for each of the atlases and the scale-specific SP-ensembles. The scale-specific SP-Ensemble strategy, as the name implies, averaged the models corresponding to the 30 stochastic parcellations in each scale. We observe that the atlas-based models performed no better than typical stochastic parcellation models, independent of scale and algorithm. This result offers an intriguing possibility: perhaps we do not need anatomically or functionally derived brain parcellations to train machine learning models since stochastic parcellations perform equally well or no worse in practice. Our proposed SP-Ensemble CNN strategy yielded accuracy results that were about as good as the best scale-specific SP-Ensemble model. Finally, the ensemble 88 Figure 3.5: Violin plots showing the spread of prediction accuracies/errors for stochastic parcellations at multiple network scales for different classification models. Mean accuracy/error of individual violins is denoted by ’Mean SPs’. Performance of individual atlases is compared with SPs with the closest # of ROIs and is denoted as ’Single Atlas’. Results are computed by training models on entire ABIDE-1 cohort and testing on the independent ABIDE-2 cohort. 89 Figure 3.6: Distribution of Ridge models’ performance for stochastic parcellations created using the same gray-matter mask as the corresponding atlas. Red denotes the atlas model’s accuracy and black indicates the SP-Ensemble accuracy. models were almost always better than the atlas-based models and they compared favorably against the individual stochastic parcellation models. The same ob- servations can be made for ABIDE-I cross-validation (see Supplementary Figure A.1). In above analysis, one potential confound was the different gray matter masks of atlases and stochastic parcellations (SPs). In order to account for this confound, we conducted following analysis. For each of the atlases, we generated 100 SPs using the same gray matter mask as the atlas. We excluded DOS160 because it does not rely on a well-defined gray matter mask and places discontiguous 4.5 mm spherical regions over fixed coordinates in the brain (sampling only 5% of brain voxels). We then trained on each of these SPs using the same hyper-parameters that were found to be optimal for the corresponding atlas. Here, we show the results for ridge regression (the model that was fastest to train), but we obtained similar results for all other algorithms as well. As can be seen from Figure 3.6, for most atlases and corresponding gray matter masks, the model trained on the atlas ROIs 90 (a) ASD/Healthy Classification (b) Age prediction Figure 3.7: Mean saliency maps of trained 3D-CNN models for SP-Ensemble performed no better than an average SP model. Furthermore, and importantly, the SP-Ensemble (computed by averaging across SPs on the atlas-specific mask) yielded better performance than the atlas models for all atlases. Visualization An important goal of machine-learning tools in neuroimaging is to generate novel insights linking imaging biomarkers with disease or phenotypic traits. Visualization techniques for CNNs can help reveal important features used by the model for discriminating between output classes. Figure 3.7 shows the saliency maps computed 91 for the SP-Ensemble CNN ASD classification and age prediction models. As can be seen from these maps, the precuneus, often considered a core node of the default mode network [210], seems to play a significant role for both prediction problems. However, there are also salient regions that are unique to each problem. For example, the anterior cingulate/ventromedial prefrontal cotex, a region that has been linked to autism [211], was distinctly highlighted for the ASD classification problem. The left parietal cortex was also emphasized for ASD prediction, which is consistent with the laterilized activation observed in this region in Autism patients [212]. On the other hand, for age prediction, the left dorsolateral prefrontal cortex (dlPFC) is a uniquely salient region. The dlPFC is associated with executive functions, such as working memory and abstract reasoning. For working memory, dlPFC’s function seems to be age-associated and more lateralized in younger adults [213]. 3.1.4 Discussion In this study, we presented a detailed empirical analysis of how the choice of ROIs can impact the performance of machine learning models trained on functional connectomes. We considered several machine learning algorithms, together with a range of spatial scales and parcellation schemes, including the popular atlas-based techniques and a stochastic approach. Our analysis suggests that using a single atlas for summarizing the connectome data is often sub-optimal for training machine learning models, and significantly more accurate predictions can be achieved with an ensemble approach that averages across models trained with different parcel- lation schemes. Furthermore, we demonstrated that averaging across stochastic parcellations can achieve very high accuracy values, often surpassing atlas-based models. Our findings resonate with several other studies that compare stochastic 92 parcellations and atlases, although in different contexts. Craddock et al. [214] compared spatially constrained functional parcellations obtained from spectral clustering with anatomically constrained parcellations produced from random clus- tering. Random parcellations performed as well as functional parcellations and better than anatomical atlases on metrics of cluster homogeneity and representation accuracy. Based on this, the study reflected that sufficiently small ROIs perform well for functional network analysis regardless of their spatial position. Fornito et al. [215] generated stochastic parcellations by randomly sub-dividing the AAL atlas and showed that functional organizational properties are independent of the parcellation template at the same network resolution, although significant variability is observed across scales. Studies on diffusion-MRI based anatomical networks have similarly shown that topological attributes and network organizational parameters are consistent across different parcellation schemes, including random parcellations [207, 216]. Another main contribution of this study is a novel approach to employ a 3D CNN architecture on functional connectivity data. Convolutional neural networks achieve state-of-the-art performance on many image-based prediction tasks, as they take advantage of the full spatial resolution of the data and the translation invariance property of the problem. Our proposed approach treats voxel-level connectivity fingerprints as input channels to a conventional 3D CNN framework. Spatial convolutions can capture local structural or topographic patterns in the data, such as connectivity gradients. Successively stacking convolutional layers in our architecture would hierarchically yield higher-order features that can capture information relevant for classification. Studies have shown that individual-level network topography serves as a fingerprint of human behavior [217]. Our multi- channel input image comprising connectivity fingerprints, coupled with CNNs, 93 provides a natural framework to capture individual-level differences in topography as they relate to behavior or disease. This strategy contrasts with current practice where the input to machine learning models are pairwise ROI functional correlations. This makes the model more susceptible to uncertainty caused by parcellation choice. This can be seen in our experiments where there is relatively larger variance in prediction performance across atlases for the fully-connected neural network. Thus, CNNs with connectivity map inputs can offer a more robust alternative to classification approaches that only rely on ROI-level connectivity information, such as the BrainNet-CNN. Our results demonstrate that when tailored for connectomes, CNNs offer a promising opportunity to probe brain networks in disease. Machine learning practitioners have to make a number of preprocessing choices in extracting connectomic features to analyze. While there is no one-size-fits- all solution across different tasks, in the context of machine learning models of functional connectivity, we present some interesting empirical observations below. Ensemble learning The motivation behind using multiple stochastic parcellations for prediction is grounded in the concept of ensemble learning. The core idea is to integrate out a latent variable (i.e., parcels or ROI definitions) from the learning problem [218]. This approach also makes the predictions more robust to the precise parcellation scheme. As shown above, the performance of atlas-based models can vary significantly (∼5-10% for parcellations at the same scale). In such a scenario, ensemble learning over multiple stochastic parcellations can be a robust strategy that yields reliable predictions. 94 Table 3.4: Classification/regression performance of FCN with a high-resolution parcellation ( ∼ 1024 ROIs) [216] Network granularity We explored the impact of network granularity on prediction performance of machine learning algorithms for connectomes. Our analysis suggests that better prediction performance can be expected with parcellations at higher granularity upto ∼ 400 ROIs. To further investigate this trend on ROI-level models, we trained the fully- connected network (FCN), that is generally the best performing baseline algorithm, on both the prediction tasks for the 1024 node parcellation proposed in [216]. As can be seen from Table 3.4, an atlas with 1024 regions is comparable to the CC200 atlas for ASD/HC classification in ABIDE-II. However, the performance actually degrades significantly (in comparison to CC200 or CC400) for the age prediction task. Our evaluations contradict with a previously reported result that a coarser network scale (∼ 100-150 ROIs) is more suitable for autism classification [186]. In their paper, these conclusions were drawn by comparing the performances achieved with a few atlases. However, inferring trends from a small number of atlases can be misleading, since factors like the boundary definitions of structures (cortical/subcortical) or the particular gray matter mask used, will effect results. Stochastic parcellations can control for these confounds and depict unbiased trends across network scales. 95 Number of gray matter voxels Our empirical study suggests that there is no direct correlation between the number of voxels in the gray matter mask and a model’s prediction performance. However, we do observe that the choice of gray matter mask can impact results. For example, the DOS160 atlas with as few as ∼ 3,039 voxels shows performance no worse than other atlases at the same resolution (HO, EZ, TT and AAL) with ∼ 20x more voxels. Visualization Saliency maps provide a valuable visualization strategy to probe deep neural network models. We visualized the saliency maps from 3D CNN models trained on ROIs extracted using both atlases and stochastic parcellations. As shown in Figure 3.7 and Supplementary Figures A.2 and A.3, these maps are remarkably consistent. These maps reveal that the precuneus, which is a hub of the default mode net- work and associated with ASD and age, plays an important role for both prediction problems. There were also uniquely highlighted regions, such as the anterior cingu- late/ventromedial prefrontal cortex for ASD classification and the left dorsolateral prefrontal cortex (dlPFC) for age prediction. Several studies have suggested the potential of DMN connectivity as a neurophenotype of autism. Chen at el. [219] trained a random forest classifier that distinguished ASD subjects from healthy controls with high accuracy, and showed that default mode and somatosensory regions contribute significantly to diagnostic accuracy. Similarly, Abraham et al. [186] revealed discriminative connections in the DMN for ASD/HC classification within a larger heterogeneous cohort of the ABIDE dataset. Furthermore, it has 96 Figure 3.8: Motion correlations been shown that the connectivity of posterior cingulate cortex (PCC) and aberra- tions in the medial prefrontal cortex node of the DMN can predict social deficits in children with ASD [220]. Our results corroborate the findings of these studies, and suggest a crucial involvement of DMN in autism. Influence of motion Several studies have shown differences in head motion parameters during fMRI between healthy controls and diseased populations, or between subjects from different age groups [221, 222]. This, in turn, can manifest as artifacts in the derived resting-state connectivity [223]. Although our independent test data was motion scrubbed, we performed additional analyses to rule out the confounding effect of motion in classifier decisions. We selected a cohort of 151 ASD subjects with motion-matched healthy controls from our independent dataset and analyzed the correlation of 4 motion parameters with classifier predictions. These include the 97 root-mean-square framewise displacement, mean relative displacement, maximum absolute displacement and the number of micro-movements greater than 0.5mm. These summary statistics were chosen in accordance with previous reports of motion artifacts in rs-fMRI[196]. As shown in Figure 3.8, no significant correlations were observed between motion variables and the predictions of SP-Ensemble (model average over all atlases). In this motion-matched cohort, classification accuracy of 71.8% was obtained using 3D-CNN. For our regression task, there was no significant correlation between a subject’s age and any of these motion parameters in our cohorts. Recommendations Based on our experiments, we make two claims in this study: (a) 3D-CNN performs favorably compared to alternative baseline algorithms, and (b) Ensemble models that average across parcellation schemes consistently perform better than individual atlas-based models and are thus a safer choice for supervised machine learning on connectomes. This is because individual atlases can show significant variability in classification/regression performance and finding the optimal atlas for a prediction task among the wide range of available atlases might not be feasible. Figure 3.9 shows the probability density estimates for the difference in performance between (a) 3D-CNN versus baseline algorithms as evaluated with the SP-Ensemble strategy, and (b) SP-Ensemble versus single atlas implemented with the 3D-CNN model. These estimates are presented for both our prediction tasks. For this experiment, we estimate the evaluation metrics (AUC-ROC for ASD/HC classification and RMSE for age regression) on 10,000 bootstrapped samples from ABIDE-II. These results demonstrate that the SP-Ensemble approach consistently achieves an accuracy 98 as good as the best performing single-atlas model. Further, the 3D-CNN model consistently outperforms the baseline algorithms for the age prediction task, with more prominent improvements for individual atlas models. This can be seen from Tables 3.2 and 3.3. We note that when using the ensemble strategy, the differences between models are marginal and might be irrelevant in some practical applications. For instance, the SP-Ensemble performance on ASD/HC classification task is comparable among 3D-CNN, FCN or BrainNet-CNN, with slight improvements over linear models. Thus, if time and/or computational resources impose constraints, it might be more suitable to prefer simpler models like FCN or SVM over 3D-CNN for example, especially with the ensemble approach. 3.1.5 Limitations and future work Throughout our analysis, Pearson’s correlation was chosen to measure functional connectivity strength between different brain regions. Several other correlation metrics, including tangent-based and partial correlation have been shown to yield superior classification performance in prior studies [102, 186]. While we do not expect this to affect the general conclusions and findings of our study, the choice of the correlation metric still remains an arbitrary decision in any machine learning pipeline for connectomes. Due to the heavy computational burden required for training multiple deep learning models, we only considered one particular scheme for creating stochastic parcellations, i.e., Poisson Disk Sampling. Alternative strategies for creating random parcellations have also been proposed, for instance, through stochastic sub-division of anatomically derived ROIs into smaller parcels [224]. It is also possible to randomize several other more popular schemes for parcellating the brain, such as, 99 Figure 3.9: Kernel density estimates of the probability distributions for the per- formance difference between models, computed based on 10000 bootstrap samples from ABIDE-II. Values to the left of the black vertical line indicate bootstrap samples where the proposed approach (3D CNN or SP-Ensemble) under-performed compared to the competing method. using Ward’s clustering on functional data from sub-samples of the population [218] or creating Geometric parcellations with different initializations [181]. While the proposed CNN approach achieves promising accuracy on autism detection and age prediction, there is room for further improvement. We have not yet conducted a comprehensive optimization of the convolutional architecture. Fur- thermore, there are likely more optimal choices than target ROI-based correlations that are used as input to the model. An interesting alternative would be select 100 random gray matter vertices for connectivity profiling, as proposed in [184]. We envision an end-to-end learning strategy that can enable the optimization of these connectomic features. Saliency maps provide an appealing visualization technique by mapping the neural network activations back to input voxel space. Several modifications to gradient-based back-propagation have been reported in literature that can poten- tially highlight more informative features learnt by the model [225, 226]. Further, the use of saliency maps need not be restricted to depicting group-averaged dis- criminative features. Unsupervised learning on saliency maps can provide novel insights into clinical subtypes of disease. It is also important to note that machine learning techniques do not unequivocally provide evidence for the salient features being directly associated with the disease or other target variables. However, when combined with detailed future investigations, they can spur clinical discoveries. 3.1.6 Conclusion The results presented in our paper showcase the utility of ensemble learning for connectomes. Functional network based prediction models are impacted by several a priori choices, the most pivotal of which is the ROI definition. We demonstrate that ensembles of stochastic parcellations yield predictions that are significantly more robust and accurate compared to single atlas-based approaches. Further, our experiments highlight the potential of convolutional neural network models for connectome-based classification. 101 3.2 Detecting abnormalities in resting-state dynamics: An unsupervised learning approach Abstract Resting-state functional MRI (rs-fMRI) is a rich imaging modality that captures spontaneous brain activity patterns, revealing clues about the connectomic orga- nization of the human brain. While many rs-fMRI studies have focused on static measures of functional connectivity, there has been a recent surge in examining the temporal patterns in these data. In this paper, we explore two strategies for capturing the normal variability in resting-state activity across a healthy popula- tion: (a) an autoencoder approach on the rs-fMRI sequence, and (b) a next frame prediction strategy. We show that both approaches can learn useful representations of rs-fMRI data and demonstrate their novel application for abnormality detection in the context of discriminating autism patients from healthy controls. 3.2.1 Introduction Resting-state fMRI captures intrinsic neural activity, in the absence of external stimuli and task requirements. Much of the research in this direction has aimed at identifying connectivity based biomarkers, restricting the analysis to so-called “static” functional connectivity measures that quantify the average degree of synchrony between brain regions. For e.g., machine learning based strategies have been used with static connectivity measures to parcellate the brain into functional networks, and extract individual-level predictions about cognitive state or clinical 102 condition [227]. In recent years, there has been a surge in the study of the temporal dynamics of rs-fMRI data, offering a complementary perspective on the functional connectome and how it is altered in disease, development, and aging [228]. However, to our knowledge, there has been a dearth of machine learning applications to dynamic rs-fMRI analysis. Thanks to large-scale datasets, modern machine learning methods have fueled significant progress in computer vision. Compared to natural vision applications, however, medical imaging poses a unique set of challenges. Data, particularly labeled data, are often scarce in medical imaging applications. This makes data-hungry methods such as supervised CNNs possibly less useful. One potential approach to tackle the limited sample size issue is to exploit unsupervised or semi-supervised learning strategies that don’t depend on large amounts of labeled training data. In this paper, we explore the use of unsupervised end-to-end learning for capturing rs-fMRI dynamics and demonstrate that the representations our models learn can be useful for detecting abnormal patterns in data. Related Work: Machine learning methods are increasingly used to compute individual-level predictions from rs-fMRI data, e.g. about disease [227]. The conventional approach of supervised learning relies on labeled training data and uses hand-crafted features such as the static correlation between pairs of regions. Such features fail to capture the dynamics of resting-state activity as it relates to behavior or disease. Moreover, emerging data suggest that learning models that exploit the full-resolution 4-dimensional fMRI data can potentially reveal more discriminative resting-state biomarkers [229]. In this work, we are motivated by this observation and our goal is to move away from hand-crafted features and take full advantage of the spatio-temporal structure of rs-fMRI. 103 Unsupervised approaches such as clustering of static connectivity measures have been previously used for disease classification and discovery of novel disease sub-types [230]. Similarly, autoencoders have been used in pre-training to improve generalization capabilities of supervised learning algorithms, as in [231]. An alternative application of unsupervised learning is outlier detection. Here, the goal is to identify data points that deviate markedly from normal samples. For example, autoencoder models have been popular for outlier detection in video [232]. In recent years, predictive modeling has also been shown to be a powerful framework in unsupervised feature learning of video representations [233]. In this approach, a model is trained to predict future frames of a video sequence. These models learn useful internal representations of the data that can in turn be used for anomaly detection or downstream object recognition or classification tasks [234]. In the present paper, we propose a novel unsupervised approach that learns rs-fMRI representations on voxel-level time-course data captured via a convolutional RNN model, in an end-to-end learning fashion. Models are trained to predict the next frame in an rs-fMRI sequence or to reconstruct the entire sequence. We apply our approach to the novel problem of outlier detection in rs-fMRI, and demonstrate its utility in discriminating autism patients from healthy controls. 3.2.2 Methodology In this section, we describe the autoencoder and prediction models considered in the study. As we demonstrate empirically, the models learn to accurately reconstruct or predict “normal” resting-state activity in healthy subjects, but yield higher reconstruction/prediction errors in patients. 104 Network building blocks Convolutional networks: CNNs have achieved unprecedented levels of perfor- mance across many vision tasks [235]. The main ingredients of CNNs include convolutional layers that serve as feature extractors, and pooling/un-pooling layers that perform down/up-sampling in resolution. In this paper, we employ encoder- decoder style networks since we are reconstructing/predicting structured image data, i.e., rs-fMRI frames. Encoder-decoder networks are widely deployed in image segmentation and generation tasks, as in [236]. The encoding part computes a cascade of increasingly high-level representations from the images, whereas the decoding part reconstructs pixel-level features from these representations. Convolutional-LSTM networks: Recurrent neural networks (RNNs), e.g., LSTMs [237], offer state-of-the-art results in many domains with sequential data, such as speech or natural language processing. Conv-LSTM cells, an extension of LSTM units, integrate convolutional layers with LSTM modules and allow the temporal propagation of high-level spatial features captured by convolutional layers. Conv-LSTM cells have shown remarkable performance in sequence forecasting problems [238]. This stems from their ability to simultaneously capture rich spatial and temporal structures in the data. Next frame prediction model Given a sequence of rs-fMRI frames, we trained a model to predict the next frame in the sequence. To improve the localization accuracy of predicted frames and capture spatio-temporal correlations at multiple resolutions, we incorporate skip connections with Conv-LSTM modules in our architecture. This U-Net style architecture [236] 105 Figure 3.10: Next frame prediction model. Each cuboid represents a 3D (2 spatial dimensions + time) feature map with number of features indicated on top. Flat boxes represent 2D feature maps, with number of channels on top. Input is an axial fMRI slice with T sequential frames. Conv-LSTM cell returns the last output of the output sequence. is shown in Figure 3.10. The input to the model is a 2D rs-fMRI sequence of T axial slices. In the encoding layers, we used 3D convolutions and max pooling, where the first two dimensions are the spatial coordinates on the axial cross-section and the third dimension is time. We compared our prediction model with several baselines, including: (a) simply using the last frame of the input sequence as a prediction of the next frame; (b) a non-learning based extrapolation model that fits separate cubic splines at each pixel on the input sequence; and (c) a non-recurrent 2-D U-Net model that excludes the Conv-LSTM modules from the proposed architecture and treats the temporal component of the input as T channels. We also considered (d) an interpolation scheme that interpolated with cubic splines between the T frames of the input sequence that precede the predicted frame and the frame after the predicted frame. This interpolation method is different than the other methods as it is not a forecasting model, yet we found it useful to assess the performance of the other methods. 106 Autoencoder model The autoencoder is an unsupervised learning approach that encodes the input into a lower dimensional representation, which is then decoded into a reconstruction of the input. The model is trained to minimize a distance function between the reconstruction and input, such as the squared L2 distance. The architecture of our reconstruction model is the same as the prediction model above, with two important differences. First, there are no skip connections, which are indicated as a “concatenate with crop” operation, to avoid the trivial solution of copying input to the output. The second difference is that, in the decoder layers and the output we have T frames, instead of a single frame. So in the visualization of this architecture, those would be represented with cuboids and 3D convolution/up- sampling operations. Further, we retained Conv-LSTM unit in the bottleneck to capture temporal dependencies between the frames of a rs-fMRI sequence. 3.2.3 Experiments Data We conducted our experiments on data from the Autism Brain Imaging Data Exchange (ABIDE) study [239]. Because of difference in TRs and other imaging parameters across sites, we restricted our experiments to the acquisition site with the largest sample size, namely NYU. We only used data that passed quality assessments by all functional raters and retained enough time-points after motion scrubbing for band-pass filtering. We randomly selected two thirds of the healthy group (54 subjects) for training/validating the reconstruction & imputation models. 107 A validation split of 10% was used during training to monitor convergence of these models. The remaining one-third group comprising 28 healthy controls was used as test data to evaluate predictions/reconstruction performance for comparison against ASD patients (N=67). Rs-fMRI preprocessing included slice timing correction, motion correction, global mean intensity normalization, standardization of functional data to MNI space, global signal regression, motion scrubbing (volume censoring) and band-pass filtering. We note that band-pass filtering was performed after motion scrubbing to avoid any motion contamination. Individual rs-fMRI scans were normalized between 0 to 1 by min-max scaling each-individual voxel’s time series. Finally, we applied a binary gray matter mask to all 3D volumes [203]. Implementation Details During training, we identified non-overlapping contiguous segments of (T + 1) frames for each subject in the training set. For each such segment, we extracted all axial slices and trained a unified model to predict the next frame, i.e, for a given architecture a single model was trained for all subjects and axial slices, comprising 16,560 training instances. Squared loss was optimized with Adam and a learning rate 1e-4. We implemented our code using Keras, with a TensorFlow back-end. The network was trained for 150 epochs with a batch size of 32. Validation curves were monitored to ensure convergence. We used same training paradigm for the non-recurrent baseline U-Net model. In our experiments, we tried different values for T and observed diminishing returns beyond T = 20 in the performance of the next frame prediction models. The overall pattern in comparing the accuracy of different models was the same. Thus, in the remainder we fix T = 20. We note that, 108 while not necessary, we fixed T = 20 for the autoencoder models too, which ensured training was done on identical datasets for these different approaches. Once the models were trained, we used them to compute predictions or reconstructions on independent data, which included both controls and ASD patients. For each test subject, we computed the mean squared error (between reconstruction/prediction and ground truth frames) as a single metric. Note that we averaged over all frames and pixels in an rs-fMRI scan. We hypothesized that this metric would be different between patients and controls, demonstrating that it could be used as an outlier detector. We also analyzed the voxel-level squared errors and conducted a statistical comparison between patients and controls to reveal the anatomical distribution of the differences. 3.2.4 Results Next Frame prediction and reconstruction errors We first demonstrate that the next frame in rs-fMRI sequence can be accurately predicted. Table 3.5 shows the performance of the different methods we implemented. We list both MSE and the mean Pearson’s correlation between predicted and ground truth frames, computed within the gray matter mask on healthy test subjects. We observe that the proposed recurrent U-Net architecture achieves the best prediction performance, even exceeding the cubic-spline based interpolator, which was given both the preceding 20 frames and the frame after the predicted frame. The recurrent LSTM modules that capture the temporal dynamics also enabled a significant boost in quality, as can be noted by comparing the performance of the U-Net and proposed architecture. Finally, the U-Net models outperformed the 109 Imputation models Mean Squared Error Pearson’s Correlation Last observation copy 0.01969 0.7558 Extrapolation 0.01203 0.8938 Interpolation* 0.00065 0.9939 Non-recurrent U-Net 0.00026 0.9967 Proposed recurrent U-Net 0.00007 0.9990 Table 3.5: Next frame prediction performance on healthy test subjects for different models. *Interpolation model had access to the frame after the predicted frame. Recurrent autoencoder: sequence length Mean squared error Pearson’s correlation T=10 frames 0.0625 0.354 T=15 frames 0.0475 0.503 T=20 frames 0.0437 0.550 Table 3.6: Reconstruction performance of the proposed recurrent autoencoder on healthy test subjects for different input sequence lengths. non-learning based methods of extrapolation, suggesting that accounting for both the spatial and temporal structure in the data yielded better results. Table 3.6 shows the mean reconstruction errors of the autoencoder on healthy test subjects for various input sequence lengths at test time. We note that the performance is worse than next-frame prediction because of the absence of skip connections. Reconstruction quality degraded with fewer frames suggesting that the autoencoder is not reconstructing frames independently and is indeed exploiting the long-term temporal dependencies between frames. For outlier detection, we thus used the temporal window T=20 as it gives the best reconstruction performance and captures longer dynamics. 110 Figure 3.11: Whisker plots showing reconstruction and prediction errors (mean squared error) for ASD patients and controls, with proposed recurrent models trained on T=20 consecutive frames. Points are individual subjects. The ends of the box are upper and lower quartiles, the median is marked by a horizontal line inside the box. Model AUC (p-value) Recurrent autoencoder 69.6 (0.00466) U-Net imputation 62.5 (0.00293) Recurrent U-Net imputation 65.9 (0.00151) Table 3.7: Area under the ROC curve for discriminating ASD vs Controls. P-values of the unpaired t-test comparing means of the two clinical groups are shown in brackets. Outlier Detection: Discriminating Patients and Controls We were interested in examining whether the next frame prediction and reconstruc- tion models can be used to detect outlier subjects. To test this, we computed mean squared error on all test subjects, including healthy controls and ASD patients. Figure 3.11 shows these error values for the proposed next frame prediction and autoencoder models. Both models yield error values that are statistically signif- icantly different between the two clinical groups. Further, AUC values obtained with autoencoder and imputation models, as shown in Table 3.7, are on par with recent supervised ASD v/s control classification results [240]. 111 Figure 3.12: Statistical significance of the difference in regional reconstruction error of the recurrent autoencoder between controls and ASD patients. FDR with q = 0.05 was implemented for multiple testing correction. − log10 p values are shown. We also note that the non-recurrent U-Net benchmark achieves a weaker sepa- ration between the two clinical groups. This indicates that the conv-LSTM layers enhance diagnostic sensitivity presumably because they are more equipped to exploit spatiotemporal structure in extracting representations. Importantly, we observed no correlation between frame-wise displacement values (a widely used metric to quantify subject motion) and the prediction/reconstruction errors- neither at the frame-level (Pearson’s correlation -0.0161/0.0218, p = 0.0739/0.0251, computed on non-motion scrubbed frames only) nor at the individual level (Pearson’s correlation 0.0033/0.1730, p = 0.9744/0.0936). Finally, we were interested in exploring the anatomical differences in errors between the two clinical groups. We thus conducted a t-test of of the regional prediction error (averaged within the boundaries of the widely used AAL atlas [203]) on the model with best AUC, i.e. the autoencoder. As can be seen from Fig 3.12, significant differences were mainly constrained to the left hemisphere, particularly localizing within the language network, involving the temporal and frontal cortices, consistent with prior literature [241]. 112 3.2.5 Discussion We considered a novel unsupervised learning strategy to analyze rs-fMRI data, where we train recurrent models to reconstruct rs-fMRI clips or to predict the next frame in a sequence. Results indicate that the proposed recurrent U-Net architecture produces very accurate predictions that yield a correlation greater than 0.99 with ground truth. Furthermore, this performance is better than an interpolation approach that had access to the frame after the predicted frame. Next, we demonstrated the utility of the proposed models in detecting outliers in rs-fMRI. Our results indicate that next frame prediction error or reconstruction error can be used to discriminate patients from controls, achieving a classification performance close to state-of-the-art results obtained with supervised methods. There are several directions we will be exploring with this technique. For example, we are interested in using the next frame prediction model to assess the quality of individual frames, particularly in the context of motion and other artifacts. Another possible application could be to use this model to impute frames that have been discarded for motion scrubbing. Finally, we believe unsupervised models can offer novel insights into the dynamics of resting state fluctuations. 113 CHAPTER 4 TOWARDS HOLISTIC ENCODING MODELS FOR PREDICTING FMRI RESPONSES TO MULTIMODAL NATURALISTIC STIMULI 4.1 Introduction Understanding the neural basis of sensory perception has been a long-standing goal of neuroscience. Brain activity recordings of healthy subjects during “free viewing” of movies present a powerful opportunity to build ecologically-sound and generalizable models of sensory systems, also known as encoding models. In neuroscience, stimulus-response relationships can be systematically understood from two complementary standpoints. Encoding models map stimuli to fine-grained neural activity via complex feature transformations. Conversely, decoding models aim to predict stimulus attributes directly from neural recordings. In this thesis, we explore the former (encoding) approach as a means of understanding how sensory information is represented in the activity of different brain regions. Modeling neural responses to naturalistic stimuli, in particular stimuli that reflect the complexity of real-world scenes (e.g., movies), offers significant promise to aid in understanding the human brain as it functions in everyday life; a central theme of this research is to use predictive modelling techniques to convert neural data into understanding and fundamental knowledge about the brain. Beyond satiating the spirit of scientific curiosity, understanding the link between neural activity and complex thought can potentially improve our understanding of neuropsychiatric disorders, creating novel opportunities for neural prosthetics. Deep neural networks trained on image or sound recognition tasks have emerged as powerful models of computations underlying sensory processing, surpassing 114 traditional models of image or sound representation based on Gabor filters and spectrotemporal filters, respectively, in mid-level and higher-order visual and auditory regions. While this success is promising, existing encoding models based on deep neural networks have been limited in their focus on limited portions of the sensory space under naturalistic stimulation, ignoring the complex and dynamic interactions of modalities (audio and vision) in this inherently context- rich paradigm. This reductionism leads to sub-optimality in predictive models of cortical responses as neural patterns evoked by movies are not simply a conjunction of activations in modality-specific cortices by their respective uni-sensory inputs; rather, there are known cross-modal influences as well as regions that receive afferents from multiple senses. Longer narratives or movies further have an inherent temporal structure; much of the meaning we infer is from stimulation sequences rather than from instantaneous visual or auditory stimuli alone. To address this limitation, we recently proposed a Deep Neural Network (DNN)-based encoding model that captures three critical inductive biases about information processing in the brain: namely, hierarchical processing, assimilation over longer timescales and multi-sensory auditory- visual interactions. By developing and evaluating this model on a large-scale movie-watching dataset, we demonstrated how incorporating this joint information leads to remarkable prediction performance across large areas of the cortex, well beyond the visual and auditory cortices into multi-sensory sites and frontal cortex. Further, we demonstrated how these neural encoding models trained solely on naturalistic data can allow us to interrogate the temporal and sensory sensitivity of different brain regions. 115 4.2 Endowing neural encoding models with both audition and vision and and stimulus history Abstract Naturalistic stimuli, such as movies, activate a substantial portion of the human brain, invoking a response shared across individuals. Encoding models that predict neural responses to arbitrary stimuli can be very useful for studying brain function. However, existing models focus on limited aspects of naturalistic stimuli, ignoring the dynamic interactions of modalities in this inherently context-rich paradigm. Using movie-watching data from the Human Connectome Project, we build group- level models of neural activity that incorporate several inductive biases about neural information processing, including hierarchical processing, temporal assimilation and auditory-visual interactions. We demonstrate how incorporating these biases leads to remarkable prediction performance across large areas of the cortex, beyond the sensory-specific cortices into multi-sensory sites and frontal cortex. Furthermore, we illustrate that encoding models learn high-level concepts that generalize to task-bound paradigms. Together, our findings underscore the potential of encoding models as powerful tools for studying brain function in ecologically valid conditions. 4.2.1 Introduction How are dynamic signals from multiple senses integrated in our minds to generate a coherent percept of the world? Understanding the neural basis of perception has been a longstanding goal of neuroscience. Previously, sensory perception in 116 humans has been dominantly studied via controlled task-based paradigms that reduce computations underlying brain function into simpler, isolated components, preventing broad generalizations to new environments or tasks [242]. Alternatively, fMRI recordings from healthy subjects during free-viewing of movies present a powerful opportunity to build ecologically sound and generalizable models of sensory systems, known as encoding models [243, 244, 245, 246, 247, 248]. To date, however, existing works on encoding models study sensory systems individually, and often ignore the temporal context of the sensory input. In reality, the different senses are not perceived in isolation; rather, they are closely entwined through a phenomenon now well-known as multi-sensory integration [249, 250]. For example, specific visual scenes and auditory signals occur in conjunction and this synergy in auditory-visual information can enhance perception in animals, improving object recognition and event detection as well as markedly reducing reaction times [251]. Furthermore, our cognitive experiences unfold over time; much of the meaning we infer is from stimulation sequences rather than from instantaneous visual or auditory stimuli. This integration of information from multiple natural sensory signals over time is crucial to our cognitive experience. Yet, previous encoding methodologies have precluded the joint encoding of this rich information into a mental representation of the world. Accurate group-level predictive models of whole-brain neural activity can be invaluable to the field of sensory neuroscience. These models learn to disregard the idiosyncratic signals and/or noise within each individual, while capturing only the shared response relevant to the stimuli. Naturalistic viewing engages multiple brain systems and involves several cognitive processes simultaneously, including auditory and visual processing, memory encoding and many other functions [252]. Group- 117 level analysis in this paradigm is enabled by the synchrony of neuronal fluctuations in large areas of the cortex across subjects [253]. Thus far, inter-subject correlation (ISC) analysis [253] has been a cornerstone tool for naturalistic paradigms because of its ability to characterize the shared response across individuals. Group-level encoding models adopt an alternative approach for capturing shared response, one grounded in out-of-sample prediction and generalization [242]. This allows them to model neural activity beyond a constrained stimulus set. However, there is a clear gap between the two mediums of analysis. While ISC analysis suggests that large areas of the cortex exhibit fluctuations that are consistent across subjects, existing neural encoding models have largely focused on predicting activity within pre-defined functional areas of the brain such as visual and auditory cortices. It is unclear how they may be scaled to develop a single predictive model for whole-brain neural responses, given that naturalistic scenes produce wide-spread cortical activations. In this paper, we aim to fill this gap: provided adequate characterization of stimuli, we hypothesize that the stable component of neural activity across a subject population, i.e., the stimulus-related activity, should be predictable. In the present study, we aim to quantify and improve the encoding of this wide-spread stimulus-driven cortical activity using rich stimulus descriptions. Brain responses in real-world conditions are highly complex and variable. Owing to their high expressive capacity, deep neural networks (DNNs) are well-suited to model the complex high-dimensional nature of neural activity in response to the multitude of signals encountered during movie-watching. Recently, DNNs optimized for image or sound recognition have emerged as powerful models of computations underlying sensory processing [243, 245, 246, 248], surpassing tradi- tional models of image or sound representation based on Gabor filters [244] and spectrotemporal filters [254], respectively, in higher-order processing regions. In 118 this approach, the stimuli presented during brain activity recordings are fed as input to pre-trained neural networks and activations of individual layers are linearly transformed into predictions of neural responses in different regions of the brain. This approach affords a useful interpretation of these feature spaces as outcomes of a task-constrained optimization, shedding light on how high-level behavioral goals, such as recognition, may constrain representations in neural systems [243]. While useful, task-driven features may diverge from optimal neural representations and tuning these features to better match the latter may be both feasible and beneficial [255]. This approach can help bridge the quantitative gap in explaining neural responses under realistic conditions while improving our understanding of the nature of information processing in the brain. From a purely modeling standpoint, our methodological innovations are threefold. First, we propose an end-to-end deep learning-based encoding model that extracts semantic feature maps from audio and visual recognition networks and refines them jointly to predict the evoked brain response. To this effect, we demonstrate that using different modalities concur- rently leads to improvements in brain encoding. Second, we note that cognitive perception during movie-watching involves maintaining memory over time and demonstrate the suitability of recurrent neural networks (RNNs) to capture these temporal dynamics. Finally, based on existing evidence of hierarchical information processing in visual and auditory cortices [246, 248], we adopt features at multiple levels of abstraction rather than low level or high level stimulus characteristics alone. We embed these inductive biases about hierarchy, long-term memory and multi-modal integration into our neural architecture and demonstrate that this comprehensive deep learning framework generalizes remarkably well to unseen data. Specifically, using fMRI recordings from a large cohort of subjects in the HCP, we build group-level encoding models that reliably predict stimuli-induced neuronal 119 fluctuations across large parts of the cortex. As a demonstration of application, we employ these encoding models to predict neural activity in response to other task-based stimuli and report excellent transferability of these models to artificial stimuli from constrained cognitive paradigms. This further suggests that these encoding models are able to capture high-level mechanisms of sensory processing. Approaching multi-sensory perception through the predictive lens of encoding models has several advantages. Because of their unconstrained nature, encoding models can enable data-driven exploration and catalyze new discoveries. Using six neural encoding models with different temporal scales and/or sensory inputs, trained only on ∼36 minutes of naturalistic data per subject, we can replicate findings from a large number of prior studies on sensory processing. First, by prominently highlighting the transition from short to long temporal receptive windows as we move progressively from early to high-level auditory areas, we can distinguish the cortical temporal hierarchy. Next, by differentiating uni-sensory cortices from multi-sensory regions such as the superior temporal sulcus and angular gyrus, we can reproduce the multi-modal architecture of the brain. Finally, by synthesizing neural responses to arbitrary stimuli such as faces, scenes or speech, we can demonstrate the functional specialization of known brain regions for processing of these distinct categories. Altogether, our results highlight the advantages and ubiquitous applications of DNN encoding models of naturalistic stimuli. 120 4.2.2 Materials and Methods Dataset We study high-resolution 7T fMRI data of 158 individuals from the Human Connec- tome Project movie-watching protocol comprising 4 audio-visual movie scans [256, 257]. The movies represent a diverse collection, ranging from short snippets of Hollywood movies to independent vimeo clips. All fMRI data was preprocessed following the HCP pipeline, which includes motion and distortion correction, high- pass filtering, head motion effect regression using Friston 24-parameter model, automatic removal of artifactual timeseries identified with Independent Component Analysis (ICA) as well as nonlinear registration to the MNI template space [257]. Complete data acquisition and preprocessing details are described elsewhere [256, 257]. Finally, whole-brain fMRI volumes of size 113x136x113 are used as the prediction target of all proposed encoding models. Rest periods as well as the first 20 seconds of every movie segment were discarded from all analyses, leaving ∼12 minutes of audio-visual stimulation data per movie paired with the corresponding fMRI response. We estimated a hemodynamic delay of 4 sec using ROI-based encoding models, as the response latency that yields highest encoding performance (Figure S2, see Supplementary Information for details). Thus, all proposed models are trained to use the above stimuli to predict the fMRI response 4 seconds after the corresponding stimulus presentation. We train and validate our models on 3 audio-visual movies with a 9:1 split and evaluate our models on the first three clips of the held-out test movie. Since the last clip in the held-out movie is repeated within the training movies, we excluded it from our analysis. 121 Methodology We train six encoding models employing different facets of the complex, dynamic movie stimulus. These include: (1) Audio-1sec and (2) Audio-20sec models, which are trained on single audio spectrograms extracted over 1-second epochs and contiguous sequences of 20 spectrograms spanning 20 seconds respectively; (3) Visual-1sec and (4) Visual-20sec models, trained with last frames of 1-second epochs and sequences of 20 evenly spaced frames within 20-second clips respectively; (5) Audiovisual-1sec and (6) Audiovisual-20sec models, which employ audio and visual input as described above, jointly. All models are trained to minimize the mean squared error between the predicted and measured whole-brain response. Figure 4.1 depicts the overall methodology for training different encoding models. Stimuli Audio We extract mel-spectrograms over 64 frequency bands between 125-7500 Hz from sound waveforms to represent auditory stimuli in ∼1 second epochs, following [258]. The audio spectrogram is treated as a single grayscale 96x64 image, denoted by xat , for the short duration model. For the longer-duration model, the input is simply a contiguous sequence of 20 of these grayscale images, represented as sat = {xa}ti i=t−19. This representation of auditory input is also supported by strong evidence that suggests the cochlea may be providing a spectrogram-like input to the brain for information processing [259]. Visual All videos were collected at 24 fps. We extract the last frame of every second of the video as a 720x1280x3 RGB input, denoted by xvt , for the 1-sec 122 Figure 4.1: Schematic of the proposed models. (A) The short-duration (1-sec) auditory and visual models take a single image or spectrogram as input, extract multi-scale hierarchical features and feed them into a Convolutional Neural Net- work (CNN)-based response model to predict the whole-brain response. (B) The long-duration (20-sec) uni-modal models take a sequence of images or spectrograms as input, feed their hierarchical features into a recurrent pathway and extract the last hidden state representation for the response model. (C) The short-duration multi-modal model combines uni-modal features and passes them into the response model. (D) The long-duration multi-modal model combines auditory and visual representations from the recurrent pathways for whole-brain prediction. Architec- tural details, including the feature extractor and convolutional response model are provided in Supplementary Information. models. We emphasize that the input here is a single RGB frame and we are using the 1-sec terminology only to be consistent with the nomenclature for audio models. We further arrange the last frame of every second in a 20-second clip into a sequence of 20 images, denoted by sv = {xv}tt i i=t−19, to represent the continuous stream of visual stimuli. These are presented to the longer-duration Visual-20sec and Audiovisual-20sec models. The inputs to the Audio-1sec, Visual-1sec, Audio-20sec, Visual-20sec, Audiovisual-1sec and Audiovisual-20sec models are thus given as xa v at , xt , st , svt , {xat , xv at } and {st , svt } respectively. 123 Audio-1sec and Visual-1sec models Neural encoding models comprise two components: a feature extractor, which pulls out relevant features, s, from raw images or audio waveforms and a response model, which maps these stimuli features onto brain responses. In contrast to existing works that employ a linear response model [245, 248], we propose a Convolutional Neural Network (CNN)-based response model where stimulus features are mapped onto neural data using non-linear transformations. Previous studies have reported a cortical processing hierarchy where low-level features from early layers of a CNN-based feature extractor best predict responses in early sensory areas while semantically rich deeper layers best predict higher sensory regions [246, 248]. To account for this effect, we employ a hierarchical feature extractor based on feature pyramid networks [260] that combines features from early, intermediate and later layers simultaneously. The detailed architectures of both components, including the feature extractor and convolutional response model, are described in Figure S3. We employ state-of-the-art pre-trained ResNet-50 [261] and VGG-ish [258] architectures in the pyramid network to extract multi-scale features from images and audio spectrograms, respectively. The base architectures were selected because pre-trained weights of these networks optimized for behaviorally relevant tasks (recognition) on large datasets, namely Imagenet[262] and Youtube-8M[263], were publicly available. Resnet-50 was trained on image classification with 1000 classes, while the VGG-ish network was pre-trained on audio event recognition with ∼30K categories. Further, due to computational and memory budget, the Resnet-50 was frozen during training across all models. On the other hand, we were able to fine-tune the VGG-ish network in both the Audio and Audiovisual encoding models. We note that in contrast to images, there is a clear asymmetry in the axes of a spectrogram, where the distinct meanings of time and frequency might warrant 1D convolutions over time instead of 2D convolutions over 124 both frequency and temporal axes. However, we found the benefits of a pre-trained network to be substantial in training convergence time and hence did not explore more appropriate architectures. Audio-20sec and Visual-20sec models Audio-20sec and Visual-20sec models employ the same feature extractor and CNN response model as their 1-second counterparts. However, here, the feature extraction step is applied on each image in a sequence of 20 frames, followed by a long short-term memory (LSTM) module to model the temporal propagation of these features. The output dimensions of the LSTM unit are set to 1024 and 512 for the visual and auditory models respectively, to ensure an equitable comparison with the corresponding 1-sec models. The last hidden state output of this LSTM unit is fed into the CNN response model with the same architecture as the 1-sec models. Audiovisual-1sec and Audiovisual-20sec models Meaningful comparison across different models requires the control of as many design choices as possible. To ensure fair comparisons, the Audiovisual-1sec model employs the same feature extractors as the Visual-1sec and Audio-1sec models. The only difference, here, is that the corresponding 1024-D and 512-D feature representations are concatenated before presentation to the CNN response model and the concatenated features are passed into a bottleneck layer to reduce the final feature dimensionality to the maximum among audio and visual feature dimensions, i.e., 1024, so that the multi-modal model is not equipped with a higher-dimensional feature space than the maximum among uni-modal models. We note that the response model has the same architecture across all 6 proposed models. Similarly, the Audiovisual-20sec model employs the same feature extraction scheme as the Visual-20sec and Audio-20sec 125 models, but fuses the last hidden state output of the respective LSTM units by simple concatenation followed by a dense layer to reduce feature dimensionality to 1024 before feeding it into the response model. Evaluation We first evaluated the prediction accuracy of all models on the independent held- out movie by computing Pearson correlation coefficient (R) between the measured and predicted response at every voxel. Here, the ‘measured’ response refers to the group-averaged response across the same group of 158 subjects on which the models were trained. Comparison among these models enables us to tease apart the sensitivity of individual voxels to input timescales and different sensory stimuli. Voxel-level correlation coefficients between the predicted and measured responses were averaged to summarize the prediction accuracy of each model in relevant cortical areas (Figure 4.2B-F). For this region-level analysis, ROIs were derived with a comprehensive multi-modal parcellation of the human cortex [264], which was mapped onto the MNI-1.6 mm resolution template. We note that ROIs were employed only to interpret the results of the study and relate them to existing literature. We emphasize that all performance metrics reported henceforth are based on voxel-level correlations. It is important to note that prediction accuracy at every voxel is bounded by the proportion of non-stimulus related variance that reflects measurement noise or other factors. We thus also show the regional level performance of all models against the reliability (“noise ceiling”) of measured responses within those regions (Figure 4.3). Noise ceiling estimation: The reliability of the group-averaged response at each voxel is estimated from a short 126 84-second clip that was repeatedly presented at the end of all movie sessions. We compute an effective upper bound on our performance metric, i.e., the correlation coefficient, as the correlation between the measured fMRI response (group-mean) during different runs. We repeat this process 6 times (choosing pairs from 4 repeat measurements) to get a mean noise ceiling estimate per voxel, as shown in Figure 4.3D. We divide the voxel-level prediction accuracy (R) by this noise ceiling to get noise-normalized prediction accuracy of all models in left panels of Figure 4.3A-C. We note that this noise ceiling is computed on the repeated video clip, which is distinct from the test movie on which the model performance metrics are computed. Direct comparison against this noise ceiling can be sub-optimal, especially if the properties of the group-averaged response vary drastically across the two stimulus conditions. We address this limitation during model evaluation against data from a held-out independent group of subjects by computing a more suitable upper bound, which is achievable by a group-level encoding model (Figure S8, see Supplementary Information for more details). As we demonstrate in the results (Figure S8, S9), the trend and spatial distribution of model performance against noise ceiling remain unchanged across the model evaluation and noise ceiling estimation methods. 4.2.3 Results Multi-sensory inputs and longer timescales lead to the best encoding performance with significant correlations across a large proportion of the stimulus-driven cortex To gain quantitative insight into the influence of temporal history and multi-sensory inputs on encoding performance across the brain, we computed the mean prediction 127 Figure 4.2: Regional predictive accuracy for the test movie. (A),(C)-(F) depict quantitative evaluation metrics for all the proposed models across major groups of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of all models is summarized across (A) auditory, (C) visual, (D) multi-sensory, (E) language and (F) frontal areas. Box plots depict quartiles and swarmplots depict mean prediction accuracy of every ROI in the group. For language areas (Group 4), left and right hemisphere ROIs are shown as separate points in the swarmplot because of marked differences in prediction accuracy. Statistical significance tests are performed to compare 1-sec and 20-sec models of the same modality (3 comparisons, results indicated with horizontal bars below the box plots) or uni-modal against multi-modal models of the same duration (4 comparisons, results indicated with horizontal bars above the box plots) using the paired t-test (p-value < 0.05, Bonferroni-corrected) on mean prediction accuracy within ROIs of each group. accuracy in five groups of regions defined as per the HCP MMP parcellation [264], namely, (1) auditory regions comprising both early and association areas, (2) early visual and visual association regions, (3) known multi-sensory sites and regions forming a bridge between higher auditory and higher visual areas, (4) language-associated regions, and (5) frontal cortical areas. As our research concerns stimulus-driven processing, only ROIs belonging to the “stimulus-driven” cortex 128 were included in the above groups (Table S2, see Supplementary Information for the definition of “stimulus-driven” cortex). Groups 1 and 2, which are associated with a single modality (auditory or visual), do not show any marked improvement from audio-visual multi-sensory inputs and are best predicted by features of their respective sensory stimulus (Figure 4.2A,C). The performance boost with multi- sensory inputs is more pronounced in groups 3, 4 and 5 which are not preferentially associated with a single modality, but are involved in higher-order processing of sensory stimuli (Figure 4.2D-F). Further, temporal history of the stimulus yields consistent improvement in prediction performance in almost all groups of regions, albeit to different extents. Improvements in groups 3, 4 and 5 agree well with the idea that higher-order sensory processing as well as cognitive and perceptual processes, such as attention and working memory, are hinged upon the history of sensory stimuli; therefore, accumulated information benefits response prediction in regions recruited for these functions. Further, both auditory and visual association cortices are known to contain regions that are responsive to sensory information accumulated over the order of seconds [265]. This potentially explains the significant improvement observed for long-timescale encoding models compared to their short-timescale counterparts in these sensory cortices (Figure 4.4). Together, the Audiovisual-20sec model integrating audio-visual multi-sensory information over longer timescales yields maximum prediction accuracy (R) and highest percentage (∼ 83 percent) of significantly predicted voxels across the stimulus-driven cortex (Figure 4.3E), suggesting that the Audiovisual-20sec model can adequately capture complementary features of each additional facet (multi-sensory stimuli / temporal information) of the sensory environment. 129 Longer timescales improve encoding performance, particularly in higher order auditory areas As a movie unfolds over time, the dynamic stream of multi-modal stimuli continu- ously updates our neural codes. Evidence from neuroimaging experiments suggests that different brain regions integrate information at different timescales; a cortical temporal hierarchy is reported for auditory perception where early auditory areas encode short timescale events while higher association areas process information over longer spans [266]. This temporal gradient of auditory processing is well-replicated within our study. Comparison of 1-sec and 20-sec models allows us to distinguish brain regions that process information at shorter timescales from those that rely on longer dynamics. There is a small, albeit significant, contribution of longer timescale inputs on prediction correlations in regions within early auditory cortex, such as A1, LBelt, PBelt, MBelt and Restro-insular cortex (RI) (Figure 4.3A, 4.4A), in line with previous reports suggesting short temporal receptive windows (TRWs) of early sensory regions [266]. Shorter integration windows are in agreement with the notion that these regions facilitate rapid processing of the instantaneous incoming auditory input. In contrast, response in voxels within auditory associ- ation ROIs lying mainly in the superior temporal sulcus or along the temporal gyrus (A4, A5, STSda, STSva, STSdp, STSvp, STGa, TA2) is seen to be much better predicted with longer timescales (Figure 4.3A, 4.4A). Cumulatively across association ROIs, the Audio-20sec model yields a highly significant improvement in prediction accuracy (∼50%) over the Audio-1sec model, in comparison to a smaller improvement (∼5%) across early auditory ROIs. 130 Figure 4.3: Model prediction accuracy in standard brain space. Left panel depicts the predictive accuracy of uni-modal (A,B) and multi-modal (C) models over the whole brain in the test movie. Colors on the brain surface indicate the Pearson correlation coefficient between the predicted timeseries at each voxel and the true voxel’s timeseries normalized by the noise ceiling (D) computed on repeated validation clips. Only significantly predicted voxels (p-value < 0.05, FDR-corrected) are colored. ROI box plots depict the un-normalized correlation coefficients between the predicted and measured response of voxels in each ROI and the respective noise ceiling for the mean. (E) shows the percentage of voxels in stimulus-driven cortex that are significantly predicted by each model and mean prediction accuracy across the stimulus-driven cortex. 131 Figure 4.4: Influence of temporal history on encoding performance. (A) Mean predictive performance of Audio-1sec and Audio-20sec models in early auditory and association auditory cortex ROIs. A major boost in encoding performance is seen across auditory association regions with the 20-sec model. (B) Mean predictive performance of Visual-1sec and Visual-20sec models across ROIs in the dorsal, ventral and MT+ regions. Dorsal stream and MT+ ROIs exhibit a significant improvement with Visual-20sec model but no effect is observed for the ventral stream. Box plots are overlaid on top of the beeswarm plot to depict quartiles. Horizontal bars indicate significant differences between models in the mean prediction accuracy within ROIs of each stream using the paired t-test (p-value < 0.05). Longer timescales lead to significantly better predictions in the dorsal visual stream and MT+ complex The distinct association of dorsal visual stream with spatial localization and action- oriented behaviors and ventral visual stream with object identification is well documented in the literature [267]. Another specialized visual area is the medial temporal complex (MT+), which has been shown to play a central role in motion processing. The functional division between these streams thus suggests a stronger influence of temporal dynamics on responses along the dorsal pathway and MT+ regions. To test this hypothesis, we contrast the encoding performance of Visual- 1sec and Visual-20sec models across the three groups by averaging voxel-wise correlations in their constituent ROIs. In accordance with the dorsal/ventral/MT+ 132 stream definition in the HCP MMP parcellation, we use the following ROIs for analysis: (a) dorsal: V3A, V3B, V6, V6A, V7, IPS1 (b) ventral: V8, Ventral Visual Complex (VVC), PIT complex, Fusiform Face Complex (FFC) and Ventro-medial Visual areas 1,2 and 3 (c) MT+: MT, MST, V4t, FST. Figure 4.4B demonstrates the distribution of mean correlations over these ROIs for different models and streams. Our findings suggest that temporal history, as captured by the Visual- 20sec model, can be remarkably beneficial to response prediction across the dorsal visual stream (∼30% improvement over Visual-1sec model) and the MT+ complex (∼62% improvement over Visual-1sec model), in agreement with our a priori hypothesis . Further, in our experiments, no marked improvement was observed for the ventral visual stream, indicating a non-significant influence of temporal dynamics on these regions. Auditory and visual stimuli features jointly approach the noise ceiling in multi-sensory areas Examining prediction accuracy against response reliability allows us to quantify how far we are from explaining predictable neural activity. A high fraction of the stimulus-driven cortex (∼ 83%) is predictable with a longer timescale input and joint audiovisual features. Notably, areas extending anteriorly and posteriorly from the primary auditory cortex such as the posterior STS, STGa and TA2 achieve prediction correlations close to the noise ceiling with the Audiovisual-20 sec model (Figure 4.3C), suggesting that DNN representations are remarkably suited to encode their response. Interestingly, performance in auditory regions is much closer to the noise ceiling than visual regions. Understanding audition and vision in the same space further 133 Figure 4.5: Sensitivity of ROIs to different sensory inputs. (A) Predictive accuracy (R) of audiovisual encoding model with and without input distortions, (B) Sensory sensitivity index of different brain regions as determined using performance met- rics under input distortion (see Supplementary Information for details). Regions dominated by a single modality are shown in darker colors, whereas light-colored regions are better predicted by a combination of auditory and visual information. Red indicates auditory-dominant regions whereas blue indicates visual dominance. allows us to appreciate the differences between these modalities. While this may suggest that audition is perhaps a simpler modality to model, the differences could also result from a bias of the dataset. A more diverse sampling of acoustic stimuli in the training set could allow the model to generalize better in auditory regions. Furthermore, in contrast to auditory stimulation where all subjects hear the same sounds, visual stimulation can elicit highly varied responses dependent on gaze location. This variability could plausibly make group-level visual encoding a more difficult task. Joint encoding models tease apart the modal sensitivity of voxels throughout the sensory cortex Neural patterns evoked by movies are not simply a conjunction of activations in modality-specific cortices by their respective uni-sensory inputs; rather, there are known cross-modal influences as well as regions that receive afferents from multiple senses [268]. Can we interrogate a joint encoding model to reveal the individual 134 contribution of auditory and visual features in encoding response across different brain regions? To address this question, we shuffled inputs of either modality along the temporal axis during inference. We measured test performance of the trained audio-visual model on predictions generated by shuffling inputs of one modality while keeping the other one intact. This distortion at test time allows us to identify areas that are preferentially associated with either visual or auditory modality. We hypothesized that regions encoding multi-sensory information will incur loss in prediction accuracy upon distortion of both auditory and visual information. Further, uni-sensory regions will likely be adversely affected by distortion of either auditory or visual information but not both. To test this hypothesis, we further developed a sensory-sensitivity index that directly reflects the sensitivity of individual brain regions to information about auditory or visual stimuli (see Supplementary Information for details). For this examination, we utilized the Audiovisual-1sec model to avoid potential confounds associated with temporal history, although analysis of the Audiovisual-20sec model showed similar results. Figure 4.5 demonstrates the result of this analysis on sensory-specific regions as well as regions known for their involvement in multi-sensory integration. The benefit from (non-distorted) multi-sensory inputs to the prediction correlations of the Audio-visual model is most remarkably seen in posterior STS, STGa and sensory-bridge regions such as the temporal-parietal-occipital junction (TPOJ1- 3) and superior temporal visual (STV) area. Another region that seems to be employing features of both modalities, albeit to a lesser extent, is the frontal eye field (FEF), whose recruitment in audiovisual attention is well studied [269]. Classically, multi-sensory integration hubs are identified as regions that show en- hanced activity in response to multi-sensory stimulation as opposed to presentation of either uni-sensory stimuli based on some statistical criteria [270]. Accordingly, 135 the posterior STS is consistently described as a multi-sensory convergence site for audio-visual stimuli [250, 268, 270, 271]. Its role in audiovisual linguistic in- tegration has also been well-studied in the literature [269]. Other multi-sensory integration sites reported extensively in prior literature include the temperoparietal junction [250, 268, 269] and superior temporal angular gyrus [272]. Our findings above lend strong support for the multi-sensory nature of all these regions. Encoding models as virtual neural activity synthesizers Next, we sought to characterize whether encoding models can generalize to novel task paradigms. By predicting neural activity for different visual categories from the category-specific representation task within the HCP Working Memory (WM) paradigm, we generated synthetic functional localizers for the two most common visual classes: faces and places. Specifically, we predict brain response to visual stimuli, comprising faces, places, tools and body parts, from the HCP task battery. We use the predicted response to synthesize contrasts (FACES-AVG and PLACES- AVG) by computing the difference between mean activations predicted for the category of interest (faces or places respectively) and the average mean activations of all categories at each voxel (Figure 4.6). The predicted and measured contrasts are thresholded to keep the top 5%, 10% or 15% most activated voxels. We report the Dice overlap between the predicted and measured contrasts for each of these threshold values to quantify the agreement between these cortical maps. We also computed the Dice overlap of the predicted contrast for each experiment against all 86 measured tfMRI contrasts provided as part of the HCP task battery in order to assess the identifiability of the synthetic contrast. We observe a notable overlap between the synthetic and measured group-level 136 Figure 4.6: Encoding models as virtual brain activity synthesizers. (A) Synthetic contrasts are generated from trained encoding models by contrasting their “syn- thesized” (i.e., predicted) response to different stimulus types. (B) Comparison of the synthesized contrast for ‘speech’ against the speech association template on neurosynth, both thresholded to keep the top 5%, 10% or 15% most activated vertices. (C-D) compare the synthesized contrasts for ‘faces’ and ’places’ against the corresponding contrasts derived from HCP tfMRI experiments, both thresholded to keep the top 5%, 10% or 15% most activated vertices. Vertices activated in only synthetic or predicted contrast maps are shown in red and blue colors respectively whereas yellow indicates the overlap. Corresponding dice scores are displayed alongside the surface maps. Distributions of dice overlap scores between the syn- thetic map and all 86 HCP tfMRI contrast maps are shown as histograms at each threshold level. Red arrow points to the dice overlap between the synthetic contrast and HCP tfMRI contrast for the same condition. In all cases, the synthetic contrast exhibits the highest agreement with the tfMRI contrast that it was generated to predict. 137 contrasts. Importantly, we find that the synthetic contrasts for ‘FACES-AVG’ and ‘PLACES-AVG’ are identifiable in that the synthetic contrast exhibits the highest agreement with the measured contrast of the same contrast condition. Further, our findings are consistent with the well-known cortical specificity of neuronal activations for processing of faces and places. Both the synthetic and measured faces contrasts are consistent with previously identified regions for face- specific processing, including the fusiform face area (corresponds to fusiform face complex (FFC) in Figure 4.6), the occipital face area in lateral occipital cortex (overlaps with the PIT complex in HCP MMP parcellation), and regions within temporo-parieto-occipital junction and STS [273, 274]. Among these, the selective role of the Fusiform Face Area in face processing has been most consistently and robustly established. Another region known to respond more strongly to faces than other object categories, namely posterior STS, has been previously implicated in processing of facial emotions [273]. Similarly, both synthetic and measured places contrasts highlight cortical regions thought to be prominent in selective processing of visual scenes. These include the parahippocampal areas (PHA1-3), retrosplenial cortex (POS1 in HCP MMP par- cellation) and the transverse occipital sulcus (TOS), which comprises the occipital place area (OPA) [275]. Cortical areas related to speech processing are similarly discovered using our models by contrasting activations predicted for speech stimuli against non-speech stimuli such as environmental sounds (Figure 4.6B, see Supplementary Information for more details). The synthetic contrast shows increased activation in language- related areas of the HCP MMP parcellation such as 55b, 44 and the superior frontal language (SFL) area with left-lateralization, in accordance with previous language 138 fMRI studies [276]. In addition, areas tuned for voice processing in STS [277] are also highlighted. The synthetic map also shows highest correlation with ‘speech’ on neurosynth term-based meta-analysis [278] and overlaps considerably with the speech association template on the platform. Taken together, these experiments illustrate the potential of encoding models to simulate contrasts and reconcile contrast-based studies with naturalistic experiments. Additional analyses In prior studies, neural response prediction is done via regularized regression, where the signal at each voxel is modeled as a weighted sum of stimulus features with appropriate regularization on the regression weights. Following earlier works, we also train l2-regularized regression models using features derived from hierarchical convolutional networks trained on image or sound recognition such as those used in the proposed models, as well as semantic categories features labelled using the WordNet semantic taxonomy similar to [279]. The latter are typically used for mapping the semantic tuning of individual voxels across the cortex. Our models consistently outperform the baselines, further illustrating the benefits of the proposed methodology (Figure S4(A)-(C), see Supplementary Information for more details). Additionally, we also performed ablation studies to understand the influence of different network components, namely the “non-linear” response model as well as the “hierarchical” feature extractor on model prediction performance and found that both components improve performance, although their relative contribution is stronger in visual encoding models than auditory models (Figure S4(D), see Supplementary Information for more details). The superior predictive performance of our models in comparison to the classical approach along with our ablation studies suggest that an interplay of end-to-end optimization with a 139 non-linear response model can jointly afford improved generalization performance. To test the generalizability of the models beyond the subject population they were trained on, we further compared the predictions of all models against the group-averaged response of a held-out group within HCP comprising 20 novel subjects distinct from the 158 individuals used in the training set, on the same independent held-out movie. The noise ceiling for this group was computed as the correlation coefficient between the mean measured response for the independent test movie across all 158 subjects in the training set and the group-averaged response computed over the 20 new subjects. This metric captures the response component shared across independent groups of subjects and thus reflects the true upper bound achievable by a group-level encoding model. As shown in Figure S8 (see Supplementary Information for more details), the models can accurately predict neural responses as measured with respect to the group mean of the held-out subjects, with the Audiovisual-20sec model performance even approaching noise ceiling in some regions, particularly the higher-order auditory association regions and multi-sensory sites such as the posterior STS. Importantly, the predictivities across the cortical surface are consistent with the performance metrics reported for the training subject population in Figure 4.3. Finally, by comparing model predictions against neural responses at the single subject level for subjects from the held-out group, we further demonstrate that the Audiovisual-20sec model can also successfully capture the response component that individual subjects share with the population (Figure S10, see Supplementary Information for details). 140 4.2.4 Discussion Free viewing of dynamic audio-visual movies enables an ecologically valid analysis of a collective set of functional processes at once, including temporal assimilation and audio-visual integration in addition to momentary sensory-specific process- ing. Perception, under such stimulation, thus recruits sensory systems as well as areas subserving more sophisticated cognitive processing. Building quantita- tively accurate models of neural response across widespread cortical regions to such real-life, continuous stimuli thus requires an integrated modelling of these disparate computations on sensory inputs. In this paper, we have presented six deep neural network-based encoding models with varying sensory and temporal information about the audio-visual stimulus. Subsequently, we queried the role of input history and different sensory information on prediction performance across individual regions of the cortex. We have shown that exploiting the richness of the stimulus along the time axis and sensory modalities substantially increases the predictive accuracy of neural responses throughout the cortex, so far as approaching the noise ceiling for voxels in some known multi-sensory sites, such as the posterior STS [250, 268, 270, 271]. Auditory and visual scenes are the principal input modalities to the brain during naturalistic viewing. Yet, existing encoding models ignore their interactions. We employ a common strategy in multi-modal machine learning settings, namely feature fusion, to jointly model auditory and visual signals from the environment. We find that minimizing the prediction error is a useful guiding principle to learn useful joint representations from an audio-visual stimulation sequence and demonstrate that models that consume multi-modal signals concurrently, namely, Audiovisual- 1sec and Audiovisual-20sec, can not only predict the respective uni-modal cortices 141 slightly better but also lead to remarkable improvements in predicting response of multi-sensory and frontal brain regions (Figure 4.2). Further, we show that multi-modal neural encoding models not only boost performance in large areas of the cortex relative to their uni-modal counterparts (Figure 4.2,4.3E), but also shed light on how neural resources are spatially distributed across the cortex for dynamic multi-sensory perception (Figure 4.5). The predictivity of different sensory inputs for neural response, as evaluated on independent held-out data, can facilitate reverse inference by identifying the sensory associations of different brain regions, providing clues into the multi-sensory architecture of the cortex. By comparative analysis of predictive performance in different regions across models (Figure 4.2) as well as perturbation analysis within the multi-modal model (Figure 4.5), we identify a number of regions that are consistently sensitive to both auditory and visual information, most notably the superior temporal sulcus and some frontal regions. Regions within inferior frontal cortex, have been implicated in the processing of visual speech, guiding sensory inferences about the likely common cause of multi-modal auditory and visual signals, as well as resolving sensory conflicts [280]. Prior research has also implicated an extensive network of inferior frontal and premotor regions in comprehending audiovisual speech, suggesting that they bind information from both modalities [281]. While unveiling the causal sequence of events for a mechanistic understanding of multi-sensory perception is not possible with the proposed approach, our findings align well with commonly held theories of sensory fusion which suggest that uni-sensory signals are initially processed in segregated regions and eventually fused in regions within superior temporal lobe, occipital-temporal junction and frontal areas [268]. This proposition is corroborated by our experiments as response prediction in these regions is best achieved by a combination of both sensory inputs (Figure 4.3,4.5). 142 A linear response model with pre-trained and non-trainable feature extractors, while simple and interpretable, imposes a strong constraint on the feature-response relationship. The underlying assumption is that neural networks optimized for performance on behaviorally relevant tasks, are mappable to neural data with a linear transform. We designed a flexible model, capable of capturing complex non-linear transformations from stimulus feature space to neural space, leading to more quantitatively accurate models that are better aligned with sensory systems. Even better accounts of cortical responses are then obtained by interlacing dynamic, multi-modal representation learning with whole-brain activation regression in an end-to-end fashion. Using these rich stimulus descriptions, we demonstrated a widespread predictability map across the cortex, that covers a large portion (∼83%) of the stimulus-driven cortex (Figure 4.3C,E), including association and some frontal regions. While inter-subject correlations in these regions are frequently reported [253, 282], suggesting their involvement in stimulus-driven processing, response predictability in these areas had remained elusive so far. Further, the cortical predictivity is maintained even as we compare model predictions against neural responses of held-out subjects (Figure S8 and S10), suggesting that the proposed models are capable of successfully capturing the “shared” or stimulus- driven response component. These results provide compelling evidence that deep neural networks trained end-to-end can learn to capture the complex computations underlying sensory perception of real-life, continuous stimuli. We further demonstrated that encoding models can form an alternative frame- work for probing the timescales of different brain regions. While primary auditory and auditory belt cortex (comprising A1, PBelt, LBelt, Mbelt) as well as the ventral visual stream benefit only marginally from temporal information, there is a remarkable improvement in prediction performance in auditory and visual associa- 143 tion and pre-frontal cortices, most notably in superior temporal lobe, visuomotor regions within the dorsal stream such as V6A, temporal parietal occipital junction and inferior frontal regions. The improvement in prediction performance with the 20-second input is consistently seen for both uni-modal and multi-modal models. It is important to acknowledge that directly comparing the prediction accuracies of static (1-sec) and recurrent (20-sec) models to infer processing timescales of different brain regions has its limitations. First, this analysis can be confounded by the slow hemodynamic response as performance improvement may be driven in part by the slow and/or spatially varying dynamics. Based on our analysis with ROI-level encoding models, the latter seems like a less plausible explanation (Figure S2, see Supplementary Information for details). Further, we performed additional analyses to understand the relationship between performance improvement in individual voxels and their autocorrelation properties and found a strong correspondence between the two, suggesting that the distribution of performance improvement across the cortex broadly agrees well with processing timescales (Figure S6, see Supplementary Information for details). Predictions from long-timescale models are based on temporal history as pro- vided in stimulus sequences, and not just the instantaneous input. Modeling dynamics within these sequences appropriately is crucial to probe effects of tempo- ral accumulation. RNNs have internal memories that capture long-term temporal dependencies relevant for the prediction task, in this case, encoding brain response, while discarding task-irrelevant content. We compare this modeling choice against a regularized regression approach on stimulus features concatenated within T-second clips, with T ranging between 1 and 20 (Figure S4, see Supplementary Information for details). The inferior performance compared to our proposed models as well as a non-increasing performance trend against T for these linear models indicate that 144 accumulation of temporal information by simply concatenating stimulus features over longer temporal windows is insufficient; rather, models that can efficiently store and access information over longer spans, such as RNNs with sophisticated gating mechanisms, are much more suitable for modeling neural computations that unfold over time. Since activations of units within RNNs depend not only on the incoming stimulus, but also on the “current” state of the network as influenced by past stimuli, they are capable of holding short-term events into memory. Adding the RNN module can thus be viewed as augmenting the encoding models with working memory. Investigating timescales of representations across brain regions by understanding the influence of contextual representations on language processing in the brain, as captured by LSTM language models for instance, has become a major research focus recently [283]. In these language encoding models for fMRI, past context has been shown to be beneficial in neural response prediction, surpassing word embedding models. However, models that explain neural responses under dynamic natural vision while exploiting the rich temporal context have not yet been rigorously explored with human fMRI datasets. In a previous study with awake mice, recurrent processing was shown to be useful in modelling the spiking activity of V1 neurons in response to natural videos [284]. In dynamic continuous visual stimulation fMRI paradigms, a common practice is to concatenate multiple delayed copies of the stimulus to model the hemodynamic response function as a linear finite impulse response (FIR) function [279]. However, since the feature dimensionality scales linearly with time-steps, this approach is limited to HRF modeling and is not feasible to capture longer dynamics of the order of tens of seconds. Another approach is to employ features from neural networks trained on video tasks, such as action recognition [247]. However, these encoding models are constrained to capture 145 one aspect of dynamic visual scenes and are likely useful to predict neural responses in highly localized brain regions. Most studies in visual encoding remain limited to static stimuli and evoked responses in relatively small cortical populations. Our brain has evolved to process ‘natural’ images and sounds. In fact, recent evidence has shown that sensory systems are intrinsically more attuned to features of naturalistic stimuli and such stimuli can induce stronger neural responses than task-based stimuli [285]. Here, we demonstrate that encoding models trained with naturalistic data are not limited to modeling responses of their constrained stimuli set. Instead, by learning high-level concepts of sensory processing, these models can also generalize to out-of-domain data and replicate results of alternate task-bound paradigms. While our models were trained on complex and cluttered movie scenes, we tested their ability to predict response to relatively simple stimuli from the HCP Task battery, such as faces and scenes (Figure 4.6). The remarkable similarity between the predicted and measured contrasts in all cases suggests that ‘synthetic’ brain voxels, predicted by the trained DNNs, correspond well with the target voxels they were trained to model. We thus provide evidence that these encoding models are capsulizing stimulus-to-brain relationships extending beyond the experimental world they were trained in. On the other hand, classical fMRI experiments, for instance task contrasts, do not generalize outside the experimental circumstance they were based on. This preliminary evidence suggests that encoding models can serve as promising alternatives for circumventing the use of contrast conditions to study hypotheses regarding the functional specialization of different brain regions. Embedded knowledge within these descriptive models of the brain, could also be harnessed in other applications, such as independent neural population control by optimally synthesizing stimuli to elicit a desired neural activation pattern [286]. 146 With purely data-driven exploration of fMRI recordings under a hypothesis-free naturalistic experiment, our models replicate the results of previous neuroimaging studies operating under controlled task-based regimes. Our analysis lends support to existing theories of perception which suggest that primary sensory cortices build representations at short timescales and lead up to multi-modal representations in posterior portions of STS [266]. Encoding performance in these regions is consistently improved with longer timescales as well as multi-sensory information. We reasoned that regions that are sensitive to multi-modal signals and/or longer stimulus dynamics could be distinguished by interrogating the performance of these models on unseen data. To date, encoding models have been rarely used in this manner to assess integration timescales or sensory-sensitivity of different brain regions. Classically, processing timescales have been probed using various empirical strategies, for example, by observing activity decay over brief stimulus presentations or by comparing auto-correlation characteristics of resting-state and stimulus-evoked activity [287]. Further, multi-sensory regions are identified via carefully constructed experiments with uni-modal and multi-modal stimulus presentations, followed by analysis of interaction effects using statistical approaches [268]. Here, we suggest that encoding models can form an alternate framework to reveal clues into these functional properties that can be rigorously validated with future investigation. As with interpreting the results of any predictive model, one should, however, proceed with caution. Sounds are generated by events; this implies that sound representations implicitly convey information about actions that generated them. Similarly, visual imagery provides clues into auditory characteristics, such as the presence or absence of speech. Thus, it is difficult to completely disentangle the individual contributions of auditory and visual features to prediction performance across cortical regions. Similarly, longer timescale inputs can lead to a more 147 robust estimate of the momentary sensory signal, potentially confounding the interpretations of TRWs. Further, scanner noise can affect changes in BOLD signals across the auditory cortex and several studies have reported that the phenomenon is exacerbated at high field strengths [288]. Importantly, brain function requires additional attentional resources and increased listening effort under the presence of scanner noise and this may impact the processing of visual input as well, for example, by affecting fixation locations and/or prioritizing attentional deployment for auditory stimuli. Scanner noise can also reduce the sensitivity to stimuli of interest (“movies”) by causing non-stimuli associated activations across the auditory cortex which may interfere in non-trivial ways with stimuli induced activations. We do not expect this to impact the prediction correlations since the influence of scanner noise is expected to be independent of the stimuli characteristics, nonetheless this is an important caveat of the proposed audio-based encoding models that hinders their ability in explaining neural responses outside the scanner. Here, notwithstanding the limitations, we contend that these models can, nonetheless, serve as powerful hypothesis generation tools. The methodological innovations in this study must also be considered in light of their limitations. Due to high dimensionality of features in early layers of the ResNet architecture for high-dimensional visual inputs, we employ pooling operations on these feature maps. Thus, low-level visual features, such as orientations, are compromised. The consequent unfavorable outcome is a low predictive performance in V1. Due to a limited computational and memory budget, we could not experiment with fine-tuning the visual sub-network in this study; in the future, with large-scale collection of naturalistic fMRI datasets that represent a more extensive sampling of the stimulus space, we anticipate that data-fitted or fine-tuned models may surpass the baseline established by pre-trained goal-driven networks and may enable us 148 to inch closer to a complete model of the human visual cortex. These models might even provide inspiration in the form of inductive biases or regularization for representation learning on diverse perceptual tasks [289]. Further, since different subjects can focus on different parts of the stimulus, group-level models can also blur out the precise object orientation information. This is particularly relevant for complex naturalistic stimuli such as movies. In the future, incorporating eye gaze data into these models can be an interesting exploration. Furthermore, due to computational constraints, the proposed model is only able to examine the effects of stimuli up to 20 seconds in the past. However, previous research with naturalistic stimuli has shown that some brain regions maintain memory of the order of minutes during naturalistic viewing [290]. Existing evidence also suggests that neural activity is structured into semantically meaningful and coherent events [266]. Capturing long-range context in encoding models can be a challenging, yet fruitful endeavour yielding potentially novel insights into memory formation. There are also inherent differences between proposed neural network models and biological networks. DNNs fail to capture known properties of biological networks such as local recurrence, however, they have been found to be useful for modelling neural activity across different sensory systems. At present, feed- forward DNNs trained on recognition tasks constitute the best predictors of sensory cortical activations in both humans and non-human primates [243]. In light of this observation, a recent study proposed that very deep feed-forward only CNNs (for example, ResNet-50 as employed in this study for visual feature extraction) might implicitly be approximating ‘unrolled’ versions of recurrent computations of the ventral visual stream [291]. Object recognition studies on non-human primates have also hinted at a functional correspondence between recurrence and deep non- linear transformations [292]. Although the functional significance of intra-regional 149 recurrent circuits in core object recognition is still under debate, mounting evidence suggests they may be subserving recognition under challenging conditions [292, 293]. Thus, investigation of more neurobiologically plausible models of the cortex that innately model intra-regional recurrent computations should be explored in the future, especially in relation to their role in visual recognition. While the present study focuses on shared stimulus-driven brain signals across a subject population, the quest to understand inter-individual variability in neural responses remains an important direction forward that promises exciting scientific discoveries linking brain activity to behavior and novel clinical applications [294]. Over the last decade, these inter-individual differences have been shown to result from differences in attentional control and engagement [295], and perhaps more intriguingly, from differences in interpretation [296], emotional valence/arousal [297] as well as intrinsic individual traits and behavior [298]. The utility of studying inter- individual variability under naturalistic stimulation from a clinical perspective has already been highlighted in several prior studies using variants of the ISC framework that attempt to extricate the stable and idiosyncratic stimulus-evoked response component in subjects with neuropsychiatric disorders from the shared stimulus- driven response that is consistent across a control subject population [299]. In the future, we expect that this direction of using naturalistic paradigms to study inter- individual differences will complement the approach of encoding models (at single- subject or group-level) in understanding how the brain processes sensory signals from its complex environment and how individual differences in this processing are linked to individual behaviors. Comprehensive descriptive models of the brain need comprehensive accounts of the stimulus. In this study, using a novel group-level encoding framework, we 150 showed that ‘reliable’ cortical responses to naturalistic stimuli can be accurately predicted across large areas of the cortex using multi-sensory information over longer timescales. Since our models were trained on a large-scale, multi-subject and open-source dataset, we believe these results could provide an important point of reference against which encoding models for naturalistic stimuli can be assayed in the future. The continued interplay of artificial neural networks and neuroscience can pave the way for several exciting discoveries, bringing us one step closer to understanding the neural code of perception under realistic conditions. Data and Software availability All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. All experiments in this study are based on the open-access Human Connectome Project movie-watching database. The dataset is publicly available for download through the HCP website (https://www.humanconnectome.org/). Throughout this study, we utilized 7T fMRI data from the ‘Movie Task fMRI 1.6mm/59k FIX-Denoised’ package within the HCP dataset. The network implementation and analysis codes are available at https://github.com/mk2299/MultimodalEncoding/. 151 4.3 Neural encoding with visual attention Abstract Visual perception is critically influenced by the focus of attention. Due to limited resources, it is well known that neural representations are biased in favor of attended locations. Using concurrent eye-tracking and functional Magnetic Resonance Imaging (fMRI) recordings from a large cohort of human subjects watching movies, we first demonstrate that leveraging gaze information, in the form of attentional masking, can significantly improve brain response prediction accuracy in a neural encoding model. Next, we propose a novel approach to neural encoding by including a trainable soft-attention module. Using our new approach, we demonstrate that it is possible to learn visual attention policies by end-to-end learning merely on fMRI response data, and without relying on any eye-tracking. Interestingly, we find that attention locations estimated by the model on independent data agree well with the corresponding eye fixation patterns, despite no explicit supervision to do so. Together, these findings suggest that attention modules can be instrumental in neural encoding models of visual stimuli. 1 4.3.1 Introduction Developing accurate population-wide neural encoding models that predict the evoked brain response directly from sensory stimuli has been an important goal in computational neuroscience. Modeling neural responses to naturalistic stimuli, in particular stimuli that reflect the complexity of real-world scenes (e.g., movies), 1Our code is available at https://github.com/mk2299/encoding_attention. 152 offers significant promise to aid in understanding the human brain as it functions in everyday life [304]. Much of the recent success in predictive modeling of neural responses is driven by deep neural networks trained on tasks of behavioral relevance. For example, features extracted from deep neural networks trained on image or auditory recognition tasks are currently the best predictors of neural responses across visual and auditory brain regions, respectively [243, 246, 248]. While this success is promising, the unexplained variance is still large enough to prompt novel efforts in model development for this task. One aspect that is often overlooked in existing neural encoding models in vision is visual attention. Natural scenes are highly complex and cluttered, typically containing a myriad of objects. What we perceive upon viewing complex, naturalistic stimuli depends significantly on where we direct our attention. It is well known that multiple objects in natural scenes compete for neural resources and attentional guidance helps to resolve the ensuing competition [305]. Due to the limited information processing capacity of the visual system, neural activity is biased in favor of the attended location [306, 307]. Hence, more salient objects tend to be more strongly and robustly represented in our brains. Further, several theories have postulated that higher regions of the visual stream encode increasingly shift- and scale-invariant representations of attended objects after filtering out interference from surrounding clutter [308, 309]. These studies suggest that deployment of attention results in an information bottleneck, permitting only the most salient objects to be represented in the inferotemporal (IT) cortex, particularly the ventral visual stream which encodes object identity. These findings together indicate that visual attention mechanisms can be crucial to model neural responses of the higher visual system. Visual attention and eye movements are tightly interlinked. Where we direct 153 our gaze often quite accurately signals the focus of our attention [310]. This form of attention, known as overt spatial attention, can be directly measured by eye-tracking. Recent work has shown that fMRI activity can be used to directly predict fixation maps or eye movement patterns under free-viewing of natural scenes, suggesting a strong link between neural representations and eye movements [311]. In a similar vein, Sinz et al. [312] demonstrated that gaze shifts as estimated from pupil locations and behavioral states can be very useful in modeling spiking activity of mouse V1 neurons. More recent large-scale efforts in such concurrent data collection on humans, such as the Human Connectome Project (HCP) [313], that simultaneously record fMRI and eye-tracking measurements on a large population under free-viewing of movies, present a novel opportunity to probe the potential role of attention in neural encoding models of ecological stimuli. Our contributions in this study are as follows: • We demonstrate that leveraging information about attended locations in an input image can be helpful in predicting the evoked neural response. Particu- larly, we show that attentional masking of high-level stimulus representations based on human fixation maps can dramatically improve neural response prediction accuracy for naturalistic stimuli across large parts of the cortex. • We show that it is possible to use supervision from neural response prediction solely to co-train a visual attention network. This training strategy thus encourages only those salient parts of the image to dominate the prediction of the neural response. We find that the neural encoding model with this trained attention module outperforms encoding models with no or fixed attention. • Interestingly, we find that despite not being explicitly trained to predict fixations, the attention network within the neural encoding model compares 154 favorably against saliency prediction models that aim to directly predict likely human fixation locations given an input image. This suggests that neural response prediction can be a powerful supervision signal for learning where humans attend in cluttered scenes with multiple objects. This signals a novel opportunity for utilizing functional brain recordings during free-viewing to understand visual attention. 4.3.2 Methods Neural encoding models comprise two major components: a representation (feature extraction) module that extracts relevant representations from raw stimuli and a response model that predicts neural activation patterns from the feature space. We propose to integrate a trainable soft-attention module on top of the representation network to learn attention schemes that guide the prediction of whole-brain neural response. Our proposed methodology is illustrated in Figure 4.7. Feature extraction network We employ the state-of-the-art ResNet-50 [261] architecture pre-trained for object recognition on ImageNet [262] as the representa- tion network to extract semantically rich features from raw input images. In this study, we focus on improving neural response prediction in higher-order regions of the visual pathway where receptive fields are larger and not limited to a single hemi-field. Prior evidence suggests that these regions are likely best modelled by deeper layers of object recognition networks [245, 246]. Thus, we extract the output of the last ”residual block”, namely res5 (after addition) before the global pooling operation to encode all images into a 2048-channel high-level feature representation image (of size 23× 32, in our experiments), denoted as Frep. All pre-trained weights 155 Figure 4.7: Proposed method. A trainable soft-attention module is implemented on top of a pre-trained representation network to rescale features based on their salience. The rescaled features are spatially pooled and fed into a convolutional response model to predict whole-brain neural response. We assess the value of the trained attention network by comparing it with neural encoding methods employing (i) stimulus-dependent attention maps derived from human fixations (AG), (ii) stimulus-independent attention map derived from all fixations in the training set that reflects the center-weighted bias of our dataset (AC) as well as a (iii) no attention model that spatially pools the features directly with no scaling. are kept frozen during training of the neural encoding models. Attention network The attention network operates on the 2048-channel feature representation image Frep. For simplicity, we employed a single convolutional layer that constructs the saliency map with a trainable 5× 5 filter V ∈ R5×5×2048×1att as, S = Gσ ∗ [Vatt ∗ Frep]+. Here, | · |+ denotes the ReLU operation and Gσ∗ indicates blurring using a 5× 5 gaussian kernel with σ = 1. The attention scores for each pixel are finally computed from saliency maps by normalizing with the spatial softmax operation, ∑ expS(i)(i)Al = n exp ( ) , i ∈ {1, .., n}. (4.1)jj=1 S 156 Here, superscript i is used to index the 23 × 32 spatial locations in the feature map Frep. We note that existing literature on selective visual attention suggests a hierarchical winner-take-all mechanism for saliency computation, where only the particular subset of the input image that is attended is consequently represented in higher visual systems [307]. The softmax operation can be construed as approxi- mating this winner-take-all mechanism. The attention is consequently applied as element-wise scaling to Frep to yield an attention modulated representation Farep = Frep A. Convolutional response model The convolut∑ional response model maps thespatially pooled attention modulated features f g = ni=1 Fa(i)rep to the neural represen- tation space, reshapes them into coarse 3D feature maps and transforms them into an increasingly fine-grained volumetric neural activation pattern using trainable convolutions. This dramatically reduces the parameter count in comparison to linear response models with dense connections. Additionally, it captures spatial context and allows end-to-end optimization of the neural encoding model to predict high-resolution neural response, thereby alleviating the need for voxel sub-sampling or selection. The full sequence of feedforward computations in the convolutional response model are shown in the inset of Figure 4.7. The architecture of the convolutional response model is kept consistent across all CNN-based models to ensure a fair comparison. Baselines and upper bounds No attention We compared the performance of all attention-based models against a model with no attention modulation that spatially pools the feature representation 157 as, f g = ∑n i=1 F (i) rep (denoted as ‘No attention’). We implemented another baseline that uses the full feature map directly (instead of spatial pooling) as a flattened input to the convolutional response model. Due to computational/memory constraints, we had to reduce the dimensionality of the fully connected layer (to 256 units instead of 1024) in the convolutional response model for this encoding method. This model is henceforth denoted as ‘No pooling’. Center-weighted attention To further assess the usefulness of a learned at- tention network, we derive a stimulus-independent attention map (AC) based on averaging across all eye gaze data in the training set, using Gaussian kernel density estimation. This essentially amounts to center-weighted attention (see Appendix) since fixation locations on average are biased towards the center of an image [314]. The standard deviation of the Gaussian kernel is chosen to maximize log-likelihood on the validation set and is consequently set to 20. Gaze-weighted attention We derive attention maps for every input frame from the eye gaze coordinates observed for the respective frame across different subjects. The human fixation maps are converted into attention maps AG by blurring with a Gaussian kernel of same standard deviation as the center-weighted attention model. The resulting attention maps in the original input image space are subsequently resized to the spatial dimensions of Frep and renormalized. Since these stimulus- specific attention maps are derived from actual human gaze information, they likely represent an upper bound in neural encoding performance among all attention-based models. 158 Linear models To date, neural encoding models in all prior work employ a linear response model with appropriate regularization on the regression weights. To compare against this dominant approach, we extract global average pooled (no-attention) features as well as pooled attention modulated features for both non-trainable attention schemes (center-weighted and gaze-weighted attention) as described above, to present to the linear regressor. We apply l2 regularization on the regression coefficients and adjust the optimal strength of this penalty λ through cross-validation using 10 log-spaced values in {1e−5, 1e5}. In later sections, we denote the performance of the above models as ‘No attention (linear)’, ‘Center- weighted attention (linear)’ and ‘Gaze-weighted attention (linear)’ respectively. Training procedure All parameters were optimized to minimize the mean squared error between the predicted and target fMRI response using Adam [315] for 25 epochs with a learn- ing rate of 1e-4. Validation curves were monitored to ensure convergence and hyperparameters were optimized on the validation set. Evaluation Neural encoding We evaluated the performance of all encoding models on the test movie by computing the Pearson’s correlation coefficient (R) between the predicted and measured fMRI response at each voxel. Since we are only interested in the stimulus-driven response, we isolate voxels that exhibit high inter-group cor- relations over all training movies. Inter-group correlation (“synchrony”) values were computed by splitting the population into half and computing correlations between the mean response time-course of each group (comprising 79 subjects) at every voxel. 159 We chose a data-driven metric (synchrony) to isolate the stimulus-driven cortex in order to avoid reliance on pre-defined atlases or functional localizers in identifying the voxels of interest. However, since choosing an arbitrary synchrony threshold may introduce a bias in the reported metrics, we employed a range of threshold values, from very loose (0.15) to very strict (0.75) for the correlation value to con- sider a voxel as “synchronous” [316]. Finally, to summarize the prediction accuracy across the stimulus-driven cortex, we compute the mean correlation coefficient across the synchronous cortex voxels by varying the “synchrony” thresholds from 0.15 (resulting in 160,900 voxels) to 0.75 (8,804 voxels). The spatial distribution of synchronous voxels across the brain as we vary the synchrony thresholds is illustrated in Figure 4.8(B). For region level analysis, ROIs were extracted using a population-wide multi-modal parcellation of the human cerebral cortex, namely the HCP MMP parcellation [264]. Saliency prediction Next, we wanted to assess if the learned attention model was indeed looking at meaningful locations in input images while predicting neural responses. To address this question and put the learned attention schemes in perspective, we assessed the agreement of predicted saliency maps with human fixation maps for every frame in the test movie. Besides a qualitative evaluation, we computed quantitative metrics for comparing the predicted saliency maps against popular fixation (or saliency) prediction approaches. These include: (i) Itti-Koch [317]: a biologically plausible model of saliency computation that assigns pixel-level conspicuity values based on multi-scale low-level feature maps (intensity, color, orientation) computed via center-surround like operations similar to visual receptive fields, (ii) Deepgaze-II model [318]: a deep neural network based approach that extracts high-level features from a pre-trained image recognition architecture 160 (VGG19) as input to a readout network that is subsequently trained to predict fixations using supervision from gaze data, and (iii) Intensity contrast features (ICF) model [318]: a low-level saliency computation model that uses the same readout architecture as the Deepgaze-II model, but on low-level intensity and intensity contrast feature maps as opposed to high-level features. Additionally, we also report evaluation metrics for the center-weighted saliency map. We note that the Deepgaze-II and ICF models were trained with eye-tracking supervision on the MIT1003 saliency dataset [319]. Developing metrics for saliency evaluation is an active area of research and several different metrics have been proposed that often exhibit discrepant behavior [320]. We report the most commonly used metrics in saliency evaluation [320], including, (i) Similarity or histogram intersection (SIM), (ii) Pearson’s correlation coefficient (CC), (iii) Normalized scanpath saliency (NSS), (iv) Area under the ROC curve (AUC) and (v) Shuffled AUC (sAUC). Following [321], we used log-density predictions as saliency maps to compute all evaluation metrics. 4.3.3 Dataset We study high-resolution 7T fMRI (TR = 1s, voxel size = 1.6 mm isotropic) recordings of 158 participants from the Human Connectome Project (HCP) movie- watching database while they viewed 4 audio-visual movies in separate runs [313, 322]. Each movie scan was about 15 minutes long, comprising multiple short clips from popular Hollywood movies and/or vimeo. Eye gaze locations of subjects were also recorded simultaneously at 1000Hz and resampled to 24Hz to match the video frame acquisition rate. All fMRI data was preprocessed using the HCP FIX denoising procedures, which include motion and distortion correction, high-pass 161 filtering (2000 sec cut-off), head motion effect regression using Friston 24-parameter model (i.e., 6 rigid body motion parameters, their backward temporal derivatives and squares of those time series), automatic removal of artifactual timeseries by applying regression based on Independent Component Analysis (ICA) [323] as well as nonlinear registration to the MNI template space [257, 322]. Since the present study focuses on the development of population-wide predictive models, we averaged the response for each frame across subjects to obtain a single fMRI volume that represents the population average brain activation in response to that frame. After discarding rest periods as well as the first 10 seconds of every movie segment, we used about 12 minutes of audio-visual stimulation data per movie paired with the corresponding fMRI response and fixation data for analysis. We extract the last frame of every second of the video as a 720× 1280× 3 RGB input to present as stimulus to the neural encoding models. The output is the predicted response across the entire brain, represented as a volumetric image of dimensions 113× 136× 113. We estimate a hemodynamic delay of 4 sec using regression based encoding models (see Appendix), as the response latency that yields highest encoding performance. Thus, all proposed and baseline models are trained to use the above stimuli to predict the fMRI response 4 seconds after the corresponding stimulus presentation. We train and validate our models on three movies using a 9:1 train-val split and leave the fourth movie for independent testing. This yields 2000 training, 265 validation and 699 test stimulus-response pairs. 4.3.4 Results Incorporating gaze-weighted attention significantly improves neural re- sponse prediction. We first examined whether attention weighted pooling helps 162 Figure 4.8: Quantitative evaluation of all models. (A) depicts mean corre- lation values across the synchronous, (i.e., stimulus-driven) cortex defined at a range of synchrony thresholds ([0.15,0.75]). Each point thus reflects the mean prediction accuracy for a model across all voxels within synchronous cortex defined by a threshold value (x-axis). (B) depicts the inter-group correlation (synchrony) values across the entire human cerebral cortex. to improve response predictions. Figure 4.8 shows the mean prediction accuracy across the entire synchronous cortex for all models considered in this study. We find that the ‘gaze-weighted attention’ model significantly outperforms the ‘no attention’ model for both linear (∼ 40 % improvement among all voxels with synchrony>0.15), as well as convolutional response model (∼ 47 % improvement among all voxels with synchrony>0.15). The attention maps result in amplification of features of attended locations along with suppression of other irrelevant infor- mation. This re-scaling of features before pooling using fixation patterns obtained from eye-tracking data remarkably improves neural encoding performance across large areas of the cortex, suggesting that neural responses are indeed dominated by sensory signals at attended locations. Although we employed a convolutional response model primarily for computational efficiency in predicting a high-resolution (113x136x113) whole-brain neural response, we also observed a small improvement in neural encoding with this response model in comparison to a linear response model. 163 Figure 4.9: Top: ROI-level analysis Mean correlation values across intermediate (V4), higher visual areas in the inferotemporal cortex and its neighborhood and other higher higher-level visual regions (Dorsal, MT+) as described in the HCP MMP parcellation [264]. Error bars represent 95% confidence intervals around mean estimates computed using bootstrap sampling. (A)-(E) Prediction accuracy across the cortical surface for all deep CNN-based models. Statistical significance of individual voxel predictions is computed as the p-value of the obtained sample correlation coefficient for the null hypothesis of uncorrelatedness (i.e., true correlation coefficient is zero) under the assumptions of a bivariate normal distribution. Only significantly predicted voxels (p<0.05, FDR corrected) for each method are colored on the surface. Prediction accuracy maps for encoding methods with linear response models are provided in the Appendix. 164 Trainable attention model outperforms models with no attention or center-weighted attention In addition to improving neural response prediction, the convolutional response model renders end-to-end training of encoding models on whole-brain neural data feasible by dramatically reducing the number of free parameters in comparison to linear response models. In this study, we exploited this increased parameter efficiency to co-train an attention network on top of a pre-trained representation network (while freezing the representation network) for the goal of neural response prediction. As shown in Figure 4.8, the encoding model with learned attention surpasses models with no pooling, no attention or center- weighted attention in mean prediction accuracy across the sychronous cortex as well across most ROIs involved in object processing. This suggests that even with no eye-tracking data, as is the case with majority movie-watching fMRI datasets, mod- elling visual attention as re-weighting of stimulus representations based on spatial attention masks can still be beneficial in response prediction. The improvements are most apparent in ventral stream regions such as the Fusiform Face Complex (FFC) and PIT Complex, as well as object-selective parts of the lateral occipital complex (LO1, LO2, LO3) (Figure 4.8). Studies in visual perception have shown that these lateral occipital areas respond more strongly to intact objects than scrambled objects or textures, providing strong evidence for their role in object recognition as well as object shape perception [324, 325, 326]. Accuracy in another group of areas within the temporo-parieto-occipital junction, which is known to be involved in visual object recognition as well as representation of facial attributes such as the intensity of facial expressions [327], is similarly improved with the trained attention network. In addition to these areas, we also observe some improvement in neural encoding performance in other higher order processing regions across the dorsal visual stream, motion-sensitive visual regions (MT+ complex) and their neighbor- 165 ing visual areas (Figure 4.9). We also trained the proposed and baseline models on representations of other randomly selected deep layers within the ResNet-50 architecture and observed a similar benefit of attention modulation across different layers (see Appendix). Further, a representational similarity analysis comparing non-modulated and attention modulated representations of different layers across popular architectures showed that models that explain stimulus-dependent human fixation patterns are able to better account for the representational geometry of neural responses across intermediate and higher visual object processing areas (see Appendix). Taken together, these findings provide further support for the utility of attention modelling in neural encoding approaches. In addition to improv- ing accuracy, the attention model further affords interpretability by highlighting salient locations within the input image that are being employed to make response predictions. Learned attention policies correspond remarkably well with human fixa- tion maps. Figure 4.10 depicts saliency maps predicted by the trained attention network on sampled frames from the test movie. This qualitative assessment indi- cates that the proposed neural encoding model learns attention policies that are consistent with human fixation maps. Since attention is learned on top of high-level features, the model learns to focus on high-level stimuli features such as the presence of faces, hands and more conspicuous objects likely to direct attention in natural scenes. A closer look at incongruent cases indicates that images where the model fails to track human fixations are often highly complex scenes, where fixations may be driven by contextual knowledge of previous movie frames (Figure 4.10, top-right) or auditory signals, e.g., who the speaker is, etc. (Figure 4.10, bottom-right). Table 4.1 shows quantitative metrics that compare the quality of saliency maps 166 Figure 4.10: Qualitative assessment of saliency (log-density) maps. Top row shows sampled frames from the test movie, middle row shows human fixation maps overlaid on the corresponding frame, bottom row shows saliency maps predicted by the attention network of the proposed neural encoding model. Blue indicates high saliency values whereas red indicates low saliency. computed by benchmark models trained to predict gaze on our data. We also listed the performance of the attention network that was merely trained on fMRI data, and not eye gaze data. We note that our attention network performs on par with popular fixation prediction models that are trained directly on the task of saliency prediction in a supervised manner (ICF and Deepgaze-II). This trend holds for almost all saliency evaluation metrics, as shown in Table 4.1. This observation is particularly interesting given that the attention network is trained using supervision from neural response prediction only, without any information about gaze coordinates. 167 Table 4.1: Evaluation against saliency prediction models. Mean and standard errors for each metric are reported. Best results are bolded. Model SIM ↑ CC ↑ NSS ↑ AUC ↑ sAUC ↑ Itti-Koch 0.318± 0.002 0.325± 0.004 1.010± 0.014 0.795± 0.004 0.537± 0.006 ICF 0.291± 0.002 0.190± 0.007 0.646± 0.023 0.665± 0.006 0.647± 0.005 Center-weighted 0.327± 0.002 0.350± 0.004 1.074± 0.013 0.803± 0.003 0.496± 0.006 Deepgaze-II 0.359± 0.003 0.420± 0.005 1.425± 0.025 0.808± 0.004 0.713± 0.004 Ours 0.392± 0.004 0.403± 0.010 1.375± 0.041 0.754± 0.006 0.645± 0.006 4.3.5 Discussion and Conclusion In the present study, we demonstrate that encoding models with visual atten- tion, whether explicitly estimated from human fixation maps or modelled using a trainable soft-attention scheme, yield significant improvements in neural response prediction accuracy over non-attention based counterparts. We observe consistent improvements across most high-level visual processing regions, suggesting that unattended portions of an input image may likely have little effect on neural rep- resentations in these regions. Loosely, this aligns well with Treisman’s feature integration theory [328], which proposes that integrated object representations are only formed for attended locations. In addition to improving response prediction accuracy, inclusion of visual attention within neural encoding models promises a better understanding of spatial selection and its influence on neural representations and perceptual processing. Further, while our study integrates a spatial attention module within a neural encoding model, the proposed approach is not restricted to this particular form of attention. For example, spatially global feature-based attention can also be studied within the context of neural encoding models as ”channel-wise” attention-weights instead of spatial attention masks. We believe the observation that neural response prediction may be a useful supervision goal to 168 study attentional deployment is particularly exciting and can be extended in novel ways. The saliency of a stimulus often depends on the context within which it is pre- sented and attentional selection strategies can be modulated by task demands [308]. Importantly, attention selects across space and time; here, we focus on spatial selection of stimuli but it is likely that modeling temporal context can lead to substantive improvements. Context can also help in highlighting attentional targets that may be driven by “surprise”. Thus, in movie watching, future neural encoding models should also capture the sequence of frames, rather than isolated frames, and the audio track in modeling attention. Our study provides a first attempt in capturing visual attention within neural encoding models. We see several opportunities for extending this work. In the present framework, we employed attention as a masking strategy to filter out clutter and retain information from only the most relevant (i.e. attended) parts of an image. It would be interesting to study how and where the features of ignored stimuli (i.e. the stimuli that doesn’t get past the attentional bottleneck) are encoded. Further, here, we modeled attention on top of high-level semantic features. In principle, the attention network can be implemented on top of any level within the representation network hierarchy, including lower stages and understanding where attention computations leads to best neural prediction accuracy and/or agreement with human fixation maps could be a worthwhile exploration. A straightforward extension in this direction would be to add the attention module on top of both low-level cues and high-level representations or to combine feature maps across layers before presenting to the attention network. In the future, we aim to further explore novel ways of incorporating attention within neural encoding models. 169 Beyond advancing our understanding of sensory perception, neural encoding models have potential for real-world applications, most obviously for brain-machine interface. Additionally, an improved understanding of the link between sensory stimuli and evoked neural activation patterns can provide opportunities for neural population control, for e.g., by synthetically designing stimuli to elicit a specific neural activation pattern [286]. 4.3.6 Broader Impact Understanding the link between sensory stimulation and evoked neural activity in humans as revealed with encoding models, can provide foundations for developing novel therapies. Viewed in this regard, an improved understanding of information processing in the brain has tremendous potential. However, encoding models can be very sensitive to biases in the training set. Our models were trained using data from the Human Connectome Project database. While this large-scale project has made a lot of valuable data publicly available to the scientific community for studying brain structure and function, it is important to consider the representational bias in the dataset. For instance, the data we analyzed is exclusively limited to a young adult population. Such biases can possibly lead to poorer generalization of models trained with these large-scale datasets on other population groups that are inadequately represented. Once these encoding models are ripe for therapeutic applications, this dataset bias could prevent under-represented groups from deriving the benefits of a useful technology, resulting in uneven access across populations. Given these considerations, it is important to address potential representation biases in fMRI datasets and develop solutions for improving diversity and inclusion. More generally, fMRI studies involving human subjects can raise a wide range of 170 other ethical issues as well, including data privacy issues and informed consent. Further, one should be cautious about the deployment of attention or gaze prediction models in applications such as advertising. Given the value of eye tracking based attention in marketing spaces, public policy notices or political campaigns, it is important to be wary of a malicious use of these attention prediction methods for profit-seeking or by ill-intentioned parties seeking to further their own agendas. These applications regard attention as a commodity to be captured and the adopted technologies can be used to manipulate users in subtle ways. An improved understanding about the link between stimuli and perceptual processing in the brain, as provided with encoding models, can also be exploited to further design or identify stimuli likely to elicit a specific emotional or cognitive response. The fact that these technologies can be deployed without the targeted individual’s knowledge or consent indicates it is important to protect users from the vulnerabilities exploited by these agents. 171 CHAPTER 5 A SHARED ENCODING MODEL FOR SUBJECT-SPECIFIC RESPONSE PREDICTION Abstract The increasing popularity of naturalistic paradigms in fMRI (such as movie watch- ing) demands novel strategies for multi-subject data analysis, such as use of neural encoding models. In the present study, we propose a shared convolutional neu- ral encoding method that accounts for individual-level differences. Our method leverages multi-subject data to improve the prediction of subject-specific responses evoked by visual or auditory stimuli. We showcase our approach on high-resolution 7T fMRI data from the Human Connectome Project movie-watching protocol and demonstrate significant improvement over single-subject encoding models. We further demonstrate the ability of the shared encoding model to successfully capture meaningful individual differences in response to traditional task-based facial and scenes stimuli. Taken together, our findings suggest that inter-subject knowledge transfer can be beneficial to subject-specific predictive models.1 5.1 Introduction Naturalistic imaging paradigms, such as movies and stories, emulate the diversity and complexity of real-life sensory experiences, thereby opening a novel window into the brain. The last decade has seen an increased foothold of naturalistic paradigms in cognitive neuroimaging, fueled by the remarkable discovery of inter-subject 1Our code is available at https://github.com/mk2299/SharedEncoding_MICCAI. 172 synchrony during naturalistic viewing [316]. Naturalistic stimuli also demonstrate increased test-retest reliability and more active subject engagement in comparison to alternate paradigms such as resting-state fMRI [304]. Furthermore, experiments have shown that naturalistic stimuli can induce stronger neural response than task-based stimuli [285], suggesting that the brain is intrinsically more attuned to the former. Taken together, these benefits suggest an exciting future for naturalistic stimulation protocols in fMRI. With large-scale compilation of multi-subject neural data through open-source initiatives such as the Human Connectome Project (HCP) [313], the development of approaches that can handle this enormous data is becoming imperative. Two approaches, namely inter-subject correlation (ISC) analysis [316, 329] and shared response model (SRM) [330], have dominated the analysis of multi-subject fMRI data under naturalistic conditions. The former approach exploits similarity in activation patterns across subjects to isolate stimulus-induced processing. The latter technique, SRM, decomposes neural activity into a shared response component and subject-specific spatial bases, and has been used for inter-subject knowledge transfer through functional alignment. While simple and efficient, both these approaches rely on a common time-locked stimulus across subjects and cannot, by design, model responses to completely unseen stimuli. On the other hand, predictive modelling of neural activity through encoding models is based upon generalization to arbitrary stimuli and can thus offer more holistic descriptions of sensory processing in an individual [242]. Neural encoding models map stimuli to fine-grained voxel-level response patterns via complex feature transformations. Previously, neural encoding models have yielded several novel insights into the functional organization of auditory and visual 173 cortices [243, 245, 246, 248]. Encoding models encapsulating different hypothesis about neural information processing can be pitted against each other to shed new light on how information is represented in the brain. In this manner, neural encoding models have been largely used for making group-level inferences. The potential to extract meaningful individual differences from naturalistic paradigms remains largely untapped. Understanding inter-subject variability in behavior-to- brain representations is of key interest to neuroscience and can potentially even help identify atypical response patterns [331]. Modelling individual brain function in response to naturalistic stimuli is one step in this direction; however, building accurate individual-level models of brain function often requires large amounts of data per subject for good generalization. The problem is further exacerbated by the variability in anatomy and functional topographies across individuals, making inter-subject knowledge transfer difficult. There is limited work in leveraging multi- subject data for more robust and accurate individualized neural encoding. To our knowledge, this problem has been studied only in the context of natural vision with a handful subjects using a Bayesian framework [332]. Further, the proposed method in [332] transfers knowledge from one subject’s encoding model into another through a two-stage procedure and does not allow simultaneous optimization of encoding models across multiple subjects. In this paper, we attempt to fill this gap; to this effect, we propose a deep- learning based framework to build more powerful individual-level encoding models by leveraging multi-subject data. Recent studies have revealed that coarse-grained response topographies are highly similar across subjects, suggesting that individual idiosyncrasies manifest in more fine-grained response patterns [247, 330]. This hints to the idea that encoding models could share representational spaces across subjects to overcome the challenges imposed by a limited quantity of per-subject data. We 174 Figure 5.1: Proposed approach: Feature pyramid networks are used to extract hierarchical features from pre-trained image/sound recognition networks. Dense fea- tures are reshaped into coarse 3D feature maps, which are mapped into increasingly fine-grained maps using convolutions. Coarse feature transformation layers are shared across subjects while deeper convolutional layers close to predicted response are subject-specific. exploit this intuition to develop a neural encoding model with a common backbone architecture for capturing shared response and subject-specific projections that account for individual response biases, as demonstrated in Figure 5.1. Our proposed approach has several merits: (i) It allows us to combine data from multiple subjects watching same or different movies to build a global model of the brain. At the same time, it can capture meaningful individual-level deviations from the global model which can potentially be related to individual-specific traits. (ii) It is amenable to incremental learning with diverse, varying stimuli across seen or novel subjects with less constraints on data collection from single subjects. (iii) It poses minimal memory overhead with additional subjects and can thus handle fMRI datasets with a large number of subjects. 175 5.2 Methodology Our proposed methodology is illustrated in Figure 5.1. Neural encoding models comprise two components: (a) a feature extractor, which pulls out relevant features from raw images or audio waveforms and (b) a response model, which maps these stimuli features into brain responses. In contrast to existing works that employ a linear response model [245, 246], we propose a CNN-based response model where the coarse 3D feature maps are shared across subjects and fine-grained feature maps are individual-specific. Previous studies have reported a cortical processing hierarchy where low-level features from early layers of a CNN-based feature extractor best predict responses in early sensory areas while semantically-rich deeper layers best predict higher sensory regions [246, 248]. To account for this effect, we employ a hierarchical feature extractor based on feature pyramid networks [260] that combines features from early, intermediate and later layers simultaneously. The output of the feature extractor is fed into the convolutional response model to predict the evoked fMRI activation. This enables us to train both components of the network simultaneously in an end-to-end fashion. Formally, let D = {Xi,Y Ni}i=1 denote the training data pairs for N subjects, where Xi denotes the stimuli presented to subject i and Yi denotes the corresponding fMRI measurements. We represent Xi as RGB images or grayscale spectrograms for the visual and auditory models, respectively. The feature model maps the 2D input into a vector representation s and is parameterized using a deep neural network F(Xi;φ) that is common across subjects. In our experiments, this model is a feature pyramid network built upon pre-trained recognition networks as DNNs optimized for image or sound recognition tasks have proven to provide powerful feature representations for encoding brain response. We define a differentiable 176 function G(s; θ) that maps the features into a shared latent volumetric space z, whose first 3 axes represent the 3D voxel space and the last axis captures the latent dimensionality. The predicted response for each subject is then defined using subject-specific differentiable functions Hi(z;ψi) that project the coarse feature maps z into an individualized brain response. We represent G and Hi’s using convolutional neural networks to have a sufficiently expressive model. Thus, θ and {ψi} represent a mix of convolutional kernels or dense weight matrices. The number of shared parameters, |θ|+ |φ| is kept much greater than the cardinality of subject-specific parameters |ψi| to accurately estimate the shared latent space. All parameters {φ, θ, ψi} are trained jointly to minimize the mean squared error between the predicted and true response. The proposed method allows us to propagate errors through the shared network even if the subjects are not exposed to common stimuli since we can always backpropagate errors for subjects independently within each batch. Furthermore, using individualized layers to account for subject-specific biases enables the model to weigh gradients coming from losses of each subject differently according to their signal-to-noise ratio. This makes the model less susceptible to noisy measurements when responses for the same stimuli are available from multiple subjects. 5.2.1 Implementation details We employ pre-trained Resnet-50 [261] and VGG-ish [258] architectures in the bottom-up path of Figure 5.1 to extract multi-scale features from images and audio spectrograms, respectively. The base architectures were selected because pre-trained weights of these networks optimized for classification on large datasets, namely Imagenet[262] and Youtube-8M[263], were publically available. For Resnet-50, we 177 use activations of the last residual block of each stage, namely, res2, res3, res4 and res5 (notation from [333]) to construct our stimulus descriptions s. From the VGG network, we use the activations of each convolutional block, namely, conv2, conv3, conv4 and the penultimate dense layer fc2[334]. The first three set of activations are refined through a top-down path to enhance their semantic content, while the last activation is concatenated into s directly (res4 activations are vectorized using global average pool). The top-down path comprises three feature maps at different resolutions with an up-sampling factor of 2 successively from the deepest layer of the bottom-up path. Each such feature map comprising 256 channels is merged with the corresponding feature map in the bottom-up path (reduced to 256 channels by 1x1 convolutions) by element-wise addition. Subsequently, the feature map at each resolution is collapsed into a 256 dimensional feature vector through a global average pool operation and concatenated into s. The aggregated features are then passed onto a shared CNN (denoted G above) comprising the following feedforward computation: a fully connected layer to map the features into a vector space which is reshaped into a 1024-channel cuboid of size 6x7x6 followed by two 3x3x3 transposed convolutions (conv.T) with a stride of 2 to up-sample the latter and obtain z. Each convolution reduces the channel count by half, thereby, resulting in a shared latent z that is a 256 channel cuboid of size 27x31x27x256. Subject-specific functions Hi’s are parameterized as a cascade of two 3x3x3 conv.T operations (stride 2) with output dimensions 128 and 1 respectively. It is important to emphasize that these operations constitute much fewer parameters, thereby favoring the estimation of a shared truth. As we demonstrate empirically, a shared space allows much better generalization. At the same time, we find that even the limited subject-specific parameters can adequately capture meaningful individual differences. All parameters were optimized using Adam[315] with a learning rate of 178 1e-4. Auditory and visual models were trained for 25 and 50 epochs respectively with unit batch size. Validation curves were monitored to ensure convergence. 5.2.2 Data and Preprocessing We study 7T fMRI data (TR = 1s) from a randomly selected sample of N=10 subjects from HCP movie-watching protocol [313, 322]. The dataset comprises 4 audiovisual movies, each ∼15 mins long. Preprocessing protocols are described in detail in [257, 322]. For our experiments, we utilize the 1.6mm MNI-registered volu- metric images of size 113 x 136 x 113 per TR. We compute log-mel spectrograms using same parameters as [258] over every 1 second of audio waveform to obtain a 2D image-like input for the VGG audio feature extractor. We extract the last frame of every second of the video to present to the image recognition network for visual features. We estimate a hemodynamic delay of 4 sec using regression based encoding models, as the response latency that yields highest encoding performance. Thus, all proposed and baseline models are trained to use the above stimuli to predict the fMRI response 4 seconds after the corresponding stimulus presentation. We train and validate our models on three movies using a 9:1 train-val split and leave the fourth movie for independent testing. This yields 2000 training, 265 validation and 699 test stimulus-response pairs per subject. 5.2.3 Baselines • Linear response model (individual subject): Here, we train independent models for each subject using linear response models. We note that, thus far, this is the dominant approach to neural encoding. To enable a fair 179 comparison, we extract hierarchical features of the same dimensionality as the proposed model to present to the linear regressor. The only difference here is the lack of a top-down pathway (since it is not pre-trained), which prevents the refinement of coarse feature maps before aggregation. We apply l2 regularization on the regression coefficients and adjust the optimal strength of this penalty through cross-validation using log-spaced values in {1e−10, 1e10}. We report the performance of the best model as ‘Individual (Linear)’. • CNN response model (individual subject): Here, we employ the same archi- tecture as the proposed model but with only one branch of subject-specific layers. We train this network independently for each subject without weight sharing and denote its performance as ‘Individual model (CNN)’. • Shared model (mean): Here, we employ the proposed model after training but instead of computing predictions using the same subject’s learned weights, we compute N predictions from all subject-specific branches. We compute the mean performance obtained by correlating each of these predictions with the ground truth response of a subject and denote this as ‘Shared (mean)’. 5.2.4 Performance evaluation We measure performance on the test movie by computing the Pearson’s correlation coefficient between the predicted and measured fMRI response at each voxel. Since different subjects have a different signal-to-noise ratio, we normalize each voxel’s correlation by the subject’s noise ceiling for that voxel. We compute the subject- specific noise ceiling by correlating their repeated measurements on a validation clip. Further, since we are only interested in the stimulus-driven response, we measure performance in voxels that exhibit high inter-subject correlations. We 180 randomly split the 10 subjects into groups of 5, and correlate the mean activity of the two groups. We repeat this process 5 times and voxels that exhibit a mean correlation greater than 0.1 are identified as synchronous voxels. We compute the mean normalized correlations across all synchronous voxels to achieve a single metric per subject, denoted as ‘Prediction accuracy’. We also correlate the predicted response of each subject against the predicted and true response of every other subject to obtain an N ×N correlation matrix for shared models. To account for higher variability in measured versus predicted response, we normalize the rows and columns of this correlation matrix following [335]. 5.2.5 Demonstration of application: personalized brain mapping To investigate if the proposed model is indeed capturing meaningful individual differences, we use the trained encoding model to predict fMRI activations for distinct visual object categories from the HCP task battery. Specifically, we predict brain response to visual stimuli (comprising faces, places, tools and body parts) from the HCP Working Memory (WM) task and use the predicted response to synthesize face and scene contrasts (FACES-AVG and PLACES-AVG respectively) for each individual. The predicted and true contrasts are thresholded to keep top 5% of the voxels. We compute the Dice overlap between the predicted contrast for each subject against the true contrast of every subject (including self) to produce an N ×N matrix for each contrast. 181 5.3 Results Figure 5.2: Quantitative evaluation: Bar charts illustrate subject-wise prediction accuracy of all models, box plots depict the distribution over subjects for % of synchronous voxels significantly predicted (p<0.05, FDR corrected). N × N correlation matrices depict the (normalized) correlation coefficient between predicted and measured responses. Figure 5.2 shows prediction accuracy of the proposed (‘Shared’) and baseline methods for each subject. The performance improvement is striking between pro- posed and individual subject models, suggesting that a shared backbone architecture can significantly boost generalization. Comparative boxplots further show that the proposed method predicts a much higher percentage of the synchronous cortex than individual subject models. Further, the difference between ‘Shared’ and ‘Shared (mean)’ as well as the dominant diagonal structure in correlation matrices suggest that the proposed method is indeed capturing subject idiosyncrasies rather than predicting a group-averaged response. Further, while the CNN response model performs slightly better in visual encoding, it incurs a performance drop compared to linear regression in auditory encoding. This perhaps suggests that the boost in accuracy seen for shared models is largely due to inter-subject knowledge transfer rather than the convolutional response model itself. 182 In Figure 5.3(A) & 5.3(B), we visualize the un-normalized correlations between the predicted and measured fMRI response for the proposed models, averaged across subjects. For the auditory model, we see significant correlations in the parabelt auditory cortex, extending into the superior temporal sulcus and some other language areas (55b) as well. For the visual model, while we see significant correlations across the entire visual cortex (V1-V8), the performance is much better in higher-order visual regions, presumably because of the semantically rich features. The lower performance in early visual regions could also result from the dynamic nature of visual stimulation in movies. Figure 5.3(C) & 5.3(D) illustrate the ability of our proposed model to characterize individual differences even beyond the experimental paradigm it was trained on. The diagonal dominance in the dice matrix for both contrasts suggests that predicted contrasts are most similar to the same subject’s true contrast. No prominent diagonal structure was observed for individual subject models, presumably because of their poor generalization to out-of-domain stimuli from the HCP task battery. Further, predicted contrasts consistently highlight known areas for face and scene processing, namely the fusiform face area[336] and parahippocampal areas[337] respectively. 5.4 Discussion In this paper, we presented a framework for utilizing multi-subject fMRI data to improve individual-level neural encoding. We showcased our approach on both auditory and visual stimuli and demonstrated consistent improvement over competing approaches. Our experiments further suggest that a single experiment 183 Figure 5.3: (A), (B) Correlations between predicted response of the proposed model and true time series of each voxel averaged across subjects. Only significantly predicted voxels are shown (p<0.05, FDR corrected). Dice matrices of predicted versus true contrasts for (C) faces and (D) scenes stimuli. (E) & (F) depict contrasts of two randomly selected subjects. ROIs are labelled from the HCP MMP parcellation [264]. (free-viewing of movies) can characterize a multitude of brain processes at once. This has important implications for brain mapping which traditionally relies on a battery of carefully-constructed stimuli administered within block-designs. Inter- subject variability in response patterns induced by the complexity of naturalistic viewing can facilitate the development of novel imaging-based biomarkers. Neural encoding models are not constrained to modeling the response to a limited set of experimental stimuli; their good generalization performance suggests that they can capture broad theories of cognitive processing. Accurate, individualized neural 184 encoding models can thus bring us one step closer to achieving the goal of biomarker discovery. 185 CHAPTER 6 A COMPUTATIONAL STRATEGY TO RICHLY CHARACTERIZE THE HUMAN VISUAL CORTEX UNDER NATURALISTIC CONDITIONS 6.1 Introduction How does the brain transform the raw, pixel-like input to the eyes to meaningful features of our external environment and use it to guide perception and visually- guided behaviors? Understanding the nature of representations and computations in the visual system has been a longstanding goal in neuroscience. Large strides have been made in this quest in early stages of the visual cortex by presenting model organisms with simple, abstract stimuli like noise, oriented edges or sine- wave gratings and studying the properties of evoked responses, such as the peak of their tuning curves. These carefully crafted experiments, for instance, revealed the presence of edge detectors in the primary visual cortex [338] and sparked the search for the preferred stimuli or the ‘optimal input’ beyond early visual areas across the visual cortical hierarchy. Over the last two decades, through numerous experiments employing different visual stimulus sets, recording and analysis techniques, a conceptual understanding has emerged that early visual areas extract low-level features like edges or curves, mid-level regions extract complex local shapes and high-level regions (at the furthest end of the visual hierarchy) encode and represent different semantic categories, like faces or scenes [339, 340, 341, 342, 343]. Further, while the underlying representational transformations that enable rapid, flexible and efficient visual perception remain largely unknown, there is accumulating evidence that visual categorization occurs in the ventral temporal 186 cortex in humans [344]. This understanding of the visual system was enabled by synthesizing findings across several different hypothesis-driven experimental studies, most of them employing artificial stimuli and simple tasks atypical of real-word vision, thus lacking ecological validity. Here, we propose a principled framework based on computational models and out-of-sample generalization to bring together findings from decades worth of study in human visual neuroscience and characterize neural response properties under ‘ecological’ conditions systematically in an entirely data-driven fashion. Rather than studying different components of visual processing in isolation with simple, abstract stimuli, we focus on a sufficiently complex stimulus ensemble to study all components at once. Throughout, we advocate an increased use of naturalistic stimuli to develop rich models of visual processing and an integrated approach to study visual processing in a natural context in the brain. To better understand the neural processes underlying visual perception in rich, naturalistic conditions, we require simplified versions of the system under investigation that abstract away from the exact biological implementation details, while keeping other functionalities of interest largely intact. Deep convolutional neural networks (DCNNs) are a promising choice for modelling these brain regions because the principles or elements with which they are built, namely hierarchical structure and convolutions resembles the spatially local, tiled processing happening in the visual cortex. Deep neural networks trained on image categorization have already set new standards for predicting neural responses along the ventral visual pathway in humans and non-human primates [243, 332]. High neural predictivity of task-optimized systems offers important clues into information processing within the brain; the observation that simply optimizing for goals like object recognition can lead to the emergence of representations highly predictive of neural responses across the ventral visual pathway suggests that activity in artificial neural networks 187 and biological networks could be aligned by shared computational goals and offers an intriguing way to test computational theories of the mind [345]. However, the use of these deep neural network models beyond response prediction and comparing different representational models has been viewed with much skepticism and promoted with reserve. Deep neural networks have been critiqued as being ‘black boxes’ and attempts to understand the brain using such models have been widely likened to ‘replacing one uninterpretable complex network with another’. Indeed, such models are notoriously difficult to understand as they often contain millions of parameters as connection weights. Here, we posit that better predictivity may not come at the cost of explanations and with powerful techniques, it may be possible to both probe the learned structure within these deep learning models of visual cortex and further also uncover the response selectivity of model neurons or voxels. There are several advantages of training a computational model directly on the task of reproducing neural data rather than employing a task-optimized network for the goal of understanding neuronal tuning properties. Encoding models fitted directly to neural data with minimal apriori constraints, e.g., generic architecture and random initialization, naturally extend themselves well to constraint-based inferences since any set of features that emerge in these trained networks are optimized to explain representations in the brain. In this study, we leverage recent advances in large-scale fMRI data collection [346] to train computational models directly on neural data with novel predictive precision. Importantly, the stimulus set employed in this study contains crowded images of multiple common objects in their natural contexts at varied viewpoints, thus being more typical of everyday scenes and allowing us to characterize neural representations and computations in rich, naturalistic conditions. As we move towards a more naturalistic neuro- 188 science amassing datasets with complex stimuli unamenable to parametrization, we also need concomitant methodological advances to efficiently derive conceptual understanding from this rich, high-dimensional data. Here, we further develop a systematic methodological framework for using computational models to decipher neuronal tuning properties. We illustrate how computational models can provide a powerful, unifying framework for building broad, generalizable theories about neural information processing under ecologically valid paradigms, focusing on the particular domain of vision. Specifically, we interpret these encoding models to reveal the selectivity of individual voxels along the ‘where’ and ‘what’ dimensions to answer the following important questions about their response properties: What portion of the input space is the neuronal population most sensitive to (the ’where’ dimension)? What features of stimuli do these voxels care about (the ’what’ dimen- sion)? Finally, we also go beyond single-voxel analysis to analyzing the population code and employ the learned distributed representations to characterize how visual information representation evolves along the ventral visual pathway. This allows us to probe the information encoded in fine-scaled patterns of activity in populations of neurons in different visual cortical regions. One major advantage of computa- tional models over direct examination of neural responses to pre-selected stimuli is that once these models are trained, we have complete access to their connection weights and predicted response for arbitrarily large stimulus sets. In this way, models allow us to ‘play out’ experiments currently infeasible due to time, budget or other constraints. Here, we capitalize on the high predictive accuracy of our response-optimized models to use them as in-silico substitutes of fMRI experiments and characterize the representational geometry on structured datasets and further investigate the functional organization of the human ventral-occiptal cortex. 189 6.2 Materials and Methods 6.2.1 Natural Scenes Dataset A detailed description of the Natural Scenes Dataset (NSD 1) is provided else- where [346]. Here, we just briefly summarize the data acquisition and preprocessing steps. The NSD dataset contains measurements of fMRI responses from 8 partici- pants who each viewed 9,000–10,000 distinct color natural scenes (22,000–30,000 trials) over the course of 30–40 scan sessions. Scanning was conducted at 7T using whole-brain gradient-echo EPI at 1.8-mm resolution and 1.6-s repetition time. Images were taken from the Microsoft Common Objects in Context (COCO) database cite Lin 2014, square cropped, and presented at a size of 8.4° x 8.4°. A special set of 1,000 images were shared across subjects; the remaining images were mutually exclusive across subjects. Images were presented for 3 s with 1-s gaps in between images. Subjects fixated centrally and performed a long-term continuous recognition task on the images. The fMRI data were pre-processed by performing one temporal interpolation (to correct for slice time differences) and one spatial interpolation (to correct for head motion). A general linear model was then used to estimate single-trial beta weights. Cortical surface reconstructions were generated using FreeSurfer, and both volume- and surface-based versions of the beta weights were created. We focused on 5 visual cortical ROIs in the study. Three ROIs belonging to the retinoptic early visual cortex, namely, V1, V2 and hV4 were defined using a prf localizer scan session for each subject. Two higher order ROIs, namely ventral-occipital areas (VO1-2) and lateral-occipital areas (LO1-2) were delineated using a popular visual probabilistic atlas [347]. 1http://naturalscenesdataset.org 190 6.2.2 Response-optimized encoding model architecture We trained separate voxel-level predictive models for each of the above regions with the same backbone architecture. The predictive model comprises a shared convolutional neural network core common across all subjects that represents the feature space unique for specific visual areas. We employ a linear readout model on top of the feature space to predict the responses of individual voxels in a specific region of interest under the assumption that the feature space likely represents the input received by these areas and these regions perform close-to-linear transformations on this input. A linear readout on a shared feature space is further based upon the often made assumption that the activity across a set of neurons or voxels in one individual can be related to the activity of the second individual in the homologous functional region by a linear transform [348]. Further, the linear readout is also factorized into spatial and feature dimensions following popular methods for neural system identification [349]. This allows us to separate spatial tuning or receptive field locations (i.e., what portion of the sensory space is the neuron population most sensitive to?) from feature tuning (i.e. what features of the visual input is the population sensitive to?). The base feature extraction network or the core thus performs all nonlinear transformations to convert the raw sensory stimuli (i.e., pixels) into a representation characteristic of a particular visual area, whereas the readout linearly maps this extracted representation into voxel responses. The core consists of four sequential convolutional blocks, with each block comprising the following feedforward computations: two convolutional layers each followed by an inner batch norm and nonlinear activation (ReLU) operations and an anti-aliased AvgPool operation at the end. Instead of regular convolutions, we employ E(2)-steerable convolutions in all our models to extract features independent of orientation [350]. This modeling choice is also inspired 191 by neural computations in early visual areas where it is hypothesized that groups of neurons perform similar computations at different orientations, e.g., edge or curve detection at different orientations. The readout contains all voxel-specific parameters and maps the extracted representation to individual voxel responses. Weights of the readout are a sum of outer products between a spatial filter and a feature vector. The spatial filter further had a positivity constraint (enforced using rectification) and was normalized independently for each voxel by dividing each spatial weight by the square-root of the sum of squared spatial weights. 6.2.3 Training and testing models Combined across all 4 subjects, the dataset comprises 37,000 natural scene images, among which 1,000 images are shared across all subjects and the rest are mutually exclusive. We used the 1,000 shared images for testing our models and split the remaining stimulus set into 35,000 training and 2,000 validation images. All parameters of the response-optimized model were optimized jointly to minimize the mean squared error between the predicted and measured response. Models were trained for a maximum of 100 epochs using Adam with a learning rate of 1e-4, a batch size of 16 and early stopping (patience = 20) based on the Pearson’s correlation coefficient between the predicted and measures responses on the validation set; validation curves were monitored to ensure convergence. The proposed method allows us to propagate errors through the shared network even if the subjects are not exposed to common stimuli since we can always exclude the subjects/voxels for which the response is not present from mean error calculation within each batch. The shared network thus benefits from diverse, varying stimuli across subjects with less extensive constraints on data collection from single subjects. We measure 192 Figure 6.1: Schematic and Quantitative Results. A shows the convolutional neural network model with factorized readout and B depicts the 5 visual areas considered in this study. Quantitative assessment of different models is shown as a boxplot in C. D shows the count of voxels that are better predicted by each model along with the difference in prediction accuracies (R). E shows the raw prediction accuracy, as estimated by the Pearson’s correlation coefficient (R), across the cortical flat map for all 4 subjects. performance (‘predictive accuracy’) on the test images by computing the Pearson’s correlation coefficient between the predicted and measured fMRI response at each voxel. 193 Figure 6.2: Schematic of retinotopic parameters 6.2.4 Comparison against retinotopic measurements from pRF-localizer scan Since our models allows us to separate spatial selectivity from feature selectivity, we first wanted to assess whether the spatial receptive field learned by these encoding models on natural stimuli agrees with the spatial receptive field measured by carefully controlled experiments that use artificial spatially modulated stimuli. The population receptive field (pRF) is defined as the region of the visual field within which a stimulus results in an increased aggregated activity across populations of neurons, as reflected in fMRI measurements. Here, we detail the procedure we followed for quantifying the agreement between the pRF parameters learned by the encoding model against the parameters of the pRF model estimated from an independent retinotopy experiment. Assuming that the receptive field is a 2D gaussian, retinotopic organization in the brain is typically defined using three parameters that describe the pRF location in polar coordinates (polar angle, eccentricity) and pRF size, as shown in Figure 6.2. We computed the pRF location as the average of the coordinate mesh weighted by the learned spatial mask value at each grid location. To compare the estimated polar angle values against the corresponding measurements from pRF localizer scans, we adopted measures from circular statistics. We note that we cannot employ the usual summary statistics or 194 correlation measures of linear statistics due to the circular nature of angular data. The circular correlation coefficient between the measured and predicted polar angle arrays in an ROI with n voxels, respectively denoted by {a1 , .., anm m} and {a1p, .., anp}, is then calculated as, √ ∑∑ ni=1 sin(aim − Tm∑) sin(ai − Tp)r = pn i=1 sin2(ai n 2 im − Tm) i=1 sin (ap − Tp) where Tm and Tp are the circular mean angles of the measured and predicted polar angle vectors,(respectively,∑ ∑ ) (1 n n ∑n ∑ )n T = atan2 sin ai 1, cos ai , T = atan2 1 sin 1ai , cos aim m pn i=1 n m p p i=1 n i=1 n i=1 6.2.5 Representational Similarity Analysis A complementary perspective to single neuron/voxel tuning is that ensembles of neurons serve as functional units of the brain and that meaningful aspects of our external world are not encoded in single neurons but in populations of neurons. So, here we attempted to characterize what kinds of information is encoded in learned patterns of activity in different visual areas through a popular analysis framework called representational similarity analysis (RSA) [351]. Here, we correlated the patterns of activity between multiple exemplars from different categories of stimuli to obtain a representational similarity matrix (RSM) for each visual area which highlights its representational geometry. Further, we quantified the separability of categorical information within these similarity matrices by computing a correlation coefficient (Kendall’s τ coefficient) between model RSMs against a ground truth adjacency matrix defined by object category labels. Specifically, the elements of this matrix are 0 if the corresponding two images belong to different categories and 1 if the images belong to the same category. This analysis was first performed using 195 a subset of images from the THINGs database [352] that belong to a pre-defined set of categories. These categories were chosen in accordance with previous fMRI studies employing RSA [353, 354] and include the following: {face, hand, elephant, cat, plant, fruit, car, tool }. Subsequently, we also validated the obtained results using other popular vision datasets like CIFAR10 and CIFAR100 [355]. 6.3 Results Response-optimized models achieve high neural predictivity across large swathes of the visual cortex First, we wanted to perform a quantitative assessement of our models. We observe high prediction accuracy (Pearson’s R>0.6) across large swathes of the visual ROIs, including the higher-order ROIs (LO1-2 and VO1-2) 6.1E. Next, we investigated the possible quantitative advantages of learning a rotationally-symmetric feature space against the representations from a standard convolutional core with no weight sharing across orientations. Importantly, we observe that the rotationally- symmetric convolutional core performs on par with the standard convolutional core 6.1C,D, even outperforming the latter in certain visual ROIs, suggesting that sharing weights across filter orientations potentially provides a strong inductive bias, allowing us to fit more expressive models efficiently. 196 Figure 6.3: Quantifying the agreement between the measured prf eccen- tricities and the prf eccentricities estimated from predictive computa- tional models across different voxels. A Subject and ROI-specific scatter plots depict predicted eccentricities against measured eccentricities. Pearson’s correlation coefficient between the two quantities is displayed in blue in each scatter plot. B Predicted and measured eccentricities across all early visual ROI voxels displayed on the cortical surface for each subject. 197 Figure 6.4: Quantifying the agreement between the measured prf polar angles and the prf polar angles estimated from predictive computational models across different voxels. A Subject and ROI-specific scatter plots depict predicted polar angles against measured polar angles. Pearson’s correlation coefficient between the two quantities is displayed in blue in each scatter plot. B Predicted and measured polar angles across all early visual ROI voxels displayed on the cortical surface for each subject. 198 Figure 6.5: Quantifying the agreement between the measured prf sizes and the prf sizes estimated from predictive computational models across different voxels. A Subject and ROI-specific scatter plots depict predicted sizes against measured sizes. Pearson’s correlation coefficient between the two quantities is displayed in blue in each scatter plot. B Predicted and measured prf sizes across all early visual ROI voxels displayed on the cortical surface for each subject. 199 Learned spatial masks reproduce the fine-grained retinotopic organiza- tion of the early visual cortex Next, we estimate the population receptive field from the learned spatial masks and compare the estimated parameters against the measurements from prf localizer scan session. It is important to note that the encoding models are entirely unconstrained by anatomy. In fact, these models were trained with no knowledge about the spatial proximity of different voxels, since the responses of voxels were modeled as independent linearly weighted (and factorized) sums of the representation extracted by the shared core network. Despite this, the estimated spatial topographic organization from the encoding models exhibits remarkable levels of agreement with fine-scale retinotopic organization of the cortex 6.36.46.5 without any explicit supervision to do so. Previously, retinotopic organization of the human visual cortex has been dominantly studied with spatially modulated stimuli. Here, we present an alternate approach based on naturalistic stimulation and predictive models, without any artificially crafted stimuli, to delineate subject-specific retinotopic maps. Response-optimized models generalize remarkably well to new subjects with low sample complexity Next, we wanted to assess how each of these predictive models generalizes to the remaining set of 4 subjects that were not used to train the model. For this analysis, we train only the linear predictor while keeping the representation/core network fixed and vary the amount of stimulus-response pairs from the new subjects to train the readouts. As shown in Figure 6.6, the models can achieve a significantly high accuracy even when the number of stimulus-response pairs is 500, i.e. nearly just 5% of the training examples used from original subjects to train the entire network. 200 Figure 6.6: Quantifying the generalization ability across subjects. (Left) Prediction performance and (Right) Agreement between estimated and measured retinotopic maps as a function of training examples (stimulus-response pairs) from novel subjects. This remarkable generalization of response-optimized networks to novel subjects suggests that they contain strong “inductive biases” and are able to sufficiently constrain the space of possible solutions in the right manner so that the model can generalize to a novel subject with few samples. Furthermore, even with these small training sets from novel subjects, we can characterize their retinotopic organization remarkably well 6.6. Data-driven models reveal the monotonically increasing separability of category information along the ventral stream hierarchy Task-optimized computational models have suggested that the ventral stream hierarchy ‘untangles’ representations through a sequence of processing steps, so that representations of objects and categories that are inseparable at the initial processing stage (e.g., V1) become untangled at the last stage of the hierarchy [345, 356]. Here, we present an even stronger evidence for this increased separability through hypothesis-free computational models. Previous studies have shown that the representational geometry for the same set of stimuli varies systematically across different cortical areas, providing a useful signature for how information 201 Figure 6.7: Spatial generalization matrices. Predicted response for a voxel in one ROI is correlated with the measured response of every other voxel within the same ROI (both within and across participants) to obtain a spatial generalization matrix for every ROI. Blue lines mark subject boundaries. Strong diagonal structure indicates that the predicted response for a voxel best matches the measured response of the same voxel, indicating the ability of the models to capture voxel-level idiosyncracies. is transformed across different cortical processing streams [356]. Here, we em- ploy the representational similarity analysis (RSA) framework to relate emergent representations in different computational models of the brain against a ground truth adjacency matrix defined by object category labels. We note that all models retain signatures of individual voxel-level idiosyncracies as indicated by prominent diagonal nature of the spatial generalization matrices 6.7; this enables us to perform population-level analysis with these predictive models. Keeping the model architec- ture the same, simply by changing the response targets for an encoding model from voxels in visual areas V1 through LO, we observe a drastic change in the geometry of the extracted representation. The increasingly prominent block-diagonal struc- 202 Figure 6.8: Separability of category information across the ventral visual stream. A Matrices of all pairwise similarities between the representational geometries in different visual ROIs. B Results of the Representational Similarity Analysis (RSA) framework applied to several visual datasets (THINGS, CIFAR100 and CIFAR10) containing different categories of objects. ture, consistent with categorical distinctions, in the RSMs along the ventral visual stream highlights that the distributed patterns of activity to exemplars of the same category become increasingly more similar and distributed patterns to exemplars of different categories become progressively dissimilar along the processing stream, strongly supporting the computational goal of the ventral visual stream as a visual categorization system. We can also assess the separability of categorical information in different visual areas more rigorously using quantitative metrics 6.8B. Here again, we see that this finding regarding increasing separability of categorical information holds across different datasets containing different visual categories. Computational models reveal the functional organization within ventral- occipital cortex Next, we wanted to discover whether there exists a systematic structure in the neural representations in the ventral-occipital region. Specifically, we built a low- 203 dimensional neural representational space by passing a large stimulus set (∼27,000 images from the THINGS database [352]) through the predictive models and then performing Principal Components Analysis (PCA) on the predicted responses of this network. Visualizing the images that elicit highest and lowest activation of individual principal components (PCs) reveals a striking functional organization: The first principal component, which explained ∼ 49% of the variance in response strongly reflects the gradient for representing animate versus inanimate categories, despite vast difference in the visual appearance of different animate categories (or inanimate categories). The second principal component, accounting for ∼ 16% variance, corresponded roughly to the curvature versus rectilinearity distinction. These results provides a unified explanation for many previous findings regarding the functional organization of the ventral visual stream, including the existence of animate versus inanimate distinctions, and an axis for curved versus rectilinear shapes [357]. 6.4 Discussion The field of cognitive neuroscience has been largely limited to testing one hypothesis about cognition at a time, with task-based experimental designs coupled with statistical inference techniques being the standard workhorses of the cognitive neuroimaging paradigm. This traditional approach has been instrumental in revealing how different brain regions respond to particular manipulations of mental or perceptual functions. However, being overly restricted to studying components of mental processing in isolation leads to paradigm-bound theories that often fail to generalize outside the experimental circumstance they were based on. Hypothesis- free computational models overcome this tradition of excessive reductionism by 204 Figure 6.9: A low-dimensional space characterizes the functional organi- zation of the ventral-occipital region. A Most and least activating images for the first two PCs. C Total explained variance as a function of the number of PCs. D Pearson’s correlation coefficient between the domain-selectivity of individual voxels against their projections onto the two PCs. E All images from the THINGS dataset projected onto the first two principal dimensions of the response. F and G Scatter plots depicting the domain-selectivity against the corresponding PC projection for all VO voxels. providing a general-purpose framework that abstracts away from the particulars of the experimental approach and can be used to describe multiple experiments at the same time. In this study, we asked: can a single model trained solely on complex naturalistic images that mimic everyday experience, simultaneously reproduce multiple neural phenomena characterized over the last two decades with controlled fMRI studies? Our experiments gave an affirmative answer to this question and this study yielded hypothesis-free computational models that mimic several known properties of the ventral visual cortex. Importantly, we now have models that can simultaneously account for many experiments. We have models that can replicate the increasing separability of information and understanding the mechanisms by which they achieve it can provide inspiration for computer vision models. Further, we have complete access to the connection weights of these 205 models and predicted responses for arbitrarily large stimulus sets. These models can thus serve as a source of mechanistic hypothesis about neural information processing. Beyond facilitating an improved understanding of the ventral visual pathway, these simplified versions of brain areas can also be used as exploratory hypothesis-generation tools in subsequent studies. Importantly, these networks are trained solely with supervision from brain response prediction without any category labels, yet they can successfully capture the linear separability of categorical information from human analogues of IT. One potential reason why these response-optimized DNNs have not been previously explored in the context of higher-order brain areas is because the highly complex and invariant tuning of the IT features had been difficult to characterize directly (task-driven networks being an exception) with smaller stimulus-response datasets. In the present study, we capitalize on the natural variation in the rich Natural Scenes Dataset to train these models directly on neural data and study how different visual areas respond to different stimulus dimensions. With creative model interpretability methods applied to deep neural network models of human vision, we hope to realize their full potential in characterizing the precise function of different brain areas, their input-output relationships when exposed to arbitrary stimuli and the underlying computational mechanisms. The goal is to move beyond the tradition of experimental reductionist approaches in cognitive neuroscience to using a combination of computational models, particularly deep learning models, and ethologically relevant naturalistic stimuli in order to understand the neural encoding of sensory information. And so while cognitive neuroimaging has been limited to testing one hypothesis at a time with controlled studies and artificial stimuli, what we argue for here is that one could instead fit complex neural network 206 models to a rich set of ethologically relevant natural stimuli to understand how different parts of a single model can simultaneously account for response to artificial stimuli across many experiments. This approach would enable us to stitch together disparate findings and understand how they may arise from a unified process. Probing computational models of the brain imposes several challenges that may lead to confounded conclusions and hypotheses about neural computations. The most important confound is that model predictions can deputize for experimental data only to the extent that they are ‘accurate’ and ultimately, all interpretative analyses and conclusions rest on the predictive accuracy of the model. It is further easy for models to learn trivial input-output dependencies without remaining faithful to the the mechanism. Despite these limitations, computational models can nonetheless provide novel testable hypotheses, which can be accepted or refuted with future experimentation. Even if the hypothesis is refuted, model failures can further drive model development and lead to the generation of improved hypothesis. In this manner, a tight loop between modeling and experiments, beyond simple offline analyses of neural data, can expedite neuroscientific discovery. 207 CHAPTER 7 LOOKING AHEAD My research thus far has been focused on the broad themes of utilizing brain imaging for under- standing cognition and making individual-level predictions of clinical phenotypes. Broadly, to summarize my vision for future work, I hope my research can answer questions of the kind that David Marr elegantly and inspiringly formulated in his influential book Vision: What kind of an information processing device is the brain? How is information from our environment robustly transformed into a coherent percept of the world? What are the fundamental principles underlying neural computations? While we have made significant strides over the last couple of decades; to date, we have few satisfying answers for the questions above and there is much to be learned about the function of different brain areas, the means by which the function is achieved, how it comes into being given the constraints faced by the system (over evolution or development) and how disparate brain networks/regions collectively support complex human behavior. At the end of the day, I hope my research can make significant contributions towards understanding the broad general principles that can explain biological intelligence and consequently inform artificial intelligence. Towards the fulfilment of this dream, I envision the new data revolution in neuroscience, with large-scale compilation of neural data and dissemination through open-source initiatives, to play a crucial role in narrowing down the numerous (often incompatible) theories and hypotheses about how the mind works. 208 BIBLIOGRAPHY [1] B. Biswal et al. “Functional connectivity in the motor cortex of resting human brain using echo-planar MRI”. In: Magn Reson Med 34.4 (Oct. 1995), pages 537–541 (cited on pages 7, 9, 11). [2] C. F. Beckmann et al. “Investigations into resting-state connectivity using independent component analysis”. In: Philosophical Transactions of the Royal Society B: Biological Sciences 360.1457 (May 2005), pages 1001– 1013. issn: 0962-8436. doi: 10.1098/rstb.2005.1634. url: http:// rstb.royalsocietypublishing.org/cgi/doi/10.1098/rstb.2005.1634 (cited on pages 8, 10, 11, 24, 31). [3] D. Cordes et al. “Mapping functionally related regions of brain with func- tional connectivity MR imaging”. In: AJNR Am J Neuroradiol 21.9 (Oct. 2000), pages 1636–1644 (cited on page 8). [4] J. S. Damoiseaux et al. “Consistent resting-state networks across healthy subjects”. In: Proc. Natl. Acad. Sci. U.S.A. 103.37 (Sept. 2006), pages 13848– 13853 (cited on pages 8, 11, 24, 31). [5] M. De Luca et al. “fMRI resting state networks define distinct modes of long-distance interactions in the human brain”. In: Neuroimage 29.4 (Feb. 2006), pages 1359–1367 (cited on pages 8, 11, 24). [6] Nico UF Dosenbach et al. “Distinct brain networks for adaptive and stable task control in humans”. In: Proceedings of the National Academy of Sciences 104.26 (2007), pages 11073–11078 (cited on page 8). [7] Michael D Fox et al. “Spontaneous neuronal activity distinguishes human dor- sal and ventral attention systems”. In: Proceedings of the National Academy of Sciences 103.26 (2006), pages 10046–10051 (cited on page 8). 209 [8] Michelle Hampson et al. “Detection of functional connectivity using temporal correlations in MR images”. In: Human brain mapping 15.4 (2002), pages 247– 262 (cited on page 8). [9] Daniel S Margulies et al. “Mapping the functional connectivity of anterior cingulate cortex”. In: Neuroimage 37.2 (2007), pages 579–588 (cited on page 8). [10] William W Seeley et al. “Dissociable intrinsic connectivity networks for salience processing and executive control”. In: Journal of Neuroscience 27.9 (2007), pages 2349–2356 (cited on page 8). [11] S. M. Smith et al. “Correspondence of the brain’s functional architecture during activation and rest”. In: Proc. Natl. Acad. Sci. U.S.A. 106.31 (Aug. 2009), pages 13040–13045 (cited on pages 8, 11). [12] Michael D Greicius et al. “Default-mode network activity distinguishes Alzheimer’s disease from healthy aging: evidence from functional MRI”. In: Proceedings of the National Academy of Sciences 101.13 (2004), pages 4637– 4642 (cited on page 8). [13] K. Supekar et al. “Network analysis of intrinsic functional brain connectivity in Alzheimer’s disease”. In: PLoS Comput. Biol. 4.6 (June 2008), e1000100 (cited on pages 8, 44). [14] Yvette I Sheline and Marcus E Raichle. “Resting state functional connectivity in preclinical Alzheimer’s disease”. In: Biological psychiatry 74.5 (2013), pages 340–347 (cited on page 8). [15] Daniel P Kennedy and Eric Courchesne. “The intrinsic functional orga- nization of the brain is altered in autism”. In: Neuroimage 39.4 (2008), pages 1877–1885 (cited on page 8). 210 [16] Christopher S Monk et al. “Abnormalities of intrinsic functional connectivity in autism spectrum disorders”. In: Neuroimage 47.2 (2009), pages 764–772 (cited on page 8). [17] Jocelyn V Hull et al. “Resting-state functional connectivity in autism spec- trum disorders: A review”. In: Frontiers in psychiatry 7 (2017), page 205 (cited on page 8). [18] Amit Anand et al. “Activity and connectivity of brain mood regulating circuit in depression: a functional magnetic resonance study”. In: Biological psychiatry 57.10 (2005), pages 1079–1088 (cited on page 8). [19] Michael D Greicius et al. “Resting-state functional connectivity in major depression: abnormally increased contributions from subgenual cingulate cortex and thalamus”. In: Biological psychiatry 62.5 (2007), pages 429–437 (cited on page 8). [20] Peter C Mulders et al. “Resting-state functional connectivity in major depressive disorder: a review”. In: Neuroscience & Biobehavioral Reviews 56 (2015), pages 330–344 (cited on page 8). [21] Meng Liang et al. “Widespread functional disconnectivity in schizophrenia with resting-state functional magnetic resonance imaging”. In: Neuroreport 17.2 (2006), pages 209–213 (cited on page 8). [22] Julia M Sheffield and Deanna M Barch. “Cognition and resting-state func- tional connectivity in schizophrenia”. In: Neuroscience & Biobehavioral Reviews 61 (2016), pages 108–120 (cited on page 8). [23] R. C. Craddock et al. “Disease state prediction from resting state functional connectivity”. In: Magn Reson Med 62.6 (Dec. 2009), pages 1619–1628 (cited on pages 8, 11, 55). 211 [24] G. Chen et al. “Classification of Alzheimer disease, mild cognitive impairment, and normal cognitive status with large-scale network analysis based on resting-state functional MR imaging”. In: Radiology 259.1 (Apr. 2011), pages 213–221 (cited on pages 8, 11, 54, 59). [25] J. A. Nielsen et al. “Multisite functional connectivity MRI classification of autism: ABIDE results”. In: Front Hum Neurosci 7 (2013), page 599 (cited on pages 8, 11, 55, 63). [26] Michael D Fox and Michael Greicius. “Clinical applications of resting state functional connectivity”. In: Frontiers in systems neuroscience 4 (2010), page 19 (cited on page 8). [27] Michael Greicius. “Resting-state functional connectivity in neuropsychiatric disorders”. In: Current opinion in neurology 21.4 (2008), pages 424–430 (cited on page 8). [28] Dongyang Zhang and Marcus E Raichle. “Disease and the brain’s dark energy”. In: Nature Reviews Neurology 6.1 (2010), page 15 (cited on page 8). [29] D Cordes et al. “Resting-State Functional Connectivity Study using Inde- pendent Component Analysis”. In: Proceedings ISMRM 1706 (1999). url: https://cds.ismrm.org/ismrm-1999/PDF6/1706.pdf (cited on page 10). [30] C. F. Beckmann and S. M. Smith. “Tensorial extensions of independent component analysis for multisubject FMRI analysis”. In: Neuroimage 25.1 (Mar. 2005), pages 294–311 (cited on page 10). [31] M. D. Greicius et al. “Functional connectivity in the resting brain: a network analysis of the default mode hypothesis”. In: Proc. Natl. Acad. Sci. U.S.A. 100.1 (Jan. 2003), pages 253–258 (cited on page 11). 212 [32] M. Jung et al. “Default mode network in young male adults with autism spectrum disorder: relationship with autism spectrum traits”. In: Mol Autism 5 (2014), page 35 (cited on page 11). [33] D. Ongur et al. “Default mode network abnormalities in bipolar disorder and schizophrenia”. In: Psychiatry Res 183.1 (July 2010), pages 59–68 (cited on page 11). [34] W. Koch et al. “Diagnostic power of default mode network resting state fMRI in the detection of Alzheimer’s disease”. In: Neurobiol. Aging 33.3 (Mar. 2012), pages 466–478 (cited on page 11). [35] D. Cordes et al. “Frequencies contributing to functional connectivity in the cerebral cortex in ”resting-state” data”. In: AJNR Am J Neuroradiol 22.7 (Aug. 2001), pages 1326–1333 (cited on page 11). [36] Raymond Salvador et al. “Neurophysiological architecture of functional magnetic resonance images of human brain”. In: Cerebral Cortex 15.9 (2005), pages 1332–2342. issn: 10473211. doi: 10.1093/cercor/bhi016 (cited on pages 11, 29, 31). [37] Aviv Mezer et al. “Cluster analysis of resting-state fMRI time series”. In: NeuroImage 45.4 (May 2009), pages 1117–1125. issn: 10538119. doi: 10. 1016/j.neuroimage.2008.12.015. url: http://linkinghub.elsevier. com/retrieve/pii/S1053811908012706 (cited on pages 11, 27, 28). [38] Helmut Laufs et al. “EEG-correlated fMRI of human alpha activity.” In: NeuroImage 19 4 (2003), pages 1463–76 (cited on page 11). [39] Jessica S. Damoiseaux and Michael D. Greicius. “Greater than the sum of its parts: a review of studies combining structural connectivity and resting-state functional connectivity”. In: Brain Structure and Function 213.6 (Oct. 2009), 213 pages 525–533. issn: 1863-2661. doi: 10.1007/s00429-009-0208-6. url: https://doi.org/10.1007/s00429-009-0208-6 (cited on page 11). [40] Y. Nir et al. “Interhemispheric correlations of slow spontaneous neuronal fluctuations revealed in human sensory cortex”. In: Nat. Neurosci. 11.9 (Sept. 2008), pages 1100–1108 (cited on page 11). [41] C. Chang and G. H. Glover. “Time-frequency dynamics of resting-state brain connectivity measured with fMRI”. In: Neuroimage 50.1 (Mar. 2010), pages 81–98 (cited on page 12). [42] Elena A. Allen et al. “Tracking whole-brain connectivity dynamics in the resting state”. In: Cerebral Cortex 24.3 (2014), pages 663–676. issn: 10473211. doi: 10.1093/cercor/bhs352 (cited on pages 12, 13, 22, 35–38). [43] Diego Vidaurre, Stephen M. Smith, and Mark W. Woolrich. “Brain net- work dynamics are hierarchically organized in time”. In: Proceedings of the National Academy of Sciences 114.48 (2017), page 201705120. issn: 0027- 8424. doi: 10.1073/pnas.1705120114. arXiv: arXiv:1408.1149. url: http://www.pnas.org/lookup/doi/10.1073/pnas.1705120114 (cited on pages 13, 22, 35–38). [44] H. Eavani et al. “Unsupervised learning of functional network dynamics in resting state fMRI”. In: Inf Process Med Imaging 23 (2013), pages 426–437 (cited on pages 13, 22, 36, 38). [45] J. M. Reinen et al. “The human cortex possesses a reconfigurable dynamic network architecture that is disrupted in psychosis”. In: Nat Commun 9.1 (Mar. 2018), page 1157 (cited on page 13). [46] Emily S. Finn et al. “Functional connectome fingerprinting: Identifying individuals using patterns of brain connectivity”. In: Nature Neuroscience 214 18.11 (2015), pages 1664–1671. issn: 15461726. doi: 10.1038/nn.4135. arXiv: 15334406. url: http://dx.doi.org/10.1038/nn.4135 (cited on pages 13, 55, 59). [47] N. Tzourio-Mazoyer et al. “Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single- subject brain”. In: Neuroimage 15.1 (Jan. 2002), pages 273–289 (cited on pages 22, 23). [48] Nora Leonardi et al. “Principal components of functional connectivity: A new approach to study dynamic brain connectivity during rest”. In: NeuroImage 83 (2013), pages 937–950. issn: 10538119. doi: 10.1016/j.neuroimage. 2013.07.019. url: http://dx.doi.org/10.1016/j.neuroimage.2013. 07.019 (cited on pages 22, 37, 38). [49] Nora Leonardi et al. “Disentangling dynamic networks: Separated and joint expressions of functional connectivity patterns in time”. In: Human Brain Mapping 35.12 (2014), pages 5984–5995. issn: 10970193. doi: 10.1002/hbm. 22599 (cited on pages 22, 37). [50] R. Cameron Craddock et al. “A whole brain fMRI atlas generated via spatially constrained spectral clustering”. In: Human Brain Mapping 33.8 (2012), pages 1914–1928. issn: 10659471. doi: 10.1002/hbm.21333. arXiv: NIHMS150003 (cited on pages 23, 30, 31). [51] V. D. Calhoun et al. “A method for making group inferences from functional MRI data using independent component analysis”. In: Hum Brain Mapp 14.3 (Nov. 2001), pages 140–151 (cited on pages 25, 33). 215 [52] Beckmann CF et al. “Group Comparison of Resting-State FMRI Data Using Multi-Subject ICA and Dual Regression”. In: Neuroimage 47 (July 2009). doi: 10.1016/S1053-8119(09)71511-3 (cited on pages 25, 31, 33). [53] Yuhui Du and Yong Fan. “Group information guided ICA for fMRI data analysis”. In: NeuroImage 69 (2013), pages 157–197. doi: 10 . 1016 / j . neuroimage . 2012 . 11 . 008. url: http : / / dx . doi . org / 10 . 1016 / j . neuroimage.2012.11.008 (cited on pages 25, 31). [54] G. Varoquaux et al. “A group model for stable multi-subject ICA on fMRI datasets”. In: NeuroImage 51.1 (2010), pages 288–299. issn: 10538119. doi: 10.1016/j.neuroimage.2010.02.010. arXiv: 1006.2300. url: http: //dx.doi.org/10.1016/j.neuroimage.2010.02.010 (cited on pages 25, 31). [55] I. Daubechies et al. “Independent component analysis for brain fMRI does not select for independence”. In: Proceedings of the National Academy of Sciences 106.26 (2009), pages 10415–10422. issn: 0027-8424. doi: 10.1073/ pnas.0903525106. eprint: http://www.pnas.org/content/106/26/ 10415.full.pdf. url: http://www.pnas.org/content/106/26/10415 (cited on page 26). [56] G. Varoquaux et al. “Multi-subject dictionary learning to segment an atlas of brain spontaneous activity”. In: Inf Process Med Imaging 22 (2011), pages 562–573 (cited on pages 27, 31, 33). [57] A. Abraham et al. “Extracting brain regions from rest fMRI with total- variation constrained dictionary learning”. In: Med Image Comput Comput Assist Interv 16.Pt 2 (2013), pages 607–615 (cited on pages 27, 35). 216 [58] J. Lv et al. “Holistic atlases of functional networks and interactions reveal reciprocal organizational architecture of cortical function”. In: IEEE Trans Biomed Eng 62.4 (Apr. 2015), pages 1120–1131 (cited on page 27). [59] Yulia Golland et al. “Data-driven clustering reveals a fundamental subdivi- sion of the human cortex into two global systems”. In: Neuropsychologia 46.2 (2008), pages 540–553. issn: 00283932. doi: 10.1016/j.neuropsychologia. 2007.10.003 (cited on pages 27, 28). [60] M. H. Lee et al. “Clustering of resting state networks”. In: PLoS ONE 7.7 (2012), e40370 (cited on pages 27, 28). [61] J. H. Kim et al. “Defining functional SMA and pre-SMA subregions in human MFC using resting state fMRI: functional connectivity-based parcellation method”. In: Neuroimage 49.3 (Feb. 2010), pages 2375–2386 (cited on pages 27, 28). [62] Polina Golland, Yulia Golland, and Rafael Malach. “Detection of Spatial Ac- tivation Patterns as Unsupervised Segmentation of fMRI Data”. In: MICCAI 10 Pt 1 (2007), pages 110–8 (cited on pages 28, 29). [63] B. T. Thomas Yeo et al. “The organization of the human cerebral cor- tex estimated by intrinsic functional connectivity”. In: Journal of Neu- rophysiology 106.3 (Sept. 2011), pages 1125–1165. issn: 0022-3077. doi: 10.1152/jn.00338.2011. url: http://www.physiology.org/doi/10. 1152/jn.00338.2011 (cited on pages 28, 31). [64] Dietmar Cordes et al. “Hierarchical clustering to measure connectivity in fMRI resting-state data”. In: Magnetic Resonance Imaging 20.4 (2002), pages 305–317. issn: 0730725X. doi: 10.1016/S0730-725X(02)00503-9 (cited on pages 29, 31). 217 [65] T. Blumensath et al. “Spatially constrained hierarchical parcellation of the brain with resting-state fMRI”. In: Neuroimage 76 (Aug. 2013), pages 313– 324 (cited on pages 29, 34). [66] A. Abraham et al. “Deriving reproducible biomarkers from multi-site resting- state data: An Autism-based example”. In: Neuroimage 147 (Feb. 2017), pages 736–745 (cited on pages 29, 43, 55, 59, 63). [67] B. Thirion et al. “Which fMRI clustering gives good brain parcellations?” In: Front Neurosci 8 (2014), page 167 (cited on page 29). [68] Yanlu Wang and Tie-Qiang Li. “Analysis of Whole-Brain Resting-State fMRI Data Using Hierarchical Clustering Approach”. In: PLOS ONE 8.10 (Oct. 2013), pages 1–9. doi: 10.1371/journal.pone.0076315. url: https: //doi.org/10.1371/journal.pone.0076315 (cited on page 29). [69] Martijn van den Heuvel, Rene Mandl, and Hilleke Hulshoff Pol. “Normalized cut group clustering of resting-state fMRI data”. In: PLoS ONE 3.4 (2008). issn: 19326203. doi: 10.1371/journal.pone.0002001 (cited on page 30). [70] X. Shen et al. “Groupwise whole-brain parcellation from resting-state fMRI data for network node identification”. In: NeuroImage 82 (2013), pages 403– 415. issn: 10538119. doi: 10.1016/j.neuroimage.2013.05.081. arXiv: NIHMS150003. url: http://dx.doi.org/10.1016/j.neuroimage.2013. 05.081 (cited on pages 30, 31). [71] N. Honnorat et al. “GraSP: Geodesic Graph-based Segmentation with Shape Priors for the functional parcellation of the cortex”. In: NeuroImage 106 (2015), pages 207–221. issn: 10959572. doi: 10.1016/j.neuroimage.2014. 11.008. arXiv: NIHMS150003. url: http://dx.doi.org/10.1016/j. neuroimage.2014.11.008 (cited on page 30). 218 [72] M. Maier, U. von Luxburg, and M. Hein. “How the result of graph clustering methods depends on the construction of the graph”. In: ArXiv e-prints (Feb. 2011). arXiv: 1102.2075 [stat.ML] (cited on page 30). [73] Evan M. Gordon et al. “Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations”. In: Cerebral Cortex 26.1 (2016), pages 288–303. issn: 14602199. doi: 10.1093/cercor/bhu239 (cited on page 32). [74] Matthew F Glasser et al. “A multi-modal parcellation of human cerebral cor- tex”. In: Nature Publishing Group 536 (2016). doi: 10.1038/nature18933. url: http://balsa.wustl.edu/WN56. (cited on page 32). [75] Alexander Schaefer et al. “Local-Global Parcellation of the Human Cerebral Cortex from Intrinsic Functional Connectivity MRI”. In: Cerebral Cortex (2017), pages 1–20. issn: 1047-3211. doi: 10.1093/cercor/bhx179. url: https://academic.oup.com/cercor/article-lookup/doi/10.1093/ cercor/bhx179 (cited on page 32). [76] R. Kong et al. “Spatial Topography of Individual-Specific Cortical Networks Predicts Human Cognition, Personality, and Emotion”. In: Cereb. Cortex (June 2018) (cited on page 33). [77] Mehraveh Salehi et al. “An exemplar-based approach to individualized parcel- lation reveals the need for sex specific functional networks”. In: NeuroImage 170 (2018), pages 54–67. issn: 10959572. doi: 10.1016/j.neuroimage.2017. 08.068. url: http://dx.doi.org/10.1016/j.neuroimage.2017.08.068 (cited on page 33). [78] Jon Kleinberg. “An Impossibility Theorem for Clustering”. In: Proceedings of the 15th International Conference on Neural Information Processing 219 Systems. NIPS’02. Cambridge, MA, USA: MIT Press, 2002, pages 463–470. url: http://dl.acm.org/citation.cfm?id=2968618.2968676 (cited on page 33). [79] S. Arslan et al. “Human brain mapping: A systematic comparison of parcel- lation methods for the human cerebral cortex”. In: Neuroimage 170 (Apr. 2018), pages 5–30 (cited on page 34). [80] Mehraveh Salehi et al. “There is no single functional atlas even for a sin- gle individual: Parcellation of the human brain is state dependent”. In: bioRxiv (2018). doi: 10.1101/431833. eprint: https://www.biorxiv. org / content / early / 2018 / 10 / 01 / 431833 . full . pdf. url: https : //www.biorxiv.org/content/early/2018/10/01/431833 (cited on page 34). [81] E. Damaraju et al. “Dynamic functional connectivity analysis reveals tran- sient states of dysconnectivity in schizophrenia”. In: Neuroimage Clin 5 (2014), pages 298–308 (cited on pages 35, 36, 38). [82] B. Rashid et al. “Dynamic connectivity states estimated from resting fMRI Identify differences among Schizophrenia, bipolar disorder, and healthy control subjects”. In: Front Hum Neurosci 8 (2014), page 897 (cited on pages 35, 36). [83] A. D. Barber et al. “Dynamic Functional Connectivity States Reflecting Psychotic-like Experiences”. In: Biol Psychiatry Cogn Neurosci Neuroimaging 3.5 (May 2018), pages 443–453 (cited on page 36). [84] A. Abrol et al. “Replicability of time-varying connectivity patterns in large resting state fMRI samples”. In: Neuroimage 163 (Dec. 2017), pages 160–176 (cited on page 36). 220 [85] C. Wang et al. “Spontaneous eyelid closures link vigilance fluctuation with fMRI dynamic connectivity states”. In: Proc. Natl. Acad. Sci. U.S.A. 113.34 (Aug. 2016), pages 9653–9658 (cited on page 36). [86] H. I. Suk et al. “State-space model with deep learning for functional dy- namics estimation in resting-state fMRI”. In: Neuroimage 129 (Apr. 2016), pages 292–307 (cited on pages 36, 37, 39). [87] Lucy R. Chai et al. “Evolution of brain network dynamics in neurodevelop- ment”. In: Network Neuroscience 1.1 (2017), pages 14–30 (cited on pages 37, 38). [88] X. Li et al. “Dynamic functional connectomics signatures for characterization and differentiation of PTSD patients”. In: Hum Brain Mapp 35.4 (Apr. 2014), pages 1761–1778 (cited on pages 37, 38). [89] V. D. Calhoun et al. “Exploring the psychosis functional connectome: aber- rant intrinsic networks in schizophrenia and bipolar disorder”. In: Front Psychiatry 2 (2011), page 75 (cited on page 39). [90] E. Amico and J. Goni. “The quest for identifiability in human functional connectomes”. In: Sci Rep 8.1 (May 2018), page 8254 (cited on page 39). [91] H. Eavani et al. “Identifying Sparse Connectivity Patterns in the brain using resting-state fMRI”. In: Neuroimage 105 (Jan. 2015), pages 286–299 (cited on pages 39, 41). [92] H. Eavani et al. “Discriminative sparse connectivity patterns for classification of fMRI Data”. In: Med Image Comput Comput Assist Interv 17.Pt 3 (2014), pages 193–200 (cited on page 39). 221 [93] Anqi Qiu et al. “Manifold learning on brain functional networks in aging”. In: Medical Image Analysis 20.1 (2015), pages 52–60. issn: 13618423. doi: 10.1016/j.media.2014.10.006. url: http://dx.doi.org/10.1016/j. media.2014.10.006 (cited on page 39). [94] H. Shen et al. “Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding of fMRI”. In: Neuroimage 49.4 (Feb. 2010), pages 3110–3121 (cited on pages 39, 41). [95] X. Guo et al. “Diagnosing Autism Spectrum Disorder from Brain Resting- State Functional Connectivity Patterns Using a Deep Neural Network with a Novel Feature Selection Method”. In: Front Neurosci 11 (2017), page 460 (cited on page 39). [96] Dumitru Erhan et al. “Why Does Unsupervised Pre-training Help Deep Learning?” In: J. Mach. Learn. Res. 11 (Mar. 2010), pages 625–660. issn: 1532-4435. url: http : / / dl . acm . org / citation . cfm ? id = 1756006 . 1756025 (cited on page 39). [97] A. S. Heinsfeld et al. “Identification of autism spectrum disorder using deep learning and the ABIDE dataset”. In: Neuroimage Clin 17 (2018), pages 16–23 (cited on pages 39, 41, 50). [98] J. Kim et al. “Deep neural network with weight sparsity control and pre- training extracts hierarchical features and enhances classification perfor- mance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia”. In: Neuroimage 124.Pt A (Jan. 2016), pages 127– 146 (cited on pages 39, 41, 50, 55). [99] Linli Xu et al. “Maximum Margin Clustering”. In: Advances in Neural Information Processing Systems 17. Edited by L. K. Saul, Y. Weiss, and 222 L. Bottou. MIT Press, 2005, pages 1537–1544. url: http://papers.nips. cc/paper/2602-maximum-margin-clustering.pdf (cited on page 40). [100] Ling Li Zeng et al. “Unsupervised classification of major depression us- ing functional connectivity MRI”. In: Human Brain Mapping 35.4 (2014), pages 1630–1641. issn: 10970193. doi: 10 . 1002 / hbm . 22278 (cited on pages 40, 41). [101] A. T. Drysdale et al. “Resting-state connectivity biomarkers define neu- rophysiological subtypes of depression”. In: Nat. Med. 23.1 (Jan. 2017), pages 28–38 (cited on pages 40, 41). [102] Kamalaker Dadi et al. “Benchmarking functional connectome-based predic- tive models for resting-state fMRI”. working paper or preprint. June 2018. url: https://hal.inria.fr/hal-01824205 (cited on pages 43, 72, 99). [103] Gaël Varoquaux et al. “Brain Covariance Selection: Better Individual Func- tional Connectivity Models Using Population Prior”. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2. NIPS’10. Vancouver, British Columbia, Canada: Curran Asso- ciates Inc., 2010, pages 2334–2342. url: http://dl.acm.org/citation. cfm?id=2997046.2997156 (cited on page 43). [104] S. M. Smith et al. “Network modelling methods for FMRI”. In: Neuroimage 54.2 (Jan. 2011), pages 875–891 (cited on pages 43, 72). [105] G. Varoquaux et al. “Detection of brain functional-connectivity difference in post-stroke patients using group-level covariance modeling”. In: Med Image Comput Comput Assist Interv 13.Pt 1 (2010), pages 200–208 (cited on page 43). 223 [106] J. Richiardi et al. “Decoding brain states from fMRI connectivity graphs”. In: Neuroimage 56.2 (May 2011), pages 616–626 (cited on page 43). [107] A. Khazaee, A. Ebrahimzadeh, and A. Babajani-Feremi. “Identifying patients with Alzheimer’s disease using resting-state fMRI and graph theory”. In: Clin Neurophysiol 126.11 (Nov. 2015), pages 2132–2141 (cited on pages 44, 54). [108] A. Lord et al. “Changes in community structure of resting state functional connectivity in unipolar depression”. In: PLoS ONE 7.8 (2012), e41282 (cited on pages 44, 55). [109] C. Z. Zhu et al. “Discriminative analysis of brain function at resting-state for attention-deficit/hyperactivity disorder”. In: Med Image Comput Comput Assist Interv 8.Pt 2 (2005), pages 468–475 (cited on page 44). [110] M. Mennes et al. “Linking inter-individual differences in neural activation and behavior to intrinsic brain dynamics”. In: Neuroimage 54.4 (Feb. 2011), pages 2950–2959 (cited on page 44). [111] T. Price et al. “Multiple-network classification of childhood autism using functional connectivity dynamics”. In: Med Image Comput Comput Assist Interv 17.Pt 3 (2014), pages 177–184 (cited on pages 44, 55). [112] T. M. Madhyastha et al. “Dynamic connectivity at rest predicts attention task performance”. In: Brain Connect 5.1 (Feb. 2015), pages 45–59 (cited on page 44). [113] Jiliang Tang, Salem Alelyani, and Huan Liu. “Feature Selection for Classi- fication: A Review”. In: Data Classification: Algorithms and Applications. 2014 (cited on page 45). 224 [114] Francisco Jairo Soares Pereira, Tom Michael Mitchell, and Matthew Botvinick. “Machine learning classifiers and fMRI: A tutorial overview”. In: NeuroImage 45 (2009), s199–s209 (cited on page 45). [115] Vladimir Vapnik and Olivier Chapelle. “Bounds on Error Expectation for Support Vector Machines”. In: Neural Computation 12 (2000), pages 2013– 2036 (cited on page 45). [116] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedfor- ward networks are universal approximators”. In: Neural Networks 2.5 (1989), pages 359–366. issn: 0893-6080. doi: https://doi.org/10.1016/0893- 6080(89)90020- 8. url: http://www.sciencedirect.com/science/ article/pii/0893608089900208 (cited on page 49). [117] M. Khosla et al. “Ensemble learning with 3D convolutional neural networks for connectome-based prediction”. In: ArXiv e-prints (Sept. 2018). arXiv: 1809.06219 [cs.CV] (cited on pages 50, 55). [118] Joel Hestness et al. “Deep Learning Scaling is Predictable, Empirically”. In: CoRR abs/1712.00409 (2017) (cited on page 52). [119] Sayan Mukherjee et al. “Estimating Dataset Size Requirements for Classify- ing DNA Microarray Data”. In: Journal of computational biology : a journal of computational molecular cell biology 10 2 (2003), pages 119–42 (cited on page 52). [120] N. U. Dosenbach et al. “Prediction of individual brain maturity using fMRI”. In: Science 329.5997 (Sept. 2010), pages 1358–1361 (cited on pages 54, 59). [121] L. Wang et al. “Decoding lifespan changes of the human brain using resting- state functional connectivity MRI”. In: PLoS ONE 7.8 (2012), e44530 (cited on page 54). 225 [122] T. B. Meier et al. “Support vector machine classification and characterization of age-related reorganization of functional brain networks”. In: Neuroimage 60.1 (Mar. 2012), pages 601–613 (cited on page 54). [123] G. Ball et al. “Machine-learning to characterise neonatal functional connec- tivity in the preterm brain”. In: Neuroimage 124.Pt A (Jan. 2016), pages 267– 275 (cited on page 54). [124] E. Challis et al. “Gaussian process classification of Alzheimer’s disease and mild cognitive impairment from resting-state fMRI”. In: Neuroimage 112 (May 2015), pages 232–243 (cited on page 54). [125] C. Y. Wee et al. “Group-constrained sparse fMRI connectivity modeling for mild cognitive impairment identification”. In: Brain Struct Funct 219.2 (Mar. 2014), pages 641–656 (cited on page 54). [126] X. Chen et al. “High-order resting-state functional connectivity network for MCI classification”. In: Hum Brain Mapp 37.9 (Sept. 2016), pages 3282–3296 (cited on page 54). [127] B. Jie et al. “Topological graph kernel on multiple thresholded functional connectivity networks for mild cognitive impairment classification”. In: Hum Brain Mapp 35.7 (July 2014), pages 2876–2897 (cited on page 54). [128] C. Y. Wee et al. “Sparse temporally dynamic resting-state functional con- nectivity networks for early MCI identification”. In: Brain Imaging Behav 10.2 (June 2016), pages 342–356 (cited on page 54). [129] D. Long et al. “Automatic classification of early Parkinson’s disease with multi-modal MR imaging”. In: PLoS ONE 7.11 (2012), e47714 (cited on page 54). 226 [130] R. C. Welsh, L. M. Jelsone-Swain, and B. R. Foerster. “The utility of independent component analysis and machine learning in the identification of the amyotrophic lateral sclerosis diseased brain”. In: Front Hum Neurosci 7 (2013), page 251 (cited on page 54). [131] A. Venkataraman et al. “Whole brain resting state functional connectivity abnormalities in schizophrenia”. In: Schizophr. Res. 139.1-3 (Aug. 2012), pages 7–12 (cited on page 55). [132] D. S. Bassett et al. “Altered resting state complexity in schizophrenia”. In: Neuroimage 59.3 (Feb. 2012), pages 2196–2207 (cited on pages 55, 59). [133] Y. Fan et al. “Discriminant analysis of functional connectivity patterns on Grassmann manifold”. In: Neuroimage 56.4 (June 2011), pages 2058–2067 (cited on page 55). [134] L. L. Zeng et al. “Identifying major depression using whole-brain functional connectivity: a multivariate pattern analysis”. In: Brain 135.Pt 5 (May 2012), pages 1498–1507 (cited on page 55). [135] A. Eloyan et al. “Automated diagnoses of attention deficit hyperactive disorder using magnetic resonance imaging”. In: Front Syst Neurosci 6 (2012), page 61 (cited on page 55). [136] D. A. Fair et al. “Distinct neural signatures detected for ADHD subtypes after controlling for micro-movements in resting state functional connectivity MRI data”. In: Front Syst Neurosci 6 (2012), page 80 (cited on page 55). [137] F. Liu et al. “Multivariate classification of social anxiety disorder using whole brain functional connectivity”. In: Brain Struct Funct 220.1 (Jan. 2015), pages 101–115 (cited on page 55). 227 [138] Q. Gong et al. “Quantitative prediction of individual psychopathology in trauma survivors using resting-state FMRI”. In: Neuropsychopharmacology 39.3 (Feb. 2014), pages 681–687 (cited on page 55). [139] B. J. Harrison et al. “Altered corticostriatal functional connectivity in obsessive-compulsive disorder”. In: Arch. Gen. Psychiatry 66.11 (Nov. 2009), pages 1189–1200 (cited on page 55). [140] S. Mueller et al. “Individual variability in functional connectivity architecture of the human brain”. In: Neuron 77.3 (Feb. 2013), pages 586–595 (cited on page 55). [141] M. D. Rosenberg et al. “A neuromarker of sustained attention from whole- brain functional connectivity”. In: Nat. Neurosci. 19.1 (Jan. 2016), pages 165– 171 (cited on page 55). [142] J. S. Siegel et al. “Disruptions of network connectivity predict impairment in multiple behavioral domains after stroke”. In: Proc. Natl. Acad. Sci. U.S.A. 113.30 (July 2016), E4367–4376 (cited on pages 55, 56, 59). [143] D. E. Meskaldji et al. “Prediction of long-term memory scores in MCI based on resting-state fMRI”. In: Neuroimage Clin 12 (2016), pages 785–795 (cited on pages 55, 56). [144] D. C. Jangraw et al. “A functional connectivity-based neuromarker of sus- tained attention generalizes to predict recall in a reading task”. In: Neu- roimage 166 (Feb. 2018), pages 99–109 (cited on page 55). [145] W. T. Hsu et al. “Resting-state functional connectivity predicts neuroticism and extraversion in novel individuals”. In: Soc Cogn Affect Neurosci 13.2 (Feb. 2018), pages 224–232 (cited on page 56). 228 [146] A. D. Nostro et al. “Predicting personality from network-based resting- state functional connectivity”. In: Brain Struct Funct 223.6 (July 2018), pages 2699–2719 (cited on page 56). [147] E. Tagliazucchi et al. “Automatic sleep staging using fMRI functional con- nectivity data”. In: Neuroimage 63.1 (Oct. 2012), pages 63–72 (cited on pages 56, 59). [148] E. Tagliazucchi and H. Laufs. “Decoding wakefulness levels from typical fMRI resting-state data reveals reliable drifts between wakefulness and sleep”. In: Neuron 82.3 (May 2014), pages 695–708 (cited on pages 56, 59). [149] X. J. Dai et al. “Long-term total sleep deprivation decreases the default spontaneous activity and connectivity pattern in healthy male subjects: a resting-state fMRI study”. In: Neuropsychiatr Dis Treat 11 (2015), pages 761– 772 (cited on page 57). [150] Y. Zhu et al. “Increased interhemispheric resting-state functional connectivity after sleep deprivation: a resting-state fMRI study”. In: Brain Imaging Behav 10.3 (Sept. 2016), pages 911–919 (cited on page 57). [151] B. T. Yeo, J. Tandi, and M. W. Chee. “Functional connectivity during rested wakefulness predicts vulnerability to sleep deprivation”. In: Neuroimage 111 (May 2015), pages 147–158 (cited on page 57). [152] T. Ge et al. “Heritability analysis with repeat measurements and its appli- cation to resting-state functional connectivity”. In: Proc. Natl. Acad. Sci. U.S.A. 114.21 (May 2017), pages 5521–5526 (cited on page 57). [153] O. Miranda-Dominguez et al. “Heritability of the human connectome: A connectotyping study”. In: Netw Neurosci 2.2 (2018), pages 175–199 (cited on pages 57, 59). 229 [154] I. Tavor et al. “Task-free MRI predicts individual differences in brain activity during task performance”. In: Science 352.6282 (Apr. 2016), pages 216–220 (cited on pages 58, 59). [155] O. Parker Jones et al. “Resting connectivity predicts task activation in pre-surgical populations”. In: Neuroimage Clin 13 (2017), pages 378–385 (cited on page 58). [156] F. Abdelnour, H. U. Voss, and A. Raj. “Network diffusion accurately mod- els the relationship between structural and functional brain connectivity networks”. In: Neuroimage 90 (Apr. 2014), pages 335–347 (cited on page 58). [157] F. Deligianni et al. “A probabilistic framework to infer brain functional connectivity from anatomical connections”. In: Inf Process Med Imaging 22 (2011), pages 296–307 (cited on page 58). [158] A. Venkataraman et al. “Joint modeling of anatomical and functional con- nectivity for population studies”. In: IEEE Trans Med Imaging 31.2 (Feb. 2012), pages 164–182 (cited on page 58). [159] C. Dansereau et al. “Statistical power and prediction accuracy in multisite resting-state fMRI connectivity”. In: Neuroimage 149 (Apr. 2017), pages 220– 232 (cited on page 63). [160] K. R. Van Dijk, M. R. Sabuncu, and R. L. Buckner. “The influence of head motion on intrinsic functional connectivity MRI”. In: Neuroimage 59.1 (Jan. 2012), pages 431–438 (cited on page 64). [161] G. Varoquaux. “Cross-validation failure: Small sample sizes lead to large error bars”. In: Neuroimage 180.Pt A (Oct. 2018), pages 68–77 (cited on page 64). 230 [162] T. Wolfers et al. “From estimating activation locality to predicting disorder: A review of pattern recognition for neuroimaging-based psychiatric diagnos- tics”. In: Neurosci Biobehav Rev 57 (Oct. 2015), pages 328–349 (cited on page 64). [163] Vince D. Calhoun and Tülay Adali. “Multisubject Independent Component Analysis of fMRI: A Decade of Intrinsic Networks, Default Mode, and Neurodiagnostic Discovery”. In: IEEE Reviews in Biomedical Engineering 5 (2012), pages 60–73 (cited on page 66). [164] Simon B. Eickhoff, B. T. Thomas Yeo, and Sarah Genon. “Imaging-based parcellations of the human brain”. In: Nature Reviews Neuroscience 19 (2018), pages 672–686 (cited on page 66). [165] R. Matthew Hutchison et al. “Dynamic functional connectivity: Promise, issues, and interpretations”. In: NeuroImage 80 (2013), pages 360–378 (cited on page 66). [166] Vince D. Calhoun et al. “The Chronnectome: Time-Varying Connectivity Networks as the Next Frontier in fMRI Data Discovery”. In: Neuron 84 (2014), pages 262–274 (cited on page 66). [167] Maria Giulia Preti, Thomas A. W. Bolton, and Dimitri Van De Ville. “The dynamic functional connectome: State-of-the-art and perspectives”. In: NeuroImage 160 (2017), pages 41–54 (cited on page 66). [168] Daniel J. Lurie et al. “On the Nature of Resting Fmri and Time-varying Functional Connectivity.” In: PsyArXiv (2018). doi: 10.31234/osf.io/ xtzre (cited on page 66). 231 [169] Michael D. Fox and Michael D. Greicius. “Clinical Applications of Resting State Functional Connectivity”. In: Front. Syst. Neurosci. 2010 (cited on page 66). [170] Mohammad Arbabshirani et al. “Single subject prediction of brain disor- ders in neuroimaging: Promises and pitfalls”. In: NeuroImage 145 (2017), pages 137–165 (cited on page 66). [171] M. Plitt, K. A. Barnes, and A. Martin. “Functional connectivity classifica- tion of autism identifies highly predictive brain features but falls short of biomarker standards”. In: Neuroimage Clin 7 (2015), pages 359–366 (cited on pages 72, 86). [172] M. Mennes et al. “Resting state functional connectivity correlates of in- hibitory control in children with attention-deficit/hyperactivity disorder”. In: Front Psychiatry 2 (2011), page 83 (cited on page 72). [173] G. Varoquaux et al. “Detection of brain functional-connectivity difference in post-stroke patients using group-level covariance modeling”. In: Med Image Comput Comput Assist Interv 13.Pt 1 (2010), pages 200–208 (cited on page 72). [174] Colin J. Brown and Ghassan Hamarneh. “Machine Learning on Human Connectome Data from MRI”. In: CoRR 1611.08699 (2016). arXiv: 1611. 08699. url: http://arxiv.org/abs/1611.08699 (cited on page 72). [175] M. Kaiser. “A Tutorial in Connectome Analysis: Topological and Spatial Features of Brain Networks”. In: ArXiv e-prints (May 2011). arXiv: 1105. 4705 [q-bio.NC] (cited on page 72). 232 [176] Aaron Alexander-Bloch et al. “The anatomical distance of functional con- nections predicts brain network topology in health and schizophrenia.” In: Cerebral cortex 23 1 (2013), pages 127–38 (cited on page 72). [177] Zhijun Yao et al. “A review of structural and functional brain networks: small world and atlas”. In: Brain Informatics 2.1 (Mar. 2015), pages 45– 52. issn: 2198-4018. doi: 10 . 1007 / s40708 - 015 - 0009 - z. url: http : //link.springer.com/10.1007/s40708-015-0009-z (cited on page 72). [178] Bruce Fischl et al. “Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain”. In: Neuron 33.3 (2002), pages 341–355 (cited on page 72). [179] Matthew F Glasser et al. “A multi-modal parcellation of human cerebral cortex”. In: Nature 536.7615 (2016), pages 171–178 (cited on page 72). [180] Simon B Eickhoff et al. “Connectivity-based parcellation: Critique and implications”. In: Human brain mapping 36.12 (2015), pages 4771–4792 (cited on page 72). [181] Salim Arslan et al. “Human brain mapping: A systematic comparison of parcellation methods for the human cerebral cortex”. In: NeuroImage 170 (2018). Segmenting the Brain, pages 5–30. issn: 1053-8119. doi: https: //doi.org/10.1016/j.neuroimage.2017.04.014. url: http://www. sciencedirect.com/science/article/pii/S1053811917303026 (cited on pages 72, 100). [182] Paul A Yushkevich et al. “Quantitative comparison of 21 protocols for labeling hippocampal subfields and parahippocampal subregions in in vivo MRI: towards a harmonized segmentation protocol”. In: Neuroimage 111 (2015), pages 526–541 (cited on page 72). 233 [183] Gaël Varoquaux et al. “Multi-subject dictionary learning to segment an atlas of brain spontaneous activity”. In: Biennial International Conference on Information Processing in Medical Imaging. Springer. 2011, pages 562–573 (cited on page 72). [184] B. T. Thomas Yeo et al. “The organization of the human cerebral cortex estimated by intrinsic functional connectivity”. In: Journal of Neurophysi- ology 106.3 (2011). PMID: 21653723, pages 1125–1165. doi: 10.1152/jn. 00338.2011. eprint: https://doi.org/10.1152/jn.00338.2011. url: https://doi.org/10.1152/jn.00338.2011 (cited on pages 72, 78, 101). [185] Bertrand Thirion et al. “Which fMRI clustering gives good brain parcella- tions?” In: Frontiers in neuroscience 8 (2014), page 167 (cited on page 72). [186] Alexandre Abraham et al. “Deriving reproducible biomarkers from multi- site resting-state data: An Autism-based example”. In: NeuroImage 147 (2017), pages 736–745. issn: 1053-8119. doi: https://doi.org/10.1016/ j.neuroimage.2016.10.045. url: http://www.sciencedirect.com/ science/article/pii/S1053811916305924 (cited on pages 73, 74, 86, 95, 96, 99). [187] Sofia Ira Ktena et al. “Metric learning with spectral graph convolutions on brain connectivity networks”. In: NeuroImage 169 (2018), pages 431–442. issn: 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2017. 12.052. url: http://www.sciencedirect.com/science/article/pii/ S1053811917310765 (cited on page 74). [188] Jeremy Kawahara et al. “BrainNetCNN: Convolutional Neural Networks for Brain Networks; Towards Predicting Neurodevelopment”. In: 146 (Sept. 2016) (cited on pages 74, 83, 84). 234 [189] Vladimir L. Cherkassky et al. “Functional connectivity in a baseline resting- state network in autism.” In: Neuroreport 17 16 (2006), pages 1687–90 (cited on page 74). [190] Michal Assaf et al. “Abnormal functional connectivity of default mode sub-networks in autism spectrum disorder patients”. In: 53 (Oct. 2010), pages 247–56 (cited on page 74). [191] Christopher S. Monk et al. “Abnormalities of intrinsic functional connectivity in autism spectrum disorders,” in: NeuroImage 47.2 (2009), pages 764–772. issn: 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2009. 04.069. url: http://www.sciencedirect.com/science/article/pii/ S1053811909004327 (cited on page 74). [192] Anibal Sólon Heinsfeld et al. “Identification of autism spectrum disorder using deep learning and the ABIDE dataset”. In: NeuroImage: Clinical. 2018 (cited on pages 74, 87, 260). [193] N Yahata et al. “A small number of abnormal brain connections predicts adult autism spectrum disorder”. In: Nature Communications 7 (2016). doi: http://dx.doi.org/10.1038/ncomms11254 (cited on page 74). [194] Adriana Di Martino et al. “Enhancing studies of the connectome in autism using the autism brain imaging data exchange II”. In: Scientific data 4 (Mar. 2017), page 170010. issn: 2052-4463. doi: 10.1038/sdata.2017.10. url: http://europepmc.org/articles/PMC5349246 (cited on pages 74, 75). [195] Cameron Craddock et al. “The Neuro Bureau Preprocessing Initiative: open sharing of preprocessed neuroimaging data and derivative”. In: Frontiers in Neuroinformatics 41 (2013). issn: 1662-5196. doi: 10.3389/conf.fninf. 235 2013.09.00041. url: http://www.frontiersin.org/neuroinformatics/ 10.3389/conf.fninf.2013.09.00041/full (cited on page 75). [196] Jonathan D. Power et al. “Methods to detect, characterize, and remove motion artifact in resting state fMRI”. In: NeuroImage 84 (Jan. 2014), pages 320–341. issn: 10538119. doi: 10 . 1016 / j . neuroimage . 2013 . 08 . 048. url: http : / / www . ncbi . nlm . nih . gov / pubmed / 23994314 % 20http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= PMC3849338 % 20http : / / linkinghub . elsevier . com / retrieve / pii / S1053811913009117 (cited on pages 76, 98). [197] John Muschelli et al. “Reduction of motion-related artifacts in resting state fMRI using aCompCor”. In: 96 (Mar. 2014) (cited on page 76). [198] Jean A. Frazier et al. “Structural brain magnetic resonance imaging of limbic and thalamic volumes in pediatric bipolar disorder.” In: The American journal of psychiatry 162 7 (2005) (cited on page 77). [199] J. M. Goldstein et al. “Hypothalamic abnormalities in schizophrenia: sex effects and genetic vulnerability”. In: Biol. Psychiatry 61.8 (Apr. 2007), pages 935–945 (cited on page 77). [200] N. Makris et al. “Decreased volume of left and total anterior insular lobule in schizophrenia”. In: Schizophr. Res. 83.2-3 (Apr. 2006), pages 155–171 (cited on page 77). [201] C. D. Smyser et al. “Prediction of brain maturity in infants using machine- learning algorithms”. In: Neuroimage 136 (Aug. 2016), pages 1–9 (cited on page 77). 236 [202] R. S. Desikan et al. “An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest”. In: Neuroimage 31.3 (July 2006), pages 968–980 (cited on page 77). [203] N. Tzourio-Mazoyer et al. “Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single- subject brain”. In: Neuroimage 15.1 (Jan. 2002), pages 273–289 (cited on pages 77, 108, 112). [204] Craddock Cameron et al. “A whole brain fMRI atlas generated via spa- tially constrained spectral clustering”. In: Human Brain Mapping 33.8 (2011), pages 1914–1928. doi: 10 . 1002 / hbm . 21333. eprint: https : / / onlinelibrary.wiley.com/doi/pdf/10.1002/hbm.21333. url: https: //onlinelibrary.wiley.com/doi/abs/10.1002/hbm.21333 (cited on page 77). [205] Jack L. Lancaster et al. “Automated Talairach Atlas labels for functional brain mapping”. In: Human Brain Mapping 10.3 (2000), pages 120–131 (cited on page 77). [206] S. B. Eickhoff et al. “A new SPM toolbox for combining probabilistic cytoarchitectonic maps and functional imaging data”. In: Neuroimage 25.4 (May 2005), pages 1325–1335 (cited on page 77). [207] Markus D Schirmer. “Developing Brain Connectivity - Effects of Parcellation Scale on Network Analysis in Neonates (Doctoral dissertation, King’s College London)”. In: (2015). url: https://kclpure.kcl.ac.uk/portal/ (cited on pages 77, 93). [208] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: CoRR 237 abs/1502.03167 (2015). arXiv: 1502.03167. url: http://arxiv.org/abs/ 1502.03167 (cited on page 82). [209] K. Simonyan, A. Vedaldi, and A. Zisserman. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”. In: ArXiv e-prints (Dec. 2013). arXiv: 1312.6034 [cs.CV] (cited on page 84). [210] Amanda V Utevsky, David V Smith, and Scott A Huettel. “Precuneus is a functional core of the default-mode network”. In: Journal of Neuroscience 34.3 (2014), pages 932–940 (cited on page 92). [211] Takamitsu Watanabe et al. “Diminished medial prefrontal activity behind autistic social judgments of incongruent information”. In: PloS one 7.6 (2012), e39561 (cited on page 92). [212] Hideya Koshino et al. “Functional connectivity in an fMRI working memory task in high-functioning autism”. In: Neuroimage 24.3 (2005), pages 810–821 (cited on page 92). [213] Patricia A Reuter-Lorenz et al. “Age differences in the frontal lateralization of verbal and spatial working memory revealed by PET”. In: Journal of cognitive neuroscience 12.1 (2000), pages 174–187 (cited on page 92). [214] R. Cameron Craddock et al. “A whole brain fMRI atlas generated via spatially constrained spectral clustering”. In: Human Brain Mapping 33.8 (Aug. 2012), pages 1914–1928. issn: 10659471. doi: 10.1002/hbm.21333. url: http://doi.wiley.com/10.1002/hbm.21333 (cited on page 93). [215] A. Fornito, A. Zalesky, and E. T. Bullmore. “Network scaling effects in graph analytic studies of human resting-state FMRI data”. In: Front Syst Neurosci 4 (2010), page 22 (cited on page 93). 238 [216] A. Zalesky et al. “Whole-brain anatomical networks: does the choice of nodes matter?” In: Neuroimage 50.3 (Apr. 2010), pages 970–983 (cited on pages 93, 95). [217] R. Kong et al. “Spatial Topography of Individual-Specific Cortical Networks Predicts Human Cognition, Personality, and Emotion”. In: Cereb. Cortex (June 2018) (cited on page 93). [218] B. Da Mota et al. “Enhancing the Reproducibility of Group Analysis with Randomized Brain Parcellations”. In: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013. Lecture Notes in Computer Science, vol 8150. Springer, Berlin, Heidelberg (cited on pages 94, 100). [219] C. P. Chen et al. “Diagnostic classification of intrinsic functional connectivity highlights somatosensory, default mode, and visual regions in autism”. In: Neuroimage Clin 8 (2015), pages 238–245 (cited on page 96). [220] V. Menon. “Developmental pathways to functional brain networks: emerging principles”. In: Trends Cogn. Sci. (Regul. Ed.) 17.12 (Dec. 2013), pages 627– 640 (cited on page 97). [221] Theodore D. Satterthwaite et al. “Impact of in-scanner head motion on multi- ple measures of functional connectivity: Relevance for studies of neurodevelop- ment in youth”. In: NeuroImage 60.1 (2012), pages 623–632. issn: 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2011.12.063. url: http: //www.sciencedirect.com/science/article/pii/S1053811911014650 (cited on page 97). [222] Damien Fair et al. “Distinct neural signatures detected for ADHD sub- types after controlling for micro-movements in resting state functional connectivity MRI data”. In: Frontiers in Systems Neuroscience 6 (2013), 239 page 80. issn: 1662-5137. doi: 10.3389/fnsys.2012.00080. url: https: //www.frontiersin.org/article/10.3389/fnsys.2012.00080 (cited on page 97). [223] Koene RA Van Dijk, Mert R Sabuncu, and Randy L Buckner. “The influence of head motion on intrinsic functional connectivity MRI”. In: Neuroimage 59.1 (2012), pages 431–438 (cited on page 97). [224] Patric Hagmann et al. “Mapping the Structural Core of Human Cerebral Cortex”. In: PLOS Biology 6.7 (July 2008), pages 1–15. doi: 10.1371/ journal.pbio.0060159. url: https://doi.org/10.1371/journal.pbio. 0060159 (cited on page 99). [225] Luisa M. Zintgraf et al. “Visualizing Deep Neural Network Decisions: Pre- diction Difference Analysis”. In: CoRR abs/1702.04595 (2017). arXiv: 1702. 04595. url: http://arxiv.org/abs/1702.04595 (cited on page 101). [226] Ramprasaath R. Selvaraju et al. “Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization”. In: CoRR abs/1610.02391 (2016). arXiv: 1610.02391. url: http://arxiv. org/abs/1610.02391 (cited on page 101). [227] Khosla et al. “Machine learning in resting-state fMRI analysis”. In: arXiv preprint arXiv:1812.11477 (2018) (cited on page 103). [228] Lixia Tian et al. “Changes in dynamic functional connections with aging”. In: Neuroimage 172 (2018), pages 31–39 (cited on page 103). [229] Liu et al. “Chronnectome fingerprinting:identifying individuals & predicting higher cognitive function using dynamic brain connectivity patterns”. In: Hum Brain Mapp () (cited on page 103). 240 [230] L. L. Zeng et al. “Unsupervised classification of major depression using func- tional connectivity MRI”. In: Hum Brain Mapp 35.4 (Apr. 2014), pages 1630– 1641 (cited on page 104). [231] Heung-Il Suk et al. “A hybrid of deep network and hidden Markov model for MCI identification with resting-state fMRI”. In: MICCAI. 2015 (cited on page 104). [232] Mahmudul Hasan et al. “Learning Temporal Regularity in Video Sequences”. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) (cited on page 104). [233] Nitish Srivastava, Elman Mansimov, and Ruslan R. Salakhutdinov. “Unsu- pervised Learning of Video Representations using LSTMs”. In: ICML. 2015 (cited on page 104). [234] Wen Liu et al. “Future Frame Prediction for Anomaly Detection - A New Baseline”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018) (cited on page 104). [235] Alex Krizhevsky et al. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012 (cited on page 105). [236] Olaf Ronneberger et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. In: MICCAI. 2015 (cited on page 105). [237] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pages 1735–1780 (cited on page 105). [238] Xingjian Shi et al. “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting”. In: NIPS. 2015 (cited on page 105). 241 [239] Di Martino et al. “The autism brain imaging data exchange:towards a large- scale evaluation of intrinsic brain architecture in autism”. In: Molecular psychiatry (2014) (cited on page 107). [240] Abraham et al. “Deriving reproducible biomarkers from multi-site resting- state data: an autism-based example”. In: NeuroImage 147 (2017), pages 736– 745 (cited on page 111). [241] Lisa T Eyler et al. “A failure of left temporal cortex to specialize for language is an early emerging and fundamental property of autism”. In: Brain 135.3 (2012), pages 949–960 (cited on page 112). [242] G. Varoquaux and R. A. Poldrack. “Predictive models avoid excessive reductionism in cognitive neuroimaging”. In: Curr. Opin. Neurobiol. 55 (Apr. 2019), pages 1–6 (cited on pages 117, 118, 173). [243] D. L. Yamins et al. “Performance-optimized hierarchical models predict neural responses in higher visual cortex”. In: Proc. Natl. Acad. Sci. U.S.A. 111.23 (June 2014), pages 8619–8624 (cited on pages 117–119, 149, 153, 174, 187). [244] K. N. Kay et al. “Identifying natural images from human brain activity”. In: Nature 452.7185 (Mar. 2008), pages 352–355 (cited on pages 117, 118). [245] Haiguang Wen et al. “Neural encoding and decoding with deep learning for dynamic natural vision”. In: Cerebral Cortex 28.12 (Dec. 2018), pages 4136– 4160. issn: 14602199. doi: 10.1093/cercor/bhx268. arXiv: 1608.03425 (cited on pages 117, 118, 124, 155, 174, 176, 277). [246] U. Guclu and M. A. van Gerven. “Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream”. In: 242 J. Neurosci. 35.27 (July 2015), pages 10005–10014 (cited on pages 117–119, 124, 153, 155, 174, 176, 277). [247] Umut Güçlü and Marcel A.J. van Gerven. “Increasingly complex repre- sentations of natural movies across the dorsal stream are shared between subjects”. In: NeuroImage 145 (Jan. 2017), pages 329–336. issn: 10959572. doi: 10.1016/j.neuroimage.2015.12.036 (cited on pages 117, 145, 174). [248] Alexander J.E. Kell et al. “A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cor- tical Processing Hierarchy”. In: Neuron 98.3 (May 2018), 630–644.e16. issn: 08966273. doi: 10.1016/j.neuron.2018.03.044. url: https: //linkinghub.elsevier.com/retrieve/pii/S0896627318302502 (cited on pages 117–119, 124, 153, 174, 176, 277). [249] A. J. King and G. A. Calvert. “Multisensory integration: perceptual grouping by eye and ear”. In: Curr. Biol. 11.8 (Apr. 2001), R322–325 (cited on page 117). [250] J. Driver and T. Noesselt. “Multisensory interplay reveals crossmodal influ- ences on ‘sensory-specific’ brain regions, neural responses, and judgments”. In: Neuron 57.1 (Jan. 2008), pages 11–23 (cited on pages 117, 136, 141). [251] J. Miller. “Divided attention: evidence for coactivation with redundant sig- nals”. In: Cogn Psychol 14.2 (Apr. 1982), pages 247–279 (cited on page 117). [252] S. Sonkusare, M. Breakspear, and C. Guo. “Naturalistic Stimuli in Neuro- science: Critically Acclaimed”. In: Trends Cogn. Sci. (Regul. Ed.) 23.8 (Aug. 2019), pages 699–714 (cited on page 117). 243 [253] U. Hasson et al. “Intersubject synchronization of cortical activity during natural vision”. In: Science 303.5664 (Mar. 2004), pages 1634–1640 (cited on pages 118, 143). [254] Marc Schönwiesner and Robert J. Zatorre. “Spectro-temporal modulation transfer function of single voxels in the human auditory cortex measured with high-resolution fMRI.” In: Proceedings of the National Academy of Sciences 106 34 (2009), pages 14611–6 (cited on page 118). [255] Daniel Schwartz, Mariya Toneva, and Leila Wehbe. “Inducing brain-relevant bias in natural language processing models”. In: NeurIPS (2019) (cited on page 119). [256] M. F. Glasser et al. “The minimal preprocessing pipelines for the Human Connectome Project”. In: Neuroimage 80 (Oct. 2013), pages 105–124 (cited on page 121). [257] A. T Vu et al. “Tradeoffs in pushing the spatial resolution of fMRI for the 7T Human Connectome Project”. In: Neuroimage 154 (July 2017), pages 23–32 (cited on pages 121, 162, 179). [258] Shawn Hershey et al. “CNN architectures for large-scale audio classification”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pages 131–135 (cited on pages 122, 124, 177, 179). [259] Albert S. Bregman. “Auditory Scene Analysis”. In: MIT press (2001) (cited on page 122). [260] Tsung-Yi Lin et al. “Feature Pyramid Networks for Object Detection”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pages 936–944 (cited on pages 124, 176). 244 [261] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pages 770–778 (cited on pages 124, 155, 177). [262] Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pages 248–255 (cited on pages 124, 155, 177). [263] Sami Abu-El-Haija et al. “YouTube-8M: A Large-Scale Video Classification Benchmark”. In: ArXiv abs/1609.08675 (2016) (cited on pages 124, 177). [264] M. F. Glasser et al. “A multi-modal parcellation of human cerebral cortex”. In: Nature 536.7615 (Aug. 2016), pages 171–178 (cited on pages 126, 128, 160, 164, 184, 271, 272, 300). [265] U. Hasson et al. “A hierarchy of temporal receptive windows in human cortex”. In: J. Neurosci. 28.10 (Mar. 2008), pages 2539–2550 (cited on page 129). [266] C. Baldassano et al. “Discovering Event Structure in Continuous Narrative Perception and Memory”. In: Neuron 95.3 (Aug. 2017), pages 709–721 (cited on pages 130, 147, 149). [267] M. A. Goodale and A. D. Milner. “Separate visual pathways for perception and action”. In: Trends Neurosci. 15.1 (Jan. 1992), pages 20–25 (cited on page 132). [268] G. A. Calvert. “Crossmodal processing in the human brain: insights from functional neuroimaging studies”. In: Cereb. Cortex 11.12 (Dec. 2001), pages 1110–1123 (cited on pages 134, 136, 141, 142, 147). 245 [269] T. Raij, K. Uutela, and R. Hari. “Audiovisual integration of letters in the human brain”. In: Neuron 28.2 (Nov. 2000), pages 617–625 (cited on pages 135, 136). [270] M. S. Beauchamp. “Statistical criteria in FMRI studies of multisensory inte- gration”. In: Neuroinformatics 3.2 (2005), pages 93–113 (cited on pages 135, 136, 141). [271] M. S. Beauchamp et al. “Unraveling multisensory integration: patchy orga- nization within human STS multisensory cortex”. In: Nat. Neurosci. 7.11 (Nov. 2004), pages 1190–1192 (cited on pages 136, 141). [272] G. A. Calvert et al. “Activation of auditory cortex during silent lipreading”. In: Science 276.5312 (Apr. 1997), pages 593–596 (cited on page 136). [273] N. Kanwisher and G. Yovel. “The fusiform face area: a cortical region specialized for the perception of faces”. In: Philos. Trans. R. Soc. Lond., B, Biol. Sci. 361.1476 (Dec. 2006), pages 2109–2128 (cited on page 138). [274] I. Tavor et al. “Separate parts of occipito-temporal white matter fibers are associated with recognition of faces and places”. In: Neuroimage 86 (Feb. 2014), pages 123–130 (cited on page 138). [275] S. Nasr et al. “Scene-selective cortical regions in human and nonhuman primates”. In: J. Neurosci. 31.39 (Sept. 2011), pages 13771–13785 (cited on page 138). [276] J. A. Frost et al. “Language processing is strongly left lateralized in both sexes. Evidence from functional MRI”. In: Brain 122 ( Pt 2) (Feb. 1999), pages 199–208 (cited on page 139). [277] P. Belin et al. “Voice-selective areas in human auditory cortex”. In: Nature 403.6767 (Jan. 2000), pages 309–312 (cited on page 139). 246 [278] T. Yarkoni et al. “Large-scale automated synthesis of human functional neuroimaging data”. In: Nat. Methods 8.8 (June 2011), pages 665–670 (cited on page 139). [279] Alexander G. Huth et al. “A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain”. In: Neuron 76 (2012), pages 1210–1224 (cited on pages 139, 145, 280). [280] Y. Cao et al. “Causal Inference in the Multisensory Brain”. In: Neuron 102.5 (June 2019), pages 1076–1087 (cited on page 142). [281] S. M. Wilson, I. Molnar-Szakacs, and M. Iacoboni. “Beyond superior tem- poral cortex: intersubject correlations in narrative speech comprehension”. In: Cereb. Cortex 18.1 (Jan. 2008), pages 230–242 (cited on page 142). [282] Iiro P. Jääskeläinen et al. “Inter-Subject Synchronization of Prefrontal Cortex Hemodynamic Activity During Natural Viewing”. In: The Open Neuroimaging Journal 2 (2008), pages 14–19 (cited on page 143). [283] Shailee Jain and Alexander Huth. “Incorporating Context into Language Encoding Models for fMRI”. In: NIPS (2018) (cited on page 145). [284] Fabian H Sinz et al. “Stimulus domain transfer in recurrent models for large scale cortical population prediction on video”. In: bioRxiv (2018) (cited on page 145). [285] J. Schultz and K. S. Pilz. “Natural facial motion enhances cortical responses to faces”. In: Exp Brain Res 194.3 (Apr. 2009), pages 465–475 (cited on pages 146, 173). [286] Pouya Bashivan, Kohitij Kar, and J. DiCarlo. “Neural population control via deep image synthesis”. In: Science 364 (2019) (cited on pages 146, 170). 247 [287] J. Chen, U. Hasson, and C. J. Honey. “Processing Timescales as an Organiz- ing Principle for Primate Cortex”. In: Neuron 88.2 (Oct. 2015), pages 244– 246 (cited on pages 147, 286). [288] Jonathan E Peelle. “Methodological challenges and solutions in auditory functional magnetic resonance imaging”. In: Frontiers in neuroscience 8 (2014), page 253 (cited on page 148). [289] Fabian H Sinz et al. “Engineering a less artificial intelligence”. In: Neuron 103.6 (2019), pages 967–979 (cited on page 149). [290] U. Hasson, J. Chen, and C. J. Honey. “Hierarchical process memory: memory as an integral component of information processing”. In: Trends Cogn. Sci. (Regul. Ed.) 19.6 (June 2015), pages 304–313 (cited on page 149). [291] Qianli Liao and Tomaso A. Poggio. “Bridging the Gaps Between Resid- ual Learning, Recurrent Neural Networks and Visual Cortex”. In: ArXiv abs/1604.03640 (2016) (cited on page 149). [292] K. Kar et al. “Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior”. In: Nat. Neurosci. 22.6 (June 2019), pages 974–983 (cited on pages 149, 150). [293] D. Wyatte, D. J. Jilk, and R. C. O’Reilly. “Early recurrent feedback facilitates visual object recognition under challenging conditions”. In: Front Psychol 5 (2014), page 674 (cited on page 150). [294] Emily S Finn et al. “Idiosynchrony: From shared responses to individual differences during naturalistic neuroimaging”. In: NeuroImage 215 (2020), page 116828 (cited on page 150). 248 [295] Jason J Ki, Simon P Kelly, and Lucas C Parra. “Attention strongly modulates reliability of neural responses to naturalistic narrative stimuli”. In: Journal of Neuroscience 36.10 (2016), pages 3092–3101 (cited on page 150). [296] Mai Nguyen, Tamara Vanderwal, and Uri Hasson. “Shared understanding of narratives is correlated with shared neural responses”. In: NeuroImage 184 (2019), pages 161–170 (cited on page 150). [297] Lauri Nummenmaa et al. “Emotions promote social interaction by synchro- nizing brain activity across individuals”. In: Proceedings of the National Academy of Sciences 109.24 (2012), pages 9599–9604 (cited on page 150). [298] Emily S Finn et al. “Trait paranoia shapes inter-subject synchrony in brain activity during an ambiguous social narrative”. In: Nature Communications 9.1 (2018), pages 1–13 (cited on page 150). [299] Zhi Yang et al. “Individualized psychiatric imaging based on inter-subject neural synchronization in movie watching”. In: NeuroImage 216 (2020), page 116227 (cited on page 150). [300] Yoav Benjamini and Yosef Hochberg. “Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing”. In: J. R. Stat. Soc. B. 57 (1995), pages 289–300 (cited on page 283). [301] Arsha Nagrani et al. “Voxceleb: Large-scale speaker verification in the wild”. In: Comput. Speech Lang. 60 (2020) (cited on page 285). [302] Karol J. Piczak. “ESC: Dataset for Environmental Sound Classification”. In: MM (2015) (cited on page 285). [303] Jonathan D Power et al. “Methods to detect, characterize, and remove motion artifact in resting state fMRI”. In: Neuroimage 84 (2014), pages 320– 341 (cited on page 296). 249 [304] S. Sonkusare, M. Breakspear, and C. Guo. “Naturalistic Stimuli in Neuro- science: Critically Acclaimed”. In: Trends Cogn. Sci. (Regul. Ed.) 23.8 (Aug. 2019), pages 699–714 (cited on pages 153, 173). [305] John T. Serences and Steven Yantis. “Selective visual attention and percep- tual coherence”. In: Trends in Cognitive Sciences 10 (2006), pages 38–45 (cited on page 153). [306] Sabine Kastner and Leslie G. Ungerleider. “Mechanisms of visual attention in the human cortex.” In: Annual review of neuroscience 23 (2000), pages 315– 41 (cited on page 153). [307] Jochen Braun, Christof Koch, and Joel L. Davis. “Visual attention and cortical circuits”. In: Visual attention and cortical circuits. 2001 (cited on pages 153, 157). [308] Laurent Itti and Christof Koch. “Computational modelling of visual atten- tion”. In: Nature Reviews Neuroscience 2 (2001), pages 194–203 (cited on pages 153, 169). [309] Tomaso Poggio and Fabio Anselmi. “Visual Cortex and Deep Networks: Learning Invariant Representations”. In: Visual Cortex and Deep Networks: Learning Invariant Representations. 2016 (cited on page 153). [310] James E. Hoffman and Baskaran Subramaniam. “The role of visual attention in saccadic eye movements”. In: Perception & Psychophysics 57 (1995), pages 787–795 (cited on page 154). [311] Thomas P O’Connell and Marvin M. Chun. “Predicting eye movement patterns from fMRI responses to natural scenes”. In: Nature Communications 9 (2018) (cited on page 154). 250 [312] Fabian Sinz et al. “Stimulus domain transfer in recurrent models for large scale cortical population prediction on video”. In: Advances in neural infor- mation processing systems. 2018, pages 7199–7210 (cited on page 154). [313] M. F. Glasser et al. “The minimal preprocessing pipelines for the Human Connectome Project”. In: Neuroimage 80 (Oct. 2013), pages 105–124 (cited on pages 154, 161, 173, 179). [314] Po-He Tseng et al. “Quantifying center bias of observers in free viewing of dynamic natural scenes.” In: Journal of vision 9 7 (2009), page 4 (cited on page 158). [315] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980 (2014) (cited on pages 159, 178). [316] U. Hasson et al. “Intersubject synchronization of cortical activity during natural vision”. In: Science 303.5664 (Mar. 2004), pages 1634–1640 (cited on pages 160, 173). [317] Laurent Itti, Christof Koch, and Ernst Niebur. “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis”. In: IEEE Trans. Pattern Anal. Mach. Intell. 20 (2009), pages 1254–1259 (cited on page 160). [318] Matthias Kümmerer et al. “Understanding Low- and High-Level Contribu- tions to Fixation Prediction”. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pages 4799–4808 (cited on pages 160, 161). [319] Zoya Bylinskii et al. MIT Saliency Benchmark. http://saliency.mit.edu/ (cited on page 161). [320] Zoya Bylinskii et al. “What Do Different Evaluation Metrics Tell Us About Saliency Models?” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2016), pages 740–757 (cited on page 161). 251 [321] Matthias Kümmerer, Thomas S. A. Wallis, and Matthias Bethge. “Information-theoretic model comparison unifies saliency metrics.” In: Pro- ceedings of the National Academy of Sciences of the United States of America 112 52 (2015), pages 16054–9 (cited on page 161). [322] D. C. Van Essen et al. “The Human Connectome Project: a data acquisition perspective”. In: Neuroimage 62.4 (Oct. 2012), pages 2222–2231 (cited on pages 161, 162, 179). [323] G. S. Khorshidi et al. “Automatic denoising of functional MRI data: Com- bining independent component analysis and hierarchical fusion of classifiers”. In: NeuroImage 90 (2014), pages 449–468 (cited on page 162). [324] K. Grill-Spector, Z. Kourtzi, and N. Kanwisher. “The lateral occipital complex and its role in object recognition”. In: Vision Res. 41.10-11 (2001), pages 1409–1422 (cited on page 165). [325] Zoe Kourtzi and Nancy Kanwisher. “Cortical regions involved in perceiving object shape.” In: The Journal of neuroscience : the official journal of the Society for Neuroscience 20 9 (2000), pages 3310–8 (cited on page 165). [326] Jonas Larsson and David J. Heeger. “Two retinotopic visual areas in human lateral occipital cortex.” In: The Journal of neuroscience : the official journal of the Society for Neuroscience 26 51 (2006), pages 13128–42 (cited on page 165). [327] Alessandro De Benedictis et al. “Anatomo-functional study of the temporo- parieto-occipital region: dissection, tractographic and brain mapping ev- idence from a neurosurgical perspective.” In: Journal of anatomy 225 2 (2014), pages 132–51 (cited on page 165). 252 [328] Anne Treisman and G. A. Gelade. “A feature-integration theory of attention”. In: Cognitive Psychology 12 (1980), pages 97–136 (cited on page 168). [329] U. Hasson, R. Malach, and D. J. Heeger. “Reliability of cortical activity during natural stimulation”. In: Trends Cogn. Sci. (Regul. Ed.) 14.1 (Jan. 2010), pages 40–48 (cited on page 173). [330] Po-Hsuan Cameron Chen et al. “A Reduced-Dimension fMRI Shared Re- sponse Model”. In: NIPS. 2015 (cited on pages 173, 174). [331] J. Dubois and R. Adolphs. “Building a Science of Individual Differences from fMRI”. In: Trends Cogn. Sci. (Regul. Ed.) 20.6 (June 2016), pages 425–443 (cited on page 174). [332] Haiguang Wen et al. “Transferring and generalizing deep-learning-based neural encoding models across subjects”. In: NeuroImage 176 (Aug. 2018), pages 152–163. issn: 10959572. doi: 10.1016/j.neuroimage.2018.04.053 (cited on pages 174, 187). [333] Ross Girshick et al. Detectron. https://github.com/facebookresearch/ detectron. 2018 (cited on page 178). [334] S. Hershley and et. al. et. Models for AudioSet: A Large Scale Dataset of Audio Events. https://github.com/tensorflow/models/tree/master/ research/audioset/vggish. 2016 (cited on page 178). [335] I. Tavor et al. “Task-free MRI predicts individual differences in brain activity during task performance”. In: Science 352.6282 (Apr. 2016), pages 216–220 (cited on page 181). [336] N. Kanwisher, J. McDermott, and M. M. Chun. “The fusiform face area: a module in human extrastriate cortex specialized for face perception”. In: J. Neurosci. 17.11 (June 1997), pages 4302–4311 (cited on page 183). 253 [337] S. Nasr et al. “Scene-selective cortical regions in human and nonhuman primates”. In: J. Neurosci. 31.39 (Sept. 2011), pages 13771–13785 (cited on page 183). [338] David H Hubel and Torsten N Wiesel. “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex”. In: The Journal of physiology 160.1 (1962), pages 106–154 (cited on page 186). [339] Anitha Pasupathy and Charles E Connor. “Population coding of shape in area V4”. In: Nature neuroscience 5.12 (2002), pages 1332–1338 (cited on page 186). [340] Xiaomin Yue, Sophia Robert, and Leslie G Ungerleider. “Curvature process- ing in human visual cortical areas”. In: NeuroImage 222 (2020), page 117295 (cited on page 186). [341] Scott L Brincat and Charles E Connor. “Underlying principles of visual shape selectivity in posterior inferotemporal cortex”. In: Nature neuroscience 7.8 (2004), pages 880–886 (cited on page 186). [342] Nicole C Rust and James J DiCarlo. “Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area V4 to IT”. In: Journal of Neuroscience 30.39 (2010), pages 12978–12995 (cited on page 186). [343] Paul E Downing et al. “Domain specificity in visual cortex”. In: Cerebral cortex 16.10 (2006), pages 1453–1461 (cited on page 186). [344] Kalanit Grill-Spector and Kevin S Weiner. “The functional architecture of the ventral temporal cortex and its role in categorization”. In: Nature Reviews Neuroscience 15.8 (2014), pages 536–548 (cited on page 187). 254 [345] D. L. Yamins and J. J. DiCarlo. “Using goal-driven deep learning models to understand sensory cortex”. In: Nat. Neurosci. 19.3 (Mar. 2016), pages 356– 365 (cited on pages 188, 201). [346] Emily Jean Allen et al. “A massive 7T fMRI dataset to bridge cognitive and computational neuroscience”. In: bioRxiv (2021) (cited on pages 188, 190). [347] Liang Wang et al. “Probabilistic maps of visual topography in human cortex”. In: Cerebral cortex 25.10 (2015), pages 3911–3931 (cited on page 190). [348] J Swaroop Guntupalli et al. “A model of representational spaces in human cortex”. In: Cerebral cortex 26.6 (2016), pages 2919–2934 (cited on page 191). [349] David A Klindt et al. “Neural system identification for large populations separating what and where”. In: Proceedings of the 31st International Con- ference on Neural Information Processing Systems. 2017, pages 3509–3519 (cited on page 191). [350] Maurice Weiler and Gabriele Cesa. “General E(2)-Equivariant Steerable CNNs”. In: arXiv preprint arXiv:1911.08251 (2019) (cited on page 191). [351] Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. “Representa- tional similarity analysis-connecting the branches of systems neuroscience”. In: Frontiers in systems neuroscience 2 (2008), page 4 (cited on page 195). [352] Martin N Hebart et al. “THINGS: A database of 1,854 object concepts and more than 26,000 naturalistic object images”. In: PloS one 14.10 (2019), e0223792 (cited on pages 196, 204). [353] Nikolaus Kriegeskorte. “Relating population-code representations between man, monkey, and computational models”. In: Frontiers in Neuroscience 3 (2009), page 35 (cited on page 196). 255 [354] Yaoda Xu and Maryam Vaziri-Pashkam. “Limits to visual representational correspondence between convolutional neural networks and the human brain”. In: Nature communications 12.1 (2021), pages 1–16 (cited on page 196). [355] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”. In: (2009) (cited on page 196). [356] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. “Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation”. In: PLoS Computational Biology 10 (2014) (cited on pages 201, 202, 298). [357] Xiaomin Yue et al. “Curvature-processing network in macaque visual cortex”. In: Proceedings of the National Academy of Sciences 111.33 (2014), E3467– E3475 (cited on page 204). [358] Robert Bridson. “Fast Poisson Disk Sampling in Arbitrary Dimensions”. In: ACM SIGGRAPH 2007 Sketches. SIGGRAPH ’07. San Diego, California: ACM, 2007. isbn: 978-1-4503-4726-6. doi: 10.1145/1278780.1278807. url: http://doi.acm.org/10.1145/1278780.1278807 (cited on page 257). [359] Nikolaus Kriegeskorte, Marieke Mur, and Peter A. Bandettini. “Represen- tational Similarity Analysis – Connecting the Branches of Systems Neuro- science”. In: Frontiers in Systems Neuroscience 2 (2008) (cited on page 298). 256 APPENDIX A SUPPLEMENTARY INFORMATION AND ADDITIONAL RESULTS FOR SECTION 3.1 A.1 Atlas Summary Atlas # of ROIs Total Vol. Median Vol.(± std) Min Vol. Max Vol. TT 97 1656.34 12.5 (±16.02) 0.03 69.71 HO 111 1611.39 10.04 (±15.26) 0.05 97.33 EZ 116 1941.65 14.11 (±11.96) 0.97 56.35 AAL 116 1843.10 13.78 (±11.05) 1.35 53.33 DOS160 160 82.05 0.51 (±0.04) 0.03 0.51 CC200 200 1172.15 5.83 (±1.26) 1.81 9.96 CC400 400 1172.15 2.97 (±0.68) 0.76 5.35 Table A.1: Summary descriptors of ROIs in individual atlases. All volumes are in cm3. A.2 Poisson Disk Sampling Poisson disk sampling is a stochastic sampling procedure where drawn samples are required to be at least a distance d apart for some user-specific distance metric and density parameter. Since we use this sampling procedure to draw parcel centers, d is estimated a priori based upon the desired number of parcels, and spatial proximity is used to compute the distance. We use the fast poisson disk sampling algorithm as proposed in [358]. This is an efficient sampling procedure that generalizes to arbitrary dimensions and allows volumetric sampling. The algorithm is outlined below: 257 • Step1: A parcel center is arbitrarily chosen from all gray matter voxels and stored in an initially empty ‘active’ list. • Step 2: A sample c is drawn from this active list of voxels. The next candidate parcel center is randomly selected from the list of all voxels within a spherical annulus between radius d and 2d around c. A candidate is accepted and added to the active list if it is atleast a distance d apart from all the existing parcel centers; otherwise another candidate is chosen. If no candidate is accepted from the annulus, c is removed from the active list and another sample c is drawn. This procedure is repeated until the active list is empty. • Step 3: Once the centers are sampled, every gray matter voxel is assigned to its closest parcel center. Sampling is performed for the left and right hemispheres separately to avoid parcels that cross hemispheric boundaries. A.3 Linear Classifiers A.3.1 Ridge Classifier Given feature vectors xi for n subjects and the corresponding prediction variables denoted by yi, we approximate the fit using a linear regression model. An L2 regularization for the weights (w) is added to the mean squared error to yield the following loss function of ridge regression: LR = ‖Xw − y‖2 + α ‖w‖2 (A.1) During classification, the output labels y are encoded as ± 1 for the two output categories to minimize the above loss. 258 A.3.2 Support Vector Machines (a) Classification Support Vector Machine Classifiers optimize for a hyperplane with maximum margin between the output classes. This results in a decision function of the form, f (x)= sign(wTx+b). The weights {w, b} are obtained by minimizing the following convex loss function consisting of a data loss component (LD) and a regularization loss for the weights (LW), LSVC = CLD + LW (A.2) LD is modeled using a hinge loss function, ∑n i=1max(0, 1− y Ti(w xi + b)) over all n training samples {(x1,y1),...,(xn,yn)}. LW is modeled using a Euclidean norm, i.e., ‖w‖2. Here, C is a tuning parameter that controls the trade-off between regularization and data loss. (b) Regression The -Support Vector Regression (SVR) scheme optimizes for a decision function of the form, f (x)=wTx+ b, that has at most  deviation from the true prediction variables y (allowing for errors when the problem is infeasible). The loss function (LSVR) can be formulated as, LSVR = CL + LW (A.3) L∑is traditionally referred to as the -insensitive loss function, and is formulatedas ni=1max(0, |wTxi + b− yi| − ) over all n training samples {(x1,y1),...,(xn,yn)}. The regularization term (LW) is modeled using a Euclidean norm, i.e., ‖w‖2. The tuning parameter C controls the trade-off between the regularization (i.e., the flatness of the decision function) and the amount up to which deviations beyond  are tolerated. Both the classification and regression problems yield weights w that can be rep- 259 resented compl∑etely as a linear combination of the training inputs xi. Thus, w isrepresented as ni=1 αixi, and the decision function becomes f (x)=∑ni=1 αixTi x+ b. This makes it easier to extend SVMs for non-linear decision functions using the kernel technique, i.e., by applying transformations φ(x) that map x to a high-dimensional space and replacing the inner product 〈xi, x〉 with the kernel K(xi, x)=〈φ(xi), φ(x)〉. For our experiments, we observed that the radial basis func- 2 tion kernel, K(x , x) =exp(-‖xi−x‖i 2 2 ), yields the best results among linear, sigmoidσ and polynomial kernels up to degree 4. A.4 Neural network hyperparameter settings We note that since it is more expensive to train a 3D-CNN, we could experiment with only a limited configuration of hyperparameters during cross-validation on ABIDE-I data, compared to FCN and Brain-Net CNN, which work with vectorized connectivity matrices and are thus faster to train. For all three neural network models, we relied on a random search over the learning rate, number of layers, number of units or feature maps in each layer and the choice of non-linearity. For this search we employed the HO atlas. The hyper-parameter configuration that yielded the best ABIDE-I cross-validation accuracy was subsequently used for all other parcellation schemes. Note that except for some minor changes, the models for age prediction and ASD/HC classification are almost identical. Furthermore, in our primary analyses we compare models based on ABIDE-II performance, which was not used for hyper-parameter tuning. For the FCN, we initially started with the architecture proposed by Heinsfeld et al.[192] for ASD/HC classification. We increased the number of layers before the 260 ASD/HC Classification accuracy (ABIDE-I) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 66.7 69.4 69.4 67.8 70.5 CC200 69.7 69.1 70.5 68.6 71.2 EZ 66.4 69.0 68.6 66.0 69.3 TT 64.4 68.6 67.1 66.0 69.4 CC400 70.2 69.4 71.0 71.3 71.7 AAL 65.4 69.1 66.7 66.5 71.4 DOS160 66.2 68.4 67.2 67.0 68.6 MA-Ensemble 69.8 70.5 71.5 69.7 73.3 SP-Ensemble 70.7 71.0 72.0 71.5 73.5 Table A.2: Classification accuracy for ASD vs. Control: 10-fold cross-validation on ABIDE-I for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. Age RMSE (ABIDE-I) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 3.51 3.64 3.57 3.55 3.37 CC200 3.40 3.66 3.44 3.46 3.35 EZ 3.53 3.70 3.60 3.55 3.40 TT 3.58 3.77 3.53 3.57 3.41 CC400 3.41 3.71 3.41 3.46 3.39 AAL 3.60 3.74 3.66 3.60 3.31 DOS160 3.62 4.01 3.76 3.67 3.48 MA-Ensemble 3.30 3.67 3.28 3.31 3.28 SP-Ensemble 3.38 3.67 3.35 3.39 3.28 Table A.3: Root mean squared error (RMSE in years) for age prediction: 10-fold cross-validation on ABIDE-I for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are itali- cized. Green indicates better performance, whereas orange/red highlights worse performance. softmax output until the ABIDE-I cross-validation accuracy stopped improving. Also, we noticed that adding batch-normalization after each layer had no noticeable impact on classification performance. Hence, we didn’t include this layer in our FCN architecture. 261 A.5 ABIDE-I cross-validation results In order to ensure a fair comparison with other studies that report 10-fold cross- validation performance on ABIDE-I, we report the performance obtained using our benchmark and proposed models (along with the ensemble learning strat- egy) for both stochastic parcellations and atlases in the form of kernel density plots (Figure A.1 and Tables A.2, A.3). Clearly, the results and conclusions on ABIDE-I remain consistent with ABIDE-II, with the 3D-CNN ensemble strategy outperforming all the baseline methods. A.6 Saliency maps for individual parcellations Visualizing the saliency maps for models trained on different brain parcellations can reveal interesting differences in the features captured by these models. We visualized the saliency maps of the 3D-CNN model for individual stochastic parcellations at multiple scales for the task of ASD/HC Classification. As shown in Figure A.2, models trained using distinct parcellation schemes are relying on the same basic underlying connectivity patterns for prediction, with small differences in their information content, that can be utilized efficiently by the ensemble learning scheme. Further, the saliency maps of atlas-based (see Figure A.3) and stochastic parcellation-based models are remarkably similar, suggesting that the connectivity patterns of the same set of voxels are guiding the classifier predictions, irrespective of the precise scheme of ROI extraction. 262 A.7 Comparison of different preprocessing strategies Since preprocessing options such as nuisance regression have been a point of con- tention in several studies, we conducted another set of experiments with standard at- las masks in three preprocessing scenarios (a) without global signal regression(GSR) + with CompCor (b) without GSR + without CompCor and (c) with GSR + without CompCor. Below, we include the results obtained with models trained on the ABIDE-1 data using the hyperparameters optimized in our original experi- ments presented in the paper. The accuracy values were computed based on test predictions on the independent ABIDE-2 dataset, following our original evaluation protocol. As can be seen from Tables A.4 and A.5, when neither GSR nor Com- pCor is employed during preprocessing, the prediction performance on both the tasks, i.e., ASD/HC Classification and age prediction, drops significantly. However, similar performance is obtained when using either or both of CompCor and GSR in preprocessing. Importantly, the trend remains the same: the 3D-CNN fares favorably against the baseline algorithms and the MA-Ensemble models generally perform better than or similar to the best-performing atlas. 263 Preprocessing scheme: Without CompCor, With GSR ASD/HC Classification accuracy (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 64.7 66.2 62.9 64.4 68.8 CC200 67.8 66.0 67.8 68.2 67.8 EZ 64.7 62.9 64.7 64.4 66.2 TT 62.9 63.9 64.4 64.9 65.7 CC400 68.5 65.7 66.7 67.5 68.8 AAL 63.7 65.2 64.4 62.1 68.8 DOS160 61.6 61.9 62.6 62.4 64.5 MA-Ensemble 69.3 66.5 69.0 69.7 68.8 Preprocessing scheme: With CompCor, Without GSR ASD/HC Classification accuracy (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 58.5 64.1 61.0 63.6 64.0 CC200 68.4 67.2 68.0 68.2 68.7 EZ 62.6 65.1 62.6 66.4 67.4 TT 63.6 66.4 64.1 63.1 65.4 CC400 67.9 65.9 66.7 65.4 69.5 AAL 63.8 64.3 61.6 62.6 66.4 DOS160 65.9 65.1 66.2 63.3 68.2 MA-Ensemble 69.9 67.4 68.9 68.9 69.5 Preprocessing scheme: Without CompCor, Without GSR ASD/HC Classification accuracy (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 60.1 58.8 62.1 62.6 63.2 CC200 67.7 66.2 62.4 64.7 67.0 EZ 59.6 59.8 62.9 60.1 63.5 TT 61.6 61.6 61.3 61.3 61.6 CC400 65.2 64.4 63.4 64.9 66.5 AAL 61.8 62.4 61.6 63.1 63.4 DOS160 57.0 58.0 62.1 61.3 62.7 MA-Ensemble 66.7 66.2 62.9 67.5 67.5 Table A.4: Classification accuracy for ASD vs. Control: Independent results on ABIDE-II for benchmark models and proposed CNN approach. Green indicates better performance, whereas orange/red highlights worse performance. 264 Figure A.1: Violin plots showing the spread of prediction accuracies/errors for stochastic parcellations at multiple network scales for different classification models. Mean accuracy/error of individual violins is denoted by ’Mean SPs’. Performance of individual atlases is compared with SPs2w65ith the closest # of ROIs and is denoted as ’Single Atlas’. Results are computed by 10-fold cross-validation on the entire ABIDE-1 cohort. Figure A.2: Saliency maps of trained CNN models for 2 randomly chosen stochastic parcellations at each scale for ASD-HC classification. 266 Figure A.3: Saliency maps for atlas-based ASD-HC classification models. 267 Preprocessing scheme: Without CompCor, With GSR Age RMSE (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 2.85 2.75 2.61 2.50 2.31 CC200 2.54 2.75 2.35 2.48 2.26 EZ 2.87 2.64 2.56 2.61 2.06 TT 3.09 2.80 2.55 2.87 2.19 CC400 2.61 2.86 2.32 2.32 2.28 AAL 2.82 2.63 2.68 2.61 2.14 DOS160 3.24 3.37 2.82 3.03 2.32 MA-Ensemble 2.56 2.67 2.23 2.29 2.07 Preprocessing scheme: With CompCor, Without GSR Age RMSE (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 3.37 2.90 2.83 2.83 2.48 CC200 2.91 2.74 2.84 2.75 2.41 EZ 3.24 2.75 2.60 2.74 2.61 TT 3.14 2.88 3.14 2.74 2.34 CC400 2.88 2.86 2.74 2.63 2.47 AAL 3.18 2.73 2.78 2.79 2.49 DOS160 3.17 3.29 2.59 2.82 2.52 MA-Ensemble 2.73 2.75 2.37 2.41 2.16 Preprocessing scheme: Without CompCor, Without GSR Age RMSE (ABIDE-II) Parcellation Ridge SVM FCN BrainNet 3D-CNN HO 3.14 3.08 3.14 2.60 2.60 CC200 3.15 2.89 3.33 2.60 2.70 EZ 3.41 2.81 3.35 2.98 2.47 TT 3.22 3.05 3.28 2.85 2.80 CC400 3.06 2.94 2.90 2.54 2.48 AAL 3.36 2.81 2.73 3.08 2.54 DOS160 3.45 3.54 3.48 2.99 2.63 MA-Ensemble 2.76 2.91 2.83 2.40 2.30 Table A.5: Root mean squared error (RMSE in years) for age prediction: Inde- pendent results on ABIDE-II for benchmark models and proposed CNN approach. Green indicates better performance, whereas orange/red highlights worse perfor- mance. 268 Figure A.4: ROC Curves for individual atlas based ASD-HC classification models. 269 Table A.6: Mean absolute error (MAE in years) for age prediction: Independent testing on ABIDE-II for benchmark models and proposed CNN approach. For each row, best results are bolded. For each column, best results are italicized. Green indicates better performance, whereas orange/red highlights worse performance. 270 APPENDIX B SUPPLEMENTARY INFORMATION AND ADDITIONAL RESULTS FOR SECTION 4.2 B.1 HCP Movies Table B.1 summarizes the HCP movie-watching dataset split used for training and evaluating all models. Table B.1: HCP dataset split Movie Split Stimulus-response pairs per subject 7T_MOVIE1_CC1 v2 Training/Validation 652 7T_MOVIE2_HO1 v2 Training/Validation 716 7T_MOVIE3_CC2 v2 Training/Validation 669 7T_MOVIE4_HO2 v2 Testing 699 B.2 Region of Interest (ROI) selection ROIs were selected for each analysis based on the descriptions provided in the neuroanatomical supplementary results of the HCP MMP parcellation [264] and an extensive literature review. For Figure 4.2 in the main text and Figure B.9, ROIs were thus assigned to groups 1-5 according to Table B.2. Dorsal and ventral visual stream ROIs as well as early and association auditory cortex ROIs in Figure 4.4 (main text) were derived from the explicit stream 271 Table B.2: ROI categorization Group ROIs A1, LBelt, PBelt, MBelt, RI, 1. Auditory STSda, STSva, A4, A5, TA2 V1, V2, V3, V3A, V3B, V3CD, V4, V4t, V6, V6A, V7, V8, DVT, LO1-3, PIT, 2. Visual FFC, VMV1-3, IPS1, MT, VVC 3. Multi-sensory + sensory bridges STSdp, STSvp, STGa, STV, TPOJ1-3 4. Language 55b, SFL, PSL, 44, 45 5. Frontal IFSa, IFSp, IFJa, IFJp, FEF segregation and categorization described in the HCP MMP parcellation [264] and are defined here for quick reference. • Dorsal: V3A, V3B, V6, V6A, V7, IPS1 • Ventral: V8, VVC, PIT, FFC, VMV1-3 • MT+: MT, MST, V4t, FST • Early auditory: A1, PBelt, MBelt, RBelt, RI • Association auditory: A4, A5, TA2, STGa, STSdp, STSda, STSvp, STSva All ROIs are shown in Figure B.1. Figure B.1: Group segregation from the HCP MMP parcellation. 272 B.3 Estimating BOLD response delay BOLD response delay was estimated using ROI-level encoding models due to their faster iteration times in comparison to voxel-wise encoding. The input to these models was the preprocessed stimuli as described for voxel-wise encoding with the same train-validation-test split, and the output was the evoked ROI-level fMRI response at different lags (1-7 seconds) from the stimulus. Thus, the output is a 360-D vector corresponding to the mean fMRI response in each ROI of the HCP MMP parcellation. The feature extractors were identical to those in the proposed voxel-wise auditory and visual models. However, instead of a convolutional response model, here, the response model comprised two fully connected layers with output dimensions of 512 and 360 with an exponential linear unit and linear activation respectively. All models were trained for 20 epochs with a batch size of 4 and a learning rate of 1e-4. Validation curves were monitored to ensure convergence. Prediction accuracy of each model was computed as the mean Pearson correlation Figure B.2: ROI-based encoding performance for estimating delay. (A) depicts the estimated mean and standard error of the prediction accuracy (R) across various delays (1-7s) within the early auditory and association auditory group (blue) as well as across all ROIs (red), as obtained using the single epoch (1s) auditory model. (B) depicts the estimated mean and standard error of the prediction accuracy (R) for various delays (1-7s) within the primary and dorsal visual streams (blue) as well as across all ROIs (red), as obtained using the single frame visual model. Shaded regions depict the standard error in estimating mean across ROIs within each group. ROI categorization is described in the sub-section on ROI selection. 273 coefficient between the predicted and measured response across all ROIs, in the held-out movie dataset. Based on Figure B.2, we estimated a response delay of 4 seconds, as this lag yielded the maximum prediction accuracy across all ROIs for both auditory and visual ROI-level models. Further, even while restricting the prediction accuracy (R) to ROIs within different cortical areas (such as the early/association auditory areas or the dorsal/ventral visual stream), the optimal lag was consistently 4 seconds, suggesting that the difference in performance of 1-sec and 20-sec models in these regions (Figure 4.4) is not largely driven by differences in the hemodynamic response function (HRF). B.4 Defining the stimulus-driven or “synchronous” cortex We isolated voxels involved in stimulus-driven processing, termed “synchronous” or “stimulus-driven” voxels, by computing mean inter-group correlations over all training movies. Inter-group correlations were computed by splitting the entire group of subjects into two halves and computing correlations between the mean response time-course of each half (comprising 79 subjects) at every voxel. We employed a liberal threshold of 0.15 for this correlation value. Thus, the mask of “stimulus-driven” voxels included those voxels that achieved an inter-group correlation of 0.15 or above. We computed mean quantitative metrics over this mask in Figure 4.3E (main text) to compare different models. 274 B.5 Model architectures and implementation The base feature extraction networks and convolutional response model in Figure 4.1 had the architecture as detailed in Figure B.3. The feature extraction networks are reminiscent of the feature pyramid network, which has shown significant improve- ments as a generic feature extractor across various applications. These networks comprise a parallel top-down pathway with lateral connections which grants them the ability to characterize both “what” and “where” in cluttered scenes, thereby enhancing object detection. We note that similar models with top-down and skip connections have been popular in vision research, since they can enrich low-level features with high-level semantics. The output of the feature extractor is fed into the convolutional response model to predict the evoked fMRI activation. This enables us to train both components of the network simultaneously in an end-to-end manner. Since the output response is differentiable with respect to network weights, the weights are adjusted via a first-order gradient-based optimization method to minimize the mean squared error between the predicted and target activation values across the entire brain. For ResNet-50, we use activations of the last residual block of each stage, namely, res2, res3, res4 and res5 to construct our stimulus descriptions s. From the VGG- ish network, we use the activations of each convolutional block, namely, conv2, conv3, conv4 and the penultimate dense layer fc2 (Pre-trained tensorflow/keras models for the visual and auditory backbone were available at https://keras. io/applications and https://github.com/tensorflow/models/tree/master/ research/audioset/vggish respectively). The first three sets of activations are refined through a top-down path to enhance their semantic content, while the last activation is concatenated into s directly (res4 activations are vectorized using 275 global average pool). The top-down path comprises three feature maps at different resolutions with an up-sampling factor of 2 successively from the deepest layer of the bottom-up path. Each such feature map comprising 256/128 channels (in visual/auditory models respectively) is merged with the corresponding feature map in the bottom-up path (reduced to 256/128 channels by 1x1 convolutions) by element-wise addition. Subsequently, the feature map at each resolution is collapsed into a 256/128-dimensional feature vector through a global average pool operation and concatenated into s, leading to a 1024-D and 512-D feature representation for the visual and auditory stimuli respectively. The aggregated features are then passed onto a CNN comprising the following feedforward computations: a fully connected layer to map the features into a vector space which is reshaped into a 1024-channel cuboid of size 6x7x6 followed by four 3x3x3 transposed convolutions (conv.T) with a stride of 2 and exponential linear unit activation function to up- sample the latter. Each convolution reduces the channel count by half with the exception of the last convolution which outputs the single-channel predicted fMRI response. The 20-second models additionally comprised an LSTM layer to model the temporal propagation of features across the contiguous sequence of input frames and/or spectrograms. The LSTM module has driven success across varied sequence modeling tasks due to its ability to efficiently regulate the flow of information across cells through gating. The memory cell in LSTM is modulated by three gates, namely, the input, forget and output gates. We note that the LSTM layer did not change the dimensionality of the input features so that equitable comparisons can be made against 1-sec models. The Audiovisual-1sec model concatenated features obtained from the base visual (1024-D) and audio (512-D) feature extraction networks, reduced their combined dimensionality to the higher value among the two (1024-D) 276 by passing through a bottleneck dense layer followed by the same convolutional response model. The Audiovisual-20sec model additionally incorporated modality- specific LSTM networks prior to feature concatenation. Implementation: We note that all 6 models have roughly the same order of trainable parameters in the range of 242M-362M. All parameters were optimized using Adam with a learning rate of 1e-4. Auditory and visual models were trained for 50 epochs with unit batch size. The stimulus as well as subject whose fMRI response is used as the target in the loss (“mean squared error”) are randomly sampled over each step of the training but kept consistent across models. We found this method to work better than using the group-averaged response as target, presumably because this sampling provides information about both the cross-subject mean and the variance of response. Given the noise characteristics at each voxel, we hypothesize that this enables the model to focus on regions that can be well predicted with the given stimulus. Validation curves were monitored for all models to ensure convergence. B.6 Regularized linear regression: deep convolutional fea- tures We also trained group-level encoding models using a linear response model since this constitutes the dominant state-of-the-art approach to neural encoding [245, 246, 248]. To enable a fair comparison against the proposed 1-sec uni-modal models, we extract hierarchical features from the same layers of the ResNet-50 and VGG-ish architectures as employed by the proposed models. The only difference here is the lack of a top-down pathway (since it is not a part of the pre-trained 277 Figure B.3: Implementation details for the audio (top left) and visual (top right) feature extraction networks as well as the convolutional response model (bottom). All layers and blocks outside the yellow rectangle (bottom-up pathway) are trained from scratch. The blocks inside the yellow rectangular window are initialized with networks pre-trained on image or sound recognition. Further, ResNet-50 is frozen during the training of all encoding models, whereas VGG is fine-tuned. The sequence of operations within each block are defined from top to bottom, while the number of repetitions for each sequence within the block are indicated with the multiplicative symbol on the right. 278 network but is trained with random initialization on the neural response prediction task), which prevents the refinement of coarse feature maps before aggregation. Pooling the outputs of different layers channel-wise using the global average pooling operation (namely {v1, v2, v3, v4} for the visual model and {a1, a2, a3, a4} for the audio model in Figure B.3) leaves us with and 1024 and 3840 features to present to the auditory and visual models, respectively. Further, to compare against the longer-duration 20-sec models, we adopted two approaches: (1) we simply concatenated the stimulus features extracted for each second (as described above) over T-second windows with T ranging from 1 to 20 seconds and presented these aggregated features to the linear response model; alternatively, (2) we reduced the dimensionality of the aggregated features to a fixed length (set to 128) as in (1) using principal component analysis run on the training data. We added this comparison to rule out the fact that the temporal trend in performance of linear models is simply driven by a higher-dimensional feature space. We note that even after dimensionality reduction, the components retained at least 80% of the explained variance in all cases. Audio-visual encodings with linear response models were obtained similarly by simply fusing the respective audio and visual hierarchical features through concatenation before linear regression. We apply l2 regularization on the regression coefficients and adjust the optimal strength of this penalty through cross-validation on the training data using log-spaced values in {1e−14, 1e14} for each model. We report performance of the best models in Figure B.4(A). Note that unlike the WordNet models, we found that optimizing a single regularization penalty α common across all voxels outperformed independent voxel-wise fitting with bootstrap in this case. Thus, we only present the results for the former. We note here that the convolutional response model in our proposed approach (instead of a fully-connected approach) allowed us to keep the learnable parameters 279 manageable, facilitating joint optimization/fine-tuning of the feature extractor and response models. The consistently superior performance of the proposed models against linear regression-based approaches strongly suggests that there is merit in end-to-end learning for encoding responses to dynamic, multi-sensory stimuli. B.7 Regularized linear regression: WordNet features Another popular approach in voxel-wise forward encoding beyond primary sensory cortices is the semantic category encoding model that is based on high-level semantic features [279]. This approach relies on labels that indicate the presence of semantic object and action categories in each movie frame. In this analysis, we employed WordNet labels that were provided as part of the HCP movie-watching data pipeline. The semantic labels were manually assigned by the Gallant lab team using the WordNet semantic taxonomy and subsequently converted to WordNet synsets to build an 859-D semantic representational space (corresponding to 859 WordNet synset names). Following [279], we fitted l2 regularized linear regression models (known as ridge regression) to find weights corresponding to different input features for every voxel. The regularization parameter, α was optimized independently for each voxel by testing among 10 log-space values in [1, 1000]. The optimal alpha is obtained by averaging across 15 bootstrapped held-out sets. In addition to fitting models with WordNet features extracted 4s prior to the measured neural response, we developed longer timescale linear models by concatenating the WordNet features extracted for each second (as described above) over T-second windows with T ranging from 1 to 20 seconds and presented these aggregated features to the bootstrapped regularized regression model. Figure B.4 (C) demonstrates the performance of WordNet models across different groups of regions as a function of 280 T, and (C) depicts the voxel-level prediction accuracy (R) of the best performing WordNet model that stacks features from 4-12s (at an interval of 1s) prior to the encoded cortical response. While simple and interpretable, the WordNet models clearly under-perform in terms of prediction accuracy (R) in comparison to the models proposed in the present study. Figure B.4: Performance of linear response models and baselines. (A) shows the region-averaged prediction accuracy of linear response models using deep convolutional features. (B) shows results of the ablation study and highlights the importance of different components of the proposed model architecture. (C) shows the region-averaged prediction accuracy of linear response models using semantically rich WordNet features and (D) shows the cortical map of the prediction accuracy (R) for the best WordNet model. The x-axis in (A) and (C) depicts the length of the windows (in seconds) over which the stimulus features are concatenated and y-axis shows the mean Pearson correlation coefficient between the predicted and measured responses across the stimulus-driven voxels. 281 B.8 Ablation study To determine the influence of different architectural components on prediction performance of the proposed models, we performed an ablation study to investigate the individual contributions of (i) non-linearities in the response model, (ii) hierar- chical (multi-scale) feature maps, (iii) fine-tuning audio sub-network (VGG) and (iv) LSTM. We selectively removed each of the components from the respective model and compared the resulting performance against the proposed 1-sec and 20-sec models that employ all (i)-(iii) and (i)-(iv) components respectively. We note that the model without LSTM (iv) uses concatenated features instead of employing recurrence. Due to computational constraints, we could not train a model that feeds 20-sec concatenated features directly to the convolutional response model since this raises the number of parameters substantially. Instead, we map the concatenated feature input to a 1024-D and 512-D feature space for visual and audio models respectively using a fully connected layer. We note that this also ensures a more equitable comparison against the proposed 20-sec models that use LSTMs by enforcing that the representations fed into the response models in both cases are of the same dimension. We follow the same protocol for training these models as used for training the proposed models. There are several interesting observations to make from this ablation analysis (Figure B.4B). (i) First, we find that encoding models with a frozen VGG network that is not updated during training incur a loss in performance compared to the proposed model where VGG layers are trainable during neural response prediction. This clearly demonstrates the advantages of altering these pre-trained models and suggests that fine-tuning is both feasible and beneficial in improving neural response prediction. (ii) Next, we find that prediction performance deteriorates after removing the non-linearities in both the Audio-1sec and Visual-1sec models. In the context of the Visual-1sec 282 model with a frozen pre-trained backbone (ResNet-50) and coupled with (i), this observation further highlights that it is possible to develop models of human sensory processing that are quantitatively more precise in matching brain activity than task-driven neural networks. (iii) Next, we assessed the benefit of using hierarchical feature maps over selecting the single best-performing layer for each model (audio or visual) based on cross-validation. For both audio and visual models, we find that features from the last layer (i.e., a4 and v4, respectively) yield the highest mean prediction accuracy (R) across the synchronous cortex. However, although the convolutional response model architecture is common across these encoding models, it is important to note that this analysis is still plagued by confounds such as the different dimensionality of feature spaces across different layers that feed into the response model. The best performing single-layer encoding model, however, still performs worse than the hierarchical approach. (iv) Finally, while the encoding models with concatenated features outperform the 1-sec models, the performance still falls short against the accuracy obtained by the proposed 20-sec models employing LSTM. We believe this noticeable difference arises from the ability of LSTMs to efficiently capture long-term dependencies and reconcile the recent input history (‘memory’) with the immediate context (current frame). B.9 Computing significance estimates The statistical significance of individual voxel predictions (Figure 4.3) was computed as the p-value of the obtained sample correlation coefficient for the null hypothesis of uncorrelation (i.e., true correlation coefficient is zero) under the assumptions of a bivariate normal distribution. We employed the false-discovery procedure of Benjamini & Hochberg (1995) [300] to control for multiple comparisons under 283 assumptions of dependence. For statistical comparison of model performance within each group of regions in Figure 4.2 (main text), we performed the paired t-test on ROI-level average performance metrics and corrected for multiple comparisons among models (Bonferroni). B.10 Sensory-sensitivity index Distorting the input to the audio-visual model at test time allows us to interrogate the sensory-sensitivity of different brain regions. We developed a sensory-sensitivity index of each ROI based upon predictive performance of the model with distorted inputs, as shown in Figure 4.5. Let SVr and SAr denote the mean prediction accuracy of the model in region r after shuffling (temporally) the input order of the visual and auditory stimuli, respectively. The sensory-sensitivity index for region r is then defined as s = SAr−SVrr + . Note that positive values of this indexSAr SVr indicate that region r incurs a greater loss in predictivity upon distortion of visual information than auditory information, suggesting a higher visual sensitivity for this voxel. Similarly, negative values signal towards a higher auditory sensitivity. B.11 Stimuli for synthetic contrasts Synthetic contrasts were generated to study the generalization of our models to new experimental paradigms (Figure 4.6). We focus on predicting task-based contrasts for three semantic categories, namely, faces, places and speech, since these are the most well-studied categories in the context of their distinct functional signatures. The stimuli for visual contrasts were derived from the HCP Working Memory 284 paradigm, which combines category specific representation tasks (including faces and places) and working memory tasks. After excluding grayscale images, we were left with 102, 77, 97 and 103 images for the categories of faces, places, body parts and tools, respectively. Since these are static image without any dynamic content, we employed the Visual-1sec model to derive the visual contrasts (Figure 4.6(C),(D)). Stimuli for the speech and non-speech contrast were extracted from large popular datasets for these categories. Speech stimuli were extracted from a human speech-utterance dataset comprising short audio clips of interviews recorded on YouTube [301]. Non-speech stimuli were extracted from another large dataset comprising short clips of environmental sounds [302]. We randomly extracted ∼ 100 minutes of audio waveforms from these datasets for both categories. The stimuli were processed for mel-spectrogram extraction in the same manner as the HCP audio-visual movies. Since the non-speech stimuli only comprised contiguous clips of roughly 3− 5 second duration, we employed the Audio-1sec model to obtain the speech contrast (Figure 4.6(B)). B.12 Perturbation analysis with 20-sec models To address the influence of temporal continuity and short-term memory (past inputs) on the predictions of 20-sec models, we conducted a perturbation analysis by distorting the input context seen by these models at inference time using two shuffling experiments: • Shuffled (different segment): In this experiment, we keep the last frame of every 20-second input segment and replace the preceding 19 frames with contiguous frames of a randomly selected 19-sec input clip within the test 285 movie. This input perturbation thus largely maintains a temporal continuity and highlights the influence of past inputs or short-term memory on response predictions. • Shuffled (same segment): Under this experimental set-up, we randomly shuffle the first 19 frames of the same input clip at inference time while keeping the last frame the same. This obliterates the temporal continuity of the input clip without changing the overall content that is fed into the encoding model. We repeated both shuffling experiments 10 times and report the average performance of each model under these two perturbation methods across different ROIs in Figure B.5. As can be seen from the figure, both input perturbations cause a drop in model performance, albeit to different degrees. Interestingly, the Audio-20sec model seems to rely on the temporal continuity of the input more heavily than the Visual-20sec model, as evidenced by the much sharper drop in performance for the former model under same segment re-shuffling. The consistent deterioration of model performance under these control experiments is thus another indication that the 20-sec models exploit recent input history (‘memory’) while computing response predictions. B.13 Performance improvement and autocorrelation decay In the past, processing timescales in the brain have been probed using several different means [287]. In one of the proposed approaches, the decay time of temporal autocorrelation is used as a proxy measure to understand the variation in processing timescales across different brain regions. With this approach, it was shown that decay times increased progressively along the temporal hierarchy. 286 Figure B.5: Perturbation analysis with Audio-20sec (A) and Visual-20sec (B) models. ROI box plots depict the un-normalized correlation coefficients between the predicted and measured response of voxels in each ROI using original or distorted 20-sec input clips at inference time. Following this line of work, we estimated the autocorrelation decay time constant (π) for each voxel by fitting an exponential, A exp{−t/π}, to the autocorrelation function (autocorrelation computed at different lags). The exponential model was first independently fit for each movie run and each voxel and the estimated π were subsequently averaged across runs to obtain one decay time constant per voxel. Here, we were primarily interested in understanding whether there is any relationship between the performance improvement of the 20-sec model over 1- sec model, ∆R, computed as the difference between the prediction accuracies 287 Figure B.6: Performance boost of the 20-sec model over 1-sec model is higher in voxels with longer autocorrelation decay times. (A) & (B) depict the performance improvement (∆R) against decay time constants for voxels associated with auditory and visual regions, respectively (Table B.2). The r value indicates the Pearson correlation coefficient between the two quantities. Each dot in the scatterplot represents an individual voxel. Bivariate kernel density estimates are overlaid on top of the scatterplot as contours to depict the probability distribution of observations. of the Audiovisual-20sec and Audiovisual-1sec at every voxel, and the temporal autocorrelation properties of that voxel. We hypothesized that in voxels with longer processing timescales, the autocorrelation would persist for longer durations (resulting in larger π) and the longer timescale model (20-sec) would yield more substantive improvement over the 1-sec model. As shown in Figure B.6, we observed a significantly positive correlation between performance improvement and the autocorrelation decay time constant (r = 0.49 and 0.50 across voxels in auditory and visual regions as defined in Table B.2), in line with our hypothesis. This suggests that the benefit of employing the 20-sec model, as quantified in terms of performance improvement, is indeed more remarkable in regions with longer processing timescales. 288 B.14 Surface visualization All input fMRI data, as well as response predictions in this study are volume-based. In order to be consistent with prior research on encoding models that employ surface visualizations, we created surface versions of volumetric predictability and synthetic contrast maps, as shown in Figures 4.3, 4.5 and 4.6. We employed the 3D trilinear mapping method from connectome workbench that computes the result on each vertex based on linear interpolation from voxels on each side of the vertex (https://www.humanconnectome.org/software/workbench-command). However, since volume to surface mappings are an approximation, we only employ this conversion for visualizations. All reported metrics are computed on volumes only on a per-voxel basis. B.15 Qualitative analysis To gain qualitative insights into the predictions of the most accurate model (Audiovisual-20sec) on the held-out movie, we plot the predicted as well as mea- sured response time-series of the voxel with ‘median’ prediction accuracy (R) in the best performing ROI of each group (Figure B.7). The latter corresponds to A4, V3CD, STSdp, IFSp and Area 45 for the auditory, visual, multi-sensory, frontal and language groups respectively. 289 Figure B.7: Predicted and measured response time-series of the ‘median’ predictive accuracy (R) voxel across ROIs of different functional groups. Vertical dashed lines mark the boundary of clip segments in the held-out movie. B.16 Group-level prediction accuracy: held-out set To test the generalizability of the models, we further compared model predictions against the group-averaged response of a held-out group within the HCP dataset comprising 20 novel subjects distinct from the 158 individuals used in the training set, on the same independent held-out movie. Noise ceiling estimation: For the held-out group, we obtain the noise ceiling by considering variability across subjects. Here, the noise ceiling was computed as the 290 correlation coefficient between the mean measured response for the independent test movie across all 158 subjects in the training set and the group-averaged response computed over the 20 new subjects. This metric captures the response component shared across independent groups of subjects and thus reflects the upper bound achievable by a group-level encoding model. We employ this noise ceiling for comparison against the prediction accuracy of the model on the held-out group of subjects (Figure B.8). The models accurately predicted cortical responses evoked by the independent test movie as measured in the independent subject population (Figure B.8, B.9), with the best performing model (Audiovisual-20sec) even achieving close to perfect predictivity relative to the “noise ceiling” in certain multi-sensory sites such as the posterior STS (Figure B.8(A), (G)). Here, the noise ceiling was computed as the correlation coefficient between the mean neural response in the independent test movie, across all 158 subjects in the training set and the group-averaged response computed over the 20 new subjects. This metric captures the response component shared across independent subject populations and thus reflects the upper bound achievable by a group-level encoding model. These results clearly indicate that inclusion of temporal history and multi-sensory information pushes the prediction accuracies closer to their upper bound, as also evidenced by a higher slope of the linear model fit on their corresponding data points. Further, voxels that truly approach the noise ceiling are predominantly associated with the auditory group of regions as broadly characterized within the HCP MMP parcellation. Interestingly, we find that this regional distribution of predictivity against noise ceiling holds even for subject-specific responses and not just the group-averaged responses, as described in the next section and shown in Figure B.10. 291 Figure B.8: Model performance on held-out group of subjects. (A) Pearson correlation coefficient (R) between the model predictions and group-averaged response of an independent subject group comprising 20 subjects, on the held-out test movie, normalized by the voxel-specific noise ceiling. (B) Predictivity against the noise ceiling for all voxels with high “synchrony” across training movies (>0.5) (see Supplementary Information for details). This gives a total of 52,954 highly “synchronous” voxels that are colored based on their association with auditory and visual groups. This hue assignment of each voxel was derived from the coloration of the corresponding ROI in the multi-modal HCP parcellation. Each dot in the scatterplot represents an individual voxel. Bivariate kernel density estimates are overlaid on top of the scatterplot as contours to depict the probability distribution of observations (prediction accuracy/noise ceiling pair at every voxel). 292 Figure B.9: Quantitative evaluation metrics for all the proposed models on the independent held-out population comprising 20 novel subjects. (A),(C)-(F) depict prediction accuracy (R) for all the proposed models across major groups of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of all models is summarized across (A) auditory, (C) visual, (D) multi-sensory, (E) language and (F) frontal areas. Box plots depict quartiles and swarmplots depict mean prediction accuracy of every ROI in the group. For language areas (Group 4), left and right hemisphere ROIs are shown as separate points in the swarmplot because of marked differences in the prediction accuracy. Statistical significance tests (results indicated with horizontal bars) are performed to compare 1-sec and 20-sec models of the same modality (3 comparisons) or uni-modal against multi-modal models of the same duration (4 comparisons) using paired t-test (p-value < 0.05, Bonferroni-corrected) on mean prediction accuracy within ROIs of each group. B.17 Subject-level prediction accuracy: held-out set For each participant in our independent subject group (N = 20), we computed the correlation coefficient (R) between the predictions of the best performing model (Audiovisual-20sec) and the subject-specific fMRI response corresponding to the independent movie. We further contrast this cortical map of prediction performance against another map computed as the voxel-wise correlation coefficient between the 293 Figure B.10: Comparison of voxel-level prediction accuracies (R) against subject- specific noise ceiling for 5 representative subjects from the held-out set. The subjects were chosen such that their mean prediction accuracy (un-normalized) within the stimulus-driven cortex lied in the ith percentile with i ∈ {0.01, 25, 50, 75, 99.9}. Surface maps with white background in (A)-(E) depict raw correlation coefficients between model (Audiovisual-20sec) predictions and subject-specific response on the held-out movie whereas maps on gray background indicate the respective subject- specific noise ceiling. Only significantly correlated voxels (p<0.05, FDR corrected) are colored on the surface. mean neural response across all 158 training subjects and the respective subject- specific response on the independent movie. The latter places an upper bound on the predictivity of each voxel as achievable by any group-level model. Here, we present the results for 5 subjects with mean prediction accuracy (un-normalized) within the stimulus-driven cortex in the ith percentile with i ∈ {0.01, 25, 50, 75, 99.9}. The results (Figure B.10) suggest that the model can successfully capture the response component that individual subjects share with the population. 294 B.18 Correcting with inter-group synchrony Since the present study focuses on population-wide predictive models, another upper bound on performance estimates that naturally comes to mind is one based on inter-subject or inter-group synchrony in cortical activity on the independent test movie. We computed split-half correlations between the mean response time-course of each group on the test movie. To compare the prediction accuracy against ISC, we divided the prediction accuracy of the best predictive model, i.e., the Audiovisual-20sec model by this synchrony-based noise-ceiling to get the synchrony- normalized prediction accuracy, shown in Figure B.11. A stronger shift towards values approaching unity indicates that the model is able to capture stimulus-driven activity highly accurately across large regions of the cortex. Figure B.11: Synchrony-normalized prediction accuracy (R) of the Audiovisual- 20sec model B.19 Influence of motion fMRI measurements are prone to various sources of noise, including spurious head motion and physiological artifacts, which may vary in systematic ways with the variables of interest in any study. While the fMRI data was pre-processed with motion correction, the effects of motion cannot be fully eliminated and need to be further accounted for. Motion confounds have been reported in prior 295 studies that use neuroimaging data as a “predictor” for different behavioral states or as clinical biomarkers. In our study, the inputs are natural images and the “predicted” variable (fMRI response) is the one prone to motion artifacts. In this study, we developed group-level predictive models of whole-brain cortical activity. One could expect to see the influence of motion in predictions if there was a systematic correlation between motion signals across subjects (so that the signal could persist post averaging), which would suggest that average subject motion tracks the stimulus characteristics. To address this issue, we examined the Pearson correlation coefficients between the predicted/measured response of each voxel and the framewise displacement across the independent test movie clips. The framewise displacement was computed as described in Power et al., [303] from the averaged motion estimates across subjects on the independent test movie. ∑ ∑ FD(t) = |d(t− 1)− d(t)|+ 50 π180 |r(t− 1)− r(t)| (B.1) where d denotes translation distances {x, y, z} and r denotes rotation angles {pitch, yaw, roll}. As shown in Figure B.12, the correlation coefficients are centered around zero with a very small standard deviation (∼0.05). Importantly, upon computing the p-value of the obtained sample correlation coefficients for the null hypothesis of uncorrelation (under the assumptions of a bivariate normal distribution), we observed that none of these correlations were significant for the predicted responses and only very few voxels (shown on the cortical surface below) were found to exhibit statistically significant correlations between measured responses and FD (p¡0.05, FDR corrected). 296 Figure B.12: Addressing the influence of motion on measured and predicted re- sponses. (A) and (B) depict the distribution of the Pearson correlation coefficient of FD with the predicted responses of the Audiovisual-20sec model and measured responses across the whole brain respectively. Surface maps in (C) depict the raw correlation coefficients between FD and the measured responses. Only statistically significant voxels (p< 0.05, FDR corrected) are colored on the surface. 297 APPENDIX C SUPPLEMENTARY INFORMATION AND ADDITIONAL RESULTS FOR SECTION 4.3 C.1 Model comparison across randomly selected layers Here, we wanted to examine if the learned attention model would lead to performance improvements in neural response prediction across other deep layers as well. We trained all 8 models using stimuli representations Frep from 2 randomly selected layers in the res5 block of the pre-trained ResNet-50 architecture, namely ‘add 14’ and ‘res5c branch2b’1, henceforth denoted as ‘Random ResNet-50 layer 1’ and ‘Random ResNet-50 layer 2’ respectively. Figure C.1 shows the prediction accuracy across the synchronous cortex on the held-out movie for all models. We again observe that the learned attention model performs favorably against models with no attention, no pooling or center-weighted attention. Further, the gaze-weighted attention method outperforms all other methods employing the same response model (linear or convolutional), consistent with our previous findings. C.2 Representational similarity analysis Representational similarity analysis (RSA) is a popular framework to compare representations of a computational model against cortical representations [356, 359]. It can be used to directly measure a computational model’s ability to explain the representational geometry in neuronal responses. Here, we wanted to assess the impact of attention modulation on a computational model’s alignment 1Notation from pre-trained ResNet-50 model: https://keras.io/api/applications/resnet/ 298 to brain responses for a wider range of model layers and architectures. Given stimuli from the held-out movie (699 frames) and the corresponding response (after hemodynamic lag), we implemented the following procedure for time-continuous RSA: (i) We computed Pearson’s correlation distance (1-R) between the response vectors for every pair of test frames to obtain the representational dissimilarity matrix (RDM) of neural responses. The dissimilarity matrices are averaged across subjects to yield a population-averaged ‘neural’ RDM. The region of interest (ROI) mask for extracting response vectors to estimate neural RDMs was derived from all voxels in intermediate (V4), ventral visual stream and lateral occipital ROIs. Responses of all voxels were normalized using z-scores before computing the dissimilarity matrix. (ii) We extracted model representations from intermediate layers of 3 pre-trained (ImageNet) architectures, namely ResNet-50 (res2, res3, res4, res5), VGG-16 (maxpool1, maxpool2, maxpool3, maxpool4, maxpool5) and AlexNet (conv1, conv2, conv3, conv4, conv5). For each of these representations, we further computed attention modulated representations using attention maps computed with each saliency prediction method as described above. For the Itti- Koch model, we used normalized saliency as the attention map. For all remaining saliency models, we used probabilistic density predictions as attention maps. All attention maps were resized to the spatial dimensions of the respective layer for this computation. Representational vectors were compared pair-wise in terms of their Pearson correlation distance (1-R) to obtain the ‘model’ RDM. (iii) Finally, we compared the compatibility of the neural and model RDMs by using a rank correlation measure (Kendall’s τA). As shown in Figure C.2, prioritized selection of stimulus features based on saliency significantly improves the correlation of model RDMs with neural RDMs. This trend holds for most models and layers, suggesting that the benefits of 299 Figure C.1: Quantitative evaluation. Mean correlation values across the syn- chronous, (i.e., stimulus-driven) cortex defined at a range of synchrony thresholds ([0.15,0.75]). Each point thus reflects the mean prediction accuracy for a model across all voxels within synchronous cortex defined by a threshold value (x-axis). attentional masking are not restricted to forward encoding models alone, but may be more universal. Further, we find that models that better explain stimulus- dependent human fixation patterns (such as Deepgaze-II or the learned attention model) are able to better account for the representational geometry of neural responses across higher visual object processing areas. C.3 Regions of interest (ROI) We employed the HCP MMP parcellation for all ROI-level analysis. Dorsal and ventral visual stream ROIs as well as MT+ ROIs in Figure 3 (main text) were derived from the explicit stream segregation and categorization described in the HCP MMP parcellation [264] and are defined here in Table C.1 for quick reference. 300 Figure C.2: Representational similarity analysis(RSA). y-axis measures the agreement between ‘model’ RDMs and ‘neural’ RDMs based on their rank correlation measure. x-axis is use to index the layer (index 1 refers to the earliest layer of the architecture) and the saliency method used for attention masking of the features before pooling. Table C.1: ROI categorization Group ROIs Dorsal V3A, V3B, V6, V6A, V7, IPS1 Ventral V8, VVC, PIT, FFC, VMV1-3 MT+ MT, MST, V4t, FST Lateral occipital LO1, LO2, LO3 C.4 Center-weighted attention Figure C.3 depicts the center-weighted saliency map used in all center-weighted attention models. We also report per-movie eye tracking statistics therein from all frames used for training or testing the models. We note that not all subjects had eye tracking measurements for every frame in the movies. Figure C.3B shows the number of subjects for which eyetracking data was available per movie (distribution across frames). This suggests that despite the missing data, most frames among all training and testing movies (MOVIE 4) had recorded gaze coordinate measurements 301 Figure C.3: A. Center-weighted saliency map and B. Eye tracking statistics from ∼110-130 subjects. C.5 Voxel-wise prediction accuracy (R) of linear models Figure C.4 depicts the prediction accuracy across the cortical surface for all methods employing linear response models that were considered in this study. As can be seen clearly, just as in methods with CNN response models, gaze-weighted attention significantly improves prediction accuracy across most higher order visual areas over models with no attention or center-weighted attention. C.6 Estimating hemodynamic (BOLD) response delay fMRI BOLD response delay was estimated using the baseline ‘No attention (Linear)’ encoding model due to its computational efficiency in comparison to encoding models employing convolutional response models. The input to these models was the 2048 dimensional (average pooled) representation of the stimuli, and the output was the evoked fMRI response across the synchronous cortex (i.e., voxels with synchrony¿0.15) at different lags (1-7 seconds) from the stimulus. Thus, the output 302 Figure C.4: Prediction accuracy across the cortical surface for all meth- ods using linear response models. Statistical significance of individual voxel predictions is computed as the p-value of the obtained sample correlation coeffi- cient for the null hypothesis of uncorrelatedness (i.e., true correlation coefficient is zero) under the assumptions of a bivariate normal distribution. Only significantly predicted voxels (p<0.05, FDR corrected) for each method are colored on the surface. is a 160900-D vector corresponding to the fMRI response. All models were trained with 5-fold cross-validation using the stimulus-response pairs from the training dataset only. Based on Figure C.5, we estimated a response delay of 4 seconds, as this lag consistently yielded the maximum prediction accuracy across 5-fold cross validation. Thus, all encoding models described in the main text were trained to predict fMRI response after 4 seconds of stimulus presentation. 303 Figure C.5: Hemodynamic response delay. 5-fold cross-validated prediction accuracy (R) of the simple (‘No attention’) model on the training dataset. Error margins are computed from the standard deviation of prediction accuracy across the 5 folds. 304