MACHINE LEARNING IN RESTING-STATE AND
NATURALISTIC fMRI ANALYSIS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Meenakshi Khosla
August 2021
© 2021 Meenakshi Khosla
ALL RIGHTS RESERVED
MACHINE LEARNING IN RESTING-STATE AND NATURALISTIC fMRI
ANALYSIS
Meenakshi Khosla, Ph.D.
Cornell University 2021
Two brain activity recording paradigms in humans have particularly emerged as
increasingly more popular tools for studying brain function in health and in disease,
namely resting-state and naturalistic stimulation. These two techniques attempt to
capture brain activity ‘in the wild’ when it is unconstrained by any specific task
and thus reflect more naturalistic modes of operation of the brain. The complexity,
very high-dimensional nature, a suite of potential applications and lack of standard,
straightforward analysis tools make machine learning very attractive for this kind
of data. In this thesis, we draw upon recent advances in machine learning, fueled
by the success of deep learning, to develop models that can capture the full richness
of this data.
Resting-state fMRI (rs-fMRI) has enormous potential to advance our under-
standing of the brain’s functional organization and how it is altered by damage or
disease. Over the last decade, substantial effort has been devoted to using rs-fMRI
for classification of a wide range of neuropsychiatric conditions, such as Alzheimer’s
disease, schizophrenia, autism spectrum disorder etc. While a growing number of
studies have demonstrated the promise of machine learning algorithms for rs-fMRI
based clinical or behavioral prediction, most prior models have been limited in their
capacity to exploit the richness of the data. The first part of this thesis describes
our work on developing novel machine learning approaches for deriving subject
level predictions from resting-state fMRI scans. We propose a novel volumetric
Convolutional Neural Network (CNN) framework that takes advantage of the
full-resolution 3D spatial structure of rs-fMRI data and fits non-linear predictive
models. We showcase our approach on a challenging large-scale dataset and report
state-of-the-art accuracy results on rs-fMRI-based discrimination of autism patients
and healthy controls.
The second part of this thesis is aimed at developing predictive models that can
capture information processing within the brain under naturalistic stimulation more
stringently than existing approaches. Brain activity recordings of healthy subjects
during “free viewing” of movies present a powerful opportunity to build ecologically
sound and generalizable models of sensory systems, also known as encoding models.
Deep neural networks trained on image or sound recognition tasks have emerged
as powerful models of computations underlying sensory processing in the brain,
surpassing traditional models of image or sound representation based on Gabor
filters and spectro-temporal filters, respectively. While this success is promising,
existing encoding models based on deep neural networks have been limited in
their focus on limited portions of the sensory space under naturalistic stimulation,
ignoring the complex and dynamic interactions of modalities (audio and vision) in
this inherently context-rich paradigm. In the second part of this thesis, we will
introduce our research with predictive models of cortical responses that aims to
capture several critical inductive biases about information processing in the brain:
namely, hierarchical processing, assimilation over longer timescales, attentional
modulation and multi-sensory auditory-visual interactions. We will describe our
efforts in capturing these phenomena in models of the brain and will share our
latest findings from this novel computational approach. Finally, we describe our
ongoing efforts to characterize neural response properties in the visual cortex
under ‘ecological’ conditions systematically in an entirely data-driven fashion using
computational models. Together, our findings illustrate how computational models
overcome the tradition of excessive reductionism in cognitive neuroimaging by
providing a general-purpose framework that abstracts away from the particulars of
the experimental approach and can be used to describe multiple experiments at
the same time.
BIOGRAPHICAL SKETCH
Meenakshi is interested in using computational models to understand how the
brain processes the natural world. She uses techniques from artificial intelligence
and functional imaging in her research to understand the nature of representa-
tions and computations in the brain. By modelling the computations underlying
sensory processing ‘in the wild’, a critical focus of her research is to understand
the function of different brain areas and how they collectively support complex
human behavior. She also has more general interests across machine learning
and neuroimaging, particularly in the use of predictive models to understand the
distinctive characteristics of the brains of people affected with different mental
disorders. Beyond satiating scientific curiosity, she hopes her research can be used
to develop novel diagnostics of neuropsychiatric diseases, and inform treatment
and design personalized therapeutics. Prior to completing her PhD at Cornell, she
obtained a B. Tech-M. Tech dual degree in Electrical Engineering at the Indian
Institute of Technology, Kanpur and worked as a postgraduate research associate
at the Yale School of Medicine.
iii
ACKNOWLEDGEMENTS
This thesis and the ideas it is based on could not have been possible without
the enthusiasm, kindness, support and insight of my advisor, Mert Sabuncu. I
am very thankful for the freedom that Mert gave me to explore the directions I
was interested in and for his belief in them. Mert has been a constant source of
encouragement and guidance through every step of graduate school, and for that
I’m extremely grateful.
I am very grateful for the opportunity to work with Amy Kuceyeski. Her
enthusiasm for working towards understanding the brain has been both contagious
and very inspiring. I would also like to thank Keith Jamison for great discussions
and tremendous help.
I am grateful for the wonderful company of my amazing labmates: Evan M.
Yu, Zhilu Zhang, Gia H. Ngo, Zijin Gu, Cagla Bahadir, Carmen Khoo, Sijia Gao,
Amaya Murguia, Tianyu Ma, Matthew Pool, Alan Wang and Victor Butoi. I am
particularly thankful to Evan and Zhilu for teaching me so many things (academic or
otherwise) and being my late-night lab companions and to Gia for always checking
in and being a helpful and caring friend and amazing collaborator. I’d like to thank
Scott Coldren for administrative help throughout graduate school. I’m also very
grateful to Leila Wehbe and Andreas Tolias for being wonderful mentors during my
PhD journey.
I would also like to thank these fantastic people outside the lab who made
Ithaca a home away from home: Drishti Wali, Arpita Sharma, Arzoo Katiyar,
Ashudeep Singh, Prateek Sehgal, Utkarsh Mall and Saksham Agarwal. I also thank
my oldest group of friends for being in my life and always encouraging me: Akansha
Srivastava, Sakshi Sinha, Silky Gupta, Manvi Gupta, Unnat Jain, Varsha Lalwani
and Mohit Sharma.
iv
Lastly, I’d like to thank my mother Deepika Khosla, my sister Pooja Khosla and
my partner Rishabh Gupta for their love and support through this long endeavor,
for keeping me sane and cheering me up when things got tough. I am extremely
grateful to my entire extended family, including my grandparents, uncles, aunts
and cousins whose love has always kept me going. Special thanks and love to the
newest member of our family, Cocoa, who filled the last months of my PhD with so
much joy.
I dedicate this thesis to my father Rajiv Khosla, who always encouraged me
to pursue my dreams and has been my biggest advocate and greatest source of
strength. Every ounce of confidence I have, I owe it to my father.
v
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
2 Related Work 7
2.1 Machine learning in resting-state fMRI analysis . . . . . . . . . . . 7
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Unsupervised learning methods . . . . . . . . . . . . . . . . 14
2.1.3 Applications of unsupervised learning in rs-fMRI . . . . . . 22
2.1.4 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 41
2.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3 Linking resting-state brain activity and mental disorders with ma-
chine learning 71
3.1 Ensemble learning with 3D convolutional neural networks for func-
tional connectome-based prediction . . . . . . . . . . . . . . . . . . 71
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.1.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 75
3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1.5 Limitations and future work . . . . . . . . . . . . . . . . . . 99
3.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.2 Detecting abnormalities in resting-state dynamics: An unsupervised
learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4 Towards holistic encoding models for predicting fMRI responses
to multimodal naturalistic stimuli 114
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2 Endowing neural encoding models with both audition and vision
and and stimulus history . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 121
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
vi
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.3 Neural encoding with visual attention . . . . . . . . . . . . . . . . . 152
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.3.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 168
4.3.6 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . 170
5 A shared encoding model for subject-specific response prediction172
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.2.1 Implementation details . . . . . . . . . . . . . . . . . . . . . 177
5.2.2 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . 179
5.2.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.2.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 180
5.2.5 Demonstration of application: personalized brain mapping . 181
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6 A computational strategy to richly characterize the human visual
cortex under naturalistic conditions 186
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.2.1 Natural Scenes Dataset . . . . . . . . . . . . . . . . . . . . . 190
6.2.2 Response-optimized encoding model architecture . . . . . . . 191
6.2.3 Training and testing models . . . . . . . . . . . . . . . . . . 192
6.2.4 Comparison against retinotopic measurements from pRF-
localizer scan . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.2.5 Representational Similarity Analysis . . . . . . . . . . . . . 195
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7 Looking ahead 208
A Supplementary Information and Additional Results for Section
3.1 257
A.1 Atlas Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
A.2 Poisson Disk Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 257
A.3 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
A.3.1 Ridge Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 258
A.3.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 259
A.4 Neural network hyperparameter settings . . . . . . . . . . . . . . . 260
A.5 ABIDE-I cross-validation results . . . . . . . . . . . . . . . . . . . . 262
vii
A.6 Saliency maps for individual parcellations . . . . . . . . . . . . . . . 262
A.7 Comparison of different preprocessing strategies . . . . . . . . . . . 263
B Supplementary Information and Additional Results for Section
4.2 271
B.1 HCP Movies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
B.2 Region of Interest (ROI) selection . . . . . . . . . . . . . . . . . . . 271
B.3 Estimating BOLD response delay . . . . . . . . . . . . . . . . . . . 273
B.4 Defining the stimulus-driven or “synchronous” cortex . . . . . . . . 274
B.5 Model architectures and implementation . . . . . . . . . . . . . . . 275
B.6 Regularized linear regression: deep convolutional features . . . . . . 277
B.7 Regularized linear regression: WordNet features . . . . . . . . . . . 280
B.8 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
B.9 Computing significance estimates . . . . . . . . . . . . . . . . . . . 283
B.10 Sensory-sensitivity index . . . . . . . . . . . . . . . . . . . . . . . . 284
B.11 Stimuli for synthetic contrasts . . . . . . . . . . . . . . . . . . . . . 284
B.12 Perturbation analysis with 20-sec models . . . . . . . . . . . . . . . 285
B.13 Performance improvement and autocorrelation decay . . . . . . . . 286
B.14 Surface visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 289
B.15 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
B.16 Group-level prediction accuracy: held-out set . . . . . . . . . . . . . 290
B.17 Subject-level prediction accuracy: held-out set . . . . . . . . . . . . 293
B.18 Correcting with inter-group synchrony . . . . . . . . . . . . . . . . 295
B.19 Influence of motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
C Supplementary Information and Additional Results for Section
4.3 298
C.1 Model comparison across randomly selected layers . . . . . . . . . . 298
C.2 Representational similarity analysis . . . . . . . . . . . . . . . . . . 298
C.3 Regions of interest (ROI) . . . . . . . . . . . . . . . . . . . . . . . . 300
C.4 Center-weighted attention . . . . . . . . . . . . . . . . . . . . . . . 301
C.5 Voxel-wise prediction accuracy (R) of linear models . . . . . . . . . 302
C.6 Estimating hemodynamic (BOLD) response delay . . . . . . . . . . 302
viii
LIST OF TABLES
3.1 Composition of Cohorts . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Classification accuracy for ASD vs. Control: Independent test on
ABIDE-II of baseline models and proposed CNN approach. For
each row, best results are bolded. For each column, best results are
italicized. Green indicates better performance, whereas orange/red
highlights worse performance. . . . . . . . . . . . . . . . . . . . . . 85
3.3 Root mean squared error (RMSE in years) for age prediction: In-
dependent test on ABIDE-II for benchmark models and proposed
CNN approach. For each row, best results are bolded. For each
column, best results are italicized. . . . . . . . . . . . . . . . . . . 86
3.4 Classification/regression performance of FCN with a high-resolution
parcellation ( ∼ 1024 ROIs) [216] . . . . . . . . . . . . . . . . . . . 95
3.5 Next frame prediction performance on healthy test subjects for
different models. *Interpolation model had access to the frame after
the predicted frame. . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.6 Reconstruction performance of the proposed recurrent autoencoder
on healthy test subjects for different input sequence lengths. . . . 110
3.7 Area under the ROC curve for discriminating ASD vs Controls.
P-values of the unpaired t-test comparing means of the two clinical
groups are shown in brackets. . . . . . . . . . . . . . . . . . . . . 111
4.1 Evaluation against saliency prediction models. Mean and standard
errors for each metric are reported. Best results are bolded. . . . . 168
A.1 Summary descriptors of ROIs in individual atlases. All volumes are
in cm3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
A.2 Classification accuracy for ASD vs. Control: 10-fold cross-validation
on ABIDE-I for benchmark models and proposed CNN approach.
For each row, best results are bolded. For each column, best
results are italicized. Green indicates better performance, whereas
orange/red highlights worse performance. . . . . . . . . . . . . . . 261
A.3 Root mean squared error (RMSE in years) for age prediction: 10-fold
cross-validation on ABIDE-I for benchmark models and proposed
CNN approach. For each row, best results are bolded. For each col-
umn, best results are italicized. Green indicates better performance,
whereas orange/red highlights worse performance. . . . . . . . . . 261
A.4 Classification accuracy for ASD vs. Control: Independent results
on ABIDE-II for benchmark models and proposed CNN approach.
Green indicates better performance, whereas orange/red highlights
worse performance. . . . . . . . . . . . . . . . . . . . . . . . . . . 264
ix
A.5 Root mean squared error (RMSE in years) for age prediction: Inde-
pendent results on ABIDE-II for benchmark models and proposed
CNN approach. Green indicates better performance, whereas or-
ange/red highlights worse performance. . . . . . . . . . . . . . . . 268
A.6 Mean absolute error (MAE in years) for age prediction: Independent
testing on ABIDE-II for benchmark models and proposed CNN
approach. For each row, best results are bolded. For each column,
best results are italicized. Green indicates better performance,
whereas orange/red highlights worse performance. . . . . . . . . . 270
B.1 HCP dataset split . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
B.2 ROI categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
C.1 ROI categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
x
LIST OF FIGURES
2.1 Traditional seed based analysis approach . . . . . . . . . . . . . . . 12
2.2 Applications of machine learning methods in resting-state fMRI . 15
2.3 A taxonomy of unsupervised learning methods used for rs-fMRI
analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Illustrations of popular clustering algorithms: K-means clustering
partitions the data space into Voronoi cells, where each observation
is assigned to the cluster with the nearest centroid (marked red
in the figure). GMMs assume that each cluster is sampled from a
multivariate Gaussian distribution and estimates these probability
densities to generate probabilistic assignment of observations to
different clusters. Hierarchical (agglomerative) clustering generates
nested partitions, where partitions are merged iteratively based on a
linkage criteria. Graph-based clustering partitions the graph repre-
sentation of data so that, for example, number of edges connecting
distinct clusters are minimal. . . . . . . . . . . . . . . . . . . . . . 67
2.5 Schematic of application 2.1.3: In decomposition, the original fMRI
data is expressed as a linear combination of spatial patterns and their
associated time series - in ICA, the independence of spatial maps is
optimized whereas in sparse dictionary learning, the sparsity of maps
is encouraged. In clustering, time series or connectivity fingerprints
of voxels are clustered to assign voxels to distinct functional networks. 68
2.6 Schematic of application 2.1.3. Three connectivity states are as-
sumed in the data for illustration purposes . . . . . . . . . . . . . . 68
2.7 Schematic of application 2.1.3. Dimensionality reduction of high-
dimensional connectomes into 3 latent components is shown for
illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.8 A common classification/regression pipeline for connectomes . . . . 69
2.9 A summary of design choices for supervised learning with rs-fMRI . 69
2.10 A taxonomy of supervised learning methods used for rs-fMRI analysis 70
3.1 A general illustration of the proposed approach . . . . . . . . . . . 73
3.2 ROI masks for example SPs and atlas at each of the four spatial
scales considered in this study. . . . . . . . . . . . . . . . . . . . . 78
3.3 Proposed CNN approach. All operations are in 3D volume. 2D
correlation maps are shown for illustration only. For the age predic-
tion task, an additional Max-Pooling and Batch-Normalization[208]
operation followed the first and second convolutional layer. . . . . . 82
3.4 ASD-HC Classification: Receiver Operating Curves for independent
validation on ABIDE-2 . . . . . . . . . . . . . . . . . . . . . . . . 87
xi
3.5 Violin plots showing the spread of prediction accuracies/errors
for stochastic parcellations at multiple network scales for different
classification models. Mean accuracy/error of individual violins
is denoted by ’Mean SPs’. Performance of individual atlases is
compared with SPs with the closest # of ROIs and is denoted as
’Single Atlas’. Results are computed by training models on entire
ABIDE-1 cohort and testing on the independent ABIDE-2 cohort. . 89
3.6 Distribution of Ridge models’ performance for stochastic parcella-
tions created using the same gray-matter mask as the corresponding
atlas. Red denotes the atlas model’s accuracy and black indicates
the SP-Ensemble accuracy. . . . . . . . . . . . . . . . . . . . . . . 90
3.7 Mean saliency maps of trained 3D-CNN models for SP-Ensemble . 91
3.8 Motion correlations . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.9 Kernel density estimates of the probability distributions for the
performance difference between models, computed based on 10000
bootstrap samples from ABIDE-II. Values to the left of the black
vertical line indicate bootstrap samples where the proposed ap-
proach (3D CNN or SP-Ensemble) under-performed compared to
the competing method. . . . . . . . . . . . . . . . . . . . . . . . . 100
3.10 Next frame prediction model. Each cuboid represents a 3D (2 spatial
dimensions + time) feature map with number of features indicated
on top. Flat boxes represent 2D feature maps, with number of
channels on top. Input is an axial fMRI slice with T sequential
frames. Conv-LSTM cell returns the last output of the output
sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.11 Whisker plots showing reconstruction and prediction errors (mean
squared error) for ASD patients and controls, with proposed re-
current models trained on T=20 consecutive frames. Points are
individual subjects. The ends of the box are upper and lower
quartiles, the median is marked by a horizontal line inside the box. 111
3.12 Statistical significance of the difference in regional reconstruction
error of the recurrent autoencoder between controls and ASD pa-
tients. FDR with q = 0.05 was implemented for multiple testing
correction. − log10 p values are shown. . . . . . . . . . . . . . . . . 112
xii
4.1 Schematic of the proposed models. (A) The short-duration (1-sec)
auditory and visual models take a single image or spectrogram as
input, extract multi-scale hierarchical features and feed them into
a Convolutional Neural Network (CNN)-based response model to
predict the whole-brain response. (B) The long-duration (20-sec) uni-
modal models take a sequence of images or spectrograms as input,
feed their hierarchical features into a recurrent pathway and extract
the last hidden state representation for the response model. (C) The
short-duration multi-modal model combines uni-modal features and
passes them into the response model. (D) The long-duration multi-
modal model combines auditory and visual representations from
the recurrent pathways for whole-brain prediction. Architectural
details, including the feature extractor and convolutional response
model are provided in Supplementary Information. . . . . . . . . . 123
4.2 Regional predictive accuracy for the test movie. (A),(C)-(F) depict
quantitative evaluation metrics for all the proposed models across
major groups of regions as identified in the HCP MMP parcellation
(B). Predictive accuracy of all models is summarized across (A)
auditory, (C) visual, (D) multi-sensory, (E) language and (F) frontal
areas. Box plots depict quartiles and swarmplots depict mean
prediction accuracy of every ROI in the group. For language areas
(Group 4), left and right hemisphere ROIs are shown as separate
points in the swarmplot because of marked differences in prediction
accuracy. Statistical significance tests are performed to compare
1-sec and 20-sec models of the same modality (3 comparisons, results
indicated with horizontal bars below the box plots) or uni-modal
against multi-modal models of the same duration (4 comparisons,
results indicated with horizontal bars above the box plots) using
the paired t-test (p-value < 0.05, Bonferroni-corrected) on mean
prediction accuracy within ROIs of each group. . . . . . . . . . . . 128
4.3 Model prediction accuracy in standard brain space. Left panel
depicts the predictive accuracy of uni-modal (A,B) and multi-modal
(C) models over the whole brain in the test movie. Colors on the
brain surface indicate the Pearson correlation coefficient between
the predicted timeseries at each voxel and the true voxel’s timeseries
normalized by the noise ceiling (D) computed on repeated validation
clips. Only significantly predicted voxels (p-value < 0.05, FDR-
corrected) are colored. ROI box plots depict the un-normalized
correlation coefficients between the predicted and measured response
of voxels in each ROI and the respective noise ceiling for the mean.
(E) shows the percentage of voxels in stimulus-driven cortex that are
significantly predicted by each model and mean prediction accuracy
across the stimulus-driven cortex. . . . . . . . . . . . . . . . . . . 131
xiii
4.4 Influence of temporal history on encoding performance. (A) Mean
predictive performance of Audio-1sec and Audio-20sec models in
early auditory and association auditory cortex ROIs. A major boost
in encoding performance is seen across auditory association regions
with the 20-sec model. (B) Mean predictive performance of Visual-
1sec and Visual-20sec models across ROIs in the dorsal, ventral and
MT+ regions. Dorsal stream and MT+ ROIs exhibit a significant
improvement with Visual-20sec model but no effect is observed for
the ventral stream. Box plots are overlaid on top of the beeswarm
plot to depict quartiles. Horizontal bars indicate significant differ-
ences between models in the mean prediction accuracy within ROIs
of each stream using the paired t-test (p-value < 0.05). . . . . . . . 132
4.5 Sensitivity of ROIs to different sensory inputs. (A) Predictive
accuracy (R) of audiovisual encoding model with and without input
distortions, (B) Sensory sensitivity index of different brain regions
as determined using performance metrics under input distortion
(see Supplementary Information for details). Regions dominated by
a single modality are shown in darker colors, whereas light-colored
regions are better predicted by a combination of auditory and visual
information. Red indicates auditory-dominant regions whereas blue
indicates visual dominance. . . . . . . . . . . . . . . . . . . . . . . 134
4.6 Encoding models as virtual brain activity synthesizers. (A) Syn-
thetic contrasts are generated from trained encoding models by
contrasting their “synthesized” (i.e., predicted) response to different
stimulus types. (B) Comparison of the synthesized contrast for
‘speech’ against the speech association template on neurosynth, both
thresholded to keep the top 5%, 10% or 15% most activated vertices.
(C-D) compare the synthesized contrasts for ‘faces’ and ’places’
against the corresponding contrasts derived from HCP tfMRI ex-
periments, both thresholded to keep the top 5%, 10% or 15% most
activated vertices. Vertices activated in only synthetic or predicted
contrast maps are shown in red and blue colors respectively whereas
yellow indicates the overlap. Corresponding dice scores are displayed
alongside the surface maps. Distributions of dice overlap scores
between the synthetic map and all 86 HCP tfMRI contrast maps
are shown as histograms at each threshold level. Red arrow points
to the dice overlap between the synthetic contrast and HCP tfMRI
contrast for the same condition. In all cases, the synthetic contrast
exhibits the highest agreement with the tfMRI contrast that it was
generated to predict. . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xiv
4.7 Proposed method. A trainable soft-attention module is imple-
mented on top of a pre-trained representation network to rescale
features based on their salience. The rescaled features are spatially
pooled and fed into a convolutional response model to predict whole-
brain neural response. We assess the value of the trained attention
network by comparing it with neural encoding methods employing
(i) stimulus-dependent attention maps derived from human fixations
(AG), (ii) stimulus-independent attention map derived from all fix-
ations in the training set that reflects the center-weighted bias of
our dataset (AC) as well as a (iii) no attention model that spatially
pools the features directly with no scaling. . . . . . . . . . . . . . . 156
4.8 Quantitative evaluation of all models. (A) depicts mean cor-
relation values across the synchronous, (i.e., stimulus-driven) cortex
defined at a range of synchrony thresholds ([0.15,0.75]). Each point
thus reflects the mean prediction accuracy for a model across all
voxels within synchronous cortex defined by a threshold value (x-
axis). (B) depicts the inter-group correlation (synchrony) values
across the entire human cerebral cortex. . . . . . . . . . . . . . . . 163
4.9 Top: ROI-level analysis Mean correlation values across interme-
diate (V4), higher visual areas in the inferotemporal cortex and its
neighborhood and other higher higher-level visual regions (Dorsal,
MT+) as described in the HCP MMP parcellation [264]. Error
bars represent 95% confidence intervals around mean estimates com-
puted using bootstrap sampling. (A)-(E) Prediction accuracy
across the cortical surface for all deep CNN-based models.
Statistical significance of individual voxel predictions is computed
as the p-value of the obtained sample correlation coefficient for the
null hypothesis of uncorrelatedness (i.e., true correlation coefficient
is zero) under the assumptions of a bivariate normal distribution.
Only significantly predicted voxels (p<0.05, FDR corrected) for
each method are colored on the surface. Prediction accuracy maps
for encoding methods with linear response models are provided in
the Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.10 Qualitative assessment of saliency (log-density) maps. Top
row shows sampled frames from the test movie, middle row shows
human fixation maps overlaid on the corresponding frame, bottom
row shows saliency maps predicted by the attention network of the
proposed neural encoding model. Blue indicates high saliency values
whereas red indicates low saliency. . . . . . . . . . . . . . . . . . . 167
xv
5.1 Proposed approach: Feature pyramid networks are used to ex-
tract hierarchical features from pre-trained image/sound recognition
networks. Dense features are reshaped into coarse 3D feature maps,
which are mapped into increasingly fine-grained maps using con-
volutions. Coarse feature transformation layers are shared across
subjects while deeper convolutional layers close to predicted response
are subject-specific. . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.2 Quantitative evaluation: Bar charts illustrate subject-wise pre-
diction accuracy of all models, box plots depict the distribution
over subjects for % of synchronous voxels significantly predicted
(p<0.05, FDR corrected). N ×N correlation matrices depict the
(normalized) correlation coefficient between predicted and measured
responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.3 (A), (B) Correlations between predicted response of the proposed
model and true time series of each voxel averaged across subjects.
Only significantly predicted voxels are shown (p<0.05, FDR cor-
rected). Dice matrices of predicted versus true contrasts for (C)
faces and (D) scenes stimuli. (E) & (F) depict contrasts of two
randomly selected subjects. ROIs are labelled from the HCP MMP
parcellation [264]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.1 Schematic and Quantitative Results. A shows the convolu-
tional neural network model with factorized readout and B depicts
the 5 visual areas considered in this study. Quantitative assessment
of different models is shown as a boxplot in C. D shows the count
of voxels that are better predicted by each model along with the
difference in prediction accuracies (R). E shows the raw prediction
accuracy, as estimated by the Pearson’s correlation coefficient (R),
across the cortical flat map for all 4 subjects. . . . . . . . . . . . . 193
6.2 Schematic of retinotopic parameters . . . . . . . . . . . . . . 194
6.3 Quantifying the agreement between the measured prf ec-
centricities and the prf eccentricities estimated from pre-
dictive computational models across different voxels. A
Subject and ROI-specific scatter plots depict predicted eccentricities
against measured eccentricities. Pearson’s correlation coefficient
between the two quantities is displayed in blue in each scatter plot.
B Predicted and measured eccentricities across all early visual ROI
voxels displayed on the cortical surface for each subject. . . . . . . 197
xvi
6.4 Quantifying the agreement between the measured prf polar
angles and the prf polar angles estimated from predictive
computational models across different voxels. A Subject
and ROI-specific scatter plots depict predicted polar angles against
measured polar angles. Pearson’s correlation coefficient between
the two quantities is displayed in blue in each scatter plot. B
Predicted and measured polar angles across all early visual ROI
voxels displayed on the cortical surface for each subject. . . . . . . 198
6.5 Quantifying the agreement between the measured prf sizes
and the prf sizes estimated from predictive computational
models across different voxels. A Subject and ROI-specific
scatter plots depict predicted sizes against measured sizes. Pearson’s
correlation coefficient between the two quantities is displayed in
blue in each scatter plot. B Predicted and measured prf sizes across
all early visual ROI voxels displayed on the cortical surface for each
subject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.6 Quantifying the generalization ability across subjects.
(Left) Prediction performance and (Right) Agreement between es-
timated and measured retinotopic maps as a function of training
examples (stimulus-response pairs) from novel subjects. . . . . . . 201
6.7 Spatial generalization matrices. Predicted response for a voxel
in one ROI is correlated with the measured response of every other
voxel within the same ROI (both within and across participants) to
obtain a spatial generalization matrix for every ROI. Blue lines mark
subject boundaries. Strong diagonal structure indicates that the
predicted response for a voxel best matches the measured response
of the same voxel, indicating the ability of the models to capture
voxel-level idiosyncracies. . . . . . . . . . . . . . . . . . . . . . . . 202
6.8 Separability of category information across the ventral vi-
sual stream. A Matrices of all pairwise similarities between the
representational geometries in different visual ROIs. B Results
of the Representational Similarity Analysis (RSA) framework ap-
plied to several visual datasets (THINGS, CIFAR100 and CIFAR10)
containing different categories of objects. . . . . . . . . . . . . . . . 203
6.9 A low-dimensional space characterizes the functional orga-
nization of the ventral-occipital region. A Most and least
activating images for the first two PCs. C Total explained variance
as a function of the number of PCs. D Pearson’s correlation coef-
ficient between the domain-selectivity of individual voxels against
their projections onto the two PCs. E All images from the THINGS
dataset projected onto the first two principal dimensions of the
response. F and G Scatter plots depicting the domain-selectivity
against the corresponding PC projection for all VO voxels. . . . . . 205
xvii
A.1 Violin plots showing the spread of prediction accuracies/errors
for stochastic parcellations at multiple network scales for different
classification models. Mean accuracy/error of individual violins
is denoted by ’Mean SPs’. Performance of individual atlases is
compared with SPs with the closest # of ROIs and is denoted as
’Single Atlas’. Results are computed by 10-fold cross-validation on
the entire ABIDE-1 cohort. . . . . . . . . . . . . . . . . . . . . . . 265
A.2 Saliency maps of trained CNN models for 2 randomly chosen stochas-
tic parcellations at each scale for ASD-HC classification. . . . . . . 266
A.3 Saliency maps for atlas-based ASD-HC classification models. . . . . 267
A.4 ROC Curves for individual atlas based ASD-HC classification models.269
B.1 Group segregation from the HCP MMP parcellation. . . . . . . . . 272
B.2 ROI-based encoding performance for estimating delay. (A) depicts
the estimated mean and standard error of the prediction accuracy (R)
across various delays (1-7s) within the early auditory and association
auditory group (blue) as well as across all ROIs (red), as obtained
using the single epoch (1s) auditory model. (B) depicts the estimated
mean and standard error of the prediction accuracy (R) for various
delays (1-7s) within the primary and dorsal visual streams (blue) as
well as across all ROIs (red), as obtained using the single frame visual
model. Shaded regions depict the standard error in estimating mean
across ROIs within each group. ROI categorization is described in
the sub-section on ROI selection. . . . . . . . . . . . . . . . . . . . 273
B.3 Implementation details for the audio (top left) and visual (top right)
feature extraction networks as well as the convolutional response
model (bottom). All layers and blocks outside the yellow rectangle
(bottom-up pathway) are trained from scratch. The blocks inside the
yellow rectangular window are initialized with networks pre-trained
on image or sound recognition. Further, ResNet-50 is frozen during
the training of all encoding models, whereas VGG is fine-tuned.
The sequence of operations within each block are defined from top
to bottom, while the number of repetitions for each sequence within
the block are indicated with the multiplicative symbol on the right. 278
xviii
B.4 Performance of linear response models and baselines. (A) shows
the region-averaged prediction accuracy of linear response models
using deep convolutional features. (B) shows results of the ablation
study and highlights the importance of different components of
the proposed model architecture. (C) shows the region-averaged
prediction accuracy of linear response models using semantically rich
WordNet features and (D) shows the cortical map of the prediction
accuracy (R) for the best WordNet model. The x-axis in (A) and
(C) depicts the length of the windows (in seconds) over which
the stimulus features are concatenated and y-axis shows the mean
Pearson correlation coefficient between the predicted and measured
responses across the stimulus-driven voxels. . . . . . . . . . . . . . 281
B.5 Perturbation analysis with Audio-20sec (A) and Visual-20sec (B)
models. ROI box plots depict the un-normalized correlation coef-
ficients between the predicted and measured response of voxels in
each ROI using original or distorted 20-sec input clips at inference
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
B.6 Performance boost of the 20-sec model over 1-sec model is higher
in voxels with longer autocorrelation decay times. (A) & (B) depict
the performance improvement (∆R) against decay time constants
for voxels associated with auditory and visual regions, respectively
(Table B.2). The r value indicates the Pearson correlation coefficient
between the two quantities. Each dot in the scatterplot represents
an individual voxel. Bivariate kernel density estimates are overlaid
on top of the scatterplot as contours to depict the probability
distribution of observations. . . . . . . . . . . . . . . . . . . . . . . 288
B.7 Predicted and measured response time-series of the ‘median’ predic-
tive accuracy (R) voxel across ROIs of different functional groups.
Vertical dashed lines mark the boundary of clip segments in the
held-out movie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
xix
B.8 Model performance on held-out group of subjects. (A) Pearson
correlation coefficient (R) between the model predictions and group-
averaged response of an independent subject group comprising
20 subjects, on the held-out test movie, normalized by the voxel-
specific noise ceiling. (B) Predictivity against the noise ceiling
for all voxels with high “synchrony” across training movies (>0.5)
(see Supplementary Information for details). This gives a total of
52,954 highly “synchronous” voxels that are colored based on their
association with auditory and visual groups. This hue assignment of
each voxel was derived from the coloration of the corresponding ROI
in the multi-modal HCP parcellation. Each dot in the scatterplot
represents an individual voxel. Bivariate kernel density estimates
are overlaid on top of the scatterplot as contours to depict the
probability distribution of observations (prediction accuracy/noise
ceiling pair at every voxel). . . . . . . . . . . . . . . . . . . . . . . 292
B.9 Quantitative evaluation metrics for all the proposed models on
the independent held-out population comprising 20 novel subjects.
(A),(C)-(F) depict prediction accuracy (R) for all the proposed
models across major groups of regions as identified in the HCP MMP
parcellation (B). Predictive accuracy of all models is summarized
across (A) auditory, (C) visual, (D) multi-sensory, (E) language
and (F) frontal areas. Box plots depict quartiles and swarmplots
depict mean prediction accuracy of every ROI in the group. For
language areas (Group 4), left and right hemisphere ROIs are shown
as separate points in the swarmplot because of marked differences
in the prediction accuracy. Statistical significance tests (results
indicated with horizontal bars) are performed to compare 1-sec and
20-sec models of the same modality (3 comparisons) or uni-modal
against multi-modal models of the same duration (4 comparisons)
using paired t-test (p-value < 0.05, Bonferroni-corrected) on mean
prediction accuracy within ROIs of each group. . . . . . . . . . . . 293
B.10 Comparison of voxel-level prediction accuracies (R) against subject-
specific noise ceiling for 5 representative subjects from the held-out
set. The subjects were chosen such that their mean prediction
accuracy (un-normalized) within the stimulus-driven cortex lied in
the ith percentile with i ∈ {0.01, 25, 50, 75, 99.9}. Surface maps
with white background in (A)-(E) depict raw correlation coefficients
between model (Audiovisual-20sec) predictions and subject-specific
response on the held-out movie whereas maps on gray background
indicate the respective subject-specific noise ceiling. Only signifi-
cantly correlated voxels (p<0.05, FDR corrected) are colored on the
surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
B.11 Synchrony-normalized prediction accuracy (R) of the Audiovisual-
20sec model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
xx
B.12 Addressing the influence of motion on measured and predicted
responses. (A) and (B) depict the distribution of the Pearson
correlation coefficient of FD with the predicted responses of the
Audiovisual-20sec model and measured responses across the whole
brain respectively. Surface maps in (C) depict the raw correlation
coefficients between FD and the measured responses. Only statisti-
cally significant voxels (p< 0.05, FDR corrected) are colored on the
surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
C.1 Quantitative evaluation. Mean correlation values across the
synchronous, (i.e., stimulus-driven) cortex defined at a range of
synchrony thresholds ([0.15,0.75]). Each point thus reflects the mean
prediction accuracy for a model across all voxels within synchronous
cortex defined by a threshold value (x-axis). . . . . . . . . . . . . . 300
C.2 Representational similarity analysis(RSA). y-axis measures
the agreement between ‘model’ RDMs and ‘neural’ RDMs based
on their rank correlation measure. x-axis is use to index the layer
(index 1 refers to the earliest layer of the architecture) and the
saliency method used for attention masking of the features before
pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
C.3 A. Center-weighted saliency map and B. Eye tracking statis-
tics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
C.4 Prediction accuracy across the cortical surface for all meth-
ods using linear response models. Statistical significance of in-
dividual voxel predictions is computed as the p-value of the obtained
sample correlation coefficient for the null hypothesis of uncorrelated-
ness (i.e., true correlation coefficient is zero) under the assumptions
of a bivariate normal distribution. Only significantly predicted
voxels (p<0.05, FDR corrected) for each method are colored on the
surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
C.5 Hemodynamic response delay. 5-fold cross-validated prediction
accuracy (R) of the simple (‘No attention’) model on the training
dataset. Error margins are computed from the standard deviation
of prediction accuracy across the 5 folds. . . . . . . . . . . . . . . . 304
xxi
CHAPTER 1
INTRODUCTION
Modern neurotechnologies enable us to measure human brain activity in unprece-
dentedly rich ways. As we enter this new data revolution in neuroscience with
large-scale compilation of neural data and dissemination through open-source ini-
tiatives, we also need concomitant methodological advances to efficiently derive
conceptual understanding from this rich, high-dimensional data. This thesis is
a methodological effort in this direction, accompanied by novel neuroscientific
findings.
fMRI is a widely used non-invasive neuroimaging tool for measuring changes
in local oxygenation of blood, also known as the BOLD signal, which is a proxy
for neural activity. While the blurred spatio-temporal activity measured by fMRI
may not be a direct window into neural coding like single-cell recordings in other
animal models, this non-invasive imaging modality nonetheless serves as a very
powerful and sensitive technique to study functional organization and feature
representations in the human brain. Two brain activity recording paradigms in
humans have particularly emerged as increasingly more popular tools for studying
brain function in health and in disease, namely resting-state and naturalistic
stimulation. These two techniques attempt to capture brain activity ‘in the wild’
when it is unconstrained by any specific task and thus reflect more naturalistic
modes of operation of the brain. Resting-state activity refers to brain activation
that arises in the absence of any task. It is measured in awake subjects when the
only instructions they are given is to close their eyes and do nothing in particular.
Naturalistic stimulation refers to brain activity that is recorded in more natural,
complex settings to learn about information processing in the brain under ecological
1
conditions. An example would be volunteers viewing movies or listening to stories
while their brain is being imaged. The complexity, very high-dimensional nature, a
suite of potential applications and lack of standard, straightforward analysis tools
make machine learning very attractive for this kind of data. In this thesis, we draw
upon recent advances in machine learning, fueled by the success of deep learning,
to develop models that can capture the full richness of this data.
Resting-state fMRI (rs-fMRI) has enormous potential to advance our under-
standing of the brain’s functional organization and how it is altered by damage or
disease. A major emphasis in the field is on the analysis of resting-state functional
connectivity (RSFC) that measures statistical dependence in BOLD fluctuations
among spatially distributed brain regions. Recently, there has been a surge of
studies harnessing resting-state functional connectivity for a wide range of super-
vised prediction tasks. For instance, the application of machine learning methods
to rs-fMRI data has shown great promise in investigations of the developing con-
nectome, as well as in predicting individual differences in cognition and behavior.
Over the last decade, substantial effort has been devoted to using rs-fMRI for
classification of a wide range of neuropsychiatric conditions, such as Alzheimer’s
disease, schizophrenia, autism spectrum disorder etc. Predictive approaches can
also be used to address research questions of interest in neuroscience. For example,
to what extent is resting-state functional connectivity heritable or how does it
vary across different vigilance states? Given this broad range of applications and
clinical potential, there is a need to develop better methods for making veridical
predictions from such data. The first part of this thesis describes our work on
developing novel machine learning approaches for deriving subject level predictions
from neuroimaging data, in particular, resting-state fMRI scans.
2
Computational neuroscience has also witnessed a minor revolution of its own,
driven largely by advances in deep learning; We now have computational mod-
els that can explain high-level sensory representations. Comparison of different
representational models is now beginning to reveal the neuronal code in parts
of the cortex that had remained elusive so far, providing a new test bench for
computational theories of the mind. In the past, sensory systems have been studied
extensively using task-based paradigms where the brain activity is recorded upon
stimulation with hand-crafted stimuli. In the domain of vision, this paradigm has
been very successful, for example, in identifying orientation sensitivity of neurons
in early visual cortex or in discovering scene-selective or face-selective regions in
higher-order visual regions within the brain. While successful for testing specific
hypotheses, this reductionist approach is limited in the sense that no single task-
based experiment can help in developing broad theories of sensory processing that
generalize outside the experimental circumstance they were based on.
Predictive models, on the other hand, are based on out-of-sample prediction and
they generalize to arbitrary new stimuli and can thus offer more holistic descriptions
of sensory processing. The biggest advantage is that once we have such a general
model, we can use it to formulate novel hypotheses about information processing
in the brain that can then be tested under more rigorous and controlled conditions.
Embedded knowledge within these models of the brain could also be harnessed
in other applications, such as independent neural population control by optimally
synthesizing stimuli to elicit a desired neural activation pattern.
Further, predictive models can also be useful in hypothesis testing. In this case,
encoding models encapsulating competing hypotheses about neural information
processing can be pitted against each other and their empirical plausibility can be
3
directly examined by comparing their predictions on held-out data against corre-
sponding measurements. Such an approach can shed new light on how information
is represented in different parts of the brain.
Given the usefulness of predictive models in hypothesis formation, in non-invasive
brain-machine interfaces, as well as in answering important research questions
relating to feature representations in the brain, the second part of my thesis is
aimed at developing predictive models that can capture information processing
within the brain more stringently than existing approaches. We pursue this goal by
developing deep neural network-based encoding models that capture three critical
inductive biases about information processing in the brain, namely, hierarchical
processing, assimilation over longer timescales and multi-sensory auditory-visual
interactions. By developing and evaluating these models on a large-scale movie-
watching dataset, we demonstrated how incorporating this joint information leads
to remarkable prediction performance across large areas of the cortex, well beyond
primary sensory regions into higher-order regions that had not been characterized
by predictive models previously. Taken together, our findings underscore how
encoding models can shed new light on the functional architecture of the human
brain and provide a basis for novel hypotheses about sensory processing.
This thesis is therefore largely a methodological effort that is accompanied
by novel results. A central theme of our research is to use machine learning or
predictive modelling techniques to convert neural data into understanding and
fundamental knowledge about the brain. This research spans multiple disciplines:
1. Medical Image Analysis: We propose a novel volumetric Convolutional Neu-
ral Network (CNN) framework that takes advantage of the full-resolution 3D spatial
structure of rs-fMRI data and fits non-linear predictive models. We showcase our
4
approach on a challenging large-scale dataset (ABIDE, with N >2,000) and report
state-of-the-art accuracy results on rs-fMRI based discrimination of autism patients
and healthy controls.
2. Computational Neuroscience: We use naturalistic stimuli and employ state-
of-the-art deep learning models on a large-scale dataset to show that the shared
response across subjects is largely predictable across the cortex. Furthermore, we
demonstrate how these models allow us to interrogate the temporal and sensory
sensitivity of different brain regions in an entirely data-driven manner from ecologi-
cally valid fMRI paradigms, underscoring the potential of neural encoding models
as a powerful tool for studying brain function in ecologically valid conditions.
3. Computer Vision: We employ a data-driven strategy to further improve DNNs
as models of the human visual cortex, and demonstrate how findings about the
brain can potentially improve computer vision models, instantiating a synergy
between computational neuroscience and AI.
Outline
This thesis is organized as follows. In chapter 2, we present the relevant literature
on machine learning in resting-state1 and naturalistic fMRI analysis. In chapter
3, we describe our novel methodological approaches for deriving individual-level
clinical or behavioral measures from rs-fMRI that take advantage of the full-
1Khosla, M., Jamison, K., Ngo, G. H., Kuceyeski, A., & Sabuncu, M. R. (2019). Machine
learning in resting-state fMRI analysis. Magnetic resonance imaging, 64, 101-121.
5
resolution 3D spatial structure of rs-fMRI data2,3,4. Chapter 4 describes our work
on predictive models of cortical responses to naturalistic stimuli that capture three
critical influences on neural computations, namely that of multi-sensory interactions,
stimulus history and visual attention5,6. Chapter 5 presents our methodological
contribution towards using multi-subject fMRI data for improving subject-specific
responses predictions7. Chapter 6 describes our ongoing investigations into data-
driven models of the human visual cortex, where we go beyond response prediction
and provide a systematic framework for inferring neuronal tuning properties from
these computational models. Finally, we conclude this thesis by summarizing our
core contributions and giving an outlook to future experiments.
2Khosla, M., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2018). 3D convolutional neural
networks for classification of functional connectomes. In Deep Learning in Medical Image Analysis
and Multimodal Learning for Clinical Decision Support (pp. 137-145). Springer, Cham.
3Khosla, M., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2019). Ensemble learning with
3D convolutional neural networks for functional connectome-based prediction. NeuroImage, 199,
651-662.
4Khosla, M., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2019, October). Detecting
abnormalities in resting-state dynamics: An unsupervised learning approach. In International
Workshop on Machine Learning in Medical Imaging (pp. 301-309). Springer, Cham.
5Khosla, M., Ngo, G. H., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2021). Cortical
response to naturalistic stimuli is largely predictable with deep neural networks. Science Advances,
7(22), eabe7547.
6Khosla, M., Ngo, G., Jamison, K., Kuceyeski, A., & Sabuncu, M. (2020). Neural encoding
with visual attention. Advances in Neural Information Processing Systems, 33.
7Khosla, M., Ngo, G. H., Jamison, K., Kuceyeski, A., Sabuncu, M. R. (2020, October). A
shared neural encoding model for the prediction of subject-specific fMRI response. In International
Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 539-548).
Springer, Cham.
6
CHAPTER 2
RELATED WORK
2.1 Machine learning in resting-state fMRI analysis
Machine learning techniques have gained prominence for the analysis of resting-
state functional Magnetic Resonance Imaging (rs-fMRI) data. Here, we present
an overview of various unsupervised and supervised machine learning applications
to rs-fMRI. We offer a methodical taxonomy of machine learning methods in
resting-state fMRI. We identify three major divisions of unsupervised learning
methods with regard to their applications to rs-fMRI, based on whether they
discover principal modes of variation across space, time or population. Next, we
survey the algorithms and rs-fMRI feature representations that have driven the
success of supervised subject-level predictions. The goal is to provide a high-level
overview of the burgeoning field of rs-fMRI from the perspective of machine learning
applications.
2.1.1 Introduction
Resting-state fMRI (rs-fMRI) is a widely used neuroimaging tool that measures
spontaneous fluctuations in neural blood oxygen-level dependent (BOLD) signal
across the whole brain, in the absence of any controlled experimental paradigm.
In their seminal work, Biswal et al. [1] demonstrated temporal coherence of low-
frequency spontaneous fluctuations between long-range functionally related regions
of the primary sensory motor cortices even in the absence of an explicit task,
suggesting a neurological significance of resting-state activity. Several subsequent
7
studies similarly reported other collections of regions co-activated by a task (such
as language, motor, attention, audio or visual processing etc.) that show correlated
fluctuations at rest [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. These spontaneously co-fluctuating
regions came to be known as the resting state networks (RSNs) or intrinsic brain
networks. The term RSN henceforth denotes brain networks subserving shared
functionality as discovered using rs-fMRI.
Rs-fMRI has enormous potential to advance our understanding of the brain’s
functional organization and how it is altered by damage or disease. A major
emphasis in the field is on the analysis of resting-state functional connectivity
(RSFC) that measures statistical dependence in BOLD fluctuations among spatially
distributed brain regions. Disruptions in RSFC have been identified in several
neurological and psychiatric disorders, such as Alzheimer’s [12, 13, 14], autism [15,
16, 17], depression [18, 19, 20], schizophrenia [21, 22], etc. Dynamics of RSFC have
also garnered considerable attention in the last few years, and a crucial challenge
in rs-fMRI is the development of appropriate tools to capture the full extent of
this RS activity. rs-fMRI captures a rich repertoire of intrinsic mental states or
spontaneous thoughts and, given the necessary tools, has the potential to generate
novel neuroscientific insights about the nature of brain disorders [23, 24, 25, 26, 27,
28].
The study of rs-fMRI data is highly interdisciplinary, majorly influenced by fields
such as machine learning, signal processing and graph theory. Machine learning
methods provide a rich characterization of rs-fMRI, often in a data-driven manner.
Unsupervised learning methods in rs-fMRI are focused primarily on understanding
the functional organization of the healthy brain and its dynamics. For instance,
methods such as matrix decomposition or clustering can simultaneously expose
8
multiple functional networks within the brain and also reveal the latent structure
of dynamic functional connectivity.
Supervised learning techniques, on the other hand, can harness RSFC to make
individual-level predictions. Substantial effort has been devoted to using rs-fMRI
for classification of patients versus controls, or to predict disease prognosis and
guide treatments. Another class of studies explores the extent to which individual
differences in cognitive traits may be predicted by differences in RSFC, yielding
promising results. Predictive approaches can also be used to address research
questions of interest in neuroscience. For example, is RSFC heritable? Such
questions can be formulated within a prediction framework to test novel hypotheses.
From mapping functional networks to making individual-level predictions, the
applications of machine learning in rs-fMRI are far-reaching. The goal of this
review is to present in a concise manner the role machine learning has played in
generating pioneering insights from rs-fMRI data, and describe the evolution of
machine learning applications in rs-fMRI. We will present a review of the key ideas
and application areas for machine learning in rs-fMRI rather than delving into the
precise technical nuances of the machine learning algorithms themselves. In light of
the recent developments and burgeoning potential of the field, we discuss current
challenges and promising directions for future work.
Resting-state fMRI: A Historical Perspective
Until the 2000s, task-fMRI was the predominant neuroimaging tool to explore the
functions of different brain regions and how they coordinate to create diverse mental
representations of cognitive functions. The discovery of correlated spontaneous
fluctuations within known cortical networks by Biswal et al. [1] and a plethora of
9
follow-up studies have established rs-fMRI as a useful tool to explore the brain’s
functional architecture. Studies adopting the resting-state paradigm have grown
at an unprecedented scale over the last decade. These are much simpler protocols
than alternate task-based experiments, capable of providing critical insights into
functional connectivity of the healthy brain as well as its disruptions in disease.
Resting-state is also attractive as it allows multi-site collaborations, unlike task-
fMRI that is prone to confounds induced by local experimental settings. This has
enabled network analysis at an unparalleled scale.
Traditionally, rs-fMRI studies have focused on identifying spatially-distinct yet
functionally associated brain regions through seed-based analysis (SBA). In this
approach, seed voxels or regions of interest are selected a priori and the time series
from each seed is correlated with the time series from all brain voxels to generate a
series of correlation maps. SBA, while simple and easily interpretable, is limited
since it is heavily dictated by manual seed selection and, in its simplest form, can
only reveal one specific functional system at a time.
Decomposition methods like Independent Component Analysis (ICA) emerged
as a highly promising alternative to seed-based correlation analysis in the early
2000s [2, 29, 30]. This was followed by other unsupervised learning techniques such
as clustering. In contrast to seed-based methods that explore networks associated
with a seed voxel (such as motor or visual functional connectivity maps), these
new class of model-free methods based on decomposition or clustering explored
RSNs simultaneously across the whole brain for individual or group-level analysis.
Regardless of the analysis tool, all studies largely converged in reporting multiple
robust resting-state networks across the brain, such as the primary sensorimotor
network, the primary visual network, fronto-parietal attention networks and the
10
well-studied default mode network. Regions in the default mode network, such
as the posterior cingulate cortex, precuneus, ventral and dorsal medial prefrontal
cortex, show increased levels of activity during resting-state suggesting that this
network represents the baseline or default functioning of the human brain. The
default mode network has sparked a lot of interest in the rs-fMRI community [31],
and several studies have consequently explored disruptions in DMN resting-state
connectivity in various neurological and psychiatric disorders, including autism,
schizophrenia and Alzheimer’s. [32, 33, 34]
Despite the widespread success and popularity of rs-fMRI, the causal origins
of ongoing spontaneous fluctuations in the resting brain remain largely unknown.
Several studies explored whether resting-state coherent fluctuations have a neuronal
origin, or are just manifestations of aliasing or physiological artifacts introduced
by the cardiac or respiratory cycle. Over time, evidence in support for a neuronal
basis of BOLD-based resting state functional connectivity has accumulated from
multiple complementary sources. This includes (a) observed reproducibility of
RSFC patterns across independent subject cohorts [4, 5], (b) its persistence in the
absence of aliasing and distinct separability from noise components [5, 35], (c) its
similarity to known functional networks [1, 2, 11] and (d) consistency with anatomy
[36, 37], (e) its correlation with cortical activity studied using other modalities [38,
39, 40] and finally, (f) its systematic alterations in disease [23, 24, 25].
Application of Machine Learning in rs-fMRI
A vast majority of literature on machine learning for rs-fMRI is devoted to unsu-
pervised learning approaches. Unlike task-driven studies, modelling resting-state
activity is not straightforward since there is no controlled stimuli driving these
11
Figure 2.1: Traditional seed based analysis approach
fluctuations. Hence, analysis methods used for characterizing the spatio-temporal
patterns observed in task-based fMRI are generally not suited for rs-fMRI. Given
the high dimensional nature of fMRI data, it is unsurprising that early analytic
approaches focused on decomposition or clustering techniques to gain a better
characterization of data in spatial and temporal domains. Unsupervised learning
approaches like ICA catalyzed the discovery of the so-called resting-state networks
or RSNs. Subsequently, the field of resting-state brain mapping expanded with
the primary goal of creating brain parcellations, i.e., optimal groupings of vox-
els (or vertices in the case of surface representation) that describe functionally
coherent spatial compartments within the brain. These parcellations aid in the
understanding of human functional organization by providing a reference map of
areas for exploring the brain’s connectivity and function. Additionally, they serve
as a popular data reduction technique for statistical analysis or supervised machine
learning.
More recently, departing from the stationary representation of brain networks,
studies have shown that RSFC exhibits meaningful variations during the course
of a typical rs-fMRI scan [41, 42]. Since brain activity during resting-state is
largely uncontrolled, this makes network dynamics even more interesting. Using
12
unsupervised pattern discovery methods, resting-state patterns have been shown to
transition between discrete recurring functional connectivity ”states”, representing
diverse mental processes [42, 43, 44]. In the simplest and most common scenario,
dynamic functional connectivity is expressed using sliding-window correlations. In
this approach, functional connectivity is estimated in a temporal window of fixed
length, which is subsequently shifted by different time steps to yield a sequence
of correlation matrices. Recurring correlation patterns can then be identified
from this sequence through decomposition or clustering. This dynamic nature
of functional connectivity opens new avenues for understanding the flexibility of
different connections within the brain as they relate to behavioral dynamics, with
potential clinical utility [45].
Another, perhaps clinically more promising application of machine learning in rs-
fMRI expanded in the late 2000s. This new class of applications leveraged supervised
machine learning for individual level predictions. The covariance structure of resting-
state activity, more popularly known as the ”connectome”, has garnered significant
interest in the field of neuroscience as a sensitive biomarker of disease. Studies
have further shown that an individual’s connectome is unique and reliable, akin
to a fingerprint [46]. Machine learning can exploit these neuroimaging based
biomarkers to build diagnostic or prognostic tools. Visualization and interpretation
of these models can complement statistical analysis to provide novel insights into
the dysfunction of resting-state patterns in brain disorders. Given the prominence
of deep learning in today’s era, several novel neural-network based approaches have
also emerged for the analysis of rs-fMRI data. A majority of these approaches
target connectomic feature extraction for single-subject level predictions.
In order to organize the work in this rapidly growing field, we sub-divide the
13
machine learning approaches into different classes by methods and application focus.
We first differentiate among unsupervised learning approaches based on whether
their main focus is to discover (a) the underlying spatial organization that is reflected
in coherent fluctuations, (b) the structure in temporal dynamics of resting-state
connectivity, or (c) population-level structure for inter-subject comparisons. Next,
we move on to discuss supervised learning. We organize this section by discussing
the relevant rs-fMRI features employed in these models, followed by commonly
used training algorithms, and finally the various application areas where rs-fMRI
has shown promise in performing predictions.
2.1.2 Unsupervised learning methods
The primary objective of unsupervised learning is to discover latent representations
and disentangle the explanatory factors for variation in rich, unlabelled data. These
learning methods do not receive any kind of supervision in the form of target
outputs (or labels) to guide the learning process. Instead, they focus on learning
structure in the data in order to extract relevant signal from noise. Below, we
review some important unsupervised learning methods that have advanced rs-fMRI
analysis.
Clustering
Given data points {X1, .., Xn}, the goal of clustering is to partition the data into K
disjoint groups {C1, .., CK}. Different clustering algorithms differ in terms of their
clustering objective, which is to maximize some notion of within-cluster similarity
and/or between-cluster dissimilarity.
14
Figure 2.2: Applications of machine learning methods in resting-state fMRI
Figure 2.3: A taxonomy of unsupervised learning methods used for rs-fMRI analysis
15
K-means K-means clustering is thus far the most popular learning algorithm for
partitioning data. The algorithm aims at minimizing the within-cluster variance.
Formally, this corresponds to the follo∥wing clustering ∥objective,∑K ∑ ∥∥∥∥ 1 ∑ ∥∥∥∥
2
min Xi − Xt
j=1 i∈C ∥ njj t∈C ∥j
where nj denotes the cardinality of set Cj . This optimization problem is solved
using an iterative algorithm, known as the Lloyd’s algorithm. The algorithm
begins with initial estimates of cluster centroids and iteratively refines them by
(a) assigning each datum to its closest cluster, and (b) updating cluster centroids
based on these new assignments.
Gaussian mixture models Mixture models are often used to represent prob-
ability densities of complex multimodal data with hidden components. These
models are constructed as mixtures of arbitrary unimodal distributions, each rep-
resenting a distinct cluster. In the case of Gaussian mixture models, each Xi is
assumed to be generated by a two-step process: (a) First, a latent component
zi ∈ {1, .., K} is sampled, zi ∼Multinomial(φ) where φk = P (zi = k); then (b) a
random sample is drawn from one of K multivariate gaussians conditional on zi,
i.e. Xi|zi = k ∼ N (µk,Σk) where µk and Σk denote the mean and covariance of
the k-th gaussian respectively. Each gaussian distribution thus denotes a unique
cluster. The model parameters {φ, µ,Σ} are obtained by maximizing the complete
data likelihood,
∑n
{φopt, µopt,Σopt} = arg max logP (Xi|φ, µ,Σ)
φ,µ,Σ ∑i=1n
= arg max logP (Xi|zi, µ,Σ)P (zi|φ)
φ,µ,Σ i=1
16
Maximum likelihood estimates of GMMs are usually obtained using the
Expectation-Maximization (EM) algorithm.
Hierarchical clustering Hierarchical clustering methods group the data into a
set of nested partitions. This multi-resolution structure is often represented with a
cluster tree, or dendrogram. Hierarchical clustering is divided into agglomerative
or divisive methods, based on whether the clusters are identified in a bottom-up
or top-down fashion respectively. Hierachical agglmomerative clustering (HAC),
the more dominant approach, initially treats each data point as a singleton cluster
and then successively merges them according to pre-specific distance metric until
a single cluster containing all observations is formed. Many distance metrics,
referred to as linkage criterion, have been proposed in literature that optimize
different goals of hierarchical clustering. These include: (a) single-link, where
distance between clusters C1 and C2 is defined as the distance between their
closest points, i.e., d(C1, C2) = min d(x , x ), (b) Complete linkage, where this
xi∈C1,xj∈
i j
C2
distance is measured between the farthest points, C(d1, d2) = max d(xi, xj), (c)
xi∈C1,xj∈C2
Average∑linkag1 ∑
e which measures the average distance between members d(C1, C2) =
| || | d(xi, xj) etc. Here, d represents dissimilarity between observations.C1 C2 xi∈C1 xj∈C2
Alternate methods for merging have also been proposed, the most popular being
Ward’s criterion. Ward’s method measures how much the within-cluster variance
will increase when merging two partitions and minimizes this merging cost. A major
drawback is computational complexity, which render HAC methods impractical in
applications with large observational data.
Graph-based clustering Graph based clustering forms another class of
similarity-based partitioning methods for data that can be represented using a
17
graph. Given a weighted undirected graph G = {V,E} with vertex set V and edge
set E, most graph-partitioning methods optimize a dissociation measure, such as the
normalized cut (Ncut). The edge weights w(i, j) represent a function of similarity
between vertices i and j. Ncut computes the total edge weights connecting two
partitions and normalizes this by their weighted connections to all nodes within the
graph. A two-way normalized-cut criteria divides G into disjoint partitions A and
B (A∪B = V,A∩B = φ) by simultaneously minimizing between-cluster similarity
while maximizing within-cluster similarity. This objective criterion is expressed as,
∑ ∑
Ncut(A,B) = ∑i∈A,j∈B w(i, j) w(i, j)( ) + ∑i∈A,j∈Bi∈A,j∈V w i, j i∈V,j∈B w(i, j)
However, minimizing this objective directly is an NP-hard problem. Spectral
clustering algorithms typically solve a relaxation of this problem. This approach
can be further extended to obtain a K-way partitioning of the graph. Graph-based
clustering approach is often more resilient to outliers, compared to k-means or
hierarchical clustering.
Latent variable models
Decomposition Decomposition or factorization based approaches assume that
the observed data can be decomposed as a product of simpler matrices, often
imposing a specific structure and/or sparsity on these individual matrices. Formally,
given data points X = [x1, .., xn] with x ∈ RDi , linear decomposition techniques
seek a basis set W = [w1, .., wK ] such that the linear space spanned by W closely
reconstructs X.
∑K
xi = wkzi(k)
k=1
18
Here, each data point xi is characterized by unique coefficients zi ∈ RK for the
basis set W . Typically, K < D so that decomposition amounts to a dimensionality
reduction. In matrix notation, the goal is to find W and Z such that X ≈ WZ,
where Z = [z1, .., zn]. This ill-posed problem is generally solved by constraining the
structure of W and/or Z.
Principal component analysis (PCA): PCA is a linear projection based
technique widely used for dimensionality reduction. The goal of PCA is to find
an orthonormal basis W that maximizes the variance captured by projected data
Z = W TX. This is equivalent to minimizing the reconstruction error of the
data points based on the low-dimensional representation Z. Mathematically, this
amounts to solving the following optimization problem,
∥ ∥2
Wopt = arg min ∥∥ − T ∥X WW X∥ subject to W ∈ OD×K
W F
where F denotes the Frobenius norm andOD×K denotes the set ofD×K dimensional
orthonormal matrices.
Independent component analysis (ICA) Independent Component Analy-
sis (ICA) is a popular method for decomposing data as a linear combination of
statistically independent components. In the ICA terminology, W is often known as
the mixing matrix whereas Z comprises the source signals. In the above formalism,
ICA assumes that the sources, i.e., the rows of Z, are statistically independent. The
source signals are recovered using a ”whitening” or ”unmixing” matrix U , where
U = W−1. Since X = WZ, we obtain Z = UX Popular algorithms thus recover
the sources by estimating U such that the components of UX are statistically
independent. Common ICA algorithms emulate independence by either minimizing
the mutual information between sources (InfoMax) or by maximizing their non-
gaussianity (FastICA). ICA usually employs a full-rank matrix factorization and is
19
often preceded with PCA for dimensionality reduction.
Sparse dictionary learning Sparse dictionary learning is formulated as a
linear decomposition problem, similar to ICA/PCA, but with sparsity constraints
on the components Z. This results in a non-convex optimization problem of the
following form:
{Wopt, Z 2opt} = arg min ‖X −WZ‖F + C ‖Z‖0
W,Z
In most practical applications, this optimization problem is relaxed by replacing
the L0-norm with L1-norm.
Non-negative matrix factorization (NMF) NMF is another dimensionality
reduction technique that seeks a low-rank decomposition of the data matrix X with
non-negativity constraints on the components W and Z. Typically, this corresponds
to solving the following optimization,
{Wopt, Zopt} = arg min ‖X −WZ‖2F subject to W ≥ 0, Z ≥ 0
W,Z
Hidden Markov Models Hidden Markov Models (HMMs) are a class of unsu-
pervised learning methods for sequential data. They are used to model a Markov
process where the sequence of observations {x1, .., xT} are assumed to be generated
from a sequence of underlying hidden states {s1, .., sT}, which can be discrete. In a
HMM with K states, it is assumed that si can take discrete values in {1, .., K}. The
parameters of the HMM are learned by maximizing the complete data likelihood,
θopt = arg maxP (x1, .., xT , s1, .., sT |θ)
θ ∏T
= arg max P (st|st−1, θ)P (xt|st, θ)
θ t=1
Here, P (s1|s0) denotes the initial state distribution π. The state transition proba-
bilities are defined by a transition matrix T with elements Ti,j = P (st = j|st−1 = i).
20
The conditionals P (xt|st = k, θ) are captured by an emission probability table
E[k, xt]. The parameters θ of this probabilistic model are thus {π, T, E}. This
maximum likelihood estimation problem is efficiently solved using a special case of
the Expectation-Maximization algorithm, known as the Baum-Welch algorithm.
Non-linear embeddings
Locally linear embeddings LLE projects data to a reduced dimensional space
while preserving local distances between data points and their neighborhood. LLE
algorithm proceeds in two steps. First, each input Xi, i ∈ {1, , ., n} is approximated
as a linear combination of its K closest neighbors. The linear subspace W is
obtaining by minimizing the reconstruction error,i.e.,
∑ ∑
W 2opt = arg min |Xi − WijXj|
W ∑i j
subject to Wij = 1
j
Here, Wij = 0 if Xj is not one of the K-nearest neighbors of Xi. In the second step,
the low-dimensional embeddings Yi are obtained by minimizing the embedding cost
function,
∑
Yopt = arg min |Yi − WijY 2j|
Y j
In the latter optimization, W is kept fixed at Wopt, while Yi’s are optimized.
Autoencoders The autoencoder is an unsupervised neural-network based ap-
proach for learning latent representations of high-dimensional data. It encodes
the input X into a lower dimensional representation Z = fθ(X), known as the
21
bottleneck, which is then decoded to reconstruct the input X̂ = gφ(Z). Both the
encoder fθ and decoder gφ are neural networks. The autoencoder is trained to
minimize ∥∥the reco∥∥nstruction error on a set of examples, often measured with an L2∥ ∥2loss, i.e., X − X̂ . The autoencoder can thus be seen as a non-linear extension
of PCA since fθ and gφ are in general non-linear functions.
2.1.3 Applications of unsupervised learning in rs-fMRI
Unsupervised machine learning methods have proven promising for the analysis of
high-dimensional data with complex structures, making it ever more relevant to
rs-fMRI. Many unsupervised learning approaches in rs-fMRI aim to parcellate the
brain into discrete functional sub-units, akin to atlases. These segmentations are
driven by functional data, unlike those approaches that use cytoarchitecture as in
the Broadmann atlas, or macroscopic anatomical features, as in the Automated
Anatomical Labelling (AAL) atlas [47]. A second class of applications delve into
the exploration of brain network dynamics. Unsupervised learning has recently
been applied to interrogate the dynamic functional connectome with promising
results[42, 43, 44, 48, 49]. Finally, the third application of unsupervised learning
focuses on learning latent low-dimensional representations of RSFC to conduct
analyses across a population of subjects. We discuss the methods under each of
these challenging application areas below.
22
Discovering spatial patterns with coherent fluctuations
Mapping the boundaries of functionally distinct neuroanatomical structures, or
identifying clusters of functionally coupled regions in the brain is a major objective
in neuroscience. Rs-fMRI and machine learning methods provide a promising
combination with which to achieve this lofty goal.
In the case of rs-fMRI, the typical approach is to decompose the 4D fMRI data
into a linear superposition of distinct spatial modes that show coherent temporal
dynamics using techniques like ICA. Clustering is an alternative unsupervised
learning approach for analysis of rs- fMRI data. Unlike ICA or dictionary learning,
clustering is used to partition the brain surface (or volume) into disjoint functional
networks. It is important to draw a distinction at this stage between two slightly
different applications of clustering since they sometimes warrant different constraints;
one direction is focused on identifying functional networks which are often spatially
distributed, whereas the other is used to parcellate brain regions. The latter
application aims to construct atlases that reflect local areas that constitute the
functional neuroanatomy, much like how standard atlases such as the Automated
Anatomical Labelling (AAL) [47] delineate macroscopic anatomical regions. One
important design decision in the application of clustering is the distance function
used to measure dissimilarity between different voxels (or vertices). In the case
of rs-fMRI, this distance function is either computed on raw time-series at voxels
or between their connectivity profiles. While these two distances are motivated
by the same idea of functional coherence, certain differences have been found in
parcellations optimized using either criteria [50].
An important requirement for almost all of these methods is the a priori selection
of the number of clusters/components. These are often determined through cross-
23
validation or through statistics that reflect the quality, stability or reproducibility
of decomposition/partitions at different scales.
ICA ICA has been one of the earliest and most widely used analytic tools for
rs-fMRI, driving several pivotal neuroscientific insights into intrinsic brain networks.
When applied to rs-fMRI, brain activity is expressed as a linear superposition of
distinct spatial patterns or maps, with each map following its own characteristic
time course. These spatial maps can reflect a coherent functional system or noise,
and several criteria can be used to automatically differentiate them. This capability
to isolate noise sources makes ICA particularly attractive. In the early days of
rs-fMRI, several studies demonstrated marked resemblance between the ICA spatial
maps and cortical functional networks known from task-activation studies [2, 4].
While typical ICA models are noise-free and assume that the only stochasticity is
in the sources themselves, several variants of ICA have been proposed to model
additive noise in the observed signals. Beckmann et al. [2] introduced a probabilistic
ICA (PICA) model to extract the connectivity structure of rs-fMRI data. PICA
models a linear instantaneous mixing process under additive noise corruption and
statistical independence between sources. De Luca et al. [5] showed that PICA can
reliably distinguish RSNs from artifactual patterns. Both these works showed high
consistency in resting-state patterns across multiple subjects. While there is no
standard criteria for validating the ICA patterns, or any clustering algorithm for
that matter, reproducibility or reliability is often used for quantitative assessment.
More recently, Khorshidi et al. proposed an automated denoising strategy for fMRI
based on ICA, known as FIX ”FMRIB’s ICA-based-X-noiseifier”. The authors
trained a classifier using manual annotations to label artefactual components based
on distinct spatial/ temporal features. These components could represent a variety
24
of structured noise sources and once identified, they can be either subtracted or
regressed out of the data to yield clean signals.
ICA can also be extended to make group inferences in population studies. Group
ICA is thus far the most widely used strategy, where multi-subject fMRI data
are concatenated along the temporal dimension before implementing ICA [51].
Individual-level ICA maps can then be obtained from this group decomposition by
back-projecting the group mixing matrix [51], or using a dual regression approach
[52]. More recently, Du et al.[53] introduced a group information guided ICA to
preserve statistical independence of individual ICs, where group ICs are used to
constrain the corresponding subject-level ICs. Varoquaux et al. [54] proposed
a robust group-level ICA model to facilitate between-group comparisons of ICs.
They introduce a generative framework to model two levels of variance in the ICA
patterns, at the group level and at a subject-level, akin to a multivariate version of
mixed-effect models. The IC estimation procedure, termed Canonical ICA, employs
Canonical Correlation Analysis to identify a joint subspace of common IC patterns
across subjects and yields ICs that are well representative of the group.
Alternatively, it is also possible to compute individual-specific ICA maps and
then establish correspondences across them [53] for generating group inferences;
however, this approach has been limited because source separations can be very
different across subjects, for example, due to fragmentation.
While ICA and its extensions have been used broadly by the rs-fMRI community,
it is important to acknowledge its limitations. ICA models linear representations
of non-Gaussian data. Whether a linear transformation can adequately capture the
relationship between independent latent sources and the observed high-dimensional
fMRI data is uncertain and likely unrealistic. Unlike the popular Principal Com-
25
ponent Analysis (PCA), ICA does not provide the ordering or the energies of its
components, which makes it impossible to distinguish strong and weak sources. This
also complicates replicability analysis since known sources i.e., spatial maps could
be expressed in any arbitrary order. Extracting meaningful ICs also sometimes
necessitates manual selection procedures, which can be inefficient or subjective. In
the ideal scenario, each individual component represents either a physiologically
meaningful activation pattern or noise. However, this might be an unrealistic as-
sumption for rs-fMRI. Additionally, since ICA assumes non-Gaussianity of sources,
Gaussian physiological noise can contaminate the extracted components. Further,
due to the high-dimensionality of fMRI, analysis often proceeds with PCA based
dimensionality reduction before application of ICA. PCA computes uncorrelated lin-
ear transformations of highest variance (thus explaining greatest variability within
the data) from the top eigenvectors of the data covariance matrix. While this step
is useful to remove observation noise, it could also result in loss of signal informa-
tion that might be crucial for subsequent analysis. Although ICA optimizes for
independence, it does not guarantee independence. Based on studies of functional
integration within the brain, assumptions of independence between functional units
could themselves be questioned from a neuroscientific point of view. Several papers
have suggested that ICA is especially effective when spatial patterns are sparse,
with negligible or little overlap. This hints to the possibility that success of ICA is
driven by sparsity of the components rather than their independence. Along these
lines, Daubechies and colleagues claim that fMRI representations that optimize
for sparsity in spatial patterns are more effective than fMRI representations that
optimize independence [55].
26
Learning sparse spatial maps Sparse dictionary learning is another popular
framework for constructing succinct representations of observed data. Varoquaux
et al. [56] adopt a dictionary learning framework for segmenting functional regions
from resting-state fMRI time series. Their approach accounts for inter-subject
variability in functional boundaries by allowing the subject-specific spatial maps
to differ from the population-level atlas. Concretely, they optimize a loss function
comprising a residual term that measures the approximation error between data
and its factorization, a cost term penalizing large deviations of individual subject
spatial maps from group level latent maps, and a regularization term promoting
sparsity. In addition to sparsity, they also impose a smoothness constraint so that
the dominant patterns in each dictionary are spatially contiguous to construct
a well-defined parcellation. In order to prevent blurred edges caused due to the
smoothness constraint, Abraham et al. [57] propose a total variation regularization
within this multi-subject dictionary learning framework. This approach is shown to
yield more structured parcellations that outperform competing methods like ICA
and clustering in explaining test data. Similarly, Lv et al. [58] propose a strategy
to learn sparse representations of whole-brain fMRI signals in individual subjects
by factorizing the time-series into a basis dictionary and its corresponding sparse
coefficients. Here, dictionaries represent the co-activation patterns of functional
networks and coefficients represent the associated spatial maps. Experiments
revealed a high degree of spatial overlap in the extracted functional networks in
contrast to ICA that is known to yield spatially non-overlapping components in
practice.
K-means clustering and mixture models K-means clustering or mixture
models are frequently used for spatial segmentation of fMRI data [37, 59, 60, 61].
27
Similarity between voxels can be defined by correlating their raw time-series [59]
or connectivity profiles [61]. Euclidean distance metrics have also been used on
spectral features of time series [37].
K-means clustering has provided several novel insights into functional organiza-
tion of the human brain. It has revealed the natural division of cortex into two
complementary systems, the internally-driven ”intrinsic” system and the stimuli-
driven ”extrinsic” system [59, 60]; provided evidence for a hierarchical organization
of RSNs [60]; and exposed the anatomical contributions to co-varying resting-state
fluctuations [37].
Golland et al. [62] proposed a Gaussian mixture model for clustering fMRI
signals. Here, the signal at each voxel is modelled as a weighted sum of N Gaussian
densities, with N determining the number of hypothesized functional networks and
weights reflecting the probability of assignment to different networks. Large-scale
systems were explored at several resolutions, revealing an intrinsic hierarchy in
functional organization. Yeo et al. [63] used rs-fMRI measurements on 1000 subjects
to estimate the organization of large-scale distributed cortical networks. They
employed a mixture model to identify clusters of voxels with similar corticocortical
connectivity profiles. Number of clusters were chosen from stability analysis and
parcellations at both a coarse resolution of 7 networks and a finer scale of 17
networks were identified. A high degree of replicability was attained across data
samples, suggesting that these networks can serve as reliable reference maps for
functional characterization.
Identifying hierarchical spatial organization Several studies have provided
evidence for a hierarchical organization of functional networks in the brain[60,
28
62]. Hierachical agglmomerative clustering (HAC) thus provides a natural tool
to partition rs-fMRI data and explore this latent hierarchical structure. Earliest
applications of clustering to rs-fMRI were based on HAC [36, 64]. This technique
thus largely demonstrated the feasibility of clustering for extracting RSNs from
rs-fMRI data. Recent applications of HAC have focused on defining whole-brain
parcellations for downstream analysis [65, 66, 67]. Spatial continuity can be enforced
in parcels, for example, by considering only local neighborhoods as potential
candidates for merging [65].
An advantage of hierarchical clustering is that unlike k-means clustering, it does
not require knowledge of the number of clusters and is completely deterministic.
However, once the cluster tree is formed, the dendrogram must be split at a level
that best characterizes the ”natural” clusters. This can be determined based on
a linkage inconsistency criterion [64], consistency across subjects [36], or advance
empirical knowledge [68].
While a promising approach for rs-fMRI analysis, hierarchical clustering has some
inherent limitations. It often relies on prior dimensionality reduction, for example
by using an anatomical template [36], which can bias the resulting parcellation. It
is a greedy strategy and erroneous partitions at an early stage cannot be rectified in
subsequent iterations. Single-linkage criterion may not work well in practice since it
merges partitions based on the nearest neighbor distance, and hence is not inherently
robust to noisy resting-state signals. Further, different metrics usually optimize
divergent attributes of clusters. For example, single-link clustering encourages
extended clusters whereas complete-link clustering promotes compactness. This
makes the a priori choice of distance metric somewhat arbitrary.
29
Graph based clustering Functional MRI data can be naturally represented in
the form of graphs. Here, nodes represent voxels and edges represent connection
strength, typically measured by a correlation coefficient between voxel time series
or between connectivity maps [50, 69]. Often, thresholding is applied on edges to
limit graph complexity. Graph segmentation approaches, such as those based on
Ncut criteria, have been widely used to derive whole-brain parcellations [50, 70,
71]. Population-level parcellations are usually derived in a two stage procedure:
First, individual graphs are clustered to extract functionally-linked regions, followed
by a second stage where a group-level graph characterizing the consistency of
individual cluster maps is clustered [50, 69]. Spatial contiguity can be easily
enforced by constraining the connectivity graph to local neighborhoods [50], or
through the use of shape priors [71]. Departing from this protocol, Shen et al.
[70] propose a groupwise clustering approach that jointly optimizes individual and
group parcellations in a single stage and yields spatially smooth group parcellations
in the absence of any explicit constraints.
A disadvantage of the Ncut criteria for fMRI is its bias towards creating
uniformly sized clusters, whereas in reality functional regions show large size
variations. Graph construction itself involves arbitrary decisions which can affect
clustering performance [72] e.g., selecting a threshold to limit graph edges, or
choosing the neighborhood to enforce spatial connectedness.
30
Table 1: Key papers for application:
Discovering spatial patterns with coherent resting-state fluctuations (RSNs)
Approach a: Decomposition
Investigations into resting-state connectivity using independent component analysis (Beckmann et al., 2005)[2]
Consistent resting-state networks across healthy subjects (Damoiseaux et al.,2006)[4]
Method: ICA, Contribution: Early works demonstrating the striking similarity between ICA spatial maps and cortical
functional networks
Group comparison of resting-state fMRI using multi-subject ICA and dual regression(Beckmann et al.,2009) [52]
A group model for stable multi-subject ICA on fMRI datasets (Varoquaux et al., 2010)[54]
Group information guided ICA for fMRI data analysis (Du et al., 2013) [53]
Method: ICA (group-level), Contribution: Influential works discussing analytical approaches for multi-subject ICA in
resting-state
Multi-subject dictionary learning to segment an atlas of brain spontaneous activity (Varoquaux et al., 2011) [56]
Method: Sparse dictionary learning, Contribution: A multi-subject dictionary learning framework for learning sparse spatial
maps
Approach b: Clustering
Hierarchical clustering to measure connectivity in fMRI resting-state data, (Cordes et al.,2002)[64]
Neurophysiological Architecture of Functional Magnetic Resonance Images of Human Brain (Salvador et al.,2005)[36]
Method: Hierarchical clustering, Contribution: Earliest applications of clustering to rs-fMRI; highlighted hierarchical
organization of functional networks
The organization of the human cerebral cortex estimated by intrinsic functional connectivity, (Yeo et al.,2011)[63]
Method: Mixture models, Contribution: Influential large-scale study investigating brain’s functional organization
A whole brain fMRI atlas generated via spatially constrained spectral clustering, (Craddock et al.,2012)[50]
Groupwise whole-brain parcellation from resting-state fMRI data for network node identification, (Shen et al.,2013)[70]
Method: Graph based clustering, Contribution: Released consistent whole-brain functional atlas for fMRI at varying spatial
resolutions based on rs-fMRI data
31
Comments I. A comment on alternate connectivity-based parcellations
Several papers make a distinction between clustering / decomposition and boundary
detection based approaches for network segmentation. In the rs-fMRI literature,
several non-learning based parcellations have been proposed, that exploit tradi-
tional image segmentation algorithms to identify functional areas based on abrupt
RSFC transitions [73, 74]. Clustering algorithms do not mandate spatial con-
tiguity, whereas boundary based methods implicitly do. In contrast, boundary
based approaches fail to represent long-range functional associations, and may
not yield parcels that are as connectionally homogeneous as unsupervised learning
approaches. A hybrid of these approaches can yield better models of brain network
organization. This direction was recently explored by Schaefer et al. [75] with a
Markov Random Field model. The resulting parcels showed superior homogeneity
compared with several alternate gradient and learning-based schemes. Further,
complementing RSFC with other modalities can yield corroborative and perhaps
complementary information for delineating areal boundaries. Recently, Glasser et
al. approached this problem by developing a multi-modal approach for generating
brain parcellations[74]. The authors propose a semi-automated approach that
combines supervised machine learning with manual annotations for parcellating
regions based on their multi-modal fingerprints (architecture, function, connectivity
and topography). Such an approach can be instrumental towards the goal of precise
human brain functional mapping.
II. Subject versus population level parcellations
Significant effort in rs-fMRI literature is dedicated to identifying population-
average parcellations. The underlying assumption is that functional connectivity
graphs exhibit similar patterns across subjects, and these global parcellations reflect
32
common organizational principles. Yet, individual-level parcellations can potentially
yield more sensitive connectivity features for investigating networks in health and
disease. A central challenge in this effort is to match the individual-level spatial
maps to a population template in order to establish correspondences across subjects.
Common approaches to obtain subject-specific networks with group correspondence
often incorporate back-projection and dual regression [51, 52], or hierarchical priors
within unsupervised learning [56, 76]. While a number of studies have developed
subject-specific parcellations, the significance of this inter-subject variability for
network analysis has only recently been discussed. Kong et al. [76] developed high
quality subject-specific parcellations using a multi-session hierarchical Bayesian
model, and showed that subject-specific variability in functional topography can
predict behavioral measures. Recently, using a novel parcellation scheme based on
K-medoids clustering, Salehi et al. [77] showed that individual-level parcellation
alone can predict the sex of the individual. These studies suggest the intriguing
idea that subject-level network organization, i.e. voxel-to-network assignments, can
capture concepts intrinsic to individuals, just like connectivity strength.
III. Is there a universal ’gold-standard’ atlas?
When considering the family of different methods, algorithms or modalities ,
there exist a plethora of diverse brain parcellations at varying levels of granularity.
Thus far, there is no unified framework for reasoning about these brain parcellations.
Several taxonomic classifications can be used to describe the generation of these
parcellations, such as machine learning or boundary detection, decomposition or
clustering, multi-modal or unimodal. Even within the large class of clustering
approaches, it is impossible to find a single algorithm that is consistently superior
for a collection of simple, desired properties of partitioning [78]. Several evaluation
33
criteria have emerged for comparing different parcellations, exposing the inherent
trade-offs at work. Arslan et al. [79] performed an extensive comparison of several
parcellations across diverse methods on resting-data from the Human Connectome
Project (HCP). Through independent evaluations, they concluded that no single
parcellation is consistently superior across all evaluation metrics. Recently, Salehi
et al. [80] showed that different functional conditions, such as task or rest, generate
reproducibly distinct parcellations thus questioning the very existence of an optimal
parcellation, even at an individual-level. These novel studies necessitate rethinking
about the final goals of brain mapping. Several studies have reflected the view
that there is no optimal functional division of the brain, rather just an array of
meaningful brain parcellations [65]. Perhaps, brain mapping should not aim to
identify functional sub-units in a universal sense, like Broadmann areas. Rather,
the goal of human brain mapping should be reformulated as revealing consistent
functional delineations that enable reliable and meaningful investigations into brain
networks.
IV. A comparison between decomposition and clustering
A high degree of convergence has been observed in the functionally coherent
patterns extracted using decomposition and clustering. Decomposition techniques
allow soft partitioning of the data, and can thus yield spatially overlapping networks.
These models may be more natural representations of brain networks where, for
example, highly integrated regions such as network ’hubs’ can simultaneously
subserve multiple functional systems. Although it is possible to threshold and
relabel the generated maps to produce spatially contiguous brain parcellations,
these techniques are not naturally designed to generate disjoint partitions. In
contrast, clustering techniques automatically yield hard assignments of voxels to
34
different brain networks. Spatial constraints can be easily incorporated within
different clustering algorithms to yield contiguous parcels. Decomposition models
can adapt to varying data distributions, whereas clustering solutions allow much
less flexibility owing to rigid clustering objectives. For example, k-means clustering
function looks to capture spherical clusters. While a thorough comparison between
these approaches is still lacking, some studies have identified the trade-offs between
choosing either technique for parcellation. Abraham et al. [57] compared clustering
approaches with group-ICA and dictionary learning on two evaluation metrics:
stability as reflected by reproducibility in voxel assignments on independent data,
and data fidelity captured by the explained variance on independent data. They
observed a stability-fidelity trade-off: while clustering models yield stable regions
but do not explain test data as well, linear decomposition models explain the test
data reasonably well but at the expense of reduced stability.
Discovering patterns of dynamic functional connectivity
Unsupervised learning has also been applied to study patterns of temporal orga-
nization or dynamic reconfigurations in resting-state networks. These studies are
often based on two alternate hypothesis that (a) dynamic (windowed) functional
connectivity cycles between discrete ”connectivity states”, or (b) functional connec-
tivity at any time can be expressed as a combination of latent ”connectivity states”.
The first hypothesis is examined using clustering-based approaches or generative
models like HMMs, while the second is modelled using decomposition techniques.
Once stable states are determined across population, the former approach allows us
to estimate the fraction of time spent in each state by all subjects. This quantity,
known as dwell time or occupancy of the state, shows meaningful variation across
individuals [42, 43, 81, 82]. It is important to note than in all these approaches,
35
the RSNs or the spatial patterns are assumed to be stationary over time and it is
the temporal coherence that changes with time.
Clustering Several studies have discovered recurring dynamic functional con-
nectivity patterns, known as ”states”, through k-means clustering of windowed
correlation matrices [42, 81, 82, 83, 84]. FC associated with these repeating states
shows marked departure from static FC, suggesting that network dynamics provide
novel signatures of the resting brain [42]. Notable differences have been observed in
the dwell times of multiple states between healthy controls and patient populations
across schizophrenia, bipolar disorder and psychotic-like experience domains [81,
82, 83].
Abrol et al. [84] performed a large-scale study to characterize the replicability
of brain states using standard k-means as well as a more flexible, soft k-means
algorithm for state estimation. Experiments indicated reproducibility of most
states, as well as their summary measures, such as mean dwell times and transition
probabilities etc. across independent population samples. While these studies
establish the existence of recurring FC states, behavioral associations of these states
is still unknown. In an interesting piece of work, Wang et al. [85] identified two
stable dynamic FC states using k-means clustering that showed correspondence
with internal states of high- and low-arousal respectively. This suggests that RSFC
fluctuations are behavioral state-dependent, and presents one explanation to account
for the heterogeneity and dynamic nature of RSFC.
Markov modelling of state transition dynamics HMMs are another valuable
tool to interrogate recurring functional connectivity patterns [43, 44, 86]. The notion
of states remains similar to the ”FC states” described above for clustering; however,
36
the characterization and estimation is drastically different. Unlike clustering where
sliding windows are used to compute dynamic FC patterns, HMMs model the
rs-fMRI time-series directly. Hence, they offer a promising alternative to overcome
statistical limitations of sliding-windows in characterizing FC changes.
Several interesting results have emerged through the adoption of HMMs. Vidau-
rre et al. [43] find that relative occupancy of different states is a subject-specific
measure linked with behavioral traits and heredity. Through Markov modelling,
transitions between states have been revealed to occur as a non-random sequence
[42, 43], that is itself hierarchically organized [43]. Recently, network dynamics
modelled using HMMs were shown to distinguish MCI patients from controls [86],
thereby indicating their utility in clinical domains.
Finding latent connectivity patterns across time-points Decomposition
techniques for understanding RSFC dynamics have the same flavor as the ones
described in section 2.1.2: of explaining data through latent factors; however, the
variation of interest is across time in this case. Adoption of matrix decomposition
techniques exposes a basis set of FC patterns from windowed correlation matrices.
Dynamic FC has been characterized using varied decomposition approaches, in-
cluding PCA[48], Singular Value Decomposition (SVD)[49], non-negative matrix
factorization[87] and sparse dictionary learning[88].
Decomposition approaches, here, diverge from clustering or HMMs as they asso-
ciate each dFC matrix with multiple latent factors instead of a single component. To
compare these alternate approaches, Leonardi et al. [49] implemented a generalized
matrix decomposition, termed k-SVD. This factorization generalizes both k-means
clustering and PCA subject to variable constraints. Reproducibility analysis in
37
this study indicated that dFC is better characterized by multiple overlapping FC
patterns.
Decomposition of dFC has revealed novel alterations in network dynamics
between healthy controls and patients suffering from PTSD [88] or multiple sclerosis
[48], as well as between childhood and young adulthood [87].
Table 2: Key papers for application:
Discovering reproducible patterns of dynamic functional connectivity
Approach a: Decomposition
Principal components of functional connectivity: a new approach to study dynamic brain connectivity during rest (Leonardi
et al.,2013)[48]
Method:PCA , Contribution: Early work characterizing dFC using latent connectivity patterns and suggesting altered
connectivity dynamics in disease
Approach b: Clustering
Tracking whole-brain connectivity dynamics in the resting state, (Allen et al.,2014)[42]
Method: K-means, Contribution: Provided evidence for recurring FC states and suggested marked departure of dynamic
connectivity patterns from static FC
Dynamic functional connectivity analysis reveals transient states of dysconnectivity in schizophrenia (Damaraju 2014)[81]
Method: K-means, Contribution: Revealed strong statistical differences in dwell times of multiple FC states between controls
and a disease group
Approach c: Markov models
Unsupervised learning of functional network dynamics in resting state fMRI (Eavani 2013)[44]
Method: HMM, Contribution: Earliest application of HMMs to study resting-state functional network dynamics
Brain network dynamics are hierarchically organized in time, (Vidaurre et al.,2017)[43]
Method: HMM, Contribution: Demonstrated that transitions between FC states occur in a non-random hierarchically
organized fashion and revealed that dwell times of FC states are linked with behavioral traits and heredity.
38
Disentangling latent factors of inter-subject FC variation
Unsupervised learning can also disentangle latent explanatory factors for FC
variation across population. We find two applications here: (i) learning low
dimensional embeddings of FC matrices for subsequent supervised learning and (ii)
learning population groupings to differentiate phenotypes based solely on FC.
Dimensionality reduction Rs-fMRI analysis is plagued by the curse of dimen-
sionality, i.e., the phenomenon of increasing data sparsity in higher dimensions.
Commonly used data features such as FC between pairs of regions, increase as
O(n2) with the number of parcellated regions. Further, sample size in typical fMRI
studies is typically of the order of tens or hundreds, making it harder to learn
generalizable patterns from original high dimensional data. To overcome this, linear
decomposition methods like PCA or sparse dictionary learning have been widely
used for dimensionality reduction of functional connectivity data [89, 90, 91, 92].
Several non-linear embedding methods like Locally linear embedding (LLE) or
Autoencoders (AEs) have also garnered attention. LLE embeddings have been
employed in rs-fMRI studies, for example, to improve predictions in supervised
age regression [93], or for low-dimensional clustering to distinguish Schizophrenia
patients from controls [94]. AEs are a neural network based alternative for generating
reduced feature sets through nonlinear input transformations. They have been
used for feature reduction of RSFC in several studies [86, 95]. AEs can also be
used in a pre-training stage for supervised neural network training, in order to
direct the learning towards parameter spaces that support generalization [96]. This
technique was shown, for example, to improve classification performance of Autism
and Schizophrenia using RSFC [97, 98].
39
Clustering heterogeneous diseases Clustering can expose sub-groups within
a population that show similar FC. Using unsupervised maximum margin cluster-
ing [99], Zeng et al. [100] demonstrated that clusters can be associated with disease
category (depressed v/s control) to yield high classification accuracy. Recently,
Drysdale et al. [101] discovered novel neurophysiological subtypes of depression
based on RSFC. Using an agglomerative hierarchical procedure, they identified
clustered patterns of dysfunctional connectivity, where clusters showed associations
with distinct clinical symptom profiles despite no external supervision. Several
psychiatric disorders, like depression, schizophrenia, and autism spectrum disorder,
are believed to be highly heterogeneous with widely varying clinical presentations.
Instead of labelling them as a unitary syndrome, differential characterization based
on disease sub-types can build better diagnostic, prognostic or therapy selection
systems. Unsupervised clustering could aid in the identification of these disease
subtypes based on their rs-fMRI manifestations.
40
Table 3: Key papers for application:
Disentangling latent factors of inter-subject RSFC variation
Identifying Sparse Connectivity Patterns in the brain using resting-state fMRI (Eavani et al.,2015)[91]
Method: Sparse dictionary learning, Contribution: One of the early works explaining inter-subject RSFC variability in terms
of sparse connectivity patterns
Approach b: Non-linear embeddings
Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding
of fMRI (Shen et al.,2010)[94]
Method: LLE, Contribution: Proposed an unsupervised learning approach for discriminating Schizophrenia patients from
controls with impressive accuracy
Identification of autism spectrum disorder using deep learning and the ABIDE dataset (Heinsfeld et al.,2018)[97]
Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification
performance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia (Kim et al., 2016)[98]
Method: Autoencoders, Contribution: More recent works demonstrating the advantages of autoencoder based dimensionality
reduction/pre-training for downstream classification
Approach c: Clustering
Unsupervised classification of major depression using functional connectivity MRI (Zeng et al., 2014)[100]
Resting-state connectivity biomarkers define neurophysiological subtypes of depression (Drysdale et al., 2017) [101]
Method: Maximum margin clustering/HAC, Contribution: Demonstrated the power of clustering approaches for diagnosing
depression and identifying its subtypes based on rs-fMRI manifestations
2.1.4 Supervised Learning
Supervised learning denotes the class of problems where the learning system is
provided input features of the data and corresponding target predictions (or labels).
The goal is to learn the mapping between input and label, so that the system
41
can compute predictions for previously unseen input data points. Prediction of
autism from rs-fMRI correlations is an example problem. Since intrinsic FC reflects
interactions between cognitively associated functional networks, it is hypothesized
that systematic alterations in resting-state patterns can be associated with pathology
or cognitive traits. Promising diagnostic accuracy attained by supervised algorithms
using rs-fMRI constitute strong evidence for this hypothesis.
In this section, we separate the discussion of rs-fMRI feature extraction from
the classification algorithms and application domains.
Deriving connectomic features
To render supervised learning effective, the most critical factor is feature extraction.
Capturing relevant neurophenotypes from rs-fMRI depends on various design choices.
Almost all supervised prediction models use brain networks or ”connectomes”
extracted from rs-fMRI time-series as input features for the learning algorithm. The
prototypical prediction pipeline is shown in Figure 2.8. Here, we discuss critical
aspects of common choices for brain network representations in supervised learning.
The first step in the prototypical pipeline is region definition and corresponding
time-series extraction. Dense connectomes derived from voxel-level correlations
are rarely used in practice for supervised prediction due to their high dimension-
ality. Both functional and anatomical atlases have been extensively used for this
dimensionality reduction. Atlases delineate ROIs within the brain that are often
used to study RSFC at a supervoxel scale. Each ROI is represented with a distinct
time-course, often computed as the average signal from all voxels within the ROI.
Consequently, the data is represented as an N × T matrix, where N denotes the
number of ROIs and T represents the time-points in the signal. A drawback of using
42
pre-defined atlases is that they may not explain the rs-fMRI dataset very well since
they are not optimized for the data at hand. Several studies employ data-driven
techniques to define regions within the brain, using unsupervised models such as
K-means clustering, Ward clustering, ICA or dictionary learning etc [66, 102]. It is
important to note that since we use pairs of ROIs to define whole-brain RSFC, the
features grow as O(N2) with the number of ROIs. Therefore, in most studies, the
network granularity is often limited to the range of 10-400 ROIs.
The second step in this pipeline involves defining connectivity strength for
extracting the connectome matrix. Functional connectivity between pairs of ROIs is
the most common feature representation of rs-fMRI in supervised learning. In order
to extract connectivity matrix, first the covariance matrix needs to be estimated.
Sample covariance matrices are subject to a significant amount of estimation error
due to the limited number of time-points. This ill-posed problem can be partially
resolved through the use of shrinkage transformations [103]. Connectivity strength
can then be estimated from the covariance matrix in multiple ways. Pearson’s cor-
relation coefficient is a commonly used metric for estimating functional connectivity.
Partial correlation is another metric that has been shown to yield better estimates
of network connections in simulated rs-fMRI data [104]. It measures the normalized
correlation between two time-series, after removing the effect of all other time-series
in the data. Alternatively, one can use a tangent-based reparametrization of the
covariance matrix to obtain functional connectivity matrices that respect the Rie-
mannian manifold of covariance matrices [105]. These connectivity coefficients can
boost the sensitivity for comparing diseased versus patient populations [66, 105]. It
is also possible to define frequency-specific connectivity strength by decomposing
the original time-series into multiple frequency sub-bands and correlating signals
separately within these sub-bands [106].
43
A few studies depart from this routine. In graph-theoretic analysis, it is common
to represent parcellated brain regions as graph nodes and functional connectivity
between nodes as edge weights. This graph based representation of functional
connectivity, the human ”connectome”, has been used to infer various topological
characteristics of brain networks, such as modularity, clustering, small-worldedness
etc. Some discriminative models have exploited these graph-based measures for
individual-level predictions [13, 107, 108], although they are more commonly used
for comparing groups. While limited in number, a few studies have also explored
rs-fMRI features beyond RSFC. Amplitude of low-frequency fluctuations (ALFF)
and local synchronization of rs-fMRI signals or Regional Homeogeneity (ReHo) are
two alternate measures for studying spontaneous brain activity that have shown
discriminative ability [109, 110]. More recently, several studies have also begun to
explore the predictive capacity of dynamic FC in supervised models [111, 112].
Feature selection
The goal of feature selection is to remove noisy, redundant or irrelevant features
from the data while minimizing the information loss. Feature selection can often
be an advantageous pre-processing step for training supervised learning algorithms,
especially in the low sample size regime. In the absence of adequate regularization,
large number of features can result in a loss of generalization power. Selecting a sub-
set of features with highest relevance can thus help in building better generalizable
models while reducing computational complexity.
Feature selection can be performed in a supervised or unsupervised fashion.
Supervised or semi-supervised feature selection techniques choose a subset of features
based on their ability to distinguish samples from different classes. These methods
44
thus rely on class labels and can be further classified into filter, wrapper or embedded
type models. Filter models first rank features by their importance/relevance for the
classification task based on a statistical measure (e.g. t-test) and then select the
top-ranked features. Wrapper models select feature subsets based on their predictive
accuracy and thus need a pre-determined classification algorithm. Wrapper models
thus perform better as they take into account the prediction accuracy estimates
during feature selection. Due to the repeated learning and cross-validation, however,
these models are computationally prohibitive. Embedded models combine the
advantages of the two by integrating feature selection into the learning algorithm.
Regression models such as LASSO belong to this category as they implicitly select
features by encouraging sparsity. These feature selection methods are discussed in
depth in a detailed review by Tang et al. [113].
An alternative for feature selection is input dimensionality reduction. Methods
like PCA or LLE belong to the category of unsupervised feature selection techniques
and have been used to reduce the feature set to a manageable size in several studies.
However, as pointed out in [114], these are not at all guaranteed to improve
classification performance since they are oblivious to class labels.
Further, whether or not feature selection is necessary also depends on the
downstream learning algorithm. Support vector machines, in general, deal well
with high-dimensional data because of an implicit regularization. In the context
of SVMs, Vapnik et al. [115] have shown that an upper bound on generalization
error is independent of the number of features. Regularized models, in general, are
capable of handling large feature sets. A drawback is that these models necessitate
cross-validation to tune hyper-parameters such as the weight of the regularization
penalty. This can reduce the effective sample size available for training and/or
45
independent testing.
In some situations, it might be beneficial to exploit domain knowledge to
guide feature selection. For example, if certain anatomical regions are known to
have altered functional connectivity in disease based on prior studies, it might be
advantageous to use this prior knowledge for constructing a focused feature set.
Methods
The majority of supervised learning methods applied to rs-fMRI are discriminant-
based, i.e., they discriminate between classes without any prior assumptions about
the generative process. The focus is on correctly estimating the boundaries between
classes of interest. Learning algorithms for the same discriminant function (e.g.,
linear) can be based on different objective functions, giving rise to distinct models.
We describe common models below.
Regularized linear models A large class of supervised learning algorithms are
based on regularized linear models. The goal is to predict a target variable Y given
input features X. Without loss of generality and for notational convenience, let us
assume that the feature vector contains a single constant entry equal to 1, which
allows us to account for a bias term. These algorithms differ in the choice of their
likelihood model, P (Y |X,w) and/or prior P (w), where w denotes the parameters
of the model. These methods yield optimization problems that are based on a
conditional likelihood estimation or a maximum a posteriori estimation (MAP)
46
framework.
wopt = arg maxP (Y |X,w) : Conditional likelihood
w
wopt = arg maxP (w|X, Y ) : MAP
w
Ridge regression Ridge regression is another widely used supervised learning
algorithm belonging to the class of regularized linear models. The goal is to predict
a real-valued output Y given input features X. The conditional likelihood in
this algorithm is specified as a multivariate normal distribution where the mean
parameter is modelled as a linear combination of input features, i.e., Y |X ∼
N (wTx, σ2I). The prior on weight parameters is often modelled a zero-mean
gaussian with a diagonal covariance matrix,i.e., w ∼ N (0, τ 2I). The optimal weight
paramaters w are thus optimized within a maximum a posteriori estimation (MAP)
framework according to,
∑n
wopt = arg min
1
2 2 (yi −
T 1w xi)2 + 2
w i=1 σ 2
‖w‖
τ 2 2
The MAP estimation problem above is convex and admits an elegant analytical
solution.
Logistic Regression Logistic regression employs a Bernoulli distribution to
model the conditional probability of an output class Y given the input features
X, i.e. Y |X ∼ Bernoulli(µ). The mean parameter, µ, is specified with a logistic
link function σ(·) using a linear combination of input features, i.e., µ = σ(wTx).
Given data {(xi, yi), i = 1, .., n}, the model parameters w are optimized within
a conditional maximum likelihood framework by solving the following convex
optimization problem,
∑n
w Topt = arg min log(1 + exp(−yiw xi))
w i=1
47
The training objective is optimized using iterative methods such as gradient descent
or Newton’s method. Regularized variants of logistic regression incorporate priors
on the weight parameters (e.g., multivariate gaussian) and optimize the MAP
estimates instead of the conditional likelihood estimates.
Support Vector Machines (SVMs) The SVM is the most widely used clas-
sification/regression algorithm in rs-fMRI studies. SVMs search for an optimal
separating hyperplane between classes that maximizes the margin, i.e., the distance
from hyperplane to points closest to it on either side. This results in a classifier of
the form f(x) = sign(wTx). The model parameters are obtained by solving the
following convex optimization problem:
∑n
w T 2opt = arg minC max(0, 1− yi(w xi)) + ‖w−b‖ .
w i=1
‖w 2−b‖ is the L2 norm of the weight vector excluding the bias term. C controls
the capacity of the model and determines the margin of the classifier. Tuning
C can control overfitting and reduce the generalization error of the model. The
resulting classification model is determined by only a subset of training instances
that are closest to the boundary, known as the support vectors. SVMs can be
extended to seek non-linear separating boundaries via adopting a so-called kernel
function. The kernel function, which quantifies the similarity between pairs of
points, implicitly maps the input data to higher dimensions. Conceptually, the use
of kernel functions allows incorporation of domain-specific measures of similarity.
For example, graph-based kernels, such as Weisfeiler-Lehman subtree kernel, can
define a distance metric on the graphical representation of functional connectivity
data for classification directly in the graph space.
48
Decision trees and random forests Decision trees predict the output Y based
on a sequence of splits in the input feature space X. The tree is a directed acyclic
graph whose nodes represent decision points and edges represent their outcomes.
The traversal of this tree in conjunction leads up to a target outcome prediction
when a node with no children (leaf node) has been reached. Decision trees are often
constructed in a top-down greedy fashion where nodes are split at each step by
optimizing a metric that quantifies the consistency between predictions and ground
truth. For example, in classification, an often-used information-theoretic metric for
quantifying this consistency is Information-Gain, i.e., the reduction in entropy of
Y after knowing X. Mathematically, this is expressed as
IG(Y,X) = H(Y )−H(Y |X)
where H denotes the Shannon entropy. Based on this metric, the first split will
use the attribute of X that gives the maximum information gain. Decision trees can
offer interpretability, often at the cost of reduced accuracy. Ensembles of decision
trees, such as random forests or boosted trees, are thus a more popular choice in
most applications since they yield much better prediction performance.
Deep neural networks An ideal machine learning system should be highly
automated, with limited hand-crafting in feature extraction as well as minimal
assumptions about the nature of mapping between data and labels. The system
should be able to mechanistically learn patterns useful for prediction from observed
labelled data. Neural networks are highly promising methods for automated learning.
This stems from their capability to approximate arbitrarily complex functions given
sufficient labelled data [116].
Deep learning based models or neural networks define a mapping Y = f(X; θ)
49
and optimize for parameters θ that yield the best functional approximation. The
function f(·) is typically composed as a concatenation of simple nonlinear functions,
often referred to as layers. A widely-used layer is a fully-connected layer that
linearly combines the input variables, and applies a simple elementwise non-linear
functions such as a sigmoid. The number of layers determines the depth of the
network and controls the complexity of the model. The weights and biases of the
layers are optimized via gradient descent based methods to minimize an objective
function that quantifies the empirical risk. Traditionally, the use of neural network
algorithms has been limited since neuroimaging is a data-scarce domain, making
it difficult to learn a reliable mapping between input and prediction variables.
However, with data sharing and open release of large-scale neuroimaging data
repositories, neural networks have recently gained adoption in the the rs-fMRI
community for supervised prediction tasks. Neural networks with fully connected
dense layers have been adopted to learn arbitrary mappings from connectivity
features to disease labels [97, 98]. Recently, more advanced neural networks models
with local receptive fields, like convolutional neural networks (CNNs), have shown
promising classification accuracy using rs-fMRI data [117]. CNNs replace the
fully-connected operations by convolutions with a set of learnable filters. Success
of this approach stems from its ability to exploit the full-resolution 3D spatial
structure of rs-fMRI without having to learn too many model parameters, thanks
to the weight sharing in CNNs.
Comments I. Strengths/weaknesses of diverse approaches
All algorithms have their own strengths and weaknesses and the choice of
approach should be driven by several factors such as the prediction task, sample
size, and nature of the input features. The training objective in common supervised
50
learning algorithms used for neuroimaging applications, such as regularized linear
models or SVMs, is often a combination of two terms: a data loss term that is a
measure of the empirical risk or training error and a regularization penalty for the
prior that helps combat over-fitting during learning (generalization error). The
penalty norm can be critical and is often constrained by our prior knowledge about
the data. L1 penalties encourage sparsity in weights whereas L2 penalties can
allow kernelization and thus enable non-linear decision functions. L2 penalties
lead to dense priors and are useful in learning problems where all features are
expected to contribute to the predictive model. L1 penalties are useful when prior
belief suggests that only a subset of features will contribute to predictions. Some
regression models, e.g., Elastic-Net, employ a linear combination of both these
penalties at the expense of an additional hyperparameter for tuning the trade-off
between the two. The algorithmic choice is also affected by the end-goal. Models
like decision trees or LASSO are often preferred when interpretability is desired
over optimal performance whereas high-complexity models like SVMs, Random
Forests or Neural Networks are imperative if the goal is to maximize performance.
II. Comments on sample sizes
An important question arises: What is an appropriate sample size for training
supervised learning models? Unsurprisingly, research has shown that the sample
size needed for learning is dependent on the complexity of the model. Powerful
non-linear algorithms typically require more training examples to be effective. In
general, one would also expect that the more features in the data, the more training
examples would be required to characterize their distribution. Hence, the minimum
training size for training a ML algorithm is in general a complex function of input
dimensionality, complexity of the chosen model, quality of data, data heterogeneity,
51
separability of classes etc.
Given the significant impact of sample size on classification performance, it
is imperative to understand the nature of this relationship. There is significant
ongoing research in answering this question using learning curves. These curves
model the relationship between sample size and generalization error and can be used
to predict the sample size required to train a particular classifier. Several studies
have shown that learning curves can be well-characterized with an inverse power-law
functional form, with E(n)αn−β, where E denotes the error and n denotes the
sample size [118, 119]. Besides empirical justification, many studies have also
provided theoretical motivations for the inverse power-law model. The parameters
of the learning curve are fitted empirically for a given application domain based on
prior classification studies. For traditional algorithms, learning curves are known
to plateau, i.e., the performance gains are insignificant beyond a certain sample
size. One significant advantage of deep learning methods is that given sufficient
capacity, they scale remarkably well with more data. Given the recent surge of
interest in single-subject predictions using rs-fMRI, estimating the learning curve
for classification of rs-fMRI data could be invaluable for understanding sample size
requirements in this domain.
Another critical issue relates to the robustness of the estimated prediction scores.
Empirical studies have shown that small sample sizes, typical in neuroimaging
studies, result in large error bars on the prediction accuracy. For instance, with
a sample size of 100, Varoquaux et al., ballpark the error in estimated prediction
accuracy of binary classification tasks to be close to 10%. With 1000 samples, this
error reduces down to 3%. Large confidence bounds can potentially invalidate the
conclusions of studies based on a small number of samples.
52
One possible strategy to overcome the limitations of insufficient sample sizes
is to exploit unlabelled data in a semi-supervised fashion in order to increase
the effectiveness of supervised learning algorithms. Transfer learning techniques
are another promising alternative for enhancing classification performance in the
low-data regime. These methods exploit neural networks trained on large datasets
or auxiliary tasks by fine-tuning them to a target dataset or classification task.
These are relatively unexplored directions in the field of rs-fMRI analysis that hold
significant potential to alleviate the sample size limitations.
III. Comments on model evaluation Cross-validation is a model evaluation
technique used to estimate the generalization error of a predictive model. A naive
cross-validation strategy is holdout, wherein the data is randomly split into a
training and test set and the test score in this single-run is used as an estimate
of out-of-sample accuracy. Given the limited sample sizes in most neuroimaging
studies, K-fold is the dominant cross-validation choice as it utilizes all data points
for both training and validation through repeated holdout, yielding error estimates
with much less variance than classic holdout. It first partitions the data into K
non-overlapping subsets, D = {S1, .., SK}. For each fold i in {1, .., K}, the model
is trained on D\ Si and evaluated on Si. The mean accuracy across all folds is
then used to estimate the model performance. While K can be anything, common
choices include 5 or 10. When K equals the number of samples in the training set,
the resampling procedure is known as leave one-out cross-validation. This can be
used with computationally inexpensive models when sample sizes are low, typically
less than a hundred.
53
Applications of supervised learning in rs-fMRI
Studies harnessing resting-state correlations for supervised prediction tasks are
evolving at an unprecedented scale. We describe some interesting applications of
supervised machine learning in rs-fMRI below.
Brain development and aging Machine learning methods have shown promise
in investigating the developing connectome. In an early influential work, Dosenbach
et al. [120] demonstrated the feasibility of using RSFC to predict brain maturation
as measured by chronological age, in adolescents and young adults. Using SVM,
they developed a functional maturation index based on predicted brain ages. Later
studies showed that brain maturity can be reasonably predicted even in diverse
cohorts distributed across the human lifespan [121, 122]. These works posited
rs-fMRI as a valuable tool to predict healthy neurodevelopment and exposed novel
age-related dynamics of RSFC, such as major changes in FC of sensorimotor
regions [122], or an increasingly distributed functional architecture with age [120].
In addition to characterizing RSFC changes accompanying natural aging, machine
learning has also been used to identify atypical neurodevelopment [123].
Neurological and Psychiatric Disorders Machine learning has been exten-
sively deployed to investigate the diagnostic value of rs-fMRI data in various neu-
rological and psychiatric conditions. Neurodegenerative diseases like Alzheimer’s
disease [24, 107, 124], its prodromal state Mild cognitive impairment [125, 126, 127,
128], Parkinson’s [129], and Amyotrophic Lateral Sclerosis (ALS) [130] have been
classified by ML models with promising accuracy using functional connectivity-
based biomarkers. Brain atrophy patterns in neurological disorders like Alzheimer’s
54
or Multiple Sclerosis appear well before before behavioral symptoms emerge. Thus,
neuroimaging-based biomarkers derived from structural or functional abnormalities
are favorable for early diagnosis and subsequent intervention to slow down the
degenerative process.
The biological basis of psychiatric disorders has been elusive and the diagnosis
of these disorders is currently completely driven by behavioral assessments. rs-fMRI
has emerged as a powerful modality to derive imaging-based biomarkers for making
diagnostic predictions of psychiatric disorders. Supervised learning algorithms using
RSFC have shown promising results for classifying or predicting symptom severity
in a variety of psychiatric disorders, including schizophrenia [98, 131, 132, 133],
depression [23, 108, 134], autism spectrum disorder [25, 66, 111, 117], attention-
deficit hyperactivity disorder [135, 136], social anxiety disorder [137], post-traumatic
stress disorder [138] and obsessive compulsive disorder [139]. Several novel network
disruption hypotheses have emerged for these disorders as a consequence of these
studies. Most of these prediction models are based on standard kernel-based SVMs,
and rely on FC between ROI pairs as discriminative features.
Cognitive abilities and personality traits Functional connectivity can also be
used to predict individual differences in cognition and behavior [140]. In comparison
to task-fMRI studies which capture a single cognitive dimension, the resting state
encompasses a wide repertoire of cognitive states due to its uncontrolled nature.
This makes it a rich modality to capture inter-individual variability across multiple
behavioral domains. ML models have been shown to predict fluid intelligence [46],
sustained attention [141], memory performance [142, 143, 144], language scores [142]
from RSFC-based biomarkers in healthy and pathological populations. Recently,
the utility of these models was also shown to extend to personality traits such as
55
neuroticism, extraversion, agreeableness and openness [145, 146].
Prediction of behavioral performance is useful in a clinical context to under-
stand how RSFC disruptions in pathology relate to impaired cognitive functioning.
Meskaldji et al. [143] used regression models to predict memory impairment in
MCI patients from different connectivity measures. Siegel et al. [142] assessed the
behavioral significance of network disruptions in stroke patients by training ridge re-
gression models to relate RSFC and structure with performance in multiple domains
(memory, language, attention, visual and motor tasks). Among them, memory
deficits were better predicted by RSFC, whereas structure was more important for
predicting visual and motor impairments. This study highlights how rs-fMRI can
complement structural information in studying brain-behavior relationships.
Vigilance fluctuations and sleep studies A handful of studies have employed
machine learning to predict vigilance levels during rs-fMRI scans. Since resting-
state studies demand no task-processing, subjects are prone to drifting between
wakefulness and sleep. Classification of vigilance states during rs-fMRI is important
to remove vigilance confounds and contamination. SVM classifiers trained on
cortico-cortical RSFC have been shown to reliably detect periods of sleep within
the sca [147, 148]. Tagliazucchi et al. [148] revealed loss of wakefulness in one-third
subjects of the experimental cohort, as early as 3 minutes into the scanner. The
findings are interesting: While resting state is assumed to capture wakefulness, this
may not be entirely true even for very short scan durations. The utility of these
studies should not remain limited to classification alone. Through appropriate
interpretation and visualization techniques, machine learning can shed new light
on the reconfiguration of functional organization as people drift into sleep.
56
Predicting individual differences in cognitive response after different sleep
conditions (e.g. sleep deprivation) using machine learning analysis of rs-fMRI is
another interesting research direction. There is significant interest in examining
RSFC alterations following sleep deprivation [149, 150]. While statistical analysis
has elucidated the functional reorganization characteristic of sleep deprivation, much
remains to be understood about the FC patterns associated with inter-individual
differences in vulnerability to sleep deprivation. Yeo et al. [151] trained an SVM
classifier on functional connectivity data in the well-rested state to distinguish
subjects vulnerable to vigilance decline following sleep deprivation from more
resilient subjects, and revealed important network differences between the groups.
Heritability Understanding the genetic influence on brain structure and function
has been a long-standing goal in neuroscience. In a recent study, Ge et al. em-
ployed a traditional statistical framework to quantify heritability of whole-brain FC
estimates [152]. Investigations into the genetic and environmental underpinnings of
RSFC were also pursued within a machine learning framework. Miranda-Dominguez
et al. [153] trained an SVM classifier on individual FC signatures to distinguish
sibling and twin pairs from unrelated subject pairs. The study unveiled several
interesting findings. The ability to successfully predict familial relationships from
resting-state fMRI indicates that aspects of functional connectivity are shaped
by genetic or unique environmental factors. The fact that predictions remained
accurate in young adult pairs suggests that these influences are sustained through
development. Further, a higher accuracy of predicting twins compared to non-twin
siblings implied that genetics (rather than environment) is likely the stronger
predictive force.
57
Other neuroimaging modalities Machine learning can also be used to inter-
rogate the correspondence between rs-fMRI and other modalities. The most closely
related modality is task-fMRI. Tavor et al. [154] trained multiple regression models
to show that resting-state connectivity can predict task-evoked responses in the
brain across several behavioral domains. The ability of rs-fMRI, that is a task-free
regime, to predict the activation pattern evoked by multiple tasks suggests that
resting-state can capture the rich repertoire of cognitive states that is reflected
during task-based fMRI. The performance of these regression models was shown to
generalize to pathological populations [155], suggesting the clinical utility of this
approach to map functional regions in populations incapable of performing certain
tasks.
Investigating how structural connections shape functional associations between
different brain regions has been the focus of a large number of studies [156]. While
neuro-computational models have been promising to achieve this goal, machine
learning models are particularly well-equipped to capture inter-individual differences
in the structure-function relationship. Deligianni et al. [157] proposed a structured-
output multivariate regression model to predict resting-state functional connectivity
from DWI-derived structural connectivity, and demonstrated the efficiency of this
technique through cross-validation. Venkataraman et al. [158] introduced a novel
probabilistic model to examine the relationships between anatomical connectivity
measured using DWI tractography and RSFC. Their formulation assumes that the
two modalities are generated from a common connectivity template. Estimated
latent connectivity estimates were shown to discriminate between control and
schizophrenic populations, thereby indicating that joint modelling can also be
useful in a clinical context.
58
Table 4: Key papers for various supervised learning application domains
Brain development and aging
Prediction of individual brain maturity using fMRI (Dosenbach et al.,2010)[120]
Method: SVM, Target: Age, Contribution: Early influential work demonstrating the feasibility of using RSFC features for predicting brain maturation.
Neurological and Psychiatric Disorders
Classification of Alzheimer disease, mild cognitive impairment, and normal cognitive status with large-scale network analysis based on resting-state functional MR imaging.
(Chen et al.,2011)[24]
Method: Fisher LDA, Target: Alzheimer/MCI/controls, Contribution: Early work highlighting the potential of RSFC to diagnose neurological disorders
Deriving reproducible biomarkers from multi-site resting-state data: An Autism-based example(Abraham et al.,2017)[66]
Method: Multiple, Target: ASD/controls, Contribution: Extensively evaluated the impact of ROI choice, connectivity metric and classifier on prediction performance
in intra-site and inter-site settings
Altered resting state complexity in schizophrenia (Bassett et al.,2012)[132]
Method: SVM, Target: Schizophrenia/controls, Contribution: Demonstrated the utility of resting-state network complexity measures in distinguishing patients with
schizophrenia
Cognitive abilities and personality traits
Functional connectome fingerprinting: Identifying individuals using patterns of brain connectivity (Finn et al.,2015)[46]
Method: Linear regression, Target: Fluid intelligence, Contribution: Demonstrated that RSFC can uniquely identify individuals and reliably predict fluid intelligence
Disruptions of network connectivity predict impairment in multiple behavioral domains after stroke (Siegel et al., 2016) [142]
Method: Ridge regression, Target: Multiple cognitive measures , Contribution: Demonstrated the ability of ML coupled with RSFC to predict cognitive deficits in
clinical populations
Vigilance fluctuations and sleep studies
Automatic sleep staging using fMRI functional connectivity data (Tagliazucchi et al.,2012) [147]
Decoding wakefulness levels from typical fMRI resting-state data reveals reliable drifts between wakefulness and sleep (Tagliazucchi et al.,2014) [148]
Method: SVM, Target: NREM sleep stages/wakefulness, Contribution: Demonstrated the ability of ML to detect sleep stages in resting-state
Heritability
Heritability of the human connectome: A connectotyping study (Miranda-Dominguez et al.,2018) [153]
Method: SVM, Target: Twins/sibling/unrelated, Contribution: Provided evidence for relationship between genetics and RSFC through predictive modelling
Other neuroimaging modalities
Task-free MRI predicts individual differences in brain activity during task performance (Tavor et al.,2016) [154]
Method: Multiple regression models, Target: Task-activation map, Contribution: 5De9monstrated that resting-state can capture the rich repertoire of cognitive states
expressed during different behavioral tasks
2.1.5 Discussion
Practical advice for machine learning practitioners
Any machine learning application requires the following: (a) a model that reflects
assumed relationships between measurements and other inductive biases, (b) a
cost function to quantify how well the model captures our data and finally, (c) an
appropriate optimization algorithm to minimize the cost. Successful application of
machine learning to rs-fMRI requires a holistic perspective of how these algorithms
work, what it means when they fail and most importantly, how to choose an
algorithm for a given task or hypothesis. There are are three crucial factors that
could dictate this choice:
1. What is the research question? What is our prior belief? Unsupervised learn-
ing tackles questions about the data-generating process. For example, clustering
and decomposition approaches have both been widely used for disentangling the
underlying causal sources of rs-fMRI data. However, they represent different prior
beliefs and often answer distinct research questions. For example, in the context of
discovering RSNs, ICA assumes that the latent components are independent and
seeks to recover spatial loci of sources of activation. This decomposition further
enables separation of functional activity from noise sources. On the other hand,
clustering generally assumes that the activation of each spatial location/region can
be explained by exactly one underlying component from a set of clusters. Because
this approach results in disjoint functional networks, clustering is the dominant
approach for learning spatially contiguous whole-brain parcellations.
When the goal is to make predictions, supervised learning algorithms are the
60
usual choice. The choice of a supervised model again depends on the research
question: Is the goal to understand the relationship between labels and features
or to build a diagnostic tool? Interpretability is key for the former application
whereas highest accuracy can be construed as the primary goal for the latter. Model
complexity must thus be chosen in accordance with this end-goal. We recommend
that these goals be well-defined before model development.
2. How much data is needed? It is important to assess the quantity of data
and whether or not it is feasible to acquire more data. Sample sizes can constrain
model complexity. More training examples are required to capture a non-linear
relationship between features and labels, than a linear relationship. Data fidelity
and regularization must also be weighed in accordance with the sample size. With
small sample sizes, regularization becomes even more critical as the model is more
likely to overfit on training samples.
3. What is the computational budget? Sometimes, the computational budget
can be restrictive. For example, certain algorithms like deep neural networks, have
a high computational demand that may not be sustained by available resources.
Further, if the number of features is very large, training even low-complexity models
can be time consuming. In such cases, models with lower run-timing complexity
can take precedence, especially for early investigations. Time, computational bud-
get or space constraints thus must be identified while choosing an appropriate model.
61
Limitations and opportunities
Many state-of-the-art techniques for rs-fMRI analysis are rooted in machine learning.
Both unsupervised and supervised learning methods have substantially expanded
the application domains of rs-fMRI. With large-scale compilation of neuroimaging
data and progresses in learning algorithms, an even greater influence is expected
in future. Despite the practical successes of machine learning, it is important to
understand the challenges encountered in its current application to rs-fMRI. We
outline some important limitations and unexplored opportunities below.
One of the biggest challenges associated with unsupervised learning methods
is that there is no ground truth for evaluation. There is no a priori universal
functional map of the brain to base comparisons between parcellation schemes.
Further, whole-brain parcellations are often defined at different scales of functional
organization, ranging from a few large-scale parcels to several hundreds of regions,
making comparisons even more challenging. Although several evaluation criteria
have been developed that account for this variability, no single learning algorithm
has emerged to be consistently superior in all. Due to the trade-offs among diverse
approaches, the choice of which parcellation to use as reference for network analysis
is thus largely subjective.
Unsupervised learning approaches for exploring network dynamics are similarly
prone to subjectivity. Characterizing dynamic functional connectivity through
discrete mental states is difficult, primarily because the repertoire of mental states
is possibly infinite. While dFC states are thought to reflect different cognitive
processes, it is challenging to obtain a behavioral correspondence for distinct states
since resting-state is not externally probed. This again makes interpretations hard
and prone to subjective bias. Machine learning approaches in this direction have
62
thus far relied on cluster statistics to fix the number of FC states. Non-parametric
models (e.g. infinite HMMs) provide an unexplored, attractive framework as they
adaptively determine the number of states based on the underlying data complexity.
A significant challenge in single-subject prediction using rs-fMRI is posed by
the fact that rs-fMRI features can be described in multiple ways. There is no
recognized gold-standard atlas for time-series extraction, nor is there a consensus
on the optimal connectivity metric. Further, even the fMRI preprocessing strategies
can vary considerably. Exploration across this space is cumbersome, especially for
advanced machine learning models like neural networks that are slow to train. An
ideal system should be invariant to these choices. However, this is hardly the case
for rs-fMRI where large deviations have been reported in prediction performance in
relationship to these factors [66].
Another challenge in training robust prediction systems on large populations
stems from the heterogeneity of multi-site rs-fMRI data. Resting-state is easier
to standardize across sites compared to task-based protocols since it does not
rely on external stimuli. However, differences in acquisition protocols and scanner
characteristics across sites still constitute a significant source of heterogeneity.
Multi-site studies have shown little to no improvement in prediction accuracy
compared to single-site studies, despite the larger sample sizes [25, 159]. While it
is possible to normalize out site effects from data, more advanced tools are needed
in practice to mitigate this bias.
High diagnostic accuracies achieved by supervised learning methods should
be interpreted with caution. Several confounding variables can induce systematic
biases in estimates of functional connectivity. For example, head motion is known
to affect connectivity patterns in the default mode network and frontoparietal
63
control network [160]. Further, motion profiles also vary systematically between
subgroups of interest, e.g., diseased patients often move more than healthy controls.
Apart from generating spurious associations, this could affect the interpretability of
supervised prediction studies. Independent statistical analysis is critical to rule out
the effect of confounding variables on predictions, especially when these variables
differ across the groups being explored.
Methodological innovations are needed to improve prediction accuracy to levels
suitable for clinical translation. Several factors make comparison of methods across
studies tedious. Cross-validation is the most commonly employed strategy for
reporting performance of ML models. However, small sizes (common in rs-fMRI
studies) are shown to yield large error bars [161], indicating that data-splits can
significantly impact performance. Generalizability and interpretability should
remain the key focus while developing predictive models on rs-fMRI data. These
are critical attributes to achieve clinical translation of machine learning models.
Uncertainty estimation is another challenge in any application of supervised learning;
ideally, class assignments by any classification algorithm should be accompanied by
an additional measure that reflects the uncertainty in predictions. This is especially
important for clinical diagnosis, where it is important to know a reliability measure
for individual predictions.
Most existing studies focus on classifying a single disease versus controls. The
ability of a diagnostic system to discriminate between multiple psychiatric disorders
is much more useful in a clinical setting [162]. Hence, there is a need to assess
the efficacy of ML models for differential diagnosis. Integrating rs-fMRI with
complementary modalities like diffusion-weighted MRI can possibly yield even
better neurophenotypes of disease, and is another challenging yet promising research
64
proposition.
2.1.6 Conclusions
We have presented a comprehensive overview of the current state-of-the-art of
machine learning in rs-fMRI analysis. We have organized the vast literature on this
topic based upon applications and techniques separately to enable researchers from
both neuroimaging and machine learning communities to identify gaps in current
practice.
65
Table 5: Key related review papers in the field
Multi-subject Independent Component Analysis of fMRI: A Decade of Intrinsic Networks, Default Mode, and Neurodiagnostic
Discovery (Calhoun et al. 2009)[163]
A focused review of group ICA discussing methodologies, discovery of RSNs and their diagnostic potential
Imaging-based parcellations of the human brain (Eickhoff et al.,2018)[164]
A detailed exploration into approaches for deriving imaging based parcellations and lurking challenges in the field
Dynamic functional connectivity: Promise, issues, and interpretations (Hutchison et al.,2013)[165]
An early review on findings, methods and interpretations of dynamical fuctional connectivity
The Chronnectome: Time-Varying Connectivity Networks as the Next Frontier in fMRI Data Discovery (Calhoun et al.,2014)
[166]
A detailed review of methods for dynamic functional connectivity analysis with a focus on decomposition techniques
The dynamic functional connectome: State-of-the-art and perspectives (Preti et al.,2017)[167]
A comprehensive review of analytical approaches for dynamic functional connectivity analysis and future perspectives
On the nature of resting fMRI and time-varying functional connectivity (Lurie et al.,2018)[168]
A discussion of diverse perspectives on time-varying connectivity in rs-fMRI
Clinical Applications of Resting State Functional Connectivity (Fox et al.,2010)[169]
An early short review focused on clinical applications of rs-fMRI
Single Subject Prediction of Brain Disorders in Neuroimaging: Promises and Pitfalls (Arbabshirani et al. 2017)[170]
Extensive survey of studies on single subject prediction of brain disorders, including opinions on promises/limitations
66
Figure 2.4: Illustrations of popular clustering algorithms: K-means clustering
partitions the data space into Voronoi cells, where each observation is assigned to
the cluster with the nearest centroid (marked red in the figure). GMMs assume that
each cluster is sampled from a multivariate Gaussian distribution and estimates these
probability densities to generate probabilistic assignment of observations to different
clusters. Hierarchical (agglomerative) clustering generates nested partitions, where
partitions are merged iteratively based on a linkage criteria. Graph-based clustering
partitions the graph representation of data so that, for example, number of edges
connecting distinct clusters are minimal.
67
Figure 2.5: Schematic of application 2.1.3: In decomposition, the original fMRI
data is expressed as a linear combination of spatial patterns and their associated
time series - in ICA, the independence of spatial maps is optimized whereas in
sparse dictionary learning, the sparsity of maps is encouraged. In clustering, time
series or connectivity fingerprints of voxels are clustered to assign voxels to distinct
functional networks.
Figure 2.6: Schematic of application 2.1.3. Three connectivity states are assumed
in the data for illustration purposes
68
Figure 2.7: Schematic of application 2.1.3. Dimensionality reduction of high-
dimensional connectomes into 3 latent components is shown for illustration.
Figure 2.8: A common classification/regression pipeline for connectomes
Figure 2.9: A summary of design choices for supervised learning with rs-fMRI
69
Figure 2.10: A taxonomy of supervised learning methods used for rs-fMRI analysis
70
CHAPTER 3
LINKING RESTING-STATE BRAIN ACTIVITY AND MENTAL
DISORDERS WITH MACHINE LEARNING
3.1 Ensemble learning with 3D convolutional neural net-
works for functional connectome-based prediction
Abstract
The specificity and sensitivity of resting state functional MRI (rs-fMRI) measure-
ments depend on preprocessing choices, such as the parcellation scheme used to
define regions of interest (ROIs). In this study, we critically evaluate the effect
of brain parcellations on machine learning models applied to rs-fMRI data. Our
experiments reveal an intriguing trend: On average, models with stochastic par-
cellations consistently perform as well as models with widely used atlases at the
same spatial scale. We thus propose an ensemble learning strategy to combine the
predictions from models trained on connectivity data extracted using different (e.g.,
stochastic) parcellations. We further present an implementation of our ensemble
learning strategy with a novel 3D Convolutional Neural Network (CNN) approach.
The proposed CNN approach takes advantage of the full-resolution 3D spatial
structure of rs-fMRI data and fits non-linear predictive models. Our ensemble
CNN framework overcomes the limitations of traditional machine learning models
for connectomes that often rely on region-based summary statistics and/or linear
models. We showcase our approach on a classification (autism patients versus
healthy controls) and a regression problem (prediction of subject’s age), and report
71
promising results.
3.1.1 Introduction
Functional connectivity, as often captured by correlations in resting state func-
tional MRI (rs-fMRI) data, has produced novel insights linking differences in
brain organization to individual or group-level characteristics. Recently, machine
learning models are being increasingly applied to study and exploit individual
variation in functional connectivity data [171, 172, 173]. These models often employ
hand-engineered features, such as pairwise correlations between regions of interest
(ROIs) and network topological measures of clustering, modularity, small-worldness,
integration, or segregation [174, 175, 176]. The ROIs are usually computed based
on a pre-defined atlas or a parcellation scheme. The choice of the ROIs can have a
significant impact on downstream analyses [102, 104, 177].
Brain ROIs can be defined based on macro-anatomical features, cytoarchitecture,
functional activations, and/or connectivity patterns [178, 179, 180, 181]. A common
approach is to derive the ROIs either based on input from experts and/or using
a data-driven strategy on a small number of subjects. Expert-defined ROIs are
challenging to standardize across studies [182] and often rely on arbitrary decisions.
Data-driven ROIs, on the other hand, can be biased by the selection of the subjects,
especially for regions that exhibit large variability across the population. Popular
data-driven techniques include clustering, dictionary learning and Independent
Component Analysis (ICA) [102, 183, 184]. Such methods can be sensitive to
confounds such as motion, while initialization, optimization, and other algorithmic
choices can also significantly influence the results [185]. A parcellation scheme not
only defines the boundaries of ROIs, but also restricts the analysis to a certain
72
Figure 3.1: A general illustration of the proposed approach
spatial scale. Abraham et al. [186] showed that among various preprocessing
decisions, the choice of region definition has the greatest impact on predictive
accuracy with data-driven extraction based on dictionary learning outperforming
ICA/clustering and other reference atlases.
Given the arbitrary nature of a chosen parcellation scheme and its impact
on predictive models, we hypothesized that machine learning models can benefit
markedly from an ensemble strategy that integrates across different scales and ROI
definitions. Figure 3.1 shows a general schematic of our proposed framework. In
this work, we conducted a thorough empirical evaluation of different choices for
brain parcellations.
Another important factor in connectome-based machine learning pertains to the
choice of the classification algorithm. A large body of related work in the literature
has focused on simple linear predictive models using vectorized connectivity data.
73
A relatively recent trend is to exploit neural networks for graph-structured data,
such as Graph Convolution Networks or BrainNet-CNN, to make individual-level
predictions on connectomes. Ktena et al. [187] applied spectral graph convolutions
in a distance-metric learning framework to train a k-nearest neighbor classifier on
connectivity data. In a similar vein, Kawahara et al. [188] proposed the Brain-
NetCNN architecture that extends convolutional neural networks (CNNs) to handle
graph-structured data. CNNs are motivated via the translation-invariance prop-
erty of image-based classification problems and can exploit voxel/pixel resolution
data. On the other hand, BrainNetCNN works directly with an adjacency matrix
derived from the connectome data, while disregarding spatial information. The
model parameter count would scale according to the number of ROIs, making the
utilization of voxel-level connectivity infeasible with this approach. As we discuss
below, we propose an alternative representation of connectivity data, which allows
us to leverage modern deep learning architectures, like CNNs, to build a prediction
model that exploits the full-resolution 3D spatial structure of rs-fMRI without
having to learn too many model parameters.
In this work, we consider two applications: discrimination of autism patients
and healthy controls; and regression of age. The first problem is a particularly
challenging one. Several previous studies have reported altered functional con-
nectivity patterns in Autism Spectrum Disorder (ASD) patients [189, 190, 191,
192]. While studies using small samples have reported classification accuracies over
75% [193], application of similar models on large heterogeneous datasets, such as
ABIDE [194], have shown more modest performance levels over a wide range of
connectome preprocessing schemes (accuracies that range 60-67%) [186].
Our main contributions in this paper are:
74
• An extensive evaluation of the influence of brain parcellations on functional
connectome-based machine learning models
• An ensemble learning strategy for combining predictions from multiple classi-
fiers corresponding to different brain parcellations
• An easy-to-implement 3D CNN framework for connectome-based classification
3.1.2 Materials and Methods
Dataset
The Autism Brain Imaging Data Exchange (ABIDE) is a multi-site consortium
aggregating and openly sharing anatomical, functional MRI and phenotypic datasets
of individuals diagnosed with ASD, as well as healthy controls (HC) [194]. The
first phase of ABIDE (ABIDE-I) collected data from 1,112 individuals, comprising
539 individuals diagnosed with ASD and 573 typical controls across 17 sites. The
second phase (ABIDE-II) aggregated 1,114 additional datasets, comprising 521
individuals with ASD and 593 healthy controls across 19 sites.
Preprocessing of fMRI Data
The Preprocessed Connectomes Project (PCP) released preprocessed versions of
ABIDE-I using several pipelines [195]. We used the data processed through the
Configurable Pipeline for the Analysis of Connectomes (CPAC). This pipeline
performs motion correction, global mean intensity normalization and standardiza-
tion of functional data to MNI space (3x3x3 mm resolution) before the extraction
of ROI time series. Among the different strategies in the release, our analysis
75
used data de-noised by regression of nuisance signals including motion parameters,
CompCor WM+CSF components, and global signal, followed by band-pass filtering
(0.01-0.1Hz). We note that we have experimented with alternate preprocessing
strategies that include/exclude the global signal regression and CompCor steps.
These results are presented in the Supplementary Section 7.6.
We preprocessed the ABIDE-II dataset following the same sequence of steps
listed for ABIDE-I in CPAC (using the version v1.0.2a). Since manual quality
control (QC) was not yet available for ABIDE-II, we performed an automatic QC
by selecting those subjects that retained at least 100 frames or 4 minutes of fMRI
scans after motion scrubbing [196]. Motion scrubbing was performed based on
Framewise Displacement (FD), discarding one volume before and two volumes after
the frame with FD exceeding 0.5mm [197].
Cohort selection
In our experiments, we used ABIDE-I subject data that passed manual QC by all
the functional raters. This yielded a final sample size of 774 ABIDE-I subjects,
comprising 379 subjects with ASD and 395 typical controls. As an independent test
dataset, we employed ABIDE-II subjects from sites that participated in ABIDE-I
and used the same MRI sequence parameters for data collection. After automatic
QC, we ended up with a final ABIDE-II sample size of 163 individuals with ASD
and 230 healthy controls. For age prediction, we only considered healthy controls.
Furthermore, subjects whose age were more than 3.5 standard deviations away from
the median were excluded from the task of age prediction. Table 3.1 summarizes
the dataset characteristics for the two prediction tasks considered in this study.
76
Dataset Prediction Sample Size Median Age (Range) in yrs
ABIDE-I Age 387 13.8 (6.5-29.1)
ABIDE-I ASD/HC 379/395 13.9 (6.5-56.2)
ABIDE-II Age 213 10.6 (5.8-18.8)
ABIDE-II ASD/HC 163/230 11.0 (5.2-38.9)
Table 3.1: Composition of Cohorts
Extracting ROI time series from atlases
In our experiments, we considered all atlases that were used for ROI time series ex-
traction in PCP. These include the following seven atlases: Talaraich and Tournoux
(TT, R=97), Harvard-Oxford (HO, R=111), Automated Anatomical Labelling
(AAL, R=116), Eickhoff-Zilles (EZ, R=116), Dosenbach 160 (DOS160, R=161),
Craddock 200 (CC200, R=200), and Craddock 400 (CC400, R=392), where R is
the number of ROIs [198, 199, 200, 201, 202, 203, 204, 205, 206].
For our 3D CNN model, described below, the parcellated regions were used as
target ROIs to derive the input connectivity features at the voxel level. For the
non-CNN benchmark models, also described below, each atlas was used to define
a corresponding connectivity matrix which was fed as input to each model after
collapsing into a vector. We report results for ensemble learning strategies as well,
where we combined the predictions of models corresponding to individual atlases.
Creating stochastic parcellations
Stochastic parcellations were created by Poisson Disk Sampling using the method
described in [207]. Given a number of ROIs, this approach divides the gray matter
voxels (as defined by a given mask) into roughly equal-sized parcels while ensuring
77
that the parcels do not cross hemisphere boundaries. Stochasticity is introduced in
the ROI center locations, and all the remaining voxels are assigned to the closest
region center. These centers are kept a minimum distance apart based on the desired
number of regions in the parcellation. Further details about the sampling approach
are provided in Supplementary Section A.2. All parcellations were created in the
MNI152 template at a 3mm resolution, same as the resolution of the preprocessed
functional data. For creating these parcellations, we relied on a whole brain gray
matter mask including sub-cortical structures. To create the mask, we took the
union of the gray matter tissue prior provided in the standard MNI152 template
and the cortical mantle mask used in [184]. Some example stochastic parcellations
are shown in Figure 3.2 against atlases at similar resolutions.
Figure 3.2: ROI masks for example SPs and atlas at each of the four spatial scales
considered in this study.
78
3D Convolutional Neural Network Approach
Here, we present our novel strategy to adopt a 3D CNN architecture for use with
connectomic data.
Loosely reminiscent of the biological visual system, CNNs use spatially localized
filters to detect local image features. Unlike fully connected layers where every unit
is connected to all other units of the previous layer, convolutional layers employ
a structured arrangement where each unit is connected to only a small subset of
spatially connected units in the input image channels. Further, the weights of these
connections are shared between the units of the convolutional layer so that the
same feature can be detected regardless of its spatial location. Mathematically,
a convolutional layer of the form Y=Ow(X) operates on an M-dimensional input
X(v)=(X1(v),....,XM(v)) by applying a set of filters {W={wm,n}, m=1,...,M;
n=1,...,N}. Here, v is used to index the pixel or voxel (in case of 3D convolution).
After applying an elementwise non-linearity φ (such as a logistic function) , this
produces an N-dimensional output Y(v)=(Y1(v),....,YN(v)). Each element Yn(v),
known as a feature map, is thus given as,
∑M
Yn(v) = φ( (Xm ∗ wm,n)(v)), (3.1)
m=1
where * denotes the standard spatial convolution operation.The convolutional
layers in CNNs are often interspersed with pooling layers that reduce the size of
feature maps and offer translation invariance. Max-pooling is the most popular
pooling operation. It down-samples each input feature map (commonly referred to
as a channel) separately by selecting the maximum feature response in pre-fixed
local neighborhoods. A max-pooling Yi = P (Xi) operation on channel i is thus
defined as, Yi(v)=Max(Xi(v̄): v̄ in neighborhood of v). In 3D, for example, the
79
neighborhood can be a 3 x 3 x 3 cube around each voxel. The convolutional and
max-pooling layers form the backbone of a CNN. A CNN architecture is constructed
by combining multiple layers that successively learn more complex features from
the input images. For example, with L layers the output can be mathematically
expressed as (Ow(L),...P ◦ Ow(1))(X). Since we are considering an image classification
problem, we add fully connected layers to the flattened output at the end of a CNN.
Research in visual recognition has shown that fully connected feedforward
architectures don’t scale well to full images. Instead, neural network architectures
with local connectivity, such as CNNs, are much more suitable when dealing with
high-dimensional images. The shared weights of the CNN architecture facilitate
learning with fewer parameters. 3D Convolutional layers thus transform an input
4D (3D multi-channel) volume to an output 4D volume. Each layer learns a set
of spatial filters that activate in response to distinct visual patterns. Replicating
or convolving each filter across the volume allows the corresponding pattern to
be detected irrespective of its spatial location. Finally, the outputs from all
filters are stacked along the 4th dimension to create a 4D feature map. Multiple
convolutional layers coupled with pooling operations create global representations
from local patterns. Stacking fully connected layers at the end after convolutional
and down-sampling operations dramatically reduces the model parameter count for
classification.
In our proposed approach, the input to the CNN is formed by concatenat-
ing voxel-level maps of “connectivity fingerprints”, which are represented as a
multi-channel 3D volume. Each channel is a connectivity feature, such as the
Pearson correlation between each voxel’s time series and the average signal within
a target ROI. In our implementation, we use both atlas-based and stochastic brain
80
parcellation schemes to define target ROIs. The total number of input channels
thus represents the number of ROIs used for creating voxel-level fingerprints. For
each parcellation scheme (atlas-based or stochastic), we trained a separate model.
In our experiments, we employed a simple CNN architecture, illustrated in
Fig. 3.3. Our architecture has several convolutional layers, interspersed with max-
pooling based down-sampling layers, followed by a couple of densely connected layers.
The models were trained with a mini-batch size of 64, until convergence of validation
loss. For classification, we used binary cross-entropy, whereas for regression we
adopted mean squared difference as the loss function. The neural network weights
were optimized via stochastic gradient descent (SGD) for classification and Adam
for regression. The learning rate and momentum for SGD were set to 0.001 and
0.9 respectively. Learning rate of Adam was set to 0.0005. For age regression, we
employ a stochastic weight averaging strategy where we average the neural network
weights over last 20 epochs. The same architecture and settings were used for all
atlases and stochastic parcellations. We note that each atlas is defined on a unique
gray matter mask. To ensure that all prediction models (benchmark and proposed)
relied on information from the same voxels, the atlas-specific gray matter mask
was applied to the voxel-level connectivity fingerprint data before feeding into the
proposed convolutional architecture. For stochastic parcellations, the custom gray
matter mask as described above was used for masking the fingerprints. The code
and stochastic parcellations have been made available at: https://github.com/
mk2299/Ensemble3DCNN_connectomes.
Benchmark Methods
In our experiments, we implemented following benchmark methods.
81
Figure 3.3: Proposed CNN approach. All operations are in 3D volume. 2D
correlation maps are shown for illustration only. For the age prediction task, an
additional Max-Pooling and Batch-Normalization[208] operation followed the first
and second convolutional layer.
Ridge Regression A linear regression model was trained with squared loss and
α times the squared norm of the weight vector (See Appendix). For classification,
the ground truth labels were encoded as ± 1 for the two output categories. We
tested 10 linearly spaced values for the hyper-parameter α in the range [0.1,10] and
report for the value with the highest cross-validation accuracy.
Support Vector Machine We implemented a standard SVM as a benchmark
(See Appendix). We found that a radial basis function (RBF) kernel performed
better than a linear model. Thus we report results for the RBF-kernel SVM. The
two hyper-parameters (RBF kernel width γ and and misclassification cost weight
C) were fine-tuned by maximizing cross-validation accuracy via a grid search. For
regression, we implemented the standard SVR scheme with an - insensitive loss
function, optimizing for the -tube and penalty parameter of the error term via
grid search.
82
Fully Connected Architecture The fully-connected neural network (FCN)
architecture takes as input functional connectivity estimates between pairs of ROIs,
which is vectorized and processed by a feed-forward network. We implemented
following architecture, which performed best on ABIDE-I cross-validation: 4 fully
connected hidden layers, with 800, 500, 100 and 20 numbers of features and each
linear layer followed by an elementwise Exponential Linear Unit (ELU) activation.
Dropout regularization parameter was set to 0.2 and applied to each layer during
training. For classification, the output node was a sigmoid, and cross-entropy loss
was used. For age prediction, the sigmoidal output was replaced with a linear
activation and mean squared difference was used as the loss function. The models
were trained with a mini-batch size of 64, until convergence of validation loss.
SGD was used as the optimizer with learning rate and momentum set to 0.01
and 0.9 respectively for classification. For age prediction, a smaller learning rate of
0.001 was used.
BrainNet Convolutional Neural Networks BrainNet CNN, originally pro-
posed in [188], utilizes specialized kernels to handle connectomic data. Their work
described novel edge-to-edge, edge-to-node and node-to-graph convolutional layers
that can potentially capture topological relationships between network edges. For
BrainNet CNN, we implemented the following architecture that worked best on
ABIDE-I cross-validation: 1 edge-to-node layer with 256 filters, followed by a
node-to-graph layer with 128 output nodes and finally a dense layer with single
output. A leaky ReLU non-linearity with alpha equal to 0.33 was applied to the
output of each layer except the last layer. The activation of the last layer was
set to linear and sigmoid for the regression and classification tasks, respectively.
Dropout regularization with rate 0.2 was used for the edge-to-node layer. Similar
83
to [188], Euclidean loss was minimized for age regression, whereas cross-entropy
loss was used to optimize the classification models. The models were trained for
1000 iterations using SGD with momentum equal to 0.9. The learning rate was set
to 0.0005 for age prediction and 0.008 for ASD/Healthy classification. The training
curves were monitored for atlases to ensure convergence.
Ensemble Learning
In our experiments, we explored two ensemble learning strategies. The first one
is what we call multi-atlas ensemble (or MA-Ensemble). MA-Ensemble averages
the predictions of the models of a specific method (e.g., BrainNet CNN) computed
using each one of the seven atlases. For classification, the final prediction is
computed as the majority vote of the individual binary class predictions. For
regression, the ensemble prediction is simply the mean. The second ensemble
strategy (SP-Ensemble) averages across the models of a specific method computed
using stochastic parcellations. In our experiments, unless stated otherwise, we used
30 stochastic parcellations at each of the following four spatial scales: 110, 160, 200
and 400 ROIs. These scales were chosen in accordance with existing atlases. Thus
the SP-Ensemble’s prediction was computed based on fusing 120 (30 × 4 scales)
models. We also implemented single-scale SP-Ensemble models, which averaged
over the 30 parcellations at the same spatial scale.
Visualizing the CNN model
In order to understand the connectivity features captured by the CNN model,
we employed the saliency map approach of [209]. This visualization technique
computes the gradient of the output prediction with respect to the input image
84
ASD/HC Classification Accuracy (ABIDE-II)
Parcellation  Ridge  SVM  FCN  BrainNet  3D-CNN
HO 63.3 68.7 67.7 66.1 67.7
CC200 67.4 70.7 71.5 70.2 72.8
EZ 63.3 66.1 63.8 64.4 66.4
TT 66.1 67.4 65.9 67.4 70.0
CC400 69.4 68.2 69.9 71.5 70.5
AAL 63.3 65.9 65.4 64.6 69.5
DOS160 66.7 63.6 66.1 64.6 67.0
MA-Ensemble 69.7 70.0 69.9 70.7 71.7
SP-Ensemble 71.7 71.2 71.2 70.5 72.3
Table 3.2: Classification accuracy for ASD vs. Control: Independent test on
ABIDE-II of baseline models and proposed CNN approach. For each row, best
results are bolded. For each column, best results are italicized. Green indicates
better performance, whereas orange/red highlights worse performance.
voxel values, i.e., the 3D volume, using a single backward pass through the trained
neural network. We then computed voxel-level saliency as the maximum absolute
gradient value across all input channels corresponding to different target ROIs.
More formally, consider an input image I, representing the connectivity fingerprints
of V voxels with R ROI signals. The saliency weights w RV×R are computed by
taking the absolute value of the gradient of neural network output O with respect
to the input image, i.e., w = |∂O |. In order to obtain the saliency at the voxel
∂I
level S  RV , we take the maximum across all the ROIs, i.e., Si = max1≤j≤R wij.
Finally, to visualize an ensemble model, we averaged the individual saliency maps
that made up the ensemble.
3.1.3 Results
Experiments
In our experiments, we considered two tasks: i) binary classification of autism vs
healthy, and ii) age prediction. For each task, we implemented two evaluation
85
Age RMSE (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 3.05 2.86 2.79 2.82 2.48
CC200 2.74 2.71 2.47 2.62 2.31
EZ 2.98 2.72 2.71 2.96 2.23
TT 3.10 2.83 2.87 3.02 2.24
CC400 2.76 2.83 2.41 2.55 2.27
AAL 2.84 2.74 2.69 2.75 2.33
DOS160 3.48 3.34 3.22 3.32 2.31
MA-Ensemble 2.72 2.81 2.47 2.55 2.15
SP-Ensemble 2.68 2.69 2.38 2.55 2.15
Table 3.3: Root mean squared error (RMSE in years) for age prediction: Independent
test on ABIDE-II for benchmark models and proposed CNN approach. For each
row, best results are bolded. For each column, best results are italicized.
schemes. First, we conducted 10-fold cross-validation on the ABIDE-I dataset,
so that we could present results that were comparable to previously reported
classification results such as [171, 186]. Second, we trained each model on the
entire ABIDE-I dataset and computed test performance on the independent ABIDE-
II set. We report classification accuracy and the receiver operating curves (ROC),
along with corresponding area under the curves (AUC) for each of these scenarios
under various combinations of parcellation schemes and prediction algorithms. For
age prediction, we report the root mean squared error (RMSE).
Evaluation of Prediction Performance
Table 3.2 shows the independent test performance for different models on the
classification problem. The proposed 3D CNN approach performs at least as good
as, and often better than, the benchmark methods, including the fully-connected
deep neural network (FCN) and BrainNetCNN. In particular, the 3D CNN approach
performs favorably against other algorithms for all but two parcellation schemes,
including the ensembles. Similarly, the SP-Ensemble achieves the best ABIDE-
86
Figure 3.4: ASD-HC Classification: Receiver Operating Curves for independent
validation on ABIDE-2
I cross-validation for most algorithms, including the 3D CNN. The ABIDE-I
cross-validation results, reported in Table A.2, are in general compatible with
the independent test results, where the 3D CNN and SP-Ensemble techniques
mostly outperform the competition. Figure 3.4 shows the Receiver Operating
Characteristic (ROC) curves for SP-Ensemble models for the different algorithms
on the independent ABIDE-II test dataset. We observe that the 3D-CNN SP-
Ensemble achieves an AUC of ∼ 77% and an accuracy of ∼ 72% on independent
ABIDE-II data, slightly better than the state-of-the-art cross-validation on ABIDE-I
for ASD/HC classification [192], with FCN and Brain-Net CNN ensembles yielding
a similar performance. ROC Curves for individual atlases are shown in Figure A.4.
Table 3.3 lists independent test results for the age prediction task on ABIDE-II,
and Table A.3 reports the 10-fold cross-validation error on ABIDE-I. The 3D CNN
approach consistently shows superior performance, yielding the best results for
87
all parcellation schemes. Similar to the classification scenario, SP-Ensemble or
MA-Ensemble also yield the best cross-validation and independent test performance
values for the majority of the algorithms, including 3D CNN. Overall, the best
accuracy is achieved by SP-Ensemble 3D CNN, which yields a root mean squared
error of 3.28 years on ABIDE-I cross-validation and 2.15 years on the independent
ABIDE-II dataset. We also estimated mean absolute error (MAE) of all models on
ABIDE-II and observed a similar trend, as reported in Table A.6.
Comparison of stochastic parcellations and atlases
Here, our objective is to conduct a detailed investigation of how the choice of
ROIs affects prediction performance for different machine learning (ML) algorithms.
For each ML algorithm and each parcellation we have a model trained on the
ABIDE-I data, which we then used on the independent ABIDE-II data to quantify
prediction accuracy. Figure 3.5 shows the distribution of accuracy values (estimated
with a kernel density model) obtained using stochastic parcellations , while also
illustrating the results for each of the atlases and the scale-specific SP-ensembles.
The scale-specific SP-Ensemble strategy, as the name implies, averaged the models
corresponding to the 30 stochastic parcellations in each scale. We observe that the
atlas-based models performed no better than typical stochastic parcellation models,
independent of scale and algorithm. This result offers an intriguing possibility:
perhaps we do not need anatomically or functionally derived brain parcellations to
train machine learning models since stochastic parcellations perform equally well
or no worse in practice.
Our proposed SP-Ensemble CNN strategy yielded accuracy results that were
about as good as the best scale-specific SP-Ensemble model. Finally, the ensemble
88
Figure 3.5: Violin plots showing the spread of prediction accuracies/errors for
stochastic parcellations at multiple network scales for different classification models.
Mean accuracy/error of individual violins is denoted by ’Mean SPs’. Performance of
individual atlases is compared with SPs with the closest # of ROIs and is denoted
as ’Single Atlas’. Results are computed by training models on entire ABIDE-1
cohort and testing on the independent ABIDE-2 cohort.
89
Figure 3.6: Distribution of Ridge models’ performance for stochastic parcellations
created using the same gray-matter mask as the corresponding atlas. Red denotes
the atlas model’s accuracy and black indicates the SP-Ensemble accuracy.
models were almost always better than the atlas-based models and they compared
favorably against the individual stochastic parcellation models. The same ob-
servations can be made for ABIDE-I cross-validation (see Supplementary Figure
A.1).
In above analysis, one potential confound was the different gray matter masks
of atlases and stochastic parcellations (SPs). In order to account for this confound,
we conducted following analysis. For each of the atlases, we generated 100 SPs
using the same gray matter mask as the atlas. We excluded DOS160 because it
does not rely on a well-defined gray matter mask and places discontiguous 4.5 mm
spherical regions over fixed coordinates in the brain (sampling only 5% of brain
voxels). We then trained on each of these SPs using the same hyper-parameters that
were found to be optimal for the corresponding atlas. Here, we show the results
for ridge regression (the model that was fastest to train), but we obtained similar
results for all other algorithms as well. As can be seen from Figure 3.6, for most
atlases and corresponding gray matter masks, the model trained on the atlas ROIs
90
(a) ASD/Healthy Classification
(b) Age prediction
Figure 3.7: Mean saliency maps of trained 3D-CNN models for SP-Ensemble
performed no better than an average SP model. Furthermore, and importantly,
the SP-Ensemble (computed by averaging across SPs on the atlas-specific mask)
yielded better performance than the atlas models for all atlases.
Visualization
An important goal of machine-learning tools in neuroimaging is to generate novel
insights linking imaging biomarkers with disease or phenotypic traits. Visualization
techniques for CNNs can help reveal important features used by the model for
discriminating between output classes. Figure 3.7 shows the saliency maps computed
91
for the SP-Ensemble CNN ASD classification and age prediction models. As can
be seen from these maps, the precuneus, often considered a core node of the default
mode network [210], seems to play a significant role for both prediction problems.
However, there are also salient regions that are unique to each problem. For example,
the anterior cingulate/ventromedial prefrontal cotex, a region that has been linked
to autism [211], was distinctly highlighted for the ASD classification problem. The
left parietal cortex was also emphasized for ASD prediction, which is consistent
with the laterilized activation observed in this region in Autism patients [212]. On
the other hand, for age prediction, the left dorsolateral prefrontal cortex (dlPFC) is
a uniquely salient region. The dlPFC is associated with executive functions, such
as working memory and abstract reasoning. For working memory, dlPFC’s function
seems to be age-associated and more lateralized in younger adults [213].
3.1.4 Discussion
In this study, we presented a detailed empirical analysis of how the choice of ROIs
can impact the performance of machine learning models trained on functional
connectomes. We considered several machine learning algorithms, together with a
range of spatial scales and parcellation schemes, including the popular atlas-based
techniques and a stochastic approach. Our analysis suggests that using a single atlas
for summarizing the connectome data is often sub-optimal for training machine
learning models, and significantly more accurate predictions can be achieved with
an ensemble approach that averages across models trained with different parcel-
lation schemes. Furthermore, we demonstrated that averaging across stochastic
parcellations can achieve very high accuracy values, often surpassing atlas-based
models. Our findings resonate with several other studies that compare stochastic
92
parcellations and atlases, although in different contexts. Craddock et al. [214]
compared spatially constrained functional parcellations obtained from spectral
clustering with anatomically constrained parcellations produced from random clus-
tering. Random parcellations performed as well as functional parcellations and
better than anatomical atlases on metrics of cluster homogeneity and representation
accuracy. Based on this, the study reflected that sufficiently small ROIs perform
well for functional network analysis regardless of their spatial position. Fornito
et al. [215] generated stochastic parcellations by randomly sub-dividing the AAL
atlas and showed that functional organizational properties are independent of the
parcellation template at the same network resolution, although significant variability
is observed across scales. Studies on diffusion-MRI based anatomical networks have
similarly shown that topological attributes and network organizational parameters
are consistent across different parcellation schemes, including random parcellations
[207, 216].
Another main contribution of this study is a novel approach to employ a 3D
CNN architecture on functional connectivity data. Convolutional neural networks
achieve state-of-the-art performance on many image-based prediction tasks, as
they take advantage of the full spatial resolution of the data and the translation
invariance property of the problem. Our proposed approach treats voxel-level
connectivity fingerprints as input channels to a conventional 3D CNN framework.
Spatial convolutions can capture local structural or topographic patterns in the
data, such as connectivity gradients. Successively stacking convolutional layers in
our architecture would hierarchically yield higher-order features that can capture
information relevant for classification. Studies have shown that individual-level
network topography serves as a fingerprint of human behavior [217]. Our multi-
channel input image comprising connectivity fingerprints, coupled with CNNs,
93
provides a natural framework to capture individual-level differences in topography
as they relate to behavior or disease. This strategy contrasts with current practice
where the input to machine learning models are pairwise ROI functional correlations.
This makes the model more susceptible to uncertainty caused by parcellation choice.
This can be seen in our experiments where there is relatively larger variance
in prediction performance across atlases for the fully-connected neural network.
Thus, CNNs with connectivity map inputs can offer a more robust alternative to
classification approaches that only rely on ROI-level connectivity information, such
as the BrainNet-CNN. Our results demonstrate that when tailored for connectomes,
CNNs offer a promising opportunity to probe brain networks in disease.
Machine learning practitioners have to make a number of preprocessing choices
in extracting connectomic features to analyze. While there is no one-size-fits-
all solution across different tasks, in the context of machine learning models of
functional connectivity, we present some interesting empirical observations below.
Ensemble learning
The motivation behind using multiple stochastic parcellations for prediction is
grounded in the concept of ensemble learning. The core idea is to integrate out a
latent variable (i.e., parcels or ROI definitions) from the learning problem [218]. This
approach also makes the predictions more robust to the precise parcellation scheme.
As shown above, the performance of atlas-based models can vary significantly
(∼5-10% for parcellations at the same scale). In such a scenario, ensemble learning
over multiple stochastic parcellations can be a robust strategy that yields reliable
predictions.
94
Table 3.4: Classification/regression performance of FCN with a high-resolution
parcellation ( ∼ 1024 ROIs) [216]
Network granularity
We explored the impact of network granularity on prediction performance of machine
learning algorithms for connectomes. Our analysis suggests that better prediction
performance can be expected with parcellations at higher granularity upto ∼ 400
ROIs. To further investigate this trend on ROI-level models, we trained the fully-
connected network (FCN), that is generally the best performing baseline algorithm,
on both the prediction tasks for the 1024 node parcellation proposed in [216]. As
can be seen from Table 3.4, an atlas with 1024 regions is comparable to the CC200
atlas for ASD/HC classification in ABIDE-II. However, the performance actually
degrades significantly (in comparison to CC200 or CC400) for the age prediction
task.
Our evaluations contradict with a previously reported result that a coarser
network scale (∼ 100-150 ROIs) is more suitable for autism classification [186].
In their paper, these conclusions were drawn by comparing the performances
achieved with a few atlases. However, inferring trends from a small number of
atlases can be misleading, since factors like the boundary definitions of structures
(cortical/subcortical) or the particular gray matter mask used, will effect results.
Stochastic parcellations can control for these confounds and depict unbiased trends
across network scales.
95
Number of gray matter voxels
Our empirical study suggests that there is no direct correlation between the number
of voxels in the gray matter mask and a model’s prediction performance. However,
we do observe that the choice of gray matter mask can impact results. For example,
the DOS160 atlas with as few as ∼ 3,039 voxels shows performance no worse than
other atlases at the same resolution (HO, EZ, TT and AAL) with ∼ 20x more
voxels.
Visualization
Saliency maps provide a valuable visualization strategy to probe deep neural network
models. We visualized the saliency maps from 3D CNN models trained on ROIs
extracted using both atlases and stochastic parcellations. As shown in Figure 3.7
and Supplementary Figures A.2 and A.3, these maps are remarkably consistent.
These maps reveal that the precuneus, which is a hub of the default mode net-
work and associated with ASD and age, plays an important role for both prediction
problems. There were also uniquely highlighted regions, such as the anterior cingu-
late/ventromedial prefrontal cortex for ASD classification and the left dorsolateral
prefrontal cortex (dlPFC) for age prediction. Several studies have suggested the
potential of DMN connectivity as a neurophenotype of autism. Chen at el. [219]
trained a random forest classifier that distinguished ASD subjects from healthy
controls with high accuracy, and showed that default mode and somatosensory
regions contribute significantly to diagnostic accuracy. Similarly, Abraham et al.
[186] revealed discriminative connections in the DMN for ASD/HC classification
within a larger heterogeneous cohort of the ABIDE dataset. Furthermore, it has
96
Figure 3.8: Motion correlations
been shown that the connectivity of posterior cingulate cortex (PCC) and aberra-
tions in the medial prefrontal cortex node of the DMN can predict social deficits in
children with ASD [220]. Our results corroborate the findings of these studies, and
suggest a crucial involvement of DMN in autism.
Influence of motion
Several studies have shown differences in head motion parameters during fMRI
between healthy controls and diseased populations, or between subjects from
different age groups [221, 222]. This, in turn, can manifest as artifacts in the
derived resting-state connectivity [223]. Although our independent test data was
motion scrubbed, we performed additional analyses to rule out the confounding
effect of motion in classifier decisions. We selected a cohort of 151 ASD subjects
with motion-matched healthy controls from our independent dataset and analyzed
the correlation of 4 motion parameters with classifier predictions. These include the
97
root-mean-square framewise displacement, mean relative displacement, maximum
absolute displacement and the number of micro-movements greater than 0.5mm.
These summary statistics were chosen in accordance with previous reports of motion
artifacts in rs-fMRI[196]. As shown in Figure 3.8, no significant correlations were
observed between motion variables and the predictions of SP-Ensemble (model
average over all atlases). In this motion-matched cohort, classification accuracy of
71.8% was obtained using 3D-CNN.
For our regression task, there was no significant correlation between a subject’s
age and any of these motion parameters in our cohorts.
Recommendations
Based on our experiments, we make two claims in this study: (a) 3D-CNN performs
favorably compared to alternative baseline algorithms, and (b) Ensemble models
that average across parcellation schemes consistently perform better than individual
atlas-based models and are thus a safer choice for supervised machine learning on
connectomes. This is because individual atlases can show significant variability in
classification/regression performance and finding the optimal atlas for a prediction
task among the wide range of available atlases might not be feasible. Figure 3.9
shows the probability density estimates for the difference in performance between
(a) 3D-CNN versus baseline algorithms as evaluated with the SP-Ensemble strategy,
and (b) SP-Ensemble versus single atlas implemented with the 3D-CNN model.
These estimates are presented for both our prediction tasks. For this experiment, we
estimate the evaluation metrics (AUC-ROC for ASD/HC classification and RMSE
for age regression) on 10,000 bootstrapped samples from ABIDE-II. These results
demonstrate that the SP-Ensemble approach consistently achieves an accuracy
98
as good as the best performing single-atlas model. Further, the 3D-CNN model
consistently outperforms the baseline algorithms for the age prediction task, with
more prominent improvements for individual atlas models. This can be seen from
Tables 3.2 and 3.3. We note that when using the ensemble strategy, the differences
between models are marginal and might be irrelevant in some practical applications.
For instance, the SP-Ensemble performance on ASD/HC classification task is
comparable among 3D-CNN, FCN or BrainNet-CNN, with slight improvements
over linear models. Thus, if time and/or computational resources impose constraints,
it might be more suitable to prefer simpler models like FCN or SVM over 3D-CNN
for example, especially with the ensemble approach.
3.1.5 Limitations and future work
Throughout our analysis, Pearson’s correlation was chosen to measure functional
connectivity strength between different brain regions. Several other correlation
metrics, including tangent-based and partial correlation have been shown to yield
superior classification performance in prior studies [102, 186]. While we do not
expect this to affect the general conclusions and findings of our study, the choice of
the correlation metric still remains an arbitrary decision in any machine learning
pipeline for connectomes.
Due to the heavy computational burden required for training multiple deep
learning models, we only considered one particular scheme for creating stochastic
parcellations, i.e., Poisson Disk Sampling. Alternative strategies for creating random
parcellations have also been proposed, for instance, through stochastic sub-division
of anatomically derived ROIs into smaller parcels [224]. It is also possible to
randomize several other more popular schemes for parcellating the brain, such as,
99
Figure 3.9: Kernel density estimates of the probability distributions for the per-
formance difference between models, computed based on 10000 bootstrap samples
from ABIDE-II. Values to the left of the black vertical line indicate bootstrap
samples where the proposed approach (3D CNN or SP-Ensemble) under-performed
compared to the competing method.
using Ward’s clustering on functional data from sub-samples of the population
[218] or creating Geometric parcellations with different initializations [181].
While the proposed CNN approach achieves promising accuracy on autism
detection and age prediction, there is room for further improvement. We have not
yet conducted a comprehensive optimization of the convolutional architecture. Fur-
thermore, there are likely more optimal choices than target ROI-based correlations
that are used as input to the model. An interesting alternative would be select
100
random gray matter vertices for connectivity profiling, as proposed in [184]. We
envision an end-to-end learning strategy that can enable the optimization of these
connectomic features.
Saliency maps provide an appealing visualization technique by mapping the
neural network activations back to input voxel space. Several modifications to
gradient-based back-propagation have been reported in literature that can poten-
tially highlight more informative features learnt by the model [225, 226]. Further,
the use of saliency maps need not be restricted to depicting group-averaged dis-
criminative features. Unsupervised learning on saliency maps can provide novel
insights into clinical subtypes of disease. It is also important to note that machine
learning techniques do not unequivocally provide evidence for the salient features
being directly associated with the disease or other target variables. However, when
combined with detailed future investigations, they can spur clinical discoveries.
3.1.6 Conclusion
The results presented in our paper showcase the utility of ensemble learning for
connectomes. Functional network based prediction models are impacted by several
a priori choices, the most pivotal of which is the ROI definition. We demonstrate
that ensembles of stochastic parcellations yield predictions that are significantly
more robust and accurate compared to single atlas-based approaches. Further, our
experiments highlight the potential of convolutional neural network models for
connectome-based classification.
101
3.2 Detecting abnormalities in resting-state dynamics: An
unsupervised learning approach
Abstract
Resting-state functional MRI (rs-fMRI) is a rich imaging modality that captures
spontaneous brain activity patterns, revealing clues about the connectomic orga-
nization of the human brain. While many rs-fMRI studies have focused on static
measures of functional connectivity, there has been a recent surge in examining
the temporal patterns in these data. In this paper, we explore two strategies for
capturing the normal variability in resting-state activity across a healthy popula-
tion: (a) an autoencoder approach on the rs-fMRI sequence, and (b) a next frame
prediction strategy. We show that both approaches can learn useful representations
of rs-fMRI data and demonstrate their novel application for abnormality detection
in the context of discriminating autism patients from healthy controls.
3.2.1 Introduction
Resting-state fMRI captures intrinsic neural activity, in the absence of external
stimuli and task requirements. Much of the research in this direction has aimed at
identifying connectivity based biomarkers, restricting the analysis to so-called “static”
functional connectivity measures that quantify the average degree of synchrony
between brain regions. For e.g., machine learning based strategies have been
used with static connectivity measures to parcellate the brain into functional
networks, and extract individual-level predictions about cognitive state or clinical
102
condition [227]. In recent years, there has been a surge in the study of the temporal
dynamics of rs-fMRI data, offering a complementary perspective on the functional
connectome and how it is altered in disease, development, and aging [228]. However,
to our knowledge, there has been a dearth of machine learning applications to
dynamic rs-fMRI analysis.
Thanks to large-scale datasets, modern machine learning methods have fueled
significant progress in computer vision. Compared to natural vision applications,
however, medical imaging poses a unique set of challenges. Data, particularly labeled
data, are often scarce in medical imaging applications. This makes data-hungry
methods such as supervised CNNs possibly less useful. One potential approach to
tackle the limited sample size issue is to exploit unsupervised or semi-supervised
learning strategies that don’t depend on large amounts of labeled training data. In
this paper, we explore the use of unsupervised end-to-end learning for capturing
rs-fMRI dynamics and demonstrate that the representations our models learn can
be useful for detecting abnormal patterns in data.
Related Work: Machine learning methods are increasingly used to compute
individual-level predictions from rs-fMRI data, e.g. about disease [227]. The
conventional approach of supervised learning relies on labeled training data and
uses hand-crafted features such as the static correlation between pairs of regions.
Such features fail to capture the dynamics of resting-state activity as it relates to
behavior or disease. Moreover, emerging data suggest that learning models that
exploit the full-resolution 4-dimensional fMRI data can potentially reveal more
discriminative resting-state biomarkers [229]. In this work, we are motivated by
this observation and our goal is to move away from hand-crafted features and take
full advantage of the spatio-temporal structure of rs-fMRI.
103
Unsupervised approaches such as clustering of static connectivity measures
have been previously used for disease classification and discovery of novel disease
sub-types [230]. Similarly, autoencoders have been used in pre-training to improve
generalization capabilities of supervised learning algorithms, as in [231]. An
alternative application of unsupervised learning is outlier detection. Here, the
goal is to identify data points that deviate markedly from normal samples. For
example, autoencoder models have been popular for outlier detection in video [232].
In recent years, predictive modeling has also been shown to be a powerful framework
in unsupervised feature learning of video representations [233]. In this approach, a
model is trained to predict future frames of a video sequence. These models learn
useful internal representations of the data that can in turn be used for anomaly
detection or downstream object recognition or classification tasks [234].
In the present paper, we propose a novel unsupervised approach that learns
rs-fMRI representations on voxel-level time-course data captured via a convolutional
RNN model, in an end-to-end learning fashion. Models are trained to predict the
next frame in an rs-fMRI sequence or to reconstruct the entire sequence. We apply
our approach to the novel problem of outlier detection in rs-fMRI, and demonstrate
its utility in discriminating autism patients from healthy controls.
3.2.2 Methodology
In this section, we describe the autoencoder and prediction models considered in the
study. As we demonstrate empirically, the models learn to accurately reconstruct
or predict “normal” resting-state activity in healthy subjects, but yield higher
reconstruction/prediction errors in patients.
104
Network building blocks
Convolutional networks: CNNs have achieved unprecedented levels of perfor-
mance across many vision tasks [235]. The main ingredients of CNNs include
convolutional layers that serve as feature extractors, and pooling/un-pooling layers
that perform down/up-sampling in resolution. In this paper, we employ encoder-
decoder style networks since we are reconstructing/predicting structured image
data, i.e., rs-fMRI frames. Encoder-decoder networks are widely deployed in image
segmentation and generation tasks, as in [236]. The encoding part computes a
cascade of increasingly high-level representations from the images, whereas the
decoding part reconstructs pixel-level features from these representations.
Convolutional-LSTM networks: Recurrent neural networks (RNNs), e.g.,
LSTMs [237], offer state-of-the-art results in many domains with sequential data,
such as speech or natural language processing. Conv-LSTM cells, an extension
of LSTM units, integrate convolutional layers with LSTM modules and allow the
temporal propagation of high-level spatial features captured by convolutional layers.
Conv-LSTM cells have shown remarkable performance in sequence forecasting
problems [238]. This stems from their ability to simultaneously capture rich spatial
and temporal structures in the data.
Next frame prediction model
Given a sequence of rs-fMRI frames, we trained a model to predict the next frame in
the sequence. To improve the localization accuracy of predicted frames and capture
spatio-temporal correlations at multiple resolutions, we incorporate skip connections
with Conv-LSTM modules in our architecture. This U-Net style architecture [236]
105
Figure 3.10: Next frame prediction model. Each cuboid represents a 3D (2 spatial
dimensions + time) feature map with number of features indicated on top. Flat
boxes represent 2D feature maps, with number of channels on top. Input is an
axial fMRI slice with T sequential frames. Conv-LSTM cell returns the last output
of the output sequence.
is shown in Figure 3.10. The input to the model is a 2D rs-fMRI sequence of T axial
slices. In the encoding layers, we used 3D convolutions and max pooling, where the
first two dimensions are the spatial coordinates on the axial cross-section and the
third dimension is time. We compared our prediction model with several baselines,
including: (a) simply using the last frame of the input sequence as a prediction
of the next frame; (b) a non-learning based extrapolation model that fits separate
cubic splines at each pixel on the input sequence; and (c) a non-recurrent 2-D U-Net
model that excludes the Conv-LSTM modules from the proposed architecture and
treats the temporal component of the input as T channels. We also considered (d)
an interpolation scheme that interpolated with cubic splines between the T frames
of the input sequence that precede the predicted frame and the frame after the
predicted frame. This interpolation method is different than the other methods as
it is not a forecasting model, yet we found it useful to assess the performance of
the other methods.
106
Autoencoder model
The autoencoder is an unsupervised learning approach that encodes the input into
a lower dimensional representation, which is then decoded into a reconstruction
of the input. The model is trained to minimize a distance function between the
reconstruction and input, such as the squared L2 distance. The architecture of
our reconstruction model is the same as the prediction model above, with two
important differences. First, there are no skip connections, which are indicated
as a “concatenate with crop” operation, to avoid the trivial solution of copying
input to the output. The second difference is that, in the decoder layers and the
output we have T frames, instead of a single frame. So in the visualization of
this architecture, those would be represented with cuboids and 3D convolution/up-
sampling operations. Further, we retained Conv-LSTM unit in the bottleneck to
capture temporal dependencies between the frames of a rs-fMRI sequence.
3.2.3 Experiments
Data
We conducted our experiments on data from the Autism Brain Imaging Data
Exchange (ABIDE) study [239]. Because of difference in TRs and other imaging
parameters across sites, we restricted our experiments to the acquisition site with
the largest sample size, namely NYU. We only used data that passed quality
assessments by all functional raters and retained enough time-points after motion
scrubbing for band-pass filtering. We randomly selected two thirds of the healthy
group (54 subjects) for training/validating the reconstruction & imputation models.
107
A validation split of 10% was used during training to monitor convergence of these
models. The remaining one-third group comprising 28 healthy controls was used
as test data to evaluate predictions/reconstruction performance for comparison
against ASD patients (N=67).
Rs-fMRI preprocessing included slice timing correction, motion correction,
global mean intensity normalization, standardization of functional data to MNI
space, global signal regression, motion scrubbing (volume censoring) and band-pass
filtering. We note that band-pass filtering was performed after motion scrubbing
to avoid any motion contamination. Individual rs-fMRI scans were normalized
between 0 to 1 by min-max scaling each-individual voxel’s time series. Finally, we
applied a binary gray matter mask to all 3D volumes [203].
Implementation Details
During training, we identified non-overlapping contiguous segments of (T + 1)
frames for each subject in the training set. For each such segment, we extracted all
axial slices and trained a unified model to predict the next frame, i.e, for a given
architecture a single model was trained for all subjects and axial slices, comprising
16,560 training instances. Squared loss was optimized with Adam and a learning
rate 1e-4. We implemented our code using Keras, with a TensorFlow back-end.
The network was trained for 150 epochs with a batch size of 32. Validation curves
were monitored to ensure convergence. We used same training paradigm for the
non-recurrent baseline U-Net model. In our experiments, we tried different values
for T and observed diminishing returns beyond T = 20 in the performance of the
next frame prediction models. The overall pattern in comparing the accuracy of
different models was the same. Thus, in the remainder we fix T = 20. We note that,
108
while not necessary, we fixed T = 20 for the autoencoder models too, which ensured
training was done on identical datasets for these different approaches. Once the
models were trained, we used them to compute predictions or reconstructions on
independent data, which included both controls and ASD patients. For each test
subject, we computed the mean squared error (between reconstruction/prediction
and ground truth frames) as a single metric. Note that we averaged over all frames
and pixels in an rs-fMRI scan. We hypothesized that this metric would be different
between patients and controls, demonstrating that it could be used as an outlier
detector. We also analyzed the voxel-level squared errors and conducted a statistical
comparison between patients and controls to reveal the anatomical distribution of
the differences.
3.2.4 Results
Next Frame prediction and reconstruction errors
We first demonstrate that the next frame in rs-fMRI sequence can be accurately
predicted. Table 3.5 shows the performance of the different methods we implemented.
We list both MSE and the mean Pearson’s correlation between predicted and ground
truth frames, computed within the gray matter mask on healthy test subjects. We
observe that the proposed recurrent U-Net architecture achieves the best prediction
performance, even exceeding the cubic-spline based interpolator, which was given
both the preceding 20 frames and the frame after the predicted frame. The
recurrent LSTM modules that capture the temporal dynamics also enabled a
significant boost in quality, as can be noted by comparing the performance of the
U-Net and proposed architecture. Finally, the U-Net models outperformed the
109
Imputation models Mean Squared Error Pearson’s Correlation
Last observation copy 0.01969 0.7558
Extrapolation 0.01203 0.8938
Interpolation* 0.00065 0.9939
Non-recurrent U-Net 0.00026 0.9967
Proposed recurrent U-Net 0.00007 0.9990
Table 3.5: Next frame prediction performance on healthy test subjects for different
models. *Interpolation model had access to the frame after the predicted frame.
Recurrent autoencoder: sequence length Mean squared error Pearson’s correlation
T=10 frames 0.0625 0.354
T=15 frames 0.0475 0.503
T=20 frames 0.0437 0.550
Table 3.6: Reconstruction performance of the proposed recurrent autoencoder on
healthy test subjects for different input sequence lengths.
non-learning based methods of extrapolation, suggesting that accounting for both
the spatial and temporal structure in the data yielded better results.
Table 3.6 shows the mean reconstruction errors of the autoencoder on healthy
test subjects for various input sequence lengths at test time. We note that the
performance is worse than next-frame prediction because of the absence of skip
connections. Reconstruction quality degraded with fewer frames suggesting that the
autoencoder is not reconstructing frames independently and is indeed exploiting the
long-term temporal dependencies between frames. For outlier detection, we thus
used the temporal window T=20 as it gives the best reconstruction performance
and captures longer dynamics.
110
Figure 3.11: Whisker plots showing reconstruction and prediction errors (mean
squared error) for ASD patients and controls, with proposed recurrent models
trained on T=20 consecutive frames. Points are individual subjects. The ends of
the box are upper and lower quartiles, the median is marked by a horizontal line
inside the box.
Model AUC (p-value)
Recurrent autoencoder 69.6 (0.00466)
U-Net imputation 62.5 (0.00293)
Recurrent U-Net imputation 65.9 (0.00151)
Table 3.7: Area under the ROC curve for discriminating ASD vs Controls. P-values
of the unpaired t-test comparing means of the two clinical groups are shown in
brackets.
Outlier Detection: Discriminating Patients and Controls
We were interested in examining whether the next frame prediction and reconstruc-
tion models can be used to detect outlier subjects. To test this, we computed mean
squared error on all test subjects, including healthy controls and ASD patients.
Figure 3.11 shows these error values for the proposed next frame prediction and
autoencoder models. Both models yield error values that are statistically signif-
icantly different between the two clinical groups. Further, AUC values obtained
with autoencoder and imputation models, as shown in Table 3.7, are on par with
recent supervised ASD v/s control classification results [240].
111
Figure 3.12: Statistical significance of the difference in regional reconstruction
error of the recurrent autoencoder between controls and ASD patients. FDR with
q = 0.05 was implemented for multiple testing correction. − log10 p values are
shown.
We also note that the non-recurrent U-Net benchmark achieves a weaker sepa-
ration between the two clinical groups. This indicates that the conv-LSTM layers
enhance diagnostic sensitivity presumably because they are more equipped to exploit
spatiotemporal structure in extracting representations. Importantly, we observed
no correlation between frame-wise displacement values (a widely used metric to
quantify subject motion) and the prediction/reconstruction errors- neither at the
frame-level (Pearson’s correlation -0.0161/0.0218, p = 0.0739/0.0251, computed on
non-motion scrubbed frames only) nor at the individual level (Pearson’s correlation
0.0033/0.1730, p = 0.9744/0.0936).
Finally, we were interested in exploring the anatomical differences in errors
between the two clinical groups. We thus conducted a t-test of of the regional
prediction error (averaged within the boundaries of the widely used AAL atlas
[203]) on the model with best AUC, i.e. the autoencoder. As can be seen from
Fig 3.12, significant differences were mainly constrained to the left hemisphere,
particularly localizing within the language network, involving the temporal and
frontal cortices, consistent with prior literature [241].
112
3.2.5 Discussion
We considered a novel unsupervised learning strategy to analyze rs-fMRI data,
where we train recurrent models to reconstruct rs-fMRI clips or to predict the
next frame in a sequence. Results indicate that the proposed recurrent U-Net
architecture produces very accurate predictions that yield a correlation greater
than 0.99 with ground truth. Furthermore, this performance is better than an
interpolation approach that had access to the frame after the predicted frame.
Next, we demonstrated the utility of the proposed models in detecting outliers in
rs-fMRI. Our results indicate that next frame prediction error or reconstruction
error can be used to discriminate patients from controls, achieving a classification
performance close to state-of-the-art results obtained with supervised methods.
There are several directions we will be exploring with this technique. For example,
we are interested in using the next frame prediction model to assess the quality of
individual frames, particularly in the context of motion and other artifacts. Another
possible application could be to use this model to impute frames that have been
discarded for motion scrubbing. Finally, we believe unsupervised models can offer
novel insights into the dynamics of resting state fluctuations.
113
CHAPTER 4
TOWARDS HOLISTIC ENCODING MODELS FOR PREDICTING
FMRI RESPONSES TO MULTIMODAL NATURALISTIC STIMULI
4.1 Introduction
Understanding the neural basis of sensory perception has been a long-standing
goal of neuroscience. Brain activity recordings of healthy subjects during “free
viewing” of movies present a powerful opportunity to build ecologically-sound
and generalizable models of sensory systems, also known as encoding models. In
neuroscience, stimulus-response relationships can be systematically understood
from two complementary standpoints. Encoding models map stimuli to fine-grained
neural activity via complex feature transformations. Conversely, decoding models
aim to predict stimulus attributes directly from neural recordings. In this thesis, we
explore the former (encoding) approach as a means of understanding how sensory
information is represented in the activity of different brain regions. Modeling neural
responses to naturalistic stimuli, in particular stimuli that reflect the complexity of
real-world scenes (e.g., movies), offers significant promise to aid in understanding
the human brain as it functions in everyday life; a central theme of this research is
to use predictive modelling techniques to convert neural data into understanding
and fundamental knowledge about the brain. Beyond satiating the spirit of scientific
curiosity, understanding the link between neural activity and complex thought can
potentially improve our understanding of neuropsychiatric disorders, creating novel
opportunities for neural prosthetics.
Deep neural networks trained on image or sound recognition tasks have emerged
as powerful models of computations underlying sensory processing, surpassing
114
traditional models of image or sound representation based on Gabor filters and
spectrotemporal filters, respectively, in mid-level and higher-order visual and
auditory regions. While this success is promising, existing encoding models based
on deep neural networks have been limited in their focus on limited portions
of the sensory space under naturalistic stimulation, ignoring the complex and
dynamic interactions of modalities (audio and vision) in this inherently context-
rich paradigm. This reductionism leads to sub-optimality in predictive models of
cortical responses as neural patterns evoked by movies are not simply a conjunction
of activations in modality-specific cortices by their respective uni-sensory inputs;
rather, there are known cross-modal influences as well as regions that receive
afferents from multiple senses. Longer narratives or movies further have an inherent
temporal structure; much of the meaning we infer is from stimulation sequences
rather than from instantaneous visual or auditory stimuli alone. To address this
limitation, we recently proposed a Deep Neural Network (DNN)-based encoding
model that captures three critical inductive biases about information processing
in the brain: namely, hierarchical processing, assimilation over longer timescales
and multi-sensory auditory- visual interactions. By developing and evaluating this
model on a large-scale movie-watching dataset, we demonstrated how incorporating
this joint information leads to remarkable prediction performance across large areas
of the cortex, well beyond the visual and auditory cortices into multi-sensory sites
and frontal cortex. Further, we demonstrated how these neural encoding models
trained solely on naturalistic data can allow us to interrogate the temporal and
sensory sensitivity of different brain regions.
115
4.2 Endowing neural encoding models with both audition
and vision and and stimulus history
Abstract
Naturalistic stimuli, such as movies, activate a substantial portion of the human
brain, invoking a response shared across individuals. Encoding models that predict
neural responses to arbitrary stimuli can be very useful for studying brain function.
However, existing models focus on limited aspects of naturalistic stimuli, ignoring
the dynamic interactions of modalities in this inherently context-rich paradigm.
Using movie-watching data from the Human Connectome Project, we build group-
level models of neural activity that incorporate several inductive biases about neural
information processing, including hierarchical processing, temporal assimilation and
auditory-visual interactions. We demonstrate how incorporating these biases leads
to remarkable prediction performance across large areas of the cortex, beyond the
sensory-specific cortices into multi-sensory sites and frontal cortex. Furthermore,
we illustrate that encoding models learn high-level concepts that generalize to
task-bound paradigms. Together, our findings underscore the potential of encoding
models as powerful tools for studying brain function in ecologically valid conditions.
4.2.1 Introduction
How are dynamic signals from multiple senses integrated in our minds to generate
a coherent percept of the world? Understanding the neural basis of perception
has been a longstanding goal of neuroscience. Previously, sensory perception in
116
humans has been dominantly studied via controlled task-based paradigms that
reduce computations underlying brain function into simpler, isolated components,
preventing broad generalizations to new environments or tasks [242]. Alternatively,
fMRI recordings from healthy subjects during free-viewing of movies present a
powerful opportunity to build ecologically sound and generalizable models of sensory
systems, known as encoding models [243, 244, 245, 246, 247, 248].
To date, however, existing works on encoding models study sensory systems
individually, and often ignore the temporal context of the sensory input. In reality,
the different senses are not perceived in isolation; rather, they are closely entwined
through a phenomenon now well-known as multi-sensory integration [249, 250].
For example, specific visual scenes and auditory signals occur in conjunction and
this synergy in auditory-visual information can enhance perception in animals,
improving object recognition and event detection as well as markedly reducing
reaction times [251]. Furthermore, our cognitive experiences unfold over time;
much of the meaning we infer is from stimulation sequences rather than from
instantaneous visual or auditory stimuli. This integration of information from
multiple natural sensory signals over time is crucial to our cognitive experience.
Yet, previous encoding methodologies have precluded the joint encoding of this rich
information into a mental representation of the world.
Accurate group-level predictive models of whole-brain neural activity can be
invaluable to the field of sensory neuroscience. These models learn to disregard the
idiosyncratic signals and/or noise within each individual, while capturing only the
shared response relevant to the stimuli. Naturalistic viewing engages multiple brain
systems and involves several cognitive processes simultaneously, including auditory
and visual processing, memory encoding and many other functions [252]. Group-
117
level analysis in this paradigm is enabled by the synchrony of neuronal fluctuations
in large areas of the cortex across subjects [253]. Thus far, inter-subject correlation
(ISC) analysis [253] has been a cornerstone tool for naturalistic paradigms because
of its ability to characterize the shared response across individuals. Group-level
encoding models adopt an alternative approach for capturing shared response, one
grounded in out-of-sample prediction and generalization [242]. This allows them
to model neural activity beyond a constrained stimulus set. However, there is a
clear gap between the two mediums of analysis. While ISC analysis suggests that
large areas of the cortex exhibit fluctuations that are consistent across subjects,
existing neural encoding models have largely focused on predicting activity within
pre-defined functional areas of the brain such as visual and auditory cortices.
It is unclear how they may be scaled to develop a single predictive model for
whole-brain neural responses, given that naturalistic scenes produce wide-spread
cortical activations. In this paper, we aim to fill this gap: provided adequate
characterization of stimuli, we hypothesize that the stable component of neural
activity across a subject population, i.e., the stimulus-related activity, should be
predictable. In the present study, we aim to quantify and improve the encoding of
this wide-spread stimulus-driven cortical activity using rich stimulus descriptions.
Brain responses in real-world conditions are highly complex and variable. Owing
to their high expressive capacity, deep neural networks (DNNs) are well-suited
to model the complex high-dimensional nature of neural activity in response to
the multitude of signals encountered during movie-watching. Recently, DNNs
optimized for image or sound recognition have emerged as powerful models of
computations underlying sensory processing [243, 245, 246, 248], surpassing tradi-
tional models of image or sound representation based on Gabor filters [244] and
spectrotemporal filters [254], respectively, in higher-order processing regions. In
118
this approach, the stimuli presented during brain activity recordings are fed as
input to pre-trained neural networks and activations of individual layers are linearly
transformed into predictions of neural responses in different regions of the brain.
This approach affords a useful interpretation of these feature spaces as outcomes
of a task-constrained optimization, shedding light on how high-level behavioral
goals, such as recognition, may constrain representations in neural systems [243].
While useful, task-driven features may diverge from optimal neural representations
and tuning these features to better match the latter may be both feasible and
beneficial [255]. This approach can help bridge the quantitative gap in explaining
neural responses under realistic conditions while improving our understanding of the
nature of information processing in the brain. From a purely modeling standpoint,
our methodological innovations are threefold. First, we propose an end-to-end deep
learning-based encoding model that extracts semantic feature maps from audio and
visual recognition networks and refines them jointly to predict the evoked brain
response. To this effect, we demonstrate that using different modalities concur-
rently leads to improvements in brain encoding. Second, we note that cognitive
perception during movie-watching involves maintaining memory over time and
demonstrate the suitability of recurrent neural networks (RNNs) to capture these
temporal dynamics. Finally, based on existing evidence of hierarchical information
processing in visual and auditory cortices [246, 248], we adopt features at multiple
levels of abstraction rather than low level or high level stimulus characteristics
alone. We embed these inductive biases about hierarchy, long-term memory and
multi-modal integration into our neural architecture and demonstrate that this
comprehensive deep learning framework generalizes remarkably well to unseen data.
Specifically, using fMRI recordings from a large cohort of subjects in the HCP, we
build group-level encoding models that reliably predict stimuli-induced neuronal
119
fluctuations across large parts of the cortex. As a demonstration of application,
we employ these encoding models to predict neural activity in response to other
task-based stimuli and report excellent transferability of these models to artificial
stimuli from constrained cognitive paradigms. This further suggests that these
encoding models are able to capture high-level mechanisms of sensory processing.
Approaching multi-sensory perception through the predictive lens of encoding
models has several advantages. Because of their unconstrained nature, encoding
models can enable data-driven exploration and catalyze new discoveries. Using
six neural encoding models with different temporal scales and/or sensory inputs,
trained only on ∼36 minutes of naturalistic data per subject, we can replicate
findings from a large number of prior studies on sensory processing. First, by
prominently highlighting the transition from short to long temporal receptive
windows as we move progressively from early to high-level auditory areas, we can
distinguish the cortical temporal hierarchy. Next, by differentiating uni-sensory
cortices from multi-sensory regions such as the superior temporal sulcus and angular
gyrus, we can reproduce the multi-modal architecture of the brain. Finally, by
synthesizing neural responses to arbitrary stimuli such as faces, scenes or speech, we
can demonstrate the functional specialization of known brain regions for processing
of these distinct categories. Altogether, our results highlight the advantages and
ubiquitous applications of DNN encoding models of naturalistic stimuli.
120
4.2.2 Materials and Methods
Dataset
We study high-resolution 7T fMRI data of 158 individuals from the Human Connec-
tome Project movie-watching protocol comprising 4 audio-visual movie scans [256,
257]. The movies represent a diverse collection, ranging from short snippets of
Hollywood movies to independent vimeo clips. All fMRI data was preprocessed
following the HCP pipeline, which includes motion and distortion correction, high-
pass filtering, head motion effect regression using Friston 24-parameter model,
automatic removal of artifactual timeseries identified with Independent Component
Analysis (ICA) as well as nonlinear registration to the MNI template space [257].
Complete data acquisition and preprocessing details are described elsewhere [256,
257]. Finally, whole-brain fMRI volumes of size 113x136x113 are used as the
prediction target of all proposed encoding models. Rest periods as well as the first
20 seconds of every movie segment were discarded from all analyses, leaving ∼12
minutes of audio-visual stimulation data per movie paired with the corresponding
fMRI response. We estimated a hemodynamic delay of 4 sec using ROI-based
encoding models, as the response latency that yields highest encoding performance
(Figure S2, see Supplementary Information for details). Thus, all proposed models
are trained to use the above stimuli to predict the fMRI response 4 seconds after
the corresponding stimulus presentation. We train and validate our models on 3
audio-visual movies with a 9:1 split and evaluate our models on the first three clips
of the held-out test movie. Since the last clip in the held-out movie is repeated
within the training movies, we excluded it from our analysis.
121
Methodology
We train six encoding models employing different facets of the complex, dynamic
movie stimulus. These include: (1) Audio-1sec and (2) Audio-20sec models, which
are trained on single audio spectrograms extracted over 1-second epochs and
contiguous sequences of 20 spectrograms spanning 20 seconds respectively; (3)
Visual-1sec and (4) Visual-20sec models, trained with last frames of 1-second
epochs and sequences of 20 evenly spaced frames within 20-second clips respectively;
(5) Audiovisual-1sec and (6) Audiovisual-20sec models, which employ audio and
visual input as described above, jointly. All models are trained to minimize the
mean squared error between the predicted and measured whole-brain response.
Figure 4.1 depicts the overall methodology for training different encoding models.
Stimuli
Audio We extract mel-spectrograms over 64 frequency bands between 125-7500
Hz from sound waveforms to represent auditory stimuli in ∼1 second epochs,
following [258]. The audio spectrogram is treated as a single grayscale 96x64 image,
denoted by xat , for the short duration model. For the longer-duration model, the
input is simply a contiguous sequence of 20 of these grayscale images, represented as
sat = {xa}ti i=t−19. This representation of auditory input is also supported by strong
evidence that suggests the cochlea may be providing a spectrogram-like input to
the brain for information processing [259].
Visual All videos were collected at 24 fps. We extract the last frame of every
second of the video as a 720x1280x3 RGB input, denoted by xvt , for the 1-sec
122
Figure 4.1: Schematic of the proposed models. (A) The short-duration (1-sec)
auditory and visual models take a single image or spectrogram as input, extract
multi-scale hierarchical features and feed them into a Convolutional Neural Net-
work (CNN)-based response model to predict the whole-brain response. (B) The
long-duration (20-sec) uni-modal models take a sequence of images or spectrograms
as input, feed their hierarchical features into a recurrent pathway and extract the
last hidden state representation for the response model. (C) The short-duration
multi-modal model combines uni-modal features and passes them into the response
model. (D) The long-duration multi-modal model combines auditory and visual
representations from the recurrent pathways for whole-brain prediction. Architec-
tural details, including the feature extractor and convolutional response model are
provided in Supplementary Information.
models. We emphasize that the input here is a single RGB frame and we are
using the 1-sec terminology only to be consistent with the nomenclature for audio
models. We further arrange the last frame of every second in a 20-second clip into
a sequence of 20 images, denoted by sv = {xv}tt i i=t−19, to represent the continuous
stream of visual stimuli. These are presented to the longer-duration Visual-20sec
and Audiovisual-20sec models.
The inputs to the Audio-1sec, Visual-1sec, Audio-20sec, Visual-20sec,
Audiovisual-1sec and Audiovisual-20sec models are thus given as xa v at , xt , st , svt ,
{xat , xv at } and {st , svt } respectively.
123
Audio-1sec and Visual-1sec models Neural encoding models comprise two
components: a feature extractor, which pulls out relevant features, s, from raw
images or audio waveforms and a response model, which maps these stimuli features
onto brain responses. In contrast to existing works that employ a linear response
model [245, 248], we propose a Convolutional Neural Network (CNN)-based response
model where stimulus features are mapped onto neural data using non-linear
transformations. Previous studies have reported a cortical processing hierarchy
where low-level features from early layers of a CNN-based feature extractor best
predict responses in early sensory areas while semantically rich deeper layers best
predict higher sensory regions [246, 248]. To account for this effect, we employ a
hierarchical feature extractor based on feature pyramid networks [260] that combines
features from early, intermediate and later layers simultaneously. The detailed
architectures of both components, including the feature extractor and convolutional
response model, are described in Figure S3. We employ state-of-the-art pre-trained
ResNet-50 [261] and VGG-ish [258] architectures in the pyramid network to extract
multi-scale features from images and audio spectrograms, respectively. The base
architectures were selected because pre-trained weights of these networks optimized
for behaviorally relevant tasks (recognition) on large datasets, namely Imagenet[262]
and Youtube-8M[263], were publicly available. Resnet-50 was trained on image
classification with 1000 classes, while the VGG-ish network was pre-trained on
audio event recognition with ∼30K categories. Further, due to computational and
memory budget, the Resnet-50 was frozen during training across all models. On the
other hand, we were able to fine-tune the VGG-ish network in both the Audio and
Audiovisual encoding models. We note that in contrast to images, there is a clear
asymmetry in the axes of a spectrogram, where the distinct meanings of time and
frequency might warrant 1D convolutions over time instead of 2D convolutions over
124
both frequency and temporal axes. However, we found the benefits of a pre-trained
network to be substantial in training convergence time and hence did not explore
more appropriate architectures.
Audio-20sec and Visual-20sec models Audio-20sec and Visual-20sec models
employ the same feature extractor and CNN response model as their 1-second
counterparts. However, here, the feature extraction step is applied on each image
in a sequence of 20 frames, followed by a long short-term memory (LSTM) module
to model the temporal propagation of these features. The output dimensions of the
LSTM unit are set to 1024 and 512 for the visual and auditory models respectively,
to ensure an equitable comparison with the corresponding 1-sec models. The last
hidden state output of this LSTM unit is fed into the CNN response model with
the same architecture as the 1-sec models.
Audiovisual-1sec and Audiovisual-20sec models Meaningful comparison
across different models requires the control of as many design choices as possible.
To ensure fair comparisons, the Audiovisual-1sec model employs the same feature
extractors as the Visual-1sec and Audio-1sec models. The only difference, here, is
that the corresponding 1024-D and 512-D feature representations are concatenated
before presentation to the CNN response model and the concatenated features
are passed into a bottleneck layer to reduce the final feature dimensionality to
the maximum among audio and visual feature dimensions, i.e., 1024, so that the
multi-modal model is not equipped with a higher-dimensional feature space than the
maximum among uni-modal models. We note that the response model has the same
architecture across all 6 proposed models. Similarly, the Audiovisual-20sec model
employs the same feature extraction scheme as the Visual-20sec and Audio-20sec
125
models, but fuses the last hidden state output of the respective LSTM units by
simple concatenation followed by a dense layer to reduce feature dimensionality to
1024 before feeding it into the response model.
Evaluation
We first evaluated the prediction accuracy of all models on the independent held-
out movie by computing Pearson correlation coefficient (R) between the measured
and predicted response at every voxel. Here, the ‘measured’ response refers to
the group-averaged response across the same group of 158 subjects on which the
models were trained. Comparison among these models enables us to tease apart
the sensitivity of individual voxels to input timescales and different sensory stimuli.
Voxel-level correlation coefficients between the predicted and measured responses
were averaged to summarize the prediction accuracy of each model in relevant
cortical areas (Figure 4.2B-F). For this region-level analysis, ROIs were derived
with a comprehensive multi-modal parcellation of the human cortex [264], which
was mapped onto the MNI-1.6 mm resolution template. We note that ROIs were
employed only to interpret the results of the study and relate them to existing
literature. We emphasize that all performance metrics reported henceforth are
based on voxel-level correlations. It is important to note that prediction accuracy
at every voxel is bounded by the proportion of non-stimulus related variance that
reflects measurement noise or other factors. We thus also show the regional level
performance of all models against the reliability (“noise ceiling”) of measured
responses within those regions (Figure 4.3).
Noise ceiling estimation:
The reliability of the group-averaged response at each voxel is estimated from a short
126
84-second clip that was repeatedly presented at the end of all movie sessions. We
compute an effective upper bound on our performance metric, i.e., the correlation
coefficient, as the correlation between the measured fMRI response (group-mean)
during different runs. We repeat this process 6 times (choosing pairs from 4 repeat
measurements) to get a mean noise ceiling estimate per voxel, as shown in Figure
4.3D. We divide the voxel-level prediction accuracy (R) by this noise ceiling to get
noise-normalized prediction accuracy of all models in left panels of Figure 4.3A-C.
We note that this noise ceiling is computed on the repeated video clip, which is
distinct from the test movie on which the model performance metrics are computed.
Direct comparison against this noise ceiling can be sub-optimal, especially if the
properties of the group-averaged response vary drastically across the two stimulus
conditions. We address this limitation during model evaluation against data from a
held-out independent group of subjects by computing a more suitable upper bound,
which is achievable by a group-level encoding model (Figure S8, see Supplementary
Information for more details). As we demonstrate in the results (Figure S8, S9), the
trend and spatial distribution of model performance against noise ceiling remain
unchanged across the model evaluation and noise ceiling estimation methods.
4.2.3 Results
Multi-sensory inputs and longer timescales lead to the best encoding
performance with significant correlations across a large proportion of
the stimulus-driven cortex
To gain quantitative insight into the influence of temporal history and multi-sensory
inputs on encoding performance across the brain, we computed the mean prediction
127
Figure 4.2: Regional predictive accuracy for the test movie. (A),(C)-(F) depict
quantitative evaluation metrics for all the proposed models across major groups
of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of
all models is summarized across (A) auditory, (C) visual, (D) multi-sensory, (E)
language and (F) frontal areas. Box plots depict quartiles and swarmplots depict
mean prediction accuracy of every ROI in the group. For language areas (Group
4), left and right hemisphere ROIs are shown as separate points in the swarmplot
because of marked differences in prediction accuracy. Statistical significance tests are
performed to compare 1-sec and 20-sec models of the same modality (3 comparisons,
results indicated with horizontal bars below the box plots) or uni-modal against
multi-modal models of the same duration (4 comparisons, results indicated with
horizontal bars above the box plots) using the paired t-test (p-value < 0.05,
Bonferroni-corrected) on mean prediction accuracy within ROIs of each group.
accuracy in five groups of regions defined as per the HCP MMP parcellation [264],
namely, (1) auditory regions comprising both early and association areas, (2)
early visual and visual association regions, (3) known multi-sensory sites and
regions forming a bridge between higher auditory and higher visual areas, (4)
language-associated regions, and (5) frontal cortical areas. As our research concerns
stimulus-driven processing, only ROIs belonging to the “stimulus-driven” cortex
128
were included in the above groups (Table S2, see Supplementary Information for
the definition of “stimulus-driven” cortex). Groups 1 and 2, which are associated
with a single modality (auditory or visual), do not show any marked improvement
from audio-visual multi-sensory inputs and are best predicted by features of their
respective sensory stimulus (Figure 4.2A,C). The performance boost with multi-
sensory inputs is more pronounced in groups 3, 4 and 5 which are not preferentially
associated with a single modality, but are involved in higher-order processing
of sensory stimuli (Figure 4.2D-F). Further, temporal history of the stimulus
yields consistent improvement in prediction performance in almost all groups of
regions, albeit to different extents. Improvements in groups 3, 4 and 5 agree
well with the idea that higher-order sensory processing as well as cognitive and
perceptual processes, such as attention and working memory, are hinged upon the
history of sensory stimuli; therefore, accumulated information benefits response
prediction in regions recruited for these functions. Further, both auditory and visual
association cortices are known to contain regions that are responsive to sensory
information accumulated over the order of seconds [265]. This potentially explains
the significant improvement observed for long-timescale encoding models compared
to their short-timescale counterparts in these sensory cortices (Figure 4.4). Together,
the Audiovisual-20sec model integrating audio-visual multi-sensory information over
longer timescales yields maximum prediction accuracy (R) and highest percentage
(∼ 83 percent) of significantly predicted voxels across the stimulus-driven cortex
(Figure 4.3E), suggesting that the Audiovisual-20sec model can adequately capture
complementary features of each additional facet (multi-sensory stimuli / temporal
information) of the sensory environment.
129
Longer timescales improve encoding performance, particularly in higher
order auditory areas
As a movie unfolds over time, the dynamic stream of multi-modal stimuli continu-
ously updates our neural codes. Evidence from neuroimaging experiments suggests
that different brain regions integrate information at different timescales; a cortical
temporal hierarchy is reported for auditory perception where early auditory areas
encode short timescale events while higher association areas process information over
longer spans [266]. This temporal gradient of auditory processing is well-replicated
within our study. Comparison of 1-sec and 20-sec models allows us to distinguish
brain regions that process information at shorter timescales from those that rely
on longer dynamics. There is a small, albeit significant, contribution of longer
timescale inputs on prediction correlations in regions within early auditory cortex,
such as A1, LBelt, PBelt, MBelt and Restro-insular cortex (RI) (Figure 4.3A,
4.4A), in line with previous reports suggesting short temporal receptive windows
(TRWs) of early sensory regions [266]. Shorter integration windows are in agreement
with the notion that these regions facilitate rapid processing of the instantaneous
incoming auditory input. In contrast, response in voxels within auditory associ-
ation ROIs lying mainly in the superior temporal sulcus or along the temporal
gyrus (A4, A5, STSda, STSva, STSdp, STSvp, STGa, TA2) is seen to be much
better predicted with longer timescales (Figure 4.3A, 4.4A). Cumulatively across
association ROIs, the Audio-20sec model yields a highly significant improvement in
prediction accuracy (∼50%) over the Audio-1sec model, in comparison to a smaller
improvement (∼5%) across early auditory ROIs.
130
Figure 4.3: Model prediction accuracy in standard brain space. Left panel depicts
the predictive accuracy of uni-modal (A,B) and multi-modal (C) models over the
whole brain in the test movie. Colors on the brain surface indicate the Pearson
correlation coefficient between the predicted timeseries at each voxel and the
true voxel’s timeseries normalized by the noise ceiling (D) computed on repeated
validation clips. Only significantly predicted voxels (p-value < 0.05, FDR-corrected)
are colored. ROI box plots depict the un-normalized correlation coefficients between
the predicted and measured response of voxels in each ROI and the respective noise
ceiling for the mean. (E) shows the percentage of voxels in stimulus-driven cortex
that are significantly predicted by each model and mean prediction accuracy across
the stimulus-driven cortex.
131
Figure 4.4: Influence of temporal history on encoding performance. (A) Mean
predictive performance of Audio-1sec and Audio-20sec models in early auditory and
association auditory cortex ROIs. A major boost in encoding performance is seen
across auditory association regions with the 20-sec model. (B) Mean predictive
performance of Visual-1sec and Visual-20sec models across ROIs in the dorsal,
ventral and MT+ regions. Dorsal stream and MT+ ROIs exhibit a significant
improvement with Visual-20sec model but no effect is observed for the ventral stream.
Box plots are overlaid on top of the beeswarm plot to depict quartiles. Horizontal
bars indicate significant differences between models in the mean prediction accuracy
within ROIs of each stream using the paired t-test (p-value < 0.05).
Longer timescales lead to significantly better predictions in the dorsal
visual stream and MT+ complex
The distinct association of dorsal visual stream with spatial localization and action-
oriented behaviors and ventral visual stream with object identification is well
documented in the literature [267]. Another specialized visual area is the medial
temporal complex (MT+), which has been shown to play a central role in motion
processing. The functional division between these streams thus suggests a stronger
influence of temporal dynamics on responses along the dorsal pathway and MT+
regions. To test this hypothesis, we contrast the encoding performance of Visual-
1sec and Visual-20sec models across the three groups by averaging voxel-wise
correlations in their constituent ROIs. In accordance with the dorsal/ventral/MT+
132
stream definition in the HCP MMP parcellation, we use the following ROIs for
analysis: (a) dorsal: V3A, V3B, V6, V6A, V7, IPS1 (b) ventral: V8, Ventral Visual
Complex (VVC), PIT complex, Fusiform Face Complex (FFC) and Ventro-medial
Visual areas 1,2 and 3 (c) MT+: MT, MST, V4t, FST. Figure 4.4B demonstrates
the distribution of mean correlations over these ROIs for different models and
streams. Our findings suggest that temporal history, as captured by the Visual-
20sec model, can be remarkably beneficial to response prediction across the dorsal
visual stream (∼30% improvement over Visual-1sec model) and the MT+ complex
(∼62% improvement over Visual-1sec model), in agreement with our a priori
hypothesis . Further, in our experiments, no marked improvement was observed
for the ventral visual stream, indicating a non-significant influence of temporal
dynamics on these regions.
Auditory and visual stimuli features jointly approach the noise ceiling
in multi-sensory areas
Examining prediction accuracy against response reliability allows us to quantify
how far we are from explaining predictable neural activity. A high fraction of the
stimulus-driven cortex (∼ 83%) is predictable with a longer timescale input and
joint audiovisual features. Notably, areas extending anteriorly and posteriorly from
the primary auditory cortex such as the posterior STS, STGa and TA2 achieve
prediction correlations close to the noise ceiling with the Audiovisual-20 sec model
(Figure 4.3C), suggesting that DNN representations are remarkably suited to encode
their response.
Interestingly, performance in auditory regions is much closer to the noise ceiling
than visual regions. Understanding audition and vision in the same space further
133
Figure 4.5: Sensitivity of ROIs to different sensory inputs. (A) Predictive accuracy
(R) of audiovisual encoding model with and without input distortions, (B) Sensory
sensitivity index of different brain regions as determined using performance met-
rics under input distortion (see Supplementary Information for details). Regions
dominated by a single modality are shown in darker colors, whereas light-colored
regions are better predicted by a combination of auditory and visual information.
Red indicates auditory-dominant regions whereas blue indicates visual dominance.
allows us to appreciate the differences between these modalities. While this may
suggest that audition is perhaps a simpler modality to model, the differences could
also result from a bias of the dataset. A more diverse sampling of acoustic stimuli
in the training set could allow the model to generalize better in auditory regions.
Furthermore, in contrast to auditory stimulation where all subjects hear the same
sounds, visual stimulation can elicit highly varied responses dependent on gaze
location. This variability could plausibly make group-level visual encoding a more
difficult task.
Joint encoding models tease apart the modal sensitivity of voxels
throughout the sensory cortex
Neural patterns evoked by movies are not simply a conjunction of activations in
modality-specific cortices by their respective uni-sensory inputs; rather, there are
known cross-modal influences as well as regions that receive afferents from multiple
senses [268]. Can we interrogate a joint encoding model to reveal the individual
134
contribution of auditory and visual features in encoding response across different
brain regions? To address this question, we shuffled inputs of either modality
along the temporal axis during inference. We measured test performance of the
trained audio-visual model on predictions generated by shuffling inputs of one
modality while keeping the other one intact. This distortion at test time allows us
to identify areas that are preferentially associated with either visual or auditory
modality. We hypothesized that regions encoding multi-sensory information will
incur loss in prediction accuracy upon distortion of both auditory and visual
information. Further, uni-sensory regions will likely be adversely affected by
distortion of either auditory or visual information but not both. To test this
hypothesis, we further developed a sensory-sensitivity index that directly reflects
the sensitivity of individual brain regions to information about auditory or visual
stimuli (see Supplementary Information for details). For this examination, we
utilized the Audiovisual-1sec model to avoid potential confounds associated with
temporal history, although analysis of the Audiovisual-20sec model showed similar
results. Figure 4.5 demonstrates the result of this analysis on sensory-specific
regions as well as regions known for their involvement in multi-sensory integration.
The benefit from (non-distorted) multi-sensory inputs to the prediction correlations
of the Audio-visual model is most remarkably seen in posterior STS, STGa and
sensory-bridge regions such as the temporal-parietal-occipital junction (TPOJ1-
3) and superior temporal visual (STV) area. Another region that seems to be
employing features of both modalities, albeit to a lesser extent, is the frontal eye
field (FEF), whose recruitment in audiovisual attention is well studied [269].
Classically, multi-sensory integration hubs are identified as regions that show en-
hanced activity in response to multi-sensory stimulation as opposed to presentation
of either uni-sensory stimuli based on some statistical criteria [270]. Accordingly,
135
the posterior STS is consistently described as a multi-sensory convergence site
for audio-visual stimuli [250, 268, 270, 271]. Its role in audiovisual linguistic in-
tegration has also been well-studied in the literature [269]. Other multi-sensory
integration sites reported extensively in prior literature include the temperoparietal
junction [250, 268, 269] and superior temporal angular gyrus [272]. Our findings
above lend strong support for the multi-sensory nature of all these regions.
Encoding models as virtual neural activity synthesizers
Next, we sought to characterize whether encoding models can generalize to novel
task paradigms. By predicting neural activity for different visual categories from
the category-specific representation task within the HCP Working Memory (WM)
paradigm, we generated synthetic functional localizers for the two most common
visual classes: faces and places. Specifically, we predict brain response to visual
stimuli, comprising faces, places, tools and body parts, from the HCP task battery.
We use the predicted response to synthesize contrasts (FACES-AVG and PLACES-
AVG) by computing the difference between mean activations predicted for the
category of interest (faces or places respectively) and the average mean activations
of all categories at each voxel (Figure 4.6). The predicted and measured contrasts
are thresholded to keep the top 5%, 10% or 15% most activated voxels. We report
the Dice overlap between the predicted and measured contrasts for each of these
threshold values to quantify the agreement between these cortical maps. We also
computed the Dice overlap of the predicted contrast for each experiment against
all 86 measured tfMRI contrasts provided as part of the HCP task battery in order
to assess the identifiability of the synthetic contrast.
We observe a notable overlap between the synthetic and measured group-level
136
Figure 4.6: Encoding models as virtual brain activity synthesizers. (A) Synthetic
contrasts are generated from trained encoding models by contrasting their “syn-
thesized” (i.e., predicted) response to different stimulus types. (B) Comparison
of the synthesized contrast for ‘speech’ against the speech association template
on neurosynth, both thresholded to keep the top 5%, 10% or 15% most activated
vertices. (C-D) compare the synthesized contrasts for ‘faces’ and ’places’ against the
corresponding contrasts derived from HCP tfMRI experiments, both thresholded
to keep the top 5%, 10% or 15% most activated vertices. Vertices activated in only
synthetic or predicted contrast maps are shown in red and blue colors respectively
whereas yellow indicates the overlap. Corresponding dice scores are displayed
alongside the surface maps. Distributions of dice overlap scores between the syn-
thetic map and all 86 HCP tfMRI contrast maps are shown as histograms at each
threshold level. Red arrow points to the dice overlap between the synthetic contrast
and HCP tfMRI contrast for the same condition. In all cases, the synthetic contrast
exhibits the highest agreement with the tfMRI contrast that it was generated to
predict.
137
contrasts. Importantly, we find that the synthetic contrasts for ‘FACES-AVG’
and ‘PLACES-AVG’ are identifiable in that the synthetic contrast exhibits the
highest agreement with the measured contrast of the same contrast condition.
Further, our findings are consistent with the well-known cortical specificity of
neuronal activations for processing of faces and places. Both the synthetic and
measured faces contrasts are consistent with previously identified regions for face-
specific processing, including the fusiform face area (corresponds to fusiform face
complex (FFC) in Figure 4.6), the occipital face area in lateral occipital cortex
(overlaps with the PIT complex in HCP MMP parcellation), and regions within
temporo-parieto-occipital junction and STS [273, 274]. Among these, the selective
role of the Fusiform Face Area in face processing has been most consistently and
robustly established. Another region known to respond more strongly to faces than
other object categories, namely posterior STS, has been previously implicated in
processing of facial emotions [273].
Similarly, both synthetic and measured places contrasts highlight cortical regions
thought to be prominent in selective processing of visual scenes. These include the
parahippocampal areas (PHA1-3), retrosplenial cortex (POS1 in HCP MMP par-
cellation) and the transverse occipital sulcus (TOS), which comprises the occipital
place area (OPA) [275].
Cortical areas related to speech processing are similarly discovered using our
models by contrasting activations predicted for speech stimuli against non-speech
stimuli such as environmental sounds (Figure 4.6B, see Supplementary Information
for more details). The synthetic contrast shows increased activation in language-
related areas of the HCP MMP parcellation such as 55b, 44 and the superior frontal
language (SFL) area with left-lateralization, in accordance with previous language
138
fMRI studies [276]. In addition, areas tuned for voice processing in STS [277] are
also highlighted. The synthetic map also shows highest correlation with ‘speech’
on neurosynth term-based meta-analysis [278] and overlaps considerably with the
speech association template on the platform. Taken together, these experiments
illustrate the potential of encoding models to simulate contrasts and reconcile
contrast-based studies with naturalistic experiments.
Additional analyses
In prior studies, neural response prediction is done via regularized regression, where
the signal at each voxel is modeled as a weighted sum of stimulus features with
appropriate regularization on the regression weights. Following earlier works, we
also train l2-regularized regression models using features derived from hierarchical
convolutional networks trained on image or sound recognition such as those used
in the proposed models, as well as semantic categories features labelled using
the WordNet semantic taxonomy similar to [279]. The latter are typically used
for mapping the semantic tuning of individual voxels across the cortex. Our
models consistently outperform the baselines, further illustrating the benefits of
the proposed methodology (Figure S4(A)-(C), see Supplementary Information for
more details). Additionally, we also performed ablation studies to understand the
influence of different network components, namely the “non-linear” response model
as well as the “hierarchical” feature extractor on model prediction performance
and found that both components improve performance, although their relative
contribution is stronger in visual encoding models than auditory models (Figure
S4(D), see Supplementary Information for more details). The superior predictive
performance of our models in comparison to the classical approach along with
our ablation studies suggest that an interplay of end-to-end optimization with a
139
non-linear response model can jointly afford improved generalization performance.
To test the generalizability of the models beyond the subject population they
were trained on, we further compared the predictions of all models against the
group-averaged response of a held-out group within HCP comprising 20 novel
subjects distinct from the 158 individuals used in the training set, on the same
independent held-out movie. The noise ceiling for this group was computed as the
correlation coefficient between the mean measured response for the independent test
movie across all 158 subjects in the training set and the group-averaged response
computed over the 20 new subjects. This metric captures the response component
shared across independent groups of subjects and thus reflects the true upper
bound achievable by a group-level encoding model. As shown in Figure S8 (see
Supplementary Information for more details), the models can accurately predict
neural responses as measured with respect to the group mean of the held-out
subjects, with the Audiovisual-20sec model performance even approaching noise
ceiling in some regions, particularly the higher-order auditory association regions
and multi-sensory sites such as the posterior STS. Importantly, the predictivities
across the cortical surface are consistent with the performance metrics reported
for the training subject population in Figure 4.3. Finally, by comparing model
predictions against neural responses at the single subject level for subjects from
the held-out group, we further demonstrate that the Audiovisual-20sec model can
also successfully capture the response component that individual subjects share
with the population (Figure S10, see Supplementary Information for details).
140
4.2.4 Discussion
Free viewing of dynamic audio-visual movies enables an ecologically valid analysis
of a collective set of functional processes at once, including temporal assimilation
and audio-visual integration in addition to momentary sensory-specific process-
ing. Perception, under such stimulation, thus recruits sensory systems as well
as areas subserving more sophisticated cognitive processing. Building quantita-
tively accurate models of neural response across widespread cortical regions to
such real-life, continuous stimuli thus requires an integrated modelling of these
disparate computations on sensory inputs. In this paper, we have presented six
deep neural network-based encoding models with varying sensory and temporal
information about the audio-visual stimulus. Subsequently, we queried the role of
input history and different sensory information on prediction performance across
individual regions of the cortex. We have shown that exploiting the richness of
the stimulus along the time axis and sensory modalities substantially increases the
predictive accuracy of neural responses throughout the cortex, so far as approaching
the noise ceiling for voxels in some known multi-sensory sites, such as the posterior
STS [250, 268, 270, 271].
Auditory and visual scenes are the principal input modalities to the brain during
naturalistic viewing. Yet, existing encoding models ignore their interactions. We
employ a common strategy in multi-modal machine learning settings, namely feature
fusion, to jointly model auditory and visual signals from the environment. We find
that minimizing the prediction error is a useful guiding principle to learn useful
joint representations from an audio-visual stimulation sequence and demonstrate
that models that consume multi-modal signals concurrently, namely, Audiovisual-
1sec and Audiovisual-20sec, can not only predict the respective uni-modal cortices
141
slightly better but also lead to remarkable improvements in predicting response
of multi-sensory and frontal brain regions (Figure 4.2). Further, we show that
multi-modal neural encoding models not only boost performance in large areas
of the cortex relative to their uni-modal counterparts (Figure 4.2,4.3E), but also
shed light on how neural resources are spatially distributed across the cortex for
dynamic multi-sensory perception (Figure 4.5). The predictivity of different sensory
inputs for neural response, as evaluated on independent held-out data, can facilitate
reverse inference by identifying the sensory associations of different brain regions,
providing clues into the multi-sensory architecture of the cortex. By comparative
analysis of predictive performance in different regions across models (Figure 4.2) as
well as perturbation analysis within the multi-modal model (Figure 4.5), we identify
a number of regions that are consistently sensitive to both auditory and visual
information, most notably the superior temporal sulcus and some frontal regions.
Regions within inferior frontal cortex, have been implicated in the processing
of visual speech, guiding sensory inferences about the likely common cause of
multi-modal auditory and visual signals, as well as resolving sensory conflicts [280].
Prior research has also implicated an extensive network of inferior frontal and
premotor regions in comprehending audiovisual speech, suggesting that they bind
information from both modalities [281]. While unveiling the causal sequence of
events for a mechanistic understanding of multi-sensory perception is not possible
with the proposed approach, our findings align well with commonly held theories
of sensory fusion which suggest that uni-sensory signals are initially processed in
segregated regions and eventually fused in regions within superior temporal lobe,
occipital-temporal junction and frontal areas [268]. This proposition is corroborated
by our experiments as response prediction in these regions is best achieved by a
combination of both sensory inputs (Figure 4.3,4.5).
142
A linear response model with pre-trained and non-trainable feature extractors,
while simple and interpretable, imposes a strong constraint on the feature-response
relationship. The underlying assumption is that neural networks optimized for
performance on behaviorally relevant tasks, are mappable to neural data with
a linear transform. We designed a flexible model, capable of capturing complex
non-linear transformations from stimulus feature space to neural space, leading to
more quantitatively accurate models that are better aligned with sensory systems.
Even better accounts of cortical responses are then obtained by interlacing dynamic,
multi-modal representation learning with whole-brain activation regression in an
end-to-end fashion. Using these rich stimulus descriptions, we demonstrated a
widespread predictability map across the cortex, that covers a large portion (∼83%)
of the stimulus-driven cortex (Figure 4.3C,E), including association and some
frontal regions. While inter-subject correlations in these regions are frequently
reported [253, 282], suggesting their involvement in stimulus-driven processing,
response predictability in these areas had remained elusive so far. Further, the
cortical predictivity is maintained even as we compare model predictions against
neural responses of held-out subjects (Figure S8 and S10), suggesting that the
proposed models are capable of successfully capturing the “shared” or stimulus-
driven response component. These results provide compelling evidence that deep
neural networks trained end-to-end can learn to capture the complex computations
underlying sensory perception of real-life, continuous stimuli.
We further demonstrated that encoding models can form an alternative frame-
work for probing the timescales of different brain regions. While primary auditory
and auditory belt cortex (comprising A1, PBelt, LBelt, Mbelt) as well as the
ventral visual stream benefit only marginally from temporal information, there is a
remarkable improvement in prediction performance in auditory and visual associa-
143
tion and pre-frontal cortices, most notably in superior temporal lobe, visuomotor
regions within the dorsal stream such as V6A, temporal parietal occipital junction
and inferior frontal regions. The improvement in prediction performance with the
20-second input is consistently seen for both uni-modal and multi-modal models.
It is important to acknowledge that directly comparing the prediction accuracies of
static (1-sec) and recurrent (20-sec) models to infer processing timescales of different
brain regions has its limitations. First, this analysis can be confounded by the
slow hemodynamic response as performance improvement may be driven in part by
the slow and/or spatially varying dynamics. Based on our analysis with ROI-level
encoding models, the latter seems like a less plausible explanation (Figure S2, see
Supplementary Information for details). Further, we performed additional analyses
to understand the relationship between performance improvement in individual
voxels and their autocorrelation properties and found a strong correspondence
between the two, suggesting that the distribution of performance improvement
across the cortex broadly agrees well with processing timescales (Figure S6, see
Supplementary Information for details).
Predictions from long-timescale models are based on temporal history as pro-
vided in stimulus sequences, and not just the instantaneous input. Modeling
dynamics within these sequences appropriately is crucial to probe effects of tempo-
ral accumulation. RNNs have internal memories that capture long-term temporal
dependencies relevant for the prediction task, in this case, encoding brain response,
while discarding task-irrelevant content. We compare this modeling choice against a
regularized regression approach on stimulus features concatenated within T-second
clips, with T ranging between 1 and 20 (Figure S4, see Supplementary Information
for details). The inferior performance compared to our proposed models as well as
a non-increasing performance trend against T for these linear models indicate that
144
accumulation of temporal information by simply concatenating stimulus features
over longer temporal windows is insufficient; rather, models that can efficiently
store and access information over longer spans, such as RNNs with sophisticated
gating mechanisms, are much more suitable for modeling neural computations that
unfold over time. Since activations of units within RNNs depend not only on the
incoming stimulus, but also on the “current” state of the network as influenced by
past stimuli, they are capable of holding short-term events into memory. Adding
the RNN module can thus be viewed as augmenting the encoding models with
working memory.
Investigating timescales of representations across brain regions by understanding
the influence of contextual representations on language processing in the brain, as
captured by LSTM language models for instance, has become a major research
focus recently [283]. In these language encoding models for fMRI, past context
has been shown to be beneficial in neural response prediction, surpassing word
embedding models. However, models that explain neural responses under dynamic
natural vision while exploiting the rich temporal context have not yet been rigorously
explored with human fMRI datasets. In a previous study with awake mice, recurrent
processing was shown to be useful in modelling the spiking activity of V1 neurons
in response to natural videos [284]. In dynamic continuous visual stimulation
fMRI paradigms, a common practice is to concatenate multiple delayed copies
of the stimulus to model the hemodynamic response function as a linear finite
impulse response (FIR) function [279]. However, since the feature dimensionality
scales linearly with time-steps, this approach is limited to HRF modeling and is
not feasible to capture longer dynamics of the order of tens of seconds. Another
approach is to employ features from neural networks trained on video tasks, such as
action recognition [247]. However, these encoding models are constrained to capture
145
one aspect of dynamic visual scenes and are likely useful to predict neural responses
in highly localized brain regions. Most studies in visual encoding remain limited to
static stimuli and evoked responses in relatively small cortical populations.
Our brain has evolved to process ‘natural’ images and sounds. In fact, recent
evidence has shown that sensory systems are intrinsically more attuned to features
of naturalistic stimuli and such stimuli can induce stronger neural responses than
task-based stimuli [285]. Here, we demonstrate that encoding models trained with
naturalistic data are not limited to modeling responses of their constrained stimuli
set. Instead, by learning high-level concepts of sensory processing, these models can
also generalize to out-of-domain data and replicate results of alternate task-bound
paradigms. While our models were trained on complex and cluttered movie scenes,
we tested their ability to predict response to relatively simple stimuli from the HCP
Task battery, such as faces and scenes (Figure 4.6). The remarkable similarity
between the predicted and measured contrasts in all cases suggests that ‘synthetic’
brain voxels, predicted by the trained DNNs, correspond well with the target voxels
they were trained to model. We thus provide evidence that these encoding models
are capsulizing stimulus-to-brain relationships extending beyond the experimental
world they were trained in. On the other hand, classical fMRI experiments, for
instance task contrasts, do not generalize outside the experimental circumstance
they were based on. This preliminary evidence suggests that encoding models can
serve as promising alternatives for circumventing the use of contrast conditions to
study hypotheses regarding the functional specialization of different brain regions.
Embedded knowledge within these descriptive models of the brain, could also be
harnessed in other applications, such as independent neural population control by
optimally synthesizing stimuli to elicit a desired neural activation pattern [286].
146
With purely data-driven exploration of fMRI recordings under a hypothesis-free
naturalistic experiment, our models replicate the results of previous neuroimaging
studies operating under controlled task-based regimes. Our analysis lends support
to existing theories of perception which suggest that primary sensory cortices build
representations at short timescales and lead up to multi-modal representations
in posterior portions of STS [266]. Encoding performance in these regions is
consistently improved with longer timescales as well as multi-sensory information.
We reasoned that regions that are sensitive to multi-modal signals and/or longer
stimulus dynamics could be distinguished by interrogating the performance of these
models on unseen data. To date, encoding models have been rarely used in this
manner to assess integration timescales or sensory-sensitivity of different brain
regions. Classically, processing timescales have been probed using various empirical
strategies, for example, by observing activity decay over brief stimulus presentations
or by comparing auto-correlation characteristics of resting-state and stimulus-evoked
activity [287]. Further, multi-sensory regions are identified via carefully constructed
experiments with uni-modal and multi-modal stimulus presentations, followed by
analysis of interaction effects using statistical approaches [268]. Here, we suggest
that encoding models can form an alternate framework to reveal clues into these
functional properties that can be rigorously validated with future investigation.
As with interpreting the results of any predictive model, one should, however,
proceed with caution. Sounds are generated by events; this implies that sound
representations implicitly convey information about actions that generated them.
Similarly, visual imagery provides clues into auditory characteristics, such as the
presence or absence of speech. Thus, it is difficult to completely disentangle the
individual contributions of auditory and visual features to prediction performance
across cortical regions. Similarly, longer timescale inputs can lead to a more
147
robust estimate of the momentary sensory signal, potentially confounding the
interpretations of TRWs. Further, scanner noise can affect changes in BOLD signals
across the auditory cortex and several studies have reported that the phenomenon
is exacerbated at high field strengths [288]. Importantly, brain function requires
additional attentional resources and increased listening effort under the presence
of scanner noise and this may impact the processing of visual input as well, for
example, by affecting fixation locations and/or prioritizing attentional deployment
for auditory stimuli. Scanner noise can also reduce the sensitivity to stimuli of
interest (“movies”) by causing non-stimuli associated activations across the auditory
cortex which may interfere in non-trivial ways with stimuli induced activations. We
do not expect this to impact the prediction correlations since the influence of scanner
noise is expected to be independent of the stimuli characteristics, nonetheless this is
an important caveat of the proposed audio-based encoding models that hinders their
ability in explaining neural responses outside the scanner. Here, notwithstanding
the limitations, we contend that these models can, nonetheless, serve as powerful
hypothesis generation tools.
The methodological innovations in this study must also be considered in light of
their limitations. Due to high dimensionality of features in early layers of the ResNet
architecture for high-dimensional visual inputs, we employ pooling operations on
these feature maps. Thus, low-level visual features, such as orientations, are
compromised. The consequent unfavorable outcome is a low predictive performance
in V1. Due to a limited computational and memory budget, we could not experiment
with fine-tuning the visual sub-network in this study; in the future, with large-scale
collection of naturalistic fMRI datasets that represent a more extensive sampling of
the stimulus space, we anticipate that data-fitted or fine-tuned models may surpass
the baseline established by pre-trained goal-driven networks and may enable us
148
to inch closer to a complete model of the human visual cortex. These models
might even provide inspiration in the form of inductive biases or regularization for
representation learning on diverse perceptual tasks [289]. Further, since different
subjects can focus on different parts of the stimulus, group-level models can also
blur out the precise object orientation information. This is particularly relevant
for complex naturalistic stimuli such as movies. In the future, incorporating eye
gaze data into these models can be an interesting exploration. Furthermore, due
to computational constraints, the proposed model is only able to examine the
effects of stimuli up to 20 seconds in the past. However, previous research with
naturalistic stimuli has shown that some brain regions maintain memory of the order
of minutes during naturalistic viewing [290]. Existing evidence also suggests that
neural activity is structured into semantically meaningful and coherent events [266].
Capturing long-range context in encoding models can be a challenging, yet fruitful
endeavour yielding potentially novel insights into memory formation.
There are also inherent differences between proposed neural network models
and biological networks. DNNs fail to capture known properties of biological
networks such as local recurrence, however, they have been found to be useful
for modelling neural activity across different sensory systems. At present, feed-
forward DNNs trained on recognition tasks constitute the best predictors of sensory
cortical activations in both humans and non-human primates [243]. In light of this
observation, a recent study proposed that very deep feed-forward only CNNs (for
example, ResNet-50 as employed in this study for visual feature extraction) might
implicitly be approximating ‘unrolled’ versions of recurrent computations of the
ventral visual stream [291]. Object recognition studies on non-human primates
have also hinted at a functional correspondence between recurrence and deep non-
linear transformations [292]. Although the functional significance of intra-regional
149
recurrent circuits in core object recognition is still under debate, mounting evidence
suggests they may be subserving recognition under challenging conditions [292,
293]. Thus, investigation of more neurobiologically plausible models of the cortex
that innately model intra-regional recurrent computations should be explored in
the future, especially in relation to their role in visual recognition.
While the present study focuses on shared stimulus-driven brain signals across
a subject population, the quest to understand inter-individual variability in neural
responses remains an important direction forward that promises exciting scientific
discoveries linking brain activity to behavior and novel clinical applications [294].
Over the last decade, these inter-individual differences have been shown to result
from differences in attentional control and engagement [295], and perhaps more
intriguingly, from differences in interpretation [296], emotional valence/arousal [297]
as well as intrinsic individual traits and behavior [298]. The utility of studying inter-
individual variability under naturalistic stimulation from a clinical perspective has
already been highlighted in several prior studies using variants of the ISC framework
that attempt to extricate the stable and idiosyncratic stimulus-evoked response
component in subjects with neuropsychiatric disorders from the shared stimulus-
driven response that is consistent across a control subject population [299]. In the
future, we expect that this direction of using naturalistic paradigms to study inter-
individual differences will complement the approach of encoding models (at single-
subject or group-level) in understanding how the brain processes sensory signals
from its complex environment and how individual differences in this processing are
linked to individual behaviors.
Comprehensive descriptive models of the brain need comprehensive accounts
of the stimulus. In this study, using a novel group-level encoding framework, we
150
showed that ‘reliable’ cortical responses to naturalistic stimuli can be accurately
predicted across large areas of the cortex using multi-sensory information over
longer timescales. Since our models were trained on a large-scale, multi-subject and
open-source dataset, we believe these results could provide an important point of
reference against which encoding models for naturalistic stimuli can be assayed in
the future. The continued interplay of artificial neural networks and neuroscience
can pave the way for several exciting discoveries, bringing us one step closer to
understanding the neural code of perception under realistic conditions.
Data and Software availability
All data needed to evaluate the conclusions in the paper are present in the
paper and/or the Supplementary Materials. All experiments in this study
are based on the open-access Human Connectome Project movie-watching
database. The dataset is publicly available for download through the HCP website
(https://www.humanconnectome.org/). Throughout this study, we utilized 7T
fMRI data from the ‘Movie Task fMRI 1.6mm/59k FIX-Denoised’ package within
the HCP dataset. The network implementation and analysis codes are available at
https://github.com/mk2299/MultimodalEncoding/.
151
4.3 Neural encoding with visual attention
Abstract
Visual perception is critically influenced by the focus of attention. Due to limited
resources, it is well known that neural representations are biased in favor of attended
locations. Using concurrent eye-tracking and functional Magnetic Resonance
Imaging (fMRI) recordings from a large cohort of human subjects watching movies,
we first demonstrate that leveraging gaze information, in the form of attentional
masking, can significantly improve brain response prediction accuracy in a neural
encoding model. Next, we propose a novel approach to neural encoding by including
a trainable soft-attention module. Using our new approach, we demonstrate that
it is possible to learn visual attention policies by end-to-end learning merely on
fMRI response data, and without relying on any eye-tracking. Interestingly, we
find that attention locations estimated by the model on independent data agree
well with the corresponding eye fixation patterns, despite no explicit supervision to
do so. Together, these findings suggest that attention modules can be instrumental
in neural encoding models of visual stimuli. 1
4.3.1 Introduction
Developing accurate population-wide neural encoding models that predict the
evoked brain response directly from sensory stimuli has been an important goal
in computational neuroscience. Modeling neural responses to naturalistic stimuli,
in particular stimuli that reflect the complexity of real-world scenes (e.g., movies),
1Our code is available at https://github.com/mk2299/encoding_attention.
152
offers significant promise to aid in understanding the human brain as it functions
in everyday life [304]. Much of the recent success in predictive modeling of neural
responses is driven by deep neural networks trained on tasks of behavioral relevance.
For example, features extracted from deep neural networks trained on image or
auditory recognition tasks are currently the best predictors of neural responses
across visual and auditory brain regions, respectively [243, 246, 248]. While this
success is promising, the unexplained variance is still large enough to prompt novel
efforts in model development for this task. One aspect that is often overlooked in
existing neural encoding models in vision is visual attention.
Natural scenes are highly complex and cluttered, typically containing a myriad
of objects. What we perceive upon viewing complex, naturalistic stimuli depends
significantly on where we direct our attention. It is well known that multiple objects
in natural scenes compete for neural resources and attentional guidance helps to
resolve the ensuing competition [305]. Due to the limited information processing
capacity of the visual system, neural activity is biased in favor of the attended
location [306, 307]. Hence, more salient objects tend to be more strongly and
robustly represented in our brains. Further, several theories have postulated that
higher regions of the visual stream encode increasingly shift- and scale-invariant
representations of attended objects after filtering out interference from surrounding
clutter [308, 309]. These studies suggest that deployment of attention results in an
information bottleneck, permitting only the most salient objects to be represented
in the inferotemporal (IT) cortex, particularly the ventral visual stream which
encodes object identity. These findings together indicate that visual attention
mechanisms can be crucial to model neural responses of the higher visual system.
Visual attention and eye movements are tightly interlinked. Where we direct
153
our gaze often quite accurately signals the focus of our attention [310]. This
form of attention, known as overt spatial attention, can be directly measured by
eye-tracking. Recent work has shown that fMRI activity can be used to directly
predict fixation maps or eye movement patterns under free-viewing of natural scenes,
suggesting a strong link between neural representations and eye movements [311].
In a similar vein, Sinz et al. [312] demonstrated that gaze shifts as estimated from
pupil locations and behavioral states can be very useful in modeling spiking activity
of mouse V1 neurons. More recent large-scale efforts in such concurrent data
collection on humans, such as the Human Connectome Project (HCP) [313], that
simultaneously record fMRI and eye-tracking measurements on a large population
under free-viewing of movies, present a novel opportunity to probe the potential
role of attention in neural encoding models of ecological stimuli.
Our contributions in this study are as follows:
• We demonstrate that leveraging information about attended locations in an
input image can be helpful in predicting the evoked neural response. Particu-
larly, we show that attentional masking of high-level stimulus representations
based on human fixation maps can dramatically improve neural response
prediction accuracy for naturalistic stimuli across large parts of the cortex.
• We show that it is possible to use supervision from neural response prediction
solely to co-train a visual attention network. This training strategy thus
encourages only those salient parts of the image to dominate the prediction of
the neural response. We find that the neural encoding model with this trained
attention module outperforms encoding models with no or fixed attention.
• Interestingly, we find that despite not being explicitly trained to predict
fixations, the attention network within the neural encoding model compares
154
favorably against saliency prediction models that aim to directly predict likely
human fixation locations given an input image. This suggests that neural
response prediction can be a powerful supervision signal for learning where
humans attend in cluttered scenes with multiple objects. This signals a novel
opportunity for utilizing functional brain recordings during free-viewing to
understand visual attention.
4.3.2 Methods
Neural encoding models comprise two major components: a representation (feature
extraction) module that extracts relevant representations from raw stimuli and a
response model that predicts neural activation patterns from the feature space. We
propose to integrate a trainable soft-attention module on top of the representation
network to learn attention schemes that guide the prediction of whole-brain neural
response. Our proposed methodology is illustrated in Figure 4.7.
Feature extraction network We employ the state-of-the-art ResNet-50 [261]
architecture pre-trained for object recognition on ImageNet [262] as the representa-
tion network to extract semantically rich features from raw input images. In this
study, we focus on improving neural response prediction in higher-order regions
of the visual pathway where receptive fields are larger and not limited to a single
hemi-field. Prior evidence suggests that these regions are likely best modelled by
deeper layers of object recognition networks [245, 246]. Thus, we extract the output
of the last ”residual block”, namely res5 (after addition) before the global pooling
operation to encode all images into a 2048-channel high-level feature representation
image (of size 23× 32, in our experiments), denoted as Frep. All pre-trained weights
155
Figure 4.7: Proposed method. A trainable soft-attention module is implemented
on top of a pre-trained representation network to rescale features based on their
salience. The rescaled features are spatially pooled and fed into a convolutional
response model to predict whole-brain neural response. We assess the value of the
trained attention network by comparing it with neural encoding methods employing
(i) stimulus-dependent attention maps derived from human fixations (AG), (ii)
stimulus-independent attention map derived from all fixations in the training set
that reflects the center-weighted bias of our dataset (AC) as well as a (iii) no
attention model that spatially pools the features directly with no scaling.
are kept frozen during training of the neural encoding models.
Attention network The attention network operates on the 2048-channel feature
representation image Frep. For simplicity, we employed a single convolutional layer
that constructs the saliency map with a trainable 5× 5 filter V ∈ R5×5×2048×1att as,
S = Gσ ∗ [Vatt ∗ Frep]+. Here, | · |+ denotes the ReLU operation and Gσ∗ indicates
blurring using a 5× 5 gaussian kernel with σ = 1. The attention scores for each
pixel are finally computed from saliency maps by normalizing with the spatial
softmax operation,
∑ expS(i)(i)Al = n exp ( ) , i ∈ {1, .., n}. (4.1)jj=1 S
156
Here, superscript i is used to index the 23 × 32 spatial locations in the feature
map Frep. We note that existing literature on selective visual attention suggests a
hierarchical winner-take-all mechanism for saliency computation, where only the
particular subset of the input image that is attended is consequently represented in
higher visual systems [307]. The softmax operation can be construed as approxi-
mating this winner-take-all mechanism. The attention is consequently applied as
element-wise scaling to Frep to yield an attention modulated representation Farep =
FrepA.
Convolutional response model The convolut∑ional response model maps thespatially pooled attention modulated features f g = ni=1 Fa(i)rep to the neural represen-
tation space, reshapes them into coarse 3D feature maps and transforms them into
an increasingly fine-grained volumetric neural activation pattern using trainable
convolutions. This dramatically reduces the parameter count in comparison to
linear response models with dense connections. Additionally, it captures spatial
context and allows end-to-end optimization of the neural encoding model to predict
high-resolution neural response, thereby alleviating the need for voxel sub-sampling
or selection. The full sequence of feedforward computations in the convolutional
response model are shown in the inset of Figure 4.7. The architecture of the
convolutional response model is kept consistent across all CNN-based models to
ensure a fair comparison.
Baselines and upper bounds
No attention We compared the performance of all attention-based models against
a model with no attention modulation that spatially pools the feature representation
157
as, f g =
∑n
i=1 F
(i)
rep (denoted as ‘No attention’). We implemented another baseline
that uses the full feature map directly (instead of spatial pooling) as a flattened input
to the convolutional response model. Due to computational/memory constraints,
we had to reduce the dimensionality of the fully connected layer (to 256 units
instead of 1024) in the convolutional response model for this encoding method.
This model is henceforth denoted as ‘No pooling’.
Center-weighted attention To further assess the usefulness of a learned at-
tention network, we derive a stimulus-independent attention map (AC) based on
averaging across all eye gaze data in the training set, using Gaussian kernel density
estimation. This essentially amounts to center-weighted attention (see Appendix)
since fixation locations on average are biased towards the center of an image [314].
The standard deviation of the Gaussian kernel is chosen to maximize log-likelihood
on the validation set and is consequently set to 20.
Gaze-weighted attention We derive attention maps for every input frame from
the eye gaze coordinates observed for the respective frame across different subjects.
The human fixation maps are converted into attention maps AG by blurring with a
Gaussian kernel of same standard deviation as the center-weighted attention model.
The resulting attention maps in the original input image space are subsequently
resized to the spatial dimensions of Frep and renormalized. Since these stimulus-
specific attention maps are derived from actual human gaze information, they likely
represent an upper bound in neural encoding performance among all attention-based
models.
158
Linear models To date, neural encoding models in all prior work employ a
linear response model with appropriate regularization on the regression weights.
To compare against this dominant approach, we extract global average pooled
(no-attention) features as well as pooled attention modulated features for both
non-trainable attention schemes (center-weighted and gaze-weighted attention) as
described above, to present to the linear regressor. We apply l2 regularization on
the regression coefficients and adjust the optimal strength of this penalty λ through
cross-validation using 10 log-spaced values in {1e−5, 1e5}. In later sections, we
denote the performance of the above models as ‘No attention (linear)’, ‘Center-
weighted attention (linear)’ and ‘Gaze-weighted attention (linear)’ respectively.
Training procedure
All parameters were optimized to minimize the mean squared error between the
predicted and target fMRI response using Adam [315] for 25 epochs with a learn-
ing rate of 1e-4. Validation curves were monitored to ensure convergence and
hyperparameters were optimized on the validation set.
Evaluation
Neural encoding We evaluated the performance of all encoding models on the
test movie by computing the Pearson’s correlation coefficient (R) between the
predicted and measured fMRI response at each voxel. Since we are only interested
in the stimulus-driven response, we isolate voxels that exhibit high inter-group cor-
relations over all training movies. Inter-group correlation (“synchrony”) values were
computed by splitting the population into half and computing correlations between
the mean response time-course of each group (comprising 79 subjects) at every voxel.
159
We chose a data-driven metric (synchrony) to isolate the stimulus-driven cortex in
order to avoid reliance on pre-defined atlases or functional localizers in identifying
the voxels of interest. However, since choosing an arbitrary synchrony threshold
may introduce a bias in the reported metrics, we employed a range of threshold
values, from very loose (0.15) to very strict (0.75) for the correlation value to con-
sider a voxel as “synchronous” [316]. Finally, to summarize the prediction accuracy
across the stimulus-driven cortex, we compute the mean correlation coefficient
across the synchronous cortex voxels by varying the “synchrony” thresholds from
0.15 (resulting in 160,900 voxels) to 0.75 (8,804 voxels). The spatial distribution
of synchronous voxels across the brain as we vary the synchrony thresholds is
illustrated in Figure 4.8(B). For region level analysis, ROIs were extracted using a
population-wide multi-modal parcellation of the human cerebral cortex, namely
the HCP MMP parcellation [264].
Saliency prediction Next, we wanted to assess if the learned attention model
was indeed looking at meaningful locations in input images while predicting neural
responses. To address this question and put the learned attention schemes in
perspective, we assessed the agreement of predicted saliency maps with human
fixation maps for every frame in the test movie. Besides a qualitative evaluation,
we computed quantitative metrics for comparing the predicted saliency maps
against popular fixation (or saliency) prediction approaches. These include: (i)
Itti-Koch [317]: a biologically plausible model of saliency computation that assigns
pixel-level conspicuity values based on multi-scale low-level feature maps (intensity,
color, orientation) computed via center-surround like operations similar to visual
receptive fields, (ii) Deepgaze-II model [318]: a deep neural network based approach
that extracts high-level features from a pre-trained image recognition architecture
160
(VGG19) as input to a readout network that is subsequently trained to predict
fixations using supervision from gaze data, and (iii) Intensity contrast features
(ICF) model [318]: a low-level saliency computation model that uses the same
readout architecture as the Deepgaze-II model, but on low-level intensity and
intensity contrast feature maps as opposed to high-level features. Additionally, we
also report evaluation metrics for the center-weighted saliency map. We note that
the Deepgaze-II and ICF models were trained with eye-tracking supervision on the
MIT1003 saliency dataset [319].
Developing metrics for saliency evaluation is an active area of research and several
different metrics have been proposed that often exhibit discrepant behavior [320].
We report the most commonly used metrics in saliency evaluation [320], including, (i)
Similarity or histogram intersection (SIM), (ii) Pearson’s correlation coefficient (CC),
(iii) Normalized scanpath saliency (NSS), (iv) Area under the ROC curve (AUC)
and (v) Shuffled AUC (sAUC). Following [321], we used log-density predictions as
saliency maps to compute all evaluation metrics.
4.3.3 Dataset
We study high-resolution 7T fMRI (TR = 1s, voxel size = 1.6 mm isotropic)
recordings of 158 participants from the Human Connectome Project (HCP) movie-
watching database while they viewed 4 audio-visual movies in separate runs [313,
322]. Each movie scan was about 15 minutes long, comprising multiple short clips
from popular Hollywood movies and/or vimeo. Eye gaze locations of subjects
were also recorded simultaneously at 1000Hz and resampled to 24Hz to match the
video frame acquisition rate. All fMRI data was preprocessed using the HCP FIX
denoising procedures, which include motion and distortion correction, high-pass
161
filtering (2000 sec cut-off), head motion effect regression using Friston 24-parameter
model (i.e., 6 rigid body motion parameters, their backward temporal derivatives
and squares of those time series), automatic removal of artifactual timeseries by
applying regression based on Independent Component Analysis (ICA) [323] as
well as nonlinear registration to the MNI template space [257, 322]. Since the
present study focuses on the development of population-wide predictive models, we
averaged the response for each frame across subjects to obtain a single fMRI volume
that represents the population average brain activation in response to that frame.
After discarding rest periods as well as the first 10 seconds of every movie segment,
we used about 12 minutes of audio-visual stimulation data per movie paired with
the corresponding fMRI response and fixation data for analysis. We extract the
last frame of every second of the video as a 720× 1280× 3 RGB input to present as
stimulus to the neural encoding models. The output is the predicted response across
the entire brain, represented as a volumetric image of dimensions 113× 136× 113.
We estimate a hemodynamic delay of 4 sec using regression based encoding models
(see Appendix), as the response latency that yields highest encoding performance.
Thus, all proposed and baseline models are trained to use the above stimuli to
predict the fMRI response 4 seconds after the corresponding stimulus presentation.
We train and validate our models on three movies using a 9:1 train-val split and
leave the fourth movie for independent testing. This yields 2000 training, 265
validation and 699 test stimulus-response pairs.
4.3.4 Results
Incorporating gaze-weighted attention significantly improves neural re-
sponse prediction. We first examined whether attention weighted pooling helps
162
Figure 4.8: Quantitative evaluation of all models. (A) depicts mean corre-
lation values across the synchronous, (i.e., stimulus-driven) cortex defined at a
range of synchrony thresholds ([0.15,0.75]). Each point thus reflects the mean
prediction accuracy for a model across all voxels within synchronous cortex defined
by a threshold value (x-axis). (B) depicts the inter-group correlation (synchrony)
values across the entire human cerebral cortex.
to improve response predictions. Figure 4.8 shows the mean prediction accuracy
across the entire synchronous cortex for all models considered in this study. We
find that the ‘gaze-weighted attention’ model significantly outperforms the ‘no
attention’ model for both linear (∼ 40 % improvement among all voxels with
synchrony>0.15), as well as convolutional response model (∼ 47 % improvement
among all voxels with synchrony>0.15). The attention maps result in amplification
of features of attended locations along with suppression of other irrelevant infor-
mation. This re-scaling of features before pooling using fixation patterns obtained
from eye-tracking data remarkably improves neural encoding performance across
large areas of the cortex, suggesting that neural responses are indeed dominated
by sensory signals at attended locations. Although we employed a convolutional
response model primarily for computational efficiency in predicting a high-resolution
(113x136x113) whole-brain neural response, we also observed a small improvement
in neural encoding with this response model in comparison to a linear response
model.
163
Figure 4.9: Top: ROI-level analysis Mean correlation values across intermediate
(V4), higher visual areas in the inferotemporal cortex and its neighborhood and
other higher higher-level visual regions (Dorsal, MT+) as described in the HCP
MMP parcellation [264]. Error bars represent 95% confidence intervals around mean
estimates computed using bootstrap sampling. (A)-(E) Prediction accuracy
across the cortical surface for all deep CNN-based models. Statistical
significance of individual voxel predictions is computed as the p-value of the
obtained sample correlation coefficient for the null hypothesis of uncorrelatedness
(i.e., true correlation coefficient is zero) under the assumptions of a bivariate normal
distribution. Only significantly predicted voxels (p<0.05, FDR corrected) for each
method are colored on the surface. Prediction accuracy maps for encoding methods
with linear response models are provided in the Appendix.
164
Trainable attention model outperforms models with no attention or
center-weighted attention In addition to improving neural response prediction,
the convolutional response model renders end-to-end training of encoding models
on whole-brain neural data feasible by dramatically reducing the number of free
parameters in comparison to linear response models. In this study, we exploited
this increased parameter efficiency to co-train an attention network on top of a
pre-trained representation network (while freezing the representation network) for
the goal of neural response prediction. As shown in Figure 4.8, the encoding model
with learned attention surpasses models with no pooling, no attention or center-
weighted attention in mean prediction accuracy across the sychronous cortex as well
across most ROIs involved in object processing. This suggests that even with no
eye-tracking data, as is the case with majority movie-watching fMRI datasets, mod-
elling visual attention as re-weighting of stimulus representations based on spatial
attention masks can still be beneficial in response prediction. The improvements are
most apparent in ventral stream regions such as the Fusiform Face Complex (FFC)
and PIT Complex, as well as object-selective parts of the lateral occipital complex
(LO1, LO2, LO3) (Figure 4.8). Studies in visual perception have shown that these
lateral occipital areas respond more strongly to intact objects than scrambled
objects or textures, providing strong evidence for their role in object recognition
as well as object shape perception [324, 325, 326]. Accuracy in another group of
areas within the temporo-parieto-occipital junction, which is known to be involved
in visual object recognition as well as representation of facial attributes such as the
intensity of facial expressions [327], is similarly improved with the trained attention
network. In addition to these areas, we also observe some improvement in neural
encoding performance in other higher order processing regions across the dorsal
visual stream, motion-sensitive visual regions (MT+ complex) and their neighbor-
165
ing visual areas (Figure 4.9). We also trained the proposed and baseline models
on representations of other randomly selected deep layers within the ResNet-50
architecture and observed a similar benefit of attention modulation across different
layers (see Appendix). Further, a representational similarity analysis comparing
non-modulated and attention modulated representations of different layers across
popular architectures showed that models that explain stimulus-dependent human
fixation patterns are able to better account for the representational geometry of
neural responses across intermediate and higher visual object processing areas
(see Appendix). Taken together, these findings provide further support for the
utility of attention modelling in neural encoding approaches. In addition to improv-
ing accuracy, the attention model further affords interpretability by highlighting
salient locations within the input image that are being employed to make response
predictions.
Learned attention policies correspond remarkably well with human fixa-
tion maps. Figure 4.10 depicts saliency maps predicted by the trained attention
network on sampled frames from the test movie. This qualitative assessment indi-
cates that the proposed neural encoding model learns attention policies that are
consistent with human fixation maps. Since attention is learned on top of high-level
features, the model learns to focus on high-level stimuli features such as the presence
of faces, hands and more conspicuous objects likely to direct attention in natural
scenes. A closer look at incongruent cases indicates that images where the model
fails to track human fixations are often highly complex scenes, where fixations may
be driven by contextual knowledge of previous movie frames (Figure 4.10, top-right)
or auditory signals, e.g., who the speaker is, etc. (Figure 4.10, bottom-right).
Table 4.1 shows quantitative metrics that compare the quality of saliency maps
166
Figure 4.10: Qualitative assessment of saliency (log-density) maps. Top row
shows sampled frames from the test movie, middle row shows human fixation maps
overlaid on the corresponding frame, bottom row shows saliency maps predicted by
the attention network of the proposed neural encoding model. Blue indicates high
saliency values whereas red indicates low saliency.
computed by benchmark models trained to predict gaze on our data. We also
listed the performance of the attention network that was merely trained on fMRI
data, and not eye gaze data. We note that our attention network performs on
par with popular fixation prediction models that are trained directly on the task
of saliency prediction in a supervised manner (ICF and Deepgaze-II). This trend
holds for almost all saliency evaluation metrics, as shown in Table 4.1. This
observation is particularly interesting given that the attention network is trained
using supervision from neural response prediction only, without any information
about gaze coordinates.
167
Table 4.1: Evaluation against saliency prediction models. Mean and standard errors
for each metric are reported. Best results are bolded.
Model SIM ↑ CC ↑ NSS ↑ AUC ↑ sAUC ↑
Itti-Koch 0.318± 0.002 0.325± 0.004 1.010± 0.014 0.795± 0.004 0.537± 0.006
ICF 0.291± 0.002 0.190± 0.007 0.646± 0.023 0.665± 0.006 0.647± 0.005
Center-weighted 0.327± 0.002 0.350± 0.004 1.074± 0.013 0.803± 0.003 0.496± 0.006
Deepgaze-II 0.359± 0.003 0.420± 0.005 1.425± 0.025 0.808± 0.004 0.713± 0.004
Ours 0.392± 0.004 0.403± 0.010 1.375± 0.041 0.754± 0.006 0.645± 0.006
4.3.5 Discussion and Conclusion
In the present study, we demonstrate that encoding models with visual atten-
tion, whether explicitly estimated from human fixation maps or modelled using a
trainable soft-attention scheme, yield significant improvements in neural response
prediction accuracy over non-attention based counterparts. We observe consistent
improvements across most high-level visual processing regions, suggesting that
unattended portions of an input image may likely have little effect on neural rep-
resentations in these regions. Loosely, this aligns well with Treisman’s feature
integration theory [328], which proposes that integrated object representations are
only formed for attended locations. In addition to improving response prediction
accuracy, inclusion of visual attention within neural encoding models promises a
better understanding of spatial selection and its influence on neural representations
and perceptual processing. Further, while our study integrates a spatial attention
module within a neural encoding model, the proposed approach is not restricted
to this particular form of attention. For example, spatially global feature-based
attention can also be studied within the context of neural encoding models as
”channel-wise” attention-weights instead of spatial attention masks. We believe the
observation that neural response prediction may be a useful supervision goal to
168
study attentional deployment is particularly exciting and can be extended in novel
ways.
The saliency of a stimulus often depends on the context within which it is pre-
sented and attentional selection strategies can be modulated by task demands [308].
Importantly, attention selects across space and time; here, we focus on spatial
selection of stimuli but it is likely that modeling temporal context can lead to
substantive improvements. Context can also help in highlighting attentional targets
that may be driven by “surprise”. Thus, in movie watching, future neural encoding
models should also capture the sequence of frames, rather than isolated frames,
and the audio track in modeling attention.
Our study provides a first attempt in capturing visual attention within neural
encoding models. We see several opportunities for extending this work. In the
present framework, we employed attention as a masking strategy to filter out clutter
and retain information from only the most relevant (i.e. attended) parts of an image.
It would be interesting to study how and where the features of ignored stimuli
(i.e. the stimuli that doesn’t get past the attentional bottleneck) are encoded.
Further, here, we modeled attention on top of high-level semantic features. In
principle, the attention network can be implemented on top of any level within the
representation network hierarchy, including lower stages and understanding where
attention computations leads to best neural prediction accuracy and/or agreement
with human fixation maps could be a worthwhile exploration. A straightforward
extension in this direction would be to add the attention module on top of both
low-level cues and high-level representations or to combine feature maps across
layers before presenting to the attention network. In the future, we aim to further
explore novel ways of incorporating attention within neural encoding models.
169
Beyond advancing our understanding of sensory perception, neural encoding
models have potential for real-world applications, most obviously for brain-machine
interface. Additionally, an improved understanding of the link between sensory
stimuli and evoked neural activation patterns can provide opportunities for neural
population control, for e.g., by synthetically designing stimuli to elicit a specific
neural activation pattern [286].
4.3.6 Broader Impact
Understanding the link between sensory stimulation and evoked neural activity in
humans as revealed with encoding models, can provide foundations for developing
novel therapies. Viewed in this regard, an improved understanding of information
processing in the brain has tremendous potential. However, encoding models can be
very sensitive to biases in the training set. Our models were trained using data from
the Human Connectome Project database. While this large-scale project has made
a lot of valuable data publicly available to the scientific community for studying
brain structure and function, it is important to consider the representational bias
in the dataset. For instance, the data we analyzed is exclusively limited to a
young adult population. Such biases can possibly lead to poorer generalization
of models trained with these large-scale datasets on other population groups that
are inadequately represented. Once these encoding models are ripe for therapeutic
applications, this dataset bias could prevent under-represented groups from deriving
the benefits of a useful technology, resulting in uneven access across populations.
Given these considerations, it is important to address potential representation
biases in fMRI datasets and develop solutions for improving diversity and inclusion.
More generally, fMRI studies involving human subjects can raise a wide range of
170
other ethical issues as well, including data privacy issues and informed consent.
Further, one should be cautious about the deployment of attention or gaze
prediction models in applications such as advertising. Given the value of eye
tracking based attention in marketing spaces, public policy notices or political
campaigns, it is important to be wary of a malicious use of these attention prediction
methods for profit-seeking or by ill-intentioned parties seeking to further their own
agendas. These applications regard attention as a commodity to be captured and the
adopted technologies can be used to manipulate users in subtle ways. An improved
understanding about the link between stimuli and perceptual processing in the brain,
as provided with encoding models, can also be exploited to further design or identify
stimuli likely to elicit a specific emotional or cognitive response. The fact that
these technologies can be deployed without the targeted individual’s knowledge or
consent indicates it is important to protect users from the vulnerabilities exploited
by these agents.
171
CHAPTER 5
A SHARED ENCODING MODEL FOR SUBJECT-SPECIFIC
RESPONSE PREDICTION
Abstract
The increasing popularity of naturalistic paradigms in fMRI (such as movie watch-
ing) demands novel strategies for multi-subject data analysis, such as use of neural
encoding models. In the present study, we propose a shared convolutional neu-
ral encoding method that accounts for individual-level differences. Our method
leverages multi-subject data to improve the prediction of subject-specific responses
evoked by visual or auditory stimuli. We showcase our approach on high-resolution
7T fMRI data from the Human Connectome Project movie-watching protocol
and demonstrate significant improvement over single-subject encoding models. We
further demonstrate the ability of the shared encoding model to successfully capture
meaningful individual differences in response to traditional task-based facial and
scenes stimuli. Taken together, our findings suggest that inter-subject knowledge
transfer can be beneficial to subject-specific predictive models.1
5.1 Introduction
Naturalistic imaging paradigms, such as movies and stories, emulate the diversity
and complexity of real-life sensory experiences, thereby opening a novel window into
the brain. The last decade has seen an increased foothold of naturalistic paradigms
in cognitive neuroimaging, fueled by the remarkable discovery of inter-subject
1Our code is available at https://github.com/mk2299/SharedEncoding_MICCAI.
172
synchrony during naturalistic viewing [316]. Naturalistic stimuli also demonstrate
increased test-retest reliability and more active subject engagement in comparison
to alternate paradigms such as resting-state fMRI [304]. Furthermore, experiments
have shown that naturalistic stimuli can induce stronger neural response than
task-based stimuli [285], suggesting that the brain is intrinsically more attuned to
the former. Taken together, these benefits suggest an exciting future for naturalistic
stimulation protocols in fMRI.
With large-scale compilation of multi-subject neural data through open-source
initiatives such as the Human Connectome Project (HCP) [313], the development
of approaches that can handle this enormous data is becoming imperative. Two
approaches, namely inter-subject correlation (ISC) analysis [316, 329] and shared
response model (SRM) [330], have dominated the analysis of multi-subject fMRI
data under naturalistic conditions. The former approach exploits similarity in
activation patterns across subjects to isolate stimulus-induced processing. The
latter technique, SRM, decomposes neural activity into a shared response component
and subject-specific spatial bases, and has been used for inter-subject knowledge
transfer through functional alignment. While simple and efficient, both these
approaches rely on a common time-locked stimulus across subjects and cannot,
by design, model responses to completely unseen stimuli. On the other hand,
predictive modelling of neural activity through encoding models is based upon
generalization to arbitrary stimuli and can thus offer more holistic descriptions of
sensory processing in an individual [242].
Neural encoding models map stimuli to fine-grained voxel-level response patterns
via complex feature transformations. Previously, neural encoding models have
yielded several novel insights into the functional organization of auditory and visual
173
cortices [243, 245, 246, 248]. Encoding models encapsulating different hypothesis
about neural information processing can be pitted against each other to shed
new light on how information is represented in the brain. In this manner, neural
encoding models have been largely used for making group-level inferences. The
potential to extract meaningful individual differences from naturalistic paradigms
remains largely untapped. Understanding inter-subject variability in behavior-to-
brain representations is of key interest to neuroscience and can potentially even
help identify atypical response patterns [331]. Modelling individual brain function
in response to naturalistic stimuli is one step in this direction; however, building
accurate individual-level models of brain function often requires large amounts of
data per subject for good generalization. The problem is further exacerbated by
the variability in anatomy and functional topographies across individuals, making
inter-subject knowledge transfer difficult. There is limited work in leveraging multi-
subject data for more robust and accurate individualized neural encoding. To our
knowledge, this problem has been studied only in the context of natural vision
with a handful subjects using a Bayesian framework [332]. Further, the proposed
method in [332] transfers knowledge from one subject’s encoding model into another
through a two-stage procedure and does not allow simultaneous optimization of
encoding models across multiple subjects.
In this paper, we attempt to fill this gap; to this effect, we propose a deep-
learning based framework to build more powerful individual-level encoding models
by leveraging multi-subject data. Recent studies have revealed that coarse-grained
response topographies are highly similar across subjects, suggesting that individual
idiosyncrasies manifest in more fine-grained response patterns [247, 330]. This hints
to the idea that encoding models could share representational spaces across subjects
to overcome the challenges imposed by a limited quantity of per-subject data. We
174
Figure 5.1: Proposed approach: Feature pyramid networks are used to extract
hierarchical features from pre-trained image/sound recognition networks. Dense fea-
tures are reshaped into coarse 3D feature maps, which are mapped into increasingly
fine-grained maps using convolutions. Coarse feature transformation layers are
shared across subjects while deeper convolutional layers close to predicted response
are subject-specific.
exploit this intuition to develop a neural encoding model with a common backbone
architecture for capturing shared response and subject-specific projections that
account for individual response biases, as demonstrated in Figure 5.1. Our proposed
approach has several merits: (i) It allows us to combine data from multiple subjects
watching same or different movies to build a global model of the brain. At the same
time, it can capture meaningful individual-level deviations from the global model
which can potentially be related to individual-specific traits. (ii) It is amenable
to incremental learning with diverse, varying stimuli across seen or novel subjects
with less constraints on data collection from single subjects. (iii) It poses minimal
memory overhead with additional subjects and can thus handle fMRI datasets with
a large number of subjects.
175
5.2 Methodology
Our proposed methodology is illustrated in Figure 5.1. Neural encoding models
comprise two components: (a) a feature extractor, which pulls out relevant features
from raw images or audio waveforms and (b) a response model, which maps these
stimuli features into brain responses. In contrast to existing works that employ a
linear response model [245, 246], we propose a CNN-based response model where the
coarse 3D feature maps are shared across subjects and fine-grained feature maps are
individual-specific. Previous studies have reported a cortical processing hierarchy
where low-level features from early layers of a CNN-based feature extractor best
predict responses in early sensory areas while semantically-rich deeper layers best
predict higher sensory regions [246, 248]. To account for this effect, we employ a
hierarchical feature extractor based on feature pyramid networks [260] that combines
features from early, intermediate and later layers simultaneously. The output of
the feature extractor is fed into the convolutional response model to predict the
evoked fMRI activation. This enables us to train both components of the network
simultaneously in an end-to-end fashion.
Formally, let D = {Xi,Y Ni}i=1 denote the training data pairs for N subjects,
where Xi denotes the stimuli presented to subject i and Yi denotes the corresponding
fMRI measurements. We represent Xi as RGB images or grayscale spectrograms
for the visual and auditory models, respectively. The feature model maps the
2D input into a vector representation s and is parameterized using a deep neural
network F(Xi;φ) that is common across subjects. In our experiments, this model
is a feature pyramid network built upon pre-trained recognition networks as DNNs
optimized for image or sound recognition tasks have proven to provide powerful
feature representations for encoding brain response. We define a differentiable
176
function G(s; θ) that maps the features into a shared latent volumetric space z,
whose first 3 axes represent the 3D voxel space and the last axis captures the
latent dimensionality. The predicted response for each subject is then defined using
subject-specific differentiable functions Hi(z;ψi) that project the coarse feature
maps z into an individualized brain response. We represent G and Hi’s using
convolutional neural networks to have a sufficiently expressive model. Thus, θ
and {ψi} represent a mix of convolutional kernels or dense weight matrices. The
number of shared parameters, |θ|+ |φ| is kept much greater than the cardinality of
subject-specific parameters |ψi| to accurately estimate the shared latent space. All
parameters {φ, θ, ψi} are trained jointly to minimize the mean squared error between
the predicted and true response. The proposed method allows us to propagate
errors through the shared network even if the subjects are not exposed to common
stimuli since we can always backpropagate errors for subjects independently within
each batch. Furthermore, using individualized layers to account for subject-specific
biases enables the model to weigh gradients coming from losses of each subject
differently according to their signal-to-noise ratio. This makes the model less
susceptible to noisy measurements when responses for the same stimuli are available
from multiple subjects.
5.2.1 Implementation details
We employ pre-trained Resnet-50 [261] and VGG-ish [258] architectures in the
bottom-up path of Figure 5.1 to extract multi-scale features from images and audio
spectrograms, respectively. The base architectures were selected because pre-trained
weights of these networks optimized for classification on large datasets, namely
Imagenet[262] and Youtube-8M[263], were publically available. For Resnet-50, we
177
use activations of the last residual block of each stage, namely, res2, res3, res4 and
res5 (notation from [333]) to construct our stimulus descriptions s. From the VGG
network, we use the activations of each convolutional block, namely, conv2, conv3,
conv4 and the penultimate dense layer fc2[334]. The first three set of activations
are refined through a top-down path to enhance their semantic content, while
the last activation is concatenated into s directly (res4 activations are vectorized
using global average pool). The top-down path comprises three feature maps at
different resolutions with an up-sampling factor of 2 successively from the deepest
layer of the bottom-up path. Each such feature map comprising 256 channels is
merged with the corresponding feature map in the bottom-up path (reduced to
256 channels by 1x1 convolutions) by element-wise addition. Subsequently, the
feature map at each resolution is collapsed into a 256 dimensional feature vector
through a global average pool operation and concatenated into s. The aggregated
features are then passed onto a shared CNN (denoted G above) comprising the
following feedforward computation: a fully connected layer to map the features into
a vector space which is reshaped into a 1024-channel cuboid of size 6x7x6 followed
by two 3x3x3 transposed convolutions (conv.T) with a stride of 2 to up-sample the
latter and obtain z. Each convolution reduces the channel count by half, thereby,
resulting in a shared latent z that is a 256 channel cuboid of size 27x31x27x256.
Subject-specific functions Hi’s are parameterized as a cascade of two 3x3x3 conv.T
operations (stride 2) with output dimensions 128 and 1 respectively. It is important
to emphasize that these operations constitute much fewer parameters, thereby
favoring the estimation of a shared truth. As we demonstrate empirically, a shared
space allows much better generalization. At the same time, we find that even the
limited subject-specific parameters can adequately capture meaningful individual
differences. All parameters were optimized using Adam[315] with a learning rate of
178
1e-4. Auditory and visual models were trained for 25 and 50 epochs respectively
with unit batch size. Validation curves were monitored to ensure convergence.
5.2.2 Data and Preprocessing
We study 7T fMRI data (TR = 1s) from a randomly selected sample of N=10
subjects from HCP movie-watching protocol [313, 322]. The dataset comprises 4
audiovisual movies, each ∼15 mins long. Preprocessing protocols are described in
detail in [257, 322]. For our experiments, we utilize the 1.6mm MNI-registered volu-
metric images of size 113 x 136 x 113 per TR. We compute log-mel spectrograms
using same parameters as [258] over every 1 second of audio waveform to obtain
a 2D image-like input for the VGG audio feature extractor. We extract the last
frame of every second of the video to present to the image recognition network for
visual features. We estimate a hemodynamic delay of 4 sec using regression based
encoding models, as the response latency that yields highest encoding performance.
Thus, all proposed and baseline models are trained to use the above stimuli to
predict the fMRI response 4 seconds after the corresponding stimulus presentation.
We train and validate our models on three movies using a 9:1 train-val split and
leave the fourth movie for independent testing. This yields 2000 training, 265
validation and 699 test stimulus-response pairs per subject.
5.2.3 Baselines
• Linear response model (individual subject): Here, we train independent
models for each subject using linear response models. We note that, thus
far, this is the dominant approach to neural encoding. To enable a fair
179
comparison, we extract hierarchical features of the same dimensionality as
the proposed model to present to the linear regressor. The only difference
here is the lack of a top-down pathway (since it is not pre-trained), which
prevents the refinement of coarse feature maps before aggregation. We apply l2
regularization on the regression coefficients and adjust the optimal strength of
this penalty through cross-validation using log-spaced values in {1e−10, 1e10}.
We report the performance of the best model as ‘Individual (Linear)’.
• CNN response model (individual subject): Here, we employ the same archi-
tecture as the proposed model but with only one branch of subject-specific
layers. We train this network independently for each subject without weight
sharing and denote its performance as ‘Individual model (CNN)’.
• Shared model (mean): Here, we employ the proposed model after training but
instead of computing predictions using the same subject’s learned weights,
we compute N predictions from all subject-specific branches. We compute
the mean performance obtained by correlating each of these predictions with
the ground truth response of a subject and denote this as ‘Shared (mean)’.
5.2.4 Performance evaluation
We measure performance on the test movie by computing the Pearson’s correlation
coefficient between the predicted and measured fMRI response at each voxel. Since
different subjects have a different signal-to-noise ratio, we normalize each voxel’s
correlation by the subject’s noise ceiling for that voxel. We compute the subject-
specific noise ceiling by correlating their repeated measurements on a validation
clip. Further, since we are only interested in the stimulus-driven response, we
measure performance in voxels that exhibit high inter-subject correlations. We
180
randomly split the 10 subjects into groups of 5, and correlate the mean activity
of the two groups. We repeat this process 5 times and voxels that exhibit a mean
correlation greater than 0.1 are identified as synchronous voxels. We compute
the mean normalized correlations across all synchronous voxels to achieve a single
metric per subject, denoted as ‘Prediction accuracy’. We also correlate the predicted
response of each subject against the predicted and true response of every other
subject to obtain an N ×N correlation matrix for shared models. To account for
higher variability in measured versus predicted response, we normalize the rows
and columns of this correlation matrix following [335].
5.2.5 Demonstration of application: personalized brain
mapping
To investigate if the proposed model is indeed capturing meaningful individual
differences, we use the trained encoding model to predict fMRI activations for
distinct visual object categories from the HCP task battery. Specifically, we predict
brain response to visual stimuli (comprising faces, places, tools and body parts)
from the HCP Working Memory (WM) task and use the predicted response to
synthesize face and scene contrasts (FACES-AVG and PLACES-AVG respectively)
for each individual. The predicted and true contrasts are thresholded to keep top
5% of the voxels. We compute the Dice overlap between the predicted contrast for
each subject against the true contrast of every subject (including self) to produce
an N ×N matrix for each contrast.
181
5.3 Results
Figure 5.2: Quantitative evaluation: Bar charts illustrate subject-wise prediction
accuracy of all models, box plots depict the distribution over subjects for %
of synchronous voxels significantly predicted (p<0.05, FDR corrected). N × N
correlation matrices depict the (normalized) correlation coefficient between predicted
and measured responses.
Figure 5.2 shows prediction accuracy of the proposed (‘Shared’) and baseline
methods for each subject. The performance improvement is striking between pro-
posed and individual subject models, suggesting that a shared backbone architecture
can significantly boost generalization. Comparative boxplots further show that the
proposed method predicts a much higher percentage of the synchronous cortex than
individual subject models. Further, the difference between ‘Shared’ and ‘Shared
(mean)’ as well as the dominant diagonal structure in correlation matrices suggest
that the proposed method is indeed capturing subject idiosyncrasies rather than
predicting a group-averaged response. Further, while the CNN response model
performs slightly better in visual encoding, it incurs a performance drop compared
to linear regression in auditory encoding. This perhaps suggests that the boost in
accuracy seen for shared models is largely due to inter-subject knowledge transfer
rather than the convolutional response model itself.
182
In Figure 5.3(A) & 5.3(B), we visualize the un-normalized correlations between
the predicted and measured fMRI response for the proposed models, averaged
across subjects. For the auditory model, we see significant correlations in the
parabelt auditory cortex, extending into the superior temporal sulcus and some
other language areas (55b) as well. For the visual model, while we see significant
correlations across the entire visual cortex (V1-V8), the performance is much better
in higher-order visual regions, presumably because of the semantically rich features.
The lower performance in early visual regions could also result from the dynamic
nature of visual stimulation in movies.
Figure 5.3(C) & 5.3(D) illustrate the ability of our proposed model to characterize
individual differences even beyond the experimental paradigm it was trained on. The
diagonal dominance in the dice matrix for both contrasts suggests that predicted
contrasts are most similar to the same subject’s true contrast. No prominent
diagonal structure was observed for individual subject models, presumably because
of their poor generalization to out-of-domain stimuli from the HCP task battery.
Further, predicted contrasts consistently highlight known areas for face and scene
processing, namely the fusiform face area[336] and parahippocampal areas[337]
respectively.
5.4 Discussion
In this paper, we presented a framework for utilizing multi-subject fMRI data
to improve individual-level neural encoding. We showcased our approach on
both auditory and visual stimuli and demonstrated consistent improvement over
competing approaches. Our experiments further suggest that a single experiment
183
Figure 5.3: (A), (B) Correlations between predicted response of the proposed model
and true time series of each voxel averaged across subjects. Only significantly
predicted voxels are shown (p<0.05, FDR corrected). Dice matrices of predicted
versus true contrasts for (C) faces and (D) scenes stimuli. (E) & (F) depict
contrasts of two randomly selected subjects. ROIs are labelled from the HCP MMP
parcellation [264].
(free-viewing of movies) can characterize a multitude of brain processes at once.
This has important implications for brain mapping which traditionally relies on a
battery of carefully-constructed stimuli administered within block-designs. Inter-
subject variability in response patterns induced by the complexity of naturalistic
viewing can facilitate the development of novel imaging-based biomarkers. Neural
encoding models are not constrained to modeling the response to a limited set of
experimental stimuli; their good generalization performance suggests that they can
capture broad theories of cognitive processing. Accurate, individualized neural
184
encoding models can thus bring us one step closer to achieving the goal of biomarker
discovery.
185
CHAPTER 6
A COMPUTATIONAL STRATEGY TO RICHLY CHARACTERIZE
THE HUMAN VISUAL CORTEX UNDER NATURALISTIC
CONDITIONS
6.1 Introduction
How does the brain transform the raw, pixel-like input to the eyes to meaningful
features of our external environment and use it to guide perception and visually-
guided behaviors? Understanding the nature of representations and computations
in the visual system has been a longstanding goal in neuroscience. Large strides
have been made in this quest in early stages of the visual cortex by presenting
model organisms with simple, abstract stimuli like noise, oriented edges or sine-
wave gratings and studying the properties of evoked responses, such as the peak
of their tuning curves. These carefully crafted experiments, for instance, revealed
the presence of edge detectors in the primary visual cortex [338] and sparked
the search for the preferred stimuli or the ‘optimal input’ beyond early visual
areas across the visual cortical hierarchy. Over the last two decades, through
numerous experiments employing different visual stimulus sets, recording and
analysis techniques, a conceptual understanding has emerged that early visual areas
extract low-level features like edges or curves, mid-level regions extract complex
local shapes and high-level regions (at the furthest end of the visual hierarchy)
encode and represent different semantic categories, like faces or scenes [339, 340,
341, 342, 343]. Further, while the underlying representational transformations that
enable rapid, flexible and efficient visual perception remain largely unknown, there
is accumulating evidence that visual categorization occurs in the ventral temporal
186
cortex in humans [344]. This understanding of the visual system was enabled by
synthesizing findings across several different hypothesis-driven experimental studies,
most of them employing artificial stimuli and simple tasks atypical of real-word
vision, thus lacking ecological validity. Here, we propose a principled framework
based on computational models and out-of-sample generalization to bring together
findings from decades worth of study in human visual neuroscience and characterize
neural response properties under ‘ecological’ conditions systematically in an entirely
data-driven fashion. Rather than studying different components of visual processing
in isolation with simple, abstract stimuli, we focus on a sufficiently complex stimulus
ensemble to study all components at once. Throughout, we advocate an increased
use of naturalistic stimuli to develop rich models of visual processing and an
integrated approach to study visual processing in a natural context in the brain.
To better understand the neural processes underlying visual perception in
rich, naturalistic conditions, we require simplified versions of the system under
investigation that abstract away from the exact biological implementation details,
while keeping other functionalities of interest largely intact. Deep convolutional
neural networks (DCNNs) are a promising choice for modelling these brain regions
because the principles or elements with which they are built, namely hierarchical
structure and convolutions resembles the spatially local, tiled processing happening
in the visual cortex. Deep neural networks trained on image categorization have
already set new standards for predicting neural responses along the ventral visual
pathway in humans and non-human primates [243, 332]. High neural predictivity
of task-optimized systems offers important clues into information processing within
the brain; the observation that simply optimizing for goals like object recognition
can lead to the emergence of representations highly predictive of neural responses
across the ventral visual pathway suggests that activity in artificial neural networks
187
and biological networks could be aligned by shared computational goals and offers
an intriguing way to test computational theories of the mind [345]. However,
the use of these deep neural network models beyond response prediction and
comparing different representational models has been viewed with much skepticism
and promoted with reserve. Deep neural networks have been critiqued as being
‘black boxes’ and attempts to understand the brain using such models have been
widely likened to ‘replacing one uninterpretable complex network with another’.
Indeed, such models are notoriously difficult to understand as they often contain
millions of parameters as connection weights. Here, we posit that better predictivity
may not come at the cost of explanations and with powerful techniques, it may be
possible to both probe the learned structure within these deep learning models of
visual cortex and further also uncover the response selectivity of model neurons or
voxels.
There are several advantages of training a computational model directly on the
task of reproducing neural data rather than employing a task-optimized network
for the goal of understanding neuronal tuning properties. Encoding models fitted
directly to neural data with minimal apriori constraints, e.g., generic architecture
and random initialization, naturally extend themselves well to constraint-based
inferences since any set of features that emerge in these trained networks are
optimized to explain representations in the brain. In this study, we leverage recent
advances in large-scale fMRI data collection [346] to train computational models
directly on neural data with novel predictive precision. Importantly, the stimulus
set employed in this study contains crowded images of multiple common objects in
their natural contexts at varied viewpoints, thus being more typical of everyday
scenes and allowing us to characterize neural representations and computations
in rich, naturalistic conditions. As we move towards a more naturalistic neuro-
188
science amassing datasets with complex stimuli unamenable to parametrization,
we also need concomitant methodological advances to efficiently derive conceptual
understanding from this rich, high-dimensional data. Here, we further develop a
systematic methodological framework for using computational models to decipher
neuronal tuning properties. We illustrate how computational models can provide
a powerful, unifying framework for building broad, generalizable theories about
neural information processing under ecologically valid paradigms, focusing on the
particular domain of vision. Specifically, we interpret these encoding models to
reveal the selectivity of individual voxels along the ‘where’ and ‘what’ dimensions
to answer the following important questions about their response properties: What
portion of the input space is the neuronal population most sensitive to (the ’where’
dimension)? What features of stimuli do these voxels care about (the ’what’ dimen-
sion)? Finally, we also go beyond single-voxel analysis to analyzing the population
code and employ the learned distributed representations to characterize how visual
information representation evolves along the ventral visual pathway. This allows us
to probe the information encoded in fine-scaled patterns of activity in populations
of neurons in different visual cortical regions. One major advantage of computa-
tional models over direct examination of neural responses to pre-selected stimuli is
that once these models are trained, we have complete access to their connection
weights and predicted response for arbitrarily large stimulus sets. In this way,
models allow us to ‘play out’ experiments currently infeasible due to time, budget
or other constraints. Here, we capitalize on the high predictive accuracy of our
response-optimized models to use them as in-silico substitutes of fMRI experiments
and characterize the representational geometry on structured datasets and further
investigate the functional organization of the human ventral-occiptal cortex.
189
6.2 Materials and Methods
6.2.1 Natural Scenes Dataset
A detailed description of the Natural Scenes Dataset (NSD 1) is provided else-
where [346]. Here, we just briefly summarize the data acquisition and preprocessing
steps. The NSD dataset contains measurements of fMRI responses from 8 partici-
pants who each viewed 9,000–10,000 distinct color natural scenes (22,000–30,000
trials) over the course of 30–40 scan sessions. Scanning was conducted at 7T
using whole-brain gradient-echo EPI at 1.8-mm resolution and 1.6-s repetition
time. Images were taken from the Microsoft Common Objects in Context (COCO)
database cite Lin 2014, square cropped, and presented at a size of 8.4° x 8.4°.
A special set of 1,000 images were shared across subjects; the remaining images
were mutually exclusive across subjects. Images were presented for 3 s with 1-s
gaps in between images. Subjects fixated centrally and performed a long-term
continuous recognition task on the images. The fMRI data were pre-processed by
performing one temporal interpolation (to correct for slice time differences) and
one spatial interpolation (to correct for head motion). A general linear model was
then used to estimate single-trial beta weights. Cortical surface reconstructions
were generated using FreeSurfer, and both volume- and surface-based versions of
the beta weights were created. We focused on 5 visual cortical ROIs in the study.
Three ROIs belonging to the retinoptic early visual cortex, namely, V1, V2 and hV4
were defined using a prf localizer scan session for each subject. Two higher order
ROIs, namely ventral-occipital areas (VO1-2) and lateral-occipital areas (LO1-2)
were delineated using a popular visual probabilistic atlas [347].
1http://naturalscenesdataset.org
190
6.2.2 Response-optimized encoding model architecture
We trained separate voxel-level predictive models for each of the above regions
with the same backbone architecture. The predictive model comprises a shared
convolutional neural network core common across all subjects that represents
the feature space unique for specific visual areas. We employ a linear readout
model on top of the feature space to predict the responses of individual voxels
in a specific region of interest under the assumption that the feature space likely
represents the input received by these areas and these regions perform close-to-linear
transformations on this input. A linear readout on a shared feature space is further
based upon the often made assumption that the activity across a set of neurons or
voxels in one individual can be related to the activity of the second individual in
the homologous functional region by a linear transform [348]. Further, the linear
readout is also factorized into spatial and feature dimensions following popular
methods for neural system identification [349]. This allows us to separate spatial
tuning or receptive field locations (i.e., what portion of the sensory space is the
neuron population most sensitive to?) from feature tuning (i.e. what features
of the visual input is the population sensitive to?). The base feature extraction
network or the core thus performs all nonlinear transformations to convert the
raw sensory stimuli (i.e., pixels) into a representation characteristic of a particular
visual area, whereas the readout linearly maps this extracted representation into
voxel responses. The core consists of four sequential convolutional blocks, with
each block comprising the following feedforward computations: two convolutional
layers each followed by an inner batch norm and nonlinear activation (ReLU)
operations and an anti-aliased AvgPool operation at the end. Instead of regular
convolutions, we employ E(2)-steerable convolutions in all our models to extract
features independent of orientation [350]. This modeling choice is also inspired
191
by neural computations in early visual areas where it is hypothesized that groups
of neurons perform similar computations at different orientations, e.g., edge or
curve detection at different orientations. The readout contains all voxel-specific
parameters and maps the extracted representation to individual voxel responses.
Weights of the readout are a sum of outer products between a spatial filter and a
feature vector. The spatial filter further had a positivity constraint (enforced using
rectification) and was normalized independently for each voxel by dividing each
spatial weight by the square-root of the sum of squared spatial weights.
6.2.3 Training and testing models
Combined across all 4 subjects, the dataset comprises 37,000 natural scene images,
among which 1,000 images are shared across all subjects and the rest are mutually
exclusive. We used the 1,000 shared images for testing our models and split
the remaining stimulus set into 35,000 training and 2,000 validation images. All
parameters of the response-optimized model were optimized jointly to minimize the
mean squared error between the predicted and measured response. Models were
trained for a maximum of 100 epochs using Adam with a learning rate of 1e-4, a
batch size of 16 and early stopping (patience = 20) based on the Pearson’s correlation
coefficient between the predicted and measures responses on the validation set;
validation curves were monitored to ensure convergence. The proposed method
allows us to propagate errors through the shared network even if the subjects are
not exposed to common stimuli since we can always exclude the subjects/voxels for
which the response is not present from mean error calculation within each batch.
The shared network thus benefits from diverse, varying stimuli across subjects
with less extensive constraints on data collection from single subjects. We measure
192
Figure 6.1: Schematic and Quantitative Results. A shows the convolutional
neural network model with factorized readout and B depicts the 5 visual areas
considered in this study. Quantitative assessment of different models is shown as
a boxplot in C. D shows the count of voxels that are better predicted by each
model along with the difference in prediction accuracies (R). E shows the raw
prediction accuracy, as estimated by the Pearson’s correlation coefficient (R), across
the cortical flat map for all 4 subjects.
performance (‘predictive accuracy’) on the test images by computing the Pearson’s
correlation coefficient between the predicted and measured fMRI response at each
voxel.
193
Figure 6.2: Schematic of retinotopic parameters
6.2.4 Comparison against retinotopic measurements from
pRF-localizer scan
Since our models allows us to separate spatial selectivity from feature selectivity, we
first wanted to assess whether the spatial receptive field learned by these encoding
models on natural stimuli agrees with the spatial receptive field measured by
carefully controlled experiments that use artificial spatially modulated stimuli. The
population receptive field (pRF) is defined as the region of the visual field within
which a stimulus results in an increased aggregated activity across populations
of neurons, as reflected in fMRI measurements. Here, we detail the procedure
we followed for quantifying the agreement between the pRF parameters learned
by the encoding model against the parameters of the pRF model estimated from
an independent retinotopy experiment. Assuming that the receptive field is a
2D gaussian, retinotopic organization in the brain is typically defined using three
parameters that describe the pRF location in polar coordinates (polar angle,
eccentricity) and pRF size, as shown in Figure 6.2. We computed the pRF location
as the average of the coordinate mesh weighted by the learned spatial mask value
at each grid location. To compare the estimated polar angle values against the
corresponding measurements from pRF localizer scans, we adopted measures from
circular statistics. We note that we cannot employ the usual summary statistics or
194
correlation measures of linear statistics due to the circular nature of angular data.
The circular correlation coefficient between the measured and predicted polar angle
arrays in an ROI with n voxels, respectively denoted by {a1 , .., anm m} and {a1p, .., anp},
is then calculated as,
√ ∑∑ ni=1 sin(aim − Tm∑) sin(ai − Tp)r = pn
i=1 sin2(ai n 2 im − Tm) i=1 sin (ap − Tp)
where Tm and Tp are the circular mean angles of the measured and predicted polar
angle vectors,(respectively,∑ ∑ ) (1 n n ∑n ∑ )n
T = atan2 sin ai 1, cos ai , T = atan2 1 sin 1ai , cos aim m pn i=1 n
m p p
i=1 n i=1 n i=1
6.2.5 Representational Similarity Analysis
A complementary perspective to single neuron/voxel tuning is that ensembles of
neurons serve as functional units of the brain and that meaningful aspects of our
external world are not encoded in single neurons but in populations of neurons. So,
here we attempted to characterize what kinds of information is encoded in learned
patterns of activity in different visual areas through a popular analysis framework
called representational similarity analysis (RSA) [351]. Here, we correlated the
patterns of activity between multiple exemplars from different categories of stimuli
to obtain a representational similarity matrix (RSM) for each visual area which
highlights its representational geometry. Further, we quantified the separability of
categorical information within these similarity matrices by computing a correlation
coefficient (Kendall’s τ coefficient) between model RSMs against a ground truth
adjacency matrix defined by object category labels. Specifically, the elements of
this matrix are 0 if the corresponding two images belong to different categories and
1 if the images belong to the same category. This analysis was first performed using
195
a subset of images from the THINGs database [352] that belong to a pre-defined
set of categories. These categories were chosen in accordance with previous fMRI
studies employing RSA [353, 354] and include the following: {face, hand, elephant,
cat, plant, fruit, car, tool }. Subsequently, we also validated the obtained results
using other popular vision datasets like CIFAR10 and CIFAR100 [355].
6.3 Results
Response-optimized models achieve high neural predictivity across large
swathes of the visual cortex
First, we wanted to perform a quantitative assessement of our models. We observe
high prediction accuracy (Pearson’s R>0.6) across large swathes of the visual ROIs,
including the higher-order ROIs (LO1-2 and VO1-2) 6.1E. Next, we investigated
the possible quantitative advantages of learning a rotationally-symmetric feature
space against the representations from a standard convolutional core with no
weight sharing across orientations. Importantly, we observe that the rotationally-
symmetric convolutional core performs on par with the standard convolutional
core 6.1C,D, even outperforming the latter in certain visual ROIs, suggesting that
sharing weights across filter orientations potentially provides a strong inductive
bias, allowing us to fit more expressive models efficiently.
196
Figure 6.3: Quantifying the agreement between the measured prf eccen-
tricities and the prf eccentricities estimated from predictive computa-
tional models across different voxels. A Subject and ROI-specific scatter
plots depict predicted eccentricities against measured eccentricities. Pearson’s
correlation coefficient between the two quantities is displayed in blue in each scatter
plot. B Predicted and measured eccentricities across all early visual ROI voxels
displayed on the cortical surface for each subject.
197
Figure 6.4: Quantifying the agreement between the measured prf polar
angles and the prf polar angles estimated from predictive computational
models across different voxels. A Subject and ROI-specific scatter plots
depict predicted polar angles against measured polar angles. Pearson’s correlation
coefficient between the two quantities is displayed in blue in each scatter plot. B
Predicted and measured polar angles across all early visual ROI voxels displayed
on the cortical surface for each subject.
198
Figure 6.5: Quantifying the agreement between the measured prf sizes
and the prf sizes estimated from predictive computational models across
different voxels. A Subject and ROI-specific scatter plots depict predicted sizes
against measured sizes. Pearson’s correlation coefficient between the two quantities
is displayed in blue in each scatter plot. B Predicted and measured prf sizes across
all early visual ROI voxels displayed on the cortical surface for each subject.
199
Learned spatial masks reproduce the fine-grained retinotopic organiza-
tion of the early visual cortex
Next, we estimate the population receptive field from the learned spatial masks and
compare the estimated parameters against the measurements from prf localizer scan
session. It is important to note that the encoding models are entirely unconstrained
by anatomy. In fact, these models were trained with no knowledge about the
spatial proximity of different voxels, since the responses of voxels were modeled as
independent linearly weighted (and factorized) sums of the representation extracted
by the shared core network. Despite this, the estimated spatial topographic
organization from the encoding models exhibits remarkable levels of agreement
with fine-scale retinotopic organization of the cortex 6.36.46.5 without any explicit
supervision to do so. Previously, retinotopic organization of the human visual cortex
has been dominantly studied with spatially modulated stimuli. Here, we present an
alternate approach based on naturalistic stimulation and predictive models, without
any artificially crafted stimuli, to delineate subject-specific retinotopic maps.
Response-optimized models generalize remarkably well to new subjects
with low sample complexity
Next, we wanted to assess how each of these predictive models generalizes to the
remaining set of 4 subjects that were not used to train the model. For this analysis,
we train only the linear predictor while keeping the representation/core network
fixed and vary the amount of stimulus-response pairs from the new subjects to train
the readouts. As shown in Figure 6.6, the models can achieve a significantly high
accuracy even when the number of stimulus-response pairs is 500, i.e. nearly just
5% of the training examples used from original subjects to train the entire network.
200
Figure 6.6: Quantifying the generalization ability across subjects. (Left)
Prediction performance and (Right) Agreement between estimated and measured
retinotopic maps as a function of training examples (stimulus-response pairs) from
novel subjects.
This remarkable generalization of response-optimized networks to novel subjects
suggests that they contain strong “inductive biases” and are able to sufficiently
constrain the space of possible solutions in the right manner so that the model can
generalize to a novel subject with few samples. Furthermore, even with these small
training sets from novel subjects, we can characterize their retinotopic organization
remarkably well 6.6.
Data-driven models reveal the monotonically increasing separability of
category information along the ventral stream hierarchy
Task-optimized computational models have suggested that the ventral stream
hierarchy ‘untangles’ representations through a sequence of processing steps, so
that representations of objects and categories that are inseparable at the initial
processing stage (e.g., V1) become untangled at the last stage of the hierarchy [345,
356]. Here, we present an even stronger evidence for this increased separability
through hypothesis-free computational models. Previous studies have shown that
the representational geometry for the same set of stimuli varies systematically
across different cortical areas, providing a useful signature for how information
201
Figure 6.7: Spatial generalization matrices. Predicted response for a voxel in
one ROI is correlated with the measured response of every other voxel within the
same ROI (both within and across participants) to obtain a spatial generalization
matrix for every ROI. Blue lines mark subject boundaries. Strong diagonal structure
indicates that the predicted response for a voxel best matches the measured response
of the same voxel, indicating the ability of the models to capture voxel-level
idiosyncracies.
is transformed across different cortical processing streams [356]. Here, we em-
ploy the representational similarity analysis (RSA) framework to relate emergent
representations in different computational models of the brain against a ground
truth adjacency matrix defined by object category labels. We note that all models
retain signatures of individual voxel-level idiosyncracies as indicated by prominent
diagonal nature of the spatial generalization matrices 6.7; this enables us to perform
population-level analysis with these predictive models. Keeping the model architec-
ture the same, simply by changing the response targets for an encoding model from
voxels in visual areas V1 through LO, we observe a drastic change in the geometry
of the extracted representation. The increasingly prominent block-diagonal struc-
202
Figure 6.8: Separability of category information across the ventral visual
stream. A Matrices of all pairwise similarities between the representational
geometries in different visual ROIs. B Results of the Representational Similarity
Analysis (RSA) framework applied to several visual datasets (THINGS, CIFAR100
and CIFAR10) containing different categories of objects.
ture, consistent with categorical distinctions, in the RSMs along the ventral visual
stream highlights that the distributed patterns of activity to exemplars of the same
category become increasingly more similar and distributed patterns to exemplars
of different categories become progressively dissimilar along the processing stream,
strongly supporting the computational goal of the ventral visual stream as a visual
categorization system. We can also assess the separability of categorical information
in different visual areas more rigorously using quantitative metrics 6.8B. Here again,
we see that this finding regarding increasing separability of categorical information
holds across different datasets containing different visual categories.
Computational models reveal the functional organization within ventral-
occipital cortex
Next, we wanted to discover whether there exists a systematic structure in the
neural representations in the ventral-occipital region. Specifically, we built a low-
203
dimensional neural representational space by passing a large stimulus set (∼27,000
images from the THINGS database [352]) through the predictive models and then
performing Principal Components Analysis (PCA) on the predicted responses of
this network. Visualizing the images that elicit highest and lowest activation of
individual principal components (PCs) reveals a striking functional organization:
The first principal component, which explained ∼ 49% of the variance in response
strongly reflects the gradient for representing animate versus inanimate categories,
despite vast difference in the visual appearance of different animate categories (or
inanimate categories). The second principal component, accounting for ∼ 16%
variance, corresponded roughly to the curvature versus rectilinearity distinction.
These results provides a unified explanation for many previous findings regarding
the functional organization of the ventral visual stream, including the existence
of animate versus inanimate distinctions, and an axis for curved versus rectilinear
shapes [357].
6.4 Discussion
The field of cognitive neuroscience has been largely limited to testing one hypothesis
about cognition at a time, with task-based experimental designs coupled with
statistical inference techniques being the standard workhorses of the cognitive
neuroimaging paradigm. This traditional approach has been instrumental in
revealing how different brain regions respond to particular manipulations of mental
or perceptual functions. However, being overly restricted to studying components
of mental processing in isolation leads to paradigm-bound theories that often fail to
generalize outside the experimental circumstance they were based on. Hypothesis-
free computational models overcome this tradition of excessive reductionism by
204
Figure 6.9: A low-dimensional space characterizes the functional organi-
zation of the ventral-occipital region. A Most and least activating images for
the first two PCs. C Total explained variance as a function of the number of PCs.
D Pearson’s correlation coefficient between the domain-selectivity of individual
voxels against their projections onto the two PCs. E All images from the THINGS
dataset projected onto the first two principal dimensions of the response. F and
G Scatter plots depicting the domain-selectivity against the corresponding PC
projection for all VO voxels.
providing a general-purpose framework that abstracts away from the particulars
of the experimental approach and can be used to describe multiple experiments
at the same time. In this study, we asked: can a single model trained solely
on complex naturalistic images that mimic everyday experience, simultaneously
reproduce multiple neural phenomena characterized over the last two decades with
controlled fMRI studies? Our experiments gave an affirmative answer to this
question and this study yielded hypothesis-free computational models that mimic
several known properties of the ventral visual cortex. Importantly, we now have
models that can simultaneously account for many experiments. We have models
that can replicate the increasing separability of information and understanding
the mechanisms by which they achieve it can provide inspiration for computer
vision models. Further, we have complete access to the connection weights of these
205
models and predicted responses for arbitrarily large stimulus sets. These models
can thus serve as a source of mechanistic hypothesis about neural information
processing. Beyond facilitating an improved understanding of the ventral visual
pathway, these simplified versions of brain areas can also be used as exploratory
hypothesis-generation tools in subsequent studies.
Importantly, these networks are trained solely with supervision from brain
response prediction without any category labels, yet they can successfully capture
the linear separability of categorical information from human analogues of IT. One
potential reason why these response-optimized DNNs have not been previously
explored in the context of higher-order brain areas is because the highly complex
and invariant tuning of the IT features had been difficult to characterize directly
(task-driven networks being an exception) with smaller stimulus-response datasets.
In the present study, we capitalize on the natural variation in the rich Natural
Scenes Dataset to train these models directly on neural data and study how
different visual areas respond to different stimulus dimensions. With creative model
interpretability methods applied to deep neural network models of human vision,
we hope to realize their full potential in characterizing the precise function of
different brain areas, their input-output relationships when exposed to arbitrary
stimuli and the underlying computational mechanisms. The goal is to move beyond
the tradition of experimental reductionist approaches in cognitive neuroscience to
using a combination of computational models, particularly deep learning models,
and ethologically relevant naturalistic stimuli in order to understand the neural
encoding of sensory information. And so while cognitive neuroimaging has been
limited to testing one hypothesis at a time with controlled studies and artificial
stimuli, what we argue for here is that one could instead fit complex neural network
206
models to a rich set of ethologically relevant natural stimuli to understand how
different parts of a single model can simultaneously account for response to artificial
stimuli across many experiments. This approach would enable us to stitch together
disparate findings and understand how they may arise from a unified process.
Probing computational models of the brain imposes several challenges that may
lead to confounded conclusions and hypotheses about neural computations. The
most important confound is that model predictions can deputize for experimental
data only to the extent that they are ‘accurate’ and ultimately, all interpretative
analyses and conclusions rest on the predictive accuracy of the model. It is further
easy for models to learn trivial input-output dependencies without remaining
faithful to the the mechanism. Despite these limitations, computational models can
nonetheless provide novel testable hypotheses, which can be accepted or refuted
with future experimentation. Even if the hypothesis is refuted, model failures can
further drive model development and lead to the generation of improved hypothesis.
In this manner, a tight loop between modeling and experiments, beyond simple
offline analyses of neural data, can expedite neuroscientific discovery.
207
CHAPTER 7
LOOKING AHEAD
My research thus far has been focused on the broad themes of utilizing brain
imaging for under- standing cognition and making individual-level predictions of
clinical phenotypes. Broadly, to summarize my vision for future work, I hope
my research can answer questions of the kind that David Marr elegantly and
inspiringly formulated in his influential book Vision: What kind of an information
processing device is the brain? How is information from our environment robustly
transformed into a coherent percept of the world? What are the fundamental
principles underlying neural computations? While we have made significant strides
over the last couple of decades; to date, we have few satisfying answers for the
questions above and there is much to be learned about the function of different
brain areas, the means by which the function is achieved, how it comes into being
given the constraints faced by the system (over evolution or development) and how
disparate brain networks/regions collectively support complex human behavior. At
the end of the day, I hope my research can make significant contributions towards
understanding the broad general principles that can explain biological intelligence
and consequently inform artificial intelligence. Towards the fulfilment of this dream,
I envision the new data revolution in neuroscience, with large-scale compilation of
neural data and dissemination through open-source initiatives, to play a crucial
role in narrowing down the numerous (often incompatible) theories and hypotheses
about how the mind works.
208
BIBLIOGRAPHY
[1] B. Biswal et al. “Functional connectivity in the motor cortex of resting
human brain using echo-planar MRI”. In: Magn Reson Med 34.4 (Oct. 1995),
pages 537–541 (cited on pages 7, 9, 11).
[2] C. F. Beckmann et al. “Investigations into resting-state connectivity using
independent component analysis”. In: Philosophical Transactions of the
Royal Society B: Biological Sciences 360.1457 (May 2005), pages 1001–
1013. issn: 0962-8436. doi: 10.1098/rstb.2005.1634. url: http://
rstb.royalsocietypublishing.org/cgi/doi/10.1098/rstb.2005.1634
(cited on pages 8, 10, 11, 24, 31).
[3] D. Cordes et al. “Mapping functionally related regions of brain with func-
tional connectivity MR imaging”. In: AJNR Am J Neuroradiol 21.9 (Oct.
2000), pages 1636–1644 (cited on page 8).
[4] J. S. Damoiseaux et al. “Consistent resting-state networks across healthy
subjects”. In: Proc. Natl. Acad. Sci. U.S.A. 103.37 (Sept. 2006), pages 13848–
13853 (cited on pages 8, 11, 24, 31).
[5] M. De Luca et al. “fMRI resting state networks define distinct modes of
long-distance interactions in the human brain”. In: Neuroimage 29.4 (Feb.
2006), pages 1359–1367 (cited on pages 8, 11, 24).
[6] Nico UF Dosenbach et al. “Distinct brain networks for adaptive and stable
task control in humans”. In: Proceedings of the National Academy of Sciences
104.26 (2007), pages 11073–11078 (cited on page 8).
[7] Michael D Fox et al. “Spontaneous neuronal activity distinguishes human dor-
sal and ventral attention systems”. In: Proceedings of the National Academy
of Sciences 103.26 (2006), pages 10046–10051 (cited on page 8).
209
[8] Michelle Hampson et al. “Detection of functional connectivity using temporal
correlations in MR images”. In: Human brain mapping 15.4 (2002), pages 247–
262 (cited on page 8).
[9] Daniel S Margulies et al. “Mapping the functional connectivity of anterior
cingulate cortex”. In: Neuroimage 37.2 (2007), pages 579–588 (cited on
page 8).
[10] William W Seeley et al. “Dissociable intrinsic connectivity networks for
salience processing and executive control”. In: Journal of Neuroscience 27.9
(2007), pages 2349–2356 (cited on page 8).
[11] S. M. Smith et al. “Correspondence of the brain’s functional architecture
during activation and rest”. In: Proc. Natl. Acad. Sci. U.S.A. 106.31 (Aug.
2009), pages 13040–13045 (cited on pages 8, 11).
[12] Michael D Greicius et al. “Default-mode network activity distinguishes
Alzheimer’s disease from healthy aging: evidence from functional MRI”. In:
Proceedings of the National Academy of Sciences 101.13 (2004), pages 4637–
4642 (cited on page 8).
[13] K. Supekar et al. “Network analysis of intrinsic functional brain connectivity
in Alzheimer’s disease”. In: PLoS Comput. Biol. 4.6 (June 2008), e1000100
(cited on pages 8, 44).
[14] Yvette I Sheline and Marcus E Raichle. “Resting state functional connectivity
in preclinical Alzheimer’s disease”. In: Biological psychiatry 74.5 (2013),
pages 340–347 (cited on page 8).
[15] Daniel P Kennedy and Eric Courchesne. “The intrinsic functional orga-
nization of the brain is altered in autism”. In: Neuroimage 39.4 (2008),
pages 1877–1885 (cited on page 8).
210
[16] Christopher S Monk et al. “Abnormalities of intrinsic functional connectivity
in autism spectrum disorders”. In: Neuroimage 47.2 (2009), pages 764–772
(cited on page 8).
[17] Jocelyn V Hull et al. “Resting-state functional connectivity in autism spec-
trum disorders: A review”. In: Frontiers in psychiatry 7 (2017), page 205
(cited on page 8).
[18] Amit Anand et al. “Activity and connectivity of brain mood regulating
circuit in depression: a functional magnetic resonance study”. In: Biological
psychiatry 57.10 (2005), pages 1079–1088 (cited on page 8).
[19] Michael D Greicius et al. “Resting-state functional connectivity in major
depression: abnormally increased contributions from subgenual cingulate
cortex and thalamus”. In: Biological psychiatry 62.5 (2007), pages 429–437
(cited on page 8).
[20] Peter C Mulders et al. “Resting-state functional connectivity in major
depressive disorder: a review”. In: Neuroscience & Biobehavioral Reviews 56
(2015), pages 330–344 (cited on page 8).
[21] Meng Liang et al. “Widespread functional disconnectivity in schizophrenia
with resting-state functional magnetic resonance imaging”. In: Neuroreport
17.2 (2006), pages 209–213 (cited on page 8).
[22] Julia M Sheffield and Deanna M Barch. “Cognition and resting-state func-
tional connectivity in schizophrenia”. In: Neuroscience & Biobehavioral
Reviews 61 (2016), pages 108–120 (cited on page 8).
[23] R. C. Craddock et al. “Disease state prediction from resting state functional
connectivity”. In: Magn Reson Med 62.6 (Dec. 2009), pages 1619–1628 (cited
on pages 8, 11, 55).
211
[24] G. Chen et al. “Classification of Alzheimer disease, mild cognitive impairment,
and normal cognitive status with large-scale network analysis based on
resting-state functional MR imaging”. In: Radiology 259.1 (Apr. 2011),
pages 213–221 (cited on pages 8, 11, 54, 59).
[25] J. A. Nielsen et al. “Multisite functional connectivity MRI classification of
autism: ABIDE results”. In: Front Hum Neurosci 7 (2013), page 599 (cited
on pages 8, 11, 55, 63).
[26] Michael D Fox and Michael Greicius. “Clinical applications of resting state
functional connectivity”. In: Frontiers in systems neuroscience 4 (2010),
page 19 (cited on page 8).
[27] Michael Greicius. “Resting-state functional connectivity in neuropsychiatric
disorders”. In: Current opinion in neurology 21.4 (2008), pages 424–430
(cited on page 8).
[28] Dongyang Zhang and Marcus E Raichle. “Disease and the brain’s dark
energy”. In: Nature Reviews Neurology 6.1 (2010), page 15 (cited on page 8).
[29] D Cordes et al. “Resting-State Functional Connectivity Study using Inde-
pendent Component Analysis”. In: Proceedings ISMRM 1706 (1999). url:
https://cds.ismrm.org/ismrm-1999/PDF6/1706.pdf (cited on page 10).
[30] C. F. Beckmann and S. M. Smith. “Tensorial extensions of independent
component analysis for multisubject FMRI analysis”. In: Neuroimage 25.1
(Mar. 2005), pages 294–311 (cited on page 10).
[31] M. D. Greicius et al. “Functional connectivity in the resting brain: a network
analysis of the default mode hypothesis”. In: Proc. Natl. Acad. Sci. U.S.A.
100.1 (Jan. 2003), pages 253–258 (cited on page 11).
212
[32] M. Jung et al. “Default mode network in young male adults with autism
spectrum disorder: relationship with autism spectrum traits”. In: Mol Autism
5 (2014), page 35 (cited on page 11).
[33] D. Ongur et al. “Default mode network abnormalities in bipolar disorder
and schizophrenia”. In: Psychiatry Res 183.1 (July 2010), pages 59–68 (cited
on page 11).
[34] W. Koch et al. “Diagnostic power of default mode network resting state
fMRI in the detection of Alzheimer’s disease”. In: Neurobiol. Aging 33.3
(Mar. 2012), pages 466–478 (cited on page 11).
[35] D. Cordes et al. “Frequencies contributing to functional connectivity in the
cerebral cortex in ”resting-state” data”. In: AJNR Am J Neuroradiol 22.7
(Aug. 2001), pages 1326–1333 (cited on page 11).
[36] Raymond Salvador et al. “Neurophysiological architecture of functional
magnetic resonance images of human brain”. In: Cerebral Cortex 15.9 (2005),
pages 1332–2342. issn: 10473211. doi: 10.1093/cercor/bhi016 (cited on
pages 11, 29, 31).
[37] Aviv Mezer et al. “Cluster analysis of resting-state fMRI time series”. In:
NeuroImage 45.4 (May 2009), pages 1117–1125. issn: 10538119. doi: 10.
1016/j.neuroimage.2008.12.015. url: http://linkinghub.elsevier.
com/retrieve/pii/S1053811908012706 (cited on pages 11, 27, 28).
[38] Helmut Laufs et al. “EEG-correlated fMRI of human alpha activity.” In:
NeuroImage 19 4 (2003), pages 1463–76 (cited on page 11).
[39] Jessica S. Damoiseaux and Michael D. Greicius. “Greater than the sum of its
parts: a review of studies combining structural connectivity and resting-state
functional connectivity”. In: Brain Structure and Function 213.6 (Oct. 2009),
213
pages 525–533. issn: 1863-2661. doi: 10.1007/s00429-009-0208-6. url:
https://doi.org/10.1007/s00429-009-0208-6 (cited on page 11).
[40] Y. Nir et al. “Interhemispheric correlations of slow spontaneous neuronal
fluctuations revealed in human sensory cortex”. In: Nat. Neurosci. 11.9 (Sept.
2008), pages 1100–1108 (cited on page 11).
[41] C. Chang and G. H. Glover. “Time-frequency dynamics of resting-state
brain connectivity measured with fMRI”. In: Neuroimage 50.1 (Mar. 2010),
pages 81–98 (cited on page 12).
[42] Elena A. Allen et al. “Tracking whole-brain connectivity dynamics in the
resting state”. In: Cerebral Cortex 24.3 (2014), pages 663–676. issn: 10473211.
doi: 10.1093/cercor/bhs352 (cited on pages 12, 13, 22, 35–38).
[43] Diego Vidaurre, Stephen M. Smith, and Mark W. Woolrich. “Brain net-
work dynamics are hierarchically organized in time”. In: Proceedings of the
National Academy of Sciences 114.48 (2017), page 201705120. issn: 0027-
8424. doi: 10.1073/pnas.1705120114. arXiv: arXiv:1408.1149. url:
http://www.pnas.org/lookup/doi/10.1073/pnas.1705120114 (cited on
pages 13, 22, 35–38).
[44] H. Eavani et al. “Unsupervised learning of functional network dynamics in
resting state fMRI”. In: Inf Process Med Imaging 23 (2013), pages 426–437
(cited on pages 13, 22, 36, 38).
[45] J. M. Reinen et al. “The human cortex possesses a reconfigurable dynamic
network architecture that is disrupted in psychosis”. In: Nat Commun 9.1
(Mar. 2018), page 1157 (cited on page 13).
[46] Emily S. Finn et al. “Functional connectome fingerprinting: Identifying
individuals using patterns of brain connectivity”. In: Nature Neuroscience
214
18.11 (2015), pages 1664–1671. issn: 15461726. doi: 10.1038/nn.4135.
arXiv: 15334406. url: http://dx.doi.org/10.1038/nn.4135 (cited on
pages 13, 55, 59).
[47] N. Tzourio-Mazoyer et al. “Automated anatomical labeling of activations in
SPM using a macroscopic anatomical parcellation of the MNI MRI single-
subject brain”. In: Neuroimage 15.1 (Jan. 2002), pages 273–289 (cited on
pages 22, 23).
[48] Nora Leonardi et al. “Principal components of functional connectivity: A new
approach to study dynamic brain connectivity during rest”. In: NeuroImage
83 (2013), pages 937–950. issn: 10538119. doi: 10.1016/j.neuroimage.
2013.07.019. url: http://dx.doi.org/10.1016/j.neuroimage.2013.
07.019 (cited on pages 22, 37, 38).
[49] Nora Leonardi et al. “Disentangling dynamic networks: Separated and joint
expressions of functional connectivity patterns in time”. In: Human Brain
Mapping 35.12 (2014), pages 5984–5995. issn: 10970193. doi: 10.1002/hbm.
22599 (cited on pages 22, 37).
[50] R. Cameron Craddock et al. “A whole brain fMRI atlas generated via
spatially constrained spectral clustering”. In: Human Brain Mapping 33.8
(2012), pages 1914–1928. issn: 10659471. doi: 10.1002/hbm.21333. arXiv:
NIHMS150003 (cited on pages 23, 30, 31).
[51] V. D. Calhoun et al. “A method for making group inferences from functional
MRI data using independent component analysis”. In: Hum Brain Mapp
14.3 (Nov. 2001), pages 140–151 (cited on pages 25, 33).
215
[52] Beckmann CF et al. “Group Comparison of Resting-State FMRI Data Using
Multi-Subject ICA and Dual Regression”. In: Neuroimage 47 (July 2009).
doi: 10.1016/S1053-8119(09)71511-3 (cited on pages 25, 31, 33).
[53] Yuhui Du and Yong Fan. “Group information guided ICA for fMRI data
analysis”. In: NeuroImage 69 (2013), pages 157–197. doi: 10 . 1016 / j .
neuroimage . 2012 . 11 . 008. url: http : / / dx . doi . org / 10 . 1016 / j .
neuroimage.2012.11.008 (cited on pages 25, 31).
[54] G. Varoquaux et al. “A group model for stable multi-subject ICA on fMRI
datasets”. In: NeuroImage 51.1 (2010), pages 288–299. issn: 10538119. doi:
10.1016/j.neuroimage.2010.02.010. arXiv: 1006.2300. url: http:
//dx.doi.org/10.1016/j.neuroimage.2010.02.010 (cited on pages 25,
31).
[55] I. Daubechies et al. “Independent component analysis for brain fMRI does
not select for independence”. In: Proceedings of the National Academy of
Sciences 106.26 (2009), pages 10415–10422. issn: 0027-8424. doi: 10.1073/
pnas.0903525106. eprint: http://www.pnas.org/content/106/26/
10415.full.pdf. url: http://www.pnas.org/content/106/26/10415
(cited on page 26).
[56] G. Varoquaux et al. “Multi-subject dictionary learning to segment an atlas
of brain spontaneous activity”. In: Inf Process Med Imaging 22 (2011),
pages 562–573 (cited on pages 27, 31, 33).
[57] A. Abraham et al. “Extracting brain regions from rest fMRI with total-
variation constrained dictionary learning”. In: Med Image Comput Comput
Assist Interv 16.Pt 2 (2013), pages 607–615 (cited on pages 27, 35).
216
[58] J. Lv et al. “Holistic atlases of functional networks and interactions reveal
reciprocal organizational architecture of cortical function”. In: IEEE Trans
Biomed Eng 62.4 (Apr. 2015), pages 1120–1131 (cited on page 27).
[59] Yulia Golland et al. “Data-driven clustering reveals a fundamental subdivi-
sion of the human cortex into two global systems”. In: Neuropsychologia 46.2
(2008), pages 540–553. issn: 00283932. doi: 10.1016/j.neuropsychologia.
2007.10.003 (cited on pages 27, 28).
[60] M. H. Lee et al. “Clustering of resting state networks”. In: PLoS ONE 7.7
(2012), e40370 (cited on pages 27, 28).
[61] J. H. Kim et al. “Defining functional SMA and pre-SMA subregions in human
MFC using resting state fMRI: functional connectivity-based parcellation
method”. In: Neuroimage 49.3 (Feb. 2010), pages 2375–2386 (cited on
pages 27, 28).
[62] Polina Golland, Yulia Golland, and Rafael Malach. “Detection of Spatial Ac-
tivation Patterns as Unsupervised Segmentation of fMRI Data”. In: MICCAI
10 Pt 1 (2007), pages 110–8 (cited on pages 28, 29).
[63] B. T. Thomas Yeo et al. “The organization of the human cerebral cor-
tex estimated by intrinsic functional connectivity”. In: Journal of Neu-
rophysiology 106.3 (Sept. 2011), pages 1125–1165. issn: 0022-3077. doi:
10.1152/jn.00338.2011. url: http://www.physiology.org/doi/10.
1152/jn.00338.2011 (cited on pages 28, 31).
[64] Dietmar Cordes et al. “Hierarchical clustering to measure connectivity in
fMRI resting-state data”. In: Magnetic Resonance Imaging 20.4 (2002),
pages 305–317. issn: 0730725X. doi: 10.1016/S0730-725X(02)00503-9
(cited on pages 29, 31).
217
[65] T. Blumensath et al. “Spatially constrained hierarchical parcellation of the
brain with resting-state fMRI”. In: Neuroimage 76 (Aug. 2013), pages 313–
324 (cited on pages 29, 34).
[66] A. Abraham et al. “Deriving reproducible biomarkers from multi-site resting-
state data: An Autism-based example”. In: Neuroimage 147 (Feb. 2017),
pages 736–745 (cited on pages 29, 43, 55, 59, 63).
[67] B. Thirion et al. “Which fMRI clustering gives good brain parcellations?”
In: Front Neurosci 8 (2014), page 167 (cited on page 29).
[68] Yanlu Wang and Tie-Qiang Li. “Analysis of Whole-Brain Resting-State
fMRI Data Using Hierarchical Clustering Approach”. In: PLOS ONE 8.10
(Oct. 2013), pages 1–9. doi: 10.1371/journal.pone.0076315. url: https:
//doi.org/10.1371/journal.pone.0076315 (cited on page 29).
[69] Martijn van den Heuvel, Rene Mandl, and Hilleke Hulshoff Pol. “Normalized
cut group clustering of resting-state fMRI data”. In: PLoS ONE 3.4 (2008).
issn: 19326203. doi: 10.1371/journal.pone.0002001 (cited on page 30).
[70] X. Shen et al. “Groupwise whole-brain parcellation from resting-state fMRI
data for network node identification”. In: NeuroImage 82 (2013), pages 403–
415. issn: 10538119. doi: 10.1016/j.neuroimage.2013.05.081. arXiv:
NIHMS150003. url: http://dx.doi.org/10.1016/j.neuroimage.2013.
05.081 (cited on pages 30, 31).
[71] N. Honnorat et al. “GraSP: Geodesic Graph-based Segmentation with Shape
Priors for the functional parcellation of the cortex”. In: NeuroImage 106
(2015), pages 207–221. issn: 10959572. doi: 10.1016/j.neuroimage.2014.
11.008. arXiv: NIHMS150003. url: http://dx.doi.org/10.1016/j.
neuroimage.2014.11.008 (cited on page 30).
218
[72] M. Maier, U. von Luxburg, and M. Hein. “How the result of graph clustering
methods depends on the construction of the graph”. In: ArXiv e-prints (Feb.
2011). arXiv: 1102.2075 [stat.ML] (cited on page 30).
[73] Evan M. Gordon et al. “Generation and Evaluation of a Cortical Area
Parcellation from Resting-State Correlations”. In: Cerebral Cortex 26.1
(2016), pages 288–303. issn: 14602199. doi: 10.1093/cercor/bhu239 (cited
on page 32).
[74] Matthew F Glasser et al. “A multi-modal parcellation of human cerebral cor-
tex”. In: Nature Publishing Group 536 (2016). doi: 10.1038/nature18933.
url: http://balsa.wustl.edu/WN56. (cited on page 32).
[75] Alexander Schaefer et al. “Local-Global Parcellation of the Human Cerebral
Cortex from Intrinsic Functional Connectivity MRI”. In: Cerebral Cortex
(2017), pages 1–20. issn: 1047-3211. doi: 10.1093/cercor/bhx179. url:
https://academic.oup.com/cercor/article-lookup/doi/10.1093/
cercor/bhx179 (cited on page 32).
[76] R. Kong et al. “Spatial Topography of Individual-Specific Cortical Networks
Predicts Human Cognition, Personality, and Emotion”. In: Cereb. Cortex
(June 2018) (cited on page 33).
[77] Mehraveh Salehi et al. “An exemplar-based approach to individualized parcel-
lation reveals the need for sex specific functional networks”. In: NeuroImage
170 (2018), pages 54–67. issn: 10959572. doi: 10.1016/j.neuroimage.2017.
08.068. url: http://dx.doi.org/10.1016/j.neuroimage.2017.08.068
(cited on page 33).
[78] Jon Kleinberg. “An Impossibility Theorem for Clustering”. In: Proceedings
of the 15th International Conference on Neural Information Processing
219
Systems. NIPS’02. Cambridge, MA, USA: MIT Press, 2002, pages 463–470.
url: http://dl.acm.org/citation.cfm?id=2968618.2968676 (cited on
page 33).
[79] S. Arslan et al. “Human brain mapping: A systematic comparison of parcel-
lation methods for the human cerebral cortex”. In: Neuroimage 170 (Apr.
2018), pages 5–30 (cited on page 34).
[80] Mehraveh Salehi et al. “There is no single functional atlas even for a sin-
gle individual: Parcellation of the human brain is state dependent”. In:
bioRxiv (2018). doi: 10.1101/431833. eprint: https://www.biorxiv.
org / content / early / 2018 / 10 / 01 / 431833 . full . pdf. url: https :
//www.biorxiv.org/content/early/2018/10/01/431833 (cited on
page 34).
[81] E. Damaraju et al. “Dynamic functional connectivity analysis reveals tran-
sient states of dysconnectivity in schizophrenia”. In: Neuroimage Clin 5
(2014), pages 298–308 (cited on pages 35, 36, 38).
[82] B. Rashid et al. “Dynamic connectivity states estimated from resting fMRI
Identify differences among Schizophrenia, bipolar disorder, and healthy
control subjects”. In: Front Hum Neurosci 8 (2014), page 897 (cited on
pages 35, 36).
[83] A. D. Barber et al. “Dynamic Functional Connectivity States Reflecting
Psychotic-like Experiences”. In: Biol Psychiatry Cogn Neurosci Neuroimaging
3.5 (May 2018), pages 443–453 (cited on page 36).
[84] A. Abrol et al. “Replicability of time-varying connectivity patterns in large
resting state fMRI samples”. In: Neuroimage 163 (Dec. 2017), pages 160–176
(cited on page 36).
220
[85] C. Wang et al. “Spontaneous eyelid closures link vigilance fluctuation with
fMRI dynamic connectivity states”. In: Proc. Natl. Acad. Sci. U.S.A. 113.34
(Aug. 2016), pages 9653–9658 (cited on page 36).
[86] H. I. Suk et al. “State-space model with deep learning for functional dy-
namics estimation in resting-state fMRI”. In: Neuroimage 129 (Apr. 2016),
pages 292–307 (cited on pages 36, 37, 39).
[87] Lucy R. Chai et al. “Evolution of brain network dynamics in neurodevelop-
ment”. In: Network Neuroscience 1.1 (2017), pages 14–30 (cited on pages 37,
38).
[88] X. Li et al. “Dynamic functional connectomics signatures for characterization
and differentiation of PTSD patients”. In: Hum Brain Mapp 35.4 (Apr. 2014),
pages 1761–1778 (cited on pages 37, 38).
[89] V. D. Calhoun et al. “Exploring the psychosis functional connectome: aber-
rant intrinsic networks in schizophrenia and bipolar disorder”. In: Front
Psychiatry 2 (2011), page 75 (cited on page 39).
[90] E. Amico and J. Goni. “The quest for identifiability in human functional
connectomes”. In: Sci Rep 8.1 (May 2018), page 8254 (cited on page 39).
[91] H. Eavani et al. “Identifying Sparse Connectivity Patterns in the brain using
resting-state fMRI”. In: Neuroimage 105 (Jan. 2015), pages 286–299 (cited
on pages 39, 41).
[92] H. Eavani et al. “Discriminative sparse connectivity patterns for classification
of fMRI Data”. In: Med Image Comput Comput Assist Interv 17.Pt 3 (2014),
pages 193–200 (cited on page 39).
221
[93] Anqi Qiu et al. “Manifold learning on brain functional networks in aging”.
In: Medical Image Analysis 20.1 (2015), pages 52–60. issn: 13618423. doi:
10.1016/j.media.2014.10.006. url: http://dx.doi.org/10.1016/j.
media.2014.10.006 (cited on page 39).
[94] H. Shen et al. “Discriminative analysis of resting-state functional connectivity
patterns of schizophrenia using low dimensional embedding of fMRI”. In:
Neuroimage 49.4 (Feb. 2010), pages 3110–3121 (cited on pages 39, 41).
[95] X. Guo et al. “Diagnosing Autism Spectrum Disorder from Brain Resting-
State Functional Connectivity Patterns Using a Deep Neural Network with
a Novel Feature Selection Method”. In: Front Neurosci 11 (2017), page 460
(cited on page 39).
[96] Dumitru Erhan et al. “Why Does Unsupervised Pre-training Help Deep
Learning?” In: J. Mach. Learn. Res. 11 (Mar. 2010), pages 625–660. issn:
1532-4435. url: http : / / dl . acm . org / citation . cfm ? id = 1756006 .
1756025 (cited on page 39).
[97] A. S. Heinsfeld et al. “Identification of autism spectrum disorder using
deep learning and the ABIDE dataset”. In: Neuroimage Clin 17 (2018),
pages 16–23 (cited on pages 39, 41, 50).
[98] J. Kim et al. “Deep neural network with weight sparsity control and pre-
training extracts hierarchical features and enhances classification perfor-
mance: Evidence from whole-brain resting-state functional connectivity
patterns of schizophrenia”. In: Neuroimage 124.Pt A (Jan. 2016), pages 127–
146 (cited on pages 39, 41, 50, 55).
[99] Linli Xu et al. “Maximum Margin Clustering”. In: Advances in Neural
Information Processing Systems 17. Edited by L. K. Saul, Y. Weiss, and
222
L. Bottou. MIT Press, 2005, pages 1537–1544. url: http://papers.nips.
cc/paper/2602-maximum-margin-clustering.pdf (cited on page 40).
[100] Ling Li Zeng et al. “Unsupervised classification of major depression us-
ing functional connectivity MRI”. In: Human Brain Mapping 35.4 (2014),
pages 1630–1641. issn: 10970193. doi: 10 . 1002 / hbm . 22278 (cited on
pages 40, 41).
[101] A. T. Drysdale et al. “Resting-state connectivity biomarkers define neu-
rophysiological subtypes of depression”. In: Nat. Med. 23.1 (Jan. 2017),
pages 28–38 (cited on pages 40, 41).
[102] Kamalaker Dadi et al. “Benchmarking functional connectome-based predic-
tive models for resting-state fMRI”. working paper or preprint. June 2018.
url: https://hal.inria.fr/hal-01824205 (cited on pages 43, 72, 99).
[103] Gaël Varoquaux et al. “Brain Covariance Selection: Better Individual Func-
tional Connectivity Models Using Population Prior”. In: Proceedings of the
23rd International Conference on Neural Information Processing Systems
- Volume 2. NIPS’10. Vancouver, British Columbia, Canada: Curran Asso-
ciates Inc., 2010, pages 2334–2342. url: http://dl.acm.org/citation.
cfm?id=2997046.2997156 (cited on page 43).
[104] S. M. Smith et al. “Network modelling methods for FMRI”. In: Neuroimage
54.2 (Jan. 2011), pages 875–891 (cited on pages 43, 72).
[105] G. Varoquaux et al. “Detection of brain functional-connectivity difference
in post-stroke patients using group-level covariance modeling”. In: Med
Image Comput Comput Assist Interv 13.Pt 1 (2010), pages 200–208 (cited
on page 43).
223
[106] J. Richiardi et al. “Decoding brain states from fMRI connectivity graphs”.
In: Neuroimage 56.2 (May 2011), pages 616–626 (cited on page 43).
[107] A. Khazaee, A. Ebrahimzadeh, and A. Babajani-Feremi. “Identifying patients
with Alzheimer’s disease using resting-state fMRI and graph theory”. In:
Clin Neurophysiol 126.11 (Nov. 2015), pages 2132–2141 (cited on pages 44,
54).
[108] A. Lord et al. “Changes in community structure of resting state functional
connectivity in unipolar depression”. In: PLoS ONE 7.8 (2012), e41282
(cited on pages 44, 55).
[109] C. Z. Zhu et al. “Discriminative analysis of brain function at resting-state for
attention-deficit/hyperactivity disorder”. In: Med Image Comput Comput
Assist Interv 8.Pt 2 (2005), pages 468–475 (cited on page 44).
[110] M. Mennes et al. “Linking inter-individual differences in neural activation
and behavior to intrinsic brain dynamics”. In: Neuroimage 54.4 (Feb. 2011),
pages 2950–2959 (cited on page 44).
[111] T. Price et al. “Multiple-network classification of childhood autism using
functional connectivity dynamics”. In: Med Image Comput Comput Assist
Interv 17.Pt 3 (2014), pages 177–184 (cited on pages 44, 55).
[112] T. M. Madhyastha et al. “Dynamic connectivity at rest predicts attention
task performance”. In: Brain Connect 5.1 (Feb. 2015), pages 45–59 (cited
on page 44).
[113] Jiliang Tang, Salem Alelyani, and Huan Liu. “Feature Selection for Classi-
fication: A Review”. In: Data Classification: Algorithms and Applications.
2014 (cited on page 45).
224
[114] Francisco Jairo Soares Pereira, Tom Michael Mitchell, and Matthew
Botvinick. “Machine learning classifiers and fMRI: A tutorial overview”. In:
NeuroImage 45 (2009), s199–s209 (cited on page 45).
[115] Vladimir Vapnik and Olivier Chapelle. “Bounds on Error Expectation for
Support Vector Machines”. In: Neural Computation 12 (2000), pages 2013–
2036 (cited on page 45).
[116] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedfor-
ward networks are universal approximators”. In: Neural Networks 2.5 (1989),
pages 359–366. issn: 0893-6080. doi: https://doi.org/10.1016/0893-
6080(89)90020- 8. url: http://www.sciencedirect.com/science/
article/pii/0893608089900208 (cited on page 49).
[117] M. Khosla et al. “Ensemble learning with 3D convolutional neural networks
for connectome-based prediction”. In: ArXiv e-prints (Sept. 2018). arXiv:
1809.06219 [cs.CV] (cited on pages 50, 55).
[118] Joel Hestness et al. “Deep Learning Scaling is Predictable, Empirically”. In:
CoRR abs/1712.00409 (2017) (cited on page 52).
[119] Sayan Mukherjee et al. “Estimating Dataset Size Requirements for Classify-
ing DNA Microarray Data”. In: Journal of computational biology : a journal
of computational molecular cell biology 10 2 (2003), pages 119–42 (cited on
page 52).
[120] N. U. Dosenbach et al. “Prediction of individual brain maturity using fMRI”.
In: Science 329.5997 (Sept. 2010), pages 1358–1361 (cited on pages 54, 59).
[121] L. Wang et al. “Decoding lifespan changes of the human brain using resting-
state functional connectivity MRI”. In: PLoS ONE 7.8 (2012), e44530 (cited
on page 54).
225
[122] T. B. Meier et al. “Support vector machine classification and characterization
of age-related reorganization of functional brain networks”. In: Neuroimage
60.1 (Mar. 2012), pages 601–613 (cited on page 54).
[123] G. Ball et al. “Machine-learning to characterise neonatal functional connec-
tivity in the preterm brain”. In: Neuroimage 124.Pt A (Jan. 2016), pages 267–
275 (cited on page 54).
[124] E. Challis et al. “Gaussian process classification of Alzheimer’s disease and
mild cognitive impairment from resting-state fMRI”. In: Neuroimage 112
(May 2015), pages 232–243 (cited on page 54).
[125] C. Y. Wee et al. “Group-constrained sparse fMRI connectivity modeling
for mild cognitive impairment identification”. In: Brain Struct Funct 219.2
(Mar. 2014), pages 641–656 (cited on page 54).
[126] X. Chen et al. “High-order resting-state functional connectivity network for
MCI classification”. In: Hum Brain Mapp 37.9 (Sept. 2016), pages 3282–3296
(cited on page 54).
[127] B. Jie et al. “Topological graph kernel on multiple thresholded functional
connectivity networks for mild cognitive impairment classification”. In: Hum
Brain Mapp 35.7 (July 2014), pages 2876–2897 (cited on page 54).
[128] C. Y. Wee et al. “Sparse temporally dynamic resting-state functional con-
nectivity networks for early MCI identification”. In: Brain Imaging Behav
10.2 (June 2016), pages 342–356 (cited on page 54).
[129] D. Long et al. “Automatic classification of early Parkinson’s disease with
multi-modal MR imaging”. In: PLoS ONE 7.11 (2012), e47714 (cited on
page 54).
226
[130] R. C. Welsh, L. M. Jelsone-Swain, and B. R. Foerster. “The utility of
independent component analysis and machine learning in the identification
of the amyotrophic lateral sclerosis diseased brain”. In: Front Hum Neurosci
7 (2013), page 251 (cited on page 54).
[131] A. Venkataraman et al. “Whole brain resting state functional connectivity
abnormalities in schizophrenia”. In: Schizophr. Res. 139.1-3 (Aug. 2012),
pages 7–12 (cited on page 55).
[132] D. S. Bassett et al. “Altered resting state complexity in schizophrenia”. In:
Neuroimage 59.3 (Feb. 2012), pages 2196–2207 (cited on pages 55, 59).
[133] Y. Fan et al. “Discriminant analysis of functional connectivity patterns on
Grassmann manifold”. In: Neuroimage 56.4 (June 2011), pages 2058–2067
(cited on page 55).
[134] L. L. Zeng et al. “Identifying major depression using whole-brain functional
connectivity: a multivariate pattern analysis”. In: Brain 135.Pt 5 (May
2012), pages 1498–1507 (cited on page 55).
[135] A. Eloyan et al. “Automated diagnoses of attention deficit hyperactive
disorder using magnetic resonance imaging”. In: Front Syst Neurosci 6
(2012), page 61 (cited on page 55).
[136] D. A. Fair et al. “Distinct neural signatures detected for ADHD subtypes
after controlling for micro-movements in resting state functional connectivity
MRI data”. In: Front Syst Neurosci 6 (2012), page 80 (cited on page 55).
[137] F. Liu et al. “Multivariate classification of social anxiety disorder using
whole brain functional connectivity”. In: Brain Struct Funct 220.1 (Jan.
2015), pages 101–115 (cited on page 55).
227
[138] Q. Gong et al. “Quantitative prediction of individual psychopathology in
trauma survivors using resting-state FMRI”. In: Neuropsychopharmacology
39.3 (Feb. 2014), pages 681–687 (cited on page 55).
[139] B. J. Harrison et al. “Altered corticostriatal functional connectivity in
obsessive-compulsive disorder”. In: Arch. Gen. Psychiatry 66.11 (Nov. 2009),
pages 1189–1200 (cited on page 55).
[140] S. Mueller et al. “Individual variability in functional connectivity architecture
of the human brain”. In: Neuron 77.3 (Feb. 2013), pages 586–595 (cited on
page 55).
[141] M. D. Rosenberg et al. “A neuromarker of sustained attention from whole-
brain functional connectivity”. In: Nat. Neurosci. 19.1 (Jan. 2016), pages 165–
171 (cited on page 55).
[142] J. S. Siegel et al. “Disruptions of network connectivity predict impairment in
multiple behavioral domains after stroke”. In: Proc. Natl. Acad. Sci. U.S.A.
113.30 (July 2016), E4367–4376 (cited on pages 55, 56, 59).
[143] D. E. Meskaldji et al. “Prediction of long-term memory scores in MCI based
on resting-state fMRI”. In: Neuroimage Clin 12 (2016), pages 785–795 (cited
on pages 55, 56).
[144] D. C. Jangraw et al. “A functional connectivity-based neuromarker of sus-
tained attention generalizes to predict recall in a reading task”. In: Neu-
roimage 166 (Feb. 2018), pages 99–109 (cited on page 55).
[145] W. T. Hsu et al. “Resting-state functional connectivity predicts neuroticism
and extraversion in novel individuals”. In: Soc Cogn Affect Neurosci 13.2
(Feb. 2018), pages 224–232 (cited on page 56).
228
[146] A. D. Nostro et al. “Predicting personality from network-based resting-
state functional connectivity”. In: Brain Struct Funct 223.6 (July 2018),
pages 2699–2719 (cited on page 56).
[147] E. Tagliazucchi et al. “Automatic sleep staging using fMRI functional con-
nectivity data”. In: Neuroimage 63.1 (Oct. 2012), pages 63–72 (cited on
pages 56, 59).
[148] E. Tagliazucchi and H. Laufs. “Decoding wakefulness levels from typical
fMRI resting-state data reveals reliable drifts between wakefulness and sleep”.
In: Neuron 82.3 (May 2014), pages 695–708 (cited on pages 56, 59).
[149] X. J. Dai et al. “Long-term total sleep deprivation decreases the default
spontaneous activity and connectivity pattern in healthy male subjects: a
resting-state fMRI study”. In: Neuropsychiatr Dis Treat 11 (2015), pages 761–
772 (cited on page 57).
[150] Y. Zhu et al. “Increased interhemispheric resting-state functional connectivity
after sleep deprivation: a resting-state fMRI study”. In: Brain Imaging Behav
10.3 (Sept. 2016), pages 911–919 (cited on page 57).
[151] B. T. Yeo, J. Tandi, and M. W. Chee. “Functional connectivity during rested
wakefulness predicts vulnerability to sleep deprivation”. In: Neuroimage 111
(May 2015), pages 147–158 (cited on page 57).
[152] T. Ge et al. “Heritability analysis with repeat measurements and its appli-
cation to resting-state functional connectivity”. In: Proc. Natl. Acad. Sci.
U.S.A. 114.21 (May 2017), pages 5521–5526 (cited on page 57).
[153] O. Miranda-Dominguez et al. “Heritability of the human connectome: A
connectotyping study”. In: Netw Neurosci 2.2 (2018), pages 175–199 (cited
on pages 57, 59).
229
[154] I. Tavor et al. “Task-free MRI predicts individual differences in brain activity
during task performance”. In: Science 352.6282 (Apr. 2016), pages 216–220
(cited on pages 58, 59).
[155] O. Parker Jones et al. “Resting connectivity predicts task activation in
pre-surgical populations”. In: Neuroimage Clin 13 (2017), pages 378–385
(cited on page 58).
[156] F. Abdelnour, H. U. Voss, and A. Raj. “Network diffusion accurately mod-
els the relationship between structural and functional brain connectivity
networks”. In: Neuroimage 90 (Apr. 2014), pages 335–347 (cited on page 58).
[157] F. Deligianni et al. “A probabilistic framework to infer brain functional
connectivity from anatomical connections”. In: Inf Process Med Imaging 22
(2011), pages 296–307 (cited on page 58).
[158] A. Venkataraman et al. “Joint modeling of anatomical and functional con-
nectivity for population studies”. In: IEEE Trans Med Imaging 31.2 (Feb.
2012), pages 164–182 (cited on page 58).
[159] C. Dansereau et al. “Statistical power and prediction accuracy in multisite
resting-state fMRI connectivity”. In: Neuroimage 149 (Apr. 2017), pages 220–
232 (cited on page 63).
[160] K. R. Van Dijk, M. R. Sabuncu, and R. L. Buckner. “The influence of head
motion on intrinsic functional connectivity MRI”. In: Neuroimage 59.1 (Jan.
2012), pages 431–438 (cited on page 64).
[161] G. Varoquaux. “Cross-validation failure: Small sample sizes lead to large
error bars”. In: Neuroimage 180.Pt A (Oct. 2018), pages 68–77 (cited on
page 64).
230
[162] T. Wolfers et al. “From estimating activation locality to predicting disorder:
A review of pattern recognition for neuroimaging-based psychiatric diagnos-
tics”. In: Neurosci Biobehav Rev 57 (Oct. 2015), pages 328–349 (cited on
page 64).
[163] Vince D. Calhoun and Tülay Adali. “Multisubject Independent Component
Analysis of fMRI: A Decade of Intrinsic Networks, Default Mode, and
Neurodiagnostic Discovery”. In: IEEE Reviews in Biomedical Engineering 5
(2012), pages 60–73 (cited on page 66).
[164] Simon B. Eickhoff, B. T. Thomas Yeo, and Sarah Genon. “Imaging-based
parcellations of the human brain”. In: Nature Reviews Neuroscience 19
(2018), pages 672–686 (cited on page 66).
[165] R. Matthew Hutchison et al. “Dynamic functional connectivity: Promise,
issues, and interpretations”. In: NeuroImage 80 (2013), pages 360–378 (cited
on page 66).
[166] Vince D. Calhoun et al. “The Chronnectome: Time-Varying Connectivity
Networks as the Next Frontier in fMRI Data Discovery”. In: Neuron 84
(2014), pages 262–274 (cited on page 66).
[167] Maria Giulia Preti, Thomas A. W. Bolton, and Dimitri Van De Ville.
“The dynamic functional connectome: State-of-the-art and perspectives”. In:
NeuroImage 160 (2017), pages 41–54 (cited on page 66).
[168] Daniel J. Lurie et al. “On the Nature of Resting Fmri and Time-varying
Functional Connectivity.” In: PsyArXiv (2018). doi: 10.31234/osf.io/
xtzre (cited on page 66).
231
[169] Michael D. Fox and Michael D. Greicius. “Clinical Applications of Resting
State Functional Connectivity”. In: Front. Syst. Neurosci. 2010 (cited on
page 66).
[170] Mohammad Arbabshirani et al. “Single subject prediction of brain disor-
ders in neuroimaging: Promises and pitfalls”. In: NeuroImage 145 (2017),
pages 137–165 (cited on page 66).
[171] M. Plitt, K. A. Barnes, and A. Martin. “Functional connectivity classifica-
tion of autism identifies highly predictive brain features but falls short of
biomarker standards”. In: Neuroimage Clin 7 (2015), pages 359–366 (cited
on pages 72, 86).
[172] M. Mennes et al. “Resting state functional connectivity correlates of in-
hibitory control in children with attention-deficit/hyperactivity disorder”.
In: Front Psychiatry 2 (2011), page 83 (cited on page 72).
[173] G. Varoquaux et al. “Detection of brain functional-connectivity difference
in post-stroke patients using group-level covariance modeling”. In: Med
Image Comput Comput Assist Interv 13.Pt 1 (2010), pages 200–208 (cited
on page 72).
[174] Colin J. Brown and Ghassan Hamarneh. “Machine Learning on Human
Connectome Data from MRI”. In: CoRR 1611.08699 (2016). arXiv: 1611.
08699. url: http://arxiv.org/abs/1611.08699 (cited on page 72).
[175] M. Kaiser. “A Tutorial in Connectome Analysis: Topological and Spatial
Features of Brain Networks”. In: ArXiv e-prints (May 2011). arXiv: 1105.
4705 [q-bio.NC] (cited on page 72).
232
[176] Aaron Alexander-Bloch et al. “The anatomical distance of functional con-
nections predicts brain network topology in health and schizophrenia.” In:
Cerebral cortex 23 1 (2013), pages 127–38 (cited on page 72).
[177] Zhijun Yao et al. “A review of structural and functional brain networks:
small world and atlas”. In: Brain Informatics 2.1 (Mar. 2015), pages 45–
52. issn: 2198-4018. doi: 10 . 1007 / s40708 - 015 - 0009 - z. url: http :
//link.springer.com/10.1007/s40708-015-0009-z (cited on page 72).
[178] Bruce Fischl et al. “Whole brain segmentation: automated labeling of
neuroanatomical structures in the human brain”. In: Neuron 33.3 (2002),
pages 341–355 (cited on page 72).
[179] Matthew F Glasser et al. “A multi-modal parcellation of human cerebral
cortex”. In: Nature 536.7615 (2016), pages 171–178 (cited on page 72).
[180] Simon B Eickhoff et al. “Connectivity-based parcellation: Critique and
implications”. In: Human brain mapping 36.12 (2015), pages 4771–4792
(cited on page 72).
[181] Salim Arslan et al. “Human brain mapping: A systematic comparison of
parcellation methods for the human cerebral cortex”. In: NeuroImage 170
(2018). Segmenting the Brain, pages 5–30. issn: 1053-8119. doi: https:
//doi.org/10.1016/j.neuroimage.2017.04.014. url: http://www.
sciencedirect.com/science/article/pii/S1053811917303026 (cited
on pages 72, 100).
[182] Paul A Yushkevich et al. “Quantitative comparison of 21 protocols for
labeling hippocampal subfields and parahippocampal subregions in in vivo
MRI: towards a harmonized segmentation protocol”. In: Neuroimage 111
(2015), pages 526–541 (cited on page 72).
233
[183] Gaël Varoquaux et al. “Multi-subject dictionary learning to segment an atlas
of brain spontaneous activity”. In: Biennial International Conference on
Information Processing in Medical Imaging. Springer. 2011, pages 562–573
(cited on page 72).
[184] B. T. Thomas Yeo et al. “The organization of the human cerebral cortex
estimated by intrinsic functional connectivity”. In: Journal of Neurophysi-
ology 106.3 (2011). PMID: 21653723, pages 1125–1165. doi: 10.1152/jn.
00338.2011. eprint: https://doi.org/10.1152/jn.00338.2011. url:
https://doi.org/10.1152/jn.00338.2011 (cited on pages 72, 78, 101).
[185] Bertrand Thirion et al. “Which fMRI clustering gives good brain parcella-
tions?” In: Frontiers in neuroscience 8 (2014), page 167 (cited on page 72).
[186] Alexandre Abraham et al. “Deriving reproducible biomarkers from multi-
site resting-state data: An Autism-based example”. In: NeuroImage 147
(2017), pages 736–745. issn: 1053-8119. doi: https://doi.org/10.1016/
j.neuroimage.2016.10.045. url: http://www.sciencedirect.com/
science/article/pii/S1053811916305924 (cited on pages 73, 74, 86, 95,
96, 99).
[187] Sofia Ira Ktena et al. “Metric learning with spectral graph convolutions on
brain connectivity networks”. In: NeuroImage 169 (2018), pages 431–442.
issn: 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2017.
12.052. url: http://www.sciencedirect.com/science/article/pii/
S1053811917310765 (cited on page 74).
[188] Jeremy Kawahara et al. “BrainNetCNN: Convolutional Neural Networks
for Brain Networks; Towards Predicting Neurodevelopment”. In: 146 (Sept.
2016) (cited on pages 74, 83, 84).
234
[189] Vladimir L. Cherkassky et al. “Functional connectivity in a baseline resting-
state network in autism.” In: Neuroreport 17 16 (2006), pages 1687–90 (cited
on page 74).
[190] Michal Assaf et al. “Abnormal functional connectivity of default mode
sub-networks in autism spectrum disorder patients”. In: 53 (Oct. 2010),
pages 247–56 (cited on page 74).
[191] Christopher S. Monk et al. “Abnormalities of intrinsic functional connectivity
in autism spectrum disorders,” in: NeuroImage 47.2 (2009), pages 764–772.
issn: 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2009.
04.069. url: http://www.sciencedirect.com/science/article/pii/
S1053811909004327 (cited on page 74).
[192] Anibal Sólon Heinsfeld et al. “Identification of autism spectrum disorder
using deep learning and the ABIDE dataset”. In: NeuroImage: Clinical. 2018
(cited on pages 74, 87, 260).
[193] N Yahata et al. “A small number of abnormal brain connections predicts
adult autism spectrum disorder”. In: Nature Communications 7 (2016). doi:
http://dx.doi.org/10.1038/ncomms11254 (cited on page 74).
[194] Adriana Di Martino et al. “Enhancing studies of the connectome in autism
using the autism brain imaging data exchange II”. In: Scientific data 4 (Mar.
2017), page 170010. issn: 2052-4463. doi: 10.1038/sdata.2017.10. url:
http://europepmc.org/articles/PMC5349246 (cited on pages 74, 75).
[195] Cameron Craddock et al. “The Neuro Bureau Preprocessing Initiative: open
sharing of preprocessed neuroimaging data and derivative”. In: Frontiers in
Neuroinformatics 41 (2013). issn: 1662-5196. doi: 10.3389/conf.fninf.
235
2013.09.00041. url: http://www.frontiersin.org/neuroinformatics/
10.3389/conf.fninf.2013.09.00041/full (cited on page 75).
[196] Jonathan D. Power et al. “Methods to detect, characterize, and remove
motion artifact in resting state fMRI”. In: NeuroImage 84 (Jan. 2014),
pages 320–341. issn: 10538119. doi: 10 . 1016 / j . neuroimage . 2013 .
08 . 048. url: http : / / www . ncbi . nlm . nih . gov / pubmed / 23994314 %
20http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=
PMC3849338 % 20http : / / linkinghub . elsevier . com / retrieve / pii /
S1053811913009117 (cited on pages 76, 98).
[197] John Muschelli et al. “Reduction of motion-related artifacts in resting state
fMRI using aCompCor”. In: 96 (Mar. 2014) (cited on page 76).
[198] Jean A. Frazier et al. “Structural brain magnetic resonance imaging of limbic
and thalamic volumes in pediatric bipolar disorder.” In: The American
journal of psychiatry 162 7 (2005) (cited on page 77).
[199] J. M. Goldstein et al. “Hypothalamic abnormalities in schizophrenia: sex
effects and genetic vulnerability”. In: Biol. Psychiatry 61.8 (Apr. 2007),
pages 935–945 (cited on page 77).
[200] N. Makris et al. “Decreased volume of left and total anterior insular lobule
in schizophrenia”. In: Schizophr. Res. 83.2-3 (Apr. 2006), pages 155–171
(cited on page 77).
[201] C. D. Smyser et al. “Prediction of brain maturity in infants using machine-
learning algorithms”. In: Neuroimage 136 (Aug. 2016), pages 1–9 (cited on
page 77).
236
[202] R. S. Desikan et al. “An automated labeling system for subdividing the
human cerebral cortex on MRI scans into gyral based regions of interest”.
In: Neuroimage 31.3 (July 2006), pages 968–980 (cited on page 77).
[203] N. Tzourio-Mazoyer et al. “Automated anatomical labeling of activations in
SPM using a macroscopic anatomical parcellation of the MNI MRI single-
subject brain”. In: Neuroimage 15.1 (Jan. 2002), pages 273–289 (cited on
pages 77, 108, 112).
[204] Craddock Cameron et al. “A whole brain fMRI atlas generated via spa-
tially constrained spectral clustering”. In: Human Brain Mapping 33.8
(2011), pages 1914–1928. doi: 10 . 1002 / hbm . 21333. eprint: https : / /
onlinelibrary.wiley.com/doi/pdf/10.1002/hbm.21333. url: https:
//onlinelibrary.wiley.com/doi/abs/10.1002/hbm.21333 (cited on
page 77).
[205] Jack L. Lancaster et al. “Automated Talairach Atlas labels for functional
brain mapping”. In: Human Brain Mapping 10.3 (2000), pages 120–131
(cited on page 77).
[206] S. B. Eickhoff et al. “A new SPM toolbox for combining probabilistic
cytoarchitectonic maps and functional imaging data”. In: Neuroimage 25.4
(May 2005), pages 1325–1335 (cited on page 77).
[207] Markus D Schirmer. “Developing Brain Connectivity - Effects of Parcellation
Scale on Network Analysis in Neonates (Doctoral dissertation, King’s College
London)”. In: (2015). url: https://kclpure.kcl.ac.uk/portal/ (cited
on pages 77, 93).
[208] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating
Deep Network Training by Reducing Internal Covariate Shift”. In: CoRR
237
abs/1502.03167 (2015). arXiv: 1502.03167. url: http://arxiv.org/abs/
1502.03167 (cited on page 82).
[209] K. Simonyan, A. Vedaldi, and A. Zisserman. “Deep Inside Convolutional
Networks: Visualising Image Classification Models and Saliency Maps”. In:
ArXiv e-prints (Dec. 2013). arXiv: 1312.6034 [cs.CV] (cited on page 84).
[210] Amanda V Utevsky, David V Smith, and Scott A Huettel. “Precuneus is a
functional core of the default-mode network”. In: Journal of Neuroscience
34.3 (2014), pages 932–940 (cited on page 92).
[211] Takamitsu Watanabe et al. “Diminished medial prefrontal activity behind
autistic social judgments of incongruent information”. In: PloS one 7.6
(2012), e39561 (cited on page 92).
[212] Hideya Koshino et al. “Functional connectivity in an fMRI working memory
task in high-functioning autism”. In: Neuroimage 24.3 (2005), pages 810–821
(cited on page 92).
[213] Patricia A Reuter-Lorenz et al. “Age differences in the frontal lateralization
of verbal and spatial working memory revealed by PET”. In: Journal of
cognitive neuroscience 12.1 (2000), pages 174–187 (cited on page 92).
[214] R. Cameron Craddock et al. “A whole brain fMRI atlas generated via
spatially constrained spectral clustering”. In: Human Brain Mapping 33.8
(Aug. 2012), pages 1914–1928. issn: 10659471. doi: 10.1002/hbm.21333.
url: http://doi.wiley.com/10.1002/hbm.21333 (cited on page 93).
[215] A. Fornito, A. Zalesky, and E. T. Bullmore. “Network scaling effects in
graph analytic studies of human resting-state FMRI data”. In: Front Syst
Neurosci 4 (2010), page 22 (cited on page 93).
238
[216] A. Zalesky et al. “Whole-brain anatomical networks: does the choice of
nodes matter?” In: Neuroimage 50.3 (Apr. 2010), pages 970–983 (cited on
pages 93, 95).
[217] R. Kong et al. “Spatial Topography of Individual-Specific Cortical Networks
Predicts Human Cognition, Personality, and Emotion”. In: Cereb. Cortex
(June 2018) (cited on page 93).
[218] B. Da Mota et al. “Enhancing the Reproducibility of Group Analysis
with Randomized Brain Parcellations”. In: Medical Image Computing and
Computer-Assisted Intervention - MICCAI 2013. Lecture Notes in Computer
Science, vol 8150. Springer, Berlin, Heidelberg (cited on pages 94, 100).
[219] C. P. Chen et al. “Diagnostic classification of intrinsic functional connectivity
highlights somatosensory, default mode, and visual regions in autism”. In:
Neuroimage Clin 8 (2015), pages 238–245 (cited on page 96).
[220] V. Menon. “Developmental pathways to functional brain networks: emerging
principles”. In: Trends Cogn. Sci. (Regul. Ed.) 17.12 (Dec. 2013), pages 627–
640 (cited on page 97).
[221] Theodore D. Satterthwaite et al. “Impact of in-scanner head motion on multi-
ple measures of functional connectivity: Relevance for studies of neurodevelop-
ment in youth”. In: NeuroImage 60.1 (2012), pages 623–632. issn: 1053-8119.
doi: https://doi.org/10.1016/j.neuroimage.2011.12.063. url: http:
//www.sciencedirect.com/science/article/pii/S1053811911014650
(cited on page 97).
[222] Damien Fair et al. “Distinct neural signatures detected for ADHD sub-
types after controlling for micro-movements in resting state functional
connectivity MRI data”. In: Frontiers in Systems Neuroscience 6 (2013),
239
page 80. issn: 1662-5137. doi: 10.3389/fnsys.2012.00080. url: https:
//www.frontiersin.org/article/10.3389/fnsys.2012.00080 (cited on
page 97).
[223] Koene RA Van Dijk, Mert R Sabuncu, and Randy L Buckner. “The influence
of head motion on intrinsic functional connectivity MRI”. In: Neuroimage
59.1 (2012), pages 431–438 (cited on page 97).
[224] Patric Hagmann et al. “Mapping the Structural Core of Human Cerebral
Cortex”. In: PLOS Biology 6.7 (July 2008), pages 1–15. doi: 10.1371/
journal.pbio.0060159. url: https://doi.org/10.1371/journal.pbio.
0060159 (cited on page 99).
[225] Luisa M. Zintgraf et al. “Visualizing Deep Neural Network Decisions: Pre-
diction Difference Analysis”. In: CoRR abs/1702.04595 (2017). arXiv: 1702.
04595. url: http://arxiv.org/abs/1702.04595 (cited on page 101).
[226] Ramprasaath R. Selvaraju et al. “Grad-CAM: Why did you say that? Visual
Explanations from Deep Networks via Gradient-based Localization”. In:
CoRR abs/1610.02391 (2016). arXiv: 1610.02391. url: http://arxiv.
org/abs/1610.02391 (cited on page 101).
[227] Khosla et al. “Machine learning in resting-state fMRI analysis”. In: arXiv
preprint arXiv:1812.11477 (2018) (cited on page 103).
[228] Lixia Tian et al. “Changes in dynamic functional connections with aging”.
In: Neuroimage 172 (2018), pages 31–39 (cited on page 103).
[229] Liu et al. “Chronnectome fingerprinting:identifying individuals & predicting
higher cognitive function using dynamic brain connectivity patterns”. In:
Hum Brain Mapp () (cited on page 103).
240
[230] L. L. Zeng et al. “Unsupervised classification of major depression using func-
tional connectivity MRI”. In: Hum Brain Mapp 35.4 (Apr. 2014), pages 1630–
1641 (cited on page 104).
[231] Heung-Il Suk et al. “A hybrid of deep network and hidden Markov model
for MCI identification with resting-state fMRI”. In: MICCAI. 2015 (cited
on page 104).
[232] Mahmudul Hasan et al. “Learning Temporal Regularity in Video Sequences”.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016) (cited on page 104).
[233] Nitish Srivastava, Elman Mansimov, and Ruslan R. Salakhutdinov. “Unsu-
pervised Learning of Video Representations using LSTMs”. In: ICML. 2015
(cited on page 104).
[234] Wen Liu et al. “Future Frame Prediction for Anomaly Detection - A New
Baseline”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2018) (cited on page 104).
[235] Alex Krizhevsky et al. “Imagenet classification with deep convolutional
neural networks”. In: Advances in neural information processing systems.
2012 (cited on page 105).
[236] Olaf Ronneberger et al. “U-Net: Convolutional Networks for Biomedical
Image Segmentation”. In: MICCAI. 2015 (cited on page 105).
[237] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In:
Neural computation 9.8 (1997), pages 1735–1780 (cited on page 105).
[238] Xingjian Shi et al. “Convolutional LSTM Network: A Machine Learning
Approach for Precipitation Nowcasting”. In: NIPS. 2015 (cited on page 105).
241
[239] Di Martino et al. “The autism brain imaging data exchange:towards a large-
scale evaluation of intrinsic brain architecture in autism”. In: Molecular
psychiatry (2014) (cited on page 107).
[240] Abraham et al. “Deriving reproducible biomarkers from multi-site resting-
state data: an autism-based example”. In: NeuroImage 147 (2017), pages 736–
745 (cited on page 111).
[241] Lisa T Eyler et al. “A failure of left temporal cortex to specialize for language
is an early emerging and fundamental property of autism”. In: Brain 135.3
(2012), pages 949–960 (cited on page 112).
[242] G. Varoquaux and R. A. Poldrack. “Predictive models avoid excessive
reductionism in cognitive neuroimaging”. In: Curr. Opin. Neurobiol. 55 (Apr.
2019), pages 1–6 (cited on pages 117, 118, 173).
[243] D. L. Yamins et al. “Performance-optimized hierarchical models predict
neural responses in higher visual cortex”. In: Proc. Natl. Acad. Sci. U.S.A.
111.23 (June 2014), pages 8619–8624 (cited on pages 117–119, 149, 153, 174,
187).
[244] K. N. Kay et al. “Identifying natural images from human brain activity”.
In: Nature 452.7185 (Mar. 2008), pages 352–355 (cited on pages 117, 118).
[245] Haiguang Wen et al. “Neural encoding and decoding with deep learning for
dynamic natural vision”. In: Cerebral Cortex 28.12 (Dec. 2018), pages 4136–
4160. issn: 14602199. doi: 10.1093/cercor/bhx268. arXiv: 1608.03425
(cited on pages 117, 118, 124, 155, 174, 176, 277).
[246] U. Guclu and M. A. van Gerven. “Deep Neural Networks Reveal a Gradient
in the Complexity of Neural Representations across the Ventral Stream”. In:
242
J. Neurosci. 35.27 (July 2015), pages 10005–10014 (cited on pages 117–119,
124, 153, 155, 174, 176, 277).
[247] Umut Güçlü and Marcel A.J. van Gerven. “Increasingly complex repre-
sentations of natural movies across the dorsal stream are shared between
subjects”. In: NeuroImage 145 (Jan. 2017), pages 329–336. issn: 10959572.
doi: 10.1016/j.neuroimage.2015.12.036 (cited on pages 117, 145, 174).
[248] Alexander J.E. Kell et al. “A Task-Optimized Neural Network Replicates
Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cor-
tical Processing Hierarchy”. In: Neuron 98.3 (May 2018), 630–644.e16.
issn: 08966273. doi: 10.1016/j.neuron.2018.03.044. url: https:
//linkinghub.elsevier.com/retrieve/pii/S0896627318302502 (cited
on pages 117–119, 124, 153, 174, 176, 277).
[249] A. J. King and G. A. Calvert. “Multisensory integration: perceptual grouping
by eye and ear”. In: Curr. Biol. 11.8 (Apr. 2001), R322–325 (cited on
page 117).
[250] J. Driver and T. Noesselt. “Multisensory interplay reveals crossmodal influ-
ences on ‘sensory-specific’ brain regions, neural responses, and judgments”.
In: Neuron 57.1 (Jan. 2008), pages 11–23 (cited on pages 117, 136, 141).
[251] J. Miller. “Divided attention: evidence for coactivation with redundant sig-
nals”. In: Cogn Psychol 14.2 (Apr. 1982), pages 247–279 (cited on page 117).
[252] S. Sonkusare, M. Breakspear, and C. Guo. “Naturalistic Stimuli in Neuro-
science: Critically Acclaimed”. In: Trends Cogn. Sci. (Regul. Ed.) 23.8 (Aug.
2019), pages 699–714 (cited on page 117).
243
[253] U. Hasson et al. “Intersubject synchronization of cortical activity during
natural vision”. In: Science 303.5664 (Mar. 2004), pages 1634–1640 (cited
on pages 118, 143).
[254] Marc Schönwiesner and Robert J. Zatorre. “Spectro-temporal modulation
transfer function of single voxels in the human auditory cortex measured
with high-resolution fMRI.” In: Proceedings of the National Academy of
Sciences 106 34 (2009), pages 14611–6 (cited on page 118).
[255] Daniel Schwartz, Mariya Toneva, and Leila Wehbe. “Inducing brain-relevant
bias in natural language processing models”. In: NeurIPS (2019) (cited on
page 119).
[256] M. F. Glasser et al. “The minimal preprocessing pipelines for the Human
Connectome Project”. In: Neuroimage 80 (Oct. 2013), pages 105–124 (cited
on page 121).
[257] A. T Vu et al. “Tradeoffs in pushing the spatial resolution of fMRI for the 7T
Human Connectome Project”. In: Neuroimage 154 (July 2017), pages 23–32
(cited on pages 121, 162, 179).
[258] Shawn Hershey et al. “CNN architectures for large-scale audio classification”.
In: 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2016), pages 131–135 (cited on pages 122, 124, 177,
179).
[259] Albert S. Bregman. “Auditory Scene Analysis”. In: MIT press (2001) (cited
on page 122).
[260] Tsung-Yi Lin et al. “Feature Pyramid Networks for Object Detection”.
In: 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2016), pages 936–944 (cited on pages 124, 176).
244
[261] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015), pages 770–778 (cited on pages 124, 155, 177).
[262] Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In:
2009 IEEE Conference on Computer Vision and Pattern Recognition (2009),
pages 248–255 (cited on pages 124, 155, 177).
[263] Sami Abu-El-Haija et al. “YouTube-8M: A Large-Scale Video Classification
Benchmark”. In: ArXiv abs/1609.08675 (2016) (cited on pages 124, 177).
[264] M. F. Glasser et al. “A multi-modal parcellation of human cerebral cortex”.
In: Nature 536.7615 (Aug. 2016), pages 171–178 (cited on pages 126, 128,
160, 164, 184, 271, 272, 300).
[265] U. Hasson et al. “A hierarchy of temporal receptive windows in human
cortex”. In: J. Neurosci. 28.10 (Mar. 2008), pages 2539–2550 (cited on
page 129).
[266] C. Baldassano et al. “Discovering Event Structure in Continuous Narrative
Perception and Memory”. In: Neuron 95.3 (Aug. 2017), pages 709–721 (cited
on pages 130, 147, 149).
[267] M. A. Goodale and A. D. Milner. “Separate visual pathways for perception
and action”. In: Trends Neurosci. 15.1 (Jan. 1992), pages 20–25 (cited on
page 132).
[268] G. A. Calvert. “Crossmodal processing in the human brain: insights from
functional neuroimaging studies”. In: Cereb. Cortex 11.12 (Dec. 2001),
pages 1110–1123 (cited on pages 134, 136, 141, 142, 147).
245
[269] T. Raij, K. Uutela, and R. Hari. “Audiovisual integration of letters in
the human brain”. In: Neuron 28.2 (Nov. 2000), pages 617–625 (cited on
pages 135, 136).
[270] M. S. Beauchamp. “Statistical criteria in FMRI studies of multisensory inte-
gration”. In: Neuroinformatics 3.2 (2005), pages 93–113 (cited on pages 135,
136, 141).
[271] M. S. Beauchamp et al. “Unraveling multisensory integration: patchy orga-
nization within human STS multisensory cortex”. In: Nat. Neurosci. 7.11
(Nov. 2004), pages 1190–1192 (cited on pages 136, 141).
[272] G. A. Calvert et al. “Activation of auditory cortex during silent lipreading”.
In: Science 276.5312 (Apr. 1997), pages 593–596 (cited on page 136).
[273] N. Kanwisher and G. Yovel. “The fusiform face area: a cortical region
specialized for the perception of faces”. In: Philos. Trans. R. Soc. Lond., B,
Biol. Sci. 361.1476 (Dec. 2006), pages 2109–2128 (cited on page 138).
[274] I. Tavor et al. “Separate parts of occipito-temporal white matter fibers are
associated with recognition of faces and places”. In: Neuroimage 86 (Feb.
2014), pages 123–130 (cited on page 138).
[275] S. Nasr et al. “Scene-selective cortical regions in human and nonhuman
primates”. In: J. Neurosci. 31.39 (Sept. 2011), pages 13771–13785 (cited on
page 138).
[276] J. A. Frost et al. “Language processing is strongly left lateralized in both
sexes. Evidence from functional MRI”. In: Brain 122 ( Pt 2) (Feb. 1999),
pages 199–208 (cited on page 139).
[277] P. Belin et al. “Voice-selective areas in human auditory cortex”. In: Nature
403.6767 (Jan. 2000), pages 309–312 (cited on page 139).
246
[278] T. Yarkoni et al. “Large-scale automated synthesis of human functional
neuroimaging data”. In: Nat. Methods 8.8 (June 2011), pages 665–670 (cited
on page 139).
[279] Alexander G. Huth et al. “A Continuous Semantic Space Describes the
Representation of Thousands of Object and Action Categories across the
Human Brain”. In: Neuron 76 (2012), pages 1210–1224 (cited on pages 139,
145, 280).
[280] Y. Cao et al. “Causal Inference in the Multisensory Brain”. In: Neuron 102.5
(June 2019), pages 1076–1087 (cited on page 142).
[281] S. M. Wilson, I. Molnar-Szakacs, and M. Iacoboni. “Beyond superior tem-
poral cortex: intersubject correlations in narrative speech comprehension”.
In: Cereb. Cortex 18.1 (Jan. 2008), pages 230–242 (cited on page 142).
[282] Iiro P. Jääskeläinen et al. “Inter-Subject Synchronization of Prefrontal
Cortex Hemodynamic Activity During Natural Viewing”. In: The Open
Neuroimaging Journal 2 (2008), pages 14–19 (cited on page 143).
[283] Shailee Jain and Alexander Huth. “Incorporating Context into Language
Encoding Models for fMRI”. In: NIPS (2018) (cited on page 145).
[284] Fabian H Sinz et al. “Stimulus domain transfer in recurrent models for large
scale cortical population prediction on video”. In: bioRxiv (2018) (cited on
page 145).
[285] J. Schultz and K. S. Pilz. “Natural facial motion enhances cortical responses
to faces”. In: Exp Brain Res 194.3 (Apr. 2009), pages 465–475 (cited on
pages 146, 173).
[286] Pouya Bashivan, Kohitij Kar, and J. DiCarlo. “Neural population control
via deep image synthesis”. In: Science 364 (2019) (cited on pages 146, 170).
247
[287] J. Chen, U. Hasson, and C. J. Honey. “Processing Timescales as an Organiz-
ing Principle for Primate Cortex”. In: Neuron 88.2 (Oct. 2015), pages 244–
246 (cited on pages 147, 286).
[288] Jonathan E Peelle. “Methodological challenges and solutions in auditory
functional magnetic resonance imaging”. In: Frontiers in neuroscience 8
(2014), page 253 (cited on page 148).
[289] Fabian H Sinz et al. “Engineering a less artificial intelligence”. In: Neuron
103.6 (2019), pages 967–979 (cited on page 149).
[290] U. Hasson, J. Chen, and C. J. Honey. “Hierarchical process memory: memory
as an integral component of information processing”. In: Trends Cogn. Sci.
(Regul. Ed.) 19.6 (June 2015), pages 304–313 (cited on page 149).
[291] Qianli Liao and Tomaso A. Poggio. “Bridging the Gaps Between Resid-
ual Learning, Recurrent Neural Networks and Visual Cortex”. In: ArXiv
abs/1604.03640 (2016) (cited on page 149).
[292] K. Kar et al. “Evidence that recurrent circuits are critical to the ventral
stream’s execution of core object recognition behavior”. In: Nat. Neurosci.
22.6 (June 2019), pages 974–983 (cited on pages 149, 150).
[293] D. Wyatte, D. J. Jilk, and R. C. O’Reilly. “Early recurrent feedback facilitates
visual object recognition under challenging conditions”. In: Front Psychol 5
(2014), page 674 (cited on page 150).
[294] Emily S Finn et al. “Idiosynchrony: From shared responses to individual
differences during naturalistic neuroimaging”. In: NeuroImage 215 (2020),
page 116828 (cited on page 150).
248
[295] Jason J Ki, Simon P Kelly, and Lucas C Parra. “Attention strongly modulates
reliability of neural responses to naturalistic narrative stimuli”. In: Journal
of Neuroscience 36.10 (2016), pages 3092–3101 (cited on page 150).
[296] Mai Nguyen, Tamara Vanderwal, and Uri Hasson. “Shared understanding
of narratives is correlated with shared neural responses”. In: NeuroImage
184 (2019), pages 161–170 (cited on page 150).
[297] Lauri Nummenmaa et al. “Emotions promote social interaction by synchro-
nizing brain activity across individuals”. In: Proceedings of the National
Academy of Sciences 109.24 (2012), pages 9599–9604 (cited on page 150).
[298] Emily S Finn et al. “Trait paranoia shapes inter-subject synchrony in brain
activity during an ambiguous social narrative”. In: Nature Communications
9.1 (2018), pages 1–13 (cited on page 150).
[299] Zhi Yang et al. “Individualized psychiatric imaging based on inter-subject
neural synchronization in movie watching”. In: NeuroImage 216 (2020),
page 116227 (cited on page 150).
[300] Yoav Benjamini and Yosef Hochberg. “Controlling the False Discovery Rate:
a Practical and Powerful Approach to Multiple Testing”. In: J. R. Stat. Soc.
B. 57 (1995), pages 289–300 (cited on page 283).
[301] Arsha Nagrani et al. “Voxceleb: Large-scale speaker verification in the wild”.
In: Comput. Speech Lang. 60 (2020) (cited on page 285).
[302] Karol J. Piczak. “ESC: Dataset for Environmental Sound Classification”.
In: MM (2015) (cited on page 285).
[303] Jonathan D Power et al. “Methods to detect, characterize, and remove
motion artifact in resting state fMRI”. In: Neuroimage 84 (2014), pages 320–
341 (cited on page 296).
249
[304] S. Sonkusare, M. Breakspear, and C. Guo. “Naturalistic Stimuli in Neuro-
science: Critically Acclaimed”. In: Trends Cogn. Sci. (Regul. Ed.) 23.8 (Aug.
2019), pages 699–714 (cited on pages 153, 173).
[305] John T. Serences and Steven Yantis. “Selective visual attention and percep-
tual coherence”. In: Trends in Cognitive Sciences 10 (2006), pages 38–45
(cited on page 153).
[306] Sabine Kastner and Leslie G. Ungerleider. “Mechanisms of visual attention in
the human cortex.” In: Annual review of neuroscience 23 (2000), pages 315–
41 (cited on page 153).
[307] Jochen Braun, Christof Koch, and Joel L. Davis. “Visual attention and
cortical circuits”. In: Visual attention and cortical circuits. 2001 (cited on
pages 153, 157).
[308] Laurent Itti and Christof Koch. “Computational modelling of visual atten-
tion”. In: Nature Reviews Neuroscience 2 (2001), pages 194–203 (cited on
pages 153, 169).
[309] Tomaso Poggio and Fabio Anselmi. “Visual Cortex and Deep Networks:
Learning Invariant Representations”. In: Visual Cortex and Deep Networks:
Learning Invariant Representations. 2016 (cited on page 153).
[310] James E. Hoffman and Baskaran Subramaniam. “The role of visual attention
in saccadic eye movements”. In: Perception & Psychophysics 57 (1995),
pages 787–795 (cited on page 154).
[311] Thomas P O’Connell and Marvin M. Chun. “Predicting eye movement
patterns from fMRI responses to natural scenes”. In: Nature Communications
9 (2018) (cited on page 154).
250
[312] Fabian Sinz et al. “Stimulus domain transfer in recurrent models for large
scale cortical population prediction on video”. In: Advances in neural infor-
mation processing systems. 2018, pages 7199–7210 (cited on page 154).
[313] M. F. Glasser et al. “The minimal preprocessing pipelines for the Human
Connectome Project”. In: Neuroimage 80 (Oct. 2013), pages 105–124 (cited
on pages 154, 161, 173, 179).
[314] Po-He Tseng et al. “Quantifying center bias of observers in free viewing of
dynamic natural scenes.” In: Journal of vision 9 7 (2009), page 4 (cited on
page 158).
[315] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic
Optimization”. In: CoRR abs/1412.6980 (2014) (cited on pages 159, 178).
[316] U. Hasson et al. “Intersubject synchronization of cortical activity during
natural vision”. In: Science 303.5664 (Mar. 2004), pages 1634–1640 (cited
on pages 160, 173).
[317] Laurent Itti, Christof Koch, and Ernst Niebur. “A Model of Saliency-Based
Visual Attention for Rapid Scene Analysis”. In: IEEE Trans. Pattern Anal.
Mach. Intell. 20 (2009), pages 1254–1259 (cited on page 160).
[318] Matthias Kümmerer et al. “Understanding Low- and High-Level Contribu-
tions to Fixation Prediction”. In: 2017 IEEE International Conference on
Computer Vision (ICCV) (2017), pages 4799–4808 (cited on pages 160, 161).
[319] Zoya Bylinskii et al. MIT Saliency Benchmark. http://saliency.mit.edu/
(cited on page 161).
[320] Zoya Bylinskii et al. “What Do Different Evaluation Metrics Tell Us About
Saliency Models?” In: IEEE Transactions on Pattern Analysis and Machine
Intelligence 41 (2016), pages 740–757 (cited on page 161).
251
[321] Matthias Kümmerer, Thomas S. A. Wallis, and Matthias Bethge.
“Information-theoretic model comparison unifies saliency metrics.” In: Pro-
ceedings of the National Academy of Sciences of the United States of America
112 52 (2015), pages 16054–9 (cited on page 161).
[322] D. C. Van Essen et al. “The Human Connectome Project: a data acquisition
perspective”. In: Neuroimage 62.4 (Oct. 2012), pages 2222–2231 (cited on
pages 161, 162, 179).
[323] G. S. Khorshidi et al. “Automatic denoising of functional MRI data: Com-
bining independent component analysis and hierarchical fusion of classifiers”.
In: NeuroImage 90 (2014), pages 449–468 (cited on page 162).
[324] K. Grill-Spector, Z. Kourtzi, and N. Kanwisher. “The lateral occipital
complex and its role in object recognition”. In: Vision Res. 41.10-11 (2001),
pages 1409–1422 (cited on page 165).
[325] Zoe Kourtzi and Nancy Kanwisher. “Cortical regions involved in perceiving
object shape.” In: The Journal of neuroscience : the official journal of the
Society for Neuroscience 20 9 (2000), pages 3310–8 (cited on page 165).
[326] Jonas Larsson and David J. Heeger. “Two retinotopic visual areas in human
lateral occipital cortex.” In: The Journal of neuroscience : the official journal
of the Society for Neuroscience 26 51 (2006), pages 13128–42 (cited on
page 165).
[327] Alessandro De Benedictis et al. “Anatomo-functional study of the temporo-
parieto-occipital region: dissection, tractographic and brain mapping ev-
idence from a neurosurgical perspective.” In: Journal of anatomy 225 2
(2014), pages 132–51 (cited on page 165).
252
[328] Anne Treisman and G. A. Gelade. “A feature-integration theory of attention”.
In: Cognitive Psychology 12 (1980), pages 97–136 (cited on page 168).
[329] U. Hasson, R. Malach, and D. J. Heeger. “Reliability of cortical activity
during natural stimulation”. In: Trends Cogn. Sci. (Regul. Ed.) 14.1 (Jan.
2010), pages 40–48 (cited on page 173).
[330] Po-Hsuan Cameron Chen et al. “A Reduced-Dimension fMRI Shared Re-
sponse Model”. In: NIPS. 2015 (cited on pages 173, 174).
[331] J. Dubois and R. Adolphs. “Building a Science of Individual Differences from
fMRI”. In: Trends Cogn. Sci. (Regul. Ed.) 20.6 (June 2016), pages 425–443
(cited on page 174).
[332] Haiguang Wen et al. “Transferring and generalizing deep-learning-based
neural encoding models across subjects”. In: NeuroImage 176 (Aug. 2018),
pages 152–163. issn: 10959572. doi: 10.1016/j.neuroimage.2018.04.053
(cited on pages 174, 187).
[333] Ross Girshick et al. Detectron. https://github.com/facebookresearch/
detectron. 2018 (cited on page 178).
[334] S. Hershley and et. al. et. Models for AudioSet: A Large Scale Dataset of
Audio Events. https://github.com/tensorflow/models/tree/master/
research/audioset/vggish. 2016 (cited on page 178).
[335] I. Tavor et al. “Task-free MRI predicts individual differences in brain activity
during task performance”. In: Science 352.6282 (Apr. 2016), pages 216–220
(cited on page 181).
[336] N. Kanwisher, J. McDermott, and M. M. Chun. “The fusiform face area: a
module in human extrastriate cortex specialized for face perception”. In: J.
Neurosci. 17.11 (June 1997), pages 4302–4311 (cited on page 183).
253
[337] S. Nasr et al. “Scene-selective cortical regions in human and nonhuman
primates”. In: J. Neurosci. 31.39 (Sept. 2011), pages 13771–13785 (cited on
page 183).
[338] David H Hubel and Torsten N Wiesel. “Receptive fields, binocular interaction
and functional architecture in the cat’s visual cortex”. In: The Journal of
physiology 160.1 (1962), pages 106–154 (cited on page 186).
[339] Anitha Pasupathy and Charles E Connor. “Population coding of shape in
area V4”. In: Nature neuroscience 5.12 (2002), pages 1332–1338 (cited on
page 186).
[340] Xiaomin Yue, Sophia Robert, and Leslie G Ungerleider. “Curvature process-
ing in human visual cortical areas”. In: NeuroImage 222 (2020), page 117295
(cited on page 186).
[341] Scott L Brincat and Charles E Connor. “Underlying principles of visual
shape selectivity in posterior inferotemporal cortex”. In: Nature neuroscience
7.8 (2004), pages 880–886 (cited on page 186).
[342] Nicole C Rust and James J DiCarlo. “Selectivity and tolerance (“invariance”)
both increase as visual information propagates from cortical area V4 to
IT”. In: Journal of Neuroscience 30.39 (2010), pages 12978–12995 (cited on
page 186).
[343] Paul E Downing et al. “Domain specificity in visual cortex”. In: Cerebral
cortex 16.10 (2006), pages 1453–1461 (cited on page 186).
[344] Kalanit Grill-Spector and Kevin S Weiner. “The functional architecture
of the ventral temporal cortex and its role in categorization”. In: Nature
Reviews Neuroscience 15.8 (2014), pages 536–548 (cited on page 187).
254
[345] D. L. Yamins and J. J. DiCarlo. “Using goal-driven deep learning models to
understand sensory cortex”. In: Nat. Neurosci. 19.3 (Mar. 2016), pages 356–
365 (cited on pages 188, 201).
[346] Emily Jean Allen et al. “A massive 7T fMRI dataset to bridge cognitive and
computational neuroscience”. In: bioRxiv (2021) (cited on pages 188, 190).
[347] Liang Wang et al. “Probabilistic maps of visual topography in human cortex”.
In: Cerebral cortex 25.10 (2015), pages 3911–3931 (cited on page 190).
[348] J Swaroop Guntupalli et al. “A model of representational spaces in human
cortex”. In: Cerebral cortex 26.6 (2016), pages 2919–2934 (cited on page 191).
[349] David A Klindt et al. “Neural system identification for large populations
separating what and where”. In: Proceedings of the 31st International Con-
ference on Neural Information Processing Systems. 2017, pages 3509–3519
(cited on page 191).
[350] Maurice Weiler and Gabriele Cesa. “General E(2)-Equivariant Steerable
CNNs”. In: arXiv preprint arXiv:1911.08251 (2019) (cited on page 191).
[351] Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. “Representa-
tional similarity analysis-connecting the branches of systems neuroscience”.
In: Frontiers in systems neuroscience 2 (2008), page 4 (cited on page 195).
[352] Martin N Hebart et al. “THINGS: A database of 1,854 object concepts and
more than 26,000 naturalistic object images”. In: PloS one 14.10 (2019),
e0223792 (cited on pages 196, 204).
[353] Nikolaus Kriegeskorte. “Relating population-code representations between
man, monkey, and computational models”. In: Frontiers in Neuroscience 3
(2009), page 35 (cited on page 196).
255
[354] Yaoda Xu and Maryam Vaziri-Pashkam. “Limits to visual representational
correspondence between convolutional neural networks and the human brain”.
In: Nature communications 12.1 (2021), pages 1–16 (cited on page 196).
[355] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features
from tiny images”. In: (2009) (cited on page 196).
[356] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. “Deep Supervised,
but Not Unsupervised, Models May Explain IT Cortical Representation”.
In: PLoS Computational Biology 10 (2014) (cited on pages 201, 202, 298).
[357] Xiaomin Yue et al. “Curvature-processing network in macaque visual cortex”.
In: Proceedings of the National Academy of Sciences 111.33 (2014), E3467–
E3475 (cited on page 204).
[358] Robert Bridson. “Fast Poisson Disk Sampling in Arbitrary Dimensions”. In:
ACM SIGGRAPH 2007 Sketches. SIGGRAPH ’07. San Diego, California:
ACM, 2007. isbn: 978-1-4503-4726-6. doi: 10.1145/1278780.1278807. url:
http://doi.acm.org/10.1145/1278780.1278807 (cited on page 257).
[359] Nikolaus Kriegeskorte, Marieke Mur, and Peter A. Bandettini. “Represen-
tational Similarity Analysis – Connecting the Branches of Systems Neuro-
science”. In: Frontiers in Systems Neuroscience 2 (2008) (cited on page 298).
256
APPENDIX A
SUPPLEMENTARY INFORMATION AND ADDITIONAL
RESULTS FOR SECTION 3.1
A.1 Atlas Summary
Atlas # of ROIs Total Vol. Median Vol.(± std) Min Vol. Max Vol.
TT 97 1656.34 12.5 (±16.02) 0.03 69.71
HO 111 1611.39 10.04 (±15.26) 0.05 97.33
EZ 116 1941.65 14.11 (±11.96) 0.97 56.35
AAL 116 1843.10 13.78 (±11.05) 1.35 53.33
DOS160 160 82.05 0.51 (±0.04) 0.03 0.51
CC200 200 1172.15 5.83 (±1.26) 1.81 9.96
CC400 400 1172.15 2.97 (±0.68) 0.76 5.35
Table A.1: Summary descriptors of ROIs in individual atlases. All volumes are in
cm3.
A.2 Poisson Disk Sampling
Poisson disk sampling is a stochastic sampling procedure where drawn samples are
required to be at least a distance d apart for some user-specific distance metric and
density parameter. Since we use this sampling procedure to draw parcel centers, d is
estimated a priori based upon the desired number of parcels, and spatial proximity
is used to compute the distance. We use the fast poisson disk sampling algorithm
as proposed in [358]. This is an efficient sampling procedure that generalizes to
arbitrary dimensions and allows volumetric sampling. The algorithm is outlined
below:
257
• Step1: A parcel center is arbitrarily chosen from all gray matter voxels and
stored in an initially empty ‘active’ list.
• Step 2: A sample c is drawn from this active list of voxels. The next
candidate parcel center is randomly selected from the list of all voxels within
a spherical annulus between radius d and 2d around c. A candidate is accepted
and added to the active list if it is atleast a distance d apart from all the
existing parcel centers; otherwise another candidate is chosen. If no candidate
is accepted from the annulus, c is removed from the active list and another
sample c is drawn. This procedure is repeated until the active list is empty.
• Step 3: Once the centers are sampled, every gray matter voxel is assigned
to its closest parcel center. Sampling is performed for the left and right
hemispheres separately to avoid parcels that cross hemispheric boundaries.
A.3 Linear Classifiers
A.3.1 Ridge Classifier
Given feature vectors xi for n subjects and the corresponding prediction variables
denoted by yi, we approximate the fit using a linear regression model. An L2
regularization for the weights (w) is added to the mean squared error to yield the
following loss function of ridge regression:
LR = ‖Xw − y‖2 + α ‖w‖2 (A.1)
During classification, the output labels y are encoded as ± 1 for the two output
categories to minimize the above loss.
258
A.3.2 Support Vector Machines
(a) Classification
Support Vector Machine Classifiers optimize for a hyperplane with maximum
margin between the output classes. This results in a decision function of the form,
f (x)= sign(wTx+b). The weights {w, b} are obtained by minimizing the following
convex loss function consisting of a data loss component (LD) and a regularization
loss for the weights (LW),
LSVC = CLD + LW (A.2)
LD is modeled using a hinge loss function,
∑n
i=1max(0, 1− y Ti(w xi + b)) over all
n training samples {(x1,y1),...,(xn,yn)}. LW is modeled using a Euclidean norm,
i.e., ‖w‖2. Here, C is a tuning parameter that controls the trade-off between
regularization and data loss.
(b) Regression
The -Support Vector Regression (SVR) scheme optimizes for a decision function
of the form, f (x)=wTx+ b, that has at most  deviation from the true prediction
variables y (allowing for errors when the problem is infeasible). The loss function
(LSVR) can be formulated as,
LSVR = CL + LW (A.3)
L∑is traditionally referred to as the -insensitive loss function, and is formulatedas ni=1max(0, |wTxi + b− yi| − ) over all n training samples {(x1,y1),...,(xn,yn)}.
The regularization term (LW) is modeled using a Euclidean norm, i.e., ‖w‖2. The
tuning parameter C controls the trade-off between the regularization (i.e., the
flatness of the decision function) and the amount up to which deviations beyond 
are tolerated.
Both the classification and regression problems yield weights w that can be rep-
259
resented compl∑etely as a linear combination of the training inputs xi. Thus, w isrepresented as ni=1 αixi, and the decision function becomes f (x)=∑ni=1 αixTi x+ b.
This makes it easier to extend SVMs for non-linear decision functions using
the kernel technique, i.e., by applying transformations φ(x) that map x to a
high-dimensional space and replacing the inner product 〈xi, x〉 with the kernel
K(xi, x)=〈φ(xi), φ(x)〉. For our experiments, we observed that the radial basis func-
2
tion kernel, K(x , x) =exp(-‖xi−x‖i 2 2 ), yields the best results among linear, sigmoidσ
and polynomial kernels up to degree 4.
A.4 Neural network hyperparameter settings
We note that since it is more expensive to train a 3D-CNN, we could experiment
with only a limited configuration of hyperparameters during cross-validation on
ABIDE-I data, compared to FCN and Brain-Net CNN, which work with vectorized
connectivity matrices and are thus faster to train. For all three neural network
models, we relied on a random search over the learning rate, number of layers,
number of units or feature maps in each layer and the choice of non-linearity. For
this search we employed the HO atlas. The hyper-parameter configuration that
yielded the best ABIDE-I cross-validation accuracy was subsequently used for all
other parcellation schemes. Note that except for some minor changes, the models
for age prediction and ASD/HC classification are almost identical. Furthermore, in
our primary analyses we compare models based on ABIDE-II performance, which
was not used for hyper-parameter tuning.
For the FCN, we initially started with the architecture proposed by Heinsfeld
et al.[192] for ASD/HC classification. We increased the number of layers before the
260
ASD/HC Classification accuracy (ABIDE-I)
Parcellation  Ridge  SVM  FCN  BrainNet  3D-CNN
HO 66.7 69.4 69.4 67.8 70.5
CC200 69.7 69.1 70.5 68.6 71.2
EZ 66.4 69.0 68.6 66.0 69.3
TT 64.4 68.6 67.1 66.0 69.4
CC400 70.2 69.4 71.0 71.3 71.7
AAL 65.4 69.1 66.7 66.5 71.4
DOS160 66.2 68.4 67.2 67.0 68.6
MA-Ensemble 69.8 70.5 71.5 69.7 73.3
SP-Ensemble 70.7 71.0 72.0 71.5 73.5
Table A.2: Classification accuracy for ASD vs. Control: 10-fold cross-validation on
ABIDE-I for benchmark models and proposed CNN approach. For each row, best
results are bolded. For each column, best results are italicized. Green indicates
better performance, whereas orange/red highlights worse performance.
Age RMSE (ABIDE-I)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 3.51 3.64 3.57 3.55 3.37
CC200 3.40 3.66 3.44 3.46 3.35
EZ 3.53 3.70 3.60 3.55 3.40
TT 3.58 3.77 3.53 3.57 3.41
CC400 3.41 3.71 3.41 3.46 3.39
AAL 3.60 3.74 3.66 3.60 3.31
DOS160 3.62 4.01 3.76 3.67 3.48
MA-Ensemble 3.30 3.67 3.28 3.31 3.28
SP-Ensemble 3.38 3.67 3.35 3.39 3.28
Table A.3: Root mean squared error (RMSE in years) for age prediction: 10-fold
cross-validation on ABIDE-I for benchmark models and proposed CNN approach.
For each row, best results are bolded. For each column, best results are itali-
cized. Green indicates better performance, whereas orange/red highlights worse
performance.
softmax output until the ABIDE-I cross-validation accuracy stopped improving.
Also, we noticed that adding batch-normalization after each layer had no noticeable
impact on classification performance. Hence, we didn’t include this layer in our
FCN architecture.
261
A.5 ABIDE-I cross-validation results
In order to ensure a fair comparison with other studies that report 10-fold cross-
validation performance on ABIDE-I, we report the performance obtained using
our benchmark and proposed models (along with the ensemble learning strat-
egy) for both stochastic parcellations and atlases in the form of kernel density
plots (Figure A.1 and Tables A.2, A.3). Clearly, the results and conclusions on
ABIDE-I remain consistent with ABIDE-II, with the 3D-CNN ensemble strategy
outperforming all the baseline methods.
A.6 Saliency maps for individual parcellations
Visualizing the saliency maps for models trained on different brain parcellations can
reveal interesting differences in the features captured by these models. We visualized
the saliency maps of the 3D-CNN model for individual stochastic parcellations at
multiple scales for the task of ASD/HC Classification. As shown in Figure A.2,
models trained using distinct parcellation schemes are relying on the same basic
underlying connectivity patterns for prediction, with small differences in their
information content, that can be utilized efficiently by the ensemble learning
scheme. Further, the saliency maps of atlas-based (see Figure A.3) and stochastic
parcellation-based models are remarkably similar, suggesting that the connectivity
patterns of the same set of voxels are guiding the classifier predictions, irrespective
of the precise scheme of ROI extraction.
262
A.7 Comparison of different preprocessing strategies
Since preprocessing options such as nuisance regression have been a point of con-
tention in several studies, we conducted another set of experiments with standard at-
las masks in three preprocessing scenarios (a) without global signal regression(GSR)
+ with CompCor (b) without GSR + without CompCor and (c) with GSR +
without CompCor. Below, we include the results obtained with models trained
on the ABIDE-1 data using the hyperparameters optimized in our original experi-
ments presented in the paper. The accuracy values were computed based on test
predictions on the independent ABIDE-2 dataset, following our original evaluation
protocol. As can be seen from Tables A.4 and A.5, when neither GSR nor Com-
pCor is employed during preprocessing, the prediction performance on both the
tasks, i.e., ASD/HC Classification and age prediction, drops significantly. However,
similar performance is obtained when using either or both of CompCor and GSR
in preprocessing. Importantly, the trend remains the same: the 3D-CNN fares
favorably against the baseline algorithms and the MA-Ensemble models generally
perform better than or similar to the best-performing atlas.
263
Preprocessing scheme: Without CompCor, With GSR
ASD/HC Classification accuracy (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 64.7 66.2 62.9 64.4 68.8
CC200 67.8 66.0 67.8 68.2 67.8
EZ 64.7 62.9 64.7 64.4 66.2
TT 62.9 63.9 64.4 64.9 65.7
CC400 68.5 65.7 66.7 67.5 68.8
AAL 63.7 65.2 64.4 62.1 68.8
DOS160 61.6 61.9 62.6 62.4 64.5
MA-Ensemble 69.3 66.5 69.0 69.7 68.8
Preprocessing scheme: With CompCor, Without GSR
ASD/HC Classification accuracy (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 58.5 64.1 61.0 63.6 64.0
CC200 68.4 67.2 68.0 68.2 68.7
EZ 62.6 65.1 62.6 66.4 67.4
TT 63.6 66.4 64.1 63.1 65.4
CC400 67.9 65.9 66.7 65.4 69.5
AAL 63.8 64.3 61.6 62.6 66.4
DOS160 65.9 65.1 66.2 63.3 68.2
MA-Ensemble 69.9 67.4 68.9 68.9 69.5
Preprocessing scheme: Without CompCor, Without GSR
ASD/HC Classification accuracy (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 60.1 58.8 62.1 62.6 63.2
CC200 67.7 66.2 62.4 64.7 67.0
EZ 59.6 59.8 62.9 60.1 63.5
TT 61.6 61.6 61.3 61.3 61.6
CC400 65.2 64.4 63.4 64.9 66.5
AAL 61.8 62.4 61.6 63.1 63.4
DOS160 57.0 58.0 62.1 61.3 62.7
MA-Ensemble 66.7 66.2 62.9 67.5 67.5
Table A.4: Classification accuracy for ASD vs. Control: Independent results on
ABIDE-II for benchmark models and proposed CNN approach. Green indicates
better performance, whereas orange/red highlights worse performance.
264
Figure A.1: Violin plots showing the spread of prediction accuracies/errors for
stochastic parcellations at multiple network scales for different classification models.
Mean accuracy/error of individual violins is denoted by ’Mean SPs’. Performance of
individual atlases is compared with SPs2w65ith the closest # of ROIs and is denoted
as ’Single Atlas’. Results are computed by 10-fold cross-validation on the entire
ABIDE-1 cohort.
Figure A.2: Saliency maps of trained CNN models for 2 randomly chosen stochastic
parcellations at each scale for ASD-HC classification.
266
Figure A.3: Saliency maps for atlas-based ASD-HC classification models.
267
Preprocessing scheme: Without CompCor, With GSR
Age RMSE (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 2.85 2.75 2.61 2.50 2.31
CC200 2.54 2.75 2.35 2.48 2.26
EZ 2.87 2.64 2.56 2.61 2.06
TT 3.09 2.80 2.55 2.87 2.19
CC400 2.61 2.86 2.32 2.32 2.28
AAL 2.82 2.63 2.68 2.61 2.14
DOS160 3.24 3.37 2.82 3.03 2.32
MA-Ensemble 2.56 2.67 2.23 2.29 2.07
Preprocessing scheme: With CompCor, Without GSR
Age RMSE (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 3.37 2.90 2.83 2.83 2.48
CC200 2.91 2.74 2.84 2.75 2.41
EZ 3.24 2.75 2.60 2.74 2.61
TT 3.14 2.88 3.14 2.74 2.34
CC400 2.88 2.86 2.74 2.63 2.47
AAL 3.18 2.73 2.78 2.79 2.49
DOS160 3.17 3.29 2.59 2.82 2.52
MA-Ensemble 2.73 2.75 2.37 2.41 2.16
Preprocessing scheme: Without CompCor, Without GSR
Age RMSE (ABIDE-II)
Parcellation Ridge SVM FCN BrainNet 3D-CNN
HO 3.14 3.08 3.14 2.60 2.60
CC200 3.15 2.89 3.33 2.60 2.70
EZ 3.41 2.81 3.35 2.98 2.47
TT 3.22 3.05 3.28 2.85 2.80
CC400 3.06 2.94 2.90 2.54 2.48
AAL 3.36 2.81 2.73 3.08 2.54
DOS160 3.45 3.54 3.48 2.99 2.63
MA-Ensemble 2.76 2.91 2.83 2.40 2.30
Table A.5: Root mean squared error (RMSE in years) for age prediction: Inde-
pendent results on ABIDE-II for benchmark models and proposed CNN approach.
Green indicates better performance, whereas orange/red highlights worse perfor-
mance.
268
Figure A.4: ROC Curves for individual atlas based ASD-HC classification models.
269
Table A.6: Mean absolute error (MAE in years) for age prediction: Independent
testing on ABIDE-II for benchmark models and proposed CNN approach. For each
row, best results are bolded. For each column, best results are italicized. Green
indicates better performance, whereas orange/red highlights worse performance.
270
APPENDIX B
SUPPLEMENTARY INFORMATION AND ADDITIONAL
RESULTS FOR SECTION 4.2
B.1 HCP Movies
Table B.1 summarizes the HCP movie-watching dataset split used for training and
evaluating all models.
Table B.1: HCP dataset split
Movie Split Stimulus-response pairs per subject
7T_MOVIE1_CC1 v2 Training/Validation 652
7T_MOVIE2_HO1 v2 Training/Validation 716
7T_MOVIE3_CC2 v2 Training/Validation 669
7T_MOVIE4_HO2 v2 Testing 699
B.2 Region of Interest (ROI) selection
ROIs were selected for each analysis based on the descriptions provided in the
neuroanatomical supplementary results of the HCP MMP parcellation [264] and an
extensive literature review. For Figure 4.2 in the main text and Figure B.9, ROIs
were thus assigned to groups 1-5 according to Table B.2.
Dorsal and ventral visual stream ROIs as well as early and association auditory
cortex ROIs in Figure 4.4 (main text) were derived from the explicit stream
271
Table B.2: ROI categorization
Group ROIs
A1, LBelt, PBelt, MBelt, RI,
1. Auditory STSda, STSva, A4, A5, TA2
V1, V2, V3, V3A, V3B, V3CD, V4, V4t,
V6, V6A, V7, V8, DVT, LO1-3, PIT,
2. Visual FFC, VMV1-3, IPS1, MT, VVC
3. Multi-sensory + sensory bridges STSdp, STSvp, STGa, STV, TPOJ1-3
4. Language 55b, SFL, PSL, 44, 45
5. Frontal IFSa, IFSp, IFJa, IFJp, FEF
segregation and categorization described in the HCP MMP parcellation [264] and
are defined here for quick reference.
• Dorsal: V3A, V3B, V6, V6A, V7, IPS1
• Ventral: V8, VVC, PIT, FFC, VMV1-3
• MT+: MT, MST, V4t, FST
• Early auditory: A1, PBelt, MBelt, RBelt, RI
• Association auditory: A4, A5, TA2, STGa, STSdp, STSda, STSvp, STSva
All ROIs are shown in Figure B.1.
Figure B.1: Group segregation from the HCP MMP parcellation.
272
B.3 Estimating BOLD response delay
BOLD response delay was estimated using ROI-level encoding models due to their
faster iteration times in comparison to voxel-wise encoding. The input to these
models was the preprocessed stimuli as described for voxel-wise encoding with the
same train-validation-test split, and the output was the evoked ROI-level fMRI
response at different lags (1-7 seconds) from the stimulus. Thus, the output is a
360-D vector corresponding to the mean fMRI response in each ROI of the HCP
MMP parcellation. The feature extractors were identical to those in the proposed
voxel-wise auditory and visual models. However, instead of a convolutional response
model, here, the response model comprised two fully connected layers with output
dimensions of 512 and 360 with an exponential linear unit and linear activation
respectively. All models were trained for 20 epochs with a batch size of 4 and
a learning rate of 1e-4. Validation curves were monitored to ensure convergence.
Prediction accuracy of each model was computed as the mean Pearson correlation
Figure B.2: ROI-based encoding performance for estimating delay. (A) depicts the
estimated mean and standard error of the prediction accuracy (R) across various
delays (1-7s) within the early auditory and association auditory group (blue) as
well as across all ROIs (red), as obtained using the single epoch (1s) auditory model.
(B) depicts the estimated mean and standard error of the prediction accuracy (R)
for various delays (1-7s) within the primary and dorsal visual streams (blue) as well
as across all ROIs (red), as obtained using the single frame visual model. Shaded
regions depict the standard error in estimating mean across ROIs within each group.
ROI categorization is described in the sub-section on ROI selection.
273
coefficient between the predicted and measured response across all ROIs, in the
held-out movie dataset. Based on Figure B.2, we estimated a response delay of
4 seconds, as this lag yielded the maximum prediction accuracy across all ROIs
for both auditory and visual ROI-level models. Further, even while restricting
the prediction accuracy (R) to ROIs within different cortical areas (such as the
early/association auditory areas or the dorsal/ventral visual stream), the optimal
lag was consistently 4 seconds, suggesting that the difference in performance of 1-sec
and 20-sec models in these regions (Figure 4.4) is not largely driven by differences
in the hemodynamic response function (HRF).
B.4 Defining the stimulus-driven or “synchronous” cortex
We isolated voxels involved in stimulus-driven processing, termed “synchronous”
or “stimulus-driven” voxels, by computing mean inter-group correlations over all
training movies. Inter-group correlations were computed by splitting the entire
group of subjects into two halves and computing correlations between the mean
response time-course of each half (comprising 79 subjects) at every voxel. We
employed a liberal threshold of 0.15 for this correlation value. Thus, the mask
of “stimulus-driven” voxels included those voxels that achieved an inter-group
correlation of 0.15 or above. We computed mean quantitative metrics over this
mask in Figure 4.3E (main text) to compare different models.
274
B.5 Model architectures and implementation
The base feature extraction networks and convolutional response model in Figure 4.1
had the architecture as detailed in Figure B.3. The feature extraction networks are
reminiscent of the feature pyramid network, which has shown significant improve-
ments as a generic feature extractor across various applications. These networks
comprise a parallel top-down pathway with lateral connections which grants them
the ability to characterize both “what” and “where” in cluttered scenes, thereby
enhancing object detection. We note that similar models with top-down and skip
connections have been popular in vision research, since they can enrich low-level
features with high-level semantics. The output of the feature extractor is fed into
the convolutional response model to predict the evoked fMRI activation. This
enables us to train both components of the network simultaneously in an end-to-end
manner. Since the output response is differentiable with respect to network weights,
the weights are adjusted via a first-order gradient-based optimization method to
minimize the mean squared error between the predicted and target activation values
across the entire brain.
For ResNet-50, we use activations of the last residual block of each stage, namely,
res2, res3, res4 and res5 to construct our stimulus descriptions s. From the VGG-
ish network, we use the activations of each convolutional block, namely, conv2,
conv3, conv4 and the penultimate dense layer fc2 (Pre-trained tensorflow/keras
models for the visual and auditory backbone were available at https://keras.
io/applications and https://github.com/tensorflow/models/tree/master/
research/audioset/vggish respectively). The first three sets of activations are
refined through a top-down path to enhance their semantic content, while the last
activation is concatenated into s directly (res4 activations are vectorized using
275
global average pool). The top-down path comprises three feature maps at different
resolutions with an up-sampling factor of 2 successively from the deepest layer
of the bottom-up path. Each such feature map comprising 256/128 channels (in
visual/auditory models respectively) is merged with the corresponding feature
map in the bottom-up path (reduced to 256/128 channels by 1x1 convolutions) by
element-wise addition. Subsequently, the feature map at each resolution is collapsed
into a 256/128-dimensional feature vector through a global average pool operation
and concatenated into s, leading to a 1024-D and 512-D feature representation
for the visual and auditory stimuli respectively. The aggregated features are then
passed onto a CNN comprising the following feedforward computations: a fully
connected layer to map the features into a vector space which is reshaped into a
1024-channel cuboid of size 6x7x6 followed by four 3x3x3 transposed convolutions
(conv.T) with a stride of 2 and exponential linear unit activation function to up-
sample the latter. Each convolution reduces the channel count by half with the
exception of the last convolution which outputs the single-channel predicted fMRI
response.
The 20-second models additionally comprised an LSTM layer to model the
temporal propagation of features across the contiguous sequence of input frames
and/or spectrograms. The LSTM module has driven success across varied sequence
modeling tasks due to its ability to efficiently regulate the flow of information across
cells through gating. The memory cell in LSTM is modulated by three gates, namely,
the input, forget and output gates. We note that the LSTM layer did not change
the dimensionality of the input features so that equitable comparisons can be made
against 1-sec models. The Audiovisual-1sec model concatenated features obtained
from the base visual (1024-D) and audio (512-D) feature extraction networks,
reduced their combined dimensionality to the higher value among the two (1024-D)
276
by passing through a bottleneck dense layer followed by the same convolutional
response model. The Audiovisual-20sec model additionally incorporated modality-
specific LSTM networks prior to feature concatenation.
Implementation:
We note that all 6 models have roughly the same order of trainable parameters
in the range of 242M-362M. All parameters were optimized using Adam with a
learning rate of 1e-4. Auditory and visual models were trained for 50 epochs with
unit batch size. The stimulus as well as subject whose fMRI response is used as
the target in the loss (“mean squared error”) are randomly sampled over each step
of the training but kept consistent across models. We found this method to work
better than using the group-averaged response as target, presumably because this
sampling provides information about both the cross-subject mean and the variance
of response. Given the noise characteristics at each voxel, we hypothesize that this
enables the model to focus on regions that can be well predicted with the given
stimulus. Validation curves were monitored for all models to ensure convergence.
B.6 Regularized linear regression: deep convolutional fea-
tures
We also trained group-level encoding models using a linear response model since
this constitutes the dominant state-of-the-art approach to neural encoding [245,
246, 248]. To enable a fair comparison against the proposed 1-sec uni-modal
models, we extract hierarchical features from the same layers of the ResNet-50 and
VGG-ish architectures as employed by the proposed models. The only difference
here is the lack of a top-down pathway (since it is not a part of the pre-trained
277
Figure B.3: Implementation details for the audio (top left) and visual (top right)
feature extraction networks as well as the convolutional response model (bottom).
All layers and blocks outside the yellow rectangle (bottom-up pathway) are trained
from scratch. The blocks inside the yellow rectangular window are initialized
with networks pre-trained on image or sound recognition. Further, ResNet-50 is
frozen during the training of all encoding models, whereas VGG is fine-tuned. The
sequence of operations within each block are defined from top to bottom, while the
number of repetitions for each sequence within the block are indicated with the
multiplicative symbol on the right.
278
network but is trained with random initialization on the neural response prediction
task), which prevents the refinement of coarse feature maps before aggregation.
Pooling the outputs of different layers channel-wise using the global average pooling
operation (namely {v1, v2, v3, v4} for the visual model and {a1, a2, a3, a4} for the
audio model in Figure B.3) leaves us with and 1024 and 3840 features to present
to the auditory and visual models, respectively. Further, to compare against
the longer-duration 20-sec models, we adopted two approaches: (1) we simply
concatenated the stimulus features extracted for each second (as described above)
over T-second windows with T ranging from 1 to 20 seconds and presented these
aggregated features to the linear response model; alternatively, (2) we reduced
the dimensionality of the aggregated features to a fixed length (set to 128) as
in (1) using principal component analysis run on the training data. We added
this comparison to rule out the fact that the temporal trend in performance of
linear models is simply driven by a higher-dimensional feature space. We note that
even after dimensionality reduction, the components retained at least 80% of the
explained variance in all cases. Audio-visual encodings with linear response models
were obtained similarly by simply fusing the respective audio and visual hierarchical
features through concatenation before linear regression. We apply l2 regularization
on the regression coefficients and adjust the optimal strength of this penalty through
cross-validation on the training data using log-spaced values in {1e−14, 1e14} for
each model. We report performance of the best models in Figure B.4(A). Note
that unlike the WordNet models, we found that optimizing a single regularization
penalty α common across all voxels outperformed independent voxel-wise fitting
with bootstrap in this case. Thus, we only present the results for the former.
We note here that the convolutional response model in our proposed approach
(instead of a fully-connected approach) allowed us to keep the learnable parameters
279
manageable, facilitating joint optimization/fine-tuning of the feature extractor and
response models. The consistently superior performance of the proposed models
against linear regression-based approaches strongly suggests that there is merit in
end-to-end learning for encoding responses to dynamic, multi-sensory stimuli.
B.7 Regularized linear regression: WordNet features
Another popular approach in voxel-wise forward encoding beyond primary sensory
cortices is the semantic category encoding model that is based on high-level semantic
features [279]. This approach relies on labels that indicate the presence of semantic
object and action categories in each movie frame. In this analysis, we employed
WordNet labels that were provided as part of the HCP movie-watching data pipeline.
The semantic labels were manually assigned by the Gallant lab team using the
WordNet semantic taxonomy and subsequently converted to WordNet synsets to
build an 859-D semantic representational space (corresponding to 859 WordNet
synset names). Following [279], we fitted l2 regularized linear regression models
(known as ridge regression) to find weights corresponding to different input features
for every voxel. The regularization parameter, α was optimized independently
for each voxel by testing among 10 log-space values in [1, 1000]. The optimal
alpha is obtained by averaging across 15 bootstrapped held-out sets. In addition to
fitting models with WordNet features extracted 4s prior to the measured neural
response, we developed longer timescale linear models by concatenating the WordNet
features extracted for each second (as described above) over T-second windows
with T ranging from 1 to 20 seconds and presented these aggregated features to
the bootstrapped regularized regression model. Figure B.4 (C) demonstrates the
performance of WordNet models across different groups of regions as a function of
280
T, and (C) depicts the voxel-level prediction accuracy (R) of the best performing
WordNet model that stacks features from 4-12s (at an interval of 1s) prior to the
encoded cortical response. While simple and interpretable, the WordNet models
clearly under-perform in terms of prediction accuracy (R) in comparison to the
models proposed in the present study.
Figure B.4: Performance of linear response models and baselines. (A) shows
the region-averaged prediction accuracy of linear response models using deep
convolutional features. (B) shows results of the ablation study and highlights the
importance of different components of the proposed model architecture. (C) shows
the region-averaged prediction accuracy of linear response models using semantically
rich WordNet features and (D) shows the cortical map of the prediction accuracy
(R) for the best WordNet model. The x-axis in (A) and (C) depicts the length of
the windows (in seconds) over which the stimulus features are concatenated and
y-axis shows the mean Pearson correlation coefficient between the predicted and
measured responses across the stimulus-driven voxels.
281
B.8 Ablation study
To determine the influence of different architectural components on prediction
performance of the proposed models, we performed an ablation study to investigate
the individual contributions of (i) non-linearities in the response model, (ii) hierar-
chical (multi-scale) feature maps, (iii) fine-tuning audio sub-network (VGG) and
(iv) LSTM. We selectively removed each of the components from the respective
model and compared the resulting performance against the proposed 1-sec and
20-sec models that employ all (i)-(iii) and (i)-(iv) components respectively. We note
that the model without LSTM (iv) uses concatenated features instead of employing
recurrence. Due to computational constraints, we could not train a model that
feeds 20-sec concatenated features directly to the convolutional response model
since this raises the number of parameters substantially. Instead, we map the
concatenated feature input to a 1024-D and 512-D feature space for visual and
audio models respectively using a fully connected layer. We note that this also
ensures a more equitable comparison against the proposed 20-sec models that use
LSTMs by enforcing that the representations fed into the response models in both
cases are of the same dimension. We follow the same protocol for training these
models as used for training the proposed models. There are several interesting
observations to make from this ablation analysis (Figure B.4B). (i) First, we find
that encoding models with a frozen VGG network that is not updated during
training incur a loss in performance compared to the proposed model where VGG
layers are trainable during neural response prediction. This clearly demonstrates
the advantages of altering these pre-trained models and suggests that fine-tuning
is both feasible and beneficial in improving neural response prediction. (ii) Next,
we find that prediction performance deteriorates after removing the non-linearities
in both the Audio-1sec and Visual-1sec models. In the context of the Visual-1sec
282
model with a frozen pre-trained backbone (ResNet-50) and coupled with (i), this
observation further highlights that it is possible to develop models of human sensory
processing that are quantitatively more precise in matching brain activity than
task-driven neural networks. (iii) Next, we assessed the benefit of using hierarchical
feature maps over selecting the single best-performing layer for each model (audio
or visual) based on cross-validation. For both audio and visual models, we find
that features from the last layer (i.e., a4 and v4, respectively) yield the highest
mean prediction accuracy (R) across the synchronous cortex. However, although
the convolutional response model architecture is common across these encoding
models, it is important to note that this analysis is still plagued by confounds
such as the different dimensionality of feature spaces across different layers that
feed into the response model. The best performing single-layer encoding model,
however, still performs worse than the hierarchical approach. (iv) Finally, while
the encoding models with concatenated features outperform the 1-sec models, the
performance still falls short against the accuracy obtained by the proposed 20-sec
models employing LSTM. We believe this noticeable difference arises from the
ability of LSTMs to efficiently capture long-term dependencies and reconcile the
recent input history (‘memory’) with the immediate context (current frame).
B.9 Computing significance estimates
The statistical significance of individual voxel predictions (Figure 4.3) was computed
as the p-value of the obtained sample correlation coefficient for the null hypothesis
of uncorrelation (i.e., true correlation coefficient is zero) under the assumptions
of a bivariate normal distribution. We employed the false-discovery procedure
of Benjamini & Hochberg (1995) [300] to control for multiple comparisons under
283
assumptions of dependence. For statistical comparison of model performance within
each group of regions in Figure 4.2 (main text), we performed the paired t-test
on ROI-level average performance metrics and corrected for multiple comparisons
among models (Bonferroni).
B.10 Sensory-sensitivity index
Distorting the input to the audio-visual model at test time allows us to interrogate
the sensory-sensitivity of different brain regions. We developed a sensory-sensitivity
index of each ROI based upon predictive performance of the model with distorted
inputs, as shown in Figure 4.5. Let SVr and SAr denote the mean prediction
accuracy of the model in region r after shuffling (temporally) the input order of
the visual and auditory stimuli, respectively. The sensory-sensitivity index for
region r is then defined as s = SAr−SVrr + . Note that positive values of this indexSAr SVr
indicate that region r incurs a greater loss in predictivity upon distortion of visual
information than auditory information, suggesting a higher visual sensitivity for
this voxel. Similarly, negative values signal towards a higher auditory sensitivity.
B.11 Stimuli for synthetic contrasts
Synthetic contrasts were generated to study the generalization of our models to new
experimental paradigms (Figure 4.6). We focus on predicting task-based contrasts
for three semantic categories, namely, faces, places and speech, since these are the
most well-studied categories in the context of their distinct functional signatures.
The stimuli for visual contrasts were derived from the HCP Working Memory
284
paradigm, which combines category specific representation tasks (including faces
and places) and working memory tasks. After excluding grayscale images, we were
left with 102, 77, 97 and 103 images for the categories of faces, places, body parts and
tools, respectively. Since these are static image without any dynamic content, we
employed the Visual-1sec model to derive the visual contrasts (Figure 4.6(C),(D)).
Stimuli for the speech and non-speech contrast were extracted from large
popular datasets for these categories. Speech stimuli were extracted from a human
speech-utterance dataset comprising short audio clips of interviews recorded on
YouTube [301]. Non-speech stimuli were extracted from another large dataset
comprising short clips of environmental sounds [302]. We randomly extracted ∼ 100
minutes of audio waveforms from these datasets for both categories. The stimuli
were processed for mel-spectrogram extraction in the same manner as the HCP
audio-visual movies. Since the non-speech stimuli only comprised contiguous clips
of roughly 3− 5 second duration, we employed the Audio-1sec model to obtain
the speech contrast (Figure 4.6(B)).
B.12 Perturbation analysis with 20-sec models
To address the influence of temporal continuity and short-term memory (past
inputs) on the predictions of 20-sec models, we conducted a perturbation analysis
by distorting the input context seen by these models at inference time using two
shuffling experiments:
• Shuffled (different segment): In this experiment, we keep the last frame of
every 20-second input segment and replace the preceding 19 frames with
contiguous frames of a randomly selected 19-sec input clip within the test
285
movie. This input perturbation thus largely maintains a temporal continuity
and highlights the influence of past inputs or short-term memory on response
predictions.
• Shuffled (same segment): Under this experimental set-up, we randomly shuffle
the first 19 frames of the same input clip at inference time while keeping the
last frame the same. This obliterates the temporal continuity of the input
clip without changing the overall content that is fed into the encoding model.
We repeated both shuffling experiments 10 times and report the average performance
of each model under these two perturbation methods across different ROIs in
Figure B.5. As can be seen from the figure, both input perturbations cause a drop
in model performance, albeit to different degrees. Interestingly, the Audio-20sec
model seems to rely on the temporal continuity of the input more heavily than
the Visual-20sec model, as evidenced by the much sharper drop in performance for
the former model under same segment re-shuffling. The consistent deterioration
of model performance under these control experiments is thus another indication
that the 20-sec models exploit recent input history (‘memory’) while computing
response predictions.
B.13 Performance improvement and autocorrelation decay
In the past, processing timescales in the brain have been probed using several
different means [287]. In one of the proposed approaches, the decay time of
temporal autocorrelation is used as a proxy measure to understand the variation
in processing timescales across different brain regions. With this approach, it
was shown that decay times increased progressively along the temporal hierarchy.
286
Figure B.5: Perturbation analysis with Audio-20sec (A) and Visual-20sec (B)
models. ROI box plots depict the un-normalized correlation coefficients between
the predicted and measured response of voxels in each ROI using original or distorted
20-sec input clips at inference time.
Following this line of work, we estimated the autocorrelation decay time constant
(π) for each voxel by fitting an exponential, A exp{−t/π}, to the autocorrelation
function (autocorrelation computed at different lags). The exponential model was
first independently fit for each movie run and each voxel and the estimated π
were subsequently averaged across runs to obtain one decay time constant per
voxel. Here, we were primarily interested in understanding whether there is any
relationship between the performance improvement of the 20-sec model over 1-
sec model, ∆R, computed as the difference between the prediction accuracies
287
Figure B.6: Performance boost of the 20-sec model over 1-sec model is higher in
voxels with longer autocorrelation decay times. (A) & (B) depict the performance
improvement (∆R) against decay time constants for voxels associated with auditory
and visual regions, respectively (Table B.2). The r value indicates the Pearson
correlation coefficient between the two quantities. Each dot in the scatterplot
represents an individual voxel. Bivariate kernel density estimates are overlaid
on top of the scatterplot as contours to depict the probability distribution of
observations.
of the Audiovisual-20sec and Audiovisual-1sec at every voxel, and the temporal
autocorrelation properties of that voxel. We hypothesized that in voxels with
longer processing timescales, the autocorrelation would persist for longer durations
(resulting in larger π) and the longer timescale model (20-sec) would yield more
substantive improvement over the 1-sec model. As shown in Figure B.6, we
observed a significantly positive correlation between performance improvement
and the autocorrelation decay time constant (r = 0.49 and 0.50 across voxels in
auditory and visual regions as defined in Table B.2), in line with our hypothesis.
This suggests that the benefit of employing the 20-sec model, as quantified in terms
of performance improvement, is indeed more remarkable in regions with longer
processing timescales.
288
B.14 Surface visualization
All input fMRI data, as well as response predictions in this study are volume-based.
In order to be consistent with prior research on encoding models that employ
surface visualizations, we created surface versions of volumetric predictability and
synthetic contrast maps, as shown in Figures 4.3, 4.5 and 4.6. We employed the 3D
trilinear mapping method from connectome workbench that computes the result
on each vertex based on linear interpolation from voxels on each side of the vertex
(https://www.humanconnectome.org/software/workbench-command). However,
since volume to surface mappings are an approximation, we only employ this
conversion for visualizations. All reported metrics are computed on volumes only
on a per-voxel basis.
B.15 Qualitative analysis
To gain qualitative insights into the predictions of the most accurate model
(Audiovisual-20sec) on the held-out movie, we plot the predicted as well as mea-
sured response time-series of the voxel with ‘median’ prediction accuracy (R) in
the best performing ROI of each group (Figure B.7). The latter corresponds to A4,
V3CD, STSdp, IFSp and Area 45 for the auditory, visual, multi-sensory, frontal
and language groups respectively.
289
Figure B.7: Predicted and measured response time-series of the ‘median’ predictive
accuracy (R) voxel across ROIs of different functional groups. Vertical dashed lines
mark the boundary of clip segments in the held-out movie.
B.16 Group-level prediction accuracy: held-out set
To test the generalizability of the models, we further compared model predictions
against the group-averaged response of a held-out group within the HCP dataset
comprising 20 novel subjects distinct from the 158 individuals used in the training
set, on the same independent held-out movie.
Noise ceiling estimation: For the held-out group, we obtain the noise ceiling by
considering variability across subjects. Here, the noise ceiling was computed as the
290
correlation coefficient between the mean measured response for the independent test
movie across all 158 subjects in the training set and the group-averaged response
computed over the 20 new subjects. This metric captures the response component
shared across independent groups of subjects and thus reflects the upper bound
achievable by a group-level encoding model. We employ this noise ceiling for
comparison against the prediction accuracy of the model on the held-out group of
subjects (Figure B.8).
The models accurately predicted cortical responses evoked by the independent
test movie as measured in the independent subject population (Figure B.8, B.9),
with the best performing model (Audiovisual-20sec) even achieving close to perfect
predictivity relative to the “noise ceiling” in certain multi-sensory sites such as the
posterior STS (Figure B.8(A), (G)). Here, the noise ceiling was computed as the
correlation coefficient between the mean neural response in the independent test
movie, across all 158 subjects in the training set and the group-averaged response
computed over the 20 new subjects. This metric captures the response component
shared across independent subject populations and thus reflects the upper bound
achievable by a group-level encoding model. These results clearly indicate that
inclusion of temporal history and multi-sensory information pushes the prediction
accuracies closer to their upper bound, as also evidenced by a higher slope of the
linear model fit on their corresponding data points. Further, voxels that truly
approach the noise ceiling are predominantly associated with the auditory group of
regions as broadly characterized within the HCP MMP parcellation. Interestingly,
we find that this regional distribution of predictivity against noise ceiling holds
even for subject-specific responses and not just the group-averaged responses, as
described in the next section and shown in Figure B.10.
291
Figure B.8: Model performance on held-out group of subjects. (A) Pearson
correlation coefficient (R) between the model predictions and group-averaged
response of an independent subject group comprising 20 subjects, on the held-out
test movie, normalized by the voxel-specific noise ceiling. (B) Predictivity against
the noise ceiling for all voxels with high “synchrony” across training movies (>0.5)
(see Supplementary Information for details). This gives a total of 52,954 highly
“synchronous” voxels that are colored based on their association with auditory and
visual groups. This hue assignment of each voxel was derived from the coloration
of the corresponding ROI in the multi-modal HCP parcellation. Each dot in the
scatterplot represents an individual voxel. Bivariate kernel density estimates are
overlaid on top of the scatterplot as contours to depict the probability distribution
of observations (prediction accuracy/noise ceiling pair at every voxel).
292
Figure B.9: Quantitative evaluation metrics for all the proposed models on the
independent held-out population comprising 20 novel subjects. (A),(C)-(F) depict
prediction accuracy (R) for all the proposed models across major groups of regions
as identified in the HCP MMP parcellation (B). Predictive accuracy of all models
is summarized across (A) auditory, (C) visual, (D) multi-sensory, (E) language and
(F) frontal areas. Box plots depict quartiles and swarmplots depict mean prediction
accuracy of every ROI in the group. For language areas (Group 4), left and right
hemisphere ROIs are shown as separate points in the swarmplot because of marked
differences in the prediction accuracy. Statistical significance tests (results indicated
with horizontal bars) are performed to compare 1-sec and 20-sec models of the same
modality (3 comparisons) or uni-modal against multi-modal models of the same
duration (4 comparisons) using paired t-test (p-value < 0.05, Bonferroni-corrected)
on mean prediction accuracy within ROIs of each group.
B.17 Subject-level prediction accuracy: held-out set
For each participant in our independent subject group (N = 20), we computed the
correlation coefficient (R) between the predictions of the best performing model
(Audiovisual-20sec) and the subject-specific fMRI response corresponding to the
independent movie. We further contrast this cortical map of prediction performance
against another map computed as the voxel-wise correlation coefficient between the
293
Figure B.10: Comparison of voxel-level prediction accuracies (R) against subject-
specific noise ceiling for 5 representative subjects from the held-out set. The subjects
were chosen such that their mean prediction accuracy (un-normalized) within the
stimulus-driven cortex lied in the ith percentile with i ∈ {0.01, 25, 50, 75, 99.9}.
Surface maps with white background in (A)-(E) depict raw correlation coefficients
between model (Audiovisual-20sec) predictions and subject-specific response on the
held-out movie whereas maps on gray background indicate the respective subject-
specific noise ceiling. Only significantly correlated voxels (p<0.05, FDR corrected)
are colored on the surface.
mean neural response across all 158 training subjects and the respective subject-
specific response on the independent movie. The latter places an upper bound on the
predictivity of each voxel as achievable by any group-level model. Here, we present
the results for 5 subjects with mean prediction accuracy (un-normalized) within the
stimulus-driven cortex in the ith percentile with i ∈ {0.01, 25, 50, 75, 99.9}. The
results (Figure B.10) suggest that the model can successfully capture the response
component that individual subjects share with the population.
294
B.18 Correcting with inter-group synchrony
Since the present study focuses on population-wide predictive models, another
upper bound on performance estimates that naturally comes to mind is one based
on inter-subject or inter-group synchrony in cortical activity on the independent test
movie. We computed split-half correlations between the mean response time-course
of each group on the test movie. To compare the prediction accuracy against
ISC, we divided the prediction accuracy of the best predictive model, i.e., the
Audiovisual-20sec model by this synchrony-based noise-ceiling to get the synchrony-
normalized prediction accuracy, shown in Figure B.11. A stronger shift towards
values approaching unity indicates that the model is able to capture stimulus-driven
activity highly accurately across large regions of the cortex.
Figure B.11: Synchrony-normalized prediction accuracy (R) of the Audiovisual-
20sec model
B.19 Influence of motion
fMRI measurements are prone to various sources of noise, including spurious
head motion and physiological artifacts, which may vary in systematic ways with
the variables of interest in any study. While the fMRI data was pre-processed
with motion correction, the effects of motion cannot be fully eliminated and
need to be further accounted for. Motion confounds have been reported in prior
295
studies that use neuroimaging data as a “predictor” for different behavioral states
or as clinical biomarkers. In our study, the inputs are natural images and the
“predicted” variable (fMRI response) is the one prone to motion artifacts. In this
study, we developed group-level predictive models of whole-brain cortical activity.
One could expect to see the influence of motion in predictions if there was a
systematic correlation between motion signals across subjects (so that the signal
could persist post averaging), which would suggest that average subject motion
tracks the stimulus characteristics. To address this issue, we examined the Pearson
correlation coefficients between the predicted/measured response of each voxel and
the framewise displacement across the independent test movie clips. The framewise
displacement was computed as described in Power et al., [303] from the averaged
motion estimates across subjects on the independent test movie.
∑ ∑
FD(t) = |d(t− 1)− d(t)|+ 50 π180 |r(t− 1)− r(t)| (B.1)
where d denotes translation distances {x, y, z} and r denotes rotation angles
{pitch, yaw, roll}. As shown in Figure B.12, the correlation coefficients are centered
around zero with a very small standard deviation (∼0.05). Importantly, upon
computing the p-value of the obtained sample correlation coefficients for the
null hypothesis of uncorrelation (under the assumptions of a bivariate normal
distribution), we observed that none of these correlations were significant for
the predicted responses and only very few voxels (shown on the cortical surface
below) were found to exhibit statistically significant correlations between measured
responses and FD (p¡0.05, FDR corrected).
296
Figure B.12: Addressing the influence of motion on measured and predicted re-
sponses. (A) and (B) depict the distribution of the Pearson correlation coefficient
of FD with the predicted responses of the Audiovisual-20sec model and measured
responses across the whole brain respectively. Surface maps in (C) depict the raw
correlation coefficients between FD and the measured responses. Only statistically
significant voxels (p< 0.05, FDR corrected) are colored on the surface.
297
APPENDIX C
SUPPLEMENTARY INFORMATION AND ADDITIONAL
RESULTS FOR SECTION 4.3
C.1 Model comparison across randomly selected layers
Here, we wanted to examine if the learned attention model would lead to performance
improvements in neural response prediction across other deep layers as well. We
trained all 8 models using stimuli representations Frep from 2 randomly selected
layers in the res5 block of the pre-trained ResNet-50 architecture, namely ‘add 14’
and ‘res5c branch2b’1, henceforth denoted as ‘Random ResNet-50 layer 1’ and
‘Random ResNet-50 layer 2’ respectively. Figure C.1 shows the prediction accuracy
across the synchronous cortex on the held-out movie for all models. We again
observe that the learned attention model performs favorably against models with
no attention, no pooling or center-weighted attention. Further, the gaze-weighted
attention method outperforms all other methods employing the same response
model (linear or convolutional), consistent with our previous findings.
C.2 Representational similarity analysis
Representational similarity analysis (RSA) is a popular framework to compare
representations of a computational model against cortical representations [356,
359]. It can be used to directly measure a computational model’s ability to
explain the representational geometry in neuronal responses. Here, we wanted to
assess the impact of attention modulation on a computational model’s alignment
1Notation from pre-trained ResNet-50 model: https://keras.io/api/applications/resnet/
298
to brain responses for a wider range of model layers and architectures. Given
stimuli from the held-out movie (699 frames) and the corresponding response (after
hemodynamic lag), we implemented the following procedure for time-continuous
RSA: (i) We computed Pearson’s correlation distance (1-R) between the response
vectors for every pair of test frames to obtain the representational dissimilarity
matrix (RDM) of neural responses. The dissimilarity matrices are averaged across
subjects to yield a population-averaged ‘neural’ RDM. The region of interest
(ROI) mask for extracting response vectors to estimate neural RDMs was derived
from all voxels in intermediate (V4), ventral visual stream and lateral occipital
ROIs. Responses of all voxels were normalized using z-scores before computing the
dissimilarity matrix. (ii) We extracted model representations from intermediate
layers of 3 pre-trained (ImageNet) architectures, namely ResNet-50 (res2, res3,
res4, res5), VGG-16 (maxpool1, maxpool2, maxpool3, maxpool4, maxpool5) and
AlexNet (conv1, conv2, conv3, conv4, conv5). For each of these representations,
we further computed attention modulated representations using attention maps
computed with each saliency prediction method as described above. For the Itti-
Koch model, we used normalized saliency as the attention map. For all remaining
saliency models, we used probabilistic density predictions as attention maps. All
attention maps were resized to the spatial dimensions of the respective layer for
this computation. Representational vectors were compared pair-wise in terms of
their Pearson correlation distance (1-R) to obtain the ‘model’ RDM. (iii) Finally,
we compared the compatibility of the neural and model RDMs by using a rank
correlation measure (Kendall’s τA).
As shown in Figure C.2, prioritized selection of stimulus features based on
saliency significantly improves the correlation of model RDMs with neural RDMs.
This trend holds for most models and layers, suggesting that the benefits of
299
Figure C.1: Quantitative evaluation. Mean correlation values across the syn-
chronous, (i.e., stimulus-driven) cortex defined at a range of synchrony thresholds
([0.15,0.75]). Each point thus reflects the mean prediction accuracy for a model
across all voxels within synchronous cortex defined by a threshold value (x-axis).
attentional masking are not restricted to forward encoding models alone, but may
be more universal. Further, we find that models that better explain stimulus-
dependent human fixation patterns (such as Deepgaze-II or the learned attention
model) are able to better account for the representational geometry of neural
responses across higher visual object processing areas.
C.3 Regions of interest (ROI)
We employed the HCP MMP parcellation for all ROI-level analysis. Dorsal and
ventral visual stream ROIs as well as MT+ ROIs in Figure 3 (main text) were
derived from the explicit stream segregation and categorization described in the
HCP MMP parcellation [264] and are defined here in Table C.1 for quick reference.
300
Figure C.2: Representational similarity analysis(RSA). y-axis measures the
agreement between ‘model’ RDMs and ‘neural’ RDMs based on their rank correlation
measure. x-axis is use to index the layer (index 1 refers to the earliest layer of the
architecture) and the saliency method used for attention masking of the features
before pooling.
Table C.1: ROI categorization
Group ROIs
Dorsal V3A, V3B, V6, V6A, V7, IPS1
Ventral V8, VVC, PIT, FFC, VMV1-3
MT+ MT, MST, V4t, FST
Lateral occipital LO1, LO2, LO3
C.4 Center-weighted attention
Figure C.3 depicts the center-weighted saliency map used in all center-weighted
attention models. We also report per-movie eye tracking statistics therein from all
frames used for training or testing the models. We note that not all subjects had
eye tracking measurements for every frame in the movies. Figure C.3B shows the
number of subjects for which eyetracking data was available per movie (distribution
across frames). This suggests that despite the missing data, most frames among all
training and testing movies (MOVIE 4) had recorded gaze coordinate measurements
301
Figure C.3: A. Center-weighted saliency map and B. Eye tracking statistics
from ∼110-130 subjects.
C.5 Voxel-wise prediction accuracy (R) of linear models
Figure C.4 depicts the prediction accuracy across the cortical surface for all methods
employing linear response models that were considered in this study. As can be
seen clearly, just as in methods with CNN response models, gaze-weighted attention
significantly improves prediction accuracy across most higher order visual areas
over models with no attention or center-weighted attention.
C.6 Estimating hemodynamic (BOLD) response delay
fMRI BOLD response delay was estimated using the baseline ‘No attention (Linear)’
encoding model due to its computational efficiency in comparison to encoding
models employing convolutional response models. The input to these models was
the 2048 dimensional (average pooled) representation of the stimuli, and the output
was the evoked fMRI response across the synchronous cortex (i.e., voxels with
synchrony¿0.15) at different lags (1-7 seconds) from the stimulus. Thus, the output
302
Figure C.4: Prediction accuracy across the cortical surface for all meth-
ods using linear response models. Statistical significance of individual voxel
predictions is computed as the p-value of the obtained sample correlation coeffi-
cient for the null hypothesis of uncorrelatedness (i.e., true correlation coefficient is
zero) under the assumptions of a bivariate normal distribution. Only significantly
predicted voxels (p<0.05, FDR corrected) for each method are colored on the
surface.
is a 160900-D vector corresponding to the fMRI response. All models were trained
with 5-fold cross-validation using the stimulus-response pairs from the training
dataset only.
Based on Figure C.5, we estimated a response delay of 4 seconds, as this lag
consistently yielded the maximum prediction accuracy across 5-fold cross validation.
Thus, all encoding models described in the main text were trained to predict fMRI
response after 4 seconds of stimulus presentation.
303
Figure C.5: Hemodynamic response delay. 5-fold cross-validated prediction
accuracy (R) of the simple (‘No attention’) model on the training dataset. Error
margins are computed from the standard deviation of prediction accuracy across
the 5 folds.
304