Abstract
There is significant interest in the development and application of deep neural networks (DNNs) to neuroimaging data. A growing literature suggests that DNNs outperform their classical counterparts in a variety of neuroimaging applications, yet there are few direct comparisons of relative utility. Here, we compared the performance of three DNN architectures and a classical machine learning algorithm (kernel regression) in predicting individual phenotypes from whole-brain resting-state functional connectivity (RSFC) patterns. One of the DNNs was a generic fully-connected feedforward neural network, while the other two DNNs were recently published approaches specifically designed to exploit the structure of connectome data. By using a combined sample of almost 10,000 participants from the Human Connectome Project (HCP) and UK Biobank, we showed that the three DNNs do not outperform kernel regression across a wide range of behavioral and demographic measures. Furthermore, the generic feedforward neural network exhibited similar performance to the two state-of-the-art connectome-specific DNNs. We conclude with suggestions on future neuroimaging DNN research, including comparisons with stronger baseline algorithms, minimum sample sizes, transparency of hyperparameter tuning and code availability. Critically, we believe that deep learning remains a promising tool for analyzing neuroimaging data. However, researchers should carefully consider whether and how their applications might benefit from DNNs’ advantages over classical alternatives, rather than treat deep learning as a panacea.
Introduction
Deep neural networks (DNNs) have enjoyed tremendous success in machine learning (Lecun et al., 2015). As such, there has been significant interest in the application of DNNs to neuroscience research. DNNs have been applied to neuroscience in at least two main ways. First, deep learning models have been used to simulate actual brain mechanisms, such as in vision (Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014; Eickenberg et al., 2017) and auditory perception (Kell et al., 2018). Second, DNNs have been applied as tools to analyze neuroscience data, including lesion and tumor segmentation (Pinto et al., 2016; Havaei et al., 2017; Kamnitsas et al., 2017b; G. Zhao et al., 2018), anatomical segmentation (Wachinger et al., 2018; X. Zhao et al., 2018), image modality/quality transfer (Bahrami et al., 2016; Nie et al., 2017; Blumberg et al., 2018), image registration (Yang et al., 2017; Dalca et al., 2018), as well as behavioral and disease prediction (Plis et al., 2014; van der Burgh et al., 2017; Vieira et al., 2017; Nguyen et al., 2018).
Deep neural networks can perform well in certain scenarios where large quantities of data are unavailable, for example, winning multiple MICCAI predictive modeling challenges (Choi et al., 2016; Kamnitsas et al., 2017a; Hongwei Li et al., 2018). Yet, the conventional wisdom is that DNNs perform especially well when applied to well-powered samples, for instance, the 14 million images in ImageNet (Russakovsky et al., 2015) and Google 1 Billion Word Corpus (Chelba et al., 2014). However, in many neuroimaging applications, the available data often only involve hundreds or thousands of participants, while the associated feature dimensions can be significantly larger, such as entries of connectivity matrices with upwards of 100,000 edges. Consequently, we hypothesize that in certain neuroimaging applications, DNNs might not be the optimal choice for a machine learning problem (Bzdok and Yeo, 2017). Here, we investigated whether DNNs can outperform classical machine learning for behavioral prediction using resting-state functional connectivity (RSFC).
RSFC measures the synchrony of resting-state functional magnetic resonance image (rs-fMRI) signals between brain regions (Biswal et al., 1995; Fox and Raichle, 2007; Buckner et al., 2013), while participants are lying at rest without any explicit task. RSFC has been widely used for exploring human brain organization and mental disorders (Smith et al., 2009; Assaf et al., 2010; Power et al., 2011; Yeo et al., 2011; Bertolero et al., 2015). For a given brain parcellation scheme (e.g., Shen et al., 2013; Gordon et al., 2016; Glasser et al., 2017; Eickhoff et al., 2018), the parcels can be used as regions of interest (ROIs), such that a whole brain (or cortical) RSFC matrix can be computed for each participant. Each entry of the RSFC matrix corresponds to the strength of functional connectivity between two brain regions. The entries of the RSFC matrices can then be used as features for predicting behavioral measures (e.g., fluid intelligence) in individual participants (Finn et al., 2015; Smith et al., 2015; Dubois and Adolphs, 2016; Rosenberg et al., 2016; Reinen et al., 2018).
In this work, we compared kernel regression with three DNN architectures in RSFC-based behavioral prediction. Kernel regression is a non-parametric classical machine learning algorithm (Murphy, 2012) that has previously been utilized in various neuroimaging prediction problems, including RSFC-based behavioral prediction (Raz et al., 2017; Zhu et al., 2017; Li et al., 2018; Kong et al., 2018). Our three DNN implementations included a generic, fully-connected feedforward neural network, and two state-of-the-art DNNs specifically developed for RSFC-based prediction (Kawahara et al., 2017; Parisot et al., 2017, 2018). An initial version of this study utilizing only the fluid intelligence measure in the HCP dataset has been previously presented at a workshop (He et al., 2018). By using RSFC data from nearly 10,000 participants and a broad range of behavioral (and demographic) measures from the HCP (Smith et al., 2013; Van Essen et al., 2013) and UK Biobank (Sudlow et al., 2015; Miller et al., 2016), this current extended study represents one of the largest empirical evaluations of DNN’s utility in RSFC-based fingerprinting.
Methods
Datasets
Two datasets were considered: the Human Connectome Project (HCP) S1200 release (Van Essen et al., 2013) and the UK Biobank (Sudlow et al., 2015; Miller et al., 2016). Both datasets contained multiple types of neuroimaging data, including structural MRI, rs-fMRI, and multiple behavioral and demographic measures for each subject.
HCP S1200 release comprised 1206 healthy young adults (age 22-35). There were 1,094 subjects with both structural MRI and rs-fMRI. Both structural MRI and rs-fMRI were acquired on a customized Siemens 3T “Connectome Skyra” scanner at Washington University at St. Louis. The structural MRI was 0.7mm isotropic. The rs-fMRI was 2mm isotropic with TR of 0.72s and 1200 frames per run (14.4 minutes). Each subject had two sessions of rs-fMRI, and each session contained two rs-fMRI runs. A number of behavioral measures was also collected by HCP. More details can be found elsewhere (Van Essen et al., 2012; Barch et al., 2013; Smith et al., 2013).
The UK Biobank is a prospective epidemiological study that have recruited 500,000 adults (age 40-69) between 2006-2010 (Sudlow et al., 2015). 100,000 of these 500,000 participants will be brought back for multimodal imaging by 2022 (Miller et al., 2016). Here we considered an initial release of 10065 subjects with both structural MRI and rs-fMRI data. Both structural MRI and rs-fMRI were acquired on harmonized Siemens 3T Skyra scanners at three UK Biobank imaging centres (Cheadle Manchester, Newcastle, and Reading). The structural MRI was 1.0mm isotropic. The rs-fMRI was 2.4mm isotropic with TR of 0.735s and 490 frames per run (6 minutes). Each subject had one rs-fMRI run. A number of behavioral measures was also collected by the UK Biobank. More details can be found elsewhere (Elliott and Peakman, 2008; Sudlow et al., 2015; Miller et al., 2016; Alfaro-Almagro et al., 2018).
Preprocessing and RSFC
We utilized ICA-FIX MSM-All grayordinate rs-fMRI data provided by the HCP S1200 release (HCP S1200 manual; Van Essen et al., 2012, 2013; Glasser et al., 2013; Smith et al., 2013; Griffanti et al., 2014; Salimi-Khorshidi et al., 2014). To eliminate residual motion and respiratory-related artifacts (Burgess et al., 2016), we performed further censoring and nuisance regression (Li et al., 2018; Kong et al., 2018). Runs with more than 50% censored frames were discarded. We considered 400 cortical (Schaefer et al., 2018) and 19 sub-cortical (Fischl et al., 2002) ROIs. The preprocessed rs-fMRI time courses were averaged across all grayordinate locations within each ROI. RSFC was then computed using Pearson’s correlation of the averaged time courses for each run of each subject (with the censored frames excluded for the computation). The RSFC was averaged across all runs, resulting in one 419 × 419 RSFC matrix for each subject.
In the case of the UK Biobank, we utilized the 55 × 55 RSFC (Pearson’s correlation) matrices provided by the Biobank (Miller et al., 2016; Alfaro-Almagro et al., 2018). The 55 ROIs were obtained from a 100-component whole-brain spatial-ICA (Beckmann and Smith, 2004), of which 45 components were considered to be artifactual (Miller et al., 2016). The use of a different parcellation scheme in the UK Biobank (compared with the HCP dataset) ensures that our results are robust to the particular choice of ROIs.
FC-based prediction setup
We considered 58 behavioral measures across cognition, emotion and personality from the HCP (Table S1; Kong et al., 2018). By restricting the dataset to participants with at least one run (that survived censoring) and all 58 behavioral measures, we were left with 953 subjects. 23, 67, 62 and 801 subjects had 1, 2, 3 and 4 runs respectively.
In the case of the UK Biobank, we considered four behavioral and demographic measures: age, sex, fluid intelligence and pairs matching1 (number of incorrect matches). By restricting the dataset to participants with 55 × 55 RSFC matrices and all four measures, we were left with 8868 subjects.
For both datasets, kernel regression and three DNNs were applied to predict the behavioral and demographic measures of individual subjects based on individuals’ RSFC matrices. More specifically, the RSFC data of each participant was summarized as an N × N matrix, where N is the number of brain ROIs. Each entry in the RSFC matrix represented the strength of functional connectivity between two ROIs. The entries of the RSFC matrix were then used as features to predict behavioral and demographic measures in individual participants.
Kernel ridge regression
Kernel regression (Murphy, 2012) is a non-parametric classical machine learning algorithm. Let y be the behavioral measure (e.g., fluid intelligence) and c be the RSFC matrix of a test subject. Let yi be the behavioral measure (e.g., fluid intelligence) and ci be the RSFC matrix of the i-th training subject. Roughly speaking, kernel regression will predict the test subject’s behavioral measure to be the weighted average of the behavioral measures of all training subjects: y ≈ ∑iεtrainingsetsimilarity(ci,c)yi, where similarity(ci,c) is the is the similarity between the RSFC matrices of the test subject and i-th training subject. Here, we simply set (ci,c) to be the Pearson’s correlation between the lower triangular entries of matrices ci and c. In practice, an l2 regularization term is needed to avoid overfitting (i.e., kernel ridge regression). The level of l2 regularization is controlled by the hyperparameter λ. More details are found in Appendix A1.
Fully-connected neural network (FNN)
Fully-connected neural networks (FNNs) belong to a generic class of feedforward neural networks (Lecun et al., 2015) illustrated in Figure 1. A FNN takes in vector data as an input and outputs a vector. A FNN consists of several fully connected layers. Each fully connected layer consists of multiple nodes. Data enters the FNN via the input layer nodes. Each node (except input layer nodes) is connected to all nodes in the previous layer. The values at each node is the weighted sum of node values from the previous layer. The weights are the trainable parameters in FNN. The outputs of the hidden layer nodes typically go through a nonlinear activation function, e.g., Rectified Linear Units (ReLU; f(x) = max(0, x)), while the output layer tends to be linear. The value at each output layer node typically represents a predicted quantity. Thus, FNNs (and neural networks in general) allow the prediction of multiple quantities simultaneously. In this work, the inputs to the FNN are the vectorized RSFC (i.e., lower triangular entries of the RSFC matrices) and the outputs are the behavioral or demographic variables we seek to predict.
BrainNetCNN
One potential weakness of the FNN is that it does not exploit the (mathematical and neurobiological) structure of the RSFC matrix, e.g., RSFC matrix is symmetric, positive definite and represents a network. On the other hand, BrainNetCNN (Kawahara et al., 2017) is a specially designed DNN for connectivity data, illustrated in Figure 2. BrainNetCNN allows the application of convolution to connectivity data, resulting in significantly less trainable parameters than the FNN. This leads to less parameters, which should theoretically improve the ease of training and reduce overfitting issues. In this work, the input to the BrainNetCNN is the N × N RSFC matrix and the outputs are the behavioral or demographic variables we seek to predict.
The BrainNetCNN takes in any connectivity matrix directly as an input and outputs behavioral or demographic predictions. Kawahara et al. (2017) used this model for predicting age and neurodevelopmental outcomes from structural connectivity data. BrainNetCNN consists of four types of layers: Edge-to-Edge (E2E) layer, Edge-to-Node (E2N) layer, Node-to-Graph (N2G) layer and a final fully connected (linear) layer. The first three types of layers are specially designed layers introduced in the BrainNetCNN. The final fully connected layer is the same as that used in FNNs.
The Edge-to-Edge (E2E) layer is a convolution layer using cross-shaped filters (Figure 2). The cross-shaped filter is applied to each element of the input matrix. Thus, for each filter, the E2E layer takes in an N × N matrix and outputs an N × N matrix. The number of E2E layer is arbitrary and is a tunable hyperparameter. The outputs of the final E2E layer are inputs to the E2N layer. The E2N layer is similar to the E2E layer, except that the cross-shaped filter is applied to only the diagonal entries of the input matrix. Thus, for each filter, the E2N layer takes in an N × N matrix and outputs a N × 1 vector. There is one E2N layer for BrainNetCNN. The outputs of the E2N layer are the inputs to the Node-to-Graph (N2G) layer. The N2G layer is simply a fully connected hidden layer similar to the a FNN’s hidden layer. Finally, the outputs of the N2G layer are linearly summed by the final fully connected layer to provide a final set of prediction values.
Graph convolutional neural network (GCNN)
Standard convolution applies to data that lies on a Euclidean grid (e.g., images). Graph convolution exploits the graph Laplacian in order to generalize the concept of standard convolution to data lying on nodes connected together into a graph. This allows the extension of the standard CNN to graph convolutional neural networks (GCNNs; Defferrard et al., 2016; Bronstein et al., 2017; Kipf and Welling, 2017). There are many different ways that GCNN can be applied to neuroimaging data (Kipf and Welling, 2017; Ktena et al., 2018; Zhang et al., 2018). Here we considered the innovative GCNN developed by Kipf and Welling (2017) and extended to neuroimaging data by Parisot and colleagues (Parisot et al., 2017, 2018). Figure 3 illustrates this approach.
The input to an FNN (Figure 1) or a BrainNetCNN (Figure 2) is the RSFC data of a single subject. By contrast, the GCNN takes in data (e.g., vectorized RSFC) of all subjects as input and outputs behavioral (or demographic) predictions of all subjects (Parisot et al., 2017, 2018). In other words, data from the training, validation, and testing sets are all input into the GCNN at the same time. To avoid leakage of information across training, validation and test sets, masking of data is applied during the calculation of the loss function and gradient descent.
More importantly, the graph in GCNN does not represent connectivity matrices (like in BrainNetCNN). Instead, each node represents a subject and edges are determined by the similarity between subjects. This similarity is problem dependent. For example, in the case of autism spectrum disorder (ASD) classification, similarity between two subjects is defined based on sex, sites and RSFC, i.e., two subjects are more similar if they have the same sex, from the same site and have similar RSFC patterns (Parisot et al., 2017, 2018). The use of sex and sites in the graph definition were particular important for this specific application, since ASD is characterized by strong sex-specific effects and the database included data from multiple unharmonized sites (Di Martino et al., 2014).
Similar to the original studies (Parisot et al., 2017, 2018), we utilized vectorized RSFC (lower triangular entries of the RSFC matrix) of all subjects as inputs to the GCNN. Edges between subjects were defined based on Pearson’s correlation between lower triangular portions of RSFC matrices.
HCP training, validation and testing
For the HCP dataset, 20-fold cross-validation was performed. The 953 subjects were divided into 20 folds, such that family members were not split across folds. Inner-loop cross-validation was performed for hyperparameter tuning. More specifically, for a given test fold, cross-validation was performed on the remaining 19 folds with different hyperparameters. The best hyperparameters were then used to train on the 19 folds. The trained model was then applied to the test fold. This was repeated for all 20 test folds.
In the case of kernel regression, there was only one single hyperparameter λ (that controls the l2 regularization; see Appendix A.1). A separate hyperparameter was tuned for each test fold and each behavioral measure separately based on a grid search over the hyperparameter.
In the case of the DNNs, there was a large number of hyperparameters, e.g., number of layers, number of nodes, number of training epochs, dropout rate, optimizer (e.g., stochastic gradient or ADAM), weight initialization, activation functions, regularization, etc. GCNN also has additional hyperparameters tuned, e.g., definition of the graph and graph Laplacian estimation.
If we trained a different DNN for each of the 58 behavioral measures, a proper hyperparameter tuning would not be computationally feasible. Thus, a single FNN (or BrainNetCNN or GCNN) was trained for all 58 behavioral measures. We note that the joint prediction of multiple behavioral measures might not be a disadvantage for the DNNs and might potentially even improve prediction accuracies (Rahim et al., 2017). Furthermore, we tried to tune each DNN (FNN, BrainNetCNN or GCNN) for only fluid intelligence, but the performance for fluid intelligence prediction was not better than predicting all 58 behavioral measures simultaneously.
Furthermore, a proper inner-loop 20-fold cross-validation would involve tuning the hyperparameters for each DNN 20 times (once for each split of the data into training-test folds), which was computationally prohibitive. Thus, for each DNN (FNN, BrainNetCNN and GCNN), we tuned the hyperparameters once, using the first split of the data into training-test folds, and simply re-used the optimal hyperparameters for the remaining training-test splits of the data. Such a procedure biases the prediction performance in favor of the DNNs (relative to kernel regression), so the results should be interpreted accordingly (see Discussion). Such a bias is avoided in the UK Biobank dataset (see below). Further details about DNN hyperparameters are found in Appendix A2.
As is common in the FC-based prediction literature (Finn et al., 2015), model performance was evaluated based on the correlation between predicted and actual behavioral measures across subjects within each test fold. Furthermore, since certain behavioral measures were correlated with motion (Siegel et al., 2017), age, sex, and motion (FD) were regressed from the behavioral measures from the training and test folds (Li et al., 2018; Kong et al., 2018). Regression coefficients were estimated from the training folds and applied to the test folds.
UK Biobank training, validation and testing
The large UK Biobank dataset allowed us the luxury of splitting the 8868 subjects into training (N = 6868), validation (N = 1000) and test (N = 1000) sets, instead of employing an inner-loop cross-validation procedure like in the HCP dataset. Care were taken so that the distributions of various attributes (sex, age, fluid intelligence and pairs matching) were similar across training, validation and test sets.
Hyperparameters were tuned using the training and validation sets. The test set was only utilized to evaluate the final prediction performance. A separate DNN was trained for each of the four behavioral and demographic measures. Thus, the hyperparameters were tuned independently for each behavioral/demographic measure. Further details about DNN hyperparameters are found in Appendix A2. Initial experiments using a single neural network to predict all four measures simultaneously (like in the HCP dataset) did not appear to improve performance and so was not further pursued.
Like before, prediction accuracies for age, fluid intelligence and pairs matching were evaluated based on the correlation between predicted and actual measures across subjects within the test set. Since the age prediction literature often used mean absolute error (MAE) as an evaluation metric (Liem et al., 2017; Cole et al., 2018; Varikuti et al., 2018), we also included MAE as an evaluation metric. In the case of sex, accuracy was defined as the fraction of participants whose sex was correctly predicted. Like before, we regressed age, sex and motion from fluid intelligence and pairs matching measures in the training set and apply the regression coefficients to the validation and test sets. When predicting age and sex, no regression was performed.
Deep neural network implementation
The DNNs were implemented using Keras (Chollet, 2015) or PyTorch (Paszke et al., 2017) and run on NVIDIA Titan Xp GPU using CUDA. Our implementation of BrainNetCNN and GCNN were based on Github code from the original papers (Kawahara et al., 2017; Kipf and Welling, 2017). Our implementation achieved similar results for the experiments provided in the original Github implementations. More details can be found in Appendix A2.
Statistical tests
For the HCP dataset, we performed 20-fold cross-validation, yielding a prediction accuracy for each test fold. To compare two algorithms, the corrected resampled t-test was performed (Nadeau and Bengio, 2003; Bouckaert and Frank, 2004). The corrected resampled t-test corrects for the fact that the accuracies across test folds were not independent. In the case of the UK Biobank, there was only a single test fold, so the corrected resampled t-test could not be applied. Instead, when comparing correlations from two algorithms, the Steiger’s Z-test was utilized (Steiger, 1980). When comparing prediction errors for age (MAE; mean absolute error), a two-tailed paired sample t-test was performed. When comparing prediction accuracies for sex, the McNemar’s test was utilized (McNemar, 1947).
Data and code availability
This study utilized publicly available data from the HCP (https://www.humanconnectome.org/) and UK Biobank (https://www.ukbiobank.ac.uk/). The 400 cortical ROIs (Schaefer et al., 2018) can be found here (https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/brain_parcellation/Schaefer2018_LocalGlobal). The code utilized in this study can be found here:https://www.dropbox.com/sh/iq2d4gttxe3qvct/AAAVw7YJnVSwtOjouZDhhyPGa?dl=0 (note to readers/reviewers: we are in the midst of pushing our code to github. The dropbox link will be replaced by a github link).
Results
Three DNNs, fully connected neural network (FNN), BrainNetCNN and Graph Convolution Neural Network (GCNN), were compared with kernel regression in FC-based behavioral prediction using the HCP and UK Biobank datasets.
HCP behavioral prediction
Figure 4 shows the prediction accuracy (correlation) averaged across 58 HCP behavioral measures and 20 test folds. FNN achieved the highest average prediction accuracy of r = 0.121 ± 0.063 (mean ± std). On the other hand, kernel regression achieved an average prediction accuracy of r = 0.115 ± 0.036 (mean ± std). However, there was no statistical difference between FNN and kernel regression (p = 0.60; see Methods).
Interestingly, BrainNetCNN (r = 0.110 ± 0.043) and GCNN (r = 0.072 ± 0.034) did not outperform FNN, even though the two DNNs were designed for neuroimaging data. For completeness, Figures 5, S1, and S2 show the behavioral prediction accuracies for all 58 behavioral measures.
UK Biobank behavioral and demographics prediction
Table 1 and Figure 6 show the prediction performances of sex, age, pairs matching and fluid intelligence. Kernel regression performed the best for age and fluid intelligence. BrainNetCNN performed the best for sex and pairs matching.
Statistical tests were performed between kernel regression and the three DNNs (see Methods). False discovery rate (q < 0.05) was applied to correct for multiple comparisons correction. For age (MAE), kernel regression was statistically better than GCNN (p = 1.8e-6). For fluid intelligence, kernel regression was statistically better than GCNN (p = 5.5e-5).
On the other hand, there was no statistical difference between kernel regression and BrainNetCNN in the case of sex and pairs matching, even though BrainNetCNN achieved a nominally higher accuracy.
Interestingly, the FNN achieved poor performance in the case pairs matching (r = − 0.0006). Upon further investigation, we found that FNN achieved an accuracy of r = 0.079 in the UK Biobank validation set. Without any hyperparameter tuning (i.e., using the default set of hyperparameters), FNN achieved accuracies of r = 0.046 and r = 0.031 in the validation and test sets respectively. Overall, this suggests that the hyperparameter tuning overfitted the validation set, despite the rather large sample size.
Computational costs
Kernel regression has a close-form solution (Appendix A1) and only one hyperparameter, so the computational cost is extremely low. For example, kernel regression training and grid search of 32 hyperparameter values in the UK Biobank validation set took about 20 minutes (single CPU core) for one behavioral measure. This is one reason why we considered kernel regression instead of other slower classical approaches (e.g., support vector regression or elastic net) requiring iterative optimization. On the other hand, FNN training and tuning of hyperparameters in the UK Biobank validation set took around 80 hours (single GPU) for one behavioral measure, excluding the manhours necessary for the manual tuning.
Discussion
In this study, we showed that DNNs did not outperform kernel regression in RSFC-based prediction of a wide range of behavioral and demographic measures across two large-scale datasets totaling almost 10,000 participants. Furthermore, FNN performed as well as the two DNNs that were specifically designed for connectome data2. Given comparable performance between kernel regression and the DNNs and the significantly greater computational costs associated with DNNs, our results suggest that DNNs should be more critically evaluated in the neuroimaging literature despite their promise.
Potential reasons why DNNs did not outperform kernel regression for RSFC-based prediction
There are several potential reasons why DNNs did not outperform kernel regression in our experiments on RSFC-based behavioral prediction. First, given the much larger datasets used in computer vision and natural language processing (Chelba et al., 2014; Russakovsky et al., 2015), it is possible that there was not enough neuroimaging data (even in the UK Biobank) to fully exploit DNNs.
Second, while the human brain is nonlinear and hierarchically organized (Deco et al., 2011; Breakspear, 2017), such a structure might not be reflected in the RSFC matrix in a way that was exploitable by the DNNs we considered. This could be due to the measurements themselves (Pearson’s correlations of rs-fMRI timeseries), the particular representation (N × N connectivity matrices) or particular choices of DNNs, although we again note that BrainNetCNN and GCNN were specifically developed for connectome data.
Third, it is well-known that hyper-parameter settings and architectural details can impact the performance of DNNs. Thus, it is possible that the benchmark DNNs we implemented in this work can be further optimized. However, we do not believe this would alter our conclusions for two reasons. First, for some measures (e.g., sex classification in the UK Biobank), we were achieving performance at or near the state-of-the-art. Second, experiments with an automatic algorithm for tuning DNN hyperparameters (Ilievski et al., 2017) did not yield better performance than our hand-tuned hyperparameters (results not shown).
Improving future DNNs research in neuroimaging
Given the exciting DNN results published in the top neuroimaging journals, we started this project with the expectation that DNNs would significantly outperform kernel regression. However, the results of this study suggest potential lessons for future DNN research in neuroimaging.
First, many DNN papers in neuroimaging do not utilize strong baseline algorithms for comparisons. In the case of RSFC-based behavioral prediction, our results suggest that kernel regression is a good baseline to be considered in future studies. Furthermore, in many (if not all) applications, a simple, but powerful baseline would be to replace the nonlinear activation functions (used in the DNN) with linear ones (Huang et al., 2018; Nguyen et al., 2018).
Second, the sample sizes of many DNN neuroimaging studies are often too small. In the case of behavioral prediction or disease classification, where the sample size is equal to the number of participants, we recommend at least a minimum of several hundred participants, since our results suggest that DNNs can achieve comparable performance with kernel regression. Thousands of participants would be better. Yet, given the results of this study, studies should perhaps aspire to even more participants. It is worth noting that what constitutes sample size depends on the problem. In the case of dense anatomical segmentation, the training data might involve manual segmentation of millions of voxels in a relatively small number of participants. In this scenario, the effective sample size might be closer to the number of labeled voxels than the number of labeled subjects. Consequently, this might explain the success of DNNs in segmentation challenges (Kamnitsas et al., 2017a; Hongwei Li et al., 2018).
Third, there are significantly more hyperparameters in DNNs compared with classical machine learning approaches. For example, for a fixed kernel (e.g., correlation metric in our study), kernel regression has one single regularization parameter. Even with a nonlinear kernel (e.g. radial basis function), there would only be two hyperparameters. This is in contrast to DNNs, where there can easily be more than ten hyperparameters. As such, it is important that studies spelled out clearly how those hyperparameters are tuned. In our experience, tuning large number of hyperparameters within a k-fold inner-loop (nested) cross-validation framework is difficult for two reasons. First, tuning so many hyperparameters k times (once for each fold) is prohibitively expensive. Second, if manual tuning is performed, information from tuning one fold will inevitably leak to another fold (via the person tuning the hyperparameters). Consequently, if the dataset is sufficiently large (e.g., UK Biobank), we recommend the data be divided into training, validation and test sets, just like in our experiments. Hyperparameter tuning should be performed only using the training and validation sets, with the test set only be utilized in the final evaluation. In smaller datasets (e.g., HCP), an inner-loop k-fold cross-validation might unfortunately be necessary to ensure stability of results (Varoquaux, 2018).
Finally, we encourage studies to make their code publicly available. Publicly available code makes it significantly easier for other researchers to perform comparisons with the published algorithms. The current evaluation study is made possible due to generous code sharing by various authors (Kawahara et al., 2017; Parisot et al., 2017, 2018). Furthermore, there are simply too many DNN hyperparameters (and design choices) to be listed in a paper. In fact, there were hyperparameters too complex to completely specify in this paper. However, we have made our publicly available, so researchers can refer to the code for the exact hyperparameters.
Limitations and caveats
Although the current study suggests that DNNs do not outperform kernel regression of RSFC-based behavioral prediction, it is possible that other DNNs (we have not considered) might outperform kernel regression. Furthermore, our study focused on the use of N × N RSFC matrices for behavioral prediction. Other RSFC features in combination with DNNs might potentially yield better performance (Hongming Li et al., 2018; Khosla et al., 2018). Furthermore, the final UK Biobank dataset will include 100,000 participants with neuroimaging data, which is ten times the number of participants used in the current study. The larger quantity of data might strongly benefit deep learning approaches.
Given the success of DNNs in many fields and at various MICCAI predictive modeling challenges, we strongly believe that DNN remains a promising tool for neuroimaging. However, researchers should carefully consider whether and how their applications would benefit from DNNs’ advantages over classical alternatives, rather than simply assume that deep learning is a panacea for all problems.
Conclusion
By using a combined sample of nearly 10,000 participants, we showed that three DNNs did not outperform kernel regression in RSFC-based prediction of a wide range of behavioral and demographic measures. Although we believe that deep learning remains a promising tool for neuroimaging data analysis, this suggests that DNNs should be more critically evaluated in the neuroimaging literature. Deep learning research in neuroimaging applications would benefit from comparisons with stronger baseline algorithms, large sample sizes, transparency in hyperparameter tuning and code availability.
Acknowledgment
This work was supported by Singapore MOE Tier 2 (MOE2014-T2-2-016), NUS Strategic Research (DPRT/944/09/14), NUS SOM Aspiration Fund (R185000271720), Singapore NMRC (CBRG/0088/2015), NUS YIA and the Singapore National Research Foundation (NRF) Fellowship (Class of 2017). Our research also utilized resources provided by the Center for Functional Neuroimaging Technologies, P41EB015896 and instruments supported by 1S10RR023401, 1S10RR019307, and 1S10RR023043 from the Athinoula A. Martinos Center for Biomedical Imaging at the Massachusetts General Hospital. Our computational work was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). The Titan Xp GPUs used for this research were donated by the NVIDIA Corporation. This research has been conducted using the UK Biobank resource under application 25163 and Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.
Appendix
A1. Kernel Regression
In this section, we describe kernel regression in detail (Liu et al., 2007; Murphy, 2012). The kernel matrix K encodes the similarity between pairs of subjects. Motivated by Finn and colleagues (2015), the i-th row and j-th column of the kernel matrix is defined as the Pearson’s correlation between the i-th subject’s vectorized RSFC and j-th subject’s vectorized RSFC (considering only the lower triangular portions of the RSFC matrices). The behavioral measure yi of subject i can be written as: where ci is the vectorized RSFC of the i-th subject, K1ci, c1 is the element at i-th row and j-th column of kernel matrix, M is the total number of training subjects, ei is the noise term and αj is the trainable weight. The goal of kernel regression is to find an optimal set of α. To achieve this goal, we maximize the penalized likelihood function: with respect to α = [αi, α2, …, αM]T. To avoid overfitting, a l2 regularization (i.e., kernel ridge regression) can be added, so the resulting optimization problem becomes: where 𝕂 is the M × M kernel matrix, y = [y1, y2, … , yM]T and λ is a hyperparameter that controls the l2 regularization. By solving equation (3) with respect to α, we can predict a test subject’s behavioral measure ys: where Ks = [K(cs, c1), K(cs, c2), …, K(cs, cM)].
In the case of the HCP, λ was selected via inner-loop cross-validation. In the case of the UK biobank, λ was tuned on the validation set.
A2. More details of deep neural networks
In this section, we describe further details of our DNN implementation. In the case of the HCP dataset:
For all three DNNs, all behavioral measures were z-normalized based on training data. The loss function was mean squared error (MSE). Optimizer was stochastic gradient descent (SGD). With the MSE loss, the output layer has 58 nodes (FNN and BrainNetCNN) or filters (GCNN).
Final FNN structure is shown in table 2. Dropout of 0.6 was added before each fully-connected layer. L2 regularization of 0.02 was added for layer 2.
View this table:Final BrainNetCNN structure is shown in table 3. Dropout of 0.5 was added after E2N layer. LeakyReLU (Maas et al., 2013) with alpha of 0.1 was used as the activation function for the first three layers.
View this table:Final GCNN structure is shown in table 4. Dropout of 0.3 was added for each layer. L2 regularization of 8e-4 was added for layer 1. The nodes of the graph corresponded to subjects. Edges were constructed based on Pearson’s correlation between subjects’ vectorized RSFC. The graph was thresholded by only retaining edges with top 5% correlation (across the entire graph). However, this might result in a disconnected graph. Therefore, the top five correlated edges of each node were also retained (even if these edges were not among the top 5% correlated edges). The graph convolution filters were estimated using a 5-degree Chebyshev polynomial (Defferrard et al., 2016).
View this table:
In the case of the UK Biobank:
For all three DNNs, model ensemble was used to improve final test result: for each DNN and each behavior, five models were trained separately. The prediction results were then averaged across the five models. All four behavioral measures were z-normalized based on training data. The loss function for sex prediction was cross entropy, i.e., the output layer for sex prediction have 2 nodes (FNN and BrainNetCNN) or filters (GCNN). The loss function was MSE for the other three measures. The output layer for these three measures have 1 node (FNN and BrainNetCNN) or filter (GCNN). Adam (Kingma and Ba, 2015) or SGD were used. See details in Tables 2, 3 and 4.
For all DNNs, model was tuned for each behavior separately. Tables 2, 3 and 4 show the final DNN structures
Final FNN structure is shown in table 2. For FNN, dropout of 0.2/0.3/0.4/0.4 (for sex/age/pairs matching/fluid intelligence respectively) was added before each fully-connected layer. L2 regularization of 0.02 was added for layer 2. Weight decay of 0.01/0.01/0.001/0.016 (for sex/age/pairs matching/fluid intelligence respectively) were applied to the weights of all fully connected layers.
Final BrainNetCNN structure is shown in table 3. For BrainNetCNN, dropout of 0.21/0.6/0.25/0.54 (for sex/age/pairs matching/fluid intelligence respectively) was added after the E2E, E2N, and N2G layers. LeakyReLU was replaced by linear activation for all four models.
Final GCNN structure is shown in table 4. Dropout of 0.3/0.6/0.6/0.7 (for sex/age/pairs matching/fluid intelligence respectively) was added for each layer. L2 regularization of 2e-5/2e-4/2e-4/2e-6 (for sex/age/pairs matching/fluid intelligence respectively) was added for layer 1. The nodes of the graph corresponded to subjects. Edges were constructed based on Pearson’s correlation between subjects’ vectorized RSFC. Thresholding of the graph was tuned separately for each behavior or demographic measure. For sex prediction, the top five correlated edges of each node were retained. For age, pairs matching and fluid intelligence prediction, the graph was thresholded by only retaining edges with top 5% correlation (across the entire graph). Furthermore, the top five correlated edges of each node were also retained (even if these edges were not among the top 5% correlated edges). The graph convolution filters for all four GCNNs were estimated by a 1-degree Chebyshev polynomial (Defferrard et al., 2016).
Footnotes
↵1 The pairs matching task requires participants to memorize the positions of matching pairs of cards.
↵2 FNN did seem to perform the worst for pairs matching in the UK Biobank, but the difference was not statistically significant. Furthermore, no approach seems to be able to predict pairs matching well.