The effect of machine learning regression algorithms and sample size

NeuroImage 178 (2018) 622–637

Contents lists available at ScienceDirect

NeuroImage journal homepage: www.elsevier.com/locate/neuroimage

The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features Zaixu Cui a, b, Gaolang Gong a, c, * a b c

State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, 100875, China Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA Beijing Key Laboratory of Brain Imaging and Connectomics, Beijing Normal University, Beijing, 100875, China

A R T I C L E I N F O

A B S T R A C T

Keywords: Individualized prediction Machine learning Regression algorithm Sample size Functional magnetic resonance imaging (MRI) Resting-state functional connectivity

Individualized behavioral/cognitive prediction using machine learning (ML) regression approaches is becoming increasingly applied. The specific ML regression algorithm and sample size are two key factors that non-trivially influence prediction accuracies. However, the effects of the ML regression algorithm and sample size on individualized behavioral/cognitive prediction performance have not been comprehensively assessed. To address this issue, the present study included six commonly used ML regression algorithms: ordinary least squares (OLS) regression, least absolute shrinkage and selection operator (LASSO) regression, ridge regression, elastic-net regression, linear support vector regression (LSVR), and relevance vector regression (RVR), to perform specific behavioral/cognitive predictions based on different sample sizes. Specifically, the publicly available resting-state functional MRI (rs-fMRI) dataset from the Human Connectome Project (HCP) was used, and whole-brain restingstate functional connectivity (rsFC) or rsFC strength (rsFCS) were extracted as prediction features. Twenty-five sample sizes (ranged from 20 to 700) were studied by sub-sampling from the entire HCP cohort. The analyses showed that rsFC-based LASSO regression performed remarkably worse than the other algorithms, and rsFCSbased OLS regression performed markedly worse than the other algorithms. Regardless of the algorithm and feature type, both the prediction accuracy and its stability exponentially increased with increasing sample size. The specific patterns of the observed algorithm and sample size effects were well replicated in the prediction using re-testing fMRI data, data processed by different imaging preprocessing schemes, and different behavioral/ cognitive scores, thus indicating excellent robustness/generalization of the effects. The current findings provide critical insight into how the selected ML regression algorithm and sample size influence individualized predictions of behavior/cognition and offer important guidance for choosing the ML regression algorithm or sample size in relevant investigations.

Introduction Multimodal neuroimaging features combined with machine learning (ML) classification algorithms have been widely applied to discriminate patients with various brain diseases from healthy controls at the individual level (Arbabshirani et al., 2017; Orrù et al., 2012). This line of research is valuable for facilitating automated diagnosis of various psychiatric diseases. In parallel, neuroimaging-based ML regression approaches have been used to predict behavioral/cognitive abilities (continuous variables) of individuals within healthy and patient populations (Gabrieli et al., 2015). Apart from achieving individualized prediction per se, ML regression approaches have also been deemed as

multivariate pattern analysis for exploring the relationship between behavior/cognition and complex patterns of multiple neuroimaging features, which is complementary to conventional brain-behavior correlational analysis (Dosenbach et al., 2010; Haynes, 2015; Norman et al., 2006). Given these merits, a growing number of neuroimaging studies have been focusing on individualized prediction of behavior/cognition based on ML regression approaches (Dosenbach et al., 2010; Gabrieli et al., 2015; Siegel et al., 2016). Resting-state functional connectivity (rsFC) represents the coupling of spontaneous low-frequency fluctuation in the blood oxygen leveldependent (BOLD) signal during resting state, and has been extensively studied in recent years (Biswal et al., 1995; Fox et al., 2005; Fransson,

* Corresponding author. State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing, 100875, China. E-mail address: [email protected] (G. Gong). https://doi.org/10.1016/j.neuroimage.2018.06.001 Received 28 February 2018; Received in revised form 31 May 2018; Accepted 1 June 2018 Available online 2 June 2018 1053-8119/© 2018 Elsevier Inc. All rights reserved.

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

rsFC strength (rsFCS) as features, the 6 regression algorithms were implemented to make specific behavioral predictions. A range of sample sizes from 20 to 700 were utilized, and the included subjects for each given sample size were randomly selected from the full HCP sample set. Using the HCP test-retest rs-fMRI dataset, we thoroughly assessed the test-retest reproducibility of how the algorithm and sample size influence behavioral predictions.

2005; Friston, 1994). Particularly, significant associations between this measure and behavior/cognition have been repeatedly observed (Dubois and Adolphs, 2016; Liu et al., 2017; Smith et al., 2015). Importantly, rsFC has been demonstrated as an effective feature for predicting the characteristics of individuals, such as biological age (Dosenbach et al., 2010), visual/verbal memory ability (Siegel et al., 2016), attention ability (Rosenberg et al., 2015), and intelligence quotient (Finn et al., 2015), in ML regression algorithms. To date, neuroimaging studies have employed a series of different regression algorithms. The most frequently used algorithms include ordinary least squares (OLS) regression (Rosenberg et al., 2015; Shen et al., 2017), least absolute shrinkage and selection operator (LASSO) regression (Wager et al., 2013), ridge regression (Siegel et al., 2016), elastic-net regression (Cui et al., 2018), linear support vector regression (LSVR) (Ullman et al., 2014), and relevance vector regression (RVR) (Gong et al., 2014). These algorithms differ in how parameters are optimized, such as the form of the loss function and regularization techniques (Hastie et al., 2001; Sch€ olkopf and Smola, 2002). Unsurprisingly, the prediction performances of these algorithms also differ in practice depending on how well the implicit assumption for each algorithm holds true in the data. This raises an open question about which algorithms are favorable for an rsFC-feature based prediction. To answer this, a systematic comparison between these algorithms is necessary, which can provide instructive information for the choice of regression algorithms for specific behavioral/cognitive prediction. Along this line, there have been a few attempts to look at differences in prediction performance across ML regression algorithms. Two separate studies reported that RVR consistently outperforms SVR when predicting individual age (Franke et al., 2010) or clinical scores (Wang et al., 2010) with brain tissue volume as features. Another study by Chu and colleagues found that ridge regression using functional activation features performed slightly better than RVR on average for predicting various types of feature ratings during virtual reality task (Chu et al., 2011). To date, however, a comprehensive assessment of performance differences between all these frequently used regression algorithms remains scarce, particularly a comparison of their ability to predict individual behavioral/cognitive abilities using rsFC-related features. In addition to the choice of ML regression algorithm, another open question is about what sample size is large enough for a robust behavioral/cognitive prediction. In the context of MRI-based subject classification/discrimination, a few studies found increasing classification accuracies with sample size increasing (Chu et al., 2012; Kl€ oppel et al., 2008). A recent thorough review indicated that studies with small sample sizes tend to report a relatively high discriminative accuracy when classifying brain disease patients from controls, which is likely due to the overfitting issue related to the small sample size (Arbabshirani et al., 2017). However, these relevant assessments of sample size effect are confined to classification algorithms, and it remains unexplored whether and how prediction accuracies of ML regression algorithms are affected by sample size. The present study aims to comprehensively compare rsFC featurebased prediction among 6 ML regression algorithms (i.e., OLS regression, LASSO regression, ridge regression, elastic-net regression, LSVR, and RVR) and further evaluate the effect of sample size on prediction accuracies, which should be able to inform future investigations applying rsFC feature to specific behavioral/cognitive predictions at the individual level. We confined to these six regression algorithms because they are the most commonly used ones to date in neuroimaging field (See Supplementary Table 1 for a summary). To evaluate the influence of feature dimension, we also included rsFC strength (rsFCS) as an rsFC-extracted lower dimensional feature, which is simply defined as the sum of all linked rsFC values for each brain region and putatively can capture the global communication ability of brain regions (Beucke et al., 2013; Buckner et al., 2009; Zuo et al., 2012). Specifically, a resting-state functional MRI (rs-fMRI) dataset from the Human Connectome Project (HCP) was used to compare the algorithms. Using whole-brain rsFC and

Materials and methods Participants The publicly available dataset from the HCP S900 release was used in the present study (Van Essen et al., 2012, 2013). Please refer to the study by Van Essen et al. (2013) for subject inclusion/exclusion criteria. Two rs-fMRI sessions were acquired over two days for each subject, denoted REST1 and REST2. As in previous studies (Liao et al., 2017; Zalesky et al., 2014), fMRI acquisition with left-to-right phase-encoding was used. In the originally released dataset, REST1 and REST2 data were available for a total of 873 and 838 subjects, respectively. Two subjects were excluded because they had a large posterior cranial fossa arachnoid cyst, and, 3 and 7 subjects were excluded from the REST1 and REST2 sessions, respectively, due to incomplete data acquisition (less than 1200 time points). In addition, 74 and 51 subjects were excluded from the REST1 and REST2 session, respectively, due to severe head motion (displacement > 3 mm, rotation > 3 ). Finally, a total of 794 subjects (345 males; 22–35 years; Table 1) for the REST1 session and 778 subjects (343 males; 22–35 years) for the REST2 session were used in our predictive analyses, and their HCP IDs are provided in Supplementary Tables 2 and 3. Behavioral/cognitive scores The HCP dataset includes a battery of behavioral/cognitive tests (Barch et al., 2013). In the present study, the scores of 4 behavioral/cognitive tests were used as prediction factors for individuals. The tests included one motor-related test (Grip Strength Dynamometry Test [GSDT]), two language-related tests (Oral Reading Recognition Test [ORRT] and Picture Vocabulary Test [PVT]), and one spatial orientation-related test (Variable Short Penn Line Orientation Test [VSPLOT]). The details of each test were described in Supplementary Table 4. All the tests were applied using the NIH Cognition Battery toolbox, and the raw scores for each test were further transformed into age-adjusted scores with a mean of 100 and a standard deviation (SD) of 15 using the NIH National Norms toolbox. Please see Slotkin et al. (2012) for testing and scoring details. Within the subjects whose imaging data qualified for inclusion, one subject lacked the GSD score, and 5 subjects lacked the VSPLOT score. In the present study, we chose the GSDT score as the main prediction score, as it provides the highest overall prediction accuracy relative to the other three behavioral/cognitive scores. The predictions of the other three scores were included in the validation results to evaluate the generalizability of our observed algorithm and sample size effects. Table 1 The characteristics of HCP S900 sample subjects in our study. Characteristic

Age (y, Mean (SD)) Gender (Male, %) Race (White, %)a Ethnicity (Not Hispanic/Latino %)b

S900 REST1 (N ¼ 794)

REST2 (N ¼ 778)

28.79 (3.67) 43.45 75.57 90.81

28.76 (3.69) 44.09 74.42 91.00

a Race was coded as whilte, Black or African American, American Indian/ Alaskan Native, Asian/Native Hawaiian/Other Pacific Islander, More than one. b Ethnicity was coded as Hispanic/Latino, Not Hispanic/Latino.

623

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

share the common goal of finding a function f ðxi Þ ¼

MRI acquisition and preprocessing

The OLS regression algorithm fits a linear model by minimizing the residual sum of squares between the observed yi in the training dataset and the values f(xi) predicted by the linear model. The objective function takes the form as below: min β

After the above preprocessing, the DPARSFA (part of DPABI) (Yan and Zang, 2010; Yan et al., 2016) was applied to remove the linear trend and several nuisance signals, including Friston's 24 head motion parameters (Friston et al., 1996), the global signal, and the average white matter (WM) and cerebrospinal fluid (CSF) signal. Finally, temporal bandpass filtering (0.01–0.1 Hz) was performed voxel-by-voxel. The human brainnetome atlas (http://atlas.brainnetome.org/) was applied to parcellate the entire gray matter into 246 regions (123 in each hemisphere) consisting of 210 cortical and 36 subcortical regions (Fan et al., 2016). This atlas is connectivity-based and, therefore, is recommended for regional functional connectivity and brain network analyses (Dresler et al., 2017). For each subject, a regional mean time series was calculated by averaging the time series over all voxels within the region, and a total of 246 regional mean time series were therefore yielded. The rsFC between each pair of regions (30,135 pairs in total) was computed by using the Pearson correlation to yield a whole-brain rsFC feature vector of 30,135 features for each subject (Fig. 1). For each region, the rsFCS was calculated, which corresponds to the centrality measure in graph theory and is simply defined as the sum of the rsFC values between that region and all the other regions (245 in total) (Buckner et al., 2009; Liu et al., 2017). A whole-brain rsFCS feature vector was then extracted for each subject, which can be taken as a lower dimensional feature of rsFC. The whole-brain rsFC and rsFCS feature vectors were independently used in the prediction analysis (Fig. 1). The motivation of including both rsFC and rsFCS features in the present study is to see whether and how feature dimensionality influences our results.

N X

ðf ðxi Þ yi Þ2

i¼1

where yi is the actual value of the behavioral score. The Moore-Penrose pseudo-inverse approach was used to solve the minimization problem of this objective function, and the singular value decomposition (SVD) was used to find the pseudo-inverse (Casanova et al., 2012; MacAusland, 2014; Mosic and Djordjevic, 2009). If X is full column rank, the β has the general analytical solution as below: 1 b β ¼ X þ y ¼ ðX T XÞ X T y

where X þ ¼ ðX T XÞ1 X T indicates Moore-Penrose pseudo-inverse, and X is a N *p matrix in which each row is a feature vector of one subject. However, OLS regression tends to over-fit when the data is noisy, that is, the acquired model performs well when predicting the training samples but fails when predicting a new/unseen sample. In contrast, ridge regression, LASSO regression, elastic-net regression, LSVR, and RVR apply various regularization techniques to maximize the generalizability of predicting unseen samples in noisy data (Smola and Scholkopf, 2004; Tipping, 2001; Zou and Hastie, 2005). Ridge regression Ridge regression develops a model that minimizes the sum of the squared prediction error in the training data and an L2-norm regularization, i.e., the sum of the squares of regression coefficients (Hoerl and Kennard, 1970). The object function is as below: min β

ML regression algorithms

N X

ðf ðxi Þ yi Þ2 þ λ

i¼1

p X 2 βj j¼1

This technique can shrink the regression coefficients, resulting in better generalizability for predicting unseen samples. In this algorithm, a regularization parameter λ is used to control the trade-off between the prediction error of the training data and L2-norm regularization, i.e., a trade-off of penalties between the bias and variance. A large λ corresponds to more penalties on variance, and a small λ corresponds to more penalties on bias (Zou and Hastie, 2005). Compared with the OLS, ridge regression can better deal with the problem of multicollinearity (Vinod, 1978) and avoid overfitting through this bias-variance trade-off.

The present study included six commonly used ML linear regression algorithms in neuroimaging field: OLS regression, LASSO regression, ridge regression, elastic-net regression, LSVR, and RVR. We confined to linear model algorithms, due to their interpretability and resilience to overfitting in high-dimensional dataset (Kragel et al., 2012). Theoretically, SVR/RVR can be a non-linear regression model by applying a non-linear kernel and mapping the inputs into high-dimensional feature spaces (Tipping, 2001; Vapnik, 2000). Here, we selected a linear kernel for both SVR and RVR, therefore leading to a nature of linear model for the SVR and RVR in our present study. These linear regression models can be formulized as follows:

j¼1

þ β0 that

OLS regression

Whole-brain rsFC and rsFCS feature extraction

p X

j¼1 βj xi;j

best predicts the actual behavioral score yi, but they differ in how they fit the regression coefficients using the training data.

In the HCP, high-resolution (2-mm isotropic voxels) fMRI images under resting state were acquired using a customized Siemens Skyra 3-T scanner with a 32-channel head coil. The functional images were first preprocessed by the fMRIVolume pipeline, which included gradient distortion correction, motion correction, echo-planar imaging (EPI) distortion correction, registration to the Montreal Neurological Institute (MNI) space, intensity normalization to a global mean, and masking out non-brain voxels. For details on data acquisition and preprocessing, see the study by Glasser and colleagues (Glasser et al., 2013).

b ¼ Y

Pp

LASSO regression LASSO regression applies L1-norm regularization to the OLS loss function, aiming to minimize the sum of the absolute value of the regression coefficients (Tibshirani, 1996). The objective function takes the form as below:

b β j Xj þ b β0

b ¼ ðb where Y y 1 ; …; b y n ÞT , and b y i ði ¼ 1; …; nÞ is the predictive value of the behavioral score for the ith subject, Xj ¼ ðx1;j ; …; xn;j ÞT and xi;j is the value of the jth feature for the ith subject, and b β j is the regression coef-

min

ficient of the jth feature. Suppose we are given training data ((x1, y1), …, (xN, yN)), where N is the number of training samples, xi is a highdimensional feature vector (xi,1, …, xi,p), p is the number of features, and yi is the actual behavioral score. The six linear regression algorithms

This L1-norm regularization typically sets most coefficients to zero and retains one random feature among the correlated ones (Zou and Hastie, 2005). Thus, LASSO regression results in a very sparse predictive

β

624

N X i¼1


p X βj j¼1

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

Fig. 1. Schematic overview of the analysis framework. The human brainnetome atlas (http://atlas.brainnetome.org/) was applied to parcellate the entire gray matter into 246 regions. Using a resting-state fMRI dataset, a 246 246 symmetric rsFC matrix was first obtained, and all lower triangle elements of the matrix (30,315 in total) were extracted as the whole-brain rsFC feature vector for each subject. For each region, the rsFC strength (rsFCS) was calculated as the sum of the rsFC values of that region with all other regions. These rsFCS values (246 in total) were then combined as a whole-brain rsFCS feature vector for each subject. Both whole-brain rsFC and rsFCS features were applied to separately predict individual behavioral/cognitive scores by six regression algorithms.

loss function (Zou and Hastie, 2005). The objective function takes the form as below:

model, which facilitates optimization of the predictors and reduces the model complexity. Notably, LASSO can only select a maximum of N-1 features in the final model, where N is the sample size (Efron et al., 2004; Ryali et al., 2012). This can be problematic for a regression with few samples but large number of features. Likewise, an algorithm parameter λ is used to control the trade-off between the prediction error on the training data and L1-norm regularization, i.e., the trade-off of penalties between the bias and variance.

min β

N X i¼1


p X 2 1 αβj þ ð1 αÞβj 2 j¼1

Therefore, elastic-net regression is essentially a combination of LASSO regression and ridge regression, which allows the number of the selected features to be larger than the sample size while achieving a sparse model (Carroll et al., 2009; Zou and Hastie, 2005). Again, a regularization parameter λ is used to control the trade-off between the prediction error on the training data and regularization, i.e., the trade-off of penalties between the bias and variance. In addition, a mixing

Elastic-net regression Elastic-net regression aims to overcome the limitations of LASSO method by combining L1-norm and L2-norm regularizations in the OLS 625

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

parameter α is used to control the relative weighting of the L1-norm and L2-norm contributions.

Where ∅s ðxÞ ¼ x*xs . Similarly, the regression coefficients of all features are determined as the weighted sum of the feature vector of all ‘relevance vector’ samples. Notably, this algorithm has no algorithm-specific parameter and, therefore, does not require extra computational resources to estimate the optimal algorithm-specific parameters. The scikit-learn library (version: 0.16.1) was used to implement OLS regression, LASSO regression, ridge regression and elastic-net regression (http://scikit-learn.org/) (Pedregosa et al., 2011), the LIBSVM function in MATLAB was used to implement LSVR (https://www.csie.ntu.edu.tw/ ~cjlin/libsvm/) (Chang and Lin, 2011), and the PRoNTo toolbox (http:// www.mlnl.cs.ucl.ac.uk/pronto/) was used to implement RVR (Schrouff et al., 2013).

LSVR In contrast to the squared loss function in the above methods, LSVR applies a Vapnik's ε-sensitive loss function to fit the linear model (Smola and Scholkopf, 2004; Vapnik, 2000). Specifically, it aims to find a function f(xi) whose predictive value deviates by no more than ε from the actual yi for all the training data while maximizing the flatness of the function. Flatness maximization is implemented via a L2-norm regularization by minimizing the squared sum of the regression coefficients. The objective function takes the form as below:

Individualized prediction framework

p l X 1 X 2 min þ C ξi þ ξ*i βj β 2 j¼1 i¼1

A schematic overview of our prediction framework is shown in Fig. 1 and Supplementary Fig. 1. The 6 regression algorithms were applied separately for the whole-brain rsFC and rsFCS features. To quantify the prediction accuracy, we applied 5-fold cross-validation (5F-CV) in all algorithms. For LASSO regression, ridge regression, elastic-net regression, and LSVR, a nested 5F-CV was applied, with the outer 5F-CV loop estimating the generalizability of the model and the inner 5F-CV loop determining the optimal parameters (e.g., λ, α, or C) for these algorithms. This nested 5F-CV procedure is elaborated below.

8 < yi f ðxi Þ ε þ ξi subject to f ðxi Þ yi ε þ ξ*i : ξi; ξ*i 0 where l is the quantity of ‘support vectors’, which are the samples that deviate by more than ε from the actual yi used to fit the model. A weight (i.e., αs) is generated for each of these ‘support vectors’ using the algorithm, and the regression coefficients of all features are calculated as the weighted sum of the feature vector of these samples. Specifically, f ðxi Þ ¼

p X j¼1

βj xi;j þ β0 ¼

p l X X j¼1

s¼1

!

αs xs;j xi;j þ β0 ¼

l X s¼1

Outer 5F-CV In the outer 5F-CV, all subjects were divided into 5 subsets. Here, we sorted the subjects according to their behavioral scores and then assigned individuals with a rank of (1st, 6th, 11th, …) to the first subset, (2nd, 7th, 12th, …) to the second subset, (3rd, 8th, 13th, …) to the third subset, (4th, 9th, 14th, …) to the forth subset, and (5th, 10th, 15th, …) to the fifth subset. This splitting approach prevented random bias between subsets, and more importantly avoided the overwhelmingly intensive computation due to the multiple repetitions in a random splitting scheme (Cui et al., 2018). Of the five subsets, four were combined as the training set, and the remaining subset was used as the testing set. To avoid features in greater numeric ranges dominating those in smaller numeric ranges, each feature was linearly scaled to the range of 0–1 across the training dataset, and the scaling parameters were also applied to scale the testing dataset (Cui et al., 2018; Erus et al., 2015; Hsu et al., 2003). A prediction model was constructed using all the training samples and then used to predict the scores of the testing samples. The Pearson correlation coefficients and mean absolute error (MAE) between the actual scores and the predicted scores were computed to quantify the accuracy of the prediction (Cui et al., 2018; Erus et al., 2015; Siegel et al., 2016). The training and testing procedures were repeated 5 times so that each of the 5 subsets was used once as the testing set. To yield the final accuracies, we averaged the correlations and MAE across the five iterations, respectively.

αs ðxi *xs Þ þ β0

where xs *xi is called the linear kernel. A parameter C controls the tradeoff between the how strongly the samples that deviate by more than ε are tolerated and the flatness of the regression line, i.e., the trade-off of penalties between the bias and variance. A large C corresponds to more penalties on bias, and a small C corresponds to more penalties on variance. Relevance vector regression (RVR) RVR is formulated in a Bayesian framework and has an identical functional form to SVR (Tipping, 2001). The function also takes the form as below: f ðxi Þ ¼

l X s¼1

βs ðxi *xs Þ þ β0

Like LSVR, only some samples (l < N), termed the ‘relevance vector’, are used to fit the model in RVR. Here, the predicted target value t ¼ fti gN1 : ti ¼ f ðxi Þ þ εn

Inner 5F-CV and parameter tuning Within each loop of the outer 5F-CV, we applied inner 5F-CVs to determine the optimal parameters (i.e., λ, α, or C) for relevant regression algorithms. Specifically, C parameter is the coefficient of training error in LSVR; λ parameter represents the coefficient of the regularization term in LASSO/ridge/elastic-net regressions. Since the C and λ contrast one another, we therefore chose C from among 16 values [25, 2-,4, …, 29, 210] (Hsu et al., 2003), and accordingly, λ from among 16 values [210, 2-,9, …, 24, 25]. As for the elastic-net regression, there is another mixing parameter α that was chosen from among 11 values, i.e., [0, 0.1, …, 0.9, 1]. Given the two parameters, a grid search was applied for the elastic-net regression, resulting in 176 (λ, α) parameter sets (16*11) in total. For each algorithm with one parameter (i.e., LSVR, LASSO, and ridge regression) or two parameters (i.e., elastic-net regression), the training set for each loop of the outer 5F-CV was further partitioned into 5 subsets according to their behavioral score rank, as for the outer loop. Four

where εn is the measurement noise. Specifically, an explicit zero-mean Gaussian prior was applied on the parameter β, pðβjαÞ ¼

N Y @ βi 0; α1 i i¼0

and therefore most weights were set as zero, which resulted in the ‘relevance vector’ samples being fewer than that of LSVR (i.e., ‘support vector’ samples). The maximum likelihood estimation was then used to find the weights of these samples. 2 N=2 1 exp 2 t ∅β p t β; σ 2 ¼ 2πσ 2 2σ

626

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

for 722 subjects. Each of the subjects can have two predicted scores, one from the REST1 data and the other from the REST2 data. To crossvalidate the test-retest predictions between the REST1 and REST2 datasets, we correlated the two predicted scores across all 722 subjects using Pearson correlation. In addition, the intra-class correlation coefficient (ICC) of the test-retest predicting scores was calculated for all algorithms (Braun et al., 2012; Shrout and Fleiss, 1979; Zhao et al., 2015).

subsets were selected to train the model under a given parameter C (LSVR) or λ (LASSO and ridge regression), or a given parameter set (λ, α) (elastic-net regression), and the remaining subset was used to test the model. This procedure was repeated 5 times such that each subset was used once as the testing dataset, resulting in 5 inner 5F-CV loops in total. For each inner 5F-CV loop, one correlation r and one mean absolute error (MAE) were generated for each parameter or parameter set, and a mean value across the 5 inner loops was then obtained for the MAE and correlation r, respectively. The sum of the mean correlation r and reciprocal of the mean MAE was defined as the inner prediction accuracy, and the parameter or parameter set with the highest inner prediction accuracy was chosen as the optimal parameter or parameter set (Cui et al., 2018). Of note, the mean correlation r and the reciprocal of the mean MAE cannot be summed directly, because the scales of the raw values of these two measures are quite different. Therefore, we normalized the mean correlation r and the reciprocal of the mean MAE across all values (i.e., 16 values in LASSO regression, ridge regression, and LSVR, and 176 values in elastic-net regression) and then summed the resultant normalized values. Accordingly, each loop of the outer 5F-CV yielded a specific optimal parameter or parameter set. The optimal parameter λ and C or parameter set (λ, α) was then used to estimate the final predictive model with the training set of the outer 5F-CV loop. Notably, OLS regression and RVR do not have algorithm-specific parameters, and therefore, the above inner 5F-CVs were not applied in these two algorithms.

Algorithm similarity in individualized predictions To explore the individual prediction similarity among the six algorithms, we correlated the predictive scores across all individuals between each pair of algorithms, resulting in an algorithm-by-algorithm similarity matrix. This 6 * 6 similarity matrix was converted into a distance matrix (i.e., 1-similarity matrix), and a hierarchical clustering method (i.e., the average linkage agglomerative algorithm) was then applied (Legendre and Legendre, 2012; Zhong et al., 2015). Algorithm similarity in the spatial pattern of feature importance In a linear prediction model, the absolute value of the weight/ regression coefficient represents the importance of corresponding feature in a prediction (Erus et al., 2015; Mourao-Miranda et al., 2005; Siegel et al., 2016). For each algorithm, we trained a prediction model using all the samples (i.e., 794 subjects for REST1 and 778 for REST2). The resultant absolute weights were then correlated across all features between each pair of algorithms. As above, the average linkage agglomerative algorithm was then applied to the distance matrix. It should be noted that some weighted combination of all features is the driving source for regression predictions and complex interactions exist among these features. These interactions make it very difficult to mathematically quantify such weighted patterns/combinations of all features. Our adopted comparative strategy here (i.e., the linear correlation across feature weights vectors between algorithms) is a sub-optimal one by simply treating each feature in isolation.

Sample size and sub-sampling To explore the effect of sample size on prediction accuracies, we sampled subsets with different sample sizes from the full cohort. Putatively, ML prediction performance should be less sensitive to sample size differences as the sample size increases. To reduce the computational burden, subset sample sizes were therefore chosen in increments of 20 from 20 to 300 and then in increments of 40 from 300 to 700. This procedure resulted in 25 sample sizes. For each sample size, we carried out random sampling 50 times, with each time sampling without replacement. The mean and SD of the resulting 50 prediction accuracies (i.e., correlation r) were then yielded. It is important to ascertain an approximate pattern/function how the mean or SD of prediction accuracies is influenced by the sample size, i.e., which model/function can fit our observed data well. Here, we applied two candidate model/function forms to fit the mean or SD of the prediction accuracies: a linear function, f(x) ¼ a*x þ b, and an exponential function, f(x) ¼ a * exp (x/b) þ c. These two candidate models/functions are widely used, and the visual inspection highly suggested an exponential model. To evaluate the goodness-of-fit, r2 values were used (a value closer to 1 indicates a better fit).

Computational cost Given the huge number of features and samples, computational cost is becoming an increasingly important factor of concern when selecting the appropriate ML regression algorithm. To quantify the computational cost, we recorded the running time of the six algorithms by running each algorithm with the above-specified procedure on a single core of the same server. To evaluate the effect of the parameter optimizing procedure, we tested the running time of the algorithms with parameters under two conditions: 1) when the parameter or parameter set was predetermined; and 2) when the optimal parameter or parameter set was determined using the inner 5F-CV. The Python/Matlab functions/scripts for the six regression algorithms have been made available online: https://github.com/ZaixuCui/Pattern_ Regression.

Generalizability of the algorithm and sample size effect Notably, the rsFC and rsFCS features in our behavioral/cognitive predictions were extracted after applying global signal regression (GSR), which is a controversial processing step in resting-state functional connectivity analyses (Fox et al., 2009; Murphy et al., 2009; Murphy and Fox, 2017). To explore how GSR influenced our main results, we computed the rsFC and rsFCS features without applying GSR and reran the same procedure to predict individual GSDT scores. To evaluate whether the patterns of how the ML regression algorithm and sample size influence the GSDT prediction can be generalized to other behavioral/cognitive predictions, we applied the same procedure to predict ORRT, PVT, and VSPLOT scores using either whole-brain rsFC or rsFCS features (with applying GSR).

Results Algorithm effect In Fig. 2 and supplementary Fig. 2, the correlation r and MAE of each algorithm were plotted as a function of sample size. Here, the correlation r was taken as the main metric of prediction accuracy. Using the wholebrain rsFC features (30,135 in total) of the REST1 dataset, LASSO regression yielded markedly lower GSDT prediction accuracies for sample sizes greater than 60 (Fig. 2A). The other five regression algorithms performed similarly, exhibiting very large ranges of overlap in GSDT prediction accuracy regardless of the sample size. In contrast, using the whole-brain rsFCS features (246 in total), OLS regression performed much worse than the other algorithms when the sample size was greater

Test-retest validation of individualized predictions In our dataset, both quantified REST1 and REST2 data were available 627

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

Fig. 2. The mean and SD of GSDT prediction accuracies (correlation r) of the 50 sample subsets. The six regression algorithms are marked with different colors, and 25 sample sizes were selected from 20 to 700. (A) rsFC-based prediction using the REST1 dataset; (B) rsFC-based prediction using the REST2 dataset; (C) rsFCS-based prediction using the REST1 dataset; (D) rsFCS-based prediction using the REST2 dataset. See Supplementary Fig. 2 for the mean absolute error (MAE) changes.

than approximately 160 (Fig. 2C). Among the other five algorithms, LASSO regression and LSVR corresponded to slightly lower prediction accuracies than ridge regression, RVR, and elastic-net regression as the sample size increased. It should be noted that the whole-brain rsFC features yielded relatively higher prediction accuracies overall than the whole-brain rsFCS features. The mean prediction accuracy (mean r of the 50 sampling subsets) reached a maximum greater than 0.5 for the rsFC feature but a maximum lower than 0.3 for the rsFCS feature. As shown, these observed differences in patterns among the six algorithms and between feature types were highly repeatable for the GSDT predictions using the REST2 dataset (Fig. 2B and D, Supplementary Fig. 2B and 2D). In addition, to evaluate the reproducibility of these results in independent dataset, we re-ran the analyses above using the rsfMRI data of HCP new subjects, which were included in the latest HCP S1200 dataset but not in the S900 dataset (236 individuals in total, see more details in Supplementary Tables 5-7). As shown in the Supplementary Fig. 3, the resultant patterns across the same range of sample size are highly compatible with the main results of HCP S900: the LASSO performed worse than other algorithms for the rsFC feature, and the OLS performed worse than the other algorithms for the rsFCS feature.

relationship between the prediction accuracy and sample size was captured well. As for the OLS regression, there is a replicable dip of accuracy for the rsFCS-based but not rsFC-based predictions, with the peak around 300 sample size, which likely represents an overfitting of the rsFCS-based OLS regression to noise data.

Generalizability of the algorithm and sample size effects As illustrated in Supplementary Fig. 4, the GSDT prediction results remain quite similar when we applied random divisions for the five folds during the cross-validation. In addition, the individual GSDT scores were re-predicted using rsFC or rsFCS as the feature derived without applying the GSR preprocessing step. The resultant patterns of the algorithm and sample size effects were well preserved (Fig. 4 and Supplementary Fig. 5). This indicated a limited effect of the GSR step on our results, supporting the generalization of our observed algorithm and sample size effects. In addition to GSDT prediction, ORRT, PVT, and VSPLOT scores were also predicted using either the whole-brain rsFC or rsFCS features (REST1). As illustrated in Fig. 5 and Supplementary Fig. 6, for these three behavioral/cognitive scores, the pattern of performance difference among the algorithms was quite similar to the GSDT prediction: LASSO regression corresponded to significantly lower accuracies for the rsFC feature, and OLS regression preformed significantly worse than the other algorithms for the rsFCS features. Similarly, the mean and SD prediction accuracies of the 50 sampling subsets increased and decreased, respectively, as the sample size increased, regardless of the algorithm. These results indicated that the above observed pattern of how the algorithm and sample size influence predictions was not confined to the GSDT prediction but can be generalized to predictions of other behavioral/ cognitive scores.

Sample size effect As clearly shown (Fig. 2 and Supplementary Fig. 3), the mean GSDT prediction accuracies of the 50 sampling subsets increased with the sample size, regardless of the algorithm. Conversely, the SD decreased as the sample size increased. Except for OLS regression, the exponential form (i.e., f(x) ¼ a * exp (x/b) þ c) better fitted the mean and SD of prediction accuracies using either the REST1 or REST2 dataset, regardless of the algorithm and feature type (Fig. 3). In particular, the exponential model explained more than 87% of the variance of the mean and SD (r2 > 0.87), indicating excellent fitting performance and that the 628

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

Fig. 3. Model fitting for the mean and SD of GSDT prediction accuracies (correlation r) of the 50 sample subsets. (A) Scatter plots for the REST1 dataset; (B) Scatter plots for the REST2 dataset. Two candidate functions were fitted: a linear function, f(x) ¼ a*x þ b, and an exponential function, f(x) ¼ a * exp (x/b) þ c. The r2 was calculated as the goodness-of-fit. Except for OLS regression using the rsFCS feature, the exponential function fitted the mean and SD significantly better.

robustness of the individual prediction. Given the better prediction performance of the rsFC feature, the correlation of test-retest predicted scores based on the rsFC feature is expected to be higher than that based on the rsFCS features.

Test-retest validation of individualized prediction To cross-validate individual predictions between the REST1 and REST2 datasets, 722 subjects with available quantified REST1 and REST2 data were applied. Fig. 6A and B shows scatter plots between the actual GTSD scores and the GTSD scores predicted by each algorithm. Regardless of the algorithm, the whole-brain rsFC feature-based prediction outperformed the rsFCS feature-based prediction using either the REST1 or REST2 dataset. Furthermore, the predicted GTSD scores using the REST1 dataset were significantly correlated with the predicted scores using the REST2 dataset (Fig. 6C). The ICC values for the test-retest predicting scores were 0.65, 0.66, 0.59, 0.67, 0.67, and 0.67 for the rsFC-based OLS regression, ridge regression, LASSO regression, Elasticnet regression, LSVR, and RVR, respectively. For the rsFCS-based prediction, the ICC values were 0.27, 0.38, 0.32, 0.36, 0.40, and 0.35, respectively. These results together suggested a decent test-retest

Algorithm similarity in individualized predictions The hierarchical clustering of the six algorithms in terms of individual prediction similarity yielded the same clusters for both the REST1 and REST2 datasets (Fig. 6D and E). Specifically, for the rsFC-based prediction, LSVR and RVR together formed one cluster (i.e., showing high similarity among within-cluster algorithms but relatively low similarity among algorithms outside their cluster) and elastic-net regression, ridge regression, and OLS regression formed another cluster. In contrast, LASSO regression showed a very low degree of similarity with all other algorithms, which is compatible with its significantly lower prediction 629

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

Fig. 4. The accuracies of GSDT prediction using the FC features that were derived without applying GSR during the processing procedure. (A) rsFC-based prediction using the REST1 dataset; (B) rsFC-based prediction using the REST2 dataset; (C) rsFCS-based prediction using the REST1 dataset; (D) rsFCS-based prediction using the REST2 dataset. See Supplementary Fig. 5 for the mean absolute error (MAE) changes.

among themselves (mostly r > 0.75). Furthermore, the top 10 rsFCS features (regions) with the highest absolute weights were similar among LASSO regression, ridge regression, elastic-net regression, and LSVR. These regions with the greatest contributions primarily involved the inferior frontal gyrus, the supplementary motor area, the insula, the inferior temporal gyrus, and the superior parietal gyrus. Again, the observed results were highly consistent for both the REST1 and REST2 datasets.

accuracies relative to the other algorithms (Fig. 2A). The rsFCS featurespecific algorithm prediction similarity exhibited a slightly different pattern (Fig. 6F and G): while LSVR and RVR also belonged to one cluster, LASSO regression was clustered with elastic-net and ridge regression. In particular, OLS regression was isolated from all the other algorithms, which corresponded to its relatively lower rsFCS-based prediction performance (Fig. 2C). Algorithm similarity in the spatial pattern of feature importance

Computational cost The 6 * 6 algorithm similarity matrices of the whole-brain rsFC and rsFCS features are illustrated in Fig. 7 and Fig. 8, respectively. For the rsFC-based GSDT prediction, the spatial patterns of the importance of the rsFC features exhibited relatively high between-algorithm correlations (r > 0.95) among OLS regression, ridge regression, elastic-net regression, and LSVR across the similarity matrix. While RVR exhibited smaller correlations with these four algorithms relative to their within-algorithm correlations, the absolute correlation value was still relatively high (r > 0.85). In contrast, LASSO regression had a relatively low correlation (r ~ ¼ 0.20) with the other 5 algorithms, which is in line with the significantly lower prediction accuracies of LASSO regression relative to the prediction accuracies of the other algorithms. The spatial distribution of the top 100 rsFC features with the highest absolute weights are illustrated in Fig. 7. As expected, OLS regression, ridge regression, elastic-net regression, and LSVR showed similar patterns. Specifically, these rsFC features with the greatest contributions were largely connected with the superior parietal area, primary motor area, middle/inferior temporal gyrus, superior/middle/inferior frontal gyrus, middle occipital gyrus, basal ganglia, and thalamus. Regarding the similarities of the rsFCS feature importance pattern among algorithms (Fig. 8), the correlations of OLS regression and RVR with the other algorithms were relatively low (mostly r < 0.70) across the similarity matrix. In contrast, LASSO regression, ridge regression, elasticnet regression, and LSVR corresponded to relatively higher correlations

The running times of the six algorithms are listed in Table 2. As expected, the whole-brain rsFCS feature-based predictions were faster than the rsFC feature-based predictions, given the markedly fewer rsFCS features. It was clearly observed that optimizing the parameter or parameter set with inner 5F-CVs dramatically increased the running time for ridge regression, elastic-net regression, LASSO regression, and LSVR. Overall, OLS regression and RVR had the shortest running times, as they did not include the parameter optimization step. Notably, some software-related bias might be introduced into our results, given that the six algorithms were implemented in different software packages. Discussion Using the large HCP dataset, the present study compared rsFC/rsFCSbased predictions among 6 commonly used ML regression algorithms and evaluated the effects of sample size on prediction performance. The results showed that ridge regression, elastic-net regression, LSVR, and RVR performed quite similarly for both rsFC- and rsFCS-based predictions. However, LASSO regression performed remarkably worse than the other algorithms based on rsFC features, while OLS regression performed markedly worse than the other algorithms based on rsFCS features. Particularly, the prediction performance of all algorithms became increasingly stable and better on average as the sample size increased. 630

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

Fig. 5. The mean and SD of prediction accuracies of the 50 sample subsets for other 3 behavioral scores: ORRT, PVT, and VSPLOT. (A) rsFC-based ORRT prediction; (B) rsFCS-based ORRT prediction; (C) rsFC-based PVT prediction; (D) rsFCS-based PVT prediction; (E) rsFC-based VSPLOT prediction; (F) rsFCS-based VSPLOT prediction. See Supplementary Fig. 6 for the mean absolute error (MAE) changes.

associated matrices (Casanova et al., 2012). Often, ill-conditioning is associated with sensitivity of the prediction to data noise, which can be effectively overcome by using regularization. Therefore the impact of regularization on prediction performance of a regression algorithm will depend on the degree of ill-conditioning and noise level of a given problem. OLS has been reported previously to produce comparable performance to other regularized algorithms in the context of some classification problems (Casanova et al., 2012; Raizada et al., 2010). In our work, we observed that rsFC-based prediction using OLS regression performed comparably with the other algorithms, but the rsFCS-based prediction using OLS regression performed worse than the other algorithms, possibly due to its sensitivity to data noise. In addition, the observed dip of OLS accuracy around 300 samples, which is specific to the rsFCS-based prediction, may also attribute to its sensitivity to data noise. OLS regression therefore should be applied with extreme caution in the case of high-dimensional feature space and small sample size, because of its unstable performance. In addition to OLS regression, the prediction performance of LASSO regression also exhibited a dependence on the feature type: it performed markedly worse than the other algorithms when the rsFC features were applied but similarly with others when the rsFCS features were applied.

The specific prediction patterns of each observed algorithm and sample size effects were replicated using re-test fMRI data and different imaging preprocessing methods (i.e., without applying the GSR). The predictions of different behavioral/cognitive scores were also replicated, which strongly supports the robustness/generalizability of these observed effects. These findings provide important reference information that can be used in choosing an appropriate ML regression algorithm or sample size in individualized behavioral/cognitive prediction studies. Regression algorithm differences First, the relative performances of rsFC- and rsFCS-based predictions using OLS regression significantly differed. Notably, OLS regression does not apply any regularization techniques within the algorithm. In contrast, ridge regression and LSVR both apply L2-norm regularization, LASSO regression includes L1-norm regularization, elastic-net regression includes both L1-norm and L2-norm regularization, and RVR applies regularization through a Gaussian prior. The lack of regularization in OLS regression model may account for its unstable prediction performance, relative to the other algorithms. Notably, the need of regularization in high-dimensional linear regression problems may depend on the 631

Z. Cui, G. Gong

NeuroImage 178 (2018) 622–637

Fig. 6. The test-retest cross-validation and algorithm-to-algorithm similarity of individualized GSDT predictions. (A) Scatter plots between the actual and predicted scores using the REST1 dataset; (B) Scatter plots between the actual and predicted scores using the REST2 dataset; (C) Scatter plots of predicted scores between the REST1 and REST2 datasets; (D) The 6 * 6 matrix representing the algorithm-to-algorithm similarity of individual rsFC-based predicted GSDT scores using the REST1 dataset. Hierarchical clustering dendrogram of algorithms are illustrated on the right. (E) The algorithm-to-algorithm similarity matrix of individual rsFC-based predicted GSDT scores using the REST2 dataset. (F) The algorithm-to-algorithm similarity matrix of individual rsFCS-based predicted GSDT scores using the REST1 dataset. (G) The algorithm-to-algorithm similarity matrix of individual rsFCS-based predicted GSDT scores using the REST2 dataset.

should be noted that a larger number of features do not necessarily lead to a poor performance of LASSO regression, relative to other algorithms. For example, the rsFC-based LASSO regression without applying GSR exhibited similar prediction accuracies with other algorithms (Fig. 4), and performed a bit better than the rsFC-based LASSO after applying GSR. This may attribute to a more degree of between-feature correlation for the rsFC without applying GSR, which favored a less chance of discarding useful features in LASSO's sparse model and ultimately led to a better performance. Future work is desired to address this GSR-relevant issue. To overcome the limitation of LASSO regression, a reduction in feature dimensionality (e.g., principal component analysis) can be applied before using the LASSO algorithm (Wager et al., 2013). Elastic-net regression, which includes both L1-norm and L2-norm

Due to the nature of L1-norm regularization, LASSO generally selects only one random feature from among the correlated features and achieves a final sparse model, which is easy to optimize and provides better generalization. In practice, LASSO can only select a maximum of N-1 features in the final model, where N is the sample size (Efron et al., 2004; Ryali et al., 2012). In the present study, the number of whole-brain rsFC features (i.e., 30,135) was much larger than the entire sample size (