Abstract
Associations between high-dimensional datasets, each comprising many features, can be discovered through multivariate statistical methods, like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). CCA and PLS are widely used methods which reveal which features carry the association. Despite the longevity and popularity of CCA/PLS approaches, their application to high-dimensional datasets raises critical questions about the reliability of CCA/PLS solutions. In particular, overfitting can produce solutions that are not stable across datasets, which severely hinders their interpretability and generalizability. To study these issues, we developed a generative model to simulate synthetic datasets with multivariate associations, parameterized by feature dimensionality, data variance structure, and assumed latent association strength. We found that resulting CCA/PLS associations could be highly inaccurate when the number of samples per feature is relatively small. For PLS, the profiles of feature weights exhibit detrimental bias toward leading principal component axes. We confirmed these model trends in state-ofthe-art datasets containing neuroimaging and behavioral measurements in large numbers of subjects, namely the Human Connectome Project (n ≈ 1000) and UK Biobank (n = 20000), where we found that only the latter comprised enough samples to obtain stable estimates. Analysis of the neuroimaging literature using CCA to map brain-behavior relationships revealed that the commonly employed sample sizes yield unstable CCA solutions. Our generative modeling framework provides a calculator of dataset properties required for stable estimates. Collectively, our study characterizes dataset properties needed to limit the potentially detrimental effects of overfitting on stability of CCA/PLS solutions, and provides practical recommendations for future studies.
Significance Statement Scientific studies often begin with an observed association between different types of measures. When datasets comprise large numbers of features, multivariate approaches such as canonical correlation analysis (CCA) and partial least squares (PLS) are often used. These methods can reveal the profiles of features that carry the optimal association. We developed a generative model to simulate data, and characterized how obtained feature profiles can be unstable, which hinders interpretability and generalizability, unless a sufficient number of samples is available to estimate them. We determine sufficient sample sizes, depending on properties of datasets. We also show that these issues arise in neuroimaging studies of brain-behavior relationships. We provide practical guidelines and computational tools for future CCA and PLS studies.
Discovery of associations between datasets is a topic of growing importance across scientific disciplines in analysis of data comprising a large number of samples across high-dimensional sets of features. For instance, large initiatives in human neuroimaging collect, across thousands of subjects, rich multivariate neural measures as one dataset and psychometric and demographic measures as another linked dataset (1–3). A major goal is to determine, in a data-driven way, the dominant latent patterns of association linking individual variation in behavioral features to variation in neural features (4–6).
A widely employed approach to map such multivariate associations is to define linearly weighted composites of features in both datasets (e.g., neural and psychometric) and to choose the sets of weights—which correspond to axes of variation—to maximize the association strength (Fig. 1A). The resulting profiles of weights for each dataset can be examined for how the features form the association. If the association strength is measured by the correlation coefficient, the method is called canonical correlation analysis (CCA) (7), whereas if covariance is used the method is called partial least squares (PLS) (5, 8, 9). CCA and PLS are commonly employed across scientific fields, including behavioral sciences (10), biology (11, 12), biomedical engineering (13), chemistry (14), environmental sciences (15), genomics (16), and neuroimaging (4, 17–19).
Although the utility of CCA and PLS is well established, a number of open challenges exist regarding their stability in characteristic regimes of dataset properties. Stability implies that elements of CCA/PLS solutions, such as association strength and weight profiles, are reliably estimated across different independent sample sets from the same population, despite inherent variability in the data. Instability or overfitting can occur if an insufficient amount of data is available to properly constrain the model. Manifestations of instabil ity and overfitting in CCA/PLS include inflated association strengths (20–22), cross-validated association strengths that are markably lower than in-sample estimates (23), or feature profiles that vary from study to study (20, 23–26). Stability of models is essential for their replicability, generalizability, and interpretability. Therefore, it is important to assess how stability of CCA/PLS solutions depends on dataset properties.
Instability of CCA/PLS solutions is in principle a known issue (6, 24). Prior studies using a small number of specific datasets or Monte-Carlo simulations have suggested to use 4 between 10 and 70 samples per feature in order to obtain stable models (21, 25, 27). However, it remains unclear how the various elements of CCA/PLS solutions (including association strengths, weights, and statistical power) differentially depend on dataset properties and sampling error, nor how CCA andPLS as distinct methods may exhibit differential robustness across data regimes. To our knowledge, no framework exists to systematically quantify errors in CCA/PLS results, depending 2 on the numbers of samples and features, the assumed latent correlation and the variance structure in the data, for both CCA and PLS.
To investigate these issues, we developed a generative statistical model to simulate synthetic datasets with known latent axes of association. Sampling from the generative model allowed quantification of deviations between estimated and true CCA or PLS solutions. We found that stability of CCA/PLS solutions requires more samples than are commonly used in published neuroimaging studies. With too few samples, estimated association strengths were too high, and estimated weights could be unreliable for interpretation. CCA and PLS differed in their dependences and robustness, in part due to PLS exhibiting a detrimental bias of weights toward principal axes. We analyzed two large state-of-the-art neuroimaging-psychometric datasets, the Human Connectome Project (2) and the UK Biobank (3), which followed similar trends as our model. These model and empirical findings, in conjunction with a meta-analysis of estimated stability in the brain-behavior CCA literature, suggest that typical CCA/PLS studies in neuroimaging are prone to instability. Finally, we applied the generative model to develop algorithms and a software package for calculation of estimation errors and required sample sizes for CCA/PLS. We end with 10 practical recommendations for application and interpretation of CCA and PLS in future studies (see also Tab. S1).
Results
A generative model for cross-dataset multivariate associations
To analyze sampling properties of CCA and PLS, we need to generate synthetic datasets of stochastic samples with known properties and with known correlation structure across two multivariate datasets. We therefore developed a generative statistical modeling framework that satisfying these requirements, which we refer to as GEMMR (Generative Modeling of Multivariate Relationships). GEMMR is central to all that follows as it allows us to design and generate synthetic datasets, investigate the dependence of of CCA/PLS sampling errors on dataset size and assumed covariances, estimate weight errors in CCAs reported in the literature, and calculate sample sizes required to bound estimation errors.
To describe GEMMR, first note that data for CCA and PLS consist of two datasets, given as data matrices X and Y, with respectively px and py features (columns) and an equal number n of samples (rows). We assume a principal component analysis (PCA) has been applied separately to each dataset so that, without loss of information, the columns of X and Y are principal component (PC) scores. The PC scores’ variances, which are also the eigenvalues of the within-set covariance matrices, SXX and SY Y, are modeled to decay with a power-law dependence (Fig. 1B) for PLS, as empirical variance spectra often follow approximate power-laws (for examples, see Fig. S1). For CCA, which optimizes correlations instead of covariances, the two datasets are effectively whitened during the analysis (see Methods) and we can therefore assume that all scores’ variances are 1.
Between-set associations between X and Y (Fig. 1C) are summarized in the cross-covariance matrix SXY. By performing a singular value decomposition of SXY a solution for CCA and PLS can be obtained (after whitening for CCA, see Methods) with the singular values giving the association strengths and the singular vectors encoding the weight vectors for the latent between-set association modes. Conversely, given association strengths and weight vectors for between-set association modes (i.e., the solution to CCA or PLS), the corresponding cross-covariance matrix can be assembled making use of the same singular value decomposition (see Methods and Fig. S2). The joint covariance matrix for X and Y is then composed from the within- and between-set covariances (Fig. 1D) and the normal distribution associated with this joint covariance matrix constitutes our generative model for CCA and PLS.
In the following we systematically vary the parameters on which the generative model depends and investigate their downstream effects on the stability of CCA and PLS solutions. Specifically, we vary the number of features (keeping the same number of features for both datasets for simplicity), the assumed between-set correlation, the power-laws describing the within-set variances (for PLS), and the number of samples drawn. Weight vectors are chosen randomly and constrained such that the ensuing X and Y scores explain at least half as much variance as an average principal component in their respective sets. For simplicity, we restrict our present analyses to a single between-set association mode. Of note, in all of the manuscript, “number of features” denotes the total number across both X and Y, i.e., px + py.
Sample size dependence of estimation error
Using randomly sampled surrogate datasets from our generative model, we characterized the estimation error in multiple elements of CCA/PLS solutions. First, we asked whether a significant association can robustly be detected, quantified by statistical power. To that end we calculate the association strength in each synthetic dataset as well as in 1000 permutations of sample labels, and calculate the probability that association strengths are stronger in permuted datasets, giving a p-value. We repeat this process, and estimate statistical power as the probability that the p-value is below α = 0.05 across 100 synthetic datasets drawn from the same normal distribution with given covariance matrix. For a sufficient number of samples that depends on the other parameter values statistical power eventually becomes 1 (Fig. 2A-B). Note that here we use “samples per feature” as an effective sample size measurement to account for the fact that datasets in practice can have widely varying dimensionalities (Figs. S3-S4). A typical value in the brain-behavior CCA/PLS literature is about 5 samples per feature (Fig. S5A), which is also marked in Fig. 2.
Second, we evaluated the association strength (Fig. 2C-D). While the observed association strength converges to its true value for sufficiently large sample sizes, it consistently overestimates the true value and decreases monotonically with sample size. Moreover, for very small sample sizes, observed association strengths are very similarly high, independent of the true correlation (Fig. S6). Thus as above, a sufficient sample size, depending on other parameters of the covariance matrix, is needed to bound the error in the association strength. We also compared in-sample estimates for the association strength to cross-validated estimates. We found that cross-validated estimates underestimate the true value (Fig. S7A-B) to a similar degree as in-sample estimates overestimate it (Fig. S7C-D). Interestingly, the average of in-sample and cross-validated association strength was a better estimator than either of the two alone in our simulations (Fig. S7E-F). Finally, bootstrapped association strengths overestimated, on average, slightly more than in-sample estimates (Fig. S8A-B).
Third, CCA and PLS solutions provide weights that encode the nature of the association in each dataset. We quantify the corresponding estimation error as the cosine distance between the true and estimated weights, separately for X and Y and taking the greater of the two. As the sign of weights is ambiguous in CCA and PLS it is chosen to obtain a positive correlation between observed and true weight. We found that weight error decreases monotonically with sample size (Fig. 2E-F). Bootstrapped weight errors were again, on average, slightly larger than in-sample estimates (Fig. S8C-F), while the variability of individual weight elements across repeated datasets can be well approximated through bootstrapping (Fig. S8G-H).
Fourth, CCA or PLS solutions provide scores which represent a latent value assigned to each sample (e.g., subject). Applying true and estimated weights to common test data to obtain test scores, score error is quantified as 1 − Spearman correlation between true and estimated scores. It also decreased with sample size (Fig. 2G-H).
Finally, some studies report loadings, i. e. the correlations between original data features and CCA/PLS scores (Fig. S9). In practice, original data features are generally different from principal component scores, but as the relation between these two data representations cannot be constrained, we calculate all loadings here with respect to principal component scores. Moreover, to compare loadings across repeated datasets we calculate loadings for a common test set, as for CCA/PLS scores. The loading error is then obtained as 1 − Pearson correlation between test loadings and true loadings. Like other error metrics, it decayed with sample size (Fig. 2I-J). Interestingly, convergence for PLS is somewhat worse than for CCA across all metrics assessed in Fig. 2.
Weight error and stability
Fig. 2 quantifies the effect of sampling error on various aspects of the model in terms of summary statistics. We next focus on the error and stability of the weights, due to their centrality in CCA/PLS analysis in describing how features carry between-set association. First we illustrate how weight vectors are affected when typically used sample-to-feature ratios are used. For this illustration we set up a joint covariance matrix with a true between-set correlation of 0.3 and assuming 100 features per dataset, and then generated synthetic datasets with either 5 or 50 samples per feature. Using 5 samples per feature, estimated CCA weights varied so strongly that the true weight were not discernable in the confidence intervals (Fig. 3A). In contrast, with 50 samples per feature the true weights became more resolved. For PLS, the confidence interval for weights estimated with 5 or 50 samples per feature did not even align with the true weights (Fig. 3B) indicating that even more samples than for CCA should be used.
We next assessed weight stability, i.e., the consistency of estimated weights across independent sample datasets. We quantified weight stability as the cosine similarity between weights obtained from two independently drawn datasets and averaged across pairs of datasets. When the datasets consisted of only few samples, the average weight stability was close to 0 for CCA and eventually converged to 1 (i. e. perfect similarity) with more samples (Fig. 3C). PLS exhibited striking differences from CCA: mean weight stability had a relatively high value even at low sample sample sizes where weight error is very high (Figs. 3D, 2F), with high variability across datasets.
Finally, to show the dependence of weight error on the assumed true between-set correlation and the number of features we estimated the number of samples required to obtain less than 10% weight error (Fig. 3E-F). The required sample size is higher for increasing number of features, and lower for increasing true between-set correlation. More samples were required for PLS than for CCA. We also observe that, by this metric, required sample sizes can be much larger than typically used sample sizes in CCA/PLS studies.
Weight PC1 bias in PLS
Figs. 3 and 2E-F show that at low sample sizes, PLS weights exhibit, on average, high error but also reasonably high stability. This combination suggests a systemic bias in PLS weights toward a different axis than the true latent axis of association. To gain further intuition of this phenomenon, we first consider the case of both datasets comprising 2 features each, so that weight vectors are 2-dimensional unit vectors lying on a circle. Setting rtrue = 0.3, we drew synthetic datasets from the normal distribution and performed CCA or PLS on these. When 50 samples per feature were used, all resulting weight vectors scattered tightly around the true weight vectors (Fig. 4A-B). With only 5 samples per feature, which is typical in CCA/PLS studies (Fig. S5A), the distribution was much wider. For CCA the circular histogram peaked around the true value. In contrast, for PLS the peak was shifted towards the first principal component axis when 5 samples per feature were used.
Next, we investigated how this weight bias toward the first principal component in PLS manifests more generally. We first considered an illustrative data regime (64 features/dataset, rtrue = 0.3). We quantified the PC bias as the cosine similarity between estimated weight vectors and a principal component axis. Compared to CCA, for PLS there was a strong bias toward the dominant PCs, even with a large number of samples (Fig. 4C,D). Note also, that the average PC bias in permuted datasets was similar to that in unpermuted datasets, for both CCA and PLS. Finally, these observations also held for datasets with differing number of features and true correlations. For PLS the weight vectors are biased toward the first principal component axis, compared to CCA, and more strongly than random weight vectors, particularly when few samples per feature were used to estimate them (Fig. 4F).
Empirical brain-behavior data
Do these phenomena observed in synthetic data from our generative modeling framework also hold in empirical data? We focused on two state-of-the-art population neuroimaging datasets: Human Connectome Project (HCP) (2) and UK Biobank (UKBB) (3). Both datasets provide multi-modal neuroimaging data along with a wide range of behavioral and demographic measures, and both have been used in prior studies using CCA to map brain-behavior relationships (3, 4, 28–32). HCP, comprising around 1200 subjects, is one of the lager neuroimaging datasets available and is of exceptional quality. We analyzed two neuroimaging modalities in the HCP dataset, resting-state functional MRI (fMRI) (in 948 subjects) and diffusion MRI (dMRI) (in 1020 subjects). UKBB is a population-level study and, to our knowledge, the largest available neuroimaging dataset. We analyzed fMRI features from 20000 UKBB subjects. HCP and UKBB thereby provide two independent testbeds, across neuroimaging modalities and with large numbers of subjects, to investigate error and stability of CCA/PLS in brain-behavior data.
After modality-specific preprocessing (see Methods), both datasets in each of the three analyses were deconfounded and reduced to 100 principal components (see Methods and Fig. S10), in agreement with prior CCA studies of HCP data (4, 28–32) (see Fig. S11 for a re-analysis of HCP functional connectivity vs. behavior in which a smaller number of principal components was selected according to an optimization procedure (33)). Functional connectivity features were extracted from fMRI data and structural connectivity features were extracted from dMRI. Note that, as only a limited number of samples were available in these empirical datasets, we cannot use increasingly more samples to determine how CCA or PLS converge with sample size (as we did with synthetic data above). Instead, we repeatedly subsampled the available data to varying sizes from 202 up to 80 % of the available number of samples.
We found that the first mode of association was statistically significant for all three sets of data and for both CCA and PLS. Association strengths decreased with increasing size of the subsamples, but clearly converged only for the UKBB data. Cross-validated association strengths estimates increased with subsample size and, for UKBB, converged to the same value as the in-sample size. Fig. 5A overlays reported CCA results from other publications that used 100 features per set in HCP data, which further confirms the decreasing trend of association strength as a function of sample size. Weight stabilities (i. e., the cosine similarities between weights estimated for different subsamples of the same size) increased with sample size but reached values close to 1 (perfect similarity) only for UKBB data. Moreover, PC1 bias was close to 0 for CCA but markably larger for PLS weights. All these results were in agreement with analyses of synthetic data discussed above (Figs. 2-4). Altogether, we emphasize the overall similarity between CCA analyses of different data modalities and features (first and second row in Fig. 5) and data of similar nature from different sources (first and third row in Fig. 5). This suggests that sampling error is a major determinant in CCA and PLS outcomes and this is valid across imaging modalities and for independent data sources. Note also that stable CCA and PLS results with a large number of considered features can be obtained with sample sizes that become available with UKBB-level datasets.
Samples per feature alone predicts published CCA strengths
We next examined stability and association strengths in CCA analyses of empirical datasets more generally. To that end we performed an analysis of the published literature using CCA with neuroimaging data to map brain-behavior relationships. From 100 CCAs that were reported in 31 publications (see Methods), we extracting the number of samples, number of features, and association strengths. As the within-set variance spectrum is not typically reported, but would be required to assess PLS results (as described above), we did not perform such an analysis for PLS.
Most studies used less than 10 samples per feature (Fig. 6A and S5A). Overlaying reported canonical correlations as a function of samples per feature on top of predictions from our generative model shows that most published CCAs we compiled are compatible with a range of true correlations, from about 0.5 down to 0 (Fig. 6A). Interestingly, despite the fact that these studies investigated different questions using different datasets and modalities, the reported canonical correlation could be well predicted simply by the number of samples per feature alone (R2 = 0.83).
We next asked whether weight errors can be estimated from published CCAs. As these are unknown in principle, we estimated them using our generative modeling framework. We did this by (i) generating synthetic datasets of the same size as a given empirical dataset, and sweeping through assumed true correlations between 0 and 1 (ii) selecting those synthetic datasets for which the estimated canonical correlation matches the empirically observed one, and (iii) using the weight errors in these matched synthetic datasets as a proxy for the weight error in the empirical dataset (Fig. S12). This resulted in a distribution of weight errors across the matching synthetic datasets for each published CCA study that we considered. The mean of these distributions are overlaid in color in Fig. 6A and the range of the distributions is shown in Fig. 6B. The mean weight error falls off roughly with the distance to the correlation-vs-samples/feature curve for permuted data (see also Fig. S5B). Altogether, these analyses suggest that many published CCA studies might have unstable feature weights due to an insufficient sample size.
Benefit of cross-loadings in PLS
Given the instability associated with estimated weight vectors, we investigated whether other measures provide better feature profiles. Specifically, we compared loadings and cross-loadings. Cross-loadings are the correlations across samples between CCA/PLS scores of one dataset with the original data features of the other dataset (unlike loadings, which are the correlations between CCA/PLS scores and original features of the same dataset). In CCA, they are collinear (see Methods and Fig. S13A) and to obtain estimates that have at most 10 % loading or cross-loading error required about the same number of samples (Fig. 7A). For PLS, on the other hand, true loadings and cross-loadings were, albeit not collinear still very similar (Fig. S13B), but cross-loadings could be estimated to within 10 % error with about 20 % to 50 % less samples as loadings in our simulations (Fig. 7B).
Calculator for required sample size
In both synthetic and empirical datasets we have seen that sample size plays a critical role to guarantee stability and interpretability of CCA and PLS, and that many existing applications may suffer from a lack of samples. How many samples are required, given particular dataset properties? We answer this question with the help of GEMMR, our generative modeling framework described above. Specifically, we suggest to base the decision on a combination of criteria, by bounding statistical power as well as relative error in association strength, weight error, score error and loading error at the same time. Requiring at least 90 % power and admitting at most 10 % error for the other metrics, we determined the corresponding sample sizes in synthetic datasets by interpolating the curves in Fig. 2 (see Fig. S14 and Methods). The results are shown in Fig. 8 (see also Figs. S15-S16). Assuming, for example, that the decay constants of the variance spectra satisfy ax + ay = − 2 for PLS, several hundreds to thousands of samples are necessary to achieve the indicated power and error bounds when the true correlation is 0.3 (Fig. 8A). More generally, the required sample size per feature as a function of the true correlation roughy follows a power-law dependence, with a strong increase in required sample size when the true correlation is low (Fig. 8B). Interestingly, PLS generally needs more samples than CCA. As mentioned above, accurate estimates of the association strength alone (as opposed to power, association strength, weight, score and loading error at the same time) could be obtained in our simulations with fewer samples: by averaging the in-sample with a cross-validated estimate (Fig. S7E-F). Moreover, accurate estimates of a PLS feature profile required fewer samples when assessed as cross-loadings (Fig. 7B).
Given the complexity and computational expense to generate and analyze enough synthetic datasets to obtain sample size estimates in the way described above, we finally asked whether we could formulate a concise, easy-to-use description of the relationship between model parameters and required sample size. To that end, we fitted a linear model to the logarithm of the required sample size, using logarithms of total number of features and true correlation as predictors (Figs. S17). For PLS, we additionally included a predictor for the decay constant of the within-set variance spectrum, |ax +ay |. Using split-half predictions to validate the model, we find very good predictive power for CCA (Fig. S17B), while it is somewhat worse for PLS (Fig. S17C). When we added an additional predictor to the PLS model, measuring the fraction of explained variance along the weight vectors in both datasets, predictions improved notably (Fig. S17D), showing that the linear model approach is also suitable for PLS in principle. As the explained variance along the true weight vectors is unobservable in practice, though, we propose to use the linear model without the explained-variance-predictor.
Discussion
We characterized CCA and PLS through a parameterized generative modeling framework. CCA and PLS require a sufficient number of samples to work as intended and the required sample size depends on the number of features in the data, the assumed true correlation, and (for PLS) the principal component variance spectrum for each dataset.
Generative model for CCA and PLS
At least for CCA, the distribution of sample canonical correlations has been reported to be intractable, even for normally distributed data (34). Thus, a generative model is an attractive alternative to investigate sampling properties. Our generative model for CCA and PLS made it possible to investigate all aspects of a solution, beyond just the canonical correlations, at the cost of higher computational expenses. For example, the generative model can be used to systematically explore parameter dependencies, to assess stability, to calculate required sample sizes in new studies, and to estimate weight stability in previously published studies. While this generative model was developed for CCA and PLS, it can also be used to investigate related methods like sparse variants (35, 36).
Pitfalls in CCA and PLS
Association strengths can be overestimated and, at least for CCA when the number of samples per feature as well as the true correlation are low, observed canonical correlations can be compatible with a wide range of true correlations, down to zero (Fig. S6). Estimated weight vectors do not need to resemble the true weights when the number of samples is low and can overfit, i. e. vary strongly between sampled datasets (Fig. 3), affecting significantly their interpretability and generalizability. Furthermore, PLS weights are also biased away from the true value toward the first principal component axis (Fig. 4). As a consequence, similarity of weights from two different samples of the population is necessary but not sufficient to infer replicability. The PC1 bias also existed for null data. Therefore, estimated weights that strongly resemble the first principal component axis need not indicate an association, but could instead indicate the absence of an association, or insufficient sample size. Importantly, we have shown that the same pitfalls also appear in empirical data.
Differences between CCA and PLS
First and foremost, CCA and PLS have different objectives: while CCA finds weighted composites with the strongest possible correlation between datasets, PLS maximizes their covariance. When features do not have a natural commensurate scale, CCA can be attractive due to its scale invariance (see Fig. 1 and Methods). In situations where both analyses make sense, PLS comes with the additional complication that estimated weights are biased towards the first principal component axis. Moreover, our analyses suggest that the required number of samples for PLS depends on the within-set principal component variance spectrum and is generally higher than for CCA. Based on these arguments, CCA might often be preferable to PLS.
Sample size calculator for CCA and PLS
Previous literature, based on small numbers of specific datasets or Monte-Carlo simulations, has suggested using between 10 and 70 samples per feature for CCA (21, 25, 27). Beyond that, our calculator is able to suggest sample sizes for the given characteristics of a dataset, and can do so for both CCA and PLS. As an example, consider the UKBB data in Fig. 5. Both in-sample and cross-validated CCA association strengths converge to about 0.5. Fig. 8B then suggests to use about 20 samples per feature, i. e. 4000 samples, to obtain at least 90 % power and at most 10 % error in other metrics. This is compatible with Fig. 5J: at 4000 subjects weight stability is about 0.8 (note that weight stability measures similarity of weights between different repetitions of the dataset; we expect the similarity of a weight vector to the true weight vector—which is the measure going into the sample size calculation—to be slightly higher on average). Our calculator is made available as an open-source Python package named GEMMR (Generative Modeling of Multivariate Relationships).
Brain-behavior associations
CCA and PLS have become popular methods to reveal associations between neuroimaging and behavioral measures (3, 4, 17, 18, 23, 29–32, 37). The main interest in these applications lies in interpreting weights or loadings to understand the profiles of neural and behavioral features carrying the brain-behavior realtionship. We have shown, however, that stability and interpretability of weights or loadings are contingent on a sufficient sample size which, in turn, depends on the true between-set correlation.
How strong are true between-set correlations? While this depends on the data at hand, and is in principle unknown a priori, our analyses provide estimates in the case of brainbehavior associations. First, we saw in UKBB data that both in-sample and cross-validated canonical correlations converged to a value of around 0.5. As the included behavioral measures comprised a wide range of categories (cognitive, physical, lifestyle measures and early life factors) this canonical correlation is probably more on the upper end, such that brain-behavior associations probing more specialized modes are likely lower. Second, we saw in a literature analysis of brain-behavior CCAs that reported canonical correlations as a function of sampleto-feature ratios largely follow the trends predicted by our generative model, despite different datasets investigated in each study. We also saw that few studies which had 10-20 samples per feature reported canonical correlations around 0.5-0.7, while most studies with substantially more than 10 samples per feature appeared to be compatible only with values ≤ 0.3. In this way, we conclude that true canonical correlations in brain-behavior applications are probably not greater than 0.3 in many cases.
Assuming a true between-set correlation of 0.3, our generative model implies that about 50 samples per feature are required at minimum to obtain stability in CCA results. We have shown that many published brain-behavior CCAs do not meet this criterion. Moreover, in HCP data we saw clear signs that the available sample size was too small to obtain stable solutions—despite that the HCP data comprised around 1000 subjects which is one of the largest and highest-quality neuroimaging datasets available to date. On the other hand, with UKBB data, where we used 20000 subjects, CCA and PLS results appeared to have converged. As the resources required to collect samples of this size go well beyond what is available to typical research groups, this observation supports the accruement of datasets that are shared widely (38, 39).
Generalizability
Small sample and effect sizes have been identified as challenges for neuroimaging that impact replicability and generalizability (40, 41). Here, we have considered stability of CCA/PLS analyses and found that observed association strengths decrease with used sample-per-feature ratio. Similarly, a decrease in reported effect size with increasing sample size has been reported in meta-analyses of various classification tasks of neuroimaging measures (42). These sample-size dependences of the observed effect sizes are an indication of instability.
A judicious choice of sample size, together with an estimate of the effect size, are thus advisable at the planning stage of an experiment or CCA/PLS analysis. Our generative modeling framework provide estimates for both. Beyond that, non-biological factors—such as batch or site effects (43–46), scan duration (47), flexibility in the data processing pipeline (48, 49)—certainly contribute to unstable outcomes and could be addressed in extensions of the generative model. External validation with separate datasets is also necessary to establish generalizability of findings beyond the dataset under investigation.
Limitations and future directions
For tractability it was necessary to make a number of assumptions in our study. Except for Fig. 6 it was assumed that both datasets had an equal number of features (but see Fig. S4 where we used different number of features for the two datasets). We also assumed that data were normally distributed, which is often not true in practice. For example, cognitive scores are commonly recorded on an ordinal scale. To address that, we used empirical datasets and found similar sample size dependencies as in synthetic datasets. In an investigation of the stability of CCA for non-normal data varying kurtosis had minimal effects (27). We then assumed the existence of a single cross-modality axis of association, but in practice several ones might be present. In that latter case, theoretical considerations suggest that even larger sample sizes are needed (50, 51). Moreover, we assumed that data are described in a principal component (PC) basis. In practice, however, PCs and the number of PCs need to be estimated, too. This introduces an additional uncertainty, although, presumably, of lesser influence than the inherent sampling error in CCA and PLS. Furthermore, we used “samples per feature” as an effective sample size parameter to account for the fact that datasets in practice have very different dimensionalities. Figs. S3-S4 show that power and error metrics for CCA are parameterized well in terms of “samples per feature”, whereas for PLS it is only approximate. Nonetheless, as “samples per feature” is arguably most straightforward to interpret, we presented results in terms of “samples per feature” for both CCA and PLS.
Several related methods have been proposed to potentially circumvent shortcomings of standard CCA and PLS (see (19) for a recent review). Regularized or sparse CCA or PLS (35, 36) have been designed to mitigate the problem of small sample sizes. They modify the modeling objective by introducing a penalty for the elements of the weight vectors, encouraging them to “shrink” to smaller values. This modification has the goal to obtain more accurate predictions, but will also bias the solutions away from their true values. (We assume that, in general, the true weight vectors are non-sparse.) Conceptually, thus, these variants follow more a “predictive” rather than “inferential” modeling goal (52, 53). Our analysis pipeline evaluated with a commonly used sparse CCA method (35) suggested that in some situations–namely, high dimensionalities and low true correlations—fewer samples were required than for CCA to obtain the same bounds on evaluation metrics (Fig. S18). Nonetheless, although sparse CCA can in principle be used with fewer samples than features, these required sample sizes for sparse CCA were still many times the number of features: when rtrue = 0.3, for example, 35–50 (depending on the number of features) samples per feature were required. We note, however, that a complete characterization of sparse CCA or PLS methods was beyond the scope of this manuscript. PLS has ben compared to sparse CCA in a setting with more features than samples and it has been concluded that the former (latter) performs better when having fewer (more) than about 500 features per sample (54). We note that sparse methods are also often used in classification tasks, where they have been observed to provide better prediction but less stable weights (55, 56), which indicates a trade-off between prediction and inference (55). Correspondingly, it has been suggested to consider weight stability as a criterion in sparsity parameter selection (55, 57, 58).
Moreover, whereas CCA and PLS are restricted to discovering linear relationships between two datasets, there exist non-linear extensions, such as kernel (59, 60), deep (61) or non-parametric (62) CCA, as well as extensions to multiple datasets (63). Due to their increased expressivity, and therefore capacity to overfit, we expect them to require even larger sample sizes. For classification, kernel and deep-learning methods have been compared to linear methods, using neuroimagingderived features as input (64). Accuracy was found similar for kernel, deep-learning and linear methods and also had a similar dependence on sample size, using up to 8000 subjects. Finally, we note that a relative of PLS, PLS regression, treats the two datasets asymetrically, deriving scores from one dataset to predict the other (5, 8, 9).
The number of features in the datasets was an important determinant for stability. Thus, methods for dimensionality reduction hold great promise. On the one hand, there are data-driven methods that, for example, select the number of principal components in a way that takes the between-set correlation into account (33). Applying this method to HCP data we saw that the reduced number of features the method suggests leads to slightly better convergence (Fig. S11). On the other hand, previous knowledge could be used to preselect the features hypothesized to be most relevant for the question at hand (65–67).
Recommendations
We end with 10 recommendations for using CCA or PLS in practice (summarized in Tab. S1).
Sample size and the number of features in the datasets are crucial determinants for stability. Therefore, any form of dimensionality reduction as a preprocessing step can be useful, as long as it preserves the features that carry the between-set association. PCA is a popular choice and can be combined with a consideration for the between-set correlation (33).
Significance tests used with CCA and PLS usually test the null hypothesis that the between-set association strength is 0. This is a different problem than estimating the strength or the nature of the association (68, 69). For CCA we find that the number of samples required to obtain 90% power at significance level α = 0.05 is lower than to obtain stable association strengths or weights, whereas for PLS the numbers are about commensurate with required sample sizes for other metrics (Fig. S15C-D). As significant results can also be obtained even when power is low, detecting a significant mode of association with either CCA or PLS does not in general indicate that association strengths or weights are stable.
CCA and PLS overestimate the association strength for small sample sizes, and we found that cross-validated estimators underestimate it. Interestingly, the average of the in-sample and the cross-validated association strength was a much better estimator in our simulations.
The main interest of CCA/PLS studies is often the nature of the between-set association, which is encoded in the weight vectors, loadings and cross-loadings. Every CCA and PLS will provide weights, loadings and cross-loadings, but they may be inaccurate or unstable if an insufficient number of samples was used for estimation. In our PLS simulations, cross-loadings required less samples than weights and loadings to obtain an error of at most 10%.
PLS weights that strongly resemble the first principal component axis can indicate that either no association exist or that an insufficient number of samples was used.
As a side effect of this bias of PLS weights towards the first principal component axis, PLS weights can appear stable across different sample sets, although they are inaccurate.
Performing CCA or PLS on subsamples of the data can indicate stability, if very similar results are obtained for varying number of samples used, and compared to using all data.
Bootstrapped estimates were useful in our simulations for assessing the variability or precision of elements of the weight vectors. Estimates were, however, not accurate: they were as biased as in-sample estimates, i. e. they overestimated association strengths, and both association strength and weight error had a similar sample size dependence as in-sample estimates.
For CCA and PLS analyses in the literature it can be difficult to deduce what datasets precisely were used. We recommend to always explicitly state the used sample size, number of features in both datasets, and obtained association strength. Moreover, as we have argued above, to assess a PLS analysis the within-set principal component variances are required and are thus useful to report.
CCA or PLS requires a sufficient number of samples for reliability. Sample sizes can be calculated using GEMMR, the accompanying software package. An assumed but unknown value for the true between-set correlation is needed for the calculation. Our literature survey suggests that between-set correlations are probably not greater than 0.3 in many cases. Assuming a true correlation of 0.3 results in a rule of thumb that CCA requires about 50 samples per feature. The number for PLS is higher and also depends on the within-set variance spectrum.
Conclusion
We have presented a parameterized generative modeling framework for CCA and PLS. It allows analysis of the stability of CCA and PLS estimates, prospectively and retrospectively. Exploiting this generative model, we have seen that a number of pitfalls exist for using CCA and PLS. In particular, we caution against interpreting CCA and PLS models when the available sample size is low. We have also shown that CCA and PLS in empirical data behave similar to the predictions of the generative model. Sufficient sample sizes depending on characteristics of the data are suggested and can be calculated with the accompanying software package. Altogether, our analyses provide guidelines for using CCA and PLS in practice.
Materials and Methods
Materials and methods are summarized in the SI appendix.
Supporting Information Text
1. CCA and PLS
We assume that we have two datasets in the form of data matrices X and Y, both of which have n rows representing samples, and, respectively, pX and pY columns representing measured features (or variables). Throughout we also assume that all columns of X and Y have mean 0. If both datasets consisted of only a single variable, we could measure their association by calculating their covariance or correlation. On the other hand, if one or both consist of more than one variable, pairwise between-set associations can be obtained but the possibly huge number of pairs results in a loss of statistical sensitivity and a difficulty to concisely interpret a potentially large number of significant associations (1). To circumvent these problems, canonical correlation analysis (CCA) and partial least squares (PLS) estimate associations between weighted composites of the original data variables and find those weights that maximize the association strength.
A. Terminology
Given a data matrix, e. g. X, composite variables or scores (a vector of the same size as the number of samples, n) are formed by projection of X onto a weight vector (of same size as the number of variables in X, pX), see Fig. 1A: Loadings (of same size as the number of variables in X, pX) characterize these composite variables by measuring their similarities with each of the original data variables in X (Fig. S9) where corr means Pearson correlation, is the j-th column of X, and the subscript z represents z-scoring across samples (i. e. subtraction of the mean and subsequent division by the standard deviation across samples). The complete loading vector is then where SXX is the sample covariance matrix for X. Similarly, cross-loadings can be defined as where SY Y and SXY are, respectively, the sample covariance matrix for Y and the sample cross-covariance matrix between X and Y.
B. Partial Least Squares
Partial Least Squares (PLS) finds the maximal covariance achievable between weighted linear combinations of features from two data matrices X and Y (2): The solution is based on the between-set covariance matrix ΣXY which can be estimated from data via its sampled version . Performing a singular value decomposition yields such that the optimal weights are given by the first columns of U and V, and the maximal covariance by the first singular value σXY,1 (3, 4).
Multiple modes of association can be estimated in this way: beyond only the first column, every pair of corresponding columns in U and V provides another mode such that (for 1 ≤ i ≤ min(px, py)) is maximal given that the covariance of lower modes (those with indices < i) has already been accounted for. There are a number of different algorithms for PLS that differ conceptually in how these higher modes are estimated (2, 3). The one presented above (sometimes called “partial least squares correlation” or PLS-SVD) was chosen for its similarity to canonical correlation analysis (see below). Another notable PLS algorithm is “PLS regression” which, in contrast to the above flavor, is asymmetrical in its handling of X and Y in that it estimates weighted composites (scores) for X and re-uses these as predictors for Y (2).
C. Canonical Correlation Analysis
Canonical Correlation Analysis (CCA) (5), as a multivariate extension of Pearson’s correlation, finds maximal correlations between weighted linear combinations of variables from X and Y : Note that is independent of the scaling of and . I. e. if and are solutions of Eq. (8), then and , where cX ∈ ℝ and cY ∈ ℝ, are also solutions.
Also note that, as for PLS, several modes of association can be obtained with this framework by successively discounting the variance that has been explained by lower-order modes.
The maximal correlation in Eq. (8) is often called “canonical”.
The further analysis is then based on the “whitened” between-set covariance matrix (6, 7). A singular value decomposition of is performed, yielding and the singular values turn out to be the canonical correlations from Eq. (8), i. e. the maximal achievable correlations between a weighted linear combination of variables in X on the one hand, and a weighted linear combination of variables in Y on the other hand. The corresponding weights are given by The use of the “whitened” between-set covariance matrix in CCA leads to an invariance property between datasets that we will exploit later. To see this, let Xw, Yw be whitened data matrices, i. e. and such that . Then, which is the same as for the original (non-whitened data). Consequently, canonical correlations for the original and whitened data are the same, given by the singular values of , canonical weights for the whitened data are directly its singular vectors and canonical weights for the original (non-whitened) data differ only by a matrix and for X and Y, respectively (Eq. (11)-Eq. (12)).
It can be shown that the invariance property is even more general (6). Let and be non-singular and and be arbitrary vectors. Then and have the same canonical correlations as X and Y, and the canonical vectors are related by Thus, in particular, z-scored data Xz = diag(SXX)−1/2X and Yz = diag(SY Y)−1/2Y as well as whitened data Xw and Yw the same canonical correlations as the original data X and Y.
In CCA, X- and Y -weights are related by (8) Replacing sample with population covariance matrices in Eq. (3) and Eq. (4), we thus also see that loadings and cross-loadings are collinear
D. Overestimation of association strength
Let ΣXY be a population cross-covariance matrix with singular value decomposition and let and σ1 be, respectively, the first columns of U, V and the first entry in . In PLS, , and are the weight vectors of the first mode and σ1 is the corresponding association strength. For CCA, as noted above, whitening the data leaves canonical correlations unchanged, so that we assume now data are white when performing CCA. Then, and the weights and association strength of the first mode are also given by and σ1. In both cases, we have .
The sample covariance matrix is an unbiased estimator for ΣXY, i. e. E[SXY] = ΣXY. Therefore, i. e. if the true (but unknown) weights were applied to a given dataset (between-set covariance matrix) the association strength of the resulting scores would, on average, match the true association strength. However, by definition, CCA and PLS select those weight vectors that maximize the association strength between resulting scores. If and are those optimal weights for a given dataset, then and consequently also i. e. the association strength is overestimated.
E. Sparse CCA
Multiple sparse CCA and PLS methods exist (9–12). Here, we use penalized matrix decomposition (PMD) (11), which has found widespread application, see e.g. (13–18). Briefly, the PMD algorithm repeats the following steps until convergence (11)
subject to and
subject to and
to maximize . If X⊤ X ≈ 1 and , then and analogously for Y. Consequently, . Note that the approximation X⊤X ≈ 1 (together with Y ⊤Y ≈ 1) makes this sparse “CCA” variant identical to sparse PLS (18, 19).
E.1. Implementation and sparsity parameter selection
We implemented a Python wrapper for the R-package PMA (20) which we used with default parameters. Sparsity parameters were estimated separately for each dataset subjected to sparse CCA via 5-fold cross-validation (11, 21): for X and Y we used 5 different candidate sparsity parameters (0.2, 0.4, 0.6, 0.8 and 1 where smaller values mean more sparsity and 1 corresponds to no sparsity) for a total of 25 parameter pairs. For each candidate parameter pair sparse CCA was estimated with 80 % of the data, the resulting weights applied to the remaining 20 % of the data to obtain test scores, the Pearson correlation calculated between the test scores and averaged across the 5 folds. The pair of sparsity parameters for which the test-correlation averaged across folds was maximal, was then selected and sparse CCA re-estimated on the whole data with these parameters.
2. Generating synthetic data for CCA and PLS
We will analyze properties of CCA and PLS with the help of simulated datasets. These datasets will be drawn from a normal distribution with mean 0 and a covariance matrix Σ that will encode assumed relationships in the data. To specify Σ we need to specify relationships of features within X, i. e. the covariance matrix , relationships of features within Y, i. e. the covariance matrix , and relationships between features in X on the one side and Y on the other side, i.e. the matrix . Together, these three covariance matrices form the joint covariance matrix (Fig. 1D) for X and Y and this allows us to generate synthetic datasets by sampling from the associated normal distribution 𝒩 (0, Σ).
A. The covariance matrices ΣXX and ΣY Y
Given a data matrix X, the features can be re-expressed in a different coordinate system through multiplication by an orthogonal matrix . No information is lost in this process, as it can be reversed: . Therefore, we are free to make a convenient choice. We select the principal component coordinate system as in this case the covariance matrix becomes diagonal, i. e. ΣXX = diag. Analogously, for Y we choose the principal component coordinate system such that ΣY Y = diag.
For modeling, to obtain a concise description of and we assume a power-law such that and with decay constants aXX and aY Y (Fig. 1B). Unless a match to a specific dataset is sought, the scaling factors cXX and cY Y can be set to 1 as they would only rescale all results without affecting conclusions.
B. The cross-covariance matrix ΣXY
The between-set covariance matrix ΣXY encodes relationships between the datasets X and Y. One such relationship is completely specified if we are given the weights of the variables in each dataset, and , and the association strength of the resulting weighted composite scores.
For PLS, the relation between the between-set covariance matrix, the weight vectors and association strengths is given by where and are the covariances of the composite scores. Arguably, correlations are more accessible to intuition though and we therefore re-express in terms of the assumed true (canonical) correlations. For each mode with weights and and covariance σXY we have where and are, respectively, the variances along the X and Y composite scores.
For CCA, on the other hand, we have to consider the singular value decomposition of where we have used Eq. (11) and Eq. (12). Here, are directly the assumed true correlations and, by construction, the weights WX and WY are constrained to satisfy and analogously for WY. As mentioned above, we can exploit the property that pre-whitened data result in the same matrix . In the following, thus, assume that we have data X and Y with and . But then, and , as well as . This is identical to the result for PLS, except that for CCA the assumption that the data are white is implicit.
Thus, in summary, to specify ΣXY we select the number m of between-set association modes, for each of them the association strength in form of the assumed true correlation, and a set of weight vectors and (for 1 ≤ i ≤ m). The weight vectors for each set need to be orthonormal , and, for CCA, both X and Y need to be white, i. e. ΣXX = 𝟙pX and ΣY Y = 𝟙pY.
C. Choice of weight vectors
We impose two constraints on possible weight vectors:
We aim to obtain association modes that explain a “large” amount of variance in the data, otherwise the resulting scores could be strongly affected by noise. The decision is based on the explained variance of only the first mode and we require that it is greater than 1/2 of the average explained variance of a principal component in the dataset, i.e. we require that and analogously for Y.
The weight vectors impact the joint covariance matrix Σ (via Eq. (27), Eq. (28) and Eq. (31)). Therefore, we require that the chosen weights result in a proper, i. e. positive definite, covariance matrix Σ.
To increase chances of finding weights that satisfy the first constraint, we compose them as a linear combination of a high-variance subspace element, and another component from the low-variance subspace. The high-variance subspace is defined as the vector space spanned by the first qX and qY (for datasets X and Y, respectively) components where qX and qY are chosen to explain 90% of their respective within-set variances. Having chosen (see below) any unit vectors of the low- and high-variance subspaces, and , they are combined as so that . Here, c is a uniform random number between 0 and 1 (but see also below). If the resulting weight vectors do not satisfy the imposed constraints, new values for and c are drawn. Note that, in case the number of between-set association modes m is greater than 1, only the first one is used to test the constraint Eq. (32), but weight vectors for the remaining modes are composed in the same way as just described.
Weight vector components of the low-variance subspace are found by multiplication of its basis vectors Ulo ∈ Rp×p−q with a rotation matrix Rlo where the first m columns of Wlo are used as the low-variance subspace components of the m between-set association modes. If qX ≥ m > pX − qX (and analogously for Y) the dimensionality of the low-variance subspace is not large enough to get a component for all m modes in this way, so that only for the first m modes a low-variance subspace component will be used.
The rotation matrix Rlo is found as the Q-factor of a QR-decomposition of a pX − qX × pX − qX (analogously for Y) matrix with elements drawn from a standard normal distribution.
Weight vector components of the high-variance subspace are selected in the following way (see Fig. S2). First, 10000 attempts are made to find them in the same way as the low-variance component, i.e. as the first m columns of where the columns of Uhi are the basis vectors for the high-variance subspace, and Rhi is found as the Q-factor of a QR-decomposition of a qX × qX (analogously for Y) matrix with elements drawn from a standard normal distribution. In case this fails (i. e. if one of the two constraints is not satisfied for all 10000 attempts), another 10000 attempts are made in which the coefficient c is not chosen randomly between 0 and 1, but the lower bound is increased stepwise from 0.5 to 1 to make it more likely that the first constraint is satisfied.
If this also fails (which tends to happen for large ground truth correlations rtrue and large dimensionalities pX and pY), and if m = 1, a differential evolution algorithm (22) is used to maximize the minimum eigenvalue of Σ, in order to encourage the second constraint to be satisfied. Specifically, qX coefficients and qY coefficients are optimized such that the weights and satisfy the constraints. As soon as the minimum eigenvalue of a resulting Σ matrix is above 10−5 the optimization is stopped. 10000 attempts are made to add a low-variance component to the optimized high-variance component in this way, and if unsuccessful, another 10000 attempts are made in which the coefficient c is not chosen randomly between 0 and 1, but the lower bound is increased stepwise from 0.5 to 1.
If this also fails, and if m = 1, the high-variance components of the weight vectors are chosen as the first principal component axes as a fallback approach. To see why this works, recall that we have assumed to work in the principal component coordinate system so that and ΣXX as well as ΣY Y are diagonal. In addition, we assume that the principal component variances are normalized such that the highest (i. e. the top-left entry in ΣXX and ΣY Y) is 1. We are seeking weight vectors that result in a positive definite covariance matrix Σ and Σ is positive definite if and only if both ΣY Y and the Schur complement of Σ, i. e. , are positive definite. ΣY Y is positive definite by construction. The between-set covariance matrix here is . For CCA, is the canonical correlation 1. For PLS, , which, with the specific choices of ΣXX, ΣY Y, and just described, also simplifies to σXY,1 = rtrue. Thus, and consequently the diagonal entries of are all greater than 0. That shows that Σ is positive definite if the weights are chosen as the first principal component axes. To not end up with the pure principal component axes in all cases, we add a low-variance subspace component as before, i. e. we make 10000 attempts to add a low-variance component with weight c chosen uniformly at random between 0 and 1, and, if unsuccessful, another 10000 attempts in which the lower bound for c is increased stepwise from 0.5 to 1.
D. Summary
Thus, to generate simulated data for CCA and PLS, we vary the assumed between-set correlation strengths , setting them to select levels, while choosing random weights WX and WY. For CCA, as outlined in the previous section, we can use pre-whitened data for which ΣXX = 𝟙 and ΣY Y = 𝟙, and as a result, the cross-covariance matrix ΣXY has the same form as for PLS. The columns of the weight matrices WX and WY must be mutually orthonormal, and, in addition, we assume that they are contained within a subspace of, respectively, qX and qY dominant principal components, that is and , where is the matrix of the first qX columns of UXX, RXX ∈ ℝqX×qX is unitary, and analogously for and RqY.
E. Performed simulations
For Figs. 2, 3E-F, the colored curves in Fig. 6A, Figs. 7, 8, S15, the CCA results in Fig. S17 and Fig. S3, we ran simulations for m = 1 between-set association mode assuming true correlations of 0.1, 0.3, 0.5, 0.7 and 0.9, used dimensionalities pX = pY of 2, 4, 8, 16, 32, 64, 128 as well as 25 different covariance matrices. aX + aY was fixed at 0 for CCA and −2 for PLS. 100 synthetic datasets were drawn from each instantiated normal distribution. Where not specified otherwise, null distributions were computed with 1000 permutations. Due to computational expense, some simulations did not finish and are reported as blank spaces in heatmaps.
Similar parameters were used for other figures, except for the following deviations.
For Fig. 3A-B pX was 100, rtrue = 0.3 and we used 1 covariance matrix for CCA and PLS.
For Fig. 3C-D pX was 100, rtrue = 0.3 and we used 10 and 100 different covariance matrices for CCA and PLS, respectively. For Fig. 4A-B, pX was 2, rtrue = 0.3 and we used 10000 different covariance matrices for CCA and PLS.
For Fig. 4C-D and 4G-H, we used 2, 4, 8, 16, 32 and 64 for pX, 0.1, 0.3 and 0.5 for rtrue, 10 different covariance matrices for CCA and PLS, and 10 permutations. A subset of these, namely pX = 64 and rtrue = 0.3 was used for Fig. 4E-F.
For Fig. 6, we varied rtrue from 0 to 0.99 in steps of 0.01 for each combination of pX and pY for which we have a study in our database of reported CCAs, and 1 covariance matrix for each rtrue.
For Fig. S4 pX + pY was fixed at 64 and for pX we used 2, 4, 8, 16, 32.
In Fig. S6, for pX we used 4, 8, 16, 32, 64 and, only for CCA, also 128, we used 10 different covariance matrices for both CCA and PLS and varied rtrue from 0 to 0.99 in steps 0.01.
For Fig. S7 we used 2, 4, 8, 16 and 32 for pX, and 10 different covariance matrices for both CCA and PLS.
For Fig. S8 we used 2, 4, 8, 16, 32 and 64 for pX, 5 different covariance matrices for both CCA and PLS, 100 bootstrap iterations and did not run simulations for rtrue = 0.1.
For the PLS results in Fig. S16 and Fig. S17 we used 50 different covariance matrices for rtrue = 0.1, 0.9, as well as for rtrue = 0.7 in combination with pX = 128, 25 for rtrue = 0.5 in combination with pX = 64, and 75 for all other combinations of pX and rtrue for which the computational expense was not too high. For each instantiated joint covariance matrix, aX + aY was chosen uniformly at random between −3 and 0 and aX was set to a random fraction of the sum, drawn uniformly between 0 and 1.
In Fig. S18 we used 0.3, 0.5, 0.7 and 0.9 for rtrue, 4, 8, 16, 32 and 64 for pX, 6 different covariance matrices and 100 permutations.
3. Evaluation of sampling error
We use five metrics to evaluate the effects of sampling error on CCA and PLS analyses.
Statistical power
Power measures the capability to detect an existing association. It is calculated when the true correlation is greater than 0 as the probability across 100 repeated draws of synthetic datasets from the same normal distribution that the observed association strength (i. e. correlation for CCA, covariance for PLS) of a dataset is statistically significant. Significance is declared if the p-value is below α = 0.05. The p-value is evaluated as the probability that association strengths are greater in the null-distribution of association strengths. The corresponding null-distribution is obtained from performing CCA or PLS on 1000 datasets where the rows of Y were permuted randomly. Power is bounded between 0 and 1 and, unlike for the other metrics (see below), higher values are better.
Relative error in between-set covariance
The relative error of the between-set association strength is calculated as where r is the true between-set association strength and is its estimate in a given sample.
Weight error
Weight error Δw is calculated as 1 - absolute value of cosine similarity between observed and true weights, separately for data sets X and Y, and the greater of the two errors is taken: Where The absolute value of the cosine similarity is used due to the sign ambiguity of CCA and PLS.
This error metric is bounded between 0 and 1 and measures the cosine of the angle between the two unit vectors and .
Score error
Score error Δt is calculated as 1 – absolute value of Spearman correlation between observed and true scores. The absolute value of the correlation is used due to the sign ambiguity of CCA and PLS. As for weights, the maximum over datasets X and Y is selected: Each element of the score vector represents a sample (subject). Thus, to be able to compute the correlation between estimated and true score vectors, corresponding elements must represent the same sample, despite the fact that in each repetition new data matrices are drawn in which the samples have completely different identities. To overcome this problem and to obtain scores, which are comparable across repetitions (denoted and ), each time a set of data matrices is drawn from a given distribution 𝒩 (0, Σ) and a CCA or PLS model is estimated, the resulting model (i. e. the resulting weight vectors) is also applied to a “test” set of data matrices, X(test) and Y (test) (of the same size as X and Y) obtained from 𝒩 (0, Σ) and common across repeated dataset draws.
The score error metric Δt is bounded between 0 and 1 and reflects the idea that samples (subjects) might be selected on the basis of how extreme they score and that the ordering of samples (subjects) is more important than the somewhat abstract value of their scores.
Loading error
Loading error Δ𝓁 is calculated as 1 – absolute value of Pearson correlation between observed and true loadings. The absolute value of the correlation is used due to the sign ambiguity of CCA and PLS. As for weights, the maximum over datasets X and Y is selected: True loadings are calculated with Eq. (3) (replacing the sample covariance matrix in the formula with its population value). Estimated loadings are obtained by correlating data matrices with score vectors (Eq. (2)). Thus, the same problem as for scores occurs: the elements of estimated and true loadings must represent the same sample. Therefore, we calculate loading errors with loadings obtained from test data (X(test) and Y (test)) and test scores ( and ) that were also used to calculate score errors.
The loading error metric Δ𝓁 is bounded between 0 and 1 and reflects the idea that loadings measure the contribution of original data variables to the between-set association mode uncovered by CCA and PLS.
Loadings are calculated by correlating scores with data matrices. Of note, all synthetic data matrices in this study are based in the principal component coordinate system. In practice, however, this is not generally the case. Nonetheless, as the transformation between principal component and original coordinate system cannot be constrained, we here do not consider this effect.
4. Weight similarity to principal component axes
The directional means μ in Figs. 4A-B are obtained via as μ = arg(R)/2.
To interpret the distribution of cosine similarities between weights and the first principal component axis we compare this distribution to a reference, namely to the distribution of cosine similarities between a random n-dimensional unit vector and an arbitrary other unit vector . This distribution f is given by (23) where P denotes the cumulative distribution function for the probability that a random unit-vector has cosine similarity with (or projection onto ) ≤ x. For −1 ≤ x ≤ 0, P can be expressed in terms of the surface area An(h) of the n-dimensional hyperspherical cap of radius 1 and height h (i. e. x − h = −1) where An(2) is the complete surface area of the hypersphere and and I is the regularized incomplete beta function. Thus, where B is a beta function and where fβ is the probability density function for the beta distribution. Hence, with is a random variable representing the cosine similarity between 2 random vectors (or the projection of a random unit-vector onto another).
5. Analysis of empirical data
We demonstrate CCA and PLS analysis in empirical data using data from the Human Connectome Project (HCP) (24) and UK Biobank (25).
A. Human Connectome Project data
A.1. fMRI data
We used resting-state fMRI (rs-fMRI) from 951 subjects from the Human Connectome Project (HCP) 1200-subject data release (03/01/2017) (24). The rs-fMRI data were preprocessed in accordance with the HCP Minimal Preprocessing Pipeline (MPP). The details of the HCP preprocessing can be found elsewhere (26, 27). Following the HCP MPP, BOLD time-series were denoised using ICA-FIX (28, 29) and registered across subjects using surface-based multimodal inter-subject registration (MSMAll) (30). Additionally, global signal, ventricle signal, white matter signal, and subject motion and their first-order temporal derivatives were regressed out (31).
The rs-fMRI time-series of each subject comprised of 2 (69 subjects), 3 (12 subjects), or 4 (870 subjects) sessions. Each rest session was recorded for 15 minutes with a repetition time (TR) of 0.72 s. We removed the first 100 time points from each of the BOLD sessions to mitigate any baseline offsets or signal intensity variation. We subtracted the mean from each session and then concatenated all rest sessions for each subject into a single time-series.
Voxel-wise time series were parcellated to obtain region-wise time series using the “RelatedValidation210” atlas from the S1200 release of the Human Connectome Project (32). Functional connectivity was then computed as the Fisher-z-transformed Pearson correlation between all pairs of parcels.
3 subjects were excluded (see section D below), resulting in a total of 948 subjects with 100 connectivity features each.
A.2. dMRI data
Diffusion MRI (dMRI) data and structural connectivity patterns were obtained as described in (33, 34). In brief, 41 major white matter (WM) bundles were reconstructed from preprocessed HCP diffusion MRI data (35) using FSL’s XTRACT toolbox (34). The resultant tracts were vectorised and concatenated, giving a WM voxels by tracts matrix. Further, a structural connectivity matrix was computed using FSL’s probtrackx (36, 37), by seeding cortex/white-grey matter boundary (WGB) vertices and counting visitations to the whole white matter, resulting in a WGB × WM matrix. Connectivity “blueprints” were then obtained by multiplying the latter with the former matrix. This matrix was parcellated (along rows) into 68 regions with the Desikan-Killany atlas (38) giving a final set of 68 × 41 = 2788 connectivity features for each of the 1020 HCP subjects.
A.3. Behavioral measures
The same list of 158 behavioral and demographic data items as in (39) was used.
A.4. Confounders
We used the following items as confounds: Weight, Height, BPSystolic, BPDiastolic, HbA1C, the third cube of FS_BrainSeg_Vol, the third cube of FS_IntraCanial_Vol, the average of the absolute as well as the relative value of the root mean square of the head motion, squares of all of the above, and an indicator variable for whether an earlier of later software version was used for MRI preprocessing. Head motion and software version were only included in the analysis of fMRI vs behavioral data, not in the analysis of dMRI vs behavioral data. Missing values were set to 0. All resulting confounds were z-scores across subjects.
B. UK Biobank data
B.1. fMRI data
We utilised pre-processed resting-state fMRI data (40) from 20,000 subjects, available from the UK Biobank Imaging study (25).
In brief, EPI unwarping, distortion and motion correction, intensity normalisation and highpass temporal filtering were applied to each subject’s functional data using FSL’s Melodic (41), data were registered to standard space (MNI), and structured artefacts are removed using ICA and FSL’s FIX (28, 29, 41).
A set of resting-state networks were identified common across the cohort using a subset of subjects (≈ 4000 subjects) (40). This was achieved by extracting the top 1200 components from a group-PCA (42) and a subsequent spatial ICA with 100 resting-state networks (41, 43). Visual inspection revealed 55 non-artefactual ICA components. Next, these 55 group-ICA networks were dual regressed onto each subjects’ data to define grey matter nodes. The average timeseries of each of the nodes were used to compute partial correlation parcellated connectomes with a dimensionality of 55 × 55. The connectomes were z-score transformed and the upper triangle vectorised to give 1485 functional connectivity features per subject, for each of the 20,000 subjects.
B.2. Behavioural measures
The UK Biobank contains a wide range of subject measures (44), including physical measures (e.g. weight, height), food and drink, cognitive phenotypes, lifestyle, early life factors and sociodemographics.
We hand-picked a subset of 3895 cognitive, lifestyle and physical measures, as well as early life factors. For categorical items, we replaced negative values with 0, as in (25). Such negative values encode mostly “Do not know”/”Prefer not to answer”. We then removed measures that had missing values in more than 50% of subjects (for instance measures that reflected subsequent visits, which were not available for many subjects that only had one visit). We also removed measures that had identical values in at least 90% of subjects, leaving 633 non-imaging measures. We then performed a redundancy check. Specifically, if the correlation between any two measures was > 0.98, one of the two items was randomly chosen and dropped. This procedure further removed 62 measures (mostly physical measures, also some less informative sections of tests), resulting in a final set of 571 behavioural measures, available for each of the 20,000 subjects.
B.3. Confounds
We used the following items as confounds: acquisition protocol phase (due to slight changes in acquisition protocols over time), scaling of T1 image to MNI atlas, brain volume normalized for head size (sum of grey matter and white matter), fMRI head motion, fMRI signal-to-noise ratio, age, sex. In addition, similar to (25) we used the squares of all non-categorical items (i. e. T1 to MNI scaling, brain volume, fMRI head motion, fMRI signal-to-noise ratio and age), as well as age × sex and age2 × sex. Altogether these were 14 confounds.
Finally, we imputed 0 for missing values and z-scored all items.
C. Preprocessing for CCA and PLS
We prepared data for CCA following, for the most part, the pipeline in (39).
Deconfounding
Deconfounding of a matrix X with a matrix of confounds C was performed by subtracting linear predictions, i.e. Where The confounds used were specific to each dataset and mentioned in the previous section.
Neuroimaging data
Neuroimaging measures, were, on the one hand, z-scored. On the other hand, normalized values were used as additional features: normalization was performed by calculating features’ absolute value of the mean across subjects and, in case this mean was above 0.1 (otherwise this feature was not used in normalized form), the original values of the feature were divided by this mean, and the resulting values were z-scored across subjects.
The resulting data matrix was de-confounded (as described in the previous above), decomposed into principle components via a singular value decomposition, and the left singular vectors, multiplied by their respective singular values were used as data matrix X in the subsequent CCA or PLS analysis.
Behavioral and demographic data
The list of used behavioral items were specific to each dataset and mentioned in the previous sections. Given this list, separately for each item, a rank-based inverse normal transformation (45) was applied and the result z-scored. For both of these steps subjects with missing values were disregarded. Next, a subjects × subjects covariance matrix across variables was computed, considering for each pair of subjects only those variables that were present for both subjects. The nearest positive definite matrix of this covariance matrix was computed using the function cov_nearest from the Python statsmodels package (46). This procedure has the advantage that subjects can be used without the need to impute missing values. An eigenvalue decomposition of the resulting covariance matrix was performed where the eigenvectors, scaled to have standard deviation 1, are principal component scores. They are then scaled by the square-roots of their respective eigenvalues (so that their variances correspond to the eigenvalues) and used as matrix Y in the subsequent CCA or PLS analysis.
D. CCA/PLS analysis
Permutation-based p-values in Fig. 5 and S11 were calculated as the probability that the CCA or PLS association strength of permuted datasets was at least as high as in the original, unpermuted data. Specifically, to obtain the p-value, rows of the behavioral data matrix were permuted and each resulting permuted data matrix together with the unpermuted neuroimaging data matrix were subjected to the same analysis as the original, unpermuted data, in order to obtain a null-distribution of between-set associations. 1000 permutations were used.
Due to familial relationships between HCP subjects they are not exchangeable so that not all possible permutations of subjects are appropriate (47). To account for that, in the analysis of HCP fMRI vs behavioral data, we have calculated the permutation-based p-value as well as the confidence interval for the whole-data (but not the subsampled data) analysis using only permutations that respect familial relationships. Allowed permutations were calculated using the functions hpc2blocks and palm_quickperms with default options as described in https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/PALM/ExchangeabilityBlocks (accessed May 18, 2020). No permutation indices were returned for 3 subjects that were therefore excluded from the functional connectivity vs behavior analysis.
Subsampled analyses (Fig. 5A-D) were performed for 5 logarithmically spaced subsample-sizes between 202 and 80% of the total subject number. For each subsample size 100 subsampled data matrices were used.
Cross-validated analyses were performed with 5-fold cross-validation.
E. Principal component spectrum decay constants
The decay constant of a principal component spectrum (Fig. S1) was estimated as the slope of a linear regression (including an intercept term)å of log(explained variance of a principal component) on log(principal component number). For each dataset in Fig. S1 we included as many principal components into the linear regression as necessary to explain either 30% or 90% of the variance.
6. Meta-analysis of prior literature
A PubMed search was conducted on December 23, 2019 using the query (“Journal Article”[Publication Type]) AND (fmri[MeSH Terms] AND brain[MeSH Terms]) AND (“canonical correlation analysis”) with filters requiring full text availability and studies in humans. In addition, studies known to the authors were considered. CCA results were included in the meta-analysis if they related a neuroimaging derived measures (e. g. structural or functional MRI, …) to behavioral or demographic measures (e. g. questionnaires, clinical assessments …) across subjects, if they reported the number of subjects and the number of features of the data entering the CCA analysis, and if they reported the observed canonical correlation. This resulted in 100 CCA analyses reported in 31 publications (39, 48–77), which are summarized in SI Dataset 1.
7. Determination of required sample size
As all evaluation metrics change approximately monotonically with sample per feature, we fit splines of degree 3 to interpolate and to determine the number of samples per feature that approximately results in a given target level for the evaluation metric. For power (higher values are better) we target 0.9, for all the other metrics (lower values are better) we target 0.1. Before fitting the splines, all samples-per-feature are log-transformed and metrics are averaged across repeated datasets from the same covariance matrix. Sometimes the evaluation metrics show non-monotonic behavior (e. g. due to numerical errors) and in case the cubic spline results in multiple roots we filter those for which the spline fluctuates strongly in the vicinity of the root (suggesting noise), and select the smallest remaining root ñ for which the interpolated metric remains within the allowed error margin for all simulated n > ñ, or discard the synthetic dataset if all roots are filtered out. In case a metric falls within the allowed error margin for all simulated n (i. e. even the smallest simulated n0) we pick n0.
We suggest, in particular, a combined criterion to determine an appropriate sample size. This is obtained by first calculating sample-per-feature sizes with the interpolation procedure just described separately for the metrics power, relative error of association strength, weight error, score error and loading error. Then, for each parameter set, the maximum is taken across these five metrics.
8. Sample size calculator for CCA and PLS
Estimating an appropriate sample size via the approach described in the previous section is computationally expensive as multiple potentially large datasets have to be generated and analyzed. To abbreviate this process (see also Fig. S14) we do use the approach from the previous section to obtain sample size estimates for rtrue ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, px ∈ {2, 4, 8, 16, 32, 64, 128 }, py = px, and (for PLS) ax + ay ∼ 𝒰 (−3, 0), ax = c(ax + ay), and c ∼ 𝒰 (0, 1), where 𝒰denotes a uniform distribution. We then fit a linear model to the logarithms of the sample size, with predictors log(rtrue), log(px + py), (for PLS) |ax + ay |, and including an intercept term.
We tested the predictions of linear model using a split-half approach (Fig. S17), i. e. we refitted the model using either only sample size estimates for rtrue ∈ {0.1, 0.3} and half the values for rtrue = 0.5, or the other half of the data, and tested the resulting refitted model on the remaining data in each case.
As for PLS (unlike CCA) the direction of the weight vectors relative to the principal component axes results in varying amounts of explained variance, we also tested an alternative linear model for PLS, in which log(vxvy) was included as additional predictor, where vx and vy denote, respectively, the explained variance ratio for the X and Y weight vector, i. e. the variance of the scores divided by the trace of the corresponding within-set covariance matrix. Note that, as the true weights are unknown in practice, this additional predictor is inaccessible in practice, and the alternative linear model only serves to gauge how much of the uncertainty in the linear model is due to this unobservable component.
9. The gemmr software package
We provide an open-source Python package, called gemmr, that implements the generative modeling framework presented in this paper. Among other functionality, it provides estimators for CCA, PLS and sparse CCA; it can generate synthetic datasets for use with CCA and PLS using the algorithm laid out above; it provides convenience functions to perform sweeps of the parameters on which the generative model depends; it calculates required sample sizes to bound power and other error metrics as described above. For a full description, we refer to the package’s documentation.
10. Code and data availability
Our open-source Python software package, gemmr, is freely available at https://github.com/murraylab/gemmr. It has dependencies on scikit-learn (78), statsmodels (46), xarray (79), pandas (80), scipy (81) and numpy (82) among others. Jupyter notebooks detailing the analyses and generation of figures presented in the manuscript are made available as part of the package documentation. The outcomes of synthetic datasets that were analyzed with CCA or PLS are available from https://osf.io/8expj/.
ACKNOWLEDGMENTS
This research was supported by NIH grants R01MH112746 (J.D.M.), R01MH108590 (A.A.), R01MH112189 (A.A.), U01MH121766 (A.A.), and P50AA012870 (A.A.); Wellcome Trust grant 217266/Z/19/Z (S.S.); a SFARI Pilot Award (J.D.M., A.A.); DFG research fellowship HE 8166/1-1 (M.H.), Medical Research Council PhD Studentship UK MR/N013913/1 (S.W.), NIHR Nottingham Biomedical Research Centre (A.M.). Data were provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. Data were also provided by the UK Biobank under Project 43822. In part, computations were performed using the University of Nottingham’s Augusta HPC service and the Precision Imaging Beacon Cluster.
Footnotes
Conceptualization: MH, SW, AA, SNS, JDM. Methodology: MH, JDM. Software: MH. Formal analysis: MH, SW, AM, BR. Resources: AA, SNS, JDM. Data Curation: AM, JLJ, AH. Writing - Original Draft: MH, JDM. Writing - Review & Editing: All authors. Visualization: MH. Supervision: JDM. Project administration: JDM. Funding acquisition: AA, SNS, JDM.
J.L.J, A.A. and J.D.M. have received consulting fees from BlackThorn Therapeutics. A.A. has served on the Advisory Board of BlackThorn Therapeutics.