Abstract
Biological aging is revealed by physical measures, e.g., DNA probes or brain scans. Individual differences in personal functioning are instead explained by psychological constructs. Constructs such as intelligence or neuroticism are typically assessed by specialized workforce through tailored questionnaires and tests. Similar to how brain age captures biological aging, intelligence and neuroticism may provide empirical proxies for mental health. Could the combination of brain imaging and sociodemographic information yield measures for these constructs that do not rely on human judgment? Here, we built proxy measures by applying machine learning on multimodal MR images and rich sociodemographic information from the largest brain-imaging cohort to date: the UK Biobank. Objective comparisons revealed that all proxies captured the target constructs and related to health-contributing habits beyond the measures they were derived from. Our results demonstrate that proxies targeting classical psychological constructs reveal facets of mental health complementary to information conveyed by brain age.
Introduction
Individual assessments in psychology and psychiatry rely on observing behavior. Using biological insight to diagnose and treat mental disorders remains a hard problem despite substantial research efforts (Kapur et al., 2012). The field of psychiatry has struggled with purely descriptive and unstable diagnostic systems (Insel et al., 2010), small sample sizes (Szucs and Ioannidis, 2017), and reliance on dichotomized groups, i.e., patients vs controls (Hozer and Houenou, 2016). Compared to somatic medicine, mental-health research faces the additional roadblock that mental pathologies cannot be measured the same way diabetes can be assessed through plasma levels of insulin or glucose. Psychological constructs, e.g., depressiveness or anxiety can only be probed indirectly through expert-built procedures such as specially-crafted questionnaires and structured interviews. Measuring reliably a given construct is difficult and questionnaires often remain the best option (Enkavi et al., 2019). While the field of psychometrics has thoroughly studied the validity of psychological constructs and their measures (Borsboom et al., 2004; Cronbach and Meehl, 1955; Eisenberg et al., 2019), the advent of new biophysical measurements on the brain brings new promises (Engemann et al., 2020; Kievit et al., 2018b; Nave et al., 2018). In particular, the growth of biobanks as well as the advances in statistical-learning techniques opens the door to large-scale validation of psychological constructs and measures for neuropsychiatric research (Collins, 2012).
In clinical neuroscience, machine learning is increasingly popular, driven by the hope to develop more generalizable models (Woo et al., 2017). Yet, to be reliable, machine learning needs large labeled datasets (Varoquaux, 2018). Its application to learn imaging biomarkers of neuropsychiatric disorders is limited by the availability of large cohorts with high-quality neuropsychiatric assessements (Bzdok and Meyer-Lindenberg, 2018). However, data on populations without diagnosed neuropsychiatric conditions is easier to collect. Such data has driven successes in developing brain-derived aging measures that capturing proxy information on mental health (Cole et al., 2015, 2018; Dosenbach et al., 2010; Engemann et al., 2020; He et al., 2020; Koutsouleris et al., 2014; Liem et al., 2017; Smith et al., 2020). Extrapolating from these successes, we propose to learn more of such proxy measures of health-related individual traits in large datasets. These could then enhance an analysis in a small dataset via links between the proxy measures and the actual clinical endpoint of interest, e.g., diagnosis or drug response. Emerging results validate the usefulness of age as one such proxy measure, leading to the so called brain age delta: the difference between predicted and actual age (Cole et al., 2015; Dosenbach et al., 2010; Smith et al., 2019a). The delta has been shown to reflect physical and cognitive impairment in adults and gives an index of neurodegenerative processes (Gonneaud et al., 2020; Liem et al., 2017). Can this strategy of biomarker-like proxy measures be extended beyond the construct of aging? Can measures derived from other targets than age serve as proxies for latent constructs?
Beyond aging, one high-stake target is intelligence, which is measured through socially administered tests and is one of the most extensively studied constructs in psychology. Fluid intelligence refers to the putatively culture-free, heritable and physiological component of intelligence (Cattell, 1963; Cattell and Scheier, 1961). Fluid intelligence is a latent construct designed to capture individual differences in cognitive capacity. It has been robustly associated with neuronal maturation and is typically reflected in cognitive-processing speed and working-memory capacity (Shelton et al., 2010). Compared to brain age, fluid intelligence may yield a proxy measure more specifically indexing cognitive function. It has been associated with psychiatric disorders such as psychosis, bipolar disorder and substance abuse (Keyes et al., 2017; Khandaker et al., 2018).
Neuroticism is a second promising target. As a key representative of the extensively studied Big Five personality inventory, neuroticism has a long-standing tradition in the psychology of individual differences (Costa and McCrae, 1992; Eysenck et al., 1985). Neuroticism is typically measured using self-assessment questionnaires and conceptualized as capturing dispositional negative emotionality including anxiety and depressiveness (Shackman et al., 2016). It has been inter-culturally validated (Cattell and Scheier, 1961; Lynn and Martin, 1997) and population-genetics studies have repeatedly linked variance in neuroticism to shared genes (Birley et al., 2006; Hettema et al., 2006; Pedersen et al., 1988). Neuroticism was shown useful in psychometric screening and supports predicting real-world behavior (Lahey, 2009; Tyrer et al., 2015). However, despite strong heritability at the population level (Power and Pluess, 2015; Vukasović and Bratko, 2015), the link with brain function at the level of large-scale network dynamics or the level of molecular mechanisms is being actively researched (Shackman et al., 2016; Yarkoni, 2015).
The advent of large MRI datasets has revealed the complexity of predicting personality traits from brain signals. Current attempts to predict fluid intelligence or neuroticism from thousands of MRI scans argue in favor of overwhelming heterogeneity and rather subtle effects that do not generalize strongly to unseen data (Dubois et al., 2018a,b). This stands in contrast to the remarkable performance obtained when predicting intelligence or neuroticism from other psychometric measures or semantic data qualitatively similar to psychometric questionnaires, e.g., Twitter and Facebook posts (Quercia et al., 2011; Youyou et al., 2015). As MRI acquisitions can be expensive and difficult in clinical settings or populations, the promises of social-media data is appealing. However, in clinical practice or research, such data can lead to measurement and selection biases difficult to control. On the other hand, background sociodemographic characteristics of individuals can be easily accessible and may help inform in similar ways on the heterogeneity of psychological traits, for instance capturing that fluid intelligence decreases with age (Horn et al., 1981). An important question is then whether this data can reveal non-redundant information on the constructs of interest.
Another challenge of quantifying psychological traits is the diversity of measurement scales, often categorical or on arbitrary non-physical units, e.g. education degree or monthly income. In fact, society treats individual differences as categorical or continuous, depending on the practical context. Personality has been proposed to span a continuum (Eysenck, 1958). Nevertheless, psychiatrists treat certain people as patients and not others (Perlis, 2011). The utility of any mental-health measure therefore depends on its practical context: When learning boundaries between qualitatively distinct groups, a measure that performs globally poorly as a continuous scale can nevertheless be sufficient to distinguish subgroups. In fact, a measure may be solely informative around the boundary region between certain classes, e.g., pilots who should fly and who should not. Importantly, the utility of any measure ultimately depends on its signal-to-noise ratio, which may be driven by measurement noise, heterogeneity, as well as the interesting variability of the particular construct measured, e.g., the type of test to assess intelligence.
Confronting the promises of population brain imaging with the challenges of measuring psychological traits raises the following questions. 1) How well can various health-related latent constructs be approximated from general-purpose inputs not designed to measure specific latent constructs? 2) What is the relative merit of brain imaging and sociodemo-graphics for probing various latent constructs? 3) Can the success of brain age be extended to other proxy measures that capture complementary facets of health-contributing behavior? In this study, we tackled these questions by using machine learning to build proxy measures, crafted to approximate well-characterized target measures from brain-imaging and sociodemographic data. As target measures, we studied age, fluid intelligence, and neuroticism. Figure 1 summarizes our approach. We first assessed how well the proxy measures approximated the target measures, isolating the contributions of the different data types. Second, to assess the intrinsic value of the proxy measures, we studied their associations with health-related habits (alcohol consumption, cumulative tobacco consumption, sleep duration, physical activity). Results suggest that, as with brain age, proxy measures can bring value for the study of mental health that goes beyond approximating an available measure.
Results
Traditional measures of mental health can be empirically approximated
We first performed model comparisons to evaluate the relative performance of proxy measures built from brain signals and distinct groups of sociodemographic variables. Figure 2 summarizes these model comparisons for approximating three targets: age, fluid intelligence and neuroticism. For the sociodemographic variables (Figure 2, dotted outlines), the analysis revealed that, for each target, there was one principal block of variables explaining most of the prediction performance. Combining all sociodemographic variables did not lead to obvious enhancements (Figure 2 – Figure supplement 2). For age prediction, variables related to current life-style showed by far the highest performance. For fluid intelligence, education performed by far best. Finally, for neuroticism, mood & sentiment clearly showed the strongest performance.
When combining MRI and sociodemographics (Figure 2, solid outlines), age prediction was enhanced in a systematic and visible way on all four blocks of variables (Table 1), suggesting that the observed differences should reproduce on future data and are unlikely to be due to chance. The benefit of including brain-imaging features, however, was less marked for prediction of fluid intelligence and neuroticism. With fluid intelligence, brain-imaging data improved the performance statistically significantly for all models, yet, with small effect sizes at the scale of a few percent or even lower (Table 1). Further, for neuroticism, no systematic advantage of including brain images alongside sociodemographics emerged. Instead, including brain images seemed to reduce generalization performance when predicting from mood & sentiment variables (Table 1, bottom row). Nevertheless, using only brain data was sufficient for statistically significant approximation of the target measures not only for age but also fluid intelligence and neuroticism (Table S1), suggesting that lifestyle and mood & sentiment explains at least some of the neurobiological variance. For neuroticism, variables on current mood & sentiment were strongly informative for prediction, reflecting that mood & sentiment is strongly related to neuroticism. Overall, predicting fluid intelligence or neuroticism was clearly more successful when sociodemographic was included (Table 1). For subsequent analyses we included all sociodemographic variables (Figure 2 – Figure supplement 2).
One important challenge with evaluating approximations of psychological measures is that such measures often come without physical scales and units (Stevens et al., 1946). In practice, clinicians and educators use them with specific thresholds for decision making. How useful proxy measures built with predictive models are to separate out discrete extreme groups? To address this question, we performed binary classification of extreme groups obtained from discretizing the targets using the 33rd and 66th percentiles. Moreover, we focused on the AUC as a performance metric which is only sensitive to ranking while ignoring the scale of the error. The results are comparable to the previous regression analysis. Classification performance for extreme groups visibly exceeded the chance level of an AUC of 0.5 for all models (Figure 3). Across proxy measures, models including sociodemographics performed best but the difference between purely sociodemographic and brain-based models was comparably weak, at the order of 0.01-0.02 AUC points (Table 2). Using only brain data resulted in proxy measures that perform less well, yet, still better than chance as revealed by permutation testing (Table S2). It is noteworthy that for both types of models the performance of discrimination reached levels above 0.8, which is considered clinically useful for biomarkers (Perlis, 2011). Overall, the results suggest that moving from the more difficult full-scale regression problem to extreme-group classification problem with purely ranking-based loss functions, the relative differences between brain-based and sociodemographics-based prediction gradually faded away.
External validity: proxy measures capture ecological health-related factors
Results so far have shown that psychological constructs can be approximated from general-purpose inputs such as brain images and sociodemographic variables that are not tailored to measure these latent constructs. Beyond approximating target measures, which are themselves imperfect, can our empirically-derived proxy measures capture complementary facets of real-world behavior? To address this question we studied the link between the three proxy measures studied –built via brain age, fluid intelligence and neuroticism– and various health behavior (sleep, physical exercise, alcohol and tobacco consumption). These behaviors are more ecological probes of mental health than questionnaires or lab-based measures and are potentially linked in multiple ways to our proxy measures. We, hence, modeled them as weighted sums of predicted brain-age delta, fluid intelligence and neuroticism using multiple linear regression. To avoid any form of circularity, we used the out-of-sample predictions for all three proxy measures, applied on the generalization dataset that was not used for building the machine learning models. We derived the brain-age delta by subtracting the actual age from the predicted age. To mitigate brain age bias (Le et al., 2018), we deconfounded health-related habits for their association with actual age (Engemann et al., 2020; Smith et al., 2019a).
The estimated regression coefficients, capturing partial correlations, revealed specific as well as complementary associations between the proxy measures and health-related behavior (Figure 4). A marginal association analysis shows similar patterns, indicating that the relationships hold also when considering the proxy measures in isolation (Figure 4 – Figure supplement 1). Elevated brain-age delta was consistently associated with increased number of alcoholic beverages. These latter proxy measures showed no consistent association with alcohol consumption (Figure 4, first row). Level of physical exercise –measured through the number of minutes spent weekly with metabolic equivalent tasks– consistently associated with the scores in all three predicted targets, suggesting independent associations (Figure 4, second row). This may seem counter-intuitive but could simply point at the possibility that people with higher test scores, as a tendency, have a more sedentary life style. Sleep duration was independently associated with brain age delta and predicted neuroticism but in opposite directions (Figure 4, third row): increased sleep duration consistently went along with elevated brain age, but lower levels of predicted neuroticism. No consistent effect emerged for fluid intelligence. Increased cumulative numbers of cigarettes smoked was independently and consistently associated with all predicted targets (Figure 4, last row): Intensified smoking went along with elevated brain age delta and elevated neuroticism but lower fluid intelligence.
The question remains whether the proxy measures bring any additional value compared to the original target measures that they were derived from. Studying the association of these original target measures with the health-related habits shows similar trends: associations with the same signs as with the proxy measures (Figure 4, B). However, these associations were more noisy or less marked as those seen with the proxy measures.
These results demonstrates that the proxy measures capture well health-related habits, potentially better than the original target measures, and in a complementary way across the three measures.
Discussion
In this study, we have extended the brain-age approach for neuroimaging to the wider notion of empirical proxy measures. Guided by machine learning, we have derived empirical approximations of traditional, extensively validated target measures from psychology. Beyond biological age, we focused on cognitive capacity (accessed by the fluid-intelligence test) and negative emotionality (accessed by the neuroticism questionnaire). Our proxy measures were derived from data not explicitly designed to assess specific latent constructs: brain imaging data and heterogeneous sociodemographic descriptors. We observed that the combination of brain imaging and target-specific sociodemographic inputs often improved approximation performance. On the held-out data that was not used for model construction, we found important associations between all proxy measures and ecological health indicators. These associations were often complementary and useful beyond the information conveyed by the approximated targets.
Constructs of mental-health can be accessed from general-purpose data
Brain age has served as landmark in this study, both conceptually and empirically. It has been arguably the most discussed candidate for a surrogate biomarker in the neuroimaging literature so far (Cole et al., 2015; Dosenbach et al., 2010; Smith et al., 2019a). With mean absolute errors around 4 years, up to 67% variance explained, and AUC-scores up to 0.93 in the classification setting, our results compare favorably to the recent brain-age literature within the UK Biobank (Cole et al., 2017; Smith et al., 2020) and in other datasets (Engemann et al., 2020; Liem et al., 2017), though we relied on non-optimized standard inputs and algorithms and not deep learning (He et al., 2018). Applying the same approach to other behavioral outcomes that probe psychological constructs, namely fluid intelligence and neuroticism, we found that these were considerably harder to approximate from general brain imaging data or sociodemographic descriptors.
It is important to recapitulate that approximation quality on the three targets investigated has a different meaning, as these are measured differently. On the one hand, age is a physical variable measured with meaningful units (years) on a ratio scale (Stevens et al., 1946) (Selma is twice as old as Bob). On the other hand, psychometric scores such as fluid intelligence –measured via socially-administered performance tests– and neuroticism –measured by self-assessment via questionnaires– are unit-free scores resulting from operationalized counting, which provokes ambiguity regarding the level of measurement (Borsboom, 2005). Their implied scales may be considered as interval (the difference between Bob’s and Selma’s intelligence is -0.1 standard deviations) if not rather ordinal (Bob’s intelligence was ranked below Selma’s) (Stevens et al., 1946). In day-to-day psychological practice, these scores are often used via practically-defined thresholds, e.g. school admission or pilot candidate selection in aviation (Carretta, 2011; Carretta and Ree, 1994). Approximations of these measures via empirically-defined proxies should thus be subjected to different standards: Brain-age prediction should be gauged accordingly to its natural continuous scale; we observed more than 50% of the variance explained. Instead, approximation of psychometric scores might be more appropriately gauged via implicit thresholds, hence, discrimination tasks. With the corresponding metrics, the receiver-operator characteristics (ROC) and its AUC-score, all proxy measures approached or exceeded a performance of 0.80 deemed relevant in biomarker development (Perlis, 2011), though to be fair, they approximated established psychometric targets (proxy measures themselves) and not a medical condition.
Nevertheless, the out-of-sample associations of the approximated constructs –the proxy measures– with health-related habits (Figure 4) paint a more complete picture of their value. Sleep duration, minutes spent exercising, and the amount of alcoholic drinks or cigarettes consumed were specifically and complementarily associated with all proxy measures on more than 4000 held-out individuals. In other words, we found multiple statistically important associations with proxy measures fluid intelligence and neuroticism that were not accounted for by brain age. Compared to the traditional measures (Figure 4 B), the associations between these proxy measures and ecological behavioral traits were less noisy, hence more consistent, regardless of their approximation quality (Figure 4 A). This may seem surprising at first, but the target measures are themselves noisy and of imperfect ecological validity. Conversely, the proxy measures are assembled via a richer phenotyping than the target measures, drawing from both fine sociodemographics and brain signals, which can help refining them.
The benefits offered by brain data depend on the approximated construct
All brain-derived approximations were statistically meaningful. Yet, only for age prediction, imaging data by itself led to convincing performance levels. Combining brain-imaging data to sociodemographics led to systematically enhanced performance for predicting age and, less strongly, fluid intelligence (Table 1). On the other hand, for neuroticism, including brain imaging never substantially improved the approximation. Does this mean that brain imaging could be avoided in practice when approximating latent constructs? Such a view is probably misleading as the numerical quality of the approximation is not the only thing that matters in a proxy measure. The interest in building a proxy measure of age from brain imaging is justified by its interpretation as an index of precocious or accelerated biological aging (Cole et al., 2015, 2017; Smith et al., 2020). In contrast, it is not yet clear that an age delta built from sociodemographic inputs –along the lines of a “social age”– supports such interpretation. From this point of view one may even prefer purely brain-based assessment of individual aging, though sociodemographics probably provide important context to the brain images.
For fluid intelligence and neuroticism the situation seemed more complex. For both targets, the best performing sociodemographic model was based on inputs semantically close to the construct of interest, i.e., education details for fluid intelligence and mood & sentiment for neuroticism. While those results reinforce the construct validity of the measure, they also come with a certain risk of circularity. In particular, the causal role of those predictors is not necessarily clear as better educational attainment is heritable itself (Krapohl et al., 2014) and may reinforce existing cognitive abilities rather than simply resulting from them. Similarly, prolonged emotional stress due to life events may exacerbate existing dispositions to experience negative emotions captured by neuroticism (Colodro-Conde et al., 2018), traits which in turn commonly help accumulate stressful life events (Lahey, 2009). Nevertheless, for fluid intelligence but not neuroticism, brain imaging added incremental value when combined with various sociodemographic predictors. This may suggest that the cues for neuroticism conveyed by brain imaging were already present in various sociodemographic predictors, potentially hinting at common causes.
It may be worthwhile to revisit the frequently reported difficulty to predict complex traits from brain imaging– especially fMRI (Dubois et al., 2018a,b; Liem et al., 2017; Maglanoc et al., 2020). This may not be entirely surprising at a theoretical level as it has even been argued that psychometric measures of complex traits may not map to biological mechanisms in simple ways (Yarkoni, 2015). Of course, this does not preclude the investigation of their brain correlates and mechanisms (Cole et al., 2015; Cox et al., 2019a; Kievit et al., 2018a; Shackman et al., 2016). It rather emphasizes the importance of searching for appropriate signals and representations supporting the given modeling goals (Bzdok and Ioannidis, 2019). As a speculation, some traits could be tightly linked to the current predominant behavior that may be poorly reflected by resting-state recordings. To consider an extreme counter example, disorders of consciousness— a stable trait induced by severe brain injuries— manifest themselves in systematically and intensely altered brain activity, hence, can be robustly detected from fMRI- and EEG-signals regardless of the present stimulation (Demertzi et al., 2019; Engemann et al., 2018). In this context, the recent turn towards naturalistic stimuli and movies (Hasson et al., 2010; Jääskeläinen et al., 2016; Nummenmaa et al., 2012; Sonkusare et al., 2019; Venkatesh et al., 2020) may be promising as trait-level differences in emotion and cognition may need to be systematically provoked by potent stimuli, e.g., emotionally charging or cognitively demanding cinematic content.
Empirically-derived proxy measures: From validity to practical utility
The validity of constructs and their measures remains a challenging question (Borsboom, 2005; Borsboom et al., 2004; Cronbach and Meehl, 1955). Here, we have demonstrated reasonable out-of-sample generalization for our proxy measures. Yet, generalization performance in itself, arguably, only yields an upper bound for validity of the measure for a target construct, comparable internal-consistency checks and re-test reliability in classical psychometrics. Even a perfect approximation may be limited by the quality of the target measure as fluid intelligence and neuroticism are notoriously difficult to measure without noise. In our study, the construct validity of the corresponding proxy measures is supported by the substantial gain in prediction performance brought by related information, namely education history and mental-health variables respectively (Figure 2). Moreover, association with health-relevant habits brings external validity to the proxy (Figure 4). For example, the complementary patterns that emerged can be related to traditional construct semantics: High consumption of cigarettes is typically associated with neuroticism (Terracciano and Costa Jr, 2004) and excessive drinking may lead to brain atrophy and cognitive decline (Topiwala et al., 2017), both common correlates of brain age (Liem et al., 2017; Wang et al., 2019).
This raises the question of the practical utility of such empirically-derived proxy measures: Can these empirically-derived proxy measures substitute specific psychometric instruments? The present study does not claim to give an unequivocal answer to this question as the utility of proxy measures will depend on the practical context. A specialized mental-health professional may prefer an established routine for clinical assessment, relying on scores such as intelligence tests and personality-scales like neuroticism, and potentially applying implicit experience-based thresholds. Based on our findings, inclusion of brain imaging may even seem to yield diminishing returns when approximating high-level psychological traits. Yet, it mays simply be a matter of time until more effective acquisition protocols will be discovered alongside signal representations supporting predictive modeling. While the cost of including brain imaging may seem exorbitant, whenever available, its inclusion seems to be a “safe bet” as machine learning is capable at selecting relevant inputs (Engemann et al., 2020) and costs of MRI-acquisition can be amortized by baseline clinical usage. Moreover, our study shows that the associations of the proxy measures to health habits compare favorably to the original target measures. As such, the proxy measures may open new doors when tailored assessment of latent constructs is not applicable to due lack of specialized mental-health workforce or sheer cost. For instance, they may bring mental-health assessment in research endeavors on large populations, e.g., for etiology, nosology, or typical epidemiology questions such as risk factors or treatment evaluation. In addition, results derived on large populations can be transferred to clinical data with finer mental-health assessment, e.g., smaller cohorts, possibly leveraging dedicated methods (He et al., 2020; Pan and Yang, 2009). Relying on three proxy measures rather than the brain age alone promises a wider array of applications.
Limitations
This study has validated proxy measures of three target constructs. The selection of these targets was guided by literature review as well as the goal to find representative health-related measures with complementary semantics. Additional constructs and psychometric tools could have been visited. Intelligence can be characterized by multiple facets. The broader construct of intelligence as a general factor –g-factor– is often estimated using latent factor models on multiple correlated tests. While g-factor modeling can be interesting for its own sake, we are less interested in normative assessment of intelligence but rather in capturing inter-individual variance related to cognitive capacity as a situational fitness signal. Such variations have been repeatedly linked to mental-health conditions (Khandaker et al., 2018). Likewise, there is a wealth of questionnaires designed to measure negative emotionality and neuroticism specifically. Yet, we could study only that available in the UK-Biobank data, the EPQ neuroticism scale. A complementary approach, leading to different scientific questions, would be to estimate latent factors by pooling all non-imaging data semantically related to neuroticism (Maglanoc et al., 2020). Rather, we chose to consider established target measures “as is” instead of derivatives to avoid bringing in additional measure-validity considerations. Nevertheless, our framework encourages future studies targeting more sophisticated representations of latent constructs.
Second, while the study was clinically motivated, it falls short of directly testing the clinical relevance of estimated proxy measures. Indeed, even in a very large general-population cohort such as the UK Biobank, there are only a few hundred diagnosed cases of mental disorders (ICD-10 mental-health diagnoses from the F chapter) with brain-imaging data available. This challenge highlights the practical importance of studying mental as a continuous, in addition to diagnosed conditions. In this direction, our analysis of health-related habits does provide some clinical relevance.
Finally, our study falls short of presenting fine-grained spatial analysis of the imaging data. This work has focused on the approximation quality of proxy measures, relying on methods that are not designed for fine-grained inference on predictors (Bzdok et al., 2018), though future work could explore post-hoc explanations (Biecek, 2018). Our analysis comparing the quality of models helps isolating major explanatory factors, yet does not provide brain mapping (Cole, 2020; Cox et al., 2019a; Kievit et al., 2018a).
Conclusion
Empirically-derived proxy measures targeting age, fluid intelligence and neuroticism reveal complementary facets of real-world behavior that contribute to maintaining mental health. As the relative importance of brain imaging and sociodemographics varies with the approximated target, we recommend generously including all available data and approximating as many targets as possible while letting machine learning perform the labor of integration. We believe that further developing and using proxy measures for constructs that are difficult to assess is a promising agenda for mental-health research. Therefore, we have made all data analysis and visualization source code available on Github: https://github.com/KamalakerDadi/proxy_measures_2020.
Materials and Methods
Dataset
The United Kingdom Biobank (UKBB) database is to date the most extensive large-scale cohort aimed at studying the determinants of the health outcomes in the general adult population. The UKBB is openly accessible and has extensive data acquired on 500 000 individuals aged 40-70 years covering rich phenotypes, health-related information, brain-imaging and genetic data (Collins, 2012). Participants were invited for repeated assessments, some of which included MR imaging. For instance, cognitive tests that were administered during an initial assessment were also assessed during the follow-up visits. This has enabled finding for many subjects at least one visit containing all heterogeneous input data needed to develop the proposed proxy measures. The study was conducted using the UKBB Resource Applixaction 23827.
Participants
All participants gave informed consent. The UKBB study was examined and approved by the North West Multi-centre Research Ethics Committee. We considered participants who have responded to cognitive tests, questionnaires, and have access to their primary demographics and brain images (Sudlow et al., 2015). Out of the total size of UKBB populations, we found 11 175 participants who had repeated assessments overlapping with the first brain imaging release (Miller et al., 2016). The demographics are 51.6% female (5 572) and 48.3% male (5 403), and an age range between 40-70 years (with a mean of 55 years and standard deviation of 7.5 years). Out of the complete analysis set, 5 587 individuals were used in the study to train the model and remaining subjects were set aside as a held-out set for generalization testing (see section Model development and generalization testing).
To establish specific comparisons between models based on sociodemographics, brain data or their combinations we exclusively considered the cases for which MRI scans were available. The final sample sizes used for model construction and generalization testing then depended on the availability of MRI: For age and fluid intelligence, our random splitting procedure (Model development and generalization testing) yielded 4203 cases for model building and 4157 for generalization. For cases with valid neuroticism assessment, fewer brain images were available, which yielded 3550 cases for model building and 3509 for generalization.
Data acquisition
Sociodemographic data (non-imaging) was collected with self-report measures administered through touchscreen questionnaires, complemented by verbal interviews, physical measures, biological sampling and imaging data. MRI data were acquired with the Siemens Skyra 3T using a standard Siemens 32-channel RF receiver head coil (Alfaro-Almagro et al., 2018). We considered three MR imaging modalities as each of them potentially captures unique neurobiological details: structural MRI (sMRI/T1), resting-state functional MRI (rs-fMRI) and diffusion MRI (dMRI). For technical details about the MR acquisition parameters, please refer to Miller et al. (2016). We used image-derived phenotypes (IDPs) of those distinct brain-imaging modalities, as they provide actionable summaries of the brain measurements and encourage comparability across studies.
Target measures
As our target measures for brain age modelign, we use an individual’s age at baseline recruitment (UKBB code “21022-0.0”). Fluid intelligence, was assessed using a cognitive battery designed to measure an individual’s capacity to solve novel problems that require logic and abstract reasoning. In the UK Biobank, the fluid intelligence test (UKBB code “20016-2.0”) comprises thirteen logic and reasoning questions that were administered via the touchscreen to record a response within two minutes for each question. Therefore, each correct answer is scored as one point with 13 points in total1. Neuroticism (UKBB code “20127-0.0”) was measured using a shorter version of the revised Eysenck Personality Questionnaire (EPQ-N) comprised of 12-items (Eysenck et al., 1985). Neuroticism was assessed during Biobank’s baseline visit. The summary of the individual’s scores ranges from 0 to 12 that assess dispositional tendency to experience negative emotions 2.
Sociodemographic data
In this work, we refer to non-imaging variables broadly as sociodemographics excluding the candidate targets fluid intelligence and neuroticism. To approximate latent constructs from sociodemographics, we included 86 non-imaging inputs (Table S3) which are the collection of variables reflecting each participant’s demographic and social factors i.e., sex, age, date and month of birth, body mass index, ethnicity, exposures at early life –e.g. breast feeding, maternal smoking around birth, adopted as a child– education, lifestyle-related variables –e.g. occupation, household family income, household people living at the same place, smoking habits–, and mental-health variables. All these data were self-reported. We then assigned these 86 variables to five groups based on their relationships. Based on our conceptual understanding of the variables, we name assigned them to one out of five groups: 1) mood & sentiment, 2) primary demographics as age, sex, 3) lifestyle, 4) education, 5) early life. We then investigated the intercorrelation between all 86 variables to ensure that the proposed grouping is compatible with their empirical correlation structure Figure S1.
The sociodemographic groups had varying amounts of missing data. For e.g. the source of missingness is concerned with the participants lifestyle habits such as smoking and mental health issues (Fry et al., 2017). To deal with this missingness in the data using imputation (Little and Rubin, 1986), we used column-wise replacement of missing information with the median value calculated from the known part of the variable. We subsequently included an indicator for the presence of imputed for down-stream analysis. Such imputation is well suited to predictive models (Josse et al., 2019).
Image processing to derive phenotypes for machine learning
MRI data preprocessing were carried out by UKBB imaging team. The full technical details are described elsewhere (Alfaro-Almagro et al., 2018; Miller et al., 2016). Below, we describe briefly the custom processing steps that we used on top of the already preprocessed inputs.
Structural MRI
This type of data analysis on T1-weighted brain images are concerned with morphometry of the gray matter areas i.e. the quantification of size, volume of brain structures and tissue types and their variations under neuropathologies or behavior (Lerch et al., 2017). For example, volume changes in gray matter areas over lifetime are associated with: brain aging (Ritchie et al., 2015), general intelligence (Cox et al., 2019b) and brain disease (Thompson et al., 2007). Such volumes are calculated within pre-defined ROIs composed of cortical and sub-cortical structures (Desikan et al., 2006) and cerebellar regions (Diedrichsen et al., 2009). We included 157 sMRI features consisting of volume of total brain and grey matter along with brain subcortical structures3. All these features are pre-extracted by UKBB brain imaging team (Miller et al., 2016) and are part of data download. We concatenated all inputs alongside custom-built fMRI features for predictive analysis (feature union).
Diffusion weighted MRI
Diffusion MRI enables to identify white matter tracts along principal diffusive direction of water molecules, as well as the connections between different gray matter areas (Behrens et al., 2003; Conturo et al., 1999). The study of these local anatomical connections through white matter are relevant to the understanding of neuropathologies and functional organization (Saygin et al., 2016). We included 432 dMRI skeleton features of FA (fractional anisotropy), MO (tensor mode) and MD (mean diffusivity), ICVF (intra-cellular volume fraction), ISOVF (isotropic volume fraction) and OD (orientation dispersion index) modeled on many brain white matter structures extracted from neuroanatomy4. For extensive technical details, please refer to de Groot et al. (2013). The skeleton features we included were from category134 shipped by the UKBB brain-imaging team and we used them without modification.
Functional MRI
Resting-state functional MR images capture low-frequency fluctuations in blood oxygenation that can reveal ongoing neuronal interactions in time forming distinct brain networks (Biswal et al., 1995). Functional connectivity within these brain network can be linked to clinical status (Greicius et al., 2004), to behavior (Miller et al., 2016), or to psychological traits (Dubois et al., 2018b). We also included resting-state connectivity features based on the time-series extracted from Independent Component Analysis (ICA) with 55 components representing various brain networks extracted on UKBB rfMRI data (Miller et al., 2016). These included the default mode network, extended default mode network and cingulo-opercular network, executive control and attention network, visual network, and sensorimotor network. We measured functional connectivity in terms of the between-network covariance. We estimated the covariance matrices using Ledoit-Wolf shrinkage (Ledoit and Wolf, 2004). To account for the fact that covariance matrices live on a particular manifold, i.e., a curved non-Euclidean space, we used the tangent-space embedding to transform the matrices into a Euclidean space (Sabbagh et al., 2019; Varoquaux et al., 2010) following recent recommendations (Dadi et al., 2019; Pervaiz et al., 2020). For predictive modeling, we then vectorized the covariance matrices to 1 485 features by taking the lower triangular part. These steps were performed with NiLearn (Abraham et al., 2014).
Comparing predictive models to approximate target measures
Imaging-based models
First, we focused on purely imaging-based models based on exhaustive combinations of the three types of MRI modalities (see Table 3 for an overview). This allowed us to study potential overlap and complementarity between the MRI-modalities. Preliminary analyses revealed that combining all MRI data gave reasonable results with no evident disadvantage over particular combinations of MRI modalities (Figure 2 – Figure supplement 1), hence, for simplicity, we only focused on the full MRI model in subsequent analyses.
Sociodemographic models
We composed predictive models based on non-exhaustive combinations of different types of sociodemographic variables. To investigate the relative importance of each class of sociodemographic inputs, we performed systematic model comparisons. We were particularly interested in studying the relative contributions of early-life factors as compared to factors related to more recent life events such as education as well as factors related to current circumstances such as mood & sentiment and life-style. The resulting models based on distinct groups of predictors are listed in Table 4 (for additional details see Table S3 and Figure S1).
Combined imaging and sociodemographic models
In the next step, we were interested in how brain-related information would interact within each of these sociodemographic models. For example, information such as the age of an individual, or the level of education, may add important contextual information to brain images. We therefore considered an alternative variant for each of the models in Table 4 that included all MRI-related features (2 074 additional features) as described at section Image processing to derive phenotypes for machine learning.
Predictive model
Linear models are recommended as default choice in neuroimaging research (Dadi et al., 2019; Poldrack et al., 2020) especially when datasets include fewer than 1000 data points. In this study approximated targets generated by distinct underlying mechanisms based on multiple classes of heterogenous input data with several thousands of data points. We hence chose the non-parametric random forest algorithm that can be readily applied on data of different units for non-linear regression and classification (Breiman, 2001) with mean squared error as impurity criterion. To improve computation time we fixed tree-depth to 250 trees, a hyper-parameter that is not usually not tuned but set to a generous number as performance plateaus beyond a certain number of trees (Hastie et al., 2005, ch. 15). Preliminary analyses suggested that additional trees would not have led to substantial improvements in performance. We used nested cross-validation (5-fold grid search) to tune the depth of the trees as well as the number of variables considered for splitting (see Table 5 for a full list of hyper-parameters considered).
Classification analysis
We also performed classification analysis on the continuous targets. For this purpose, we discretized the targets into extreme groups based on the 33rd and 66th percentiles (see Table 6 for the number of classification samples per group). We were particularly interested in understanding whether model performance would increase when moving toward classifying extreme groups. For this analysis, we considered all three types of models (full MRI 2074 features from imaging-based models see section Imaging-based models, all sociodemographics variables, total 86 variables see section Sociodemographic models), combination of full MRI and all sociodemographics, a total 2160 variables see section Combined imaging and sociodemographic models. When predicting age, we excluded the age & sex sociodemographic block from all sociodemographic variables which then yielded a total of 81 variables. To assess the performance for classification analysis, we used the area under the curve (AUC) of the receiver operator characteristic (ROC) as an evaluation metric (Poldrack et al., 2020).
Model development and generalization testing
Before any empirical work, we generated two random partitions of the data, one validation dataset for model construction and one held-out generalization dataset for studying out-of-sample associations using classical statistical analyses.
For cross-validation, we then subdivided the validation set into 100 training- and testing splits following the Monte Carlo resampling scheme (also referred to as shuffle-split) with 10% of the data used for testing. To compare model performances based on paired tests, we used the same splits across all models. Split-wise testing performance was extracted and carried forward for informal inference using violin plots (Figure 2,Figure 3). For generalization testing, predictions on the held-out data were generated from all 100 models from each cross-validation split.
On the held-out set, unique subject-wise predictions were obtained by averaging across folds and occasional duplicate predictions due to Monte Carlo sampling which could produce multiple predictions per subject5. Such strategy is known as CV-bagging (Varoquaux et al., 2017b) and can improve both performance and stability of results6. The resulting averages were reported as point estimates in Figures 2,3, and 2 – Figure supplement 1 and used as proxy measures in the analysis of health-related behaviors Figure 4.
Statistical analysis
Resampling statistics for model comparisons on the held-out data
To assess the statistical significance of the observed model performance and the differences in performance between the models, we computed resampling statistics of the performance metrics on the held-out generalization data not used for model construction (Gemein et al., 2020). Once unique subject-wise predictions were obtained on the held-out generalization data by averaging the predictions emanating from each fold of the validation set (cv-bagging), we computed null- and bootstrap-distributions of the observed test statistic on the held-out data, i.e., R2 score for regression and AUC score for classification.
Baseline comparisons
To obtain a p-value for baseline comparisons (could the prediction performance of a given model be explained chance?) on the held-out data, we permuted targets 10 000 times and then recomputed the test statistic in each iteration. P-values were then defined as the probability of the test statistic under null distribution being larger than the observed test statistic. To compute uncertainty intervals, we used bootstrap, recomputing the test statistic after resampling 10 000 times with replacement and reporting the 2.5 and 97.5 percentiles of the resulting distribution.
Pairwise comparisons between models
For model comparisons, we considered the out-of-sample difference in R2 or AUC between any two models. To obtain a p-value for model comparisons (could the difference in prediction performance between two given models be explained chance?) on the held-out data, we permuted the scores predicted by model A and model B for every single prediction 10 000 times and then recomputed the test statistic in each iteration. We omitted all cases for which only predictions from one of the models under comparison was present. P-values were then defined as the probability of the absolute of the test statistic under null distribution being larger than the absolute observed test statistic. The absolute was considered to account for differences in both directions. Uncertainty intervals were obtained from computing the 2.5 and 97.5 percentiles of the bootstrap distribution based on 10 000 iterations. Here, predictions from model A and model B were resampled using identical resampling indices to ensure a meaningful paired difference.
Out-of-sample association between proxy measures and health-related habits
Computation of brain age delta and de-confounding
For association with health-contributing habits (Table 7), we computed the brain age delta as the difference between predicted age and actual age: As age prediction is rarely perfect, the residuals will still contain age-related variance which commonly leads to brain age bias when relating the brain age to an outcome of interest, e.g., sleep duration (Le et al., 2018). To mitigate leakage of age-related information into the statistical models, we employed a de-confounding procedure in line with Smith et al. (2019b) and (Engemann et al., 2020, eqs. 6-8) consisting in residualizing a measure of interest (e.g. sleep duration) with regard to age through multiple regression with quadratic terms for age. To minimize computation on the held-out data, we first trained a model relating the score of interest to age on the validation set to then derive a de-confounding predictor for the held-out generalization data. The resulting de-confounding procedure for variables in the held-out data amounts to computing an age-residualized predictor measureresid from the measure of interest (e.g. sleep duration) by applying the following quadratic fit on the validation data: The de-confounding predictor was then obtained by evaluating the weights βval1 and βval2 obtained from Equation 2 on the generalization data: We performed this procedure for all target measures, to study associations not driven by the effect of age.
Health-related habits regression
We then investigated the joint association between proxy measures of interest and health-related habits (Table 7) using multiple linear regression. For simplicity, we combined all brain imaging and all sociodemographics variables (Figure 2, Figure 2 – Figure supplement 1, Figure 2 – Figure supplement 2). The ensuing model can be denoted as where outcomeresid is given by Equation 2. Prior to model fitting, rows with missing inputs were omitted. For comparability, we then applied standard scaling on all outcomes and all predictors.
The parametric bootstrap was a natural choice for uncertainty estimation, as we used standard multiple linear regression which provides a well defined procedure for mathematically quantifying its implied probabilistic model. Computation was carried out using sim function from the arm package as described in Gelman and Hill (2006, Ch.7,pp.142-143). This procedure can be intuitively regarded as yielding draws from the posterior distribution of the multiple linear regression model under the assumption of a uniform prior. For consistency with previous analyses, we computed 10000 draws.
Software
Preprocessing and model building were carried out using Python 3.7. The NiLearn library was used for processing MRI inputs (Abraham et al., 2014). We used the scikit-learn library for machine learning (Pedregosa et al., 2011). For statistical modeling and visualization we used the R-language (R Core Team, 2019) (version 3.5.3) and its ecosystem: data.table for high-performance manipulation of tabular data, ggplot (Clarke and Sherrill-Mix, 2017; Wickham, 2016) for visualization and the arm package for parametric bootstrapping (Gelman and Su, 2020). All data analysis code is shared on GitHub: https://github.com/KamalakerDadi/proxy_measures_2020.
Supporting Information
Acknowledgments
We would like to thank Dr. Stefania de Vito and Dr. Benjamin de Haas for the critical review and helpful discussion of previous versions of the manuscript. We would like to thank Dr. Julien Dubois and Prof. Ralph Adolphs for helpful discussions in the course of this research project.
Appendix 1: Additional results
Appendix 2: Sociodemographic variables
Footnotes
↵1 A complete overview of the 13 individual fluid intelligence items can be seen from this manual https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/Fluidintelligence.pdf
↵2 For a complete list of Neuroticism questionnaires can be seen from this manual https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/MentalStatesDerivation.pdf
↵3 Regional grey matter volumes http://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=1101 Subcortical volumes http://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=1102
↵4 Diffusion-MRI skeleton measurements http://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=134
↵5 We ensured prior to computation that with 100 CV-splits, predictions were available for all subjects.
↵6 The use of CV-bagging can explain why on figures 2,3, and 2 – Figure supplement 1 the performance was sometimes slightly better on the held-out set compared to the cross-validation on the validation test.