Abstract
Obesity is highly heritable, yet only a small fraction of its heritability has been attributed to specific genetic variants. Missing heritability is particularly pronounced for childhood obesity. Here we studied 226 children for whom we typed almost one million single-nucleotide polymorphisms (SNPs), and collected weight and length or height at eight time points between birth and the age of three years. Leveraging longitudinal weight gain trajectory information and novel functional data analysis (FDA) techniques, we constructed a polygenic risk score (PRS) comprised of 24 SNPs. This PRS explains 56% of the variability in weight gain trajectories among the studied children. Moreover, it is significantly higher in children with (vs. without) rapid infant weight gain—a predictor of obesity later in life. We validated the constructed PRS in populations of adolescents and adults—suggesting that some genetic variants predispose to obesity at both childhood and later life stages. In contrast, PRSs from genome-wide association studies (GWAS) of adult obesity were not predictive of weight gain in our cohort of children, and did not share SNPs with our PRS. Our research provides a strong example of a successful application of FDA to a GWAS. We demonstrate that a sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with smaller sample sizes. This has the potential of shifting the existing paradigm in GWAS.
Introduction
Obesity is a rising epidemic, and one that is increasingly affecting children. In 2018, 18% of children in the United States were obese and approximately 6% were severely obese1—a substantial increase from previous years2. Given the strong association between weight gain during childhood and obesity across the lifecourse3, the search for early life risk factors has become a research and public health priority.
Obesity is a complex disease with an etiology influenced by environmental, behavioral, and genetic factors, which likely interact with each other4. For childhood obesity, dietary composition and sedentary lifestyle have often been cited as main contributors5. Evidence also exists for a significant role of parents’ socioeconomic status6 and maternal prenatal health factors including gestational diabetes7 and smoking8. Obesity risk in children has also been associated with appetite9 which has been shown to be partially influenced by genetics10.
The heritability of obesity has been estimated to be between 50% and 90% (with the highest values reported for monozygotic twins and the lowest for non-twin siblings and parent-child pairs, reviewed in 11). This is a much higher percentage than that accounted for by the genetic variants found so far12,13. Therefore, obesity suffers from “missing heritability”—a broad discrepancy between the estimated heritability of the phenotype and the variability explained by genetic variants discovered to date. Indeed, the search for specific genetic variants that increase the risk of obesity, in adulthood as well as in childhood, is still ongoing. Using whole-genome sequencing, researchers have found variants in individual genes that contribute to severe, early-onset obesity14. Moreover, genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) that are significantly associated with obesity phenotypes such as increased body mass index (BMI), high waist-to-hip ratio, etc.15–21. Albeit successful, these studies have some shortcomings; the individual contributions of the identified SNPs tend to be very small12, and the prevalent focus is still on adult cohorts—with only one childhood obesity study for every 10 adult obesity studies22.
One way to utilize the information gained from GWAS is to summarize the risk from multiple disease-causing alleles in polygenic risk scores (PRSs) that can be computed for each individual23. These scores are either simple counts (unweighted) or weighted sums of disease-causing alleles identified by GWAS. Notably, while several studies have constructed PRSs for childhood obesity24–27, most have done so relying on SNPs identified by GWAS on adult BMI. Since SNPs affecting obesity risk in adults and children may differ28,29,30, this may explain the limited12, 31 and age-dependent32 explanatory power of such scores for children’s weight gain status.
In this study, we attempted to bridge this gap by focusing specifically on SNPs affecting obesity risk in children and by using novel, highly effective Functional Data Analysis (FDA) statistical methods developed by our group. Based on data from a deeply characterized pediatric cohort33–35, we constructed children’s growth curves and treated them as a longitudinal phenotype. FDA fully leverages this longitudinal information, extracting complex signals that can be lost in standard analyses of cross-sectional or summary measurements. This increases power and specificity for assessing potentially complex and combinatorial genetic contributions. Moreover, FDA models genetic effects on the entire growth curve non-parametrically. This characterizes changes in effect size over time in a more flexible and effective manner than other statistical methods for longitudinal data. With our analyses, we identified genetic variants significantly associated with children’s growth curves and combined them in a novel PRS that is strongly predictive of growth patterns and rapid infant weight gain36,37, which is associated with obesity later in life. We also investigated how environmental and behavioral covariates compound with our novel score in affecting growth curves, and provided biological and statistical validations of our findings.
Results
Participants and DNA typing
Our study utilized 226 first-born children (out of a total of 279) enrolled in the Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) study33. For these children, weight and length were measured at birth, 4 weeks, 16 weeks, 28 weeks, 40 weeks, and one year, and weight and height—at two and three years. Using the ratios of weight for length or height (henceforth referred to weight-for-length/height) at these eight time points, we constructed growth curves for all children (Fig. 1a; see Methods). We used weight-for-length/height ratio because it is the recommended measurement for identification of children at risk for obesity under the age of two years by the American Academy of Pediatrics (BMI is recommended afterwards)38. Six out of eight time points in our study fall into this category, therefore for consistency we utilized weight-for-length/height ratio for all eight time points analyzed.
In addition to growth curves, we computed conditional weight gain for each child (change in weight between birth and 6 months, correcting for length, see Methods). Conditional weight gain was shown to be an effective indicator of risk for developing obesity later in life in a previous study39. Also in our study, children who experienced rapid infant weight gain, i.e. those with a positive conditional weight gain, had a significantly greater weight at one (p<2.2×10−16), two (p=9.1×10−14), and three (p=6.2×10−13) years of age than children who did not (one-tailed t-tests, Fig. S1).
We isolated genomic DNA from blood samples from the 226 children and genotyped it on the Affymetrix Precision Medicine Research Array containing 920,744 SNPs across the genome. SNPs that had missing information, a minor allele frequency below 0.05, or were in the mitochondrial DNA were removed from the dataset—leaving a total of 79,498 SNPs for subsequent analyses (Fig. S2).
FDA-based Polygenic Risk Score predicts growth curves and rapid infant weight gain
The sample size of our study (n=226) is small for a traditional GWAS. However, our FDA approach allowed us to leverage the longitudinal information in growth curves to identify significant SNPs and combine them into a polygenic risk score (PRS). More specifically, we used FDA screening40 to first reduce the analysis from 79,498 to 10,000 potentially relevant SNPs. Next, we used Functional Linear Adaptive Mixed Estimation (FLAME)41 to identify 24 SNPs as significant predictors of children’s growth curves (Table 1). Finally, we constructed our novel FDA PRS as a weighted sum of allele counts across the 24 selected SNPs, with weights determined with additional FDA techniques (see Methods). We found that FDA PRS is indeed a strong predictor for growth curves with a significant positive effect on weight-for-length/height ratios across time (R2=0.56, p<1×10−15, function-on-scalar regression, see Methods), and especially between ~10 and ~30 months of age (Fig. 2a). This can also be observed noting that growth curves of children with high PRS values are concentrated above the mean curve (Fig. 1b). Moreover, FDA PRS is significantly larger for children with rapid infant weight gain compared to those without (one-tailed t-test, p=4.2×10−10; Fig. 2b), and is positively correlated with conditional weight gain (R2=0.19, p<1×10−05; Fig. 2c) as well as with weight-for-length/height ratio at one (R2=0.50, p<1×10−5), two (R2=0.53, p<1×10−5), and three (R2=0.46, p<1×10−5) years of age (Fig. S3).
These results are in sharp contrast with those we obtained for our cohort using a PRS based on adult obesity SNPs from another study. For each child, we calculated Belsky PRS—a weighted PRS based on 29 SNPs identified through adult obesity GWAS as described by Belsky and colleagues26. This PRS was used because it was correlated with BMI outcomes from age three to 38, so we hypothesized that it would be a good predictor of weight outcomes across the lifecourse. However, Belsky PRS is not a significant predictor of our children’s growth curves from birth through age three (R2=0.0032, p=0.35, function-on-scalar regression, Fig. 2d). Furthermore, Belsky PRS is not significantly larger for children with rapid infant weight gain compared to those without (one-tailed t-test, p=0.22; Fig. 2e) and does not display significant correlations with conditional weight gain (R2=0.0009, p=0.66; Fig. 2f) and weight-for-length/height ratio at one (R2=0.0064, p=0.25), two (R2=0.0036, p=0.37), and three (R2=0.0009, p=0.71) years of age (Fig. S4). Additionally, we calculated three other previously published childhood obesity PRSs for our cohort—Elks PRS25, den Hoed PRS24, and Li PRS27. Similar to the Belsky PRS, these scores were not correlated with conditional weight gain (Fig. S5a-c) and there was not a significant difference in PRS values between children with vs. without rapid infant weight gain (Fig. S5d-f).
The 24 SNPs included in the FDA PRS (Table 1) do not appear in prior PRSs for either childhood or adult obesity, and, interestingly, are not located in genes commonly associated with obesity (e.g., FTO32,42 and MC4R15,42). However, nine of the 24 SNPs in our FDA PRS can be linked directly or indirectly to obesity-related traits. In particular, using the NHGRI-EBI GWAS catalog (https://www.ebi.ac.uk/gwas/), we found that some of the SNPs are located in genes associated with BMI (rs4915535, rs10227226, rs471670), cholesterol levels (rs12039940, rs9837708, rs17626544), Type 2 diabetes (rs638348), and hypertension (rs1539759). Some SNPs we discovered are located in the vicinity of obesity-related genes. In addition to being located in ZNF648, a gene important for determining HDL cholesterol levels, rs12039940 is located downstream of CACNA1E, a gene associated with BMI change over time43. Another instance is rs72679478, a SNP with a high weight in the FDA PRS (Table 1). It is located within DNAJC6, a gene associated with Parkinson’s disease, but it is also just upstream of the leptin receptor gene (LEPR) which has been associated with early-onset adult obesity44. Potential relationships between the remaining 15 SNPs and obesity should be investigated in future studies.
Contributions of environmental and behavioral covariates
Children’s weight gain patterns can be affected by a variety of environmental and behavioral factors, which compound to genetic effects. To evaluate their potential effects on our results, we considered a functional regression (see Methods) of the growth curves on FDA PRS plus 11 potential confounding covariates, namely: maternal pre-pregnancy BMI, paternal BMI, child’s birthweight, maternal gestational weight gain, maternal gestational diabetes, maternal smoking during pregnancy, mode of delivery, the child’s sex, mother-reported child’s appetite score, INSIGHT intervention group, and family socioeconomic status (Table 2). FLAME41 applied to this regression identified FDA PRS, birthweight and appetite as significant predictors—however, the variability explained by these three predictors (R2=0.57; Table S1) is very similar to that explained by the FDA PRS alone (R2=0.56). Thus, genetic effects captured by our FDA PRS remain significant, and in fact strongly dominant, also when accounting for the environmental and behavioral covariates at our disposal.
To confirm these results we also considered the regression of conditional weight gain39 on the same 12 predictors as above. Best subset selection applied to this regression identified FDA PRS (p=8.65×10−09) and appetite (p=2.80×10−05), but not birthweight, as significant positive predictors. The variability explained (R2) produced by these two predictors is 0.24 (Table S1), only five percentage points higher than the one produced by FDA PRS alone (R2=0.19). Thus, again, the majority of the explanatory power remains attributable to the FDA PRS. Group LASSO45 applied to this regression identified FDA PRS and birthweight, but not appetite, as relevant predictors. It also selected maternal pre-pregnancy BMI and paternal BMI, but led to a lower R2 of 0.21 (Table S1).
Notably, and not unexpectedly given its lack of association with children’s growth patterns, when we reran the analyses presented above using the Belsky PRS (instead of the FDA PRS), we did not identify it as a significant predictor. For instance, best subset selection for the regression of conditional weight gain on the Belsky PRS plus the 11 environmental and behavioral covariates at our disposal retained only appetite as positive and significant predictor (p=5.53×10−05); all other predictors, including the Belsky PRS itself, were eliminated.
Validation of the FDA-based Polygenic Risk Score
Biological validation of the FDA PRS
The analyses presented above assess the predictive power of our FDA PRS “in-sample”—that is, on the same data on which we selected SNPs and estimated the scoring weights. To validate the FDA PRS, we considered two independent datasets from dbGaP. It is important to note that we could not identify publicly available data from an independent cohort that matches our study design (i.e. with genome-wide SNP data and longitudinal weight and length or height measurements for children under the age of three). We thus used dbGaP data from older individuals, fully aware of the fact that these are not ideal for our purposes.
Remarkably, we were able to successfully validate FDA PRS in two independent cohorts consisting of much older individuals—adolescents and adults—as compared with our study population of three-year-olds. The first dataset consists of 525 adolescents between the ages of 12 and 15 from the Philadelphia Neurodevelopment Cohort (dbGaP study phs000607.v3.p246–48). Individuals are classified based on BMI-for-age percentiles as underweight (<5th percentile), normal (5th to <85th percentile), overweight (85th to <95th percentile), and obese (≥95th percentile). The distributions of our FDA PRS in these classes shift towards larger values as BMI increases from underweight to obese (Fig. 3a, upper and lower panels). While this does not translate in significant differences between all pairs of classes, the FDA PRS of obese adolescents is significantly higher than that of underweight adolescents (p=0.012, one-tailed t-test).
The second dataset consists of 3,486 adults (≥18 years of age) from the eMERGE study (dbGaP study phs000888.v1.p1) who are classified as extremely obese (BMI ≥ 40 kg/m2) or non-obese (20 kg/m2 ≤ BMI < 30 kg/m2). Extremely obese individuals have significantly higher FDA PRS than non-obese individuals (one-tailed t-test, p=3.2×10−3, Fig. 3b). Thus, FDA PRS based on children’s weight gain patterns is predictive of extreme obesity later in life.
Finally, as an additional form of biological validation of our FDA PRS constructed using weight-for-length/height ratio growth curves, we considered growth curves constructed using BMI. As mentioned above, weight-for-length/height ratio is recommended for children under two years of age by the American Academy of Pediatrics38. However, our cohort is observed at ages two and three, when BMI is recommended as the most meaningful measurement38. Thus, we also considered growth curves constructed using BMI measurement at all eight time points of the INSIGHT study. Notably, our weight-for-length/height FDA PRS is a strong predictor also for the BMI growth curves (R2=0.43, 8 p<1×10−15, function-on-scalar regression)—suggesting a reasonable consistency between the information conveyed by the two measurements, at least up to this age.
Statistical validation of the selected SNPs
We also assessed the robustness of our FDA-based SNP selection with a sub-sampling scheme akin to a 20-fold cross-validation on our original dataset. Specifically, we randomly split the data (i.e. the participants) in 20 equal parts and applied FLAME41 to perform SNP selection 20 times, using different 19/20 of the data each time. We next counted how many times (out of 20) each SNP was selected. Notably, for the 24 SNPs included in our FDA PRS, the weights computed to construct the PRS correlate with the number of times the SNPs are selected in this sub-sampling scheme (Fig. 4). The frequency of selection captures, in a way, how stable the effect of a genetic variant is amid the complex and combinatorial signals in this type of data. Moreover, SNPs which have both the largest weights and the highest selection frequency may be the most important to interpret and validate in future studies.
Discussion
A novel, highly predictive FDA-based PRS for childhood obesity
In this study, we used FDA techniques to construct a novel polygenic risk score which includes 24 SNPs selected based on children’s longitudinal weight gain patterns. Among our study participants, this score explains approximately 56% of the variability in growth curves from birth to the age of three years, and approximately 19% of the variability in conditional weight gain. Moreover, our score validates on two independent datasets comprising adolescent and adults individuals, and our SNP selection shows statistical robustness.
While the 24 SNPs identified by our study do not appear in prior polygenic risk scores for either childhood or adult obesity, some are located in genes linked to obesity-related phenotypes in previous GWAS studies. Among the others, three SNPs are within genes previously associated with child development (puberty timing) and four within genes linked to periodontitis. Connections between obesity and puberty timing49, as well as between obesity and periodontitis in adults50, have been suggested, yet the functional mechanisms remain unclear. As with all GWAS-type studies, it is important to note that some of the identified SNPs may not be truly “causal”, but may be in linkage disequilibrium with causal SNPs—and the genes in the immediate vicinity of such SNPs may not be those through which the phenotype is influenced (e.g., rs72679478 located upstream of the leptin receptor gene). We have also identified several SNPs with no prior associations, some of such SNPs have high statistical robustness and high weight in the PRS, and thus need to be we investigated in future functional experiments.
The power of FDA-based GWAS
Our results demonstrate a key advantage of FDA-based GWAS over traditional GWAS. Our study was set up as an ultra-high dimensional problem—with many more predictors (i.e. SNPs) than observations (i.e. individuals). By integrating FDA techniques into every step of the analysis, from the screening and selection of SNPs through the construction of the polygenic risk score, we were able to utilize a more dynamic and information-rich phenotype than the ones used in traditional cross-sectional analyses. In turn, this allowed us to unveil subtler, more complex effects with limited information. This is a valuable contribution as it expands the scope of GWAS to studies that do not comprise tens of thousands of individuals—but instead a few hundred deeply characterized participants51.
Genetics of childhood and adult obesity
Previous studies supported a relationship between polygenic scores including adult BMI SNPs and childhood weight gain status24–27. However, this relationship was generally weak—and weaker the younger the age of the children13,24,26. In fact, Belsky and colleagues26 themselves found no relationship between their PRS (i.e. Belsky PRS) and BMI at birth (R2=0.00, p>0.9) and a very weak relationship at three years of age (R2=0.0064, p<0.01). This is confirmed by our inability to detect a relationship between the Belsky PRS and growth curves or conditional weight gain measurements in our cohort. In contrast, our FDA-based PRS comprising SNPs identified from childhood growth curves was able to distinguish extreme obesity-related phenotypes in adolescents and adults from two independent validation cohorts. Thus, while SNPs with strong effects on adult obesity are minor or insignificant contributors to weight gain in childhood, the SNPs with strong effects on such gain in childhood do predict obesity later in life. This is consistent with the notion that early life weight gain, and hence its genetic underpinning, predispose to obesity across the lifecourse3.
Other contributing factors and perspectives
Behavioral and environmental factors are important variables to consider when investigating the etiology of complex diseases. In our study we considered 11 such factors that could influence child weight gain trajectories and found that, while the FDA PRS is by far the dominant predictor, an appetite score computed on our cohort (see Methods) and birthweight have significant effects. We also found some evidence for an effect of parental BMI measurements. It has been shown that a child’s appetite behavior impacts early weight gain and may have a strong genetic basis31,10. In agreement with this, a recent study found a positive relationship between a childhood obesity PRS and appetite52. In our study child’s appetite behaviour was reported by his/her mother which could have introduced some biases. Because appetite is emerging as an interesting predictor of child weight gain status, it should be explored in more detail in future studies. Birthweight has also been associated with the genetic risk for obesity, although the strength of the association between weight and genetics seems to increase as one ages25,27,53. Finally, parental BMI has been associated with children being overweight or obese54–56, which could be explained by shared environment, shared genetics or their interaction.
In addition to the type of environmental and behavioral factors considered in our study, other factors may compound to and interact with genetics in shaping obesity risks. These include the microbiome, the metabolome, and the epigenome. We found previously that children’s oral microbiota composition is associated with growth curves57. Moreover, we are collecting data on the metabolomes and epigenomes of the children in our study cohort. Our overarching goal is to develop a multi-omic model to comprehensively understand the development of childhood obesity and identify a combination of risk factors that can be used for accurate identification of children who would benefit most from early life intervention programs.
Our FDA-based polygenic risk score was computed considering the longitudinal change in weight-for-length/height ratio from birth through three years of age. An ongoing follow-up of our study participants, with weight and height collected at later time points, will allow us to further evaluate the predictive power of the FDA PRS as age progresses. Finally, we note that our children cohort (Table 2), as well as the cohorts of adolescents and adults used for validation, consisted predominantly of individuals of European ancestry. It will be of great interest to conduct similar analyses on individuals of non-European ancestries, and identify differences and commonalities in the genetic factors contributing to obesity risks among different ethnicities.
Methods
Study sample, growth curves, and conditional weight gain
We collected genetic information from 226 children recruited from the 279 families involved in the INSIGHT study33. These children are full-term singletons born to primiparous mothers in Central Pennsylvania. The INSIGHT study is a randomized, responsive-parenting behavioral intervention aimed at the primary prevention of childhood obesity against a home safety control. INSIGHT collected clinical, anthropometric, demographic, and behavioral variables on the children between birth and the age of three years (Table 2). In this study we utilized 11 of these variables including maternal pre-pregnancy BMI, paternal BMI, maternal pregnancy health variables (gestational weight gain, gestational diabetes, and smoking during pregnancy), family income (as a proxy for socioeconomic status), mode of delivery, child’s sex, child’s birth weight, INSIGHT intervention group (intervention or control), and mother-reported child’s appetite at 44 weeks. The appetite score is an ordinal variable on a scale from 1-5 which summarizes the Child Eating Behavior Questionnaire (CEBQ)58. Domains on the CEBQ include food responsiveness, emotional over-eating, food enjoyment, desire to drink, satiety responsiveness, slowness in eating, emotional under-eating, and food fussiness. Length was measured using a recumbent length board (Shorr Productions) for visits before two years (birth, 3-4 weeks, 16 weeks, 28 weeks, 40 weeks, and one year). Standing height was measured with a stadiometer (Seca 216) at two and three years.
To construct growth curves, we utilized the anthropometric data collected above to calculate weight-for-length/height ratio at each time point for our analysis. We used FDA to analyze these longitudinally as individual functions through the fdapace package in R. This package implements the Principal Analysis by Conditional Estimation (PACE) algorithm59, which pools information across subjects for more accurate curve construction. We used the default settings and represented them in Fig. 1a using 51 cubic spline functions with evenly spaced knots.
Conditional weight gain z-scores were calculated as the standardized residuals from a regression of age- and sex-specific weight-for-age z-score at 6-months on the weight-for-age z-score at birth (determined using the World Health Organization sex-specific child growth standards)34. Length-for-age z-score at 6-months, length at birth, and precise age at the 28-week visit were considered as cofactors in this regression and thus only the change in weight between birth and 6-months was captured34,36. These scores are approximately normally distributed and have, by construction, a mean of 0 and a standard deviation of 1. Positive conditional weight gain z-scores correspond to a greater than average weight gain and are used to define rapid infant weight gain, which is a risk factor for developing obesity later in life37,60,61.
Genotyping
Blood from a fingerstick was collected at the child’s one year clinical research visit. Genomic DNA was isolated (Qiagen DNeasy Blood and Tissue Kit) and genotyped on the Affymetrix Precision Medicine Research Array (PMRA). Initial quality filtering was performed using the following criteria: we removed SNPs with minor allele frequency >0.05 and/or present in less than 5% of individuals. All quality filtering steps were performed in PLINK v1.962,63 with 79,498 SNPs remaining after quality filtering.
To obtain missing genotype calls and genotypes not included on the PMRA, we performed imputation. Individual’s genotypes were first phased leveraging pedigree information (genotypes were also collected for mother and father in most cases, and some younger siblings) using SHAPEIT264,65. The phased haplotypes were then used for imputation using the 1,000 Genomes Project phase 3 data66 as a reference panel in IMPUTE267. SNPs with imputation probability <90% were removed. Following imputation, we had 12,479,343 SNPs.
Functional Data Analysis techniques
First, we used an FDA feature screening method40, which is an effective and fast procedure to filter out SNPs that are clearly unimportant, yielding a substantially smaller subset of SNPs that can then be used in a more advanced joint model68. This method is specifically designed for longitudinal GWAS and can handle up to millions of SNPs. The method evaluates each SNP individually fitting a simplified model comprising only that SNP (with no other SNPs involved) and calculating a weighted mean squared error. This is then used to rank the SNPs. In our study, the top 10,000 SNPs were selected with this feature screening step. See Supplementary Note S1 for additional details about FDA feature selection.
After feature screening, we used FLAME (Functional Linear Adaptive Mixed Estimation)41, a method that simultaneously selects important predictors and produces smooth estimates for function-on-scalar linear models. This method further downselects from the pool of the top 10,000 SNPs as ranked within our screening step. In addition, it provides smooth estimates of the effects of the selected SNPs on the growth curves. To tune the penalty involved in FLAME, we split our observations into training (75%) and test (25%) sets. This procedure resulted in 24 SNPs and their corresponding estimated effect curves. FLAME was also used to assess the statistical robustness of SNP selection in a 20-fold sub-sampling scheme (selection was repeated 20 times, each time on 19/20 of the data); for this exercise we fixed the penalty level, to be consistent across folds. See Supplementary Note S2 for additional details about FLAME.
Next, we used the estimated effect curves produced by FLAME for each of the 24 selected SNPs to construct our FDA-based Polygenic risk score. This was done choosing SNP-specific weights that maximize the covariance between weighted SNP counts and growth curves fitted through the FLAME41 estimates—thus incorporating both the dynamic nature of the SNP effects and linkage disequilibrium between the SNPs themselves. We applied the weights to the allele counts of each child, and computed his/her FDA PRS as the weighted sum of counts across the selected SNPs. Thus, FDA allows us to exploit the longitudinal structure of our data to not only screen and select SNPs, but also weigh them using estimates of how their effects change over time.
We assessed the association between growth curves and the FDA PRS fitting function-on-scalar linear models69. The significance of the FDA PRS was determined based on three tests70 employing different types of weighted quadratic forms. One employs a simple L2 norm of the parameter estimate (L2), another uses principal components to reduce dimension prior to a Wald-type test (PCA), and the last blends the two through the addition of a weighted scheme in the PCA (Choi). We reported the more conservative of the three values.
Polygenic Risk Scores constructed by other studies
To run the calculations of Belsky PRS26, Elks PRS25, den Hoed PRS24, and Li PRS27, we employed the Allelic Scoring function in PLINK v1.962,63. In some cases proxy alleles had to be used in place of SNPs that were not assessed on the PMRA. Such proxies were determined using linkage disequilibrium with LDlink72. Tables describing the composition of each PRS can be found in the Supplemental Materials (Tables S2-S5).
Validation datasets
We used two validation datasets downloaded from dbGaP. The first dataset was obtained from Neurodevelopmental Genomics: Trajectories of Complex Phenotypes (dbGaP dataset accession number phs000607.v3.p246–48). We considered 525 adolescents between the ages of 12 to 15 years who self-reported as being of European descent. Using their height and weight measurements, we calculated BMI and then categorized the BMI based on the Centers for Disease Control and Prevention BMI-for-age (and gender) recommendations. The second dataset was obtained from the eMERGE Network Imputed for 41 Phenotypes (dbGaP dataset accession number phs000888.v1.p1, variable number phv00225989.v1.p1). We considered 3,486 adults with either a case or control diagnosis of extreme obesity. For both datasets the FDA GRS was calculated using the score function in PLINK v1.962,63. Proxies for SNPs were determined using LDLink72 and are summarized in Table S6.
Analysis of environmental and behavioral covariates
Using the Bayesian Information Criterion option of the bestglm function in R73, we applied best subset selection to the regression of conditional weight gain scores34 on 11 potentially confounding covariates (described in the Results section). We did not consider interaction terms for this analysis, but we included (separately) Belsky PRS or FDA PRS as a 12th predictor in the regression. Once the best subset of predictors was selected, we fitted a linear model on it using the lm function in the R stats package.
We also applied LASSO and Group LASSO procedures to the same regressions for conditional weight gain, using the glmnet74 and gglasso75 packages respectively. We considered again the 11 potentially confounding covariates along with a PRS, and all possible two-way interactions. The LASSO methods were tuned via 10 fold cross-validation, fixing the penalty parameter to be within 1 standard error of the overall minimum cross-validation error. This is considered the more parsimonious approach, usually favoring a sparser final model71.
Data Availability
Phenotypic and Genetic data are/will be available under dbGaP study number: phs001498.v2.p1. Code for carrying out the statistical methods (screening, applying FLAME, PRS construction and evaluation) can be found at https://github.com/makovalab-psu/InsightPRSConstruction.
Ethics Statement
This project has been approved by Penn State IRB (PRAMS 34493).
Funding
This project was supported by grants R01DK88244 and R01DK099354 from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Funding was also provided by Penn State Institute of CyberScience, Penn State Eberly College of Sciences, and the Huck Institutes of Life Sciences at Penn State. Additionally, this project was funded in part, under a grant with the Pennsylvania Department of Health using Tobacco Settlement and CURE funds. The Department specifically disclaims responsibility for any analyses, interpretations, or conclusions. Additional funding was provided by NSF DMS 1712826. AK was supported by the NIH 5T32LM012415-03 predoctoral training grant.
Author contributions
SJCC, KDM, IMP, LLB, FC, and MR conceived the project and devised the project study design. AK, SJCC, JL, MR, FC, and KDM were involved in the data analysis. SJCC, AK, KDM, FC, and MR contributed to the writing of the manuscript with comments from co-authors. MP, LLB, JS, and MM provided resources such as access to the study population and the associated data.
Competing Interests
The authors declare no competing interests.
Acknowledgements
We are grateful for the INSIGHT study participants and nurses for their participation in this project. We would also like to thank B.Higgins, C.Reimer, R. Bruhans, A.Shelly, P.Carper, J.Beiler, J. Stokes, N.Verdiglione, and L.Hess for their assistance. The Philadelphia Neurodevelopment Cohort: Support for the collection of the data for Philadelphia Neurodevelopment Cohort (PNC) was provided by grant RC2MH089983 awarded to Raquel Gur and RC2MH089924 awarded to Hakon Hakonarson. Subjects were recruited and genotyped through the Center for Applied Genomics (CAG) at The Children's Hospital in Philadelphia (CHOP). Phenotypic data collection occurred at the CAG/CHOP and at the Brain Behavior Laboratory, University of Pennsylvania. eMERGE: The eMERGE Network was initiated and funded by NHGRI through the following grants: U01HG006828 (Cincinnati Children’s Hospital Medical Center/Boston Children’s Hospital); U01HG006830 (Children’s Hospital of Philadelphia); U01HG006389 (Essentia Institute of Rural Health, Marshfield Clinic Research Foundation and Pennsylvania State University); U01HG006382 (Geisinger Clinic); U01HG006375 (Group Health Cooperative); U01HG006379 (Mayo Clinic); U01HG006380 (Icahn School of Medicine at Mount Sinai); U01HG006388 (Northwestern University); U01HG006378 (Vanderbilt University Medical Center); and U01HG006385 (Vanderbilt University Medical Center serving as the Coordinating Center). Samples and data in this obesity study were provided by the non-alcoholic steatohepatitis (NASH) project. Funding for the NASH project was provided by a grant from the Clinic Research Fund of Geisinger Clinic. Funding support for the genotyping of the NASH cohort was provided by a Geisinger Clinic operating funds and an award from the Clinic Research Fund. The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000380.v1.p1.