Abstract
The epigenetic architecture in humans is influenced by genetic factors, exposure histories and biological factors such as age, but little is known about their relative contribution or their longitudinal dynamics. Here, we studied DNA methylation levels at over 750,000 CpG sites in mononuclear blood cells collected at birth and age 7 from 196 children of primarily self-reported Black and Hispanic ethnicities to study age- and ancestry-related patterns in DNA methylation. We developed a novel Bayesian inference method for longitudinal data and showed that even though average methylation levels changed from birth to age 7, the vast majority of the ancestry-associated methylation patterns present at birth are also present at age 7. A large proportion of ancestry-associated CpGs (59%) had a nearby methylation quantitative trait locus (meQTL) and we show that at least 13% of the ancestry-associated methylation patterns were mediated through local genotype. These combined results indicate that ancestry-associated methylation patterns in blood are in large part genetically determined. Our results further suggest that DNA methylation patterns in blood cells are robust to many environmental exposures, at least during the first 7 years of life.
Introduction
Epigenetic patterning in human genomes reflects the contributions of genetic variation [1, 2] exposure histories [3-8], and biological factors, such as age [9-18], ethnicity [19-24] and disease status [25-28], among others. However, little work has been done to elucidate the relative contributions or longitudinal dynamics of each on epigenetic patterning.
To directly examine the relationship between age, ethnicity, genetic variation, early life exposures and allergic phenotypes and an epigenetic mark, we studied global DNA methylation patterns at over 750,000 CpG sites on the EPIC array in cord blood mononuclear cells (CBMCs) collected at birth and in peripheral blood mononuclear cells (PBMCs) collected at 7 years of age from 196 children participating in the Urban Environment and Childhood Asthma (URECA) birth cohort study[29, 30]. This cohort is part of the NIAID-funded Inner-City Asthma Consortium and is comprised of children primarily of Black and Hispanic self-reported ethnicity, with a mother and/or father with a history of at least one allergic disease living in poor urban areas (see Gern et al. [30] for details of enrollment criteria). Mothers of children in the URECA study were enrolled during pregnancy and children were followed from birth through at least 7 years of age.
The longitudinal design of the URECA study provided us with the resolution to partition genetic from non-genetic effects on ancestry-associated DNA methylation patterns, and yielded new insight into the factors affecting DNA methylation patterns at CpG sites in mononuclear (immune) cells during early life in ethnically admixed children. Using a novel statistical inference method that provides a general framework for analyzing longitudinal genetic and epigenetic data, we show that ancestry-dependent methylation patterns are conserved over the first 7 years of life and that these patterns are strongly influenced, and often mediated, by local genotype. Further, chronological age, but not measured exposures during pre- or post-natal periods or disease status by age 7, was associated with methylation patterns in this sample of children. Considering the results of our study and those of a recently published comprehensive review on environmental epigenetics research [31], we suggest that methylation levels in blood are not as responsive to environmental exposures as previously suggested [20], at least during the first 7 years of life.
Results
Our study included 196 children participants in the Urban Environment and Childhood Asthma (URECA) cohort who had stored cord blood mononuclear cells (CBMCs) and peripheral blood mononuclear cells (PBMCs) collected at birth and age 7, respectively [29], and passed quality control (QC) checks as described in Methods. The URECA children were classified by parent- or guardian-reported race into one of the following categories: Black, n =147; Hispanic, n = 39; White, n = 1; Mixed race n = 7, and Other, n = 2. A description of the study population is shown in Table 1 and in Supplementary Materials. Ancestry, assessed using ancestral PCs, revealed varying proportions of African and European ancestry along PC1. Becuase there is little separation along PC2 (Figure 1) and no genome-wide significant correlation between PC2 through PC10 and methylation levels at either age, we defined PC1 as inferred genetic ancestry (IGA). The reported races (RR) of the children are also shown in Figure 1. The means and ranges of gestational age stratified by reported race are shown in Table 1; the distribution of gestational age at birth is shown in Figure S1 in the supplement.
Inferred genetic ancestry effects on DNA methylation patterns are conserved in magnitude and direction between birth and age 7
Previous cross-sectional studies have revealed associations between ancestry and DNA methylation at birth [19, 23] and later in life [20-22, 24, 25]. These correlations were generally attributed to the combined effects of genetic variation and environmental exposures [20-23]. However, because of the cross-sectional nature of these studies, it is not known if the association between ancestry and methylation patterns present at birth persist (or change) in childhood. Moreover, because ancestry is typically confounded with environmental exposures [41], it has been proposed that ancestry effects on methylation levels may reflect the effects of exposure histories, which also may vary by race or ethnicity [20]. Alternatively, ancestry effects on DNA methylation patterns could also be due to genetic differences. In these cases, we would expect ancestry-associated methylation patterns to be conserved from birth to later childhood. Using the longitudinal data from the URECA cohort, we tested this hypothesis by addressing three questions. What is the correlation between ancestry and methylation levels at individual CpG sites at birth and age 7? Is the direction and magnitude of the correlation between ancestry and methylation levels conserved between birth and age 7? Are there any CpGs for which the correlation between methylation and ancestry at birth is significantly different from the correlation between methylation and ancestry at age 7?
Standard hypothesis testing can be used to answer the first question but is not appropriate for answering the second or third because failure to reject the null hypothesis that the effects are equal at birth and age 7 does not imply the null hypothesis is true. Additionally, because our studies were conducted in cord blood cells at birth and peripheral blood cells at age 7, effects at birth and age 7 may differ slightly due to differences in cell composition [42]. To circumvent these issues, we built a Bayesian model (see model (S3) in the Supplement) and let the data determine both the strength of the correlation between inferred genetic ancestry or reported race and methylation, and how similar the correlations were at birth and age 7. We then answered the first, second and third questions by defining and estimating the correct (cor), conserved (con) and discordant (dis) sign rates for each CpG g = 1,…,784,484:
corg(0) = Posterior probability that the estimate for the direction of CpG g’s ancestry effect at birth was correct.
cong(0, 7) = Posterior probability that the directions of CpG g’s ancestry effects at birth and age 7 were the same AND the directions were estimated correctly.
disg(0) = Posterior probability that the ancestry effect for CpG g at birth was non-zero and was 0 or in the opposite direction at age 7 AND both directions were estimated correctly.
The correct and discordant sign rates at age 7 (corg(7), disg(7)) were defined analogously. Because the correct sign rate at birth and age 7 is always at least as large as the conserved sign rate, we say that the birth and age 7 effects for CpG g overlap if its conserved sign rate (cong(0,7)) is above a designated threshold. Supplemental Figures S2 and S3 provide insight into how the correct and conserved sign rates compare with standard univariate P values. We refer the reader to “Joint modeling of methylation at birth and seven” in the Supplement for a detailed description of our model and estimation procedure.
After fitting the relevant parameters in the model to the data, we were able to estimate the fraction of CpGs with non-zero effects at both ages that fell into one of four possible bins: the two effects were completely unrelated (ρ = 0), moderately similar (ρ = 1/3), very similar (ρ = 2/3), or identical (ρ = 1). Note that if a non-trivial fraction of CpG sites had effects that were negatively related, they would be assigned to the first bin (ρ = 0). In fact, less than 1% of the CpGs with non-zero inferred genetic ancestry effects at both ages had unrelated or moderately similar IGA effects (ρ = 0 or 1/3), whereas 34% fell in the very similar bin and 66% had identical inferred genetic ancestry effects at birth and age 7 (Supplemental Figure S4). This indicates that when inferred genetic ancestry effects on methylation are non-zero at both birth and age 7, they tend to be very similar or exactly the same with respect to both direction and magnitude.
We then estimated the correct and conserved sign rates for all 784,484 probes, and identified 2,873 inferred genetic ancestry-associated CpGs (IGA-CpGs) in CBMCs (cong(7) ≥ 0.95), 3,834 in PBMCs at age 7 (corg(0,7) ≥ 0.95), and 2,659 whose effects were conserved in sign (cong(0,7) ≥ 0.95). Methylation tended to increase with increasing African ancestry at 1,494 of the 2,659 conserved CpGs (56%), suggesting that individuals with more African ancestry tend to have more methylation (P < 10-10). This is consistent with the study of Moen et al. [22], which used the Illumina 450K array to quantify the differences in methylation between European and African populations. Supplemental Figure S5 shows the IGA-CpG locations in the genome and Figure 2a illustrates the overlap between IGA-CpGs at birth and age 7. This strong overlap corroborates our above observations and answers the second question in the affirmative: if methylation is strongly correlated with inferred genetic ancestry at birth, the magnitude and direction of the correlation is conserved at age 7.
Inferred genetic ancestry is more correlated with methylation than is self-reported race
The observed correlations between ancestry and methylation levels may reflect differences in environmental exposures [20, 22], due to associations between race or ethnicity with socio-cultural, nutritional, and geographic exposures, among others [41]. In fact, Galanter et al. [20] showed in a cross-sectional study that self-reported ethnicity explained a substantial portion of the variability in whole blood DNA methylation patterns from Latino children of diverse ethnicities, even more so than inferred genetic ancestry. They concluded that ethnicity captures genetic, as well as the socio-cultural and environmental differences that influence methylation levels. If this were the case in the URECA children, we should observe just as large, if not larger, an effect as we did for inferred genetic ancestry when we substitute reported race for inferred genetic ancestry in the analyses presented in the previous section. However, using reported race, we identified only 457 CpGs at birth and 709 CpGs at age 7 whose correct sign rate was at least 0.95, and 424 whose reported effects were conserved from birth to age 7 (cong(0,7) ≥ 0.95), far fewer than the 3,991 CpGs whose methylation was significantly correlated with inferred genetic ancestry at birth or at age 7.
To explore this further, we examined the overlap between IGA-CpGs and reported race-associated CpGs (RR-CpGs) in CBMCs at birth and in PBMCs at 7 (Figures 2c-d). Because reported race is an estimate of inferred genetic ancestry, there is a still substantial overlap between IGA-CpGs and RR-CpGs. In fact, almost all of the RR-CpGs are among the IGA-CpGs, but the opposite is not true. This indicates that while IGA-CpGs include most RR-CpGs, reported race does not capture most of the variation in methylation attributable to inferred genetic ancestry in these children.
The observed correlations between DNA methylation and ancestry are primarily genetic
To further address the question of whether ancestry effects on methylation at either birth or age 7 were due to genetic variation or to environmental exposures, we used local genetic variation (within 5kb of a CpG site) and DNA methylation data at birth and age 7 in the 147 self-reported Black children in our study. Of the 519,622 CpGs within 5kb of a SNP, 65,068 and 70,898 had at least one meQTL in CBMCs at birth and in PBMCs at age 7, respectively, at an FDR of 5%. In addition, 59% of IGA-CpGs with at least one SNP in the ±5kb window had at least one meQTL at birth or age 7 at an FDR of 5%, indicating IGA-CpGs were enriched for CpGs with meQTLs (Figure 3a-b).
To provide additional evidence that local genotype mediates the effect of inferred genetic ancestry on methylation, we used logistic regression to regress the genotype of each of the 269,622 SNPs in our study set onto inferred genetic ancestry. The goal was to determine the fraction of IGA-CpGs that were mediated through local genotype, i.e. IGA-CpGs with both edges a and c in Figure 3a. Our analysis first revealed that the genotypes at meQTLs whose target CpGs were IGA-CpGs in either CBMCs at birth or PBMCs at age 7 (IGA-meQTLs) were significantly more correlated with inferred genetic ancestry than the genotype of non-IGA-meQTLs (Figure 3c). Moreover, approximately 13% of the IGA-CpGs with at least one SNP in their ±5kb windows had an inferred genetic ancestry effect that was mediated through local genotype (i.e. had edges a and c, see Supplement for calculation details), which is likely an underestimate of the true number IGA-CpGs mediated through local genotype because our sample size was relatively small [43]. Nonetheless, this is striking compared to the 0.1% of non-IGA-CpGs whose corresponding SNP had edge c at a 20% FDR.
Lastly, we used DNA methylation data on 573 ethnically diverse U.S. Latino children ages 9 to 16 years old from the Galanter et al. study [20] to further explore the effect of ancestry on DNA methylation in whole blood. Children and teenagers in the Galanter study were classified as Mexican, Puerto Rican, mixed Latino, or other Latino. They reported 916 CpG sites whose methylation was significantly associated with reported race, 773 of which were also significantly associated with estimated percent European, Native American and African ancestry at a Bonferroni P value threshold of 1.6 x 10-7. A total of 726 of their 916 ethnic-associated CpGs were also in the set of 784,484 probes CpG sites in our study. Our set of IGA-CpGs at birth or age 7 contained a significant fraction of the 726 ethnic-associated CpGs from the cross-sectional study, but there was considerably less overlap with our RR-CpGs (Figure 4). If the correlation we observed between ancestry and methylation was largely due to responses to environmental exposures, as suggested in the Galanter study, then the overlap with reported race should have been at least as large as the overlap with inferred genetic ancestry. That was not the case, further suggesting that the inferred genetic ancestry effects on methylation in the URECA cohort are primarily genetic in origin.
Non-genetic factors influence the observed ancestry-methylation correlation more at age 7 than at birth
Although most of the variation in methylation levels at ancestry-associated CpGs can be attributed to genetic variation in the URECA children, a small proportion may be due to non-genetic (environmental) factors. To explore this possibility, we further hypothesized that non-genetic effects on methylation levels at ancestry-associated CpGs would be greater at age 7 than at birth, due to accumulated exposures over the first 7 years of life. We note that none of the direct or indirect measures of exposures that were available in this cohort were associated with methylation levels at either age, including maternal asthma, maternal infections during pregnancy, pet ownership, bedroom allergens, mother stress, anxiety and depression metrics, maternal cotinine levels during pregnancy, number of smokers in the household, number of siblings, number of previous live births, daycare attendance, number of colds at age 2 or 3, and allergic sensitization or asthma in the child (see Supplement for details). We did, however, identify 16,172 age-related CpGs (CpGs whose methylation changed from birth to age 7). Besides being strongly enriched for CpGs used to predict gestational age in Knight et al. [13] and chronological age in Horvath [10] (see Figure S7 in the Supplement), estimates of the age effects among the CpGs that changed from birth to age 7 showed the same direction of change as their corresponding estimated gestational age effects at birth in 97% of the 16,172 CpGs, which included 14,186 gestational age-associated effects that were not significant at the 5% FDR threshold. This concordance in direction of effect is unlikely to occur by chance (P < 10-10, see Supplement for calculation), and indicates that the majority of the change in mean methylation levels from birth to age 7 was due to aging-related mechanisms rather than age-dependent environmental exposures.
To directly test the hypothesis that non-genetic factors tend to have larger effects on methylation levels at age 7 than at birth, we used our Bayesian model to estimate the proportion of CpGs in our study whose methylation was not associated with ancestry at birth but associated at age seven and the proportion that were associated with ancestry at birth but not at age 7. The former was greater than 14% while the latter was less than 1.5% using either inferred genetic ancestry or reported race as a measure of ancestry. Even though over 14% of all CpGs in our study had ancestry effects present at age 7 but not at birth, we were only able to identify 18 discordant IGA-CpGs and 4 discordant RR-CpGs at age 7 using a liberal threshold of (disg(7) ≥ 0.80). That is, for almost all CpGs that are associated with ancestry at age 7 but not at birth, the expected ancestry effect sizes were quite small relative to the statistical error, making it impossible to assign the direction of effect on methylation changes with confidence (Figure S6). Therefore, while some CpG sites may be influenced by exposures that are correlated with ancestry at age 7 but not at birth, their effects were far too small to estimate in this sample size.
Discussion
The relationships between DNA methylation, chronological age, and ancestry have the potential to shed light on disease etiology and may help determine the relative genetic and environmental contributions to the observed inter-individual variability of the epigenome [9-15, 19-24]. While it has previously been shown that ancestry is related to DNA methylation in cross-sectional studies [19-24], and that statistically significant meQTLs are conserved as one ages [44], it has yet to be shown whether or not ancestry-dependent methylation marks are conserved as children age.
Even though there was substantial change in blood methylation levels over time among children in this cohort, inferred genetic ancestry and self-reported race effects on methylation were overwhelmingly conserved in both direction and magnitude from birth to age 7. This result is interesting in and of itself because it provides an example of perinatal epigenetic variation that persists later in life, and more generally an example of a persistent effect on DNA methylation levels, which has been cited as a critical area of future epigenetic research [31, 45]. The consistency of our estimates for the effect due to ancestry also demonstrates the fidelity of our processing step to account for unobserved factors like cell composition, since failure to account for latent covariates often leads to biased and irreproducible estimates [46, 47]. Furthermore, the novel statistical framework we used to infer effects that are conserved versus those that vary over time can be easily applied to other longitudinal DNA methylation data, as a way to avoid the spurious logic often used in applications of frequentist hypothesis testing that failing to reject the null hypothesis implies the null is true.
While the observation that inferred genetic ancestry and reported race effects are conserved from birth to age 7 gives credence to the hypothesis that the effects are genetic in nature, it does not rule out the possibility of environmental components or gene-environment interactions that could determine ancestry-related methylation prior to birth and persist as the child ages. To further explore this, we showed that the IGA-CpGs were enriched among CpGs with meQTLs, and that methylation levels at many of the IGA-CpGs are mediated by local genotype, indicating that much of the ancestry-methylation correlation could be attributed to genetic variation. Moreover, the RR-CpGs were only a small subset of IGA-CpGs in our study. This is opposite to the findings of Galanter et al. [20], who argued that ancestry-dependent methylation patterns in admixed populations are in large part determined by differences in exposure histories. Because their data were cross-sectional they could not evaluate whether the observed patterns arose during childhood or were also present at birth. Our results provide evidence for genetics accounting for most of the correlation between methylation and ancestry, and implies that the genetic contribution to variability in blood methylation is substantial.
Our observations in support of strong genetically – and weak environmentally – determined ancestry-associated methylation patterns in blood may seem paradoxical to the plethora of studies showing that DNA methylation levels in blood cells are associated with environmental exposures, such as cadmium, arsenic and smoking, to name a few [5-8, 20, 48-52]. Whereas the estimated genetic effect sizes in our study are substantially larger than many of the environmentally-associated effects on methylation patterns previously reported, the effects of environmental exposures on methylation in blood are probably too small to estimate with even moderate to large sample sizes [31]. For example, it was only by performing a meta-analysis in 6,685 individuals that Joubert et al. [5] were able to identify 6,000 CpGs whose DNA methylation levels in blood from infants and adolescents were associated maternal smoking exposure. In one sense, we were able to corroborate previous observations of small non-genetic effects on methylation in blood by showing that while methylation patterns at an estimated 14% of all CpGs in our study were not correlated with ancestry at birth but correlated with ancestry at age 7, the correlation at individual CpGs at age 7 was too small to be identified as statistically significant. We were also not able to find any statistically significant correlations between methylation at birth or at age 7 and any of the environmental exposure variables measured in this cohort. We note that cord blood cotinine levels, a measure of in utero tobacco smoke exposure, were above the level of detection in only 34 of the 196 mothers in our study.
An unsurprising feature of these longitudinal data is that average methylation levels of over 16,000 CpGs changed significantly from birth to age 7. However, what was quite remarkable was that the direction of the change in 97% of those CpGs matched the direction of their corresponding estimated gestational age effect at birth, which included over 14,000 gestational age-associated effects that were not genome-wide significant. Not only does this fit with the above narrative and suggest that methylation levels of the vast majority of the 16,172 age-related CpGs were in fact changing due to age-related mechanisms and not because of differences in environmental exposures at birth and age 7, it also indicates that the “epigenetic clock” present at birth may be the same as that present later in life. While we do not have the data to explore this further, this remains an important avenue of future research.
The results of our study suggest that DNA methylation levels in blood cells are fairly robust to environmental exposures, including those that are correlated with self reported race. A better understanding of tissue-specific methylation responses to environmental exposures could inform the design of future studies and provide insights into the mechanisms through which exposures and gene-environment interactions influence health and disease.
Materials and methods
Sample composition
URECA is a birth cohort study initiated in 2005 in Baltimore, Boston, New York City and St. Louis under the NIAID-funded Inner City Asthma Consortium [29]. Pregnant women were recruited. Either they or the father of their unborn child had a history of asthma, allergic rhinitis, or eczema, and deliveries prior to 34 weeks gestation were excluded (see Gern et al. [29] for full entry criteria). Informed consent was obtained from the women at enrollment and from the parent or legal guardian of the infant after birth.
Maternal questionnaires were administered prenatally and child health questionnaires administered to a parent or caregiver every 3 months through age 7 years. Gestational age at birth and obstetric history were obtained from medical records. Additional details on study design are described in Gern et al. [29] and in the Supplement. Frozen paired cord blood mononuclear cells (CBMCs) and peripheral blood mononuclear cells (PBMCs) at age 7, were available for 196 of the 560 URECA children after completing other studies. After QC (see below), DNA methylation data were available for 194 children at birth, 195 children at age 7, and 193 children at both time points; genotype data were available in 193 children (194 at birth; 195 at age 7) (Supplementary Table 1).
DNA Methylation
DNA for methylation studies was extracted from thawed CBMCs and PBMCs using the Qiagen AllPrep kit (QIAGEN, Valencia, CA). Genome-wide DNA methylation was assessed using the Illumina Infinium MethylationEPIC BeadChip (Illumina, San Diego, CA) at the University of Chicago Functional Genomics Facility (UC-FGF). Birth and 7-year samples from the same child were assayed on the same chip and the data were processed using Minfi [32]; Infinium type I and type II probe bias were corrected using SWAN [33]. Raw probe values were corrected for color imbalance and background by control normalization. Three out of the 392 samples (two at birth and one at age 7) were removed as outliers following normalization. We removed 82,352 probes that mapped either to the sex chromosomes or to more than one location in a bisulfite-converted genome, had detection P values greater than 0.01% in 25% or more of the samples, or overlapped with known SNPs with minor allele frequency of at least 5% in African, American or European populations. After processing, 784,484 probes were retained and M-values were used for all downstream analyses, which were computed as log2 (methylated intensity + 100) - log2 (unmethylated intensity + 100). The offset of 100 was recommended in Du et al. [34].
Genotyping
DNA from the 196 URECA children was genotyped with the Illumina Infinium CoreExome+Custom array. Of the 532,992 autosomal SNPs on the array, 531,755 passed Quality control (QC) (excluding SNPs with call rate <95%, Hardy-Weinberg P values <10-5, and heterozygosity outliers). We conducted all analyses in 293,696 autosomal SNPs with a minor allele frequency ≥5%. Genotypes for three children failed QC and were excluded from subsequent analysis that involved genotypes, including methylation quantitative locus (meQTL) mapping, inferred genetic ancestry, or used genetic ancestry PC1 as a covariate (see below). These three children were included in all other analyses.
Estimating inferred genetic ancestry
Ancestral principal component analysis (PCA) was performed using a set of 801 ancestry informative markers (AIMs) from Tandon et al. [35] that were genotyped in both the URECA children and in HapMap [36] release 23. Because PC1 captured the majority of variation in genetic ancestry (Figure 1), we refer to PC1 as inferred genetic ancestry and consider it as a surrogate measure for percent African ancestry.
Statistical analysis
To determine the effect of gestational age on methylation in CBMCs, we used standard linear regression models with the child’s gender, sample collection site, inferred genetic ancestry and methylation plate number as covariates in our model. We also estimated cell composition and other unobserved confounding factors using a method described in McKennan et al. [37]. We then computed a gestational age P value for each CpG site and used q-values [38] to control the false discovery rate at a nominal level. We took the same approach to determine CpGs whose methylation changed from birth to age 7, except the response was measured as the difference in methylation at birth and age 7. In this analysis, we included the child’s gender, gestational age at birth, inferred genetic ancestry and sample collection site as covariates. Because all paired samples were on the same plate, we did not include plate number as a covariate in this analysis. We also estimated unobserved factors that influence differences in methylation at birth and age 7 using McKennan et al. [37] and included these latent factors in our linear model. See models (S1) and (S2) in the Supplement for more detail.
We used data from the self-reported Hispanic and Black individuals with methylation measured at both time points to analyze the effect of ancestry (either inferred genetic ancestry or self-reported race) on methylation at birth and age 7 jointly using a Bayesian model. We did not include the 10 individuals of other reported races in this analysis because we did not want our estimates to be influenced by the groups with small samples sizes. We included age (birth or age 7), sample collection site, gestational age at birth, gender and methylation plate number as covariates in our model, and estimated additional unobserved covariates (including cell composition) using a method specifically designed for correlated data [39]. Once we estimated the relevant hyper-parameters, we extended the sign rate paradigm developed in Stephens [40] to perform inference in longitudinal data. This is discussed in more detail in the context of the specific questions we present in the results section. We encourage the reader to review the Supplement for a more detailed presentation of this model and previously discussed models.
We performed meQTL mapping in the 145 genotyped, self-reported Black children using the set of 269,622 SNPs with 100% genotype call rate in this subset. We restricted ourselves to this subset of samples to minimize heterogeneity in effect sizes. To identify CpG-SNP pairs, we considered SNPs within 5kb of each CpG, as this region has been previously shown to contain the majority of genetic variability in DNA methylation [1] and is small enough to mitigate the multiple testing burden, and computed a P value for the effect of the genotype at a single SNP on methylation at the corresponding CpG with ordinary least squares. We then defined the meQTL for each CpG site as the SNP with the lowest P value. In addition to genotype, we included inferred genetic ancestry (i.e., ancestry PC1), gestational age at birth, gender, sample collection site and methylation plate number in the linear model, along with the first nine principal components of the residual methylation data matrix after regressing out the intercept and the five additional covariates. We then tested the null hypothesis that a CpG did not have an meQTL in the 10kb region by using the minimum marginal P value in the region as the test statistic and computed its significance via bootstrap. Lastly, we used q-values to control the false discovery rate.
Acknowledgements
This work was supported in part by NIH grants U19 AI106683, R01 HL129735, R01 HL122712, and P01 HL070831. The URECA study has been funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, under Contract numbers NO1-AI-25496, NO1-AI-25482, HHSN272200900052C, HHSN272201000052I, 1UM1AI114271-01 and UM2AI117870. Additional support was provided by the National Center for Research Resources, National Institutes of Health, under grants RR00052, M01RR00533, 1UL1RR025771, M01RR00071, 1UL1RR024156, UL1TR000040, UL1TR001079 and 5UL1RR024992-02.
Footnotes
↵¶ Equal contributions