ABSTRACT
MicroRNA target sites are often conserved among related species. On the other hand, purifying selection against novel microRNA target sites can also exist. However, the relative impact of conservation versus selection against target sites is still unknown. We investigated these processes in human populations by focusing on polymorphic sites in which one allele is a target site and the other is not. Target allele frequencies were significantly lower than expected at these sites. The analysis of derived allele frequencies revealed that, when the non-target allele is ancestral, the proportion of non-target sites is higher than expected by chance. Conversely, when the target allele is ancestral, the proportion of non-target alleles is also significantly higher than expected. These analyses reveal a selective pressure against microRNA target alleles, which is more effective than selection to conserve target sites. Additionally, microRNA target sites show relatively low levels of population differentiation (Fst). However, when we analyse separately target sites in which the target allele is ancestral in the population, the proportion of SNPs with high Fst significantly increases. These findings support a scenario in which population differentiation (and possible local adaptation) is much more likely in target sites that are lost than in the gain of new target sites. Taking all the results together, we conclude that there is evidence of pervasive selection against microRNA target sites in human populations. The overall impact across untranslated regions is not negligible and should be taken into account when studying the evolution of genomic sequences.
INTRODUCTION
MicroRNAs are small endogenous RNAs that can regulate virtually any type of biological process. In humans they were not discovered until this century [1] and there are now about 1900 human microRNA precursors annotated in miRBase [2], although less than 700 are classified with high confidence. Soon after microRNAs were found in multiple animal species [3–5], the first target prediction tools became available [6–8]. Only in the last few years have these developments permitted the evolutionary analysis of target sites [9–13] revealing that many microRNA target sites are highly conserved among species. On the other hand, whilst some microRNA families have been conserved for millions of years, their targets appear to differ between species (see for instance [14]). Indeed, evidence from vertebrates suggests that gains and losses of target sites may be more important than changes in the microRNAs themselves during the evolution of microRNA-based gene regulation, as microRNAs are usually highly conserved (see Discussion in [15]).
Several studies have found that gene transcripts are depleted of target sites for co-expressed microRNAs [10–12]. In particular long 3’ UTR might accumulate microRNA target sites by random mutation, yet they actually have a lower frequency than expected by chance, suggesting that there has been selection against these sequences [10]. These missing sites have been called ‘anti-targets’ [16,10]. Interestingly, target sites for the same microRNAs tend to be conserved in transcripts expressed in neighboring tissues [11]. These studies have shown that selection against microRNA target sites can be inferred from comparisons among distantly related species. However, the relative impact of selection against microRNA target sites in human populations is not known.
Analysis of human populations has suggested purifying selection was particularly strong at microRNA target sites, even in non-conserved sites [17,18]. It was also detected negative selection against gaining microRNA target sites in Yoruban populations, but the pattern was not detected in other populations [17]. In a study of Drosophila populations we found evidence of selection against microRNA target sites [19]. Specifically, we found selection against target sites of maternal microRNAs in maternally deposited transcripts. More recently is has been shown that this effect is particularly strong for the mir-309 cluster, whose microRNAs are abundant in the egg and almost absent in the zygote [20]. Characterising this type of selection in humans would reveal to which extent it shapes our genomes. However, the strength and prevalence of selection against target sites is human populations is still unknown. Here we investigate polymorphisms at human microRNA target sites and quantify the impact of selection.
RESULTS
Bias towards microRNA non-target alleles in human populations
In order to investigate the selective pressures on microRNA target sites in human populations, we first mapped human single-nucleotide polymorphisms (SNPs) to putative canonical microRNA target sites such that one allele is a target site and the alternative allele is not a target site (see Methods for details). The non-target allele in this pair is called a ‘near-target’ [21]. We compiled 709,854 polymorphic target sites in 42,221 gene transcripts (15,026 genes) for 2,584 microRNAs. We first compared the allele frequency distribution of target sites for broadly expressed microRNAs with a background distribution obtained by conducting the same analysis on the reverse complement sequences of 3’UTRs (see Methods). The distribution is biased to the near-target allele: there are more non-target sequences than background at low population frequencies, and fewer at high frequencies (one-tailed Mann-Whitney test, p<<0.001; Figure 1A). Selection against the target sites might be expected in cases where the microRNA/transcript pairs are co-expressed and therefore more likely to interact. We therefore compared these pairs with those that are co-expressed and found a significant difference (Figure 1B; p=0.029).
To further investigate the selective forces operating at microRNA targets sites we computed the derived allele frequency (DAF, also known as the unfolded site frequency spectrum). We initially considered those SNPs for which the derived allele (the non-conserved allele) is the target allele (Figure 1C). As expected the overall DAF distribution is L-shaped. However, the distribution of target sites for broadly expressed microRNAs is significantly skewed to the near-target allele compared to the background distribution, indicating selective pressure against the newly arisen target allele (p=0.007). Strikingly, when we explored SNPs for which the derived allele is the near-target allele, the distribution is also skewed towards the near-target allele (Figure 1D; p=0.022). In summary, the distribution of allele frequencies shows evidence of selection against microRNA target sites in human populations.
The effect of microRNA expression levels and evolutionary conservation
We next considered the potential impact of microRNA conservation. On the one hand, evolutionarily conserved microRNAs may have a weaker effect on selection against microRNA targets, as partly deleterious target alleles may have been cleared from the population. On the other hand, evolutionarily conserved microRNAs tend to be highly expressed, and therefore it is expected that the selective pressure to avoid targets for such microRNAs should be stronger. In other words, expression and conservation are not independent to each other. We therefore included both factors in the analysis: the level of expression of the microRNA and a measure of the phylogenetic conservation of the microRNA sequence.
We analysed cases where the derived allele is a target site (as in Figure 1C); testing whether frequency spectra were different across different levels of microRNA conservation (human-primate specific, conserved in mammals, and conserved in animals) and different levels of expression (low, mid and high as described in Methods). For all 81 possible pairwise comparisons we performed a one-tailed Mann-Whitney test and displayed the resulting p-values in a heatmap in Figure 2. The light grey tones indicate lower frequencies of the target allele, suggesting selection against the derived microRNA target sites. This graph allows a quick visual exploration of the results from many tests. First, it can be seen that lower-left colors are lighter than top-right. That is, for conserved microRNAs, the target allele frequencies are smaller than for less conserved microRNAs. At the same time, within each conservation level (squares in the diagonal in Figure 2) the colors are lighter in the top-right sections, indicating that for higher expressed microRNAs the target allele frequencies are lower than for less expressed microRNAs. In other words, selective avoidance against target sites is higher for conserved and higher for highly expressed microRNAs.
To evaluate the joint effect of conservation and expression, we performed a two-way non-parametric Sheirer-Ray-Hare test ([22], pp.445). The effect of microRNA expression level in target allele frequencies was significant (p=0.00001) whilst the effect of conservation and the interaction between conservation and expression was not significant (p=0.20100 and p=0.79595 respectively). This analysis therefore supports one of the two trends suggested by inspection of Figure 2: the depressed frequency of target sites is significantly stronger for more highly expressed microRNAs, but there is insufficient evidence to establish a difference between microRNAs that have been strongly conserved among species and the others.
Population differentiation at target sites
If a microRNA target site is under selective constraints, we should expect differentiation among populations at these sites to be relatively low. To investigate this prediction, we grouped SNPs at target sites for broadly expressed microRNAs according to their Fst and compared the relative frequency of these SNPs compared to the background (see Methods). In Figure 3A we observed that, overall, SNPs at microRNA target sites are depleted for high Fst values. This depression of Fst at microRNA target sites was not detected in the analysis of another group [23] (see Discussion). We further explored the distribution of Fst separately depending on the inferred state of the ancestral allele (target versus near-target). The most striking observation was that, incontrast to the general trend, when the conserved microRNA target site is segregating in human populations (i.e. the target allele was ancestral), there can be considerable variation in more frequency among human populations than in the background loci: in some populations the non-target allele is more frequent than the ancestral allele. The same result appears when we expanded the dataset to microRNAs that are expressed in specific tissues (Figure 3C, red line). In Table 1 we show SNPs in microRNA target sites with a Fst greater than 0.6. In most cases, the ancestral target is lost in populations outside Africa. Interestingly, for HM13 (Minor histocompatibility antigen H13) two target sites for miR-150-5p are lost simultaneously (these sites are about 1kb away to each other).
On the other hand, in cases where the ancestral allele was the near-target, there is a deficit of high Fst SNPs compared to the background sites(Figure 3B-C, blue lines). In these data, we did not find cases where the derived target allele varied atypically among populations, nor cases where the derived allele reached a very high frequency in a subset of human populations.
Co-evolution of microRNA and target sites: a potential case study
We found one case in which a microRNA has a SNP within its seed sequence (the region that determines the targeting property of the microRNA), which shows some evidence of population differentiation (rs7210937; Fst = 0.3314). In this case, the Fst value between African and European populations is remarkably high (Fst = 0.6129; Nei’s estimate [24,25]). In European populations 92.5% of sequenced individuals present the ancestral form of miR-1269b, whilst in African population, the derived version is more frequent (59.8%). As a shift in the seed sequence may have an impact on the evolution of 3’UTRs, we further studied target sites whose ancestral form is a target for the derived miR-1269 microRNA (786 in total). Then we compared the frequency of the target site allele between European and African populations with the Wilcoxon non-parametric test for paired samples. We found that in African populations the frequency of target alleles is lower than in European populations at these sites (p<<0.001) whilst for the ancestral miR-1269b we did not find any significant difference. These results suggest that a shift in the allele frequencies affecting the seed sequence of a microRNA can have an effect on the allele frequencies at the novel target sites, specifically toward the non-target allele.
DISCUSSION
The study of allele frequencies has been extensively used to detect selective pressures in human populations [17,26,27]. Here we show that the patterns of allele frequencies at 3’UTRs show evidence of selection against most microRNA target sites. First, the allele frequencies at target sites are biased towards the non-target allele. Second, this trend is also seen in the subset of cases where the derived allele is a target sequence. These effects are strongest in the cases where the corresponding microRNAs are highly expressed, suggesting that interaction between the microRNA and the target is a key is the source of selection against target sequences. The microRNAs that have been conserved over longer periods of vertebrate evolution did not impose detectably greater selection against their target sequences, once the effect of expression levels had been taken into account.
The most popular microRNA target prediction programs rely on target site conservation to reduce the number of false positives [28] and/or do not provide a stand-alone version to run on custom datasets [29]. Therefore, we used a naive microRNA target prediction method that reports canonical targets and near-target sites [21]. That allowed us to study pairs of alleles segregating at target sites without any other constraint. On the other hand, we would expect a high number of false positive in target predictions (reviewed in [30]). Remarkably, we found a significant pattern of selection against microRNA target sites. This reinforces our initial hypothesis and suggest that, if we would be able to restrict the analysis to bona fide target sites, the signal might be stronger. One possibility is to evaluate experimentally validated target sites. However, these experiments are based on reference genomes, so segregating target sites whose target allele is not in the reference genome will be lost from the analysis. The way forward may be to perform high-thoughtput microRNA target experiments, like HITS-CLIP [31], in cells derived from different populations. The continuous drop in the costs of sequencing and high-throughput experiments may allow this in the near future. Indeed, high-throughput experimental evaluation of segregating alleles at regulatory motifs (transcription factor binding sites, RNA binding sites, etcetera) is a promising area of research which will help us to move from a typological (reference genome) to a population view of gene regulation.
Another way to study the effect of selection in populations is to evaluate the population differentiation [23,32]. We found that, at microRNA target sites in general, populations tend to be less differentiated than at control background sites, suggesting that selection acts against an increase in their frequencies. This result was different to that reported by Li et al. [23]. This may be explained by our finding that a subset of target sites do have relatively high Fst values. We found this trend at those sites at which the ancestral allele was a microRNA target site and the derived allele is a non-target. The loss of a microRNA target (as in the examples reported in a previous work [23]) may be relatively frequent. It follows that the non-target allele might be neutral or even advantageous in some of these cases. It is noticeable that in the examples in which the derived allele reaches a high frequency, that occurs in the non-African populations, which is the pattern that would be expected if a neutral derived allele spread by genetic drift during founder events. Loss of target sites could also have advantageous effects, though the complex interactions that occur in regulatory networks, for example, it has been proposed that in the human lineage the loss of microRNA target sites contributed to an increase in the expression levels of some genes [33]. Our work suggest that this loss of targets may be continuing now in human populations. Selection in favour of new target sites appears to be rarer: we found a strong signal of purifying selection against novel microRNA target sites.
Selection against deleterious mutations has been extensively studied in population genetics (reviewed in [34]). For instance, strong purifying selection produces a phenomenon called background selection, in which loci linked to the selected site experience a reduction in their effective population size [35]. That is, purifying selection reduces the influence of selection at linked sites. For weakly selected sites, a similar process has been described: weak selection Hill-Robertson interference (wsHR [36,37]). Under wsHR, multiple alleles are under a weak selective force, very close together so that recombination is small or negligible between sites, interfering with other selective pressures in the area. We believe that this is the case for the selection against microRNA target sites here described: weak selection against multiple target/near-target sites will shape the evolutionary landscape of the entire untranslated region.
Most analyses of genomic sequence assume that the protein coding sequences are the most important sites of selection. Our results suggests that 3’ UTRs should be included as well. New target sites will emerge at a significant rate because many mutations can potentially introduce a new site for one of the many microRNAs. More specifically, there are about 2,000 microRNA families described in TargetScan (see Methods), defined by 7-nucleotide seed sequences. Assuming that 3’UTRs are composed of non-overlapping 7-mers (a simplifying yet conservative assumption) the expected number of near-target sites per kilobase (kb) is about 75. With a mutation rate of 2.5 10-5 per kb [38] and a total length of the genome that encode 3’UTRs of 34Mb, it can be shown that there will be on average one novel microRNA target site on a 3’UTR per genome per generation. That is, one potential deleterious microRNA target site per person per generation.
It is expected that other regulatory motifs influence the evolution of 3’UTRs. For instance, Savisaar et al [39] have described selection against RNA-binding motifs. The selective avoidance of transcription factor binding sites [40,41] and of mRNA/ncRNA regulatory interactions in bacteria [42] have been also described. It is likely that on top of all the selective forces that are usually taken into account, there is a layer of selection against weakly deleterious regulatory motifs that will be influencing the evolution of the genome. In conclusion, selection against microRNA target sites in prevalent in human populations, and it may constrain other selective forces in post-transcriptional regulatory regions.
MATERIALS AND METHODS
MicroRNA target and near-target sites were predicted with seedVicious (v.1.1 [21]]) against 3’UTR as annotated in Ensembl version 91 [43] for the human genome assembly hg38. Single nucleotide polymorphisms for the 1000 Genomes project [44] were retrieved from dbSNP (build 137) [45] and mapped to our target predictions. Ensembl sequences and polymorphism data were downloaded using the BiomaRt R package [46]. When plotting allele frequencies (Figure 1) we only considered segregating alleles in which the minor allele is present in at least 1% of the sampled population, but when using all segregating alleles we obtained comparable results. The ancestral allele status was obtained from dbSNP. To compute the background (randomly expected) allele distributions we repeated the process but finding targets in the reverse complement strand of the 3’UTR, to control for sequence length and composition. Expression information for microRNAs was obtained from Meunier et al. [47] and from miRMine [48], and for gene transcripts from the Bgee database (version 13.2, [49]), considering the following tissues: lung, blood, placenta, liver, heart, brain, kidney and testis. MicroRNAs with more than 50 RPM (reads per million) across all tissues were considered broadly expressed. In Figure 2, ‘high’, ‘mid’, and ‘low’ refer to microRNA with more than 500 RPM, between 500 and 50 RPM, and less than 50RPM in any tissue respectively. In Figure 3C, ‘broadly and non-broadly expressed’ microRNAs were defined as microRNAs with at least 50 RPKM on any tissue. MicroRNAs were grouped into evolutionary conservation categories depending on the species spread of the seed family in TargetScan 7.2 [28]. Fst values were retrieved from the 1,000 Genomes Selection Browser 1.0 [50]. In figure 3 the fold enrichment was computed as the logarithm of the ratio between the proportion of SNPs at target sites and the proportion of SNPs at a background site for each Fst bin. All statistical analyses were done with R (v. 3.4.3, [51]). All processed datasets and online tools to compute the tests here reported are available at our dedicated web server PopTargs (Hatlen and Marco, under review).
ACKNOWLEDGEMENTS
We thank Richard Nichols for critical reading of the manuscript and useful comments. This work was supported by the Wellcome Trust [grant number 200585/Z/16/Z].