Abstract
Human genetics has informed the clinical development of new drugs, and is beginning to influence the selection of new drug targets. Large-scale DNA sequencing studies have created a catalogue of naturally occurring genetic variants predicted to cause loss of function in human genes, which in principle should provide powerful in vivo models of human genetic “knockouts” to complement model organism knockout studies and inform drug development. Here, we consider the use of predicted loss-of-function (pLoF) variation catalogued in the Genome Aggregation Database (gnomAD) for the evaluation of genes as potential drug targets. Many drug targets, including the targets of highly successful inhibitors such as aspirin and statins, are under natural selection at least as extreme as known haploinsufficient genes, with pLoF variants almost completely depleted from the population. Thus, metrics of gene essentiality should not be used to eliminate genes from consideration as potential targets. The identification of individual humans harboring “knockouts” (biallelic gene inactivation), followed by individual recall and deep phenotyping, is highly valuable to study gene function. In most genes, pLoF alleles are sufficiently rare that ascertainment will be largely limited to heterozygous individuals in outbred populations. Sampling of diverse bottlenecked populations and consanguineous individuals will aid in identification of total “knockouts”. Careful filtering and curation of pLoF variants in a gene of interest is necessary in order to identify true LoF individuals for follow-up, and the positional distribution or frequency of true LoF variants may reveal important disease biology. Our analysis suggests that the value of pLoF variant data for drug discovery lies in deep curation informed by the nature of the drug and its indication, as well as the biology of the gene, followed by recall-by-genotype studies in targeted populations.
Main Text
Human genetics in drug discovery
Human genetics has inspired clinical development pathways for new drugs and shows promise in guiding the selection of new targets for drug discovery1,2. The majority of drug candidates that enter clinical trials eventually fail for lack of efficacy3, and while in vitro, cell culture, and animal model systems can provide preclinical evidence that the compound engages its target, too often the target itself is not causally related to human disease1. Candidates that target genes with human genetic evidence for causality in disease are more likely to become approved drugs4,5.
An oft-cited example is the development of monoclonal antibodies to PCSK9. PCSK9 binds and causes degradation of the low-density lipoprotein (LDL) receptor, thus raising serum LDL and cardiovascular disease (CVD) risk6. Naturally occurring genetic variation in the PCSK9 gene provided a full allelic series that correctly7,8 predicted that pharmacological inhibition of PCSK9 would lower LDL and be protective against CVD. Gain-of-function variants in PCSK9 raise LDL and CVD risk9, whereas variants that reduce functionality lower LDL and CVD risk10, and variants that result in a total loss of function lower LDL and CVD risk more strongly11,12. A human lacking any PCSK9 due to compound heterozygous inactivating mutations has very low LDL and no discernible adverse phenotype13.
This story illustrates the potential for human genetics to inform on the phenotypic impact — both efficacy and tolerability — of a target’s modulation and inactivation, thus providing dose-response and safety information even before any drug candidate has been identified1. This provides a powerful motivation to study human genetics when evaluating a potential drug target. At the same time, however, we will show in this article that the characteristics of PCSK9 cannot be taken as a one-size-fits-all standard for what criteria a gene must meet in order to be a promising drug target. In contrast to PCSK9, many highly successful drugs target genes where pLoF variants are depleted by intense natural selection and are extremely rare in the general population. For many of these genes, pLoF variants are too rare for ascertainment of multiple double-null human “knockouts” to be a realistic goal. Nevertheless, the study of pLoF variants and the individuals harboring them, even if limited to heterozygotes, can be deeply informative for drug discovery, but only in the context of deep curation undertaken with awareness of gene and disease biology and of the potential drug and its indication.
Rationale and caveats for studying loss-of-function variants
Variants annotated as nonsense, frameshift, or essential splice site-disrupting are categorized as protein-truncating variants. Provided that there is rigorous filtering of false positives14, such variants are generally expected to reduce gene function, and are referred to here as predicted loss-of-function (pLoF) variants. In the simplest case, a germline heterozygous loss-of-function allele may correspond to a 50% reduction in gene dosage compared to a wild-type individual, and germline double null (homozygous or compound heterozygous) genotypes may correspond to 0% of normal gene dosage, in all tissues, throughout life — though of course the reality may be more complex. While full dosage compensation appears to be rare, at least at the RNA level15, a variety of mechanisms may cause heterozygous or even homozygous LoF to be phenotypically muted: for instance, factors other than gene dosage may be rate-limiting for the protein’s function16, or paralogs may compensate14. In some cases, however, pLoF variants can phenocopy long-term pharmacological inhibition, and may be useful for predicting the effects of drugs that negatively impact their target’s function, such as inhibitors, antagonists, and suppressors. Such drugs comprise a significant fraction of approved medicines (see below). While the effects of genetic inactivation of potential drug targets have been studied for decades using knockout mice17, public databases of genetic variation such as the Genome Aggregation Database (gnomAD)18, containing a total of 141,456 human genomes and exomes, now provide an opportunity to study the effects of gene knockout in the organism of most direct interest: humans.
As with any biological data, information from pLoF variants must be interpreted within a broader therapeutic context. Many drugs are not inhibitors, but rather confer neomorphic or hypermorphic gains of function on their targets19, and are thus not well-modeled by pLoF variants at all. Even for drugs that antagonize their target’s function, it is important to recognize several ways in which genetic knockout and pharmacological inhibition may have divergent effects. For example: a chemical probe may inhibit only one of a protein’s two or more functional domains20, may inhibit proteins encoded by two or more paralogous genes21, or it may inhibit a target only when a particular complex is formed22or, alternatively, when a particular protein-protein interaction is absent23. Gene knockouts normally affect every tissue in which a gene is expressed, although in some cases variants may occur on tissue-specific isoforms24–27, and meanwhile many drugs have tissue-restricted distribution. Genetic knockout is also lifelong, including embryonic phases, whereas pharmacological inhibition is generally temporary and age-restricted; this is important because dozens of approved drugs are known or suspected to cause fetal harm but are tolerated in adults28. Finally, as noted above, genetic knockout has a specific dose, whereas the dosing of pharmacological inhibition can be titrated as needed.
While these caveats are important, the PCSK9 example illustrates that pLoF variants can nonetheless be predictive of the phenotypic effects of drugging a target, and other examples of protective LoF variants modeling therapeutic intervention have subsequently arisen29–31. In addition, some drug adverse events may have been predictable in light of human genetic data32— for instance, inhibition of DGAT1 resulted in gastrointestinal side effects which may phenocopy biallelic DGAT1 loss-of-function mutations33,34. Currently, however, a systematic framework for applying human genetic data to the selection of drug targets and to the prediction of drug safety is lacking. In this article we lay the groundwork for such a framework, by analyzing the frequency, distribution, and signals of natural selection against pLoF variants in gnomAD, particularly in the targets of approved drugs.
Measuring natural selection in human genes by pLoF constraint
One natural question to ask of a gene is whether disruptive variants that arise in it are severely deleterious — that is, result in a severe disease state that would typically result in carriers having a high risk of being removed from the population by natural selection. In some cases such information is available directly from the observation of severe disease patients where pLoF mutations in that gene have been shown to be causal; however, for a substantial majority of human genes, no severe pLoF phenotype has yet been determined35. The lack of a known pLoF phenotype may arise for multiple reasons: (1) disruption of the gene may cause no discernible phenotype at all; (2) disruption may cause a phenotype that is evolutionarily deleterious but clinically mild, has effects only on reproductive fitness rather than individual health, or manifests only upon a certain environmental exposure; (3) the corresponding disease families may not yet have been sequenced or adequately analyzed, at least in sufficient numbers to convincingly demonstrate causation; or (4) pLoF variants may cause a phenotype so severe that human carriers are never observed (e.g. early embryonic lethality). Genes that fall in the latter three categories can be detected even if patients with the corresponding pLoF mutations have not yet been observed, by identifying a depletion of unique pLoF variants in the general population – a state known as constraint36.
Identifying constraint requires comparing the number of pLoF variants observed in a gene in a large population with the number expected in the absence of natural selection. Determining the number of expected pLoF variants in the population relies on a mutation rate model, many of which are based on the rate at which mutations spontaneously arise. The model used here determines the rate of mutation by incorporating the exact nucleotide change (e.g. C to T) and the immediate sequence context, among other factors18. Thus, in any given reference population, such as the 125,748 human exomes in gnomAD18, the expected number of unique genetic variants seen in at least one individual in a gene of interest, absent natural selection, is predicted based on mutation rates36–38. This expected number of variants can then be compared to the actual observed number of variants in the database in order to quantify the strength of purifying natural selection acting on that gene, for variants of each functional class — synonymous, missense, and pLoF36. Because true pLoF variants are very rare, annotation errors can account for a large fraction of apparent pLoF variants14, and pLoF constraint is best assessed using rigorous filtering for known error modes and with transcript expression-aware annotation18,24.
Constraint differs from evolutionary conservation in that constraint (1) informs on selection in humans, not other species; (2) primarily reflects selection against variants in a heterozygous state; and (3) can more finely discriminate strong versus weak selection signals39,40. In general, the degree of constraint observed across the genome varies dramatically between synonymous variants, which appear to be under almost no natural selection, missense variants, which show some weak selection, and pLoF variants, which show a strong signal of depletion genome-wide but with marked variation between genes (Figure 1). Various metrics have been developed to quantify constraint39; here, we focus on the ratio of observed to expected pLoF variants (obs/exp).
Comparison of pLoF constraint in drug targets versus other gene sets
As explained above, constraint allows us to quantify the degree of natural selection against loss-of-function variants in each gene in the human genome. One might expect that drug targets should be less constrained than other genes, since targeting genes that do not tolerate inactivation might result in more adverse events. Alternatively, however, one might expect that drug targets should be more constrained than other genes, since constraint partly reflects a gene’s dosage sensitivity, and effective drugs should target genes where a change in gene dosage affects phenotype. We used the obs/exp constraint metric described above to assess the degree of natural selection against loss-of-function variants in the targets of approved drugs (extracted from DrugBank41, N=383). The overall distribution of pLoF obs/exp values for drug targets was similar to that for all genes (Figure 2A). Drug targets include genes under no apparent natural selection against loss-of-function (obs/exp 100%) as well as genes under intense purifying selection (obs/exp 0%).
We compared the mean obs/exp value for drug targets to that of other gene lists (Figure 2B). As previously reported18,42, the ranking of various gene lists aligns with expectation. Olfactory receptors, which are often dispensable in humans43, have nearly 100% of their expected pLoF variation, and genes that tolerate homozygous inactivation in humans also have a higher proportion of their expected pLoF variants than the average gene. Recessive disease genes are close to the genome-wide average, possessing 59% of the expected number of pLoF variants, likely reflecting weak selection against heterozygous carriers of inactivating mutations in these genes. Dominant disease genes are more depleted for pLoF, and genes known to be essential in cell culture or associated with diseases of haploinsufficiency are even more severely depleted. Targets of approved drugs are on average more depleted for pLoF variation than the average gene (P = 0.0003), with only 44% of the expected amount of pLoF variation, versus 52% for all genes.
We then stratified by the drug’s effect on its target: negative (inhibitors, antagonists, suppressors, etc., N=240), positive (activators, agonists, inducers, etc., N=142), and other/unknown (N=94). Although one might expect that targets of negative drugs would need to be more tolerant of pLoF variation, pLoF constraint did not differ significantly between these sets, and if anything, targets of negative drugs were more constrained than those of positive drugs (mean obs/exp 42% vs. 48%, P=0.31, Kolmogorov-Smirnov test, Figure 2B). We note that many drug targets are included in more than one of these three sets, and 50 genes are the targets of both positive and negative drugs.
Overall 19% of drug targets (N=73), including 53 targets of inhibitors or other negative drugs, have a pLoF obs/exp value less than the average (12.8%) for genes known to cause severe diseases of haploinsufficiency44 (ClinGen Level 3). To determine whether this finding could be explained by particular class or subset of drugs, we examined constraint in several well-known example drug targets (Table 1). A few of the most heavily constrained drug targets are targets of cytotoxic chemotherapy agents such as topoisomerase inhibitors or cytoskeleton disruptors, a set of drugs intuitively expected to target essential genes. However, several genes with apparently complete or near-complete selection against pLoF variants are targets of highly successful, chronically used inhibitors including statins and aspirin.
These examples demonstrate that even strong pLoF constraint does not preclude a gene from being a viable drug target. This mirrors the lesson from animal models that a lethal mouse knockout phenotype, such as that reported for Hmgcr or Ptgs2, does not rule out successful drug targeting45–47. The fact that pharmacological inhibition is apparently well-tolerated even in some genes where loss-of-function appears to be evolutionarily deleterious might reflect any of the issues raised above, including differences in effective “dosage”, tissue distribution, or the importance of the gene in embryonic versus adult life stages.
Potential confounding variables in the composition of drug targets
As noted above, drug targets are on average more depleted for pLoF variation than other genes, possessing on average just 44% as much pLoF variation as expected, compared to 52% for all genes (Figure 2), and the effect is similar or stronger when the analysis is limited to drugs with a negative effect on their target’s function. From an efficacy perspective, one could argue that this makes sense: constrained genes should be enriched for dosage-sensitive genes, such that a pharmacological agent with less than 100% target engagement can still bring about a change in phenotype. But from a safety perspective, this result is counterintuitive: one would instead have expected that agents targeting more strongly constrained genes are more likely to cause adverse events and so less likely to become approved drugs. Before drawing any conclusions about whether pLoF constraint is predictive of drug success, we sought to identify potential confounding variables that could impact this analysis.
Drug targets are dominated by a few families or classes of proteins, including rhodopsin-like G-protein coupled receptors (GPCRs), nuclear receptors, voltage- and ligand-gated ion channels, and enzymes48,49. We asked whether controlling for these classes might affect the results shown in Figure 2. These four classes of genes are collectively enriched by 9.5-fold (95%CI: 7.7-11.7, P < 1 × 10−50, Fisher exact test) among approved drug targets and, in total, account for 54% (207/386) of targets in our dataset. Each class has a mean pLoF obs/exp value significantly different from the set of all genes, with rhodopsin-like GPCRs being less constrained and the other three classes being more constrained (Figure 3). After controlling for membership in these four target classes as well as an “other” category, approved drug targets are still more constrained than other genes, with a mean pLoF obs/exp ratio lower by 10.0% (P = 6 × 10−6, linear regression), mirroring the result in Figure 2.
A second potential confounder is that many genes were chosen as drug targets because of their presumed relevance to human disease — targets with human genetic validation are reported to be four-fold enriched among approved drugs versus clinical candidates4. Genes adjacent to genome-wide association study (GWAS) hits — a proxy for involvement in human disease — are more constrained than the average gene42 and are 2.2-fold enriched among drug targets (P = 2 × 10−14, Fisher exact test), collectively accounting for 52% (200/386) of drug targets. However, even after controlling for GWAS hit adjacency, drug targets were still more constrained than other genes with a pLoF obs/exp ratio on average 6% lower (P = 0.003, linear regression).
A third confounder is the number of adult human tissues in which each gene is expressed (thresholding at a median of 1 transcript per million in GTEx v7)50. Broader expression across tissues is associated with more severe constraint, meaning inversely correlated with obs/exp (Spearman’s correlation r = −0.31, P < 1 × 10−50), and drug targets are on average expressed in fewer tissues than all genes (mean 32/53 vs. 37/53 tissues, P = 1 × 10−12, Kolmogorov-Smirnov test). After controlling for this effect, however, drug targets are still more constrained than the average gene, with pLoF obs/exp 11% lower (P = 2 × 10−8, linear regression).
All three observed variables considered above — protein family, disease association, and tissue expression — are confounded with a gene’s status as drug target. This suggests that many unobserved variables are likely to differ between drug target and non-drug target genes as well. Thus, although drug targets are more constrained than the average gene even after controlling for the variables considered here, it would not necessarily be appropriate to conclude that stronger pLoF constraint is associated with increased likelihood of drug target success. Instead, given the wide spectrum of constraint values observed in drug targets (Figure 2A) and the diverse examples form that spectrum (Table 1), the salient conclusion is that genes from the strongly constrained to the not at all constrained can make viable drug targets.
This analysis is limited in crucial ways by available annotations and gene lists. For instance, we only compared targets of successful drug candidates (those that reach approval) to all genes, whereas to gain insight into safety signals it might be more instructive to compare to the targets of drug candidates that failed early in development due to on-target toxicity; however, to our knowledge, no sufficiently large dataset of such targets currently exists. It is also possible that different trends would emerge if analysis could be limited to drugs taken chronically and systemically (as opposed to transiently and/or locally) and/or stratified by the severity of the indicated condition, as a proxy for the severity of side effects that can be tolerated. Such annotations exist but would require extensive manual curation, a direction for future research. Finally, as noted above, expression patterns during embryonic development may explain some differences between the phenotypic effects of genetic disruption and pharmacological inhibition, but data on human embryonic gene expression are lacking.
Prospects for ascertainment of heterozygous or homozygous “knockout” humans for target validation
The analyses above suggest that a simple statistical approach based solely on quantifying a gene’s constraint will not be sufficient to nominate or exclude good drug targets. Where humans with loss-of-function variants in a potential target can be ascertained and studied, however, their phenotypes are expected to be extremely valuable for predicting the phenotypic effects — both desired and undesired — of a drug against that target. The PCSK9 example illustrates the potential value of such “genotype-first” ascertainment, and has inspired many efforts to do the same for other potential targets of interest51–54. To date, however, it has generally been unclear, for any particular gene of interest, how best to go about finding null individuals. Likewise, in genes for which double null humans have not yet been identified, it is often unclear whether this is due to chance, or due to lethality of this genotype.
To explore these questions, we computed the cumulative allele frequency18 (CAF, or p) of pLoF variants in each gene in gnomAD in order to assess how often heterozygous or homozygous null individuals might be identified for any given gene of interest. We first considered a random mating model, under which the expected frequency of pLoF heterozygotes is 2p(1-p) and the expected frequency of double null or total “knockout” individuals is p2. Whereas gnomAD is now large enough to include at least one pLoF heterozygote for the majority of genes, ascertainment of total “knockout” individuals in outbred populations will require multiple orders of magnitude larger sample sizes for most genes (Figure 4A). For instance, consider a sample size of 14 million individuals from outbred populations, 100 times larger than gnomAD today. In this sample size, 75% of genes (N=14,340) would still be expected to have <1 double null individual, and 91% of genes (N=17,546) — including 92% of existing approved drug targets (N=357) — would have sufficiently few expected “knockouts” that observing zero of them would not represent a statistically significant departure from expectation. Indeed, for 38% of genes (N=7,546), even if all humans on Earth were sequenced, observing zero “knockouts” would still not be a statistically significant anomaly. Thus, for the vast majority of genes for the foreseeable future, examining outbred populations alone will not provide statistical evidence that a double null genotype is not tolerated in humans.
Some human populations, however, have demographic properties that render them substantially more likely to produce “knockout” individuals. One such category is bottlenecked populations, such as Finnish or Ashkenazi Jewish individuals, which are descended relatively recently (<100 generations) from a small number of historical founders that subsequently expanded rapidly in size to create large modern populations. This demographic history means that very rare LoF variants present in a founder — including both neutral and relatively deleterious variants — can rise to an unusually high allele frequency in the resulting population51. Thus, it is possible for a gene where LoF variants are ultra-rare in an outbred population, either due to chance or due to natural selection, to harbor relatively common LoF variants in a bottlenecked population. Examination of the full distribution of cumulative LoF allele frequencies for different genes (Figure 4B), however, reveals the double-edged sword of pLoF analysis in bottlenecked populations. In any given bottlenecked population, only a handful of genes have common pLoF variants, and meanwhile, rare pLoF variants that did not pass through the bottleneck have been effectively removed from these populations, resulting in a greater proportion of genes with few or no pLoF variants. If one’s gene of interest happens to be present at high frequency in such a population, then a large number of pLoF individuals can be ascertained, enabling association studies that would have been difficult or underpowered in a larger, outbred population51,55. But if one begins with a pre-determined gene or list of genes to study, any specific bottlenecked population is less likely to reveal interesting pLoF variants than are outbred populations. As such, any effort to use such populations for genome-wide target validation studies would be well-advised to draw samples from as diverse a set of bottlenecked populations as possible to maximize the probability of any specific gene being knocked out in at least one group.
The study of consanguineous individuals, by contrast, is much more likely to identify homozygous pLoF genotypes for a pre-determined gene of interest. The East London Genes & Health (ELGH) initiative56 has recruited ∼35,000 British Pakistani and Bangladeshi individuals, about 20% of whom report that their parents are related. On average, the N=2,912 individuals who reported that their parents were second cousins or closer had 5.8% (about 1/17th) of their genome in runs of autozygosity, meaning that both chromosomes are identical, inherited from the same recent ancestor. Consider, for example, a gene with pLoF allele frequency 1 in 3,000. This gene would be expected to have homozygous or compound heterozygous pLoF variants in (1/3,000)2= 1 in 9 million individuals in an outbred population, but 0.058 * 1/3,000 = 1 in 52,000 consanguineous individuals. Unlike in bottlenecked populations, where certain pLoF variants can be very common, the allele frequency of pLoF variants is not shifted in populations with elevated rates of consanguinity; only the homozygote frequency is dramatically shifted to the right (Figure 4C). These properties explain why the study of these populations has been highly fruitful to date53,57,58 and justify ambitious plans to expand these cohorts in the coming years56,59. However, it is worth emphasizing that because the underlying variants are still rare, studying these populations may only identify a handful of individuals with a homozygous pLoF genotype in a specific gene of interest; such data may be adequate to address safety questions and identify stark phenotypic effects53, but will often be highly underpowered for the study of subtle clinical phenotypes or the direct validation of disease-protective effects.
Ascertainment of double null “knockout” humans remains a desirable goal for establishing the phenotype associated with complete loss of the target gene. However, the data above demonstrate that discovery of substantial numbers of such individuals may be infeasible for many genes of interest in outbred populations. Even in consanguineous cohorts, for most genes, observing homozygous individuals will require orders of magnitude larger sample sizes than are available today (Figure 4C). At present, for most genes, we believe that well-powered studies of the phenotypic impact of human LoF alleles will be limited to heterozygous individuals, which often will provide valuable models of partial gene inhibition, as we describe in an accompanying manuscript exploring individuals heterozygous for LRRK2 LoF54.
Regardless of the study design, moving from pLoF genotypes to information about specific clinical outcomes will depend critically on the accuracy of pLoF identification. We thus next turn our attention to the careful curation required to filter for true LoF variants before embarking upon any genotype-based ascertainment effort.
Curation of pLoF variants in six neurodegenerative disease genes
To illustrate both the opportunities and the challenges associated with identifying true LoF individuals for further study, we manually curated the data from gnomAD as well as the scientific literature for six genes associated with gain-of-function (GoF) neurodegenerative diseases, for which inhibitors or suppressors are presently under development60–68: HTT (Huntington disease), MAPT (tauopathies), PRNP (prion disease), SOD1 (amyotrophic lateral sclerosis), and LRRK2 and SNCA (Parkinson disease). The results (Table 2 and Figure 5) illustrate four points about pLoF variant curation.
First, other things being equal, genes with longer coding sequences have more opportunity for LoF variants to arise, and so are likely to have a higher cumulative frequency of LoF variants, unless they are heavily constrained. Thus, shorter and/or more constrained genes are more difficult targets for the follow-up of LoF individuals, even though constraint in and of itself does not rule out a gene being a good drug target (Table 1).
Second, many variants annotated as pLoF are in fact false positives, and this is particularly true of pLoF variants with higher allele frequencies, such that the true cumulative allele frequency of LoF is often much lower after manual curation than before. As such, studies of human pLoF variants that do not apply extremely stringent curation to their candidate variants can easily dilute their clinical studies with large numbers of false pLoF carriers or homozygotes, rendering the resulting data challenging or impossible to interpret. In the long term, we anticipate that high-throughput direct functional validation of candidate pLoF variants will become the standard for such studies in humans.
Third, even after careful curation, the cumulative frequency of LoF variants is sometimes sufficiently high to place certain bounds on what heterozygote phenotype might exist. For example, in HTT, LRRK2, PRNP, and SOD1, individuals with high-confidence heterozygous LoF variants are equally or more common in the population than people with gain-of-function variants that cause neurodegenerative disease. In each case, the gain-of-function disease has been well-characterized for decades. Thus, it seems unlikely that a comparably severe and penetrant heterozygous loss-of-function syndrome associated with the same gene could have gone unnoticed to the present day. Of course, this does not rule out the possibility that heterozygous loss-of-function could be associated with a less severe or less penetrant phenotype.
Finally, the positional distribution of pLoF variants often appears non-random, and careful curation of variants in such genes can often reveal a reason for the observed distribution, with resulting dramatic changes in the gene’s constraint and/or cumulative LoF allele frequency. Three genes in our curation set — HTT, MAPT, and PRNP — are good examples of how different non-random positional distributions of pLoF variants in a gene’s coding sequence can correspond to different error modes or disease biology (Figure 5).
HTT, the gene encoding huntingtin, the cause of Huntington disease, appears at first glance to harbor several common LoF variants, with a cumulative allele frequency of 6%. This is surprising in view of this gene’s strong constraint in humans (Table 2) and the known embryonic lethal phenotype of homozygous knockout in mice84. Inspection (Figure 5A) reveals that all of the common pLoF variants in HTT are sequencing read alignment artifacts within the polyglutamine and polyproline tracts of exon 1, some of which are removed by the automated annotation tool LOFTEE18, and the rest of which can be identified quickly by visual inspection. True LoF variants in HTT are in fact rare, consisting mostly of singletons (variants seen only once in gnomAD’s database of 141,456 individuals). Nonetheless, a total of 37 apparently real LoF alleles are observed in HTT, and these variants are positionally random and include nonsense, splice, and frameshift mutations. This suggests that ∼1 in 3,800 people in the general population are heterozygous for genuine LoF of HTT, making this genotype about as common as the HTT CAG repeat expansion that causes Huntington’s disease. While heterozygous HTT LoF variants do appear to be under negative selective pressure given the clear depletion of such variants in the population, the prevalence of this genotype makes it unlikely that such variants result in a penetrant, severe, syndromic illness. This conclusion is consistent with the lack of reported phenotype in a human with a heterozygous translocation disrupting HTT85 and the heterozygous parents of children with a neurodevelopmental disorder due to compound heterozygous hypomorphic mutations in HTT86,87. Heterozygous knockout mice are likewise reported to have no obvious abnormality84, although reduced body weight has been noted88. Functional studies to confirm that the observed variants in HTT are true LoF, and recall-by-genotype efforts to identify any phenotype in these individuals remain important future research directions. At present, the balance of evidence suggests that heterozygous HTT loss-of-function does not cause a severe, penetrant disease in humans.
MAPT, the gene encoding tau, the cause of tauopathies and an important protein in Alzheimer disease, appears at first glance to harbor a large number of LoF variants, some of which are common, leading to a cumulative LoF allele frequency of 14%. The positional distribution of variants is suspiciously non-random, however, with LoFs concentrated in a few exons. Plotting the variant data against brain RNA expression data24 reveal the reason for this pattern (Figure 5B): almost all of the pLoF variants in MAPT, including all those with appreciable allele frequency, fall in exons that are not expressed in the brain. The few remaining pLoF variants that do fall in brain-expressed exons were all determined to be sequencing or annotation errors upon closer inspection, meaning that no true LoF variants are observed in MAPT. Heterozygous MAPT deletions in humans have been reported: a partial deletion of exons 6-9 is believed to result in pathogenic gain of function89, while the 17q21.31 microdeletion syndrome90 spanning MAPT and four other genes is associated with a neurodevelopmental disorder that has since been causally attributed to the loss of KANSL191. Homozygous Mapt knockout mice are grossly normal92,93. Our data would be consistent with MAPT loss-of-function having some fitness effect in humans, but our sample size is insufficient to prove that MAPT loss-of-function is not tolerated (see Supplement). Even if heterozygous MAPT loss-of-function is pathogenic, this does not imply that MAPT is not a viable drug target, for the reasons explained above. However, this would mean that ascertaining and studying MAPT LoF individuals in order to determine whether reduced gene dosage is protective against tauopathies may prove difficult or impossible.
PRNP, the gene encoding prion protein, the cause of prion disease, is a single-exon gene, so truncating variants do not trigger nonsense-mediated decay and instead result in shortened proteins. PRNP appears at first glance to be modestly depleted for LoF variants, particularly in its C terminus. As previously reported52, comparing gnomAD data to reported pathogenic variants in the literature (Figure 5C) reveals that truncating variants at codon 145 or higher are associated with a pathogenic gain-of-function leading to prion disease, apparently through removal of the protein’s GPI anchor. All of the variants seen in non-dementia cohorts in gnomAD occur prior to codon 145 and appear to correspond to true LoF. An individual with a G131X mutation was found to be neurologically healthy at age 77 with no family history of neurodegeneration (see Supplement), suggesting that stop codons up through at least codon 131 are benign. The sole C-terminal truncating variant observed in gnomAD, a frameshift at codon 234, at the beginning of the GPI signal, turns out to be an individual with dementia diagnosed clinically as Alzheimer disease (see Supplement). This is consistent with the slowly progressive dementia reported for some PRNP late truncating mutations94, although we cannot exclude the possibility that this variant is benign and that the Alzheimer diagnosis is a coincidence. When only codons 1-144 are considered, PRNP is not constrained at all (Table 2). Because the gene is short, the cumulative frequency of LoF variants is still low: ∼1 in 18,000 individuals are heterozygous for PRNP LoF, a frequency that has enabled phenotypic characterization of a small number of individuals (Supplement), although ascertainment of homozygotes will likely only ever be possible in consanguineous individuals.
The above examples illustrate only a few of the types of positional patterns and error modes that may appear upon manual curation. Additional examples have been reported previously95,96, and a companion paper further illustrates the importance of transcript expression-aware annotation24. For anyone considering developing a drug against a target, the types of analyses described above are only a first step. Variants that appear to be true LoF after filtering and curation still occasionally turn out not to disrupt gene function, so RNA and/or protein studies are essential. Once true pLoF variants are identified, recontact efforts can be initiated where consents allow, and even when deep phenotype information is not available, examining the age distribution, study cohorts, and case/control status of pLoF individuals can be highly valuable. For an example of such a deeper analysis of one gene of interest, see our companion paper on pLoF variants in LRRK254.
Suggestions for assessing pLoF variation in potential drug targets
While there are many caveats, and pLoF variants in a gene will never be a perfect model of pharmacological inhibition of that gene’s product, there are now many examples to illustrate that pLoF variants can have enormous predictive value for the phenotypic impact of drugging a target1,2. We therefore expect that many more sequencing, functional studies, recontact efforts, and association studies will be undertaken with the intent of characterizing the impact of pLoF variants on genes under consideration as potential drug targets. In view of the above analyses and findings, we suggest guidelines for how such approaches can be undertaken (Box 1).
Box 1. Suggested guidelines for studying pLoF variation in a candidate drug target
Carefully filter and curate pLoF variants. False positive pLoF variants abound, and are particularly enriched among common pLoF variants. Filtering using annotation tools such as LOFTEE18, RNA expression data24, and deep manual curation are critical before interpreting variants or initiating expensive downstream recontact or phenotyping efforts.
Consider the positional distribution of pLoF variants. A non-random distribution of pLoF variants throughout a gene’s coding sequence can reflect sequencing or annotation pitfalls, or can point to disease biology. Interpreting such patterns often requires careful analysis both of error modes and of gene-specific biology including transcript structure and expression.
Calculate cumulative allele frequency. The sum of the frequency of all pLoF variants in a gene will predict how realistic it is to identify a sufficient number of heterozygous and double null individuals for follow-up studies, and can often be informative in itself. Identify any populations with higher pLoF frequencies, as these may be the most fruitful for follow-up studies. If ascertainment of homozygotes is desired, sequencing of populations with higher rates of consanguinity will often be the most realistic route.
Where possible, experimentally validate loss of function. Even after careful filtering and curation, RNA or protein studies will sometimes reveal that a pLoF variant does not in fact disrupt gene function. For high-value target genes, developing high-throughput functional assays and using these to test all candidate pLoF variants will often be worthwhile before embarking on clinical follow-up studies.
Do not eliminate genes from consideration based solely on a lack of pLoF individuals. Some genes, whether because they are short, and thus have few mutations expected a priori, or because they are under intense natural selection, have very few pLoF variants. Many successful approved drugs target such genes. Even when pLoF heterozygotes can be observed, double null individuals should not be expected for most genes at present sample sizes. While pLoF variation is valuable, lack thereof should not preclude a target from consideration.
Above all, we suggest that the study of pLoF variation should be informed by a full view of the biology of the gene, drug, and indication. Nothing about developing a drug is trivial, and that includes applying lessons from human genetics. But given the scale and expense of drug development, it is worth the effort to carefully read out, through human genetics, the valuable data from experiments that nature has already done.
Methods
Data sources
pLoF analyses used the gnomAD dataset of 141,456 individuals18. For data consistency, all genome-wide constraint and CAF analyses (Figures 1-4) used only the 125,748 gnomAD exomes. Curated analyses of individual genes used all 141,456 individuals including 15,708 whole genomes.
Gene lists used in this study were extracted from public data sources between September and December 2018 as shown in Table 3.
Calculation of pLoF constraint
The calculation of constraint values for genes has been described in general elsewhere36,42 and for this dataset specifically by Karczewski et al18. Constraint calculations were limited to single-nucleotide variants (which for pLoF means nonsense and essential splice site mutations) found in gnomAD exomes with minor allele frequency < 0.1% and categorized as high-confidence LoF by LOFTEE. Only unique canonical transcripts for protein-coding genes were considered, yielding 17,604 genes with available constraint values. For curated genes (Table 2), the number of observed variants passing curation was divided by the expected number of variants to yield a curated constraint value. For PRNP, the expected number of variants was adjusted by multiplying by the ratio of the sum of mutation frequencies for all possible pLoF variants in codons 1-144 to the sum of mutation frequencies for all possible pLoF variants in the entire transcript, yielding 6 observed out of 6.06 expected. For MAPT, the expected number of variants was taken from Ensembl transcript ENST00000334239, which includes only the exons identified as constitutively brain-expressed in Figure 5B.
Calculation of pLoF heterozygote and homozygote/compound heterozygote frequencies
Cumulative pLoF allele frequency (CAF) was calculated as reported18. Briefly, LOFTEE-filtered high-confidence pLoF variants with minor allele frequency <5% in 125,748 gnomAD exomes were used to compute the proportion of individuals without a loss-of-function variant (q); the CAF was computed as p = 1-sqrt(q). This approach conservatively assumes that, if an individual has two different pLoF variants, they are in cis to each other and count as only one pLoF allele.
For outbred populations (Figure 4A), we used the value of p from all 125,748 gnomAD exomes, as this allows the largest possible sample size. This includes some individuals from bottlenecked populations, for which the distribution of p does differ from outbred populations, but these individuals are a small proportion of gnomAD exomes (12.6%). This also includes some consanguineous individuals, but these are an even smaller proportion of gnomAD exomes (2.3%), and any difference in the value of p between consanguineous and outbred populations is expected to be very small. Heterozygote frequency was calculated as 2p(1-p) and homozygote and compound heterozygote frequency was calculated as p2. Lines indicate the size of gnomAD (141,456 individuals) and the world populaton (6.69 billion).
For bottlenecked populations (Figure 4B), we used the value of p from the 10,824 Finnish exomes only. Lines indicate the number of Finns in gnomAD (12,526) and the population of Finland (5.5 million).
For consanguineous individuals (Figure 4C), we again used the value of p from all gnomAD exomes, because p is not expected to differ greatly in consanguineous versus outbred populations. We used the mean proportion of the genome in runs of autozygosity (a) from individuals self-reporting second cousin or closer parents in East London Genes & Health, a = 0.05766 (rounded to 5.8%). Heterozygote frequency was calculated as 2p(1-p) and homozygote and compound heterozygote frequency was calculated as (1-a)p2 + ap. Lines indicate the number of consanguineous South Asian individuals in gnomAD (N=2,912, by coincidence the same number as report second cousin or closer parents in ELGH) based on F > 0.05 (a conservative estimate, since second cousin parents are expected to yield F = 0.015625), and the estimated number of individuals in the world with second cousin or closer parents (10.4% of the world population)103.
Several caveats apply to our CAF analysis. Our approach naively treats genes with no pLoFs observed as having p=0, even though pLoFs might be discovered at a larger sample size. It also naively treats genes with one pLoF allele observed as having p=1/(2*125748), even though on average singleton variants have a true allele frequency lower than their nominal allele frequency42. We naively group all populations together, even though the distribution of populations sampled in gnomAD does not reflect the world population18; we believe this is reasonable because CAF for many genes is driven by singletons and other ultra-rare variants for which frequency is not expected to differ appreciably by continental population42. It is important to note that the histograms shown in Figure 4 reflect the expected frequency of heterozygotes and homozygotes/compound heterozygotes, based on gnomAD allele frequency, rather than the actual observed frequency of individuals with these genotypes in gnomAD. Finally, the sample size for all gnomAD exomes (Figures 4A and 4C) is larger than for only Finnish exomes (Figure 4B). For a version of Figure 4 with the global gnomAD population downsampled to the same sample size as the gnomAD Finnish population, see Figure S1.
Genetic prevalence estimation
Here, we define “genetic prevalence” for a given gene as the proportion of individuals in the general population at birth who harbor a pathogenic variant in that gene that will cause them to later develop disease. Genetic prevalence has not been well-studied or estimated for most disease genes.
In principle, it should be possible to estimate genetic prevalence simply by examining the allele frequency of reported pathogenic variants in gnomAD. In practice, three considerations usually preclude this approach. First, the present gnomAD sample size of 141,456 exomes and genomes is still too small to permit accurate estimates for very rare diseases. Second, the mean age of gnomAD individuals is ∼55, above the age of onset for many rare genetic diseases, and individuals with known Mendelian disease are deliberately excluded, so pathogenic variants will be depleted in this sample relative to the whole birth population. Third and most importantly, a large fraction of reported pathogenic variants lack strong evidence for pathogenicity and are either benign or low penetrance42,52, so without careful curation of pathogenicity assertions, summing the frequency of reported pathogenic variants in gnomAD will in most cases vastly overestimate the true genetic prevalence of a disease.
Instead, we searched the literature and very roughly estimated genetic prevalence based on available data. In most cases, we took disease incidence (new cases per year per population), multiplied by proportion of cases due to variants in a gene of interest, multiplied by average age at death in cases. In some cases, estimates of at-risk population or direct measures of genetic prevalence were available. Details of the calculations undertaken for each gene are provided in the Supplement.
Data and source code availability
Analyses utilized Python 2.7.10 and R 3.5.1. Data and code sufficient to produce the plots and analyses in this paper are available at https://github.com/ericminikel/drug_target_lof
Supplement
Downsampling of cumulative allele frequency analysis
HTT
We considered several approaches to estimating the genetic prevalence of Huntington’s disease (HD). A reported HD incidence of 0.38 cases per 100,000 per year based on meta-analysis69 multiplied by an average age at death of ∼60 for the most common CAG lengths104 gives a genetic prevalence of 1 in 4,386. One exhaustively ascertained study of HD70 found a prevalence of 13.7 per 100,000 symptomatic plus 81.6 per 100,000 at 25-50% risk. Assuming there are twice as many individuals at 25% risk as at 50% risk, then on average 33.3% of the 81.6, or 27.1 per 100,000 have the mutation. Thus, 13.7 + 27.1 = 40.8 per 100,000 individuals have an HTT CAG expansion, equal to 1 in 2,451. Finally, a genetic screen of a general population sample71 found ≥40 CAG repeat alleles, which are presumed to be fully penetrant, in 3 individuals out of 7,315, for a genetic prevalence of 1 in 2,438.
LRRK2
Based on meta-analysis72, Parkinson’s disease (PD) has an estimated prevalence of 1,903 per 100,000 at age ≥80, meaning the general population’s lifetime risk of PD is ∼1.9%. It is generally stated that about 10% of PD cases are “familial” and the remainder sporadic; in a diverse worldwide case series, LRRK2 mutations were found in 179/14,253 (1.3%) sporadic cases and 201/5,123 (3.9%) familial cases73, implying that LRRK2 mutations are present in ∼1.6% of all PD cases. Thus, LRRK2 mutations account for a 1.6% * 1.9% = ∼0.030% lifetime risk of PD in the general population, or 1 in 3,300.
It is important to consider for a moment how this figure relates to the penetrance of LRRK2 mutations, as LRRK2 variants appear to occupy a spectrum of penetrance105. some variants exhibit Mendelian segregation with disease106,107, implying high risk; the G2019S variant is estimated to have ∼32% penetrance108; and other common variants are risk factors with odds ratios of only ∼1.2 estimated through genome-wide association studies (GWAS)109. The GWAS-implicated common variants were not included in the case series on which our estimate is based73, but G2019S does account for the majority of cases in that series. Because the 0.03% estimate here is based on counting symptomatic cases rather than asymptomatic individuals, it will appropriately underestimate the number of G2019S carriers. In essence, in this calculation each G2019S carrier in the population only counts as 1/3 of a person, because they have only a 1/3 probability of developing a disease. It is therefore appropriate that our estimate of genetic prevalence (0.03%) is actually lower than double the allele frequency of G2019S in gnomAD (0.1%).
MAPT
Estimation of the genetic prevalence of MAPT gain-of-function mutations is difficult because pathogenic MAPT mutations can present with a variety of clinical phenotypes, and common MAPT haplotypes are associated with risk for a variety of different neurodegenerative disorders. We were unable to identify any studies of genetic prevalence nor any large case series for any MAPT-associated phenotype. As a crude estimate, we considered that frontotemporal dementia has a reported incidence of 2.7-4.1 per 100,000 per year74 with typical age at death of perhaps 60, and MAPT mutations accounting for 5-20% of familial cases, and familial cases accounting for 40% of all cases75. Multiplying all these figures results in range of 0.0032% to 0.020%, or 1 in 5,000 – 31,000.
As noted in the main text, our sample size is not sufficient to prove that MAPT loss-of-function is not tolerated. When we restrict to constitutive, brain-expressed exons (Ensembl transcript ENST00000334239), we expect 12.6 pLoF variants and observe 0. The 95% confidence interval on MAPT constraint is thus (0%, 23.7%). The upper bound of 23.7% implies that our data do not rule out a true pLoF obs/exp value of up to 3.0/12.6, or in other words, we cannot rule out that another population sample as large as gnomAD might yield up to 3 genuine pLoF variants.
PRNP
We have recently considered the lifetime risk of genetic prion disease in detail76. All forms of prion disease (sporadic, genetic, and acquired) appear to be the cause of death of ∼1 in 5,000 people based on either death certificate analysis or division of disease incidence by the overall death rate. ∼10% of cases are attributable to PRNP variants with evidence for Mendelian segregation (although additional cases harbor lower-penetrance variants). Thus, we expect a genetic prevalence of 1 in 50,000. On the order of ∼1 in 100,000 people in gnomAD and 23andMe harbor high-penetrance PRNP variants52,76, although as noted above, we expect these datasets to be depleted compared to the population at birth, because prion disease is rapidly fatal and many individuals in these databases are above the typical age of onset.
Figure 5C displays variants from gnomAD plus the literature, including those previously reported52, and Table S1 shows details for each variant. Allele count for variants from the literature in Figure 5C is the total number of definite or probable cases with sequencing performed in the studies cited in Table S1. The L234Pfs7X variant changes PrP’s C-terminal GPI signal from SMVLFSSPPVILLISFLIFLIVGX to SMVPSPLHLX. This novel sequence does not adhere to the known rules of GPI anchor attachment110: GPI signals must contain a 5-10 polar residue spacer followed by 15-20 hydrophobic residues. Thus, this frameshifted PrP would be predicted to be secreted and thus may be pathogenic, explaining the Alzheimer disease diagnosis in this individual. However, it is also possible that the novel C-terminal sequence found here interferes with prion formation, and/or that this variant is incompletely penetrant, and that the diagnosis of Alzheimer’s disease in this individual is merely a coincidence.
SNCA
As explained above for LRRK2, we assumed a 1.9% lifetime risk of Parkinson’s disease (PD) in the general population, with 10% of cases being familial. SNCA point mutations, duplications, and triplications all appear to be highly penetrant, and in a familial PD case series these accounted for 103/709 = 15% of individuals77. Thus, we estimate that SNCA mutations account for a 1.9% * 10% * 15% = 0.00028% risk of PD in the general population, or 1 in 360,000.
SOD1
SOD1 mutations are believed to account for ∼12% to 24% of familial ALS78,79 and 1% of sporadic ALS78,118. One a meta-analysis found that ∼4.6% of ALS is familial80, although a figure of 10% is also often used119. These figures imply that ∼1.5 – 3.3% of all ALS is attributable to SOD1. The overall incidence of ALS is reported at ∼1.6 – 2.2 per 100,000 per year120,121, so the incidence of SOD1 ALS might be estimated at ∼0.024 – 0.073 per 100,000 per year. Age at death of ∼50 is around average for many SOD1 mutations79, implying a 1.2 – 3.7 per 100,000 population prevalence of pathogenic SOD1 mutations, or a range of 1 in 27,000-83,000.
We note that frameshift mutations in SOD1 at codons 126 or 127 have been reported to cause a pathogenic gain-of-function leading to ALS122,123. Both of these codons occur in the gene’s fifth and final exon; all of the variants curated as leading to loss-of-function here are in exons 1-4.
Acknowledgments
This study was performed under ethical approval from the Partners Healthcare Institutional Research Board (2013P001339/MGH) and the Broad Institute Office of Research Subjects Protection (ORSP-3862). We thank all of the research participants for contributing their data. EVM is supported by the National Institutes of Health (F31 AI22592) and by an anonymous organization. gnomAD data aggregation was supported primarily by the Broad Institute, gnomAD analysis was funded by NIDDK U54 DK105566, and development of LOFTEE by NIGMS R01 GM104371. ELGH is funded by the Wellcome Trust (102627, 210561), the Medical Research Council (M009017), Higher Education Funding Council for England Catalyst, Barts Charity (845/1796), Health Data Research UK (for London substantive site), and research delivery support from the NHS National Institute for Health Research Clinical Research Network (North Thames). NW is supported by a Rosetrees and Stoneygate Imperial College Research Fellowship. The results published here are in part based upon data: 1) generated by The Cancer Genome Atlas managed by the NCI and NHGRI (accession: phs000178.v10.p8). Information about TCGA can be found at http://cancergenome.nih.gov, 2) generated by the Genotype-Tissue Expression Project (GTEx) managed by the NIH Common Fund and NHGRI (accession: phs000424.v7.p2), 3) generated by the Exome Sequencing Project, managed by NHLBI, 4) generated by the Alzheimer’s Disease Sequencing Project (ADSP), managed by the NIA and NHGRI (accession: phs000572.v7.p4). We thank Jaakko Kaprio and Mitja Kurki (Finnish Twins AD cohort) and Academy of Finland grant 312073, and Ruth McPherson (Ottawa Genomics Heart Study) for providing information on individuals with PRNP truncating variants. We thank Jeffrey B. Carroll, Karl Heilbron, J. Fah Sathirapongsasuti, Daniel Rhodes, and Laurent C. Francioli for comments and suggestions. A subset of the analyses reported here originally appeared as a blog post on CureFFI.org (http://www.cureffi.org/2018/09/12/lof-and-drug-safety/).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.
- 112.
- 113.
- 114.
- 115.
- 116.
- 117.
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵