Abstract
Myosin motors are the fundamental force-generating element of muscle contraction. Variation in the human β-cardiac myosin gene (MYH7) can lead to hypertrophic cardiomyopathy (HCM), a heritable disease characterized by cardiac hypertrophy, heart failure, and sudden cardiac death. How specific myosin variants alter motor function or clinical expression of disease remains incompletely understood. Here, we combine structural models of myosin from multiple stages of its chemomechanical cycle, exome sequencing data from population cohorts of 60,706 and 42,930 individuals, and genetic and phenotypic data from 2,913 HCM patients to elucidate novel structure-function relationships within β-cardiac myosin. We first developed computational models of the human β-cardiac myosin protein before and after the myosin power stroke. Then, using a spatial scan statistic modified to analyze genetic variation in protein three-dimensional space, we found significant enrichment of disease-associated variants in the converter, a kinetic domain that transduces force from the catalytic domain to the lever arm to accomplish the power stroke. Focusing our analysis on surface-exposed residues, we identified another region enriched for disease-associated variants that contains both the converter domain and residues on a single flat surface on the myosin head described as a myosin mesa. This surface is prominent in the pre-stroke model, but substantially reduced in size following the power stroke. Notably, HCM patients with variants in the enriched regions have earlier presentation and worse outcome than those with variants in other regions. In summary, this study provides a model for the combination of protein structure, large-scale genetic sequencing and detailed phenotypic data to reveal insight into time-shifted protein structures and genetic disease.
Myosin motors are molecular machines responsible for converting chemical energy into the mechanical force necessary for cell division, directed cell migration, vesicle trafficking and muscle contraction1. Efforts to understand structure/function relationships within myosin have been ongoing for more than fifty years, and incorporate structural biology, in vitro biophysical and biochemical analyses, and studies in model systems from Dictyostelium to mouse2,3. Variants in myosin genes cause several skeletal and cardiac myopathies4 including hypertrophic cardiomyopathy (HCM), a genetic disease of the heart muscle characterized by an asymmetric thickening of the ventricular walls and a decrease in the ventricular chamber size. Clinically, HCM can be associated with arrhythmia, heart failure or sudden death5. Except in cases with large kindreds, relationships between genotype and disease expression have been challenging to establish due to the absence of large scale genetic population data and lack of multi-center sharing of patient genetic and clinical data6,7. Further, limited understanding of three dimensional protein structural dynamics has prevented the extension of inference from genetic variation beyond the single linear dimension of genomic DNA sequence.
Recently, advances in next-generation sequencing technology have enabled the assembly of large datasets of human genetic variation in both unselected and disease-affected populations. Comparative analysis of these cohorts enables within-gene inference of constraint - a measure of population tolerance to variation that can reveal insight into critical functional residues. The MYH7 gene, encoding the β-cardiac myosin implicated in hypertrophic cardiomyopathy, is highly constrained for missense and loss of function variants8,9. Studies of regional tolerance within MYH7 report conflicting results and suffer from small samples sizes or a lack of a control cohort6,10,11. We hypothesized that assessing regional genetic tolerance in the context of time-shifted three-dimensional structures would reveal novel insights into structure-function relationships in MYH7.
The Sarcomeric Human Cardiomyopathy Registry (SHaRe) was established as an international consortium of HCM investigators and currently contains detailed longitudinal clinical data on 2,913 HCM patients who have undergone clinical genetic testing12. The Exome Aggregation Consortium (ExAC) is a publicly available database of exome sequences from 60,706 unselected individuals13. We compared the prevalence of missense variants in MYH7 within these cohorts. We found 193 unique missense variants (in 476 patients with MYH7 variants) in the HCM cohort and 454 unique missense variants in the ExAC database. In both cases, observed missense variants were very rare (Supplementary Figure S1), consistent with previous reports of constraint within the MYH7 gene8,14. Although both disease and population variants are non-uniformly distributed throughout the gene, there is a significant difference in the linear distribution of rare variants between these cohorts (KS p = 5.0x10−11, Supplementary Figure S1). Disease-associated missense variants are concentrated in the catalytic globular domain and the coiled coil S2, consistent with previous results6. Even within these domains, however, distributions of disease and population variants are not the same (KS p = 0.003). These results suggest that the likelihood of MYH7 variants causing disease is in part due to their location within the gene.
Since molecular motors act in three-dimensional space, we sought a method to investigate patterns of genetic tolerance in the folded structure of human β-cardiac myosin protein. We used multi-template homology modeling of other myosin proteins in the pre- and post-stroke states to build three-dimensional models of human β-cardiac myosin containing the human ventricular light chains (Fig. 1 and Methods). These models represent two distinct phases of the actin-activated myosin chemomechanical cycle. Four fundamental regions of the myosin motor domain are included: the actin-binding site (Fig. 1, green residues), the ATP-binding pocket (red), the converter domain (blue), and the light chain binding region or lever arm. In the pre-stroke state, the converter aligns with a relatively flat surface of the myosin head described as the myosin mesa. Based on its size (>20 nm2), flat topology and high degree of evolutionary conservation, this feature has been proposed as an interaction site for intra- or intermolecular binding15. Following the force-producing lever arm stroke of a myosin head, the motor is in its post-stroke state (Fig. 1a) and the mesa falls out of alignment with the converter domain.
To prioritize three-dimensional structural regions of interest, we applied a modified version of the spatial scan statistic16,17 to the pre-stroke and post-stroke models of β-cardiac myosin S1. This statistic searches for spherical regions with an increased proportion of genetic variants in disease compared to control cohorts. In the myosin pre-stroke model, we find a striking increase in the proportion of disease-associated missense variation in a 15 Å-sphere centered on residue 736 (p = 0.001) (Fig. 2a). This region, covering much of the converter domain, contains 17 missense variants observed in disease and no missense variants observed in control data (Fig. 2c). Using the post-stroke model of β-cardiac myosin, we again observed enrichment of disease-associated variants in the converter domain (p < 0.001) centered on residue 733 (Supplementary Figure S2). Enrichment of disease-associated variants in both the pre- and post-stroke states persists when including only variants formally classified as pathogenic or likely pathogenic, and when limited only to individuals of European descent (Supplementary Text 3). During systolic contraction of the heart, the converter domain serves the critical function of transducing force by swinging about 70° from its pre-stroke position (Fig. 1b, lever arm projecting outward). Variants in the converter domain have been shown to alter muscle power output and kinetics18,19 and have been associated with worse outcomes in HCM20–22. Our data provide a complementary line of evidence that variants in the β-cardiac myosin converter domain are poorly tolerated and prone to development of HCM.
To replicate these results, we sought an independent source of disease-associated and population genetic variation. We curated publications from medical centers not yet affiliated with the SHaRe registry (Supplementary Text 1). We compared this set of 231 missense variants with 430 missense variants found in 42,930 exomes from unselected individuals in the Geisinger Health System sequenced by the Regeneron Genetics Center (DiscovEHR cohort). The converter domain regions identified in both the pre-stroke and post-stroke states showed enrichment of disease-associated variants in the replication dataset (pre-stroke p = 0.0019, post-stroke p = 1.7 x 10−4).
We extended our analysis of three-dimensional protein space to examine surface regions of β-cardiac myosin. To find such domains, we first defined surface-exposed amino acids by their accessibility to a spherical probe with a radius of 2.5 Å (the approximate size of an amino acid side chain) and approximated the surface distance between any two residues23 (see Methods and Supplementary Figure S3 and Fig. 4). The surface of β-cardiac myosin contains 568 of the 765 residues in the S1 domain (74%). Of these, 71 are associated with HCM (72% of all HCM variants) and 79 are found in a reference population (71% of all reference variants), suggesting that variants in both cohorts are evenly distributed between the surface and core of the protein (chi-square p=0.51). We then applied our spatial scan statistic to the surface of β-cardiac myosin. Using the myosin pre-stroke model, we identified a region of the surface covering 277 of the 568 surface amino acid residues (p = 0.002, Fig. 3a,b), including the converter domain and the mesa, that is highly enriched for disease-associated variation. Strikingly, the region contains 52 of the 71 surface HCM-associated missense variants and only 27 of the 79 surface non-disease associated missense variants (Fig. 3d), whereas the remainder of the myosin surface (Fig. 3c) covers 291 residues and contains only 19 disease-associated variants compared with 52 non-disease associated variants (Fig. 3d). The identified converter/mesa region was also enriched in the replication data set (p = 2.5 x 10−5).
Using the same procedure to search the surface regions of the myosin post-stroke structure, we detected a smaller enriched region of 122 amino acid residues again covering the converter but with a reduced portion of the myosin mesa (Supplementary Figure S5). During the myosin power stroke, the converter moves away from the mesa (compare Fig. 1a and Fig. 1b, Fig.3 and Supplementary Figure S5), so the enriched converter/mesa region is no longer contiguous and available for intra- or intermolecular interactions in the post-stroke state (Fig. 1b, Fig. 3a,b). Indeed, the enriched amino acid residues in the post-stroke model move significantly more in three-dimensional space between the pre and the post-stroke models (Wilcoxon p < 2 x 10−16) than other amino acid residues on the surface of the protein. These data suggest that myosin conformational changes during the actin-activated chemomechanical cycle may be important not only for transducing force, but also for modulating the size and shape of this surface region and altering its availability for binding. Indeed, the functional importance of the converter/mesa region is further supported by the presence of the binding site for omecamtiv mecarbil, a recently described small molecule modulator of cardiac myosin currently in clinical trials for the treatment of heart failure24,25.
Next, we tested the myosin S2 fragment for regions enriched for disease associated variation. The spatial scan analysis revealed that the first half of the S2 fragment is enriched for disease variants (p=0.003, Supplementary Figure S6). This proximal part of S2 has been shown to bind to the amino-terminal part of myosin binding protein C (MyBP-C)26, a sarcomere protein that is most frequently mutated in patients with HCM14. The enrichment of disease-associated variants in this region suggests that binding between myosin S2, MyBP-C (and potentially other partners) is important for development of HCM.
To further investigate the contribution of the genetically-constrained regions to disease, we compared the clinical features of patients with variants in these regions to patients with variants elsewhere in MYH7. The clinical profile of HCM is highly variable, with some patients living a normal lifespan with minimal symptoms and others dying suddenly or requiring cardiac transplantation at a young age27. Similarly, age at presentation of HCM varies widely between patients and earlier onset is correlated with a more severe phenotype28. We find that HCM patients with a variant inside the spherical enriched region are 11.2 years younger at diagnosis (24.9 vs. 36.1 years old, Wilcoxon p = 6.7 x 10−5) (Fig. 4a-b) than patients harboring other variants in the myosin head. The presence of a variant in the HCM-enriched surface region is associated with a 10.0-year earlier age at diagnosis (age 31.5 vs. age 41.5, Wilcoxon p = 1.6 x 10−4) than those with other surface variants (Fig. 4c-d). In addition, we find an increased hazard for clinical outcomes in the surface enriched region (HR = 1.918, p=0.023), though not in the spherical enriched region (Supplementary Figure S7-8, Supplementary Text 2). These findings demonstrate that analysis of genetic constraint in protein space can reveal domains with both increased disease burden and pathogenicity.
Our study demonstrates the power of integrating detailed structural information with large clinical and genetic databases to identify regions associated with functional importance and disease severity in Mendelian diseases. We find that variants associated with HCM are enriched in the β-cardiac myosin converter domain, where they lead to more severe outcomes. We provide the first evidence that similar clustering and pathogenicity are present in a surface spanning the converter domain and the recently-described mesa. Because amino acid residues forming the mesa come from disparate locations in the nucleotide sequence, discovery of this region depends on integration of protein structural information. The pronounced shift of the converter/mesa surface during the power stroke raises the mechanistic hypothesis that these variants exert their deleterious effect selectively in the pre-stroke state, perhaps by disrupting dynamic binding interactions. In summary, these findings highlight the importance of considering data from human genetics in the context of the dynamic, 3-dimensional protein structure, and illustrate a new approach to structure-function analysis in genetic diseases.
Methods
SHaRe Database
The Sarcomeric Human Cardiomyopathy Registry (SHaRe), a multicenter database that pools de-identified patient-level data from established institutional datasets at participating sites. At the time of analysis, the registry contained clinical and genetic testing data on 2,913 patients with HCM. Over 1,000 of these patients have pathogenic variants in MYH7 and MYBPC3. This database contains individuals from 9 inherited disease centers throughout the world, including Brigham and Women’s Hospital, Children’s Hospital Boston, Erasmus Medical Center, Careggi University in Florence, Stanford Center for Inherited Cardiovascular Disease, University College London, University of Michigan, the Laboratory of Genetics and Molecular Cardiology in Sao Paolo, and Akureyri Hospital Iceland. The database includes demographic data, medical history, echocardiogram data, genetic testing results, and many other data relating to cardiac health and clinical outcomes.
ExAC
The Exome Aggregation Consortium13 released data from 60,706 exomes from multiple sequenced cohorts that are not enriched for rare diseases such as HCM. We downloaded ExAC data for the canonical MYH7 transcript ENST00000355349 on August 27, 2015.
Variant Filtering and Inclusion Criteria
Variants in MYH7 from the SHaRe database were filtered for quality purposes. Only exonic variants were included in the analysis. We included all exonic missense variants seen in HCM patients in clinical genetic testing. For comparison, we downloaded data from the MYH7 gene from the Exome Aggregation Consortium (ExAC) on August 27, 2015. We excluded variants with a population specific frequency of greater than 1/2000 in the ExAC data, as these common variants are unlikely to be causal for HCM based upon the population prevalence of the disease. For the spatial scan and clinical outcomes analysis, we analyzed only missense variants.
For validation purposes, we also performed a subset of analyses including variants expertly classified as “pathogenic” or “likely pathogenic” while excluding variants of unknown significance. This is to ensure that variants of unknown significance (VUS) are not driving the results of the analyses. In addition, we also performed a validation using variants found in individuals of European ancestry from ExAC and individuals with a reported race of white, to ensure that global population structure was not confounding our analysis (Supplementary Text 3).
We generated an independent validation data set combining previously published analyses of HCM variants from other medical centers with 42,930 exomes from the DiscovEHR sequencing project involving the Regeneron Genetics Center and Geisinger Health System. Once again, we included only missense variants and removed variants with an allele frequency greater than 1 in 2000 in the DiscovEHR exomes (Supplementary Text 1).
Development of human β-cardiac myosin protein models
We developed human β-cardiac myosin S1 models based on human motor domain structural data to best represent the human form of the cardiac myosin. We retrieved the protein sequence of human β-cardiac myosin and the human cardiac light chains from UNIPROT database29: myosin heavy chain motor domain (MYH7) - P12883, myosin essential light chain (MLC1) - P08590, and myosin regulatory light chain (MLC2) - P10916. We used a multi-template homology modeling approach to build the structural coordinates of MYH7 (residues 1-840), MCL1 (residues 1-195), MCL2 (residues 1-166) and S2 (residues 841-1280). We obtained the three dimensional structural model of S1 in the pre- and post- stroke states by integrating the known structural data from solved crystal structures, as described below.
Homology modeling of the pre-stroke structure was performed with template structures of the smooth muscle myosin motor domain30 (PDB id: 1BR1) and the scallop smooth muscle myosin light chain domain31 (PDB id: 3TS5). The templates used for the modeling of the post-stroke structure were obtained from the human β-cardiac myosin motor domain32 (PDB id: 4P7H) and the rigor structure from the squid myosin motor domain33 (PDB id: 3I5G). Missing regions in the myosin motor domain (loop1, loop2) were each separately built using the ModLoop program34and regions in the regulatory light chains that were not solved in the crystal structures were independently modelled using the I-TASSER prediction35 method, and they were used as individual templates. Sequence alignment between MHY7, MLC1, and MLC2 with their respective structural templates were obtained. The models of pre- and post-stroke structures were acquired using the MODELLER package36. We used a multi-template modeling method: 100 models were obtained and the best model was selected based on the DOPE score. The structural models were energy minimized using SYBYL7.2 (Tripos Inc.) to remove potential short contacts. The final three-dimensional models of the pre- and post-stroke structures were validated using RAMPAGE37, which provides a detailed check on the stereochemistry of the protein structure using the Ramachandran map.
The S2 region is a long coiled-coil structure; hence we used the template from the Myosinome database38. Modeling was done using the MODELLER package, selection of best model and validation of the model was done as described above. Visualizations were performed using PyMOL version 1.7.4 (http://www.pymol.org).
Statistical Methods
Comparisons between the ExAC and SHaRe variant locations in MYH7 were performed using the Kolmogorov-Smirnov test statistic. All statistical analyses were performed in R version 3.1.2 39 and many graphs were prepared using ggplot240.
Spatial Scan Statistic
For the spatial scan analysis, we compared the locations of unique variants observed in HCM patients in SHaRe with the locations of variants observed in ExAC. The Spatial Scan Statistic exhaustively searches three-dimensional windows of a predefined set of sizes and shapes throughout the human β-cardiac myosin molecule for regions with an increased proportion of HCM-associated (SHaRe) variants. Let pw be the proportion of variation within a window that is HCM-associated, let qw be the proportion of variation outside the window which is HCM-associated. For each window, we calculate the binomial likelihood ratio statistic comparing the null model where pw and qw both equal the overall rate r against the model where pw is not equal to qw. The likelihood ratio statistic for each window is as follows: where yw is the number of HCM-associated variants within the window, zw is the number of other variants in the window, yg is the number of HCM-associated variants outside the window, zg is the number of other variants outside the window, and r is the overall proportion of HCM-associated variants. The test statistic is the maximum of the observed likelihood ratio statistics for all windows. Significance is assessed through permutation analysis by permuting the variant labels.
For the first analysis, we used spherical windows based upon the three-dimensional locations of amino acids in the human β-cardiac myosin molecule. The myosin S1 models include residues 1 to 841. We excluded the disordered loop regions between residues 205 and 211 and residues 627 and 640, as the positions of these amino acid residues are not well defined. We compared the observed missense variants in human β-cardiac myosin in SHaRe and in ExAC. We generated lists of unique missense variants from the SHaRe and ExAC datasets. A variant was considered HCM-associated if it was observed in a SHaRe patient diagnosed with HCM and not HCM associated if the variant was only observed in the ExAC data. Variants were assigned to three-dimensional locations in the human β-cardiac myosin model based upon the position of the alpha carbon atom of their corresponding amino acid residue. For each amino acid residue in the myosin S1, we tested spherical windows with radii of 10, 12.5, 15, 17.5, 20, 22.5, and 25 Angstroms centered on the alpha carbon for enrichment of HCM associated variants. The maximum test statistic for the entire set of windows in the model was calculated and significance was assessed through permutation of variant labels using 1000 permutations. For validation, we performed the analysis above removing all missense variants classified as Variants of Unknown Significance (Supplementary Text 3). We also performed the analysis above using only missense variants observed in the European population in ExAC or in European ancestry HCM cases.
Surface Analysis
In addition to the spherical windows defined above, we define windows based upon the exposed surface of the human β-cardiac myosin molecule. For this analysis, we estimate the surface distance between any two amino acids. We developed the following procedure to perform this estimation. First, we calculated the solvent excluded surface for the human β-cardiac myosin models using the MSMS program23 with a 2.5 Å radius sphere as a probe, which approximates the size of amino acid side chain interactions. For the calculated surface, the MSMS program returns a large net of vertices of the surface, each connected to multiple other points on the surface. We use these vertices and connections to build a weighted graph of the surface of the molecule, with the vertices as nodes and the edges as connections weighted by the Euclidean distance between the two connected vertices. Then, for each amino acid, we assign it to the point on the surface that is the closest point, on average, to all of the non-hydrogen atoms of the amino acid. This average distance is used as an estimate of the depth of the amino acid. Amino acids with average depths of greater than 4 Angstroms for the surface model were considered to be not on the surface. We used the A-star algorithm to calculate the distance on the surface graph between two amino acid residues. In this analysis we included only amino acids in the ‘head’ of the human β-cardiac myosin molecule, defined as amino acids 1 to 784. The lever arm was excluded. We excluded the disordered loop regions between residues 205 and 211 and residues 627 and 640.
Based upon this set of pairwise calculated distances, we define planar surface regions of the MYH7 molecule for analysis as all the amino acids within a certain distance of any given ‘center’ amino acid. We exclude non-surface amino acids, based upon their estimated depth. We once again perform the spatial scan statistic as described above to identify surface regions of increased genetic burden.
S2 Fragment Analysis
We performed the spherical spatial scan statistic as described above to test for enrichment in the S2 fragment of myosin. We used spherical window sizes of 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 Angstroms centered on the alpha carbon of each amino acid residue in the S2 model. For this analysis, we compared the region between amino acid residue 838 and amino acid residue 1112. The analysis was truncated at amino acid residue 1112 due to low sequencing coverage following residue 1112 in the ExAC data.
Clinical phenotype analysis
We perform outcomes analysis using statistical methods implemented in R version 3.1.2. For the analysis of age at diagnosis only known probands were included. We compared the primary diagnosis ages using a Wilcoxon test. For the age at event analyses (Supplementary Text 2), the composite outcomes were defined as follows: The arrhythmic outcome consisted of cardiac arrest, ICD firing, and sudden cardiac death. The heart failure outcome combined the events of end-stage HCM (defined as the left ventricular ejection fraction falling below 55%), New York Heart Association Class III or IV status, transplant operation, and left ventricular assistive device implantation. The overall composite outcome combined the arrhythmic and heart failure outcomes as well as including the outcomes of atrial fibrillation, stroke, and death (all causes). Individuals were considered to enter the study at their diagnosis age and were censored at their last known age. We compared hazard ratios for each region using the Cox proportional hazards model adjusting for gender.
Competing interests
J.A.S is a founder of and owns shares in Cytokinetics, Inc. and MyoKardia, Inc., biotech companies that are developing therapeutics that target the sarcomere. E.M.G is an employee and owns shares in MyoKardia, Inc. E.A.A is a founder of Personalis, Inc. C.D.B. is on the Scientific Advisory Boards of http://Ancestry.com, Personalis, Liberty Biosecurity, and Etalon DX. C.D.B. is also a founder and chair of the SAB of IdentifyGenomics.
Acknowledgements
The authors would like to thank Jonathan Fox for guidance during the early stages of the SHaRe registry and Aleks Pavlovic for support in obtaining and curating the clinical data.