Abstract
Motivation: Predict whether a mutation is deleterious based on the custom 3D model of a protein.
Methods: We have developed modiot, a mutation prediction tool which is based on per residue RMSD (root mean square deviation) values of superimposed 3D protein models. Our mathematical algorithm was tested for 42 described mutations in multiple genes including renin, beta-tubulin, biotinidase, sphingomyelin phosphodiesterase-1, phenylalanine hydroxylase and medium chain Acyl-Coa dehydrogenase. Moreover, modiot scores corresponded to experimentally verified residual enzyme activities in mutated biotinidase, phenylalanine hydroxylase and medium chain Acyl-CoA dehydrogenase. Several commercially available prediction algorithms were tested and results were compared. The modiot PERL package and the manual can be downloaded from https://github.com/MODICT/MODICT.
Conclusion: We show here that modiot is capable tool for mutation effect prediction at the protein level, using superimposed 3D protein models instead of sequence based algorithms used by POLYPHEN and SIFT.
1 Introduction
1.1 State of the art
As next generation sequencing (NGS) is advancing the field of molecular biology today, more human protein variants are identified than ever before. One of the greatest challenges in this field is to be able to predict whether the detected variants are real disease-causing changes underlying the patients condition.
The current concept of mutation effect prediction heavily depends on the composite algorithms that mainly implement a sequence-based BLAST search that tries to identify a number of similar protein sequences above a preset threshold, then relate and combine several other parameters such as PSIC (Position-Specific Independent Counts), known three-dimensional (3D) structures of similar proteins, surface area, β-factor and atomic contacts. Some available algorithms (e.g.PoLYPHEN 2, http://genetics.bwh.harvard.edu/pph2/, [1]) use all above whereas others use either a portion or a more diverse set of parameters (e.g.SIFT (http://sift.jcvi-.org/, [2]), MUTATION TASTER (http://www.mutationtaster.org/, [3]), PROVEAN (http://provean.jcvi.org/index), [4]). Nonetheless, the fact that these algorithms take into account non-mutually exclusive (non-orthogonal) features, the method to correctly combine the results to derive a conclusive output remains ambiguous. One recently described method uses weighted means obtained from false positive rates and false negative rates of each distinct algorithm to approach a consensus score (Condel: http://bg.upf.edu/condel/home [5]). Even after utilizing cancer-trained methods, such integration of scores were not able to correctly classify all variants [6].
1.2 Hypothesis and problem definition
A high percentage of genomic variants in protein-coding genes were shown to modify the tertiary structure of the coded protein sequence. These structural modifications can be predicted by comparing the 3D structures of the wild type and mutant protein (.pdb files). The 3D structures are generated in commercial or academic-only servers and software (i-tasser, http://zhanglab.ccmb.med.umich.edu/I-TASSER/ [7, 8], swiss-model http://swissmodel.expasy.org/ [9], modeller http://salilab-.org/modeller/ [10], yasara http://www.yasara.org/) by supplying the raw amino acid sequences in fasta format. The generated results have to be interpreted carefully to find the structural changes in the mutant protein. However such interpretation and analysis on the molecular dynamics is not straightforward and simple.
We have derived a simple algorithm called modiot to predict the effect of mutations on the structure of the protein. It is complementary to the protein modeling tools mentioned above, as it requires the 3D protein structures predicted by these tools. The algorithm takes into account the global structural changes in the 3D protein model. These structural changes are measured in means of the change in Root Mean Square Deviation (Δrmsd) and the corresponding residue number in the protein sequence.
2 Methods
2.1 Algorithm
Let Ai denote the rmsd value of a given amino acid at ith position resulting from comparison of two models in a cartesian space defined by V(i, Ai). Assuming the entire length of a protein with N residues is 1 unit, then the unit area of the rectangle enclosed by two consecutive amino acids can be approximated by:
If a given domain is enclosed by ith and jth amino acid residues then the area spanned by the domain can be expressed as: where Wi and Ci denote optional weight and conservation scores respectively which are usually provided by the training and iteration modules (users can attain as well). Of course the aforementioned area does not solely result from the mutation. An error value can be expressed in terms of overall rmsd (;generated by swiss-model):
A total area can be defined from equations 2 and 3 (ad=Area Domain, ae=Area Error): Above formula is a generalization for multiple domains. In case there is only one domain between residues i and j, than the total area simply is adi, j + aei, j. A raw score (Γ) can be expressed in terms of:
It is noteworthy that for a given interval, AD and AE are not guaranteed to be equal, even if the regions taken into consideration spans the entire protein. While AD is obtained from per residue rmsd, AE is obtained from . AD/TOTAL and AE/TOTAL should be considered as 2 orthogonal vectors. MODIOT is designed to work with specific protein domains where i and j designate the start and end of a domain. For MODIOT to perform optimal, it is important that the domains which are most critical for the functionality of the protein are chosen. This can be literature indings or can be predicted by the iteration script which is included in the software package (see section 2.3).
The difference (δ) between equations 2 and 3 is important to discern background signal from actual effect:
The significance (γ) of the difference depends on the length of the domain and the standard deviation of the individual RMSD values: where Zx denotes the Z score of (100 • x)th percentile and σ denotes the standard deviation. Assuming that the rmsd values are distributed in a Gaussian distribution, the Z-score derived significance score gives an idea about how much of the domain residues account for the large RMSD values. From equations 6 and 7, a coefficient of significance (κ) can be defined:
In the equation 8 above, Σδ or Σγ denotes the total sum of δ or γ between all specified domain intervals such as δi, j + δm, n + δu, w …. Equations 5 and 8 can be combined to express a final score:
The criteria of evaluating the score can be performed via 2 different approaches as outlined in sections 2.2 and S1.2. In a fraction of cases, comparison of MODIOT scores requires calculating thresholds and these thresholds are calculated via a K parameter. Beware that this is not the same coefficient as in equation 8. This parameter is a measure of the highest p-value attainable with a given accuracy. The K parameter is calculated from known list of mutations listed in table S1. For more information for the usage of this parameter refer to section S1.2.
2.2 MODIOT methodology
The algorithm of MODIOT is based on rmsd values of superimposed wildtype and mutant proteins. For calculating, RMSD values, a 3D protein model is required of both the wildtype and mutant case, which is calculated by using the i-tasser and phyre2 servers. After construction of the 3D models, the generated pdb files are used as input for a script included in modiot which will extract the necessary RMSD values. For the purpose of testing modiot, amino acid sequence of wildtype and mutant renin, Tubb2b, Btd and Smpdl proteins (uniprot id: P00797, Q9BVA1, P43251, P17405) were submitted to the automated i-tasser and phyre2 servers. PAH and ACADM (tables 1, 2) were submitted to the automated phyre2 server. For further details on speciic settings, see section S1.1. modiot can be supplied with optional weight (min:0,default:10) and conservation(min:0,max:11,default:1) scores which are both array vectors (single number per line in a text ile). Multiplying all entries of the weight and conservation ile by a constant does not change the result. Both iles are optional and not mandatory for modiot to work. However, they can be used to give higher priority to certain regions. The default set up attains 1 to both conservation and weight scores.
Conservation scores are generated by aligning reviewed sequences of the protein of interest in different species from UniProt (http://www.uniprot.org/). It is a simple text file of one conservation score per line and generated using the JALYIEW utility.
modiot requires a user generated per-residue rmsd file as well. We have developed a script which can be supplied to swiss-pdb. This script extracts the rmsd values from superimposed WT (wildtype) and MT (mutated).pdb iles to a ile.
modiot score interpretation makes use of a negative and positive control. As negative control, a superimposition between the wildtype protein and a reined model of the same wildtype protein (in some cases, a known benign mutation can also be used instead of reined wildtype, see sections 2.4 and S1.2). For the positive control, superimposition between the wildtype protein and a known pathogenic variant can be used. The scores for the negative and positive control can as such be used as a scale for the MODIOT result of the protein variant of interest. A more mathematical approach to MODIOT score interpretation is given in sections S1.2, 3.2, S1.3 and figure 7.
2.3 Training and Iteration
As will be described throughout the section 3, modiot is designed to work with distinct domains which are critical for protein functionality. Often however, this information is not readily available. In order to meet these needs, modiot comes with a training and iteration module where a random number approach is used to approximate a good candidate weight score combination as in figures 2, 4, 6, 8 and 9.
The training module accepts a list of paired modiot scores and enzymatic activity (or any measure of residual protein function that is determined experimentally). It tries to ind an optimal weight score combination for each residue that yields the highest possible Pearson’s correlation (one would expect enzymatic activity and modiot scores to be negatively correlated). The user has control over the iteration process by regulating several parameters such as the number of rounds to iterate. Even then, improvement of initial correlation varies from protein to protein and depends on the number of mutations to be trained with.
modiot package also comes with an iterator module to identify regions of a protein that contribute the most to the overall modiot score (figures 2, 4 and 6). The iteration algorithm automatically attains weight scores between 0 and 10 to residues: the higher the weight score, the more the contribution of that residue pair to the overall modiot score. modiot uses a random number approach to approximate a signiicant combination. Although the computation process can be cumbersome under certain conditions, current approach performs well with comparison of many models simultaneously. Such an example is given in figure 10 where mutations that preserve more than or equal to 50 percent of residual activity are compared to two relatively more severe mutations.
When the iteration algorithm of modiot is used, it generates an automatic and interactable output as shown in figure 11. The user can choose to display amino acids with certain properties or just visualize the change in regions that correspond to a domain. The user may wish to know if residues with high modiot score are also conserved which can be seen from the color coding. For a more comprehensive explanation of how to interpret iterator results please refer to modiot documentation.
2.4 ROC curve generation
One of the challenges to construct a receiver operating characteristic curve (ROC) for an algorithm that generates a continuous range of output rather than a qualitative output (deleterious or benign) is to build a parametric classiication system. This can be achieved by recalculating thresholds for a given set of mutations with known outcome while varying the levels of stringency (a measure of how rigorous the thresholds are constructed). Subsequently, this can be plotted against the p-value (a measure of how correctly the mutations are classiied) In principle, mutations are not only completely benign or deleterious but spread through a range of variable residual protein activity/function. In addition to a negative control which is usually Δrmsd between wildtype and a refined wildtype model or wildtype and a benign model, another score from Δrmsd between wildtype and a given benign/deleterious/partial model should be used. This allows the user to construct a hypothetical distribution of scores and thus determine the likelihood of a test score being benign, deleterious or partial. Such a script is included in the modiot package. The user can import his calculated scores from new models and update the current ROO plot shown in figure 12. Data used to generate the plot is listed in table S1.
2.5 Output
modiot, supplied with the rmsd ile, gives as an output an algorithm score, which is a float value without units.
3 Results
We have derived a simple algorithm modiot to predict whether a mutation is deleterious or not based on the RMSD obtained from superimposed mutated and wildtype 3D structures. The 3D protein structures in this study were modeled by I-TASSER and PHYRE2, however other modeling algorithms can be used as well. The mathematical model underlying modiot can also incorporate the information from conservation and weight scores. An iteration algorithm to determine the regions that account the most for the calculated score is also available with modiot. modiot is not only a prediction tool, but also a tool to scrutinize changes in the protein structure independent of the score.
The algorithm was tested on 6 different proteins which belong to different protein families. The chosen mutations were of different nature in order to minimize bias. modiot scores were interpreted by two methods,either correlating them with experimental metrics like enzymatic activities, or using the scores for ordinal clas-siication (deleterious, benign, partially deleterious etc.). The irst method requires modiot scores for at least 3 mutations with experimentally veriied enzyme activities for predicting the effect of unknown mutation. Then, the modiot scores and the enzymatic activity of the known mutations are plotted in a scatter plot and a trend-line is set by the least squares method. By observing the trend-line the enzymatic activity of your mutation of interest can be traced. The advantage of this approach is the ability to use the training module on modiot for a subset (or the entire set) of mutations to increase the initial Pearson’s r correlation coefficient. This method was applied on Btd, Pah and Acadm mutations (see tables 1, 2 and figure 3.3).
The second method is used when there are less than or equal to 2 mutations. However a negative control modiot score is required for comparison. This method was applied on Renin, Tubb2b and Smpd1 mutations (see sections 3.1, 3.2 and 3.4). Regardless of the method, higher modiot scores mean more deleterious.
Throughout this paper modiot scores have both been used as ordinal classiiers (benign, partially deleterious, deleterious etc.) and continuous variables to measure correlation. In all of the tested cases in this study whether conservation scores and/or weight scores were used or not is indicated. Concerning the examples given in this article, modiot performs better without conservation scores.
Throughout the results section, output of the iteration algorithm (residues that contribute the most to a modiot score) was represented using I-PV as shown in figs 2, 4, 6 and 10 [11].
3.1 Renin p.R33W
Renin is one of the main components that regulates the main arterial blood pressure via the renin-angiotensin system and is initially secreted as a propeptide with a 67 amino acid long signal sequence [12]. Mature renin does not have this signal sequence and is 37kDa long [13]. A novel heterozygous mutation c.58T>C (p.C20R) was found in all affected members of a family with autosomal dominant inheritance of anemia, polyuria, hyperuricemia and chronic kidney disease [14].
Another variant p.R33W suspected to be benign resides within the same signal sequence (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11571098;-http://web.expasy.org/variant_pages/VAR_020375.html). Several prediction algorithms were tested on this variant previously [15]. In this example, conservation scores generated by multiple sequence alignment of reviewed Ren (renin) sequences were also used by the algorithm as an additional factor (section S1.3). Based on domain annotations, residues that are involved in various interactions were also given a weight score of 20 instead of default value (10, section S1.3). Figure 1C and figure 2 show the algorithm results associated with these mutations.
We also provided wildtype and mutated Renin fasta iles to automated PHYRE2 server and received models for the same variants. Wildtype Renin score was 0.328 whereas p.R33W and p.C20R scores were 3.816 and 4.128 respectively. Based on these scores p.R33W variant should be classiied as deleterious. As mentioned previously, the p.R33W is of unknown significance due to its low frequency (dbSNP, <1%). Although a study has claimed that it significantly reduces Renin biosynthesis (http://www.ashg.org/2014meeting/abstracts/fulltext/f140120880.htm), to our knowledge it has not yet been published. The Renin example demonstrates that modiot scores are not totally independent from the models provided to it. For more detailed explanation for using modiot scores as an ordinal classiier, please refer to the manual and section S1.3.
3.2 Tubb2b p.A248V and p.R380L
Tubulins are the main components of microtubules on which dynein and kinesin motor proteins bind. Together with intermediate ilaments and microilaments, they form the cytoskeleton which plays a major role in intercellular trafficking, cell-cell interactions, junctions and cellular migration [16]. Tubulins are ubiquitously expressed in all human tissues. However mutations in these proteins mostly affect tissue types that rely on their functionality the most during development such as cells of neuronal or glial origin [17, 18]. Almost all mutations in tubulins result in Malformations of Cortical Development (MCD) [19]. Mutations in TUBB2B result in polymicrogyria spectrum of malformations. [20–26]. 2 de novo mutations in Tubb2b, namely p.A248V and p.R380L in 2 unrelated patients of Turkish and Belgian origin and 1 patient of French-Canadian origin respectively were identiied and tested for their modiot scores [21].
Figure 3 (C) and figure 4 show the algorithm results associated with these mutations. Scores without weight and conservation parameters (section S1.4) for wildtype, Tubb2bp.A248V and Tubb2bp.R380L were 1.843, 1.984 and 2.003 respectively. Choosing the wildtype as control (SC) and Tubb2bp.R380L as known deleterious mutation (SK), the threshold T1 was calculated as . The value for T1 was 1.945 which was lower than the Tubb2bp.A248V score (σ = standard deviation, κ = 55). This means that the Tubb2bp.A248V mutation is indeed deleterious.
Wildtype and mutated fasta iles were provided to the automated phyre2 server. modiot scores in the absence of weight and conservation parameters for wildtype, Tubb2bp.A248V and Tubb2bp.R380L were 1.448, 4.203 and 3.459 respectively. Choosing Tubb2bp.A248V as the known deleterious variant, the T1 threshold is 3.200 which is lower than the Tubb2bp.R380L score. As a result, modiot scores generated by both i-tasser and phyre2 models agree on the nature of the variants.
3.3 Btd p.H447R and p.R209C
Biotinidase is an enzyme that is encoded by the BTD gene. Low enzyme activity interferes with the cycling of biotin and if left untreated, it may lead to neurological and cutaneous issues [27]. In this example, a case with experimentally veriied results from 2 patients will be used and compared with modiot scores [28]. The genotype of the patients in the aforementioned study were c.1330G>C (p.D444H)/c.1340A>G (p.H447R)[patient 1] and c.557G>A (p.C186Y)/c.625C>T (p.R209C)[patient 2]. Both former mutations (c.1330G> C in patient 1 and c.557G> A in patient 2) were null mutations meaning that the experimentally measured residual enzyme activity belongs to the latter mutations [27, 28]. The residual enzyme activity in the patients were 61eu (enzyme units) and 91eu respectively (population mean 263eu). modiot scores were generated using 2 different modeling algorithms (i-tasser, phyre2) and results were compared with residual enzyme activity as shown in figure 5 [8, 29]. Conservation scores were generated by aligning reviewed biotinidase sequences from UniProt (Homo sapiens, Rattus norvegicus, Mus musculus, Bos taurus, Takifugu rubripes) by using Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/) and the resulting scores (min, 0; max, 11) corresponding to 1-543 residues of Btd were given to modiot [30]. Supplying or not supplying the conservation scores do not significantly alter the scoremodiot/enyzmatic – activity ratios as can be seen from table S1.
The modiot scores were generated by taking into account functionally important regions (residues 57-363, 402-403 and 489-490; UNIPROT, P43251). These functionally important regions can generally be found in UNIPROT. As seen in figure 5, both PHYRE2 and I-TASSER scores are proportional to corresponding enzymatic activities. Although there are only 2 mutations, taken together with the negative control score, raw modiot scores without any conservation or weight iles correlate strongly with enzymatic activity (phyre2: r = –0.805; i-tasser: r = —0.838).
3.4 Mutations in Sphingomyelin phosphodiesterase-1
Sphingomyelin phosphodiesterase-1 is an enzyme (Uniprot ID: ASM_HUMAN) located in lysosomes and responsible for conversion of sphingomyelin to ceramide. Deicits in enzyme activity or reduction in the enzyme concentration result in an inborn error of metabolism grouped under the name Niemann-Pick disease (type A and B) [31]. Several polymorphisms exist that are frequent amongst control populations. One example of such variant is the p.V36A located in the signal sequence. Another variant that is often mistaken as deleterious is p.G506R [32]. Using phyre2 to model wildtype, figure 7 demonstrates the procedure of classifying the p.G506R mutation. Since the known p.V36A variant is benign (with a score of SK), the SI score is substituted directly by SK. Based on the calculated thresholds, the p.G506R mutation was correctly classified as “partially deleterious or benign”. The procedure to use modiot as an ordinal classiier using thresholds is further elaborated in the manual and in the discussion section.
3.5 Mutations in Medium Chain Acyl-CoA Dehydrogenase
Medium chain acyl-coa dehydrogenase (MCAD, Uniprot ID: P11310, NP_000007.1) is an enzyme encoded by the ACADM gene. MCAD deiciency is one of the most common deficits in mitochondrial β-oxidation. MCAD is the enzyme responsible for breaking down medium-chain fatty acids. Deleterious mutations that reduce the enzyme activity result in clinical symptoms such as hypoglycemia, hepatic and neuronal dysfunction [33]. Enzymatic activity data of homozygous/compound heterozygous patients carrying 2 deleterious mutations have been adapted from Sturm et al. as shown in table 2 [33]. Mutated proteins were modeled using phyre2 and superimposed on wildtype MCAD which was generated by submitting wildtype fasta ile to the phyre2 server. For each mutation pair the modiot score was the average of the modiot score of individual mutations (direct summation without average only expands the graph on one axis). Rather than using modiot as a classiier, the main goal was to see if the modiot scores correlates with the real experimental measurements. modiot scores correlated negatively with the enzymatic activities as shown in figure 8.
Because higher modiot scores denote more deleterious effect, as the residual activity increases, it’s well expected for modiot scores to go down which results in negative correlation. As shown in figure 8, the initial Pearson’s correlation coef-icient was -0.488. Although not very strong, it is important to underscore that modiot is the irst attempt to achieve such degree of correlation between prediction and experimental outcome from user generated 3D protein models. Figure 8 also compares correlation of polyphen2 scores with enzymatic activity which did not yield signiicant concordance with experimental results.
Figure 8 also depicts the use of the training module of modiot. Table 2 lists the compound heterozygous mutations used for correlations in figure 8. Eight of the mutation pairs in table 2 share a near-null deleterious p.K329E mutation where homozygotes for this variant has ive percent residual activity. Thus, we have trained modiot with these eight mutations and then used the trendline (calculated by least squares method) to guess the enzymatic activity of other remaining mutation pairs in table 2. As shown in figure 8 (lower right), modiot was able to achieve 91 percent accuracy. The MCAD example demonstrates the possibility of developing an enzyme speciic panel without the need of very large datasets for training of modiot.
3.6 Mutations in PAH
The last example is about pheynlketonurea (PKU), an enzymatic defect that manifests itself with the deiciency in phenylalanine hydroxylase (PAH), a phenylalanine to tyrosine converter with the aid of tetrahydrobiopterin (BH4). It is an autosomal recessive disease with both copies of PAH carrying deleterious mutations. The ample decrease in PAH activity results in elevated phenylalanine blood concentration. If the elevated phenylalanine concentration is left untreated, it can lead to mental retardation with structural brain changes visible on a MRI. Deleterious mutations in PAH affects variably the level of enzymatic activity. Data regarding such mutations can be found in several studies [34, 35]. Comparison of the generated modiot scores after excluding outliers shows that the scores of individual mutations were negatively correlated with residual enzyme activities as shown in figure 9 (Pearson’s r = -0.494). Similarly, POLYPHEN2 scores correlated negatively with experimental measurements but to a lesser degree (Pearson’s r = -0.417). Using the training module for the 14 mutations in figure 9 further improved the initial correlation coefficient from -0.494 to -0.722.
4 Availability and Future Directions
Discussion
modiot is an algorithm which predicts whether a mutation is deleterious or not. This is based on the rmsd obtained from superimposing mutated and wildtype 3D protein structures. Modeling was done here by using i-tasser and phyre2, although alternatives can be used as well. The mathematical model underlying modiot can also incorporate the information from conservation and weight scores. An iteration algorithm to determine the regions that account the most for the calculated score is also available with the package.
There are two ways to make use of modiot scores. The irst way is to convert the scores into an ordinal classiication system, which requires a negative control. The second way is to correlate experimental results with modiot scores as shown in the BTD, MCAD and PAH examples. The bottleneck in this approach is to ind several known mutations in the protein of interest with available enzymatic activities or an equivalent measurement. However, this method allows an extrapolation between modiot scores and residual protein activity. By using the MODICT training module, one can further optimize the linear relationship between modiot scores and residual enzyme activities. Although overall RMSD values and signiicance is taken into account by the algorithm, modiot’s accuracy still depends on the models generated by the user. Unlike polyphen2 and sift, modiot scores are not normalized and vary depending on the length of protein, rmsd values between residues, overall RMSD, regions that are taken into account etc. Therefore individual modiot scores should not be seen as values indicative of deleterious or benign nature, but should always be interpreted in relation to their negative/positive controls or in relation to known enzyme activities.
Reporting results with Modict
When reporting results using modiot, users should provide the parameters they used together with the tool. Several of these parameters are key factors in repro-ducibility of the results. One of these parameters is the modeling algorithm used (phyre2, i-tasser etc.) and the sequence of the protein submitted to the server. The other parameter is the regions that are taken into account (residue numbers, domains etc.) when calculating the modiot score. The user should also indicate the conservation and the weight scores used, if any. If the training algorithm is used, than the mutations used for training and the output weight score combination should be reported as well. If the user has followed the ordinal classiication method, then she/he should also indicate how the negative control score was generated. Lastly, the users should also indicate the superimposition method used for generating the RMSD values. For example, superimposition based on alpha carbon has been used throughout this article.
Limitations
modiot is a tool that is not independent on the models generated by the modeling algorithm of choice. The Renin case is a good example for this where models generated by phyre2 and i-tasser gave different modiot scores. Moreover, consistency in superimposition techniques used between models and the portion of the protein that is actually modeled (full length protein modeling is usually more reliable than partial modeling of distinct domains) significantly affect the outcome. Many modeling servers also include a conidence key together with the results which are useful to judge the quality of starting models. In general, since the wildtype model will be the main model where test and known mutated models are superimposed on, a low quality model will make it harder to discern between scores. Another issue is that many modeling servers have amino acid limits on submitted fasta iles which are generally below 2000. This might make the evaluation of large proteins harder. As modeling algorithms advance, several of these issues will be resolved. Another drawback is that all structural deviations from a given wildtype model is perceived towards the deleterious spectrum whereas in reality there are also gain of function mutations. In that case, it is possible to modify the range of weight scores to include negative values as well.
Future directions
It is important to underline that modiot has no universal training dataset. This means that the algorithm itself (without any weight or conservation parameters) is able to reflect and capture portion of the physio-chemical interactions that determine the outcome of pathogenicity, at least for the proteins demonstrated in this article. In later stages the conservation scores or more importantly the weight scores can be used to train modiot on a protein basis. For instance certain combinations of weight scores that yield a higher correlation coefficient for a given enzyme panel can be generated. We planning to train modiot on variety of proteins and upload the trendlines for each modeling algorithm so the end user would only have to upload his/her mutation’s modiot score without having to train the algorithm manually.
A systematic database of modiot scores could be very beneicial for additional variant iltering in Next Generation Sequencing analysis as the utilization of protein structures iles is not adequately implemented. We are planning to store user-submitted modiot scores for this purpose. modiot is a fully automated algorithm that comes with a variety of scripts to analyze the effects of mutations on protein structure. Unlike most other mutation predictors, modiot uses. pdb iles and can simultaneously compare multiple models for differences in topology. All the models used for this article can be downloaded together with the modiot package from https://github.com/MODICT/MODICT.
Competing interests
The authors declare that they have no competing interests.
Acknowledgments
Ibrahim Tanyalcin received funding from Scientific Fund Willy Gepts and the Foundation Marguerite Delacroix. AJ received funding from the Research Foundation Flanders.
Footnotes
↵† Corresponding author