TY - JOUR T1 - A variant by any name: quantifying annotation discordance across tools and clinical databases JF - bioRxiv DO - 10.1101/054023 SP - 054023 AU - Jennifer Yen AU - Sarah Garcia AU - Aldrin Montana AU - Jason Harris AU - Steven Chervitz AU - John West AU - Richard Chen AU - Deanna M. Church Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/05/19/054023.abstract N2 - Background Clinical genomic testing is dependent on the robust identification and reporting of variant-level information in relation to disease. With the shift to high-throughput sequencing, a major challenge for clinical diagnostics is the cross-identification of variants called on their genomic position to resources that rely on transcript- or protein-based descriptions.Methods We evaluated the accuracy of three tools (SnpEff, Variant Effect Predictor and Variation Reporter) that generate transcript and protein-based variant nomenclature from genomic coordinates according to guidelines by the Human Genome Variation Society (HGVS). Our evaluation was based on comparisons to a manually curated list of 127 test variants of various types drawn from data sources, each with HGVS-compliant transcript and protein descriptors. We further evaluated the concordance between annotations generated by Snpeff and Variant Effect Predictor with those in major germline and cancer databases: ClinVar and COSMIC, respectively.Results We find that there is substantial discordance between the annotation tools and databases in the description of insertion and/or deletions. Accuracy based on our ground truth set was between 80-90% for coding and 50-70% for protein variants, numbers that are not adequate for clinical reporting. Exact concordance for SNV syntax was over 99.5% between ClinVar and Variant Effect Predictor (VEP) and SnpEff, but less than 90% for non-SNV variants. For COSMIC, exact concordance for coding and protein SNVs were between 65 and 88%, and less than 15% for insertions. Across the tools and datasets, there was a wide range of equivalent expressions describing protein variants.Conclusion Our results reveal significant inconsistency in variant representation across tools and databases. These results highlight the urgent need for the adoption and adherence to uniform standards in variant annotation, with consistent reporting on the genomic reference, to enable accurate and efficient data-driven clinical care. ER -