Abstract
As whole-genome sequencing technologies improve and accurate maps of the entire genome are assembled, short open-reading frames (sORFs) are garnering interest as functionally important regions that were previously overlooked. However, there is a paucity of tools available to investigate variants in sORF regions of the genome. Here we investigate the performance of commonly used tools for variant calling and variant prioritisation in these regions, and present a framework for optimising these processes. First, the performance of four widely used germline variant calling algorithms is systematically compared. Haplotype Caller is found to perform best across the whole genome, but FreeBayes is shown to produce the most accurate variant set in sORF regions. An accurate set of variants is found by taking the intersection of called variants. The potential deleteriousness of each variant is then predicted using a pathogenicity scoring algorithm developed here, called sORF-c. This algorithm uses supervised machine-learning to predict the pathogenicity of each variant, based on a holistic range of functional, conservation-based and region-based scores defined for each variant. By training on a dataset of over 130,000 variants, sORF-c outperforms other comparable pathogenicity scoring algorithms on a test set of variants in sORF regions of the human genome.
- AUPRC
- Area under the precision-recall curve
- BED
- Browser Extensible Data
- CADD
- Combined annotation-dependent depletion
- DANN
- Deleterious annotation of genetic variants using neural networks
- EPO
- Enredo, Pecan, Ortheus pipeline
- GATK
- Genome analysis toolkit
- GIAB
- Genome in a bottle
- HGMD
- Human gene mutation database
- Indels
- Insertions and deletions
- MS
- Mass spectrometry
- ORF
- Open reading frame
- RF
- Random Forests
- ROC
- Receiver Operating Characteristics
- SEP
- sORF encoded peptide
- sklearn
- Scikit-learn package
- SNVs
- Single nucleotide variants
- sORF
- Short open-reading frame
- TF
- Transcription factor
- TSS
- Transcription start site
- VCF
- Variant Call Format file