ConnectedReads: machine-learning optimized long-range genome analysis workflow for next-generation sequencing

Chung-Tsai Su; Sid Weng; Yun-Lung Li; Ming-Tai Chang

doi:10.1101/776807

Abstract

Current human genome sequencing assays in both clinical and research settings primarily utilize short-read sequencing and apply resequencing pipelines to detect genetic variants. However, structural variant (SV) discovery remains a considerable challenge due to an incomplete reference genome, mapping errors and high sequence divergence. To overcome this challenge, we propose an efficient and effective whole-read assembly workflow with unsupervised graph mining algorithms on an Apache Spark large-scale data processing platform called ConnectedReads. By fully utilizing short-read data information, ConnectedReads is able to generate haplotype-resolved contigs and then streamline downstream pipelines to provide higher-resolution SV discovery than that provided by other methods, especially in N-gap regions. Furthermore, we demonstrate a cost-effective approach by leveraging ConnectedReads to investigate all spectra of genetic changes in population-scale studies.

Footnotes

https://github.com/atgenomix/connectedreads

Abbreviations

CNV: Copy number variant
HDFS: Hadoop distributed file system
HSA: Haplotype-sensitive assembly
LRS: Long-read sequencing
NGS: Next-generation sequencing
NRNR: Non-reference, non-repetitive
QD: Quality depth
RDD: Resilient distributed dataset
SRS: Short-read sequencing
SV: Structural variant
UNI: Unique non-reference insertion
WGS: Whole-genome sequencing

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.