TY - JOUR T1 - SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification JF - bioRxiv DO - 10.1101/118083 SP - 118083 AU - Manuel Tardaguila AU - Lorena de la Fuente AU - Cristina Marti AU - Cécile Pereira AU - Hector del Risco AU - Marc Ferrell AU - Maravillas Mellado AU - Marissa Macchietto AU - Kenneth Verheggen AU - Mariola Edelmann AU - Iakes Ezkurdia AU - Jesus Vazquez AU - Michael Tress AU - Ali Mortazavi AU - Lennart Martens AU - Susana Rodriguez-Navarro AU - Victoria Moreno AU - Ana Conesa Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/03/18/118083.abstract N2 - High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline for the classification of long-read transcripts that computes over 30 descriptors, which can be used to assess the quality of the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms are elusive to proteogenomics detection and are abundant in major protein changes with respect to the principal isoform of their genes. A comparison of Iso-Seq over the classical RNA-seq approaches solely based on short-reads demonstrates that the PacBio transcriptome not only succeeds in capturing the most robustly expressed fraction of transcripts, but also avoids quantification errors caused by unaccounted 3’ end variability in the reference. SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes. SQANTI is available at https://bitbucket.org/ConesaLab/sqanti. ER -