RT Journal Article SR Electronic T1 SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification JF bioRxiv FD Cold Spring Harbor Laboratory SP 118083 DO 10.1101/118083 A1 Manuel Tardaguila A1 Lorena de la Fuente A1 Cristina Marti A1 Cécile Pereira A1 Hector del Risco A1 Marc Ferrell A1 Maravillas Mellado A1 Marissa Macchietto A1 Kenneth Verheggen A1 Mariola Edelmann A1 Iakes Ezkurdia A1 Jesus Vazquez A1 Michael Tress A1 Ali Mortazavi A1 Lennart Martens A1 Susana Rodriguez-Navarro A1 Victoria Moreno A1 Ana Conesa YR 2017 UL http://biorxiv.org/content/early/2017/03/18/118083.abstract AB High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline for the classification of long-read transcripts that computes over 30 descriptors, which can be used to assess the quality of the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms are elusive to proteogenomics detection and are abundant in major protein changes with respect to the principal isoform of their genes. A comparison of Iso-Seq over the classical RNA-seq approaches solely based on short-reads demonstrates that the PacBio transcriptome not only succeeds in capturing the most robustly expressed fraction of transcripts, but also avoids quantification errors caused by unaccounted 3’ end variability in the reference. SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes. SQANTI is available at https://bitbucket.org/ConesaLab/sqanti.