Abstract
Transcript levels are a critical determinant of the proteome and hence cellular function. Because the transcriptome is an outcome of the interactions between genes and their products, we reasoned it might be accurately represented by a subset of transcript abundances. We develop a method, Tradict (transcriptome predict), capable of learning and using the expression measurements of a small subset of 100 marker genes to reconstruct entire transcriptomes. By analyzing over 23,000 publicly available RNA-Seq datasets, we show that Tradict is robust to noise and accurate, especially for predicting the expression of a comprehensive, but interpretable list of transcriptional programs that represent the major biological processes and cellular pathways. Coupled with targeted RNA sequencing, Tradict may therefore enable simultaneous transcriptome-wide screening and mechanistic investigation at large scales. Thus, whether for performing forward genetic, chemogenomic, or agricultural screens or for profiling single-cells, Tradict promises to help accelerate genetic dissection and drug discovery.