ABSTRACT
Introduction An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples.
We present an open source bioinformatics pipeline viGEN that combines existing well-known and novel RNA-seq tools for not only detection and quantification of viral RNA, but also variants in the viral transcripts.
Methods The pipeline includes 4 major modules: The first module allows to align and filter out human RNA sequences; second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral genes level thus allowing for downstream differential expression analysis of viral genes between experimental and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package.
Results In this paper, we use this pipeline in a case study to examine viruses present in RNA-seq data from 75 TCGA liver cancer patients. We were able to quantify viral transcriptomes at a viral-gene/CDS level, find differentially expressed viral transcripts between the groups of patients, extract variants, and connect them to clinical outcome. The results presented corresponded with published literature in terms of rate of detection, viral gene expression patterns and impact of several known variants of HBV genome. Results also show novel information about distinct patterns of expression and co-expression in Hepatitis B, Hepatitis C, Human Endogenous Retrovirus (HERV) K113 viruses.
Conclusion This pipeline is generalizable, and can be used to provide novel biological insights into the significance of viral and other microbial infections in complex diseases, tumorigeneses and cancer immunology. The source code, with example data and tutorial is available at: https://github.com/ICBI/viGEN/.
LIST OF ABBREVIATIONS
- HBV
- Hepatitis B virus
- HCV
- Hepatitis C Virus
- HERV K113
- Human Endogenous Retrovirus K113
- TCGA
- The Cancer Genome Atlas
- HCC
- Hepatocellular carcinoma
- NAFLD
- nonalcoholic fatty liver disease
- Hep B
- Hepatitis B
- Hep C
- Hepatitis C
- HepB + HepC
- coinfected with both Hepatitis B and C virus
- HBsAg
- Hepatitis B surface antigen
- HBeAg
- Hepatitis B type e antigen
- NGS
- next-generation sequencing
- RNA-seq
- whole transcriptome sequencing
- BAM
- Binary version of Sequence alignment/map format
- CDS
- coding sequence
- Cox PH
- Cox Proportional Hazard
- HBx
- viral gene X
- STS
- Sequence-tagged sites
- NCBI
- National Center for Biotechnology Information
- GFF
- general-feature-format