Abstract
The ability of Epstein Barr Virus (EBV) to transform resting cell B-cells into immortalized lymphoblastoid cell lines (LCL) provides a continuous source of peripheral blood lymphocytes that are used to model conditions in which these lymphocytes play a key role. Here, the PacBio generated transcriptome of three LCLs from a parent-daughter trio (SRAid:SRP036136) provided by a previous study [1] were analyzed using a kmer-based version of YeATS (KEATS). The set of over-expressed genes in these cell lines were determined based on a comparison with the PacBio transcriptome of twenty tissues provided by another study (hOPTRS) [2]. MIR155 long non-coding RNA (MIR155HG), Fc fragment of IgE receptor II (FCER2), T-cell leukemia/lymphoma 1A (TCL1A), and germinal center associated signaling and motility (GCSAM) were genes having the highest expression counts in the three LCLs with no expression in hOPTRS. Other over-expressed genes, having low expression in hOPTRS, were membrane spanning 4-domains A1 (MS4A1) and ribosomal protein S2 pseudogene 55 (RPS2P55). While some of these genes are known to be over-expressed in LCLs, this study provides a comprehensive cataloguing of such genes. A recent work involving a patient with EBV-positive large B-cell lymphoma was ‘unusually lacking various B-cell markers’, but over-expressing CD30 [3] - a gene ranked 79 among uniquely expressed genes here. Hypomethylation of chromosome 1 observed in EBV immortalized LCLs [4, 5] is also corroborated here by mapping the genes to chromosomes. Extending previous work identifying un-annotated genes [6], 80 genes were identified which are expressed in the three LCLs, not in hOPTRS, and missing in the GENCODE, RefSeq and RefSeqGene databases. KEATS introduces a method of determining expression counts based on a partitioning of the known annotated genes, has runtimes of a few hours on a personal workstation and provides detailed reports enabling proper debugging.
Introduction
Epstein Barr Virus (EBV) transform resting cell B-cells into immortalized lymphoblastoid cell lines (LCL) [7], providing a continuous source of peripheral blood lymphocytes [8] to help model conditions in which these lymphocytes play a key role [9–11]. LCLs show high expression of several B-cell activation markers (FCER2, CD70, CD30, etc.) [12], and are extensively used to predict clinical response to anticancer drugs [13].
Pacific Biosciences (PacBio) sequencing [14] generates much longer reads compared to second-generation sequencing technologies [15, 16], with a trade-off of lower throughput, higher error rate and more cost per base [17, 18]. The longer sequence lengths in PacBio compared to other sequencing methods alleviate assembly issues associated with other methods with shorter read lengths [19, 20]. Unprecedented volumes of data generated by fast-evolving sequencing technologies necessitates the development of different pipelines to process and analyze this data. Transcriptomes are under-utilized while annotating genomes [21–23], as demonstrated on the walnut genome [24]. Previously, the MCF-7 transcriptome (2013 version, provided by Pacbio) was used to find transcripts that have no annotation in the current RefSeq and GENCODE databases, and predominantly absent in heart, liver and brain transcriptomes also provided by PacBio [6]. Also, shorter fragments of some of these transcripts were found to be present in seven tissues analyzed in a recent RACE-seq study (Accid:ERP012249) [25].
In the current work, three transcriptomes from a parent-daughter trio LCL cells lines (GM128LCLs) [1] were used to generate a consensus based catalogue of gene over-expressed in these cell lines as compared to the transcriptome from twenty different normal tissues (hOPTRS) [2]. This analysis required the development of an kmer-based assembly program within YeATS, named KEATS. KEATS identified several (n=765) genes that are expressed in GM128LCLs, but not found in hOPTRS. A recent work involving a patient with EBV-positive large B-cell lymphoma was ‘unusually lacking various B-cell markers’, but over-expressing CD30 [3] - a gene ranked 79 among uniquely expressed genes here. Furthermore, other genes (n=1361) were identified that had basal expression in hOPTRS, but higher expression in GM128LCLs. Hypomethylation of chromosome 1 observed in EBV immortalized LCLs [4, 5] is also corroborated here by mapping the genes to chromosomes. Extending previous work identifying un-annotated genes [6], 80 genes were identified which are expressed in the three LCLs, not in hOPTRS, and missing in the GENCODE, RefSeq and RefSeqGene databases. Thus, a catalogue of genes is generated that characterize LCLs, a model for studying many kinds of cancer.
Results and discussion
Tilgner et al. [1] provided the PacBio transcriptome (SRAid:SRP036136) for three LCLs (GM128LCLs) from a parent-daughter trio (GM12878:n=715902, GM12891:n=586527 and GM12892:n=573590) [1], while another study has provided the PacBio transcriptome of a diverse pool of RNA samples representing 20 human tissues (hOPTRS) [2].
Over-expressed genes in GM128LCLs with no expression in hOPTRS
Table 1 enumerates the first twenty genes with no corresponding transcripts in hOPTRS (see FILE:overExpressedCutoff10.txt for the complete list, n=765). The MIR155 gene, encoding the MiR-155 microRNA and the largest overexpressed gene in the GM128LCLs, is a widely studied gene known to promote the development and aggressiveness of B cell malignancies [26–29]. Another study using Northern blotting demonstrated that MIR155 has a 10 to 30 fold higher copy number in LCLs than in normal circulating B cells [30].
Over-expressed genes in GM128LCLs with basal expression in hOPTRS
Table 2 shows transcripts wherein the counts in hOPTRS are <10, and counts in GM128LCLs >10 (see FILE:overExpressedCutoff10.txt for the complete list, n=1361). 10 is used as an empirical cutoff. Ideally a statistic should be used to check for over-expressed genes, but will not significantly alter the rankings of the top-ranked genes presented here. MS4A1, the most over-expressed gene, encodes the B-lymphocyte antigen CD20 expressed ubiquitously on the surface B-cells in almost all stages. Anti-CD20 monoclonal antibodies are used for the treatment of patients with B-Cell malignancies [31], although CD20 was shown to have no prognostic value in acute lymphoblastoid leukemia [32].
Genes assigned to chromosome corroborates the hypomethylation of chromosome 1
Table 3 shows that chromosome 1, known to be hypomethylated in EBV immortalized LCLs [4, 5], overexpressed the maximum number of genes. It has been shown that demethylation of satellite 3 DNA in chromosome 1 leads to increased transcription in senescent cells and in A431 epithelial carcinoma cells [33]. Hypomethylation of chromosome 1 and 16 have also been linked to Wilms tumors [34]. Also, ‘chromosome 1 is involved in quantitative anomalies in 50-60% of breast tumours’ [35], with three common genes (MLLT11, MTX1 and HIV-1) from chromosome 1 being reported here as being overexpressed (see FILE:overExpressedCutoff10.txt).
Issues with RefSeq
In Table 2, Accid:NG_011221.1 marked with a single asterisk is annotated in RefSeq under a different ‘facet’, and thus not downloaded automatically. Ideally, this should have been part of the ‘mRNA’ facet. This can be an issue while benchmarking RefSeq with GENCODE [36]. Accid:AL121985.13 marked with a double asterisk is not annotated in RefSeq, but is annotated in GENCODE (Id:OTTHUMT00000479908). This gene is antisense to the CD48 antigen, a protein found on the surface of immune cells [37].
The utility of a comprehensive catalogue
A recent work involving EBV-positive large B-cell lymphoma (DLBCL) was found to be lacking various standard B-cell markers, but over-expressing CD30/TNFRSF8 [3]. This study identified that the DLBCL case was ‘positive for CD30 and MUM-1, not defining the lineage of tumor cells’ [3]. However, a previous study had reported ‘CD30 was expressed in 14% of DLBCL patients. Patients with CD30+ DLBCL had superior 5-year overall survival’ [38]. Irrespective, the current study identifies CD30/TNFRSF8 as a gene uniquely expressed in LCLs, and its ranking shows that there are at least 78 other possible biomarkers (although many of them, like MIR155, are known and well established).
Expression counts - detailed reporting in KEATS
KEATS provides a detailed reporting system to enable debugging results. Take NM_001243.4 (length=3706) in Table 1 - this has a 24 count in GM12878. The transcripts matching these are reported in a file which includes the lengths of the individual transcripts (Table 4). Normalization is achieved by dividing the sum of the lengths of these by the length of NM_001243.4, leading to count of 11 in GM12878 (Table 1).
Unannotated genes
Extending previous work identifying un-annotated genes [6], 80 genes were identified which are expressed GM128LCLs, not in hOPTRS, and missing in the GENCODE, RefSeq and RefSeqGene databases (FILE:notannotated.fa). Table 5 shows the annotation of these transcripts obtained from a BLAST on the complete ‘nt’ database.
Materials and methods
GENCODE dataset
GENCODE release 25 was obtained from https://www.gencodegenes.org/ (release date 07/2016). Two files - gencode.v25.transcripts.fa (n=200k) and gencode.v25.lncRNA_transcripts.fa (n=27k) - were combined to create a single database (FILE:gencode.v25.ALL.fa.list, n=225785).
RefSeq dataset
The RefSeq database was created from https://www.ncbi.nlm.nih.gov/nuccore. The current Refseq database has about 200K sequences, and the facet ”genomic DNA/RNA” was ignored (about 20K), leaving ‘facets’ mRNA, rRNA, cRNA, tRNA and ncRNA sequences (FILE:mrna.refseq.180k.fa.list, n=180k). Another set (RefSeqGene) was obtained from ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/(FILE:RefSeqGene.ALL.fa.list, n=6569).
PacBio transcriptomes
The PacBio transcriptomes from the parent-daughter trio (GM12878, GM12891 and GM12892) were obtained from SRAid:SRP036136 [1]. The Pacbio transcriptome from twenty tissues has been provided at http://www.stanford.edu/∼htilgner/2013_NBT_paper/data/hOP.all.input.ccs.fa.gz [2].
kmer-based partitioning of ANNODB
GENCODE, RefSeq and RefSeqGene were combined to form a single database (ANNODB), which has redundancies. A kmer-based partitioning algorithm groups the 400K sequences of ANNODB into ∼100k sequences. The clustering algorithm first identifies pairs of sequences having a kmer=100 in common. Finally, a partition is created such that any sequence in a particular cluster has at least one sequence sharing a kmer=100 in the same cluster, and mapping to the same chromosome. The longest sequence in a cluster is chosen as the representative of that cluster. This generic partitioning method can also be done in the case of completely annotated genomes, like RefSeq, by using the gene id.
kmer-based counts in the transcriptome
Sequences in the transcriptome are kmer=100 matched to the non-partitioned ANNODB. Based on the partitioned ANNODB, counts are generated for the representative sequence. The counts are normalized by summing up the sequence lengths of the transcripts, and dividing it by the length of the representative sequence.