ABSTRACT
Despite the long-held assumption that transposons are normally only expressed in the germ-line, recent evidence shows that transcripts of LINE sequences are frequently found in the somatic cells. However, the extent of variation in LINE transcript levels across different tissues and different individuals, and the genes and pathways that are co-expressed with LINEs are unknown. Here we report the variation in LINE transcript levels across tissues and between individuals observed in the normal tissues collected for The Cancer Genome Atlas. Mitochondrial genes and ribosomal protein genes were enriched among the genes that showed negative correlation with L1HS in transcript level. We hypothesize that oxidative stress is the factor that leads to both repressed mitochondrial transcription and LINE over-expression. KRAB zinc finger proteins (KZFPs) were enriched among the transcripts positively correlated with older LINE families. The correlation between transcripts of individual LINE loci and individual KZFPs showed highly tissue-specific patterns. There was also a significant enrichment of the corresponding KZFP’s binding motif in the sequences of the correlated LINE loci, among KZFP-LINE locus pairs that showed co-expression. These results support the KZFP-LINE interactions previously identified through ChIP-seq, and provide information on the in vivo tissue context of the interaction.
INTRODUCTION
LINE transposable elements (TEs) comprise more than 17% of the human genome. Most LINE elements are incapable of retrotransposition, except for the few elements among the youngest families including L1HS. In addition to generating insertional mutations and causing disease through its retrotransposition activities (reviewed in (1)), L1HS also causes damage to the cell through aberrant expression of its sequence (Reviewed in (2)). Aberrant expression of full-length or even partial L1HS sequence that contains the ORF2 is known to cause double-strand DNA breaks (DSBs) (3, 4). This excess damage in DNA can lead to cell cycle arrest and apoptosis (3, 5), or a senescence-like state (4).
LINE elements have long been thought to be expressed only in the germline cells (6–8), but recently we have learned that both full-length and partial transcripts of LINEs are frequently found in the somatic cells (8–10) with large variation in expression levels across tissue types. A large number of different LINE element sites are expressed in human somatic tissues, and this expression varies among different individuals (11). The level of LINE expression is especially pronounced in cancer cells. Endogenous expression of LINEs have been observed early on in germ cell tumors (GCTs) (12), and GCT-derived embryonal carcinoma cell (ECC) lines (13), and in numerous other types of cancer (14) more recently. There are hypotheses considering LINE activity as a driver of tumorigenesis or LINE expression as one of the hallmarks of cancer (15).
Although there are many reports of LINE expression in the somatic cells, how LINE expression is repressed and de-repressed in human somatic cells is still largely unknown. Based on what we have learned so far, LINE expression is regulated through multiple layers, consisting of transcription factors, DNA methylation, PIWI-interacting RNAs (piRNAs), RNA interference (RNAi), and posttranscriptional host factors. Full-length human LINE elements contain a 5’ Untranslated Region (UTR) that includes an internal RNA polymerase II promoter (16), as well as binding sites for RUNX3 (17), SRY (18) and YY1 (19). LINE expression is also regulated through cell-specific and stage-specific epigenetic mechanisms (2). Expression of LINE full-length transcripts is correlated with differential methylation of the promoter region in various cell lines (45), and in fetal germ cell development in male mice (20). Loss-of-function studies in mice suggest that piRNAs guide de novo methylation to retrotransposons during germline development (21, 22), and an intact pi-RNA pathways is required to prevent L1 mobilization in male meiosis (23). A large number of proteins associated with LINE in humans at the post-transcription stage has been identified through proteomic studies of the LINE ORFs and the ribonucleoprotein (RNP) complex. LINE RNP comprises of its two ORF proteins bound preferentially to their encoding (cis) LINE RNA. Other proteins modulating retrotransposition are also associated with the RNP complex. So far, four studies have utilized immunoprecipitation and proteomics and identified altogether about a hundred proteins (24–27). Recently, CRISPR–Cas9 screen was used to identify proteins that restrict LINE activity (28). The protein MORC2 and the human silencing hub (HUSH) complex was shown to selectively bind evolutionarily young, full-length LINEs located within euchromatic environments, and promote deposition of histone H3 Lys9 trimethylation (H3K9me3) for transcriptional silencing (28).
Among the proteins participating in regulation of TEs, Krüppel associated box-Zinc Finger Proteins (KRAB-ZFPs) are an interesting gene family. The Krüppel associated box (KRAB) domain is a well-known repressor domain and together with the best-known co-factor KAP1 (TRIM28), the KZFP-KAP1 complex has been shown to silence both exogenous retroviruses and EREs during embryonic development (29, 30). Based on this observation, and the pattern of co-evolution of retroviral LTRs and the C2H2-Zinc Finger gene family, it has been hypothesized that the KRAB-ZFPs function in transposable element suppression (31). But except for a few KRAB-ZFPs, most members do not have a characterized function. In an alternative hypothesis, instead of its original role in silencing, it was proposed that KRAB-ZFPs may control the domesticated transposable elements that contribute to the host transcription regulation network (32). Several studies have now reported observations that suggest that older LINE elements are functional members of the host transcriptome, possibly acting as alternative promoters, enhancers or precursors of long non-coding RNAs. Faulkner et al. in 2009, was the first study to provide a global picture of the significant contribution of retrotransposons to human transcriptome in multiple tissue types (33). This report showed that 6-30% of transcripts had transcription start sites located within transposons, and these transposons were tissue-specifically expressed and influenced the transcription of nearby genes. The results were extended by Djebali et al. in 2012 showing again the tissue-specificity of transposon expression, and that most of these transcripts are found in the nuclear part of the cell (34).
Recently, high-throughput RNA-seq data of various types of cancer samples and their normal counterparts have become available in The Cancer Genome Atlas (TCGA) (35–37). By focusing on the normal tissue samples from TCGA, we can access thousands of natural experiments across various types of tissues that show normal variation in LINE transcript levels, and obtain a global picture of LINE expression and regulation in humans. Since the samples are collected from fresh-frozen tissues, LINE transcript levels are observed in vivo, complementing the studies that focus on retrotransposition assays or transposon expression in human cell lines. An important strength of this approach is the large number of samples collected for each tissue type. The variation in LINE element transcripts observed in multiple samples within each tissue, allowed us to analyze the co-expression patterns between genes and LINEs. We hypothesized that genes that regulate the transcription level of LINEs would show correlation in expression levels with the LINE transcripts.
We first summarize the survey of LINE expression variation found in the RNA-seq data from 660 samples of normal tissue. We confirm the earlier findings that LINE expression varies across tissue types, with the tissues esophagus, stomach and head and neck showing the highest level of L1HS transcripts. Transcript levels of individual LINE loci are highly tissue specific and within each family only a few individual loci are highly expressed, contributing to the bulk of the transposon transcripts at the family level. We also find large variation in LINE expression across individual samples within each tissue type. The expression of the L1HS family varies by as low as 2-fold in tissues like kidney, to greater than 8-fold in breast, thyroid, and many other tissues.
By analyzing the correlation between individual genes and L1HS transcripts, controlling for the older LINE transcript levels, we found a significant excess of mitochondrial genes and ribosomal protein genes negatively correlated with L1HS in transcript level. Several of the genes that recurrently show strong negative correlation across tissues, have functions related to oxidative stress. We also found an enrichment of KRAB-ZFPs that were positively correlated with older LINE expression, recurrently in multiple tissue types. Binding motifs of the KRAB-ZFP proteins are enriched in the sequences of LINE loci for co-expressed KZFP-LINE pairs, when compared to random KZFP-LINE loci combinations that are not co-expressed.
MATERIAL AND METHODS
RNA-Seq and gene expression quantification in the normal tissues
We used the gene level quantification provided by The Cancer Genome Atlas (TCGA) for the gene expressions. We collected gene level quantifications for 660 samples from The Cancer Genome Atlas (TCGA). We focused on cancer types that had at least 6 normal samples of RNA-seq data, collected from normal tissue adjacent to the cancer tissue. As a result, 17 different tissue types were included in our analysis: BLCA (Bladder urothelial carcinoma), BRCA (Breast carcinoma), CHOL (Cholangiocarcinoma), COAD (Colon adenocarcinoma), ESCA (Esophageal adenocarcinoma), HNSC (Head and neck squamous cell carcinoma), KICH (kidney chromophobe), KIRC (kidney renal clear cell carcinoma), KIRP (Kidney renal papillary cell carcinoma), LIHC (Liver hepatocellular carcinoma), LUAD (Lung adenocarcinoma), LUSC (Lung squamous cell carcinoma), PRAD (Prostate adenocarcinoma), READ (Rectum adenocarcinoma), STAD (Stomach adenocarcinoma), THCA (Thyroid carcinoma) and UCEC (Uterine Corpus Endometrial Carcinoma). Number of samples for each tissue is described in Table 1. Although we will use the acronym for the cancer type to describe these tissues, we emphasize again that all our samples come from the normal tissues collected from the same organ of the same patient with the tumour. The tumour tissue samples were not included in our analysis.
Methods for sequencing and data processing of RNA using the RNA-seq protocol have been previously described for The Cancer Genome Atlas (TCGA) (35–37). Briefly, RNA was extracted, prepared into poly(A) enriched Illumina TruSeq mRNA libraries, sequenced by Illumina HiSeq2000 (resulting in paired 48-nt reads), and subjected to quality control. RNA reads were aligned to the hg19 genome assembly using Mapsplice (38). Gene expression was quantified for the transcript models corresponding to the TCGA GAF2.1 using RSEM (39). We used the raw_count values in the .rsem.genes.results files, rounded to an integer, as the gene level quantification.
Quantifying TE derived transcripts at the locus and family level
We collected RNA-seq level 1 binary alignment files (.bam files) for 660 samples (Table 1) from The Cancer Genome Atlas. These bam files are results of the RNA-seq and MapSplice protocol described in the previous section. We used a modified version of the software TEtranscripts (40) for quantifying the reads mapping to annotated transposons. TEtranscripts is a software that can quantify both gene and TE transcript levels from RNAseq experiments. It takes into account the ambiguously mapped TE-associated reads by proportionally assigning read counts to the corresponding TE families using an Expectation-Maximization algorithm. We implemented two modification to the original TEtranscripts software. 1) We modified it to report read counts for each individual TE locus in the reference genome in addition to the family level counts. 2) We developed a function to discount the read counts by removing read counts that correspond to transcripts containing TE sequences that originate from pre-mrna or retained introns in the mature RNA (41). If the reads mapping to TEs are part of the pre-mrna or retained introns, we should see continuous mapping of reads that span the introns flanking the TE of interest. We utilized the read depths in the flanking introns to proportionally reduce the number of total reads mapped to the TE using the following formula:
IL : intron left to TE. IR : intron right to TE.
RIL : read depth of the intron left to the TE
countIL : read counts mapped to the intron left to the TE (includes multi-mapped reads)
lenIL : length of the left intron
read_len : length of the sequencing read
countTE′: count of reads mapped to TE after the correction.
In addition to the discounted quantification, we also used the option in the TEtranscripts software to quantify the TE transcripts using only the uniquely mapped reads. Downstream analyses were done using both the discounted quantification based on multi-mapped reads and the uniquely mapped quantification, to assess the impact of uncertainty in multi-mapped reads.
The retrotransposon annotations used were generated from the RepeatMasker tables, obtained from the UCSC genome database and provided by TEtranscripts. For quantifying reads mapping to the TE flanking introns we generated gtf files containing 1) the TE flanking intron positions, 2) the intergenic TE positions, 3) the exonic TE positions (TEs that fall within an exon, including non-coding RNA genes). In case of intronic TEs, we use the algorithm described above to discount the transcripts from pre-mrna or retained introns. In case of intergenic TEs, we count all EM estimated reads mapped to TEs without any discount. In case of exonic TEs, we ignore those counts altogether, and the exonic TEs do not contribute to the locus count nor the family level count. The modified version of the TEtranscripts software and the required gtf files can be found at https://github.com/HanLabUNLV.
Normalization and transformation of read counts
After quantifying the reads mapping to annotated genes and TEs, both the gene level counts, and the TE counts were normalized between samples across all tissue types with DEseq2. We used the default “median ratio method” for normalization in DESeq2 (42). Briefly, the scaling factor for each sample is calculated as median of the ratio, for each gene, of its read count to its geometric mean across all samples. The assumption of the median ratio method is that most genes are not consistently differentially expressed between tissues. If there is systematic difference in ratio between samples, the median ratio will capture the size relationship. But, this assumption may be violated when we are comparing large number of tissues types at the same time, since a large proportion of the genes may be differentially expressed in at least one tissue type, or one of the tissues may be extremely biased in their number of differentially expressed genes. In order to achieve more robust normalization, we used a two-step normalization method called the differentially expressed genes elimination strategy (DEGES) (43). We performed preliminary normalization using the “median ratio method”, filtered out potential differentially expressed genes in the data, found a subset of robust non-differentially expressed genes, and used the subset to perform the second round of “median ratio normalization”. The resulting pairwise MA plot between tissues after normalization showed better normalization compared to the regular one-step normalization (Supplementary Figure 1). The size factors for each sample obtained from the two-step normalization on gene counts were then used to normalize the TE quantifications of the same sample. The normalized counts were log2 transformed using the variance stabilizing transformation function in DESeq2 (42, 44) for downstream analysis.
Clustering of samples by expression pattern
We visualized the expression patterns in the data using the default complete linkage clustering in hclust, with the pheatmap function in R (45). The top 1000 genes or TEs, with the largest variances were used for the clustering and visualization. For the locus level TE expression, we filtered out all loci that had less than 5 read counts for every sample. To compare the clustering of samples based on gene expression vs. TE expression, we used the normalized Mutual Information measure (46). The hierarchical clusters were cut off at k = 17, the number of different tissue types. Because the resulting clusters were not accurate enough to distinguish between similar tissues, we used a broader tissue grouping to compare with the clusters. The tissues were grouped to 10 broader types based on preliminary clustering: bladder/endometrium (BLCA, UCEC), breast (BRCA), liver/bile duct (CHOL, LIHC), colon/rectum (COAD, READ), esophagus/stomach (ESCA, STAD), head and neck — the squamous epithelium in the mucosal surfaces inside the mouth, nose, and throat (HNSC), kidney (KICH, KIRC, KIRP), lung (LUAD, LUSC), prostate (PRAD), and thyroid (THCA). The broader tissue type of each sample was used as the ground truth. Each resulting cluster was then assigned a group label based on the majority tissue type. Normalized mutual information was calculated by comparing the labels from the clustering to the true class labels.
Correlated expression between genes and LINEs
Before examining genes that are associated with L1HS expression, we first examined the clinical variables to check any confounding variables associated with the L1HS levels in the normal tissue. We tested the variables age, days to death, pathological stage, T staging, N staging, M staging, gender, radiation and race for each tissue type. Radiation therapy was the only clinical variable associated with L1HS expression in the normal tissue of thyroid, and thus was the only clinical variable included as a covariate in our linear models and only for the thyroid tissue.
Correlation between gene and L1HS transcripts were tested in each tissue groups separately, in bladder, breast, liver/bile duct, colon/rectum, stomach/esophagus, head and neck, kidney, lung, prostate and thyroid. We tested 20532 genes for each tissue group using a linear model with log2 L1HS expression as the dependent variable, and log2 gene expression as the independent variable. For a gene to be included in our test, it had to be present in at least eight individual patients. We also required that the gene be expressed with a minimum RPM of 2 in 75% of the samples to be included in the dataset. In addition to the radiation therapy for thyroid tissue, we considered effective library size (sum of all normalized counts) and the batch ID provided by the TCGA project as additional covariates. The linear model we used is described below.
We tested all combination of linear models that can be created by including or excluding these variables. Second-order Akaike Information Criterion (AICc) was used to select the best linear model. The coefficient and p-value used are the ones from the best model.
After the preliminary analysis, we found there was significant correlation between the level of L1HS and the level of the older LINE elements. To tease apart the effects, we separated the total LINE quantification into L1HS and older LINEs. L1HS was the sum of read counts mapping to all the individual L1HS loci. An additional variable that we called “oldLINEs” was quantified as the sum of read counts mapping to all the LINE elements except for L1HS, L1PA2 and L1PA3. We estimated the linear models above for L1HS, this time including the covariate oldLINEs. To identify genes correlated with older LINE elements, we also estimated the linear models with oldLINEs as the dependent variable, using L1HS as the covariate.
Genes showing recurrent correlation with L1HS and older LINEs in multiple tissues
For the gene-L1HS correlation analysis, we decided to focus on global correlation signals that appear in multiple tissue types. In order to identify these global recurrent correlations, we used two different methods. First, we collected genes that showed significant gene coefficients from the linear model (q-value < 10e-4) in more than one tissue type. But, because this method relies on the p-value of the coefficient, it is biased towards picking up genes in tissues with large sample sizes. So, we also utilized a second measure, called REC score (47), that utilizes the ranks of correlations instead of the absolute p-values to find recurrent correlation across tissues. This rank-based approach ensures that individual tissue types are weighted equally, and limits bias from tissues with large sample sizes or from strong associations measured in only a single tissue type. We modified the REC score slightly to test all gene relationship against a single dependent variable (L1HS or oldLINEs).
Let Lk be the ordered vector of negative and positive associations involving mRNA j in cancer type k. The strength of the association between mRNA j and L1HS in cancer type k is defined by the relative rank: where, rrj,k is the rank of the association between mRNA j in the vector Lk. The rest of the formulas for calculating RECj, the REC score for mRNA j follow the methods described in (47).
Once genes with recurrent correlations were identified, we used the gene enrichment analysis software DAVID (48) with the positively and negatively correlated gene sets to identify functional clusters that are enriched among the correlated genes. We also derived a ranked list of these genes based on their sign of the gene coefficient and the maximum partial eta-squared values for the gene coefficient from the linear model. We used this pre-ranked list of genes with recurrent correlations to run the Gene Set Enrichment Analysis (GSEA) software (49) against the curated pathways gene set and the GO terms gene set of the MSigDB collections (50).
Correlation between individual LINE loci and KRAB-ZFPs
To understand the positive correlation between LINEs and KRAB-ZFPs, we looked at the correlation between each KZFPs and individual LINE loci in different tissue types. We tested the correlation for 366 KRAB Zinc Finger Proteins that were identified in Imbeault et al. (51) and also found in our gene expression data. Because the search space of pairwise combinations of KZFP and individual LINE loci was too large, we examined the relationship in a step-wise approach. In the first step, we tested the correlation between all pairwise combinations of 366 KZFPs and 146 LINE subfamilies using the TE quantification at the family level in each tissue type. Then, in the second step, once the significantly correlated KZFP and LINE family was identified, we focused on those pairs. We tested the correlation between the expression of the significant KZFP and the expression of each individual locus of the significant LINE family in the tissue where the initial co-expression was found to identify individual LINE loci that are co-expressed with the KZFP.
KRAB-ZFP motif search in the sequences of co-expressed LINE loci
To test the presence of zinc finger motifs in sequences of co-expressed LINE loci, we searched for known KZFP motifs using the software Find Individual Motif Occurrences (FIMO) (52). We used the motifs of the 39 KZFPs identified by Barazandeh, et al. (53) through a comparison of ChIP-Seq results. Taking only the zinc finger genes from our dataset that had a zinc finger motif available from Barazandeh, et al. (53), we had a total of 10,782 pairwise combinations of zinc finger motifs and individual L1 locus that showed significant co-expression. We searched the corresponding motifs in the respective L1 sequence using the default p-value cutoff from the FIMO program, p-value < 0.0001. Afterwards, we summed up the total number of motifs identified in the L1 sequences as well as the total number of unique zinc finger motifs identified in each of the L1 sequences, as some motifs are identified multiple times in an L1 sequence.
To test the significance of our results, we generated two different null models for comparison. For the first null model, we randomly sampled 10,782 pairs of KZFP motifs and L1 loci out of all available KZFPs and L1 loci in our gene expression dataset, regardless of whether they are co-expressed or not. For the second null model, we aimed to construct a null model that reflects the distribution of the L1 families in our significantly co-expressed pairs. We thus started with the 10,782 pairs of zinc finger motifs and L1 sequences that were identified as significantly co-expressed, but we created new pairwise relationships by randomly permuting the LINEs against the KZFPs so that the co-expression relationships are disrupted. We generated 1000 sets of each null model, and tested for the existence of the KZFP motif in the paired L1 sequence. We then summed up the total number of zinc finger motifs identified in the L1 sequences as well as the total number of unique motifs in each of the L1 sequences. We fit a normal distribution to the total number of motifs identified for the 1000 set of null models, and calculated the p-value of our observed count of motifs.
Validation data set from the GTEx project
In addition to the normal samples from TCGA, we also obtained normal lung tissue RNA-seq files from the Genotype-Tissue Expression (GTEx) project (54) as an independent set of samples for validation. GTEx donors are identified through low-post-mortem-interval (PMI) autopsy or organ and tissue transplant settings, in which biospecimens collection can start within 24 hours of death or surgery. It is expected that the majority of the tissue samples in this dataset is free of major disease. The RNA-seq protocol for the GTEx project has been previously described in (54). Briefly, RNA was extracted to a non-strand specific poly(A) selected Illumina TruSeq library, sequenced by HiSeq 2000 into 76bp paired-end reads, with median coverage of 82M total reads. The reads were mapped to the GENCODE 19 annotation using STAR (55). We downloaded 210 aligned bam files from dbGaP for the lung tissue, chosen because of its large sample size compared to other tissues. The analyses described above were repeated identically for these 210 samples.
RESULTS
TE transcripts are quantified with correction to discount TE reads originating from pre-mRNAs or retained introns
In general, we find that there is good correlation between the total raw counts, and the reduced counts of the TE family after our correction (Figure 1). We also visually confirmed several examples of clear L1HS expression that are not part of the surrounding intron (supplementary Figure 2). But there were a few samples with large differences before and after correction, such that the overall distribution of the corrected difference was positively skewed (Figure 1, Supplementary Figure 3). On average, the total read count for L1HS was reduced by around 426 reads for each sample. There were 10 patients with large reduction in L1HS read counts, greater than twice the standard deviation above the mean, and these outliers with large corrections came from two tissue types, the stomach and the esophagus (Table 2). When looked in detail these extremely large corrections came from very few specific L1HS loci. The most extreme case was due to a single L1HS locus, L1HS_dup39 on chromosome 1, 91853148-91853486 (Supplementary Figure 4a). It lies next to an rDNA that is highly expressed in stomach and esophagus. There was transcription read-through of the rDNA that extended beyond the rDNA boundary and continued to transcribe part of the L1HS sequence, beyond the 5-base threshold that we required for being counted as L1HS transcription. Because of this, L1HS_dup39 had a read count of 11473 reads in this sample (patient ID 6852). But after applying our correction method based on the flanking read depths, all the reads mapping to this particular L1HS locus was removed, and the quantification for the particular locus went down to zero as it should. Another example is L1HS_dup898, which was the single locus most frequently identified as the largest correction within each sample. Among the 674 samples that we quantified, L1HS_dup898 showed the largest correction in 310, almost half of the samples (Table 3a). L1HS_dup898 is found on chr 9, 85664455-85670486 in the middle of the first intron of the gene RASEF. When we looked into the read alignment in detail in one of the samples with the highest expression (patient ID A4OM), we found that there were large numbers of reads uniquely mapping to L1HS_dup898, but we also found equally large numbers of reads mapping to the surrounding intron on both the left and right side of the TE, indicating intron retention (Supplementary Figure 4b). The estimated read counts for L1HS_dup898 were reduced to zero after the correction.
Although, we focused on L1HS elements as example cases, L1HS is actually not the TE family that showed the largest corrections compared to other TE families. AluSx, AluSx1, AluJb, AluY, MIR3, MIRb, etc. showed the largest corrections, several orders of magnitude larger reduction in read counts compared to L1HS (Table 3b). As we have seen in the case of L1HS, a few specific loci were responsible for a major proportion of the correction for a TE family, depending on the tissue type. This was due to specific transcription read-throughs or intron retentions that happen more frequently in certain tissue types.
We also found cases where the method corrected for erroneous TE quantifications due to TEs embedded within long non-coding RNAs (lncRNAs). One TE locus that was most frequently identified with the largest correction in the sample was AluSx1_dup59209. AluSx1_dup59209 had very high transcript levels with an average read count of 18863 in thyroid and head and neck tissue. But when we looked into the alignment in detail, we found that the Alu element was embedded in a lncRNA gene called TTTY14. The reads mapping here were counted as AluSx1 transcripts in the original quantification. In the alignment, we see that AluSx1_dup59209 has accumulated enough mutations, such that almost all the reads mapped to it are uniquely mapped reads. It looks to be a novel case of an Alu domestication, where an Alu insertion or a secondary duplication of an original Alu insertion became part of a testis specific RNA gene (56). Since we added a step to remove all read counts mapping to TEs if the TE falls within exons or non-coding RNAs, the read counts for AluSx1_dup59209 were reduced to zero in all samples, removing the errors from the original quantification (Supplementary Figure 4c).
LINE expression shows tissue-specific expression patterns among the normal somatic tissues
There have been multiple reports of tissue specific expression of TEs in the human genome, starting from Faulkner et al. in 2009 (33) to Philippe et al. 2017 (57) more recently. We were still surprised at the extent of tissue specific information contained in the expression pattern of individual LINE loci across the genome. During our preliminary analysis examining the data quality and exploring the data, we unexpectedly found that we could cluster each sample into their broader tissue groupings, based on LINE expression patterns alone without relying on any genes at all. We checked the quality of the data by clustering the samples based on the 1000 genes with the largest variance in expression level, and confirmed that the samples were correctly grouped into the tissues based on gene expressions (Supplementary Figure 5a). Then we explored the expression of each LINE family and found that the family level quantification of LINE expression did not contain enough information to correctly cluster the samples into their corresponding tissue (Supplementary Figure 5b). But when we used the expression levels of 1000 individual TE loci with the largest variance, the samples were again correctly classified into their corresponding tissue groups, even without any information from the genes (Figure 2). We used normalized mutual information between the different clustering results and the ground truth (the true tissue group) to evaluate the quality of clustering. Normalized mutual information was compared for clustering results based on gene expression, family level LINE expression, locus level LINE expression and random assignments. We found that the locus level LINE expression was as predictive of tissue groups as the genes (Table 4).
Compared to other older LINE elements, the tissue specificity was less pronounced for L1HS expression. Still we found large variation in the amount of L1HS expression in different somatic tissues across normal tissue samples (Figure 3), consistent with the previous observation in adult human tissues (10) and in human cell lines (57). There were at least a moderate level of L1HS expression across all the tissue types we examined. Tissue with the lowest L1HS level were the liver and the bile ducts, with the median value at about 300 reads per million. Head and neck tissue, esophageal tissue and stomach tissue showed the highest level of L1HS expression in the normal tissue samples, with a median of around 1500 reads per million. This was even after large amount of corrections in stomach and esophagus samples seen in Figure 1. This is consistent with the observation by Belancio et al. showing high levels of full length LINE expressed in the adult esophagus and stomach tissue, at about 80% and 150% relative to the levels in HeLa cells (10).
When we looked at expression of each individual L1HS loci, we found that certain L1HS loci and certain individual samples were over-represented in the list of the top 100 L1HS loci with the largest read counts (Table 5a, Supplementary Table 1). For example, the locus with the highest expression level across all L1HS loci, L1HS_dup241 shows up 23 times in the top 100 list in 23 different individual samples. L1HS_dup241 is a full length L1HS element on chromosome 3 (chr3:26439509-26445536), and it is most frequently the most highly expressed L1HS locus in the genome. It is possible that the few L1HS loci that we identified to be highly expressed are also the handful of source L1s (hot L1’s) that are responsible for the majority of retrotransposition events (58–60). We tested the overlap between the 23 L1HS loci in our top 100 highly expressed list, and the 28 L1HS elements in Table S3 of Tubio et al. (60) We found 6 elements overlapping between the two lists. Considering there are a total of 863 annotated L1HS loci in the human genome (hg19), 6 overlaps are more than what we would expect if the lists were independent. Similar to the over-represented loci, certain tissue samples were also over-represented in the top 100 list. For example, a head and neck tissue sample from patient ID 8663 shows up 9 times in the top 100 list with 9 different L1HS loci (Table 5b, Supplementary Table 1). This suggests that multiple L1HS loci are upregulated by a common regulatory mechanism across different chromosomes in certain individuals. There were six stomach tissue samples, five head and neck, and five esophagus tissue represented in the top 100 list, again showing that head and neck, esophagus and stomach showed higher L1HS expression not only at the family level but also at the individual locus level (Table 5b).
Specific sub-families of SINEs, ERVs and MERs show recurrent positive correlation with LINE expression in multiple normal tissues
Since TEtranscripts quantifies the reads mapped to different transposon families (40), we were able to study the co-expression of different transposons with L1HS. Based on the estimates of TEtranscripts, we found multiple LINE subfamilies closely related to L1HS showing up with the strongest correlation in expression levels with L1HS (Supplementary Table 2a). In fact, the correlations between L1HS and many repeat families were much stronger than the relationship between L1HS and any of the genes. The correlation between closely related LINE subfamilies is expected because reads from transposons that map to sequences that are indistinguishable between subfamilies are assigned to multiple subfamilies with proportional weight by TEtranscripts using an Expectation-Maximization algorithm. What was not expected, was that we also found several DNA transposons and Endogenous retroviruses (ERVs) that are highly and recurrently correlated with L1HS expression in multiple tissues (Supplementary Table 2b). Since there was wide-spread correlation between different LINE subfamilies and multiple transposons outside LINEs, we decided to tease apart the correlation using a multivariate model including separate quantification for the L1HS element and the older LINEs. When we controlled for the RNA level of older LINE elements as a covariate, most of the correlation seen between non-LINE transposons and L1HS disappeared, and the elements correlated with L1HS was limited to the most closely related L1 elements (L1PAs), but no other transposon families (Supplementary Table 2c). On the other hand, when we looked at elements that are correlated to older LINEs, controlling for correlation with L1HS, we found several Alu elements recurrently correlated with older LINEs, (Supplementary Table 2d). Considering there is no sequence similarity between the SINEs, MERs, ERVs and the LINEs, the correlation among these diverse transposons is probably due to a common regulatory mechanism that is de-repressing these transposons and LINEs at the same time. There have been reports of such co-expression of ERVs and LINEs in cancerous tissues (61, 62), possibly through concordant hypomethylation (63).
Expression of mitochondrial proteins are negatively correlated with L1HS expression
Based on our results showing widespread correlation between L1HS and older LINEs, we decided to examine genes co-expressed with L1HS and older LINEs separately, controlling for the other variable as a covariate in the linear model. We found 3650 genes that showed negative correlation in expression with L1HS (q-value < 10e-4) in at least one tissue (Supplementary Table 3). There were 1056 genes that were negatively correlated with L1HS in more than one tissue, i.e. the correlation was replicated in at least two tissues. Figure 4 shows the top 12 genes with the largest partial eta-squared estimated across all tissues.
We did a gene set enrichment analysis (GSEA) using DAVID for the 1056 genes. The top three most enriched clusters were mitochondrial transit peptide / mitochondrial inner membrane, oxidative phosphorylation and ribosome and ribonucleoprotein (including mitochondrial ribosome proteins) (Table 6, Supplementary Table 4). This result was robust to the threshold we used for significance. When we used a more stringent cutoff of q-value < 10e-5, we found 326 genes negatively correlated with L1HS in more than one tissue, but the top three most enriched annotation clusters were still the same for the 326 genes. The result was also robust to the software we used for the gene set enrichment analysis (DAVID or GSEA)(48, 49), and to the annotation database we used for the analysis (curated pathways or GO terms). With GSEA on curated pathways, the pathways KEGG: OXIDATIVE_PHOSPHORYLATION, REACTOME: RESPIRATORY_ELECTRON_TRANSPORT were on the top of the list. With GSEA on GO terms, the terms MITOCHONDRIAL_PART, MITOCHONDRION, MITOCHONDRIAL_ENVELOPE were on the top of the list (Supplementary Table 5, Supplementary Figure 6). When we look at the list of genes significant in these categories, we find that there is almost an across-the-board reduction in transcripts for mitochondria and the ribosome (Figure 5) in the cells that have high level of L1HS transcript levels. Nuclear encoded mitochondrial genes are known to be co-regulated with mtDNA encoded genes, and ribosomal protein genes are co-regulated with rDNAs. We did not include mtDNA encoded genes, nor rDNAs in our quantification, so we could not test whether they are also negatively correlated with L1HS expression. The enrichment of mitochondrial genes among the genes negatively correlated with L1HS was significant in models both including and excluding expression of older LINE elements as a covariate. In contrast, there was no enrichment of mitochondrial genes when we looked at the genes correlated with the expression of older LINE elements adjusting for L1HS level as a covariate (Figure 6). This shows that the negative correlation between the expression of mitochondrial genes and LINE expression is either a common relationship among all LINE families or stronger for L1HS compared to older LINE elements.
To understand how the uncertainty in reads mapping to TEs affect our results, we decided to examine the correlation only using the uniquely and unambiguously mapped reads. We used the option in the TEtranscripts software to quantify TEs only based on the uniquely mapped reads, and replicated the gene-L1HS co-expression analysis. Overall, the significance of the correlations dropped, due to the lower power coming from lower number of reads. We found 6 genes negatively correlated with L1HS with q-value < 10-e4 and 96 genes using a less stringent cutoff (q-value < 10-e3). The six genes identified are CA4, CCDC151, PARD6A, RASSF4, SLC29A2, UBTD1, and there are no common functions across these genes, and no obvious link to mitochondria was found. But with the 96 genes, we found that mitochondrial inner membrane and mitochondrial transit peptide were still the most enriched category among the genes negatively correlated with L1HS measured by uniquely mapped reads (Supplementary Table 6).
We also looked at the recurrent correlations using REC score (Jacobsen et al. NSMB 2013). A strong negative REC score reflects that the mRNA-L1HS relationship generally show correlation in the same direction across the studied tissue types. Compared to the list of correlated genes we examined above, that were identified by a significant coefficient in the linear model in at least two or more tissue groups, this list is based on the rank of correlations and doesn’t rely on the significance of the p-value in the linear model. Table 7 shows the top 20 genes that have the best REC scores. None of the genes had a REC score that is less that 6.2, which is considered as the threshold for significance, showing that no gene was consistently co-expressed with L1HS across all tissues. But, when we used the top 100 genes with the most negative REC scores and examined the enriched functions, we again found that the most enriched cluster was mitochondrial respiratory chain and oxidative phosphorylation.
When examining the top 20 genes that show recurrent negative correlation with L1HS across tissues (Table 7), we found interesting interaction and overlap in function among the genes. DHPS encodes the Deoxyhypusine Synthase, the enzyme that cleaves and transfers spermidine that is necessary for the hypusination of eIF5A. Depletion of eIF5A leads to endoplasmic reticulum stress and unfolded protein response (64). The function of AIP is unknown but it is reported to interact with the aryl hydrocarbon receptor (AhR). A tight coupling of AhR and Nrf2-dependent regulation in the prevention of quinone-induced oxidative stress and ER stress has been reviewed in (65). CREB3 is a transcription factor that activates unfolded protein response (UPR) target genes during endoplasmic reticulum (ER) stress response (66). PARK7 and ECSIT are mitochondrial genes. PARK7 has a role in protection against oxidative damage, and PARK7 deficient mice under oxidative stress show impaired mitochondrial biogenesis and respiratory chain deficiency (Billia et al. PNAS 2013). ECSIT is a mitochondrial complex I associated gene that has been shown to regulate the production of mitochondrial reactive oxygen species (mROS) following engagement of Toll-like receptors (TLRs) (67). ECSIT-deleted cells show complete loss of mitochondrial complex I function, increase in mROS production, and accumulate damaged mitochondria because of defective mitophagy (68). PARK7 and ECSIT both interact with the mitophagy regulator PINK1 (69). PARK7 also regulates lipid rafts mediated endocytosis, and FLOT1 is one of the main components of lipid rafts (70). Knockout of PARK7 (DJ-1) have been reported to decrease the expression of FLOT1 (71). Knock down of FLOT1 results in increased PERK and eIF2α phosphorylation indicating endoplasmic reticulum (ER) stress (72). PARD6A is a cell polarity gene that is required for endocytic trafficking (73).
In addition to the genes DHPS, FLOT1, AIP and CREB3, the overall transcriptional repression of ribosomal proteins seen in Table 6 and Figure 6 also led us to consider the possibility that the cells that have high level of L1HS transcripts may be exhibiting activation of the unfolded protein response (UPR) due to ER stress. Although the UPR pathway is mainly regulated at the protein level through phosphorylation and cleavage, we examined whether any genes in the UPR pathway showed correlation with L1HS at the transcript level that was replicated in at least two tissue types. We found that L1HS transcript level was positively correlated with EIF2AK2(PKR), ERN1(IRE1α), and NFE2L2 (NRF2), but negatively correlated with ATF4(CREB2) and CHOP(DDIT3) (Figure 7). We did not observe correlation between L1HS and the other transducers of UPR, PERK or ATF6, nor did we see any correlation between KEAP1 and L1HS. We did see that the gene SQSTM1, encoding p62 that binds to KEAP1, was negatively correlated with L1HS. We also saw several chaperone proteins, ubiquitination factors and other regulators of ER stress correlated with L1HS. DNAJC5, HSPA4L, UBE4B, MBTPS2, ITPR1, ITPR2, and ITPR3 was positively correlated with L1HS and DNAJC4, DNAJC17, DNAJC19, UBE2G2, SSR2 (TRAPB), BNIP1 was negatively correlated with L1HS in at least two tissues. Excluding the annotations related to mitochondria, the functional annotations “proteasome”, “autophagy”, “endosome”, and “antioxidant” were the top annotations enriched in the negatively correlated gene set (Table 6).
Since XBP1 is one of the genes in the UPR pathway regulated through splicing, we examined if the XBP1s isoform correlates with the L1HS level in our samples. We tested the log2 transformed raw counts of the XBP1s isoform against the log2 transformed L1HS level, and we also tested the ratio of the spliced isoform over all transcripts XBP1s/(XBP1u+XBP1s) against the log2 transformed L1HS level. But neither quantification of the spliced version of XBP1 correlated with L1HS consistently across tissues. We found that the ratio of the spliced form positively correlated with L1HS levels in the colorectal samples, but it was not replicated in any other tissue.
To find whether there are common transcription factors upstream of the genes correlated with L1HS, we took our list of 1056 genes negatively correlated with L1HS and looked for common regulatory elements using the web server DiRE (https://dire.dcode.org) (74). We found that the transcription factor binding sites (TFBS) of NRF2 and ELK1 are the most frequently identified motifs for our list of genes. 476 genes out of 1056 had at least one TFBS identified, and of those 99 (20.69%) included at least one NRF2 binding site and 105 (22.52%) included at least one ELK1 binding site. Unlike NRF2, we did not find any significant correlation between ELK1 and L1HS. But, as we describe in the next section, several genes in the MAPK signaling pathway were positively correlated with L1HS in multiple tissues.
Expression of chromatin modifiers are positively correlated with L1HS expression
Compared to the genes in negative correlation with L1HS, there was a larger number of genes in positive correlation with L1HS in expression. We found 4338 genes that showed positive correlation in expression with L1HS (q-value < 10e-4) in at least one tissue out of the ten tissue groups (Supplementary Table 2). There were 1318 genes that were positively correlated with L1HS in more than one tissue. We did a gene set enrichment analysis using DAVID for the 1318 genes that are positively correlated with L1HS in more than one tissue. After excluding the very general categories of “kinase” and “transcription” that consisted of 293 and 768 genes respectively, we saw SH3 domains, DNA damage and repair, bromodomains, PHD fingers, and chromodomain helicases as most enriched functional categories for genes positively correlated with L1HS expression (Table 8, Supplementary Table 4, Figure 8).
Bromodomains are acetyllysine binding domains and PHD domains are methyllysine binding domains. They are both chromatin readers that recognize modified chromatin and function in epigenetic control of gene transcription. PHD and bromodomains are found in tandem in the C-terminal of several chromatin-associated proteins, including the transcriptional intermediary factor 1 (TIF1) family. The best-known member of the TIF1 family with PHD-bromo domains is the co-repressor TRIM28 (KAP1) that binds to multiple KRAB C2H2 zinc-finger family of proteins and recruits associated mediators of histone and DNA methylation to silence endogenous retroviruses (75). KAP1 did not show significant correlation with L1HS in our data in any tissue. One other member of the TIF1 family, TRIM33(TIF1γ) showed positive correlation with L1HS in four tissues, esophagus/stomach, head and neck, kidney and thyroid gland. There was no direct relationship between whether the protein had an activating or repressing role, and whether the protein was positively or negatively correlated in expression with L1HS. For example, KDM5A, KDM5B (Figure 4), and KDM5C, KDM5 family of H3K4 lysine demethylases that function as transcriptional corepressors were all positively correlated with L1HS in expression. On the other hand, TET2 and TET3 zinc finger proteins that can actively de-methylate DNA through oxidization of 5-methylcytosine were also positively correlated in expression with L1HS.
Because the list of genes was too long, we didn’t list all the genes for the kinase category or the transcription category. But some notable genes in those categories include genes in the MAPK signaling pathway. IGF1R, SOS1, NRAS, BRAF, MAPK1(ERK1), and RPS6KA6(RSK4) were all positively correlated with L1HS in multiple tissues. GSK3B and PI3K(PIK3CA) also showed positive correlation with L1HS levels. EP300 (p300) and CREBBP(CBP), histone acetyltransferases that act as coactivators were positively correlated with L1HS, and CARD14(CARMA2), EIF2AK2(PKR), TRAF6 and NF-κB were positively correlated with L1HS in transcript levels.
Table 9 shows the genes with the best REC scores that have recurrent positive correlation across tissues. We removed genes RASEF, MAST4, and LPP from the list, because they have L1HS sequences in their introns, that can lead to spurious positive correlation. Although we correct for the TEs embedded in introns, the correction is not complete, as we discuss in the Discussion section below.
Expression of KRAB-ZFPs are positively correlated with expression of older LINE elements in the normal tissues
As can be seen in Figure 9, various zinc finger proteins are both positively and negatively correlated with L1HS and older LINEs. But, overwhelmingly more zinc fingers are positively correlated with LINE expression, rather than negatively. Also, if we focus on the Krüppel associated box (KRAB) domain, zinc fingers with KRAB domains were significantly enriched among genes positively correlated with older LINE elements but less among genes correlated with the young and active L1HS (Supplementary Table 4). There are three ZFPs that show recurrent negative correlation with L1HS in multiple tissues, ZNF511, ZNF32, ZNF174 (Table 7), but none of these were KRAB-ZFPs, since they were missing the KRAB domain. We initially predicted a negative relationship given the known repressive function of the KRAB domain, and the well characterized role of KAP1 in TE silencing. But again, as seen with the chromatin modifiers, the functional role of the protein and the co-expression relationship is not simplistic, and one cannot predict whether the protein is a repressor or an activator from the direction of co-expression. To understand this relationship between KZFPs and LINE elements, we decided to examine the co-expression in more detail by looking at all the subfamilies within LINEs and each LINE locus within the subfamilies.
Among the ∼400 KZFPs that were identified by Imbeault et al. (51), we found 366 KZFPs in our expression data. We first tested the correlation between those 366 KZFPs and 294 LINE subfamilies using the family level expression quantification. We found 31,187 pairs with significant correlations (qvalue < 0.0001), amounting to about 2.90% out of a total of 1,076,040 pairwise combinations (366 KZFPs x 294 LINE subfamilies x 10 tissue groups). There were 29,484 pairs with positive correlation and 1,703 with negative correlation at the LINE family level, Next, focusing on those 31,187 KZFP, LINE family pairs with significant correlation, we then tested the focal KZFP against the expression level of all individual LINE loci belonging to that family of LINE in the tissue where the significant relationship was observed. There were 49,313 KZFP and LINE locus pairs that were significantly correlated (qvalue < 0.0001 and partial-eta2 > 0.4), 41,067 pairs with positive correlation and 8,246 with negative correlation at the individual LINE locus level. This is about 0.05 % of all 90,696,187 pairwise combinations that we tested. Bearing in mind that we were only testing ZFP-LINE locus pairs that we already knew had a correlation at the family level, this tells us that most of the correlation we observed at the LINE family level were due to only a small number of individual LINE loci contributing to the correlation.
The correlation between KZFP and each individual LINE locus was very tissue-specific, as 37,994 out of 41,067 KZFP-LINE locus pairs that were positively co-expressed were significant in only one tissue (92.5%). Still, there were some ZFPs that correlated with multiple loci within a family, or even multiple families. For example, ZNF780B, ZNF587, ZNF793 had very promiscuous correlations across the board with multiple LINE families, multiple loci within a LINE family, and in multiple tissues. These zinc finger proteins and the LINE families were all correlated among each other, creating a network cluster of co-expression. But, for the most part, most correlations appeared between specific zinc finger proteins and specific LINE families, in a limited number of LINE loci, and in specific tissues. For example, ZIM3 showed positive correlation with L1 family L1PBb. When correlation was tested between ZIM3 and all L1PBb loci, only 221 out of 3415 (6.4%) of the L1PBb loci showed positive correlation with ZIM3, and of those 209 (94.5%) showed positive correlation in the tissue group esophagus and stomach. Figure 10 shows an example of correlation between KZFPs and individual L1M2b loci in the breast tissue.
Again, confirming what we found in our initial analysis, the families that were found as significantly co-expressed with KZFPs were mostly older LINE families. The total count of significant loci were largest for L2 families, L2c, L2a, L2b and L2 in that order, but, that is mainly because L2 elements have the largest number of loci in the human genome. If we adjust for the total number of loci available per family, then the families that are most frequently co-expressed with the ZFPs were elements that are shared in the primate lineages, such as L1PA12, L1PBb, L1P2, L1PA6, L1PA7. The list did not include the youngest element L1HS or L1PA2, L1PA3. There are five KRAB-ZFPs that have been shown to bind to the youngest L1HS sequence in earlier Chip-Seq studies, ZNF425, ZNF382, ZNF84, ZNF141 and ZNF649 (51, 76). We looked at whether any of these proteins showed correlation with L1HS in the expression level, but we did not find any KRAB-ZFPs that showed significant correlation with L1HS.
Binding motifs of KZFPs are enriched in the sequence of the co-expressed LINE locus among KZFP-LINE locus pairs that show significant co-expression
We searched for the known motifs of KZFPs identified in Imbeault et al. (51) in the sequences of the L1 loci found to be significantly correlated with said KZFP. We found a total of 4222 identified motifs in our total set of co-expressed KZFP-L1 pairs, 2256 of them being unique. From the 1000 simulations of our first null model, where we sampled randomly from all zinc fingers and L1 instances available in our dataset for each run, the number of motifs we found ranged from 3127 to 3627, with a mean of 3351.466, median of 3355, and a standard deviation of 78.999. In the second null model, we utilized the same dataset of co-expressed zinc fingers and L1 sequences as our experimental model but randomly permuted the L1 sequences, so the pairwise relationship do not reflect the co-expression any more. With 1000 simulations we observed the number of motifs ranging from 2935 to 3519, with a mean of 3255.095, a median of 3254, and a standard deviation of 81.481. Our observation of 4222 identified motifs in the co-expressed pairs showed a p-value of zero under both null model (Figure 11), indicating a significant excess of the corresponding motifs in the KZFP-L1 locus pairs that show significant co-expression.
Results are replicated in an independent dataset of normal tissue samples of the lung
All our samples come from supposedly normal tissues that were selected by pathologists based on histology, although they are adjacent to the cancer cells. But, since suppression of mitochondria and oxidative phosphorylation, replaced by enhanced aerobic glycolysis is one of the hallmarks of cancer, referred to as the Warburg effect, there is a possibility that our results reflect the cancerous environment that these “normal tissue” are experiencing, including hypoxic conditions. To find out if we could replicate the results in a true normal tissue, we used an independent dataset from the GTEX project to test the correlation with the same analysis.
First, we confirmed that the lung tissues clustered well with the lung tissues from the TCGA dataset based on the gene expression patterns (Supplementary Figure 7a). When examining the LINE expression, there were a few outliers in the GTEX lung samples with unusually high LINE expression that formed a cluster of their own separate from the other lungs, and instead grouping with two other stomach samples. These samples were interesting, because they showed LINE expressions that were significantly higher than any of the samples seen in the TCGA dataset, yet the gene expressions were not especially different from other lungs in GTEX nor in TCGA (Supplementary Figure 7 a vs. b).
The negative relationship between mitochondrial genes, ribosomal protein genes and L1HS expression was replicated in the lung transcriptome data from GTEX in the linear model without controlling for older LINEs as the covariate, but was not replicated when we controlled for the older LINEs. We are not sure why there is this discrepancy. But, the TCGA data is a more robust result because we only included genes that showed correlation in multiple tissues. With the GTEX data, we only had one tissue type, the lung, so we did not require the additional filtering of recurrent correlation. When examining the linear model without controlling for older LINEs as a covariate, we confirmed that the main results were replicated. There was enrichment of terms mitochondrial genes, ribosomal protein genes, response to reactive oxygen species, autophagy, proteasome, among the genes negatively correlated with L1HS. There was enrichment of KRAB-ZFPs, among genes positively correlated with both L1HS and older LINE elements.
DISCUSSION
Limitations to the quantification and correction
Quantifying transposon transcripts is a difficult problem, due to their ambiguity in short read mapping because of repeated content in the reference genome. Current state of the art methods rely on Expectation-Maximization to account for the uncertainty in multi-mapped reads (40). Because the estimation of multi-mapped reads start from values initialized based on the uniquely mapped reads, the final estimates tend to correlate well with the uniquely mapped read counts. This works well with gene transcripts where more than 90% of the reads are estimated to map uniquely across all genes (77), and the variance in the mappability across genes is not large. But with transposons, not only are the mappability lower than the genes, the variance of the mappability across different loci are huge due to the large number of similar sequences. This can lead to biases in the quantification if one locus has high mappability due to more unique mutations, while another locus has lower mappability due to smaller number of unique mutations by chance. Focusing only on uniquely mapped reads doesn’t really solve this problem, and will lead to equally biased quantification, as evidenced by the high correlation observed between uniquely mapped reads and the total reads including multi-mapped reads.
In addition, the reference genome that are used for mapping the transposon transcripts do not include all the polymorphic TE insertions in the human population. If the transcript originated from a polymorphic TE that is not present in the human reference genome, it will be redirected to the most similar locus that is present in the reference genome. This will not be a serious problem when quantification is done at the family level, but if one is interested in locus level quantification it becomes a serious problem, especially because the polymorphic loci could potentially be the more active loci that are expressed at higher levels compared to other loci of the same family.
Another complication in TE transcript quantification is that TEs are frequently embedded within introns that are transcribed before they are processed, or sometimes fail to be spliced out, or embedded within exons or non-coding RNAs that are expressed in different conditions (41). To account for this source of error, we introduced a method to correct for TE reads coming from retained introns or pre-mRNA. Although we observed large corrections for specific TE elements embedded within introns, the correction is not complete. The main limitation to our approach stems from the fact that the correction is done after the EM algorithm, after the multi-mapped counts are probabilistically assigned to multiple TE loci in the genome. The correction only removes the equivalent of read counts assigned to the problematic locus, and the uncertainty in the assignment through EM means there may be remaining reads assigned to alternative locations, that actually originate from retained introns as well. A more accurate approach would be to correct for the read counts from retained introns before the EM algorithm based on the uniquely mapped reads, and then run EM based on the corrected counts. But, estimating the depth of the repeat region using uniquely mapped reads is a difficult problem. The effective length of the uniquely mapped region is difficult to estimate, because again mappability varies from locus to locus for any TE, depending on the unique mutations it has accumulated. So, for this study, we decided to use the easier approach to run EM first, and probabilistically assign the TE reads, and then correct based on the expected read depth across the length of the TE locus.
Although we partially dealt with the third problem described above, we have ignored the first and second problems, and we recognize the limitations of the current methods. An important future study would be to study the mappability of individual TE loci carefully, including the known polymorphic sites, and to design a software for TE quantification that can take into account the mappability of each locus in its EM algorithm, as well as correct for the retained introns while considering the effective length of the uniquely mappable region within the TE. Despite these limits, we believe the main results of gene correlations found with L1HS and older LINEs are not affected by the quantification. The uncertainty lies in the assignment of reads to each LINE families, but the total reads coming from LINEs that are found in the RNAseq are valid. Our correlation analysis was done at the family level for each tissue separately. We may have under- or over-estimated the L1HS read counts, but as long as the errors are similar in distribution within each tissue type, and the errors are not systematically biased to be associated with a certain set of genes, it should not be a serious problem. We tried to make the results robust by dropping the reads mapping to LINEs most closely related to L1HS, i.e.L1PA2 and L1PA3, and requiring the correlation be replicated in at least two tissue types. We also confirmed that the results were replicated using uniquely mapped reads only.
Stress and TE expression
Initially, when we started the project, our goal was to identify candidate genes involved in transposon control, based on the co-expression analysis. But, once the analysis was done, the results were pointing to what induces LINE expression, rather than what suppresses LINE expression. Among the genes known to function in transposon control, RNaseH2C (Figure 4), and RNaseH2A showed negative correlation with L1HS, MPHOSPH8 encoding the protein MPP8 and C3orf63 encoding TASOR, part of the HUSH complex identified in Liu et al. (28), showed positive correlation with L1HS in multiple tissues in our results. But several well-known genes with functions in transposon control, e.g. MORC2, SIRT6, KAP1, SAMHD1, MOV10, ZAP, C12orf35 (human ortholog of RESF1), etc. are missing in our list of significantly correlated transcripts. Occasionally they would show up as positively correlated in one of the tissues, but not replicated in more than one tissue, which was what we required to be called as correlated genes. Instead, the major theme that emerged from our results is stress response. Genes involved in oxidative stress such as ISCU, PARK7, or ECSIT and genes such as DHPS, FLOT1 and CREB3 in the UPR pathway, showed negative correlation across multiple tissues. That leads to the hypothesis that oxidative stress or ER stress may be inducing the high expression of LINEs. Mitochondrial damage, oxidative stress and proteotoxic stress are tightly linked, and causal relationship in all directions have been observed. Misfolded protein accumulation can induce reactive oxygen species (ROS), and constitutively activated UPR has been shown to induce expression of Ty2 transposon in yeast (78, 79). Defective mitochondria result in elevated mROS production, and mROS can lead to more damage in the mitochondrial DNA. It is also possible that L1HS expression is the cause, not the result, of the stress response. Observation of co-localization between LINE elements and stress granules (80) supports the co-occurrence of LINE expression and UPR, but how expression of ORF proteins and the damage to DNA are linked with UPR is not well understood (81, 82). We found that transcripts of PKR and NRF2 are positively correlated with L1HS, and the binding motif of NRF2 are enriched among the genes negatively correlated with L1HS. But, we did not find higher level or higher proportion of the active isoform XBP1(S) in samples with higher L1HS levels. It is possible that other transducers are involved, but further experiments would be needed to evaluate the relationship between oxidative stress, ER stress and the expression of L1HS.
Ever since the initial discovery of transposons, stress has been associated with transposon activity (83). Numerous reports have documented transposon expression and transposon activity induced by various biotic and abiotic stresses (84–87). In humans, various stresses have been shown to induce LINE1 transcription or activation including chemical compounds (88–90), radiation (91, 92), oxidative stress (93) and aging (94). Most of these studies have observed L1 activity in vitro, by exposing cultured cells to stress factors and assaying the retrotransposition activity (but see (95) in mouse). Our study is the first that reports observation on L1 expression in vivo in normal human tissue samples, and shows widespread association between L1HS expression and genes modulating stress response. In this light, it is interesting that we were not able to observe correlation between age and L1 transcript levels in our data. It may be because, although our samples are all normal tissue, they come from cancer patients and thus the age distribution is skewed towards older age. The range is from 15 to 90, but the median age across all samples is 62. We did find association between past radiation therapy and L1HS expression level but only in the thyroid tissue. The exposure and damage to normal tissues resulting from radiation therapy has been long recognized (96), and the sensitivity to radiation is known to vary depending on the tissue or organ, as well as the genetics of the patient (97). Our results show that one of the responses to radiation damage in the adjacent thyroid tissue may be a higher level of L1HS expression (Supplementary Figure 8) and it may have implications for future rate of recurrence.
TE expression in disease
The genes and pathways that we find to be correlated with L1HS expression are interesting when we think about the accumulated evidence of transposon activity in cancer, and more recent reports of transposon dysregulation in neurodegenerative diseases. Suppressed mitochondrial function and altered metabolism, are one of the hallmarks of cancer, and abnormal MAPK signaling is implicated in a wide range of cancers. Oxidative stress has also been linked to cancer through increased DNA damage, and genome instability. In our preliminary analysis, we observed that the same correlations are found in the cancer tissue samples as well, but we focused on the normal tissues for the scope of this study, to avoid the confounding factors of malignant transformation. The fact that we observe these correlations in normal tissue, albeit obtained adjacent to the cancer tissue, could mean that LINE dysregulation is one of the earlier steps in the initiation of cancer and could potentially contribute to the progression of cancer through the damage that they cause to the DNA. Similarly, mitochondrial dysfunction, oxidative stress and proteotoxic stress, and more recently transposon expression are some of the common features that are observed in the tissue samples or animal models of neurodegenerative diseases, such as amyotrophic lateral sclerosis (ALS) (98–100) and Alzheimer (101). Our observation that the link between mitochondria, ER stress, and LINE expression is observed across multiple tissues other than brain that are considered normal and healthy, indicate that cellular stress, transcriptional response and the induction of L1 expression may be a fundamental process that happens quite frequently across different organs. Whether such expression becomes a burden to the cell and contributes to disease, and if so, why the process leads to more serious consequences in certain tissues such as brain are questions that will need to be explored.
TEs and KRAB-ZFPs
Although LINE elements have been studied for a long time, their ubiquitous and highly tissue-specific expression patterns are starting to be appreciated only recently. The fact that LINEs compose close to 19% of the human genome is frequently emphasized, but the fact that there is observable amount of LINE transcripts in human RNA-seq data has mostly been ignored or regarded as a nuisance without any functional relevance. But, recently, important regulatory roles for LINEs are emerging with observations of contribution to transcription start sites (33), active transcription during early development (102), and even critical function similar to long non-coding RNAs that guide chromatin-remodeling complexes to specific loci in the genome (103). ChIP-Seq studies on KRAB-ZFPs have identified extensive binding between this family of proteins and transposable elements including LINEs (51, 76), implying a role for regulating TE expression. Our results add complementary evidence at the RNA level, by revealing widespread correlation in transcripts across multiple KZFPs and LINE elements that are generally tissue specific and LINE locus specific. We also found that despite the overall sequence similarity among the LINE elements, the transcriptionally correlated pairs showed enrichment of the specific ZFP motifs in the LINE sequence, compared to randomly generated pairs. Given the correspondence between the presence of the binding motif and the co-expression relationship between KZFP and LINEs, it is possible that the co-expression we are observing is leaky transcription accompanying KZFP binding at the LINE locus, similar to the observations of transcription in enhancer RNAs. The tissue-specificity of the correlated expression hints that the in vivo binding patterns or the effect of the binding may be different depending on the tissues. Our results may be informative in designing future experiments on the types of cell lines or tissues and the ZFP proteins to explore with ChIP-seq assays. We could also potentially utilize the transcriptional correlation to predict motifs for about one third of the KZFPs, of which we have no information on their binding patterns yet.
AVAILABILITY
The pipeline and data can be found at https://github.com/HanLabUNLV/LINEexpression. The modified version of the TEtranscripts (40) software can be found at https://github.com/HanLabUNLV/tetoolkit.
FUNDING
This work was supported by the National Institutes of Health [GM116108 to M.V.H.]. Funding for open access charge: National Institutes of Health.
ACKNOWLEDGEMENT
Footnotes
The authors wish it to be known that, in their opinion, the first four authors should be regarded as joint First Authors
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.↵
- 9.
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.
- 86.
- 87.↵
- 88.↵
- 89.
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.↵
- 101.↵
- 102.↵
- 103.↵