Abstract
Experiments designed to assess differential gene expression represent a rich resource for discovering how DNA regulatory sequences influence transcription. Results derived from such experiments are usually quantified as continuous scores, such as fold changes, test statistics and p-values. We present a de novo motif discovery algorithm, SArKS, which uses a nonparametric kernel smoothing approach to identify promoter motifs correlated with elevated differential expression scores. SArKS has the capability to smooth over both motif sequence similarity and, in a second pass, over spatial proximity of multiple motifs to identify longer regions enriched in correlative motifs. We applied SArKS to simulated data, illustrating how SArKS can be used to find motifs embedded in random background sequences, and to two published RNA-seq expression data sets, one probing S. cerevisiae transcriptional response to anti-fungal agents and the other comparing gene expression profiles among cortical neuron subtypes in M. musculus. For both RNA-seq sets we successfully identified motifs whose kernel-smoothed scores were significantly elevated compared to the permutation-estimated background distributions. We found strong similarities between these identified motifs and known, biologically meaningful sequence elements which may help to provide additional context for the results previously published regarding these data sets. Finally, because eukaryotic transcription regulation is highly combinatorial, we also outline how SArKS methods might be extended to discover synergistic motifs.
Introduction
Discrete sequences—of tones, of symbols, or of molecular building blocks—can provide clues to other characteristics of the entities from which they are derived: a phrase in a bird’s song can reveal which species it belongs to, the use of an idiomatic expression can pinpoint a speaker’s geographic origin, and a specific short string of nucleotide residues can illuminate the function of a DNA domain. In these examples, insights are gleaned from informative motifs—short subsequences that match some frequently recurring discernible pattern.
Of particular interest to us are motifs in DNA sequences which are informative with regard to patterns of differential gene expression. The identification of such motifs can help to elucidate the manner in which structure (patterns in DNA sequence) mediates function (regulation of gene expression). Because DNA is largely invariant, individual cell properties tend to be determined by their complement of resident proteins. Tight control over protein expression is, therefore, essential for cellular differentiation, identity, and function. While prior efforts have identified sequences that participate in regulating eukaryotic gene expression, the details regarding how and which specific motifs contribute to specific expression profiles are poorly understood. Here we present an analytical approach toward deciphering this fundamental biological puzzle.
Regulation of gene expression is achieved via a number of complementary processes. First, non-coding DNA is replete with short sequences that can bind transcription factors (TFs), proteins whose own expression varies from cell to cell and over the course of development. Second, DNA can be methylated, epigenetically altering the accessibility of regulatory and coding regions to transcriptional machinery. DNA methylation in turn recruits proteins which modify histones and thereby chromatin structure, further impacting accessibility. In this report, we take the latter regulatory strategies into consideration but focus primarily on accessible regions containing TF binding sites.
In the present study, we present a broadly-applicable algorithm for identifying DNA regulatory domains that support differential gene expression. Our strategy is predicated on the following suppositions: (a) gene expression regulatory regimes involve the binding of TFs to their respective sites on non-coding DNA found near, within, or some distance from a gene; (b) TFs act combinatorially to attract and repel transcription machinery; (c) the same TF binding site may appear multiple times within a stretch of DNA, interspersed with other binding sites; (d) the orientation of a TF binding site gains importance closer to the transcription start site (TSS) of the gene; and (e) there is more than one solution: different genes, even those co-expressed within a single cell, may rely on different regulatory mechanisms. As a practical matter, and in accord with these suppositions, we aim to identify TF binding sites in the vicinity of co-expressed genes and scrutinize their arrangement for significant patterns that can then be evaluated experimentally.
Many different methods for the identification of TF binding motifs have been described. Consensus-based methods such as Weeder [1, 2] focus on motifs of length k that occur repeatedly (allowing for small numbers of mismatches) in sequences of interest. Such methods can be efficiently implemented using suffix trees: Weeder in particular follows a suffix tree-based approach originally described in [3] and [4] with an added heuristic restriction on the pattern of allowed mismatches to maintain the efficiency of the recursive search method utilized [1].
Alternately, profile-based methods such as MEME [5–7] (Multiple Expectation-Maximization for Motif Elicitation) fit a profile model (i.e., a matrix composed of the modeled probabilities of each base occurring at each position of a fixed width motif) of a motif to be compared to a background model in order to classify subsequences as either matching the motif or not. MEME fits these profile models using an expectation-maximization (EM) approach, repeatedly computing the degree to which each subsequence fits the profile (E-step) and then recalculating the profile by realigning subsequences based on these fits (M-step).
Chromatin immunoprecipitation (ChIP)-based techniques (e.g. ChIP-seq) for identifying protein-interacting DNA sequences have led to the application of motif-finding algorithms to larger sequence data sets than was typical during previous decades [8]. Methods like MDScan [9] can take advantage of the ranking of sequences on based on ChIP enrichment to first generate candidate motifs using only the most enriched DNA sequences and then progressively refine these motifs using the full set of detected DNA sequences.
While MDScan uses functional ranking to separate sequences into sets of higher and lower priority to better focus limited analytical resources for motif discovery, it does not attempt to directly compare one set of sequences to the other. In contrast, discriminative motif analysis [10] seeks to identify motifs specifically differentiating one set of sequences (e.g., promoter regions for a set of genes with a given expression pattern) from another (e.g., a set of reference promoter regions). A number of approaches have been applied to this problem, including [11–18]. A popular recent example, DREME [19] (Discriminative Regular Expression Motif Elicitation), employs Fisher’s exact test to assess the significance of motif matches in sequences of one set compared to the other, with further refinement of motif profile conducted for satisfactory candidate motifs.
Discriminative approaches incorporate gene-specific information into the motif discovery process—by, e.g., comparing sequences associated with genes with elevated expression in an experimental condition of interest to sequences associated with genes whose expression shows less evidence of elevation—but these methods implicitly assume that genes may be adequately characterized in a binary manner (e.g., elevated vs. not elevated). Given that the information used to establish the contrasting gene sets is often obtained in the form of continuous expression measurements (and derived measures of differential expression such as t-statistics, f -statistics, etc.), with some genes exhibiting extremely divergent expression patterns across conditions while (usually many) others show more modest differences, it may be more useful to develop methods for what might be called “correlative motif discovery” seeking motifs whose presence signals a trend towards higher or lower values of such a continuous measure.
Correlating motifs from sequences (e.g., promoter regions) wb with associated continuous score values yb (e.g., measures of differential expression for the genes associated with the promoter regions) would be straightforward if we had some way of quantifying potential motif patterns present within the wb. The algorithm we propose here (illustrated in Fig 1) builds on this idea by:
concatenating all of the sequences wb into one supersequence x (detailed in Eq (1) below);
constructing the suffix array [si] of this supersequence (Eq (4)), where i indexes all suffixes of x sorted into lexicographic order;
mapping the suffix positions i back to the sequences wbi from which the beginnings of the associated suffixes are derived (Eq (5)); and finally
for each suffix array index i, applying kernel smoothing to locally regress ybj on suffix position j (Eq (6)): the resulting smoothed scores are then proportional to the correlations of the scores ybj with the local kernel Kij centered at i.
We are thus using the suffix array index i as the aforementioned quantification of the motif pattern corresponding to the first few characters of the suffix of x beginning at character si. Because i gives the position of a suffix in the lexicographically sorted list of suffixes of the concatenated supersequence x, multiple occurrences of a highly conserved motif—even if they derive from different sequences w—will be consolidated into a run i, i + 1, …, j of consecutive index values. Kernel smoothing using a kernel of width on the order j - i thus offers a way to compare the scores ybi, ybi+1, …, ybj to the overall score distribution. In this way, Suffix Array Kernel Smoothing (or SArKS) provides an efficient method for de novo discovery of conserved motifs which tend to be found selectively in high-scoring sequences.
We also describe an extension of this method for identification of longer motifs by adding a second round of kernel smoothing applied over the spatial extent of the sequences in order to detect longer regions containing clustered motifs. The use of a nonparametric permutation testing method for computing significance thresholds is then illustrated through the application of SArKS methods to both simulated and real data sets, thus demonstrating (a) the manner in which idealized versions of the motif detection problem may be solved for simulated data and (b) that the algorithm finds plausible candidate patterns with interesting relationships to sequence elements known to have potential regulatory activity when applied to two real gene expression data sets. By implementing a correlational approach to motif discovery, SArKS thus provides a step forward in taking full advantage of the differential expression information offered by RNA-sequencing experiments in the context of motif discovery.
Methods
Motif selection
Given n sequences wb (also referred to as words) with associated scores yb, the basic motif selection algorithm defining SArKS consists of:
Concatenation
Concatenate all words wb (each assumed to end in the line-terminator character $ lexically prior to all other characters) to form word of length ln = |x| = ∑b|wb|. Define also
Thus x[lb, lb+1) = wb; that is, the substring of the concatenated string starting at position lb (inclusive) and ending immediately before position lb+1 (exclusive) is the sequence wb (in this paper the first character of a string w is denoted w[0], the second w[1], etc.).
Suffix sorting
Lexically sort suffixes into ordered set thereby defining suffix array [si] mapping index i of suffix in S to suffix position s in x (in our software we rely on the Skew algorithm [20] modified to use a difference cover of 7 and implemented in SeqAn [21] to efficiently compute the suffix array).
Block marking
Define block array [bi] by mapping index i of suffix in S to block b containing suffix position si. The block array then tells us that the character x[si] at position si in the concatenated string x is derived from in the sequence .
Kernel smoothing
Calculate locally weighted averages where the kernel Kij acts as a weighting factor for the contribution of the score ybj to the smoothing window centered at sorted suffix index i. Loosely speaking, Kij is used to measure how similar (the beginning of) the suffix x[sj, |x|) is to be considered to (the beginning of) the suffix x[si, |x|) in the calculation of a representative score averaged over suffixes similar to x[si, x). As the suffixes have been sorted into lexicographic order, the magnitude of the difference i j provides some information regarding this similarity: the key idea of the kernel smoothing approach described here is that Eq (6) with Kij defined to be a function of i j may therefore offer a computationally tractable approach for identifying similar substrings (prefixes of suffixes) which tend to occur preferentially in high scoring words wb.
In this work we use a uniform kernel which allows Eq (6) to be computed easily in terms of cumulative sums:
The kernel half-width κ appearing in Eq (7) is an important adjustable parameter in the SArKS methodology controlling the degree of smoothing applied. Increasing κ smooths over more, and hence generally more diverse, suffixes, potentially increasing statistical power at the expense of the resolution of the detected motifs. Recommended guidelines for selecting this parameter are discussed further in Results and discussion.
k selection
Set length for k-mer associated with suffix array index i by locally averaging the length of suffix sequence identity: where kmax functions both to increase computational efficiency and to make more robust in the presence of a small number of long identical substrings (all results presented here based on kmax = 12). Eq (9) is similar to Eq (6) except that: (a) Eq (9) smooths the length (capped at kmax) of the longest prefix on which the suffixes x[si, |x|) and x[sj, |x|) agree instead of smoothing the score ybj as in Eq (6); and (b) Eq (9) omits the central term i = j as it trivially compares suffix the suffix beginning at si to itself and is thus uninformative.
Motif selection
Choose score threshold θ and minimum k-mer size kmin, thereby defining k-mer set M by where is the nearest integer to . Strategies for setting the filtering threshold θ based on the permutation testing method described in Permutation testing (and for choosing a reasonable kmin) are discussed in Results and discussion.
Limit intra-sequence repeats
One complicating factor in the strategy described in Motif selection is the presence of highly repetitive sequences (common in eukaryotic DNA [22]): if the substring x[si, si + rm) (assumed to derive wholly from the single word wbi) consists of r ≫ 1 repeats of the same m-mer, then it is likely that the sorted suffix array index positions j and k implicitly defined by sj = si + am and sk = si + bm for small a, b =0 will be close by, since, assuming without loss of generality that a < b, showing that the suffixes of x beginning at positions (si + am) and (si + bm) agree on their first (r - b)m characters. Since all of the positions si + am for small a must come from the same word block bi they must have the same associated score ybi. If this score ybi is particularly high, this phenomenon may lead to windows of high values centered on j satisfying sj = si + am which result from a very small number of different repeat-containing words (perhaps as few as one if the number of repeats is high enough within a single high-scoring word). An example of a repetitive substring receiving a high smoothed score in such manner is discussed in DNA motifs associated with anti-fungal response.
The distribution of weighted word frequencies contributing to the window centered at position i of the suffix array table across the full word set W may for these purposes be reasonably summarized by the associated Gini impurity (often used in fitting classification and regression trees [23]): which provides a measure ranging from 0 to of the degree of uniqueness of the words contributing to the calculation of . Requiring gi =gmin can thus be used to screen out positions i for which the repeated occurrence of a few high-scoring words in the window centered at i leads to . Permutation testing further demonstrates that gi is directly linked to the variation of the smoothed scores which would be expected if there were no association between the sequences wb and the scores yb, thereby providing the motivation for the use of this particular measure for filtration.
Pruning and extending k-mers
The presence of a k-mer x[si, si + k) associated with a high smoothed score may also result in high smoothed scores when sj = si + m if the substring (k-m)-mers x[si + m, si + k) also differentiate higher and lower scoring sequences (if perhaps not as well as the superstring k-mer). The following two steps may be added to the algorithm described in Motif selection in order to reduce the reporting of such substring results in cases where they are present only as part of the full k-mer:
Prune nested k-mers
Cases in which both k-mer x[si, si + k) and its sub-(k –m1 –m2)-mer x[si + m1, si + k –m2) (with m1 > 0, m2=0) are individually identified can be resolved to report only the longer k-mer: denoting remove any index i ∈ I if there exists j ∈ I such that the -mer interval starting at sj includes all of the -mer interval starting at si, thus retaining only:
This can be done efficiently using an interval tree.
Extend k-mers
For each i ∈ II, define the duplet resolving any ties in the arg max in favor of maximal z0. Eq (17) picks out the largest super-interval containing the interval such that the extended -mer is equal to one of the already identified k-mers . Then defines our pruned motif set.
Spatial smoothing
Existing motif discovery approaches often take into account the tendency of some sequence motifs to exhibit local spatial clustering (thought in some cases to facilitate the cooperative interactions between TFs required for appropriate gene regulation) [24]. Our algorithm can also take advantage of this observation, extending candidate regulatory regions through the application of a second round of kernel-smoothing over the positions within words: where we here use uniform kernels of the form (generally with width λ ≠ κ) to search for regions of length λ with elevated densities of high-scoring motifs. Note that defined by Eq (19) is indexed not by suffix array index i but by suffix array value si giving the spatial position si in the concatenated word x.
To use such spatial smoothing as an additional basis for motif selection/filtering, it is generally necessary to introduce a second threshold θspatial, as the doubly-smoothed scores will generally be somewhat less dispersed than will be the singly-smoothed . In this case, formula (21) for the starting motif set M becomes: with similar modification to formula (18) for M′ then required as well.
Gapped motif detection
While lexical sorting of suffixes assembles occurrences of the same k-mer together into a block of adjacent index positions i, gapped motifs such as in which there is significant variability in the characters appearing within the internal substring ugap will be scattered into distinct subblocks dispersed within the larger superblock corresponding to their common prefix u0. By mixing less relevant suffixes in with those corresponding to u within the range of the smoothing kernel, this dispersion can dilute the apparent correlation between motif and score.
While the technique described in Spatial smoothing ameliorates this problem to some extent, it does not specifically focus on the important situation where a head motif u0 is always followed (after the variable ugap) by the same tail motif u1. We describe here a method for discovering just such gapped motifs by applying first a relatively relaxed threshold θ (which may on its own admit many false positives) and then examining the tail sequences ugap * u1 * … following it for evidence of an enriched sequence u1, pruning away candidate head sequences for which no such corresponding tails can be found.
Defining for any string u: and noting that , we can look for the presence of a particularly common substring u1 such that the number of occurrences u1 exactly j positions downstream of an occurrence of u0 in a sufficiently high-scoring word is significantly higher than expected. In order to quantify the significance of cj(u0, u1; θ) some sort of background null model is required; for simplicity we assume homogeneity and independence at different positions i in our examples here, so that according to the null model, where is the number of occurrences of u0 in high scoring words (i.e., where ) and px[a] is the null probability of character x[a]. The method here is not constrained to the use of such a naive null model, however; a higher-order Markov null model (as has been demonstrated to improve other motif discovery algorithms [6, 25]) could easily be used instead.
Cluster k-mers by sequence similarity
There are many cases of interest where motifs are not defined by an exact match to a specific k-mer but instead may allow for some variation away from an idealized pattern. Thus the set M defined by Eq (21) is likely to contain many related k-mers which may be more usefully clustered into a few higher-level motif patterns.
Here we adopt a simple edit distance-based criteria to perform this clustering. First we define a diameter d = 0 controlling how similar motifs must be to cluster together and initialize the (ordered) set of clusters C = ∅. We then consider the sequences in the reverse order of their smoothed scores for all suffix indices i surviving all imposed filters, initializing a new cluster “centered” at in C if is not within d edits of the centers of any existing clusters (e.g., ACGT would initialize a new cluster if d = 1 and the only existing cluster was centered on AAG, but would not if a cluster already existed centered at either ACG or AAGT). If, on the other hand, is within d edits of the center of one or more existing clusters, it is added to the first such cluster. This clustering strategy has previously been efficiently implemented in the software package starcode [26], on which we rely here.
Permutation testing
In order to decide whether the observed correlation between the occurrence of the motifs uncovered by the approach described above and the sequence scores yb is meaningful, it is useful to have a method for examining results that might be obtained if the sequences wb and the scores yb were independent of each other. To this end, the word scores yb are subjected to permutation p to define
If the permutation π is randomly selected independently of both the sequences wb and the scores yb, any true relationships between sequences and scores should be disrupted. This suggests a simple method for assessing the significance of motifs discovered using a given set of parameters (θ, kmin, gmin, kernel half-width κ, etc.): generate R random permutations pr and for each permutation select positions i satisfying , gi ≥ gmin and any other desired criteria (e.g., presence of highly significant tail sequences when searching for gapped motifs as described in Gapped motif detection, or observation of high spatially-smoothed scores when the method of Spatial smoothing is employed). In this manner one can estimate the distribution of the number of motifs which would be chosen under a null model in which there is no association between the sequences of the various words wb and the scores yb.
This method of significance testing also provides the motivation for the form of Eq (14) in Limit intra-sequence repeats. To demonstrate this, let ∏ be a random variable representing a random permutation and note that the random variables y∏(b) satisfy while, assuming that the number of words n = |W | is large enough that we may approximate y∏(b) ⫫ y∏(b′) for b ≠b′, where is defined by Eq (13) and for all b
Eq (30) then tells us that where the Gini impurity gi is defined by Eq (14). Thus smaller values of gi imply higher variance of the window-smoothed scores obtained under random permutation П (with mean unchanged). This increased variance will lead to the requirement of larger cutoff values θ for reporting motifs discovered in the unpermuted data with a given degree of confidence unless positions i with gi < gmin are filtered out as described in Limit intra-sequence repeats.
RNA-seq expression analysis
In order to test SArKS, we selected two RNA-seq data sets from Gene Expression Omnibus database [27] (https://www.ncbi.nlm.nih.gov/geo/): GSE80357, from Saccharomyces cerevisiae (strain 288c), and GSE63137, from Mus musculus neocortical neurons [28]. Strain 288c data was obtained following exposure of yeast cells to two different anti-fungal agents. The GSE63137 data set contains detailed transcriptomic and epigenetic information from three distinct non-overlapping classes of pooled neocortical neurons: principal excitatory neurons, parvalbumin (PV)-positive GABAergic neurons, and vasoactive intestinal peptide (VIP)-positive GABAergic neurons.
For the yeast data set GSE80357, we based the sequence scores yb on the provided gene-level edgeR differential expression results:
(where Λb is the edgeR likelihood ratio statistic for gene b provided in the analysis results for GSE80357).
Because the position of the first used exon often provides information on which TSS is used—and hence on what DNA region defines the applicable promoter—in multicellu-lar eukaryotes, we reanalyzed the GSE63137 RNA-seq data at the transcript level, using kallisto [29] to quantify and normalize transcript level expression against Ensembl mouse cDNA reference GRCm38 [30]. Both mean and variance filters were applied (retaining only transcripts for which at least 100 pseudocounts were obtained when summed across all samples, whose mean normalized expression met or exceeded the median of the transcript mean normalized expression levels, and whose normalized expression variance across full sample set similarly met or exceeded the median such value) to winnow the set of transcripts analyzed [31]. In order to simplify downstream analysis, only the isoform with highest mean expression level across all samples was retained for each detected gene. Finally, as previously analyzed epigenetic information on chromatin accessibility was available from the same study [28], only transcripts for which the transcription start sites were located within ATAC-seq peaks (i.e., were accessible) for all examined neuron classes were retained for analysis. Imposing this condition minimizes the likelihood that epigenetic factors, rather than regulatory sequence characteristics, underlie the variations in gene expression across cell classes.
Differential gene expression was then assessed on normalized expression values via standard Student’s t-test comparing data for PV neuron data to excitatory and VIP neuron data, with the resulting t-statistic providing a rough estimate of the gene’s enrichment in PV neurons to be used as score yb for transcript b. To prevent the few very large magnitude t-statistics from unduly influencing motif discovery, we enforced a ceiling of 10 on the magnitude of yb, so that
Results and discussion
Illustration using simulated data
To illustrate the method, we first applied it to a simple simulated data set in which 30 random sequences wb were generated with each letter wb[s] drawn independently from a Unif {A,C,G,T} distribution; to the last 10 sequences (i.e., those wb with b =20) we then embedded the k-mer motif CATACTGAGA (k = 10) by choosing a position sb (independently for each sequence wb) from Unif {0, …, |wb| –k} and replacing wb[sb, sb + k) by the desired k-mer sequence. Scores were assigned to the sequences according to whether the motif had been embedded:
The kernel half-width κ = 4 was chosen for this simulation in order to obtain smoothing windows of approximately the same size as the number of motif-positive sequences, 2 * κ + 1 ≈ |{b|yb = 1}|. In cases where one might expect that most high-scoring sequences exhibit a single conserved copy of a motif while few low-scoring sequences contain the motif, this may be generalized to provide a reasonable starting point for selection of window size: choose where φ divides “high-scoring” sequences from “low-scoring” ones.
Fig 2 plots as obtained from Eq (6) when the method of Motif selection is followed using a uniform kernel with κ = 4. The highest peaks in the plot correspond to the positions of various substrings of the embedded motif, and lead to the set M of k-mers defined by the column of table 1.
Pruning table 1 as described in Pruning and extending k-mers, Eq (16) leaves only the rows for i ∈ {2257, 2258, 2256, 1462, 1458, 1463}. Applying Eq (17) then extends the 8-mer ATACTGAG of the rows i ∈ {1462, 1458, 1463} to the full 10-mer, so that, following Eq (18), the final k-mer set M′ = {CATACTGAGA}.
Permutation testing illustrates the utility of setting a minimum k-mer length kmin and/or a minimum block Gini impurity gmin during motif selection: 190 out of 1000 random permutations generated at least one position i(π) for which (where θ was taken to have the maximum possible value of 1), but none of these permutations yield any results if kmin = 6 is applied to restrict attention to hexamer or longer motifs. Alternatively, if a relatively stringent minimum Gini impurity gmin = 0.878 (selected so that only those i for which bi-κ, bi-κ+1, …, bi+κ are all distinct are retained) is enforced, only 2 of 1000 permutations yield positive results, yielding a 95% CI of (0.024%, 0.72%) for family-wise error rate (FWER).
We repeated the process of generating 30 random sequences, embedding the motif CATACTGAGA into the last 10 of them, and then applying suffix array kernel smoothing to the sequence scores 1000 times. In 999 out of these 1000 iterations, the maximum value
Illustration of motif selection process from Motif selection applied to simulated data set with window half-width κ = 4 and score threshold θ = 1 (here kmin = gmin = 0).
Distribution of differences between obtained by suffix array kernel smoothing using unpermuted sequence scores yb and obtained using permuted sequence scores yπ(b) over 1000 simulations (30 random sequences of 250 characters each) run either with (motif (+) column) or without (motif (-) column) inclusion of motif CATACTGAGA in final 10 sequences.
(with gmin = 0.878) calculated using the unpermuted sequence scores exceeded the maximum value obtained using one set of randomly permuted sequence scores per iteration. The full distribution of the differences is given in the motif (+) column of table 2. Table 2 also contains (motif (-) column) the distribution of values for 1000 repetitions of an amended version of this process in which the sole modification was to omit the motif embeddings: in this case, exceeded in only 171 of the simulations, while exceeded in 179 simulations (with equality between the two holding in the remaining 650 iterations). The symmetry of the distribution of around 0 in the motif (-) case is to be expected since the scores yb are independent of the sequences wb whether permuted or not if no motifs are included.
Gapped motif detection
Following a similar strategy to that laid out in the Illustration using simulated data above, we generated a second simulated data set containing 30 random 250 character control sequences and then embedding a specific motif into the last 10 of them (again defining yb by Eq (35)) in order to test the gapped-motif detection strategy of Gapped motif detection. In this case, however, the motif was specified as CATA..CTGA, where the periods between CATA and CTGA represent different pairs of bases randomly assigned to each sequence: 1 AG, 1 CA, 3 CG, 1 GA, 1 GT, and 3 TG (the high frequency of G—in 7 out of 10 embeddings—immediately prior to CTGA here resulted purely from random chance).
This is a more challenging motif discovery problem than the one discussed above. We therefore asked whether approach Gapped motif detection enables SArKS to find correlative motifs in small data sets even when there is no single long conserved section.
The assumptions underlying the selection of κ in the initial simulation study (Illustration using simulated data) are not satisfied in the gapped motif simulations, as the head motif 4-mer u0 = CATA is not sufficiently long to guarantee that it will not be present by random chance. Indeed, given independent equiprobable characters in the concatenated sequence x of length l = 30 * 250 = 7, 500, we would expect any individual k-mer to appear approximately 4−kl times on average; for k = 4 and l = 7, 500 this yields an expectation of 29 random occurrences (in addition to embedded occurrences), or about one per sequence—including the first 20 sequences into which it was not embedded—in our example.
This introduces a new scale to consider: the expected number of total occurrences of the head motif u0 (here CATA), which we could approximate in this case by the expected number of random occurrences (29) plus the expected number of embedded occurrences (10) at about 39 (actual number of u0 = CATA occurrences in simulated data set was 32). However, while this gives us a rough sense of the ceiling on the window size to which the head u0 contributes, there will generally be subwindows of the window containing all occurrences of u0 that are particularly enriched in suffixes of higher-scoring sequences w20, w21,…, w29. Thus we targeted window sizes slightly below this expected count (39): the value of κ = 12, corresponding to a full window size of 2 * κ + 1 = 25 was chosen as the mid-point between κ = 4 appropriate for motifs very unlikely to occur by chance and the value of κ ≈ 20 corresponding to the expected half-width of the window of all suffixes beginning with u0.
The spatial length scale λ = 10 over which to smooth for this application is based on the length of the target motif u0 * ugap * u1 = CATA..CATG. It could be argued that a slightly lower value of λ would be more appropriate, since there is little reason to expect that suffixes beginning with the last few characters of u1 will generate high scores ; results similar to those presented for λ = 10 were obtained using λ as low as 5 (though θspatial must be set higher for smaller values of λ).
Fig 3 plots the joint distributions of and for both permuted and unpermuted scores and further split into low- and high-Gini impurity gi indices, with thresholds θ = 0.6 and θspatial = 0.44 indicated by dotted lines. This plot suggests a useful method for selecting θ and θspatial: repeatedly permute the sequence scores yb to obtain to estimate permuted score distributions; thresholds should then be selected high enough that permuted scores rarely exceed them (after filtering out low Gini impurity suffix indices i; note the tighter distribution of values for the permuted, high Gini impurity panel as compared to the corresponding low Gini impurity panel, consistent with Eq (32)). Upon further inspection of the unpermuted, high Gini impurity panel of Fig 3, a few disconnected islands containing some of the high-scoring indices i corresponding to occurrences of CATA..CATG may be observed; note that no such islands appear in the permuted distributions.
Thus having set κ = 12, λ = 10, θ = 0.6, θspatial = 0.44, and having used the median of the Gini impurities gi to define gmin = 0.931, we followed the methods of Motif selection– Pruning and extending k-mers, modified to incorporate spatial smoothing as described in Spatial smoothing, thereby obtaining M′ = {CATA} containing only the embedded head sequence u0.
Table 3 shows the results of searching for common k-mers (3 ≤ k ≤ 6) u1 occurring within 10 positions downstream of u0 = CATA occurrences ranked using the simple binomial null-model described in Gapped motif detection. The most significant hit found is for the correct motif tail sequence u1 = CTGA, while the remainder of the table contains various substrings of either CTGA or GTCGA, reflecting the randomly occurring bias favoring G immediately preceding the tail sequence in the simulated data.
For permutation testing of this gapped motif detection problem, a threshold p-value of 10−10 was applied to determine if any meaningful downstream hits u1 were detected: at this threshold, while 202 of 1000 permutations resulted in positive detection of a head motif u0 only 1 of these yielded a positive u1 motif hit (corresponding to 95% CI (0.0025%, 0.56%) for FWER). These results thus demonstrate that statistical analysis of the composition of trailing tail sequences can complement the basic SArKS approach to facilitate the detection of gapped motifs that might otherwise be missed.
DNA motifs associated with anti-fungal response
We used the methods of Results and discussion to examine potential sequence motifs related to gene expression differences between yeast samples treated with a pair of synergistic anti-fungal agents and a set of matched control specimens as measured in the RNA-seq data set, GEO accession number GSE80357 [32]. The scores yb for the genes in this data set were derived from the analysis provided in the data set submission as described in RNA-seq expression analysis Eq 33.
The sequences wb for this application were defined to be the 500 bases immediately upstream (5’) of the transcription start site (TSS) of each of 5,436 genes for which edgeR [33] analysis results were included in the GSE80357 submission. Using the genome annotations collected in version R64-2-1 of the S. cerevisiae gff created by the Saccharomyces Genome Database we calculated that 71.1% of the annotated genes had TSSs were at least 500 bases downstream of the next TSS upstream (median separation between consecutive TSSs calculated to be 888 bases, while mean separation was 1336 bases).
Fig 4 shows the joint distributions of gi and either or (obtained by smoothing either the true scores yb or permuted scores , respectively) as indicated. The propensity for increased variance in the smoothed scores at lower values of gi underlying Eq (32) can be clearly observed. One consequence of this phenomenon for this data set is that a suffix beginning with a block of 23 consecutive thymine residues simultaneously yields both the highest (unpermuted) and the lowest Gini impurity gi (corresponding to the far-left end of the uppermost red tendril in the right panel of Fig 4); only 53 distinct promoter region sequences wb contribute to the 251 positions composing the smoothing window centered on this suffix, with 23 of the 251 suffixes derived from a single promoter sequence.
We chose the Gini impurity filter value gmin = 0.9950 to satisfy with γ = 0.2, thus removing suffix indices i for which the variance of the permuted smoothed scores would be more than approximately 120% of the median value (see discussion leading to Eq (32)); this filter removes only 0.74% of all suffixes from consideration. The threshold θ = 1.5 was determined by examining the distribution of values generated using randomly permuted scores : only 5 of 250 such permutations generated any scores exceeding 1.5 for κ = 125 and gmin = 0.9950 (95% CI (0.65%, 4.6%) for FWER). Using these parameter values and following Motif selection–Pruning and extending k-mers we obtained M′ = {TGACTCA, GACTCA, TGACTC, GACTCAT, TGACTAT, ATGACTAA, ATGACTC, TTAGTCA, CCGTACA, AGATAAG, AGATAAGA, GATAAGC, TATATAAG, TATATAAAG} clustering (setting maximum edit distance d = 3) into 3 clusters centered at TGACTCA, CCGTACA, and AGATAAG.
Assessing the similarities of the centers of these 3 high-scoring k-mer clusters to known biological motifs using tomtom [34], we found:
TGACTCA similar to binding motif for Yap1p (E-value 0.024)
CCGTACA similar to binding motif for Rap1p (E-value 0.093)
AGATAAG similar to binding motif for Gzf3p (E-value 0.080).
While there is little evidence of relevant differential expression for the gene Yap1 (likelihood-ratio (LR)=0.185, p = 0.67, yYap1 = 0, log2-fold-change (logFC)=0.05), the genes for both Rap1 (LR=8.65, p = 0.0033, yRap1 = 2.16, logFC=0.31) and Gzf3 (LR=50.5, p = 1.2e -12, yGzf3 = 3.92, logFC=0.79) both appear to have elevated expression levels in the simultaneous amphotericin B (AMB) and lactoferrin (LF) treatment group relative to control. Pang et. al. have suggested that the synergistic anti-fungal activity of AMB and LF may involve disruption of oxidative stress response: Yap1 is an essential TF in the normal oxidative stress response [35]. Pang et.al. also discuss the involvement of iron and zinc homeostasis in the synergistic response; Gzf3 has been computationally annotated to Gene Ontology (GO) terms for zinc ion binding and metal ion binding [36, 37]. Furthermore, there is also evidence that the TFs identified with binding sites similar to SArKS identified motifs may regulate or be regulated by TFs previously studied by Pang et. al.: Fig 5 depicts putative regulatory relationships (as found in the YEASTRACT [38] database of documented associations) between these TFs and the two TFs Aft1p and Zap1p previously suggested by [32] as critical actors in the synergistic response of S. cerevisiae to the combination of AMB and LF. The distillation of these motifs demonstrates the power of our methodology to uncover candidate sequences that may support differential gene expression.
DNA motifs associated with neuron subtype-specific expression
Finally we applied the SArKS motif discovery methodology to an RNA-seq data set comprising gene expression data for different mouse neocortical neuron subtypes [28]. These authors developed an approach for the purification of genetically defined cell types in mammals and applied it in conjunction with a variety of next-generation sequencing methods—including ATAC-seq and MethylC-seq as well as the aforementioned RNA-seq—to investigate epigenetic variation between three different subtypes of neocortical neurons. This data set has many useful features for correlative motif analysis using SArKS beyond the quantification of differential expression, including especially information regarding which regions of the genome are accessible to transcriptional machinery via ATAC-seq. This information is useful not only for filtering the set of genes included in SArKS analysis (as discussed in), but also through the application of ATAC-seq footprint analysis in conjunction with differential methylation analysis that was performed in [28] to infer TF binding at cell-type specific regulatory regions, yielding a set of independently identified TF-binding motifs to which we may compare our own results.
Our initial goal was to identify potential regulatory motifs associated with transcripts enriched in parvalbumin (PV) GABAergic neurons. Because we focused on putative regulatory regions in the vicinity of actively used gene transcription start sites, we quantified expression at the transcript level, filtered transcripts, and determined differential expression as described in RNA-seq expression analysis; table 4 indicates the results of the various transcript filters: 6,326 distinct transcripts, each representing a unique gene, were retained for analysis. For this data set, we conducted three separate SArKS analyses, two focusing upstream (5’) of the TSSs for the transcripts of interest and the other downstream (3’).
For the selected transcript set, we selected and both again using Eq (38) but with the lower value γ = 0.1 (thus filtering out 9.8% of suffix indices upstream and 5.9% of suffix indices downstream). The lower values here relative to those used for the GSE80357 yeast data set were motivated by the use of longer sequences wb (appropriate for the less compact mouse genome) increasing the potential for false positive motif signals and thus requiring more stringent thresholds to maintain a high rate of negative results in permutation testing.
Upstream promoter analysis
We first examined upstream sequences wb for each of the 6,326 remaining transcript species from 3 kb 5’ of the TSS to the TSS (the TSSs of 85.5% of mouse genes annotated in Ensembl GRCm38 are separated by greater than 3 kb from the nearest upstream TSS; median separation 23 kb, mean separation 54 kb). Regarding the 979 transcript species whose t-statistic scores yb ≥ ϕ = 2 for the PV versus other neuron subtypes comparison (see RNA-seq expression analysis for details) as high-scoring, we began with half-window size set at a high-end estimate of κ = 500 (corresponding to full window size of 2κ + 1 = 1001). In order to select a windowed score threshold θ for the upstream sequence analysis, we generated 250 random permuted score sets and calculated maximum scores (defined as in Eq (37)) for each: these ranged from 0.424 to 0.560. Based on this distribution, we selected θ = 0.55 as our threshold for this application; 249 out of were less than this threshold (95% CI (0.010%, 2.2%) for FWER). Applying the methods of Motif selection– Pruning and extending k-mers using these sequences as the various wb and the associated transcript t-statistics as the scores yb (Eq (34)) with parameters κ = 500, θ = 0.55, and resulted in, which clusters into a single motif centered on CCACCTGC for any d > 0.
The identified upstream motif sequences CCACCTGC and CCACCTGCC both contain the canonical core recognition E-box sequence CANNTG (more specifically, the E12-box variant CACCTG [39]). Comparison of CCACCTGC with known motifs from the JASPAR database [40] using tomtom finds some similarity to TF-binding motifs for SNAI2 (E-value 0.20), MAX (E-value 0.27), SCRT2 (E-value 0.30), SCRT1 (E-value 0.36), and TCF3 (E-value 0.38). 3 of these TFs (SCRT2, SCRT1, and TCF3) were included in the set of genes whose measured expression levels met the minimum mean and variance filters for analysis described in RNA-seq expression analysis; the remaining TFs, SNAI2 and MAX, both met the mean expression criteria but had low expression variance across the 6 analyzed samples. Normalized expression levels of SCRT2 and SCRT1 were elevated in PV neurons relative to excitatory and VIP neurons (t-statistic scores ySCRT2 = 5.40 (p = 0.0057) and ySCRT1 = 8.87 (p = 0.00089)), while TCF3 shows little evidence of differential expression between any of the classes of neurons (anova FTCF3 = 0.59). Interestingly, the motifs for both SNAI2 and TCF3 were also included in list of TFs identified as possibly regulating cell-type specific expression (for at least one of the 3 cell types studied) using a combination of ATAC-seq footprint analysis and differential methylation analysis in the original study associated with this data set [28].
Use of a smaller smoothing window defined by κ = 250 for the upstream promoter sequence analysis generated very similar results (M′ = CCACCTGG) to those obtained with κ = 500 but with slightly degraded performance under permutation testing: greater than only 243 out of 250 permuted values for κ = 250 compared to 249 out of 250 permuted scores for κ = 500. We thus retained the larger κ = 500 smoothing window here.
Upstream promoter analysis with spatial smoothing
To detect longer regulatory sequences within 3 kb upstream regions of the 6,326 analyzed genes we applied the spatial smoothing method of Spatial smoothing. We retained the same kernel half-window size κ = 500 and Gini impurity cutoff and selected the spatial length scale λ = 100 to target the low end of the enhancer length distribution [41]. Permutation testing then led us to select the combination of θ = 0.5 and θspatial = 0.25 (for which there were no positive hits in 250 random permutations (95% CI (0%, 1.5%) FWER)). This resulted in positive hits both for the previously found sequence CCACCTGCC and for 4 closely spaced positions in the versus-i plot (iϵ{8919530, 8919531, 8919548, 891958}) corresponding to variations on a lengthy sequence beginning CTGGAACTCACTCTG …; the suffixes corresponding to these 4 peaks were identical in the first 46 nucleotide positions and exhibited substantial similarity beyond that.
Because of the high degree of similarity over the longer length of these sequences we bypassed the calculations of Eq (9) and instead compared the common first 46 bases of each to known databases, finding an indel-free alignment with 45 of 46 bases perfectly matched with the B1 rodents/Mammalia short interspersed element (SINE) sequence from SINEBase [42]; looking at longer surrounding regions, for each of the 4 peaks we found an alignment to B1 covering at least 132 of the 145 nucleotides in B1 with at least 94% sequence identity. The B1 SINE family consists of retrotransposon-derived sequences which appear repeatedly throughout the mouse genome; recently there have been suggestions that there may be positive selection for the presence of these sequences upstream and in introns of genes with specific functions [43] and that they might also function as enhancers [44].
Downstream promoter analysis
Finally we conducted a third analysis focusing on sequences wb extending 1 kb downstream (3’) of the TSS. Here significant results were obtained using either κ = 500 or the smaller window κ = 250; we chose to focus on the κ = 250 results as they generated longer, potentially more specific, motifs. We chose θ = 0.8 to be higher than all of the resulting from 250 random permutations πr (the maximum observed was 0.760; 95% CI for FWER again (0%, 1.5%)) and again applied Motif selection–Pruning and extending k-mers, here setting the Gini impurity cutoff to . The resulting motif set (Eq (18)) contained 7 distinct k-mers: AAGGTCA, ACCTTGG, GACCTTG, GACCTTGG, TGACCTT, TGACCTTG, and TGTCCTTG (with the last of these corresponding to the maximal value of ). Clustering according to Cluster k-mers by sequence similarity (d = 3) divides these sequences into two clusters centered at TGACCTTG and AAGGTCA which are clearly reverse complements of the same motif.
Fig 6 shows the distributions of t-statistics for transcript species whose downstream sequences either do or do not contain the highest-scoring octamer TGACCTTG. Comparison of the k-mer sequences with known motifs from the JASPAR database [40] using tomtom shows that these k-mers are very similar to the ESRRA/ESRRB/ESRRG binding motifs (e.g., E-value 0.00079 for TGACCTTG match to ESRRA and ESSRB motifs from JASPAR CORE, E-value 0.0046 for match to ESRRG motif). Notably, ESRRA, ESRRB, and ESRRG were all among the previously identified motif set described in [28]. The genes for ESRRA and ESRRG both passed the mean and variance filters employed in RNA-seq expression analysis and both exhibited significantly elevated expression in PV neurons relative to both excitatory and VIP neurons (yESRRA = 3.63 (p = 0.022), yESRRG = 3.34 (p = 0.029)), while ESRRB did not meet the applied mean expression filter (though the low expression levels observed do also indicate elevated expression in PV neurons, tESRRB = 10.2 (p = 0.00052)). TGACCTTG also matched 2 other JASPAR motifs at E-values below 0.1: RORA (E-value 0.0097) and NR5A2 (E-value 0.02). The TF-binding motif for RORA was also in the set of motifs flagged in [28]; neither of the genes RORA nor NR5A2 showed much evidence of elevated expression in PV neurons (yRORA = 0.556, yRXRB = 0.825).
Combining motifs
Fig 7 presents the fractions of analyzed transcript species matching each of the 3 motifs here identified—CCACCTGC and B1 SINE upstream of the TSS and the cluster centered on TGACCTTG downstream of the TSS—in a mosaic plot (area of tiles proportional to corresponding fractions. The boxes in the mosaic plot are colored according to whether the observed fraction of sequences containing the motif is above or below the fraction predicted by a null model in which all indicated factors (presence of motif or elevated t-statistic) occur independently: gold=fraction greater than expected under independence, blue=less. Also encoded is the fraction of distinct transcripts with PV-versus-other-subtype t-statistic scores yb ≥ φ = 2. It is apparent that the downstream ESRRA/ESRRB/ESRRG related motif TGACCTTG… has the strongest association with specificity of expression in PV cells, and also that there is a large degree of overlap between the transcript species whose 3 kb upstream regions contain either the E-box CCACCTGC pattern or the B1 sequence pattern (in fact we noted that for some of the highest-scoring B1 matches, a single adenine residue insertion relative to the consensus B1 sequence created a CCACCTGCC match within the B1 region). It is less obvious that the upstream motifs (CCACCTGC or B1) contribute to much increased specificity for those transcript species for which the downstream TGACCTTG… motifs is present. These results suggest that if motifs with more complex combinatorial patterns of association with differential expression are sought it may be useful to take this into account explicitly within the SArKS framework.
Future Directions
Because the regulation of eukaryotic gene expression likely involves interactions among multiple short sequence motifs [45], it is of interest to discover motifs that work together synergistically to confer cell-type specific gene expression profiles. To achieve this objective we need to extend the methods associated with continuous sequence scores introduced in the present study by, e.g., utilizing multivariate kernel regression models such as where the 4-index kernel Kijkl might be chosen to satisfy constraints along the lines of
That is, i and k must correspond to suffixes with sufficiently similar prefixes (as must j and l), while i and j must come from the same word (as must k and l); see Fig 8.
As discussed in the Introduction, we have made a number of suppositions here regarding the mechanisms by which eukaryotic transcription is regulated, including but not limited to the combinatorial mode of TF action just discussed. A future challenge is to optimally choose the stretch of DNA to be examined relative to the nearby genes: while we have investigated sequences defined solely by proximity to the TSS, it is well known that regulatory elements may lie quite far from their target genes [46]. It would be advantageous to develop more sophisticated approaches to both (a) the identification of genomic regions most likely to contain regulatory elements and (b) the linkage of potential regulatory element-dense regions to governed genes. One place to start may be with information on evolutionary conservation [47] and epigenetic modification [48] near genes of interest.
Finally, while we have tested SArKS on biological sequences, we anticipate uses far afield from this example, including motif discovery in time series data [49], or, by considering node or edge sequences produced by random walks, analysis of complex network structure [50].
Conclusions
We here introduce SArKS as a method for de novo correlative motif discovery in order to more fully exploit the results of modern quantitative methods (such as RNA-seq) by avoiding the dichotomization—and consequent loss of information [51]—of sequence scores into discrete groups as required by standard discriminative motif discovery algorithms. SArKS has also been designed with an eye towards minimizing the reliance on specification of specific background sequence models, instead using nonparametric permutation methods [52] to set significance thresholds for motif identification. SArKS is also capable of a second smoothing pass over spatial location of motifs within the sequences in which they are found following the initial smoothing by lexicographic sequence similarity in order to identify longer, potentially interrupted, motifs. Finally, we provide several examples of the usage of SArKS along with detailed analysis of the results thus obtained.
Acknowledgments
This work was supported by BRAIN initiative grant 1U01NS094330 from NINDS and has benefited from discussions with Becca Young, Eric Brenner, and Preeti Mehta.