Abstract
Motivation We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, p-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies nonparametric kernel smoothing to uncover promoter motifs that correlate with elevated differential expression scores. SArKS detects motifs by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motifs can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing.
Results We applied SArKS to published gene expression data representing distinct neocortical neuron classes in M. musculus and interneuron developmental states in H. sapiens. When benchmarked against several existing algorithms for correlative motif discovery using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power.
Availability https://github.com/denniscwylie/sarks.
Contact denniswylie{at}austin.utexas.edu.
1 Introduction
Discrete sequences—of tones, of symbols, or of molecular building blocks—can provide clues to other characteristics of the entities from which they are derived: a phrase in a bird’s song can reveal which species it belongs to, the use of an idiomatic expression can pinpoint a speaker’s geographic origin, and a specific short string of nucleotide residues can illuminate the function of a DNA domain. In these examples, insights are gleaned from informative motifs—short subsequences that match some frequently recurring discernible pattern.
Of particular interest are DNA regions modulating differential gene expression. The regions contain motifs that produce defined patterns of gene expression, however the details of how and which motifs are needed for expression specificity remain poorly understood.
We present a broadly-applicable algorithm for identifying DNA regulatory regions that support differential gene expression. Our strategy is predicated on the following suppositions: (a) gene expression regulatory regimes involve the binding of transcription factors (TFs) to sites on non-coding DNA in the vicinity of a transcription start site (TSS) (Maston et al. (2006); Nguyen and D’haeseleer (2006)); (b) TFs act combinatorially to attract and repel transcription machinery (Walhout (2006)); (c) the same TF binding site may appear multiple times within a stretch of DNA, interspersed with other binding sites (Gotea et al. (2010)); and (d) there is more than one solution: different genes, even those co-expressed within a single cell, may rely on different regulatory mechanisms (Badis et al. (2009)). In accord with these suppositions, we aim to identify TF binding sites associated with enriched transcripts and scrutinize their arrangement for significant patterns that can then be evaluated experimentally.
Many motif identification methods have been described. Consensus-based methods such as Weeder (Pavesi et al. (2001, 2004)) focus on fixed-length motifs that repeatedly occur (with few mismatches) in sequences of interest, and can be efficiently implemented using suffix trees (Sagot (1998); Marsan and Sagot (2000); Pavesi et al. (2001)). Alternately, profile-based methods such as MEME (Bailey and Elkan (1995); Bailey et al. (2006, 2009)) build a probabilistic motif profile to be compared to a background model in order to classify subsequences as either matching the motif or not.
In contrast, discriminative methods (Sinha (2003)) identify motifs that differentiate one set of sequences (e.g., promoter regions for genes with a given expression pattern) from another (e.g., reference promoter regions). Many approaches have been applied to this differentiation problem (e.g., Segal et al. (2002); Segal and Sharan (2005); Redhead and Bailey (2007); Fauteux et al. (2008); Valen et al. (2009); Huggins et al. (2011); Yao et al. (2014)). One popular example, DREME (Bailey (2011)), employs Fisher’s exact test to compare counts of motif matches in the target/positive sequences with counts in the background/negative sequences. HOMER (Heinz et al. (2010)) uses similar hypergeometric enrichment calculations, but couples them to a zero-or-one-occurrence-per-sequence (ZOOPS) scoring approach. The recent motif finder STEME (Reid and Wernisch (2014)) extends a suffix tree-based approximate expectation-maximization approach (Reid and Wernisch (2011)) into a practical tool capable of discriminative motif discovery.
When discriminative methods are applied to differential gene expression, they impose a binary representation (such as elevated or not elevated expression). However, differential gene expression is generally described using a continuous measure (t-statistics, f-statistics, etc.), with some genes more affected than others by a difference in state. It is more useful, therefore, to use “correlative motif discovery,” which seeks motifs whose presence signals a trend towards higher or lower values of the continuous measure. A few such correlative algorithms have been described, including MOTIF REGRESSOR (Conlon et al. (2003)), which first applies the (non-correlative) MDScan (Liu et al. (2002)) algorithm to identify motifs in a subset of high-scoring sequences, then filters the motif set based on the predictive value of regression models based on the selected motifs. Another correlative algorithm, FIRE (Elemento et al. (2007)), iteratively optimizes the mutual information between sequence scores and occurrences of candidate motifs, starting from a set of most informative “seed” motifs.
Both of these algorithms may be seen as applying correlational information (regression or mutual information, respectively) as a filter to select and refine a set of candidate motifs generated in a non-correlative manner.
The generation of a seed motif set paves the way for sequence ranking by counting occurrences of the uncovered motifs within each sequence ωb. However, as the number of possible motifs of length k grows exponentially with k, given a fixed set of sequences {wb} and a suitably large k, only a fraction of possible length-k motifs will be observed in any sequence ωb. For example, in 1000 sequences ω1,…, wiooo each of length |ωb| = 1000, at most one million k-mers of any length k can be found.
In contrast, we aimed to develop SArKS as an algorithm for correlative motif discovery that does not require seed motifs to minimize the possibility of missing informative motifs due to suboptimal seeding. Our solution was to focus on observed substrings of the sequences ωb, not all possible k-mer patterns that could be present in the ωb.
Specifically, SArKS relies on suffixes of ωb—substrings formed by deleting the beginning of a string. As there are only |ωb| non-empty suffixes of ωb, SArKS is able to process all suffixes of its input sequences even when when they are long and/or numerous. SArKS then assesses suffix similarity by lexicographic sorting: just as words sharing a common prefix are found close together in a dictionary, suffixes starting with a shared k-mer are assigned similar numeric positions in the sorted list of all suffixes (Figure 1). By correlating sorted suffix position with suffix sequence score using kernel smoothing, SArKS develops this idea into a de novo motif discovery algorithm, with a natural extension for identification of longer multi-motif domains (MMDs) spanning tens to hundreds of bases (Section 2).
We applied SArKS to two RNA-seq data sets using nonparametric permutation testing to compute significance thresholds for correlative motif discovery and to estimate false positive rates. We demonstrate that SArKS outperforms existing algorithms at identifying correlative motifs in cross-validation testing scenarios. The top motif patterns and MMDs identified by SArKS include known regulatory elements (Mathelier et al. (2015); Elbarbary et al. (2016)). Thus, the correlational motif discovery approach used by SArKS takes full advantage of differential expression RNA-seq data to illuminate prospective sequence-dependent mechanisms of gene expression regulation.
2 Methods
Symbolic notation is described both when introduced and systematically in Section S2.1.
Given n sequences ωb (also referred to as words) with associated scores yb, the essential steps of the algorithm (illustrated in Figure 1 and described in Section 2.1) consist of:
concatenating all the sequences ωb into one supersequence x;
constructing the suffix array [si] of this supersequence (Equation (2)), where i indexes all suffixes of x sorted into lexicographic order;
mapping suffix positions i back to the sequences ωbi from which the beginnings of the associated suffixes are derived (Equation (3)); and
for each i, applying kernel smoothing to locally regress the sequence scores ybi on suffix positions j lexically near i (Equation (4)).
We thus encode the motif pattern corresponding to the first few characters of the suffix of x beginning at character si with the numerical suffix array index value i. Because i gives the position of a suffix in the lexicographically sorted list of suffixes of the concatenated supersequence x, multiple occurrences of a highly conserved motif—even if they derive from different sequences ωb—will be consolidated into a run i, i + 1, …, j of consecutive index values. Averaging together runs of j — i consecutive scores by kernel smoothing using a kernel of width j — i thus offers a way to compare the scores ybi, ybi+1, …, yb to the overall score distribution (1).
2.1 Motif selection
Concatenate all words ωb (each assumed to end in the line-terminator character $ lexically prior to all other characters) to form word of length . Define also . Then x[lb,lb+1) = ωb; that is, the substring of the concatenated string starting at position lb (inclusive) and ending immediately before position lb+1 (exclusive) is the sequence ωb (we denote the first character of a string ω by ω[0], the second ω[1], etc.).
Lexically sort suffixes xs = x[s, |x|) into ordered set thereby defining suffix array [si] mapping index i of suffix in S to suffix position s in x (in our software we rely on the Skew algorithm (Kärkkäinen and Sanders (2003)) modified to use a difference cover of 7 and implemented in SeqAn (Döring et al. (2008)) to efficiently compute the suffix array).
Define block array [bi] by mapping index i of suffix in S to block b containing suffix position si. The block array then tells us that the character x[si] at position si in the concatenated string x is derived from ωbi[si – lbi] in the sequence ωbi.
Calculate smoothed scores as locally weighted averages where the kernel Kij acts as a weighting factor for the contribution of the score ybj to the smoothing window centered at sorted suffix index i. Kij is used to measure how similar (the beginning of) the suffix x[sj, |x|) is to be considered to (the beginning of) the suffix x[si, |x|) in the calculation of a representative score ŷi averaged over suffixes similar to x[si, |x|). As the suffixes have been sorted into lexicographic order, the magnitude of the difference i – j reflects this similarity: the key idea of the kernel smoothing approach described here is that Equation (4) with Kij defined to be a function of |i-j| may therefore offer a computationally tractable approach for identifying similar substrings which tend to occur preferentially in high scoring words ωb.
In this work we use a uniform kernel which allows Equation (4) to be computed in terms of cumulative sums:
The kernel half-width κ appearing in Equation (5) is an important adjustable parameter controlling the degree of smoothing. Increasing κ smooths over more diverse suffixes, potentially increasing statistical power at the expense of the resolution of the detected motifs (i.e., length of k-mer prefix common to suffixes in the smoothing window). We recommend investigating a range of values of this parameter as is illustrated in Section 3.
Set length for k-mer associated with suffix array index i by averaging locally the length of suffix sequence identity: where kmax functions both to increase computational efficiency and to make ki more robust in the presence of a small number of long identical substrings. All results presented here based on kmax = 12: This value was selected as kmax ~ log4|x| where x is the longest concatenated sequence string considered in Section 3.2.1, so that for k > kmax there are more distinct k-mers than there are positions for such k-mers to occur in all of the sequences ωb composing x.
Equation (7) is similar to Equation (4) except that: (a) Equation (7) smooths the length of the longest prefix on which the suffixes |) and |) agree instead of smoothing the score yb as in Equation (4); and (b) Equation (7) omits the central term i = j as it trivially compares the suffix beginning at si to itself and is thus uninformative.
A straightforward approach to identifying correlative motifs using SArKS would then be to set a score threshold θ and take motifs to be k-mers prefixing the suffixes starting at the positions si in the concatenated string x. This is the essence of our method, though below we add two filters designed to pinpoint the optimal locations si at which to initiate motifs and, in Section S2.2, to remove likely false positive positions.
Defining the negative spatial shift operator η(i) which yields the unique suffix array index corresponding to the spatial position immediately prior to si, so that sv(i) = si — 1, as well as the positive shift operator ρ(i) similarly defined by the condition sp(i) = si + 1, we start with a preliminary filtered suffix array index set I consisting of those i for which (1) the smoothed score ŷi > θ and (2) ŷi is not less than the smoothed scores of the spatial positions in x immediately adjacent to si (i.e., si must be the loci of a peak in plot of ŷi versus spatial position si): from which we obtain the associated set M of k-mers beginning at the positions si in x by where [] is the nearest integer to . Strategies for setting the filtering threshold θ based on the permutation testing method are described in Section S2.5. In the next section we recommend one additional filter—designed to limit the impact of intra-sequence tandem repeats on reported motifs—to be incorporated into the definition of the index set I, replacing Equation (8) by Equation (10), for use in Equation (9).
The k-mers composing the set M (Equation (9)) constitute the SArKS motif set when spatial smoothing is not employed. When spatial smoothing is employed to detect multi-motif domains (Sections 2.3 and S2.4), a modified procedure for merging spatially contiguous motifs within such domains leads to Equation (S9) for the final k-mer motif set Mspatial.
2.2 Limiting the impact of intra-sequence repeats
The frequent occurrence of short tandem repeats in the genome (Ellegren (2004)) can cause smoothing windows to be skewed towards a relatively small number of distinct sequences (discussed in Section S2.2). As a result, the smoothed motif scores may reflect fewer input sequences, reducing precision and increasing false positive rates among the high scoring motifs. To filter out such false positives, section S2.2 introduces the Gini impurity score gi measuring the “effective sequence count” contributing to the smoothing window centered at i, while Section S2.5 demonstrates that gi predicts the variance of the smoothed score for suffix array index i under the null hypothesis of independence between sequence and score. We can thus modify equation (8) to remove potential false positive i values characterized by low Gini impurities gi: screening out positions i for which the repeated occurrence of a few high-scoring words in the window centered at i leads to ŷi ≥ θ.
2.3 Spatial smoothing to identify multi-motif domains (MMDs)
Existing motif discovery approaches recognize the tendency of regulatory motifs to cluster into domains (Wasserman and Sandelin (2004)). Our algorithm exploits this feature, identifying candidate regulatory regions through the application of a second round of kernel-smoothing over suffix positions si within words: where we use uniform kernels of the form (generally with width λ ≠ κ) to search for regions of length λ with elevated densities of high-scoring motifs. Note that defined by Equation (11) is indexed not by suffix array index i but by suffix array value si giving the spatial position si in the concatenated word x. Spatial smoothing requires a threshold θspatial ≠ θ, as the doubly-smoothed scores tend to be less dispersed compared to the singly-smoothed ŷi. The threshold θspatial can be used to define an index set Ispatial in a manner similar to how I is defined by Equation (10). This procedure is detailed in Section S2.4, which additionally defines the set Jspatial of suffix array indices i corresponding to the starting positions of MMDs. It then details the procedure adopted by SArKS to merge spatially contiguous motifs within the same MMD, yielding the set Ispatial of suffix array indices i and merged motif lengths required to obtain the merged motif set Mspatial analogous to equation (9).
2.4 Permutation testing to establish significance of motif set
The significance of the correlation between the motifs uncovered by SArKS and the sequence scores yb can be evaluated by examining results obtained when the sequences ωb and the scores yb are independent of each other. To this end, the word scores yb are subjected to permutation π to define yb(π) = yπ(b). If the permutation π is randomly selected independently of both the sequences ωb and the scores yb, any true relationships between sequences and scores will be disrupted. Section S2.5 and Section S2.6 develop the strategy used by SArKS to set thresholds θ (and/or θspatial) for each combination of parameters κ, λ,gmin to control the overall false positive rate.
3 Results and discussion
3.1 Illustration of SArKS using simulated data
To illustrate SArKS, we first applied it to a simple simulated toy data set in which 30 random sequences ωb were generated with each letter ωb[s] drawn independently from a Unif {A,C,G,T} distribution. We then embedded the k-mer motif CATACTGAGA (k = 10) in the last 10 sequences (i.e., those ωb with b > 20) by choosing a position sb independently for each sequence ωb from Unif {0,…, |ωb| — k} and replacing ωb[sb, sb + k) with the motif. Scores were assigned to the sequences according to whether the motif had been embedded: yb = 0 if b ∊ [0,20), yb = 1 if b > 20.
The kernel half-width κ = 4 was used to obtain smoothing windows of size similar to the number of motif-positive sequences, 2κ + 1 ~ |{b | yb = 1}|. As this number cannot be known in advance when applying SArKS to real data, in practice we recommend testing a range of κ values as done in Section 3.2.1 below.
Figure 2 plots ŷi as obtained from Equation (4) applying the method of Section 2.1 to search for motifs. The highest peaks in the plot correspond to the positions of various substrings of the embedded motif, and correspond to the set M of k-mers defined by the x[si, si + ||) column of Table 1.
Removing nested motifs from Table 1 as described in Section S2.3, Equation (S7) leaves only the rows for i ∊ {2257,2258,2256,1462,1458,1463}. Applying Equation (S8) then extends the 8-mer ATACTGAG of the rows i ∊ {1462,1458,1463} to the full 10-mer, so that, following Equation (S9), the final k-mer set M’ = {CATACTGAGA} is recovered (Table 1).
Section S2.5 illustrates the utility of setting a minimum Gini impurity gmin during motif selection to reduce the false positive rate: 190 out of 1000 random permutations generated at least one position i(π) for which (for this toy model θ was taken to have the maximum possible value of 1), but only 20 of these permutations yield any results if gmin = 0.8506 (following Equation (S5) with γ = 0.1) is applied. Based on these results, we can derive a 95% confidence interval of (1.2%, 3.1%) for the family-wise error rate (FWER, a type of false positive rate; see Section S2.6).
3.2 Uncovering promoter motifs associated with differential gene expres sion
We set out to analyze two published RNA-seq data sets (Mo et al. (2015); Close et al. (2017)) using SArKS. The first study presented transcriptome data for adult mouse neurons sorted according to cell class (Mo et al. (2015)). In particular, this study was among the first to profile parvalbumin-expressing (PV) interneurons, a major inhibitory subclass in the mammalian neocortex. PV basket and chandelier neurons are intimately involved in the microcircuitry of sensory processing, memory formation and critical period plasticity (Cobb et al. (1995); Klausberger and Somogyi (2008)). Dysfunction of PV interneurons has been linked to autism and schizophrenia (Lewis et al. (2005)), and the ability to access these neurons using a cell-type specific promoter has been a priority for brain scientists. The second study examined transcriptomes of differentiating interneurons at several developmental time points (Close et al. (2017)). In the sections below, we describe the parameters and results of SArKS analyses for both data sets. In Section S3.2.2, we inspect the SArKS-elicited motifs associated with PV neurons and demonstrate how to extend the application of SArKS to identify promoter multi-motif domains (MMDs).
3.2.1 Data set 1: cell class-specific transcriptome analysis
We analyzed RNA-seq gene expression data from mouse neocortical neurons pooled based on genetically defined cell classes (Mo et al. (2015)) to identify regulatory motifs associated with parvalbumin (PV) neuron-specific gene expression.
After accounting for differential expression and chromatin accessibility (Section S2.7.1), we examined two sets of sequences for 6,326 unique transcripts. The first set covered upstream regions −3000 base pairs (bp) to the transcription start site (TSS), the second set extended from the TSS to +1000 bp.
We tested a range of half-window sizes κ ∈ {250, 500,1000, 2500} with the maximum value of 2500 selected to produce a smoothed window of size 2κ + 1 = 5001 similar to the number of input sequences (6326). Note that smaller windows are less likely to contain sample multiple suffixes from the same promoter sequence: in particular, windows of width greater than the number of distinct sequences must contain multiple suffixes from at least one promoter sequence.
For each half-window size κ, we applied two minimum Gini impurity values gmin set according to Equation (S5) with first 7 = 0.1 and then 7 = 0.2. Also for each value of n, we examined three separate spatial window sizes λ ∈ {0,10,100}. These values were selected to investigate the performance of SArKS using no spatial smoothing (λ = 0), using a window λ = 10 of the typical length scale of eukaryotic transcription factor binding sites (Stewart et al. (2012)), and using a window λ = 100 to target the low end of the enhancer length distribution (Loots (2008)). Thresholds θ (for analyses with no spatial smoothing) or θspatial (for analyses with λ ∈ {10,100}) were set according to the permutation testing strategy detailed in Section S2.5 using R = 100 permutations.
3.2.2 Data set 2: differentiating interneuron transcriptome analysis
We examined RNA-seq data for differentiating human interneurons (Close et al. (2017)), applying SArKS to identify promoter motifs associated with elevated gene expression in doublecortin-positive (DCX+) GABAergic neurons compared to DCX- cells. Differential expression was assessed for 6,939 genes as detailed in Section S2.7.2 and we analyzed upstream sequences (from −3000 bp to the TSS) and downstream sequences (from the TSS to +1,000 bp) as described in Section 3.2.1.
SArKS analysis was conducted using all combinations of half-window size κ ∈ {250,500,1000,2500} and spatial smoothing window λ ∈ {0,10,100} for the reasons described in 3.2.1. However, for this data set, the minimum Gini impurity thresholds were computed using only γ = 0.1— we had seen little benefit from including the higher value γ = 0.2 in our experience with the Mo 2015 data set (see Section 3.2.4). Thresholds θ or θspatial were set according to the permutation testing strategy detailed in Section S2.5 using R = 100 permutations.
3.2.3 Benchmark comparisons for correlative motif discovery
We conducted a cross-validation benchmarking study to compare SArKS correlative motif discovery performance to that of five motif search algorithms. Two of these methods, FIRE (Elemento et al. (2007)) and MOTIF REGRESSOR (Conlon et al. (2003)), as they rely on alternative approaches to correlative motif discovery. The remaining algorithms, DREME (Bailey (2011)), HOMER (Heinz et al. (2010)), and STEME (Reid and Wernisch (2014)) are popular discriminative methods which we have run by discretizing our score data with promoter sequences b considered ‘positive’ sequences if the score yb ≥ 2, ‘negative’ otherwise. While there is a definite loss of information in this discretization—the avoidance of which is one of the primary motivations for the introduction of SArKS, as well as other correlative motif algorithms—we were interested in direct comparison of correlative and discriminative algorithms to assess the degree to which correlative algorithms actually benefit from avoiding discretization.
We split the 6,326 transcripts selected from the Mo 2015 data set into five disjoint subsets V1, V2, …, V5 (the name Vf intended to suggest the fth validation set) of approximately equal size (| V1| = 1,266, while |Vf| = 1, 265 for f > 1). The set of 6,939 genes selected from the Close 2017 data set was similarly partitioned into disjoint cross-validation folds.
For both data sets, for both promoter sequence ranges investigated, and for each of the algorithms evaluated, five separate motif identification analyses were conducted corresponding to the five cross-validation folds Vf. For each analysis f ∈ {1,…, 5}, motif discovery was performed using the sequences and scores from all folds except Vf: the genes assigned to Vf were instead held out for validation of the discovered motif (so that the set of genes used to learn the motif sets was in each case disjoint from the set of genes used for validation). Existing algorithms were run at their default parameter settings where possible; exact specifications are given in Section S2.8, while SArKS parameters were set as described in Sections 3.2.1-3.2.2.
We used tomtom (Gupta et al. (2007)) to compare the pooled motif sets identified by each algorithm. Figure S3 shows the resulting overlap between motifs sets by algorithm: for each of the benchmarked algorithms, the majority of identified motifs had a SArKS-identified counterpart. SArKS also identified many additional motifs.
The Pearson correlation between the count of occurrences of a given motif in sequence ωb with the score yb across the sequence-score pairs (ωb, yb) provides a natural metric for assessing correlative motif discovery performance. Figure 3 plots the estimated Pearson correlation values for each motif identified (by each algorithm) evaluated using the held-out validation set {(ωb, yb) | b ∈ Vf} appropriate for the fold f in which the motif was discovered (with 3A and 3B presenting results for the Mo 2015 and Close 2017 data sets, respectively).
As Figure 3A-B and Figure S3 demonstrate, the number of motifs identified by different algorithms can be highly variable: DREME, FIRE <C HOMER, MOTIF REGRESSOR ¡ SArKS (the motif count for STEME is a fixed input parameter). The interpretation of the number of motifs is, however, complicated by two factors: (1) the occurrence rate of individual motifs in the relevant biological sequences (promoters, etc.) may differ substantially (e.g., longer motifs may occur less frequently) and (2) some motifs may be very similar in sequence.
The first of these complications is illustrated in Figure 3A-B by faceting horizontally on motif occurrence rate (count per sequence): one visible trend here is that the DREME-, FIRE-, and HOMER-identified motifs tend to occur less frequently than do the MOTIF REGRESSOR and STEME motifs, indicating that DREME, FIRE, and HOMER tend to define motifs more granularly than do MOTIF REGRESSOR or STEME. SArKS-identified motifs are spread across a wide range of per-sequence occurrence rates in this plot, as SArKS identifies both more and less granular motifs as the size of the smoothing window κ is varied through the ranges specified in Sections 3.2.1-3.2.2.
The second complication—the similarities among identified motifs—may be addressed by noting that correlative motif discovery can also be viewed as a form of feature extraction. In this vein, we can assess the performance of such algorithms by using the selected motifs as predictors to build regression models for associated sequence score yb based on the motif counts in the sequence ωb. Figure 3C plots validation set-estimated Pearson correlations of the predictions made by building a linear ridge regression model (using generalized cross-validation (Golub et al. (1979)) to select the L2 regularization parameter) with the sequence scores for each cross-validation fold by algorithm. Motifs were counted only within the sequence range in which they were identified, with these counts then merged into a single feature vector per gene to allow the regression models to consider both up- and downstream motifs simultaneously. This approach collapses the variation in quantity and quality of individual motifs down to variation of a single quantity—the regression model predictions—thereby facilitating a head-to-head comparison of motif discovery algorithms bypassing both of the complications discussed above. As the similarity of some identified motifs manifests as collinearity of regression predictors, regularization is a key component of this modeling approach.
SArKS yields better results than the other algorithms for both validation data sets (Figure 3C); aside from SArKS, the other two correlational motif discovery algorithms (FIRE and MOTIF REGRESSOR) do not appear to show a consistent advantage in performance relative to the discriminative algorithms.
If, instead of using the merged motif feature set, the regression models are built using only upstream or downstream motif counts, the results shown in Figure S2 are obtained, making clear that all six algorithms generally perform better when searching the downstream regions (for which SArKS shows a particularly strong advantage in both data sets).
Considering the downstream motif results, we noted that for every algorithm applied to the Mo 2015 data set, the motif with the highest Pearson correlation coefficient between occurrence count and parvalbumin specificity score in the held-out cross-validation fold exhibited significant tomtom similarity (q ≤ 0.1) to the ESRRA/ESRRB/ESRRG trio of TF-binding motifs documented in the JASPAR database (Mathelier et al. (2015)). Looking at the Close 2017 downstream motif results, we observed that the most highly cross-validation-correlated motifs for three of the algorithms—FIRE, HOMER, and SArKS—were significantly similar to all of the JASPAR motifs TGIF1/TGIF2/MEIS2/MEIS3/PKNOX1/PKNOX2.
In contrast, the top upstream motif results showed no such convergence on common JASPAR profiles: applied to the Mo 2015 data set, only one pairwise combination of two algorithms—FIRE and STEME—produced top upstream motifs (ranked by cross-validated Pearson correlation) that showed significant tomtom similarity (q ≤ 0.1) to a common JAS-PAR profile (NR5A2, whose binding motif closely resembles the ESRRA/ESRRB/ESRRG pattern mentioned above). Applied to the Close 2017 data set, no two algorithms produce motifs similar to the same JASPAR profile.
We see that in those cases where all of the algorithms performed better in the cross-validation testing (downstream), the top motifs were more likely to converge on known TF-binding motifs. Interestingly, SArKS outperformed the other algorithms to a greater degree in the analyses of downstream regions than of upstream regions.
Comparison of all of the motifs discovered by the various algorithms with known TF-binding motifs is further explored in Section S3.2.1.
Finally, we compared the average run times for each of the benchmarked algorithms applied to the upstream and downstream cross-validation analyses. As is shown in Figure 4, SArKS took longer than most of the other algorithms with the exception of STEME; FIRE and HOMER are quite fast relative to the others. Further discussion of the computational complexity of SArKS is provided in Section S3.3.
3.2.4 Permutational analysis of SArKS results
The permutation testing procedure used to set SArKS score thresholds can be used for directly assessing the statistical significance of the motif set SArKS reports as well. This is done by (1) following the procedure laid out in Sections S2.5 and S2.6 using a set of R randomly drawn permutations of the input sequence scores to determine threshold values for motif selection and (2) independently drawing a second set of R2 permutations from which the false positive rate corresponding to these thresholds can be estimated according to (S26).
To demonstrate this procedure, we re-applied SArKS to both the Mo 2015 and Close 2017 data sets here including all 6,326 or 6,939 selected genes (respectively) without cross-validation subsetting. We again investigated all combinations (κ, λ) ∊ {250,500,1000,2500} x {0,10,100} for the smoothing half-width κ and spatial length λ, computing gmin for each value of κ following Equation (S5) using the γ values indicated in Sections 3.2.1-3.2.2, and determining significance thresholds using R = 100 randomly generated permutations.
For the Mo 2015 data set, the analyses performed using the stricter gmin values obtained using γ = 0.1 yielded larger k-mer motif sets: 3,393 total distinct k-mers versus only 1,232 using γ = 0.2 for the upstream sequence set; 380 distinct k-mers using γ = 0.1 versus just 180 using γ = 0.2 for the downstream sequence set. More than 98% of the k-mers discovered using 7 = 0.2 were also identified using 7 = 0.1 (for both sequence ranges: 1,208 of the 1,232 upstream; 179 of the 180 downstream). Based on these results for the Mo 2015 analysis, we focused exclusively on 7 = 0.1 for the Close 2017 analysis, as described in Section 3.2.2.
The results above demonstrate that restrictive values of 7 can yield larger motif sets that include almost all of the motifs obtained using more permissive 7 values. This highlights the importance of the Gini impurity filter in focusing SArKS on potential motifs that appear within sufficiently many distinct sequences WB to achieve reasonable statistical confidence.
We assessed the statistical significance of these SArKS results following the method of Section S2.6 with thresholds 9 and Spatial set by Equation (S24) and Equation (S25) using z = 4. Upstream sequence analysis of the Mo 2015 set considering R2 = 250 independent random permutations resulted in 12 (4.8%) for which any of the parameter sets (ft, gmin, A) yielded a nonempty set of identified motifs; for the Close 2017 set, the same procedure resulted in 8 (3.2%) nonempty motif sets. These upstream sequence results correspond to a 95% family-wise error rate confidence interval (FWER CI) of (2.5%, 8.2%) in the Mo 2015 analysis and (1.4%, 6.2%) in the Close 2017 analysis.
For the downstream sequence analysis, R2 = 250 independent permutations yielded 8 (3.2%) instances of nonempty motif sets for Mo 2015 and 1 (0.4%) nonempty motif set for Close 2017, from which we estimate 95% FWER CIs of (1.4%, 6.2%) for Mo 2015 and (0.01%, 2.2%) for Close 2017.
The role of the parameter z in Eqs (S24)-(S25) in balancing FWER against sensitivity can be seen in the analyses presented here by considering the consequence of increasing z: at z = 5 for the same 250 permutations, the permutation analysis using upstream regions resulted in nonempty motif sets in only 2 permutations for Mo 2015 or 1 permutation for Close 2017. Similarly, for the downstream regions, permutation analysis with z = 5 resulted in 4 or 1 permutation(s) respectively. The cost of these decreased false positive rates to sensitivity is apparent in that at most half of the motif fc-mers identified using z = 4 were still discovered using z = 5 in each of the analyses; for the Close 2017 upstream analysis conducted with z = 5, SArKS returned no significant motif results at all. Here we were willing to accept the FWER values associated with z = 4 (point estimates ranging from 0.4% to 4.8% in these analyses) in order to maintain a higher sensitivity.
Selection of the parameter z to appropriately balance sensitivity against false positive rate will generally depend on the range of k, A, and gmin values investigated. When SArKS analyses are conducted for many combinations of these parameters there will be correspondingly more possible opportunities for false positives, requiring a higher value of z to maintain confidence in the results. In cases where the size of the returned motif set may be large, there is an additional factor to consider: the smaller motif sets associated larger values of z benefit not only from greater statistical confidence but also from a reduction in the computational effort required to refine and process the motif set (Section S3.3).
4 Conclusions
We introduce SArKS as a method for de novo correlative motif discovery. SArKS avoids the dichotomization—and consequent loss of information (Fedorov et al. (2009))—of sequence scores into discrete groups as required by discriminative motif discovery algorithms. SArKS does not require specification of parametric background sequence models, instead using nonparametric permutation methods (Ernst (2004)) to set thresholds for motif identification and to estimate false positive rates. SArKS smooths over spatial motif location to identify multi-motif domains (MMDs), which can in turn help refine the identified motifs. We have benchmarked SArKS against several existing discriminative and correlative algorithms using previously published RNA-seq data: SArKS uncovered particularly large motif sets and SArKS motif sets functioned as more predictive feature sets in a cross-validated regression modeling approach than did motif sets generated by existing algorithms. SArKS thus offers an approach to motif discovery capable of fulling exploit differential gene expression data.
S2 Methods
S2.1 Notation glossary
S2.2 Limiting the impact of intra-sequence repeats
One complicating factor in the strategy described in Section 2.1 is the presence of tandem repeats (common in eukaryotic DNA (Ellegren, 2004)): if the substring x[si, si + rm) (assumed to derive wholly from the single word ωb) consists of r 3> 1 repeats of the same m-mer, then it is likely that the sorted suffix array index positions j and k implicitly defined by sj = si + am and sk = si + bm for small a, b > 0 will be close by, since, assuming without loss of generality that a < b, showing that the suffixes of x beginning at positions (si + am) and (si + bm) agree on their first (r — b)m characters. Since all of the positions si + am for small a must come from the same word block bi they must have the same associated score yb. If this score yb is particularly high, this phenomenon may lead to windows of high ŷj values centered on j satisfying sj = si + am which result from a very small number of different repeat-containing words (perhaps as few as one if the number of repeats is high enough within a single high-scoring word). We thus here develop a natural method for filtering the peak index set I to selectively remove suffix array index values i where the smoothing window is dominated by a few heavily repeated words ωb. The distribution of weighted word frequencies contributing to the window centered at position i of the suffix array table across the full word set ω may for these purposes be summarized by the associated Gini impurity (often used in fitting classification and regression trees (Breiman et al., 1984)): which provides a measure ranging from 0 to of the degree of uniqueness of the words contributing to the calculation of yi.
F (0
As a concrete example, i all of the weighted frequencies word frequencies fb = 1 are the same for a set of exactly q words ωb appearing in the smoothing window centered on i, = 1 — 1. This gi q suggests an intuitive interpretation of (1 — gi) as the multiplicative inverse of the “effective word count” contributing to the smoothing window around i.
Section S2.5 further demonstrates that (1 — gi) is also approximately proportional to the variation of the smoothed scores yi that would be expected if there were no association between the sequences ωb and the scores yb (see Equation (S21) below). This proportionality suggests a simple method for selection of a gmin value at which most suffix array indices i will be retained while filtering out only those most likely to yield false positive results under permutation testing:
As shown in Equation (S21), setting gmin to satisfy Equation (S5) removes suffix indices i for which the variance of the permuted smoothed scores is greater than (1+γ) times the median value. Thus any value γ > 0 will retain the majority of positions i for further consideration. We have used γ = 0.1 or γ = 0.2 for all of the examples in the present work, retaining positions for which the permuted score variance is less than 110% or 120%, respectively, of the median.
Requiring gi_gmin results in rede_ning the peak index set I to screening out positions i for which the repeated occurrence of a few high-scoring words in the window centered at i leads to ŷi > θ.
S2.3 Reducing redundancy in reported motif set
The presence of a k-mer x[si, si + k) associated with a high smoothed score ŷi can also result in high smoothed scores ŷj when sj = si + m if the substring (k — m)-mers x[si + m, si + k) also also preferentially found in higher-scoring sequences, as pictured in Figure S1. The following two steps may be added to the algorithm described in Section 2.1 in order to reduce the reporting of such substrings when they are present only as part of the full k-mer.
S2.3.1 Removing shorter k-mers nested inside longer peak motifs
Cases in which both k-mer x[si, si + k) (e.g., CATACTGAGA in Figure S1) and its sub-(k —m1 —m2)-mer x[si+m1, si+k — m2) (with m1 > 0, m2 > 0; e.g., TACTGAGA in Figure S1 with m1 = 2, m2 = 0) are individually identified can be resolved to report only the longer k-mer by removing any index i <E I (defined by Equation (S6)) if there exists j <E I such that the [kj]-mer interval starting at sj includes all of the [ki]-mer interval starting at si, thus retaining only: This can be done efficiently using an interval tree.
S2.3.2 Extending substring k-mers to match longer motifs from distinct peaks
Besides two cases of nested k-mers which may be removed from the reported motif set by the method of Section S2.3.1 (ATACTGAGA and TACTGAGA), Figure S1 also depicts a shorter k-mer ATACTG derived from a distinct occurrence of the same longer k-mer (CATACTGAGA). Because this distinct occurrence of the longer k-mer was not itself initially identified, the method of Section S2.3.1 does not remove the shorter substring k-mer from the motif set. However, such substring k-mers may be extended to the longer k-mer occurrence by the following method: for each i G I’, define the duplet resolving any ties in the arg max in favor of maximal z0. Equation (S8) picks out the largest super-interval si — z0, si + [ki] + z1 J containing the interval si, si + [ki] J such that the extended is equal to one of the already identified k-mers < x sj, sj + [kj~| J j G I’ >. (In the example of Figure S1, and , corresponding to the light gray highlighted characters surrounding the substring k-mer). Then defines our motif set after removal of nested motifs.
S2.4 Spatial smoothing to identify multi-motif domains
SArKS identifies candidate multi-motif domains (MMDs) through the application of a second round of kernel-smoothing over suffix positions si within words: where we here use uniform kernels of the form (generally with width λ = κ) to search for regions of length λ with elevated densities of high-scoring motifs. Note that defined by Equation (S10) is indexed not by suffix array index i but by suffix array value si giving the spatial position si in the concatenated word x.
To use such spatial smoothing for motif selection/filtering, it is necessary to introduce a second threshold θspatial, as the doubly-smoothed scores will generally be somewhat less dispersed than will be the singly-smoothed ŷi. The threshold θspatial can be used to define an index set Ispatial similar to the manner in which I is defined by Equation (S6), but the task is more complex when we replace the single spatial position si by a spatial window [si, si + λ).
Recalling the definitions of the negative/positive spatial shift operators η(i)/ρ(i) which yield the unique suffix array indexes corresponding to the spatial position immediately before/after si, so that sη(i) = si — 1 and sρ(i) = si + 1, first define: Jspatial represents the set of suffix array indices i corresponding to the left endpoints si of spatial windows [si, si + λ) passing the filters for score threshold θspatial and minimum Gini impurity gmin, and for which the sequence-smoothed score yi is at least as high as the spatially adjacent scores ŷη(i) to the left and ŷρ(i) to the right.
Defining the left-directed distance δj from the suffix with sorted suffix array index j to the set Jspatial by we define in turn the set of sorted suffix array indices Ispatial marking the starting positions of selected k-mers by:
Equation (S14) identifies suffix array indices i: (1) whose spatial positions si fall within a spatial window [sj, sj + λ) for some j G Jspatial, (2) whose sequence-smoothed score ŷi > θspatial, and (3) for which the position si — 1 spatially to the left is either (3A) not in one of the spatial windows specified by Jspatial or (3B) has associated sequence-smoothed score ŷη(i) < θspatial. This final criterion is included because we want to merge adjacent k-mers whose leftmost positions fall within the same selected spatial window.
This merging process is implemented by calculating for each index i G Ispatial the length of the merged k-mer starting at i. Equation (S15) sets by selecting the right endpoint + sj of the k-mer beginning at sj to maximize the merged length + sj — si over all choices of sj for which every position sm between si and sj has an acceptable sequence-smoothed score ym > θspatial. It is then straightforward to obtain the motif set
S2.5 Permutation testing to establish significance of motif set
The significance of the observed correlation between the occurrence of the motifs uncovered by SArKS and the sequence scores yb can be evaluated by examining results obtained when the sequences ωb and the scores yb are independent of each other. To this end, the word scores yb are subjected to permutation π to define If the permutation π is randomly selected independently of both the sequences ωb and the scores yb, any true relationships between sequences and scores will be disrupted. This suggests a simple method for assessing the significance of motifs discovered using a given set of parameters (kernel half-width κ, θ, gmin, etc.): generate R random permutations πr and for each permutation calculate scores using Equation (4) (and also using Equation (S10) if spatial smoothing is employed) with yb replaced by yπb. In this manner one can estimate the distribution of scores under a null model in which there is no association between the sequences of the various words ωb and the scores yπb.
This method of significance testing also provides the motivation for the form of Equation (S4) in Section S2.2. Let Π be a random variable representing a random permutation and note that the random variables yΠ(b) satisfy while, assuming that the number of words n = |W| is large enough that we may approximate yΠ(b) J- yΠ(b’) for b = b’, where fb(i is defined by Equation (S3) and for all b Equation (S19) then tells us that where the Gini impurity gi is defined by Equation (S4). Thus smaller values of gi imply higher variance of the window-smoothed scores obtained under random permutation (with mean unchanged). This increased variance will lead to the requirement of larger cutoff values θ for reporting motifs discovered in the unpermuted data with a given degree of confidence unless positions i with gi < gmin are filtered out as described in Section S2.2.
S2.6 Permutation testing to set thresholds for multiple parameter combinations
Multiple combinations of the values of SArKS parameters may be explored (with α indexing the set of desired combinations); for example, Sections 3.2.1-3.2.2 discuss the parameters used in the benchmarking examples herein and the rationales for their selection. For any permutation π, let (i.e., is the highest ltered sequence-smoothed score obtained ater permuting by π, while is similarly the highest ltered spatially-smoothed score). hen we suggest a simple method for setting thresholds θ(α) and based on a set of randomly generated permutations {πr}: with higher values of z trading reduced sensitivity for lower false positive rates (in the examples analyzed in Section 3.2.1 we take z = 4). For the sake of simplicity we have generally used only one of these two thresholds for any particular combination of parameters α, setting θ(α) = —oo if κ(α) > 1 or if κ — 1 (i.e., if spatial smoothing is not employed).
In order to characterize the false positive rate associated with the entire set of analyses across all of the parameter settings employed while controlling for multiple hypothesis testing, a family-wise error rate (FWER) e resulting from these thresholds can then be estimated by generating an independent set of R’ permutations and counting the number of permutations for which a nonempty set of k-mer motifs is identified using any of the parameter sets . That is, writing (where and are defined respectively by Equation (S6) and Equation (S14) using the thresholds θ(α) and determined using the original permutation set {πr}) we can infer confidence intervals by noting that the random variable E instantiated in e satisfies E ~ Binom(R’, e) under the permutation test null hypothesis. We can thus derive confidence intervals (CIs) for the FWER (in the weak sense, as the permutation test represents a complete null hypothesis with no true positives (Farcomeni, 2008)) by applying the Clopper-Pearson method for estimation of binomial CIs.
S2.7 RNA-seq expression analysis
S2.7.1 Assigning PV differential expression scores for Mo 2015 data set
In order to test SArKS, we selected the M. musculus neocortical neuron RNA-seq data set GSE63137
(Mo et al., 2015) from Gene Expression Omnibus (GEO) database (Barrett et al., 2013) (https://www.ncbi.nlm.nih.gov/geo/
This data set contains detailed transcriptomic and epigenetic information from three functionally and neurochemically distinct classes of pooled neocortical neurons: principal excitatory neurons, parvalbumin-positive (PV) GABAergic neurons, and vasoactive intestinal peptide-positive (VIP)
GABAergic neurons.
Because the position of the first exon can help pinpoint the TSS—and hence the DNA region containing the putative promoter—we reanalyzed the GSE63137 RNA-seq data using kallisto (Bray et al., 2015) to quantify and normalize transcript level expression against Ensembl mouse cDNA reference GRCm38. Transcript species were filtered by mean expression to focus on those for which reliable expression estimates could be made, retaining only transcripts for which at least 100 pseudocounts were obtained when summed across all samples and whose mean normalized expression met or exceeded the median of the transcript mean normalized expression levels. We also filtered out transcripts that showed low variance across the full sample set, retaining only those for which the estimated variance of normalized expression values met or exceeded median across all transcript species (Bourgon et al, 2010). In order to simplify downstream analysis, only the isoform with highest mean expression level across all samples was retained per gene. Finally, as based on chromatin accessibility data (Mo et al, 2015), only transcripts for which the transcription start sites were located within ATAC-seq peaks (i.e., were accessible) for all examined neuron classes were analyzed. This accessibility-based filter reduced the likelihood that epigenetic features, rather than regulatory sequences, determine the variations in gene expression between cell classes.
Differential gene expression was assessed using normalized expression values via standard Student’s t-test (comparing data for PV neurons to data for excitatory and VIP neurons), with the resulting t-statistic providing an estimate of a gene’s enrichment in PV neurons (score yb for transcript b). One potential issue with the use of such t-statistics with small sample numbers—here, two samples associated with each neuron type—is that especially low within-group standard deviation estimates can result in very large magnitude t-statistics for a few genes. For example, the average estimated within-group standard deviation of the 76 genes with |tb| > 10 (with |tb| ranging up to a maximum value of 49.6) was less than 30% of the average within-group standard deviation of the full set of 6,326 analyzed genes (Table S1). Every one of the 76 genes with such high magnitude t-statistics had a within-group standard deviation estimate below the median value for the full gene set.
The phenomenon of low within-group variance estimates leading to inflated test statistics has previously led to the application of empirical Bayes methods (Smyth, 2004) using moderated t-statistics in place of standard t-statistics for calculating differential expression p-values. As we are here instead interested in using the t-statistics to derive word scores yb, for which no particular distributional assumptions are required, we have adopted a simpler approach to prevent the few very large magnitude t-statistics from unduly influencing motif discovery by applying a ceiling of 10 on the magnitude of yb:
S2.7.2 Assigning DCX differential expression scores for Close 2017 data set
We also examined an RNA-seq data set comparing transcriptomes of in vitro-induced human embryonic stem cells and the resulting cultured interneurons (Close et al., 2017). We applied SArKS to identify promoter motifs associated with elevated gene expression in doublecortin-positive (DCX+) interneurons. We restricted our analysis to the post-induction day 54 (D54) timepoint, where most of the DCX+ neurons were post-mitotic and GABAergic, and for which the largest total number of cells had been profiled, minimizing the within-group expression variations.
We used the normalized gene expression levels from GEO (Barrett et al., 2013) records for this data set (accession GSE93593). We chose not to reanalyze the sequencing data in this case because we did not want to split the read counts per cell—which, given the large numbers of cells observed, tend to be much lower than read counts per sample in bulk RNA-seq—across multiple distinct transcripts for each gene. We found 15,490 genes for which (1) nonzero aligned read counts were detected in at least one (out of 585) analyzed cells and (2) a unique entry was found in the GRCh38 annotation of the human genome. We retained the 6,939 genes from this set for which the average aligned read count per cell was ≥ 25 for further analysis (Table S1). As the variance of the log2-transformed transcripts-per-million (TPM) normalized expression levels was quite high (≥ 1 for 6,852 of the 6,939 genes, ≥ 0.5 for all 6,939 genes), we did not apply any variance filter for this data set. As no epigenetic information was available, no accessibility filtering could be conducted.
For the filtered high-read-count gene set, differential expression was assessed via a simple two group t-test comparing the DCX+ cells to the DCX- cells and SArKS scores were assigned according to Equation (S27), just as was done for the Mo 2015 data set.
S2.8 Specifications for running existing motif discovery algorithms
Existing algorithms were run at their default parameter settings (defined either within the source code or in associated documentation), with two exceptions: (1) MOTIF REGRESSOR was run searching only for motifs positively correlating with score to enable more direct comparison of its output with that of the other algorithms (none of which look for anticorrelated motifs by default). (2) STEME was run in discriminative mode using a high order Markov model on the negative sequences exactly as suggested in the online documentation (https://pythonhosted.org/STEME/using.html); however, STEME’s implementation requires pre-specification of the number of motifs to report, defaulting to a single motif if unspecified. Given that (Reid and Wernisch (2014)) extensively compared STEME to DREME with the finding that the two were generally comparable in performance, we took the upper bound on the observed DREME motif set size (10 motifs) as the number of motifs for STEME to report. ## ## DREME options: dreme \ -p $pos_seq_fasta \ -n $neg_seq_fasta \ -oc $out \ -png ## ## FIRE options: perl fire.pl \ --expfiles=$scores \ --exptype=continuous \ --fastafile_dna=$seq_fasta \ --seqlen_dna=$seq_len \ --nodups=1 ## ## HOMER options: homer2 denovo \ -i $pos_seq_fasta \ -b $neg_seq_fasta ## ## MOTIFREGRESSOR options: MotifRegressor.pl \ $scores \ $seq_fasta \ null 1 1 2 1 0 50 250 50 250 5 15 50 30 \ $out ## _interpretation of MOTIFREGRESSOR parameters above_ ## null : background sequence distribution to be computed based on input sequences ## 1 : use column 1 from $scores to rank sequences ## 1 : use column 1 from $scores to perform regression ## 2 : data does need to be further log-transformed ## 1 : look for motifs in high-scoring (as opposed to low-scoring) sequences ## 0 : select fixed count of top motifs (as opposed to setting fixed score threshold) ## 50* : number of initial top motifs ## 250* : number of sequences with high values for confirmation ## 50* : (ignored since we are only interested in high-scoring motifs) ## 250* : (ignored since we are only interested in high-scoring motifs) ## 5* : minimum motif width ## 15* : maximum motif width ## 50* : number of seed candidate motifs ## 30* : number of motifs reported before regression ## _all parameters marked with * were set at the example (e.g.) values given in ## MOTIFREGRESSOR’s README.MR file_ ## ## STEME options: steme \ --output-dir=$out \ --num-motifs=10 \ --bg-model-order=5 \ --bg-fastafile=$neg_seq_fasta \ $pos_seq_fasta ## _see https://pythonhosted.org/STEME/using.html_
S3 Results and Discussion
S3.1 Illustration of SArKS using simulated data
To demonstrate that the results of Section 3.1 are not a quirk of a single simulation, we repeated the process of (1) generating 30 random sequences, embedding the motif CATACTGAGA into the last 10 sequences, and (2) applying SArKS to the sequences and sequence scores 1000 times. In 971 iterations, the maximum value calculated using the unpermuted sequence scores exceeded the maximum value obtained using one set of randomly permuted sequence scores per iteration. The full distribution of the differences is shown in the motif (+) column of Table S2.
We also examined the results of SArKS applied to simulated data in which no motif was present to find; for this purpose, we repeated an amended version of the simulation process 1000 times, omitting the motif embeddings. The distribution of for these no-motif simulations is presented in the motif (-) column of Table S2. In this case, ŷmax exceeded in only 239 of the simulations, while exceeded ŷmax in 282 simulations, with equality between the two holding in the remaining 479 iterations. The symmetry of the distribution of around 0 in the motif (-) case is to be expected since the scores yb are independent of the sequences ωb whether permuted or not if no motifs are included. By contrast, the strong asymmetry of the distribution of when the motif is present demonstrates the ability of the permutation approach to differentiate a true signal from background noise.
S3.2 Uncovering promoter motifs associated with differential gene expression
S3.2.1 Benchmark comparisons to existing algorithms
The cross-validated regression modeling strategy described in Section 3.2.3 builds a single regression model based on the concatenated upstream and downstream motif count vectors. We also built two more separate regression models—one using as feature set only the upstream motif counts, the other only the downstream motif counts—for each of the two data sets, obtaining the results presented in Figure S2. SArKS generally outperformed the other algorithms in these comparisons, though for the upstream analysis of the Close 2017 data, DREME and HOMER offer similar performance; all of the algorithms have their poorest performance in this particular analysis. For both data sets the regression model predictions on the held-out validation folds are noticeably better in the downstream analyses than the upstream analyses, as discussed in Section 3.2.3.
We used tomtom (Gupta et al., 2007) to compare the pooled motif sets identified by each algorithm and detected overlap between motifs sets by algorithm (S3). There exists a significantly similar (q ≤ 0.1) SArKS-identified motif for the majority of motifs identified by any of the existing algorithms in the Mo 2015 data set. For the Close 2017 data set, at least 50% of the motifs identified by DREME, FIRE, MOTIF REGRESSOR, and STEME can be paired with a significantly similar SArKS motif, though this is true for only 39% of HOMER-identified motifs.
An alternative benchmarking approach is to compare the motifs identified algorithmically to databases of known TF-binding motifs, such as JASPAR (Mathelier et al., 2015). For the presence of a TF-binding site to be biologically relevant in a cell, it is necessary for the TF itself to be present as well. In the context of our analysis of the two RNA-seq data sets, we checked whether or not mRNA encoding TFs whose binding sites were similar to discovered motifs are enriched among either the PV neuron (Mo 2015) or DCX+ cell (Close 2017) transcripts. We classified a TF gene as enriched if there was at least one distinct mRNA transcript for the gene with (1) at least 100 reads (or pseudocounts for the Mo 2015 set) and (2) for which the mean TPM-normalized estimated expression level in either the PV samples (Mo 2015) or DCX+ cells (Close 2017) ≥ the median of the genewise means for all measured transcripts/genes in the relevant data set. Figure S4A is similar to a receiver-operating characteristic plot in which the motif discovery algorithms are regarded as classifying JASPAR motifs as positive when they show sufficient similarity to any of the discovered motifs; the distance of a point above the diagonal indicates the degree to which an algorithm preferentially identifies binding motifs for TFs showing high RNA-seq expression levels in the target cell population. SArKS identifies motifs similar to a larger fraction of JASPAR than do the other algorithms while maintaining a preference for motifs for highly expressed TFs.
Figure S4B illustrates the overlaps between the sets of JASPAR motifs with similarities among the motifs identified by the motif discovery algorithms: For all algorithms applied to the Close 2017 data set and all but HOMER in the Mo 2015 data set, the set of JASPAR motifs with significant similarity (q ≤ 0.1) to one of the algorithm-identified motifs overlaps by more than 50% with the set of JASPAR motifs significantly similar to a SArKS motif. The degree of overlap between the JASPAR matches among the various algorithm motif sets tends to be higher than the degree of overlap directly between the motif sets themselves. This suggests that the presence of a similar JASPAR motif may provide supporting evidence that a given detected motif is a not a false positive.
S3.2.2 Case study: analysis of SArKS results for Mo 2015 data set
The values of gmin obtained for the analysis of the Mo 2015 gene set (6,326 genes remaining after application of filters described in Section S2.7.1), along with the fraction of suffix array index values i for which gi > gmin, are listed in Table S3.
S3.2.2.1 Top motif identified in sequences downstream of TSS
The highest ŷi value obtained—detected in the downstream sequence analysis using n = 250, λ = 0, and 7 = 1.1 in the downstream region analysis—corresponded to the fc-mer TGACCTTG. This fc-mer is very similar to a number of JASPAR TF-binding motifs. The strongest matches are to the binding motifs of ESRRB (q = 0.00078), ESRRA (q = 0.00078), and ESRRG (q = 0.00301). In fact a large fraction of the motifs associated with identified peaks in ŷi identified in the downstream analysis exhibit significant similarity to one of the JASPAR motifs ESRRB, ESRRA, or ESRRG, as is illustrated in Figure S5B. The ESRR(A/B/G) TFs are all members of the estrogen-related receptor family; there is evidence that these receptors are involved in brain functions including synaptic transmission, neuronal firing, and mitochondrial biogenesis (Saito and Cui, 2018). This particular set of motifs may also help to explain the overall stronger performance of all of the motif discovery algorithms using the downstream sequences relative to the upstream sequences (Figure 3A), as we noted that all of these algorithms identified motifs similar to each of these JASPAR motifs (Section 3.2.3).
S3.2.2.2 Top motifs identified in sequences upstream of TSS
ESRRB/ESRRA/ESRRG binding motifs were also identified by SArKS analysis of the upstream sequences, but they did not account for either the highest scores ŷi nor did they correspond to a large fraction of the overall k-mer motif sets discovered (Figure S5A). Figure S6 sheds some light on this: the distribution of sequence scores yb for downstream sequences containing one or more copies of the top SArKS octamer TGACCTTG is shifted upward to a much higher degree than is the the distribution of yb values for upstream sequences containing TGACCTTG.
Instead, For five of the 12 distinct combinations of smoothing half-window κ and spatial window λ investigated using SArKS, the k-mer CCACCTGC was identified at the positions si with maximal values of ŷi (the k-mers GCACACCTT, TGGAACTCACT, CCTGGAAC, and CAGCCTGG (identified using two distinct parameter combinations at the same suffix index i) were associated the highest ŷi values using the remaining seven parameter combinations). The octamer CCACCTGC contains the canonical core recognition E-box sequence CANNTG (specifically, the E12-box variant CACCTG (Bouard et al.,, 2016); we note that the significant SArKS peak set contains many peaks corresponding to the 7-mer CACCTGC as well as the longer octamer adding the extra initial C). Comparison of CCACCTGC with known motifs from the JASPAR database using tomtom finds some evidence of similarity to 10
TF-binding motifs (SNAI2, MAX, SCRT2, SCRT1, TCF3, MNT, Id2, MAX::MYC, TCF4, and FIGLA; q-values of 0.14 for each), though no similarities significant at q ≤ 0.1. Unlike the case for the ESRR(A/B/G) motifs discovered in the downstream analyses, for which all of the benchmarked algorithms detected a matching motif, only one of the other algorithms (HOMER) detected a motif similar to either CACCTGC or CCACCTGC (tomtom q ≤ 0.1; no other algorithm produced any motifs matching even at q ≤ 0.5).
S3.2.2.3 B1 SINE sequence identified through MMD analysis
As the octamer CCACCTGC was identified in analyses with spatial window length λ ranging up to 100, we performed a multiple sequence alignment using muscle (Edgar, 2004) of the 100-mers x[si — 50, si + 50) for these positions si (Figure S7A); three of the five 100-mers thus aligned were very similar (Levenshtein distance < 7) to the 99-mer consensus sequence constructed. Furthermore, the consensus sequence also contains CCTGGAAC and CCAGGCTG (reverse complement of CAGCCTGG).
A blast screen of known repeated elements in the mouse genome for a consensus sequence uncovered a 93.9% identical base pair stretch of the B1 short interspersed element (SINE) sequence (SINEBase (Vassetzky and Kramerov, 2013)). The B1 SINE family consists of retrotransposon-derived sequences appearing throughout the mouse genome, especially upstream and within introns of genes implicated in DNA remodeling and expression regulation (Tsirigos and Rigoutsos, 2009). Additional observations have further suggested that SINEs function as transcriptional enhancers (Ichiyanagi, 2013; Elbarbary et al, 2016; Ge, 2017).
Figure S5A indicates the number of SArKS-identified peaks that fall within blast hits between the upstream sequences ωb and the B1 SINE consensus sequence as well as the numbers of peaks corresponding to the top motifs discussed above. The upstream SArKS peaks derived from analyses involving spatial-smoothing (λ ∊ {10,100}) are dominated by B1 sequences, many including the CCACCTGC motif or its reverse-complement.
S3.2.2.4 SArKS motifs correspond to variations on B1 sequence
Figure S7B provides a more detailed view of these peak counts by splitting them out by position to which the corresponding k-mers align to the B1 consensus and by whether they are matched or mismatched to the B1 consensus at each position. The k-mer CCACCTGC itself is not quite a perfect match to the canonical B1, containing a single base substitution away from the octamer CCGCCTGC whose reverse complement GCAGGCGG is found at positions 49-56 of the SINEBase B1 sequence. This substitution is responsible for the peak at position 54 in the mismatch counts in Figure S7B—one of the few positions at which there are more mismatches than matches. This G to A substitution creates the above noted E-box sequence CANNTG, while the unmodified octamer CCGCCTGC does not match any JASPAR motifs at q ≤ 0.5. This highlights the ability of SArKS to discover potentially functionally significant variations within a recurring sequence.
One of the remaining top upstream k-mer motifs mentioned above, GCACACCTT, similarly matches the nonamer GCACGCCTT spanning positions 15-23 of the SINEBase B1 sequence, but with a single G to A substitution. The modified nonamer GCACACCTT identified by SArKS shows significant similarity (tied q values of 0.038) to several JASPAR motifs (TBX21, EOMES, TBX15, TBX1, and TBX2), while the unmodified B1 nonamer GCACGCCTT again shows no similarity to any JASPAR motifs at q ≤ 0.5, again suggestive that specific SINE variants may promote differential gene expression.
S3.2.2.5 SArKS-based candidate promoter selection
Finally, to illustrate how SArKS can be used to help select candidate regulatory regions for promoting specific expression patterns, we again constructed a ridge regression model based on the counts of SArKS-identified k-mer motifs. We applied the same modeling strategy as described in Section 3.2.3 to the promoter regions defined by the 3,000 base pairs immediately preceding the TSSs of each of the 6,326 distinct analyzed transcripts. Each distinct transcript was then assigned a score by resubstitution into the resulting regression fit. Table S4 shows a sequence of filters in which these regression scores were applied alongside other relevant criteria to select candidate PV-specific promoter regions. The promoter regions associated with the genes ATP5SL, GPRC5B, IFT27, KCNH2, MAFB, PAQR4, SLC29A2, SYT2, TBC1D2B, TMEM186, and TTC39A comprise the 11 candidates (from the final row of Table S4) selected for further experimental validation. Table S5 shows which of the top motifs discussed above are present in each of the candidate promoter regions: all regions except those for GPRC5B and MAFB contain at least one match for the ESRRB motif, while several also contain one or more copies of the E-box sequence and/or a match to the B1 SINE sequence. The promoter for IFT27 contains a match to a variant the B1 sequence with the substitution creating the E-box sequence CACCTGC. It is worth noting that there are many other SArKS motifs contributing to the promoter ranking model used here. Indeed, in accord with the principle that there is likely to be more than one way for combinations of motifs to achieve expression specificity, the candidate promoters for the genes GPRC5B and MAFB are ranked highly based exclusively on motifs other than the highest scoring ones.
S3.3 Computational complexity and scalability of SArKS
One of the major motivations behind SArKS’ method of discovering motifs—searching for blocks of lexicographically similar suffixes derived predominantly from high-scoring sequences—lies in the scalability of suffix-based methods. The number of suffixes of a string (or set of strings) scales linearly with the length of the string(s) involved: as a result, the steps involved in the SArKS algorithm for identifying significant peaks scale linearly in both runtime and memory space with the combined size of the set of input sequences. We discuss this in more detail below. We then discuss the complexity of the later steps involved in extracting information regarding specific motif k-mers from the significant SArKS peak set.
The existing implementation of SArKS generates and then stores in memory the full suffix array of the concatenated sequence x = ω0 * …ωn-1: this step is asymptotically linear in the length of the concatenated sequence both in terms of runtime and memory (Kärkkäinen and Sanders, 2003). There is one caveat regarding the memory requirement here: the suffix array for a sequence of length l contains a permutation of the first l integers; while the length of this array is linear in l, the number of digits required to specify each integer grows logarithmically with l as well. An uncompressed suffix array (as used here) thus technically requires memory specified in bits scaling with llogl. Assuming the default use of 64-bit integers (as is done in the numpy-based python implementation we have used), however, memory will scale linearly for sequences of length up to ~ 1018 characters, far beyond current practical limits.
Given the inverted suffix array is yielding the value of the suffix array index i corresponding to the suffix array value si, the block array (Equation (3)) can be constructed in linear time and space (again in terms of the length of the concatenated sequence x) by (1) looping through the positions s in the concatenated string x, (2) checking whether the active block b needs to be incremented according to whether s > lb+1 (Section 2.1), and (3) filling in the position is of the block array with the active block value b.
Kernel smoothing using a uniform kernel may be implemented in linear time by computing differences of cumulative sums (Equation (6)). The array of Gini impurity values (Equation (S4)) can be computed in linear time by successively computing the difference in consecutive values resulting from shifting the smoothing window by one position and updating the associated block frequencies Equation (S3). Identification of peaks (by comparing the score of each position to the scores of the two spatially adjacent positions) in the array of smoothed suffix scores ŷi, along with the filtering of the resulting peak set based on score threshold θ and Gini impurity threshold min, again requires time linear in the length of the concatenated sequence x. Similar remarks hold for the analogous spatial smoothing operations.
Permutation testing requires repetition of the above steps R times, where R is the number of permutations, and is hence still asymptotically linear in the length of the concatenated sequence x. While in principle parallelizable, each permutation will require its own smoothed (and, if desired, spatially smoothed) score array, so that parallelization requires memory linear in R * |x|.
Motif length selection according to Equation (7) could be naively implemented in 0(kmax* k) time per peak by directly comparing each suffix in the smoothing window to the suffix corresponding to the suffix array index around which the window is centered. In fact it is generally faster to use the suffix array to compute the suffix array index bounds for which the k-mer prefixing the central suffix is conserved (this may be done quite efficiently using the Burrows-Wheeler transform (Ferragina and Manzini, 2000); in our implementation of SArKS we have generally avoided this in order to reduce the memory requirements of the algorithm, favoring instead a slightly less efficient binary search approach). Either way, motif length selection generally requires time linear in the size of the peak set; in practice, when the peak set is large, this step can be relatively time consuming.
Merging of spatially adjacent k-mers originating within the same spatial smoothing window (Equation (S15)) may be computed in time linear in the size of the peak set times the length of the spatial smoothing window A. In the case of large peak sets, this step can be time consuming as well.
S4 Future directions
S4.1 Gapped motif detection
While lexical sorting of suffixes assembles occurrences of the same k-mer into a block of adjacent index positions i, gapped motifs such as in which there is significant variability in the characters appearing within the internal substring ugap will be scattered into distinct subblocks dispersed within the larger superblock corresponding to their common prefix u0. This dispersion can dilute the apparent correlation ŷi between motif and score by mixing non-matching suffixes in with those corresponding to u within the range of the smoothing kernel.
While the technique described in Section S2.4 ameliorates this problem, it does not specifically focus on the important situation where a head motif u0 is always followed by the same tail motif u1 after the variable region ugap. Such gapped motifs might be discovered using SArKS by first applying a relatively relaxed threshold θ (which may on its own admit many false positives) and then examining the tail sequences ugap * u1 * • • • following it for evidence of an enriched sequence u1, removing candidate head sequences for which no such corresponding tails can be found. In this way, the ability of SArKS to detect motifs with particularly variable internal positions may be improved.
S4.2 Other applications of SArKS
While we have tested SArKS as a method for identifying candidate cell type-specific regulatory motifs, it could also be applied to sequence motifs associated with state dependent changes in activated neurons of a single class as well as to differential gene expression in cancer and in specimens that have been exposed to varying physical or chemical stimuli. We also anticipate uses far afield from analysis of biological sequences, including motif discovery in time series data (Fu, 2011), or, by considering node or edge sequences produced by random walks, analysis of complex network structure (Masoudi-Nejad et al, 2012).
Acknowledgements:
The authors thank Dr. Becca Young, Eric Brenner, Brian Gereke and Dr. Preeti Mehta for helpful discussions and Dr. Ila Fiete for a critical reading of the manuscript. This project has been supported by NIH BRAIN Initiative award U01NS094330 to BVZ.