Abstract
Transcriptional regulatory networks (TRNs) are enriched for certain network motifs. This could either be the result of natural selection for particular hypothesized functions of those motifs, or it could be a byproduct of mutation (e.g. of the prevalence of gene duplication) and of less specific forms of selection. We have developed a powerful new method for distinguishing between adaptive vs. non-adaptive causes, by simulating TRN evolution under different conditions. We simulate mutations to transcription factor binding sites in enough mechanistic detail to capture the high prevalence of weak-affinity binding sites, which can complicate the scoring of motifs. Our simulation of gene expression is also highly mechanistic, capturing stochasticity and delays in gene expression that distort external signals and intrinsically generate noise. We use the model to study a well-known motif, the type 1 coherent feed-forward loop (C1-FFL), which is hypothesized to filter out short spurious signals. We found that functional C1-FFLs evolve readily in TRNs under selection for this function, but not in a variety of negative controls. Interestingly, a new “diamond” motif also emerged as a short spurious signal filter. Like the C1-FFL, the diamond integrates information from a fast pathway and a slow pathway, but their speeds are based on gene expression dynamics rather than topology. When there is no external spurious signal to filter out, but only internally generated noise, only the diamond and not the C1-FFL evolves.
Author Summary Frequently occurring motifs are thought to be fundamental building blocks of biological networks, conducting specific functions. However, we still lack definitive evidence that these motifs have evolved “adaptively” (to perform the particular function proposed for them), rather than “non-adaptively” (as byproducts of some other function, or as an artifact of patterns of mutations). Here we develop a powerful null model that captures important non-adaptive factors that can shape the evolution of transcriptional regulatory networks, and use it to provide the missing piece of evidence of adaptive origin in the case of the most studied motif, a feed-forward loop that is hypothesized to filter out short spurious signals. We also find evidence for an alternative solution to this problem, where the functionality of the feed-forward loop is encoded not in network topology, but in the dynamics of gene expression. Our model is suitable for studying whether other network features have evolved adaptively vs. non-adaptively.
Introduction
Transcriptional regulatory networks (TRNs) are integral to development and physiology, and underlie all complex traits. An intriguing finding about TRNs is that certain “motifs” of interconnected transcription factors (TFs) are over-represented relative to random re-wirings that preserve the frequency distribution of connections [1, 2]. The significance of this finding remains open to debate.
The canonical example is the feed-forward loop (FFL), in which TF A regulates a target C both directly, and indirectly via TF B, and no regulatory connections exist in the opposite direction [1-3]. Each of the three regulatory interactions in a FFL can be either activating or repressing, so there are eight distinct kinds of FFLs [4; Fig 1]. Given the eight frequencies expected from the ratio of activators to repressors, two of these kinds of FFLs are significantly over-represented [4]. In this paper, we focus on one of these two over-represented types, namely the type 1 coherent FFL (C1-FFL), in which all three links are activating rather than repressing (Fig 1, top left). C1-FFL motifs are an active part of systems biology research today, e.g. they are used to infer the function of specific regulatory pathways [5, 6].
The over-representation of FFLs in observed TRNs is normally explained in terms of selection favoring a function of FFLs. Specifically, the most common adaptive hypothesis for the over-representation of C1-FFLs is that cells often benefit from ignoring short-lived signals and responding only to durable signals [3, 4, 7]. Evidence that C1-FFLs can perform this function comes from the behavior both of theoretical models [4] and of in vivo gene circuits [7]. A C1-FFL can achieve this function when its regulatory logic is that of an “AND” gate, i.e. both the direct path from A to C and the indirect path from A to B to C must be activated before the response is triggered. In this case, the response will only be triggered if, by the time the signal trickles through the longer path, it is still active on the shorter path as well. This yields a response to long-lived signals but not short-lived signals.
However, just because a behavior is observed, we cannot conclude that the behavior is a historical consequence of past selection favoring that behavior [8, 9]. The explanatory power of this adaptive hypothesis of filtering out short-lived and spurious signals needs to be compared to that of alternative, non-adaptive hypotheses [10]. The over-representation of C1-FFLs might be a byproduct of some other behavior that was the true target of selection [11]. Alternatively, it might be an intrinsic property of TRNs generated by mutational processes – gene duplication patterns have been found to enrich for FFLs in general [12], although not yet C1-FFLs in particular. Adaptationist claims about TRN organization have been accused of being just-so stories, with adaptive hypotheses still in need of testing against an appropriate null model of network evolution [13-23].
Here we develop such a computational null model of TRN evolution, and apply it to the case of C1-FFL over-representation. We simulate gene duplication and deletion, and sufficient realism in our model of cis-regulatory evolution to capture the non-adaptive effects of mutation in shaping TRNs. In particular, we consider “weak” TF binding sites (TFBSs) that can easily appear de novo by chance alone, and from there be selected to bind a TF more strongly.
It is also important to capture the stochasticity of gene expression, which causes the number of mRNAs and hence proteins to fluctuate [24, 25]. This is because demand for spurious signal filtering and hence C1-FFL function may arise not just from external signals, but also from internal fluctuations. Stochasticity in gene expression also shapes how external spurious signals are propagated. Stochasticity is a constraint on what TRNs can achieve, but can be adaptively co-opted in evolution [26]; either way, it might underlie the evolution of certain motifs. Most computational models of TRN evolution that consider gene expression as the major phenotype do not simulate stochasticity in gene expression (see [27-29] for three notable exceptions). The genotype to phenotype map we develop here does include intrinsic stochasticity in gene expression.
Here we use this model to ask whether AND-gated C1-FFLs evolve as a response to selection for filtering out short and spurious external signals, compared to conditions that control for both mutational biases and for less specific forms of selection. We find that they evolve far more often under these specific selection conditions than under control conditions, providing long-awaited support for the adaptive hypothesis. We also ask whether there are alternative motifs that evolve to solve the same selective challenge. We find that a “diamond” [30] is such a motif, filtering out short spurious signals by requiring them to arrive not through both a long and a short path, but through both a fast and a slow path of equal topological lengths. We also compare motifs that evolve to filter out external spurious signals to those that evolve in response to intrinsic stochastic noise in gene expression. We find that while both diamonds and C1-FFLs evolve in response to the former, only diamonds evolve in response to the latter.
Models
Overview of the model
We simulate the dynamics of TRNs as the TFs activate and repress one another’s transcription. For each moment in developmental time (i.e. on the timescale of one cell responding to stimuli), we simulate the numbers of nuclear and cytoplasmic mRNAs in a cell, the protein concentrations, and the chromatin state of each transcription start site. Transitions between three possible chromatin states -- Repressed, Intermediate, and Active -- are a stochastic function of TF binding, and transcription initiation from the Active state is also stochastic. An overview of the model is shown in Fig 2. The pattern of TF binding affects chromatin, which affects transcription rates, eventually affecting the concentration of TFs and so completing regulatory feedback loops. The genotype is specified by a set of cis-regulatory sequences that contain TFBSs to which TFs may bind (which, as nucleotide sequences, are subject to realistic mutational parameters), by which consensus sequence each TF recognizes and with what affinity, and by 5 gene-specific parameters that control gene expression as a function of TF binding: mean duration of transcriptional bursts, mRNA degradation, protein production, and protein degradation rates, and gene length which affects delays in transcription and translation. An external signal is treated like another TF, and the concentration of an effector gene in response is a primary determinant of fitness, combined with a cost associated with gene expression (Fig 2). Mutants replace resident genotypes as a function of the difference in estimated fitness. Parameter values, taken as far as possible from Saccharomyces cerevisiae, are summarized in Table 1. Source code in C is available at https://github.com/MaselLab/network-evolution-simulator.
Transcription factor binding
Transcription of each gene is controlled by TFBSs present within a 150-bp cis-regulatory region, corresponding to a typical yeast nucleosome-free region within a promoter [31]. The perfect TFBS for a typical yeast TF has information content equivalent to 13.8 bits [32]; this means that in a simplified model of binding where only one of the four nucleotides is a good match at each site, ∼7 bp are recognized as an optimal consensus binding site. Maerkl & Quake [33] reported that the TFBSs of two yeast TFs, Pho4p and Cbf1p, can have up to 2 mismatched sites within their 6 bp consensus binding sequence, while still binding the TF above background levels [33]. Our model therefore tracks TFBSs with up to 2 mismatches. This low information content implies a higher density of TFBSs within our cis-regulatory regions than our algorithm was able to handle, so we instead assigned each TF an 8-bp consensus sequence. Two TFs cannot simultaneously occupy overlapping stretches, which we assume extend beyond the recognition sequence to occupy a total of 14 bp [34]; this captures competitive binding. Hindrance between TFBSs is shown in Fig 3A; TFs are assumed to work in both orientations [35].
Sites with m>3 mismatches are assumed to still bind at a background rate equal to m=3 mismatches, with dissociation constant Kd(3) = 10−5 M [33] for all TFs. We assume that each of the last three bp makes an equal and independent additive contribution ΔGbp < 0 to the binding energy [36]: although not always true, this approximates average behavior well [33]. We ignore cooperativity in binding. Dissociation constants of eukaryotic TFs for perfect TFBSs can range from 10−5 M [37] to 10−11 M [38]. We initialize each TF with its own value of log10(Kd(0)) sampled from a uniform distribution between −6 and −9, with mutation capable of further expanding this range, subject to Kd(0) < 10−5 M. Substituting m=0 and m=3 into we can solve for ΔGbp and ΔG0, and thus obtain Kd(1) and Kd(2).
Because TFs bind non-specifically to DNA at a high background rate, each nucleosome-free stretch of 14 bp can be considered to be a non-specific binding site (NSBS). A haploid S. cerevisiae genome is 12 Mb, 80% of which is wrapped in nucleosomes [39], yielding approximately 106 potential non-specific binding sites (NSBSs). In a yeast nucleus of volume 3×10−15 liters, the NSBS concentration is of order 10−4 M. To find the concentration of free TF [TF] in the nucleus given a total TF concentration of CTF, we consider in the context of NSBSs, substitute [TF·NSBS] with CTF - [TF], and solve for
Thus, about 90% of total TFs are bound non-specifically, leaving about 10% free. The relatively small number of specific TFBSs is not enough to significantly perturb the proportion of free TFs, and so for the specific TFBSs with m<3 that are of interest in our model, we simply use Kd*(m) = 10Kd(m) to account for the reduction in the amount of available TF due to non-specific binding. We also rescale Kd* from moles/liter to the more convenient number of molecules per cell by multiplying by 3×10−15 liter × 6.02×1023 molecules/mole = 1.8×109 molecules cell−1 M−1, for a total multiplication factor of 1.8×1010 molecule M−1. If there were only one binding site, it would be bound for a fraction of time where Ni is the per-cell number of molecules of TF i; note that we assume all TF molecules are located in the nucleus.
The transition rates between chromatin states (see section below) are a function of the numbers of activators A and repressors R bound to a cis-regulatory region. Note that in our model, each TF is either always an activator, or always a repressor, independently of binding context. The joint probability distribution of A and R is derived in S1 Text section 1.
Transcriptional regulation
Activation of the effector gene requires at least two TFBSs to be occupied by activators – not necessarily different activators. The requirement for two activators makes the effector gene capable of evolving an AND-gate via a configuration of TFBSs in which the only way to have two TFs bound is for them to be different TFs (Fig 3B). All other genes are AND-gate-incapable, meaning that their activation requires only one TFBS to be occupied by an activator. PA denotes the probability of having at least one activator bound for an AND-gate-incapable gene, or two for an AND-gate-capable gene. PR denotes the probability of having at least one repressor bound.
Noise in yeast gene expression is well described by a two step process of transcriptional activation [40, 41], e.g. nucleosome disassembly followed by transcription machinery assembly. We denote the three possible states of the transcription start site as Repressed, Intermediate, and Active (Fig 2). Transitions between the states depend on the numbers of activator and repressor TFs bound (e.g. via recruitment of histone-modifying enzymes [42, 43]). We make conversion from Repressed to Intermediate range, as a function of PA, from the background rate 0.15 min−1 of histone acetylation [44; presumed to be followed by nucleosome disassembly], to the rate of nucleosome disassembly 0.92 min−1 for the constitutively active PHO5 promoter [40]:
We make conversion from Intermediate to Repressed a function of PR, ranging from a background histone de-acetylation rate of 0.67 min−1 [44], up to 4.11 min−1, with that maximum chosen so as to keep a similar maximum:basal rate ratio as that of rRep_to_Int:
We assume that repressors disrupt the assembly of transcription machinery [45] to such a degree that conversion from Intermediate to Active does not occur if even a single repressor is bound. In the absence of repressors, activators facilitate the assembly of transcription machinery [46]. Brown et al. [40] reported that the rate of transcription machinery assembly is 3.3 min−1 for a constitutively active PHO5 promoter, and 0.025 min−1 when the Pho4 activator of the PHO5 promoter is knocked out. We use this range to set where PA_no_R is the probability of having no repressors and either one (for an AND-gate-incapable gene) or two (for an AND-gate-capable gene) activators bound, and PnotA_no_R is the probability of having no TFs bound (for AND-gate-incapable genes) or having no repressors and not more than one activator bound (for AND-gate-capable genes).
The promoter sequence not only determines which specific TFBSs are present, but also influences non-specific components of the transcriptional machinery [47, 48]. We capture this via gene-specific but TF-binding-independent rates rAct_to_Int with which the machinery disassembles and a burst of transcription ends. In other words, we let TF binding regulate the frequency of “bursts” of transcription, while other properties of the cis-regulatory region regulate their duration. E.g., yeast transcription factor Pho4 regulates the frequency but not duration of bursts of PHO5 expression, by regulating the rates of nucleosome removal and of transition to but not from a transcriptionally active state [40]. We estimate the distribution of rAct_to_Int from the observed rates of mRNA production of 255 yeast genes [49] that are likely to have similarly low nucleosome occupancy [50] and thus are constitutively open to expression (see S1 Text section 2 for details and also for the bounds of rAct_to_Int). For modeling simplicity, we assume that the core promoter sequence responsible for the value of rAct_to_Int is distinct from the 150-bp sequences in which our TFBSs are found.
mRNA and protein dynamics
Once in the Active state, a gene initiates new transcripts stochastically at rate rmax_transc_init = 6.75 mRNA/min [40]. There is a delay before transcription is completed, of duration 1 + L / 600 minutes, where L is the length of the ORF in codons (see S1 Text section 3).
We model a second delay between the completion of a transcript and the production of the first protein from it. The delay comes from a combination of translation initiation and elongation; it ends when the mRNA is fully loaded with ribosomes all the way through to the stop codon and the first protein is produced. We ignore the time required for mRNA splicing; introns are rare in yeast [51]. mRNA transportation from nucleus to cytosol, which is likely diffusion-limited [52, 53], is fast even in mammalian cells [54] let alone much smaller yeast cells, and the time it takes is also ignored. The median time in yeast for initiating translation is 0.5 minute [Table 1 in 55], and the genomic average peptide elongation rate is 330 codon/min [55]. After an mRNA is produced, we therefore wait for 0.5 + L / 330 minutes, and then model protein production as continuous at a gene-specific rate rprotein_syn (see S1 Text section 4 for details of rprotein_syn).
Protein transport into the nucleus is rapid [56] and is approximated as instantaneous and complete, so that the newly produced protein molecules immediately increase the probability of TF binding. Each gene has its own mRNA and protein decay rates, initialized from distributions taken from data (see S1 Text section 5).
All the rates regarding transcription and translation are listed in Table 1, including distributions estimated from data, and hard bounds imposed to prevent unrealistic values arising during evolution.
Developmental simulation
Our algorithm is part-stochastic, part-deterministic. We use a Gillespie algorithm [57] to simulate stochastic transitions between Repressed, Intermediate, and Active chromatin states, and to simulate transcription initiation and mRNA decay events. Fixed (i.e. deterministic) delay times are simulated between transcription initiation and completion, and between transcript completion and the production of the first protein. Protein production and degradation are described deterministically with ODEs, and updated frequently in order to recalculate TF concentrations and hence chromatin transition rates. We initialize developmental simulations with no mRNA or protein (except for the signal), and all genes in the Repressed state. Details of our simulation algorithm are given in the S1 Text section 6.
Selection conditions
Filtering out short spurious signals is a special case of signal recognition more generally. In environment 1, expressing the effector is beneficial, and in environment 2 it is deleterious. We select for TRNs that take information from the signal and correctly decide whether to express the effector. In our control condition, the signal is “on” at a constant level when the effector is beneficial in environment 1, and off in environment 2. Fitness is a weighted average across these two environments. In our test condition (Fig 4), the signal is constantly on in environment 1 and briefly on (for the first 10 minutes) in environment 2 – selection is to ignore this short spurious signal. The signal is treated as though it were an activating TF whose concentration is controlled externally, with an “off” concentration of zero and an “on” concentration of 1,000 molecules per cell, which is the typical per-cell number of a yeast TF [58].
We make fitness quantitative in terms of a “benefit” B(t) as a function of the amount of effector protein Ne(t) at developmental time t. Our motivation is the scenario in which the effector protein directs resources from metabolic program I to II. When program II produces benefits, where bmax is the maximum benefit if all resources were redirected to program II, and Ne_sat is the minimum of amount of effector protein to achieve this. Similarly, when program I is beneficial,
We set Ne_sat to 10,000 molecules, which is about the average molecule number of a metabolism-associated protein per cell in yeast [58]. Without loss of generality given that fitness is relative, we set bmax to 1.
A second contribution to fitness comes from the cost of gene expression C(t) (Fig 2, bottom center). We make this cost proportional to the total protein production rate. We estimate a fitness cost of gene expression of 2×10−6 per protein molecule translated per minute, based on the cost of expressing a non-toxic protein in yeast [59; see S1 Text section 7 for details].
We simulate gene expression for 90 minutes of developmental time (Fig 4), and calculate “cellular fitness” in a given environment as the average instantaneous fitness (B(t)-C(t)) over these 90 minutes. We consider environment 2 to be twice as common as environment 1 (a “signal” should be for an uncommon event rather than the default), and take the appropriate weighted average.
Evolutionary simulation
We simulate a novel version of origin-fixation (weak-mutation-strong-selection) evolutionary dynamics, i.e. the population contains only one resident genotype at any time, and mutant genotypes are either rejected or chosen to be the next resident. Estimators of genotype fitness are averaged over 200 developmental replicates per environment in the case of the mutant, plus an additional 800 should it be chosen to be the next resident. The mutant replaces the resident if
This differs from Kimura’s [60] equation for fixation probability, but captures the same flavor; due to stochasticity in , fixation probability is a monotonic function of the true difference in fitness. Note that it is possible, especially at the beginning of an evolutionary simulation, for relative fitness to be paradoxically negative. In this rare case, for simplicity, we use the absolute value of on the denominator.
If 2000 successive mutants are all rejected, the simulation is terminated; upon inspection, we found that these resident genotypes had evolved to not express the effector in either environment. We refer to each change in resident genotype as an evolutionary step. We stop the simulation after 50,000 evolutionary steps; at this time, most replicate simulations seem to have reached a fitness plateau (S2 Fig); we use all replicates except those terminated early. To reduce the frequency of early termination in the case where the signal was not allowed to directly regulate the effector, we used a burn-in phase selecting on a more accessible intermediate phenotype (see S1 Text section 9). In this case, burn-in occurred for 1000 evolutionary steps, followed by the usual 50,000 evolutionary steps with selection for the phenotype of interest (S2 Fig).
Genotype Initialization
We initialize genotypes with 3 activator genes, 3 repressor genes, and 1 effector gene. Cis-regulatory sequences and consensus binding sequences contain As, Cs, Gs, and Ts sampled with equal probability. Rate constants associated with the expression of each gene, are sampled from the distributions described above and summarized in Table 1.
Mutation
A genotype is subjected to 5 broad classes of mutation, at rates summarized in Table 2 and justified in S1 Text section 8. First are single nucleotide substitutions in the cis-regulatory sequence; the resident nucleotide mutates into one of the other three types of nucleotides with equal probability. Second are single nucleotide changes to the consensus binding sequence of a TF, with the resident nucleotide mutated into one of the other three types at equal probability. Both of these can affect the number and strength of TFBSs.
Fourth are mutations to gene-specific expression parameters. Most of these (L, rAct_to_Int, rprotein_syn, rmRNA_deg, and rprotein_deg) apply to both TFs and effector genes, while mutations to the gene-specific values of Kd(0) apply only to TFs. Each mutation to L increases or decreases it by 1 codon, with equal probability unless L is at the upper or lower bound. Effect sizes of mutations to the other five parameters are modeled in such a way that mutation would maintain log-normal stationary distributions for these values, in the absence of selection or arbitrary bounds (see S1 Text section 8 for details). Upper and lower bounds (S1 Text section 8) are used to ensure that selection never drives these parameters to unrealistic values.
Fifth is conversion of a TF from being an activator to being a repressor, and vice versa. The signal is always an activator, and does not evolve.
Importantly, this scheme allows for divergence following gene duplication. When duplicates differ due only to mutations of class 4, i.e. protein function is unchanged, we refer to them as “copies” of the same gene, encoding “protein variants”. Mutations in classes 2 and 5 can create a new protein.
Results
Functional AND-gated C1-FFLs evolve readily under selection for filtering out a short spurious signal
We begin by simulating the easiest case we can devise to allow the evolution of C1-FFLs for their purported function of filtering out short spurious signals. The signal is allowed to act directly on the AND-gate-capable effector, so all that needs to evolve is a single activating TF between the two, as well as AND-logic for the effector. We score motifs at the end of a set number of generations (see Methods). Evolved C1-FFLs are scored and classified into subtypes based on the presence of non-overlapping TFBSs (Fig 3B). The important subtype comparison for our purposes being the AND-gated C1-FFL vs. the next three non-AND-gated C1-FFL types combined (OR-gated, signal-controlled, and slow-TF-controlled); the remaining three logic subtypes are vanishingly rare. The adaptive hypothesis predicts the evolution of the subtype with AND-regulatory logic, which requires both the effector to be stimulated both by the signal and by the slow TF. While all replicates show large increases in fitness, a multimodal distribution of final fitness states is observed, indicating whether or not the replicate was successful at evolving the phenotype of interest rather than becoming stuck at an alternative locally optimal phenotype (Fig 5A). AND-gated C1-FFLs frequently evolve in the high fitness outcomes, but not the low fitness outcomes (Fig 5B).
We also see C1-FFLs that, contrary to expectations, are not AND-gated; while found primarily in the low fitness replicates, some are also in the high fitness genotypes (Fig 5B). However, this is based on scoring motifs and their logic gates on the basis of all TFBSs, even those with two mismatches and hence low binding affinity. Unless these weak TFBSs are deleterious, they will appear quite often by chance alone. A random 8-bp sequence has probability of being a two-mismatch binding site for a given TF. In our model, a TF has the potential to recognize 137 different sites in a 150-bp cis-regulatory sequence (taking into account steric hindrance at the edges), each with 2 orientations. Thus, by chance alone a given TF will have 0.0038 × 137 × 2 ≈1 two-mismatch binding sites in a given cis-regulatory sequence (ignoring palindromes for simplicity), compared to only ∼0.1 one-mismatch TFBSs. Excluding two-mismatch TFBSs when scoring motifs significantly reduces the non-AND-gated C1-FFLs, while only modestly reducing the observed frequency of adaptively evolved AND-gated C1-FFLs in the high fitness mode (Fig 5C).
To confirm the functionality of these AND-gated C1-FFLs, we mutated the evolved genotype in two different ways (Fig 6A) to remove the AND regulatory logic. As expected, this lowers fitness in the presence of the short spurious signal but increases fitness in the presence of constant signal, with a net reduction in fitness (Fig 6B). This is consistent with AND-gated C1-FFLs representing a tradeoff, by which a more rapid response to a true signal is sacrificed in favor of the greater reliability of filtering out short spurious signals.
To test the extent to which C1-FFLs can evolve non-adaptively, we simulated evolution under three negative control conditions: 1) neutrality, i.e. all mutations are accepted to become the new resident genotype; 2) no spurious signal, i.e. the effector should be expressed under a constant “ON” signal and not under a constant “OFF” signal; 3) harmless spurious signal, i.e. the effector should be expressed under a constant “ON” environment whereas effector expression in the “OFF” environment with short spurious signals is neither punished nor rewarded beyond the cost of unnecessary gene expression. AND-gated C1-FFLs evolve much less often under all three negative control conditions (Fig 7). Non-AND-gated C1-FFLs do evolve under the negative control conditions (Fig 7A), but disappear when weak TFBSs are excluded during motif scoring (Fig 7B).
Diamond motifs are an alternative adaptation in more complex networks
Sometimes the source signal will not be able to directly regulate an effector, and must instead operate via a longer regulatory pathway involving intermediate TFs [61]. In this case, even if the signal itself takes the idealized form shown in Fig 4, its shape after propagation may become distorted by the intrinsic processes of transcription. Motifs are under selection to handle this distortion.
To enforce indirect regulation, we ran simulations in which the signal was not allowed to bind to the cis-regulatory sequence of effector genes. The fitness distribution of the evolutionary replicates has only one mode (S4 Fig), so we compared the highest fitness, lowest fitness, and median fitness replicates. In agreement with results when direct regulation is allowed, genotypes of low and medium fitness contain few AND-gated C1-FFLs, while high fitness genotypes contain many (Fig 8A, left).
While visually examining the network context of these C1-FFLs, we discovered that many were embedded within AND-gated “diamonds” to form “FFL-in-diamonds” (Fig 8A right). This led us to discover that AND-gated diamonds also occurred frequently without AND-gated C1-FFLs to form “isolated diamonds” (Fig 8A middle). Note that it is in theory possible, but in practice uncommon, for diamonds to be part of more complex conjugates. Systematically scoring the AND-gated isolated diamond motif confirmed its high occurrence (Fig 8B middle).
An AND-gated C1-FFL integrates information from a short/fast regulatory pathway with information from a long/slow pathway, in order to filter out short spurious signals. A diamond achieves the same end of integrating fast and slow transmitted information via differences in the gene expression dynamics of the two regulatory pathways, rather than via topological length (Fig 9).
Note that a simple transcriptional cascade, signal -> TF -> effector, has also been found experimentally to filter out short spurious signals, e.g. when the intermediate TF is rapidly degraded, dampening the effect of a brief signal [62]. Two such transcriptional cascades involving different intermediate TFs form a diamond, so the utility of a single cascade is a potential explanation for the high prevalence of double-cascade diamonds. However, in this case we would have no reason to expect marked differences in expression dynamics between the two TFs, as illustrated in Fig 9. We will also see below that AND-gates evolve between the two cascades.
Weak TFBSs make motif scoring more difficult
Results depend on whether we include weak TFBSs when scoring motifs. Weak TFBSs can either be in the effector’s cis-regulatory region, affecting how the regulatory logic is scored, or upstream, affecting only the presence or absence of motifs. When a motif is scored as AND-gated only when two-mismatch TFBSs in the effector are excluded, we call it a “near-AND-gated” motif. Recall from Fig 3B that effector expression requires two TFs to be bound, with only one TFBS of each type creating an AND-gate. When a second, two-mismatch TFBS of the same type is present, we have a near-AND-gate. TFs may bind so rarely to this weak affinity TFBS that its presence changes little, making the regulatory logic still effectively AND-gated. A near-AND-gated motif may therefore evolve for the same adaptive reasons as an AND-gated one. Fig 8B and C shows that both AND-gated and near-AND-gated motifs are enriched in the high fitness genotypes.
When we exclude upstream weak TFBSs while scoring motifs, FFL-in-diamonds are no longer found, while the occurrence of isolated C1-FFLs and diamonds increases (Fig 8C). This makes sense, because adding one weak TFBS, which can easily happen by chance alone, can convert an isolated diamond or C1-FFL into a FFL-in-diamond (added between intermediate TFs, or from signal to slow TF, respectively).
AND-gated isolated C1-FFLs appear mainly in the highest fitness outcomes, while AND-gated isolated diamonds appear in all fitness groups (Fig 8C), suggesting that diamonds are easier to evolve. 18 out of 30 high-fitness evolutionary replicates are scored as having a putatively adaptive AND-gated or near-AND-gated motif in at least 50% of their evolutionary steps when upstream weak TFBSs are ignored (close to addition of bars in Fig 8C, because these two AND-gated motifs rarely coexist in a high-fitness genotype). The remaining 12 have more complex arrangements of weak TFBSs that mimic a single strong one.
Just as for the AND-gated C1-FFLs evolved under direct regulation and analyzed in Fig 6, perturbation analysis supports an adaptive function for AND-gated C1-FFLs and diamonds evolved under indirect regulation (Fig 10A.i, 10B.i). Breaking the AND-gate logic of these motifs by adding a (strong) TFBS to the effector cis-regulatory region reduces the fitness under the spurious signal but increases it under the constant “ON” beneficial signal, resulting in a net decrease in the overall fitness.
If we add a two-mismatch TFBS instead, this converts an AND-gated motif to a near-AND-gated motif. This lowers fitness only when the extra link is from the slow TF to the effector, and not when the extra link is from the fast TF to the effector (Fig 10B.ii, 10C.ii). Indeed, these extra links are tolerated during evolution too: if we take the 7 high-fitness replicates that contain a near-AND-gated C1-FFL in at least 5% of the evolutionary steps, in all 7 cases this motif is near-AND-gated rather than AND-gated because of an extra weak TFBS for the fast TF, while this is never due to a weak TFBS for the slow TF in C1-FFLs. Similarly, out of the 20 high-fitness replicates that contain a near-AND-gated diamond, 11 cases are primarily because of an extra weak TFBS of the fast TF, 9 cases (all of them OR-gated) are because of weak TFBSs for both TFs, and no cases are primarily due to an extra TFBS for the slow TF. By chance alone, fast and slow TF should be equally likely to contribute the weak TFBS that makes a motif near-AND-gated rather than AND-gated. This non-random occurrence of weak TFBSs creating near-AND-gates illustrates how even weak TFBSs can be shaped by selection against some (but not all) motif-breaking links.
AND-gated isolated diamonds also evolve in the absence of external spurious signals
We simulated evolution under the same three control conditions as before, this time without allowing the signal to directly regulate the effector. In the “no spurious signal” and “harmless spurious signal” control conditions, motif frequencies are similar between low and high fitness genotypes (S5 Fig, S6 Fig), and so our analysis includes all evolutionary replicates. When weak (two-mismatch) TFBSs are excluded, AND-gated isolated C1-FFLs are seen only after selection for filtering out a spurious signal, and not under other selection conditions (Fig 11A). However, AND-gated isolated diamonds also evolve in the absence of spurious signals, indeed at even higher frequency (Fig 11B). Results including weak TFBSs are similar (S7 Fig).
Perturbing the AND-gate logic in these isolated diamonds reduces fitness via effects in the environment where expressing the effector is deleterious (Fig 10B.iii). Even in the absence of external short spurious signals, the stochastic expression of intermediate TFs might effectively create short spurious signals when the external signal is set to “OFF”. It seems that AND-gated diamonds evolve to mitigate this risk, but that AND-gated C1-FFLs do not. The duration of internally generated spurious signals has an exponential distribution, which means that the optimal filter would be one that does not delay gene expression [63]. The two TFs in an AND-gated diamond can be activated simultaneously, but they must be activated sequentially in an AND-gated C1-FFL; the shorter delays possible with AND-gated diamonds might explain why only diamonds and not FFLs evolve to filter out intrinsic noise in gene expression.
Discussion
There has never been sufficient evidence to satisfy evolutionary biologists that motifs in TRNs represent adaptations for particular functions. Critiques by evolutionary biologists to this effect [13-23] have been neglected, rather than answered, until now. While C1-FFLs can be conserved across different species [64-67], this does not imply that specific “just-so” stories about their function are correct. In this work, we study the evolution of AND-gated C1-FFLs, which are hypothesized to be adaptations for filtering out short spurious signal [3]. Using a novel and more mechanistic computational model to simulate TRN evolution, we found that AND-gated C1-FFLs evolve readily under selection for filtering out a short spurious signal, and not under control conditions. Our results support the adaptive hypothesis about C1-FFLs.
Previous studies have also attempted to evolve adaptive motifs in a computational TRN, successfully under selection for circadian rhythm and for multiple steady states [68], and unsuccessfully under selection to produce a sine wave in response to a periodic pulse [23]. Our successful simulation might offer some methodological lessons, especially a focus on high-fitness evolutionary replicates, which was done by us and by Burda et al. [68] but not by Knabe et al. [23]. Knabe et al. [23] suggested that including a cost for gene expression may suppress unnecessary links and promote motifs. However, we found AND-gated C1-FFLs still evolve in the high-fitness genotypes under selection for filtering out a spurious signal, even when there is no cost of gene expression (S8 Fig).
AND-gated C1-FFLs express an effector after a noise-filtering delay when the signal is turned on, but shut down expression immediately when the signal is turned off, giving rise to a “sign-sensitive delay” [3, 7]. Rapidly switching off has been hypothesized to be part of their selective advantage, above and beyond the function of filtering out short spurious signals [63]. We selected only for filtering out a short spurious signal, and not for fast turn-off, and found that this was sufficient for the adaptive evolution of AND-gated C1-FFLs.
Most previous research on C1-FFLs has used an idealized implementation (e.g. a square wave) of what a short spurious signal entails [4, 63, 69]. In real networks, noise arises intrinsically in a greater diversity of forms, which our model does more to capture. Even when a “clean” form of noise enters a TRN, it subsequently gets distorted with the addition of intrinsic noise [70]. Intrinsic noise is ubiquitous and dealing with it is an omnipresent challenge for selection. Indeed, we see adaptive diamonds evolve to suppress intrinsic noise, even when we select in the complete absence of extrinsic spurious signals.
Our model, while complex for a model and hence capable of capturing intrinsic noise, is inevitably less complex than the biological reality. However, we hope to have captured key phenomena, albeit in simplified form. E.g., a key phenomenon is that TFBSs are not simply present vs. absent but can be strong or weak, i.e. the TRN is not just a directed graph, but its connections vary in strength. Our model, like that of Burda et al. [68] in the context of circadian rhythms, captures this fact by basing TF binding affinity on the number of mismatch deviations from a consensus TFBS sequence. While in reality, the strength of TF binding is determined by additional factors, such as broader nucleic context and cooperative behavior between TFs (reviewed in Inukai et al. [71]), these complications are unlikely to change the basic dynamics of frequent appearance of weak TFBSs and enhanced mutational accessibility of strong TFBSs from weak ones. Similarly, AND-gating can be quantitative rather than qualitative [72], a phenomenon that weak TFBSs in our model provide a simplified version of. Note that our model, while powerful in some ways, is computationally limited to small TRNs.
Core links in adaptive motifs involve strong not weak TFBSs. However, weak (two-mismatch) TFBSs can create additional links that prevent an adaptive motif from being scored as such. Some potential additional links are neutral while others are deleterious; the observed links are thus shaped by this selective filter, without being adaptive. Note that there have been experimental reports that even weak TFBSs can be functionally important [73, 74]; these might, however, better correspond to 1-mismatch TFBSs in our model than two-mismatch TFBSs. Ramos et al. [74] and Crocker et al. [73] identified their “weak” TFBSs in comparison to the strongest possible TFBS, not in comparison to the weakest still showing affinity above baseline.
A striking and unexpected finding of our study was that AND-gated diamonds evolved as an alternative motif for filtering out short spurious external signals, and that these, unlike FFLs, were also effective at filtering out intrinsic noise. Diamonds are not overrepresented in the TRNs of bacteria [2] or yeast [75], but are overrepresented in signaling networks (in which post-translational modification plays a larger role) [76], and in neuron networks [1]. In our model, we treated the external signal as though it were a transcription factor, simply as a matter of modeling convenience. In reality, signals external to a TRN are by definition not TFs (although they might be modifiers of TFs). This means that our indirect regulation case, in which the signal is not allowed to directly turn on the effector, is the most appropriate one to analyze if our interest is in TRN motifs that mediate contact between the two. Note that if we were to score the signal as not itself a TF, we would observe adaptive C1-FFLs but not diamonds in this case, in agreement with the TRN data. However, this TRN data might miss functional diamond motifs that spanned levels of regulatory organization, i.e. that included both transcriptional and other forms of regulation. The greatest chance of finding diamonds within TRNs alone come from complex and multi-layered developmental cascades, rather than bacterial or yeast [77]. Multiple interwoven diamonds are hypothesized to be embedded with multi-layer perceptrons that are adaptations for complex computation in signaling networks [30].
The function of a motif relies ultimately on its dynamic behavior, with topology merely a means to that end. The C1-FFL motif is based on two pathways between signal and effector, one much faster than the other, which is achieved by making them different lengths. This same function was achieved non-topologically in our adaptively evolved diamond motifs. Multiple motifs have previously been found capable of generating the same steady state expression pattern [21]; here we find multiple motifs for a much more complex function.
It is difficult to distinguish adaptations from “spandrels” [8]. Standard procedure is to look for motifs that are more frequent than expected from some randomized version of a TRN [2, 78]. For this method to work, this randomization must control for all confounding factors that are non-adaptive with respect to the function in question, from patterns of mutation to a general tendency to hierarchy – a near-impossible task. Our approach to a null model is not to randomize, but to evolve with and without selection for the specific function of interest. This meets the standard of evolutionary biology for inferring the adaptive nature of a motif [13-23].
Supporting information
S1 Fig. Examples of evolved phenotypes under selection for filtering out a short spurious signal. The figure shows the average expression of the effector protein over 200 replicate developmental simulations in each of the two environments. A high-fitness phenotype and a low-fitness phenotype, as defined in Fig 5, are shown for comparison. The signal is allowed to directly regulate the effector in these simulations.
S2 Fig. Representative fitness trajectories under selection to filter out short spurious signals. (A) The signal is allowed to directly regulate the effector genes. (B) The signal cannot directly regulate the effector genes. Note the average is weighted, with environment 2 being considered twice as common as environment 1.
S3 Fig. Genotypes evolved under control selective conditions: (A) “harmless spurious signal”, and (B) “no spurious signal”. There is no clear evidence of a multimodal distribution of fitness outcomes among replicates (left), and C1-FFLs occur equally in the 10 genotypes of the highest fitness vs. the 10 genotypes of the lowest fitness (right), and so the entire distribution (left) was used to produce Fig 7. Data are shown as mean±SE over evolutionary replicates.
S4 Fig. Fitness distrbution of 115 evolutionary replicates under selection for filtering out short spurious signals, when the signal cannot directly regulate the effector. The fitness of a replicate is the average genotype fitness over the last 10,000 evolutionary steps. Colors indicate replicates analyzed elsewhere.
S5 Fig. Evolution when responding to a spurious signal is harmless, when the signal is not allowed to directly regulate the effector. (A) Fitness distribution of 60 replicate simulations. The occurrence of both (B) FFL-in-diamonds and (C) isolated diamonds were similar in the 10 genotypes with the highest fitness vs. in 10 genotypes with the lowest fitness. Weak (two-mismatch) TFBSs are included when scoring motifs. Data are shown as mean±SE over replicates. Isolated C1-FFLs rarely evolve under this condition, therefore their occurrence is not plotted.
S6 Fig. Evolution when there is no spurious signal, when the signal is not allowed to directly regulate the effector. (A) Fitness distribution of 50 replicate simulations. The occurrence of both FFL-in-diamonds and (C) isolated diamonds were similar in the 10 genotypes with the highest fitness vs. in the 10 genotypes with the lowest fitness. Weak (two-mismatch) TFBSs are included when scoring motifs. Data are shown as mean±SE over replicates. Isolated C1-FFLs rarely evolve under this condition, therefore their occurrence is not plotted.
S7 Fig. Selection for filtering out a short spurious signal is the primary way to evolve AND-gated C1-FFLs (A), but AND-gated isolated diamonds also evolve in the absence of spurious signals (B). The signal is not allowed to directly regulate the effector, and the right hand sides of (A) and (B) are identical to Fig 11. When scoring motifs, we either include (left) or exclude (right) all two-mismatch TFBSs in the cis-regulatory sequences of intermediate TF genes and effector genes. See S1 Text section 10 for the calculation of y-axis. Data are shown as mean±SE over evolutionary replicates.
S8 Fig. After removing the cost of gene expression, AND-gated C1-FFLs are still associated with a successful response to selection for filtering out a short spurious signal. The signal can directly regulate the effector genes. (A) Distribution of fitness outcomes across 46 replicate simulations. (B) 10 out of 13 replicates with the highest fitness [the 13 replicates are in red in (A)] still evolve AND-gated C1-FFLs. Replicates with the 4th, 6th, and 8th highest fitness evolve the motif shown in (C) rather than AND-gated C1-FFLs. The “high-fitness” group therefore replace the three replicates with replicates with the 11th to 13th highest fitness. Bars are mean±SE of the occurrence ove replicate evolutionary simulations. 5 replicates [blue in (A)] with the lowest fitness do not contain AND-gated C1-FFLs or the motif in (C). (C) AND-gated C1-FFLs with a long arm. Note that both S and B need to be present to induce the expression of E, therefore this motif can also act as spurious signal filter.
S1 Text. Additional details of the model and algorithms
Acknowledgements
Work was supported by the University of Arizona and by a Pew Scholarship to JM, John Templeton Foundation grant 39667 to JM and KX, and by National Institutes of Health grants R35GM118170 to MLS and R01GM076041 to JM and AKL. We thank Hinrich Boeger for helpful discussions and careful reading of the manuscript, Jasmin Uribe for early work on this project, and the high-performance computing center at the University of Arizona for generous allocations.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.↵
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.
- 66.
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.
- 80.
- 81.
- 82.
- 83.