Abstract
Gene expression in all organisms is controlled by cooperative interactions between DNA-bound transcription factors (TFs). However, measuring TF-TF interactions that occur at individual cis-regulatory sequences remains difficult. Here we introduce a strategy for precisely measuring the Gibbs free energy of such interactions in living cells. Our strategy uses reporter assays performed on strategically designed cis-regulatory sequences, together with a biophysical modeling approach we call “expression manifolds”. We applied this strategy in Escherichia coli to interactions between two paradigmatic TFs: CRP and RNA polymerase (RNAP). Doing so, we consistently obtain measurements precise to ~ 0.1 kcal/mol. Unexpectedly, CRP-RNAP interactions are seen to deviate in multiple ways from the prior literature. Moreover, the well-known RNAP binding motif is found to be a surprisingly unreliable predictor of RNAP-DNA binding energy. Our strategy is compatible with massively parallel reporter assays in both prokaryotes and eukaryotes, and should thus be highly scalable and broadly applicable.
Introduction
Cells regulate the expression of their genes in response to biological and environmental cues. A major mechanism of gene regulation in all organisms is the binding of transcription factor (TF) proteins to cis-regulatory elements encoded within genomic DNA. DNA-bound TFs interact with one another, either directly or indirectly, forming cis-regulatory complexes that modulate the rate at which nearby genes are transcribed (Ptashne and Gann, 2002; Courey, 2008). Different arrangements of TF binding sites within cis-regulatory sequences can lead to different regulatory programs, but the rules that govern which arrangements lead to which regulatory programs remain largely unknown. Understanding these rules, which are collectively called “cis-regulatory grammar” (Weingarten-Gabbay and Segal, 2014), is a major challenge in modern biology.
A diverse array of high-throughput technologies have revolutionized our understanding of transcriptional regulation in recent years. It is now possible to map the genome-wide binding sites of transcription factors in vivo (Ren et al., 2000; Johnson et al., 2007), sometimes to nucleotide resolution (Rhee and Pugh, 2011). Large collaborative efforts using such methods have been carried out to comprehensively annotate cis-regulatory elements in model organisms (modENCODE Consortium et al., 2010; Gerstein et al., 2010) and in humans (ENCODE Project Consortium, 2012). Complementing such techniques are high-throughput in vitro methods for characterizing TF binding specificity (Mukherjee et al., 2004; Meng et al., 2005; Berger et al., 2006; Zhao et al., 2009; Jolma et al., 2010; Slattery et al., 2011). These methods have been applied to a large fraction of the TFs in select model organisms (Noyes et al., 2008; Badis et al., 2009) as well as in humans Jolma et al., 2013). However, neither class of method addresses the critical question of what TFs do once bound to DNA. In particular, there are no systematic methods, either high-throughput or low-throughput, for characterizing the TF-TF interactions that occur within cis-regulatory complexes in living cells.
Measuring the quantitative strength of interactions between DNA-bound TFs is critical for elucidating cis-regulatory grammar. In particular, knowing the Gibbs free energy of TF-TF interactions is essential for building biophysical models Bintu et al. (2005); Sherman and Cohen (2012) that can quantitatively explain gene regulation in terms of simple protein-DNA and protein-protein interactions. Biophysical models have proven remarkably successful at quantitatively explaining regulation by a small number of well-studied cis-regulatory sequences. Arguably, the biggest successes have been achieved in the bacterium E. coli, particularly in the context of the lac promoter (Vilar and Leibler, 2003; Kuhlman et al., 2007; Kinney et al., 2010; Garcia and Phillips, 2011; Brewster et al., 2014) and the OR/OL control region of the λ phage lysogen (Ackers et al., 1982; Shea and Ackers, 1985; Cui et al., 2013). But in both cases, the biophysical level of understanding that has been achieved required decades of focused study. New approaches for dissecting cis-regulatory energetics, approaches that are both general and systematic, will be needed before this quantitative level of understanding can be obtained for any cis-regulatory sequence having any arrangement of TF binding sites.
Here we address this need by describing a systematic experimental/modeling strategy for dissecting the biophysical mechanisms of transcriptional regulation in living cells. Our strategy is based on reporter assays and is not a new experimental method per se. Rather, it shows how key biophysical quantities in transcriptional regulation can be measured to high precision by performing relatively simple experiments on strategically chosen cis-regulatory sequences, then analyzing the resulting data appropriately. Our rationale for introducing this strategy is that reporter assays can be readily performed in a wide variety of systems, making this strategy highly flexible and broadly applicable. Moreover, massively parallel reporter assays should allow this strategy to be dramatically scaled up.
Our strategy centers on the measurement and modeling of mathematical objects that we call “expression manifolds.” The underlying idea is to perform multidimensional measurements. If a hypothesized biophysical model is true, these measurements will collapse to a lower-dimension manifold embedded in this measurement space. If such data collapse is observed, specific values for the parameters of the hypothesized biophysical model can be inferred. On the other hand, if such collapse is not observed, the hypothesized biophysical model can be rejected and a different biophysical model is seen to be needed.
To demonstrate its utility, we applied this strategy to a regulatory paradigm in E. coli: activation of the σ70 RNA polymerase holoenzyme (RNAP) by the cAMP receptor protein (CRP). RNAP is arguably the best understood RNA polymerase in biology (Ruff et al., 2015), and CRP is arguably the best understood transcriptional activator (Busby and Ebright, 1999). CRP activates transcription when bound to DNA at various positions upstream of RNAP by forming favorable interactions with the RNAP α subunit. Such regulation is often described as “class I” or “class II”, depending on the spacing between the RNAP and CRP binding sites. Both classes of interaction are known to depend strongly on the spacing between binding sites, but the in vivo Gibbs free energies of these interactions have been reported for only one such spacing: when the CRP site is centered -61.5 bp relative to the transcription start site (TSS), as occurs at the E. coli lac promoter.
By measuring and modeling expression manifolds, we systematically determined the in vivo Gibbs free energy (ΔG) of CRP-RNAP interactions that occur at a variety of different binding site spacings. These ΔG values were consistently measured to a precision of ~ 0.1 kcal/mol, roughly 3% of the strength of a hydrogen bond. Although our results broadly agree with the prior literature, there are key divergences. We find that class I CRP-RNAP interactions, which occur when CRP is centered upstream of ~ -60.5 bp, are generally much stronger than have been suggested. Moreover, we find that the class II CRP-RNAP interaction that occurs when CRP is centered at -40.5 bp can either activate or repress transcription depending on features of the RNAP binding site that have yet to be understood.
In the course of these experiments we obtained other key biophysical information. First, we were able to distinguish between two qualitatively different mechanisms of transcriptional activation: “stabilization” of RNAP-DNA binding (also called “recruitment” (Ptashne, 2003)) versus “acceleration” of the transcript initiation rate by DNA-bound RNAP. Contrary to prior in vitro studies, we find that in vivo class II activation by CRP at -41.5 bp occurs exclusively through stabilization, not acceleration. Second, we were able to measure the strength with which both CRP and RNAP bind their respective sites. This strength is quantified by the grand canonical potential (denoted here by ΔΨ), which accounts for the ΔG of binding as well as the in vivo concentration of each protein. Importantly, we find that the actual in vivo ΔΨ of RNAP-DNA binding deviates substantially from the predictions of the established RNAP binding motif. This result highlights the perils of assuming simple models for protein-DNA binding energy when modeling the biophysics of transcriptional regulation.
In what follows, we first illustrate this expression manifold strategy in the context of simple repression, which provides a general way to measure the ΔΨ of TF-DNA binding. This strategy is then used to measure the ΔΨ of CRP binding to a near-consensus DNA site that we use in subsequent experiments. Next we show how expression manifolds, inferred from measurements of simple activation, can be used to determine the ΔG of TF-RNAP interactions. This strategy is used to measure CRP-RNAP interactions at a variety of class I and class II positions, and the deviations of these measurements from the prior literature are discussed. Finally, we compare the values of ΔΨ for RNAP-DNA binding, obtained in the course of the above analyses, to the predictions of the RNAP-DNA binding motif from Kinney et al. (2010).
Results
Strategy for measuring TF-DNA interactions in vivo
We begin by showing how expression manifolds can be used to measure the in vivo strength of TF binding to a specific DNA binding site. This measurement is accomplished by using the TF of interest as a transcriptional repressor. We place the TF binding site directly downstream of the RNAP binding site so that the TF, when bound to DNA, sterically occludes the binding of RNAP. We then measure the rate of transcription from a few dozen variant RNAP binding sites. Transcription from each variant site is assayed in both the presence and in the absence of the TF.
Figure 1A illustrates a thermodynamic model (Bintu et al., 2005; Sherman and Cohen, 2012) for this type of simple repression. In this model, promoter DNA can be in one of three states: unbound, bound by the TF, or bound by RNAP. These three state are assumed to occur with a relative frequency that is consistent with thermal equilibrium, i.e., with a probability proportional to its Boltzmann weight.
The energetics of protein-DNA binding determine the Boltzmann weight for each state. By convention we set the weight of the unbound state equal to 1. The weight of the TF-bound state is then given by F = [TF]KF where [TF] is the concentration of the TF and KF is the affinity constant in inverse molar units. Similarly, the weight of the RNAP-bound state is P = [RNAP]KP. In what follows we refer to F and P as the “binding factors” for the TF-DNA and RNAP-DNA interactions, respectively. We note that these can also be written as and where kB is Boltzmann’s constant, T is temperature, and ΔΨF and ΔΨP respectively denote the grand canonical potential of binding for the TF and RNAP. Note that the grand canonical potential is equal to the Gibbs free energy of binding plus a term that accounts for the entropic cost of pulling each protein out of solution. For reference, 1 kBT = 1.62 kcal/mol at 37 °C.
The overall rate of transcription is computed by summing the amount of transcription produced by each state, weighting each state by the probability with which it occurs. In this case we assume the RNAP-bound state initiates at a rate of tsat, and that the other states produce no transcripts. We also add a term, tbg, to account for background transcription (e.g., from an unidentified promoter further upstream). The rate of transcription in the presence of the TF is thus given by
In the absence of the TF (F = 0), the rate of transcription becomes
Our goal is to measure the TF-DNA binding factor F. To do this, we create a set of promoter sequences where the RNAP binding site is varied but the TF binding site is kept fixed. We then measure transcription from these promoters in both the presence and absence of the TF, respectively denoting the resulting quantities by t+ and t− (Figure 1B). Our rationale for doing this is that changing the RNAP binding site sequence should, according to our model, affect only the RNAP-DNA binding affinity KP. All of our measurements should therefore lie along a one-dimensional “expression manifold” residing within the two-dimensional space of (t−, t+) values. Moreover, this expression manifold should follow the specific mathematical form implied by Equations 1 and 2 when P is varied and the other parameters (tsat, tbg F) are held fixed. See Figure 1C.
The geometry of this expression manifold is nontrivial. In particular, when F ≫ 1 and tbg/tsat ≪ 1, there are five different regimes corresponding to different values of the RNAP binding factor P for which the expressions for t+ and t− approximately simplify. These regimes are listed in Figure 1D. In regime 1, P is so small that both t+ and t− are dominated by background transcription, i.e., t+ ≈ t− ≈ tbg. P is somewhat larger in regime 2, causing t− to be proportional to P while t+ remains dominated by background. In regime 3, both t+ and t− are proportional to P in this regime, with t+/t− ≈ 1/(1 + F). In regime 4, t− saturates at tsat while t+ remains proportional to P. Regime 5 occurs when both t+ and t− are saturated, i.e., t+ ≈ t− ≈ tsat.
Precision measurement of in vivo CRP-DNA binding
The placement of CRP downstream of the RNAP binding site is known to repress transcription (Morita et al., 1988). We therefore reasoned that placing a DNA binding site for CRP downstream of RNAP would allow us to measure the binding factor of that site. Figure 2 illustrates measurements of the expression manifold used to characterize the strength of CRP binding to the 22bp site GAATGTGACCTAGATCACATTT. This site contains the well-known consensus site, which comprises two dyadic pentamers (underlined) separated by a 6bp spacer (Gunasekera et al., 1992). We performed measurements using this CRP site centered at two different locations relative to the TSS: +0.5 bp and +4.5 bp.1 To avoid influencing CRP binding strength, the -10 region of the RNAP site was kept fixed in the promoters we assayed while the -35 region of the RNAP binding site was varied (Figure 2A). Promoter DNA sequences are shown in Appendix 1 Figure 1.
We obtained t- and t+ measurements for these constructs using a modified version of the β-galactosidase assay of Miller (1972); see Materials and Methods for details. Our measurements are largely consistent with an expression manifold having the expected mathematical form (Figure 2B). Moreover, the measurements for CRP at the two different spacings (+0.5 bp and +4.5 bp) appear consistent with each other, although the measurements at +4.5 bp have consistently lower values for P. A small number of data points do deviate substantially from this manifold, but the presence of such outliers is not surprising from a biological perspective: introducing mutations into the RNAP binding site has the potential to create a new binding site, either for RNAP itself or for other TFs. Fortunately, outliers appear at a rate small enough for us to identify and exclude them by inspection.
We quantitatively modeled the expression manifold in Figure 2B by fitting n + 3 parameters to our 2n measurements, where n = 42 is the number of non-outlier data points, each point corresponding to an assayed promoter. The n + 3 parameters were tsat, tbg, F, and P1, P2, …, Pn, where each Pi is the RNAP binding factor of promoter Pi. Nonlinear least squares optimization was then used to infer values for these parameters. Uncertainties in tsat, tbg, and F were quantified by repeating this procedure on bootstrap-resampled data points.
These results yielded highly uncertain values for tsat because none of our measurements appear to fall within regime 4 or 5 of the expression manifold. A reasonably precise value for tbg was obtained, but substantial scatter about our model predictions in regime 1 and 2 remain. This scatter likely reflects some variation in tbg from promoter to promoter, variation that is to be expected since the source of background transcription is not known and the appearance of even very weak promoters could lead to such fluctuations.
These data do, however, determine a highly precise value for the strength of CRP-DNA binding: or, equivalently, ΔΨP = -2.10 ±0.10 kcal/mol.2 This expression manifold approach is thus able to measure TF-DNA binding energies to a precision of ~ 0.1 kcal/mol, about 2% of the hydroxyl-oxygen hydrogen bond (5.0 kcal/mol), the kind routinely found in liquid water. We note that CRP forms ~ 38 hydrogen bonds with DNA when it binds to a consensus DNA site (Parkinson et al., 1996), and that previous in vitro measurements of the Gibbs free energy of CRP-DNA binding to its consensus site have yielded 15 kcal/mol (Ebright et al., 1989; Gunasekera et al., 1992). Our result indicates that, in living cells, this Gibbs free energy is almost entirely canceled by the entropic cost of removing a CRP molecule from the cytoplasmic environment.
Strategy for measuring TF-RNAP interactions in vivo
Next we discuss how to measure activating interactions between TFs and RNAP. A common mechanism of transcriptional activation is stabilization (also called recruitment (Ptashne, 2003)). This occurs when a DNA-bound TF stabilizes the RNAP-DNA closed complex. Stabilization effectively increases the RNAP affinity KP, and thus the binding factor P, while not affecting the rate of transcript initiation from the RNAP-DNA closed complexes.
A thermodynamic model for activation by stabilization is illustrated in Figure 3A. Here promoter DNA can be in four states: unbound, TF-bound, RNAP-bound, or doubly bound. In the doubly bound state, a “cooperatively factor” α is included in the Boltzmann weight. This cooperatively factor is related to the TF-RNAP Gibbs free energy of interaction, . Activation occurs when α > 1 (ΔGα < 0). The resulting activated transcription rate is given by
This can be rewritten as where is a renormalized cooperatively that accounts for the strength of TF-DNA binding. As before, t− is given by Equation 2. Note that α′ < a and that α• ≫ α when F ≫ 1 and a» 1.
As before, we measure both t+ and t− for RNAP binding sites of varying strength (Figure 3B). These measurements will, according to our model, lie along an expression manifold resembling the one shown in Figure 3C. This expression manifold exhibits five distinct regimes when .
These regimes are listed in Figure 3D.
Precision measurement of class I CRP-RNAP interactions
CRP activates transcription at the lac promoter and other promoters by binding to a 22 bp site centered at -61.5 bp relative to the TSS. This is an example of class I activation, which is mediated by an interaction between CRP and the RNAP α C-terminal domain (aCTD) (Busby and Ebright, 1999). In vitro experiments have shown this class I CRP-RNAP interaction to activate transcription by stabilizing the RNAP-DNA complex.
We measured t+ and t− for 47 variants of the lac* promoter (see Materials and Methods, as well as Appendix 1 Figure 1). These promoters have the same CRP binding site assayed for Figure 2, but positioned at -61.5 bp, upstream of RNAP (Figure 4A). They differ from one another in the -10 or -35 regions of their respective RNAP binding sites. Figure 4B shows the resulting measurements. With the exception of 3 outlier points, these measurements appear consistent with stabilizing activation via a Gibbs free energy of ΔGα = _3.96 ± 0.09 kcal/mol, corresponding to a cooperativity of α′ ~ 600. We note that, with F » 30 determined in Figure 2, α′ = α to 3% accuracy.
This observed cooperativity is substantially stronger than suggested by previous work. Early in vivo experiments suggested a much lower cooperativity value, e.g. 50-fold (Beckwith et al., 1972), 20-fold (Ushida and Aiba, 1990), or even 10-fold (Gaston et al., 1990). These previous studies, however, only measured the ratio t+/t− for a specific choice of RNAP binding site. This ratio is (by Equation 4) always less than α and the differences between these quantities can be substantial.
However, even studies that have used explicit biophysical modeling have determined lower cooperativity values: Kuhlman et al. (2007) reported a cooperativity of a » 240 (ΔGα ≈ −3.4 kcal/mol), while Kinney et al. (2010) reported a ≈ 220 (ΔGa ≈ _3.3 kcal/mol). Both of these studies, however, relied on the inference of complex biophysical models with many parameters. The expression manifold in Figure 3, by contrast, is characterized by only three parameters (tsat, tbg α’), all of which can be approximately determined by visual inspection. In fact, while measuring this affinity manifold we isolated multiple specific promoters exhibiting t+/t− ≈ 400, directly showing that α > 400.
To test the generality of this approach, we measured expression manifolds for 11 other potential class I activation positions. At every one of these positions we clearly observed the collapse of data to a 1D expression manifold of the expected shape (Figure 4C). By quantitatively modeling these manifolds, we determined the cooperativity α and the Gibbs free energy ΔGα at each position. Uncertainties in these quantities were determined by the modeling of bootstrap-resampled data points (Materials and Methods). The resulting values for both α and ΔGα are shown in Figure 4D. As first shown by Gaston et al. (1990) and Ushida and Aiba (1990), α depends strongly on the spacing between the CRP and RNAP binding sites, exhibiting a strong ~ 10.5 bp periodicity reflecting the helical twist of DNA. However, as with the measurement in Figure 4B, the a values we measure are far stronger than the t+/t− ratios previously reported by Gaston et al. (1990) and Ushida and Aiba (1990); see Table 1.
Acceleration vs. stabilization
E. coliTFs can regulate multiple different steps in the transcript initiation pathway (Lee etal., 2012; Browning and Busby, 2016). For example, instead of stabilizing RNAP binding to DNA, TFs can activate transcription by increasing the rate at which DNA-bound RNAP initiates transcription, a process we refer to as “acceleration”. CRP, in particular, has previously been reported to activate transcription in part by acceleration when positioned appropriately with respect to RNAP (Niu et al., 1996; Rhodius et al., 1997).
We investigated whether expression manifolds might be used to distinguish activation by acceleration from activation by stabilization. First we generalized the thermodynamic model in Figure 3A to accommodate both α-fold stabilization and β-fold acceleration (Figure 5A). This is accomplished by using the same set of states and Boltzmann weights as in the model for stabilization, but assigning a transcription rate βtsat (rather than just tsat) to the TF-RNAP-DNA ternary complex. The resulting activated rate of transcription is given by
This simplifies to where α’ is the same as in Equation 5 and is a renormalized version of the acceleration rate β. The resulting expression manifold is illustrated in Figure 5C. Like the expression manifold for stabilization, this manifold has up to five distinct regimes corresponding to different values of P (Figure 5D). Unlike the stabilization manifold however, t+ Φ t- in the strong RNAP binding regime (regime 5): t+ ≈ β′tsat while t- ≈ tsat.
We next asked whether class I activation by CRP has an acceleration component. Previous in vitro work had suggested that the answer is ‘no’ (Malan et al., 1984; Busby and Ebright, 1999), but our expression manifold approach allows us to address this question in vivo. We proceeded by assaying promoters containing variants of the consensus RNAP binding site, TTGACAn(17)TATAAT, that contain SNPs in their -10 or -35 regions (Figure 6A and Appendix 1 Figure 1). Note that, because the consensus RNAP binding site is 1 bp shorter than in the constructs measured for Figure 4, the CRP site at -60.5 bp in this construct corresponds to the -61.5 bp location in the constructs assayed for Figure 4B.
The resulting data (Figure 6B) are seen to largely fall along the previously measured all-stabilization expression manifold in Figure 4B. In particular, many of these data points lie at the intersection of this manifold with the t+ = t- diagonal. We thus find that, for CRP at -61.5 bp, β = 1 to the precision of our experiments. We also identify an unambiguous value of tsat = 16.01.0+0.8 a.u. for the transcription initiation rate of an RNAP saturated promoter. Single-cell measurements suggest that this tsat value corresponds to ~ 0.23 ± 0.11 transcripts per second per promoter (So et al., 2011). Comparing this value of tsat to the tsat obtained for the other manifolds in Figure 4C, we were able to estimate β for these other positions. Figure 6C shows the results: we find that β ≈ 1 at all of the other class I positions for which reasonably precise estimates of β could be obtained. These results confirm that class I transcriptional activation by CRP occurs in vivo almost entirely through stabilization and not through acceleration.
Surprises in class II regulation
Many E. coli TFs participate in what is referred to as class II activation (Browning and Busby, 2016). This type of activation occurs when the TF binds to a site that overlaps the -35 element (often completely replacing it) and interacts directly with the main body of RNAP. CRP is known to participate in class II activation at many promoters (Keseler et al., 2011; Salgado et al., 2013), including the galP1 promoter, where it binds to a site centered at position -41.5 bp (Adhya, 1996). In vitro studies have shown CRP to activate transcription at -41.5 bp relative to the TSS through a combination of stabilization and acceleration (Niu et al., 1996; Rhodius et al., 1997).
We sought to reproduce this finding in vivo by measuring expression manifolds. We therefore placed a consensus CRP site at -41.5 bp, replacing much of the -35 element in the process, then varied the -10 element of the RNAP binding site (Figure 7A). Surprisingly, we observed that the resulting expression manifold saturates at the same tsat value shared by all class I promoters. Thus, CRP appears to activate transcription in vivo solely through stabilization, and not at all through acceleration, when located at -41.5 bp relative to the TSS (Figure 7B).
The genome-wide distribution of CRP binding sites suggests that CRP also participates in class II activation at position -40.5 bp (Keseler et al., 2011; Salgado et al., 2013). When measuring an expression manifold at this position, however, we obtained a scatter of 2D points that did not collapse to any discernible 1D expression manifold (Figure 7D). Some of these promoters exhibit activation, some exhibit repression, and some exhibit no regulation by CRP.
Our observations complicate the current understanding of class II regulation by CRP. Our in vivo measurements of CRP at -41.5 bp call into question the mechanism of activation previously discerned using in vitro techniques. The scatter observed when CRP is positioned at -40.5 bp suggests that, at this position, the -10 region of the RNAP binding site influences the values of at least two relevant biophysical parameters (not just P, as our model predicts). A potential explanation for both observations is that, because CRP and RNAP are so intimately positioned at class II promoters, even minor changes in their relative orientation caused by differences between in vivo and in vitro conditions or by changes in RNAP site sequence could have a major effect on CRP-RNAP interactions. Such sensitivity would not be expected to occur in class I activation, due to the flexibility with which the RNAP aCTDs are tethered to the main complex.
Avoiding parametric models of protein-DNA binding energy
The measurement and modeling of expression manifolds has another important advantage over previous approaches for dissecting cis-regulatory sequences using massively parallel reporter assays (Kinney et al., 2010; Belliveau et al.,2018): it sidesteps the need to parametrically model how protein-DNA binding affinity depends on DNA sequence. In modeling the expression manifolds for class I activation by CRP (Figure 4C) we obtained values for the RNAP binding factor, P = [RNAP]KP, for each of the variant RNAP binding sites we measured. Specifically, each inferred value for P was determined by the position of the corresponding measurement along the length of the manifold.
RNAP has a very well established sequence motif (McClure et al., 1983). Indeed, its DNA binding requirements were among the first characterized for any DNA-binding protein (Pribnow, 1975). More recently, a high-resolution model for RNAP-DNA binding energy was determined using data from a massively parallel reporter assay called Sort-Seq (Kinney et al., 2010). This “energy matrix model” assumes that the base pair at each position contributes additively to the overall binding energy. This model is largely consistent with previously described RNAP binding motifs but, unlike those motifs, it can predict binding energy in physically meaningful energy units (i.e., kcal/mol). In what follows we denote these binding energies as ΔΔGP, because they describe differences in the Gibbs free energy of binding between two DNA sites.
There is good reason to believe this matrix model to be the most accurate current model of RNAP- DNA binding. However, subsequent work has suggested that the predictions of this model might still have substantial inaccuracies (Brewster et al., 2012). To investigate this possibility, we compared our measured values for the grand canonical potential of RNAP-DNA binding (ΔΨP = −kBT log P) to binding energies predicted from this matrix model from Kinney et al. (2010), which is illustrated in Figure 8A. These values are plotted against one another in Figure 8B. Although there is a strong correlation between the predictions of the model and our measurements, deviations of 1 kcal/mol or larger (corresponding to variations in P of 5-fold or greater) are not uncommon. There also appears to be systematic deviations of this model from the diagonal.
This finding is sobering: even for one of the best understood DNA-binding proteins in biology, predictions of in vivo protein-DNA binding energy are still quite crude. When used in conjunction with thermodynamic models, as in (Kinney et al., 2010), the inaccuracies of these models can have major effects on predicted transcription rates. Expression manifolds sidestep the need to parametrically model such binding energies, enabling the direct inference of grand canonical potential values for each RNAP binding site assayed.
Discussion
Expression manifolds provide a new strategy for dissecting the biophysics of transcriptional regulation in living cells. The key idea is to perform measurements of regulatory element activity that lie in a multidimensional space. These promoters are chosen so that, if a hypothesized biophysical model is correct, measurements will collapse to a lower-dimensional manifold embedded within this space. If the data collapse as expected, one can infer the parameters of the hypothesized biophysical model. If the data do not collapse, one learns that a different biophysical model is needed.
Here, we measured expression manifolds characterizing both simple repression and simple activation by CRP. Two expression measurements were made for each assayed promoter, one in the presence of cAMP (t+) and one in the absence of cAMP (t−). Each promoter thus corresponded to a point (t−, t+) in 2D. For each CRP-RNAP spacing, we assayed promoters that differed only in the DNA sequence of the RNAP binding site. Our biophysical models assumed that this site controls only one relevant biophysical quantity: the affinity of RNAP for DNA. Thus, we expected that these 2D measurements would collapse to a 1D expression manifold, with different positions along the manifold corresponding to different values of RNAP-DNA binding affinity.
Robust data collapse was observed for CRP binding sites located at all except one of the positions we assayed. In these cases, we were able to infer precise values for the energetic parameters of our models. Inferring a model for simple repression allowed us to determine the strength of CRP-DNA binding (ΔΨF = −2.10 ± 0.10 kcal/mol). Inference of models for simple activation then allowed us to determine values for the CRP-RNAP interaction, as quantified by the Gibbs free energy ΔGα; these interaction energies were consistently determined to a precision of ~ 0.1 kcal/mol.
Expression manifolds for different biophysical models often have different shapes. Measuring and modeling expression manifolds can thus allow one to distinguish between qualitatively different mechanisms of transcriptional activation. In our experiments, all transcriptional activation was seen to occur through CRP-mediated stabilization of RNAP-DNA binding, as opposed to CRP-mediated acceleration of transcript initiation. This was true even for class II activation by CRP centered at -41.5 bp, a position for which previous in vitro experiments had suggested a substantial acceleration component.
Expression manifolds also allow the measurement of protein-DNA binding energy without the need for parametric models of how this binding energy depends on DNA sequence. In the experiments described here, we obtained measurements for RNAP-DNA binding energy, as quantified by ΔΨP, for each of the assayed promoters. These measurements deviate substantially from the predictions of the established RNAP-DNA binding motif (Kinney et al., 2010). This is a cautionary tale: even for very well studied TFs, one cannot assume that published motifs accurately predict the affinity of individual DNA binding sites.
Unexpectedly, our data did not collapse to an expression manifold when CRP was centered at -40.5 bp. This result allowed us to reject our hypothesized biophysical model. We thus learned that the DNA sequence of the core RNAP binding site somehow controls how RNAP interacts with CRP in this class II configuration. Additional work will be required to understand this sequence-dependence, which to our knowledge has not been previously reported.
Our strategy has been designed to be compatible with massively parallel reporter assays (MPRAs), which use ultra-high-throughput DNA sequencing to measure the activities of thousands of transcriptional regulatory sequences simultaneously. We expect that MPRAs, performed on microarray-synthesized promoter libraries, should allow hundreds of expression manifolds to be measured in a single experiment. MPRAs will also facilitate the study of TFs that cannot be controlled by a small molecule: one can measure t+ and t− by assaying promoters that either do or do not have a functional TF binding site but are otherwise identical. The ease with which MPRAs can assay promoters with different combinations of sites turned “on” and “off” should enable the study of more complex regulatory architectures, beyond just simple repression and simple activation.
Based on these results, we advocate a very different approach to dissecting transcriptional regulatory grammar than has been pursued by other groups. Instead of assaying and modeling many different arrangements of transcription factor binding sites (Gertz et al., 2009; Sharon et al., 2012; Mogno et al., 2013; Smith et al., 2013; Levo and Segal, 2014; White et al., 2016) or the activity of completely random DNA (de Boer et al., 2017), we suggest that more attention be paid to the interactions that occur within specific binding site configurations. Expression manifolds provide a useful way of interrogating individual protein-DNA and protein-protein interactions that occur in a specific promoter architecture without requiring a holistic model that aims to describe arbitrary binding site arrangements. Using MPRAs to simultaneously assay hundreds of systematically varied architectures, we expect that it should be possible to build biophysical models of transcriptional regulatory grammar from the ground up.
What would high-precision knowledge of transcriptional regulatory grammar in bacteria do for us? For one thing, it would greatly facilitate the interpretation of bacterial genome sequences. Currently, it is difficult to predict the functional consequences of TF binding sites just from their locations relative to annotated TSSs. Knowing the distance-dependent interactions between RNAP and common E. coli TFs would greatly illuminate how previously annotated binding sites for these TFs actually affect expression. Such knowledge would also facilitate MPRA-based efforts to dissect previously unannotated regulatory sequences across the genome (Belliveau et al., 2018).
Precise knowledge of transcriptional regulatory grammar in bacteria would also have important implications for synthetic biology. Currently, complex biological computations are performed in synthetic systems by stringing simple promoter “parts” together into complex regulatory networks. By contrast, naturally occurring promoters can often perform quite complex computations themselves via the multi-protein-DNA complexes that they scaffold (Kuhlman et al., 2007; Cui et al., 2013). Such computational mechanisms have many potential advantages, including faster response times and increased robustness to stochastic fluctuations. These advantages could be particularly useful in metabolic engineering, which requires rapid and reliable control over the expression of multiple genes in a pathway (Smanski et al., 2016; Nielsen and Keasling, 2016; Zhao et al., 2018). But although the potential capabilities of complex promoters have been explored both theoretically (Buchler et al., 2003; Bintu et al., 2005) and experimentally (Setty et al., 2003; Mayo et al., 2006; Segall-Shapiro et al., 2018), there remains little capability in synthetic biology to design complex promoters with predictable quantitative behavior. High-precision knowledge of the energetics underlying transcriptional regulatory grammar could enable this capability.
Will expression manifolds be useful for understanding transcriptional regulation in eukaryotes? Both FACS-based MPRAs (Sharon et al., 2012; Weingarten-Gabbay et al., 2017) and RNA-Seq-based MPRAs (Melnikov et al., 2012; Kwasnieski et al., 2012; Patwardhan et al., 2012) are well established in eukaryotes so, on a technical level, experiments analogous to those described here should be feasible. The bigger question, we believe, is whether the results of such experiments would be interpretable. Eukaryotic transcriptional regulation is far more complex than transcriptional regulation in bacteria. In fact, it is not even clear what mutations to the basal promoter in eukaryotes might correspond to the mutations in the RNAP site that we relied upon here. Still, we believe that pursuing this strategy in eukaryotes is worthwhile. Despite the underlying complexities, simple “effective” models of regulatory biophysics might work surprisingly well.
Materials and Methods
Media
Expression measurements were performed on cells grown in rich defined media (RDM; purchased from Teknova) (Neidhardt et al., 1974) supplemented with 10 mM NaHCO3,1 mM IPTG (Sigma), and 0.2% glucose. In what follows we refer to this media as RDM’. RDM’ was further supplemented with 50 μg/ml kanamycin (Sigma) when growing cells, as well as 250 μM cAMP (Sigma) when measuring t+.
Strains
Expression measurements were performed in E. coli strain JK10, which has genotype ΔcyaA ΔcpdA Δ lacYΔlacZ ΔdksA. JK10 is derived from strain TK310 (Kuhlman et al., 2007), which is ΔcyaA ΔcpdA ΔlacY. The ΔcyaA ΔcpdA mutations prevent TK310 from synthesizing or degrading cAMP, thus allowing in vivo cAMP concentrations to be quantitatively controlled by adding cAMP to the growth media. Into TK310 we introduced the ΔlacZ mutation, yielding strain DJ33; this mutation allows Miller assays to be used in conjunction with plasmid-based reporters driving lacZ expression. In our initial experiments, we found that the growth rate of DJ33 in RDM’ varies strongly with amount of cAMP added to the media. Fortunately, we isolated a spontaneous knock-out mutation in dksA (thus yielding JK10), which caused the growth rate (~ 30 min doubling time) in RDM’ to be independent of cAMP concentrations below ~ 500 μM.3 The JK10 genotype was confirmed by whole genome sequencing.
Reporter constructs
Expression of the lacZ gene was driven from variants of a plasmid we call pJK48. These reporter constructs were cloned as follows. We started with the vector pJK14 from Kinney et al. (2010). pJK14 contains a pSC101 origin of replication (~ 5-10 copies per cell), a kanamycin resistance gene, and a ccdB cloning cassette positioned immediately upstream of a gfpmut2 reporter gene and flanked by outward-facing BsmBI restriction sites. First, the gfpmut2 gene in this vector was replaced with lacZ, yielding pJK47. Next, the ribosome binding site in the 5’ UTR of lacZ was weakened, yielding pJK47.419; this weakening prevents lacZ expression from a maximally active promoter from substantially slowing cell growth in RDM’. pJK47.419 was propagated in DB3.1 E. coli (Invitrogen), which is resistant to the CcdB toxin.
The promoters we assayed were variants of what we call the lac* promoter. The lac* promoter is similar to the endogenous lac promoter of E. coli MG1655 except for (i) it contains a CRP binding site with a consensus right pentamer and (ii) it contains mutations that were introduced in an effort to remove previously reported cryptic promoters (Reznikoff, 1992). Promoter-containing insertion cassettes were created through overlap-extension PCR and flanked by outward-facing BsaI restriction sites. All primers were ordered from Integrated DNA Technologies. Note that some of the primers used to create these inserts were synthesized using pre-mixed phosphoramidites at specified positions; this is how a 24% mutation rate in the -10 or -35 regions of the RNAP binding site was achieved. The resulting promoter sequences are illustrated in Appendix 1 Figure 1.
To clone variants of pJK48, we separately digested the pJK47.419 vector with BsmBI (NEB) and the appropriate insert with BsaI (NEB). Digests were then cleaned up (Qiagen PCR purification kit) and ligated together in at 1:1 molar ratio for 1 hour using T4 DNA ligase (Invitrogen). After 90 min dialysis, plasmids were transformed into electrocompetent JK10 cells. Individual clones were plated on LB supplemented with kanamycin (50 μg/ml), while libraries were grown in 50 ml LB supplemented with kanamycin. After initial cloning, each clone was re-streaked, grown in LB+kan, and stored as a catalogued glycerol stock. The promoter region of each clone was sequenced in both directions. Only plasmids with validated promoter sequences were used for the measurements presented in this paper. The promoter sequences of all constructs used in this study, as well as their measured t+ and t− values, are provided at https://github.com/jbkinney/18_expressionmanifolds.
Miller assays
Expression was quantified using ONPG-based β-galactosidase activity measurements adapted from the method of Miller (1972). Specifically, we obtained t+ and t− measurements for each clone as follows.
First, each clone was streaked out on LB+kan agar and grown overnight. A colony was then picked and used to inoculate a 1.5 ml overnight LB+kan liquid culture. Either 8 μl, 6 μl, or 4 μl of the overnight culture were then diluted into 200 μl RDM’+kan. 25 μl of each dilution was then added to 175 μl RDM’+kan in a 96-well optical bottom plate and supplemented with either 0 μM cAMP (for t−) or 250 μM cAMP (for t+). The plate was then covered with Breathe-Easier film (USA Scientific) and cells were cultured for ~ 3 hr at 37 °C, shaking at 900 RPM in a microplate shaker. During this time, 5.5 ml of lysis buffer was freshly prepared using 1.5 ml RDM’, 4.0 ml PopCulture reagent (Millipore), 114 μl of 35 mg/ml chloramphenicol (Sigma), and 44 μl of 40 U/μl rLysozyme (Sigma).
Microplate film was removed and cell density (quantified by A600) was measured using an Epoch 2 Microplate Spectrophotometer (BioTek). Cells were then lysed by adding 25 μl lysis buffer to each microplate well, incubating the microplate at room temperature for 10 minutes without shaking, then cooling the microplate at 4°C for a minimum of 15 minutes. In each well of a 96-well optical bottom plate, 50 μl of lysate was then added to 50 μl of pre-chilled Z-buffer (Miller, 1972) containing 1 mg/ml ONPG (Sigma). Samples were sealed with optical film and both A420 and A550 were periodically measured in the plate reader over an extended period of time (every 1.5 min for 1 hour or every 15 min for 10 hours, depending on the level of expression expected).
The expression levels t+ and t- were quantified from these absorbance data using the formula where V = 50 is the volume of lysate in μl added to the ONPG reaction, ΔT is the change in time from the beginning of the measurement, and ΔAX indicates a change in absorbance at X nm over this time interval. Only data from wells with A600 ≲ 0.5 were analyzed. Note that the A550 term in Equation 9 is not multiplied by 1.75 as it is in Miller (1972). This is because our A550 measurements are used to compensate for condensation on the microplate film, not for cellular debris as in (Miller, 1972); our lysis procedure produces no detectable cellular debris. In practice, Equation 9 was not evaluated using individual measurements, but was rather computed from the slope of a line fit to non-saturated absorbance measurements using custom Python scripts. Raw A420, A550, and A600 values, as well as our analysis scripts, are available at https://github.com/jbkinney/18_expressionmanifolds. In all the figures, median values from at least 3 independent Miller measurements were used to define each measured t+ and t- data point.
Parameter inference
Expression manifold parameters were fit to measured t+ and t- values as follows. First, outlier data points were called by eye and excluded from the parameter fitting procedure. We denote the remaining measurements using and , where i = 1,2,… n indexes the non-outlier data points. These 2n measurements were used to fit n + 3 parameters: the saturated transcription rate (tsat), the background transcription rate (tbg), the renormalized cooperativity (α’)4, and the RNAP binding factors for each assayed RNAP site (P1P2,…,Pn). This was accomplished using nonlinear least squares. Specifically, we minimized the loss function where θ = {tsat,tbgα’,P1,P2,…,Pn} are the model parameters and
The solid black lines in Figure 2B and Figures 4B,C show the expression manifolds fit to all n data points. The gray lines in Figure 2B and Figure 4B represent parameters fit to bootstrap-resampled data points.
The values reported for F and α, as well as for ΔGF and ΔGα, were computed using parameters fit to bootstrap-resampled data. For the occlusion data in Figure 2B, we reported where 1kBT = 1.62 kcal/mol (corresponding to 37 °C) and where F84, F50, and F16 respectively denote the 84th, 50th, and 16th percentiles of F values obtained from bootstrap resampling. For the activation data in Figures 4B and 4C, we computed α from α’ via α = α’ - (α’ - 1)/F50. We then reported where α84, α50, and α16 respectively denote the 84th, 50th, and 16th percentiles of α values obtained from bootstrap resampling.
By visual inspection of Figure 6B, we determined that β ≈ 1 at 61.5 bp. In Figure 6C, we therefore report for each position X, an acceleration βX given by where is the saturated rate of transcription inferred for -61.5 bp in Figure 4B and, similarly, denotes the saturated rate of transcription inferred for position X in Figure 4C. Plotted points show the median values, while error bars show the [16%, 85%] quantile interval.
Figure 8 shows Pi,50 values with error bars extending from [Pi,16 to Pi,84]. Such values were computed using P-values determined from data in which the individual replicates for each promoter were bootstrap resampled, but for which all promoters were used in the inference procedure.
Author contributions
JBK conceived of this study. TF and JBK designed this study. JBK, TF, AA, MSG, and DJ carried out the experiments. TF and JBK carried out the computational analysis. JBK wrote the manuscript with input from MSG, RP, DJ, TF, and AA. JBK funded this study.
Acknowledgments
We thank Bryce Nickels and Stirling Churchman for helpful feedback. This work was supported by a CSHL/Northwell Health Alliance grant to JBK and by NIH Cancer Center Support Grant 5P30CA045508.
Appendix 1
Footnotes
↵1 The first transcribed base is, in this paper, assigned position 0 instead of the more conventional +1. Half-integer positions indicate centering between neighboring nucleotides.
↵2 See Materials and Methods for a discussion of how uncertainties in these values are computed and reported.
↵3 Note, however, that JK10 will not grow in minimal media in the absence of cAMP.
↵4 Note that α’ = 1/(1 + F) in the case of simple repression, as in Figure 2.