Abstract
How do new promoters evolve? The current notion is that new promoters emerge from duplication of existing promoters. To test whether promoters can instead evolve de novo, we replaced the lac promoter of Escherichia coli with various random sequences and evolved the cells in the presence of lactose. We found that a typical random sequence of ∼100 bases can mimic the canonical promoter and enable growth on lactose by acquiring only one mutation. We further found that ∼10% of random sequences could serve as active promoters even without any period of evolutionary adaptation. Such a short distance from a random sequence to an active promoter may improve evolvability yet it may also lead to undesirable accidental expression. Nevertheless, across the E. coli genome accidental expression is largely avoided by disfavoring codon combinations that resemble canonical promoter motifs. Our results suggest that the promoter recognition machinery have been tuned to allow high accessibility to new promoters, and similar findings might also be observed in higher organisms or in other motif recognition machineries, like transcription factor binding sites or protein-protein interactions.
Introduction
Promoters control the transcription of genes and therefore play a major role in evolutionary adaptation1. The extensive study of promoters by genomic analysis2–4, experimental protein-DNA interactions5–7 and promoter libraries8–11 has mostly revolved around highly refined promoters i.e. long-standing wild-type promoters and their derivatives. However, the emergence of new promoters, for example when cells need to activate horizontally transferred genes12,13, is less understood. The current notion is that new promoters largely emerge from duplication of existing promoters via genomic rearrangements14,15, transposable elements16,17, or by inter-species mobile elements18. Pooled promoter libraries allow activity measurements of a large number of starting sequences, yet pool competition is not applicable for following an evolutionary process that requires mutational steps. Pooled libraries under selection are often dominated by a few sequences with high activity. Therefore, in order to resemble evolution in natural ecologies we utilized lab evolution methods on parallel evolving populations, each starting with a different random sequence. Following these evolving populations highlighted an unacknowledged way for new promoters to emerge by mutations rather than by copying an existing promoter. Promoter activity was typically achieved by a single mutation and could be further increased in as stepwise manner by additional mutations that increased the similarity of the random sequence to the canonical promoter (TATAAT and TTGACA motifs).
Main Text
To create an ecological scenario that can test how bacteria evolve de novo promoters, we sought a beneficial gene in the genome but not yet expressed, similarly to what might occur during horizontal gene transfer with a non-functional promoter. To this end, we modified the lac operon of E. coli: the lac metabolic genes (LacZYA) remain intact (including their 5’UTR), yet we deleted their promoter and replaced it by a variety of non-functional sequences. To broadly represent the non-functional sequence-space, we used random sequences of 103 bases (same length as the deleted WT lac promoter), which were computer-generated with the typical GC content of the E. coli genome (∼50.8% GC, see Methods). In addition, the lac repressor (LacI) was deleted, and the lactose permease (LacY) was fluorescently labeled with YFP19 for future quantification of expression. To avoid possible artifacts associated with plasmids, all modifications were made on the E. coli chromosome20, so the engineered strains had a single copy of the metabolic genes needed for lactose utilization, yet without a functional promoter (Figure 1A). We began building such strains with random sequences as “promoters”, and already observed for the first strains obtained that they could not express the lac genes and thus they could not utilize or grow on lactose. This experimental observation was therefore consistent with the expectation that a random sequence is unlikely to be a functional promoter.
To select for de novo lactose utilization we started evolution by serial dilution with the obtained strains, each carrying a different random sequence instead of the WT lac-operon promoter. We first focused on three such strains (termed RandSequence1, 2 and 3) and tested their ability to evolve expression of the lac operon, each in four replicates. As controls, we also evolved a strain carrying the WT lac promoter (termed WTpromoter), and another strain in which the entire lac operon was deleted (termed ΔLacOperon). Before the evolution experiment, only the WTpromoter strain could utilize lactose (Supp. Figure 1). Therefore, to facilitate growth to low population sizes the evolution medium contained glycerol (0.05%) that the cells can utilize and lactose (0.2%) that the cells can only exploit if they express the lac operon. To isolate lactose-utilizing mutants, we routinely plated samples from the evolving populations on plates with lactose as the sole carbon source (M9+Lac) (Figure 1B). Remarkably, within 1-2 weeks of evolution (less than 100 generations), all of these populations exhibited lactose-utilizing abilities, except for the ΔLacOperon population (Supplementary Information). These laboratory evolution results therefore argue that the populations carrying random sequences instead of a promoter can rapidly evolve expression. Next, we addressed the question of whether the solutions found during evolution were mutations in the random sequences or simply copying and pasting of existing promoters from elsewhere in the genome.
To determine the molecular nature of the evolutionary adaptation, we sequenced the region upstream to the lac operon (from the beginning of the lac operon through the random sequence that replaced the WT lac promoter and up to the neighboring gene upstream). Strikingly, within each of the different random sequences a single mutation occurred, and continued evolution yielded additional mutations within the random sequences that further increased expression from the emerging promoters. All replicates showed the same mutations, yet sometimes in different order (Supp. Table 1). In order to confirm that the evolved ability to utilize lactose is because of the observed mutations, each mutation was inserted back into its relevant ancestral strain. Then, we assessed the lac-operon expression by YFP measurements (thanks to the LacY-YFP labeling) (Figure 2A). This experimental evolution demonstrates how non-functional sequences can rapidly become active promoters, in a stepwise manner, by acquiring successive mutations that gradually increase expression. Next, we aimed to determine the mechanism by which these mutations induced de novo expression from a random sequence.
Looking at the context of the emerging mutations clearly showed that expression was achieved by mimicking the canonical promoter motifs of E. coli21, which is responsible for transcribing the majority of the genes in a growing E. coli. (i.e. the ‘minus 10’ TATAAT and the ‘minus 35’ TTGACA, separated by a spacer of 17±2 bases). Each of the five mutations found during evolution of the three random sequences contributed for better capturing of the canonical promoter motifs (Figure 2B). The emerging promoters seem to comply with the higher importance of the TATAAT motif to promoter strength. Randseq1 and Randseq2 both captured 5 out of 6 bases, and RandSeq3 captured the full 6 bases, while for the TTGACA motif they all captured 3 out of the 6 motif bases. Interestingly, although before evolution Randseq3 already captured 3/6 bases of the TTGACA motif plus 5/6 of the TATAAT motif, it was not sufficient to induce expression. Presumably, Randseq3 was not an active promoter before evolution due to a short spacer (14 bases, compared with the ideal 17 bases spacer), which creates significant torsion of the DNA22 and thus reduced attachment of the transcription machinery. Nevertheless, a single mutation in Randseq3 allowed perfect capturing of the TATAAT motif and as a result also expression despite the short spacer. Therefore, de novo promoters are highly accessible because the different features that make a promoter, like sequence motifs and spacer size, can be compromised and still function.
The most surprising aspect of random sequences evolving into functional promoters was the fact that a single mutation was sufficient for turning on expression. Therefore, we predicted that if indeed a single mutation in a 103-base random sequence is often sufficient to generate an active promoter, there might also be a small portion of random sequences that are already active without the need of any mutation. Indeed, when testing all 40 strains (RandSeq1 to 40) for growth on M9+Lac plates before evolution, we observed that four of the strains (10%) formed colonies without acquiring any mutation in their random sequences. We scanned the random sequences of these already-active strains (RandSeq7, 12, 30, 34) and found regions with high similarity to the canonical σ70 promoter, equivalent to the similarities caused by the mutations mentioned earlier (Supp. Figure 2). Given that a single mutation might be sufficient to turn expression on, we proceeded with the strains that did not exhibit lac-operon activity, by putting them under selection for lactose utilization both by the abovementioned daily-dilution routine (in M9+GlyLac) and by directly screening for mutants that can form colonies on M9+Lac plates (Methods).
Overall, we observed expression activity in all but 5% of the random-sequence strains (38/40). Analysis of all forty strains and their lac operon activating mutations showed that: 10±4.7% were already active without any mutation (4/40), 57.5±7.8% found mutations within the 103 bases of the random sequence (23/40), 12.5±5.2% found mutations in the intergenic region just upstream to the random sequence (5/40) and 15±5.7% utilized genomic rearrangements that relocated an existing promoter of genes found upstream to the lac operon (6/40)(Figure 3A). YFP measurements indicate that all strains displayed substantial expression of the lac operon after acquiring the activating mutations (Figure 3B). To confirm that transcriptional read-thought from the selection gene upstream did not facilitate the emergence of de novo promoters, we made six strains in a marker-free manner (Methods) and showed that their ability to evolve de novo promoters is similar to the rest of the strains. A typical random sequence of ∼100 bases is therefore not an active promoter but is frequently only one point mutation away from being an active promoter (For details on all mutations, their verifications and different outcomes between replicates see Supp. Table 1).
We performed lab evolution for de novo expression by selecting for a functional readout – the ability to grow on lactose. Meaning that the expression threshold of the lac operon, above which cells can grow on lactose, was often passed by a single mutation. To verify these surprising findings using a method that is not bound to a specific threshold we calculated the mutational distance of random sequences from the canonical promoter of E. coli. We computationally scanned 10 millions random sequences (of 103 bases) against the canonical promoter motifs and observed that a typical random sequence is likely to match 8 out of the 12 possible matches (of the two six-mers TTGACA and TATAAT, with spacing of 17±2). Interestingly, similar analysis performed on E. coli’s constitutive promotes showed that the majority of them have 9 out of 12 matches – only one less than that random sequences of ∼100 bases. Our claim is therefore strengthened, as a random sequence typically requires only one mutation in order to reach the number of matches that characterize naturally occurring constitutive promoters. Furthermore, it implies that some portion of random sequences may be active already as ∼10% of random sequences have 9 or more matches (Figure 4).
The short mutational distance from random sequences to active promoters may act as a double-edged sword. On the one hand, the ability to rapidly “turn on” expression may provide plasticity and high evolvability to the transcriptional network. On the other hand, this ability may also impose substantial costs, as such a promiscuous transcription machinery is prone to expression of unnecessary gene fragments23. Such accidental expression is not only wasteful but can also be harmful as it may interfere with the normal expression of the genes within which it occurs24,25. Our data suggest that ∼10% of 100-base sequences are an active promoter, meaning that a typical ∼1kb gene might naturally contain an accidental promoter inside its coding sequence. Therefore, we looked for strategies that E. coli might have taken to minimize accidental expression. Normal promoters typically occur in the intergenic region between genes and not within the coding region. We assessed the occurrence of accidental promoters in the middle of E. coli genes (i.e. between the start codon of each gene till its stop codon). This coding region composes 88% of the E. coli genome. Since each amino acid can be encoded by multiple synonymous codons, every gene in the genome can be encoded in many alternative ways. We hypothesized that the E. coli genome avoids codon combinations that create promoter motifs in the middle of genes. Using promoter prediction software26,27, we found that the WT E. coli genome has much less accidental expression than what would be expected based on a random choice of codons to encode the same amino acids (while preserving the overall codon bias28, Figure 5A). The E. coli genome has therefore likely been under selection to avoid this accidental expression within the coding region of genes.
To assess the optimization level of each gene separately, we compared the accidental expression score of each WT gene to the scores of a thousand alternative recoded versions. Remarkably, we found that ∼40% of WT genes had accidental expression as low as the lowest decile of their recoded versions. Our data indicated that some E. coli genes minimize accidental expression more than others. Essential genes, for example, exhibit an even stronger signal of optimization compared to the general signal obtained for all genes together (Figure 5B). Essential genes are under stronger selective pressure to mitigate interference29,30 and therefore they better avoid accidental expression presumably because it leads to collisions with RNA polymerases that transcribe them31–33. We observed similar results when we used a recoding method in which we just shuffled the codons of each gene, again indicating that the E. coli genome has been under selection to minimize accidental expression (Supp. Figure 3, Methods). To further validate that the WT E. coli has depleted promoter motifs within its coding region, we performed a straightforward analysis by unbiased counting of motif occurrences across the genome. The analysis showed that promoter motifs are depleted from the middle of genes, especially the TATAAT motif (Methods, Supp. Table 2). Reassuringly, among this group of depleted motifs we also found the Shine-Dalgarno sequence (ribosome binding site)34. Therefore, evolution may have acted to minimize accidental expression by avoiding codon combinations with similarity to promoter motifs, thereby allowing E. coli to benefit from flexible transcription machinery while counteracting its detrimental consequences.
Discussion
Overall, our study suggests that the sequence recognition of the transcription machinery is rather permissive and not restrictive35 to the extent that the majority of non-specific sequences are on the verge of operating as active promoters. We found that the typical ∼100-base sequence requires only a single mutation to become an active promoter. Consequently, some small portion of non-specific sequences can function as active promoters even without any mutation. This low sequence specificity of the transcription machinery may explain part of the pervasive transcription seen in unexpected locations in bacterial genomes23 as well as the expression detected in large pools of plasmids that harbor degenerate sequences upstream to a reporter gene36. Despite the ability to avoid accidental expression by histone-like proteins37,38 and by depletion of promoter-like motifs, accidental expression might not always be detrimental and may sometimes be selected for. When we analyzed accidental expression in toxin/antitoxin gene couples39, we observed higher accidental expression in toxin genes compared with their antitoxin counterparts (Supp. Figure 4, Supplementary Information). Interestingly, when we split the accidental expression score into its ‘sense’ (same strand as the gene) and ‘antisense’ (opposite strand) components, we observed that toxins had a much stronger accidental expression in their antisense direction compared to the sense direction. However, in the antitoxins, sense and antisense scores correlated, as largely seen genome-wide (Supp. Figure 5). This leads us to speculate that E. coli might have utilized accidental expression as a means to restrain gene expression40,41 of specific genes, presumably by causing head-to-head collisions of RNA polymerases31–33.
Our main findings may be relevant to other organisms and to other DNA/RNA binding proteins like transcription factors. The number of necessary mutations between random sequences to any sequence-feature should be considered for possible “accidental recognition” and for the ability of non-functional sequences to mutate into functional ones. We demonstrated that a random sequence is most probable to capture 8 out of 12 motif bases, while functional constitutive promoters usually capture 9/12. Furthermore, we experimentally demonstrated the implications of this numerical analysis by an evolutionary process that repeatedly found this “missing” mutation in order to exploit unutilized lactose. Therefore, the implications of this study may also prove useful to synthetic biology designs, as one needs to be aware that spacer sequences might not always be non-functional as assumed. Moreover, spacer sequences can actually be properly designed to have lower probability for accidental functionality, for example a spacer that has particularly low chances of acting as a promoter (or RBS, or any other sequence motif).
Tuning a recognition system to be in a metastable state so that a minimal step can cause significant changes might serve as a mechanism by which cells improve their adaptability. If two or more mutations were needed in order to create a promoter from a non-functional sequence, cells would face a much greater fitness-landscape barrier that would drastically reduce the ability to evolve de novo promoters. The rate at which new adaptive traits appear in nature is remarkable, yet the mechanisms underlying this rapid pace are not always understood. As part of the effort to reveal such mechanisms42 our study suggests that the transcription machinery was tuned to be “probably approximately correct”43 as means to rapidly evolve de novo promoters. Further work will be necessary to determine whether this flexibility in transcription is also present in higher-organisms and in other recognition processes.
Methods
Strains
Strains were constructed using the Lambda-Red system20, including integration of random sequences as promoters by using chloramphenicol resistance selection gene. Yet, for the strains with RandSeq9, 12, 15, 17, 18, 23, integration was done by the Lambda-Red-CRISPR/Cas9 system without introducing a selection marker, in order to exclude transcriptional read-through due to the expression of an upstream selection gene. The ancestral strain for all 40 random sequence strains, as well as for the control strains (WTpromoter and ΔLacOperon) was SX70019 in which the lacY was tagged with YFP. In addition, the mutS gene was deleted (by gentamycin resistance gene) to achieve higher yield in chromosomal integration using the lambda-red system44 and as a potential accelerator of evolution due to increased mutation rate. For Randseq1, 2 and 40 we created additional strains from an ancestor in which the mutS was not deleted and after similar evolution the exact same mutations arise. In all strains, lacI was deleted (for all but the CRISPR/Cas9 strains, by spectinomycin resistance gene) and replaced by an extra double terminator (BioBricks BBa_B0015) to prevent transcription read through from upstream genes.
Random sequences
random sequences were generated computationally, 103 bases long (same length as the WT lac promoter they replaced). To prevent deviation from the overall GC content of E. coli (50.8%) sequences with GC context lower than 45.6% or higher than 56.0% were excluded. In addition, to avoid sequencing issues, sequences with homo-nucleotide stretches longer than five were excluded.
Selection for lactose utilization
Lab evolution was performed on liquid cultures grown on M9+GlyLac by daily dilution of 1:100 into fresh medium. M9 base medium for 1L included 100uL CaCl2 1M, 2ml of MgSO4 1M, 10ml NH4Cl 2M, 200ml of M9 salts solution 5x (Sigma Aldrich). Concentrations of carbon source were 0.05% for glycerol and 0.2% for lactose for M9+GlyLac, 0.2% lactose for M9+Lac and 0.4% glycerol for M9+Gly (all in w/v). Cultures were routinely checked for increased yield at saturation and samples were plated on M9+Lac plates for isolation of colonies that can utilize lactose as a sole carbon source. In parallel to our liquid M9+GlyLac selection for lactose-utilization we also performed agar-plate selection by growing random-sequence strains on non-selective medium (M9+Gly) and then plated them while in late logarithmic phase on M9+Lac plates to select for lactose-utilizing colonies. All populations were evolved in parallel duplicates, but RandSeq1, 2, 3 had four replicates.
Quantifying growth and expression
Growth curves were obtained by 24h measurements of OD600 every 10min. Lac operon expression was quantified by YFP florescence measurements. Both measurements performed by a Tecan M200 plate reader.
E. coli genomic data
Lists of essential genes and prophage genes were downloaded from EcoGene45, a list of toxin-antitoxin gene couples was obtained from Ecocyc39, coding sequences of genes were downloaded from GeneBank (K-12 substr. MG1655, U00096).
Recoding the coding sequence of E. coli genes
To create alternative versions of the coding region we recoded all translated genes in E. coli (n=4261) 1000 different times while preserving the amino acid sequence and codon bias. As another null model we also shuffled the codons of each gene in 1000 permutations. Although a shuffled version of a gene does not preserve the amino acid sequence, it exactly preserves the GC content of each gene, and thus it controls for another aspect that may result in accidental expression.
Promoter prediction
Using the output from BPROM26,27 we obtained predicted expression scores by combining the scores of the minus-10 site and the minus-35 site and factoring in the prediction score (LDF) from the output by multiplying. In addition, we scanned sequences for promoters by running a sliding window with the canonical motif and identified regions with maximal agreement.
Six-mer analysis
Looking for depleted and over represented motifs we counted the occurrences of all sixmers within the coding region of E. coli. We compiled a list of all 4096 possible six-mers and counted how many times each six-mer occurs in all WT coding region compared to the 1000 recoded versions. Then, we focused on six-mers that are significantly rare/abundant in WT version compared with their counting in the recoded versions.
Acknowledgments
We thank the Human Frontier Science Program for supporting A.H.Y. Special thanks for Idan Frumkin, Rebecca Herbst and members of the Gorelab and the Almlab for fruitful discussions. We thank the Xie lab for providing strains and Gene-Wei Li, Jean-Benoit Lalanne and Tami Lieberman for their helpful comments on the manuscript.