Summary
The gene regulatory architecture associated with photosynthesis is poorly understood. Most plants use the ancestral C3 pathway, but our most productive cereal crops use C4 photosynthesis. In these C4 cereals, large-scale alterations to gene expression allow photosynthesis to be partitioned between cell types of the leaf. Here we provide a genome-wide transcription factor binding atlas for grasses that operate either C3 or C4 photosynthesis. Most of the >950,000 sites bound by transcription factors are preferentially located in genic sequence rather than promoter regions, and specific families of transcription factors preferentially bind coding sequence. Cell specific patterning of gene expression in C4 leaves is associated with combinatorial modifications to transcription factor binding despite broadly similar patterns of DNA accessibility between cell types. A small number of DNA motifs bound by transcription factors are conserved across 60 million years of grass evolution, and C4 evolution has repeatedly co-opted at least one of these hyper-conserved cis-elements. The grass cistrome is highly divergent from that of the model plant Arabidopsis thaliana.
Introduction
Photosynthesis sets maximum crop yield, but despite millions of years of natural selection is not optimised for either current atmospheric conditions or agricultural practices (Long et al., 2015; Ort et al., 2015). The majority of photosynthetic organisms, including crops of global importance such as wheat, rice and potato use the C3 photosynthesis pathway in which Ribulose-Bisphosphate Carboxylase Oxygenase (RuBisCO) catalyses the primary fixation of CO2. However, carboxylation by RuBisCO is competitively inhibited by oxygen binding the active site (Bowes et al., 1971). This oxygenation reaction generates toxic waste-products that are recycled by an energy-demanding series of metabolic reactions known as photorespiration (Bauwe et al., 2010; Tolbert, 1971). The ratio of oxygenation to carboxylation increases with temperature (Jordan and Ogren, 1984; Sharwood et al., 2016) and so losses from photorespiration are particularly high in the tropics. When oxygenation is reduced through CO2 enrichment, crops show increased photosynthetic efficiency and higher yields (Leakey et al., 2012). In addition to the inefficiency associated with oxygenation by RuBisCO, due to the rapid rise in atmospheric CO2 concentrations from ~220 to 400ppm, the stoichiometry and kinetics of other photosynthesis enzymes are considered sub-optimal. For example, increased activity of Sedoheptulose 1,7-bisphosphatase improves photosynthesis and yield (Lefebvre et al., 2005; Miyagawa et al., 2001). Furthermore, the ability of leaves to harness light and generate chemical energy is neither optimised for current crop canopy structures (Zhu et al., 2010b) or rapid fluctuations in light (Kromdijk et al., 2016). Thus, although it is clear that improving C3 photosynthesis would drive increased crop yields, we have an incomplete understanding of the genes that underpin this fundamental process.
The inefficiencies associated with C3 photosynthesis in the tropics have led to multiple plant lineages evolving mechanisms that suppress oxygenation by concentrating CO2 around RuBisCO. One such evolutionary strategy is known as C4 photosynthesis. Species that use the C4 pathway include maize, sorghum and sugarcane, and they represent the most productive crops on the planet (Sage and Zhu, 2011). In C4 leaves, additional expenditure of ATP, alterations to leaf anatomy and cellular ultrastructure, as well as spatial separation of photosynthesis between compartments (Hatch, 1987) allows CO2 concentration to be increased around tenfold compared with that in the atmosphere (Furbank, 2011). Despite the complexity of C4 photosynthesis, it is found in over 60 independent plant lineages and so represents one of the most remarkable examples of convergent evolution known to biology (Sage et al., 2011). In most C4 plants the initial RuBisCO-independent fixation of CO2 and the subsequent RuBisCO-dependent reactions take place in distinct cell-types known as mesophyll and bundle sheath cells, and so are associated with strict patterning of gene expression between these cell-types. Although the spatial patterning of gene expression is fundamental to C4 photosynthesis, very few examples of cis-elements or trans-factors that generate cell-preferential expression required for C4 photosynthesis have been identified (Brown et al., 2011; Gowik et al., 2004; Williams et al., 2016). In summary, in both C3 and C4 species, work has focussed on analysis of mechanisms controlling the expression of individual genes, and so our understanding of the overall landscape associated with photosynthesis gene expression is poor.
In yeast and animal systems, the high sensitivity of open chromatin to DNase-I (Zentner and Henikoff, 2014) has allowed comprehensive, genome-wide characterisation of transcription factor binding sites at single nucleotide resolution (Hesselberth et al., 2009; Neph et al., 2012; Thurman et al., 2012). Despite the power of this approach to define regulatory DNA and the likely transcription factors binding these sequences, this approach has not yet been used to provide insight into the regulatory architecture of photosynthetic leaves of major crops except rice (Zhang et al., 2012b). Moreover, although leaves are composed of multiple distinct cell-types, differences in transcription factor binding between cells have not yet been assessed in plants. By carrying out DNase I-SEQ on grasses that use either C3 or C4 photosynthesis, we provide comprehensive insight into the transcription factor binding repertoire associated with each form of photosynthesis. The data indicate that specific cell types from leaf tissue make use of a markedly distinct cis-regulatory code, and that transcription factor binding is more frequent within genes than promoter regions. Despite significant conservation in the transcription factors families binding DNA in grasses, it is also apparent that the binding sites they recognise are subject to high rates of mutation. Comparison of sites bound by transcription factors in both C3 and C4 leaves demonstrates that the repeated evolution of C4 photosynthesis is built on combination of de novo gain of cis-elements and exaptation of highly conserved regulatory elements found in the ancestral C3 system.
Results
A cis-regulatory atlas for monocotyledons
Four grass species were selected to provide insight into the regulatory architecture associated with C3 and C4 photosynthesis. Brachypodium distachyon uses the ancestral C3 pathway (Figure 1A). Sorghum bicolor, Zea mays and Setaria italica all use C4 photosynthesis although phylogenetic reconstructions indicate that S. italica represents an independent evolutionary origin of the C4 pathway (Figure 1A). Nuclei from leaves of S. italica (C4), S. bicolor (C4), Z. mays (C4) and B. distachyon (C3) were treated with DNase I (Figure S1) and subjected to deep sequencing. A total of 799,135,794 reads could be mapped to the respective genome sequences of these species (Table S1). 159,396 DNase-hypersensitive sites (DHS) of between 150-15,060 base pairs representing broad regulatory regions accessible to transcription factor binding were identified from all four genomes (Figure 1B). Between 20,817 and 27,746 genes were annotated as containing at least one DHS (Table S2). Within these DHS, 533,409 digital genomic footprints (DGF) corresponding to individual transcription factor binding sites of between 11 and 25 base pairs were identified through differential accumulation of reads mapping to positive or negative strands around transcription factor binding sites (Figure 1B&C). At least one transcription factor footprint was identified in >75% of the broader regions defined by DHS (Table S2). In contrast to the preferential binding of transcription factors to AT-rich DNA observed in A. thaliana all four grasses DGF had a greater GC content compared with the genome average (Table S3).
DHS and DGF were primarily located in gene-rich regions, and depleted around centromeres (Figure 1D). Individual transcription factor binding sequences were resolved in all chromosomes from each species (Figure 1D). Many genes contained DGF that have previously been associated with specific classes of transcription factors. For example, the SbPPC gene (Sobic.010G160700) encoding phosphoenolpyruvate carboxylase that catalyses the first committed step of C4 photosynthesis, contained sixteen DGF of which six are associated with known transcription factor families (Figure 1D). On a genome-wide basis, the distribution of DGF was similar between species, with the highest proportion of such sites located in promoter, coding sequence (CDS) and intergenic regions (Figure 1E). However, when normalised to the length of such regions, the density of transcription factor recognition sites was highest in 5’ untranslated regions (UTRs), coding sequences (CDS) and 3’ UTRs (Figure 1F). In all four species promoter regions contained fewer DGF than genic sequence (Figure 1F) and distribution plots showed that the density of DGF was highest in exonic sequences downstream of the annotated transcriptional start sites (Figure S2).
A distinct cis-regulatory lexicon for specific cells within the leaf
The above analysis provides a genome-wide overview of the cis-regulatory architecture associated with photosynthesis in leaves of grasses. However, as with other complex multicellular systems, leaves are composed of many specialised cell types. Because each DGF is defined by fewer sequencing reads mapping to the genome compared with the larger DHS region, depleted signals derived from low abundance cell-types cannot be detected from such tissue-level analysis (Figure 2A). Since bundle sheath strands can be separated easily (Covshoff et al., 2013) leaves of C4 species provide a simple system to study transcription factor binding in specific cells (Figure 2B). After bundle sheath isolation from S. bicolor, S. italica and Z. mays a total of 129,137 DHS were identified (Figure 2B; Table S4) containing 452,263 DGF (Figure 2B; Table S4; FDR<0.01). Of the 452,263 DGF identified in bundle sheath strands, 170,114 were statistically enriched in the bundle sheath samples compared with whole leaves (Figure 2B; Table S4). The number of these statistically enriched DGF in bundle sheath strands of C4 species was large and ranged from 15,880 to 85,256 in maize and S. italica respectively (Figure S3). Genome-wide, the number of broad regulatory regions defined by DHS in the bundle sheath that overlapped with those present in whole leaves ranged from 71 to 84% in S. italica and S. bicolor respectively (Table S5). However, only 8-23% of the narrower DGF found in the bundle sheath were also identified in whole leaves (Table S6). Taken together, these findings indicate that specific cell types of leaves share considerable similarity in the broad regions of DNA that are accessible to transcription factors, but that the short sequences actually bound by transcription factors vary dramatically.
To provide evidence that DGF predicted after analysis of separated bundle sheath strands are of functional importance, they were compared with previously validated cis-elements. In C4 species, there are few such examples, but in the maize RbcS gene, which is preferentially expressed in bundle sheath cells, an I-box (GATAAG) is essential for light-mediated activation (Giuliano et al., 1988) and a HOMO motif (CCTTTTTTCCTT) is important in driving bundle sheath expression (Xu et al., 2001) (Figure 2C). Both elements were detected in our pipeline. Interestingly, the HOMO motif was only bound in the bundle sheath strands (Figure 2C), and whilst the I-box was detected in both bundle sheath strands and whole leaves, the position of the DGF covering it was slightly shifted between the two samples (Figure 2C). Thus, orthogonal evidence for transcription factor binding in maize supports a functional role for DGF identified by DNaseI-SEQ in this study.
To investigate the relationship between cell specific gene expression and the position of DHS and DGF, the DNase-I data were compared with RNA-seq datasets from mesophyll and bundle sheath cells of C4 leaves (Chang et al., 2012; Emms et al., 2016; John et al., 2014). At least three mechanisms associated with cell specific gene expression operating around individual genes were identified, and can be exemplified using three co-linear genes found on chromosome seven of S. bicolor. First, in the NADP-malate dehydrogenase (MDH) gene, which is highly expressed in mesophyll cells and encodes a protein of the core C4 cycle (Figure S4) a broad DHS site was present in whole leaves, but not in bundle sheath strands (Figure 2D). Whilst presence of this site indicates accessibility of DNA to transcription factors that could activate expression in mesophyll cells, global analysis of all genes strongly and preferentially expressed in bundle sheath strands indicates that presence/absence of a DHS in the bundle sheath compared with the whole leaf is not sufficient to generate cell specificity (Figure S5, S6). Second, in the next contiguous gene that encodes an additional isoform of MDH that is also preferentially expressed in mesophyll cells (Figure S4) a DHS was found in both whole leaf and bundle sheath strands but DGF occupancy within this region differed between cell types (Figure 2D). Thus, despite similarity in DNA accessibility, the binding of particular transcription factors was different between cell types. However, once again, genome-wide analysis indicated that alterations to individual DGF were not sufficient to explain cell specific gene expression. For example, fewer than 30% of all enriched DGF in the bundle sheath were associated with differentially expressed genes (Table S7). Lastly, in the third gene in this region, which encodes a NAC domain transcription factor preferentially expressed in bundle sheath strands, differentially enriched DGF were associated both with regions of the gene that have similar DHS in each cell type, but also a region lacking a DHS in whole leaves compared with bundle sheath strands (Figure 2D). These three classes of alteration to transcription factor accessibility and binding were detectable in genes encoding core components of the C4 cycle (Figure 2E, Figure S7) implying that a complex mosaic of altered transcription factor binding mediates the cell specific expression found in the C4 leaf.
Overall, we conclude that differences in transcription factor binding between cells is associated with both DNA accessibility defined by broad DHS, as well as fine-scale alterations to transcription factor binding defined by DGF. In addition, although bundle sheath strands possessed a distinct regulatory landscape compared with the whole leaf, we were unable to identify examples of C4 genes in which individual transcription factor binding sites differed between bundle sheath and whole leaf samples. This finding implies that cell specific gene expression in C4 leaves is mediated by a complex mixture of combinatorial effects mediated by alterations to gene accessibility as defined by DHS, but also changes to binding of multiple transcription factors to each C4 gene.
Transcription factor families associated with cell specific expression
The cistrome, or set of transcription factor binding sites found in a genome, has been determined for A. thaliana and to date, consists of 872 experimentally verified motifs linked to 529 transcription factors (O’Malley et al., 2016). Of these 872 motifs from A. thaliana 525 could be identified in the Z. mays, S. bicolor, S. italica and B. distachyon datasets (Figure 3A). However, within individual species fewer motifs were detected and so de novo prediction was used to identify sequences over-represented in DGF compared with those across the whole genome. This resulted in an additional 524 novel motifs being annotated (Figure 3A). Inspection of these motifs predicted de novo demonstrated clear strand bias in DNase-I cuts (Figure 3B) as would expected from bone fide transcription factor binding. By combining known and de novo motifs, the percentage of DGF that could be annotated in each species increased to more than 41% (Figure 3C). The relatively high number of motifs defined by transcription factor binding sites predicted de novo is presumably due to the significant evolutionary time since grasses diverged from A. thaliana.
To define the most common motifs actually bound by transcription factors in mature leaves undertaking C3 and C4 photosynthesis the frequency of individual motifs was determined and ranked from most to least common in each species. The relative ranking of motifs in the four grasses was similar (Figure 3D). Visualisation of transcription factor families associated with these DGF in word clouds showed that the most prevalent motifs are associated with the AP2-EREBP and C2H2 transcription factor families (Figure 3D). These findings indicate that across these four grasses the most commonly bound transcription factor motifs are highly conserved. There was much less conservation between transcription factor binding sites in photosynthetic leaves of these monocotyledons compared with the dicotyledon A. thaliana (Figure 3D). This finding combined with the large number of motifs from A. thaliana not detected in the grasses (Figure 3A) argue for significant divergence in the cistromes of monocotyledons and dicotyledons.
To investigate whether particular classes of transcription factor binding motifs are associated with specific genomic features, the proportion of each motif found in promoter elements, 5’ UTRs, coding regions, introns and 3’ UTR sequences was defined (Figure S8). In most cases, the distribution of individual motifs was similar in all genomic features, however it was noticeable that a set of motifs was particularly common in coding sequence (Figure S8). Clustering analysis indicated that a set of 96 transcription factor motifs were strongly associated with coding sequences in all four grass species (Figure S9B, S10). The clear strand-bias indicates strong protein-DNA interaction centred on these motifs within coding sequences (Figure S9C). Sequences that carry out a dual role in both coding for amino acids and in transcription factor binding have been termed duons. Thus, in grasses it appears that duons are recognised by a specific set of transcription factors.
To identify regulatory factors associated with gene expression in the C4 bundle sheath, transcription factor motifs located in DGF enriched in either the bundle sheath or in whole leaf samples of S. bicolor were identified (Figure 3E). There was little difference in the ranking of the most prevalent commonly used motifs between these cell types (Figure 3E&F). For example, the AP2-EREBP and C2H2 families were dominant in both bundle sheath and whole leaf samples, indicating that cell-specificity is not associated with large-scale changes in the relative importance of transcription factor binding. However, in terms of prevalence, a small number of transcription factor binding motifs were ranked in the top 50% in whole leaves but the bottom 50% in bundle sheath strands (Figure 3F). This finding implies that quantitative modifications to the use of particular transcription factor families are associated with the spatial patterning of gene expression that is a hallmark of C4 photosynthesis.
Further analysis revealed that in all three C4 species, motifs recognised by C2C2GATA, bZIP, bHLH, BZR and TCP transcription factors were enriched in whole leaf samples, whereas those bound by ARID transcription factors were enriched in the bundle sheath (Figure 3G and Table S9). Moreover, analysis of the cell-specific transcript accumulation of members of the C2C2-GATA family, revealed one orthologue which was consistently mesophyll enriched in all three C4 species (GRMZM2G379005, Seita.1G358400, Sobic.004G337500; Figure 3H). Thus, these data implicate these transcription factor families in controlling cell-specific gene expression in C4 leaves, and indicate that in some cases, separate C4 lineages appear to be using orthologous transcription factors to drive cell specific expression.
Transcription factor binding sites that are conserved but mobile
As B. distachyon, S. bicolor, Z. mays and S. italica are thought to have diverged from a common ancestor around 60 million years ago (The International Brachypodium Initiative, 2010) they provide an opportunity to examine the extent to which the cis-regulatory code has diverged since that point. Furthermore, whilst the last common ancestor of Z. mays and S. bicolor was thought to use C4 photosynthesis, S. italica belongs to a separate C4 lineage (Zhang et al., 2012a). Thus, comparative analysis of these species should provide insight into the extent to which the cis-regulatory architecture is conserved in the grasses, and how it has been modified during the evolution of C4 photosynthesis.
In pairwise comparisons of the four species, DGF fell into three categories: those for which homologous sequences were both present and bound by a transcription factor (conserved and occupied), those for which homologous sequences were present but were only bound by a transcription factor in one species (conserved but not occupied) and those for which no sequence homology could be found (not conserved) (Figure 4A). Only a small percentage of DGF were both conserved in sequence and bound by transcription factors (Figure 4B, Table S8). DGF that were conserved but unoccupied were the next most abundant group (Figure 4B) but the majority of DGF were not conserved (Figure 4B, Table S8). These data indicate substantial turnover in the cis-code associated with the transcription factor binding repertoire of monocotyledons.
Consistent with the rapid turnover of DGF documented genome-wide (Figure 4B), the majority of C4 genes did not share DGF (Table S10 and S11). However, three genes associated with the core C4 and the Calvin-Benson-Bassham cycle that are strongly expressed in either bundle sheath or mesophyll cells contained the same cis-elements bound by a transcription factor in all three C4 species. For example, in the 2-oxoglutarate/malate transporter (OMT1) gene, four sites defined by transcription factor binding were detected in all three C4 species (Figure 4C). However, the position of these sites within the gene varied in each species. In the transketolase (TKL) gene that is preferentially expressed in bundle sheath cells, three conserved motifs defined by transcription factor binding were detected in all C4 species, but they were also all found in the C3 species B. distachyon (Figure 4D). Thus, in some cases patterning of C4 gene expression appears linked to pre-existing regulatory architecture operating in the ancestral C3 state, but in cases where the cis-regulatory code associated with C4 gene expression is strongly conserved the position of these transcription factor binding sites within any gene is variable.
Hyper-conserved cis-regulators found in coding sequences of C4 genes
To investigate the extent to which transcription factor binding sites associated with C4 genes within a C4 lineage are conserved, genes encoding the core C4 cycle were compared in S. bicolor and Z. mays (Figure 5A). 28 genes associated with the C4 and Calvin-Benson-Bassham Cycles contained a total of 531 DGF. Although many of these transcription factor footprints were conserved in sequence within orthologous genes, only twenty were both conserved and bound by a transcription factor (Figure 5A). These data therefore indicate that although many cis-elements found in orthologous genes of the C4 cycle are conserved in sequence, a small proportion were bound by a transcription factor at the time of sampling.
Genome-wide, the number of DGF that were conserved in sequence and bound by a transcription factor decayed in a non-linear manner with phylogenetic distance (Figure 5B). For example, Z. mays and S. bicolor shared 9,446 DGF that were both conserved and occupied. S. italica shared only 1,194 DGF with Z. mays and S. bicolor (Figure 5B). Finally, comparison of these C4 grasses with C3 B. distachyon yielded 192 DGF that have been conserved over >60Myr of evolution. 95 of these highly conserved DGF were present in whole leaf samples of the C3 species, but in the C4 species were restricted to the bundle sheath (Figure 5B). This set of 192 ancient and highly conserved DGF were located predominantly in 5’ UTRs and coding sequence and strikingly, in bundle sheath strands, over fifty percent of these hyper-conserved DGF were in coding sequence (Figure 5B).
One such hyper-conserved DGF is found in the NdhM gene that encodes a subunit of the NADH complex that preferentially assembles in bundle sheath cells of C4 plants (Figure 5C) but it is not known how this evolved. In the ancestral C3 state a hyper-conserved DGF is found in whole leaves of B. distachyon (Figure 5D). However, in all three C4 species rather than this DGF being detected in whole leaf material, it is detected in the bundle sheath. It is also noticeable that this motif has proliferated within the gene in the C4 species compared with C3 B. distachyon, and in maize and sorghum is also found in the 5’ UTR as well as coding sequence. Furthermore, in whole leaf samples of these C4 species, transcription factor binding is shifted upstream or downstream (Figure 5D). We therefore propose that preferential expression of NdhM in the bundle sheath is built upon a cis-regulator present in the C3 state that activates expression in all photosynthetic cells of the leaf. During the evolution of C4 photosynthesis, whilst accessibility of this ancient and highly conserved cis-element is maintained in the bundle sheath to allow expression of NdhM, in mesophyll cells an additional transcription factor(s) binds flanking sequence that blocks access to this pre-existing architecture. These findings are consistent with hyper-conserved DGF located in coding sequence playing an important role in the cell specific gene expression required in leaves of C4 grasses.
As genome-wide analysis indicated that a specific group of DGF was associated with coding sequence (Figure S8-S10) we investigated whether motifs associated with the 192 hyper-conserved DGF found in all four grasses were over-represented in this set. Remarkably, of the 96 families of transcription factors strongly associated with binding motifs in coding sequence (Figure S10), 47 and 55 were hyper-conserved in the whole leaf and bundle sheath respectively and the ERF family was particularly common (Figure S11, S12). Overall, these data indicate that in these grasses specific families of transcription factors are particularly important in binding coding sequences, and that the duons bound by these transcription factors are highly conserved across deep evolutionary time.
Discussion
Genome-wide transcription factor binding in grasses
The data presented here provide insight into genome-wide binding of transcription factors in photosynthetic tissue, but also maize and sorghum which represent two of the world’s most productive crops. This transcription factor binding landscape shows both similarities and differences with other eukaryotic systems. For example, in contrast with A. thaliana in which AT-rich DNA is preferentially bound, the grasses showed preferential binding of transcription factors to GC-rich DNA. Preference for GC-rich DNA has also been observed in humans (Wang et al., 2012) and so the differences in binding likely reflect the relative proportion of nucleotides in each genome. In all these eukaryotes, individual genes are bound by a complex mosaic of transcription factors distributed across major genic feature including promoter regions, UTRs and coding sequence. However, in grasses this standard architecture exemplified by yeast, animals and A. thaliana appears to be have been modified such that a much higher proportion of transcription factor footprints are located in exonic and coding regions. For example, in human cells ~3% of transcription factor binding sites are exonic (Stergachis et al., 2013). In contrast, in grass leaves studied here up to 36% and 25% of transcription factor binding sites were located in exonic and coding sequence respectively. This finding is supported by the following observations. First, within individual genes the distribution of transcription factor binding sites peaked after the predicted transcriptional start site. Second, in grasses, strong and reproducible expression of transgenes is routinely achieved by fusing 5’ exon and intron sequence to the promoter of interest (Cornejo et al., 1993; Jeon et al., 2000; Maas et al., 1991). Third, although the functional importance of transcription factor binding to coding sequences has been debated in animals (Xing and He, 2015), in grasses these motifs are bound by specific families of transcription factors, and so it is not the case that all transcription factors contribute to this non-random distribution. Moreover, in plants functional analysis has now indicated that duons can control the patterning of gene expression (Reyna-Llorens et al., 2016). Although it is unclear why transcription factor binding in grasses should be particularly prevalent in 5’ UTR and coding sequences, these findings combined with the available literature argue for duons and the cognate transcription factors that bind them being of pervasive importance in grass genomes.
The transcription factor landscape underpinning gene expression in specific cell types
Given the central importance of cellular compartmentation to C4 photosynthesis, there have been significant efforts to identify cis-elements that restrict gene expression to either mesophyll or bundle sheath cells of C4 leaves (Hibberd and Covshoff, 2010; Sheen, 1999; Wang et al., 2014). Along with many other systems, initial analysis focussed on regulatory elements located in promoters of C4 genes (Sheen, 1999) but, it has become increasingly apparent that the patterning of gene expression between cells in the C4 leaf can be mediated by elements in various parts of a gene. This includes untranslated regions (Kajala et al., 2011; Patel et al., 2004; Viret et al., 1994; Williams et al., 2016; Xu et al., 2001) and coding sequences (Brown et al., 2011; Reyna-Llorens et al., 2016). The genome-wide data reported here provides an unbiased insight into where transcription factors bind C4 genes, and along with the rest of the genome, indicate that binding is most dense in the 5’ UTRs and coding exons.
Mechanistically, this DNaseI dataset also indicates that cell specific gene expression in C4 leaves is not strongly correlated with changes to large-scale accessibility of DNA as defined by DHS. This strongly implies that modifications to chromatin density within any one gene do not impact on its expression between cell types. Rather, as only 8-24% of transcription factor binding sites detected in the bundle sheath were also found in whole leaves, the data strongly implicate complex modifications to patterns of transcription factor binding in controlling gene expression between cell types. These findings are consistent with analogous analysis in roots where genes with clear spatial patterns of expression are bound by multiple transcription factors (Sparks et al., 2016) and highly combinatorial interactions between multiple activators and repressors tune the output (de Lucas et al., 2016). However, it is also the case that particular classes of transcription factors including the C2C2GATA, bZIP, bHLH and ARID families are implicated in patterning of gene expression because they were preferentially detected as binding their cognate cis-elements in either bundle sheath strands or whole leaves. Our findings therefore strongly imply that the spatial patterning of gene expression in leaves is mediated by a quantitative switch in the abundance of a group of transcription factors.
More generally, the finding that so few transcription factor binding sites were shared between different cell types in leaves of S. bicolor, Z. mays and S. italica argues strongly for the need to isolate these cells when attempting to understand the control of gene expression. Although separating bundle sheath strands from C4 leaves is relatively trivial (Covshoff et al., 2013; Furbank et al., 1985; Leegood, 1985) this is not the case for C3 leaves. Approaches in which nuclei from specific cell-types are labelled with an exogenous tag (Deal and Henikoff, 2011) should now allow their transcription factor landscapes to be defined. In the future, the application of DNase I-SEQ to specific cell types from both C3 and C4 leaves should provide insight into how the extent to which gene regulatory networks have been re-wired during the evolution of the complex C4 trait.
Characteristics of the transcription factor repertoire facilitating evolution of the C4 pathway
Comparison of transcription factor binding in the C3 grass B. distachyon with three C4 species provides insight into mechanisms associated with the evolution of C4 photosynthesis. One striking finding was that in all four species, irrespective of whether they used C3 or C4 photosynthesis, the most abundant DNA motifs bound by transcription factors were similar. Thus, motifs recognised by the AP2EREBP, C2C2 and C2C2DOF classes of transcription factor were most commonly bound across each genome. This indicates that during the evolution of C4 photosynthesis, there has been relatively little alteration to the most abundant classes of transcription factors that bind DNA.
The repeated evolution of the C4 pathway has frequently been associated with convergent evolution (Sage, 2004; Sage et al., 2012). However, parallel alterations to amino acid and nucleotide sequence that allow altered kinetics of the C4 enzymes (Christin et al., 2014, 2007) and patterning of C4 gene expression (Brown et al., 2011) respectively have also been reported. The genome-wide analysis of transcription factor binding now indicates that parallel evolution of transcription factors has contributed to the repeated appearance of C4 photosynthesis. This is best exemplified by the fact that in the three C4 species that are derived from two independent C4 lineages, motifs bound by the ARID and C2C2GATA classes of transcription factor were enriched in bundle sheath and whole leaves respectively. In the case of the C2C2GATA family, transcripts derived from one specific orthologue were more abundant in mesophyll cells of all C4 species. Thus, within separate lineages of C4 plant, the same classes of transcription factors have been recruited into functioning preferentially in one cell type, and in the case of the C2C2GATA family this is associated with orthologous genes being preferentially expressed in mesophyll cells.
When orthologous genes were compared between genomes the majority of transcription factor binding sites were not conserved. Furthermore, of the DGF that were conserved, we found that their position within orthologous genes varied. This indicates that C4 photosynthesis in grasses is tolerating both a rapid turnover of the cis-code, and that when motifs are conserved in sequence, their position and number within a gene can vary. It therefore appears that the cell-specific accumulation patterns of C4 proteins can be maintained despite considerable modifications to the cistrome of C4 leaves. This finding is analogous to the situation in yeast where the output of genetic circuits can be maintained despite rapid turnover of regulatory mechanisms underpinning them (Tsong et al., 2006). It was also the case that some conserved motifs bound by transcription factors in the C4 species were present in B. distachyon, which uses the ancestral C3 pathway. Previous work has shown that cis-elements used in C4 photosynthesis can be found in orthologous genes from C3 species (Reyna-Llorens et al., 2016; Williams et al., 2016). However, these previous studies identified cis-elements that were conserved in both sequence and position. As it is now clear that such conserved sites are mobile within a gene, it seems likely that many more examples of ancient cis-elements important in C4 photosynthesis will be found in C3 plants.
Although we were able to detect a small number of transcription factor binding sites that were conserved and occupied in all four species that were sampled, these ancient hyper-conserved motifs appear to have played a role in the evolution of C4 photosynthesis. Interestingly, a large proportion of these motifs bound by transcription factors are found in coding sequence, and this bias was particularly noticeable in bundle sheath cells. Due to the amino acid code, the rate of mutation of coding sequence compared with the genome is restricted. If such regions have a longer half-life than transcription factor binding sites in other regions of the genome, then they may represent an excellent source of raw material for the repeated evolution of complex traits (Martin and Orgogozo, 2013). It remains to be determined why this characteristic is particularly noticeable in bundle sheath cells of C4 leaves.
In summary, the data presented here provides a transcription factor binding atlas for leaves of grasses using either C3 or C4 photosynthesis. Surprisingly, many sequences bound by transcription factors are found within genes rather than promoter regions. Indeed, particular transcription factor families preferentially bind coding sequence and the motifs that they bind are highly conserved in the grasses. Moreover, the canonical patterning of gene expression in C4 leaves is underpinned by complex combinatorial modifications to transcription factor binding. Lastly, consistent with the deep evolutionary time associated with the divergence of the monocotyledons and dicotyledons, the cistrome of grasses is highly divergent from that of the model plant Arabidopsis thaliana.
Methods
Growth conditions and isolation of nuclei
S. bicolor, S. italica and Z. mays were grown under controlled conditions at the University of Cambridge in a chamber set to 12h/12h light/dark; 28°C light/20°C dark; 400µmol m-2 s-1 photon flux density, 60% humidity. For germination, S. bicolor and Z. mays seeds were imbibed in dH2O for 48h, S. italica seeds were incubated on wet filter paper at 30°C overnight in the dark. Z. mays, S. bicolor and S. italica were grown on 3:1 (v/v) M3 compost to medium vermiculite mixture, with a thin covering of soil. Seedlings were hand-watered. For B. distachyon plants were grown under controlled conditions at the Sainsbury Laboratory Cambridge University, first under short day conditions 14h/10h, light/dark for 2 weeks and then shifted to long day 20h/4h, light/dark, for 1 week and harvested at ZT20. Temperature was set at 20°C, humidity 65% and light intensity 350µmol m-2 s-1.
To isolate nuclei from S. bicolor, Z. mays and S. italica mature third and fourth leaves with a fully developed ligule were harvested 4-6 h into the light cycle on the 18th day after germination. Bundle sheath cells were mechanically isolated (Covshoff et al., 2013). At least 3 g of tissue was used for each extraction. Nuclei were isolated using a sucrose gradient adapted from (Gendrel et al., 2005) and the amount of nuclei in each preparation was quantified using a haemocytometer. For B. distachyon plants were flash frozen and material pulverised in a coffee grinder. 3g of plant material was added to 45 ml NIB buffer (10mM Tris-HCl, 0.2M sucrose, 0.01% (v/v) Triton X-100, pH 5.3 containing protease inhibitors (SIGMA)) and incubated at 4°C on a rotating wheel for 5 min, afterwards debris was removed by sieving through 2 layers of Miracloth (millipore) into pre-cooled flasks. Nuclei were spun down 4,000rpm, 4°C for 20 min. Plastids were lysed by adding Triton to a final concentration of 0.3% (v/v) and incubated for 15 min on ice. Nuclei were pelleted by centrifugation at 5000 rpm at 4°C for 15 min. Pellets were washed 3 times with chilled NIB buffer.
DNAse-I digestion, sequencing and library preparation
To obtain sufficient DNA each biological replicate consisted of leaves from tens of individuals and to conform to standards set by the Human Genome project at least two biological replicates were sequenced for each sample. 2 x 108 of freshly extracted nuclei were re-suspended at 4°C in digestion buffer (15 mM Tris-HCl, 90 mM NaCl, 60 mM KCl, 6 mM CaCl2, 0.5 mM spermidine, 1 mM EDTA and 0.5 mM EGTA, pH 8.0). DNAse-I (Fermentas) at 7.5 U was added to each tube and incubated at 37 °C for 3 min. Digestion was arrested with addition of 1:1 volume of stop buffer (50 mM Tris-HCl, 100 mM NaCl, 0.1% (w/v) SDS, 100 mM EDTA, pH 8.0, 1 mM Spermidine, 0.3mM Spermine, RNase A 40 µg/ml) and incubated at 55°C for 15 min. 50 U of Proteinase K was added and samples incubated at 55 °C for 1 h. DNA was isolated with 25:24:1 Phenol:Chloroform:Isoamyl Alcohol (Ambion) followed by ethanol precipitation. Samples were then size-selected using agarose gel electrophoresis. The extracted DNA samples were quantified fluorometrically with a Qubit 3.0 Fluorometer (Life technologies), and a total of 10 ng of digested DNA (200 pg l-1) was used for library construction.
Initial sample quality control of pre-fragmented DNA was assessed using a Tapestation DNA 1000 High sensitivity Screen tape (Agilent, Cheadle UK). Sequencing ready libraries were prepared using the Hyper Prep DNA Library preparation kit (Kapa Biosystems, London UK) according to the manufacturer’s instructions and indexed for pooling using NextFlex DNA barcoded adapters (Bioo Scientific, Austin TX US). Libraries were quantified using a Tapestation DNA 1000 Screen tape and by qPCR using an NGS Library Quantification Kit (KAPA Biosystems) on an AriaMx qPCR system (Agilent) and then normalised, pooled, diluted and denatured for sequencing on the NextSeq 500 (Illumina, Chesterford UK). The main library was spiked at 10% with the PhiX control library (Illumina). Sequencing was performed using Illumina NextSeq in the Departments of Biochemistry and Pathology at the University of Cambridge, UK, with 2x75 cycles of sequencing.
Data processing
Genome sequences were downloaded from Phytozome (v10) (Goodstein et al., 2012). The following genome assemblies were used: Bdistachyon_283_assembly_v2.0; Sbicolor_255_v2.0; Sitalica_164_v2; Zmays_284_AGPv3. Reads were mapped to genomes using bowtie2 (Langmead and Salzberg, 2012) with the following parameters: –local –D 15 –R 2 –N 0 –L 20 –I S,1,0.75.
Aligned reads were then processed using samtools (Li et al., 2009) to remove those with a MAPQ score <42. DHS sites were identified using a procedure adapted from the ENCODE 3 pipeline (https://sites.google.com/site/anshulkundaje/projects/idr) (Marinov et al., 2014). Briefly DHS were called using MACS2 (Feng et al., 2012) with the following parameters to offset read locations in order to position DHS cut site in the middle of peak regions: -p 1e-1 –nomodel –extsize 150 –shift −75 –llocal 50000
The final set of peak calls were determined using the irreproducible discovery rate (IDR (Li et al., 2011)) and calculated using the script batch_consistency_analysis.R (https://github.com/modENCODE-DCC/Galaxy/blob/master/modENCODE_DCC_tools/idr/batch-consistency-analysis.r).
Quality metrics and identification of Digital Genomic Footprints (DGF)
SPOT score (number of a subsample of mapped reads (5M) in DHS/Total number of subsampled, mapped reads (5M) (John et al., 2011)) was calculated using BEDTools (Quinlan and Hall, 2010) to determine the number of mapped reads possessing at least 1bp overlap with a DHS site. NSC and RSC scores were calculated using SPP (Kharchenko et al., 2008) and PCR bottleneck coefficient (PCB) was calculated using BEDTools and the following bash code:
bedtools bamtobed –bedpe –I ${FILT_BAM_FILE_n} | awk ‘BEGIN{OFS=”\t”}{print $1,$2,$4,$6,$9,$10}’ | grep –v ‘ChrM\ChrC’| sort | uniq –c | awk ‘BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{printf “%d\t%d\t%d\t%d\t%f\t%f\t%f\n”,mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}’ > ${PBC_FILE_QC}`
Digital Genomic Footprints (DGF) were identified using the Wellington algorithm (Piper et al., 2013) in the pyDNase software package (http://pythonhosted.org/pyDNase/) with the following parameters:
-fdr 0.05 [regions] [reads] [output directory]
where [reads] represents a BED file of DHS locations within which footprints were called and [reads] a filtered BAM file of sequenced reads.
Differential DGF were identified using Wellington bootstrap algorithm (Piper et al., 2015) from pyDNase package with the following parameters:
-fdr 0.05 [treatment_BAM] [control_BAM] [regions] [treatment_output] [control_output]
Where [treatment_BAM] is a filtered BAM file containing sequenced reads from the sample of interest, [control_BAM] is a filtered BAM file containing mapped sequenced reads against sample for comparison; [regions] is a BED file containing DHS locations within which footprints are called. All DE DGFs with a threshold of score equal and higher than 10 were considered as differentially abundant DGFs.
Data visualisation
DHS and DGF sequences were loaded into and visualized in the Integrative Genomics Viewer (Thorvaldsdóttir et al., 2013) and figures produced in Inkscape, bar plots were generated with R package ggplot2 (Wickham, 2009), scatterplots using R function plot() and figures depicting conservation of DGF or motifs between orthologous sequences were generated using genoplotR (Guy et al., 2010). Word clouds were created with the wordcloud R package (Fellows, 2012).
TreeView images were produced in two stages. The script ‘dnase_to_javatreeview.py’ from pyDNAse was run with the following parameters to generate the input file:
[regions_BED] [reads_BAM] [OUTPUT]
Where [regions_BED] is a bed file containing locations of all DGF sites, [reads_BAM] is the BAM file containing all aligned reads, and [OUTPUT] specifies the output csv file name. To visualize files Java TreeView (Saldanha, 2004) was run with the following command:
java -Xmx4G -jar TreeView.jar
Changing the file format settings to All Files, the csv file from pyDNase was loaded into TreeView, from the dropdown menu entered Settings->Pixel Setting and checked all the Fill boxes, Contrast Value 1 and colours Red and Blue, the output was saved as .svg file.
Average cut density plots were generated using the script ‘dnase_average_profile.py’ from pyDNase (Piper et al., 2013, 2015) with the following parameters:
–w 100 –b [regions_BED] [reads_BAM] [OUTPUT]
Where [regions_BED] is a bed file containing locations of all DGF sites, [reads_BAM] is the BAM file containing all aligned reads, and [OUTPUT] specifies the output file name.
Genomic features were annotated and distribution calculated using the Bioconductor package ChIPpeakAnno (Zhu, 2013; Zhu et al., 2010a) interfaced with a custom R script. The required gff3 files (Goodstein et al., 2012) (Sitalica_164_v2.1.gene_exons.gff3; Sbicolor_255_v2.1.gene_exons.gff3; Zmays_284_6a.gene.exons.gff3; Bdistachyon_283_v2.1.gene_exons.gff3) downloaded from Phytozome.
In order to convert motif files into MEME format for motif scanning a multi-step procedure was necessary. Background frequency files are required when generating motifs (Thijs et al., 2001); to produce background files FASTA sequences for the regions of interest (DGF) were extracted using BEDTools suite (Quinlan and Hall, 2010) with the following command:
bedtools getfasta -fi [FASTA_genome] -bed [regions]-fo [FASTA_regions]
Background frequency files were tailored for each species for motif searching, using scripts from the meme suite (Bailey et al., 2009).
fasta-get-markov [FASTA_all] [background_file_MEME]
Motif files in FASTA format were converted to STAMP format using the online tool (http://www.benoslab.pitt.edu/stamp/) (Mahony and Benos, 2007), then RSTAT was used to convert STAMP format into TRANSFAC format (http://rsat01.biologie.ens.fr/rsa-tools/convert-matrix_form.cgi) (Medina-Rivera et al., 2015). A bug in the transfac2meme script requires that all bp frequencies are represented as floating point numbers containing two decimal places. In order to convert the TRANSFAC file to a suitable format the following code was used:
sed ‘s/0 /0.00/g’ [transfac file] | sed ‘s/1 /1.00/g’ | sed ‘s/2 /2.00/g’ | sed ‘s/3 /3.00/g’ | sed ‘s/4 /4.00/g’ | sed ‘s/5 /5.00/g’ | sed ‘s/6 /6.00/g’ | sed ‘s/7 /7.00/g’ | sed ‘s/8 /8.00/g’ | sed ‘s/9 /9.00/g’ | sed ‘s/0$/0.00/g’ | sed ‘s/1$/1.00/g’ | sed ‘s/2$/2.00/g’ | sed ‘s/3$/3.00/g’ | sed ‘s/4$/4.00/g’ | sed ‘s/5$/5.00/g’ | sed ‘s/6$/6.00/g’ | sed ‘s/7$/7.00/g’ | sed ‘s/8$/8.00/g’ | sed ‘s/9$/9.00/g’ | sed ‘s/\P0.00 /\P0/g’ > [transfac_fixed]
MEME motif files were created from TRANSFAC files using scripts from the MEME suite (Bailey et al., 2009) with the following command:
transfac2meme -bg [background_file] [transfac_fixed] > [MEME_FILE]
where [background_file] is the background base pair distribution file and [MEME_FILE] is the motif file output.
de novo motif prediction, motif scanning and enrichment testing
de novo motif prediction was performed using findMotifsGenome.pl script from the HOMER suite (Heinz et al., 2010) using digital genomic footprints (DGF) as input together with the reference genome sequence for each species with the following command:
findMotifsGenome.pl [INPUT_DGFs.bed] [REF_GENOME.fasta] [OUTFILE].motifs -size 200 -cpg
A set of 872 transcription factor binding motifs (O’Malley et al., 2016) in meme format was downloaded from
http://neomorph.salk.edu/dev/pages/shhuang/dap_web/pages/browse_table_aj.php
Motif scanning was performed using FIMO (Grant et al., 2011) with default parameters:
–bgfile [background_file] –o [OUPUT_FILE] [MOTIF_FILE] [FASTA_REGIONS]
where [background_file] is the background base pair distribution file, [OUTPUT_FILE] is the output file name, [MOTIF_FILE] is the file containing input motif(s) in MEME format and [FASTA_REGIONS] is a FASTA file containing all DGF sequences motifs are scanned against.
To determine overrepresentation of TF family motifs in samples hypergeometric tests were performed using R with the following parameters:
over<-phyper(hitInSample-1,hitInPop,failInPop,sampleSize,lower.tail=F)
where:
Population: Unique genes with an annotation in whole leaf and bundles sheath samples. sampleSize: Number of unique genes with an annotation in whole leaf samples.
HitInPop: Total number of unique genes annotated with given transcription factor in tissue sample. HitInSample: Number of unique genes sharing an annotation in WL and BS samples (overlap). failInPop: Number of unique genes with annotation only in WL samples.
p-values were adjusted for the false discovery rate using the procedure of Benjamini & Hochberg (Benjamini and Hochberg, 1995).
The distribution of each motif across different genomic features was obtained for each of the 525 known annotated motifs by dividing the number of hits in a particular feature by the total number of hits in the genome. K-means clustering was then employed to group motifs by genomic feature in Z. mays, S. italica, S. bicolor and B. distachyon.
Whole genome alignments and pairwise cross mapping of genomic features
To cross map genomic features between species, mapping files were generated according to (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto) using tools from the UCSC Genome Browser, including trfBig, faToNib, faSize, lavToPsl, faSplit, axtChain, chainNet (Kent et al., 2002) and LASTZ (Harris, 2007).
Genomic features where then mapped between genomes using bnMapper (Denas et al., 2015) and the following parameters:
-fBED4 –threshold 0.7 –o [outfile] [infile] [Chain file]
where [infile] is a BED file of DGF locations in the species of origin, [Chain file] is a chain file providing mapping coordinates between the species of origin and comparison.
Data
Detailed step by step methods are available for DNase I digestion are on protocols.io (dx.doi.org/10.17504/protocols.io.hdfb23n), Raw sequencing data and processed files are deposited in Gene Expression Omnibus (GSE97369).
Contributions
SJB and I-RL grew and harvested nuclei from S. bicolor, S. italica and Z. mays. KJ provided the nuclei from B. distachyon. SJB and I-RL performed DNase I digestion and data analysis. SJB, I-RL and JMH wrote the manuscript and prepared the figures.
Acknowledgements
KJ was supported by a Gatsby Career Development Fellowship, IRL was supported by CONCyT and BBSRC grant BB/L014130, whilst SJB was supported by the 3to4 grant from the EU and BB/I002243 from the BBSRC.
Footnotes
I-RL - suallorens{at}gmail.com
SJB - sjb287{at}cam.ac.uk
KJ - katja.jaeger{at}slcu.cam.ac.uk