Abstract
MADS-box transcription factors (TFs) are ubiquitous among eukaryotes, and classified into two groups: type I or SRF (Serum Response Factor)-like, and type II or MEF2 (Myocyte Enhancing Factor2)-like1. In flowering plants, type I MADS-box TFs are associated with reproductive development and many are active in the endosperm, a nutritive tissue supporting the embryo2. Deregulation of these genes has been frequently linked to failure of endosperm development and seed inviability, both in the Brassicaceae3–6, and in crop species like tomato and rice7,8. Nevertheless, a mechanistic explanation for these observations, clarifying the role of MADS-box TFs in endosperm development, remains to be established. Here we show that the imprinted Arabidopsis thaliana MADS-box TF PHERES1 (PHE1)9 has a central role in endosperm development as a master regulator of imprinted gene expression, especially of paternally expressed genes (PEGs), which have been previously implicated in endosperm development5,10–12. Control of imprinted gene expression by PHE1 is mediated by parental asymmetry of epigenetic modifications in PHE1 DNA-binding sites, conferring different accessibilities to maternal and paternal alleles. Importantly, we show that the CArG-box-like DNA-binding motifs used by PHE1 to access gene promoters are carried by RC/Helitron transposable elements (TEs), providing an example of molecular domestication of these elements. Hence, this work shows that TEs are intrinsically linked to imprinting: not only by enforcing specific epigenetic landscapes13–15, but also by serving as important sources of cis-regulatory elements. Moreover, it provides an example of how TEs can widely distribute TF binding sites in a plant genome, allowing to recruit crucial endosperm regulators into a single transcriptional network.
Main
PHE1 was the first described PEG in Arabidopsis9. To identify the genes regulated by PHE1, we performed a ChIP-seq experiment using a line expressing PHE1::PHE1-GFP (Supplementary Table 1-2). The 1942 identified PHE1 target genes were enriched for Gene Ontology (GO)-terms associated with development, metabolic processes, and transcriptional regulation, showing that many PHE1 targets are themselves transcriptional regulators (Extended Data Fig. 1a). Among PHE1 targets are several known regulators of endosperm development like AGL62, YUC10, IKU2, MINI3, and ZHOUPI, revealing a central role for PHE1 in regulating endosperm development (Extended Data Table 1). Importantly, type I MADS-box family genes were over-represented among PHE1 targets (Extended Data Fig. 1b), pointing to a high degree of cross-regulation among members of this family.
Our data revealed that PHE1 uses two distinct DNA-binding motifs. Motif A was present in about 53% of PHE1 binding sites, while motif B could be found in 43% of those (Fig. 1a). PHE1 motifs closely resemble type II CArG-boxes, the signature motif of MADS-box TFs16 (Fig. 1a, Extended Data Fig. 2), suggesting that DNA-binding properties between type I and type II members are conserved.
Strikingly, PHE1 binding sites significantly overlapped with TEs, preferentially with those of the RC/Helitron superfamily (29% in PHE1 binding sites, versus 9% in random binding sites) (Fig. 1b), and in particular with ATREP10D elements (Extended Data Fig. 3). We thus addressed the question whether RC/Helitrons contain sequence properties that promote PHE1 binding. Indeed, a screen of genomic regions for the presence of PHE1 DNA-binding motifs, revealed significantly higher motif densities within all RC/Helitrons, compared to other TE superfamilies (Fig. 1c). Interestingly, even though motif densities were higher in RC/Helitrons overlapped by PHE1 binding sites than in non-overlapped RC/Helitrons, this difference was not significant (Fig. 1c). This reveals that the enrichment of PHE1 DNA-binding motifs is a specific feature of RC/Helitrons, suggesting that the exaptation of these TEs as cis-regulatory regions facilitates TF binding and modulation of gene expression. Furthermore, we found that RC/Helitron sequences associated with both PHE1 DNA-binding motifs could be grouped into multiple clusters (127 for motif A and 97 for motif B) based on aligned homologous sequences around the motifs. Thus, this suggests that multiple ancestral RC/Helitrons played a role in acquiring these motifs, which were subsequently amplified in the genome through transposition.
Among the PHE1 target genes we detected a significant enrichment of imprinted genes, with 12% of all maternally expressed genes (MEGs) and 31% of all PEGs being targeted (Fig. 2a). Imprinted gene expression relies on parental-specific epigenetic modifications, which are asymmetrically established during male and female gametogenesis, and inherited in the endosperm17,18. Demethylation of repeat sequences and TEs – occurring in the central cell, but not in sperm – is a major driver for imprinted gene expression17,18. In MEGs, DNA hypomethylation of maternal alleles leads to their expression, while DNA methylation represses the paternal allele17,18. In PEGs, the hypomethylated maternal allele undergoes trimethylation of lysine 27 of histone H3 (H3K27me3), a repressive histone modification established by the Fertilization Independent Seed-Polycomb Repressive Complex2 (FIS-PRC2). This renders the maternal alleles inactive, while the paternal allele is expressed14,17,18.
Given the significant overrepresentation of imprinted genes among PHE1 targets, we assessed how the epigenetic landscape at those loci correlates with DNA-binding by PHE1. We surveyed levels of endosperm H3K27me3 within PHE1 binding sites and identified two distinct clusters (Fig. 2b). Cluster 1 was characterized by an accumulation of H3K27me3 in regions flanking the centre of the binding site, while the centre itself was devoid of this mark. Cluster 2 contained binding sites largely devoid of H3K27me3. The distribution of H3K27me3 in cluster 1 was mostly attributed to the deposition of H3K27me3 on the maternal alleles, while the paternal alleles were devoid of this mark (Fig. 2c) – a pattern usually associated with PEGs14. Consistently, genes associated with cluster 1 binding sites had a more paternally-biased expression in the endosperm when compared to genes associated with cluster 2 (Extended Data Fig. 4a). This is reflected by more PEGs and putative PEGs being associated with cluster 1 (Extended Data Fig. 4b). We also identified parental-specific differences in DNA methylation, specifically in the CG context: PHE1 binding sites associated with MEGs had significantly higher methylation levels in paternal alleles than in maternal alleles (Fig. 2d).
We hypothesized that the differential parental deposition of epigenetic marks in PHE1 binding sites can result in differential accessibility of these regions, which might therefore impact transcription in a parent-of-origin-specific manner. To test this, we performed ChIP using a Ler maternal plant and a Col PHE1::PHE1-GFP pollen donor, taking advantage of SNPs between these two accessions to discern parental preferences of PHE1 binding. Using Sanger sequencing, we determined the parental origin of enriched ChIP-DNA in MEG, non-imprinted, and PEG targets (Fig. 2e, Extended Data Fig. 5). While binding of PHE1 was biallelic in non-imprinted targets (Fig 2e), only maternal binding was detected in the tested MEG targets, supporting the idea that CG hypermethylation of paternal alleles prevents their binding by PHE1 (Fig. 2e, Extended Data Fig. 6a). Interestingly, we observed biallelic binding in PEG targets (Fig. 2e). Even though the maternal PHE1 binding sites in PEGs were flanked by H3K27me3 (Fig. 2b-c), correlating with transcriptional repression of maternal alleles, the absence of this mark within the binding site centres seems to be permissive for maternal PHE1 binding. We speculate that the accessibility of this site might be important to mediate recruitment of H3K27me3 in the central cell and/or for maintenance of this mark during endosperm development (Extended Data Fig. 6a).
Previous studies have shown that PEGs are often flanked by RC/Helitrons19, a phenomenon suggested to lead to the parental asymmetry of epigenetic marks in these genes13,14. Consistent with our finding that PHE1 binding sites overlapped with RC/Helitrons (Fig. 1b), we found that PHE1 DNA-binding motifs were contained within these TEs significantly more frequently in PEGs than in non-imprinted genes (Extended Data Fig. 6b). Furthermore, we detected the presence of homologous RC/Helitrons containing PHE1 binding motifs in the promoter regions of several PHE1-targeted PEG orthologs (Extended Data Fig. 7), indicating ancestral insertion events. The presence of these RC/Helitrons correlated with paternally-biased expression of the associated orthologs, providing further support to the hypothesis these TEs contribute to the gain of imprinting, especially of PEGs. Thus, besides facilitating the asymmetry of epigenetic marks, these TEs can contribute to the generation of novel gene promoters that ensure the timely endosperm expression of PEGs, under the control of PHE1, and possibly other type I MADS-box TFs (Extended Data Fig. 6a).
Among the PEGs targeted by PHE1 were ADM, SUVH7, PEG2, and NRPD1a (Extended Data Table 1). Mutants in all four PEGs suppress the abortion of triploid (3x) seeds generated by paternal excess interploidy crosses10,20. Furthermore, we found that close to 50% of highly upregulated genes in 3x seeds are targeted by PHE1 (Fig. 3a), suggesting this TF might play a central role in mediating the strong gene deregulation observed in these seeds. If this is true, removal of PHE1 is expected to alleviate gene deregulation and prevent 3x seed inviability. To test this hypothesis, we generated a phe1 CRISPR/Cas9 mutant in the phe2 background, since both genes are likely redundant21 (Extended Data Fig. 8a). We introduced phe1 phe2 into the omission of second division 1 (osd1) mutant background, which produces diploid gametes at high frequency22. Wild-type (wt) and phe2 maternal plants pollinated with osd1 pollen form 3x seeds that abort at high frequency23 (Figure 3b, Extended Data Fig. 8b). In contrast, phe1 phe2 osd1 pollen strongly suppressed 3x seed inviability, reflected by the increased germination of 3x phe1 phe2 seeds (Fig. 3c, Extended Data Fig. 8c). This phenotype could be reverted by introducing the PHE1::PHE1-GFP transgene paternally (Fig. 3b-c, Extended Data Fig. 8b-c). Notably, 3x seed rescue was mostly mediated by phe1, as the presence of a wt PHE2 allele in 3x seeds (wt x phe1 phe2 osd1) led to comparable rescue levels than when having no wt PHE2 allele present (phe2 × phe1 phe2 osd1) (Fig. 3b-c, Extended Data Fig. 8b-c). Importantly, 3x seed rescue was accompanied by reestablishment of endosperm cellularisation (Extended Data Fig. 8d-e), and reduced expression of PHE1 target genes (Extended Data Fig. 8f).
Loss of FIS-PRC2 function causes a similar phenotype to that of paternal excess 3x seeds, correlating with largely overlapping sets of deregulated genes3,5. Since FIS-PRC2 is a major regulator of PEGs in Arabidopsis endosperm14, we addressed the question whether imprinting is disrupted in 3x seeds. To assess this, we analysed the parental expression ratio of imprinted genes in the endosperm of 2x and 3x seeds. Surprisingly, imprinting was not disrupted in 3x seeds (Fig. 3d), consistent with similar levels of H3K27me3 on the maternal alleles of PEGs in the endosperm of 2x and 3x seeds (Fig. 3e, Extended Data Fig. 9). Collectively, these data show that the major upregulation of imprinted gene expression in 3x seeds is due to increased transcription of the active allele, likely mediated by PHE1 and other MADS-box TFs, with maintenance of the imprinting status.
In summary, this work reveals that the MADS-box TF PHE1 is a major regulator of imprinted genes in the Arabidopsis endosperm, and that this TF establishes a reproductive barrier in response to interploidy hybridizations. We furthermore show that deregulated PEGs in 3x seeds remain imprinted, but that the active allele becomes strongly overexpressed, correlating with increased PHE1 activity in these seeds5. Importantly, we reveal a novel role for RC/Helitrons in the regulation of imprinted genes by showing that they contain PHE1 DNA-binding sites. Our data favour a scenario where these elements have been domesticated to function as providers of cis-regulatory sequences that facilitate transcription of imprinted genes. Thus, this study provides an example of TE-mediated distribution of TF binding sites throughout a flowering plant genome, adding support to the long-standing idea that transposition facilitates the formation of cis-regulatory architectures required to control complex biological processes24–26. We speculate that this process may have contributed to endosperm evolution by allowing the recruitment of crucial developmental genes into a single transcriptional network, regulated by type I MADS-box TFs. The diversification of the mammalian placenta has been connected with the dispersal of hundreds of placenta-specific enhancers by endogenous retroviruses27, suggesting that the convergent evolution of the endosperm in flowering plants and the mammalian placenta have been promoted by TE transpositions.
Methods
Plant material and growth conditions
Arabidopsis thaliana seeds were sterilized in a closed vessel containing chlorine gas, for 3 hours. Chlorine gas was produced by mixing 3 mL HCl 37% and 100 mL of 100% commercial bleach. Sterile seeds were plated in ½ MS-medium (0.43% MS salts, 0.8% Bacto Agar, 0.19% MES hydrate) supplemented with 1% Sucrose. When required, appropriate antibiotics were supplemented to the medium. Seeds were stratified for 48 h, at 4°C, in darkness. Plates containing stratified seeds were transferred to a long-day growth chamber (16h light / 8h dark; 110 μmol s−1m−2; 21°C; 70% humidity), where seedlings grew for 10 days. After this period, the seedlings were transferred to soil and placed in a long-day growth chamber.
Several mutant lines used in this study have been previously described: osd1-122, osd1-328 and pi-129. The phe2 allele corresponds to a T-DNA insertion mutant (SALK_105945). Phenotypical analysis of this mutant revealed no deviant phenotype relative to Col wt plants (data not shown). Genotyping of phe2 was done using the following primers (PHE2 fw 5’-AAATGTCTGGTTTTATGCCCC-3’, PHE2 rv 5’-GTAGCGAGACAATCGATTTCG-3’, T-DNA 5’-ATTTTGCCGATTTCGGAAC-3’).
Generation of phe1 phe2
The phe1 phe2 double mutant was generated using the CRISPR/Cas9 technique. A 20-nt sgRNA targeting PHE1 was designed using the CRISPR Design Tool30. A single-stranded DNA oligonucleotide corresponding to the sequence of the sgRNA, as well as its complementary oligonucleotide, were synthesized. BsaI restriction sites were added at the 5’ and 3’ ends, as represented by the underlined sequences (sgRNA fw 5’-ATTGCTCCTGGATCGAGTTGTAC-3’; sgRNA rv 5’-AAACGTACAACTCGATCCAGGAG-3’). These two oligonucleotides were then annealed to produce a double stranded DNA molecule.
The double-stranded oligonucleotide was ligated into the egg-cell specific pHEE401E CRISPR/Cas9 vector31 through the BsaI restriction sites. This vector was transformed into the Agrobacterium tumefaciens strain GV3101, and phe2/- plants were subsequently transformed using the floral-dip method32.
To screen for T1 mutant plants, we performed Sanger sequencing of PHE1 amplicons derived from these plants, and obtained with the following primers (fw 5’-AGTGAGGAAAACAACATTCACCA-3’; rv 5’-GCATCCACAACAGTAGGAGC-3’). The selected mutant contained a homozygous two base pair deletion that leads to a premature stop codon, and therefore a truncated PHE11-50aa protein. In the T2 generation, the segregation of pHEE401E allowed to select plants that did not contain this vector and that were double homozygous phe1 phe2 mutants. Genotyping of the phe1 allele was done using primers fw 5’-AAGGAAGAAAGGGATGCTGA-3’ and rv 5’-TCTGTTTCTTTGGCGATCCT-3’, followed by RsaI digestion.
Seed imaging
Analysis of endosperm cellularisation status was done following the Feulgen staining protocol described previously33. Imaging of Feulgen-stained seeds was done in a Zeiss LSM780 NLO multiphoton microscope with excitation wavelength of 800 nm, and acquisition between 520 nm – 695 nm.
RT-qPCR
RNA extraction, cDNA synthesis and qPCR analyses for AGL and PEG expression were performed as described previously33. Two biological replicates per cross were used. Primer sequences for YUC10, AGL62, and PP2A were described previously33. For the remaining genes, primer sequences were as follows: AGL48 (fw 5’-TTCGCGATCCACCAGTGTTT-3’, rv 5’-GACCGCCTCCTACAAAACCA-3’), AGL90 (fw 5’-TTGGTGATGAGTCGTTTTCCGA-3’, rv 5’-TCATATTCGCATTTGCGTCCG-3’).
Chromatin immunoprecipitation (ChIP)
To find targets of PHE1, we performed a ChIP experiment using a reporter line containing the PHE1 protein tagged with GFP. This reporter line, which we denote as PHE1::PHE1-GFP, contains the PHE1 promoter, its coding sequence and its 3’ regulatory sequence, and is present in a Col background34. Given the presence of the 3’ regulatory regions35, this reporter behaves as a paternally expressed gene, similarly to the endogenous PHE1 gene. Crosslinking of plant material was done by collecting 600 mg of 2 days after pollination (DAP) PHE1::PHE1-GFP siliques, and vacuum infiltrating them with a 1% formaldehyde solution in PBS. The vacuum infiltration was done for two periods of 15 min, with a vacuum release between each period. The crosslinking was then stopped by adding 0.125 mM glycine in PBS and performing a vacuum infiltration for a total of 15 min, with a vacuum release each 5 min. The material was then ground in liquid nitrogen, resuspended in 5mL Honda buffer36, and incubated for 15 min with gentle rotation. This mixture was filtered twice through Miracloth and one time through a CellTrics filter (30 μm), after which a centrifugation for 5 min, at 4°C and 1,500 g was performed. The nuclei pellet was then resuspended in 100 μL of nuclei lysis buffer36, and the ChIP protocol was continued as described before36. ChIP DNA was isolated using the Pure Kit v2 (Diagenode), following the manufacturer’s instructions.
For the parental-specific PHE1 ChIP the starting material consisted of 600 mg of 2DAP siliques from crosses between a pi-1 mother (Ler ecotype) and a PHE1::PHE1-GFP father (Col ecotype). The male sterile pi-1 mutant was used to avoid emasculation of maternal plants. Crosslinking of plant material, nuclei isolation, ChIP protocol, and ChIP-DNA purification were the same as described before.
To assess parental-specific H3K27me3 profiles in 3x seeds, the INTACT system was used to isolate 4DAP endosperm of seeds derived from Ler pi-1 × Col INTACT osd1-1 crosses, as described previously36,37. osd1 mutants were used for their ability to generate unreduced male gametes, leading to the formation of 3x seeds. ChIPs against H3 and H3K27me3 were then performed on the isolated endosperm nuclei, following the previously described protocol36.
The antibodies used for these ChIP experiments were as follows: GFP Tag Antibody (A-11120, Thermo Fisher Scientific), anti-H3 (Sigma, H9289), and anti-H3K27me3 (Millipore, cat. no. 07-449). All experiments were performed with two biological replicates.
Library preparation and sequencing
PHE1::PHE1-GFP ChIP libraries were prepared using the Ovation Ultralow v2 Library System (NuGEN), with a starting material of 1 ng, following the manufacturer’s instructions. These libraries were sequenced at the SciLife Laboratory (Uppsala, Sweden), on an Illumina HiSeq2500 platform, using 50 bp single-end reads.
Library preparation and sequencing of H3K27me3 ChIPs in 3x seeds was done as described previously37.
Both datasets were deposited at NCBI’s Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/), under the accession number GSE129744.
qPCR and Sanger sequencing of parental-specific PHE1 ChIP
Purified ChIP DNA and its respective input DNA obtained from the parental-specific PHE1 ChIP were used to perform qPCR. Positive and negative genomic regions for PHE1 binding were amplified using the following primers: AT1G55650 (fw 5’-CGAAGCGAAAAAGCACTCAC-3’; rv 5’-CCTTTTACATAATCCGCGTTAAA-3’), AT2G28890 (fw 5’-TTTGTGGTTGGAGGTTGTGA-3’; rv 5’-GTTGTTCGTGCCCATTTCTT-3’), AT5G04250 (fw 5’-AATTGACAAATGGTGTAATGGT-3’; rv 5’-CCAAAGAATTTGTTTTTCTATTCC-3’), AT1G72000 (fw 5’-AACAAATATGCACAAGAAGTGC-3’; rv 5’-ACCTAGCAAGCTGGCAAAAC-3’), AT3G18550 (fw 5’-TCCTTTTCCAAATAAAGGCATAA-3’; rv 5’-AAATGAAAGAAATAAAAGGTAATGAGA-3’), AT2G20160 (fw 5’-TCCTAAATAAGGGAAGAGAAAGCA-3’; rv 5’-TGTTAGGTGAAACTGAATCCAA-3’), negative region (fw 5’-TGGTTTTGCTGGTGATGATG-3’, rv 5’-CCATGACACCAGTGTGCCTA-3’). HOT FIREPol EvaGreen qPCR Mix Plus (ROX) (Solis Biodyne) was used as a master mix for qPCR amplification in a iQ5 qPCR system (Bio-Rad).
For Sanger sequencing, positive genomic regions for PHE1 binding were amplified by PCR using the Phusion High-Fidelity DNA Polymerase (Thermo Fisher Scientific), in combination with the primers described above. Amplified DNA was purified using the GeneJET PCR Purification kit (Thermo Fisher Scientific) and used for Sanger sequencing. The chromatograms obtained from Sanger sequencing were then analysed for the presence of SNPs.
Bioinformatic analysis of ChIP-seq data
For the PHE1::PHE1-GFP ChIP, reads were aligned to the Arabidopsis (TAIR10) genome using Bowtie version 1.2.238, allowing 2 mismatches (-v 2). Only uniquely mapped reads were kept. ChIP-seq peaks were called using MACS2 version 2.1.1, with its default settings39. Input samples served as control for their corresponding GFP ChIP sample. Each biological replicate was handled individually, and only the overlapping peak regions between the two replicates were considered for further analysis. These regions are referred throughout the text as PHE1 binding sites (Supplementary Table 1). Peak overlap was determined with BEDtools version 2.26.040. Each PHE1 binding site was annotated to a genomic feature and matched with a target gene using the peak annotation feature (annotatePeaks.pl) provided in HOMER version 4.941 (Supplementary Table 2). Only binding sites located less than 3 kb away from the nearest transcription start site were considered.
PHE1 DNA-binding motifs were identified from PHE1::PHE1-GFP ChIP-seq peak regions with HOMER’s findMotifsGenome.pl function, using the default settings. P-values of motif enrichment, as well as alignments between PHE1 motifs and known motifs were generated by HOMER.
Read mapping, coverage analysis, purity calculations, normalisation of data, and determination of parental origin of reads derived from H3K27me3 ChIPs in 3x seeds was done following previously published methods14 (Supplementary Table 3).
Analysis of PHE1 target genes
Significantly enriched Gene Ontology terms within target genes of PHE1 were identified using AtCOECIS42, and further summarized using REVIGO43.
Enrichment of specific transcription factor families within PHE1 targets was calculated by first normalizing the number of PHE1-targeted TFs in each family, to the total number of TFs targeted by PHE1. As a control, the number of TFs belonging to a certain family was normalised to the total number of TFs in the Arabidopsis genome. The Log2 fold change between these ratios was then calculated for each family. Significance of the enrichment was assessed using the hypergeometric test. Annotation of transcription factor families was done following the Plant Transcription Factor Database version 4.044. Only TF families containing more than 5 members were considered in this analysis.
To determine which imprinted genes are targeted by PHE1, a custom list consisting of the sum of imprinted genes identified in different studies13,19,45–47 was used.
To determine the proportion of genes overexpressed in paternal excess crosses that are targeted by PHE1, a previously published transcriptome dataset of 3x seeds was used48.
Spatial overlap of TEs and PHE1 binding sites
Spatial overlap between PHE1 ChIP-seq peak regions (binding sites) and TEs was determined using the regioneR package version 1.8.149, implemented in R version 3.4.150. As a control, a mock set of binding sites was created, to which we refer to as random binding sites. This random binding site set had the same total number of binding sites and the same size distribution as the PHE1 binding site set. Using regioneR, a Monte Carlo permutation test with 10000 iterations was performed. In each iteration the random binding sites were arbitrarily shuffled in the 3 kb promoter region of all Arabidopsis thaliana genes. From this shuffling, the average overlap and standard deviation of the random binding site set was determined, as well as the statistical significance of the association between PHE1 binding sites and TE superfamilies/families.
BedTools version 2.26.040 was used to determine the fraction of PHE1 binding sites targeting MEGs, PEGs, or non-imprinted genes where a spatial overlap between binding sites, RC/Helitrons and PHE1 DNA-binding motifs is simultaneously observed. The hypergeometric test was used to assess the significance of the enrichment of PHE1 binding sites where this overlap is observed, across different target types.
Calculation of PHE1 DNA-binding motif densities
To measure the density of PHE1 DNA-binding motifs within different genomic regions of interest, the fasta sequences of these regions were first obtained using BEDtools. HOMER’s scanMotifGenomeWide.pl function was then used to screen these sequences for the presence of PHE1 DNA-binding motifs, and to count the number of occurrences of each motif. Motif density was then calculated as the number of occurrences of each motif, normalized to the size of the genomic region of interest. Chi-square tests of independence were used to test if there were any associations between specific genomic regions and PHE1 DNA-binding motifs. This was done by comparing the proportion of DNA bases corresponding to PHE1 DNA-binding motifs in each genomic region.
Identification of homologous PHE1 DNA-binding motifs carried by RC/Helitrons
To assess the homology of PHE1 DNA-binding motifs and associated RC/Helitron sequences, pairwise comparisons were made among all sequences, using the BLASTN program. The following parameters were followed: word size = 7, match/mismatch scores = 2/-3, gap penalties, existence = 5, extension = 2. The RC/Helitron sequences were considered to be homologous if the alignment covered at least 9 out of the 10 bp PHE1 DNA-binding motif sites, extended longer than 30 bp, and had more than 75% identity. Because the mean length of intragenomic conserved non-coding sequences is around 30 bp in A. thaliana51, we considered this as the minimal length of alignments to define a pair of related motif-carrying TE sequences. We identified 1107 RC/Helitrons sequences out of 1232 that were associated with PHE1 A motifs and shared homology with at least one other sequence carrying a PHE1 A motif. For PHE1 B motifs, 675 out of 849 sequences had shared homology with at least one other sequence. The pairwise homologous sequences were then merged in higher order clusters, based on shared elements in the homologous pairs.
Phylogenetic analyses of PHE1-targeted PEG orthologs in the Brassicaceae
Amino acid sequences and nucleotide sequences of PHE1-targeted PEGs were obtained from TAIR10. The sequences of homologous genes in the Brassicaceae and several other rosids were obtained in PLAZA 4.0 (https://bioinformatics.psb.ugent.be/plaza/)52, BRAD database (http://brassicadb.org/brad/)53, and Phytozome v.12 (https://phytozome.jgi.doe.gov/)54.
For each PEG of interest, the amino acid sequences of the gene family were used to generate a guided codon alignment by MUSCLE with default settings55. A maximum likelihood tree was then generated by IQ-TREE 1.6.7 with codon alignment as the input56. The implemented ModelFinder was executed to determine the best substitution model57, and 1000 replicates of ultrafast bootstrap were applied to evaluate the branch support58. The tree topology and branch supports were reciprocally compared with, and supported by another maximum likelihood tree generated using RAxML v. 8.1.259.
We selected PEGs that had well supported gene family phylogeny with no lineage-specific duplication in the Arabidopsis and Capsella clades, and where imprinting data were available for all Capsella grandiflora (Cgr), C. rubella (Cru), and Arabidopsis lyrata (Aly) orthologs of interest13,60,61. We then obtained the promoter region, defined as 3 kb upstream of the translation start site, of the orthologs and paralogs in Brassicaceae and rosids species These promoter sequences were searched for the presence of homologous RC/Helitron sequences, as well as for putative PHE1 DNA-binding sites contained in these TEs.
Epigenetic profiling of PHE1 binding sites
Parental-specific H3K27me3 profiles14 and DNA methylation profiles48 generated from endosperm of 2x seeds were used for this analysis. Levels of H3K27me3 and CG DNA methylation were quantified in each 50 bp bin across the 2 kb region surrounding PHE1 binding site centres using deepTools version 2.062. These values were then used to generate H3K27me3 heatmaps and metagene plots, as well as boxplots of CG methylation in PHE1 binding sites. Clustering analysis of H3K27me3 distribution in PHE1 binding sites was done following the k-means algorithm as implemented by deepTools. A two-tailed Mann-Whitney test with continuity correction was used to assess statistical significance of differences in CG methylation levels.
Parental gene expression ratios in 2x and 3x seeds
To determine parental gene expression ratios in 2x and 3x seeds, we used previously generated endosperm gene expression data20. In this dataset, Ler plants were used as maternal plants pollinated with wt Col or osd1 Col plants, allowing to determine the parental origin of sequenced reads following the method described before63.
Parental gene expression ratios were calculated as the number of maternally-derived reads divided by the sum of maternally- and paternally-derived reads available for any given gene. Ratios were calculated separately for the two biological replicates of each cross (Ler × wt Col and Ler × osd1 Col), and the average of both replicates was considered for further analysis. The MEG and PEG ratio thresholds for 2x and 3x seeds indicated in Figure 3d were defined as a four-fold deviation of the expected read ratios, towards more maternal or paternal read accumulation, respectively. The expected read ratio for a biallelically expressed gene in 2x seeds is 2 maternal reads : 3 total reads, while for 3x paternal excess seeds this ratio is 2 maternal reads : 4 total reads. Deviations from these expected ratios were used to classify the expression of published imprinted genes13,19,45–47 as maternally or paternally biased in 3x seeds, according to the direction of the deviation. As a control, the parental bias of these imprinted genes was also assessed in 2x seeds.
Parental expression ratios of genes associated with H3K27me3 clusters
Previously published endosperm gene expression data, generated with the INTACT system, was used for this analysis64. Parental gene expression ratios were determined as the mean between ratios observed in the Ler × Col cross and its reciprocal cross. As a reference, the parental gene expression ratio for all endosperm expressed genes was also determined. A two-tailed Mann-Whitney test with continuity correction was used to assess statistical significance of differences between parental gene expression ratios.
H3K27me3 accumulation in imprinted genes
Parental-specific accumulation of H3K27me3 across imprinted gene bodies in the endosperm of 2x14 and 3x seeds (this study) was estimated by calculating the mean values of the H3K27me3 z-score across the gene length. Imprinted genes were considered as those genes previously identified in different studies13,19,45–47. A two-tailed Mann-Whitney test with continuity correction was used to assess statistical significance of differences in H3K27me3 z-score levels.
Statistics
Sample size, statistical tests used, and respective p-values are indicated in each figure or figure legend, and further specified in the corresponding Methods sub-section.
Data availability
ChIP-seq data generated in this study is available at NCBI’s Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/), under the accession number GSE129744. Additional data used to support the findings of this study are available at NCBI’s Gene Expression Omnibus, under the following accession numbers: H3K27me3 ChIP-seq data from 2x endosperm14 – GSE66585; Gene expression data in 2x and 3x endosperm20 – GSE84122; Gene expression data in 2x and 3x seeds and parental-specific DNA methylation from 2x endosperm48 – GSE53642; Parental-specific gene expression data of 2x INTACT-isolated endosperm nuclei64 – GSE119915.
Author contributions
R.A.B., J.M-R., Y.Q, D.D.F and C.K. performed the experimental design. R.A.B., J.M-R., J.V.B. and Y.Q. performed experiments. R.A.B., J.M-R., Y.Q., J.S-G., and C.K. analysed the data. R.A.B and C.K. wrote the manuscript. All authors read and commented on the manuscript.
Competing interests
The authors declare no competing interests
Materials and correspondence
The materials generated in this study are available upon request to C.K. (claudia.kohler{at}slu.se).
Acknowledgments
We thank Qi-Jun Chen for providing the pHEE401E CRISPR/Cas9 vector. We are grateful to Cecilia Wärdig for technical assistance. Sequencing was performed by the SNP&SEQ Technology Platform, Science for Life Laboratory at Uppsala University, a national infrastructure supported by the Swedish Research Council (VRRFI) and the Knut and Alice Wallenberg Foundation. This research was supported by a grant from the Swedish Research Council (to C.K.), a grant from the Knut and Alice Wallenberg Foundation (to C.K.), and support from the Göran Gustafsson Foundation for Research in Natural Sciences and Medicine.