Abstract
Understanding evolution of plant immunity is necessary to inform rational approaches for genetic control of plant diseases. The plant immune system is innate, encoded in the germline, yet plants are capable of recognizing diverse rapidly evolving pathogens. Plant immune receptors (NLRs) can gain pathogen recognition through point mutation, recombination of recognition domains with other receptors, and through acquisition of novel ‘integrated’ protein domains. The exact molecular pathways that shape immune repertoire including new domain integration remain unknown. Here, we describe a non-uniform distribution of integrated domains among NLR subfamilies in grasses and identify genomic hotspots that demonstrate rapid expansion of NLR gene fusions. We show that just one clade in the Poaceae is responsible for the majority of unique integration events. Based on these observations we propose a model for the expansion of integrated domain repertoires that involves a flexible NLR ‘acceptor’ that is capable of fusion to diverse domains derived across the genome. The identification of a subclass of NLRs that is naturally adapted to new domain integration can inform biotechnological approaches for generating synthetic receptors with novel pathogen ‘traps’.
Introduction
Plants have powerful defence mechanisms, which rely on an arsenal of plant immune receptors (Jones, Vance and Dangl, 2016; Dodds and Rathjen, 2010). The Nucleotide Binding Leucine Rich Repeat (NLR) proteins represent one of the major classes of plant immune receptors. Plant NLRs are modular proteins characterized by a common NB-ARC domain similar to the NACHT domain in mammalian immune receptor proteins (Jones, Vance and Dangl, 2016). On the population level, NLRs provide plants with enough diversity to keep up with rapidly evolving pathogens (Hall et al., 2009; Joshi et al., 2013). With over 50 fully sequenced plant genomes today, it is timely to apply comparative genomics approaches to investigate common trends in NLR evolution across the plant kingdom, including key crop species.
In contrast to the highly conserved NB-ARC domains, the Leucine Rich Repeats (LRRs) of NLRs show high variability (Noel et al., 1999; Jacob, Vernaldi and Maekawa, 2013). The functional consequence of high LRR variation is thought to be the generation of novel recognition specificities (Bakker et al., 2006; Sukarta, Slootweg and Goverse, 2016). In addition, recent findings show that novel pathogen recognition specificities can also be acquired through the fusion of non-canonical domains to NLRs (Le Roux et al., 2015; Kroj et al., 2016). These exogenous domains can serve as ‘baits’ mimicking host targets of pathogen-derived effector molecules and therefore act in concert with LRR variation to broaden the spectra of recognised pathogen-derived effectors (Cesari, Bernet al., 2014a; Cesari et al., 2014b; Le Roux et al., 2015).
NLRs plant immune receptors were discovered over 20 years ago through cloning of plant disease resistance genes in Arabidopsis (Mindrinos et al., 1994; Bent et al., 1994). Sequencing of the Arabidopsis genome allowed annotation of the NLR repertoire based on a genome-wide scan for the conserved NB-ARC domain that subsequently revealed common and non-canonical NLR architectures. Application of this method to newly sequenced plant genomes has revealed common principles in NLR composition. Additionally, genome scans have contributed to our understanding of the genome-wide architecture of NLRs, including a tendency for NLRs to form major resistance clusters (Christopoulou et al., 2015; Christie et al., 2016). The relatively poor quality of assembled genome sequence inrepetitive regions has hampered accurate identification and annotation of NLR genes, which are present at high copy number in the genome and also encode repetitive LRR domains. To overcome this problem, a method called resistance gene enrichment sequencing was developed (Jupe et al., 2013; Witek et al., 2016; Andolfo et al., 2014); it involves enrichment of NLRs from genomic or transcribed DNA and enables their accurate assembly. The identification of NLRs across plant genomes using uniform computational methods, such as scanning genomes with Hidden Markov Models (HMMs) for the NB-ARC domain, has allowed the NLR repertoire to be compared across species (Sarris et al., 2016; Kroj et al., 2016; Yue et al., 2016). This has led to identification of plant families with a significantly expanded or reduced number of NLRs (Sarris et al., 2016; Kroj et al., 2016; Zhang et al., 2016) and the identification of co-evolutionary links between NLR diversification and their regulation by miRNAs (Zhang et al., 2016). Comparative genomics analyses also revealed that formation of NLRs with non-canonical architectures is common across flowering plants (Sarris et al., 2016; Kroj et al., 2016).
The NLR copy number variation identified in genomic and RenSeq scans of different plant genomes has been attributed to the birth and death process of gene evolution (Michelmore, Meyers and Young, 1998). The mechanisms by which new NLR genes are created and upon which selection can act remains elusive. The prevailing consensus holds that NLR diversity is likely to be generated through a variety of mechanisms including duplication, unequal crossing over, non-homologous (ectopic) recombination, gene conversion and transposable elements (Jacob, Vernaldi and Maekawa, 2013). Identification of the selection pressures acting on the NLR gene family has also proved to be a challenging question to answer with often only subtle or divergent selection pressure signatures identified for individual NLRs. This has led to the conclusion that NLRs are generally under purifying and neutral selection (Bakker et al., 2006).
The most recent paradigm in NLR diversification involves fusion to exogenous protein domains, also called integrated domains (NLR-ID), a mechanism deployed across flowering plants (Sarris et al 2016, Kroj et al 2016). The availability of sequenced genomes now allows the evolution and diversification of NLRs with integrated domains to be addressed, including the following questions. First, are NLR-IDs distributed uniformly across different subclasses of NLRs or are there specialized clades that are more prone to exogenous domain integration? Previous blast analyses of known NLR genes, such as RGA5 and Sr33 hinted at diversity of integrated domains fused to their homologs, however, no evolutionary links between these genes have been established (Cesari et al., 2014a; Periyannan et al., 2013). Second, is NLR-ID diversification associated with particular genomic locations and if so are these locations syntenic across species? The answer to this question might shed a light to the mechanisms of how NLR-IDs are formed. Third, previous functional analyses of two NLR-ID genes demonstrated that they require activation partners that are co-located in same genomic locus and even share the same regulatory sequence being expressed from the opposite strand (Okuyama et al., 2011; Le Roux et al., 2015; Zhai et al., 2014; Sarris et al., 2016). It is still not clear whether such requirement for a paired NLR is a rule for all NLR-IDs (Sarris et al., 2016) and if so, how diversification of NLR-IDs would be coupled to the diversity of the pair. While the first question about adaptability of NLRs to new gene fusions can be addressed by studying evolutionary history of NLR-IDs themselves, definitive answers to the second question might require near-complete genomes with high continuity as well as population genetics analyses.
The grasses (Poaceae) represents a highly successful family of flowering plants that originated as early as 120 million years ago (Prasad et al., 2005; Prasad et al., 2011) and rapidly colonized diverse environments, becoming the most abundant plant family on Earth. Among grasses are three major cereals that form the basis of modern day agriculture and human diet: maize (Zea mays), rice (Oryza sativa) and wheat (Triticum species). It has been suggested that high genomic plasticity of grasses contributed to their adaptability and success in agriculture (Dubcovsky and Dvorak, 2007). The genomes of grasses range in size, ploidy and chromosome number from 270 Mb genome of Brachypodium distachyon and 400 Mb genome of rice (O. sativa) to 17 Gb hexaploid genome of bread wheat (T. aestivum). With genome expansion, transposable elements proliferated from comprising 21% of the Brachypodium genome to over 80% of the wheat genome and play a major role in genome evolution (Vogel et al., 2010; Choulet et al., 2010; Wicker et al., 2016). Genomes of the Poaceae have also undergone major re-arrangements through chromosome fusions, duplications and translocations, leading to divergent chromosome numbers, yet maintaining long syntenic blocks which allow the identification of common and divergent genome regions (Vogel et al., 2010; Salse et al., 2008). Through both global and local rearrangements, the genomes of grasses acquired diverse variation in gene copy numbers, including high copy number of NLRs (Sarris et al., 2016; Zhang et al., 2016), which makes Poaceae an attractive system to study NLR evolution.
In this paper, we have examined the evolutionary dynamics of NLR-IDs in the genomes of nine grass species, their distribution within the NLR phylogeny and the diversity of their integrated domains within and across species. We identified several “hotspot” clades and were able to define one ancient monophyletic clade of NLRs that is highly amenable to new domain integrations, in which most diversity was attained in the Triticeae species. This clade is present at syntenic locations in grasses following the evolutionary history of genomes as well as local species-specific chromosomal translocation events. We also observed that it is absent in maize, likely as a consequence of overall contraction of NLRs in this species. The identification of this NLR-ID hotspot can form the basis for new biotechnological approaches for designing NLR receptors with synthetic fusions to new pathogen traps.
RESULTS AND DISCUSSION
NLR-IDs are distributed non-uniformly across NLR Protein subgroups
We examined the evolution of NLRs across nine grass species with available genomes - Setaria italica, Sorghum bicolor, Zea mays (maize), Brachypodium distachyon, Oryza sativa (rice), ordeum vulgare (barley), Aegilops tauschii, Triticum urartu and Triticum aestivum (hexaploid bread wheat). The phylogeny of 4,130 NLRs from these species, based on the common NB-ARC domain, showed that proteins within a few clades are highly prone to domain integrations compared to other clades, although NLR-ID formation is not exclusive to these clades (Figure 1A). One hotspot clade was particularly enriched in NLR-ID proteins (59 % are NLR-IDs) compared to 8% of proteins with NLR-IDs across all clades (Figure 1A, hotspot 1, highlighted in red). This clade was found to be nested within an outer clade (Figure 1A, highlighted in blue) with only 0 to 14 % of proteins containing NLR-IDs. These two clades include proteins representative of all the studied grass species with the exception of Z. mays (Figure 1E). Therefore, we predict that this hotspot clade originated before the split of Panicodae, Ehrhartoidae and Pooidae (BEP and PACCMAD clades) from the rest of the Poaceae 60 MYA (Vogel et al., 2010). Supporting our hypothesis, an outer ancestral clade was apparent (Figure 1A, highlighted in cyan) that contained proteins from all the grass species present in the tree, including Z. mays, although the bootstrap support value leading to this clade was only 85 Separate NLR phylogenies for each of the grass species showed that the pattern of integrated domain hotspots was strongest in Brachypodium and in the Triticeae species (Figure 2; Supplemental Figures 1 – 9). It is also clear that NLR(-ID) protein duplication has proliferated most strongly in these species for this hotspot clade (Figure 1E). However, the relative ratio of NLRs with and without extra domains in this clade has remained relatively constant at around 59% suggesting that the rate of domain recycling has been constant across these species (Figure 1B; Supplemental Table1).
Two other major NLR-ID hotspots were investigated (Figure 1, hotspot 2 and 3). Hotspot 2 contains 81 proteins from the grass species present in the tree, except for O. sativa and H. vulgare. It is located in an inner clade - 38% are NLR-IDs - nested within an outer clade but no ancestral clade was apparent. For hotspot 3, 42 % of proteins (46 out of 109) are NLR-IDs and there is no outer clade or ancestral clade detectable. In each of these hotspots, one integrated domain dominates, a DDE superfamily endonuclease domain (hotspot 2) and a BED-type zinc finger domain (hotspot 3) in contrast to hotspot 1 which contains 34 different domains.
Expansion of the NLR-ID hotspot 1 clade is linked to diversification through new gene fusions
The NLR-ID proteins from evolutionary hotspot 1 were examined further to test the hypothesis that the increase in the number of NLRs with integrated domains was due to the creation of novel gene fusions rather than the duplication of existing ID fusions. A clear expansion of the ID domain repertoire was found for this group of proteins, particularly for the Triticeae species (Table 1). It is possible that differences in the observed repertoires can be explained partly by incomplete annotation of genomes or fragmented assembly of NLRs; such proteins were omitted from the phylogenetic analysis if they were < 70 % complete across the NB-ARC domain. However, the overall trend across the genomes strongly suggests that differences cannot be explained solely by differences in genome assemblies. Moreover, genomes such as B. distachyon, Z. mays and O. sativa are assembled to much higher quality than those of the Triticeae species, yet they contain fewer NLR-IDs and have lower ID diversity. This fact suggests that the trend we observed is not only biologically relevant, but could in reality be even more pronounced when complete Triticeae genomes become available.
To further understand the evolution of ID fusions, the section of the tree in Figure 1 for hotspot 1 and theassociated outer and ancestral clades were re-aligned and analyzed by maximum likelihood phylogeny (Figure 3A; Supplemental Dataset 4). Each gene was annotated with a cartoon showing both canonical and non-canonical domains. We observed examples in the Triticeae in which neighbouring proteins in the tree - clustering with high bootstrap support - share the same domain at the same position indicating common ancestry and suggesting selection to maintain a functional fusion. We were only able to find such evidence of conservation between proteins from the Triticeae and Brachypodium.
In contrast to these observations of gene architecture conservation across species, throughout the tree, we also observed orthologs and closely related paralogs with distinct domain fusions. This observation suggests that this subfamily of NLR protein has a high ability to form independent fusions with a wide variety of domains, unlike other types of NLR protein in the rest of the NB-ARC protein family.
The clades highlighted in figure 3A differ most strikingly for the position of the integrated domain within the protein. The majority of proteins in the outer clade have integrated domains at the N-terminal end. These integrated domains belong to the same Pfam family which suggests a single integration event followed by gene duplication and secondary losses (Figure 3B) such as observed for hotspots 2 and 3. In contrast, the proteins from the inner clade have IDs integrated primarily at their C-terminal end and are much more diverse (Figure 3C). Most of the NLR-ID diversity was observed in T. aestivum with 68 proteins in this clade including at least one domain that was representative of 21 non-redundant IDs. The number of different ID domains for proteins in the individual T. aestivum subgenomes is higher and more diverse than for two of the diploid progenitors, T. urartu (A genome) and A. tauschii (D genome), indicating that new domains have continued to be integrated de novo into T. aestivum proteins following the divergence of T. aestivum from these progenitors (Supplemental Dataset 1 and 2).
Genomic locations involved in proliferation and diversification of NLR-IDs
We observed that NLRs from the hotspot clade were found on different chromosomes across and within species. For five species analyzed in this study, the chromosomal location of NLR-IDs was available from the genome annotation. We looked to see whether there was any enrichment of NLR-IDs from the hotspot clade on any particular chromosome and investigated whether these inter-species differences could be explained by whole-genome rearrangement during evolution (Table 2). Indeed, in most species NLR-IDs from hotspot clade were concentrated only on 1-2 chromosomes, such as chromosomes 2 and 5 in S. bicolor, chromosome 11 in O. sativa, chromosome 4 in B. distachyon which form known syntenic blocks (Vogel et al, 2010) indicating an ancient origin of the locus that was present in the common ancestor of grasses. With the proliferation of NLR-ID hotspot 1 in wheat, we also observed more divergent locations of NLR-IDs, mostly concentrated on chromosomes 7AS, 7DS, 4AL, but also on chromosomes 1, 3 and 6 (Table 2, Supplemental Table 3). Such proliferation can be explained by more recent large scale genomic rearrangements, such as translocation of chromosomal region from 7BS to 4AL and other known chromosomal translocation and duplications (Salse et al., 2008; Clavijo et al., 2016). This indicates that proliferation of NLR-IDs in Triticeae might be linked to greater plasticity of its genome. Since some of the larger translocations in wheat occurred after the formation of NLR-ID hotspot 1, it is also possible that the interaction across members of NLR-ID locus contribute to larger genomic rearrangement events.
When we examined orthologous NLRs located on different wheat sub-genomes, we identified rapid local proliferation of domain fusions (Figure 4). In some instances, orthologous copies were subjected to simple domain loss, such as the sub-clade with the Kelch domain, while others exhibited domain swap, such as the sub-clade harbouring NPR1/AP2/Myb_DNA_Binding domains (Figure 4). This indicates that active and very rapid gene rearrangement continues to take place in the local genomic context of NLR-ID hotspot 1.
Possible mechanisms driving NLR-ID diversification
Any mechanism that creates gene fusions requires a move or a copy and paste event of an exogenous gene from one location to another. Since NLRs from the hotspot clade are mostly found at syntenic locations, yet harbour diverse fusions, it is most likely that these NLRs act as hotspot ‘acceptors’ for exogenous genes to create NLR-IDs rather than move themselves. We observed that the overall number of NLRs in the hotspot increases proportionally to the total increase of NLRs in the genome. Therefore, we hypothesize that duplication of NLRs at hotspots create more ‘acceptor’ sites which results in greater NLR-ID diversity. What makes these particular NLRs more amenable to new gene fusions compared to other NLRs remains unclear.
How can exogenous domains become fused to NLRs? Wicker, Buchmann and Keller (2010) formulated three main models for gene movement/duplication into non-homologous locations and observed all of them taking place in cereal genomes. The first model involves transposable elements (TEs) acting alone that either excise and transpose genes from one location to another (DNA transposons) or copy and paste them via RNA intermediate (retrotransposons). In the latter case, the gene at the new site does not contain introns. The second model relies on the endogenous host machinery alone that repairs double-stranded DNA breaks using non-homologous exogenous DNA template. The third and most prevalent process in grasses combines the activity of TEs and DNA repair. In this model, TEs insert in new locations themselves or act on common repeats to induce double stranded breaks without bringing in any exogenous genes, then these double stranded breaks are repaired with endogenous machinery using non-homologous DNA fragments. In all cases, gene movement has potential to create new gene fusions, such as NLR-IDs.
In order to distinguish among these mechanisms, we extracted coding DNA sequence of integrated domains for 40 T. aestivum genes from the hotspot clade and aligned them back to the genome (blastn, e-value 1e-3). Similar to the NLR portion of the genes, most of the integrated domain contained introns, which indicates that they were not acquired by retrotransposition. Similarly, integrated domains that were acquired recently in cereals and have been validated previously (Sarris et al., 2016), such as NPR1 and Exo70, contained an unintegrated single copy (one in sub-genome) paralogs elsewhere in the genome, providing evidence against gene movement through DNA transposition. Therefore, we hypothesize that the integration of exogenous domains would follow ‘copy-and-paste’ mechanisms observed previously in cereals (Wicker, Buchmann and Keller, 2010). Such mechanisms involve double-stranded DNA breaks and repair with non-homologous template. Whether this process is driven by endogenous plant machinery alone or triggered by the movement of transposable elements remains unclear. Expansion of the hotspot clade in Triticeae, its proliferation to multiple genomic locations as well as increased diversity of integrated domains might be linked to the overall increased fraction of TEs in these genomes compared to smaller genomes of other grasses.
Evolutionary model of NLR-ID hotspot formation and proliferation
The processes underpinning genome evolution include domain duplication, fission and fusion (Moore et al., 2008) which have recently been implicated in NLR evolution (Zhong and Cheng, 2016; Kroj et al., 2016; Sarris et al., 2016). Our model (Figure 5) summarizes how these processes could have driven the expansion and diversification during evolution of the proteins in NLR-ID hotspot 1. The model in Figure 5 can be used to illustrate how a subset of NLR-ID diversity (Figure 4) may have been generated. Subsequent to the hotspot clades ancestral genes’ acquisition of high ability to form fusions, diversification occurs resulting in the large variety of exogenous integrated domains. The NLR-IDs in the upper clade of figure 4 contains only fusions to the Kelch domain and could therefore be represented by the models ID1 domain, as it appears to be a conserved fusion despite several duplications. The absence of the Kelch domain in NLRs (TRIAE AA2059910.1) nested within clades where all other members have the domain, suggests the absence of the domain is likely the result of fission events. Conversely, a diversity of exogenous integrated domains can be seen in the lower clade of figure 4 - this suggests an NLR with evolutionary history similar to the models ID2. Thus, the B3 NLR fusion could be analogous to ID2 with the NLR-ID having undergone successive duplications resulting in both domain preservation and domain swap where ID3 could hypothetically be a Myb_DNA_Binding domain.
CONCLUSIONS
The source of plants’ ability to rapidly acquire new pathogen recognition specificities remains a key question in plant immunity. Its answer is closely linked to the evolution of plant immune receptors and their diversification. Recently NLR diversification has been shown to involve acquisition of new exogenous domains, likely through gene fusion events. These integrated domains can be baits for the pathogens and resemble their original targets in the host, thus rapidly expanding recognition potential of plant NLRs. Here we found that formation of NLR-IDs is not random in NLR evolution and we observed a clear hotspot of NLR-ID proliferation and diversification in grasses, particularly in the Triticeae. This hotspot is of ancient origin and is present in all surveyed grasses except for maize. Such proliferation of NLR-IDs involves new diverse domain integrations. Genomic locations of NLR-IDs from the hotspot clade indicate that it evolved very early near the origin of grasses (or before) and expanded alongside whole genome evolution of these species. In the Triticeae, it shows more rapid movement across the genome as well as rapid local rearrangements. Although the exact mechanisms of NLR-ID formation remain to be uncovered, we predict that it involves double-stranded DNA breaks and could be driven either by endogenous machinery such as non-homologous (ectopic) recombination which has previously been shown to drive evolution of Mla locus in barley (Leister et al., 1998), or alternatively by local activity of transposable elements and endogenous DNA repair machinery as has been previously documented for other types of gene duplications in cereals (Wicker, Buchmann and Keller, 2010).
In the future, the availability of higher quality genome assemblies as well as multiple genomes for each species will allow more detailed analyses of syntenic gene clusters and will identify precise location of DNA breakpoints that lead to NLR-ID formation. Combining long molecule sequencing RenSeq (Giolai et al., 2016) with population genetic analyses will allow us to estimate how rapidly new gene fusions are formed within populations and how fast selection of advantageous combinations occur in nature.
Modern plant breeding practices have dramatically reduced the genetic variability of crops. Isolation and utilization of major race-specific resistance genes has been one of the major genetic methods of disease management. However, these approaches are hampered by the appearance of pathogen races that are overcoming resistance. Reliance on synthetic fungicides for pathogen control has been successful but it imposes heavy economic costs and has suffered from same consequences as over reliance on antibiotics in medicine, leading to selection of highly virulent drug-resistant pathogens. Moreover, fungicides adversely affect human health and the environment, which has resulted in stringent rules on pesticide use and banning of chemicals deemed harmful to human health (Ilbery et al., 2013). Furthermore, future bans are planned to reduce environmental damage. Consequently, it is predicted there will be a significant yield reduction due to pathogen pandemics. There is an urgent need for new genetic sources of resistance for future sustainable crop production (Dangl, Horvath and Staskawicz, 2013; Ellis et al., 2014). Our identification of NLRs that are highly amenable to integration of exogenous domains can be efficiently exploited for advancing understanding of how new immune receptor specificities are formed and provide new avenues to generate novel synthetic fusions.
METHODS
Identification of NLRs and NLR-IDs in plant genomes
NLR plant immune receptors were identified in nine monocot species by the presence of common NB-ARC domain (Pfam PF00931) as described previously (Sarris et al., 2016). T. aestivum (TGAC v1) and A. tauschii genomes (ASM34733v1) were downloaded from EnsemblPlants and analyzed using the same pipeline as before (Sarris et al., 2016). All up to date scripts are available from https://github.com/krasileva-group/plant_rgenes.
Phylogenetic Analysis
An HMM model - based on the pFAM model, PF00931 - was built to include the ARC2 subdomain which is also present in plant NB-ARC proteins (Supplemental File 1). To build the model of NB-ARC1-ARC2, eight proteins (Swissport identifiers: APAF_HUMAN, LOV1A_ARATH, K4BY49_SOLLC, RPM1_ARATH, R13L4_ARATH, RPS2_ARATH, DRL24_ARATH, DRL15_ARATH) were aligned using the PRANK program (Loytynoja and Goldman, 2008) and the HMM profile was built from this alignment with the HMMER3 HMMBUILD program (Mistry et al 2013), using default parameters for both programs. Amino acid sequences encoding the NB-ARC proteins were aligned to this hmm model using the HMMER3 HMMALIGN program (version 3.1b2) (Mistry et al 2013). The resulting alignment of the NB-ARC1-ARC2 domain was converted to fasta format using the HMMER ESL-REFORMAT program. Any amino acids with non-match states in the hmm model were removed from the alignment. Sequences with less than 70 % coverage across the alignment were removed from the data set. The longest sequence for each gene out of the available set of splice versions was used for phylogenetic analysis. In addition, 35 proteins encoding genes with characterized and known functions in pathogen defence from the literature were also included; the list of genes was based on a curated R-gene dataset by Sanseverino et al, 2012 (http://prgdb.crg.eu). Phylogenetic analysis was carried out using the MPI version of the RAxML (v8.2.9) program (Stamatakis, 2014) with the following method parameters set: −f a, −x 12345, −p 12345, −# 100, −m PROTCATJTT. The tree contained 4,130 sequences, 338 columns, took 67 hours to generate and required 17 GB RAM memory. Separate trees for each species were also prepared for Figure 2 using the same methods. Overall species phylogeny was constructed using NCBI taxon identification numbers at phyloT (phylot.biobyte.de). The trees were mid-point rooted and visualized using the Interactive Tree of Life (iToL) tool (Letunic and Bork, 2016) and are publicly available at http://itol.embl.de under ‘Sharing data’ and ‘KrasilevaGroup’ and in Newick format in Supplemental Dataset 3. Annotation files were prepared for displaying the presence of ID domains in the proteins, identifying species gene identifiers by colour and visualising the location of individual domains within the protein backbone. An ID domain was defined as being any domain, except for LRR, AAA, TIR and RPW8 which are often associated with NB-ARC-containing proteins.
AUTHOR CONTRIBUTIONS
PB, KVK and WH designed the study. PB, GD, EB and KVK analyzed the data. All authors contributed to writing of the manuscript.
ACKNOWLEDGEMENTS
Authors are grateful to all members of Krasileva group and many colleagues, especially Sophien Kamoun, for thoughtful discussions of the presented material. We thank Daniil Prigozhin for suggestions on data analyses and the manuscript. We are grateful to Matthew Moscou and William Jackson as well to the 2016 European Research Council interview panel members for providing additional motivation to write this manuscript. KVK is strategically supported by the Biotechnology and Biological Science Research Council (BBSRC) and the Gatsby Charitable Foundation. This project was also supported by BBSRC and Institute Strategic Programme Grant at The Earlham Institute (BB/J004669/1) and BBSRC National Capability in Genomics at The Earlham Institute (BB/J010375/1). The high-performance computing resources and services used in this work were supported by the EI Scientific Computing group alongside the NBIP Computing infrastructure for Science (CiS) group.