Abstract
Nicotiana benthamiana is an important model organism and representative of the Solanaceae (Nightshade) family. N. benthamiana has a complex ancient allopolyploid genome with 19 chromosomes, and an estimated genome size of 3.1Gb. Several draft assemblies of the N. benthamiana genome have been generated, however, many of the gene-models in these draft assemblies appear incorrect. Here we present a nearly non-redundant database of 42,855 improved N. benthamiana gene-models. With an estimated 97.6% completeness, the new predicted proteome is more complete than the previous proteomes. We show that the database is more sensitive and accurate in proteomics applications, while maintaining a reasonable low gene number. As a proof-of-concept we use this proteome to compare the leaf extracellular (apoplastic) proteome to a total extract of leaves. Several gene families are more abundant in the apoplast. For one of these apoplastic protein families, the subtilases, we present a phylogenetic analysis illustrating the utility of this database. Besides proteome annotation, this database will aid the research community with improved target gene selection for genome editing and off-target prediction for gene silencing.
Introduction
Nicotiana benthamiana has risen to prominence as a model organism for several reasons. First, N. benthamiana is highly susceptible to viruses, resulting in highly efficient virus-induced gene-silencing (VIGS) for rapid reverse genetic screens (Senthil-Kumar and Mysore, 2014). This hypersusceptibility to viruses is due to an ancient disruptive mutation in the RNA-dependent RNA polymerase 1 gene (Rdr1), present in the lineage of N. benthamiana which is used in laboratories around the world (Bally et al., 2015). Reverse genetics using N. benthamiana have confirmed many genes important for disease resistance (Wu et al., 2017; Senthil-Kumar et al., 2018). Additionally, N. benthamiana is highly amenable to the generation of stable transgenic lines (Clemente, 2006; Sparkes et al., 2006) and to transient expression of transgenes (Goodin et al., 2008). This easy manipulation has facilitated rapid forward genetic screens and has established N. benthamiana as the plant bioreactor of choice for the production of biopharmaceuticals (Stoger et al., 2014). Finally, N. benthamiana is a member of the Solanaceae (Nightshade) family which includes important crops such as potato (Solanum tuberosum), tomato (Solanum lycopersicum), eggplant (Solanum melongena), and pepper (Capsicum ssp.), as well as tobacco (Nicotiana tabacum) and petunia (Petunia ssp.).
N. benthamiana belongs to the Suaveolentes section of the Nicotiana genus, and has an ancient allopolyploid origin (>10Mya) accompanied by chromosomal re-arrangements resulting in a complex genome with 19 chromosomes in the haploid genome – reduced from the ancestral allotetraploid 24 chromosomes - and an estimated haploid genome size of ~3.1Gb (Leitch et al., 2008; Goodin et al., 2008; Wang and Bennetzen, 2015). There are four independent draft assemblies of the N. benthamiana genome (Bombarely et al., 2012; Naim et al., 2012), as well as a de-novo transcriptome generated from short-read RNAseq (Nakasugi et al., 2014). These datasets have greatly facilitated research in N. benthamiana, allowing for efficient prediction of off-targets of VIGS (Fernandez-Pozo et al., 2015) and genome editing using CRISPR/Cas9 (Liu et al., 2017), as well as RNAseq and proteomics studies (Grosse-Holz et al., 2018). These draft assemblies are, however, several years old and gene annotations have not been updated since. Furthermore, in the course of our research we realized that many of the gene models in these draft assemblies are incorrect, and that putative pseudo-genes are often annotated as protein-encoding genes. This is exacerbated because these draft assemblies are highly fragmented and that N. benthamiana has a complex origin. Furthermore, the de-novo transcriptome assembly has a high proportion of chimeric transcripts. Because of incorrect annotations, extensive processing is required to select target genes for reverse genetic approaches such as gene silencing and editing, or for phylogenetic analysis of gene families. We realized that gene annotation in several other species in the Nicotiana genus were much better (Xu et al., 2017; Sierro et al., 2013; Sierro et al., 2014), and decided to re-annotate the available N. benthamiana draft genomes using these gene models as a template. The gene models obtained in this way were extracted into a single non-redundant database with improved gene models. Here we show that this database is more accurate and sensitive for proteomics, facilitates phylogenetic analysis of gene families, and may be useful for genome editing and VIGS on- and off-target prediction.
Results and Discussion
Re-annotation of gene-models in the N. benthamiana genome assemblies
For the annotation of gene-models in the different N. benthamiana draft genomes, we chose to use Scipio (Keller et al., 2008). Scipio refines the transcription start-site, exon-exon boundaries, and the stop-codon position of protein sequences aligned to the genome using BLAT (Keller et al., 2008). Importantly, given that the input protein sequences are well-annotated, this method is more accurate and sensitive than other gene prediction methods (Keller et al., 2011). Because the efficiency of this process correlates with phylogenetic distance, we took the predicted protein sequences from recently sequenced Nicotiana species (Figure 1) (Sierro et al., 2013; Xu et al., 2017). We then used CD-HIT at a 95% identity cut-off to reduce the redundancy in this database and additionally to remove partial sequences (Figure 1, Step 1). The resulting database – Nicotiana_db95 – contains 85,453 protein sequences from various Nicotiana species. We used the protein sequences in this database as an input to annotate gene-models in the four independent draft assemblies of the N. benthamiana genome using Scipio (Figure 1, Step 2). As the available N. benthamiana draft genomes are highly fragmented and each individual draft genome may miss a number of genes, we extracted the gene-models generated by Scipio, filtered for redundancy using CD-HIT-EST and combined the gene-models into a single database (NbA) containing 41,651 gene-models (Figure 1, Step 3). We next compared the predicted proteome derived from our NbA database against the published predicted proteomes using a proteomics dataset from a full-leaf extract or apoplastic fluid (samples described further in the manuscript). Proteins for which peptides were identified in the published proteomes but not in our NbA database were extracted and re-annotated in the draft genomes as described above and added to our NbA database resulting in the NbB database containing 42,884 gene-models (Figure 1; Step 4, 1233 additional entries). Finally, missing BUSCOs (Benchmarking Universal Single-Copy Orthologues) (Simão et al., 2015; Waterhouse et al., 2018) were re-annotated in our database as above, together with the manual curation of several gene-families in which several duplicated genes were removed (see Material & Methods) to obtain our final NbC database containing 42,853 entries (Figure 1; Step 5).
The new proteome database is more complete, more sensitive and accurate, and relatively small
We next compared the predicted proteome database to the published predicted proteomes. We also included our Nicotiana_db95 proteome database in this comparison. The published proteomes included the predicted proteomes from the Niben0.4.4 and Niben1.0.1 draft genomes, a previously described curated database in which gene-models from Niben1.0.1 were corrected using RNAseq reads (Grosse-Holz et al., 2018), and the predicted proteome derived from the de-novo transcriptomes (Nakasugi et al., 2014).
We used BUSCO (Simão et al., 2015; Waterhouse et al., 2018) as a quality measure to estimate the completion of our database as compared to the published predicted proteomes (Figure 2a). The BUSCO set used contains 1440 highly conserved plant genes which are expected to be predominantly found in a single-copy (Simão et al., 2015). Nicotiana_db95 has one fragmented and nine missing BUSCOs, indicating that at best we should be able to identify 99.3% of the N. benthamiana genes using this database (Figure 2a). In our NbB database, 1406/1440 (97.6%) BUSCO proteins were identified as complete, of which 762 were single-copy (54.2%), 644 were duplicated (45.8%), nine sequences were fragmented (0.63%), and 25 were missing (1.74%) (Figure 2a). The high number of duplication is likely due to either technical duplication generated by small variations between the different draft genomes, or genuinely duplicated genes arising from the allo-tetraploid origin of N. benthamiana. By adding missing BUSCOs into the NbC database, we recovered eight of the nine fragmented BUSCOs and ten of the 25 missing BUSCOs. In comparison, the next most complete, previously published proteome is the predicted proteome from the Nbv5.1 primary + alternate transcriptome, which has 12 fragmented and 32 missing BUSCOs, but it also has nearly five times more proteins than our database and 71.8% of BUSCOs are duplicated.
Next, we investigated the number of unique PFAM identifiers found with each entry in each proteome, as an estimation of the number of proteins incorrectly annotated (Figure 2b). We expect that miss-annotated sequences and fragmented gene products are less likely to get a PFAM annotation. Indeed, significantly more proteins get at least one PFAM identifier in our three databases as compared to the published proteomes, indicating that proteins in our database are better annotated.
Furthermore, we looked at the length distributions of proteins in the different predicted proteomes (Figure 2c). We reasoned that the protein-length distribution should be similar to that of the Nicotiana_db95 database. The proteins in the final proteome are significantly longer than those in the Niben0.4.4, Niben1.0.1, and manually curated proteome (Grosse-Holz et al., 2018) while the proteins in the Nbv5.1 primary + alternate proteome are on average larger than in our final NbC database. We speculate that the Niben0.4.4 and Niben1.0.1 predicted proteomes contain many pseudo-genes which are annotated as protein-encoding as well as partial genes (Figure 2c), while the Nbv5.1 primary + alternative proteome has a high proportion of chimeric sequences which due to the short-read sequencing techniques used are biased towards long transcripts (Figure 2c). Additionally, the curated proteome has a large proportion of very small proteins and 47.3% of genes do have a PFAM annotation, which we speculate is due to partial sequences or spurious small ORFs being annotated as protein-encoding (Figure 2b,c).
Finally, comparing the different proteomes on a proteomics dataset indicates the new database has the highest sensitivity, with the highest percentage of annotated MS/MS spectra in both tested samples, while it has the fewest entries (Figure 2d). Additionally, using our new NbC database, we identify the highest number of unique peptides identified in at least 3 out of 4 biological replicates of both proteomes (Figure 2d). These metrics combined indicate that the new NbC database is more sensitive and accurate for proteomics than the currently available databases. Importantly, this does not come at the cost of increased redundancy, which would hinder downstream applications.
Since the previous proteomics dataset was also used to re-annotate gene-models (Figure 1; Step 4), we independently validated our database on an independent dataset where we reanalysed a previously published apoplastic proteome of agro-infiltrated N. benthamiana as compared to non-infiltrated N. benthamiana (PRIDE repository PXD006708) (Grosse-Holz et al., 2018). The new NbC database was also more sensitive and accurate than the Curated database on this dataset (18,430 vs 17,960 peptides detected, 22.5%±/-3.1% vs 21.7%±3.0% spectra identified).
Finally, since phylogenetic analysis of gene families in closely related species often relies on gene-annotations, we compared the predicted proteome from our NbB database against the predicted proteomes of Solanaceae species for which genomes have been sequenced (Figure S1a,b). Our NbB proteome compares well to the predicted proteomes of other sequences Solanaceae species. Additionally, since the predicted proteomes of some of these species miss a relatively high proportion of genes (up to 28.5% of genes missing or fragmented), care must be taken to not over-interpret results derived from phylogenetic analysis using these sequences.
Improved annotation of the apoplastic proteome of N. benthamiana
Next we used our final NbC database to analyse the extracellular protein repertoire of the N. benthamiana apoplast. The plant apoplast is the primary interface in plant-pathogen interactions (Misas-Villamil and van der Hoorn, 2008; Doehlemann and Hemetsberger, 2013) and apoplastic proteins include many enzymes potentially important in plant-pathogen interactions. We found the protein composition of leaf apoplastic fluid (AF) to be distinct from that of a leaf total extract (TE) (Figure 3a). We considered proteins apoplastic when only detected in the AF samples or those with a log2 fold abundance difference ≥1.5 and a p-value cut-off off ≤0.01 (BH-adjusted moderated t-test) in the comparison of AF vs TE (518 proteins). Similarly, we considered proteins intracellular when found only in the TE or those proteins with a log2 fold abundance difference ≤-1.5 with a p-value cut-off off ≤0.01 in the comparison of AF vs TE (1042 proteins) (Figure 3b). The remainder proteins was considered both apoplastic and intracellular (832 proteins). As expected, the apoplastic proteome is significantly enriched for signal peptide containing proteins, while the intracellular proteins and proteins present both in the apoplast and intracellular are significantly enriched for proteins lacking a signal peptide (BH-adjusted hypergeometric test, p<0.001).
Proteins considered predominantly intracellular are enriched for GO-SLIM terms associated with translation, photosynthesis and transport as biological processes (Figure 3c), and a similar pattern is seen for the molecular function terms (Figure 3d). Proteins present both in TE and in AF are enriched for GO-SLIM terms associated with biosynthetic processes, and homeostasis (Figure 3c). These processes usually performed by proteins acting at multiple subcellular localizations. The apoplastic proteome is enriched for proteins acting in catabolic processes and carbohydrate and lipid metabolic processes (Figure 3c), which is reflected in the enrichment of peptidases and glycosidases (Figure 3d, Table S1 for a full list).
To specify which peptidases are enriched in the apoplast, we also annotated the proteome with MEROPS peptidase identifiers (Rawlings et al., 2018). Three of the 15 different families of peptidases detected in the apoplast have significantly more members enriched in the AF as compared to TE, namely the subtilase (S08; 13 members, p<0.001), serine carboxypeptidase-like (S10; 8 members, p<0.01), and aspartic peptidase families (A01; 16 members, p<0.001), while the proteasome is enriched in the intracellular fraction (T01; 27 members, p<0.001) (BH-adjusted hypergeometric test, Table S2 for a full list).
Pseudogenization in the subtilisin family is consistent with a contracting functional genome
One of the gene families found enriched in the apoplast is the subtilisin family. Several subtilisins are implicated in immunity, notably the tomato P69 clade of subtilisins (Taylor and Qiu, 2017). In order to estimate the completeness of our database, we manually verified and corrected genes belonging to the subtilisin gene family. Our NbC database contains 64 complete subtilisin genes, and one partial gene. By searching the Niben1.0.1 and Niben0.4.4 genome assemblies, we identified an additional 43 putative subtilisin pseudo-genes which had internal stop-codons and are therefore likely non-coding.
Interestingly, phylogenetic analysis shows that close paralogs are often pseudogenised. This pattern of pseudogenization in the subtilisin gene family is consistent with a contracting functional genome upon polyploidization, where for each functional protein-encoding gene there is a corresponding pseudo-gene (Figure 4, and Figure S2). Remarkably, no SBT3 clade family members were identified in N. benthamiana (Figure S3). Finally, we looked for the amino acid residue at the pro-domain junction, as the presence of an aspartic acid residue is indicative of phytaspase activity (Reichardt et al., 2018). Three N. benthamiana subtilisins may possess phytaspase activity based on the presence of an apartic acid residue at the pro-domain junction as well as a histidine residue in the S1 pocket which is thought to bind to P1 aspartic acid (Figure S2, and Figure S3, Reichardt et al., 2018).
During this analysis we discovered three subtilisin genes that are missing in our NbB database, and six incomplete sequences lacking 5-107 amino acids. In addition, five putative pseudo-genes were annotated as protein-encoding genes and were removed from the final NbC database, and 18 subtilase genes were found to be duplicated and these duplicates were removed in the final NbC database (Table S3). In comparison, the Niben1.0.1 genome annotation predicts 103 different subtilisin gene products. However, we found that these annotated genes correspond to 38 pseudo-genes and 49 protein-encoding genes - none of which are correctly annotated - while 16 subtilisin genes are absent from Niben1.0.1 (Table S3). Furthermore, the predicted proteome from the Nbv5.1 primary+alternate transcriptome contains more than 400 subtilisin gene products, largely due to a large number of chimeric sequences. In conclusion, the new database represents a significant improvement over previous genome annotations and facilitates more accurate and meaningful phylogenetic analysis of gene families in N. benthamiana.
Improved accuracy for genome editing: the subtilase gene-family
Target selection for genome editing is improved by the use of our new database for several reasons: 1) gene-models in this database are more complete; 2) fewer pseudo-genes are annotated as protein-encoding genes; 3) gene duplication is reduced as compared to the de-novo transcriptome; and 4) the remaining duplication in our database is easily resolved for genes of interest as it mostly involves genes with slight sequence variations between the different draft genomes. These sequence variations may be due to heterozygosity or technical artefacts of the sequencing and assembly. As an example we show the gene-model of one of the subtilisins in the different databases. In our NbC database, this subtilisin is encoded by a single-exon gene-model of 2,268bp encoding for a 756 amino acid protein (Figure 5a). This subtilisin is highly fragmented in the Niben1.0.1 genome assembly, with parts of the sequence present on different contigs, while the gene is only partially annotated (Figure 5b). The last 90bp of this gene are not annotated in the Nbv0.5 genome (Figure 5c). Furthermore, there is a 132bp insertion in the Niben0.4.4 genome assembly resulting in a predicted protein with a 44 amino-acid insertion (Figure 5d). Additionally, we identified 13 sequences corresponding to partial or chimeric variants of this subtilisin are present in the Nb5.1 primary + alternate predicted proteome using BLAST with no full match. Finally, this subtilisin differs by three non-synonymous SNPs between the Niben1.0.1 and Nbv0.5 genome assemblies, while two of these non-synonymous SNPs are present in the Niben0.4.4 genome assembly (Figure 5b,d). In conclusion, this example displays how combining gene-models derived from different genome assemblies has made our database more complete than annotating any single genome assembly currently available.
Although our NbC database does not contain the genomic context and lacks non-coding genes, this database will vastly improve research on N. benthamiana. We trust our NbC database to be useful for the large research community of plant scientists using N. benthamiana as a model system, for example to identify novel interactors in Co-IP experiments, but also to facilitate reverse genetic approaches such as genome editing and VIGS.
Material & Methods
Sequence retrieval
The predicted proteomes for N. attenuata (GCF_001879085.1) (Xu et al., 2017) (http://nadh.ice.mpg.de/NaDH/), N. tabacum TN90 (GCF_000715135.1) (Sierro et al., 2013), N. sylvestris (GCF_000393655.1) (Sierro et al., 2013) and N. tomentosiformis (GCF_000390325.2) (Sierro et al., 2013), and Daucus carota subsp. sativus (GCA_001625215.1) (Iorizzo et al., 2016) were downloaded from Genbank. In addition, we retrieved 565 full-length N. benthamiana protein sequences from Genbank. The Arabidopsis thaliana predicted proteome (Araport11_genes.201606.pep) was obtained from Araport (Cheng et al., 2017). The Solanum melongena predicted proteome (SME_r2.5.1_pep) was obtained from the Eggplant Genome DataBase (Hirakawa et al., 2014). The N. obtusifolia (NIOBT_r1.0) predicted proteome was obtained from the Nicotiana attenuata Data Hub (Xu et al., 2017) (http://nadh.ice.mpg.de/NaDH/). The Petunia axillaris N (Petunia_axillaris_v1.6.2_proteins) and P. inflata S6 (Petunia_inflata_v1.0.1_proteins) (Bombarely et al., 2016), Capsicum annuum glabriusculum (CaChiltepin.pep) and C. annuum zunla-1 (CaZL1.pep) (Qin et al., 2014), C. annuum cv CM334 (Pepper.v.1.55.proteins.annotated) (Kim et al., 2014), Solanum tuberosum (PGSC_DM_v3.4_pep) (Consortium, 2011), and Solanum lycopersicum (ITAG3.2_proteins) (Consortium, 2012) predicted proteomes were downloaded from Solgenomics. The N. benthamiana draft genome builds Niben1.0.1 and Niben0.4.4 - both generated by the Boyce Thompson Institute for Plant Research (BTI) (Bombarely et al., 2012) - were downloaded from Solgenomics, and the Nbv0.5 and Nbv0.3 draft genomes were made available by the Waterhouse lab at the Queensland University of Technology (Naim et al., 2012).
Annotation
In order to extract gene-models from the published N. benthamiana draft genomes we combined all the Nicotiana protein sequences, except for those from N. obtusifolia, in one database, with the addition of 110 genes which we had previously manually curated leading to a database with 226,543 protein sequences. We used CD-HIT (v4.6.8) (Fu et al., 2012) to cluster these sequences at a 95% identity threshold and reduce the redundancy in our database while removing partials (Nicotiana_db95; 85,453 sequences). This database was used to annotate the gene-models in the different N. benthamiana genome builds using Scipio version 1.4.1 (Keller et al., 2008) which was run with default settings. After running Scipio we used Augustus (v3.3) (Stanke et al., 2006) to extract complete and partial gene models. Putative pseudo-genes (containing internal stop codons) and genes lacking an ATG start or stop codon were stored separately. Transdecoder (v5.0.2) (Haas et al., 2013) was used to retrieve the single-best ORF on the putative pseudo-genes containing homology to the Nicotiana_db95 database as determined by BlastP searches. If a putative pseudo-gene contained an ORF >90% of the annotated gene length and lacking <30 amino acids it was considered a putative gene. Other putative pseudo-genes were discarded. Next we used CD-HIT-EST to filter the redundancy from this database. First, we used CD-HIT-EST to cluster the CDS derived from the gene-models derived from the different databases at 100% identity. Next, we selected the longest sequence at 99% identity between the different genome builds using CD-HIT-EST-2D in the following order for both the complete and the partial databases: Niben1.0.1 > Nbv0.5 > Niben0.4.4 > Nbv0.3. Since sequences which are smaller are maintained like this we used the reduced databases in the opposite direction to remove partial genes: Nbv0.3 > Niben0.4.4 > Nbv0.5 > Niben1.0.1. Finally we used CD-HIT-EST-2D to remove genes from the partial database with a longer representative in the complete database at 99% identity and vice-versa. This resulted in the NbA database. We compared this database for proteomic analysis on the described proteomics dataset containing apoplastic fluid (AF) samples and full-leaf extract (TE) samples and compared its performance to the other published predicted proteomes. For this analysis we predicted the Nbv5.1 proteome from the transcriptome using Transdecoder and selecting the single-best ORF with homology to the Nicotiana_db95 database, and filtered the database using CD-HIT at 100% identity. Proteins for which peptides were identified in the other databases but absent from the NbA database search were extracted, clustered at 100% using CD-HIT, and re-annotated in the genomes as above. This resulted in the NbB database. Finally, we ran BUSCO (v3.0.2; dependencies: NCBI-BLAST v2.7.1+; HMMER v3.1; Augustus v3.3) (Simão et al., 2015; Waterhouse et al., 2018) on the different N. benthamiana predicted proteomes using the plants set (Embryophyta_odb9), extracted the missing BUSCOs and re-annotated these as above. Additionally, we manually inspected the database for the PLCP, subtilisin, VPE, and GH35-domain encoding gene families, and manually removed redundant sequences. This resulted in our final database. This database was annotated using SignalP (v4) (Petersen et al., 2011), ApoplastP (v1.0.1) (Sperschneider et al., 2018), and PFAM (v31) (Finn et al., 2016). Finally we annotated the predicted proteome with GO terms and UniProt identifiers using Sma3s v2 (Casimiro-Soriguer et al., 2017).
Sample preparation for proteomics and definition of biological replicates
Four-week old N. benthamiana plants were used. The AF was extracted by vacuum infiltrating N. benthamiana leaves with ice-cold MilliQ. Leaves were dried to remove excess liquid, and apoplastic fluid was extracted by centrifugation of the leaves in a 20 ml syringe barrel (without needle or plunger) in a 50 ml falcon tube at 2000x g, 4°C for 25min. Samples were snap-frozen in liquid nitrogen and stored at −80°C prior to use. TE was collected by removing the central vein and snap-freezing the leaves in liquid nitrogen followed by grinding in a pestle and mortar and addition of three volumes of phosphate-buffered saline (PBS) (w/v). One biological replicate was defined as a sample, AF or TE, consisting of one leaf from three independent plants (3 leaves total). Four independent biological replicates were taken for AF and TE.
Protein digestion and sample clean-up
AF and TE sample corresponding to 15μg of protein was taken for each sample (based on Bradford assay). Dithiothreitol (DTT) was added to a concentration of 40mM, and the volume adjusted to 250μl with MS-grade water (Sigma). Proteins were precipitated by the addition of 4 volumes of ice-cold acetone, followed by a 1hr incubation at −20°C and subsequent cetrifugation at 18,000 g, 4°C for 20min. The pellet was dried at room temperature (RT) for 5min and resuspended in 25μL 8M urea, followed by a second chloroform/methanol precipitation. The pellet was dried at RT for 5 min and resuspended in 25μL 8M urea. Protein reduction and alkylation was achieved by sequential incubation with DTT (final 5mM, 30 min, RT) and iodoacetamide (IAM; final 20mM, 30min, RT, dark). Non-reacted IAM was quenched by raising the DTT concentration to 25mM. Protein digestion was started by addition of 1000ng LysC (Wako Chemicals GmbH) and incubation for 3hr at 37°C while gently shaking (800rpm). The samples were then diluted with ammoniumbicarbonate (final concentration 80mM) to a final urea concentration of 1M. 1000ng Sequencing grade Trypsin (Promega) was added and the samples were incubated overnight at 37°C while gently shaking (800rpm). Protein digestion was stopped by addition of formic acid (FA, final 5% v/v). Tryptic digests were desalted on home-made C18 StageTips (Rappsilber et al., 2007) by passing the solution over 2 disc StageTips in 150μL aliquots by centrifugation (600-1200× g). Bound peptides were washed with 0.1% FA and subsequently eluted with 80% Acetonitrile (ACN). Using a vacuum concentrator (Eppendorf) samples were dried, and the peptides were resuspended in 20 μL 0.1% FA solution.
LC-MS/MS
The samples were analysed as in (Grosse-Holz et al., 2018). Briefly, samples were run on an Orbitrap Elite instrument (Thermo) (Michalski et al., 2011) coupled to an EASY-nLC 1000 liquid chromatography (LC) system (Thermo) operated in the one-column mode. Peptides were directly loaded on a fused silica capillary (75μm × 30cm) with an integrated PicoFrit emitter (New Objective) analytical column packed in-house with Reprosil-Pur 120 C18-AQ 1.9 μm resin (Dr. Maisch), taking care to not exceed the set pressure limit of 980 bar (usually around 0.5-0.8μl/min). The analytical column was encased by a column oven (Sonation; 45°C during data acquisition) and attached to a nanospray flex ion source (Thermo). Peptides were separated on the analytical column by running a 140-min gradient of solvent A (0.1% FA in water; ; Ultra-Performance Liquid Chromatography (UPLC) grade) and solvent B (0.1% FA in ACN; UPLC grade) at a flow rate of 300nl/min (gradient: start with 7% B; gradient 7% to 35% B for 120 min; gradient 35% to 100% B for 10 min and 100% B for 10 min) at a flow rate of 300 nl/min.). The mass spectrometer was operated using Xcalibur software (version 2.2 SP1.48) in positive ion mode. Precursor ion scanning was performed in the Orbitrap analyzer (FTMS; Fourier Transform Mass Spectrometry) in the scan range of m/z 300-1800 and at a resolution of 60000 with the internal lock mass option turned on (lock mass was 445.120025 m/z, polysiloxane) (Olsen et al., 2005). Product ion spectra were recorded in a data-dependent manner in the ion trap (ITMS) in a variable scan range and at a rapid scan rate. The ionization potential was set to 1.8kV. Peptides were analysed by a repeating cycle of a full precursor ion scan (1.0 × 106 ions or 50ms) followed by 15 product ion scans (1.0 × 104 ions or 50ms). Peptides exceeding a threshold of 500 counts were selected for tandem mass (MS2) spectrum generation. Collision induced dissociation (CID) energy was set to 35% for the generation of MS2 spectra. Dynamic ion exclusion was set to 60 seconds with a maximum list of excluded ions consisting of 500 members and a repeat count of one. Ion injection time prediction, preview mode for the Fourier transform mass spectrometer (FTMS, the orbitrap), monoisotopic precursor selection and charge state screening were enabled. Only charge states higher than 1 were considered for fragmentation.
Peptide and Protein Identification
Peptide spectra were searched in MaxQuant (version 1.5.3.30) using the Andromeda search engine (Cox et al., 2011) with default settings and label-free quantification and match-between-runs activated (Cox and Mann, 2008; Cox et al., 2014) against the databases specified in the text including a known contaminants database. Included modifications were carbamidomethylation (static) and oxidation and N-terminal acetylation (dynamic). Precursor mass tolerance was set to ±20 ppm (first search) and ±4.5 ppm (main search), while the MS/MS match tolerance was set to ±0.5 Da. The peptide spectrum match FDR and the protein FDR were set to 0.01 (based on a target-decoy approach) and the minimum peptide length was set to 7 amino acids. Protein quantification was performed in MaxQuant (Tyanova et al., 2016), based on unique and razor peptides including all modifications.
Proteomics processing in R
Identified protein groups were filtered for reverse and contaminants proteins and those only identified by matching, and only those protein groups identified in 3 out of 4 biological replicates either AF or TE were selected. The LFQ values were log2 transformed, and missing values were imputed using a minimal distribution as implemented in imputeLCMD (v2.0) (Lazar, 2015). A moderated t-test was used as implemented in Limma (v3.34.3) (Ritchie et al., 2015; Phipson et al., 2016) and adjusted using Benjamini–Hochberg (BH) adjustment to identify protein groups significantly differing between AF and TE. Bonafide apoplastic protein groups were those only detected in AF and those significantly (p≤0.01) log2 fold change ≥1.5 in AF samples. Protein groups only detected in TE and those significantly (p≤0.01) log2 fold change ≤-1.5 depleted in AF samples were considered intracellular. The remainder was considered both apoplastic and intra-cellular. Majority proteins were annotated with SignalP, PFAM, MEROPS (v12) (Rawlings et al., 2018), GO, and UniProt keywords identifiers. A BH-adjusted Hypergeometric test was used to identify those terms that were either depleted or enriched (p≤0.05) in the bonafide AF protein groups as compared to bonafide AF depleted proteins or protein groups present both in the AF and TE.
Phylogenetic analysis
Predicted proteomes were annotated with PFAM identifiers, and all sequences containing a Peptidase S8 (PF00082) domain were extracted from the different databases. Additionally, we manually curated the subtilisin gene-family in the Niben1.0.1 draft genome, identifying putative pseudo-genes which were annotated as protein-encoding genes, as well as missing genes and incorrect gene models or genes in which the reference sequence was absent in Niben1.0.1. Tomato subtilisins were retrieved from Solgenomics, and other previously characterized subtilisins (Taylor and Qiu, 2017) were retrieved from NCBI. Clustal Omega (Sievers et al., 2011; Li et al., 2015) was used to align these sequences. The putative pseudo-gene sequences were substituted with the best blast hit in NCBI in order to visualize pseudogenization in the alignment and phylogenetic tree. Determining the best model for maximum likelihood phylogenetic analysis and the phylogenetic analysis was performed in MEGA X (Kumar et al., 2018). The evolutionary history was inferred by using the Maximum Likelihood method based on the Whelan and Goldman model. A discrete Gamma distribution was used to model evolutionary rate differences among sites, and the rate variation model allowed for some sites to be evolutionarily invariable. All positions with less than 80% site coverage were eliminated. Niben101Scf00595_742942-795541 was used to root the phylogenetic trees.
Data Availability - The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE (Vizcaíno et al., 2016) partner repository (https://www.ebi.ac.uk/pride/archive/) with the data set identifier PXD010435. During the review process the data can be accessed via a reviewer account (Username: reviewer17475{at}ebi.ac.uk; Password: PQSfFZyN). Samples FGH01-04 represent AF and FGH05-08 represent TE.
Funding
This work has been supported by ‘The Clarendon Fund’ (JK), and the ERC Consolidator grant 616449 ‘GreenProteases’ (RvdH, FGH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests
The authors have declared that no competing interests exist.
Author contributions
Conceptualization: JK, RvdH; Formal analysis: JK; Funding acquisition: RvdH; Wetlab experiments: FHG; Proteomics: FK, MK; Programming: JK, FH; Writing: JK, RvdH.
Acknowledgements
We would like to thank Philippe Varennes-Jutras and Daniela Sueldo for critically reading the manuscript and providing important suggestions for improving the manuscript.
Footnotes
Database availability. The database has been uploaded at Oxford Research Archives (ORA) and can be downloaded from this link: https://deposit.ora.ox.ac.uk/datasets/uuid:f09e1d98-f0f1-4560-aed4-a5147bc7739d. We hope that this database will be accessible via the SolGenomics database in the near future.