Abstract
Background High-throughput sequencing (HTS) technologies are increasingly applied to analyse complex microbial ecosystems by mRNA sequencing of whole communities, also known as metatranscriptome sequencing. This approach is at the moment largely limited to prokaryotic communities and communities of few eukaryotic species with sequenced genomes. For eukaryotes the analysis is hindered mainly by a low and fragmented coverage of the reference databases to infer the community composition, but also by lack of automated workflows for the task.
Results From the databases of the National Center for Biotechnology Information and Marine Microbial Eukaryote Transcriptome Sequencing Project, 142 references were selected in such a way that the taxa represent the main lineages within each of the seven supergroups of eukaryotes and possess predominantly complete transcriptomes or genomes. From these references, we created an annotated microeukaryotic reference database. We developed a tool called TaxMapper for a reliably mapping of sequencing reads against this database and filtering of unreliable assignments. For filtering, a classifier was trained and tested on sequences in the database, sequences of related taxa to those in the database and randomly generated sequences. Additionally, TaxMapper is part of a metatranscriptomic Snakemake workflow developed to perform quality assessment, functional and taxonomic annotation and (multivariate) statistical analysis including environmental data. The workflow is provided and described in detail to empower researchers to easily apply it for metatranscriptome analysis of any environmental sample.
Conclusions TaxMapper shows superior performance compared to standard approaches, resulting in a higher number of true positive taxonomic assignments. Both the TaxMapper tool and the workflow are available as open-source code at Bitbucket under the MIT license: https://bitbucket.org/dbeisser/taxmapper and as a Bioconda package: https://bioconda.github.io/recipes/taxmapper/README.html.
Background
Motivation and goals
Metatranscriptome sequencing of diverse ecosystems is becoming a common methodology in many research institutions, and large scale sampling campaigns such as the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP, [1]) and the Tara Oceans expedition [2] have contributed to a growing amount of available environmental sequencing data. However, the analysis of the resulting short read sequences is still far from routine, especially for unicellular eukaryotic organisms, due to what was termed by Escobar-Zepeda et al. as “the neglected world of eukaryotes in metagenomics” [3]. This is particularly severe since microscopic eukaryotes (protists) constitute a paraphyletic taxon [4] spread over the whole eukaryotic tree of life and represent the bulk of most major groups, whereas multicellular lineages are confined to small corners [5]. Protists occur at high abundance in almost all habitats, e.g. in freshwaters, oceans, biofilms and soils [5, 6, 7, 2, 8, 9]. They maintain ecosystem functions, as they are responsible for most planktonic primary production [10], are the most important feeders of bacteria [11, 7] and key players in the regulation of element cycling, particularly carbon [7, 12].
Perhaps surprisingly then, protists are poorly covered by genomic reference databases despite their broad diversity, and if at all, only few model species are present. Therefore, most recent metatranscriptome approaches were designed for prokaryotes, which offer more complete databases (e.g. NCBI) in contrast to eukaryotes. Here, efficient mapping approaches, such as BWA or Bowtie, and methodologies allowing few differences to the reference sequences (e.g. k-mer indices) can be used. It is frequently possible to obtain taxonomic assignments even down to species level.
In contrast, few genome sequences from eukaryotes exist, and those that do are not well balanced across the main lineages of the eukaryotic tree of life, and therefore do not reflect the diversity within these lineages. The main focus of publicly available genomes lies on the Opisthokonta (Fungi/Metazoa group), including many animals, in particular model organisms, and Viridiplantae (green plants, containing Streptophyta and Chlorophyta) with an emphasis on crop plants. For example, in the NCBI database the available genomes in these two groups already represent 96% of the available genomes for eukaryotes, whereas eukaryotic genomes represent 43% of all genomes from the three domains (bacteria: 54%, archaea: 3%, NCBI June 2017).
The diversity of microbial eukaryotes is strongly underrepresented and database searches that aim at an assignment of metatranscriptomic reads on species level will, for the most part, be incorrect. This is caused by the fact that neither the species nor a close relative are included in the database and by the disproportional coverage of taxonomic groups leading to misassignments of reads to incorrect taxa by chance. In addition, available databases are often too large to be used in their entirety to map or search with millions of metatranscriptomic sequences on the read level.
A possible way out (taken here) is to restrict the taxonomic assignment to broader taxonomic groups, using appropriate reference organisms for each group. In turn, this requires a different approach to the similarity search, allowing to find more distantly related sequences. Since such similarity search tools are more time consuming, a reasonable search time can only be obtained by restricting the analysis to smaller reference databases.
Many existing approaches base their taxonomic assignments on selected sequenced marker genes. However, for a joint taxonomic and functional analysis (which taxonomic group performs which functions?), it is necessary to assign each single read to a taxonomic group and to a protein family.
Our goal was therefore to design, test and provide a comprehensive tool and workflow for eukaryotic metatranscriptome analysis, encompassing everything from preprocessing to integration of environmental data. A large impediment, as already mentioned, was a missing reference for the taxonomic assignment of sequences, which we constructed for all major taxonomic groups based on 142 publicly available transcriptomes and genomes. Our tool TaxMapper assigns taxonomic information to each read by mapping to the database using a reduced amino acid alphabet, and subsequently filtering of unreliable assignments. It is part of an automated rule-based Snakemake workflow developed to perform quality assessment and both functional and taxonomic annotation, as well as (multivariate) statistical analysis including environmental data.
In this work, we (i) describe the microeukaryotic reference database, (ii) present the TaxMapper software for taxonomic mapping and filtering of reads, and (iii) provide a detailed step-wise instruction on how to analyse metatranscriptomes from eukaryotic microorganisms using a modular workflow.
Related work
Metatranscriptome workflows
Existing metatranscriptome workflows often focus on bacterial composition, like Leimena et al. [13] who describe in detail an analysis pipeline for prokaryotic datasets. Other studies construct pipelines for subparts of the analysis, including Goncalves et al. [14] who constructed an R-based pipeline for pre-processing, quality assessment and expression estimation of RNA sequence datasets, and Marchetti et al. [15] who provide an R package for differential expression analysis of metatranscriptome sequences starting from a count matrix of genes and a phylogenetic annotation. For our purposes, these approaches have two disadvantages: (i) they provide no complete executable workflow, and (ii) the available workflow parts cannot be easily adapted to eukaryotic data.
Metatranscriptome analysis tools
Many metagenomics or metatranscriptomics analysis tools were conceived for the analysis of bacterial communities. For example, CLARK [16, 17] is a tool for the taxonomic classification of metagenomic reads using known bacterial genomes. GOTTCHA [18] is a taxonomic profiler that uses nonredundant signature databases for prokaryotic and viral genomes. Genometa [19] is a Java program to identify bacterial species and gene content from high-throughput datasets. MetaPhyler [20] estimates bacterial composition from metagenomic samples.
Others use a subset of the sequences for taxonomic profiling of metagenomes. Web-based solutions are provided by MG-RAST [21] and EBI metagenomics [22] that automatically analyse rRNA and mRNA in submitted samples. MetaPhlAn2 [23] and mOTU [24] use a subset of marker genes for taxonomic profiling. QIIME [25] uses Operational Taxonomic Units (OTUs) to assign a taxonomy.
A user-specified library of genomes of species that are present in the samples has to be provided for recent programs utilizing k-mers such as Kraken [26], LMAT [27] or DUDes [28].
The last category of tools searches the NCBI database to assign reads to taxo-nomical level after a BLAST search, including MEGAN [29] and Taxator-tk [30] or after a mapping with Bowtie, e.g. Centrifuge [31].
Four our purposes, we found that each existing tool exhibited a shortcoming that rendered it unsuitable for the read-level assignment of taxonomic and functional information to microeukaryotic sequences. We summarize our requirements versus the properties of existing tools in Table 1.
Methodology and implementation
Reference database
To counter-balance the uneven diversity of eukaryotic microorganisms present in public databases, we construct the TaxMapper reference database such that it evenly includes genomic and transcriptomic sequences from all eukaryotic supergroups and taxonomic groups. References from the databases of NCBI [32] and the Marine Microbial Eukaryote Transcriptome Sequencing Project [1] were selected based on the following criteria: (i) The taxa represent the main lineages within each of the seven supergroups of eukaryotes (see Fig. 1). (ii) Their genomes or transcriptomes are mostly complete; i.e., we excluded obviously incomplete datasets that consisted of only some hundred sequences. We thus selected 142 transcriptomes and genomes; the selection is described under “Results”.
The protein sequences of all reference genomes or transcriptomes were downloaded, redundant sequences were discarded for each species and the amino acid sequences were used to build a database index.
TaxMapper
TaxMapper is designed to allow an easy-to-use search with sequence reads in the compiled database and to filter erroneous hits. It consists of five modules (search, map, filter, count, plot) that can be run individually with user defined parameters or as a single step with default settings.
The initial search in the indexed database is conducted for a single read file or forward and reverse reads in parallel using the protein similarity search tool RAPSearch2 [33] (v2.24, fast mode, using a loose E-value cutoff of 105, but restricted to the best 20 hits). RAPSearch2 performs a fast similarity search in a reduced amino-acid search space. The best 20 hits are returned for each query (read) sequence and mapped to the 7 taxonomic supergroups and 28 main lineages. Two hits are kept subsequently, the best hit (BH) and the next best hit, according to E-value, that falls into another lineage (next lineage hit, NLH). (Hits that are better than the NLH and agree with the taxonomic group of the BH are skipped.) Forward and reverse results can be combined by choosing either the option “best” to use the better of both searches or “concordant”, where forward and reverse have to map to the same taxon.
The filter idea behind TaxMapper is to assign taxonomic information only if the BH and NLH are “different enough”. If the differences between BH and NLH in mapping properties such as the E-value, identity, alignment score etc. are large, the assignment of the best hit is regarded trustworthy and is returned, otherwise no taxonomic group is ascribed to reduce false positive assignments. The details of the filter approach are discussed below (Subsection Filtering). Fig. 2 illustrates the difference of this approach to other approaches that use only the best hit or the lowest common ancestor (LCA) of several hits. While the best hit approach returns just the best hit, regardless of further results that might be equally good, the lowest common ancestor approach returns the lowest level in the taxonomic tree that the hits have in common, which might be close to the root if the hits are too diverse.
Subsequently, count matrices can be generated over samples, summarizing the reads for all taxonomic groups to apply total count normalization and plot community compositions.
TaxMapper is implemented as a stand-alone tool in the Python language (v3.5). The statistical model for the filtering step (described below) was estimated using the generalized linear model function in R, applying maximum likelihood estimation (MLE). R is not required for running the TaxMapper software. TaxMapper can be run either stepwise with user-defined settings or for easier handling in one analysis step with default parameters. In the second case, just a folder of raw data in FASTQ or FASTA format has to be provided and all results are generated automatically. The analysis can be parallelized by declaring the number of threads to use and it is suggested to run it on a multicore machine or server for large datasets.
Filtering
The filtering step based on the best hit (BH) and the nearest lineage hit (NHL) is a distinguishing feature of TaxMapper. Since we found it impossible to separate correct from incorrect taxonomic assignments based on BH and NLH E-values alone, we estimated a logistic regression model based on five BH/NLH properties:
percent identity of the BH,
ratio of identities between BH and NLH,
log10 E-value of BH,
difference in log10 E-values of BH and NLH,
the total size (in basepairs) in the database of the BH’s taxonomic group
The base frequencies were added as an independent variable in addition to the mapping statistics (E-value and identity) to include the different number of sequences per taxonomic group, which can bias hits toward more abundant taxa in the reference database.
In general, the binary logistic model is used to estimate the probability of a binary response y ∈ {0,1}, based on one or more independent variables (x1,…,xp):
Here the xk are the five hit properties described above, and y = 1 corresponds to the event that the BH is a correct assignment, whereas y = 0 means that the BH is an incorrect assignment. The goal is to search for values of the coefficients β such that the probability P(y = 1 | x) is large when the hit properties x indicate that BH and NLH are sufficiently different such that the taxonomic assignment based on the BH is correct.
For estimating and testing the classifier, reads were chosen from 18 species that are included in the reference database and 17 species that are not included in the database, but where the taxonomic lineage is known and present in the database. Not all of the 28 groups could be used, since for some groups all available species were included in the database and further species for testing were not obtainable.
We obtained raw read data belonging to the above 35 species, listed with accession number in the supplementary file Suppl_TestTable.csv. Since for these reads, we know the correct taxonomic origin, we sorted them into two classes based on TaxMapper’s best hit (BH) alone: correctly classified or misclassified. We randomly chose 500 000 correctly classified (true positive, TP) and 500 000 misclassified (false positive, FP) reads as training data for estimating the model (see Fig. 3). This dataset of one million reads was split into 20% holdout data and 80% training and test data. The training and test data was again randomly split into 80% training and 20% test data 100 times to train and evaluate the classifier using 100-fold Monte Carlo cross-validation. In addition, in each cross validation round, the holdout data and randomly created reads were used to evaluate the classifier. Performance on the random reads (which by definition have no relation to any database sequence) allows us to estimate how well we are able to reject sequences that are from none of the eukaryotic lineages contained in the database. Results are given in the “Results” section.
Workflow
A comprehensive workflow for metatranscriptome analysis was developed and made available in an executable Snakemake-based workflow. Snakemake is a workflow description language and execution environment developed by Köster et al. [34]. The workflow steps are defined in terms of rules with input, output and Shell, Python or R code. Dependencies between rules are automatically resolved and rules are automatically parallelized where possible. It features an easy to read, self-documenting syntax which also serves for version and parameter tracking. For the described workflow Snakemake version 3.9.1 was used.
The workflow covers both taxonomic assignment of each read (using TaxMapper) and functional assignment (using RAPSearch2 on the UniProt database). Steps and parameters can be adjusted using a provided configuration file (config.yaml).
In the following, the most important rules and steps of the workflow are explained. An overview is given in Figure 4.
The steps of the bioinformatic workflow are specified in the workflow management system Snakemake. Snakemake rules describe how to create output files from input files by executing commands on the input files. The commands can also be run on single files in the terminal, Python or R, but for automation, parallelization and reproducibility of the workflow, Snakemake is used. We briefly explain the Snakemake syntax here on a short exemplary Snakemake file:
The desired final outputs of the workflow are described in the rule all, these are “plots/datasetl.pdf” and “plots/dataset2.pdf”. To create the plot, we run a shell command in the rule create_plots on the input “raw/{dataset}.csv” to create the output “plots/{dataset}.pdf”. Snakemake determines the rule dependencies by matching file names and automatically fills the wildcard dataset with the correct names: datasetl and dataset2, that are expected as the input of rule all.
Preprocessing
The quality of raw sequencing reads is analysed using the quality control tool FastQC [35]. It computes various quality measures such as the base quality, overrepresented sequences, read length et cetera. The compressed FASTQ files are used as input and the snakemake rule runs FastQC as a shell command on the input. The wildcards sample and pair represent the sample name and forward and reverse read respectively.
Identified low quality bases and sequencing adapters can be removed with trimming tools such as cutadapt (vl.12, [36]). From the forward and reverse read, given as input, the adapter beginning with ‘GATCGGAAGAGCA’ and bases with a quality value below 20 are trimmed. If the remaining read length is below 50, the whole read will be discarded. All output files are saved in the folder results/cleaned.
Taxa identification
TaxMapper is used for the assignment and filtering of taxonomic information. For brevity, the one-step version is shown below, since it just needs an input folder with all FASTQ files and parallelization is performed within TaxMapper (here 20 threads are used via option –t). We have to get the input folder from the input files and provide an output file from TaxMapper as output for snakemake. The expand command is used to get a list of all input files by filling in the wildcards for sample and pair, which are lists of all filenames and forward and reverse reads provided in the configuration file. The database index is created within the subworkflow taxonomy which is given as the input database. To let Snakemake handle parallelization and provide user-defined parameters, the workflow can also be run in five successive steps: search, map, filter, count and plot (see Fig. 4 TaxMapper box).
Functional annotation
RAPSearch (v2.24, [33]), a fast protein similarity search tool, is used to search the read sequences in the Uniprot database (release 2016_06) [37]. The Uniprot database is downloaded and indexed as part of the workflow (in a subworkflow termed uniprot). The similarity search is performed with default parameters and the best hit is returned. Via a Uniprot identifier mapping file, obtained from the Uniprot database, KEGG (Kyoto Encyclopedia of Genes and Genomes, [38]) Orthology identifiers can be assigned to the query sequence.
Additional rules are used to shorten the output and combine the forward and reverse read mapping (see Fig. 4 Uniprot box). The input FASTQ files have to be first extracted from the gz archive to use them as input for RAPSearch2, then they are searched against Uniprot returning the alignments of the best hit or no result for each read.
Statistics and downstream analysis
Subsequent statistical analyses depend on the type of study and question. Since it is not always possible or intended to perform e.g. differential expression analysis, we included several possible rules in the workflow. All of the rules execute R code that is longer than a couple of lines and therefore not depicted here.
Existing rules include a differential expression analysis given different conditions using the Bioconductor package edgeR (v3.14.0, [39]), ordination analyses such as principal component analysis and redundancy analysis using the R package vegan (v2.3-4, [40]) and KEGG pathway analyses with the R packages GAGE (v2.21.1, [41]) and pathview (v1.9.0, [42]).
Results and discussion
Reference database
According to our criteria, 142 reference sequences were selected for the TaxMapper reference database (for details see Suppl file: Suppl_TaxTable.csv). These references belong to the seven supergroups of eukaryotes, including 28 main lineages. In accordance with the taxonomy published by Boenigk and Wodniok [43] and with the tree of life project [44], we chose different levels of each lineage to cover their molecular and functional diversity. Figure 1 and Table 2 give an overview.
The supergroup Amorphea consists of two main lineages, the Opisthokonta (Holomycota and Holozoa) and Amoebozoa. Additionally, the small phylum Apusozoa is considered as a likely paraphyletic sistergroup of the Opistokonta [45, 46]. In the database the Amorphea are represented by 27 reference taxa. 19 taxa are affiliated with the Opisthokonta, including fungi representing the Holomycota, and Eumetazoa, Choanoflagellida (Choanomonada) and basal Opisthokonta, e.g. Filastera and Ichthyosporea here called Opisthokonta Rest, as representatives for the Holozoa. The Amoebozoa contain 7 reference taxa including lobose Amoebae, Archamoebae and Mycetozoa (slime moulds). One reference taxa is included for the phylum Apusozoa.
The supergroup Excavata is a very diverse group that can be summarized into two main groups, the Discoba including the lineages Euglenozoa, Heterolobosea and Jakobida as well as the Metamonada including the lineages Parabasalia and Fornicata. Many species of this supergroup are parasites [5] but some taxa e.g. most Euglenida are free-living and often occur in freshwater [47]. In the database the Excavata are represented by 9 reference taxa affiliated with Euglenozoa, Heterolobosea, Parabasalia and Fornicata. Due to few available transcriptomes of this supergroup in public databases and the focus on free-living taxa, only few references could be added.
The supergroup Archaeplastida includes three main lineages, the species-poor Glaucophyta (Glaucocystophyceae), the mostly marine Rhodophyta and the species-rich Viridiplantae (Chlorophyta, Streptophyta). Particularly the Chlorophyta are important primary producers in freshwater habitats [48]. Therefore, Archaeplastida are represented by 22 reference taxa affiliated with Chlorophyta, Streptophyta, Rhodophyta and Glaucocystophyceae.
The supergroup Rhizaria is a diverse group and consists of two main lineages, Cercozoa and Retaria (Foraminifera and Radiolaria). Cercozoa are very abundant in soil but can also occur in freshwaters and marine habitats [49]. In the database Rhizaria are represented by only 7 taxa belonging to Cercozoa and Foraminifera as there are only a few sequenced species available in public databases, particularly from Cercozoa.
The supergroup Alveolata is a very diverse group. It consists of three main lineages, Ciliophora, Apicomplexa and Dinophyta. Further, the smaller lineages Chromerida, Colpodellids and Perkinsea are affiliated with the Alveolata. Ciliophora and Dinophyceae can occur in high abundances and are important predators of other protists [50, 51]. Due to their importance and diversity they are covered by a high number of reference taxa (26) in the database: Ciliophora, Apicomplexa, Dinophyceae, Chromerida and Perkinsea.
The supergroup Stramenopiles is a very diverse group including many lineages which can be summarized into three groups, the Pseudofungi, the heterotrophic Bigyra and the plastid bearing Ochrophyta [52]. Some of these lineages, e.g. Bacillariophyta and Chrysophyceae, are very abundant in freshwater habitats [48, 50]. They are important primary producers and predators of bacteria. Therefore, we covered this group by a high number of 40 reference taxa. Pseudofungi were included as well as Bigyra summarizing the three lineages Bicosoecida, Blastocystis and Labyrinthulida. The Ochrophyta are represented by the two abundant freshwater groups Bacillariophyta and Chrysophyceae and a collection of other reference taxa affiliated with several Stramenopile lineages called Stramenopiles Rest.
An additional “group” in the eukaryotic tree of life are the incertae sedis Eukaryota which include amongst others the Hacrobia (Cryptophyta, Haptophyta) [5]. The evolutionary position of theses taxa is still uncertain as the phylogenetic position differs depending on the studied organism and genes. In the database Hacrobia are represented by 11 reference taxa, affiliated with Cryptophyta and Haptophyta.
Evaluation of the filtering step
After training the classifier to reject assignments of training reads whose best hit misses the correct taxonomic group, we evaluated the performance on the test, random and holdout dataset.
The results are depicted as receiver operating characteristic (ROC) curves in Fig. 5 A and compared based on the area under the curve (AUC) and accuracy (ACC) in Table 3. Shown are true positive rate (TPR) and false positive rate (FPR) of TaxMapper results varying over the cutoff for the probability P(y = 1|x1,…, x5). Results are also given when no logistic model, but a simple E-value cutoff for the best hit, is used.
TaxMapper yields superior results, especially in the desired area with low false positive rates, and an AUC of 0.90–0.91 in contrast to 0.84 for the simple E-value cutoff method. The highest accuracy of 0.84 was obtained for a probability cutoff of 0.38 and 0.40 for TaxMapper (test and holdout data, respectively). The best accuracy (0.79) for a simple E-value cutoff lay below –0.92 (log10 E-value).
A false positive rate below 0.1 could be obtained with a probability cutoff of 0.58 or log10 E-value below 1.66. Obviously, in the random dataset only the number of false positives can be reduced, resulting in the best accuracy of 1.0 for a probability cutoff of 1.0, filtering out all reads. But as shown in Fig. 5 B and C, the accuracy increases rapidly and a low false positive rate below 0.1 is already obtained with an average probability cutoff of 0.29 (see Fig. 5 and Tab. 3).
Evaluation of TaxMapper against other tools
The runtime and results of TaxMapper were compared to the tool Taxator-tk and Centrifuge, to our knowledge the only tools that can be run on a server and assign sequences to a taxonomy on read-level (see Fig. 6). Both tools were run with default parameters and as described in the manual. As a reference for Centrifuge the non redundant NCBI index was used as provided by the authors of Centrifuge. For Taxator-tk the provided refpacks could not be used, since they focus on prokaryotic taxa, therefore a refpack using the NCBI nr database was build according to the instructions on the website. The search step of Taxator-tk utilises a blastn or LAST search against the NCBI nonredundant nucleotide database. Due to the long runtime, only the holdout data with 200 000 reads was tested. Overall, Taxator-tk using the Megan algorithm takes 3980:13 minutes, Centrifuge takes 15:07 minutes and TaxMapper 32:49 minutes (wall clock time) on a server with AMD Opteron processors (6176, 2.3 GHz) using 20 threads. This corresponds to a user time of 182:18 minutes for TaxMapper, of which the search step takes longest with 180:23 minutes.
Centrifuge uses the fast mapping algorithm Bowtie to map the reads against the NCBI database. The drawback is that Bowtie allows few mismatches and therefore reads map only to very similar sequences. If the organism or a close relative is not contained in the database, a taxonomy cannot be assigned, leading to many unclassified reads for this method. The Megan algorithm of Taxator-tk uses BLAST, therefore only few reads are unclassified, but the majority map to the root node of the taxonomy, due to the lowest common ancestor approach described in Figure 2. The original algorithm developed for Taxator-tk is optimized for longer reads, starting with 500 bp, and was not used here. TaxMapper results in the highest number of true positive assignments and the lowest number of false positives. Results were the taxonomic assignment of the best hit was insecure, were removed in the filter step.
Discussion
Example application: sliver dataset
To showcase an application, the metatransciptome workflow was run on a subset of sequencing data from a study published in 2014 by Boenigk et al. [52]. In brief, a short-term silver exposure experiment was conducted on nine 20 L plastic tanks containing water from a natural plankton community from an eutrophic pond at the campus Essen of the University Duisburg-Essen. The nine tanks were divided into three experimental groups (control, silver nitrate and silver nanoparticle exposure) with three replicate tanks each. The subsample used here contains the control samples and the silver nitrate samples. The metatranscriptomic workflow was applied to analyse the functional and taxonomic differences between the treatments. Figure 7 A depicts the community compositions with the largest changes visible in the groups Bacillariophyta and Chlorophyta. The taxonomic changes are also depicted in the PCA in Figure 7 B, separating on the second principal component the control samples from the samples treated with a sublethal silver concentration of 5 μg/L. On the functional level a test for differential expression reveals 34 KEGG orthologous genes that differ significantly (FDR < 0.1) between the two groups and show an enrichment of photosynthesis pathways. It is known that silver ions affect the primary metabolism in particular photosynthesis by direct interference [52, 53]. On the other hand, it has been shown that for low concentrations of silver green algae grows is increased as observed in Figure 7 A [54].
A subset of this study with the first 100,000 reads per FASTQ file is provided with the workflow as test dataset.
Future database updates
When new sequences become available which further complete the diversity of the eukaryotic supergroups, an update of the database will be released. In particular, the Excavata and Rhizaria should be extended in future versions, for which at the moment only few appropriate genomes or transcriptomes are present.
Conclusions
Despite the large number of tools developed for taxonomic analyses, the majority of them aims at different sequencing data (e.g. rRNA, contigs) or organismic groups (prokaryotes) and does not allow a combined functional and taxonomic analysis of metatranscriptomic data. We therefore developed the presented tool TaxMapper to work in conjunction with a constructed microeukaryotic reference database for taxonomic assignment, and included the taxonomic analysis in a complete workflow for metatranscriptomic sequence analysis.
The smaller, but more appropriate reference for protists, allows a faster search than a comparable search against whole NCBI.
False positive assignments can be filtered using a probability cutoff on a logistic regression model based on features of the best hit and next lineage hit, which yielded better result than a simple E-value cutoff.
TaxMapper can be run straightforwardly on a folder of sequencing data or as part of the Snakemake workflow. The workflow performs quality assessment, functional and taxonomic annotation and (multivariate) statistical analyses using available environmental factors or different sample groups. The provided workflow ensures a reproducible analysis which can be easily extended to new samples.
Both the TaxMapper tool and the workflow are available as open-source software at Bitbucket under the MIT license: https://bitbucket.org/dbeisser/taxmapper and as a Bioconda package: https://bioconda.github.io/recipes/taxmapper/README.html.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent to publish
Not applicable.
Availability of data and material
The data and software are available at Bitbucket https://bitbucket.org/dbeisser/taxmapper, https://bitbucket.org/dbeisser/taxmapper_supplement and as a Bioconda package: https://bioconda.github.io/recipes/taxmapper/README.html.
Competing interests
The authors declare that they have no competing interests.
Funding
DB, SR and JB thank the Deutsche Forschungsgemeinschaft (DFG) for the support within the Priority Programme DynaTrait (SPP 1704), grants RA 1898/1-1 and BO 3245/14-1.
Author’s contributions
DB compiled the reference database, developed the tool TaxMapper and the Snake-make workflow and wrote the manuscript. NG selected the taxa for the reference database and wrote the Reference Database section in the manuscript. LG provided the first version of reference taxa. HT contributed to the tool TaxMapper, created the conda package and performed code review. JB provided expertise on protists and the eukaryotic phylogeny. SR provided bioinformatics expertise and wrote sections of the manuscript. JB and SR led and guided the study. All authors participated in writing and approved the final version of the manuscript.
Additional files
Suppl_TaxTable.csv — Taxa contained in reference database
Information on taxa contained in reference database, including taxonomic affiliation, accession number and database.
Suppl_TestTable.csv — Validation taxa
Information on taxa used for evaluating the logistic regression model.
Acknowledgements
We thank Matthias Höller for running Centrifuge on the holdout data.
Footnotes
Full list of author information is available at the end of the article
List of abbreviations
- ACC
- Accuracy
- AUC
- Area under the curve
- BH
- Best hit
- DFG
- Deutsche Forschungsgemeinschaft
- FDR
- False discovery rate
- FP
- False postive
- FPR
- False positive rate
- HTS
- High-throughput sequencing
- KEGG
- Kyoto Encyclopedia of Genes and Genomes
- LCA
- Lowest common ancestor
- MLE
- Maximum likelihood estimation
- MMETSP
- Marine Microbial Eukaryote Transcriptome Sequencing Project
- NCBI
- National Center for Biotechnology Information
- NLH
- Next lineage hit
- OTU
- Operational Taxonomic Unit
- PCA
- Principal component analysis
- ROC
- Receiver operating characteristic
- TMM
- Trimmed mean of M-values
- TP
- True positive
- TPR
- True positive rate