Abstract
Using metagenomics to determine animal diet offers a new and promising alternative to current methods. Here we show that rapid and inexpensive diet quantification is possible through metagenomic sequencing with the portable Oxford Nanopore MinION. Using a simple amplification-free approach, we profiled the stomach contents from wild-caught rats. We conservatively identified diet items from over 50 taxonomic orders, ranging across nine phyla that include plants, vertebrates, invertebrates, and fungi. This highlights the wide range of taxa that can be identified using this simple approach. We calibrate the accuracy of this method by comparing the characteristics of reads matching the ground-truth host genome (rat) to those matching diet items. We also suggest a means to correct for biases in metagenomic approaches that arise due to the paucity of genomic sequence in databases as compared to mitochondrial DNA or rDNA. Finally, we implement a constrained ordination analysis to show that it is possible to identify the sampling location of an individual rat within tens of kilometres based on diet content alone. This work establishes long-read metagenomic methods as a straightforward and robust approach for diet quantification. It considerably simplifies the workflow and avoids many inherent biases as compared to metabarcoding. Continued increases in the accuracy and throughput of Nanopore sequencing, along with improved genomic databases, means that this approach will continue to improve in accuracy.
Introduction
Bias in current methods
Accurate information about what organisms are eating informs many aspects of our understanding of ecosystems and food web dynamics, however unbiased and sensitive assessment of diet content is extremely difficult to achieve due to the limited accuracy of available methods. A variety of methods have been applied to quantify diet components in animals, including visual inspection of gut contents (Daniel, 1973; Pierce & Boyle, 1991) stable isotope analysis (Carreon-Martinez & Heath, 2010; Major, Jones, Charette, & Diamond, 2007), and time-lapse video (Brown, Moller, Innes, & Jansen, 2008; Dunlap & Pawlik, 1996). However, these methods can be biased and imprecise. Identification of prey items using visual examination of stomach contents is strongly affected by which items are most easily degraded (for example, soft-bodied species).
Stable isotope analysis yields only broad information on diet such as relative consumption of protein and plant matter, as well as information on whether prey items are terrestrial or marine in origin (Basha, Chamberlain, Zaki, Kandeel, & Fares, 2016; Hobson, 1987). Time-lapse video (Dunlap & Pawlik, 1996; Volpov et al., 2015) requires identification of the specific prey item, often difficult or impossible for small prey items or in low-light conditions. To circumvent these issues, DNA-based methods (King, Read, Traugott, & Symondson, 2008; Soininen et al., 2009) are becoming more popular.
Perhaps the most widely applied DNA-based method is metabarcoding. This approach relies on PCR amplification and sequencing of conserved regions from nuclear, mitochondrial, or plastid genomes (King et al., 2008). With adequate primer selection, this method can detect a wide range of species, and does not require specific expertise necessary for other methods (for example identifying degraded prey items).
However, DNA metabarcoding is not free from bias. PCR primers must be specifically tailored to particular sets of taxa or species (Jarman, Gales, Tierney, Gill, & Elliott, 2002). Although more “universal” PCR primer pairs have been developed (for example targeting all bilaterians or even all eukaryotes; (Jarman, Deagle, & Gales, 2004), all primer sets exhibit bias towards certain taxa. Tedersoo et al. (2015) (Tedersoo et al., 2015) found five-fold differences in fungal operational taxonomic units (OTU) estimates when using different sets of fungal-specific PCR primer pairs. Leray et al. (2013) (Leray et al., 2013) found that published universal primer pairs (i.e. those that do not target specific taxa) were capable of amplifying only between 57% and 91% of tested metazoan species, with as few as 33% of species in some phyla being amplified at all (e.g. cnidarians). Deagle et al. (2014) argued that in general, COI regions are simply not sufficiently conserved, and thus should not be used for metabarcoding studies at all (Deagle, Jarman, Coissac, Pompanon, & Taberlet, 2014). Finally, Pawluczyk et al. (2015) showed that different loci from the same species exhibit up to 2,000-fold differences in qPCR-estimated DNA quantity within samples (Pawluczyk et al., 2015). It has even been shown that the polymerase itself can bias diversity metrics when using metabarcoding methods (Pereira, Peplies, Brettar, & Hoefle, 2018). For these reasons, a less biased method is desirable.
Metagenomic sequencing for diet
Metagenomic sequencing, in which all of the DNA in the sample is directly sequenced, offers an attractive alternative to metabarcoding for several reasons. Metagenomic approaches have most frequently been used to yield insights into microbial diversity and function (Anantharaman et al., 2016; Fierer et al., 2012; Hover et al., 2018; Xu & Knight, 2015). Recent advances in computational methods (Breitwieser & Salzberg, 2018; Huson, Mitra, Ruscheweyh, Weber, & Schuster, 2011; Kim, Song, Breitwieser, & Salzberg, 2016; Wood & Salzberg, 2014) now allow routine rapid quantification of microbial taxa in metagenomic samples. However, metagenomic approaches have rarely been used to quantify eukaryotic taxa. An important application of such a method would be for diet analysis, as many diet items are difficult to identify based on macro- or microscopic analysis.
Here, we quantify rat diet composition using a novel metagenomic approach based on long-read nanopore sequencing (Oxford Nanopore Technologies). This study shows for the first time that low-accuracy long-read sequences can be used to accurately classify eukaryotic metagenomic data. As a test case, we quantify rat diet using stomach contents. Using such samples is opportune for both methodological and ecological reasons.
First, rats are extremely omnivorous. As such, they serve as an excellent means to quantify the breadth of taxa that can be detected using a metagenomic long read approach. Second, the use of stomach samples means that a significant number of reads will be host reads. This allows us to assess the characteristics of true positive sequence reads (rat-derived reads that match rat database sequences), as well as false negative and false positive reads (rat-derived reads that match non-rat database sequences). We can then determine whether reads matching diet items have similar characteristics to known true positive (host) reads.
Finally, understanding rat diets has important ecological implications. It is well-established that the relatively recent introduction of mammalian predators to New Zealand and other islands has had significant negative effects on many of the native animal populations. This ranges from insects (Gibbs, 1998), to reptiles (Towns, Daugherty, & Cree, 2001), to molluscs (Stringer, Bassett, McLean, McCartney, & Parrish, 2003), to birds (Diamond & Veitch, 1981; Dowding & Murphy, 2001), and can have detrimental effects for entire terrestrial and aquatic ecosystems (Graham et al., 2018). Currently, an ambitious plan is being put into place that aims for the eradication of all mammalian predators from New Zealand (including possums, rats, stoats, and hedgehogs), by 2050 (http://www.doc.govt.nz/predator-free-2050; (Russell, Innes, Brown, & Byrom, 2015). A useful step toward this goal would be to prioritise the management of predators, and establish in which locations native species experience the highest levels of predation. To do so requires establishing the diet content of local mammalian predators.
Materials and Methods
Study Areas
We trapped rats from three locations near Auckland, New Zealand. Each location comprised a different type of habitat: undisturbed inland native forest (Waitakere Regional Parklands, WP); native bush surrounding an estuary (Okura Bush Walkway, OB); and restored coastal wetland (Long Bay Regional Park, LB) (Fig. 1). Traps in OB and LB were baited with peanut butter, apple, and cinnamon wax pellets; or bacon fat and flax pellets. Traps in WP were baited with chicken eggs, rabbit meat, or cinnamon scented poison pellets. From 16 November to 16 December 2016, traps were surveyed by established conservation groups at each site every 48 hours. A total of 36 rats were collected from these locations. The majority of rats collected (34/36) were determined to be male Rattus rattus by visual inspection. These 34 rats were selected for further analysis.
DNA Isolation
Within 48 hours of trapping, rats were stored at either −20°C or −80°C until dissection. We removed intact stomachs from each animal and removed the contents. After snap freezing in liquid nitrogen, we homogenised the stomach contents using a sterile mini blender to ensure sampling was representative of the entire stomach.
We purified DNA from 10-20 mg of homogenised stomach contents using the Promega Wizard Genomic DNA Purification Kit, with the following modifications to the Animal Tissue protocol: after protein precipitation, we transferred the supernatant to a new tube and centrifuged a second time to minimise protein carryover. The DNA pellet was washed twice with ethanol. These modifications were performed to improved DNA purity. We rehydrated precipitated DNA by incubating overnight in molecular biology grade water at 4°C, and stored the DNA at −20°C. DNA quantity, purity, and quality was ascertained by nanodrop and agarose gel electrophoresis. The DNA samples were ranked according quantity and purity (based on A260/A280 and secondarily, A230/A280 ratios). The eight highest quality DNA samples from each of the three locations were selected for DNA sequencing.
DNA Sequencing
Sequencing was performed on two different dates (24 January 2017 and 17 March 2017) using a MinION Mk1B device and R9.4 chemistry. For each sequencing run, DNA from each rat was barcoded using the 1D Native Barcoding Kit (Barcode expansion kit EXP-NBD103 with sequencing kit SQK-LSK108) following the manufacturer’s instructions. Twelve samples were pooled and run on each flow cell, for a total of 24 individual rats. The flow cells had 1373 active pores (January) and 1439 active pores (March). Sequencing was performed using local base calling in MinKnow v1.3.25 (January) or MinKnow v1.5.5 (March), but both runs were re-basecalled after data collection using Albacore 2.2.7 with demultiplexing performed in Albacore and filtering disabled (options --barcoding--disable_filtering).
Sequence classification
All sequences were BLASTed (blastn v2.6.0+) against a locally compiled database consisting of the combined NCBI other_genomic and nt databases (downloaded on 13th June 2018 from NCBI). Default blastn parameters were used (gapopen 5, gapextend 2), and only hits with an e-value of 1e-2 or less were saved. Due to the predominance of short indels present in nanopore sequence data, we used an initial set of basecalled data to test whether changing these default penalties affected the results (gapopen 1, gapextend 1). We found that these adjusted parameters did not qualitatively change our results.
We assigned sequence reads to specific taxon levels using MEGAN6 (v.6.11.7 June 2018) (Huson et al., 2016). We only used reads with BLAST hits having an e-value of 1×10-20 or lower (corresponding to a bit score of 115 or higher) and an alignment length of 100 base pairs or more. To assign reads to taxon levels, we considered all hits having bit scores within 20% of the bit score of the best hit (MEGAN parameter Top Percent).
Multivariate analyses
Multivariate analyses were done using the software PRIMER v7 (K. R. Clarke & Gorley, 2015). The data used in the multivariate analyses were in the form of a sample-(i.e. individual rat) by-family matrix of read counts. All bacteria, rodent, and primate families were removed. The majority of rodent hits were to rat and mouse, resulting from the rats’ own DNA (see below). The majority of the primate hits were to human sequences, which likely resulted from sample contamination.
The read counts were converted to proportions per individual rat, by dividing by the total count for each rat, to account for the fact that the number of reads varied substantially among rats (K. Robert Clarke, Robert Clarke, Somerfield, & Gee Chapman, 2006). The proportions were then square-root transformed so that subsequent analyses were informed by the full range of taxa, rather than just the most abundant families (K. Clarke & Green, 1988). We then calculated a matrix of Bray-Curtis dissimilarities, which quantified the difference in the gut DNA of each pair of rats based on the square-root transformed proportions of read counts across families (K. Robert Clarke et al., 2006).
We used unconstrained ordination--specifically, non-metric multidimensional scaling (nMDS) applied to the dissimilarity matrix--to examine the overall patterns in the diet composition among rats. To assess the degree to which the diet compositions of rats were distinguishable among the three locations, we applied canonical analysis of principal coordinates (CAP) (Anderson & Willis, 2003) to the dissimilarity matrix. CAP is a constrained ordination which aims to find axes through multivariate data that best separates a priori groups of samples (in this case, the groups are the locations from which the rats were sampled); CAP is akin to linear discriminant analysis but it can be used with any resemblance matrix. The out-of-sample classification success was evaluated using a leave-one-out cross-validation procedure (Anderson & Willis, 2003).
We used Similarity Percentage (SIMPER; (K. R. Clarke, 1993)) to characterise and distinguish between the locations. This allowed us to identify the families with the greatest percentage contributions to (1) the Bray-Curtis similarities of diets within each location (Table S3) and (2) the Bray-Curtis dissimilarities between each pair of locations (Table S4).
Results
DNA sequencing and assignment of reads to taxa
After DNA isolation and sequencing, we obtained a total of 82,977 reads from the January run and 96,150 reads from the March run. Median read lengths were 606 bp and 527 bp for the January and March datasets, respectively (Fig. 2A). These lengths are considerably shorter than other nanopore sequencing results from both our and others work (Jain, Olsen, Paten, & Akeson, 2016). This is most likely due to degradation of the DNA during digestion in the stomach as well as fragmentation during DNA isolation (Deagle, Eveson, & Jarman, 2006) and sequencing library preparation. The median phred quality scores per read ranged from 7-12 (0.80 - 0.94 accuracy) for both runs (Fig. S1). The number of reads per barcoded rat sample varied by 10-fold for January and up to 40-fold in March (Fig. 2B and 2C). This is due mostly to the highly variable quality of DNA in each sample. However, read length and quality were similar for all samples (Fig. S1).
To quantify diet contents we first BLASTed all sequences against a combined database of the NCBI nt database (the partially non-redundant nucleotide sequences from all traditional divisions of GenBank excluding genome survey sequence, EST, high-throughput genome, and whole genome shotgun (ftp://ftp.ncbi.nlm.nih.gov/blast/db/README)) and the NCBI other_genomic database (RefSeq chromosome records for non-human organisms (ftp://ftp.ncbi.nlm.nih.gov/blast/db/README)). We used BLAST as it is generally viewed as the gold standard method in metagenomic analyses (McIntyre et al., 2017). Of the 133,022 barcoded reads, 30,535 (23%) hit a sequence in the combined nt and other_genomic database at an e-value cutoff of 1e-2.
As an initial assessment of the quality of these hits, we examined the alignment lengths and e-values. We found a bimodal distribution of alignment lengths and a highly skewed distribution of e-values (Fig. 3A). We hypothesized that many of the short alignments with high e-values were false positives. We thus first filtered this hit set, only retaining BLAST hits with e-values less than 1e-20 and alignments greater than 100 bp. Similar quality filters have been imposed previously (Srivathsan, Sha, Vogler, & Meier, 2015). A total of 22,154 hits passed this filter (Datafile S1). Mean read quality had substantial effects on the likelihood of a read yielding a BLAST hit, with almost 40% of high accuracy read having hits in the March dataset, as compared to 1% of low accuracy hits (Fig. 3B).
To specifically assign each sequence read to a taxon, we analysed the BLAST results in MEGAN6 (Huson et al., 2016). The algorithm employed in MEGAN6 assigns reads to a most recent common ancestor (MRCA) taxon level. For example, if a read has BLAST hits to five species, three of which have bit scores within 20% of the best hit, the read will be assigned to the genus, family, order, or higher taxon level that is the MRCA of those best-hit three species (Huson, Auch, Qi, & Schuster, 2007). If a read matches one species far better than to any other, by definition, the MRCA is that species.
5,334 reads (24%) were not assigned to any taxon by Megan. Of the remainder, 31% were assigned by MEGAN as being bacterial. 55% of these were Lactobacillus spp. These results match previous studies on rat stomach microbiomes, which have found lactobacilli to be the dominant taxa (Brownlee & Moss, 1961; Horáková, Zierdt, & Beaven, 1971; Li et al., 2017; Maurice et al., 2015). Plant-associated Pseudomonas and Lactococcus taxa were also common, at 7% and 6%, respectively.
MEGAN assigned reads to a wide range of eukaryotic taxa. To conservatively infer taxon presence, we first reclassified MEGAN species-level assignments to the level of genus. However, after this, many clear false positive assignments remained (e.g. hippo and naked mole rat). These matches were generally short and of low identity. To reduce such false positive taxon inferences, we used information from reads assigned to the genera Rattus (rat) and Mus (mouse). We inferred that the reads assigned to Rattus (2,696 reads in total) were true positive genus-level assignments and that the reads assigned to Mus (2,798 reads in total) were false positive genus-level assignments (and not true positive Mus-derived reads). Although rats are known to prey on mice (Bridgman, Innes, Gillies, Fitzgerald, & King, 2013), if this had occurred, we would expect that (1) the ratio of mouse to rat reads would be higher in the subset of rats that had predated mice; (2) in those same rats, the percent identity of the reads assigned to Mus would be higher than in rats that had not predated mice. However, we found that the ratio of mouse to rat reads was similar for all rats. In addition, there was no evidence of higher percent identities for Mus reads from rats that had higher ratios.
Notably, the mean percent identity values of the best BLAST hits for Rattus and Mus reads differed substantially, with Rattus reads having a median identity of 86.4%, and Mus 81.0% (Fig. 4A). The mean percent identity for Rattus reads corresponds very well to that expected given the mean quality scores of the reads (assuming the true sequence of the read is 100% identical to Rattus, 86.4% identity corresponds to a mean quality score of 8.7; Fig. S2A-C). There was also a clear difference in the alignment lengths: the median ratio of alignment length to read length was 0.57 for Rattus and 0.52 for Mus (Fig. 4B). We note that read identity and the ratio of alignment length to read length are positively correlated (Fig. S2G-I). There is little correlation between read identity and alignment length alone (Fig. S2D-F).
Importantly, the majority of diet items have percent identities that overlap with the Rattus reads, and alignment length to read length ratios that often exceed the Rattus reads. This suggests that many diet taxa assignments are correct down to the level of genus (as the Rattus-assigned reads are correct to the level of genus). However, to further decrease false positive taxon assignments of diet items, we implemented cut-offs based on the characteristics of the Mus- and Rattus-assigned reads. For genus-level assignment, we required at least 82.5% identity and an alignment length to read length ratio of at least 0.55. These cutoffs exclude 88% of the reads falsely assigned to Mus, instead assigning them correctly to one taxon level higher, the Family Muridae. For family-level assignments, we required 77.5% identity, an alignment length to read length ratio of at least 0.1, and a total alignment length of at least 150 bp. Using higher cutoffs for the ratio of alignment length to read length excluded a large number of likely true positive taxa for which only short mtDNA or rDNA database sequences were present in the databases. For all other read-to-taxon assignments, we placed the read at the level of Order, or used the taxon level assigned by MEGAN. Using these cutoffs, 16% of all reads were classified at the Genus level; 71% were classified at the Family-level or below; 89% were classified at the Order-level or below; and 98% were classified at the Phylum-level or below.
After filtering out bacterial, host, and contaminant reads (matching primate DNA), 4,719 reads remained (28% of all classified reads) (Datafile S2). Within these, we observed that a small number of likely false positive taxa remained. Most were single reads with short alignments: Poeciliidae (177 bp); Salmonidae (172 bp); Cyprinodontiformes (140 bp and 177 bp); and Octopodidae (151 bp). The exception to this were three reads from two rats matching Buthidae (scorpions), which had alignment lengths of 762 bp, 664 bp, and 298 bp. It is unlikely these are true positives, and instead we hypothesise that these rats predated harvestmen (Opiliones), a closely related sister taxon within Arachnida but lacking significant amounts of genomic data. Despite the presence of these false positive taxa, we did not further increase the stringency of our filters, allowing us to resolve most taxa at the level of family, with a small rate of false positive inference (here, eight clear instances out of almost 5,000 reads).
Identification of diet
Within each rat, a wide variety of plant, animal, and fungal orders were discernible, ranging from two to 25 orders per rat (mean 8.7; Fig. 5). In total, we identified taxa from 68 different Families, 55 different Orders, 15 different Classes, and eight different Phyla (Fig 6). Plants were the primary diet item, with the largest fraction of rats consuming four predominant orders: Poales (grasses), Fabales (legumes), Arecales (palms), and Araucariales (podocarps). The dominance of plant matter (fruits and seeds) in rat diets has been established previously (Riofrío-Lazo & Páez-Rosas, 2015; Sweetapple & Nugent, 2007). Animal taxa made up a smaller component of each rat’s diet, with Insecta dominating: Hymenoptera, Coleoptera, Lepidoptera (moths and butterflies), Blattodea (cockroaches), Diptera (flies), and Phasmatodea (stick insects). In addition, Stylommatophora (slugs and snails) were present in substantial numbers (Fig. 6A and 6B). Fungi were only a small component of the rats’ diet, although several orders were present: Sclerotiniales (plant pathogens), Saccharomycetales (budding yeasts), Mucorales (pin molds), Russulales (brittlegills and milk-caps), and Chytotheriales (black yeasts). Finally, for many rats, a substantial proportion of the stomach contents were parasitic worms (primarily Spirurida (nematodes) and Hymenolepididae (tapeworms)).
Due to our metagenomic approach, the fraction of each element of the rats’ diets is distorted by biases in genomic databases: whole genome data exists for only a few taxa, while mtDNA and rDNA sequence data are present in the database for the vast majority of animal and plant genera. To quantify this bias, we determined the fraction of hits that mapped to non-genomic database sequences relative to the fraction of hits that mapped to genomic DNA. By quantifying this fraction for species with complete genome sequences in the database and species without complete genomes we aimed to assess the effects of this bias.
For the majority of animals with sequenced genomes in the database, we found that the fraction of reads that mapped genomic sequence ranged from 61% (Gallus) to 73% (Rattus) to 100% (Coturnix and Numida) (Fig. 7). We hypothesise that this variation is likely due to the type of tissue sequenced. For Rattus the sequenced tissue was primarily stomach muscle, which has a relatively high fraction of mtDNA; for Coturnix and Numida it may have been eggs. For plants with sequenced genomes, the fraction of reads matching genomic sequence was generally higher: between 88% (Zea) and 98% (Cenchrus).
In contrast, for genera with little or no genomic sequence in the database, the vast majority of matches were solely to mtDNA, rDNA, or microsatellite loci: 90% of Phoenix (date palm) hits; all Helix (snail); and all Rhaphidophora (cave weta) hits. All Artioposthia (New Zealand flatworm) hits were to rDNA. These results indicate that for genera with no genomic sequence data, we have underestimated the actual number of sequences from that taxon by approximately three-to twenty-fold (for animals and plants, respectively). It is difficult to determine how these numbers correlate with biomass.
Close examination of the sequence classification data suggested that specific families (and orders) were overrepresented in the diets of rats from particular locations. For example, six out of eight rats from the native estuarine bush habitat (OB) consumed Arecaceae, while only one in the restored wetland area (LB) did. All three rats that consumed Phaseanidae were from the native estuarine habitat (OB). All five rats that consumed Solanales were from the restored wetland area. These patterns suggested that it might be possible to use diet components alone to pinpoint the habitat from which each rat was sampled.
nMDS and CAP analysis by location
In order to determine if diet composition of the rats differed consistently between locations, we first performed an unconstrained analysis using nMDS on taxa assigned at the family level. Using family rather than order or genus provides a balance between how precisely we identify the taxon of diet item (genus, family, order), and whether we assign a taxon at all. While family-level assignments are less precise than genus-level, only 16% of all reads were classified at the genus level, while 71% were classified at the family level.
The family-level unconstrained ordination (nMDS) showed no obvious grouping of rats with respect to the locations (Fig. 8a), indicating that locations did not correspond to the predominant axes of variation among the diets. However, a constrained ordination analysis (CAP) identified axes of variation that distinguished the diets of rats from different locations (Fig. 8b). We found that the CAP axes correctly classified the locations of 19 out of 24 (79%) rats using a leave-one-out procedure. The families having the largest correlations with the first two principal coordinates, and most responsible for the separation between groups, were primarily plants: Arecaceae, Podocarpaceae, Piperaceae, and Pinaceae. In addition, insect groups (Cerambycids and Formicids) and birds (Phaseanidae and Numididae) played a role (Fig. 8c).
The families driving similarity within the three locations (i.e., had the greatest within-location SIMPER scores) varied among locations. LB had average Bray-Curtis within-location similarity of 13% mostly attributable to Hymenolepidae (accounting for 51% of the within-group similarity), Solanaceae (11%), and Fabaceae (11%). The average similarity for OB was 21%, with the greatest contributing taxa being Arecaceae (33%), Poaceae (23%), Fabaceae (9%), and Phasianidae (8%). The average similarity for WP was 24%, with the greatest contributing taxa being Poaceae (72%) (Table S4).
Discussion
Accuracy and sensitivity
Here we have shown that using a simple metagenomic approach with error-prone long reads allows rapid and accurate classification of rat diet components. We expect that this technique can be used to infer diet for a wide variety of animal and sample types, including samples that use less invasive collection methods, such as fecal matter. The sensitivity of this approach will likely improve as the accuracy and yield of Oxford Nanopore sequencing increases. The analysis here is based on less than 200,000 reads from two flow cells. The rapid improvement of this technology is such that current yields are often far in excess of two million reads per flow cell. The method will also improve as the diversity of taxa in genomic sequence databases increases. Several aspects of the data support this.
First, we note that we did not find BLAST hits for the majority of reads. This is partially due the relatively low accuracy of the Oxford Nanopore sequencing platform at the time these data were collected (approximately 87%). However, the fraction of reads yielding hits in the database increased substantially for higher quality reads, approaching 40% for very high quality reads (Fig. 3b). Other factors also likely reduce the numbers of BLAST hits, such as the paucity of genome sequence data for many taxa. This is convincingly illustrated by comparing across taxa the fraction of genomic hits to mitochondrial or rDNA sequence hits.
As the species sampling of genomic databases increases (Lewin et al., 2018), the taxon-level precision of this method will improve. Given the current rate of genomic sequencing, with careful sampling, the vast majority of multicellular plant and animal families (and even genera) will likely have at least one type species with a sequenced genome within the next decade. Continued advancement in sequence database search algorithms as compared to current methods (Kim et al., 2016; Nasko, Koren, Phillippy, & Treangen, 2018; Wood & Salzberg, 2014) should considerably decrease the computational workload necessary to find matching sequences.
Although metagenomic approaches decrease the bias arising from PCR amplification of specific DNA regions, additional biases can arise, as the presence or absence of species and genera can only be inferred for those species or genera present in genomic databases. Although this is similarly true for metabarcoding approaches, metabarcode databases are rapidly becoming more comprehensive in terms of species representation as compared to genomic databases. Importantly, genomic sequence databases are rapidly increasing in species diversity, as are the methods to query these large databases(Kim et al., 2016; Wood & Salzberg, 2014)
To decrease biases in genomic databases, some previous studies have performed metagenomic classification using mitogenome data alone. Using such methods, Srivathsan et al and Paula et al. (2016) (Srivathsan, Ang, Vogler, & Meier, 2016); (Paula et al., 2016) found between 0.004% and 0.008% of all metagenomic reads matched mitogenomes from diet taxa. Limiting database searches to mitogenomes partially ameliorates biases in terms of taxon field in terms of taxon representation (i.e. most taxa will have similar levels of genomic representation in the databases). However, it considerably decreases diet resolution given that for some taxa, only a small percentage of sequence reads derive from the mitochondria as opposed to the nuclear genome.
It is also important to note that our interest in diet also includes resolving relative biomass and relative numbers of each prey species, neither of which necessarily correlate well with the amount of DNA (either mitochondrial or nuclear) purified from a sample. Even a simple correction for the fraction of reads matching mitochondrial versus nuclear genomes is difficult, as different plant and animal tissues differ considerably in the relative amounts of mitochondrial versus nuclear DNA (e.g. leaf versus fruit).
Methodological advantages
We found that rats consumed many soft-bodied species (e.g. mushrooms, flat worms, slugs, and lepidopterans) that would be difficult to identify using visual inspection of stomach contents. Achieving data on such a wide variety of taxa would be difficult to quantify using other molecular methods, as there are no universal 18S or COI universal primers capable of amplifying sequences in all these taxa. While it might be possible to use primer sets targeted at different phyla or orders, quantitatively comparing diet components across these using sequences amplified with different primer sets is extremely difficult due to differences in primer binding and PCR efficiency.
The nanopore MinION-based sequencing method used in this simple metagenomic approach has several advantages. Compared to other high throughput sequencing technologies (e.g. Illumina, IonTorrent, or PacBio), there is no initial capital investment required to use the platform. On a per-sample basis, data generation is inexpensive (approximately $150 USD per barcoded sample, and approximately half this price if reagents are purchased in bulk). Library preparation and sequencing can be extremely rapid, going from DNA sample to sequence in less than two hours (Zaaijer et al., 2017). Furthermore, the sequencing platform itself is highly portable. As the cost of nanopore-based sequencing continues to decrease (both per sample and per base pair), it should become possible to use molecular methods for routine ecological monitoring of species presence or absence in field settings, without significant investment in infrastructure (Kamenova et al., 2017). Finally, we suggest that our approach of standardising the read counts by sample, followed by an optional transformation such as square root and dissimilarity-based multivariate ordination, offers a useful analytical pipeline for analysing metagenomic diet-composition data.
We note that modifications to our approach might further increase the precision of our ability to infer community composition. Any error-prone long read dataset (i.e. PacBio or ONT) has both short (e.g. 500 bp) and long (e.g. 5000 bp) reads, as well as high quality (e.g. mean accuracy greater than 90%) and low quality (e.g. mean accuracy less than 80%) reads. When inferring community composition, a null expectation is that taxa should be equally represented by long, high quality reads as they are by short, low quality reads. If some taxa are represented only by short, low quality reads, this suggests that these taxa may be false positive inferences. Similarly, the difficulty in correctly mapping short inaccurate reads could be mitigated by weighting the probability of taxon mapping by the number of long, accurate reads that map to certain taxa. Thus, the fact that not all reads are extremely long and accurate does not mean that they cannot all be used to infer taxon presence in metagenomic analyses.
Conclusion
Here we have shown that a rapid error-prone long read metagenomic approach is able to accurately characterise diet taxa at the family-level, and distinguish between the diets of rats according to the locations from which they were sourced. This information may be used to guide conservation efforts toward specific areas and habitats in which native species are most at risk from this highly destructive introduced predator.
Data Accessibility
Sequence data are available in the SRA archive (accession number PRJEB27647)
Author Contributions
WP, JD, NF, and OS conceived the project. WP performed the stomach dissections. WP and NF optimised the genomic DNA isolation and library preparation. NF performed the nanopore sequencing. GB and OS processed and performed quality control on the sequencing data. WP and OS performed the sequence classification. WP, AS, NF, and OS analysed the data. WP, NF, AS, and OS wrote the paper, with input from all authors.
Supplemental Tables
Datafile S1. Table of read BLAST hits and assigned MEGAN taxa with no filters applied.
Datafile S2. Table of read BLAST hits and assigned MEGAN taxa for diet items, with reads reclassified at the family or order level by filtering on read length to alignment length ratio and percent identity.
Supplemental Figures
Fig S1. Biplots of read lengths and qualities for each barcode in the January and March runs.
Fig S2. Correlation of read accuracy with alignment characteristics. (a-c) Read accuracy is positively correlated with the percent identity of the top BLAST hit. Points show a subsample of reads; orange line indicates a running median; red dotted line is the y=x line, which is expected if accuracy corresponds exactly to percent identity. (a) indicates the relationship for diet items; (b) for rats; and (c) for mice. (d-f) Read accuracy and alignment length show no significant relationship. Plots again are (d) diet items; (e) rats; and (f) mice. (g-i) Read accuracy and the ratio of read length to alignment length are positively correlated: more accurate reads are more likely to have long alignments relative to read length. Plots again are (g) diet items; (h) rats; and (i) mice.
Acknowledgements
This work was supported by a Massey University Research Fund to NF, a Marsden Fund Grant (15-MAU-136) to JD and Marsden Fund Grant MAU1703 to OS. Thanks to Friends of Okura Bush, Mary Stewart from Auckland Council, and Gillian Wadams and the volunteers at the Waitakere Ranges for collecting rat samples and aiding in rat species identification. Sample collection was performed under (Auckland Council Permit to Undertake Research WS1064).
Footnotes
Communicating authors: Olin K. Silander, Institute of Natural and Mathematical Sciences, Massey University, Auckland 0745, New Zealand, olinsilander{at}gmail.com, +64 9 213 6618; Nikki E. Freed, Institute of Natural and Mathematical Sciences, Massey University, Auckland 0745, New Zealand, freednikki{at}gmail.com, +64 9 213 6639