Abstract
Tuberculosis remains one of the main causes of death worldwide. The long and cumbersome process of culturing Mycobacterium tuberculosis complex (MTBC) bacteria has encouraged the development of specific molecular tools for detecting the pathogen. Most of these tools aim to become novel tuberculosis diagnostics, and big efforts and resources are invested in their development, looking for the endorsement of the main public health agencies. Surprisingly, no study had been conducted where the vast amount of genomic data available is used to identify the best MTBC diagnostic markers. In this work, we use large-scale comparative genomics to provide a catalog of 30 characterized loci that are unique to the MTBC. Some of these genes could be targeted to assess the physiological status of the bacilli. Remarkably, none of the conventional MTBC markers is in our catalog. In addition, we develop a qPCR assay to accurately quantify MTBC DNA in clinical samples.
Background
Tuberculosis (TB) is the most lethal infectious disease caused by a single agent, namely bacteria belonging to the Mycobacterium tuberculosis complex (MTBC)[1]. Whereas isolating the bacteria from clinical specimens is a time-consuming process that delays both clinical diagnosis and research workflows, rapid molecular tests have the potential to identify the pathogen DNA in a few hours [2,3]. This is the main reason why the development of new molecular tools for TB diagnosis is an active area of research, with many companies involved, looking for the endorsement of the World Health Organization (WHO) [4]. The most successful example has been the Xpert MTB/RIF test [5], which was endorsed by the WHO back in 2010 for TB diagnosis, and recommended as the first-line diagnostic in 2017[6]. Achieving a high sensitivity and specificity is pivotal for the development and improvement of molecular tests to ensure an accurate diagnosis. To this end, most tests incorporate specific markers for the detection of MTBC bacteria. For instance, the new Xpert MTB/RIF Ultra assay, previously targeting the rpoB gene alone, has now incorporated the insertion sequences IS6110 and IS1081[7]. The insertion sequence IS6110 has been extensively used as a MTBC-specific marker since first described in 1990[8]. In addition, the IS6110 can be present in high copy numbers in some MTBC strains (from 0 to 27 copies)[9], causing the nucleic acid amplification tests (NAAT) targeting this sequence to achieve higher sensitivities for strains carrying several copies. However, the specificity of the IS6110 has been questioned since two decades ago[10–15] what, along with the fact that some strains lack this insertion sequence, can lead to an incorrect diagnosis[16,17].
Several other genes have been used as markers for the accurate identification of MTBC bacteria[18–21]. However, the accuracy of NAATs based on these markers rely on the specificity of the primers, since most of the targeted loci are claimed to be MTBC-specific, yet they were evaluated with limited genomic information on the diversity of NTM and MTBC bacteria.
Nowadays, the use of the publicly available omic data can help identifying species-specific genetic markers to develop accurate molecular tools. Analyzing omic data has been proven to be an effective strategy for the identification of specific markers in several organisms[22–26], and even some workflows have been published for the evaluation of genetic markers based on genomic data[27]. For instance, comparative genomics was used by Zozaya-Valdés et al. to assess the population structure of Mycobacterium chimaera, identifying six specific loci of these organisms that allowed them to develop a highly accurate qPCR assay.
Strikingly, the use of comparative genomics for the identification of MTBC-specific loci has been very limited. The few published studies focused on genetic regions acquired by horizontal gene transfer and used the limited datasets available at the time of publication, a decade ago[28–30]. By contrast, last years have witnessed a burst of available genomic sequences of a wide range of mycobacteria species and thousands of strains of the MTBC[31–33].
In this work, we perform a large-scale comparative genomic analysis to provide a reference list of 30 MTBC-specific loci that will be of great utility for the scientific community working on the development of new research and clinical tools for tuberculosis. Remarkably, we found that the main MTBC markers used up to date are also present in other organisms, mainly NTM. In our analysis, we assess the global diversity of each MTBC-specific gene among a comprehensive dataset of more than 4,700 MTBC strains, showing the value of using the genomic data at hand to identify the best targets for diagnostic assays. In addition, we develop a qPCR assay based on one of these markers capable of quantifying MTBC DNA in clinical samples.
Methods
In silico identification of MTBC-specific diagnostic gene markers
To identify MTBC-specific loci, we used blastn[34] to look for all the genes of the tuberculosis reference strain H37Rv (NC_000962.3) in the NCBI nucleotide non-redundant database (accessed October 2018) and a custom database comprising 4,277 NTM assemblies (Supplementary Methods 1). All the searches were performed specifying the algorithm blastn with a word size (or seed) of 7 bp. Then, we filtered the results with a set of stringent parameters to discard loci similar to any genomic region of any organism other than MTBC. We discarded all the genes that presented an alignment of more than 25% of its sequence (query coverage) with a similarity greater than 80%. If a gene was aligned in 60% of its sequence or longer it was discarded regardless of the similarity of the alignment. We only kept those genes that were present in all the MTBC bacteria.
Once potential MTBC-specific markers were identified, we decided to assess their genetic diversity. To do this, we analyzed the polymorphisms (single nucleotide polymorphisms (SNPs) and indels) observed at each position across a dataset comprising 4,766 genomes of MTBC strains[35]. Therefore, the number of SNPs of each gene was calculated as the sum of positions showing any nucleotide other than the reference. In the case of indels, we considered those positions showing an indel in at least 10 strains (0.2% of the database) to avoid the noise introduced by single-strain indels spanning large genic regions and possible false deletions arising as a result of sequencings with uneven genomic coverages. This allowed us to calculate different metrics for each gene such as the absolute number of polymorphisms, polymorphisms per base and, most importantly, the prevalence of each one.
Finally, we looked for available information of these genes in the bibliography, what allowed us to discard some candidates based on their genomic context and provide extended information about their physiology. We gathered transcriptomic and proteomic data derived from different published studies: transcriptomic data in response to overexpression of 206 transcription factors[36], different genotoxic stresses[37] and response to nitric oxide stress at different time-points[38], as well as proteomic data in response to nutrient starvation[39].
Set-up of a MTBC-specific qPCR assay for DNA detection and quantification
We used the list of 30 MTBC-specific loci to set up a qPCR assay for the detection and quantification of MTBC DNA. To select the target for the assay, we took into consideration the number of polymorphisms per base, the absence of high-prevalent polymorphisms, the gene length and its genomic context. These criteria enabled an optimum design of primers, amplifying a universal and highly-specific region for the detection of MTBC. We designed the primers and probes for the assay using the web tool Primer-BLAST[40], checking that no unspecific amplicons were predicted. Finally, the qPCR assay consisted on the amplification of a 65 bp region within the Rv2341 gene using the following primers: Forward-GCCGCTCATGCTCCTTGGAT, Reverse-AGGTCGGTTCGCTGGTCTTG, Probe-TGAGTGCCTGCGGCCGCAGCGC.
To test the specificity of the assay we performed qPCR experiments with DNA from all MTBC lineages (except lineage 7 due to unavailability), human DNA, a mock sample with mixed DNA from 20 different bacterial species (ATCC® MSA-1002™ and 17 different species of NTM (Supplementary Methods 2).
The reaction efficiency was calculated using serial dilutions of pure H37Rv DNA as template (0.5 ng/ul to 0.5*10-5 ng/ul). In addition, we evaluated the performance of the assay detecting and quantifying MTBC DNA in a test set of clinical samples. We used extracted DNA from 12 homogenized sputum samples from culture-positive TB patients, two of them with negative smear microscopy. We also used a DNA extraction from a non-TB patient sputum to spike in known concentrations of pure H37Rv DNA (0.5 ng/ul to 0.5*10-5 ng/ul), to calculate the reaction efficiency in clinical samples.
All the qPCR reactions were carried out using hydrolysis probes chemistry (FAM/BHQ) in a total volume of 20ul, containing 10ul of Kapa Probe Fast Master Mix 2X (Kapa Biosystems), 250mM of each primer, 350mM of probe and 2ul of sample. All were performed in a Roche Lightcycler 96 (Roche Diagnostics), with two replicates per sample and including reactions with no template as negative controls (NTC). When calculating reaction efficiencies, we used three replicates per point instead of two. The conditions for each assay comprised an initial denaturation step at 95°C for 3 minutes, followed by 55 amplification cycles as follows: 20 seconds at 60°C for annealing, 1 second at 72°C for extension, and 10 seconds at 95°C for denaturation. The results were analyzed with LightCycler 96 ® 1.1 software. Triplicates of each assay were carried out to check the reproducibility.
Bacterial culture, clinical specimens and DNA extraction
All the DNA extractions were performed in our laboratory except for the commercial DNA mix of 20 bacterial species. Available cultures of different NTM species were subcultured in in 7H11 solid agar media and then the DNA extracted following the standard CTAB protocol[41] with an inactivation step of 1 hour at 80°C. DNA concentrations were measured with the Qubit fluorometer (dsDNA high-sensitivity kit) and samples with a concentration higher than 1ng/ul were normalized to 1ng/ul. In the case of the 13 sputum specimens, DNA extraction was performed as described by Votintseva et al[42]. All the samples were handled in a BSL-3 until DNA was extracted and purified.
Ethics approval
The clinical specimens used in this study were collected as part of the surveillance program of communicable diseases by the General Directorate of Public Health of the Comunidad Valenciana and, as such, falls outside the mandate of the corresponding Ethics Committee for Biomedical Research. All personal information was anonymized and no data allowing individual identification was retained.
Results
We identified 40 genes to be uniquely present in members of the MTBC according to our filtering parameters (Figure 1). After evaluating their genetic diversity across a database of more than 4,700 MTBC strains, we observed that the median number of SNPs per base was 0.07, with some of these genes showing either higher or lower diversities (up to 0.1 and 0.04 SNPs/base respectively), probably as a result of different selective pressures. Importantly, although most of the polymorphisms analyzed were strain-specific, we observed high prevalent polymorphisms as well (Figure 1, Supplementary File 1). For instance, Rv0610c showed a SNP present in 4182 strains and Rv2823c showed an insertion in 4,345 strains. Analysis of the phylogenetic distribution of these polymorphisms confirmed that they mapped to deep branches in the phylogeny. For example, the SNP in Rv0610c affected all modern lineages (L2, L3, L4).
Among these, 9 genes were discarded as potential diagnostic markers since they were included in regions of difference (RD) 182 (Rv2274c) and RD 207 (Rv2816c-Rv2820c) as described in Gagneux et al.[43] or were in variable genomic regions associated to CRISPR elements (Rv2816c-2823c)[44]. Another gene, Rv3424c was also discarded as we found it to be duplicated in a very labile genomic region, between the (putative) transposase of the insertion sequence IS1532 and PPE 59. Therefore, the curated list of MTBC-specific diagnostic markers finally consisted in 30 genes (Figure 1).
When looking at published transcriptomic and proteomic data (see Methods), we observed that Rv2003c, Rv2142c, and Rv3472 proteins are found in greater levels (6.19, 3.6 and 100-fold respectively) when the bacteria is subjected to starvation. Interestingly, Rv2003c is also observed to be overexpressed upon treatment with nitric oxide (Supplementary File 2).
Based on our large genomic analysis, we set up a qPCR assay targeting the Rv2341 gene. This gene, described as “probable conserved lipoprotein lppQ” in the Mycobrowser database[45], is situated in a stable genomic region, between the asparagine tRNA and the gene of the DNA primase, involved in the synthesizes of the okazaki fragments. Furthermore, we were able to design an optimized set of primers that avoid, at the same time, any region harboring prevalent polymorphisms (Figure 1).
When testing the qPCR assay with a panel of samples including different MTBC lineages, human, mock bacterial communities and different NTMs, the specificity of the assay was of 100%. The efficiency of the reaction was of 95% showing a limit of detection of 10fg (hypothetically corresponding to 2 genome equivalents). When using a standard curve of pure H37Rv DNA spiked in sputum samples, both the efficiency of the reaction (97%) and the limit of detection remained unaltered (Figure 2). When testing our qPCR assay with a panel of 12 TB sputum samples, we were able to detect and quantify MTBC DNA in all TB patient sputa, including 2 confirmed TB cases with a negative smear microscopy (Supplementary File 4).
Discussion
Identification of MTBC markers for the development of new diagnostic and research tools for tuberculosis has been an active area of research over the last decades, focusing on the direct or indirect detection of the tubercle bacilli. It is striking that for such a relevant disease, from both the epidemiological and economical point of view, for which tons of genomic data is already available, the identification of MTBC-specific genes had been relegated to the background. This has been probably motivated by the fact that current molecular tools have shown to perform well in most of situations. For instance, assays targeting the insertion sequence IS6110 ([46] or rpoB[47]. However, the available tools are not enough to stop the spread of the disease and for this reason many new generation diagnostics are still being developed with the aim to improve the accuracy of the existing ones and tackle their known flaws.
Our analysis provides invaluable information to develop such diagnostics, with a catalog of specific MTBC markers. Remarkably, some of the markers that we identify could be targeted to determine the physiological status of MTBC bacteria under certain conditions. For example, Rv2003c, overexpressed during starvation and upon treatment with nitric oxide[38,39], is also upregulated during dormancy[48]. Similarly, Rv1374c has been described to be a small RNA that is highly expressed during exponential growth[49], and hence could be used to evaluate the replicative state of the bacilli.
Strikingly, none of the markers considered to be MTBC-specific up to date are in our list of unique MTBC genes. For instance, when examining in which species the IS6110 can be found, we observed several non-MTBC organisms, including 14 NTMs, carrying at least one copy. The same is true for IS1081 and mpt64, present in 38 and 6 NTM respectively (Supplementary File 3). Similarly, the short-chain dehydrogenase/reductase gene (SDR) (Rv0303, region 365,234–366,142), which has been recently described as a M. tuberculosis-specific marker[28], is actually present in several NTM, as revealed by a blastn search in the non-redundant database of the NCBI web server (accessed January 2019), and in our database of NTM assemblies (Supplementary File 5). The fact that IS6110 is still one of the most used genetic targets for MTBC DNA detection (for example in the new Xpert Ultra MTB/RIF assay[7]), highlights the great utility, and the necessity, of translating the results of genomic analyses to the laboratory.
To illustrate the translational potential of our work, we set up an accurate qPCR assay capable of quantifying MTBC DNA with 100% specificity and a sensitivity up to 2 genome copies. Quantifying MTBC DNA from clinical samples is challenging due to the presence of PCR inhibitors along with great proportions of DNA from human and oropharyngeal microbiota. However, this capability is invaluable not only for diagnostic purposes, but also in the research context, for example when developing new protocols[42,50]. Remarkably, our assay, targeting a small region of the Rv2341 gene, showed an excellent performance in a test set of clinical specimens. However, we want to highlight that the list provided here comprehends 30 loci, from which many different molecular tools for tuberculosis could be developed.
Altogether, our analysis has a direct translational value, as it represents an important resource for research groups and companies involved in the development and improvement of novel TB diagnostics. For instance, the markers identified in this work could be used to improve existing tests such as the Xpert MTB/RIF assay, by including targets that we have demonstrated to be globally conserved and fully specific to the MTBC.
Declarations
Competing Interests
The authors declare no conflict of interest in this article.
Funding Sources
This work was supported by projects of the European Research Council (ERC) (638553-TB-ACCELERATE), Ministerio de Economía y Competitividad, and Ministerio de Ciencia, Innovación y Universidades (Spanish Government), SAF2013-43521-R, SAF2016-77346-R and SAF2017-92345-EXP (to IC), BES-2014-071066 (to GAG), FPU 13/00913 (to ACO)
Author Contributions
GAG and IC designed the study and analyzed the data. GAG and ACO analyzed the 4,766 MTBC strains dataset. GAG, MTP and CML performed the qPCR experiments. GAG and LMV cultured the non-tuberculous mycobacteria and performed the DNA extractions. AGB and RB did the microbiological identification of the isolates of non-tuberculous mycobacteria. RB provided the clinical specimens and clinical data. GAG and IC wrote the first draft of the manuscript. All authors contributed to the final version of the manuscript.
Contact Information
Corresponding author e-mail, Iñaki Comas: icomas{at}ibv.csic.es