ABSTRACT
Asthma is a common, under-diagnosed disease affecting all ages. We sought to identify a nasal brush - based classifier of mild / moderate asthma. One hundred ninety subjects with mild / moderate asthma and controls underwent nasal brushing and RNA sequencing of nasal samples. A machine learning - based pipeline, comprised of feature selection, classification, and statistical analyses, identified a diagnostic classifier of asthma consisting of 90 nasally expressed genes interpreted via an L2 - regularized logistic regression classification model. This nasal brush-based classifier performed with strong predictive value and sensitivity across eight validation test sets, including (1) a test set of independent asthmatic and non-asthmatic subjects profiled by RNA sequencing (positive and negative predictive values of 1.00 and 0.96, respectively; AUC of 0.994), (2) two independent case-control cohorts of asthma profiled by microarray, and (3) five independent cohorts of subjects with other respiratory conditions (allergic rhinitis, upper respiratory infection, cystic fibrosis, smoking), where the panel had a low to zero rate of misclassification. Translational development of this classifier into a diagnostic nasal brush-based biomarker for clinical use could aid in asthma detection and care.
Introduction
Asthma is a chronic respiratory disease that affects 8.6% of children and 7.4% of adults in the United States [1]. Its true prevalence may be higher. The fluctuating airflow obstruction, bronchial hyper-responsiveness, and airway inflammation that characterize mild to moderate asthma can be difficult to detect in busy, routine clinical settings [2]. In one study of US middle school children, 11% reported physician-diagnosed asthma with current symptoms, while an additional 17% reported active asthma-like symptoms without a diagnosis of asthma [3]. Undiagnosed asthma leads to missed school and work, restricted activity, emergency department visits, and hospitalizations [3, 4]. Given the high prevalence of asthma and consequences of missed diagnosis, there is high potential impact of improved diagnostic tools for asthma [5].
National and international guidelines recommend that the diagnosis of asthma should be based on a history of typical symptoms and objective findings of variable expiratory airflow limitation [6, 7]. However, obtaining such objective findings can be challenging given currently available tools. Pulmonary function tests (PFTs) require equipment, expertise, and experience to execute well [8, 9]. Many individuals have difficulty with PFTs because they require coordinated breaths into a device. Results are unreliable if the procedure is done with poor technique [8]. Further, PFTs are usually not immediately available in primary care settings. Despite guidelines recommending objective tests such as PFTs to assess possible asthma, PFTs are not done in over half of patients suspected of having asthma [8]. Induced sputum and exhaled nitric oxide have been explored as asthma biomarkers, but their implementation requires technical expertise and does not yield better clinical results than physician-guided management alone [10]. Given the above, the reality is that most asthma is still clinically diagnosed and managed based on self-report [8, 9]. This is suboptimal for mild/moderate asthma given its waxing/waning nature, and because self-reported symptoms and medication use are biased [11].
A nasal biomarker of asthma is of high interest given the accessibility of the nose and shared airway biology between the upper and lower respiratory tracts [12-15]. The easily accessible nasal passages are directly connected to the lungs and exposed to common environmental and microbial factors. In this study, we applied next-generation sequencing and machine learning to identify a novel nasal brush-based classifier of asthma (Figure 1). Specifically, we used RNA sequencing (RNAseq) to comprehensively profile gene expression from nasal brushings collected from subjects with mild to moderate asthma and controls, creating the largest nasal RNAseq data set in asthma to date. Using a robust machine learning-based pipeline comprised of feature selection [16], classification [17], and statistical analyses [18], we identified an asthma gene panel that accurately differentiates subjects with and without mild-moderate asthma. This pipeline was designed with a systems biology-based perspective that many genes, even ones with marginal effects, can collectively classify phenotypes (here asthma) more accurately than individual genes [19].
We validated this asthma gene panel on eight test sets of independent subjects with asthma and other respiratory conditions, finding that it performed with high accuracy, sensitivity, and specificity. As the study of nasal transcriptomics in asthma has been marked by small studies thus far, our relatively large study importantly adds RNAseq data to the field while also leveraging smaller existing data sets for external validation. We see our identification of a diagnostic nasal brush-based classifier of asthma as the first step in the development of minimally invasive, nasal biomarkers for asthma care, with translational development for clinical implementation to follow next. As with any disease, the first step is to accurately identify affected patients, and a next phase of research will be to develop nasal biomarkers to predict treatment response.
Results
Study population and baseline characteristics
We performed nasal brushing on 190 subjects for this study, including 66 subjects with well-defined mild to moderate persistent asthma (based on symptoms, medication need, and demonstrated airway hyper-responsiveness by methacholine challenge) and 124 subjects without asthma (based on no personal or family history of asthma, normal spirometry, and no bronchodilator response). The definitional criteria we used for mild-moderate asthma are consistent with US National Heart Lung Blood Institute guidelines for the diagnosis of asthma [7], and are the same criteria used in the longest NIH-sponsored study of mild-moderate asthma [20, 21].
From these 190 subjects, a random selection of 150 subjects were a priori assigned as the development set (to be used for asthma classifier development), and the remaining 40 subjects were a priori assigned as the RNAseq test set (to be used as one of 8 validation test sets for testing of the asthma classifier identified from the development set).
The baseline characteristics of the subjects in the development set (n=150) are shown in the left section of Table 1. The mean age of subjects with asthma was somewhat lower than subjects without asthma, with slightly more male subjects with asthma and more female subjects without asthma. Caucasians were more prevalent in subjects without asthma, which was expected based on the inclusion criteria. Consistent with reversible airway obstruction that characterizes asthma [2], subjects with asthma had significantly greater bronchodilator response than control subjects (T-test P = 1.4 x 10-5). Allergic rhinitis was more prevalent in subjects with asthma (Fisher’s exact test P = 0.005), consistent with known comorbidity between allergic rhinitis and asthma [22]. Rates of smoking between subjects with and without asthma were not significantly different.
RNA isolated from nasal brushings from the subjects was of good quality, with mean RIN 7.8 (±1.1). The median number of paired-end reads per sample from RNA sequencing was 36.3 million. Following pre-processing (normalization and filtering) of the raw RNAseq data, 11,587 genes were used for statistical and machine learning analysis. VariancePartition analysis [23], which is designed to analyze the contribution of technical and biological factors to variation in gene expression, showed that age, race, and sex contributed minimally to total gene expression variance (Supplementary Figure 1). For this reason, we did not adjust the pre-processed RNAseq data for these factors.
Differential gene expression analysis by DeSeq2 [24] showed that 1613 and 1259 genes were respectively over-and under-expressed in asthma cases versus controls (false discovery rate (FDR) ≤ 0.05) (Supplementary Table 1). These genes were enriched for disease-relevant pathways in the Molecular Signature Database [25], including immune system (fold change=3.6, FDR=1.07 x 10-22), adaptive immune system (fold change=3.91, FDR=1.46 x 10-15), and innate immune system (fold change=4.1, FDR=4.47 x 10-9) (Supplementary Table 1).
Identifying a nasal brush-based classifier to predict asthma status
To identify a nasal brush-based classifier that accurately predicts asthma status using the RNAseq data generated, we developed a rigorous machine learning pipeline that combined feature (gene) selection [16] and classification techniques [17] that was applied to the development set (Materials and Methods and Supplementary Figure 2). This pipeline was designed with a systems biology-based perspective that many genes, even ones with marginal effects, can collectively classify phenotypes (here asthma) more accurately than individual genes. Each gene expression trait can be evaluated on its own or in combination with other gene expression traits to assess how well it distinguishes asthma cases from controls (a process referred to as feature selection). Once the most predictive gene expression traits (features) are identified, various machine learning algorithms can be applied to build a classifier that is optimized to predict asthma status as accurately as possible given the data (a process referred to as classification analysis).
Feature selection in our pipeline involved a cross validation-based protocol [26] using the well-established Recursive Feature Elimination (RFE) algorithm [16] combined with L2- regularized Logistic Regression (LR or Logistic) and Support Vector Machine (SVM-Linear (kernel)) algorithms [17] (combinations referred to as LR-RFE and SVM-RFE respectively) (Supplementary Figure 3). Classification analysis was then performed by applying four global classification algorithms (SVM-Linear, AdaBoost, Random Forest, and Logistic) [17] to the expression profiles of the gene sets identified by feature selection. To reduce the potential adverse effect of overfitting, this process (feature selection and classification) was repeated 100 times on 100 random splits of the development set into training and holdout sets. The final classifier was selected by statistically comparing the models in terms of both classification performance and parsimony, i.e., the number of genes included in the model [18] (Supplementary Figure 4).
Due to the imbalance of the two classes (asthma and controls) in our cohort (consistent with imbalances in the general population), we used F-measure as the main evaluation metric in our study [27]. This class-specific measure is a conservative mean of precision (predictive value) and recall (same as sensitivity), and is described in detail in Box 1 and Supplementary Figure 5. F-measure can range from 0 to 1, with higher values indicating superior classification performance. An F-measure value of 0.5 does not represent a random model. To provide context for our performance assessments, we also computed commonly used evaluation measures, including positive and negative predictive values (PPVs and NPVs) and Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) scores (Box 1 and Supplementary Figure 5).
Evaluation measures for predictive models
Many measures exist for evaluating the performance of classifiers. The most commonly used evaluation measures in medicine are the positive and negative predictive values (PPV and NPV respectively; Supplementary Figure 5), and Area Under the Receiver Operating Characteristic (ROC) Curve (AUC score) [27]. However, these measures have several limitations. PPV and NPV ignore the critical dimension of sensitivity [27]. For instance, a classifier may predict perfectly for only one asthma sample in a cohort and make no predictions for all other asthma samples. This will yield a PPV of 1, but poor sensitivity, since none of the other asthma samples were identified by the classifier. ROC curves and their AUC scores do not accurately reflect performance when the number of cases and controls in a sample are imbalanced [27], which is frequently the case in clinical studies and medical practice. For such situations, precision, recall, and F-measure (Supplementary Figure 5) are considered more meaningful performance measures for classifier evaluation. Note that precision for cases (e.g. asthma) is equivalent to PPV, and precision for controls (e.g. no asthma) is equivalent to NPV (Supplementary Figure 5). Recall is the same as sensitivity. F-measure is the harmonic (conservative) mean of precision and recall that is computed separately for each class, and thus provides a more comprehensive and reliable assessment of model performance for cohorts with unbalanced class distributions. Like PPV, NPV and AUC, F-measure ranges from 0 to 1, with higher values indicating superior classification performance, but a value of 0.5 for F-measure does not represent a random model and could in some cases indicate superior performance over random. For the above reasons, we consider F-measure as the primary evaluation measure in our study, although we also provide PPV, NPV and AUC measures for context.
The best performing and most parsimonious combination of feature selection and classification algorithm identified by our machine learning pipeline was LR-RFE & Logistic (Regression) (Supplementary Figure 4). The classifier inferred using this combination was built on 90 predictive genes and will be henceforth referred to as the asthma gene panel. We emphasize that the expression values of the panel’s 90 genes must be used in combination with the Logistic classifier and the model’s optimal classification threshold (i.e. predicted label=asthma if classifier’s probability output≥0.76, else predicted label=no asthma) to be used effectively for asthma classification.
Validation of the asthma gene panel classifier in an RNAseq test set of independent subjects
Our next step was to validate the asthma gene panel in an RNAseq test set of independent subjects, for which we used the test set (n=40) of nasal RNAseq data from independent subjects. The baseline characteristics of the subjects in this test set are shown in the right section of Table 1. Subjects in the development and test sets were generally similar, except for a lower prevalence of allergic rhinitis among those without asthma in the test set.
The asthma gene panel performed with high accuracy in the RNAseq test set’s independent subjects, achieving AUC=0.994 (Figure 2), PPV 1.00, and NPV 0.96 (Figures 3B and 3D, left most bar). In terms of the F-measure metric, the panel achieved F=0.98 and 0.96 for classifying asthma and no asthma, respectively (Figures 3A and 3C, left most bar). For comparison, the much lower performance of permutation-based random models is shown in Supplementary Figure 6.
Our machine learning pipeline evaluated models from several combinations of feature selection and classification algorithms to select the most predictive classifier. Potentially predictive genes can also be identified from differential expression analysis and results from prior asthma-related studies. Figure 4 shows the performance of the asthma gene panel in the RNAseq test set relative to that of alternative classifiers trained on the development set using: (1) other classifiers tested in our machine learning pipeline, (2) all genes in our data set (11587 genes after filtering), (3) all differentially expressed genes in the development set (2872 genes) (Supplementary Table 1), (4) genes associated with asthma from prior studies[28] (70 genes) (Supplementary Table 2), and (5) a commonly used one-step classification model (L1-Logistic) [29] (243 genes). The asthma gene panel identified by our pipeline outperformed all these alternative classifiers despite its reliance on a small number of genes.
We emphasize that our panel produced more accurate predictions than models using all genes, all differentially expressed genes, and all known asthma genes. This supports that data-driven methods can build more effective classifiers than those built exclusively on traditional statistical methods (which do not necessarily target classification), and current domain knowledge (which may be incomplete and subject to investigation bias). Our panel also outperformed and was more parsimonious than the model learned using the commonly used L1-Logistic method, which combined feature selection and classification into a single step. The fact that our asthma gene panel performed well in an independent RNAseq test set while also outperforming alternative models lends confidence to the panel’s classification ability.
Validation of the asthma gene panel in external asthma cohorts
To assess the generalizability of our asthma gene panel for asthma classification in other populations and profiling platforms, we applied the panel to microarray-derived nasal gene expression data generated from independent cohorts of asthmatics and controls : Asthma1 (GEO GSE19187)[30] and Asthma2 (GEO GSE46171)[31]. Supplementary Table 3 summarizes the characteristics of these external, independent case-control cohorts. In general, RNAseq-based predictive models are not expected to translate well to microarray-profiled samples [32, 33]. A major reason is that gene mappings do not perfectly correspond between RNAseq and microarray due to disparities between array annotations and RNAseq gene models [33]. Our goal was to assess the performance of our asthma gene panel despite discordances in study designs, sample collections, and gene expression profiling platforms.
The asthma gene panel performed relatively well (Figure 3 middle bars) and consistently better than permutation-based random models (Supplementary Figure 6) in classifying asthma and no asthma in both the Asthma1 and Asthma2 microarray-based test sets. The panel achieved similar F-measures in the two test sets (Figures 3A and 3C middle bars), although the PPV and NPV measures were more dissimilar for Asthma2 (PPV 0.93, NPV 0.31) than for Asthma1 (PPV 0.61, NPV 0.67) (Figure 3B and 3D middle bars). Although the panel’s performance was better than its random counterparts for both these test sets, the difference in this performance was smaller for Asthma2. This occurred partially because Asthma2 includes many more asthma cases than controls (23 vs. 5), which is counter to the expected distribution in the general population. In such a skewed data set, it is possible for a random model to yield an artificially high F-measure for asthma by predicting every sample as asthmatic. We verified that this occurred with the random models tested on Asthma2.
To assess how the asthma gene panel might perform in a larger external test set, we combined samples from Asthma1 and Asthma2 and performed the evaluation on this combined set. We chose this approach because no single large, external dataset of nasal gene expression in asthma exists, and combining cohorts could yield a joint test set with heterogeneity that partially reflects real-life heterogeneity of asthma. As expected, all the performance measures for this combined test set were intermediate to those for Asthma1 and Asthma2 (Figure 3 right most bars). These results supported that our panel also performs reasonably well in a larger and more heterogeneous cohort.
Overall, despite the discordance of gene expression profiling platforms, study designs, and sample collection methods, our asthma gene panel performed reasonably well in these external test sets, supporting a degree of generalizability of the panel across platforms and cohorts. Such a translatable result is not frequently observed in genomic medicine research, especially those based on gene expression [34, 35].
Specificity of the asthma gene panel: validation in external cohorts with non-asthma respiratory conditions
To assess the specificity of our panel, we next sought to determine if it would misclassify as asthma other respiratory conditions with symptoms that overlap with asthma. To this end, we evaluated the performance of the asthma gene panel on nasal gene expression data derived from case-control cohorts with allergic rhinitis (GSE43523) [36], upper respiratory infection (GSE46171) [31], cystic fibrosis (GSE40445) [37], and smoking (GSE8987) [12]. Supplementary Table 4 details the characteristics for these external cohorts with non-asthma respiratory conditions. In three of these five non-asthma cohorts (Allergic Rhinitis, Cystic Fibrosis and Smoking), the panel appropriately produced one-sided classifications, i.e., samples were all appropriately classified as “no asthma.” This is shown by the zero F-measure for the positive (asthma) class (Figure 5A) and perfect F-measure for the negative (no asthma) class (Figure 5C) obtained by the panel in these cohorts. In other words, the precision for the asthma class (PPV) of our panel was exactly and appropriately zero (Figure 5B), and NPV was perfectly 1.00 for these cohorts with non-asthma conditions (Figures 5D). The URI day 2 and 6 cohorts were slight deviations from these trends, where the panel achieved perfect NPVs of 1.00 (Figure 5D), but marginally lower F-measure for the “no asthma” class (Figure 5C) due to slightly lower than perfect sensitivity. This may have been influenced by common inflammatory pathways underlying early viral inflammation and asthma [38]. Nonetheless, consistent with the other non-asthma test sets, the panel’s misclassification of URI as asthma was rare and substantially less than its random counterpart classifiers (Supplementary Figure 7).
To assess the asthma gene panel’s performance if presented with a large, heterogeneous collection of non-asthma respiratory conditions reflective of real clinical settings, we aggregated the non-asthma cohorts into a “Combined non-asthma” test set and applied the asthma gene panel. The results included an appropriately zero F-measure for asthma and zero PPV, and F-measure 0.97 for no asthma and NPV 1.00 (Figure 5, right most bars). Results from the individual and combined non-asthma test sets collectively support that the asthma gene panel would rarely misclassify other respiratory diseases as asthma.
Statistical and Pathway Examination of Genes in the Asthma Gene Panel
An interesting question to ask for a disease classification panel is how does its predictive ability relate to the individual differential expression status of the genes constituting the panel? We found that 46 of the 90 genes included in our panel were differentially expressed (FDR ≤ 0.05), with 22 and 24 genes over-and under-expressed in asthma respectively (Figure 6, Supplementary Table 1). More generally, the genes in our panel had lower differential expression FDR values than other genes (Kolmogorov-Smirnov statistic=0.289, Pvalue=2.73x10-37) (Supplementary Figure 8).
In terms of biological function, pathway enrichment analysis of our panel’s 90 genes, though statistically limited by the small number of genes, yielded enrichment for pathways including defense response (fold change=2.86, FDR=0.006) and response to external stimulus (fold change=2.50, FDR=0.012). Only four (C3, DEFB1, CYFIP2 and GSTT1) of the 90 genes are known asthma genes and are functionally involved in complement activation, microbicidal activity, T-cell differentiation, and oxidative stress, respectively [28]. These results suggest that our machine learning pipeline was able to extract information beyond individually differentially expressed or previously known asthma genes, allowing for the identification of a parsimonious panel of genes that collectively enabled accurate asthma classification.
Discussion
We identified a panel of genes expressed in nasal brushings that accurately classifies subjects with mild/moderate asthma from controls. This nasal brush-based panel, consisting of the expression profiles of 90 genes interpreted via a logistic regression classification model, performed with high precision (PPV=1.00 and NPV=0.96) and recall for classifying asthma (AUC=0.994). The performance of the asthma gene panel across independent asthma test sets demonstrates the generalizability of the panel across study populations and two major modalities of gene expression profiling (RNAseq and microarray). Additionally, the panel’s low to zero rate of misclassification on external cohorts with non-asthma respiratory conditions supported the specificity of this panel.
Our nasal brush-based asthma gene panel is based on the common biology of the upper and lower airway, a concept supported by clinical practice and previous findings [12-15]. Clinically, we rely on the united airway by screening for lower airway infections (e.g. influenza, methicillin-resistant Staphylococcus aureus) with nasal swabs [39]. Sridhar et al. found that gene expression consequences of tobacco smoking in bronchial epithelial cells were reflected in nasal epithelium [12]. Wagener et al. compared gene expression in the nasal and bronchial epithelia from 17 subjects, finding that 99% of the 33,000 genes tested exhibited no differential expression between the nasal and bronchial epithelia in those with airway disease [13]. In a study of 30 children, Guajardo et al. identified gene clusters with differential expression in nasal epithelium between subjects with exacerbated asthma vs. controls [14]. The above studies were done with small sample sizes and microarray technology. More recently, Poole et al. compared RNAseq profiles of nasal brushings from 10 asthmatic and 10 control subjects to publicly available bronchial transcriptional data, finding correlation (ρ=0.87) between nasal and bronchial transcripts, as well as correlation (ρ=0.77) between nasal differential expression and previously observed bronchial differential expression in asthmatics [15]. To our knowledge, our study has generated the largest nasal RNAseq data set in asthma to date and is the first to identify a nasal brush-based classifier of asthma.
Although based on only 90 genes, our asthma gene panel classified asthma with greater accuracy than models based on all genes, all differentially expressed genes, and known asthma genes (Figure 4). Its superior performance supports that our machine learning pipeline successfully selected a parsimonious set of informative genes that (1) captures more actionable knowledge than traditional differential expression and genetic association analyses, and (2) cuts through the potential noise of genes irrelevant to asthma. These results show that data-driven methods can build more effective classifiers than those built exclusively on current domain knowledge. About half the genes in our asthma gene panel were not differentially expressed at FDR ≤ 0.05, and as such would not have been examined with greater interest had we only performed traditional differential expression analysis, which is the main analytic approach of virtually all studies of gene expression in asthma. [12-15, 40, 41]. Consistent with basic hypotheses underlying systems biology approaches, our study demonstrated that the asthma gene panel captures signal from differential expression as well as genes below traditional significance thresholds that may still have a contributory role to asthma classification. Only four of the 90 genes (complement component 3 (C3), defensing beta-1 (DEFB1), cytoplasmic FMR1 interacting protein (CYFIP2) and glutathione S-transferase theta 1 (GSTT1)) were previously identified to be relevant to asthma by genetic association studies [28].
Our asthma gene panel has the potential to be developed into a minimally invasive biomarker to aid asthma diagnosis at clinical frontlines, where time and resources often preclude pulmonary function testing (PFT). Nasal brushing can be performed quickly, does not require machinery for collection, and implementation of our classification model yields a straightforward, binary result of asthma or no asthma. According to the Global Initiative for Asthma and US National Heart Lung Blood Institute, the diagnosis of asthma should be based on a history of typical symptoms and objective findings of variable expiratory airflow limitation by PFT [6, 7]. Practically, however, objective measures are often not obtained. Patients with mild/moderate asthma are frequently asymptomatic at the time of exam. PFTs are often not done, with one study showing that over half of 465,866 patients over age 7 years with newly diagnosed asthma had no PFTs performed within a 3.5 year window surrounding diagnosis [8]. Clinicians defer PFTs due to lack of equipment, time, and / or expertise to perform and interpret results [8, 9]. Diagnosing asthma based on history alone contributes to its under-diagnosis, as patients with asthma under-perceive and under-report their symptoms [11]. Misdiagnosis of asthma also occurs frequently given overlapping symptoms between asthma and other conditions [42]. Even if PFTs are obtained, spirometric abnormalities in mild/moderate asthmatics are not always present. An objective, accurate diagnostic classifier that is easy to obtain and interpret with minimal effort from the provider and patient could improve asthma diagnostic accuracy so that appropriate management can then be pursued.
Implementation of the asthma gene panel could involve clinicians brushing a patient’s nose, placing the brush in a prepackaged tube, and submitting the sample for gene expression profiling targeted to the panel. Some platforms allow for direct transcriptional profiling of tissue without an RNA isolation step, avoiding inconveniences associated with direct RNA work [43, 44] and yielding comparable results to RNAseq [45]. Bioinformatic interpretation of the output via the logistic regression-based classifier and classification threshold check could be automated, resulting in a determination of asthma or no asthma for the clinician to consider. Gene expression-based diagnostic classifiers are being successfully used in other disease areas, with prominent examples including the commercially available MammaPrint [46] and Oncotype DX [47] for diagnosing / predicting breast cancer phenotypes. These examples from the cancer field demonstrate an existing path for moving a diagnostic gene panel such as ours to clinical use.
Because it takes seconds for nasal brushing, an asthma gene panel such as ours may be attractive to time-strapped clinicians, particularly primary care providers at the frontlines of asthma diagnosis. Asthma is frequently diagnosed and treated in the primary care setting [48] where access to PFTs is often not immediately available. Although PFTs yield results without specimen handling, these advantages do not seem to overcome its logistical limitations as evidenced by their low rate of real-life implementation [8, 9]. The direct costs of our panel are likely to be slightly higher than PFTs. Targeted profiling of our 90-gene panel currently costs about $100 per sample, while PFTs cost about $80 according to the Medicare Physician Fee Schedule [49]. However, gene expression profiling costs are likely to decrease [50], and implementation of the asthma gene panel could result in cost savings if it reduces the under-diagnosis and misdiagnosis of asthma [4]. Undiagnosed asthma leads to costly healthcare utilization worldwide [4], including in the United States, where asthma accounts for $56 billion in medical costs, lost school and work days, and early deaths [51]. Clinical implementation of our asthma gene panel could identify undiagnosed asthma, leading to its appropriate management before high healthcare costs from unrecognized asthma are incurred. Given the panel’s demonstrated specificity, use of our asthma gene panel could also reduce asthma misdiagnosis by correctly providing a determination of “no asthma” in non-asthmatic subjects with conditions often confused with asthma. Clinical benefit from gene-expression based classification has already been seen in the breast cancer field, where use of the 70-gene panel test MammaPrint to guide chemotherapy in a clinical trial leads to a lower 5-year rate of survival without metastasis compared to standard management [46].
We recognize that our asthma gene panel did not perform quite as well in the microarray-based vs. RNAseq-based asthma test sets, which was to be expected due to differences in study design and technological factors between RNAseq and microarray profiling. First, the baseline characteristics and phenotyping of the subjects differed. Subjects in the RNAseq test set were adults who were classified as mild/moderate asthmatic or healthy using the same strict criteria as the development set, which required subjects with asthma to have an objective measure of obstructive airway disease (i.e. positive methacholine challenge response). In contrast, subjects in the Asthma1 microarray test set were all children (i.e. not adults) with nasal pathology, as entry criteria included dust mite allergic rhinitis specifically [30] (Supplementary Table 3). Subjects from the Asthma2 cohort were adults who were classified as having asthma or healthy based on history. As mentioned, the diagnosis of asthma based on history alone without objective lung function testing can be inaccurate [52]. The phenotypic differences between these test sets alone could explain differences in performance of our asthma gene panel in these test sets. Second, the differential performance may be due to the difference in profiling approach. Gene mappings do not perfectly correspond between RNAseq and microarray due to disparities between array annotations and RNAseq gene models [33]. Compared to microarrays, RNAseq quantifies more RNA species and captures a wider range of signal [40]. Prior studies have shown that microarray-derived models can reliably predict phenotypes based on samples’ RNAseq profiles, but the converse does not often hold [33]. Despite the above limitations, our asthma gene panel performed with reasonable accuracy in classifying asthma in these independent microarray-based test sets. These results support a degree of generalizability of our panel to asthma populations that may be phenotyped or profiled differently.
An effective clinical classifier should have good positive and negative predictive value [53]. In our case, if an individual has asthma, the ideal classifier would reliably indicate asthma so that an accurate diagnosis is made, and if an individual does not have asthma, the ideal classifier would indicate “no asthma” so that misdiagnosis does not occur. This was indeed the case with our asthma gene panel, which achieved high positive and negative predictive values of 1.00 and 0.96 respectively in the RNAseq test set. We also tested our asthma gene panel on independent tests sets of subjects with allergic rhinitis, upper respiratory infection, cystic fibrosis, and smoking, and showed that the panel had a low to zero rate of misclassifying other respiratory conditions as asthma (Figure 5). These results were particularly notable for allergic rhinitis, a predominantly nasal condition. Although our panel is based on nasal gene expression, and asthma and allergic rhinitis frequently co-occur [22], our panel did not misdiagnose allergic rhinitis as asthma. Although these conclusions are based on relatively small validation sets due to the scarcity of nasal gene expression data in the public domain, the strong performance of our panel gives hope that it will be generalizable and specific in other larger cohorts as well.
One of the current limitations of using RNAseq is the cost of processing large number of samples and generating large datasets. Although we have generated one of the largest nasal RNAseq data set in asthma to date, a future direction of this study is to recruit additional cohorts for nasal gene expression profiling and extend validation of our findings in a prospective manner, which will aid in the panel’s path to clinical translation. This will also be facilitated by the rapidly falling costs of sequencing technologies [50], especially if done in a targeted manner. We recognize that our development set was from a single center and its baseline characteristics do not characterize all populations. For example, the development set consisted of adults, and our control subjects were largely Caucasian. However, variancePartition analysis demonstrated minimal contribution of age, race, and gender to gene expression variance in our data (Supplementary Figure 1). We also find it reassuring that the panel performed reasonably well in multiple external data sets spanning children and adults of varied racial distributions, and with asthma and other respiratory conditions defined by heterogeneous criteria. Subjects with asthma in our development cohort were not all symptomatic at the time of sampling. The fact that the performance of our asthma gene panel does not rely on symptomatic asthma is a strength, as many mild/moderate asthmatics are only sporadically symptomatic given the fluctuating nature of the disease.
We see our diagnostic nasal brush-based classifier of asthma as the first step in the development of nasal biomarkers for multiple aspects of asthma care. As with any disease, the first step is to accurately identify affected patients. The asthma gene panel described in this study provides an accurate path to this critical diagnostic step. With a correct diagnosis, an array of existing asthma treatment options can be considered [6]. A next phase of research will be to develop a nasal biomarker to predict endotypes and treatment response, so that asthma treatment can be targeted, and even personalized, with greater efficiency and effectiveness [54].
In summary, we applied RNA sequencing and machine learning to identify a panel of genes expressed in nasal brushings that accurately classifies subjects with mild/moderate asthma from controls. This panel performed with accuracy across independent and external test sets, indicating reasonable generalizability across study populations and gene expression profiling modality, as well as specificity to asthma. Our asthma gene panel has the potential to be developed into a clinical biomarker to aid in asthma diagnosis, as it could be quickly obtained by simple nasal brush, does not require machinery for collection, and can be easily interpreted. Technical translation of panel implementation in the clinical environment, as well as prospective trials of its clinical effectiveness as a diagnostic asthma biomarker, are needed next. If further developed and applied to clinical practice, this nasal brush-based asthma gene panel could improve asthma detection and care.
Materials and Methods
Study design and subjects
Subjects with mild / moderate asthma were a subset of participants of the Childhood Asthma Management Program (CAMP), a multicenter North American study of 1041 subjects with mild to moderate persistent asthma [20, 21]. Findings from the CAMP cohort have defined current practice and guidelines for asthma care and research [21]. Asthma was defined by symptoms ≥ 2 times per week, use of an inhaled bronchodilator ≥ twice weekly or use of daily medication for asthma, and increased airway responsiveness to methacholine (PC20 ≤ 12.5 mg / ml). The subset of subjects included in this study were CAMP participants who presented for a visit between July 2011 and June 2012 at Brigham and Women’s Hospital (Boston, MA), one of the eight study centers for CAMP.
Subjects with “no asthma” were recruited during the same time period by advertisement at Brigham & Women’s Hospital. Selection criteria were no personal history of asthma, no family history of asthma in first-degree relatives, and self-described Caucasian ethnicity. Participation was limited to Caucasian individuals because a concurrent independent study was planned that would compare these same subjects to 968 Caucasian CAMP subjects who participated in the CAMP Genetics Ancillary study [55]. Subjects underwent pre-and post-bronchodilator spirometry according to American Thoracic Society guidelines. Only those meeting selection criteria and with demonstrated normal lung function without bronchodilator response were considered to have “no asthma.”
Nasal brushing and RNA sequencing
Nasal brushing was performed with a cytology brush. Brushes were immediately placed in RNALater (ThermoFisher Scientific, Waltham, MA) and then stored at 40°C until RNA extraction. RNA extraction was performed with Qiagen RNeasy Mini Kit (Valencia, CA). Samples were assessed for yield and quality using the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA) and Qubit fluorometry (Thermo Fisher Scientific, Grand Island, NY).
Of the 190 subjects who underwent nasal brushing (66 with mild/moderate asthma, 124 with no asthma), a random selection of 150 subjects were a priori assigned as the development set (for classification model development), with the 40 remaining subjects earmarked to serve as a test set of independent subjects (for testing the classification model). To minimize potential batch effects, all samples were submitted together for RNA sequencing (RNAseq). Staff at the Mount Sinai genomics core were blinded to the assignment of samples as development or test set. The sequencing library was prepared with the standard TruSeq RNA Sample Prep Kit v2 protocol (Illumina). The mRNA libraries were sequenced on the Illumina HiSeq 2500 platform with a per-sample target of 40-50 million 100 bp paired-end reads. The data were put through Mount Sinai’s standard mapping pipeline[56] (using Bowtie [57] and TopHat [58], and assembled into gene-and transcription-level summaries using Cufflinks [59]). Mapped data were subjected to quality control with FastQC and RNA-SeQC [60]. Data were pre-processed separately for the development and test sets to avoid leakage of information across the two data sets and maintain fairness of the machine learning procedures as much as possible. Genes with fewer than 100 counts in at least half the samples were dropped to reduce the potentially adverse effects of noise. DESeq2 [24] was used to normalize the data sets using its variance stabilizing transformation method.
VariancePartition Analysis of Potential Confounders
Given differences in age, race, and sex distributions between the asthma and “no asthma” classes, we used the variancePartition method [23] to assess the degree to which these variables influenced gene expression and potentially confounded the target phenotype (asthma status). The total variance in gene expression was partitioned into the variance attributable to age, race, and sex using a linear mixed model implemented in variancePartition v1.0.0 [23]. Age (continuous variable) was modeled as a fixed effect while race and sex (categorical variables) were modeled as random effects. The results showed that age, race, and sex accounted for minimal contributions to total gene expression variance (Supplementary Figure 1). Downstream analyses were therefore performed with gene expression data unadjusted for these variables.
Differential gene expression and pathway enrichment analysis
DESeq2 [24] was used to identify differentially expressed genes in the development set. Genes with FDR ≤ 0.05 were deemed differentially expressed, with fold change <1 implying under-expression and vice versa. To identify the functions underlying these genes, pathway enrichment analysis was performed using the Gene Set Enrichment Analysis method applied to the Molecular Signature Database (MSigDB) [25].
Identification of the Asthma Gene Panel by Machine Learning Analyses of the RNAseq Development Set
To identify gene expression-based classifiers that predict asthma status, we applied a rigorous machine learning pipeline implemented in Python using the scikit-learn package [61] that combined feature (gene) selection [16], classification [17], and statistical analyses of classification performance [18] to the development set (Supplementary Figure 2). Feature selection and classification were applied to a training set comprised of 120 randomly selected samples from the development set (n=150) as described below. For an independent evaluation of the candidate classifiers generated from the training set by this process, they were then evaluated on the remaining 30 samples (holdout set). Finally, to reduce the dependence of the finally chosen classifier on a specific training-holdout split, this process was repeated 100 times on 100 random splits of the development set into training and holdout sets. The details of the overall process as well as the individual components are as follows.
Feature selection
The purpose of the feature selection component was to identify subsets of the full set of genes in the development set, whose expression profiles could be used to predict the asthma status as accurately as possible. The two main computations constituting this component were (i) the optimal number of features that should be selected, and (ii) the identification of this number of genes from the full gene set. To reduce the likelihood of overfitting when conducting both these computations on the entire training set, we used a 5x5 nested (outer and inner) cross-validation (CV) setup [26] for selecting features from the training set (Supplementary Figure 3). The inner CV round was used to determine the optimal number of genes to be selected, and the outer CV round was used to select the set of predictive genes based on this number, thus separating the samples on which these decisions are made. The supervised Recursive Feature Elimination (RFE) algorithm [62] was executed on the inner CV training split to determine the optimal number of features. The use of RFE within this setting enabled us to identify groups of features that are collectively, but not necessarily individually, predictive. This reflects our systems biology-based expectation that many genes, even ones with marginal effects, can play a role in classifying diseases/phenotypes (here asthma) in combination with other more strongly predictive genes [19]. Specifically, we used the L2-regularized Logistic Regression (LR or Logistic) [63] and SVM-Linear (kernel) [64] classification algorithms in conjunction with RFE (combinations henceforth referred to as LR-RFE and SVMRFE respectively). For this, for a given inner CV training split, all the features (genes) were ranked using the absolute values of the weights assigned to them by an inner classification model, trained using the LR or SVM algorithm, over this split. Next, for each of the conjunctions, the set of top-k ranked features, with k starting with 11587 (all filtered genes) and being reduced by 10% in each iteration until k=1, was considered. The discriminative strength of feature sets consisting of the top k features as per this ranking was assessed by evaluating the performance of the LR or SVM classifier based on them over all the inner CV training-test splits. The optimal number of features to be selected was determined as the value of k that produces the best performance. Next, a ranking of features was derived from the outer CV training split using exactly the same procedure as applied to the inner CV training split. The optimal number of features determined above was selected from the top of this ranking to determine the optimal set of predictive features for this outer CV training split. Executing this process over all the five outer CV training splits created from the development set identified five such sets. Finally, the set of features (genes) that was common to all these sets (i.e. in their intersection/overlap), which is expected to yield a more robust feature set than the individual outer CV splits, was selected as the predictive gene set for this training set. One such set was identified for each of LR-RFE and SVM-RFE.
Classification analyses
Once predictive gene sets had been selected from feature selection, four global classification algorithms (L2-regularized Logistic Regression (LR or Logistic) [63], SVM-Linear [64], AdaBoost [65], and Random Forest (RF) [66]) were used to learn intermediate classification models over the training set. These intermediate models were then applied to the corresponding holdout set to generate probabilistic asthma predictions for the samples. An optimal threshold for converting these probabilistic predictions into binary ones (higher than threshold=asthma, lower than threshold=no asthma) was then computed as the threshold that yielded the highest classification performance on the holdout set. This optimization resulted in the proposed classification models.
Statistical analyses of classification performance
After the above components have been run on 100 training-holdout splits of the development set, we obtain 100 proposed classification models for each of eight feature selection-global classification combinations (two feature selection algorithms (LR-RFE and SVM-RFE) and four global classification algorithms Logistic, SVM-Linear, AdaBoost and RF). The next step of our pipeline was to determine the best performing combination. Instead of making this determination just based on the highest evaluation score, as is typically done in ML studies, we utilized this large population of models and their optimized holdout evaluation scores to conduct a statistical comparison to make this determination. Specifically, we applied the Friedman test followed by the Nemenyi test [18, 67] to this population of modules and their evaluation scores. These tests, which account for multiple hypothesis testing, assessed the statistical significance of the relative difference of performance of the combinations in terms of their relative ranks across the 100 splits.
Optimization for parsimony
For an effective phenotype classifier, it is essential to consider parsimony in model selection (i.e. minimize number of features (i.e. genes)) to enhance its biological and clinical utility and acceptability. To enforce this for our classifier, an adapted performance measure, defined as the absolute performance measure (F-measure) divided by the number of genes in that model, was used for the above statistical comparison, i.e. as input to the Friedman-Nemenyi tests. In terms of this measure, a model that does not obtain the best performance measure among all models, but uses much fewer genes than the others, may be judged to be the best model. The result of the statistical comparison using this adapted measure was visualized as a Critical Difference plot [18] (Supplementary Figure 4), and enabled us to identify the best combination of feature selection and classification method as the left-most entry in this plot.
Final model development
The final step in our pipeline was to determine the representative model out of the 100 learned the above best combination by finding which of these models yielded the highest evaluation measure (F-measure). In case of ties among multiple candidates, the gene set that produced the best average asthma classification F-measure (Box 1 and Supplementary Figure 5) across all four global classification algorithms was chosen as the gene set constituting the representative model for that combination. This analysis yielded the representative gene set, global classification algorithm, and the optimized asthma classification threshold. Finally, our asthma gene panel was built by training the global classification algorithm to the expression profiles of the representative gene set, and using the optimized threshold for classifying samples with and without asthma.
Validation of the Asthma Gene Panel in an RNAseq test set of independent subjects
The asthma gene panel identified by our machine learning pipeline was then tested on the RNAseq test set (n=40) to assess its performance in independent subjects. F-measure was used as the primary measure for classification performance, as described in Box 1 and Supplementary Figure 5. AUC, PPV and NPV were additionally calculated for context.
Performance Comparison to Alternative Classification Models
For comparison, the same machine learning methodology was used to train and evaluate models from all combinations of feature selection and global classification methods considered in our pipeline. We also applied our machine learning pipeline with replacement of the feature (gene) selection step with these pre-determined gene sets: (1) all filtered RNAseq genes, (2) all differentially expressed genes, and (3) known asthma genes from a recent review of asthma genetics [28]. To maintain consistency with the machine learning pipeline-derived models, these were each used as a predetermined gene set that was run through the same pipeline (Supplementary Figure 2 with the feature selection component turned off) to identify the best performing global classification algorithm and the optimal asthma classification threshold for this predetermined set of features. The algorithm and threshold were used to train each of these gene sets’ representative classification model over the entire development set, and the resulting model for each of these gene sets was then evaluated on the RNAseq test set. Finally, as a baseline representative of alternative sparse classification algorithms, which represent a one-step option for doing feature selection and classification simultaneously, we also trained an L1-regularized logistic regression model (L1-Logistic) [29] on the development set and evaluated it on the RNAseq test set.
Performance Comparison to Permutation-based Random Models
To determine the extent to which the performance of all the above classification models could have been due to chance, we compared their performance with that of their random counterpart models (Supplementary Figure 6, Supplementary Figure 7). These counterparts were obtained by randomly permuting the labels of the samples in the development set and executing each of the above model training procedures on these randomized data sets in the same way as for the real development set. These random models were then applied to each of the test sets considered in our study, and their performances were also evaluated in terms of the same measures. For each of real models tested in our study, 100 corresponding random models were learned and evaluated as above, and the performance of the real models was compared with the average performance of the corresponding random models.
Validation of the asthma gene panel in external independent asthma cohorts
To assess the generalizability of the asthma gene panel to other populations, microarray-profiled data sets of nasal gene expression from two external asthma cohorts-- Asthma1 (GSE19187) [30] and Asthma2 (GSE46171) [31] (Supplementary Table 3)-- were obtained from NCBI Gene Expression Omnibus (GEO) [68]. The asthma gene panel was then applied and its performance evaluated on these external asthma cohorts..
Validation of the asthma gene panel in external cohorts with other respiratory conditions
To assess the panel’s ability to distinguish asthma from respiratory conditions that can have overlapping symptoms with asthma, i.e. its specificity to asthma, microarray-profiled data sets of nasal gene expression were also obtained for five external cohorts with allergic rhinitis (GSE43523) [36], upper respiratory infection (GSE46171) [31], cystic fibrosis (GSE40445) [37], and smoking (GSE8987) [12] (Supplementary Table 4). The asthma gene panel was then applied and its performance evaluated on these external cohorts with non-asthma respiratory conditions.
Data availability
Data and code for this study (doi:10.7303/syn9878922) are available via Synapse at https://www.synapse.org/#!Synapse:syn9878922/files/
Declarations
Ethics approval and consent to participate
The institutional review boards of Brigham & Women’s Hospital and the Icahn School of Medicine at Mount Sinai approved the study protocols. Written informed consent was obtained from all subjects.
Funding
This study was supported by the US National Institutes of Health (NIH R01AI118833, K08AI093538, R01GM114434) and the Icahn Institute for Genomics and Multiscale Biology, including computational resources provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai.
Author contributions
SB directed the study. SB, BAR, and EES designed the study. SB and AJR directed the recruitment of subjects and sample collection. BAR and STW provided guidance for access to subjects. EES advised on sequencing strategy. SB curated the clinical data. SB, GP, and OPP designed and performed the statistical and computational analyses. SB and GP wrote the manuscript. SB, GP, OPP, AJR, GEH, BAR, STW, and EES edited the manuscript. All authors contributed significantly to the work presented in this paper.
Competing interests
SB, GP, and EES have filed a patent application related to the findings of this manuscript. The remaining authors declare that they have no competing interests.
Acknowledgments
We thank Kathryn Paul, Laura Ting, Anne Plunkett, Nancy Madden, Ann Fuhlbrigge, Kelan Tantisira, Dan Cossette, Aimee Garciano, and Roxanne Kelly for their assistance and support with recruitment, specimen collection, and sample processing. We thank Robert Griffin and Ana Stanescu for critically reviewing the paper.