Abstract
Single-cell RNA-seq (scRNA-seq) technologies have been broadly utilized to reveal the molecular mechanisms of respiratory diseases and physiology at single-cell resolution. Here, we constructed a cigarette smoking lung atlas by integrating data from 8 public datasets, including 104 lung scRNA-seq samples with patient state information. The cigarette smoking lung atlas generated by this single-cell meta-analysis (scMeta-analysis) revealed early carcinogenesis events and defined the alterations of single-cell gene expression, cell population, fundamental properties of biological pathways, and cell–cell interactions induced by cigarette smoking. In addition, we developed two novel scMeta-analysis methods incorporating clinical metadata: VARIED (Visualized Algorithms of Relationships In Expressional Diversity) and AGED (Aging-related Gene Expressional Differences). VARIED analysis revealed the expressional diversity associated with smoking carcinogenesis in each cell population. AGED analysis revealed differences in gene expression related to both aging and smoking states. Our scMeta-analysis provided new insights into the effects of smoking and into cellular diversity in the human lung at single-cell resolution.
Introduction
Smoking is the leading risk factor for early death, and its negative effects present individual and public health hazards (1, 2). Cigarette smoke is a mixture of thousands of chemical compounds generated from tobacco burning (3) that causes chronic airway inflammation, reactive oxygen species (ROS) production, and DNA damage. Specifically, it has been discovered that smoking injures the respiratory organs and cardiovascular system and causes carcinogenesis, chronic obstructive pulmonary disease (COPD), and atherosclerosis (4). In particular, the incidence of lung squamous carcinoma is significantly increased by cigarette smoking (5, 6).
Single-cell RNA-seq (scRNA-seq) technologies have been broadly utilized to reveal the molecular mechanisms of respiratory diseases and physiology at single-cell resolution. scRNA-seq in human lungs identified novel cell populations and cellular diversity (7–13). However, there are several concerns regarding scRNA-seq analysis. One of these concerns is sample size, that is, that clinical scRNA-seq analyses could be biased due to insufficient sample sizes. A possible solution is meta-analysis of scRNA-seq data. The recently developed single-cell meta-analysis (scMeta-analysis) method has been considered a powerful tool for large-scale analysis of integrated single-cell cohorts. The scMeta-analysis shows robust statistical significance and the capacity to compare the results among different studies at the single-cell level. In fact, integrated scMeta-analysis of a number of cohorts has revealed a previously unappreciated diversity of cell types and gene expression; for example, scMeta-analysis of lung endothelial cells, including human and mouse datasets, revealed novel endothelial cell populations (14–17). In addition, comparative analysis of scRNA-seq cohorts revealed pan-cancer tumor-specific myeloid lineages (18).
In this study, we integrated 8 publicly available datasets comprising 104 lung scRNA-seq samples and analyzed a total of 257,663 single cells to construct a cigarette smoking lung atlas. The scMeta-analysis of the cigarette smoking lung atlas defined single-cell gene expression according to smoking, age, and gender. In addition, we developed novel scMeta-analysis methods: VARIED (Visualized Algorithms of Relationships In Expressional Diversity) analysis and AGED (Aging-related Gene Expressional Differences) analysis with clinical metadata. VARIED analysis revealed the diversity of gene expression associated with cancer-related events in each cell population, and AGED analysis revealed the expressional differences in relation to both aging and smoking states.
Results
Integrated single-cell lung atlas with cigarette smoking
According to scRNA-seq collection criteria (see methods), we chose 8 publicly available datasets of lung scRNA-seq data to construct a cigarette smoking lung atlas (Figure 1A). To this end, we collected data from 374,658 single cells from 104 scRNA-seq samples (smoker: 55 samples, never-smoker: 49 samples, Figure 1A). In the process of quality control with Seurat in R, 116,995 low-quality single cells (nFeatures < 103 & mt.percent > 20%) were removed. Integration of the 8 datasets was performed by the Harmony algorithm with the smoking states of scRNA-seq samples (19) (Supplementary Figure S1A). Integrated single-cell transcriptome data were linked with clinical metadata such as smoking states, age, gender, and race (Supplementary Table S1, Supplementary Figure S1B). The cigarette smoking lung atlas is composed of a total of 257,663 single cells (Figure 1B). UMAP plots with cell type-specific markers (PTPRC as an immune marker, EPCAM as an epithelial marker, CLDN5 as an endothelial marker, and COL1A2 as a fibroblast marker) showed an obvious segregation of immune, epithelial, endothelial, and fibroblastic lineages (Figure 1C). The density plot showed that the majority of single cells in the atlas were immune cells and epithelial cells (Supplementary Figure S1C). There were 132,956 single cells in the smoker group and 124,707 single cells in the never-smoker group (Supplementary Figure S2A). Comparison of the atlases by smoking states revealed that most of the cell populations in the UMAP plot overlapped; however, parts of epithelial clusters were specific to the never-smoker group (Supplementary Figure S2A). To confirm that the integration of the 8 datasets reduced bias, we showed the atlas marked with the datasets (Figure 1D). All major clusters seemed to overlap among the 8 datasets (Supplementary Figure S2B), although the populations of cells were different in each dataset (Figure 1E). This difference in cell populations could be caused by differences in tissue collection and cell isolation processes.
In the atlas with all cell types (Figure 1B), we first identified the cell types present within the atlas according to the lung cell markers in the human lung scRNA-seq atlas (7) (Supplementary Figure S3). To investigate the cell types in further detail, we extracted subsets of “epithelia”, “fibroblasts”, “endothelia”, “lymphoids”, and “myeloids” and repeated the UMAP procedure with each subset, which comprised 44 subpopulations in total (Figure 2A, B). There were 14 epithelial cell types (smoker: 27,583 cells, never-smoker: 58,418 cells; Supplementary Figure S4), 7 fibroblastic cell types (smoker: 3,583 cells, never-smoker: 1,920 cells; Supplementary Figure S5), 7 endothelial cell types (smoker: 8,642 cells, never-smoker: 4,523 cells; Supplementary Figure S6), 8 lymphoid cell types (smoker: 27,804 cells, never-smoker: 12,174 cells; Supplementary Figure S7), and 8 myeloid cell types (smoker: 55,671 cells, never-smoker: 40,647 cells; Supplementary Figure S8).
Cigarette smoking is known to induce alterations in cell populations in the lungs. For example, the number of basal linage cells decreased (20), and the number of basophils increased (21) in smoking lungs. The atlas showed differences in the numbers of 44 cell subpopulations by smoking states (Figure 2C). Evidently, the cell numbers of basal, basal-proximal (px), ionocyte, mucous, proliferating epithelia, and tracheal basal clusters significantly decreased. Previous bulk studies have reported that the number of bronchial epithelial cells is altered by smoking (9, 20, 22). Consistent with these reports, our data confirmed that smoking had a devastating effect on epithelial cells in the bronchus and bronchiole. On the other hand, the numbers of alveolar type 1 cells (AT1), alveolar fibroblasts, adventitial fibroblasts, B cells, CD4+ memory/effector T cells, CD8+ T cells, natural killer (NK) cells, NK T cells (NKT), and basophils significantly increased. Previously, the number of basophils infiltrating lung tissue has been reported to increase in COPD models, and basophils contribute to emphysema formation by cytokine production in the early phase of COPD (21). The atlas confirmed the increase in basophil cell number with smoking. We also examined the cell cycle in each cell cluster. The cell cycle indices in each subpopulation were not obviously changed between the smoking and never-smoking groups (Supplementary Figure S9A and B).
VARIED analysis visualized variations in epithelial populations by smoking states
Cigarette smoking is the highest risk factor for carcinogenesis of squamous carcinoma in the bronchia and trachea of the lung (2, 5). To comprehensively understand the effects of smoking in the lung, we developed VARIED (Visualized Algorithms of Relationships In Expressional Diversity) analysis to quantify the alteration in gene expressional diversity. VARIED analysis is based on the network centrality of a correlational network with graph theory in each single cell (23). The differences in the centrality between smokers and never-smokers represent the alteration of gene expressional diversity in each cell cluster (Figure 3A). VARIED analysis revealed greater diversity in epithelial clusters, suggesting that cigarette smoking primarily perturbed epithelial populations, particularly in the bronchia and trachea (Figure 3B and 3C). These data are consistent with the fact that epithelial cells, located at the bronchia, are considered to be the origin of lung squamous carcinoma (24). Interestingly, the diversity in basophils was also remarkably altered by cigarette smoking. To examine the molecular basis for diversity in gene expression, we extracted differentially expressed genes (DEGs) in the basal-px cluster between smokers and never-smokers, focusing on basal-px because this cluster was the most influenced by cigarette smoking (Figure 3B, Supplementary Table S3). Enrichment analysis of the DEGs revealed that cancer-related categories were significantly enriched in the smoker basal-px cluster (Figure 3D and E, Supplementary Table S4). The cigarette smoking lung atlas and VARIED analysis confirmed the early oncogenic events in bronchial and tracheal epithelial cells. Our data indicate that smoking adversely affects bronchial epithelial cells and alters gene expressional diversity in carcinogenesis.
Cigarette smoking affected GWAS-related genes in lung squamous carcinoma
As the cigarette smoking lung atlas provided high-resolution expression data in 44 cell types, we explored gene expression profiles from a genome-wide association study (GWAS) of lung squamous carcinoma with smoking (25). To identify the expressional patterns and the broad contributions of different lung cell types to squamous carcinoma susceptibility, the expression levels of an average of 92 GWAS genes were examined in all lung cell types (Supplementary Figure S10A). High expression of squamous carcinoma GWAS genes was observed in the specific clusters, and cigarette smoking affected the expression of GWAS-related genes in some clusters. In particular, the expression of MUC1 was increased in the smoker epithelial clusters (Supplementary Figure S10B), and the expression of HLA-A was increased in the smoker myeloid clusters (Supplementary Figure S10C). Mutated MUC1 has oncogenic roles in carcinogenesis in the human lung (26, 27). Truncating mutations in HLA-A carry a risk of dysregulation of cancer-related pathways (28).
Gender differences in the cigarette smoking lung atlas
We also examined the effect of gender differences on gene expression at single-cell resolution in all epithelial clusters (Supplementary Figure S11A). As a first step, we analyzed the cell cycle distribution in males and females in the smoker group. The results showed almost no difference in cell cycle state between males and females; however, we found subtle differences. For example, the female basal-px cluster exhibited an increased S/G2M index ratio compared to the male basal-px cluster; in contrast, the tracheal basal-px cluster in males exhibited an increase in the S index ratio compared to that in females (Supplementary Figure S11B). Next, we performed pathway enrichment analysis to identify the differences in epithelial clusters between male and female smokers. As a result, there were differences between males and females; however, gender-specific alterations were commonly identified across the epithelial clusters, not specifically in the clusters (Supplementary Figure S11C).
Cancer-associated alterations induced by smoking
Given that cigarette smoking has a significant impact on carcinogenesis in bronchial epithelial cell clusters, we next focused on the alteration of cancer-associated fibroblasts (CAFs) and tumor endothelial cells (TECs). These types of cells are well known to contribute to tumor malignancy (29–31). We examined the expression of marker genes such as ACTA2, PDPN, and COL1A1 in CAFs and COL18A1, COL4A1, and COL4A2 in TECs by smoking states (Figure 4A and 4B). A typical CAF marker, ACTA2, was significantly induced in the adventitial fibroblast, alveolar fibroblast, and myofibroblast clusters in the smoker group (Figure 4A, top panel). Likewise, other CAF markers such as PDPN and COL1A1 were also significantly upregulated in the adventitial fibroblast, alveolar fibroblast, and myofibroblast clusters in the smoker group (Figure 4A, middle and bottom panels). Additionally, TEC markers such as COL18A1, COL4A1, and COL4A2 were increased in several endothelial cell clusters (Figure 4B). For further investigation of CAF marker expression, we divided the smoker adventitial fibroblast cluster into a high-ACTA2 group and a low-ACTA2 group and analyzed the DEGs between them (Supplementary Figure S12A). The DEGs analysis showed that collagen family and SPARC expression increased in the smoker high-ACTA2 group (Supplementary Figure S12B). Likewise, DEGs analysis was performed between an ANGPT2-high lymphatic group and an ANGPT2-low lymphatic group, and the results suggested that FABP4 was highly expressed in the ANGPT2-high group. FABP4 is a key regulator of tumor angiogenesis (32). These results suggested that transformation of cancer-associated stromal cells was induced in the early phase of carcinogenesis promoted by cigarette smoking.
Next, we performed module analysis with cancer-related gene sets, such as senescence, ROS production, IFN signaling, heme metabolism, and epithelial to mesenchymal transition (EMT) genes. The module analysis depicted the alteration of cancer-related events by smoking in each cluster (Figure 4C). Several modules were drastically altered between the smoker and never-smoker groups, such as IFN signaling in endothelial and myeloid clusters; EMT in epithelial, fibroblastic, and endothelial clusters; and mitophagy in lymphoid and myeloid clusters. Because increased expression of EMT module genes in endothelial clusters was observed, we examined the expression of endothelial to mesenchymal transition (EndMT) marker genes (FN1, POSTN, VIM) (17, 33). These EndMT markers were significantly upregulated, suggesting that smoking induced EndMT in some endothelial clusters (Figure 4D top, Supplementary Figure S12E). Autophagy in immune cells is important for cellular immunity, differentiation and survival (34). The autophagy module was especially increased in NKT cells from lymphoid clusters and some myeloid clusters (Figure 4C), suggesting that immune cells enhanced cellular immunity and IFN signaling in smoking lungs (Figure 4D middle). Cigarette smoking induced upregulation of transferrin and ferritin in epithelial, endothelial, and myeloid cells of the lung. The dysregulation of heme metabolism is linked with smoking-related respiratory diseases (35). The heme metabolism module increased in most epithelial, fibroblastic, and myeloid clusters and some endothelial cell clusters, such as veins and capillaries (Figure 4C, 4D bottom). Finally, increased senescence module scores were broadly observed across most cell types (Supplementary Figure S12F), suggesting that smoking induced aging in the lung. The module analysis of the cigarette smoking lung atlas evidently indicated what cell types were influenced by smoking and how smoking affected these cells in the lung.
Increased cell–cell interactions between epithelial cell clusters and lymphoid or myeloid clusters in smokers
From the module analysis, we observed increased IFN signaling throughout the lung cells. These data suggested that smoking produced chronic inflammation in the lung and prompted us to examine the interactions between epithelial and immune cells via inflammatory signaling. For this purpose, we performed cell–cell interaction (CCI) analysis using 7,200 interactions between interferon, interleukin, and chemokine family genes at single-cell resolution (Figure 5A). CXCL8 (interleukin 8: IL8) is produced by lymphocytes, endothelial cells, fibroblasts, and epithelial cells in the lung and has important roles in pulmonary diseases and cancers (36, 37). In epithelial-immune cell interactions, the CXCL8-interaction network was expanded by increasing the expression in club, goblet, and serous cells of the smoker groups (Figure 5B). The CCI networks between epithelial and lymphoid cell clusters showed increased epithelial to lymphoid cluster interactions in smokers compared to never-smokers (Figure 5C top). On the other hand, the lymphoid to epithelial cluster interactions showed smaller differences between groups (Figure 5C bottom, Supplementary Figure S13A left and S13B left). This result suggested that the epithelial to lymphoid interaction was mainly unidirectional, and it is consistent with the module analysis result that the IFN signaling module did not increase in the lymphoid clusters (Figure 4C). In contrast, epithelial–myeloid interactions (both “from epithelia to myeloid” and “from myeloid to epithelial”) were clearly enhanced in the smoker group compared to the never-smoker group (Supplementary Figure S13A-C). Therefore, cigarette smoking enhanced the mutual interaction between epithelial and myeloid cells via inflammatory signaling.
Aging-related gene expression in the cigarette smoking lung atlas
As the majority of the samples in the atlas had patient age information, we aimed to identify aging-related genes associated with cigarette smoking (Figure 6A). We developed AGED (Aging-related Gene Expression Differences) analysis based on regression analysis with single-cell transcriptome data (see methods). Briefly, by using regression analysis with age and gene expression in the smoker and never-smoker groups, we calculated the differences in slopes (Δ) for all genes in 44 cell clusters (Figure 6B). For selected genes that were obviously changed with advancing age between the smoker and never-smoker groups, the Δ values were plotted as AGED results in a heatmap (Figure 6C). These data showed that the lung surfactant proteins SFTPC and SFTPB decreased in secretory epithelial clusters with advancing age in the smoker (Figure 6C and 6D left). These lung surfactant proteins maintain the activation of alveolar macrophages and promote recovery from injuries induced by smoking (38). Additionally, secretoglobins (SCGB3A1, SCGB3A2, and SCGB1A1) were also decreased with advancing age in smokers (Figure 6C and 6D middle and right). MALAT1 is a well-known lncRNA in lung cancer, and its expression contributes to malignancy (39). AGED analysis showed that MALAT1 expression increased in most cell types with advancing age in smokers (Figure 6C and 6E), suggesting that the oncogenic risk associated with MALAT1 increased with age. From the module analysis, heme metabolism was dysregulated in the lung (Figure 4C). The expression levels of FTL and FTH1 genes (ferritin) were significantly altered with advancing age in the smokers (Figure 6C). In the “CD68+ macrophage” and “macrophage” clusters, ferritin significantly increased with smoking and aging. In addition, the expression patterns of several mitochondrial genes were altered with advancing age in smokers (Supplementary Figure 14). The module analysis showed that the mitophagy, ferroptosis, and ROS production modules, which are related to mitochondrial dysfunction, were also altered by smoking (Figure 4C). AGED analysis confirmed that age-related mitochondrial dysregulation contributed to the progression of respiratory diseases. Collectively, the AGED analysis revealed changes in aging-related gene expression with smoking in each cell cluster.
Discussion
In this study, we presented a human cigarette smoking lung atlas, generated via the meta-analysis of 104 samples from 8 public scRNA-seq datasets. Our integrated smoking atlas confirmed the alteration of gene expression in the lung at single-cell resolution and identified the early oncogenic events induced by cigarette smoking. Additionally, the novel VARIED and AGED analyses revealed cell type and gene expressional diversity with smoking and age.
One of the significant contributions of this study is that the scMeta-analysis of integrated datasets identified expressional diversity in the early phase of lung squamous carcinoma at the single-cell level. In fact, expression analysis following VARIED revealed early oncogenic signaling in epithelial, fibroblastic, and endothelial cells, expression changes in GWAS-related genes, and gender-dependent alterations in the smoking lung. In previous studies of the effects of smoking, genetic mutations in oncogenes and tumor suppressor genes were discovered (40–42). Bronchial epithelial cells from smokers have mutations in TP53, NOTCH1, FAT1, CHEK2, PTEN, ARID1A and other genes (40). Our atlas showed that survival AKT-mTOR signaling, mitochondrial dysregulation, and sirtuin signaling pathways were altered in bronchial basal cells by smoking (Supplementary Table S4). Mutations in PTEN contribute to the activation of AKT-mTOR signaling (43). FAT1 controls mitochondrial functions (44), and its mutations induce the dysregulation of mitochondria. Additionally, cigarette smoking promotes lung carcinogenesis by IKKβ- and JNK-dependent inflammation (45). DEGs analysis of basal-px clusters indicated that JUN and FOS expression levels were increased in the smoker basal-px cluster (Supplementary Table S3). Our module analysis and CCI analysis showed enhancement of inflammatory signaling in the epithelial clusters. Furthermore, our results showed that sirtuin signaling was enhanced in bronchial epithelial cells in smokers. The atlas confirmed the signaling related to genetic mutations induced by smoking.
The first scMeta-analysis was performed to investigate severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-related genes by The Human Cell Atlas Lung Biological Network (14). Further scMeta-analyses were reported for endothelial cells in the human and mouse lung (15) and liver-specific immune cells (16), which revealed the alteration of cell populations and expressional heterogeneity with single-cell resolution. Additionally, the study of pan-cancer scRNA-seq cohorts revealed heterogeneity in tumor-infiltrating myeloid cell composition and the functions of cancer-specific myeloid cells (18). scMeta-analysis is a powerful tool and strategy to overcome the problem of sample bias in small clinical cohorts. Additionally, our integrated datasets enabled us to perform single-cell analysis linked with clinical information in meta-cohorts such as AGED analysis, which identified aging-related gene expression with single-cell resolution. Furthermore, it revealed correlations in the alterations of gene expression associated with smoking and aging. Further scMeta-analyses incorporating additional clinical information will be helpful for understanding homeostasis and diseases.
Our study has limitations. First, differences in the tissue sampling and single-cell isolation methods generated bias in the cell populations used in this study. This bias could not be completely removed by computational normalization. In fact, our integrated datasets showed the differences in cell subpopulations in each dataset (Supplementary Figure S2B). Next, clinical information such as smoking states, gender, and age depended on the collection in the primary studies. The atlas has only a simple classification: smoker or never-smoker; we could not consider detailed smoking information such as the amount of smoking, years of smoking, and Brinkman index (Supplementary Tables S1 and S2). Additionally, patient age was significantly different between the smoker and never-smoker populations (Figure 6A). Moreover, clinical information such as age and gender was not available for all datasets. In the future, it will be necessary to expand the integrated dataset following the publication of new appropriate datasets for a more robust analysis.
The integrated atlas presented herein contributed to the characterization of the alterations caused by cigarette smoking that are related to carcinogenesis of lung squamous carcinoma. However, lung cancer also develops in never-smokers, in whom lung adenocarcinoma is predominant (5, 6). scMeta-analysis focused on lung adenocarcinoma in different clinical states has the potential to reveal the nature of genetic carcinogenesis. As a future study, the integration of scRNA-seq data from normal lungs (never-smokers) and lung adenocarcinoma could be a feasible approach to discover the mechanism of carcinogenesis and elucidate the cellular diversity in lung adenocarcinoma. In addition, clinical scRNA-seq and scMeta-analysis will be powerful tools in combination with data from pan-cancer multiomics analyses, such as those in The Cancer Genome Atlas (TCGA) (46, 47). Therefore, the integration of scMeta-analysis data with clinical and omics data paves the way for an in-depth understanding of the nature of cancer.
Materials and Methods
scRNA-seq data collection from public databases
The scRNA-seq cohorts were downloaded from the public Gene Expression Omnibus (GEO) and European Genome-Phenome Archive (EGA) databases (Supplementary Table S1). We collected scRNA-seq samples of human lungs for which smoking states information was available. From physiological studies of the lung airway, all 10 never-smoker samples were extracted from the EGA00001004082 dataset (48), and 1 never-smoker and 3 smoker samples were extracted from the GSE130148 dataset (13). From idiopathic pulmonary fibrosis (IPF) studies, 5 never-smoker and 3 smoker samples were extracted from a total of 17 samples in the GSE122960 dataset (49), 1 never-smoker and 7 smoker samples were extracted from a total of 34 samples in the GSE135893 dataset (12), and 22 never-smoker and 23 smoker samples were extracted from a total of 78 samples in the GSE136831 dataset (11). From studies of lung disease in smokers, 3 never-smoker and 3 smoker samples were extracted from the GSE123405 dataset (50), and 3 never-smoker and 9 smoker samples were extracted from the GSE173896 dataset (23). From lung cancer studies, 4 never-smoker and 7 smoker samples were extracted from a total of 58 samples in the GSE131907 dataset (51). A total of 104 samples (never-smoker: 49, smoker: 56) were collected, and the details of the extracted samples are shown in Supplementary Table S2. These datasets were imported into R software version 3.6.3. and transformed into Seurat objects with the package Seurat version 3.2 (52). The Seurat objects from the different datasets were then integrated in R.
Integration of datasets, data quality control and removal of batch effects
The integrated dataset was subjected to normalization, scaling, and principal component analysis (PCA) with Seurat functions. Removal of low-quality cells was performed against the merged dataset before batch effect removal according to the following criteria (nFeature_RNA > 1000 and percent.mt < 20). To remove the batch effect between cohort studies, Harmony (version 1.0) algorithms were applied to the integrated datasets (19, 53) following the instructions in the Quick start vignettes (https://portals.broadinstitute.org/harmony/articles/quickstart.html).
Cell type annotation and cell cycle scoring
Clustering of neighboring cells was performed by the functions ‘FindNeighbors’ and ‘FindClusters’ from Seurat using Harmony reduction. First, the clusters were grouped based on the expression of tissue compartment markers (for example, EPCAM for epithelia, CLDN5 for endothelia, COL1A2 for fibroblasts, and PTPRC for immune cells) (Figure 1C and Supplementary Figure S3) and then annotated in detail according to “A molecular cell atlas of the human lung” (7). Cell cycle analysis was performed with the ‘CellCycleScoring’ function of Seurat.
VARIED (Visualized Algorithms of Relationships In Expressional Diversity) analysis
To evaluate the expressional heterogeneity in the cell populations, we calculated the correlation coefficients for each cell population between smokers and never-smokers. In each cluster, normalized closeness centrality was calculated in R, as previously described (23, 54).
where r is the absolute value of Pearson’s correlational coefficient and n is the number of cells in the cluster.
Module analysis
Module analysis was performed by the ‘AddModuleScore’ function in Seurat using the gene lists from MSigDB (https://www.gsea-msigdb.org/gsea/msigdb/). The EMT module (HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION), heme metabolism module (HALLMARK_HEME_METABOLISM), ROS module (HOUSTIS_ROS), autophagy module (REACTOME_AUTOPHAGY), IFN signaling module (REACTOME_INTERFERON_SIGNALING), senescence module (REACTOME_CELLULAR_SENESCENCE), circadian module (REACTOME_CIRCADIAN_CLOCK), mitophagy module (REACTOME_MITOPHAGY), pyroptosis module (REACTOME_PYROPTOSIS), and ferroptosis module (WP_FERROPTOSIS) were subjected to module analysis in each cell population.
Pathway enrichment analyses and IPA
We performed enrichment analysis against the marker gene list in each cluster between male and female smokers by the ‘ClusterProfiler’ (55) and ‘ReactomePA’ (56) packages in R. Gene symbols were converted to ENTREZ IDs using the ‘org.Hs.eg.db’ package version 3.10.0. Pathway datasets were downloaded from the Reactome database. Pathway enrichment analysis using the ‘enrichPathway’ function was performed by the BH method. Marker genes of the basal-px cluster in smokers and never-smokers were calculated by ‘FindMarkers’ with the MAST method (57). Enrichment analysis of basal-px was performed using QIAGEN Ingenuity Pathway Analysis software.
Cell–cell interaction (CCI) analysis
Gene–gene interactions, including ligand–receptor interactions, were performed using the interaction database of the Bader laboratory from Toronto University (https://baderlab.org/CellCellInteractions#Download_Data). We selected the genes that were categorized as ‘interferons’, ‘interleukins’ and ‘TNFSF superfamily’ in the HUGO Gene Nomenclature Committee database (https://www.genenames.org/). We calculated the cell number of subpopulations with values greater than 2. Only subpopulations whose expressing cell ratio exceeded 10% were extracted for CCI network analysis, and the CCI score between epithelial and immune cell subpopulations in smokers and never-smokers was calculated as previously described (23).
L: Ligand subpopulation (ligand gene expression > 2), R: receptor subpopulation (receptor gene expression > 2), n: cell number.
AGED (Aging-related Gene Expressional Differences) analysis
We calculated the average expression of all genes in each cluster in both smokers and never-smokers and performed regression analysis in correlation with gene expression and patient age by R. Next, we calculated the differences in slopes (delta) in smokers and never-smokers via regression analysis and extracted the genes with the highest delta to be shown in a heatmap.
Code and data availability
The datasets GSE122960, GSE123405, GSE130148, GSE131907, GSE135893, GSE136831, and GSE173896 are available in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo/). The EGA00001004082 dataset is available in the EGA database (https://ega-archive.org/). The source code of scMeta-analysis and integrated datasets is available on GitHub (https://github.com/JunNakayama/scMeta-analysis-of-cigarette-smoking).
Data visualization
The dimensionality-reduced cell clustering is shown as a UMAP plot by the function ‘runUMAP’. Heatmaps were drawn by Morpheus from the Broad Institute. A ridge plot was drawn using the ‘ggridges’ package in R. Bubble plots and violin plots were drawn using the ‘ggplot2’ package in R. Sankey plots were drawn using the ‘network3D’ package in R.
Statistical Analysis
Correlation coefficients were calculated by Spearman correlation in R. Welch’s t test or Tukey’s or Dunnett’s multiple comparison test was used for comparison of the datasets. Significance was defined as P < 0.05.
Author contributions
J.N. and Y.Y. conceived and designed the study. J.N. performed the data analysis and construction of the datasets. J.N. and Y.Y. wrote the manuscript. Y.Y. supervised this project. All authors reviewed and edited the manuscript.
Funding
This work was supported by Project for MEXT KAKENHI (Grant-in-Aid for Scientific Research (B); grant number: 21H02721, Grant-in-Aid for JSPS Fellows: 20J01794, Grant-in-Aid for Early-Carrier Scientists: 21K15562), GSK Japan Research Grant, and Tokyo Biochemical Research Foundation Research Grant.
Data availability
scMeta-analysis data were available to the NCBI GEO database and EGA database. Detailed information is shown in Supplementary Table 1.
Competing Interests statement
The authors have declared that no conflict of interest exists.
Supplemental Tables
Supplementary Table S1. Supplementary TableS1. A list of publicy 8 datasets for the atlas.
Supplementary Table S2. The details of integrated scRNA-seq samples in the atlas
Supplementary Table S3. The DEGs list in basal-px clusters
Supplementary Table S4. IPA canonical pathways in smoker basal-px cluster
Acknowledgments
We are grateful to all members of the lab for stimulating discussions during the preparation of this manuscript.
Footnotes
Lead contact: Yusuke Yamamoto (yuyamamo{at}ncc.go.jp) ORCID ID: 0000-0001-8844-4295 (Jun Nakayama), 0000-0002-5262-8479 (Yusuke Yamamoto)
https://github.com/JunNakayama/scMeta-analysis-of-cigarette-smoking