Abstract
Cell surface proteins play critical roles in a wide range of biological functions and disease processes through mediation of adhesion and signaling between a cell and its environment. Owing to their biological significance and accessibility, cell surface proteomes (i.e. surfaceomes) are a rich source of targets for developing tools and strategies to identify, study, and manipulate specific cell types of interest, from immunophenotyping and immunotherapy to targeted drug delivery and in vivo imaging. Despite their relevance, the unique combination of molecules present at the cell surface are not yet described for most cell types. While modern mass spectrometry approaches have proven invaluable for generating discovery-driven, empirically-derived snapshot views of the surfaceome, significant challenges remain when analyzing these often-large datasets for the purpose of identifying candidate markers that are most applicable for downstream applications. To overcome these challenges, we developed SurfaceGenie, a web-based application that integrates a consensus-based prediction of cell surface localization with user-input data to prioritize candidate cell type specific surface markers. Here, we outline the development of the strategy and demonstrate its utility for analyzing human and rodent data from proteomic and transcriptomic workflows. An easy-to-use web application is freely available at www.cellsurfer.net/surfacegenie.
Introduction
Cell surface proteins play critical roles in a wide range of biological functions and disease processes through mediation of adhesion and signaling between a cell and its environment. Owing to their biological significance and accessibility, cell surface proteomes (i.e. surfaceomes) are a rich source of targets for developing tools and strategies to identify, study, and manipulate specific cell types of interest, from immunophenotyping and immunotherapy to targeted drug delivery and in vivo imaging. A growing interest in cell type specific data has fueled the generation of the Cell Surface Protein Atlas (1), Human Protein Atlas (2), Human Cell Atlas Project (3), and related efforts. However, the unique combination of molecules present specifically at the cell surface are not yet described for most cell types or disease states, and thus continued discovery and annotation efforts are needed.
Mass spectrometry (MS) based workflows can be applied to identify and quantify hundreds to thousands of cell surface proteins (1, 4-14). Particularly, chemoproteomic methods to specifically label and subsequently affinity enrich cell surface proteins can provide experimental evidence of a protein’s subcellular location and therefore enable the generation of discovery-driven, empirically-derived snapshot views of the surfaceome (10, 15, 16). These approaches offer significant advantages over transcriptomic approaches, which cannot directly inform protein abundance or localization, and antibody-based strategies which are limited to molecules for which high quality reagents are available. As such, these MS-based chemoproteomic approaches are well-suited to defining cell type specific surfaceomes and serve as a useful first step in defining the cellular phenotype, enabling the development of marker combinations (i.e. barcodes) that are cell type specific (17, 18).
Despite their advantages, these chemoproteomic methods generally require >50 million cells, on average, to produce high quality results, which may preclude their application to sample-limited cell types such as primary cells. Although a recent study suggests these methods can be applied to smaller numbers of cells (15), methods that enable routine discovery on very low numbers of cells are not yet widely available. Furthermore, to ensure the results from these approaches provide empirical evidence of surface localization, the initial chemical labeling must be applied to cells with intact plasma membranes, which can pose challenges for certain cell types. For these reasons, more general proteomic approaches that accurately identify and quantify proteins will continue to be useful in the search for cell surface proteins that are informative for a particular cell type or disease status, albeit with the caveat that they offer less inherent specificity for cell surface proteins. Independent of the discovery strategy employed, bioinformatic predictions can serve as an important complement to experimental approaches by providing a means to filter data and prioritize the focus on proteins that are predicted to be localized to the cell surface (19-22).
Though MS is well-suited to the identification of cell-type specific proteins, ultimately, antibodies (Ab) or other affinity reagents that recognize specific epitopes on cell surface proteins are required for most downstream applications such as live cell sorting, imaging, and drug targeting by Ab-drug conjugates. Considering the significant cost and time required to generate and validate affinity reagents for these purposes, it is prudent that the candidate marker prioritization is as selective as possible prior to reagent generation. Specifically, candidate selection should consider whether a marker is likely to be accessible to and detectable by affinity reagents in a manner that allows cell types of interest to be discriminated from non-target cells. Moreover, these assessments should be objective and suited to the analysis of large datasets such as those provided in proteomic and transcriptomic studies. To address these outstanding needs, we developed GenieScore, a mathematical strategy that integrates a consensus-based prediction of cell surface localization with user-input data to prioritize candidate cell type specific surface markers. Here, we outline the development of the strategy and demonstrate its utility for analyzing data from proteomic workflows that specifically identify cell surface proteins (e.g. CSC) and more general strategies (e.g. whole-cell lysate proteomics and transcriptomics). To facilitate its implementation for a broad range of study and data types, we developed SurfaceGenie, an easy-to-use web application that calculates the GenieScore for user-input data and further annotates the data with ontology information relevant for cell surface proteins. SurfaceGenie is freely available at www.cellsurfer.net/surfacegenie.
Results
Generation of a surface prediction consensus (SPC) dataset for predictive localization
Based on first principles, three features of a protein predominate its capacity to serve as a cell surface marker capable of distinguishing among cell types (Figure 1A). These include (1) presence at the cell surface, (2) difference in abundance among cell types, and (3) sufficient abundance for antibody-based detection. Whereas features concerning the abundance must be determined empirically, a consensus-based predictive approach was adopted to represent whether a protein is capable of being present at the cell surface, as this feature is largely a function of its primary sequence. To this end, four previous bioinformatic-based constructions of the human cell surface proteome were compiled into a single, surface prediction consensus (SPC) dataset resulting in 5,407 protein accession numbers (Dataset S1, 4.1). The strategies used to generate these predicted human surface protein datasets varied markedly, from manual curation to machine learning, and resulted in datasets ranging 1090-4393 surfaceome proteins each. Overall, the dataset sizes are a primary determinant as to how the datasets intersect (Figure S1). For example, the number of proteins exclusive to a prediction strategy is positively correlated to the size of the original dataset, albeit not in a linear manner, comprising 1.7%, 4.4%, 9.6%, and 26.5% for the Diaz-Ramos, Bausch-Fluck, Town, and Cunha datasets, respectively. Despite these differences, there was considerable overlap among these predictions, with 69% and 41% of proteins in the SPC dataset occurring in ≥ 2 or ≥ 3 individual prediction sets, respectively. To stratify the proteins in the SPC dataset according to how likely they are to be truly present at the cell surface, each protein was assigned one point for each of the individual predicted datasets in which that protein appeared, termed SPC score -any protein not present in the dataset is assigned a score of 0 (Dataset S1, 4.1). The distribution of SPC scores in the compiled dataset is shown in the histogram in Figure 1B where 1671, 1507, 1497, and 732 proteins are assigned a score of 1, 2, 3, and 4, respectively. (Figure S1). To enable more widespread application, homologous accession numbers were mapped between human and mouse using the Mouse Genome Informatics database (http://www.informatics.jax.org) and human and rat using the Rat Genome Database (https://rgd.mcw.edu) (Dataset S1, 4.2-3).
Benchmarking the SPC dataset against other annotations
The SPC dataset was compared to three established strategies for determination of cell surface localization – Gene Ontology Cellular Component (GO-CC) Annotations, annotations within the Cell Surface Protein Atlas (CSPA), and annotations generated through application of HyperLOPIT(23). Comparisons to GO-CC were consistent with expectations as ‘nucleus’ and ‘cytoplasm’ were the two most common terms for proteins with an SPC scores of 0, ‘integral component of membrane’ and ‘membrane’ for SPC scores of 1, and ‘integral component of membrane’ and ‘plasma membrane’ for SPC scores of 2-4 (Figure S2A). The ‘confidence’ assignment to proteins in the CSPA correlated well with SPC score for both human and mouse, with the notable outlier of ~17% of proteins assigned ‘high confidence’ having an SPC score of 0 (Figure S2B). However, upon closer inspection, 95% these proteins are predicted to be secreted or extracellular matrix proteins (Secretome P, (24)), which can be captured by CSC but are not integral membrane proteins. HyperLOPIT annotations agreed with SPC score to a lesser extent, with the most common annotations in proteins with SPC scores of 3 or 4 being ‘plasma membrane’. However, ‘ER/Golgi apparatus’ was the most common annotation in proteins with SPC scores of 1 or 2 (Figure S2C). Though these comparisons demonstrated agreement overall, the SPC dataset provides unique and specific information in addition to assigning the predictions in a non-binary manner. As the SPC score is not dependent on experimental observation, it is more comprehensive in coverage than the CSPA and HyperLOPIT. These differences offer significant advantages for mathematically assigning the likelihood that a protein is present at the cell surface in a predictive manner.
Applying the SPC dataset to compare two proteomic approaches for surface protein identification
The concept of specificity as it relates to cell surface markers is always context dependent, meaning a protein or set of proteins may be useful for identifying a particular cell type in one context, but not another (e.g. a protein that is specific to a single cell type within an organ may not be specific to that organ when all other tissues in the body are considered). Therefore, prioritization of cell surface proteins that are likely capable of serving as informative markers should consider experimental data from relevant cell types, including the target and non-target cell types that are to be discriminated. We previously demonstrated that the Cell Surface Capture Technology (CSC) applied to 100 million cells can yield proteins capable of distinguishing among four human lymphocyte cell lines (25). Here, we performed whole-cell lysate (WCL) digestion of 5 million cells of these same cell lines to determine whether a generic proteomic approach coupled with SPC score and GenieScore analysis could identify cell surface proteins sufficient to distinguish among these cell lines. Compared to the CSC analysis which identified 470 proteins, the WCL approach identified 3858 proteins (≥2 unique peptides). While the majority, 73% (343), of the CSC-identified proteins are predicted to be cell surface localized (i.e. SPC scores of 1-4), only 13% (485) of the WCL proteins (Figure 2) met this criterion. This trend is expected due to the high specificity of CSC for cell surface proteins (10, 11, 13, 25). Though predicted surface proteins were identified by both proteomic approaches, the distributions of SPC scores suggest more confidence in the surface localization of CSC proteins compared to WCL. This is exemplified by the number of cluster of differentiation (CD) molecules in each SPC-scoring subset, where 109 of 343 proteins from CSC and 50 of 485 proteins from WCL are annotated as CD molecules (Dataset S1 4.4-5). Despite these differences, applying a hierarchical clustering approach to the peptide spectrum matches (PSMs) assigned to individual biological replicates for the subset of proteins in each dataset with an SPC scores of 1-4 recapitulated the clustering predicted based on the entire dataset for both proteomic approaches (Figure 2). Although these datasets were collected on the same cell lines, only 127 proteins with SPC scores 1-4 were observed in both datasets, which represent 37% and 26% of the CSC and WCL predicted surface proteins, respectively. These data highlight that despite the challenges in identifying cell surface proteins when using generic proteomic strategies that do not specifically enrich for them, application of the SPC-scoring approach can provide a statistical strategy for determining whether the data are sufficient to differentiate among cells lines.
Testing two label-free quantitation strategies as input data for SurfaceGenie
The GenieScore was calculated for each protein in the CSC and WCL datasets using PSMs as inputs for the two terms based on experimental data - signal dispersion and signal strength (Figure 1). GenieScores were plotted against the rank-order - according to GenieScore - for CSC and WCL data resulting in a rectangular-hyberbola-like shape, namely, a subset of higher-scoring proteins that trail off into a majority of proteins that are lower-scoring (Figure 2). Although the range of GenieScores was similar for both proteomics approaches (6.59 and 6.16 for CSC and WCL, respectively) there are significant differences in the average and distribution, due to the statistical differences between CSC and WCL for each of the terms used to calculate GenieScore – SPC scores, signal dispersion, and signal strength (Figure S3). These differences are likely consequences of the highly-selective nature of CSC for identifying cell surface proteins. Although CSC provides empirical evidence of surface localization, unlike WCL, the laborious sample processing involved in selective enrichment of N-glycopeptides can introduce more experimental variability compared to the simple WCL digestion. Moreover, CSC results in fewer peptides identified per protein owing to the restriction to tryptic N-glycosylated peptides. Despite the differences between these two proteomic approaches, the GenieScores for the 127 proteins identified in both proteomic approaches were relatively well correlated (R = 0.66) (Dataset S1 4.6, Figure S3). Recognizing the potential challenges of relying on PSMs for quantitative comparisons, peak areas for selected proteins were calculated using Skyline to provide an alternative type of experimental data for calculating the GenieScore. Selection criteria for peptides analyzed in Skyline are provided in the Supporting Information Methods section. The GenieScores calculated using MS1 peak areas correlated well with the GenieScores using PSMs (R = 0.79 and 0.86 for CSC and WCL, respectively (Figure 2)). As the calculation of GenieScore relies on averages (as opposed to individual replicate measurements) the relationship between the product of the GenieScore experimental terms (signal dispersion and signal strength) and the statistical difference (which considers variability in measurement) between cell lines was investigated. A positive relationship was observed, with correlations of 0.47 and 0.73 for CSC and WCL, respectively. The positive relationship suggests that the equation for the GenieScore is likely to be prioritizing proteins for which there is a statistical difference (Dataset S1 4.7-8). Overall the GenieScore is a robust prioritization metric, demonstrating similar rank ordering for proteins common to CSC and WCL data and for proteins within CSC or WCL using the different quantitative measurements (PSMs or MS1 peak area).
Benchmarking GenieScore against a published study of surface proteins in cancer cell lines
Though the GenieScore appears to be a valid metric insofar as it produced similar rank ordering independent of the type of input data, we sought to benchmark it against a published study that validated markers which were originally selected based on experimental proteomic and transcriptomic data. In the test dataset, seven antibodies were generated to surface proteins upregulated on RAS-driven cancer cells compared to a control cell line (26). As the CSC results in this study were reported as a log-fold change without individual values, the signal strength component of the GenieScore was calculated using the FPKM values from the RNA-Seq dataset. Of the 122 proteins found to be more abundant in the MCF10A KRASG12V cells relative to empty vector control, the proteins selected for antibody development ranked 1,2,3,8, 28, and 30 in our GenieScore analysis (Figure 3A). The rank-order by GenieScore was compared to the rank-order of log2 fold change in abundance (a metric denoted as selection criteria in the original manuscript) (Figure 3A). The GenieScore also performed well using the RNA-Seq data as a starting point, with the SPC analysis rapidly reducing the candidate list from 1139 upregulated proteins to 330 with SPC scores of 1-4. The proteins selected for validation by antibody-based analysis in the manuscript are among the top candidates when rank-ordered by GenieScore (3, 4, 9, 10, 36) with four of the five genes in the top 3% of the 330 SPC-scoring upregulated proteins. These rank-orders perform favorably compared to using log2 fold change in transcript levels (25, 37, 43, 50, 115) (Figure 3B). Based on these results, the GenieScore is a powerful metric for selection of cell surface proteins that can serve as markers for immunodetection applications, and in this example highlights additional proteins of interest that were not targeted in the original study.
Integrating GenieScores of proteomic and transcriptomic data to reveal candidate markers for Mouse Islet Cell Types
As the GenieScore produced useful rank-ordering of potential protein markers from both RNA-Seq and CSC data that were consistent with published results, we sought to determine if it would be a useful metric for integrating data from disparate studies for marker discovery. To this end, we performed CSC on mouse alpha and beta cell lines and compared the results to published RNA-Seq data acquired on primary alpha and beta cells from dissociated mouse islets (27). The datasets shared 321 predicted surface proteins in common, but when the GenieScores from CSC data were plotted against the GenieScores from the RNA-Seq data, they revealed a poor correlation (R = 0.25) (Figure 4A). This could be due to the fact that the CSC dataset was acquired on cell lines and the RNA-Seq was on primary cells. However, in the context of marker discovery, each of these approaches offers advantages, namely, the CSC data provides experimental evidence regarding abundance at the cell surface and the RNA-Seq analysis of primary cells avoids possible artifacts introduced by culturing cells ex vivo. Recognizing the benefits of these complementary approaches, the data were combined in a manner that weighs them equally. Specifically, the GenieScores were normalized to the maximum value from each dataset and then the scores were averaged (Figure 4B). The top candidate markers for alpha and beta cells revealed by this combined approach are provided in Figure 4C. Several of these have been studied in the context of islet biology (e.g. GLP1R (28), LRP1 (29), CRHR1 (29)) and most (26/30) were identified in a proteomic study of intact human islets, suggesting potential utility across species (30). Altogether, GenieScore calculations provide a rapid method for integrating proteomic and transcriptomic data for surface marker prioritization
SurfaceGenie: a web-based application for integrating GenieScore and relevant annotations
SurfaceGenie, a shinyApp written in R, was developed to enable calculation of the GenieScores for user input data. In this interface, users upload data as a csv file and can view the distribution of GenieScores and SPC scores for their data. Proteins are annotated with ontological information including CD and HLA molecule annotations. The plots and data generated are available for download, including the results for individual terms used to calculate GenieScore. Additional functionality includes the ability to query accession numbers in single or batch mode, independent of data type, to obtain SPC Scores. SurfaceGenie is freely available at http://www.cellsurfer.net/surfacegenie.
Discussion
Despite the central role cell surface proteins play in maintaining cellular structure and function, the cell surface is not well documented for most human cell types. There is currently no comprehensive reference repository of experimentally determined cell surface proteins cataloged by individual human cell types that can be used for comparison to experimental or diseased phenotype. Although specialized proteomic approaches allow for probing the occupancy of the cell surface, the sample requirements and technical sophistication often preclude widespread application, and quantitation is challenging. To overcome these challenges, predictions of surface localization can enable insights from more easily implemented proteomic and transcriptomic approaches, which can be performed on smaller sample sizes. Here, we describe the development of GenieScore, a calculation that integrates a predictive metric regarding surface localization with experimental data to prioritize proteins which may be useful as cell surface markers. We demonstrate that GenieScore is compatible with CSC, WCL, and RNA-Seq data and is a useful framework by which to integrate multiple sources of data for marker discovery. A web-based application, SurfaceGenie, was generated to enable the calculation of SPC-scores and GenieScores on user-input data and annotation of datasets with functional annotations relevant for cell surface proteins.
It is anticipated that SurfaceGenie will enable prioritization of cell surface markers to support a broad range of applications, including immunophenotyping, immunotherapy, and drug targeting for a range of research questions, from mechanistic studies to those in search of markers for disease. However, whether an expressed protein is localized to the cell surface on a specific cell type in a specific experimental or biological condition remains difficult to predict. This is especially true for proteins that do not fit the canonical model (e.g. lack a signal peptide) or are only trafficked to the cell surface upon ligand binding (e.g. glucose transporter). For these reasons, experimental workflows that provide capabilities for discovery (i.e. not limited to available affinity reagents) while providing experimental evidence of cell surface localization on a particular cell type of interest with a specific context (e.g. experimental condition, disease state) will remain invaluable.
Methods
All experimental details are provided in Supporting Information.
Cell culture
Human lymphocyte cell lines (Ramos, HG-3, RCH-ACV, Jurkat) were cultured and passaged as previously described (25). Alpha TC1 clone 6 (ATCC CRL-2934) and beta-TC-6 (ATCC CRL-11506) cells were maintained at 37°C and 5% CO2, cultured in Dulbecco’s Modified Eagle’s Medium (Gibco #11885-084) supplemented with 10% heat-inactivated fetal bovine serum containing 16.6 mM or 5.5 mM glucose, respectively.
Cell Lysis, Protein Digestion, and Peptide Cleanup
For WCL analysis of lymphocytes, cell pellets were lysed in 100mM Ammonium Bicarbonate containing 20% acetonitrile and 40% Invitrosol (ThermoFisher Scientific), digested with trypsin overnight, and cleaned by SP2 following the standard operating protocol as described (31). Peptides were quantified using Pierce Quantitative Fluorometric Peptide Assay (ThermoFisher Scientific) according to manufacturer’s instructions on a Varioskan LUX Multimode Microplate Reader and SkanIt 5.0 software (ThermoFisher Scientific). For CSC analysis of mouse islet cell lines, samples were prepared as previously described (11, 13, 25).
Label Free Quantitation by Mass Spectrometry
Lymphocyte peptides and CSC samples of mouse islet cell types were analyzed by LC-MS/MS using a Dionex UltiMate 3000 RSLCnano system (ThermoFisher Scientific) in line with a Q Exactive (ThermoFisher Scientific). Lymphocyte samples were prepared as 50 ng/µL total sample peptide concentration with Pierce Peptide Retention Time Calibration Mixture (PRTC, Thermo) spiked in at a final concentration of 2 fmol/µL PRTC, and then blocked and randomized with two technical replicates analyzed per sample. CSC samples of mouse islet cell types were analyzed as described (32, 33). MS data were analyzed using Proteome Discoverer 2.2 (ThermoFisher Scientific) and SkylineDaily.
Construction of a consensus dataset of predicted surface proteins
Four published surfaceome datasets (19-22), each of which used a distinct methodology to bioinformatically predict the subset of the proteome which can be surface localized, were concatenated into a single consensus dataset. In this process, the UniProt retrieve/mapping ID tool (www.uniprot.org) was used to convert the gene names provided in the published surfaceomes to UniProt Accession numbers. Ambiguous matches were clarified by any supplementary information provided in the datasets in addition to gene name (i.e. alternate name, molecule name, chromosome). To stratify the proteins within the consensus dataset, each was assigned a surface prediction consensus score (SPC score), a summed value whereby one point was awarded for each of the prediction strategies in which the protein appeared.
GenieScore – A mathematical representation of surface marker potential
An equation was developed to mathematically represent key features deemed relevant when considering whether a protein has high potential to be useful as a cell surface marker for distinguishing between cell types or experimental groups. The equation, which returns a metric termed the GenieScore, is the product of 1) the SPC scores (described above); 2) signal dispersion, a measure of the disparity in observations among investigated samples and is mathematically equivalent to the square of the normalized Gini coefficient; and 3) signal strength, a logarithmic transformation of the experimental data (e.g. number of peptide spectral matches, MS1 peak area, FKPM, or RKPM). A thorough definition and rationalization of the individual equation terms is provided in Supporting Information.
SurfaceGenie Web application
A web application for accessing SurfaceGenie was developed as an interactive Shiny app written in R and is available at www.cellsurfer.net/surfacegenie.
Supporting Information
Figure S1 – Visualization of the intersections between datasets used to generate SPC score
Figure S2 – Benchmarking the SPC score against GO terms, CSPA, and HyperLOPIT
Figure S3 – Distributions of GenieScore terms in WCL and CSC lymphocyte data
Dataset S1 – (1) Human SPC dataset, (2) Mouse SPC dataset, (3) Rat SPC dataset, (4) lymphocyte WCL data with GenieScores, (5) lymphocyte CSC data with GenieScores, (6) GenieScores for proteins common to CSC and WCL, (7) ANOVA test statistics for WCL data, (8) ANOVA test statistics for CSC data
Supplemental Methods
Author Contributions
R.L.G. and M.W. conceived the study; R.L.G. supervised the study; M.W. developed the algorithms and designed and performed MS experiments; S.S. developed the python code; S.S. and J. L. developed the web application; R.A.J.L., P.A.H., J.A.C., performed analyses of mouse islet cell lines, M.W. and R.L.G. analyzed data; M.W. generated figures; M.W. and R.L.G. co-wrote the manuscript; All authors approved the final manuscript.
Acknowledgements
This work was supported by the National Institutes of Health [R01-HL126785 and R01-HL134010 to R.L.G.; F31-HL140914 to M.W.]. Funding sources had no involvement in study design, data collection, interpretation, analysis or publication.