Abstract.
Microarray technology has unlocked doors to a multitude of open analysis problems that if conceived with efficacy may uncover varied genotypic and phenotypic traits. Algorithms belonging to different cultures in computer science have been applied to gene expression data to derive correlation and stratification parameters. While most outcomes are subject to clinical validation, majority of which get declined, the search for the precisely targeted therapeutic agents is still on. This paper is an effort in the similar direction and strives to delineate genes with significant stromal signatures. We suggest a corroborative indulgence of a human laterality disorder gene, CCDC11 in the metastasis, in addition to the role of WDR88 and ARPP21 genes has been further materialized in the analysis. Another standout aspect of the study has been the associated implications of the genes in rare disorders of male breast and female prostate cancers. There is also a threshold proposal that stratifies “safe” expression space for genes. Complimentarily, the manuscript serves as an expedient protocol for anyone seeking microarray data analysis, particularly in R.
1. Introduction
There appears nothing proverbially eerie about the technology at the get go. Microarrays usage throve with (Schena et al. 1995) and were originally applied to harbour global gene expression (DeRisi et al. 1997; DeRisi et al. 1996) in association with yeast studies. With the proliferation of data pertaining to medication and that too in the digital proforma, it is crucial to constantly challenge and update the current configuration of systems that are being used to analyse it [genomic data] for compliance to the medical care. NGS is one such advancement that was gullible to the geneticists. Unlike the microarray data that catalogues gene expression values under a predefined probe, the RNA-seq data from NGS documents expression range in totality (Uziela & Honkela 2013). RNA sequencing technology pictures a comprehensive view of the transcriptome with the data being reproducible for novel discoveries yielded by disparate analyses. RNA sequencing is also helpful in detection of structural variations as gene fusions, alternative splicing events, etc. But microarrays still continue to provide a relatively affordable first-foot to genomics, bearing robustness and short turn-around time. With significant disparities owing to the definite and specific backgrounds of the individuals, the genomic data available via microarray format has shown likewise results when particular maladies come into question as cancer, diabetes, amongst others. The big question however is that could the genes be standardized via ontology driven mechanism so that specific drug targets be known and hunted for. Scientists are always looking for particular biomarkers that can be universally acclaimed and acknowledged. In the current paper, we attempt to underline key players that are actively responsible for representing the metastatic behaviour and proliferation of oncogenic state in a body induced with breast-type and pros-tate-type cancers, in cognizance to stromal reaction. The results are based on a comparative meta-analysis.
Reactive stroma is a response to the aberration into the tissues due to tumor invasion (Planche et al. 2011). Synonymous to desmoplasia, it has also been recognized that the stromal response is exclusive to tumor type. It can be perceived that desmoplastic response is in tandem to carcinogenesis and subsequent metastasis. Thus, it is unstated that desmoplastic reaction is also a prospective antecedent of premalignant stage, as the growth of connective, fibrous tissues around the tumor cells commences. Genetic irregularity in the cells compartmentalized in epithelium represents carcinoma in situ and the lesions initiate cell fibroblasts as a tackling measure. Functionalities of stromal initiation include homeostasis and tissue structure restoration. Chronicled is also that cell division govern mechanism is hampered because of the tumor induction and eventual progression. The amount of reactive stroma is proportional to the disease state (Martin & Rowley 2013). Once the tumor foray infiltrates through the ECM into adjoining host tissues, they become potent to further metastasize. Vascular structures, blood and lymph vessels, ECM, and fibroblasts constitute the stroma (Casey et al. 2009). Diverse studies by (Tuxhorn et al. 2002), (Ayala et al. 2003), (Roepman et al. 2006), and (Finak et al. 2006) implemented LCM to scrutinize gene expression profiles of tumor stroma (breast) versus normal epithelium and clinched that the alterations in the stromal microenvironment is comparative to the tumor progression.
In the following work, we attempt to ascertain genes that are prominent to tumor progression and subsequent stromal response. This may aid identification of key pathways (genes instituted) that are liable for the cancer metastasis. As the dataset may reveal, we attempt to analyse breast and prostate oncogenes.
This paper is organized as follows:
First, the developments in the breast cancer and prostate research, over the years, are catalogued. Various data analysis methodologies that have inferred some very seminal results have been underlined. We then present our viewpoint and improvements in the domain and propose a novel algorithm to analyse cancer stroma data. As it would necessitate, the sifted targets are subjected to validation; but due to accessibility constraints, could only be done via available erstwhile published research work. Their [genes] analysis can further substantiate our studies for preventing the spread of cancer to the other tissues through pathway blockage and rendering them benign through a drug treatment.
The statistical analysis and visualizations are covered with R language (version 3.2.3) (interface used is RStudio version 0.99.491) on a desktop computer with 8 Gb RAM and an Intel i5 CPU with 3.50 Ghz clock speed. For further distillations, MeV version 4.9.0 (Anon n.d.), and Cytoscape are employed.
1.1 An Alarming Statistic
A not so long ago article (Kamath et al. 2013), reports that India is overwhelmed with 2.5 million cancer patients in aggregate and close to a million such augmented annually. To put things into perspective, there were 1.7 million and 11.4 million cancer incidences in the South East Asia region and world over, respectively in 2004. According to Globocan data (International Agency for Research on Cancer), India tops the chart with 1.85 million years of healthy life lost due to breast cancer alone. The aftermaths of this malady are equally likely for the rest of the world too.
* Healthy life lost is defined by years lost owing to premature death and deterioration of health standards on account of a disease induction into the body.
An elucidation from an erstwhile research confirms that after cervical cancer, breast cancer is highly promulgated amongst Indian women. Also shown is that Indian women are likely to inhibit breast cancer, a decade earlier than their Western counterparts. The paucity of early detection and incompetent control mechanism can largely explain the succumbing rate. Ex-orbitance in breast cancer cases throughout developing nations is proportionate to varying lifestyle being is unregulated and sporadic, expectancy and delivery of fewer children, and hormonal intervention exemplified by post-menopausal hormonal therapy. The symptoms are profound at a later stage of the malignancy and hence pose greater challenge to review the disease at the initiation. The authors of the study (Kamath et al. 2013) stressed upon the need to exorcise this “ticking time bomb” and called for apt administrative measures for the same.
Prostate cancer, mostly occurring in elderly men, has similar danger trail and accounts for second largest cancer causing deaths in U.S. males after lung cancer (Siegel et al. 2016; Gaylis et al. 2016).
1.2 Provenance
When it comes to being most defiant and stubborn, and not to mention “incurable”, cancer is christened far and wide for being the malady that poses serious threat to the manhood. Many of the responsible genes involved in the pathways oriented to oncological disorders have complex and overlapped functioning. Not to mention, some genes remain dormant at an instance and are activated by a particular range of expression level of other corresponding gene[s]. They also tend to become chemotherapy resistant through a self-regulatory mechanism. These attributes account for a thorough and complacent inspection of the various parameters involved in gene functioning, mapped and homed-in.
In the exploration of gene expression data, the magnitude of tissue samples is lower with respect to number of genes that may inevitably lead to overfitting of data and inappropriate results (Shen et al. 2007). Gene selection is critical to elucidate tissue classification as well as to model complex genetic and molecular underpinnings, which explain the relation between genes and varied biological phenomenon. The stability of the analysis model can be accomplished through it.
BRCA1 and BRCA2 are vehemently recognized for hereditary breast and ovarian cancer proneness. These are human genes that produce tumor suppressor proteins that implicitly initiate DNA repair mechanism. If a mutation is detected in any of the aforementioned genes, the susceptibility to espousing tumor inception is high (Anon n.d.). BRCA mutation may lead to the following probabilities:
40%-80% for breast cancer
11%-40% for ovarian cancer
1%-10% for male breast cancer
Up to 39% for prostate cancer
1%-7% for pancreatic cancer
Likewise, if any other relative cancer genes could be deciphered by comprehensively analyzing the gene expression data and establishing their helm in metabolism via clinical validation, we can get closer to disease treatment and increased understanding towards biology. (Bosdet et al. 2013) take BRCA mutation testing to a whole new level by incorporating the Second Generation Sequencing and Third Generation Sequencing procedures, collectively known as NGS, to deal with increasing number of tests that the people are willing to take to judge their cancer proneness. This era of NGS renders reduced cost, greater efficiency and high throughput. The assay defined uses automated small amplicon PCR followed by sample pooling and sequencing with a second-generation instrument.
1.3 Androgen Receptor: An observable cause commune
Classically abnormality in males associated with prostate cancer, androgen receptor response has been apropos (Yu et al. 2000). AR gene isn’t solely responsible to harbor design and characteristic instructions for sex drive and hair growth, but also facilitate sexual physiognomies. Positioned on the long (q) wing of the X chromosome at the 12th position, the AR gene encompasses cohorts of CAG repeat regions (triplets or trinucleotide repeats). The strength of quantifiable occurrences of these DNA segments account for the proneness of the prostate cancer and breast cancer; while some studies hold more repeats liable, others blame lesser ones (Yu et al. 2000). Research also depicts that mutations in the AR gene are accountable for prostate cancer instantiation (Nelson 2002) (Giovannucci et al. 1997), albeit somatic in nature. In women, longer CAG repeats and polymorphisms may increase the risk of endometrial and breast cancers (Mehdipour, Pirouzpanah, Kheirollahi, & Atri, 2010).
1.4 Gene Selection
While holding candescence to the fact that intergenic regions relegated as “junk DNA” have long been undermined, numerous follow up studies have unraveled that non-coding RNAs, amongst other “dark” regions have a profound effect on regulation of gene expression (Birney et al. 2007) (Carninci et al. 2005) (Cheng et al. 2005) (He et al. 2008). Since microarrays are designed to study gene measurements, the aforementioned parameters are left diluted. This aspect holds its vitality and is sure to influence the end result. Notwithstanding, it has been known that Particle Swarm Optimization Technique (PSO) has been meticulously significant in harnessing gene selection (Shen et al. 2007) (Yuan & Chu 2007) (Shen et al. 2008) (Chuang et al. 2008) (Lin et al. 2008). Other approaches include Artificial Neural Networks (ANN) and Fisher Discriminant Analysis (FDA), to name a few. An ensemble methodology involving Particle Swarm Optimization (PSO) and Support Vector Machines (SVM) has been observed to be particularly critical to feature selection and cornering genes of interest (Yeung et al. 2009).
1.5 Elucidation of Cancer Subtypes
Breast cancer is a neoplasm that with distinct subtypes has differently representable histo-pathological features and response to systemic therapies (Dai et al. 2016). Patient age, tumor size, and axillary lymph node status have been deciding factors as well (Schnitt 2010). Im-munohistochemistry (IHC) biomarkers have been classically deployed to ascertain subtyping. They entail Estrogen Receptor (ER), Progesterone Receptor (PR), Androgen Receptor (AR), and Human Epidermal growth factor Receptor 2 (HER2). Back in the 70’s, there were two subtypes that became known to us, viz. (luminal epithelial) ER+ and ER-(Perou et al. 2000) (Sorlie et al. 2003) (Alexe et al. 2007). Triple negative breast cancer is characterized by a cancer subtype devoid of ER, PR, and HER2 gene expressions. Compounds like tamoxifen (for ER), and trastuzumab (for HER2), are tactless in dealing with triple-negative breast cancer. It is chemotherapeutically challenging as it warrants a grouping of disparately rated drugs to target each of the receptor. Owing to its profile, triple negative breast cancer is revered as a basal-type. Another recent study (Vici et al. 2015), illumes reasonableness of the triple positive breast cancer.
From prostate cancer viewpoint, gene fusions between TMPRSS2 and ETS hierarchies have been stressfully documented (Tomlins et al. 2006), and also with ERG genealogy (Penney et al. 2016). Expression levels of genes MUC1 and AZGP1 were also shown to categorically underline exclusive subtypes of prostate cancer from clinicopathological stance (Lapointe et al. 2004).
(Herschkowitz et al. 2007) orchestrated a pioneering work that led to elucidation of a novel sub-type pertaining to breast cancer disorder. This new subtype, referred to as Claudin-Low was implicit of low expression genes. Also, traditionally, tumor types could be classified as basal epithelial-like group (ERs), an ERBB2-overexpressing group, and a normal breast-like group (Davidson & Liu. 2010). Another feature discovery from a study by (Sorlie et al. 2001) had confirmed the possible subdivision of the ER+ tumor type into two clusters with distinctive gene sets having particular corresponding clinical outcome.
2. Results and Discussion
The exegesis is premeditated so as to elucidate a quantifiable threshold that stratifies gene expression space in conjunction to normal and cancer stromal response states. We deliberate to identify key transcriptional features that determine the high dimensional feature space and visualize their inter-linkups via a regulatory network illustration. This is always complimentary to ascertain our knowledge about genes and their pathway-occurrence motifs… (Figure 1)
2.1 Anonymous Genes/ Probes
We identified 27236 entries, while scrutinizing the annotation fields of the dataset that eluded ontological reference. There are also considerable amount of genes whose expression values are catalogued under incongruent probes, resulting in their multiple incidences. This is a purported case of genes’ splice variants, as the dataset suggests. While we aim to identify DEG and construct a respective GRN, there is also a prudence of elucidating functionally coherent genes that may unravel profiling of all or few anonymous genes. Thus, it appears duteous to abandon blank values to sustain quality of biological interpretation and germaneness.
2.2 Normality and Data stabilization
The data appears normalized data sans log transformation. Hence it is log-transformed and metamorphosed to render mean=0, and standard deviation=1, i.e. it followed normal distribution. Since, the normality isn’t skewed as a result of multiple comparisons problem (Dunn 1961), as we’re not envisioning multitude of significance values, there is no proliferation of Type I error occurrence anomalies… (Figure 2)
2.3 Differentially Expressed Genes
With respect to the assumed significance level (α) to be 0.01 and hence a stern confidence interval of 99%, we aim to copiously optimize the gene(s) search by postulating as follows:
(Null hypothesis)H0: Genes are not differentially expressed (equal means)
(Alternate hypothesis)H1: Genes are differentially expressed (unequal means)
The listing of differentially expressed genes will implicitly catalogue up- and downregulated genes too. To prudently list them out, a within genes correlation does the job. The negative numbers represent down-regulated genes and positives up-regulated ones. (Danielsson et al. 2013) report that maximum of genes en route malignancy, are down regulated. This is not for reference, but only to mark. There is also to note that since breast cancer and prostate cancer find unique origins pertaining sex discrepancy, it’s only logical to work with bifurcated dataset. We contemplate breast cancer and prostate cancer entries with distinct exegesis and later combine and compare the results owing to significance to biological interpretation.
The exploration renders 356 probes being differentially expressed in breast stroma and 221 in prostate stroma (with p-value < 0.01) amongst which ADH1B, COL10A1 are most distinctly expressed in breast strata, while BMP5, SFRP4 are notable enough in prostate cluster…
Desmoplastic Retort to Prostate and Breast Carcinomas’ Metastasis.
We also acknowledge that packages like siggenes (Schwender, 2012), samr are available that incorporate Significance Analysis of Microarrays (SAM) (Tusher, Tibshirani and Chu, 2001) working procedure, but in view of keeping the analysis more abstract and interactive, there is a minimum use of readymade library functions.
2.4 Gene Set Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) is a scheme to map statistically relevant genes to pre-known biological profiles, eg. phenotype, to discern their life relevance. The molecular signatures are updated as the curation cascades. There exist a consortium of metadata libraries for cataloguing genes and gene products’ information. To standardize the practice of annotation in genomics, this bioinformatics initiative is absolutely imperative as we’re riding the snowball of discoveries in GWAS. (In R language, GWAS is facilitated by Fischer’s exact test.)
This method is applied to the resultant set of differentially expressed genes and only those with a reference in MSigDB (Subramanian, Tamayo, et al., 2005) (with a valid Unigene_ID, Entrez_ID, and GO_TERM) are selected for further analysis. To accomplish the same, MeV and Cytoscape (in parts) are used. The GO listing wasn’t available for 10 probes in prostate data and 11 in breast data, which led to their discard. At this stage, our dataset has 345 and 211 rows in breast and prostate data, respectively.
2.5 Functionally Coherent Genes
An important aspect that escorts investigation of the differentially expressed genes is the strength of associativity between their tumor and normal roles. This can be explored using correlation technique of statistical descent. Commonly known Pearson’s product-moment (or simply, correlation) coefficient helps establish connect between two linearly distributed variables. In simplified terms, Spearman’s coefficient is a non-parametric version of Pearson’s coefficient with ranked data (Hauke & Tomasz 2011). Since, our dataset selection is so, we would prefer using Pearson’s correlation measure as opposed to Spearman’s or Kendall’s which is equally effective (or more) for the qualitative data. Kendall’s t is based on concordance and discordance. The question is to establish similarity between two distinct genes, technically two expression vectors (Saeed et al. 2003). An expression vector spans expression values vide all featured experimental conditions. Albeit microarrays are not known to cater isoform expression detection as they are not absolute exhibitors of gene expression and rather give a relative value (log ratios of hybridization intensities).
The Pearson’s correlation coefficient (r) for class labels X and Y is mathematically represented as follows:
The resultant genes in breast cancer and prostate cancer were tested for correlation amongst themselves. With each gene confronting every other gene, a comparisons are expected in a pairwise matrix format. Since distance measure is hinged around mean values, a mix of positive and negative integers is likely. It is to note here that notwithstanding the amplitude of correlation, there is a significance of signs in the correlation matrix. A negative (-) number indicates that a gene is inhibiting another gene, while a positive (+) marks that the two are expressing collaterally. To safeguard our conviction to the fullest, the gene list was filtered with a dual-parameterized statistic. We sifted the genes with low p-value significance and high correlation measure. The tables catalogue genes from cancer-duos, with correlation > 0.95 and p-value < 10-7.
2.6 Gene Regulatory Networks (GRN)
Transcriptional activity can be precisely monitored with GRNs (Chai et al. 2014). A visualization of putative pathways and the absolute values that are symbolic of the degree of strength between two components can bring out some very useful linkage information. After the elucidation of differentially expressed genes, the inkling is to draw a correlation measure amongst them to infer a relational matrix with values {-1, 0, 1} with interpretations anticorrelated, no dependence, and correlated, respectively. This notion is certainly not delimited to “naivety of adjacency”. The informal theme is the distance measure, but logically it may falsify the overall outcome due to inherent biases with the chip construction. Therefore, the notion of correlated transcripts is revered more viable. A gene regulatory network is a visualization of a set/part(s) of genes that result in myriad (all) of cell processes, including metabolism, cell signaling and transduction, cell growth control etc., which is vital to understand the dynamics of molecular biology (Karlebach & Shamir 2008). But, the mechanistic inference of the architecture is subject to experimental biology, a wet lab gig (Davidson & Levin 2005). Nevertheless, the disposability of GRNs can’t be disparaged as they provide a blueprint of the underlying system and tellingly optimize our erudition.
Correlation establishes the linear propensity in-between variables. For pursuit of the same, we deliberate a Bayesian approach. In continuum to our expedition with R, the packages, BNArray (Chen et al. 2006), NATbox (Chavan et al. 2009) deploy probabilistic slant to decipher gene interactions, where NATbox shows competitive proficiency (Chavan et al. 2009). In this treatise, however, we’ve considered Cytoscape as a pliant tool for visualization the transcriptional network in corroboration with the GeneMania plugin. The output network matrix of genes was exported to Cytoscape for visualization and analysis.
Post validation of the transcriptional networks, gene CCDC11, which has traditionally been revered for human laterality disorder (Perles et al. 2012; Narasimhan et al. 2015), has been elicited to show strong propensity in both (breast and prostate) cancer profiles. The Coiled-Coil Domain Containing 11 or CCDC11 is a protein coding gene which is closely associated with epidermis in amphibians and skin fibroblasts from Homo sapiens (Narasimhan et al. 2015). Re-annotated as Cilia and Flagella Associated Protein 53 (CFAP53), the mutation in CCDC11 exhibits perturbed left-right asymmetry (Silva et al. 2016). As showcased in the particular analysis, it has thorough connectedness at the order of 11 (aggregate) to other genes, which may signify functional co-regulation. From the understanding, it is proposed that this hub gene could be responsible to stage the process of stromal response and coordinate in the transcriptional activities of the same.
WDR88 gene on chromosome 19, revered WD repeat domain 88, is a protein-coding gene and a branded marker for the onset of prostate cancer (Chinnaiyan et al. 2013). In a topical finding, the gene has also been shown to have links with schizophrenia (Richards et al. 2016). 167 organisms have orthologs with human gene WDR88 that is conserved in chimpanzee, Rhesus monkey, dog, cow, mouse, rat, chicken, and frog.
Another gene ARPP21, located in chromosome 9, has been exceptionally highlighted in the breast and prostate cancer profiles. It has been duly captured to be frequently deregulated as is miR-128 (Pellagatti et al. 2010; Li et al. 2013). According to NCBI RefSeq (June 2012), this gene encodes a cAMP-regulated phosphoprotein. The encoded protein is enriched in the caudate nucleus and cerebellar cortex. A similar protein in mouse may be involved in regulating the effects of dopamine in the basal ganglia. Alternate splicing results in multiple transcript variants. It is thence fathomed that these hub genes could be responsible to stage the process of stromal response and coordinate in the transcriptional activities of the same.
From gene ontology, ARPP21 is also attributed to the response to stimulus, triggering cellular response to heat; at any temperature higher than the optimal stimulus of that organism.
A deliberation to the current study also entails the exceptional, yet formidable idea of cross-linkages of breast and prostate cancers. Although the rudiments of breast and prostate are oriented towards females and males, respectively, nonetheless, an exceptional yet indelible facet of female prostate and male breast profiles has been dimly studied. According to the American Cancer Society, breast cancer is aggregate 100 times less common in males than females; that is to calibrate the lifetime risk of a male getting breast cancer is 1 in 1000. Contrastingly, the skene/ periurethral gland carcinoma (female prostate cancer, in generic terms) is also found to contribute less than 0.0003 percent towards all genital cancers in women (Dodson et al. 1994). The numbers aren’t intellectually stimulating, albeit we choose to delve a little deeper.
2.7 Female Prostate and Male Breast Carcinomas
Female prostate, i.e. Skene gland, named after Alexander Johnston Chalmers Skene, who was a British gynecologist from Scotland, is a homologue for the male prostate organ and its adenocarcinoma is a scarce occurrence. Elevated Prostate Specific Antigen (PSA) and PSA Phosphatase (PSAP) are potent markers for detecting prostate cancer in general (female as well as male prostate specimens). Owing to the rarity, the female prostate cancer isn’t thoroughly researched too. With the limited physiological understanding, an older case study presented a female subject with advanced form of the disease. It was treated with convention-al surgery (Ueda et al. 2012), to eventually weed out all the spread. Other techniques including radiotherapy (Korytko et al. 2012) have been sublimely effective as well.
The female prostate is acknowledged as a functional part of the urinary and reproductive systems in female humans (Zaviacic et al. 2000). It is located on the anterior wall of the vagina, around the lower end of urethra, on each side. A chance that skene gland could be a secondary cancer site is also plausible. Estrogen (Estradiol, Estriol, and Estrone) and Progesterone are the two key enzymes/ hormones that regulate the female breast development, menstrual cycle, and sexual function. They are also luminaries in the prostate region in the female gerbils. Estrogen is present in both male humans and female humans, and can be measured for analyzing cancer of the reproductive system subunits, viz. ovaries, testicles, etc. The cancer of the Skene gland is also more recognized in older females, showing tangible lesions (Custodio et al. 2010). Additionally, it has also been extensively deliberated that a family history of breast cancer and prostate cancer engenders augmented jeopardy to a postmenopausal woman gestating breast cancer (Robinson et al. 2015), (Beebe-Dimmer et al. 2015). The abscesses in the gerbils are also shown to be driven by progesterone. A case history of multiple pregnancies and ageing could be culpable for the female prostate disorder (Oliveira et al. 2011).
Owing to the limited case studies of Skene gland cancer, the symptoms of the disorder aren’t well acknowledged and etiology is apparently impervious. As general indications, bleeding in the urethra, that could also accompanied by sporadic pain, are primarily contingent to symptomatic treatments. If the following conditions hold, a quick visit to the physician is often advisable.
Arduous, frequent, and often difficult urination
Bleeding from the urethra
Painful sexual intercourse and pubic area
Erratic menstrual cycle
The causes for Skene gland cancer are diversely plethoric. They can include infection as prostatitis, some sexually transmitted infections (STIs) as gonorrhea; Polycystic Ovarian Syndrome (PCOS) that renders imbalance and frequently abundance of reproductive hormones, cysts, and adenofibroma.
Another malady, although uncommon but not to be belittled as the rate of occurrence increases every year, is the Male Breast Cancer (MBC). Mainly, females are more vulnerable to breast cancer, having stocky breast tissue; however, males have pertinent breast tissue as well. Scientifically, mutated copies of BRCA1 and BRCA2 genes incubated by male humans are proverbial causes for MBC. Tamoxifen and anti-hormonal drugs are FDA-approved chemotherapeutics to treat breast cancer in both male and females. Requisite surgery (mastectomy/ lumpectomy) followed by radiation therapy is standardly warranted, although individual therapies could include more aggressive treatment options. The ideology of a male human incubating breast cancer is largely pondered with ignorance and aversion; this conviction, in most cases, delays the screening of the disease. Peculiar symptoms of MBC entail ruptured (and often painful) nipples, puckering and dimpled masses of the breast, decolorized jaggy surfaces, etc. MBC is usually detected as a hard lump underneath the nipple and areola. The histopathological derivatives in MBC and female breast cancer are homogenous.
Gynecomastia is also a disorder in men of benign nature, where the breast tissue becomes enlarged due to hormonal misbalance (oestrogen to testosterone ratio), especially during puberty. Although a natural phenomenon, it is usually conceived with humiliation and anxiety; however paltry cases have been reported to establish that gynecomastia and MBC are concomitant. In conjunction, pseudogynecomastia is a condition when adipose tissue (fat) causes gynecomastia.
Therefore, it can be argued that the denominations of origins, histopathology, causes, symptoms, and treatments are overlapped for male-breast and female-breast cancer; and likely so for male-prostate and female-prostate cancer. The contributing genes and pathways could be further explored for overlap in disease profile and therapy.
2.8 Stromal Response Threshold
The cancer metastasis presents an intriguing case of classical science theory- a medium required for propagation. The carcinogenesis triggers a parallel desmoplastic reaction that serves as carrier of malignancy (Whatcott et al. 2012). Technically, desmoplasia is the development of fibrous and connective tissues encompassing tumor cells. Cells like endothelium and fibroblasts stage all structural and functional profiles from carcinogenesis to metastasis (Kalluri & Zeisberg 2006) and vitally graded therapeutic targets. Chemo resistance is highly attributed to mutations in cancer cells as per the Darwinian doctrine of evolution, survival of the fittest (Fodale, Pierobon, Liotta, & Petricoin, 2011) (Pisco et al. 2013).
We import widely recognized e1071 package library (Meyer et al. 2015) to employ Support Vector Machine (SVM) classification to distill the transcriptional threshold to desmoplastic response. Although there are four others catalogued in the R library that carry out the SVM implementation, viz. kernlab, klaR, svmpath, and shogun. Technically, a decision boundary equation is sought here. Our aim, from epidemiological context, is to aid medicinal normalization of transcriptional impressions so as to contain tumor invasion trans-organismal cultures. The one-versus-one favor of classification is evident for 6 class pairs (normal-tumor duo). Multiple kernel types were considered and cost functions analyzed before arriving at the tune() that cross-validates a range of SVM models outputs the optima.
The equation of the hyperplane separating the negative and positive examples is given by: where w is weight vector, x is input vector, and b is bias. The decision boundary can also be deemed as a linear combination of support vectors. As calculated, 8 and 9 support vectors were rendered from prostate and breast data respectively. The bias vectors from <svm model>$rho are 1.429841 and 4.139861, from prostate and breast data correspondingly. Further information can be found, as code output, from the supplementary documents.
2.9 Conclusion and Future Work
This text has been premeditated to render the most interactive portrayal of working with gene expression data analysis. As a part of the original work, the authors have carried out survival analysis too. The treatise however concentrates on the improved biomarker(s) dissemination.
As an imminent applicability, the study can aid fostering of pertinent therapeutics to deride proliferation of cancer metastasis from one tissue to another by monitoring the expression threshold and keeping it checked.
The procedure highlights an illustration of the packages available in the R language and Bioconductor that duly facilitate the exploratory analysis of the genomic data. While doing so, certain cohorts of genes were found relevant and were statistically narrowed to seed further analysis. This aids reducing the search space for biomarkers (broadly explains the doc-trine of bioinformatics) and the pipeline of wet laboratory testing and validation, proceeds. If the genes CDCC11, WDR88 and ARPP21 have any causal implications in the stromal response to the cancer metastasis, can and will only be substantiated through valid laboratory studies. Researches alike add to the annotations of the known gene functionality. An array of such explorations is warranted and is indeed happening. This trend over a period of time is believed to pave way for a precision medicine schedule, when drug compounds’ applications and the respective gene functions are almost perfectly matched.
3. Material and Methods
3.1 Dataset Selection
The gene expression dataset chosen from the study is derived out of a study based on stromal cells and invasive breast and prostate cancer development (Planche et al. 2011)
The authors have commenced performing log transformation oriented normalization and moved further with a primary cue gathering via Principal Component Analysis (PCA). It is also reported that a very few number of overlain genes befall from breast and tumor profiles. Pearson correlation coefficients exhibit stout propensity of breast stromal genes with breast data and prostate stromal genes with prostate data (Figure 1) (Planche et al. 2011). To add to further consolidation of the outcome, survival analysis was carried out using Univariate Cox approach that highlighted genes whose expression levels were crucially associated with the patient survival. The downloaded dataset has no observable missing values in the cells to impute; rather the blank entries are subsidiary to gene and probe ids.
Technically deduced from the background meta-analysis of the subject, we may decipher that cancer will need a host medium (tissue) to proliferate to the other cells/ tissues/ organs. The metastasis front of cancer would seek for the favorable restructuring of the basal tissue framework. From the anticancer therapeutic vantage, hence, it renders incumbent that the oncogenes and stromal response must be equally thrusted.
Through this exemplar multifaceted exegesis, we objectivize to construe the following:
Contrivance of differentially expressed genes (DEG)
GRN reconstruction, and
Decoding functionally coherent genes (eliciting anonymous genes) in accord to iso-form expression.
Designing a classifier (machine learning approach) that embraces a threshold value of gene expression that triggers ambient oncological desmoplastic response.
From statistical standpoint, the data concerned is paired, i.e. two different conditions (cancerous and normal, here) hybridized on the same slide. A recce exhibits noticeable gene entries that outlie the tightly stratified expression space, as can be derived from the Fig. 3. The dataset dimension of 54675 features tacitly conveys the infestation of multiple gene entries associated with diverse probes. However, a cursory reconnaissance shall also establish that there is only one replicate to each experimental condition.
In recent years, the molecular data has become reverently large. R has evolved as the defacto tool for genomic data analysis attributable to its IDE, flexibility and workflow control. Amongst others Python is a viable option too. Biopython is a dedicated version of the language for biological data analytics. However, R has an edge over other languages in terms of packages (functionalities) to cope with the multidimensional data. Being open-source and fully distributable adds to the prowess as well.
Declaration of Interest
The authors register no conflict of interest.
Author Contribution
Conception and design: Rajni Jaiswal, Shaurya Jauhari.
Development of methodology: Rajni Jaiswal
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): Rajni Jaiswal, Shaurya Jauhari.
Writing, review, and/or revision of the manuscript: Shaurya Jauhari, S.A.M. Rizvi.
Study supervision: S.A.M. Rizvi.
Stromal Data Analysis: R Script file
# Installing GEOquery source(“http://www.bioconductor.org/biocLite.R”) biocLite(“GEOquery”) # Loading GEO file with GEOquery library(Biobase) library(GEOquery) #Download GPL file, put it in the current directory, and load it: gpl570 <- getGEO(‘GPL570’, destdir=“.”) #Or, open an existing GPL file: gpl570 <- getGEO(filename=‘GPL570.soft’) # Handpicked description (three columns: ID, Gene Symbol, Gene Title). Table(gpl570) [c(“ID”,”Gene Symbol”,”Gene Title”)] IDs <- attr(dataTable(gpl570), “table”)[, c(“ID”, “Gene Symbol”, “Gene Title”)] # Extract the expression values from the dataset # line 64 contains field names DS_Main <- read.table(“GSE26910_series_matrix.txt.gz”, skip = 63, header = TRUE, sep = “\t”, fill = TRUE) # Remove the last line from the matrix that says “!series_matrix_table_end” DS_Main <- DS_Main[-54676, ] # Merging the annotation information to the expression values matrix and rejecting null entries. names(IDs)[1] <- “ID_REF” DS <- merge(IDs,DS_Main, by = “ID_REF”) DS[DS == ““] <-NA DS <- na.omit(DS) # Reordering of respective breast cancer and prostate cancer datsets. # Prostate Normal [1:6], Prostate Tumor [7:12], Breast Normal [13:18], Breast Tumor [19:24] WorkDS <- DS [c(4,6,8,10,12,14, 5,7,9,11,13,15, 16,18,20,22,24,26, 17,19,21,23,25,27)] # RowMeans calculation ProstateNormalMean <- rowMeans(log2(WorkDS[,1:6])) ProstateTumorMean <- rowMeans(log2(WorkDS[,7:12])) BreastNormalMean <- rowMeans(log2(WorkDS[,13:18])) BreastTumorMean <- rowMeans(log2(WorkDS[,19:24])) # MA-Plot par(mfrow=c(1,2)) ProstateMean <- rowMeans(log2(WorkDS[, 1:12])) BreastMean <- rowMeans(log2(WorkDS[, 13:24])) plot(ProstateMean, ProstateTumorMean-ProstateNormalMean, main=“MA Plot for Prostate data”, pch=16, cex=0.35) hold() plot(BreastMean, BreastTumorMean-BreastNormalMean, main="MA Plot for Breast data", pch=16, cex=0.35) # Rough draft of extreme probes DS[which.min(BreastTumorMean-BreastNormalMean), ] ### most negatively expressed breast gene DS[which.min(ProstateTumorMean-ProstateNormalMean), ] ### most negatively expressed prostate gene DS[which.max(ProstateTumorMean-ProstateNormalMean), ] ### most positively expressed prostate gene DS[which.max(BreastTumorMean-BreastNormalMean), ] ### most positively expressed breast gene # Standard Deviation calculation for t-test install.packages(genefilter) library(genefilter) ProstateNormalSD <- rowSds(log2(WorkDS[,1:6])) ProstateTumorSD <- rowSds(log2(WorkDS[,7:12])) BreastNormalSD <- rowSds(log2(WorkDS[,13:18])) BreastTumorSD <- rowSds(log2(WorkDS[,19:24])) # t-test calculation and histogram plot par(mfrow=c(1,2)) Prostate_ttest <-(ProstateTumorMean-ProstateNormalMean)/sqrt(ProstateTumorSD^2/6 + ProstateNormalSD^2/6) hist(Prostate_ttest,nclass=100) hold() Breast_ttest <- (BreastTumorMean-BreastNormalMean)/sqrt(BreastTumorSD^2/6 + BreastNormalSD^2/6) hist(Breast_ttest, nclass=100) # p-value calculation and histogram plot Prostate_pval <- 2*(1-pt(abs(Prostate_ttest),5)) Breast_pval <- 2*(1-pt(abs(Breast_ttest),5)) par(mfrow=c(1,2)) hist(Prostate_pval, nclass=100) hold() hist(Breast_pval, nclass = 100) # volcano Plot par(mfrow=c(1,2)) plot(ProstateTumorMean-ProstateNormalMean, -log10(Prostate_pval), main =“Volcano Plot@Prostate tissue”, xlab= “Sample Mean Difference”, ylab= “-log10(p value)”, pch=16, cex=0.35) hold() plot(BreastTumorMean-BreastNormalMean, -log10(Breast_pval), main =“Volcano Plot@Breast tissue”, xlab= “Sample Mean Difference”, ylab= “-log10(p value)”, pch=16, cex=0.35) # Boxplots for the normal data and its log transformed version.(Log2 transformation applied) par(mfrow = c(1, 2)) boxplot(WorkDS, col = c(2,3,2,3,2,3,2,3,2,3,2,3), main = “Expression values pre-normalization”, xlab = “Slides”, ylab = “Expression”, las = 2, cex.axis = 0.7) hold() boxplot(log2(WorkDS), col = c(2,3,2,3,2,3,2,3,2,3,2,3), main = “Expression values post-log-transformation”, xlab = “Slides”, ylab = “Expression”, las = 2, cex.axis = 0.7) abline(0, 0, col = "black") # Check Normality par(mfrow=c(1,2)) qqnorm(Prostate_ttest, main = “QQ Plot@Prostate Data”) qqline(Prostate_ttest) hold() qqnorm(Breast_ttest, main = “QQ Plot@Breast Data”) qqline(Breast_ttest) # Elucidating genes with particular p-values. for (i in c(0.01, 0.05, 0.001, 1e-04, 1e-05, 1e-06, 1e-07)) print(paste(“genes with p-values smaller than”,i, length(which(Prostate_pval < i)))) for (i in c(0.01, 0.05, 0.001, 1e-04, 1e-05, 1e-06, 1e-07)) print(paste(“genes with p-values smaller than”,i, length(which(Breast_pval < i)))) # Plot heatmap of differentially expressed genes: Genes are differentially expressed if its p-value is under a given threshold, which must be smaller than the usual 0.05 or 0.01 due to multiplicity of tests BreastDEGenes <- data.frame(which(Breast_pval < 0.01)) ProstateDEGenes <- data.frame(which(Prostate_pval < 0.01)) ProstateDEGenesData <- ProstateDEGenes[, 1] BreastDEGenesData <- BreastDEGenes[, 1] ProstateData <- as.matrix(WorkDS[ProstateDEGenesData, 1:12]) heatmap(ProstateData, col = topo.colors(100), cexRow = 0.5) BreastData <- as.matrix(WorkDS[BreastDEGenesData, 13:24]) heatmap(BreastData, col = topo.colors(100), cexRow = 0.5) # List of differentially expressed genes. #Breast Data BDEG <- matrix(nrow = nrow(BreastDEGenes), ncol = 1) for(i in 1:nrow(BreastDEGenes)) BDEG[i,]<- paste(DS[BreastDEGenes[i,], “ID_REF”]) BDEG <- as.data.frame(BDEG) names(BDEG)[1] <- “ID_REF” FinalBDEG <- merge(BDEG,DS) BDEG <- merge(BDEG, IDs, by = ‘ID_REF’) view(BDEG) #Prostate Data PDEG <- matrix(nrow = nrow(ProstateDEGenes),ncol = 1) for(i in 1:nrow(ProstateDEGenes)) PDEG[i,] <- paste(DS[ProstateDEGenes[i,], “ID_REF”]) PDEG <- as.data.frame(PDEG) names(PDEG)[1] <- “ID_REF” FinalPDEG <- merge(PDEG,DS) PDEG <- merge(PDEG, IDs, by = ‘ID_REF’) view(PDEG) ##Intersecting transcripts in breast and prostate cancer types as marked in the dataset BDEG$match <- match(BDEG$location, PDEG$location, nomatch=0) # Reordering of respective breast cancer and prostate cancer datsets. # Prostate Normal [1:6], Prostate Tumor [7:12], Breast Normal [13:18], Breast Tumor [19:24] FinalPDEG <- FinalPDEG [c(4,6,8,10,12,14, 5,7,9,11,13,15, 16,18,20,22,24,26, 17,19,21,23,25,27)] WorkFinalPDEG <- FinalPDEG[1:12] FinalBDEG <- FinalBDEG [c(4,6,8,10,12,14, 5,7,9,11,13,15, 16,18,20,22,24,26, 17,19,21,23,25,27)] WorkFinalBDEG <- FinalBDEG[13:24] ##Prostate data regrerssion analysis(linear model) par(mfrow=c(1,6)) plot(log2(WorkFinalPDEG$GSM662756),log2(WorkFinalPDEG$GSM662757), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662756) ∼ log2(WorkFinalPDEG$GSM662757)), col= 1) plot(log2(WorkFinalPDEG$GSM662758),log2(WorkFinalPDEG$GSM662759), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662758) ∼ log2(WorkFinalPDEG$GSM662759)), col= 1) plot(log2(WorkFinalPDEG$GSM662760),log2(WorkFinalPDEG$GSM662761), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662760) ∼ log2(WorkFinalPDEG$GSM662761)), col= 1) plot(log2(WorkFinalPDEG$GSM662762),log2(WorkFinalPDEG$GSM662763), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662762) ∼ log2(WorkFinalPDEG$GSM662763)), col= 1) plot(log2(WorkFinalPDEG$GSM662764),log2(WorkFinalPDEG$GSM662765), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662764) ∼ log2(WorkFinalPDEG$GSM662765)), col= 1) plot(log2(WorkFinalPDEG$GSM662766),log2(WorkFinalPDEG$GSM662767), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662766) ∼ log2(WorkFinalPDEG$GSM662767)), col= 1) ##Gene Set Enrichment Analysis library(genefilter) library(GSEABase) Breast_GSEA <- GeneSetCollection(WorkFinalBDEG, setType = KEGGCollection()) Prostate_GSEA <- GeneSetCollection(WorkFinalPDEG, setType = KEGGCollection()) ##Correlation Analysis ##Breast WorkFinalBDEG <- read.csv(“WorkFinalBDEG_GSEAFiltered.csv”) ## Import filtered annotation file from MeV. btemp <- WorkFinalBDEG btemp$ID_REF <-NULL btemp <- log2(btemp) pairs(btemp) BreastCorrelationMatrix <- cor(t(as.matrix(btemp))) BreastCorMat <- as.data.frame(BreastCorrelationMatrix) rownames(BreastCorMat) <- WorkFinalBDEG$ID_REF colnames(BreastCorMat) <- WorkFinalBDEG$ID_REF ##Prostate WorkFinalPDEG <- read.csv(“WorkFinalPDEG_GSEAFiltered.csv”) ## Import filtered annotation file from MeV. ptemp <- WorkFinalPDEG ptemp$ID_REF <-NULL ptemp <- log2(ptemp) pairs(ptemp) ProstateCorrelationMatrix <- cor(t(as.matrix(ptemp))) ProCorMat<- as.data.frame(ProstateCorrelationMatrix) rownames(ProCorMat) <- WorkFinalPDEG$ID_REF colnames(ProCorMat)<- WorkFinalPDEG$ID_REF ### Feature Selection: Clustering of robustly entwined genes. install.packages(“gplots”) install.packages(“Hmisc”) library(Hmisc) library(gplots) heatmap.2(ProstateCorrelationMatrix, main=“Hierarchical Cluster”, dendrogram=“column”,trace=“none”,col=greenred(10)) heatmap.2(1-abs(ProstateCorrelationMatrix), distfun=as.dist, trace=“none”) heatmap.2(BreastCorrelationMatrix, main=“Hierarchical Cluster”, dendrogram=“column”,trace=“none”,col=greenred(10)) heatmap.2(1-abs(BreastCorrelationMatrix), distfun=as.dist, trace=“none”) ##Prostate Data library(caret) HighlyCorrelated <- findCorrelation(ProstateCorrelationMatrix, cutoff = 0.95, verbose = TRUE, names = FALSE) print(HighlyCorrelated) WorkFinalPDEG[HighlyCorrelated,1] IDs[WorkFinalPDEG[HighlyCorrelated,1],c(2,3)] for(i in 2:nrow(BreastCorMat)) { for(j in 1:ncol(BreastCorMat)-1) { if(i>j) { out <- c (rownames(BreastCorMat[i,]), colnames(BreastCorMat[j]), BreastCorMat[i,j]) write.table(out, file=“output.txt”, append=TRUE, sep= “ “) } else break } } #Network Ready Matrix Format (Function) // Credit: http://www.sthda.com flattenCorrMatrix <-function(cmat, pmat) { ut <- upper.tri(cmat) data.frame( row = rownames(cmat)[row(cmat)[ut]], column = rownames(cmat)[col(cmat)[ut]], cor =(cmat)[ut], p = pmat[ut] ) } library(Hmisc) btemp <- as.matrix(btemp) rownames(btemp)<- WorkFinalBDEG$ID_REF BreastNet <-rcorr(t(btemp)) BreastNetworkInputMatrix<- flattenCorrMatrix(BreastNet$r, BreastNet$P) #lets map the gene names to row and column entries BreastNetworkInputMatrix$row <-IDs[WorkFinalBDEG[BreastNetworkInputMatrix$row,1],2] BreastNetworkInputMatrix$column <-IDs[WorkFinalBDEG[BreastNetworkInputMatrix$column,1],2] ptemp <- as.matrix(ptemp) rownames(ptemp)<- WorkFinalPDEG$ID_REF ProstateNet <-rcorr(t(ptemp)) ProstateNetworkInputMatrix <- flattenCorrMatrix(ProstateNet$r, ProstateNet$P) ProstateNetworkInputMatrix$row <-IDs[WorkFinalPDEG[ProstateNetworkInputMatrix$row,1],2] ProstateNetworkInputMatrix$column <-IDs[WorkFinalPDEG[ProstateNetworkInputMatrix$column,1],2] symnum(BreastCorrelationMatrix) symnum(ProstateCorrelationMatrix) install.packages(“corrplot”) library(corrplot) corrplot(BreastCorrelationMatrix, type=“upper”, order=“hclust”, tl.col=“black”, tl.srt=45) corrplot(ProstateCorrelationMatrix, type=“upper”, order=“hclust”, tl.col=“black”, tl.srt=45) install.packages(“PerformanceAnalytics”) library(PerformanceAnalytics) chart.Correlation(BreastCorrelationMatrix, histogram= TRUE, pch= 19) chart.Correlation(ProstateCorrelationMatrix, histogram= TRUE, pch= 19) col<- colorRampPalette(c(“blue”, “white”, “red”))(20) heatmap(x = BreastCorrelationMatrix, col = col, symm = TRUE) heatmap(x = ProstateCorrelationMatrix, col = col, symm = TRUE) ## Optimize network ready correlation and p-values matrix ## top candidates which manifest low p-value and high correlation. BreastFinal <- BreastNetworkInputMatrix[which(abs(BreastNetworkInputMatrix$cor) > 0.95 | BreastNetworkInputMatrix$p < 0.0000001), c(1,2,3,4)] ProstateFinal <- ProstateNetworkInputMatrix[which(abs(ProstateNetworkInputMatrix$cor) > 0.95 | ProstateNetworkInputMatrix$p < 0.0000001), c(1,2,3,4)] write.csv(BreastFinal, “BreastFinalTest.csv”) write.csv(ProstateFinal, “ProstateFinalTest.csv”) ##Intersecting transcripts in breast and prostate cancer types as marked in the dataset BDEG$match <- match(BDEG$location, PDEG$location, nomatch=0) # Reordering of respective breast cancer and prostate cancer datsets. # Prostate Normal [1:6], Prostate Tumor [7:12], Breast Normal [13:18], Breast Tumor [19:24] FinalPDEG <- FinalPDEG [c(4,6,8,10,12,14, 5,7,9,11,13,15, 16,18,20,22,24,26, 17,19,21,23,25,27)] WorkFinalPDEG <- FinalPDEG[1:12] FinalBDEG <- FinalBDEG [c(4,6,8,10,12,14, 5,7,9,11,13,15, 16,18,20,22,24,26, 17,19,21,23,25,27)] WorkFinalBDEG <- FinalBDEG[13:24] ##Prostate data regrerssion analysis(linear model) par(mfrow=c(1,6)) plot(log2(WorkFinalPDEG$GSM662756),log2(WorkFinalPDEG$GSM662757), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662756) ∼ log2(WorkFinalPDEG$GSM662757)), col= 1) plot(log2(WorkFinalPDEG$GSM662758),log2(WorkFinalPDEG$GSM662759), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662758) ∼ log2(WorkFinalPDEG$GSM662759)), col= 1) plot(log2(WorkFinalPDEG$GSM662760),log2(WorkFinalPDEG$GSM662761), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662760) ∼ log2(WorkFinalPDEG$GSM662761)), col= 1) plot(log2(WorkFinalPDEG$GSM662762),log2(WorkFinalPDEG$GSM662763), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662762) ∼ log2(WorkFinalPDEG$GSM662763)), col= 1) plot(log2(WorkFinalPDEG$GSM662764),log2(WorkFinalPDEG$GSM662765), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662764) ∼ log2(WorkFinalPDEG$GSM662765)), col= 1) plot(log2(WorkFinalPDEG$GSM662766),log2(WorkFinalPDEG$GSM662767), pch = 16, cex = 1.3, col = c(“blue”,”red”)) abline(lm(log2(WorkFinalPDEG$GSM662766) ∼ log2(WorkFinalPDEG$GSM662767)), col= 1) ##Gene Set Enrichment Analysis library(genefilter) library(GSEABase) Breast_GSEA <- GeneSetCollection(WorkFinalBDEG, setType = KEGGCollection()) Prostate_GSEA <- GeneSetCollection(WorkFinalPDEG, setType = KEGGCollection()) # Support Vector Machine Implementation ## Prostate install.packages(“e1071”) library(e1071) temp1 <- WorkFinalPDEG temp1$ID_REF <-NULL temp1 <- log2(temp1) temp1 <- t(temp1) ClassLabels1 <- c(rep(1,6),rep(-1,6)) DataFrame1 <- data.frame(Gene=temp1,ClassLabels=as.factor(ClassLabels1)) SVMModel1 <- svm(ClassLabels1∼., data=DataFrame1, kernel=“linear”, cost=10, scale = FALSE) GeneWeights1<-t(SVMModel1$coefs)%*%SVMModel1$SV sort.list(GeneWeights1) ## Genes 212 and 129 have highest and second highest weights, respectively. plot(SVMModel1,DataFrame1, Gene.212 ∼ Gene.129) ##Breast temp2 <- WorkFinalBDEG temp2$ID_REF <-NULL temp2 <- log2(temp2) temp2 <- t(temp2) ClassLabels2<- c(rep(1,6),rep(-1,6)) DataFrame2 <- data.frame(Gene=temp2,ClassLabels=as.factor(ClassLabels2)) SVMModel2 <- svm(ClassLabels2∼., data=DataFrame2, kernel=“linear”, cost=10, scale = FALSE) GeneWeights2<-t(SVMModel2$coefs)%*%SVMModel2$SV sort.list(GeneWeights2) ## Genes 346 and 133 have highest and second highest weights, respectively. plot(SVMModel2,DataFrame2, Gene.346 ∼ Gene.133) install.packages(“kernlab”) library(kernlab) x <- as.matrix(temp1) y <- matrix(c(rep(1,6),rep(-1,6))) svp <- ksvm(x,y,type=“C-svc”, prob.model= TRUE) predict (svp, x, type= “probabilities”)Footnotes
- Abbreviations
- ANN
- Artificial Neural Networks
- AR
- Androgfen Receptor
- DEG
- Differentially Expressed Genes
- ECM
- Extracellular Matrix
- ER
- Estrogen Receptor
- FDA
- Fisher Discriminant Analysis
- GRN
- Gene Regulatory Network
- GSEA
- Gene Set Enrichment Analysis
- GWAS
- Genome Wide Association Studies
- IHC
- Immunohistochemistry
- LCM
- Laser Capture Microdissection
- NGS
- Next Generation Sequencing
- PCA
- Principal Component Analysis
- PCR
- Polymerase Chain Reaction
- PSO
- Particle Swarm Optimization
- SVM
- Support Vector Machines
- HER2
- Human Epidermal growth factor Receptor 2
- PR
- Progesterone Receptor