RT Journal Article SR Electronic T1 Nano Random Forests to mine protein complexes and their relationships in quantitative proteomics data JF bioRxiv FD Cold Spring Harbor Laboratory SP 050302 DO 10.1101/050302 A1 Luis F. Montaño-Gutierrez A1 Shinya Ohta A1 Georg Kustatscher A1 William C. Earnshaw A1 Juri Rappsilber YR 2016 UL http://biorxiv.org/content/early/2016/05/01/050302.abstract AB The large and ever-increasing numbers of quantitative proteomics datasets constitute a currently underexploited resource for drawing biological insights on proteins and their functions. Multiple observations by different laboratories indicate that protein complexes often follow consistent trends. However, proteomic data is often noisy and incomplete–members of a complex may correlate only in a fraction of all experiments, or may not be always observed. Inclusion of potentially uninformative data hence imposes the risk of weakening such biological signals. We have previously used the Random Forest (RF) machine-learning algorithm to distinguish functional chromosomal proteins from ‘hitchhikers’ in an analysis of mitotic chromosomes. Even though it is assumed that RFs need large training sets, in this technical note we show that RFs also are able to detect small high-covariance groups, like protein complexes, and relationships between them. We use artificial datasets to demonstrate the robustness of RFs to identify small groups even when working with mixes of noisy and apparently uninformative experiments. We then use our procedure to retrieve a number of chromosomal complexes from real quantitative proteomics results, which compare wild-type and multiple different knock-out mitotic chromosomes. The procedure also revealed other proteins that covary strongly with these complexes suggesting novel functional links. Integrating the RF analysis for several complexes revealed the known interdependency of kinetochore subcomplexes, as well as an unexpected dependency between the Constitutive-Centromere-Associated Network (CCAN) and the condensin (SMC 2/4) complex. Serving as negative control, ribosomal proteins remained independent of kinetochore complexes. Together, these results show that this complex-oriented RF (nanoRF) can uncover subtle protein relationships and higher-order dependencies in integrated proteomics data.Abbreviations:RFRandom ForestMCCPMulti-Classifier Combinatorial ProteomicsnanoRFRandom forests trained with small training setsMVPMultivariate proteomic profilingFPFractionation profilingICPinterphase chromatin probabilityCCANConstitutive Centromere-Associated NetworkNupNucleoporinSMCStructural Maintainance of ChromosomesSILACStable Isotope Labeling by Amino acids in Cell culture