TY - JOUR T1 - Nano Random Forests to mine protein complexes and their relationships in quantitative proteomics data JF - bioRxiv DO - 10.1101/050302 SP - 050302 AU - Luis F. Montaño-Gutierrez AU - Shinya Ohta AU - Georg Kustatscher AU - William C. Earnshaw AU - Juri Rappsilber Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/04/26/050302.abstract N2 - The large and ever increasing numbers of quantitative proteomics datasets constitute a currently underexploited resource for drawing biological insights on proteins and their functions. Multiple observations by different laboratories indicate that protein complexes often follow similar trends. However, proteomic data is often noisy and incomplete – members of a complex may correlate weakly or only in a fraction of all experiments, or may not be observed in all experiments. We have previously used the Random Forest (RF) machine-learning algorithm to distinguish functional chromosomal proteins from ‘hitchhikers’ in an analysis of mitotic chromosomes. Even though it is assumed that RFs need large training sets, in this technical note we show that RFs also are able to detect small protein complexes and relationships between them. We use artificial datasets to demonstrate the robustness of RFs to identify small groups even when working with mixes of noisy and apparently uninformative experiments. We then use our procedure to retrieve a number of chromosomal complexes from real quantitative proteomics datasets, comparing wild-type and multiple different knock-out mitotic chromosomes. The procedure also revealed other proteins that covary strongly with these complexes suggesting novel functional links. Integrating the RF analysis for several complexes revealed the known interdependency of kinetochore subcomplexes, as well as an unexpected dependency between the Constitutive-Centromere-Associated Network (CCAN) and the condensin (SMC 2/4) complex. Serving as negative control, ribosomal proteins remained independent of kinetochore complexes. Together, these results show that this complex-oriented RF (nanoRF) can uncover subtle protein relationships and higher-order dependencies in integrated proteomics data.Abbreviations:RFRandom ForestMCCPMulti-Classifier Combinatorial ProteomicsnanoRFRandom forests trained with small training setsMVPMultivariate proteomic profilingFPFractionation profilingICPinterphase chromatin probabilityCCANConstitutive Centromere-Associated NetworkNupNucleoporinSMCStructural Maintainance of ChromosomesSILACStable Isotope Labeling by Amino acids in Cell culture ER -