Abstract
Protein complexes can be computationally identified from protein-interaction networks with community detection methods, suggesting new multi-protein assemblies. Most community detection algorithms tend to be un- or semi-supervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and by using parallel algorithms, respectively. Here, we present Super.Complex, a distributed supervised machine learning pipeline for community detection in networks. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities with epsilon-greedy and pseudo-metropolis criteria, and an embarrassingly parallel implementation can be run on a computer cluster for scaling to large networks. In order to evaluate Super.Complex, we propose three new measures for the still outstanding issue of comparing sets of learned and known communities. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable and can be used in different applications of community detection, with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0.
Author summary Characterization of protein complexes, i.e. sets of proteins assembling into a single larger physical entity, is important, as such assemblies play many essential roles in cells, ranging from gene regulation (as for the elaborate multiprotein complexes of mRNA transcription and elongation) to establishing the major structural elements of cells (as for key cytoskeletal protein complexes, such as microtubules and their trafficking proteins). Disruption of protein-protein interactions often leads to disease, therefore identifying a complete list of protein complexes allows us to better understand the association of protein and disease. From networks of protein-protein interactions, potential protein complexes can be identified computationally through the application of community detection methods, which flag groups of entities interacting with each other in certain patterns. In this work, we present Super.Complex, a generalizable and scalable supervised machine learning-based community detection algorithm that outperforms existing methods by accurately learning and using patterns from known communities. We propose 3 novel evaluation measures to compare learned and known communities, an outstanding issue. We use Super.Complex to identify 1028 human protein complexes, including 234 complexes linked to SARS-CoV-2, the virus causing COVID-19, and 103 complexes containing 111 uncharacterized proteins.
Competing Interest Statement
The authors have declared no competing interest.