Where Natural Protein Sequences Stand out From Randomness

Laura Weidmann; Tjeerd Dijkstra; Oliver Kohlbacher; Andrei Lupas

doi:10.1101/706119

Abstract

Biological sequences are the product of natural selection, raising the expectation that they differ substantially from random sequences. We test this expectation by analyzing all fragments of a given length derived from either a natural dataset or different random models. For this, we compile all distances in sequence space among fragments of each dataset and compare the resulting distance distributions. Even for 100mers, 95.4% of all distances between natural fragments are in accordance with those of a model based on the natural residue composition. Hence, natural sequences are distributed almost randomly in global sequence space. When further accounting for the specific residue composition of domain-sized fragments, 99.2% of all distances between natural fragments can be modeled. Local residue composition, which might reflect biophysical constraints on protein structure, is thus the predominant feature characterizing distances between natural sequences globally whereas homologous effects are only barely detectable.