PT  - JOURNAL ARTICLE
AU  - Florian Massip
AU  - Michael Sheinman
AU  - Sophie Schbath
AU  - Peter F. Arndt
TI  - Comparing the Statistical Fate of Paralogous and Orthologous Sequences
AID  - 10.1101/053843
DP  - 2016 Jan 01
TA  - bioRxiv
PG  - 053843
4099  - http://biorxiv.org/content/early/2016/05/17/053843.short
4100  - http://biorxiv.org/content/early/2016/05/17/053843.full
AB  - Since several decades, sequence alignment is a widely used tool in bioinformatics. For instance, finding homologous sequences with known function in large databases is used to get insight into the function of non-annotated genomic regions. Very efficient tools, like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model one can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. The corresponding probability for two homologous sequences is much higher allowing to identify them. Here we focus on the distribution of lengths of exact sequence matches in protein coding regions pairs of evolutionary distant genomes. We show that this distribution exhibits a power-law tail with exponent α = —5. Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and non-coding parts of genomes, thus providing with a better understanding of statistical properties of genomic sequences and their evolution.