On triangular Inequalities of correlation-based distances for gene expression profiles

Jiaxing Chen; Yen Kaow Ng; Lu Lin; Yiqi Jiang; Shuaicheng Li

doi:10.1101/582106

Abstract

Various distance functions for evaluating the differences be- tween gene expression profiles have been proposed in the past. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, d_a = 1 − |ρ|, where ρ is some similarity measures, such as Pearson or Spearman correlation. How- ever, absolute correlation distance fails to fulfill the triangular inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as sped up data clustering. In this work, we propose as an alternative. We prove that d_r satisfies the triangular equality when ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We empirically compared d_r with d_a in gene clustering and sample clustering experiment, using real biological data. The two distances performed similarly in both gene cluster and sample cluster in hierarchical cluster and PAM cluster. However, d_r demonstrated more robust clustering. According to bootstrap experiment, the number of times where d_r generated more robust sample pair partition is significantly (p-value < 0.05) larger. This advantage in robustness is also supported by the class “dissolved” event.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.