Saturday, July 28, 2012

Non-metric similarities and clustering algorithms

I have often had to deal with clustering data for which a distance metric cannot be easily defined. In fact, in some cases it is possible to define a non-metric dissimilarity function that satisfies the first three of the following conditions, but not the fourth one (the triangle inequality):

  1. \( d(x,x) \ge 0, \forall x \)
  2. \( d(x,y) = 0 \Leftrightarrow x = y \)
  3. \( d(x,y) = d(y,x) \)
  4. \( d(x,y) \le d(x,z) + d(z,y) \)

Fortunately, some clustering algorithms are applicable to pairwise dissimilarity matrices computed using (symmetric) non-metric dissimilarity functions (as well as distance functions in arbitrary metric spaces). For example, [1] describes the SLINK algorithm for single-linkage hierarchical clustering, and mentions that it only requires a "symmetric non-negative dissimilarity coefficient" for which \(d(a,a)=0,  \forall a \). Also, [2] presents the k-medoids algorithm, and explicitly states that "such dissimilarities do not have to satisfy the triangle inequality" and provides an example of application for such a scenario.


While there exist other clustering algorithms, such as [3] and [4], that do not require a vector space representation of the objects to be clustered, they do require that the function used to compute the dissimilarity matrix is actually a metric.


[1] Sibson. "SLINK: An optimally efficient algorithm for the single-link cluster method"
(http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf)
[2] L. Kaufman and P.J. Rousseeuw. "Clustering by means of medoids"
(ftp://ftp.win.ua.ac.be/pub/preprints/87/Clumea87.pdf)
[3] J. Zhou and J. Sander. "Data Bubbles for Non-Vector Data: Speeding-up Hierarchical 
Clustering in Arbitrary Metric Spaces"
(http://www.vldb.org/conf/2003/papers/S14P03.pdf)
[4] V. Ganti et al. "Clustering Large Datasets in Arbitrary Metric Spaces"
(http://www.cs.virginia.edu/~cyberia/papers/ICDE99.pdf)



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.