Similitudo: July 2012

I have often had to deal with clustering data for which a distance metric cannot be easily defined. In fact, in some cases it is possible to define a non-metric dissimilarity function that satisfies the first three of the following conditions, but not the fourth one (the triangle inequality):

$d(x,x) \ge 0, \forall x$
$d(x,y) = 0 \Leftrightarrow x = y$
$d(x,y) = d(y,x)$
$d(x,y) \le d(x,z) + d(z,y)$

Fortunately, some clustering algorithms are applicable to pairwise dissimilarity matrices computed using (symmetric) non-metric dissimilarity functions (as well as distance functions in arbitrary metric spaces). For example, [1] describes the SLINK algorithm for single-linkage hierarchical clustering, and mentions that it only requires a "symmetric non-negative dissimilarity coefficient" for which

$d(a,a)=0, \forall a$ . Also, [2] presents the k-medoids algorithm, and explicitly states that "such dissimilarities do not have to satisfy the triangle inequality" and provides an example of application for such a scenario.

While there exist other clustering algorithms, such as [3] and [4], that do not require a vector space representation of the objects to be clustered, they do require that the function used to compute the dissimilarity matrix is actually a metric.

[1] Sibson. "SLINK: An optimally efficient algorithm for the single-link cluster method"
(http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf)
[2] L. Kaufman and P.J. Rousseeuw. "Clustering by means of medoids"
(ftp://ftp.win.ua.ac.be/pub/preprints/87/Clumea87.pdf)
[3] J. Zhou and J. Sander. "Data Bubbles for Non-Vector Data: Speeding-up Hierarchical
Clustering in Arbitrary Metric Spaces"
(http://www.vldb.org/conf/2003/papers/S14P03.pdf)
[4] V. Ganti et al. "Clustering Large Datasets in Arbitrary Metric Spaces"
(http://www.cs.virginia.edu/~cyberia/papers/ICDE99.pdf)

Similitudo

Saturday, July 28, 2012

Non-metric similarities and clustering algorithms

Friday, July 27, 2012

Tversky index