Saturday, July 28, 2012

Non-metric similarities and clustering algorithms

I have often had to deal with clustering data for which a distance metric cannot be easily defined. In fact, in some cases it is possible to define a non-metric dissimilarity function that satisfies the first three of the following conditions, but not the fourth one (the triangle inequality):

  1. \( d(x,x) \ge 0, \forall x \)
  2. \( d(x,y) = 0 \Leftrightarrow x = y \)
  3. \( d(x,y) = d(y,x) \)
  4. \( d(x,y) \le d(x,z) + d(z,y) \)

Fortunately, some clustering algorithms are applicable to pairwise dissimilarity matrices computed using (symmetric) non-metric dissimilarity functions (as well as distance functions in arbitrary metric spaces). For example, [1] describes the SLINK algorithm for single-linkage hierarchical clustering, and mentions that it only requires a "symmetric non-negative dissimilarity coefficient" for which \(d(a,a)=0,  \forall a \). Also, [2] presents the k-medoids algorithm, and explicitly states that "such dissimilarities do not have to satisfy the triangle inequality" and provides an example of application for such a scenario.


While there exist other clustering algorithms, such as [3] and [4], that do not require a vector space representation of the objects to be clustered, they do require that the function used to compute the dissimilarity matrix is actually a metric.


[1] Sibson. "SLINK: An optimally efficient algorithm for the single-link cluster method"
(http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf)
[2] L. Kaufman and P.J. Rousseeuw. "Clustering by means of medoids"
(ftp://ftp.win.ua.ac.be/pub/preprints/87/Clumea87.pdf)
[3] J. Zhou and J. Sander. "Data Bubbles for Non-Vector Data: Speeding-up Hierarchical 
Clustering in Arbitrary Metric Spaces"
(http://www.vldb.org/conf/2003/papers/S14P03.pdf)
[4] V. Ganti et al. "Clustering Large Datasets in Arbitrary Metric Spaces"
(http://www.cs.virginia.edu/~cyberia/papers/ICDE99.pdf)



Friday, July 27, 2012

Tversky index

Measuring the similarity between sets is useful in many application. The Jaccard index is probably one of the most well known set similarity indexes. The Tversky index is a more general index, which includes the Jaccard or Dice indexes as a special case when \(\gamma\) and \(\eta\) are both equal to 1, or both equal to 0.5, respectively.

\[sim(X,Y) = \frac{|X \cap Y|}{|X \cap Y| + \gamma|X - Y| + \eta |Y - X|}\]

(see http://en.wikipedia.org/wiki/Tversky_index)