Sunday, August 12, 2012

General Coefficient of Similarity

I've recently read an interested paper published in 1971 [1] that presents a really well organized discussion on what it means to measure the similarity between two objects (or individuals), and how to deal with missing information and dichotomous features.

Let \(s_{ij}^{(k)}\) be the similarity between two objects \(i\) and \(j\) according to their \(k\)-th feature. Also let \( \delta_{ij}^{(k)} \in \{0,1\} \) indicate whether feature \(k\) of \(i\) and \(j\) can be compared (\( \delta_{ij}^{(k)}=1\)) or not (\( \delta_{ij}^{(k)}=0\)). For example, when \(i\) and/or \(j\) are missing feature \(k\), then \( \delta_{ij}^{(k)} = 0 \). Accordingly, the overall similarity \(S_{ij}\) between \(i\) and \(j\) can be written as
\[ S_{ij} = \frac{ \sum_k w(x_i^{(k)},x_j^{(k)}) s_{ij}^{(k)} }{ \sum_k w(x_i^{(k)},x_j^{(k)}) \delta_{ij}^{(k)} }  \]
where \(x_i^{(k)}\) and \(x_j^{(k)}\) are the values for the \(k\)-th feature of \(i\) and \(j\), respectively, and the \(w(x_i^{(k)},x_j^{(k)})\) are the weights assigned to the different features. Interestingly, the weights are expressed as a function of the feature values, rather than being constant. This allows for elegantly dealing with hierarchical features, among other things (see [1], Section 4.1).

[1] J. C. Gower. "A General Coefficient of Similarity and Some of Its Properties." Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. (http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf)

Saturday, July 28, 2012

Non-metric similarities and clustering algorithms

I have often had to deal with clustering data for which a distance metric cannot be easily defined. In fact, in some cases it is possible to define a non-metric dissimilarity function that satisfies the first three of the following conditions, but not the fourth one (the triangle inequality):

  1. \( d(x,x) \ge 0, \forall x \)
  2. \( d(x,y) = 0 \Leftrightarrow x = y \)
  3. \( d(x,y) = d(y,x) \)
  4. \( d(x,y) \le d(x,z) + d(z,y) \)

Fortunately, some clustering algorithms are applicable to pairwise dissimilarity matrices computed using (symmetric) non-metric dissimilarity functions (as well as distance functions in arbitrary metric spaces). For example, [1] describes the SLINK algorithm for single-linkage hierarchical clustering, and mentions that it only requires a "symmetric non-negative dissimilarity coefficient" for which \(d(a,a)=0,  \forall a \). Also, [2] presents the k-medoids algorithm, and explicitly states that "such dissimilarities do not have to satisfy the triangle inequality" and provides an example of application for such a scenario.


While there exist other clustering algorithms, such as [3] and [4], that do not require a vector space representation of the objects to be clustered, they do require that the function used to compute the dissimilarity matrix is actually a metric.


[1] Sibson. "SLINK: An optimally efficient algorithm for the single-link cluster method"
(http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf)
[2] L. Kaufman and P.J. Rousseeuw. "Clustering by means of medoids"
(ftp://ftp.win.ua.ac.be/pub/preprints/87/Clumea87.pdf)
[3] J. Zhou and J. Sander. "Data Bubbles for Non-Vector Data: Speeding-up Hierarchical 
Clustering in Arbitrary Metric Spaces"
(http://www.vldb.org/conf/2003/papers/S14P03.pdf)
[4] V. Ganti et al. "Clustering Large Datasets in Arbitrary Metric Spaces"
(http://www.cs.virginia.edu/~cyberia/papers/ICDE99.pdf)



Friday, July 27, 2012

Tversky index

Measuring the similarity between sets is useful in many application. The Jaccard index is probably one of the most well known set similarity indexes. The Tversky index is a more general index, which includes the Jaccard or Dice indexes as a special case when \(\gamma\) and \(\eta\) are both equal to 1, or both equal to 0.5, respectively.

\[sim(X,Y) = \frac{|X \cap Y|}{|X \cap Y| + \gamma|X - Y| + \eta |Y - X|}\]

(see http://en.wikipedia.org/wiki/Tversky_index)