Sunday, August 12, 2012

General Coefficient of Similarity

I've recently read an interested paper published in 1971 [1] that presents a really well organized discussion on what it means to measure the similarity between two objects (or individuals), and how to deal with missing information and dichotomous features.

Let \(s_{ij}^{(k)}\) be the similarity between two objects \(i\) and \(j\) according to their \(k\)-th feature. Also let \( \delta_{ij}^{(k)} \in \{0,1\} \) indicate whether feature \(k\) of \(i\) and \(j\) can be compared (\( \delta_{ij}^{(k)}=1\)) or not (\( \delta_{ij}^{(k)}=0\)). For example, when \(i\) and/or \(j\) are missing feature \(k\), then \( \delta_{ij}^{(k)} = 0 \). Accordingly, the overall similarity \(S_{ij}\) between \(i\) and \(j\) can be written as
\[ S_{ij} = \frac{ \sum_k w(x_i^{(k)},x_j^{(k)}) s_{ij}^{(k)} }{ \sum_k w(x_i^{(k)},x_j^{(k)}) \delta_{ij}^{(k)} }  \]
where \(x_i^{(k)}\) and \(x_j^{(k)}\) are the values for the \(k\)-th feature of \(i\) and \(j\), respectively, and the \(w(x_i^{(k)},x_j^{(k)})\) are the weights assigned to the different features. Interestingly, the weights are expressed as a function of the feature values, rather than being constant. This allows for elegantly dealing with hierarchical features, among other things (see [1], Section 4.1).

[1] J. C. Gower. "A General Coefficient of Similarity and Some of Its Properties." Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. (http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf)