Many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of objects. It is useful to denote the distance between two instances x_i and x_j as: d(x_i,x_j). A valid distance measure should be symmetric and obtains its minimum value (usually zero) in case of identical vectors. The distance measure is called a metric distance measure if it also satisfies the following properties:
In the cases where the instances are characterized by attributes of mixed-type, one may calculate the distance by combining various methods. For instance, when calculating the distance between instances i and j using a metric such as the Euclidean distance, one may calculate the difference between nominal and binary attributes as 0 or 1 (“match” or “mismatch”,respectively), and the difference between numeric attributes as the differencebetween their normalized values. The square of each such difference will be added to the total distance. Such calculation is employed in many clustering algorithms.
The dissimilarity d(x_i, x_j) between two instances, containing p attributes of mixed types, is defined as:
where the indicator δ=0 if one of the values is missing. The contribution of attribute n to the distance between the two objects d^(n) is computed according to its type:
If the attribute is binary or categorical:
If the attribute is continuous-valued (where h runs over all non-missing objects for attribute n):
If the attribute is ordinal, the standardized values of the attribute are computed first and then, z_i,n is treated as continuous-valued.