Many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of objects. It is useful to denote the distance between two instances x_i and x_j as: d(x_i,x_j). A valid distance measure should be symmetric and obtains its minimum value (usually zero) in case of identical vectors. The distance measure is called a metric distance measure if it also satisfies the following properties:
Given two p-dimensional instances, x_i= (x_i1, x_i2, . . . , x_ip) and x_j=(x_j1, x_2, . . . , x_jp), the distance between the two data instances can be calculated using the Minkowski metric:
The commonly used Euclidean distance between two objects is achieved when g= 2. Given g= 1, the sum of absolute paraxial distances (Manhattan metric) is obtained, and with g=∞ one gets the greatest of the paraxial distances (Chebychev metric).
The measurement unit used can affect the clustering analysis. To avoid the dependence on the choice of measurement units, the data should be standardized. Standardizing measurements attempts to give all variables an equal weight. However, if each variable is assigned with a weight according to its importance, then the weighted distance can be computed as: