Contents
ToggleInternal quality criteria
Internal quality criteria typically measure the compactness of clusters using a similarity measure. It typically measures intra-cluster homogeneity, inter-cluster separability, or a combination of these two. It does not use external information alongside the data itself.
List
- Sum of the quadratic error
- Dispersion criteria
- Category utility metric
- Cutting measures
- Ball Hall
- Banfeld-Raftery
- Condorcet criterion
- Criterion C
- Calinski-Harabasz
- Davies-Bouldin
- Det_Ratio
- Dunn
- GDImn
- Gamma
- G+
- Ksq_DetW
- Log_Det_Ratio
- Log_SS_Ratio
- McClain-Rao
- PBM
- Biserial point
- Ratkawsky-Lance
- Ray-Turi
- Scott Symons
- SD_Scat
- SD_Dis
- S_Dbw
- Silhouette
- Trace W
- WiB trace
- Wemmert-Gançarski
- Xie-Beni
How to Read Internal Quality Metrics
In order to find the best partition of the data, we generally run a algorithm of clustering with different values of the expected number of clusters K: let's say that Km ≤ K ≤ KM. The applied clustering algorithm could be Ascending Hierarchical Clustering (AHC) or k-means algorithm or any other technique. We then calculate a QK quality index for each value of K and select the partition that led to the “best” value for QK.
This section explains what is considered the “best” value for the different quality indices.
The table summarizes, for each index, which rule should be applied in order to determine the best index value. For example, in the case of the Calinski-Harabasz index, if the quality index has been calculated for different partitions of the data, the best partition is the one corresponding to the largest value of the index.
The decision rules called max and min in the table mean selecting the largest or smallest index value, respectively. The decision rule called max diff means that the best value of K is that corresponding to the greatest difference between two successive slopes. On a chart representing index values versus number of clusters selected, this corresponds to an elbow.
Sum of squared error
Internal quality metrics typically measure cluster compactness using a measure of similarity (such as Sum of Squared Error). It typically measures intra-cluster homogeneity, inter-cluster separability, or a combination of these two. It does not use external information alongside the data itself.
Sum of squared error is the simplest and most widely used criterion measure for clustering. It is calculated as:
where C_k is the set of instances of cluster k; μ_k is the vector mean of cluster k. The components of μ_k are calculated as:
where N_k = | C_k | is the number of instances belonging to cluster k.
The methods of partitioning which minimize the SSE criterion are often called minimum variance partitions, because by simple algebraic manipulation the SSE criterion can be written:
The SSE criterion function is suitable for cases where the clusters form compact clouds well separated from each other.
Additional minimum criteria for SSE can be produced by replacing the value of S_k with expressions such as:
Dispersion criteria
Internal quality metrics generally measure the compactness of clusters using a measure of similarity (such as Dispersion criteria: trace, determinant, invariance). It typically measures intra-cluster homogeneity, inter-cluster separability, or a combination of these two. It does not use external information alongside the data itself.
The scalar diffusion criteria are derived from the diffusion matrices, reflecting the intra-cluster diffusion, the inter-cluster diffusion and their summation – the total diffusion matrix. For the k-th cluster, the diffusion matrix can be calculated as follows:
The intra-cluster dispersion matrix is calculated as the sum of the last definition over all W clusters:
The diffusion matrix between clusters can be calculated as follows:
where μ is the total mean vector and is defined as:
The total diffusion matrix should be calculated as follows:
Three scalar criteria can be derived from S_W, S_B and S_T.
The trace is the sum of the diagonal elements of a matrix. Minimizing the trace of S_W is similar to minimizing HSE and is therefore commonly used. This criterion, representing the intra-cluster dispersion, is calculated as follows:
Another criterion, which can be maximized, is the criterion between clusters:
The determinant of a scattering matrix measures approximately the square of the scattering volume. Since S_B will be singular if the number of clusters is less than or equal to dimensionality, or if mc is less than dimensionality, its determinant is not an appropriate criterion. If we assume that S_W is not singular, the function of the determinant criterion is:
The eigenvalues λ_1, λ_2 ,. . . , λ_d of S_W * S_B are the basic linear invariants of the diffusion matrices. The good partitions are those for which the non-zero eigenvalues are large. As a result, several criteria can be derived, including eigenvalues. Three of these criteria are:
Category utility
Category utility is defined as increasing the expected number of entity values that can be correctly predicted given a certain grouping. This metric is useful for problems that contain a relatively small number of nominal features each having a small cardinality.
Cutting Measurement
In some cases, it is useful to represent the clustering problem as a minimal cutting problem. In such cases, the quality is measured as the ratio of the remaining weights to the total cut weights. If there is no restriction on the size of the clusters, it is easy to find the optimal value. Thus, the min-cut measurement is revised to penalize unbalanced structures.
Ball Hall
The average dispersion of a cluster is the average of the squares of the distances of the points in the cluster from their barycenter. The Ball-Hall index is the average, across all clusters, of their average dispersion:
In the particular case where all the clusters have the same size N/K, this sum is reduced to WGSS/N.
Banfeld-Raftery
This index is the weighted sum of the logarithms of the traces of the variance-covariance matrix of each cluster.
The index can be written as follows:
The quantity Tr(WG{k})/nk can be interpreted as the average of the squares of the distances between the points of the cluster Ck and their barycenter G{k}. If a cluster contains a single point, this trace is equal to 0 and the logarithm is undefined.
Concordet
Another suitable approach is to apply the Condorcet solution to the clustering problem. In this case, the criterion is calculated as follows:
where s (x_j, x_k) and d (x_j, x_k) measure the similarity and distance of vectors x_j and x_k.
Criterion C
Criterion C is an extension of Condorcet's criterion and is defined as (where γ is a threshold value):
If we consider the NT distances between pairs of points as a series of values sorted in ascending order, the index C uses the smallest NW values and the largest NW values to calculate the sums Smin and Smax: the sum S implies the NW distances in this sequence which correspond to pairs present in a cluster (i.e. pairs whose two points are in the same cluster). At most, 3NW distances are effectively retained in the calculation of this index.
Criterion C is then written as follows:
Calinski-Harabasz
Performance based on HSE average intra and inter-cluster (Tr):
where B_k is the matrix of dispersion between clusters and W_k is the intra-cluster scatter matrix defined by:
with N the number of points in our data, C_q the set of points in cluster q, c_q the center of cluster q, c the center of E, n_q the number of points in cluster q.
Davies-Bouldin
This index treats each cluster individually and seeks to measure how similar it is to the cluster closest to it. The DB index is formulated as follows:
I (c_i) represents the average of the distances between the objects belonging to the cluster C_i and its center. And I (c_i, c_j) represents the distance between the centers of the two clusters C_i and C_j.
For each cluster i of the partition, we seek the cluster j which maximizes the index described as follows:
The best partition is therefore the one that minimizes the average of the value calculated for each cluster. In other words, the best partition is the one that minimizes the similarity between clusters.
Det_Ratio
The Det_Ratio index is defined like this:
T denotes the total diffusion matrix. It is the sum of the BG and WG matrices.
Dunn
Dunn's Index is another internal cluster validation metric that can be calculated as follows:
- For each cluster, calculate the distance between each of the objects of the cluster and the objects of the other clusters
- Use the minimum of this distance per pair as inter-cluster separation (min.separation)
- For each cluster, calculate the distance between objects in the same cluster.
- Use maximum intra-cluster distance (i.e. maximum diameter) as intra-cluster compactness
- Calculate Dunn's Index (D) as follows:
GDImn
The Generalized Dunn Index (GDI) gives a value based on the “good behavior” of clusters and their members, measured based on distances between clusters and within clusters.
Let us denote by the letter δ a measure of the inter-cluster distance and by Δ a measure of the intra-cluster distance (which is also called cluster diameter). The GDI index, relating to these distances, is defined as follows:
With k and k' between 1 and K.
Six different definitions of δ (denoted δ1 to δ6) and three definitions of Δ (denoted Δ1 to Δ3) have been suggested. This leads to 18 different indices denoted Cuv: here u is an integer designating the distance between the clusters (1 ≤ u ≤ 6) and v an integer designating the distance within the groups (1 ≤ v ≤ 3). The definitions of the distances Δ within the cluster are:
Here d is the Euclidean distance. The factor 2 in the definition of Δ3 allows the value to be interpreted as a diameter rather than a radius. The definitions of the distances between clusters δ are:
The first four distances (δ1 to δ4) appear in bottom-up clustering algorithms and are called single linkage, full linkage, average linkage, and centroid linkage, respectively. The measure δ5 is the weighted average (with the weights nk and nk′) of the average distances between the points of the clusters Ck and Ck′ and their respective barycenter. The measure δ6 is the Hausdorff distance DH.
Baker-Hubert Gamma
The Baker-Hubert Gamma index is an adaptation, within the framework of clustering, of the Γ index of correlation between two data vectors A and B of the same size.
Generally, for two indices i and j such that ai < aj , we say that the two vectors are concordant if bi < bj , in other words if the values are classified in the same order in the two vectors. We calculate the number s+ of concordant pairs {i, j} and the number s− of discordant pairs. Note that inequalities are strict, meaning ties are removed. In this context, the index Γ is classically defined as follows:
The value is between -1 and 1.
In the context of a partition, the first vector A is chosen to be the set of distances dij between pairs of points {Mi,Mj} (with i < j). The second vector B is a binary vector: in this vector, the coordinate corresponding to a pair {Mi,Mj} is worth 0 if the two points are in the same cluster and 1 otherwise. These two vectors have a length NT = N(N − 1)/2.
The number s+ represents the number of times that a distance between two points belonging to the same cluster (i.e. a couple for which the value of vector B is 0) is strictly less than the distance between two points n not belonging to the same cluster. cluster (i.e. a couple for which the value of vector B is 1).
The number s− represents the number of times where the opposite situation occurs, that is to say that a distance between two points belonging to the same cluster (value 0 in B) is strictly greater than a distance between two points not belonging to the same cluster. (value 1 in B). Cases where there is a tie (tie or ex-aequos) are not taken into account.
There are NB inter-cluster distances and, for each of them, we compare with the NW intra-cluster distances: we finally carry out NB × NW comparisons. We can write the numbers s+ and s− in the following form:
G+
Using the same notations as for the Baker-Hubert Γ index, the G+ index is defined as follows:
It is the proportion of discordant pairs among all pairs of distinct points.
Ksq_DetW
As its name suggests, its formula is K²|WG|.
Log_Det_Ratio
The Log_Det_Ratio index is defined like this:
where T is the diffusion matrix and WG. This is a logarithmic variant of the Det_Ratio index.
Log_SS_Ratio
The Log_SS_Ratio index is defined like this:
where BGSS and WGSS are the traces of the BG and WG matrices respectively.
McClain-Rao
As for the index C, let SW denote the sum of the intra-cluster distances:
Recall that the total number of distances between pairs of points belonging to the same cluster is NW. Let us denote SB the sum of the distances between clusters:
The total number of distances between pairs of points that do not belong to the same cluster is NB = N(N − 1)/2 − NW. The McClain-Rao index is defined as the quotient between the average distances within a cluster and between clusters:
PBM
The PBM index (acronym consisting of the initials of the names of its authors, Pakhira, Bandyopadhyay and Maulik) is calculated from the distances between the points and their barycenters and the distances between the barycenters themselves.
Let us denote by DB the greatest distance between two cluster barycenters:
On the other hand, let us denote EW the sum of the distances of the points of each cluster to their barycenter and ET the sum of the distances of all the points to the barycenter G of the whole data:
The PBM index is defined like this:
ET is a constant that does not depend on the partition or the number of clusters.
Point-Biserial
Generally speaking, in statistics, the point-biserial coefficient is a measure of correlation between a continuous variable A and a binary variable B (i.e. a variable whose values are 0 or 1). A and B are sets of the same length n.
The values of A are divided into two groups A0 and A1 depending on whether the corresponding value in B is 0 or 1. Let MA0 and MA1 denote the averages in A0 and A1, and nA0 and nA1 the number of elements in each group. The point-biserial correlation coefficient is defined as the quantity:
where sn is the standard deviation of A.
In the context of a comparison between different clusterings, the term sn can be omitted because it does not depend on the partitions but only on the entire data set.
As in the case of the index Γ, we adapt this definition by choosing A as the set of NT distances between pairs of points Mi and Mj. The corresponding value in B is 1 if the two points are in the same cluster and 0 otherwise:
MA1 is the average of all distances within the cluster and MA0 is the average of all distances between clusters. Thus, the definition of the point-biserial index is:
Ratkowsky-Lance
We calculate the average ¯R of the quotients between BGSS and TSS for each dimension of the data, that is to say for each column of the matrix A. Note:
BGSSj is in fact the j-th diagonal term of the BG matrix. The Ratkowsky-Lance index (¯c/√K) is defined as follows:
Ray-Turi
The Ray-Turi index is defined as a quotient:
– the numerator is the average of the squares of the distances of all the points relative to the barycenter of the cluster to which they belong:
– the denominator is the minimum of the squares of the distances Δkk′ between all the barycenters of the cluster:
The Ray-Turi index can therefore be written as follows:
Scott Symons
This index is the weighted sum of the logarithms of the determinants of the variance-covariance matrix of each cluster. This can be written as follows:
The determinants of WG{k} matrices are greater than or equal to 0 because these matrices are positive semi-definite. If either of them is 0, the index is undefined.
SD_Scat and SD_Dis
We define two quantities S and D called respectively average diffusion of clusters and total separation between clusters.
The average cluster diffusion, denoted S, is defined as follows. Consider the vector of variances for each variable in the dataset. It is a vector V of size p defined by:
Similarly, we define variance vectors V{k} for each cluster Ck:
The quantity S is the average of the norms of the vectors V{k} divided by the norm of the vector V:
On the other hand, the total separation between clusters, denoted D, is defined as follows. Let us denote Dmax and Dmin respectively as the largest and smallest distance between the barycenters of the clusters:
The SD index is finally defined like this:
where α is a weight equal to the value of D obtained for the partition with the greatest number of clusters. In order to compare several partitions of the data, it is first necessary to calculate the value of D corresponding to the greatest number of clusters in order to find the value of the coefficient α then calculate the other indices from this coefficient.
S_Dbw
This index is based on the notion of density of points belonging to two clusters. We first define a limit value σ equal to the square root of the sum of the norms of the variance vectors V{k} divided by the number of clusters:
The density γkk′ for a given point, relating to two clusters Ck and Ck′, is equal to the number of points in these two clusters whose distance from this point is less than σ. Geometrically, this amounts to considering the ball of radius σ centered at a given point and counting the number of points of Ck ∪ Ck′ located in this ball.
For each pair of clusters, let us evaluate the densities for the barycenters G{k} and G{k′} of the clusters and for their midpoint Hkk′. We form the quotient Rkk' between the density in the middle and the greatest density at the two barycenters:
On the other hand, we define an inter-cluster density G as the average of the quotients Rkk′:
The S-Dbw index is defined as the sum of the average dispersion in clusters S and the inter-cluster density G:
Silhouette
Validates performance based on intra and inter-cluster distances:
with a (i) the average dissimilarity with the other data of the cluster and b (i) the weakest dissimilarity with any non-member cluster for each x_i and center of the cluster y:
The silhouette coefficient varies between -1 (worst ranking) and 1 (best ranking). Silhouette's overall average is often calculated.
tau
Using the same notations as for the Gamma index, the Kendall τ index between two data vectors of length NT is classically defined in statistics as the quantity:
The numbers s+ and s− do not count links, so if an inter-cluster distance and an intra-cluster distance are equal, they do not enter the numerator. In order to take equalities into account, we modify the denominator and define the corrected index τc like this:
with
where ti is the number of values in the i-th group of links for vector A and uj is the number of values in the j-th group of links for vector B. Here vector B consists only of values 0 and 1 (corresponding respectively to the inter-cluster and intra-cluster pairs) and we thus have:
A simple calculation shows that ν0 − ν2 = NBNW. If we make the reasonable assumption that the vector A contains few identical values, we can estimate that ν2 is negligible compared to ν0. This justifies the following definition of the Tau clustering index:
Trace_W
The Trace_W index is defined like this:
Trace_WiB
The Trace_WiB index is defined like this:
Wemmert-Gançarski
The Wemmert-Gançarski index is constructed from the quotients of distances between the points and the barycenters of all the clusters.
For a point M belonging to cluster Ck, we form the quotient R(M) between the distance of this point from the barycenter of the cluster to which it belongs and the smallest distance from this point to the barycenters of all the other clusters:
We then average these quotients in each cluster. If this average is greater than 1, we ignore it, otherwise we take its complement to 1. Specifically, let's define:
The Wemmert-Gançarski index is defined as the weighted average, for all clusters, of the quantities Jk like this:
Which can be rewritten as:
Xie-Beni
The Xie-Beni index is a fuzzy clustering index, but it also applies to sharp clustering.
It is defined as the quotient between the mean square error and the minimum of the minimum square distances between the points of the clusters. The mean square error, in the case of net clustering, is simply the quantity 1/N*WGSS, in other words the average of the squares of the distances of all points relative to the barycenter of the cluster to which they belong.
and the Xie-Beni index can be written as follows: