Calinski-Harabasz, Davies-Bouldin, Dunn and Silhouette perform well in a wide range of situations.
Performance based on average intra and inter-cluster SSE (Tr):
where B_k is the between group dispersion matrix and W_k is the within-cluster dispersion matrix defined by:
with N be the number of points in our data, C_q be the set of points in cluster q, c_q be the center of cluster q, c be the center of E, n_q be the number of points in cluster q.
This index treats each cluster individually and seeks to measure how similar it is to the closest cluster to it. The DB index is formulated as follows:
I(c_i) represents the mean of the distances between the objects belonging to cluster C_i and its center. And I(c_i, c_j) represents the distance between the centers of the two clusters C_i and C_j.
For each cluster i in the partition, we look for cluster j which maximizes the index described as follows:
The best clusturing is therefore that which minimizes the average of the value calculated for each cluster. In other words, the best clusturing is the one that minimizes the similarity between the clusters.
The Dunn index is another internal clustering validation measure which can be computed as follow:
- For each cluster, compute the distance between each of the objects in the cluster and the objects in the other clusters
- Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
- For each cluster, compute the distance between the objects in the same cluster.
- Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
- Calculate the Dunn index (D) as follow:
Validates performance based on intra and inter-cluster distances:
with a(i) the average dissimilarity with other data in cluster and b(i) the lowest dissimilarity to any non-member cluster for each x_i and center of cluster y:
The silhouette coefficient varies between -1 (worst classification) and 1 (best classification). The global Silhouette mean is frequently compute.