Quality on the number of clusters

Contents

A subject related to the validation of clusters is to decide if the number of clusters obtained is the right one (Quality on the number of clusters). This point is particularly important for algorithms which need this value as a parameter. The usual procedure is to compare the characteristics of groups of different sizes. Usually, internal criteria indices are used in this comparison. A graph of these indices for different numbers of clusters can show the most likely number of clusters.

Some of the internal validity indices can be used for this purpose: Calinsky Harabasz index, Silhouette. Using the intra-class dispersion matrix (S_W), other criteria can be defined (Hartigan index and Krzanowski Lai index):

Let's estimate the number of clusters by comparing a cluster with the expected distribution of the data given the null hypothesis (no clusters). Let us calculate different groupings of data increasing the number of clusters and compare them to the data clusters (B) generated with a uniform distribution.

The interclass distance matrix S_W is calculated for both and compared. The correct number of clusters is the one where the largest difference appears between the S_W of the data and the uniform data (first term of the following equation):

The probable number of clusters is the smallest number that satisfies:

where s_k is defined as (sd_k is the standard deviation of the first term of Gap):