Hopkins Statistic

Before clustering a dataset we can test if there are actually clusters. We have to test the hypothesis of the existence of patterns in the data versus a dataset uniformly distributed (homogeneous distribution).

The Hopkins statistic is computed as follows:

  1. Sample n points (p_i) from the dataset (D) uniformly and compute the distance to their nearest neighbor (d(p_i))
  2. Generate n points (q_i) uniformly distributed in the space of the dataset and compute their distance to nearest neighbors in D (d(q_i))
  3. Compute the quotient H:

If data are uniformly distributed the value of H will be around 0.5.