External quality criteria

External measures can be useful for examining whether the structure of the clusters match to some predefined classification of the instances.

Mutual information based measure

The mutual informationcriterion can be used as an external measure for clustering. The measure for m instances clustered using C={C_1, . . . , C_g} and referring to the target attribute y whose domain is dom(y) ={c_1, . . . , c_k} is defined as follows:

where m_l,h indicate the number of instances that are in cluster C_l and also in class c_h. m.,h denotes the total number of instances in the class c_h. Similarly, m_l,. indicates the number of instances in cluster C_l.

MI is combined with entropy in the Normalized mutal information:

MI is combined with entropy in the Adjusted mutal information:

Precision-recall measure

The precision-recall measure from information retrieval can be used as an external measure for evaluating clusters.The cluster is viewed as the results of a query for a specific class. Precision is the fraction of correctly retrieved instances, while recall is the fraction of correctly retrieved instances out of all matching instances. A combined F-measure can be useful for evaluating a clustering structure.

Rand index

The Rand index is a simple criterion used to compare an induced clustering structure (C1) with a given clustering structure (C2). Let a be the number of pairs of instances that are assigned to the same cluster in C1 and in the same cluster in C2; b be the number of pairs of instances that are in the same cluster in C1, but not in the same cluster in C2; c be the number of pairs of instances that are in the same cluster in C2, but not in the same cluster in C1; and d be the number of pairs of instances that are assigned to different clusters in C1 and C2. The quantities a and d can be interpreted as agreements, and b and c as disagreements. The Rand index is defined as:

The Rand index lies between 0 and 1. When the two partitions agree perfectly, the Rand index is 1.

A problem with the Rand index is that its expected value of two random clustering does not take a constant value (such as zero). Hubert and Arabie in 1985 suggest an adjusted Rand index that overcomes this disadvantage.