
 Sum of squared error
 Dispersion criteria
 Category utility metric
 Cutting measures
 Ball Hall
 BanfeldRaftery
 Condorcet criterion
 Criterion C
 CalinskiHarabasz
 DaviesBouldin
 Det_Ratio
 Dunn
 GDImn
 Gamma
 G+
 Ksq_DetW
 Log_Det_Ratio
 Log_SS_Ratio
 McClainRao
 PBM
 Biserial point
 RatkawskyLance
 RayTuri
 Scott Symons
 SD_Scat
 SD_Dis
 S_Dbw
 Silhouette
 Trace W
 WiB trace
 WemmertGançarski
 XieBeni
 External quality
 Relative quality
 The clusterCrit package
 Notebooks on cluster validation
 Hierarchical agglomerative
 Hierarchical divisive
 Relative hierarchical
 Squared errorbased
 pdf estimate via mixture densities
 Graph theorybased
 Combinatorial search techniquesbased
 Kernelbased
 Sequential data
 Largescale data sets
 dataviz and highdimensional data
Choice of methods:
 Cases where clustering is useless
 Clustering, the basic algorithms to know
 17 clustering algorithms for all use cases
 Introduction to Spectral Clustering
Choice of metrics:
Use case:
Contents
ToggleData partitioning
The process of data partitioning refers to the steps that represent the sequence required for a complete analysis. Implications of decisions taken in each of these areas:
 The entities to be grouped must be selected. The elements must be chosen to be representative of the structure of the clusters in the population.
 The variables to be used in cluster analysis are selected. Again, variables must contain enough information to allow grouping of objects.
 The user must decide whether or not to normalize the data. If normalization is to be performed, the user must select one procedure from several different approaches.
 A measure of similarity or dissimilarity should be selected. These measurements reflect the degree of proximity or separation between objects.
 A clustering method must be selected. The user's concept of what constitutes a cluster is important because different methods have been devised to find different types of cluster structures.
 the number of clusters must be determined.
 The final step in the clustering process is to interpret, test, and replicate the cluster analysis. Interpreting the clusters with the applied context requires the knowledge and expertise of the user's particular discipline. The tests involve the problem of determining whether there is significant clustering or arbitrary partitioning of the random noise. Finally, replication determines whether the resulting cluster structure can be replicated in other instances.
Although variations on this sevenstep process may be needed to suit a particular application, this sequence represents the critical steps in a cluster analysis.
Data partitioning and classification are two fundamental tasks in data mining. Classification is mainly used as a method of supervised learning, data partitioning for unsupervised learning (some partitioning models do both). The objective of data partitioning is descriptive, that of classification is predictive. Since the goal of data partitioning is to discover a new set of categories, the new groups are interesting in themselves and their evaluation is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, since the groups must reflect a set of reference classes.
Here is a 4V comparison of the most frequently used algorithms:
The strengths and weaknesses of each category:
As well as the most common comparison metrics: