Data Partitioning / Clustering 101

Data partitioning

The process of data partitioning refers to the steps that represent the sequence required for a complete analysis. Implications of decisions taken in each of these areas:

  1. The entities to be grouped must be selected. The elements must be chosen to be representative of the structure of the clusters in the population.
  2. The variables to be used in the cluster analysis are selected. Again, the variables must contain enough information to allow grouping of objects.
  3. The user must decide whether or not to normalize the data. If normalization is to be performed, the user must select a procedure from among several different approaches.
  4. A measure of similarity or dissimilarity should be selected. These measurements reflect the degree of proximity or separation between objects.
  5. A clustering method must be selected. The user's concept of what constitutes a cluster is important because different methods have been devised to find different types of cluster structures.
  6. the number of clusters must be determined.
  7. The final step in the clustering process is to interpret, test, and replicate the cluster analysis. Interpreting clusters with the context applied requires the user's knowledge and expertise of the particular discipline. The tests involve the problem of determining whether there is a significant clustering or an arbitrary partition of the random noise. Finally, replication determines whether the resulting cluster structure can be replicated in other examples.

Although variations on this seven-phase process may be necessary to suit a particular application, this sequence represents the critical steps in a cluster analysis.

Data partitioning and classification are two fundamental tasks in data mining. Classification is primarily used as a supervised learning method, data partitioning for unsupervised learning (some partitioning models do both). The purpose of data partitioning is descriptive, that of classification is predictive. Since the purpose of data partitioning is to discover a new set of categories, the new groups are interesting in themselves and their evaluation is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, as the groups must reflect a set of reference classes.

data partitioning

Here is a 4V comparison of the most frequently used algorithms:

clustering

The strengths and weaknesses of each category:

clustering

As well as the most common comparison metrics:

clustering