Data Partitioning / Clustering 101

Data partitioning

The process of data partitioning refers to the steps that represent the sequence required for a complete analysis. Implications of decisions taken in each of these areas:

  1. The entities to be grouped must be selected. The elements must be chosen to be representative of the structure of the clusters in the population.
  2. The variables to be used in cluster analysis are selected. Again, variables must contain enough information to allow grouping of objects.
  3. The user must decide whether or not to normalize the data. If normalization is to be performed, the user must select one procedure from several different approaches.
  4. A measure of similarity or dissimilarity should be selected. These measurements reflect the degree of proximity or separation between objects.
  5. A clustering method must be selected. The user's concept of what constitutes a cluster is important because different methods have been devised to find different types of cluster structures.
  6. the number of clusters must be determined.
  7. The final step in the clustering process is to interpret, test, and replicate the cluster analysis. Interpreting the clusters with the applied context requires the knowledge and expertise of the user's particular discipline. The tests involve the problem of determining whether there is significant clustering or arbitrary partitioning of the random noise. Finally, replication determines whether the resulting cluster structure can be replicated in other instances.

Although variations on this seven-step process may be needed to suit a particular application, this sequence represents the critical steps in a cluster analysis.

Data partitioning and classification are two fundamental tasks in data mining. Classification is mainly used as a method of supervised learning, data partitioning for unsupervised learning (some partitioning models do both). The objective of data partitioning is descriptive, that of classification is predictive. Since the goal of data partitioning is to discover a new set of categories, the new groups are interesting in themselves and their evaluation is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, since the groups must reflect a set of reference classes.

Here is a 4V comparison of the most frequently used algorithms:

The strengths and weaknesses of each category:

As well as the most common comparison metrics:

Exit mobile version