Cluster analysis

Distance Measures and similarity functions:

Evaluation criteria:

Algorithms:

Introduction

The clustering process refers to the steps which represent the sequence necessary for a complete analysis:

  1. The entities to be clustered must be selected. The of elements should be chosen to be representative of the cluster structure in the population.
  2. The variables to be used in the cluster analysis are selected. Again, the variables must contain sufficient information to permit the clustering of the objects.
  3. The user must decide whether or not to standardize the data. If standardization is to be performed, then the user must select a procedure from several different approaches.
  4. A similarity or dissimilarity measure must be selected. These measures reflect the degree of closeness or separation between objects. A dissimilarity measure, such as distance, assumes values as two objects become less similar. A similarity measure, such as correlation, assumes larger values as two objects become more similar.
  5. A clustering method must be selected. The user’s concept of what constitutes a cluster is important because different methods have been designed to find different types of cluster structures.
  6. The number of clusters must be determined.
  7. The last step in the clustering process is to interpret, test, and replicate the resulting cluster analysis. Interpretation of the clusters with applied context requires the knowledge and expertise of the user’s particular discipline. Testing involves the problem of determining whether there is a significant clustering or an arbitrary partition of random noise data. Finally, replication determines whether the resulting cluster structure can be replicated in other samples.

Although variations on this seven-phase process may be necessary to fit a particular application, this sequence represents the critical steps in a cluster analysis.

Clustering and classification are both fundamental tasks in Data Mining.Classification is used mostly as a supervised learning method, clustering for unsupervised learning (some clustering models are for both). The goal of clustering is descriptive, that of classification is predictive. Since the goal of clustering is to discover a new set of categories, the new groups are of interest in themselves, and their assessment is intrinsic. In classification tasks, however, an important part of the assessment is extrinsic, since the groups must reflect some reference set of classes.