K-means and K-mediods Clustering

Clustering implies partitioning data into meaningful groups such that data items within each group are similar but are dissimilar across different groups. One of the most widely used clustering procedure is the K-means clustering algorithm, which seeks to group data into K clusters. Each cluster is represented by the mean of cluster members. Starting with a set of K initial centers, the K-means algorithm assigns each data item to the cluster whose center is closest and then recomputes the cluster centers. The process is carried out iteratively to successively improve the locations of cluster centers and their membership as shown in the series of panels below.


For K-means to work, each data item must be represented with numerical attributes. It is also important to note that each cluster center may not correspond to an actual data item in the cluster as it happens in the above pictorial example. This particular aspect of K-means clustering becomes an issue when you want each cluster representative to be an actual data item. This kind of requirement typically arises when you are clustering, for example, text documents, or pictures, or other multimedia artifacts. Consider you have a a set of pictures and you want to represent that set by a single picture. In this instance, you will select one of the pictures from the set that you believe is the most representative picture of that set; you would not think of calculating an average picture of the set as its representative. This is done by K-mediods algorithm which works just like K-means but uses a data item to represent each cluster which is most similar to the rest of data items in the cluster. This kind of situation is shown below.


K-mediods algorithm is also used in those applications where each data item may have attributes that are qualitative in nature. In such cases, again the data item showing most similarity to other data items in the cluster is a preferred choice for cluster representation. You must remember K-mediods is computationally more expensive than K-means.