Determining the optimal number of clusters using distance based k-means algorithm

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Faculty of Applied Sciences, South Eastern University of Sri Lanka, Sammanthura

Abstract

In the current digital era, data is generated enormously with fast growth from different sources, and managing such huge data is a big challenge. Clustering algorithm is able to find hidden patterns and extract useful information from huge datasets. Among the clustering techniques, k-means clustering algorithm is the most commonly used unsupervised classification technique to determine the optimal number of clusters (k). However, the choice of the optimal number of clusters (k) is a prominent problem in the process of the k-means clustering algorithm. In most cases, clustering huge data, k is pre-determined by researcher, and incorrectly chosen k leads to increase computational cost. In order to obtain the optimal number of clusters, a distance-based k-mean algorithm was proposed with a simulated dataset. In the k-means algorithm, two distance measures were considered namely Euclidean and Manhattan distances. The results based on simulated data reveal that the k-means algorithm with Euclidean distance yields the optimal number of clusters compared to Manhattan distance. Testing on real datasets shows consistent results as the simulated ones.

Description

Citation

Proceedings of the 11th Annual Science Research Sessions, FAS, SEUSL, Sri Lanka 15th November 2022 Scientific Engagement for Sustainable Futuristic Innovations pp. 60.

Endorsement

Review

Supplemented By

Referenced By