Improved K-means Using an Integration of Density based and Genetic Algorithm based Model
Abstract
In the data oriented world that we live in, many clustering algorithms exist but one that has been widely studied and used is K-means. K-means is a well-known clustering algorithm, however, it is not without its limitations. A few of its limitations have been identified, as well as ways to improve it so as to make it more efficient. The first problem identified is that the optimal K value must be known beforehand. If a bad value is selected, it will result in poor clustering. This is achieved by a method that uses density plots to determine the best K value. The next problem that is looked into, is ways to optimize the initial centroid selection process and thus improve efficiency of the algorithm as a whole. This is done by using an evolutionary algorithm called genetic algorithm. This enhanced version is then tested on benchmark clustering datasets and artificially generated datasets. Afterwards, it is applied to a dataset consisting of geographical coordinates of taxi pickup locations to identify hotspots.