11-08-2014, 12:48 PM
Improvements in K-means Algorithm to Execute on Large Amounts of Data
IMPROVEMENTS IN K-MEANS.pdf (Size: 483.4 KB / Downloads: 50)
ABSTRACT
By the help of large storage capacities of current computer systems, datasets of
companies has expanded dramatically in recent years. Rapid growth of current
companies’ databases has raised the need of faster data mining algorithms as time is very
critical for those companies.
Large amounts of datasets have historical data about the transactions of
companies which hold valuable hidden patterns which can provide competitive advantage
to them. As time is also very important for these companies, they need to mine these huge
databases and make accurate decisions in short durations in order to gain marketing
advantage. Therefore, classical data mining algorithms need to be revised such that they
discover hidden patterns and relationships in databases in shorter durations.
In this project, K-means data mining algorithm has been proposed to be improved
in performance in order to cluster large datasets in shorter time. Algorithm is decided to
be improved by using parallelization. Parallelization of the algorithm has been considered
to be a suitable solution as the popular way of increasing computation power is to
connect computers and execute algorithms simultaneously on network of computers. This
popularity also increases the availability of parallel computation clusters day by day.
Parallel version of the K-means algorithm has been designed and implemented by
using C language. For the parallelisation, MPI (Message Passing Interface) library has
been used. Serial algorithm has also been implemented by using C language for the
purpose of comparison. And then, algorithms have been run for several times under same
conditions and results have been discussed. Summarized results of these executions by
using tables and graphics has showed that parallelization of the K-means algorithm has
provied a performance gain almost proportional by the count of computers used for
parallel execution
INTRODUCTION
Parallel version of the K-means algorithm has been designed and implemented in this
project for the purpose of improvement of K-means algorithm in execution time. Serial
(Classical K-means) version of the algorithm has also been implemented for the purpose of
comparison with parallel the version in time. Both implementations have been tested on the
same environment and results have been discussed. As K-means is a clustering algorithm
which is a type of data mining algorithm, data mining and clustering have also been
examined in the project. KDD (Knowledge Discovery in Databases) has also been
discussed, because data mining is a step of it. After addressing where K-means stands,
details of serial and proposed parallel K-means algorithms has been presented. Before
examining parallel K-means algorithm, parallelisation concept of algorithms has been
introduced in order to prepare a background for the details of parallel K-means algorithm.
After all, both versions (serial and parallel) of the algorithm have been executed on the
same platform and results have been discussed
KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING
DD refers to the overall process of discovering useful knowledge from databases.
KDD consist of several steps. Data mining refers to a particular one of those steps of
overall KDD process. Data mining is the application of specific algorithms for extracting
patterns, which then will be interpreted and evaluated to produce knowledge, from data
[Fayyad, Piatetsky-Shapiro, and Smyth 1996]. Main aspect of this Master’s Thesis Project
is the data mining itself, not the whole KDD process. Therefore, data mining will be
examined in more detail then the overall KDD process.
In addition to data mining step, KDD process also has data selection, preprocessing,
transformation and interpretation steps as shown in Figure 2.1 [Fayyad, Piatetsky-Shapiro,
and Smyth 1996]. Composition of these steps constitutes the KDD process. In order to
understand what data mining is and address where it stands, overall KDD process will be
presented shortly. In Figure 2.1, overall representation of KDD process, which also
includes data mining step, can be seen
Data Warehousing
Success of data mining is strongly related with data warehousing functions of
companies. Data mining uses pre-processed data which are supplied by data warehouses.
Companies’ data warehousing departments develop functions which collect valuable data
from business activities continuously. Collected and cleansed data are then stored in data
warehouses in order to be used in statistical works. Data mining is the one of the most
important ones of those statistical works which uses the data stored in data warehouses
Types of Data Mining Algorithms
Data mining algorithms are the collection of techniques in order to perform data
mining task. Currently, there are a lot of data mining algorithms for a wide range of data
mining tasks. Mainly, these algorithms can be categorized into there groups according to
the types of patterns which those algorithms try to discover [Apte 1997]. These three types
of algorithm are presented in the following parts.
Deficiencies of Serial K-means Algorithm
In today’s world, company’s databases have grown explosively which makes it very
time consuming or sometimes impossible to run traditional data mining algorithms on their
data bases. Therefore, serial K-means algorithm will either lack in performance or crash
when trying to cluster huge amounts of databases which arises the need of improvement of
K-means algorithm in order for the K-means algorithm to cluster large amounts of data in
reasonable times.
An ideal data mining algorithm should scale well. In other words, the algorithm
should produce true results in reasonable time even the database grows to very large
amounts. Therefore, some modifications should be done to traditional data mining
algorithms in order to make them scale well in case of huge amounts of data.
PARALLELIZATION OF ALGORITHMS
heory of parallelisation is mainly breaking large tasks into smaller ones, assigning
these smaller tasks to multiple working units and finally managing these multiple working
units [MHPCC 1999]. When efficient coordinating (well parallelism) of working units is
present, tasks can be completed in much shorter durations. Parallelisation of tasks is very
common in real life such as building construction.
Parallelisation is also applicable for algorithms. In the scope of this master thesis
project, serial version of K-means algorithm is implemented and the parallel version of the
algorithm is designed and implemented in C Language. Development of the parallel version
of the algorithm is to improve the algorithm in order to scale well for large volumes of data
and produce correct results in reasonable time.
As parallelisation is a strategy for performing large and time consuming tasks faster,
when parallel computing is considered, it is obvious that process power will be increased
and the process time will be decreased with the provision that design and implementation
of parallelization is done properly [MHPCC 1999]. And also, properties of algorithms,
which are to be parallelized, are very important for parallelization task. Some algorithms
are very suitable for parallelization but some of them are not
Parallel Programming with MPI
Before designing the parallel algorithm, a parallel programming technique should be
selected in order to design the parallel algorithm according to that technique. In this project,
a tool called LAM-MPI (Local Area Multicomputer – Massage Passing Interface) which
uses MPI (Message Passing Interface) standard will be used as the parallel programming
tool and technique
2
. LAM-MPI is selected as the parallel programming tool because of its
widely acceptance in the world and availability of platforms to execute LAM-MPI
programs. In order to execute LAM-MPI programs, an Ethernet network with Linux
operating systems is enough. And also the LAM-MPI utility must be installed to the
computers.
LAM-MPI utility is a collection of runtime components and libraries which helps
writing parallel programs in C Programming Language without dealing with networking
tasks and scheduling of processors. Written C program by the programmer must only deal
with parallelization of the business logic to be performed. Programmer deals with processes
(not the processors) which of each perform the same task on different data and uses
massage passing commands to communicate the processes. Scheduling of processes to
computers in the network and communicating the computers in the network with each other
is the job of LAM-MPI utility.
PARALLEL K-MEANS ALGORITHM
As mentioned in previous chapters, parallel K-means algorithm has been developed
and implemented in this project by the aim of performance increase when compared with
serial K-means. Parallelization of K-means algorithm has been proposed to be a solution
for the need of a faster K-means algorithm in order to cluster large amounts of databases in
reasonable durations. And also, by using parallel K-means, it has been aimed to gather
exactly same clustering results with serial algorithm, since purpose of parallelization in this
project is to perform exactly same clustering in shorter duration.
K-means algorithm has been re-designed to run in parallel manner by using the
message passing technique of parallelization. LAM-MPI (Local Area Multicomputer –
Message Passing Interface) utility has been used in order to implement the parallelized
version of K-means algorithm. This utility provides development of parallel programs
which are to be executed on network of computers (ethernet networks). LAM-MPI utility
has been preferred to be used for the development of message passing based parallel
algorithm because of widely acceptance of the utility.
Expectations from the Parallel Algorithm
Aim of the parallelization of K-means algorithm was to improve the algorithm such
that it will produce the same results with serial algorithm. As no change has been
performed on the K-means algorithm during parallelization, it is expected that the
algorithm will produce exactly same clustering results with serial one. This can also be seen
by the sample convergences of both the serial algorithm (Section 3.4) and the parallel
algorithm (Section 5.3).
In the design of a parallel algorithm, overhead of root computer for synchronisation
and the amount of data interchanged between computers should be minimized as possible
for the success of parallelization. If root overhead of synchronisation and frequency of data
interchange are high and if large amount of data are passed between computers then it
won’t be surprising that parallelization of the algorithm.
In this project, K-means algorithm has been designed carefully to run in parallel
manner such that frequency of messaging and the amount of messages transmits between
processes have been lessened as much as possible.
CONCLUSION AND FUTURE WORKS
Main aspect of this project has been to improve the K-means algorithm so that it can
perform clustering on large datasets in reasonable durations. When considering databases of
current companies, this improvement becomes very critical, because current companies
have huge amounts of data and they need to mine their databases in short durations in order
to have marketing advantage.
When examining serial K-means algorithm, it can be observed that the algorithm
deals with all objects in dataset serially which very time consuming especially for large
databases. When huge datasets are in account, serial K-means algorithm either lacks in
performance or crashes because of the larger dataset than the amount of memory of a single
machine.
In this project, parallelization of K-means algorithm has been proposed as an
improvement for the algorithm. It has been proposed that, parallel version of the algorithm
will produce exactly the same results with the serial algorithm in much shorter durations
almost by the number of computers used.
Parallel version of the algorithm has been designed and implemented by using C
language and LAM-MPI (Local Area Multi Computer - Message Passing Interface) utility
which can be used by C programs. Serial algorithm has also been implemented by C
language for the purpose of comparison with parallel version.
Design of the parallel algorithm has been performed carefully, such that algorithm
requires minimal frequency of messaging between processes with minimal message sizes.
Unlike the previous parallelization works of K-means algorithm, which transmit objects in
whole dataset between processes in each iteration of the clustering algorithm, only
summary of objects in dataset is transmitted between processes. Summary of objects,
mentioned above, is the sum and count of objects in a cluster which is a very small data
when comparing with the all objects in datase