15-12-2012, 01:59 PM
A Project Report on Parallel Data Mining in Cloud using Hadoop
1Parallel Data.pdf (Size: 2.53 MB / Downloads: 110)
Abstract
In the information industry, there is wide availability of huge amounts of
data and the eminent need for turning such data into useful. Data mining fulfils this
need by the process of exploration andanalysis, by automatic or semi-automatic
means, of large quantities of data, in order to discover meaningful patterns and
rules. In case of a single system with few processors, there are restrictions on the
speed of the processing as well as the size of the data that can be processed at a
time. The speed as well as the limit on the size of the data to be processed can be
increased if data mining is carried out in parallel fashion with the help of the
coordinated systems connected in LAN. But the problem with this solution is that
LAN does not have the elasticity property, i.e. it cannot change the number of
systems in which the work is to be distributed on the basis of the size of the data to
be processed. Our main aim is to distribute data to be analyzed in various nodes in
private cloud so our first task is to setup a private cloud. Eucalyptus is software
that helps in creating and managing a private cloud.This project report investigates
how to set up a private cloud using Eucalyptus along with its usability and
requirements. For the optimum data distribution, and the efficient data mining as
per user’s desire, we will be implementing various algorithms such as K-Means.
Introduction
Today’s era is the information era. Explosion of unstructured information
has resulted in need to extract useful information from the available huge dataset.
That is what the data mining techniques do. Data mining aims at finding
meaningful patterns or rules in large datasets. It is an interdisciplinary field, which
combines research from areas such as machine learning, statistics, high
performance computing, and neural networks. A common feature of most data
mining tasks is that they are resource intensive and operate on large sets of data.
Data sources measuring in gigabytes or terabytes are now quite common in data
mining. For example, WalMart records 20 million transactions and AT&T
produces 275 million call records every day. This calls for fast data mining
algorithms that can mine huge databases in a reasonable amount of time.
However, despite the many algorithmic improvements proposed in many
serial algorithms, the large size and dimensionality of many databases makes data
mining tasks too slow and too big to be run on a single processor machine. There is
need to have some more efficient technology which can do the data mining task in
more automated way, more accurately and most importantly, faster. Cloud is the
infrastructure which can fulfil these needs.
Scope of Project
First of all, we will setup a private cloud. We will use private cloud as it
imitate cloud computing on private networks with benefits of cloud computing
without the pitfalls, capitalizing on data security, corporate governance, and
reliability concerns. We will use Eucalyptus which will provide infrastructure as a
service for setting up our own private cloud. Our private cloud will consist of a
cloud controller, 1 cluster controller and multiple nodes. We will run multiple
instances on these nodes on which further data processing will take place.
Once the private cloud has been set up, we will configure Hadoop on it.
Hadoop is a Software platform that lets one easily write and run applications that
process vast amounts of data. As our main aim is to process large amount of data
using clustering algorithms, Hadoop-which will take care of data distribution and
parallel data processing, will help us a lot.
HDFS(Hadoop Distributed File System) which is part of Hadoop platform will
automatically distribute large datasets on various instances.
Software Study
To perform the study of this project, it was necessary to understand the
concepts of Cloud computing, to compare different cloud computing software
platforms, to study Hadoop & Mahout frameworks and also to study K-Means
algorithm provided by Mahout.
Cloud Concepts
Definition of cloud computing is based on five attributes: multitenancy (shared
resources) massive scalability, elasticity, pay as you go, and self-provisioning of
resources.
Multitenancy(shared resources):
Cloud computing is based on a business model in which resources are shared (i.e.
multiple users use the same resource) at the network level, host level, and
application level.
Massive scalability
Although organizations might have hundreds or thousands of systems, cloud
computing provides the ability to scale to tens of thousands of systems, as well as
the ability to massively scale bandwidth and storage space.
Elasticity
Users can rapidly increase and decrease their computing resources as needed, as
well as release resources for other uses when they are no longer required.
Pay as you go
Users pay for only the resources they actually use and for only the time they
require them.
Self-provisioning of resources
Users self-provision resources, such as additional systems (processing capability,
software, storage) and network resources.This cloud capability allows users to
increase and decrease their computing resources as needed.
Eucalyptus- Private Cloud Computing Platform
EUCALYPTUS – Elastic Utility Computing Architecture for Linking Your
Programs to Useful Systems – is an open-source software infrastructure for
implementing “cloud computing” on clusters. The current interface to
EUCALYPTUS is compatible with Amazon’s EC2 interface, but the infrastructure
is designed to support multiple client-side interfaces. EUCALYPTUS is
implemented using commonly available Linux tools and basic Web-service
technologies making it easy to install and maintain.
Eucalyptus enables the creation of on-premise private clouds, with no requirements
for retooling the organization's existing IT infrastructure or need to introduce
specialized hardware.
Hadoop
Apache Hadoop is a software framework that supports data-intensive
distributed applications under a free license. It enables applications to work with
thousands of computational independent computers and petabytes of data.
A small Hadoop cluster will include a single master and multiple worker
nodes. The master node consists of a JobTracker, TaskTracker, NameNode, and
DataNode. A slave or worker node acts as both a DataNode and TaskTracker,
though it is possible to have data-only worker nodes, and compute-only worker
nodes; these are normally only used in non-standard applications. Hadoop requires
JRE 1.6 or higher. The standard startup and shutdown scripts require ssh to be set
up between nodes in the cluster.
Apache Mahout
Apache Mahout is an open source project by the Apache Software
Foundation (ASF) with the primary goal of creating scalable machine-learning
algorithms. While Mahout's core algorithms for clustering, classification and batch
based collaborative filtering are implemented on top of Apache Hadoop using the
map/reduce paradigm, it does not restrict contributions to Hadoop based
implementations. Mahout programs also run on a single node or on a non-Hadoop
cluster.We will use mahout on the top of Hadoop to implement k-means algorithm.
Private Cloud Setup
Eucalyptus:
Elastic Utility Computing Architecture for Linking Your Programs to Useful
Systems – is an open-source software infrastructure for implementing “cloud
computing” on clusters. The current interface to Eucalyptus is compatible with
Amazon’s EC2 interface, but the infrastructure is designed to support multiple
client-side interfaces. Eucalyptus is implemented using commonly available Linux
tools and basic Web-service technologies making it easy to install and maintain.
Eucalyptus enables the creation of on-premise private clouds, with no requirements
for retooling the organization's existing IT infrastructure or need to introduce
specialized hardware.
Modify Ubuntu image with hadoop inbuilt setup.
Generally Hadoop configuration is done on the operating system which is
running on the dedicated server but as we are interested in implementing it on the
instance of operating system we are supposed to modify the image itself. If we
don’t do so then each time when we run standard ubuntu image on cloud we need
to again do the hadoop configuration each time. So best way to overcome this issue
is to modify the ubuntu standard image itself and then that modified image is
uploaded on the walrus of Eucalyptus.
Conclusion and Future Scope
Our aim of the project was to carry out Data mining process in cloud. Thiswill
surely provide a much efficient way of data processing than any other centralized
system. It will reduce the processing time to great extent. It will also make better
resource utilization. This technique will prove to be convenient in processing of
large datasets.
But as described earlier, three tasks were needed to be completed in order to
achieve this decided goal – setting up private cloud, configuration of hadoop on
cloud and then carrying out data mining task with the help of mahout. Out of these
three tasks, we have completed first two tasks successfully. But because of the time
constraints we are not able to resolve some technical issues regarding the third
task.
We observed problem with hadoop configuration that it is not possible to
upload necessary input files on HDFS.To perform a task of data mining, two
conditions are mandatory. First, availability of necessary input files in Hadoop
Distributed File System on the node. Second, the data node should be constantly
running on slave machines.but may be because of lack of resources or because of
some other reasons datanode automatically goes down so that we are not able add
libraries of mahout on them. We are working on it to solve this particular issue and
will to trying to overcome whatever technical limitations are there.