Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: CLOUD COMPUTING
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=74106]


INTRODUCTION

1.1 Abstract

This report is introducing our group‟s final year project. We will describe our project‟s:  Project objective,  Experiment we did ,  Benefits and trade off of using cloud and Hadoop,  Problems we solved.
The objective of our project is using a private cloud and Hadoop to handle massive data processing like video searching. In this final year project, we choose to use two technologies. The first one is Cloud which allows us to allocate the resource with an easy and efficient way. The cloud we choose is Ubuntu Enterprise private Cloud. The second technology we use is Hadoop, it is a software framework to implement data-intensive program with thousand of machines. Eucalyptus provides the resource in form of individual virtual machine, and Hadoop manage all the resource of the virtual machine. Hadoop evenly allocate the storage, computation power and divide large jobs to separate machines with many tiny jobs. Since we are using two „new‟ technologies (cloud and Hadoop) to handle a large scale job, we must fully understand the characteristic of these systems. Aim to fully understand two technologies, we did many experiment on testing the performance, especially the scalability of two technologies and optimize the environment. We find private cloud powerful but with much limitation, especially the performance trades off. Hadoop allow us to easily process large amount of data in parallel which is difficult to handle in traditional ways. At the middle of this report, we will describe the advantages and disadvantages of using Cloud and Hadoo



1.2 Motivation

File size keeps increase Although the computing power of machines is keeping increase in a very high speed. Almost every 3 years, CPU‟s computing power increase twice. However size of the files keeps increasing also in an amazing rate.
20 years ago, the common format is only text file. Later, computer can handle the graphics well, and play some low quality movies. In recent year, we are not satisfied in the quality of DVD, and introduce the Blu-ray disk. The file is changed form a few KBs to now nearly 1 TB.
Limitation of CPU core CPU is the most important part of a computer. However, with current technology, CPU speed has reached the upper bound of speed. Directly increase the frequency of a CPU core is not the solution to improve the computation power as the temperature of such core may increase to thousand of degree. The current solution is putting more cores into a single CPU, but the best solution is running the application in multi cores and also in multi machine in parallel. Which are the ultimate aims of most of the application?

Parallel system can make computation faster

File size is increasing but the algorithms don‟t improve so much. The improvement of using faster CPU but similar algorithms is limited. Moreover, newer format always require more complicated decode algorithms and make the processing time longer. The only way to greatly speed up the process is make the job parallel running on different machines.


Cloud reduce the cost

Since buying expensive facilities for only a single purpose is unreasonable but using the cloud charge only how long has we use. It enables many people to run large scale project with an acceptable cost.
We hope using cloud and some related technologies to implement application which are difficult to implement.






1.3 Project Objective

This project aims to implement large applications with Hadoop (a parallel programming software framework) on a cloud. Instead of programming very fast applications on a few machines, we are focusing on programming applications that has high scalability from a few machine to thousands of machines.

1.4 Idea of our project

Our idea is very simple. Assume there is a very large video data base. Giving a set of video or frames, we hope to find it from that database, and tell the position of that video.

The idea is simple but it is very useful in different aspects. If we change the algorithm to an object detection algorithm, it can be used in a surveillance video application. If we put face detection algorithm, it become a video diary. The key point of this project it building application with high scalability. When database is increased, the application can still handle it.




1.5 Project resource

Our project is going to introduce cloud and its performance. View Lab gives us machines to set up a cloud.





1.5.1Hardware

3 Dell Desktop PC: Be the node server of the
X CPU: Intel Pentium D 3Ghz cloud. Kvm require the
(support VT-x) physical hardware support
Memory:1GB VT-x or AMD-v, these are the
Storage:160GB






1 D-link wireless router It plays the roles as DHCP
sever and switch and help
x setting up a local network







1 Asus 1000M switch Most of the machines are
connected to this switch in
x order to achieve 1000M
bandwidth and eliminate the
bottle neck of the network



1.5.2Software and framework

Ubuntu Server Edition 10.10 (64Bits) Ubuntu Enterprise Cloud (UEC) is the most common platform to set up a private cloud. There are more support and discussion on the internet. Hadoop MapReduce is a theory appears in recent year. Hadoop (software framework) is the famous open source project; it is very suitable to implement large scale data-intensive applications.
Figure 2 Logo of Hadoop



Cloud Computing

2.1 Introduction to cloud’s concept

Cloud Computing is a web processing with large amount of resource. The user of the cloud can obtain the service thought network (in both internet and intranet). In other words, users are using or buying computing service from others. The resource of the cloud can be anything IT related. In general, cloud provides application, computation power, storage, bandwidth, database and some technologies like MapReduce. As the resource pool is very large, user can increase the application on cloud to any scale. It is fully under users‟ control.

2.2 From point of views of resource

2.2.1 Types of cloud
There are four main type of cloud:

1) Public cloud: The cloud computing resource is shared outside, anyone can use it and some payment maybe need.

2) Private cloud: It is opposite to public cloud, private cloud‟s resource is limit to a group of people, like a staff of a company etc.


3) Hybrid cloud: this is a mixture of previous two clouds, some cloud computing resource is shared outside but some don‟t.

4) Community cloud: this is a special cloud to make use of cloud computing‟s features. More than one community shares a cloud to share and reduce the cost of computing system. This will be discussing later

We only focus on the private and public cloud. The typical example of the public cloud is Amazon Web Service (AWS EC2) and WSO2. We will introduce Eucalyptus and AWS later.









2.2.2 Types of cloud service

There three main types of cloud service provided:

1) Software as a Service (SaaS) :

Clients can use the software provide by provider. Which usually need not to install and it is usually a one to many service. Like Gmail, search engine. The typical cloud application we used is Google Doc, a powerful web application similar to Microsoft office. It is free for most users and has paid version for the company who want more features.


Apart from the software we often use in office, there are some more powerful Cloud service like Jaycut and Pixlr. Jaycut is a free online application implemented with Flash; you can upload movies and edit it. Jaycut is a very powerful and can do almost all basic effects same as other desktop application. After finishing your work, it can compile the final product with a magical high speed and give you a download link. You can deliver the movies by passing the links to others and need not to worry about the storage problem.


2) Platform as a Service (PaaS):

Clients can run their own applications on the platform provided; General platforms are Linux and Windows.

3) Infrastructure as a Service (IaaS):

Client can put their own operation system on cloud. For example, user can put an optimized Linux for networking ability.

In practical usage, PaaS and IaaS are actually very similar with the difference that whether the image is provided by user or not. If you use the image provided, it is PaaS, otherwise, it is IaaS.
Cloud usually provides many images with pre-set-up environment like SQL server and PHP. Users treat it as onlineshopping and buying the IT service.



2.2.3 User interface

Apart from access the application with web browsers. There are many other tools are developed to provide user-friendly interface. For example, Eucalyptus and AWS EC2 can manage instance control with Elastic Fox with is developed by amazon. And Eucalyptus provides the web interface to access the Configuration of the cloud.
Terminal of course is a very useful interface to control the cloud. It is also the best way to handle much complex control like running an instance with other
2.2.4 Why cloud is so powerful

First we define physical machines as cloud servers, and instance or virtual machine (VM) as the virtual server provided to the users.

Resource pool is large
Public cloud gives users surprise with its elastic property and giant resource pool with thousands of machines. With a few line of commands, user can use these machines to do anything they wanted to do.
Easy to manage
Private cloud gives users surprise with its easiness of managing resource and better use of resource. Private cloud users can send request similar to the public cloud users. Apart from that, they can almost get what they want immediately. All these actions are handled by cloud system and administrators need not to set up any new machines.

Virtualization allows physical machine divide resource and plays many roles
General cloud (Eucalyptus and AWS EC2) heavily use the virtualization technology and run many virtual machine on a single physical machine. Each virtual machine obtain environment with different amount of resource, like Size of RAM, Storage and number of virtual CPU cores. The resource allocated can be very small and extremely large.

2.2.5 Usage of public cloud

Public cloud usually charge user per user hour with a reasonable price. Within $20 dollar Apart from the resource allocated can be adjust, platforms the instance running can be difference also like Windows, Linux. Since the instance is a virtual machine only, some application can be preinstalled inside the image and provide more different type of images. Actually, cloud is an IT resource hiring service; there are many usage of the cloud. The aim target of using cloud is reducing the cost. General public cloud allows user pay as you go and gets the only service they wanted.

Using public cloud as the only IT support
For example purely using public cloud can reduce or eliminate the cost of manage an IT system. Since company must put many resource on maintenances other than buying the facilities. Cloud providers usually provide many manage tool and some of them are free. These tools can manage the resource using according from usage and adjust number of instance automatically. Also some detailed report will be sent to user and alert the any issue their instances encountered.

Using expensive facilities
Also, cloud can also provide some expensive facilities with a reasonable price. For example, if you need a few high performance for some experiments for a few day. It does not make sense to buy this machine only for this single purpose. Let see Picture 7, AWS provide you with the most powerful machine with GPU supported and changes.



Using cloud to as part of the IT resource
In some situation, Company‟s bandwidth and storage may not enough to serve the user over the world. They may use cloud as storage as cloud pool usually has very large bandwidth.

Public cloud is a service As a conclusion, cloud is a service and it is a business term rather than technical term. It separates the concern between the user and provider. The cloud is very powerful as it uses virtualization and has a giant resource pool. Using cloud can reduce the cost on IT management.

Many people misunderstood cloud Cloud is a term make many people confuse. It seems to be the way to solve large scale problem and change the instance‟s computation power more than his host machine. Actually, a virtual machine


2.2.6 Private cloud

Owning a cloud is not the only right of big company. But of course, private cloud is weaker than public because of smaller resource pool.
These are a few open source cloud project and Eucalyptus is the most famous one in Linux environment. In our project, we chose to use Ubuntu Enterprise Cloud (UEC) as it has more support and documentation. Before we introduce Eucalyptus, we will introduce private cloud.
Private clouds usually provide only virtual machine and storage as the service. Private cloud has features very similar to the Cloud server generally provides the instance in form of virtual machine. Since we are not always require a very powerful machine. Cloud server can run many small instances on a machine and greatly reduce the cost on buying server. Apart from reduce the number of machines, the IT resource of instance can be dynamically adjust (reboot required like virtual box) or move to another machines even public cloud.

Fast virtual machines set up
Just imagine you are setting up 100 machines to run MapReduce, you may set up the machine one by one and repeat the jobs. It is surely a slow task. In the wrest case, if you need to install new software in each machine like FFmpeg (Encoder we used in this project), you must use a long time to modify such minor change in each machine. Using cloud is totally another case. We implemented a MapReduce program of movies conversion which required FFmpeg. We only need to modify a single image and upload to each server. It requires a few of hours only. In our project, we only use a few hours to set up a Hadoop cluster with cloud, but almost a whole day in physical machine. Virtual machine environment also help us to prevent the technical driver.

Easy to replace the hardware
Generally, operation system binds to the physical machine. Replacing hardware usually requires reinstall all operation system, driver, applications and configuration. It surely cost a long time.
In cloud server, the image and the physical machine sis separate by the virtualization, so that even though the physical is replaced, the image can still obtain similar running environment. So whenever a hardware facility is replaced, it replaces only the computation power. After installing the virtualization software like KVM and XEN (two technologies widely applied in cloud computing), images can still run in the new server. Since administrators need not to worry about the configuration of image, they can upgrade and replace the system faster and easier.

Better use of the resource
Since the cloud is the resource pool shared through network and can be elastically adjusted. Administrator can allocate suitable amount and types of resource to different users according to their need. For example, it an application handling
text data, can use less memory and CPU cores; an application simulating 3D world can allocation many more powerful instance even with GPU support.

Security
Security is surely an important issue. Virus can easily propagate to other program within a server. It can cost a very serious damage to the server requires times to remove the infected file. Apart from virus, some application may crash and used up all the CPU resource. This prevents other program work properly. Using cloud can separate the application with different virtual machine within one physical host. Since the virtual machine is independent, the infection change can significantly decrease. A virtual machine can share limited number of virtual core. Even though an application hanged and used up its resource, other machine won‟t be affected and work properly.


2.3 From point of views of application
Structure

2.3.1 Using Cloud Layer

Usually, a cloud provide user with virtual machines. A virtual machine can be independent to others, just same as the servers inside the server room. But how can a cloud be elastic and fit to users‟ business?
There are two main approaches, but both of them do it by adjusting the number of virtual machines running but with two totally different mechanisms. The first approach is giving the different virtual machine with different roles. The typical example is Microsoft Azure. Let take a large scale web application similar to YouTube as an example. The duty of the application are allowing user to upload the video and play the video.
Azure is actually a cloud layer running on a number of windows sever. It contains many technologies like crash handling, and workload distributed system. If a machine crash, the application can still run on other server and the database won‟t lose.
Azure requires programmers set up one or more roles (type of virtual machines) under some framework in the cloud application. Each role processes different types of job. For example: a) Web Role process the website query, e.g. Display videos b) Worker role process the background job, e.g. convert the videos to different format. c) VM role are used by administrators to manage the application. Such configuration even allows the role can run some large scale process in more than a virtual machine parallel.



2.3.2 Using Cloud instance with other framework

Another approach is migrating application directly into the cloud with form of virtual machine. Administrator is managing the system by managing the virtual machines. The typical example is AWS EC2. Cloud can run different number of machine on different image similar to the previous approach. However, AWS is providing only the instance but not a cloud layer for each user. If a server crash, no handler will take up the jobs. This approach requires the application has elastic property and handle the crash of the system. In order to has the property of elastic and handling crash of the system automatically. Some other technology like distributed system or MapReduce.
technology, Hadoop is the MapReduce application framework) as a layer to connect all machine.



2.4 Summary of concept of Cloud

Private cloud is a way to manage the system, private cloud is a business term rather than technical term. Cloud has feature of elastic and virtualization. The most important benefit of cloud is reducing cost, but not it powerfulness. The power of the cloud is upper limited by the size of the resource and heavily relies on the network condition.

2.5 Eucalyptus (UEC)

2.5.1 Principal of eucalyptus cloud

There is different cloud architecture and in general, a cloud will not provide physical machine but virtual machine or an application service to the users. We are introducing the Eucalyptus which is an open source project for private cloud and public cloud.
Eucalyptus has architecture similar to Amazon Elastic Compute Cloud (EC2). Actually, the image of eucalyptus can directly run on EC2 and the commands of these two systems is almost the same? Eucalyptus highly relies on the Virtualization technology. The cloud usually provides a virtual machine to the user and a physical machine can run more than one virtual machine (instance). Users can run self-created images (operation system with self provided application) or provided default images. The architecture has many advantages: A physical machine can play the rules of different operation systems, and less physical machines are needed in the network. This greatly reduces the budget of computer system set up. All application can be independent and be more secure. Each virtual machine has its resource bound. Even though an application used up its resource in a virtual machine or crashed, it will not affect other virtual machine. Since all application is separated, it can help preventing attack from others and virus spreading.



Our private cloud

We built our first private cloud with 4 old machines, the hardware equipment are shown in previous discussion. The most important requirement is having some CPU which supports VT-x.
After building a cloud, we built many images to test the actually performance of the cloud. Each image requires three file: kernel, ramdisk, and hard disk image


2.5.3 Technical trouble shoot

2.5.3.1 Image fails to run a image

After the cloud is set up, we found the cloud fail to run any images even the default images. It shows pending and then terminated with not error message displayed. The reason is that the CPUs do not support VT-x and KVM. There are two ways to solve this problem: use centOS and run xen instead of KVM; using the machine with VT-x support. At the end, department give us three machines, so that we can install the cloud and run it. So we don‟t need to learn to set up cloud in a new environment.




2.5.3.2 Fail to add node server

After installed all required OS in Cluster Controller and nodes. The node can‟t discover and add any nodes. Only a few discussion on this problem, and we found this is a bug of Eucalyptus 1.6. Eucalyptus doesn‟t release any guideline in handling this problem, but it can be fixed with a few simple steps by resting the cloud. 1) Clear all the clusters and nodes in Cluster Controller. 2) Reboot all machines 3) Clear all the .ssh/Known host record 4) Let Cluster Controller discover the nodes again.

2.5.3.3 Fail to de-registered
This is a bug similar to previous problem and can be fixed with the same way to handle previous bug.

2.5.3.4 Instance IPs problem
Whenever any machine‟s IP changed, the cluster won‟t work properly. If Cluster Controller‟s IP changed, it can execute any “euca*" command, all the setting of this cluster need to be rested at http://localhost:8443 and the update the credentials. If Node‟s IP changed, it will be more complicated. Cluster Controllers may not find the node. The Cluster Controller will not add any other node which is using this IP, and the node with changed IP will be used and the certificate in the Cluster Controller doesn‟t accept the new IP. Cluster Controllers may give you some warning about some attacks. To fix this problem, you can let Cluster Controllers forget the old nodes by remove the “known hosts” file in Cluster Controller‟s //var/lib/eucalyptus/.ssh (.ssh is a hidden folder) The best way to handle this problem is bind the IPs with MAC address.

2.5.4.1Startup time

Startup time of the virtual machines is heavily depended on the network status. Before running a virtual machine, the cloud runs a few steps:
1. Check availability of the cloud to check if there is enough resource.
2. Decrypt and decompress the image
3. At the same time, send the image to one of the node.
4. The node will save the image to the cache
5. And copy a new copy and run the virtual machine. The virtualization used

When an image is sent to the node server, it will copy will be save it node has sufficient space. Next time if the same image runs on it, it can directly copy the image and save the bandwidth. Second start up requires 1/3 of the first start up time only.
Startup of AWS images require two minutes only.


2.5.4.2 Overall performance

The overall performance of our private cloud (after optimization) is about 73% of the physical machine. It is similar to the value that eucalyptus discussed. The most important factor slowing down the performance is the IO performance of the virtual machine.

2.5.4.3 IO performance

The IO time of the virtual machines is 1.47 time of the host (physical machine). The disk write speed is only 50.9MB/2s; it is just half of the physical machine‟s 103MB/s. If two process is writing a hard disk at the same time, virtual machine can write data with speed 26MB/s, (65MB/s in physical machine). We can see write time of the virtual machine can reach to about 50% of the physical machines in our test case. The read disk speed is about 80% of the Physical machine.
The virtual machine can usually share ~50 %( 20MB/s) bandwidth of the physical host (~40MB/s). It is fast enough for out project.
Some related report of hard disk branch mark show that virtual machine can reach ~90% performance of the physical machines. It seem to be the problem of configuration or the facilities'

2.5.4.4 Multi-platform

An important feature of cloud is supporting Linux and Windows running on it. After the cloud is set up, we tried to build our own image instead of the default image provided by eucalyptus.
Surely, setting up a Linux Image is easy and same as installing it on a physical machine. The performance of the self- built images is faster than the default image and support GUI Desktop version of cloud. There is a virtual monitor for each virtual machine.
We can install and setup the windows into an image, however, it show kernel panic when the image is upload up the cloud.

2.5.4.5 KVM and XEN

Kvm and xen are the most common open source virtualization technology used in Eucalyptus and AWS EC2. Both of them are using the same image format and compatible to each other. However, their hardware require are different. Kvm requires host
CPU support Intel‟s VT-x or AMD-v. These two technologies are similar and allow the code of a virtual machine can directly run on host‟s CPU without many change. Xen and kvm both run very fast and kvm has better performance.

2.5.4.10 Limitation and weakness of Eucalyptus
Although eucalyptus is a pretty good open source cloud project with many powerful feature. IT has many limitations on it.

The instance won’t be save after termination Eucalyptus allows a user to copy an image to many instances. You can reboot the instance and the data in the hard disk still exist. However, if you terminated the instance, all the data in virtual machines will be freed. So before terminating an instance, user must upload the data to cloud storage like Walrus. Another approach is implementing the application only use the central storage. But this approach requires a very large bandwidth of the local network and may be charged a lot on public cloud. Getting worse, the poor IO performance many reduce the overall performance of the whole system.

1. Poor instance start up mechanism
2. Large image file problem
3. Sensitive to the network

4. Non-static IP
Since the number of IPs allocated to the instance is limited. The cloud will reuse the IPs.

2.5.5 Difficulties in setting up private cloud
Setting up a cloud actually require only a few


2.5.6 Cloud Summary

Cloud is actually a very powerful technology. The most significant feature of cloud is elastic and reduces cost. It is not a tool to implement faster application since virtual machines seldom run faster than his host. But virtualization makes manage and share resource to different possible and allocate resource fit to any situation.
However, we have shows some limitation of typical. There are many trades off and silly design in cloud technology, and the performance of the cloud is limited to the power of the hosts and the network stability.
Chapter 3:
Apache Hadoop

3.1 Introduction to Hadoop

In 1965, Gordon E. Moore claim that transistor counts had doubled every year. Recent years, the technology has reached the limit of what is possible with one
CPU, that's why parallel computing comes, due to the convenience of management, could computing architecture and the most widely-used, open source framework Hadoop comes out.
Apache Hadoop is a software framework that supports dataintensive distributed applications; it enables applications to work with thousands of nodes and petabytes of data.

1. The projects Hadoop Common: the common utilities that support the other Hadoop subprojects. 2. Hadoop distribute file system(HDFS),
3. Hadoop MapReduce: a software framework for distributed processing of large data sets on compute clusters.
4. Avro: A data serialization system.
5. Chukwa: A data collection system for managing large distributed systems.
6. HBase: A scalable, distributed database that supports structured data storage for large tables.
7. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
8. Mahout: A Scalable machine learning and data mining library.
9. Pig: A high-level data-flow language and execution framework for parallel computation. In our project, we mainly focus on HDFS, MapReduce and Hive.

Hadoop benchmark
In year 2009, Hadoop sorts 100 TB in 173 minutes (0.578 TB/min) on 3452 nodes x (2 Quad core Xeons, 8 GB memory, 4 SATA)

HDFS
Hadoop Distributed File System (HDFS) is an open source storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations; it is designed in fault tolerance.

MapReduce
MapReduce is a programming model and software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Hadoop MapReduce is an open source project from Apache, it allow us develop parallel applications without any parallel programming techniques, and the applications could be easy deployed and executed, it works with the HDFS and processes huge datasets (e.g. more than 1TB) on distributed clusters (thousands of CPUs) in parallel. It allow programmer writing the code easily and fast.


Working Environment:
Currently the newest version of Hadoop is 0.21 which is not well tested, it may get problems when starting the HDFS service, so, we choose the version 0.20, it was supported by Linux environment and tested with 2000 nodes, theoretically speaking, it also works in windows, but rare people tried it. As it is Java based system, so, get Java installed is a must.

3.2 HDFS

The average of bandwidth of the world is increasing, lots of service provider needs more and more storage, such as Facebook, in 2009, 4TB compressed new data was generated to their storage each day, 135TB compressed data scanned each day, parallel computing is the choose, this also works for many years, but the needs is still growing, they needs more and more machines, hard disks, more than 1000 machines working together is not special now. So, the other big problem appears: hardware failure, a report from Google claims that the hard drive gets 7% failure rate per year, so, in average, they have hard drive failure every day. One of the possible solutions is choosing higher level hard disks which get smaller failure rate, but the cost is extremely expensive. HDFS is designed to handle this large scale data read and write and hard drive failure, data stored in different machines maybe 3 or more copies, if hard drive failure detected, the rebalance will do the work to keep the file completed.

3.2.1 HDFS Structure
3.2.1.1 Namenode

The Namenode server in HDFS is the most important part, it store all the metadata of the stored files and directories, such as the list of files, list of blocks for each file, list of DataNodes for each block and also File attributes. The other role is to serve the client queries, it allows clients to add/copy/move/delete a file, it will records the actions into a transaction log. For the performance, it save the whole file structure tree in RAM and hard drive. A HDFS only allow one running namnode, that's why it is a single point of failure, if the namenode failed or goes down, the whole file system will goes offline too. So, for the namenode machine, we need to take special cares on it, such as adding more RAM to it, this will increase the file system capacity, and do not make it as DataNode, JobTracker and other optional roles.

3.2.1.2 SecondaryNamede

From the name, it may like the active-standby node of the NameNode, in fact, it is not. What it works is to backup the metadata and store it to the hard disk, this may helping to reduce the restarting time of NameNode. In HDFS, the recent actions on HDFS will be stored in to a file called EditLog on the NameNode, after restarting HDFS; the NameNode will replay according to the Editlog. SecondaryNameNode will periodically combines the content of EditLog into a checkpoint and clear the Editlog File, after that, the NameNode will replay start from the latest checkpoint, the restarting time of NameNode will be reduced. SecondaryNamede was running on the same machine as NameNode in default which is not a good idea, it is because if the NameNode machine crashed, it will hard to restore. Placing the SecondaryNamede on the other machine, it may help the new NameNode to restore the file structure.


3.2.1.3 Datanode

Datanode is the scaled up part of HDFS, it may have thousands, it mainly use to store the file data. On startup, DataNode will connect to the NameNode and get ready to respond to the operations from NameNode. After the NameNode telling the position of a file to the client, the client will directly talk to the DataNode to access the files. DataNodes could also talk to each other when they replicating data. The DataNode will also periodically send a report of all existing blocks to the NameNode and validates the data block checksums (CRC32).

3.2.1.4 Compare with NFS
Network File System(NFS) is one of the famous DFS, the design is straightforward, it provides remote access to single logical volume stored on a single machine, but there has limitations such as the files in an NFS volume all reside on a single machine, lots of information as can be stored in one machine, and does not provide any reliability guarantees if that machine goes down, another limitation is that all the data is stored on a single machine, all the clients must go to this machine to retrieve their data. This can overload the server if a large number of clients must be handled. HDFS solve the problems.

1. HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
2. HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster.
3. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
4. Applications that use HDFS are assumed to perform long sequential streaming reads from files. HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files.
5. Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported. (An extension to Hadoop will provide support for appending new data to the ends of files; it is scheduled to be included in Hadoop 0.19 but is not available yet.)


3.2.1.6 Limitation of HDFS

The latest version of Hadoop is just 0.21, so there exists some limitations, for the NameNode, it is a single point of failure and it allows single Namespace for entire cluster, the development team claims that they are working on work on Backup Node to solve this problem. It's also does not rebalance based on access patterns or load, the average performance will be reduced. It's designed in Write-once-read-many access model, then the client can only append to existing files, no general file modify operation.
3.1 MapReduce

3.3.1 Machine Roles

There has 2 roles for MapReduce jobs, one is JobTracker and TaskTracker.

JobTracker
The JobTracker is used to farm out the MapReduce tasks to the clusters, it will allocate the task to the cluster which have the data that the task needed in highest priority to maximize the performance. When the client submit a job to the JobTracker, it will talks to the NameNode to find the data location, then it could determine which node is the nearest to the data and submits the task to the chosen TaskTracker nodes. When the TaskTracker finished the task, JobTracker will get the signal and assigns another task to it. It runs in the NameNode machine in default.

TaskTracker
TaskTracker just like the DataNode, it will be running in every clusters, it received the task from JobTracker and mainly do the Map, Reduce, Shuffle operations. Every TaskTracker has a set of slots, each slot accept a task. When the JobTracker tries to find TaskTackers to run a MapReduce task, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack. The TaskTracker runs the actual work in separate JVM process, if the task crashed, it will not take down the TaskTracker. When the task is running, TaskTracker may monitoring these processes and store the output and return value. It also send heartbeat signals to JobTracker and let the JobTracker knows that it was alive, the heartbeat signal also including the information such as number of empty slots. After finishing the task, it sends a signal to notify the JobTracker.

3.3.2 Programming Model

MapReduce is an easy programming model for computer or non-computer scientists. The whole structure of a MapReduce program is mainly work by 2 functions, the Mapper and the Reducer. One key-value pair as a input for the Mapper, and the Map functions run in parallel, generates any pairs of intermediate key-value pairs from different input data sets, and these will be sent to the Reduce function as input, and the reducer will sort the input key-value pairs according to the key and combine the value of key-value pairs into a list which have the same key, at last, the reducers which are also run in parallel generates the output keyvalue pairs. All the tasks are working independently.

3.3.3 Execution flow

The user program splits the data source into small ones; normally it will be split into 16~64MB blocks, and copy the program to the selected JobTrackers to get ready for processing data. The JobTracker distributes the job to JobTaskers.
The Mapper Read the records from the data source such as lines of text files, data row from database, list of text as file path and so on. After mapping process, it sends the intermediate key-value pairs to reducers. Prototype of Map function: Map (in_key, in_value) →list (out_key,intermediate_value).

3.3.4 Anatomy of a Job

1. Client creates a job, configures it, and submits it to job tracker
2. JobClient computes input splits (on client end)
3. Job data (jar, configuration XML) are sent to JobTracker
4. JobTracker puts job data in shared location, enquires tasks
5. TaskTrackers poll for tasks

3.3.5 Features

Automatic parallelization and distribution When we run our application, the JobTracker will automatically handle all the messy things, distribute tasks, failure handling, report progress.

Fault‐tolerance,
The TaskTracker nodes are monitored. If they do not submit heartbeat signals in a period of time, they are deemed to have failed and the work is scheduled on a different TaskTracker. If a task failed for 4 times (in default), the whole job will failed. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable.

Locality optimization
He JobTracker will assign the splits which have the same keys to nearest clusters; this may reduce the job distribution time in Map procedure and collection the output key-value pairs in Reduce procedure.

3.3.6 Our testing
JobTracker is a single point of failure for MapReduce service, if it crashed, all the jobs will be failed. Hadoop accounts are not isolated; they share the same file pool and resources.
Our Pilot programs We want to have our own MapReduce application, but not just word count (the example), so we implement all the basic interface of the MapReduce framework: InputFormat, InputSplit, RecordReader, Writable, RecordWriter, OutputFormat, Mapper and Reducer. Base on these interfaces, we could write any MapReduce programs easily.

3.4 Trouble shoot
Failed to start the NameNode : Installing Hadoop is relative simple, but there may have some problems, after setting up the environment, we need to specify the Java installation path by setting the parameter JAVA_HOME in $Hadoop_path/conf/hadoop-env.sh, otherwise, the HDFS will get error when it try to boot up. No TaskTracker available: the NameNode runs normally and also exist some DataNode, but no TaskTracker exist, that's because if HDFS is not ready, the JobTracker will no accept any TaskTracker to join, HDFS will be in safe mode when it start, it will not allow any modifications to the file system, if the number of DataNode is not enough, it will report error and state in safe mode. It may return “The ratio of reported blocks 0.9982 has not reached the threshold 0.9990. Safe mode will be turned off automatically." Turn on enough DataNode to reach the threshold will solve this problem, if some of DataNode were crashed and the available ratio is not enough, the only choice is to re-format the whole HDFS. So, less than 3 DataNodes is dangerous. MapReduce job always 0 Map input and 0 reduce input: in the RecordReader, the "next" function should return true although there has no next value, we should use a local variable as a flag to check if any other value pair exists.

3.5 Hadoop summary

For those test cases, the scalability of Hadoop is very significant, most of them show the linear increasing trend, we just test on 5 machines, if the number of clusters is large enough, and it will be very powerful for any jobs. On the performance aspect, it is not the same as we expected at beginning, for empty jobs, it may also takes 30 to 60 seconds to run. The overhead of Hadoop on small jobs is very significant, that‟s why Hadoop is for large scale, data intensive jobs, otherwise, MPI will be the better choice.

Hadoop is also not good for frequently I/O access, the MD5generator shows that the performance is much lower than running on a machine locally, that's because frequently appending data to a file is very slow. For the word count example, we are very surprise that processing 40MB text and 400MB text data are using nearly the same time, it's because each task is too small for 40MB data, the overhead is larger than the effective working time, in average, on our environment, each task will use at least 2 seconds on the whole flow, if the effective working time is near to this value, the efficiency may become very low, if the effective working time is too long, a task may hold the machine and occupied lots of resources, if the task failed, the overhead will be much large as the task will start from zero again.

The suggestion from Cloudera which is the enterprise Hadoop developer is 14 to 17 seconds for a task runs in map function, this may get better efficiency, but some of the Job could not be controlled to fit this optimal value, such as the video conversion, if we need to meet the time, we need to cut a video into small pieces, each piece may used about that period of time to convert, after the conversion, we need an additional step to combine the pieces together, the cutting and combining task may also use up lots of time, the overall efficient may be lower, but for the task such as MD5generator, we could turn the length of each split and let the map function run for any time we expect. For the jobs that have a lot of small pieces of input files, we may pack the files together, it may get performance improvement.

Theoretically speaking, any jobs could be ran in a single machine, it could be paralleled by MapReduce in Hadoop, the programming is easy, but it is not always a good choice, Jobs mainly focused on computation is not good in Hadoop, Jobs do not consume too much data may have better choices other than Hadoop. For the convenience of parallel programming, Hadoop maybe the easiest one to learn, the programmer don't need to have any parallel programming skills, it will easily make it parallel by MapReduce, it is good for the non- computer scientist. Hadoop is made for scalability not performance, it is good in data processing!
Chapter 4

Conclusion


We can see cloud is powerful IT resource pool (public cloud), or management tool (private cloud). Its elastic property allows application resizing itself to any scale. The almost important benefit of using cloud is paying and using the IT resource as you go. How much has you used, how much you should pay. These features of cloud enable small company or organization like school running large scale application or experiment with a reasonable cost.
Even though the cloud provide users with elastic resource, not all application can directly run on the cloud, it re-development of application and put it into cloud layer (e.g. Azure), or distribute the workload on its own (e.g. AWS). In order to fully shows up the performance of cloud. Second approach requires some powerful framework on top of all virtual machines and fully used up all resource form each virtual machine.
Hadoop (MapReduce) is one of the very powerful frameworks that enable easy development on data-intensive application. It objective is help building a supplication with high scalability with thousands of machines. We can see Hadoop is very suitable to data-intensive background application and perfect fit to our project‟s requirements.
Apart from running application in parallel, Hadoop provides some job monitoring features similar to Azure. If any machine crash, the data could be recovered by other machines, and it will take up the jobs automatically.
When we put Hadoop into cloud, we also see the convenience in setting up Hadoop. With a few command lines, we can allocate any number of clusters to run Hadoop, this may save lot of time and effort.
We found the combination of cloud and Hadoop is surely a common way to setup large scale application with lower cost, but higher elastic property.