Hadoop Seminar Report

**seminar code** · 20-09-2014, 09:28 AM

Hadoop

Hadoop.pptx (Size: 180.12 KB / Downloads: 43)

Abstract

Hadoop is a framework for running applications on large clusters built of commodity hardware.

The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster.

In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Introduction

What is Hadoop?

Software platform that lets one easily write and run applications that process vast amounts of data.
It includes:
– Map Reduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access

Yahoo! is the biggest contributor

Here's what makes it especially useful:
Scalable
Economical
Efficient
Reliable

What does it do?

Hadoop implements Google’s MapReduce, using HDFS
MapReduce divides applications into many small blocks of work.
HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster.
MapReduce can then process the data where it is located.
Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.

Hadoop: Assumptions

It is written with large clusters of computers in mind and is built around the following assumptions:
Hardware will fail.
Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.
It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
Applications need a write-once-read-many access model.
Moving Computation is Cheaper than Moving Data.
Portability is important

HDFS

Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster.

It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance.

The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.

Hadoop Distributed File System – Goals:
Store large data sets
Cope with hardware failure
Emphasize streaming data access

Protocol

All HDFS communication protocols are layered on top of the TCP/IP protocol

A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocol with the Namenode.

The Datanodes talk to the Namenode using Datanode protocol.

RPC abstraction wraps both ClientProtocol and Datanode protocol.

Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.

Mapping workers to Processors

The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines.

The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data.

When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth

What are Hadoop/MapReducelimitations?

Cannot control the order in which the maps or reductions are run.
For maximum parallelism, you need Maps and Reduces to not depend on data generated in the same MapReduce job (i.e. stateless).
A database with an index will always be faster than a MapReduce job on unindexed data.
Reduce operations do not take place until all Maps are complete (or have failed then been skipped).
General assumption that the output of Reduce is smaller than the input to Map; large data source used to generate smaller final values

Summary and Conclusion

Hadoop MapReduce is a large scale, open source software framework dedicated to scalable, distributed, data-intensive computing
The framework breaks up large data into smaller parallelizable chunks and handles scheduling
▫ Maps each piece to an intermediate value
▫ Reduces intermediate values to a solution
▫ User-specified partition and combiner options
Fault tolerant, reliable, and supports thousands of nodes and petabytes of data
If you can rewrite algorithms into Maps and Reduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoop’s MapReduce is the way to go for a distributed problem solving approach to large datasets
Tried and tested in production
Many implementation options

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Biometrics Security System Full Download Seminar Report and Paper Presentation	computer science crazy	30	190,561,110	24-02-2021, 08:13 AM Last Post: buy cialis generic
	Ultrasonic Trapping In Capillaries For Trace-Amount Bi (Download Full Seminar Report)	Computer Science Clay	2	104,277,107	17-01-2018, 11:59 AM Last Post: dhanabhagya
	nanorobotics full report	project topics	24	176,551,278	16-01-2018, 05:50 PM Last Post: Guest
	robotic surgery full report	project report tiger	16	150,961,205	07-01-2018, 07:28 PM Last Post: Raymondnof
	Human Computer Interface : Seminar Report and PPT	seminar post	1	1,337	22-09-2017, 11:23 AM Last Post: jaseela123
	4G Broadband : Seminar Report and PPT	study tips	1	1,261	22-09-2017, 11:19 AM Last Post: jaseela123
	Amoeba full report	project topics	1	1,631,984	22-09-2017, 10:38 AM Last Post: jaseela123
	Itanium Processor : Seminar Report and PPT	seminar projects maker	1	1,052	21-09-2017, 12:46 PM Last Post: jaseela123
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.