20-09-2014, 09:28 AM
Hadoop
Hadoop.pptx (Size: 180.12 KB / Downloads: 43)
Abstract
Hadoop is a framework for running applications on large clusters built of commodity hardware.
The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster.
In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
Introduction
What is Hadoop?
Software platform that lets one easily write and run applications that process vast amounts of data.
It includes:
– Map Reduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
Yahoo! is the biggest contributor
Here's what makes it especially useful:
Scalable
Economical
Efficient
Reliable
What does it do?
Hadoop implements Google’s MapReduce, using HDFS
MapReduce divides applications into many small blocks of work.
HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster.
MapReduce can then process the data where it is located.
Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
Hadoop: Assumptions
It is written with large clusters of computers in mind and is built around the following assumptions:
Hardware will fail.
Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.
It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
Applications need a write-once-read-many access model.
Moving Computation is Cheaper than Moving Data.
Portability is important
HDFS
Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster.
It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance.
The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.
Hadoop Distributed File System – Goals:
Store large data sets
Cope with hardware failure
Emphasize streaming data access
Protocol
All HDFS communication protocols are layered on top of the TCP/IP protocol
A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocol with the Namenode.
The Datanodes talk to the Namenode using Datanode protocol.
RPC abstraction wraps both ClientProtocol and Datanode protocol.
Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.
Mapping workers to Processors
The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines.
The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data.
When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth
What are Hadoop/MapReducelimitations?
Cannot control the order in which the maps or reductions are run.
For maximum parallelism, you need Maps and Reduces to not depend on data generated in the same MapReduce job (i.e. stateless).
A database with an index will always be faster than a MapReduce job on unindexed data.
Reduce operations do not take place until all Maps are complete (or have failed then been skipped).
General assumption that the output of Reduce is smaller than the input to Map; large data source used to generate smaller final values
Summary and Conclusion
Hadoop MapReduce is a large scale, open source software framework dedicated to scalable, distributed, data-intensive computing
The framework breaks up large data into smaller parallelizable chunks and handles scheduling
▫ Maps each piece to an intermediate value
▫ Reduces intermediate values to a solution
▫ User-specified partition and combiner options
Fault tolerant, reliable, and supports thousands of nodes and petabytes of data
If you can rewrite algorithms into Maps and Reduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoop’s MapReduce is the way to go for a distributed problem solving approach to large datasets
Tried and tested in production
Many implementation options