01-10-2012, 03:48 PM
Data-Intensive Computing with Hadoop
Hadoop programming.pdf (Size: 1.03 MB / Downloads: 153)
Hadoop overview
• Apache Software Foundation project
• Framework for running applications on large clusters
• Modeled after Google’s MapReduce / GFS framework
• Implemented in Java
• Includes
• HDFS - a distributed filesystem
• Map/Reduce - offline computing engine
• Recently: Libraries for ML and sparse matrix comp.
• Y! is biggest contributor
• Young project, already used by many
Hadoop Goals
• Scalable
• Petabytes (1015 Bytes) of data on thousands on nodes
• Much larger than RAM, even single disk capacity
• Economical
• Use commodity components when possible
• Lash thousands of these into an effective compute and
storage platform
• Reliable
• In a large enough cluster something is always broken
• Engineering reliability into every app is expensive
Sample Applications
• Data analysis is the core of Internet services.
• Log Processing
• Reporting
• Session Analysis
• Building dictionaries
• Click fraud detection
• Building Search Index
• Site Rank
• Machine Learning
• Automated Pattern-Detection/Filtering
• Mail spam filter creation
• Competitive Intelligence
• What percentage of websites use a given feature?
Problem: Bandwidth to Data
• Need to process 100TB datasets
• On 1000 node cluster reading from remote storage
(on LAN)
• Scanning @ 10MB/s = 165 min
• On 1000 node cluster reading from local storage
• Scanning @ 50-200MB/s = 33s-8 min
• Moving computation to the data enables I/O
bandwidth scaling
• Network is the bottleneck
• Data size is reduced by the processing
• Need visibility into data placement
Hadoop Distributed File System
• Fault tolerant, scalable, distributed storage system
• Designed to reliably store very large files across
machines in a large cluster
• Data Model
• Data is organized into files and directories
• Files are divided into uniform sized blocks and distributed
across cluster nodes
• Blocks are replicated to handle hardware failure
• Corruption detection and recovery:
Filesystem-level checksuming
• HDFS exposes block placement so that computes can be
migrated to data
HDFS Architecture
• Similar to other NASD-based DFSs
• Master-Worker architecture
• HDFS Master “Namenode”
• Manages the filesystem namespace
• Controls read/write access to files
• Manages block replication
• Reliability: Namespace checkpointing and journaling
• HDFS Workers “Datanodes”
• Serve read/write requests from clients
• Perform replication tasks upon instruction by Namenode
Map-Reduce
• Application writer specifies
• A pair of functions called Map and Reduce
• A set of input files
• Workflow
• Generate FileSplits from input files, one per Map task
• Map phase executes the user map function transforming
input records into a new set of kv-pairs
• Framework shuffles & sort tuples according to their keys
• Reduce phase combines all kv-pairs with the same key
into new kv-pairs
• Output phase writes the resulting pairs to files
• All phases are distributed among many tasks
• Framework handles scheduling of tasks on cluster
• Framework handles recovery when a node fails
Hadoop M-R architecture
• Map/Reduce Master “Job Tracker”
• Accepts Map/Reduce jobs submitted by users
• Assigns Map and Reduce tasks to Task Trackers
• Monitors task and Task Tracker status, re-executes tasks
upon failure
• Map/Reduce Slaves “Task Trackers”
• Run Map and Reduce tasks upon instruction from the Job
Tracker
• Manage storage and transmission of intermediate output