Data-Intensive Computing with Hadoop

**seminar flower** · 01-10-2012, 03:48 PM

Data-Intensive Computing with Hadoop

.pdf

Hadoop programming.pdf (Size: 1.03 MB / Downloads: 153)

Hadoop overview

• Apache Software Foundation project
• Framework for running applications on large clusters
• Modeled after Google’s MapReduce / GFS framework
• Implemented in Java
• Includes
• HDFS - a distributed filesystem
• Map/Reduce - offline computing engine
• Recently: Libraries for ML and sparse matrix comp.
• Y! is biggest contributor
• Young project, already used by many

Hadoop Goals

• Scalable
• Petabytes (1015 Bytes) of data on thousands on nodes
• Much larger than RAM, even single disk capacity
• Economical
• Use commodity components when possible
• Lash thousands of these into an effective compute and
storage platform
• Reliable
• In a large enough cluster something is always broken
• Engineering reliability into every app is expensive

Sample Applications

• Data analysis is the core of Internet services.
• Log Processing
• Reporting
• Session Analysis
• Building dictionaries
• Click fraud detection
• Building Search Index
• Site Rank
• Machine Learning
• Automated Pattern-Detection/Filtering
• Mail spam filter creation
• Competitive Intelligence
• What percentage of websites use a given feature?

Problem: Bandwidth to Data

• Need to process 100TB datasets
• On 1000 node cluster reading from remote storage
(on LAN)
• Scanning @ 10MB/s = 165 min
• On 1000 node cluster reading from local storage
• Scanning @ 50-200MB/s = 33s-8 min
• Moving computation to the data enables I/O
bandwidth scaling
• Network is the bottleneck
• Data size is reduced by the processing
• Need visibility into data placement

Hadoop Distributed File System

• Fault tolerant, scalable, distributed storage system
• Designed to reliably store very large files across
machines in a large cluster
• Data Model
• Data is organized into files and directories
• Files are divided into uniform sized blocks and distributed
across cluster nodes
• Blocks are replicated to handle hardware failure
• Corruption detection and recovery:
Filesystem-level checksuming
• HDFS exposes block placement so that computes can be
migrated to data

HDFS Architecture

• Similar to other NASD-based DFSs
• Master-Worker architecture
• HDFS Master “Namenode”
• Manages the filesystem namespace
• Controls read/write access to files
• Manages block replication
• Reliability: Namespace checkpointing and journaling
• HDFS Workers “Datanodes”
• Serve read/write requests from clients
• Perform replication tasks upon instruction by Namenode

Map-Reduce

• Application writer specifies
• A pair of functions called Map and Reduce
• A set of input files
• Workflow
• Generate FileSplits from input files, one per Map task
• Map phase executes the user map function transforming
input records into a new set of kv-pairs
• Framework shuffles & sort tuples according to their keys
• Reduce phase combines all kv-pairs with the same key
into new kv-pairs
• Output phase writes the resulting pairs to files
• All phases are distributed among many tasks
• Framework handles scheduling of tasks on cluster
• Framework handles recovery when a node fails

Hadoop M-R architecture

• Map/Reduce Master “Job Tracker”
• Accepts Map/Reduce jobs submitted by users
• Assigns Map and Reduce tasks to Task Trackers
• Monitors task and Task Tracker status, re-executes tasks
upon failure
• Map/Reduce Slaves “Task Trackers”
• Run Map and Reduce tasks upon instruction from the Job
Tracker
• Manage storage and transmission of intermediate output

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	Green Computing-A Critical Necessity For Making ICT Judicious	seminar flower	1	3,105	19-09-2017, 03:35 PM Last Post: jaseela123
	DEMONSTRATING DATAPOSSESSION AND UN CHEATABLE DATA TRANSFER	seminar flower	1	1,466	19-09-2017, 11:05 AM Last Post: jaseela123
	Processing of collected data PPT	seminar projects maker	1	718	15-09-2017, 12:48 PM Last Post: jaseela123
	Cloud Computing	seminar code	1	646	13-09-2017, 03:36 PM Last Post: jaseela123
	Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data pdf	study tips	1	2,018	13-09-2017, 12:59 PM Last Post: jaseela123
	Cloud Computing	seminar code	1	762	13-09-2017, 12:58 PM Last Post: jaseela123
	Green Computing for Efficient use of Energy and Electronic Waste Minimization Report	project girl	1	1,357	12-09-2017, 12:37 PM Last Post: jaseela123
	Data Warehouse Report	study tips	1	879	12-09-2017, 12:23 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.