09-10-2012, 10:55 AM
Cloud Computing with MapReduceand Hadoop
Cloud Computing.ppt (Size: 2.35 MB / Downloads: 125)
What is Cloud Computing?
“Cloud” refers to large Internet services like Google, Yahoo, etc that run on 10,000’s of machines
More recently, “cloud computing” refers to services by these companies that let external customers rent computing cycles on their clusters
Amazon EC2: virtual machines at 10¢/hour, billed hourly
Amazon S3: storage at 15¢/GB/month
Attractive features:
Scale: up to 100’s of nodes
Fine-grained billing: pay only for what you use
Ease of use: sign up with credit card, get root access
What is MapReduce?
Simple data-parallel programming model designed for scalability and fault-tolerance
Pioneered by Google
Processes 20 petabytes of data per day
Popularized by open-source Hadoop project
Used at Yahoo!, Facebook, Amazon, …
What is MapReduce used for?
At Google:
Index construction for Google Search
Article clustering for Google News
Statistical machine translation
At Yahoo!:
“Web map” powering Yahoo! Search
Spam detection for Yahoo! Mail
At Facebook:
Data mining
Ad optimization
Spam detection
MapReduce Design Goals
Scalability to large data volumes:
1000’s of machines, 10,000’s of disks
Cost-efficiency:
Commodity machines (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer administrators)
Easy to use (fewer programmers)
Hadoop Distributed File System
Files split into 128MB blocks
Blocks replicated across several datanodes (usually 3)
Single namenode stores metadata (file names, block locations, etc)
Optimized for large files, sequential reads
Files are append-only
Fault Tolerance in MapReduce
If a task is going slowly (straggler):
Launch second copy of task on another node (“speculative execution”)
Take the output of whichever copy finishes first, and kill the other
Surprisingly important in large clusters
Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc
Single straggler may noticeably slow down a job