27-03-2014, 04:56 PM
Apache Hadoop and Hive
Apache Hadoop.ppt (Size: 504.5 KB / Downloads: 88)
Hadoop, Why?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each application.
Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
Need common infrastructure
– Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Hive, Why?
Need a Multi Petabyte Warehouse
Files are insufficient data abstractions
Need tables, schemas, partitions, indices
SQL is highly popular
Need for an open data format
– RDBMS have a closed data format
– flexible schema
Hive is a Hadoop subproject!
Goals of HDFS
Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Optimized for Batch Processing
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
NameNode Metadata
Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
A Transaction Log
– Records file creations, file deletions. etc
NameNode Failure
A single point of failure
Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
Need to develop a real HA solution
Job Sheduling
Current state of affairs
FIFO and Fair Share scheduler
Checkpointing and parallelism tied together
Topics for Research
Cycle scavenging scheduler
Separate checkpointing and parallelism
Use resource matchmaking to support heterogeneous Hadoop compute clusters
Scheduler and API for MPI workload