21-08-2012, 03:05 PM
Managing Big Data in the Enterprise
Managing Big Data.pdf (Size: 675.52 KB / Downloads: 131)
Introduction
According to IDC1, the amount digital information produced in
2011 will be ten times that produced in 2006: 1,800 exabytes.
The majority of this data will be “unstructured” – complex data
poorly‐suited to management by structured storage systems
like relational databases.
Unstructured data comes from many sources and takes many
forms – web logs, text files, sensor readings, user‐generated
content like product reviews or text messages, audio, video and
still imagery and more.
Large volumes of complex data can hide important insights.
Are there buying patterns in point‐of‐sale data that can
forecast demand for products at particular stores? Do RFID tag
reads show anomalies in the movement of goods during
distribution? Do user logs from a web site, or calling records in
a mobile network, contain information about relationships
among individual customers? Can a collection of nucleotide
sequences be assembled into a single gene? Companies that
can extract facts like these from the huge volume of data can
better control processes and costs, can better predict demand
and can build better products.
Reliable Storage: HDFS
Hadoop includes a fault‐tolerant storage system called the
Hadoop Distributed File System, or HDFS. HDFS is able to store
huge amounts of information, scale up incrementally and
survive the failure of significant parts of the storage
infrastructure without losing data.
Hadoop creates clusters of machines and coordinates work
among them. Clusters can be built with inexpensive computers.
If one fails, Hadoop continues to operate the cluster without
losing data or interrupting work, by shifting work to the
remaining machines in the cluster.
Hadoop for Big Data Analysis
Many popular tools for enterprise data management –
relational database systems, for example – are designed to
make simple queries run quickly. They use techniques like
indexing to examine just a small portion of all the available
data in order to answer a question.
Hadoop is a different sort of tool. Hadoop is aimed at problems
that require examination of all the available data. For example,
text analysis and image processing generally require that every
single record be read, and often interpreted in the context of
similar records. Hadoop uses a technique called MapReduce to
carry out this exhaustive analysis quickly.
Summary
Hadoop’s MapReduce and HDFS use simple, robust techniques
on inexpensive computer systems to deliver very high data
availability and to analyze enormous amounts of information
quickly. Hadoop offers enterprises a powerful new tool for
managing big data.