15-12-2012, 04:37 PM
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING.pdf (Size: 1.34 MB / Downloads: 38)
What Motivated Data Mining? Why Is It Important?
Data mining has attracted a great deal of attention in the information industry and in society as a whole in
recent years, due to the wide availability of huge amounts of data and the imminent need for turning such
data into useful information and knowledge. The information and knowledge gained can be used for
applications ranging from market
analysis, fraud detection, and customer retention, to production control and science exploration.
Data mining can be viewed as a result of the natural evolution of information technology. The database
system industry has witnessed an evolutionary path in the development of the following functionalities
(Figure 1.1): data collection and database creation, data management (including data storage and
retrieval, and database transaction processing), and advanced data analysis (involving data warehousing
and
data mining).
What Is Data Mining?
Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data.
The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as
gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately
named “knowledge mining from data,” which is unfortunately somewhat long. “Knowledge mining,” a
shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is
a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw
material (Figure 1.3). Thus, such a misnomer that carries both “data” and “mining” became a popular
choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
Relational Databases
A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to manage and
access the data.
A relational database is a collection of tables, each ofwhich is assigned a unique name Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or
rows). Each tuple in a relational table represents an object identified by a unique key and described by a
set of attribute values. A semantic data model, such as an entity-relationship (ER) data model, is often
constructed for relational databases. An ER data model represents the database as a set of entities and
their relationships.
DataWarehouses
Suppose that AllElectronics is a successful international company, with branches around the
world. Each branch has its own set of databases. The president of AllElectronics has asked you to
provide an analysis of the company’s sales per item type per branch for the third quarter. This is a difficult
task, particularly since the relevant data are spread out over several databases, physically located at
numerous sites.
If AllElectronics had a data warehouse, this task would be easy. A data warehouse is a repository of
information collected from multiple sources, stored under a unified schema, and that usually resides at a
single site. Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data
loading, and periodic data refreshing.
Advanced Data and Information Systems and Advanced Applications
Relational database systems have been widely used in business applications. With the progress
of database technology, various kinds of advanced data and information systems have emerged and are
undergoing development to address the requirements of new applications.
The new database applications include handling spatial data (such as maps), engineering design
data (such as the design of buildings, system components, or integrated circuits), hypertext and
multimedia data (including text, image, video, and audio data), time-related data (such as historical
records or stock exchange data), stream data
(such as video surveillance and sensor data, where data flow in and out like streams), and the
WorldWideWeb (a huge, widely distributed information repository made available by the Internet). These
applications require efficient data structures and scalable methods for handling complex object structures;
variable-length records; semistructured or unstructured data; text, spatiotemporal, and multimedia data;
and database schemas with complex structures and dynamic changes.
Object-Relational Databases
Object-relational databases are constructed based on an object-relational data model. This model
extends the relational model by providing a rich data type for handling complex objects and object
orientation. Because most sophisticated database applications need to handle complex objects and
structures, object-relational databases are becoming increasingly popular in industry and applications.
Conceptually, the object-relational data model inherits the essential concepts of object-oriented
databases, where, in general terms, each entity is considered as an object. Following the AllElectronics
example, objects can be individual employees, customers, or items.