04-09-2012, 03:01 PM
DATA MINING TECHNIQUES FOR STRUCTURED AND SEMISTRUCTURED DATA
DATA MINING.pdf (Size: 649.11 KB / Downloads: 117)
Abstract
Data mining is the application of sophisticated analysis to large amounts of data in order
to discover new knowledge in the form of patterns, trends, and associations. With the
advent of the World Wide Web, the amount of data stored and accessible electronically has
grown tremendously and the process of knowledge discovery (data mining) from this data
has become very important for the business and scientic-research communities alike.
This doctoral thesis introduces Query Flocks, a general framework over relational data
that enables the declarative formulation, systematic optimization, and ecient processing
of a large class of mining queries. In Query Flocks, each mining problem is expressed as a
datalog query with parameters and a lter condition. In the optimization phase, a query
stock is transformed into a sequence of simpler queries that can be executed eciently. As
a proof of concept, Query Flocks have been integrated with a conventional database system
and the thesis reports on the architectural issues and performance results.
While the Query-Flock framework is well suited for relational data, it has limited use
for semistructured data, i.e., nested data with implicit and/or irregular structure, e.g. web
pages. The lack of an explicit xed schema makes semistructured data easy to generate or
extract but hard to browse and query. This thesis presents methods for structure discovery
in semistructured data that alleviate this problem. The discovered structure can be of
varying precision and complexity. The thesis introduces an algorithm for deriving a schemaby-
example and an algorithm for extracting an approximate schema in the form of a datalog
program.
Introduction
The amount of data stored and available electronically has been growing at an ever increasing
rate for the last decade. In the business community, companies collect all sorts of
information about the business process such as nancial, payroll, and customer data. The
data is often among the most valuable assets of a business. In the scientic community, a
single experiment can produce terabytes of data. Subsequently, there is growing demand
for methods and tools that analyze large volumes of data. However, even storing, let alone
analyzing, such huge amounts of data presents many new obstacles and challenges. An oft
used metaphor that \we are drowning in data, and yet starving for knowledge" sums up
the situation perfectly. The eld of data mining has emerged out of this necessity.
Data mining is broadly dened as the process of nding \patterns" from large amounts
of data. The denition is necessarily vague because it has to encompass the vast array
of methods, techniques, and algorithms from various elds such as databases, machine
learning, and statistics. To obfuscate things even further, data mining is often considered
to be only a step, or be it the most important one, in the knowledge discovery process. The
knowledge discovery process involves several other pre-mining and post-mining steps such
as data cleansing and data visualization. In the present thesis, the focus is on data mining
from the database perspective.
Data Mining for Structured Data
The state of the art in data mining for structured data is many dierent algorithms that
operate on limited types of data. Furthermore, most data-mining methods are at best
loosely-coupled with relational DBMS, thus not taking advantage of the existing database
technology. In this thesis, we propose a framework, called query
ocks, that allows the
declarative formulation of a large class of data-mining queries over relational data. We
also present a method for systematic optimization and ecient processing, called query
Data Mining for Semistructured Data
The importance of semistructured data has been recognized in the database community and
is emphasized by the
urry of research activities in the last several years. The emergence
of XML and its rapid adoption by the e-commerce companies has made semistructured
data equally important for the business community. However, since the proliferation of
semistructured data has been relatively recent, there is a lack of tools and methods for
analysis of such data. Standard data-mining techniques developed for structured data are
dicult to apply and have been shown to be ineective[NAM98]. In this thesis, we make
the following two contributions that address the problem of analyzing semistructured data.