13-06-2012, 12:13 PM
Discovering Conditional Functional Dependencies
Discovering Conditional Functional Dependencies.pdf (Size: 617.67 KB / Downloads: 34)
INTRODUCTION
Conditional functional dependencies (CFDs) [1] were recently
introduced for data cleaning. They extend standard functional
dependencies (FDs) by enforcing patterns of semantically
related constants. CFDs have been proven more effective than
FDs in detecting and repairing inconsistencies (dirtiness) of
data [1], [2], and are expected to be adopted by data-cleaning
tools that currently employ standard FDs (e.g., [3], [4], [5];
see [6], [7] for surveys on data cleaning tools).
CFDS AND CFD DISCOVERY
We first review the definition of CFDs [1]. We then formalize
the notions of minimal CFDs and frequent CFDs. Finally,
we state the discovery problem for CFDs and show how the
discovered CFDs are used to generate so-called tableau CFDs.
Conditional Functional Dependencies
Consider a relation schema R defined over a fixed set of attributes,
denoted by attr®. For each attribute A $ attr®,
we use dom(A) to denote its domain.
The Discovery Problem for CFDs
Given a sample relation r of a schema R, an algorithm for
CFD discovery aims to find CFDs defined over R that hold
on r. Obviously it is not a good idea to return the set of
all CFDs that hold on r, since the set contains trivial and
redundant CFDs and is unnecessarily large. Thus we want to
find a canonical cover, i.e., a non-redundant set consisting of
minimal CFDs only, from which all CFDs on r can be derived
via implication analysis. Moreover, real-life data is often dirty,
containing errors and noise. To exclude CFDs that match errors
and noise only, we consider frequent CFDs, which have a
pattern tuple with support in r above a threshold.
RELATED WORK
Prior work on conditional dependencies has mostly focused
on the consistency and implication analyses of CFDs [1],
repairing methods to localize and fix errors detected by CFDs
[26], propagation of CFDs from source data to views in data
integration [27], extensions of CFDs by adding disjunction and
negation [28] or adding ranges [10], confidence of CFDs [29],
as well as extensions of inclusion dependencies with conditions
(referred to as CINDs) [30]. To our knowledge, CFD
discovery was only studied in [21], [10], [31]. Except these,
the previous work assumes that CFDs are already designed
and provided.
Discovering Conditional Functional Dependencies.pdf (Size: 617.67 KB / Downloads: 34)
INTRODUCTION
Conditional functional dependencies (CFDs) [1] were recently
introduced for data cleaning. They extend standard functional
dependencies (FDs) by enforcing patterns of semantically
related constants. CFDs have been proven more effective than
FDs in detecting and repairing inconsistencies (dirtiness) of
data [1], [2], and are expected to be adopted by data-cleaning
tools that currently employ standard FDs (e.g., [3], [4], [5];
see [6], [7] for surveys on data cleaning tools).
CFDS AND CFD DISCOVERY
We first review the definition of CFDs [1]. We then formalize
the notions of minimal CFDs and frequent CFDs. Finally,
we state the discovery problem for CFDs and show how the
discovered CFDs are used to generate so-called tableau CFDs.
Conditional Functional Dependencies
Consider a relation schema R defined over a fixed set of attributes,
denoted by attr®. For each attribute A $ attr®,
we use dom(A) to denote its domain.
The Discovery Problem for CFDs
Given a sample relation r of a schema R, an algorithm for
CFD discovery aims to find CFDs defined over R that hold
on r. Obviously it is not a good idea to return the set of
all CFDs that hold on r, since the set contains trivial and
redundant CFDs and is unnecessarily large. Thus we want to
find a canonical cover, i.e., a non-redundant set consisting of
minimal CFDs only, from which all CFDs on r can be derived
via implication analysis. Moreover, real-life data is often dirty,
containing errors and noise. To exclude CFDs that match errors
and noise only, we consider frequent CFDs, which have a
pattern tuple with support in r above a threshold.
RELATED WORK
Prior work on conditional dependencies has mostly focused
on the consistency and implication analyses of CFDs [1],
repairing methods to localize and fix errors detected by CFDs
[26], propagation of CFDs from source data to views in data
integration [27], extensions of CFDs by adding disjunction and
negation [28] or adding ranges [10], confidence of CFDs [29],
as well as extensions of inclusion dependencies with conditions
(referred to as CINDs) [30]. To our knowledge, CFD
discovery was only studied in [21], [10], [31]. Except these,
the previous work assumes that CFDs are already designed
and provided.