05-09-2016, 11:53 AM
1452759585-AReviewpaperonProbabilisticDatabasemethods.docx (Size: 52.35 KB / Downloads: 4)
Abstract:
Real-world databases experience various data quality problems of different causes including heterogeneity of consolidated data sources, imprecision of reading devices, and data entry errors. Existence of duplicate records is a prominent data quality problem. The process of duplicate elimination often involves uncertainty in deciding on the true duplicates. Current tools resolve such uncertainty either through expert intervention, which is not always possible, or by taking destructive decisions that may lead to unrecoverable errors.
In this paper, the approach is that duplicate elimination from a new perspective treating de-duplication procedure as data processing tasks with uncertain out comes. We propose a complete uncertainty model that compactly encodes the space of clean instances of the input data, and introduce efficient model implementations.
Keywords: Entity Resolution, Uncertain Data, Probabilistic Databases, Query processing
INTRODUCTION
They are databases in which every attribute of an entity or a tuple can have a number of alternative values, each with some probability. A probabilistic database is the result of integration with uncertain alternative tuples, or a consequence of probabilistic mappings from the sources to the integration. Due to these probabilities, query answers are unavoidably probabilistic. The “probabilistic cleaning” approach builds on and extends the current practices in data cleaning in several ways: Capturing all possible cleaning outcomes allows preserving the potentially useful cleaning results without destroying the original unclean data. This enables data processing without expert intervention if uncertain query answers are encountered. Further, it provides experts with a scope of the conflicting/uncertain cleaning decisions that need to be addressed.
•Contrasting and comparing the uncertainty involved in the outputs of multiple cleaning procedures, or the same procedure with multiple specifications. This allow for using and integrating different cleaning tools appropriately in cleaning workflows.
•Enabling new types of cleaning requirements defined on the query output, and hence are independent from the implementation details of underlying cleaning procedures.
•Allowing specifying cleaning requirements as query predicates by pushing the expensive off-line clustering and matching tasks to model building.
The need for efficiently querying such databases and for supporting practical query scenarios that require analytical or summarized information is discussed here. The author also sketch possible methodologies and techniques that would allow performing efficient processing of queries over such probabilistic databases, and especially without the need to materialize the potentially, huge collection of all possible deduplication worlds
RELATED WORK
In this section, we are going to survey some currently available methods of probabilistic databases for uncertain predicates.
[1] On the Fly Entity Aware Query Processing in the Presence of Linkage
In this paper author, described a novel framework for entity linkage with uncertainty. Here available linkage is been used with data and its trust value instead of using the linkage information to merge structures a-priori. For probabilistic linkage a new probabilistic query answering technique has been used. This novel idea introduce below details: (i) it performs merges at run time for not only on existing linkages but also for given query; (ii) it allows results that contain structures, but generated as a result also for linkages; and (iii) support and query conditions evaluation across linked structures. This offers a functionality not currently supported by any traditional probabilistic databases.
This work proposes a new technique for query answering under data uncertainty that combines the ideas of entity linkage and probabilistic databases. First accept data with uncertainty at the attribute level. Then obtain alternative matching suggestions as returned by existing entity linkage techniques. This forms new probabilistic database that apart from uncertainty at the attribute level, contains uncertainty at the merging pairs.
Definition of probabilistic linkage database:
The linkage probabilistic database is a tuple of <E,L,p^a,p^l>, where E is the entity set, L is the assignment of linkages on E, and p^a,p^l are the attributes and probability linkages assignment functions.
In particular,
p^l|L→[0,1] and p^a|B →[0, 1]| with B= {a|∃<id, A>ϵE^aϵA}.
There are probabilities on linkages, the linkage probabilistic database models the different number of probabilistic databases having linkage relationships. This type of a database is created from the linkage probabilistic database by choosing a fraction of the linkages.
L-world probability:
Here these above probabilistic databases having assignments like possible linkage of worlds or possible l-world is called as l-world probability. The group of all these possible l-world of linkage probabilistic database D is denoted as plw (D)
A linkage probabilistic database D=< E,L,p^a,p^l>, the specification of linkage is a linkage assignment L^sp≥L such that∀xϵ L^sp:p^l (x)≠0. The Boolean expression of specification of linkage is the expression ⋀_(1,n)▒c_(k,) where
c_(k={█( x≠y if (x,y)ϵL^(x,y)∄L^sp@x=y if (x,y)ϵL^sp )┤ )
When the Boolean expression is false then the linkage specification is invalid [1].
This technique has benefits as if it first overcomes drawback that may result from the one-time a-priori merging plan, as happens with the traditional entity linkage techniques. As no merging decisions have taken place, consider only updates required are on the linkages related to the new or modified data. This method produces additional valid query answering results compared to those of entity linkage and probabilistic databases, and cannot simulated with any of other techniques. Efficient query evaluation takes place in this paper has two major benefits as below. First only the required merges take place. Second, it avoids overwhelming the user with answers containing long lists of attributes originating from different possibly linked entities that are outside the users’ interests.
It is concluded that proposed technique in this paper able to formally define the semantics of a new kind of query answering, and describe efficient techniques for query processing. The results of evaluation for this method on real world datasets show the efficiency and effectiveness compare to others.
[2] Ranking Query Answers in Probabilistic Databases: Complexity and Efficient Algorithms
In this paper author, proposed the solution for issue of ranking query answers in probabilistic databases. Author introduced a technique to make top-k exact answers for conjunctive queries. Statically obtained share plans and deterministic approximation used in this approach which refresh upper and lower bounds probable query answers. To implement this technique SPROUT query engine has been use of a MayBMS that is the part of probabilistic DBMS. This mechanism is been based on Incremental approximate probability computations for events with shared factors , Probability Computation by Event Decomposition, Decomposition of Events with Shared Factors, and Ranking Algorithm.
Ranking Problem in PP
The group of decision problems is the complexity class PP and it can be solve a nondeterministic polynomial-time Turing machine where the receipt condition is that at least half of the computation paths accept.
PP is tightly connected to #P class of functions which is count the total number of accepting nondeterministic polynomial-time Turing machines: P^PP=P^(#P).This meant to the class of decision problems computable in polynomial time using #P oracles coincides with the class of decision problems computable in polynomial time using PP oracles[2].
Membership in PP
The ranking problem is in PP by the following argument: Let F and G are the events of two answers t1 and t2 respectively. We have that P (t1) = P (F) and P (t2) = P (G). In the case of conjunctive queries, the events F and G are positive DNF formulas; the following argument holds however for arbitrary events.
The problem is to check whether P (F) ≤ P(G). Define the disjoint events
A = F ∧ ¬G and B = G ∧ ¬F. Then,
P (F) ≤ P (G) ⇔ P (A) ≤ P (B), since
P(A) = P(F) + P(¬G) –P(F ∨ ¬G)
= P(F) –P(G) + P(G ∧ ¬F) = P(F) –P(G) + P(B).
Now denote A′ = A∨¬A¬B and B′ = B∨¬A¬B. Clearly,
A′ ∨ B′ = true. Then,
P(A′) = P(A) + P(¬A¬B). P(B′) = P(B) + P(¬A¬B)
Hence, P(A) ≤ P(B) ⇔ P(A′) ≤ P(B′) ⇔ P(B′) ≥ 1/2.
In the case of uniform probability distributions, the latter inequality is the PP-complete problem MAJ SAT for arbitrary formulas B′. In case of non-uniform probability distributions, the same argument used for membership in #P of the probability computation problem can be made here [2].
Main goal of this technique is to quantify the probabilities of the events of the query answers to the extent needed to separate the top-k most probable events from the others. This method is suitable solution for ranking query problem
[3] Efficient Query Evaluation on Probabilistic Databases
In this paper author described a system, which sustain complex SQL queries. Here complex SQL queries is been supported by “uncertain” predicates. Probabilistic model is been used for query semantics and results ranked by information retrieval strategy. Author mainly focuses on query execution. Here author used both a Monte-Carlo simulation algorithm and approximation algorithm for their purpose. Probability is been assigned to each tuple in given database of a uncertain predicates. To observe how correctly probability matches with uncertain predicates. These observations of a probability then collected in answer for result ranking.
Example of probabilistic database
The probabilistic database is nothing but each tuple has the defined probability of belonging to that database. Figure 1 shows a probabilistic database D^P with tablesS^p 〖and T〗^p. The meaning of probabilistic database is the probability of distribution on all database instances, that is possible worlds, and denoted by pwd (D^p).Figure 2(a) shows 8 probable instances which having non-zero probabilities, which are computed by multiplying the tuple probabilities, as we have assumed them to be independent.
For example the probability of D3= (1-0.7)*0.4*0.5=0.06, since the instance contains the tuples s2, t1 and does not contain s1.
RELATED WORK
In this section, we are going to survey some currently available methods of probabilistic databases for uncertain predicates.
[1] On the Fly Entity Aware Query Processing in the Presence of Linkage
In this paper author, described a novel framework for entity linkage with uncertainty. Here available linkage is been used with data and its trust value instead of using the linkage information to merge structures a-priori. For probabilistic linkage a new probabilistic query answering technique has been used. This novel idea introduce below details: (i) it performs merges at run time for not only on existing linkages but also for given query; (ii) it allows results that contain structures, but generated as a result also for linkages; and (iii) support and query conditions evaluation across linked structures. This offers a functionality not currently supported by any traditional probabilistic databases.
This work proposes a new technique for query answering under data uncertainty that combines the ideas of entity linkage and probabilistic databases. First accept data with uncertainty at the attribute level. Then obtain alternative matching suggestions as returned by existing entity linkage techniques. This forms new probabilistic database that apart from uncertainty at the attribute level, contains uncertainty at the merging pairs.
Definition of probabilistic linkage database:
The linkage probabilistic database is a tuple of <E,L,p^a,p^l>, where E is the entity set, L is the assignment of linkages on E, and p^a,p^l are the attributes and probability linkages assignment functions.
In particular,
p^l|L→[0,1] and p^a|B →[0, 1]| with B= {a|∃<id, A>ϵE^aϵA}.
There are probabilities on linkages, the linkage probabilistic database models the different number of probabilistic databases having linkage relationships. This type of a database is created from the linkage probabilistic database by choosing a fraction of the linkages.
L-world probability:
Here these above probabilistic databases having assignments like possible linkage of worlds or possible l-world is called as l-world probability. The group of all these possible l-world of linkage probabilistic database D is denoted as plw (D)
A linkage probabilistic database D=< E,L,p^a,p^l>, the specification of linkage is a linkage assignment L^sp≥L such that∀xϵ L^sp:p^l (x)≠0. The Boolean expression of specification of linkage is the expression ⋀_(1,n)▒c_(k,) where
c_(k={█( x≠y if (x,y)ϵL^(x,y)∄L^sp@x=y if (x,y)ϵL^sp )┤ )
When the Boolean expression is false then the linkage specification is invalid [1].
This technique has benefits as if it first overcomes drawback that may result from the one-time a-priori merging plan, as happens with the traditional entity linkage techniques. As no merging decisions have taken place, consider only updates required are on the linkages related to the new or modified data. This method produces additional valid query answering results compared to those of entity linkage and probabilistic databases, and cannot simulated with any of other techniques. Efficient query evaluation takes place in this paper has two major benefits as below. First only the required merges take place. Second, it avoids overwhelming the user with answers containing long lists of attributes originating from different possibly linked entities that are outside the users’ interests.
It is concluded that proposed technique in this paper able to formally define the semantics of a new kind of query answering, and describe efficient techniques for query processing. The results of evaluation for this method on real world datasets show the efficiency and effectiveness compare to others.
[2] Ranking Query Answers in Probabilistic Databases: Complexity and Efficient Algorithms
In this paper author, proposed the solution for issue of ranking query answers in probabilistic databases. Author introduced a technique to make top-k exact answers for conjunctive queries. Statically obtained share plans and deterministic approximation used in this approach which refresh upper and lower bounds probable query answers. To implement this technique SPROUT query engine has been use of a MayBMS that is the part of probabilistic DBMS. This mechanism is been based on Incremental approximate probability computations for events with shared factors , Probability Computation by Event Decomposition, Decomposition of Events with Shared Factors, and Ranking Algorithm.
Ranking Problem in PP
The group of decision problems is the complexity class PP and it can be solve a nondeterministic polynomial-time Turing machine where the receipt condition is that at least half of the computation paths accept.
PP is tightly connected to #P class of functions which is count the total number of accepting nondeterministic polynomial-time Turing machines: P^PP=P^(#P).This meant to the class of decision problems computable in polynomial time using #P oracles coincides with the class of decision problems computable in polynomial time using PP oracles[2].
Membership in PP
The ranking problem is in PP by the following argument: Let F and G are the events of two answers t1 and t2 respectively. We have that P (t1) = P (F) and P (t2) = P (G). In the case of conjunctive queries, the events F and G are positive DNF formulas; the following argument holds however for arbitrary events.
The problem is to check whether P (F) ≤ P(G). Define the disjoint events
A = F ∧ ¬G and B = G ∧ ¬F. Then,
P (F) ≤ P (G) ⇔ P (A) ≤ P (B), since
P(A) = P(F) + P(¬G) –P(F ∨ ¬G)
= P(F) –P(G) + P(G ∧ ¬F) = P(F) –P(G) + P(B).
Now denote A′ = A∨¬A¬B and B′ = B∨¬A¬B. Clearly,
A′ ∨ B′ = true. Then,
P(A′) = P(A) + P(¬A¬B). P(B′) = P(B) + P(¬A¬B)
Hence, P(A) ≤ P(B) ⇔ P(A′) ≤ P(B′) ⇔ P(B′) ≥ 1/2.
In the case of uniform probability distributions, the latter inequality is the PP-complete problem MAJ SAT for arbitrary formulas B′. In case of non-uniform probability distributions, the same argument used for membership in #P of the probability computation problem can be made here [2].
Main goal of this technique is to quantify the probabilities of the events of the query answers to the extent needed to separate the top-k most probable events from the others. This method is suitable solution for ranking query problem
[3] Efficient Query Evaluation on Probabilistic Databases
In this paper author described a system, which sustain complex SQL queries. Here complex SQL queries is been supported by “uncertain” predicates. Probabilistic model is been used for query semantics and results ranked by information retrieval strategy. Author mainly focuses on query execution. Here author used both a Monte-Carlo simulation algorithm and approximation algorithm for their purpose. Probability is been assigned to each tuple in given database of a uncertain predicates. To observe how correctly probability matches with uncertain predicates. These observations of a probability then collected in answer for result ranking.
Example of probabilistic database
The probabilistic database is nothing but each tuple has the defined probability of belonging to that database. Figure 1 shows a probabilistic database D^P with tablesS^p 〖and T〗^p. The meaning of probabilistic database is the probability of distribution on all database instances, that is possible worlds, and denoted by pwd (D^p).Figure 2(a) shows 8 probable instances which having non-zero probabilities, which are computed by multiplying the tuple probabilities, as we have assumed them to be independent.
For example the probability of D3= (1-0.7)*0.4*0.5=0.06, since the instance contains the tuples s2, t1 and does not contain s1.
Supporting ranking queries on uncertain and incomplete data
This approach also focuses on uncertain records ranking problem. In this paper, author presents a new probabilistic model, based on partial orders, to encapsulate the space of possible rankings originating from score uncertainty. Paper describes and analyzes a set of efficient query evaluation algorithms. This technique can used to solve the problem of rank aggregation in partial orders under two widely adopted distance metrics. In addition, designed sampling techniques based on Markov chains to quantify approximate query answers. This paper addresses ranking query with uncertain scores. Paper address issues regarding uncertain scores and integrates probabilistic ranking quantifications in semantics queries. General score uncertainty model used to compute ranking queries of different semantics. To achieve this goal author proposed a probabilistic ranking model of partial orders. To cut down answer space, answer space-pruning algorithm used. Branch-and-bound search algorithms are used to compute exact query answers based on a search. To compute exact query answers, Markov Chain Monte-Carlo (MCMC) method used for sampling.
.Markov chain Monte-Monte-Carlo (MCMC) method
The Metropolis-Hastings is the standard MCMC sampling algorithm. For example suppose the importance in drawing samples from a target distribution π(x).
The (M–H) algorithm generates a sequence of random draws of samples that follow π(x) as follows:
1. Start from an initial sample x0.
2. Generate a candidate sample x1 from an arbitrary proposal distribution q(x1|x0).
3. Accept the new sample x1 with probability
α=〖min(〗〖(π(x1).q(x0|x1))/(π(x0).q(x1|x0))〗,1)
4. If x1 is accepted, then set x0 = x1.
5. Repeat from step (2).
Monte-Carlo integration
Let v be the volume of τ, s be the total number of samples and x1….xm be the samples that are inside τ. Then,
∫_τ^0▒〖f(x)dx≈m/s〗.v.1/m ∑_(i=1)^m▒〖f(xi)〗
The expected value of the above approximation is the true integral value with an O (1/√s) approximation error[4].
To understand the problem of optimal rank aggregation in partial orders obtained by uncertain scores under both Spearman footrule and Kendall tau distance metrics. Polynomial time algorithm used to solve the problem under Spearman footrule distance. This novel probabilistic model that extends partial orders to represent the uncertainty in the scores of database records.
[5] Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach
In this paper, author analyzed problem of answering probabilistic threshold top-k queries on uncertain data that collects uncertain records. This technique using a probability of at least p to be in the top-k list in that p is a user specified probability threshold. Author presented a perfect algorithm, based on Poisson approximation method based algorithm and rapid sampling algorithm. Real and synthetic datasets used to check ability of probabilistic threshold top-k queries and the capability of this method. In this approach author addressed challenges:
Challenge 1: What does a probabilistic top-k query mean?
Challenge 2: How can a probabilistic threshold top-k query be answered efficiently?
To develop this approach, author first apply efficient algorithm to avoid unfolding of all possible words. It shows the proper top-k probability for every tuple by investing the sorted list of all tuples only once. Then sampling method used to compute an approximation with quality assurance to the answer set by making a small sample of the uncertain database. Next author observes statistical properties of the top-k probability, and derive a general scan stopping condition for query answering algorithms. Moreover, a Poisson approximation based method is been proposed to answer probabilistic threshold top-k queries in linear time.
A Poisson Approximation Based Method
If success probability is trivial and the numbers of Poisson trials are greater, Poisson binomial distribution can be estimated well by Poisson distribution.
For a set of Poisson trials X1,….,Xn such that Pr(Xi =1) = pi, let X =∑_(i=1)^n▒Xi. X follows a Poisson binomial distribution. Let μ = E[X] =∑_(i=0)^n▒Pi. The probability ofX = k can be approximated by
Pr(X = k) ≈ f(k; μ) =μ^k/k! e^(-μ), where f(k; μ) is the Poisson probability mass function. Thus, the probability of X < k can be approximated by
Pr(X<k)≈F(k;μ)=(τ([k+1],μ))/[k]!,
Where F (k; μ) is the cumulative distribution function corresponding to
f(k; μ), and Γ(x, y) =∫_y^∞▒t^(x-1) e^(-t)dt is the upper incomplete gamma function. Theoretically, Le Cam showed that the quality of the approximation has the following upper bound.
█(min@0≤1≤n)∑_(k=0)^1▒〖Pr(X=k)-∑_(k=0)^1▒〖f(k;μ)〗〗≤9max{Pi}
The above upper bound depends on only the maximum success probability in the Poisson trials. In the worst case where maxi {pi} = 1, the error bound is very loose [5].
A proper method to give solution a PT-k query is to enumerate all possible worlds and apply the query to each possible world. Then, we can compute the top-k probability of each tuple and select the tuples passing the probability threshold. The results show that they are efficient and each of them has its unique edge. It is interesting to extend our study to handle different kinds of ranking and preference queries on uncertain data.