14-01-2013, 03:55 PM
Truth Discovery with Multiple Conflicting Information Providers on the Web
Truth Discovery.pdf (Size: 185.64 KB / Downloads: 49)
ABSTRACT
The world-wide web has become the most important infor-
mation source for most of us. Unfortunately, there is no
guarantee for the correctness of information on the web.
Moreover, different web sites often provide conflicting in-
formation on a subject, such as different specifications for
the same product. In this paper we propose a new problem
called Veracity, i.e., conformity to truth, which studies how
to find true facts from a large amount of conflicting informa-
tion on many subjects that is provided by various web sites.
We design a general framework for the Veracity problem,
and invent an algorithm called TruthFinder, which uti-
lizes the relationships between web sites and their informa-
tion, i.e., a web site is trustworthy if it provides many pieces
of true information, and a piece of information is likely to be
true if it is provided by many trustworthy web sites. Our ex-
periments show that TruthFinder successfully finds true
facts among conflicting information, and identifies trustwor-
thy web sites better than the popular search engines.
INTRODUCTION
The world-wide web has become a necessary part of our
lives, and might have become the most important informa-
tion source for most people. Everyday people retrieve all
kinds of information from the web. For example, when shop-
ping online, people find product specifications from web sites
like Amazon.com or ShopZilla.com. When looking for inter-
esting DVDs, they get information and read movie reviews
on web sites such as NetFlix.com or IMDB.com.
“Is the world-wide web always trustable?” Unfortunately,
the answer is “no”. There is no guarantee for the correctness
of information on the web. Even worse, different web sites
often provide conflicting information, as shown below.
Example 1: Authors of books. We tried to find out
who wrote the book “Rapid Contextual Design” (ISBN:
0123540518). We found many different sets of authors from
different online bookstores, and we show several of them in
Table 1. From the image of the book cover we found that
A1 Books provides the most accurate information. In com-
parison, the information from Powell’s books is incomplete,
and that from Lakeside books is incorrect.
COMPUTATIONAL MODEL
Based on the above heuristics, we know if a fact is pro-
vided by many trustworthy web sites, it is likely to be true; if
a fact is conflicting with the facts provided by many trust-
worthy web sites, it is unlikely to be true. On the other
hand, a web site is trustworthy if it provides facts with high
confidence. We can see that the web site trustworthiness
and fact confidence are determined by each other, and we
can use an iterative method to compute both. Because true
facts are more consistent than false facts (Heuristic 3), it is
likely that we can distinguish true facts from false ones at
the end. In this section we discuss the computational model.
2 Influences between Facts
The above discussion shows how to compute the confi-
dence of a fact that is the only fact about an object. How-
ever, there are usually many different facts about an object
(such as f1 and f2 in Figure 2), and these facts influence each
other. Suppose in Figure 2 the implication from f2 to f1 is
very high (e.g., they are very similar). If f2 is provided by
many trustworthy web sites, then f1 is also somehow sup-
ported by these web sites, and f1 should have reasonably
high confidence. Therefore, we should increase the confi-
dence score of f1 according to the confidence score of f2,
which is the sum of trustworthiness scores of web sites pro-
viding f2. which controls the influ-
ence of related facts. We can see that ¾∗(f) is the sum of
confidence score of f and a portion of the confidence score
of each related fact f′ multiplies the implication from f′ to
f. Please notice that imp(f′ ! f) < 0 when f is conflicting
with f′.
We can compute the confidence of f based on ¾∗(f) in
the same way as computing it based on ¾(f) (defined in
Equation (4)). We use s∗(f)to represent this confidence.
s∗(f) = 1 − e−¾∗(f). (7)
Handling Additional Subtlety
One problem with the above model is we have been assum-
ing different web sites are independent with each other. This
assumption is often incorrect because errors can be propa-
gated between web sites. According to the definitions above,
if a fact f is provided by five web sites with trustworthiness
of 0.6 (which is quite low), f will have confidence of 0.99!
But actually some of the web sites may copy contents from
others. In order to compensate for the problem of overly
high confidence, we add a dampening factor ° into Equation
(7), and redefine fact confidence as s∗(f) = 1 − e−°·¾∗(f),
where 0 < ° < 1.
The second problem with our model is that, the confidence
of a fact f can easily be negative if f is conflicting with
some facts provided by trustworthy web sites, which makes
¾∗(f) < 0 and s∗(f) < 0. This is unreasonable because
even with negative evidences, there is still a chance that
f is correct, so its confidence should still be above zero.
Therefore, we adopt the widely used Logistic function [3],
which is a variant of Equation (7), as the final definition for
fact confidence.
When ° ·¾∗(f) is significantly less than zero, s(f) is close to
zero but remains positive, which is consistent with the real
situation. Equation (8) is also very similar to Sigmoid func-
tion [6], which has been successfully used in various models
in many fields.
Iterative Computation
As described above, we can infer the web site trustworthi-
ness if we know the fact confidence, and vice versa. As in
Authority-hub analysis [2] and PageRank [4], TruthFinder
adopts an iterative method to compute the trustworthiness
of web sites and confidence of facts. Initially, it has very lit-
tle information about the web sites and the facts. At each it-
eration TruthFinder tries to improve its knowledge about
their trustworthiness and confidence, and it stops when the
computation reaches a stable state.
We choose the initial state in which all web sites have
a uniform trustworthiness t0. (t0 is set to the estimated
average trustworthiness, such as 0.9.) In each iteration,
TruthFinder first uses the web site trustworthiness to com-
pute the fact confidence, and then recomputes the web site
trustworthiness from the fact confidence. It stops iterating
when it reaches a stable state.
CONCLUSIONS
In this paper we introduce and formulate the Veracity
problem, which aims at resolving conflicting facts from mul-
tiple web sites, and finding the true facts among them. We
propose TruthFinder, an approach that utilizes the inter-
dependency between web site trustworthiness and fact con-
fidence to find trustable web sites and true facts. Experi-
ments show that TruthFinder achieves high accuracy at
finding true facts and at the same time identifies web sites
that provide more accurate information.