07-02-2013, 04:27 PM
Assigning Trust to Wikipedia Content
1Assigning Trust.pdf (Size: 257.23 KB / Downloads: 13)
ABSTRACT
The Wikipedia is a collaborative encyclopedia: anyone can contribute
to its articles simply by clicking on an “edit” button. The
open nature of the Wikipedia has been key to its success, but has
also created a challenge: how can readers develop an informed opinion
on its reliability? We propose a system that computes quantitative
values of trust for the text in Wikipedia articles; these trust
values provide an indication of text reliability.
The system uses as input the revision history of each article, as
well as information about the reputation of the contributing authors,
as provided by a reputation system. The trust of a word in an article
is computed on the basis of the reputation of the original author
of the word, as well as the reputation of all authors who edited
text near the word. The algorithm computes word trust values that
vary smoothly across the text; the trust values can be visualized using
varying text-background colors. The algorithm ensures that all
changes to an article’s text are reflected in the trust values, preventing
surreptitious content changes.
INTRODUCTION
Wikipedia is an online encyclopedia who grew in the span of a
few years to become one of the most widely used sources of information
on the web. Wikipedia owes its growth and breadth of
coverage to its ability to harness the contributions of millions of
individuals, ranging from casual visitors, to domain experts, to dedicated
editors. On the other hand, the open process that gives rise
to Wikipedia content makes it difficult for visitors to form an idea
of the reliability of the content. Wikipedia articles are constantly
changing, and the contributors range from domain experts, to vandals,
to dedicated editors, to superficial contributors not fully aware
of the quality standards the Wikipedia aspires to attain. Wikipedia
visitors are presented with the latest version of each article they
visit: this latest version does not offer them any simple insight into
how the article content has evolved into its most current form, nor
does it offer a measure of how much the content can be relied upon.
These considerations generated interest in algorithmic systems for
estimating the trust of Wikipedia content [21, 34].
The Trust Assignment Algorithm
The goal of our trust system is to convey information on the degree
with which the text has been revised, and to flag any recent
unchecked content modifications. We rely on a simple idea: the
trust of text should depend on the reliability of the author, and on
the reliability of the people who subsequently revised, checked, and
edited the text [21, 34].
As a measure of author and revisor quality, we take the author
reputation computed by the author reputation system of [1]. That
reputation system, like the trust system described in this paper, is
content-driven: it relies on content analysis, rather than user-to-user
feedback. Users who contribute long-lived content gain reputation,
while users who contribute content that is quickly removed lose reputation.
The resulting author reputation was shown to correlate well
with the quality of the author’s future contributions, justifying its
use in the computation of text trust.
Trust Quality Metrics
The trust values are computed from the past history of text, and
reflect the degree with which text has been edited and revised. Ideally,
we would like to show that high trust text conveys with high
probability correct information. However, correctness is very difficult
to define and measure. As a substitute, we study the correlation
between trust, and future text stability, in the hypothesis that correct
(or high-quality) content is less likely to be revised [34]. The
quality metrics will also provide quantitative performance indices
that will be useful in fine-tuning the behavior of the algorithms. We
note that the quality metrics capture only in part the intent underlying
our trust system: in particular, the goals of predicting future text
stability, and warning readers about recent modifications, do not always
coincide, as we will see in more detail later. Nevertheless, the
metrics offer valuable insight in the performance of the system.
Related Work
The problem of the reliability of Wikipedia content has often
emerged both in the press (see, e.g., [27, 12]) and in scientific journals
[8]. The idea of assigning trust to specific sections of text of
Wikipedia articles as a guide to readers has been previously proposed
in [21, 4, 34], as well as in white papers [14] and blogs [20];
these papers also contain the idea of using text background color to
visualize trust values.
The work most closely related to ours is [34], where the trust of
a piece of text is computed from the Wikipedia roles (anonymous,
registered user, or editor) of the original author, and of the authors
who subsequently revised the article. The Wikipedia roles of authors
are thus used in lieu of author reputation; as a consequence,
the algorithm can only be applied to wikis where authors are organized
in a well-defined hierarchy. Text analysis is performed at
the granularity level of sentences; all sentences introduced in the
same revision form a fragment, and share the same trust. A change
anywhere in a sentence causes the whole sentence to be considered
new, and the position of the change in the sentence is not flagged
via the trust labeling.
IMPLEMENTATION
We have implemented a trust tool that computes text trust and
provenance for theWikipedia. The trust tool takes as input an XML
dump containing all the text of all the revisions of the Wikipedia;
such dumps are periodically made available from the Wikimedia
Foundation. The trust tool is written in Ocaml [16]; we chose this
language for its combination of speed and excellent memory management.
On an Intel Core 2 Duo 2 GHz CPU, our tool is capable
of assigning trust to versions of Wikipedia articles2 at over 15 versions/
second, or roughly 1.5 millions versions per day, an edit rate
much higher than the one of the on-line Wikipedia [32]. We have
run the trust tool over the entire English Wikipedia, as of its February
6, 2007 dump; the results can be viewed on a live demo [29]. To
save disk space on the server, the demo contains only the last 100
versions of each article, but all versions were considered in trust
computation.