21-05-2012, 01:47 PM
eRDF: Live Discovery for the Web of Data
eRDF- Live Discovery for the Web of Data-main.pdf (Size: 786.6 KB / Downloads: 20)
Problem Description
The Web of Data is growing at an amazing rate as more and more data-sources
are being made available online in RDF, and linked. At the same time specialised
triple-stores, such as Virtuoso[9], OWLIM[1] or 4store[6], have matured into
powerful engines that can eciently answer queries for a given schema over
static data sets of billions of RDF triples
However, in many cases the schema is not known, nor is the precise nature
of the search query. As the name suggests, query engines are suitable for precise
querying, but necessarily fail when the task is more explorative, when the user
needs to discover information, rst. A second drawback of current approaches
is that static data sets are explored and queried rather than the actual data
sources themselves. It is acknowledged that the currently most convenient use
of Semantic Data is by querying collections of static data, which are often out-
dated, instead of live discovery. This is due to the diculty of joining results
from dierent engines in federated querying. Finally, given the open character
of the web, which is intrinsically incoherent, incomplete and incorrect, an explo-
ration engine for the Web of Data must be robust. We claim that the eRDF
infrastructure makes signicant steps in these four areas: exploration, live-access,
decentralisation and robustness.We now discuss these areas in more detail before
discussing eRDF and its use in the Billion Triple Challenge (BTC).
Discovery queries The paradigm shift on the WWW from browsing to search
was one of the critical elements for its success as it allowed users to nd relevant
information without knowing its exact location in the network. In search users
dene their needs by providing keywords often with the goal to nd relevant
information without having a specic information source in mind. While se-
mantic search engines, such as sig.ma, are beginning to provide search over the
Web of Data, there is still the need for new techniques to discover what data is
available, particularly, for software agents. Indeed, generating queries for a given
data source usually requires extensive knowledge of that data-source in order to
produce reasonable results. By integrating an approximate component into the
query process, eRDF can aid discovery.
Anytime answers over live distributed data-sources Many of the applications
based on the Web of Data do not use data sources directly, as federated queries
over live SPARQL endpoints is known to be extremely expensive, because known
optimizations (for example to deal with joins) do not work in the distributed
case. Instead, snapshots are taken at intervals, dumped into gigantic repositories
and made available in database style for querying. The eect is that the available
information is constantly outdated, not just the index (as in traditional search
engines), but even the data itself.
eRDF allows distributed queries over live data-sources as only very simple
unary queries are needed. Additionally, eRDF can issues all of its queries in
a fully parallel fashion. There is no theoretical restriction on the number of
data-sources and their data-size only marginally increases individual response
time. Of course, increasing data-size in combination with a constant population
size will increase convergence time. However, given the any-time character of
evolutionary methods good answers are still returned comparatively quickly.
This makes eRDF an interesting alternative for exploration and discovery for
the Web of Data.
Robustness Although SPARQL has been developed as an RDF query lan-
guage for Web data, there is a discrepancy between the database like query
formalism and the adaptive, open-world, incoherent and inconsistent character
of the Web of Data. Schemas are often unknown, and posing promising queries
requires explicit knowledge of the structure of the information. The eect of this
is that many good answers are missed as queries are simply not adequate for
certain information needs. eRDF does not extend SPARQL but releases some Se-
mantic constraints if required by the application. This makes it more robust for
querying unknown information, which is essential for exploration and discovery.
The eRDF infrastructure at a glance
In [8,7] we introduced RDF query answering by evolutionary algorithms (eRDF).
The basic idea is simple: instead of indexing the triples and joining results of
ground queries, we guess a population of candidate solutions. Those are then
improved by the classical mutation operation guided by a tness function which,
roughly said, calculates a distance of a candidate from being a solution. This dis-
tance can simply be the number of invalid triples in our solution, or more com-
plex combinations of such simple metrics with user-dened similarity measures.1
Based on such well-dened, and user-specied, notions of similarity eRDF re-
turns \perfect" answers if possible, and approximate answers if necessary, which
is exactly what is required for discovery queries.