06-09-2016, 11:12 AM
1453185887-keyword.pdf (Size: 1.26 MB / Downloads: 4)
ABSTRACT
Keyword search is a process of searching for relevant documents on the Web using one or
more user specified words called Keywords. Keywords and their related data elements are linked
using keyword elements relations. It is a method of querying linked data sources on the Web.
These queries search for the related data over all relevant sources on the Web and present a lot of
suggestions, of which many are unnecessary. We can reduce the number of results that are not
relevant by keyword combinations in the query, but this makes it difficult to handle the query
efficiently. It also increases the response time of the query which is not desirable in today's Web
scenario demanding high responsiveness.
To reduce this high-cost of processing the query, a novel method is proposed to route the
keywords only to relevant sources over all sources. Routing keyword search is a novel proposal to
improve the performance of the keyword search and helps in minimizing the time and space costs.
Introduction
The Web today is not only a collection of textual data but also a collection of interlinked
data sources (e.g., Linked Data). Linking Open Data is one such large project through which large
amount of legacy data is transformed into the Resource Description Framework (RDF) and linked
to other sources and published as linked data [1]. Linked data is comprised of many sources that
contain billions of Resource Description Framework triples which are linked by millions of links
like 'sameAs' links, which are published more frequently.
It would be difficult for a typical web-user to explore this linked data on the Web using
any structured query languages. This is where the keyword search is applied. Unlike structured
query languages, here, it is not necessary for the user to have any knowledge of the schema of the
underlying data that he need to exploit. In the present scenario when a query is passed to the
database through a keyword, it searches for the most relevant structured results [1], [2], [3] or a
single relevant database. The issue with this approach is the Web of Linked Data is not directly
applicable as a source may encompass may Linked sources of data. The main problem with this
approach is not about finding the most relevant source, but computing most relevant combination
of sources[6],[7]. We propose to generate a routing plan that can compute the results from multiple
data sources.
1.2 Linked Data
Linked Data provides a description for the method of publishing structured data for the
purpose of interlinking and making the structured data more useful through semantic queries.
7
Related documents and related data are linked on the Web. Linked Data defines the set of best
practices for connecting structured data and publishing it on the Web [15].Linked data is built on
standard Web technologies like HTTP, URI and RDF [14]. Rather than using these technologies
to just serve web pages for user requests, linked data employs them to share information in such a
way that computers can directly read it. Thus data from different sources is connected and can be
queried. Linked data describes how the Web is used for connecting related data that was not
previously connected and lowering the barriers of linking the data that is linked currently by using
other methods [15]. Fig 1.1 shows an overview of how the data from different datasets is connected
in the Web.
Resource Description Framework (RDF)
The Resource Description Framework (RDF) is a set of specifications designed by the
World Wide Web Consortium (W3C) as a metadata model [16]. The Resource Description
Framework is generally used in the Web resources for conceptual description and modeling of
information. It is similar to traditional approaches used for conceptual modelling like classdiagrams
or entity relationships but is mainly used to describe relation between the Web resources.
In Resource Description Framework the relations are expressed as triples in the form of subject–
predicate–object. Here, the subject denotes a resource, the object denotes the information of the
subject and the predicate describes the relation between the subject and the object. In a simple
way, we can say that a predicate is an edge between the two nodes, the subject and the object. The
subject and objects can be swapped like in the classical notation of entity-attribute-value model in
object oriented design where object is object, subject is attribute and predicate is value. Collection
of Resource Description Framework can be represented as a labeled directed multi-graph [16].
Hence a data model based on Resource Description Framework is more suitable for certain
knowledge representations than tradition entity-relation model or other ontological models.
1.4 Data Mining
Though the term data mining was coined in 1990's, the concept of data mining dates back
many years. The growth of Data mining began with the beginning of data storage on computers.
Data mining evolved with the advancements in computer technology like data storage, processing
power of the computers, new software’s and new algorithms. However the major advancements in
9
data mining happened with the introduction of relational databases and structured query languages.
The next improvement came with the evolving of data warehousing and online analytic processing.
Data Mining is the process of knowledge extraction from large sets of data by analyzing
the data and discovering consistent patterns and semantic relations between variables [13]. This
knowledge is used to validate the findings by applying the detected patterns on new data. The
study of Data mining involves artificial intelligence, statistics, machine learning and databases.
Data mining mainly has three phases Exploration, Pattern Identification and Deployment.
The exploration deals with preparing of data that involves cleaning the data, transforming
it and selecting subsets of records from the data by performing some preliminary operations based
on the requirements. The second phase also known as model building, considers various models
and choose the best one based on their predictive performance. A variety of techniques are
developed to attain this goal based on competitive evaluation of models. Some of these techniques
are Bagging, Boosting, Stacking and Meta-Learning [13]. The last phase, the deployment phase,
utilizes the model chose as the best in the previous phase and applies it to the new dataset to
produce predictions and estimations.
1.5 Existing Systems
The current study on keyword querying is in two different directions. The first study mainly
focuses on the search approach computing the most relevant structured results and the later study
focuses on source selection to compute the relevant source [1].
10
Many number of frameworks have been designed previously to produce keyword query
results. These frameworks, when given a keyword query, retrieve the most relevant structured
results, or simply, select the single most relevant databases. However, these approaches are singlesource
solutions. They are not directly applicable to the Web of linked data, where the results are
not bounded by a single source but might encompass several linked data sources. As opposed to
the source selection problem, which is focusing on computing the most relevant sources, the
problem here is to compute the relevant combination of sources.
When a keyword is queried in the existing system, it searches the relevant results and
generates routing plans for the obtained results and displays them all. The quantity of potential
results may increment exponentially with the number of sources and the links connecting them.
Most of the results for such queries may be redundant, particularly when the query is simple and
the resulting links connected to that keyword are more. The routing problem, we need to compute
results capturing specific elements at the data level. Routing keywords return the entire source
which may or may not be the relevant sources.
Disadvantages:
The following are the major drawbacks of the existing approach which can be minimised by
implementing minor changes in the existing approach.
1. With the increase in the number of sources and links connecting them the potential results may
also increase exponentially and most of the results may not be useful when they are not relevant
to the user query.
2. Computing results to capture elements at data level is the actual routing problem.
3. Routing keywords usually return entire source that may or may not be a relevant one.
1.6 Proposed System
We propose a new method to solve the problem of keyword search over a large number of
linked and structured data sources using keyword query routing. The high-cost of searching for
keywords that span across different sources can be reduced by routing the keywords only to
relevant sources. Unlike the existing system which only uses the relationships between the
keywords, we employ the keyword element relationship graph [9] and apply routing plans over
the obtained results. Then we apply Maximum Likelihood algorithm on the obtained results to
minimize the number of results by filtering the unwanted results we obtained from the keyword
element relationship graph.
Advantages:
The following are the advantages of the proposed system.
1. Possible to reduce the cost of the search.
2. Possible to reduce the time for the search.
3. Produce the results from multiple resources