Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Amit Kumar resume
You're currently viewing a stripped down version of our content. View the full version with proper formatting.

project maker

EXPERIMENTAL RESULTS ON ALGORITHMS FOR
MULTIPLE KEYWORD MATCHING


[attachment=66636]

ABSTRACT

Multiple keyword matching is a basic problem in computer science and is used to locate all the appearances of a finite set
of keywords (the so called ``patterns'') inside an input string (the so called ``text''). This paper presents experimental
results for the well known Commentz-Walter, Wu-Manber, Set Backward Oracle Matching and Salmela-Tarhio-Kytöjoki
multiple keyword matching algorithms. The algorithms are compared in terms of running time for random texts using a
binary alphabet and texts using the English alphabet while the experimental results are validated against the theoretical
complexity


INTRODUCTION

Multiple keyword matching is a basic problem in computer science. It is a variant of string matching for
multiple keywords that is commonly used to locate all the appearances of a finite set of keywords (the so
called “patterns”) inside an input string (the so called “text”).
Given an input string T = t1t2...tn over an alphabet Σ and a finite set of r keywords P = {p1, p2,...,pr} where
all keywords have a length of m characters and the total size of all keywords is denoted as |P|, the multiple
keyword matching problem can be defined as the way to report all locations i in T where there is an exact
occurrence of any keyword (Navarro, 2002).
The multiple keyword matching problem has many applications across several areas of scientific
computing including information retrieval, data filtering, DNA sequence matching, virus detection and
packet classification in intrusion detection systems. The multiple keyword matching algorithms are also used
by many well known tools: Snort (Snort, 2010) uses a variant of the Aho-Corasick algorithm to perform
intrusion detection, the Wu-Manber algorithm is utilized by Agrep (Wu, 1992) to perform fuzzy string
searching while a Commentz-Walter-like algorithm is used in GNU Grep (GNU Grep, 2010) when searching
for the occurrences of multiple keywords in a text.
The simplest solution to the multiple keyword matching problem is to apply a string matching algorithm
iteratively on the input string for each keyword. While frequently used in the past, this technique is not
efficient when a large keyword set is involved. The aim of all modern multiple keyword matching algorithms
is to scan the input string T in a single pass to locate the occurrences of all keywords. These algorithms are
based of single-keyword matching algorithms, with some of their functions generalized to process multiple


EXPERIMENTAL METHODOLOGY

The experiments were executed locally on an Intel Core 2 Duo CPU with a 3.00GHz clock speed and 2
Gb of memory, 64 KB L1 cache and 6 MB L2 cache. The operating system used was an Ubuntu Linux and
during the experiments only the typical background processes ran. To decrease random variation, the time
results were averages of 100 runs. The algorithms presented in the previous section were implemented using
the ANSI C programming language and were compiled using the GCC 4.4.3 compiler with the “-O2” and “-
funroll-loops” optimization flags



CONCLUSION

In this paper experimental results of the well known Commentz-Walter, Wu-Manber, SBOM and the
Salmela-Tarhio-Kytöjoki algorithms were presented. The algorithms were compared in terms of running time
for texts of English and binary alphabet, for a keyword set of size between 100 and 100.000 keywords and
for keyword lengths of size m = 8 and m = 32. It was shown that for different data sets, different algorithms
are preferable: Salmela-Tarhio-Kytöjoki had the best performance on an English text and for either keyword
lengths, SBOM was the fastest algorithm for texts with a binary alphabet when keywords of length m = 8
were used while the Commentz-Walter algorithm was faster for the same data when keywords of length m =
32 were used.
The work presented in this paper could be extended with experiments that use keywords of varying length
and larger keyword sets while focusing on the preprocessing time and memory requirements of the
algorithms. It would also be interesting to examine the application of the multiple pattern matching
algorithms in areas of scientific computing including indexing and computational biology. Since the data set
that needs to be processed by multiple keyword matching is usually inherently parallel in nature, future
research could focus on the speed up of the existing algorithms when parallel processed on traditional parallel
architectures like clusters environments (Michailidis, 2002) and multicore systems as well as on modern
parallel systems like GPU architectures (Kouzinopoulos, 2009)