23-05-2012, 05:16 PM
Burst Tries
Tries data structure.pdf (Size: 348.24 KB / Downloads: 18)
Introduction
Many computing applications involve management of large sets of distinct strings. For example,
vast quantities of information are held as text documents, ranging from news archives, law
collections, and business reports to repositories of documents gathered from the web. These
collections contain many millions of distinct words, the number growing more or less linearly
with collection size [52]. Other applications also involve large numbers of distinct strings, such
as bibliographic databases and indexes for genomic data [51].
Not only must these sets of strings be managed, but, in applications such as indexing, for
eciency reasons the complete set should be held in memory [55]. Construction of an inverted
index involves the determination of the set of distinct terms|that is, the vocabulary of the
collection|and, for each term, maintenance of information such as its total occurrence count
and the set of locations at which it occurs.
Existing data structures Binary search trees
In a binary search tree (BST), each tree node stores a string and two pointers to left and right
child nodes. A search to nd a query string involves, at each node, comparing the query string
to the node's string to determine whether the string has been found or, otherwise, whether to
branch left or right. At the root, the string comparison typically terminates after inspection of
a single character; as the search progresses, the number of characters inspected at each string
comparison gradually increases.
Although the allocation of strings to nodes is determined by the insertion order, for a skew
distribution it is reasonable to expect that common words occur close to the beginning of the
text collection and are therefore close to the root of the BST. Assuming the distribution is
stable, accesses to a common term should be fast, since the rst levels of the tree are usually
kept in cache and only a few string comparisons are required.
Hash tables
The most ecient form of hash table for vocabulary accumulation is based on chaining [56]. In
such hash tables, a large array is used to index a set of linked lists of nodes, each of which is
said to be at a slot. On search, an array index is computed by hashing the query string. The
string is then sought for in the linked list for that index.
Several factors are important for good performance. First, the hash function needs to achieve
a reasonably uniform distribution of strings to slots. In practice, this means that the order of
slots does not correspond to string sort order, so once the vocabulary has been accumulated the
distinct strings must be sorted|a small overhead only.
Tries and ternary search trees
A trie is an alternative to a BST for storing strings in sort order [4, 16, 22, 29, 35]. A node in
a standard trie is an array of pointers, one for each letter in the alphabet, with an additional
pointer for the empty string. A leaf is a record concerning a particular string. A string is found
in a trie by using the letters of the string to determine which pointers to follow. For example,
if the alphabet is the letters from `a' to `z', the rst pointer in the root corresponds to the
letter `a'; the node T indicated by this pointer is for all strings beginning \a-". In node T, the
pointer corresponding to the letter `c' is followed for all strings beginning \ac-"; the pointer
corresponding to the empty string is for the record concerning the single-letter string \a".
Bursttries
In earlier work comparing the tree data structures discussed above [56], we observed that,
compared to hashing, these structures had three sources of ineciency. First, the average search
lengths were surprisingly high, typically exceeding ten pointer traversals and string comparisons
even on moderate-sized data sets with highly skew distributions. In contrast, a search under
hashing rarely requires more than a string traversal to compute a hash value and a single
successful comparison. Second, for structures based on BSTs, the string comparisons involved
redundant character inspections, and were thus unnecessarily expensive. For example, given
the query string \middle" and given that, during search, \michael" and \mideld" have been
encountered, it is clear that all subsequent strings inspected must begin with the prex \mi".
Third, in tries the set of strings in a subtrie tends to have a highly skew distribution: typically
the vast majority of accesses to a subtrie are to nd one particular string. Thus use of a highly
time-ecient, space-intensive structure for the remaining strings is not a good use of resources.