09-02-2013, 02:25 PM
Data Structures - HashSets
Data Structures.pdf (Size: 455.43 KB / Downloads: 27)
Hashing concepts
The naïve idea of hashing is to store an element into an array at a position index computed as follows:
obtain element_hash_code of the element by processing the element's data and generating an integer
value (the idea of "hashing" the element roughly means "grinding it up")
use a simple mod operation to map into the array's range:
index = hash(element) = abs(element_hash_code % array_capacity)
The element_hash_code can be negative, and so the absolute value is needed to ensure that we obtain
a non-negative value.
The obvious issue is that there may be two or more elements which map to the same index; if so, we say there
is a collision.
A hash function which produces no collisions for a set of elements is referred to as perfect hash function for that
set. If we had a perfect hash function we could simply store the element at that index position in the array,
ignoring empty array positions. The cost of searching for such an element would be no more than the cost of
computing the hash code for that element.
Java's hashCode function
In order to do hashing, we must have access to, or be able to create a suitable hash code for any element. In a
generic situation, this means that the element type must be "hashable". Fortunately, Java has a built-in, Objectlevel
member function called hashCode which generates such a hash code for any element. You can try this
simple example to view the hashCode values of various element:
Hashing is really not constant time, but we say it is
If we have no collisions, the search for a given string, would simply involve computing the hash index and look in
table at that index position. Based on this idea, it is usually stated that the cost of a perfect hash-based search
is constant time, O(1), or constant time, regardless of the number of elements.
The constant time claim is not technically correct. The reason is that in order to have the ability to create N
distinct elements, each element must be constructed from a finite set of symbols using a sequence of length
Ω(log(N)). The hash function would have to process the entire symbol sequence in order to avoid duplicating
hash codes, thus forcing an Ω(log(N)) lower bound to compute the hash code.
Despite this technicality, hash search time (at least for perfect hashing) is said to be O(1), or constant time.
Open hashing, chaining
The key issue in hashing is dealing with, or resolving, collisions. The simplest idea of resolving collisions is to
group them together as a set of colliding elements accessed through the array at the hash index. The technique
called open hashing (as opposed to closed) maintains the set of colliding elements in a separate data structure
called a buck et accessed through the array index.