Skip to content

Commit 63f4f39

Browse files
committed
docs: Add invariant and properties of good hashes to README
1 parent 3e5aafd commit 63f4f39

File tree

2 files changed

+50
-9
lines changed

2 files changed

+50
-9
lines changed

src/main/java/dataStructures/hashSet/openAddressing/HashSet.java

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -146,14 +146,6 @@ public boolean contains(T element) {
146146
while (collisions < capacity()) {
147147
int bucketIndex = hashFunction(element, collisions);
148148

149-
// Invariant: Probe sequence is unbroken (no null values between buckets in the sequence).
150-
// This is maintained by add and delete.
151-
// This means that given a probe sequence e.g. (1, 2, 3, 4, 5, ...) for a given element, add will attempt to
152-
// add the element into the buckets in the given order. As a result, if an element is in bucket 3, there
153-
// will be elements in buckets 1 and 2, given that there must have been collisions for the element to be
154-
// added to bucket 3 instead of bucket 1, or bucket 2.
155-
// Similarly, to maintain that invariant, delete will not replace the element with null, but with a
156-
// marker (Tombstone).
157149
// If a bucket contains null in the probe sequence, we can be sure that the Set does not
158150
// contain the element, and return false immediately.
159151
// Unlike HashSet::add, HashSet::contains ignores buckets containing Tombstones.

src/main/java/dataStructures/hashSet/openAddressing/README.md

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,35 @@ A hash collision is resolved by <b>probing<b>, or searching through alternative
66
the array (the probe sequence) until either the target element is found, or an unused array slot is found,
77
which indicates that there is no such key in the table.
88

9+
## Implementation Invariant
10+
Note that the buckets are 1-indexed in the following explanation.
11+
12+
Invariant: Probe sequence is unbroken. That is to say, given an element that is initially hashed to
13+
bucket 1 (arbitrary), the probe sequence {1, 2, ..., m} generated when attempting to `add`/`remove`/`find`
14+
the element will ***never*** contain null.
15+
16+
This invariant is used to help us ensure the correctness and efficiency of `add`/`remove`/`contains`.
17+
With the above example of an element generating a probe sequence {1, 2, ...}, `add` will check each bucket
18+
sequentially, attempting to add the element, treating buckets containing `Tombstones` (to be explained later) and
19+
`nulls` as **empty** buckets available for insertion.
20+
21+
As a result, if the bucket is inserted in bucket `m`, such that the probe sequence {1, 2, ..., m} is
22+
generated, then there must have been elements occupying buckets {1, 2, ..., m - 1}, resulting in collisions.
23+
24+
`remove` maintains this invariant with the help of a `Tombstone` class. As explained in the CS2040S lecture notes,
25+
simply replacing the element to be removed with `null` will cause `contains` to **fail** to find an element, even if it
26+
was present.
27+
28+
`Tombstones` allow us to mark the bucket as deleted, which allows `contains` to know that there is a
29+
possibility that the targeted element can be found later in the probe sequence, returning false immediately upon
30+
encountering `null`.
31+
32+
We could simply look into every bucket in the sequence, but that will result in `remove` and `contains` having an O(n)
33+
runtime complexity, defeating the purpose of hashing.
34+
35+
TLDR: There is a need to differentiate between deleted elements, and `nulls` to ensure operations on the Set have an O(1)
36+
time complexity.
37+
938
## Probing Strategies
1039

1140
### Linear Probing
@@ -40,7 +69,7 @@ h(k, i) = (h1(k) + i * h2(k)) mod m where h1(k) and h2(k) are two ordinary hash
4069

4170
*Source: https://courses.csail.mit.edu/6.006/fall11/lectures/lecture10.pdf*
4271

43-
## Analysis
72+
## Complexity Analysis
4473

4574
let α = n / m where α is the load factor of the table
4675

@@ -50,3 +79,23 @@ For n items, in a table of size m, assuming uniform hashing, the expected cost o
5079

5180
e.g. if α = 90%, then E[#probes] = 10;
5281

82+
## Properties of Good Hash Functions
83+
There are two properties to measure the "goodness" of a Hash Function
84+
1. h(key, i) enumerates all possible buckets.
85+
- For every bucket j, there is some i such that: h(key, i) = j
86+
- The hash function is a permutation of {1..m}.
87+
88+
Linear probing satisfies the first property, because it will probe all possible buckets in the Set. I.e. if an element
89+
is initially hashed to bucket 1, in a Set with capacity n, linear probing generates a sequence of {1, 2, ..., n - 1, n},
90+
enumerating every single bucket.
91+
92+
2. Uniform Hashing Assumption (NOT SUHA)
93+
- Every key is equally likely to be mapped to every ***permutation***, independent of every other key.
94+
- Under this assumption, the probe sequence should be randomly, and uniformly distributed among all possible
95+
permutations, implying `n!` permutations for a probe sequence of size `n`.
96+
- Linear Probing does ***NOT*** fulfil UHA. In linear probing, when a collision occurs, the HashSet handles it by
97+
checking the next bucket, linearly until an empty bucket is found. The next slot is always determined in a fixed
98+
linear manner.
99+
- In practicality, achieving UHA is difficult. Double hashing can come close to achieving UHA, by using another
100+
hash function to vary the step size (unlike linear probe where the step size is constant), resulting in a more
101+
uniform distribution of keys and better performance for the hash table.

0 commit comments

Comments
 (0)