@@ -6,6 +6,35 @@ A hash collision is resolved by <b>probing<b>, or searching through alternative
6
6
the array (the probe sequence) until either the target element is found, or an unused array slot is found,
7
7
which indicates that there is no such key in the table.
8
8
9
+ ## Implementation Invariant
10
+ Note that the buckets are 1-indexed in the following explanation.
11
+
12
+ Invariant: Probe sequence is unbroken. That is to say, given an element that is initially hashed to
13
+ bucket 1 (arbitrary), the probe sequence {1, 2, ..., m} generated when attempting to ` add ` /` remove ` /` find `
14
+ the element will *** never*** contain null.
15
+
16
+ This invariant is used to help us ensure the correctness and efficiency of ` add ` /` remove ` /` contains ` .
17
+ With the above example of an element generating a probe sequence {1, 2, ...}, ` add ` will check each bucket
18
+ sequentially, attempting to add the element, treating buckets containing ` Tombstones ` (to be explained later) and
19
+ ` nulls ` as ** empty** buckets available for insertion.
20
+
21
+ As a result, if the bucket is inserted in bucket ` m ` , such that the probe sequence {1, 2, ..., m} is
22
+ generated, then there must have been elements occupying buckets {1, 2, ..., m - 1}, resulting in collisions.
23
+
24
+ ` remove ` maintains this invariant with the help of a ` Tombstone ` class. As explained in the CS2040S lecture notes,
25
+ simply replacing the element to be removed with ` null ` will cause ` contains ` to ** fail** to find an element, even if it
26
+ was present.
27
+
28
+ ` Tombstones ` allow us to mark the bucket as deleted, which allows ` contains ` to know that there is a
29
+ possibility that the targeted element can be found later in the probe sequence, returning false immediately upon
30
+ encountering ` null ` .
31
+
32
+ We could simply look into every bucket in the sequence, but that will result in ` remove ` and ` contains ` having an O(n)
33
+ runtime complexity, defeating the purpose of hashing.
34
+
35
+ TLDR: There is a need to differentiate between deleted elements, and ` nulls ` to ensure operations on the Set have an O(1)
36
+ time complexity.
37
+
9
38
## Probing Strategies
10
39
11
40
### Linear Probing
@@ -40,7 +69,7 @@ h(k, i) = (h1(k) + i * h2(k)) mod m where h1(k) and h2(k) are two ordinary hash
40
69
41
70
* Source: https://courses.csail.mit.edu/6.006/fall11/lectures/lecture10.pdf *
42
71
43
- ## Analysis
72
+ ## Complexity Analysis
44
73
45
74
let α = n / m where α is the load factor of the table
46
75
@@ -50,3 +79,23 @@ For n items, in a table of size m, assuming uniform hashing, the expected cost o
50
79
51
80
e.g. if α = 90%, then E[ #probes] = 10;
52
81
82
+ ## Properties of Good Hash Functions
83
+ There are two properties to measure the "goodness" of a Hash Function
84
+ 1 . h(key, i) enumerates all possible buckets.
85
+ - For every bucket j, there is some i such that: h(key, i) = j
86
+ - The hash function is a permutation of {1..m}.
87
+
88
+ Linear probing satisfies the first property, because it will probe all possible buckets in the Set. I.e. if an element
89
+ is initially hashed to bucket 1, in a Set with capacity n, linear probing generates a sequence of {1, 2, ..., n - 1, n},
90
+ enumerating every single bucket.
91
+
92
+ 2 . Uniform Hashing Assumption (NOT SUHA)
93
+ - Every key is equally likely to be mapped to every *** permutation*** , independent of every other key.
94
+ - Under this assumption, the probe sequence should be randomly, and uniformly distributed among all possible
95
+ permutations, implying ` n! ` permutations for a probe sequence of size ` n ` .
96
+ - Linear Probing does *** NOT*** fulfil UHA. In linear probing, when a collision occurs, the HashSet handles it by
97
+ checking the next bucket, linearly until an empty bucket is found. The next slot is always determined in a fixed
98
+ linear manner.
99
+ - In practicality, achieving UHA is difficult. Double hashing can come close to achieving UHA, by using another
100
+ hash function to vary the step size (unlike linear probe where the step size is constant), resulting in a more
101
+ uniform distribution of keys and better performance for the hash table.
0 commit comments