docs: Add invariant and properties of good hashes to README

euchangxian · euchangxian · commit 63f4f39b1969 · 2024-01-27T23:42:49.000+08:00
diff --git a/src/main/java/dataStructures/hashSet/openAddressing/HashSet.java b/src/main/java/dataStructures/hashSet/openAddressing/HashSet.java
@@ -146,14 +146,6 @@ public boolean contains(T element) {
         while (collisions < capacity()) {
             int bucketIndex = hashFunction(element, collisions);
 
-            // Invariant: Probe sequence is unbroken (no null values between buckets in the sequence).
-            // This is maintained by add and delete.
-            // This means that given a probe sequence e.g. (1, 2, 3, 4, 5, ...) for a given element, add will attempt to
-            // add the element into the buckets in the given order. As a result, if an element is in bucket 3, there
-            // will be elements in buckets 1 and 2, given that there must have been collisions for the element to be
-            // added to bucket 3 instead of bucket 1, or bucket 2.
-            // Similarly, to maintain that invariant, delete will not replace the element with null, but with a
-            // marker (Tombstone).
             // If a bucket contains null in the probe sequence, we can be sure that the Set does not
             // contain the element, and return false immediately.
             // Unlike HashSet::add, HashSet::contains ignores buckets containing Tombstones.
diff --git a/src/main/java/dataStructures/hashSet/openAddressing/README.md b/src/main/java/dataStructures/hashSet/openAddressing/README.md
@@ -6,6 +6,35 @@ A hash collision is resolved by <b>probing<b>, or searching through alternative
 the array (the probe sequence) until either the target element is found, or an unused array slot is found,
 which indicates that there is no such key in the table.
 
+## Implementation Invariant
+Note that the buckets are 1-indexed in the following explanation.
+
+Invariant: Probe sequence is unbroken. That is to say, given an element that is initially hashed to 
+bucket 1 (arbitrary), the probe sequence {1, 2, ..., m} generated when attempting to `add`/`remove`/`find`
+the element will ***never*** contain null.
+
+This invariant is used to help us ensure the correctness and efficiency of `add`/`remove`/`contains`.
+With the above example of an element generating a probe sequence {1, 2, ...}, `add` will check each bucket 
+sequentially, attempting to add the element, treating buckets containing `Tombstones` (to be explained later) and 
+`nulls` as **empty** buckets available for insertion. 
+
+As a result, if the bucket is inserted in bucket `m`, such that the probe sequence {1, 2, ..., m} is
+generated, then there must have been elements occupying buckets {1, 2, ..., m - 1}, resulting in collisions.
+
+`remove` maintains this invariant with the help of a `Tombstone` class. As explained in the CS2040S lecture notes,
+simply replacing the element to be removed with `null` will cause `contains` to **fail** to find an element, even if it
+was present.
+
+`Tombstones` allow us to mark the bucket as deleted, which allows `contains` to know that there is a
+possibility that the targeted element can be found later in the probe sequence, returning false immediately upon 
+encountering `null`.
+
+We could simply look into every bucket in the sequence, but that will result in `remove` and `contains` having an O(n)
+runtime complexity, defeating the purpose of hashing.
+
+TLDR: There is a need to differentiate between deleted elements, and `nulls` to ensure operations on the Set have an O(1)
+time complexity.
+
 ## Probing Strategies
 
 ### Linear Probing
@@ -40,7 +69,7 @@ h(k, i) = (h1(k) + i * h2(k)) mod m where h1(k) and h2(k) are two ordinary hash
 
 *Source: https://courses.csail.mit.edu/6.006/fall11/lectures/lecture10.pdf*
 
-## Analysis
+## Complexity Analysis
 
 let α = n / m where α is the load factor of the table
 
@@ -50,3 +79,23 @@ For n items, in a table of size m, assuming uniform hashing, the expected cost o
 
 e.g. if α = 90%, then E[#probes] = 10;
 
+## Properties of Good Hash Functions
+There are two properties to measure the "goodness" of a Hash Function
+1. h(key, i) enumerates all possible buckets.
+   - For every bucket j, there is some i such that: h(key, i) = j
+   - The hash function is a permutation of {1..m}.
+
+Linear probing satisfies the first property, because it will probe all possible buckets in the Set. I.e. if an element
+is initially hashed to bucket 1, in a Set with capacity n, linear probing generates a sequence of {1, 2, ..., n - 1, n},
+enumerating every single bucket.
+
+2. Uniform Hashing Assumption (NOT SUHA)
+    - Every key is equally likely to be mapped to every ***permutation***, independent of every other key.
+    - Under this assumption, the probe sequence should be randomly, and uniformly distributed among all possible
+      permutations, implying `n!` permutations for a probe sequence of size `n`.
+    - Linear Probing does ***NOT*** fulfil UHA. In linear probing, when a collision occurs, the HashSet handles it by
+      checking the next bucket, linearly until an empty bucket is found. The next slot is always determined in a fixed
+      linear manner.
+    - In practicality, achieving UHA is difficult. Double hashing can come close to achieving UHA, by using another 
+      hash function to vary the step size (unlike linear probe where the step size is constant), resulting in a more
+      uniform distribution of keys and better performance for the hash table.