Skip to content

Commit 48870c9

Browse files
committed
refactor: Improved clarity
1 parent cd3c465 commit 48870c9

File tree

12 files changed

+247
-94
lines changed

12 files changed

+247
-94
lines changed

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ It is aligned with [CS2040s](https://nusmods.com/courses/CS2040S/data-structures
77

88
The work here is continually being developed by CS2040s Teaching Assistants(TAs) and ex-2040s students,
99
under the guidance of Prof Seth.
10-
It is still in its infant stage, mostly covering lecture content and discussion notes.
10+
It mostly covers lecture content and discussion notes.
1111
Future plans include deeper discussion into the tougher parts of tutorials and even practice problems / puzzles related
1212
to DSA.
1313

@@ -30,13 +30,13 @@ Gradle is used for development.
3030
- [Linked List](src/main/java/dataStructures/linkedList)
3131
- [LRU Cache](src/main/java/dataStructures/lruCache)
3232
- Minimum Spanning Tree
33-
* Kruskal
34-
* Prim's
33+
* [Kruskal](src/main/java/algorithms/minimumSpanningTree/kruskal)
34+
* [Prim's](src/main/java/algorithms/minimumSpanningTree/prim)
3535
* Boruvska
3636
- [Queue](src/main/java/dataStructures/queue)
3737
- [Deque](src/main/java/dataStructures/queue/Deque)
3838
- [Monotonic Queue](src/main/java/dataStructures/queue/monotonicQueue)
39-
- Segment Tree
39+
- [Segment Tree](src/main/java/dataStructures/segmentTree)
4040
- [Stack](src/main/java/dataStructures/stack)
4141
- [Segment Tree](src/main/java/dataStructures/segmentTree)
4242
- [Trie](src/main/java/dataStructures/trie)
@@ -47,10 +47,10 @@ Gradle is used for development.
4747
* [Template](src/main/java/algorithms/binarySearch/binarySearchTemplated)
4848
- [Counting Sort](src/main/java/algorithms/sorting/countingSort)
4949
- [Cyclic Sort](src/main/java/algorithms/sorting/cyclicSort)
50-
* [Special case](src/main/java/algorithms/sorting/cyclicSort/simple) of O(n) time complexity
51-
* [Generalized case](src/main/java/algorithms/sorting/cyclicSort/generalised) of O(n^2) time complexity
50+
* [Special case](src/main/java/algorithms/sorting/cyclicSort/simple)
51+
* [Generalized case](src/main/java/algorithms/sorting/cyclicSort/generalised)
5252
- [Insertion Sort](src/main/java/algorithms/sorting/insertionSort)
53-
- [Knuth-Morris-Pratt](src/main/java/algorithms/patternFinding) aka KMP algorithm
53+
- [Knuth-Morris-Pratt](src/main/java/algorithms/patternFinding) (KMP algorithm)
5454
- [Merge Sort](src/main/java/algorithms/sorting/mergeSort)
5555
* [Recursive](src/main/java/algorithms/sorting/mergeSort/recursive)
5656
* [Bottom-up iterative](src/main/java/algorithms/sorting/mergeSort/iterative)
@@ -76,8 +76,8 @@ Gradle is used for development.
7676
* [Selection](src/main/java/algorithms/sorting/selectionSort)
7777
* [Merge](src/main/java/algorithms/sorting/mergeSort)
7878
* [Quick](src/main/java/algorithms/sorting/quickSort)
79-
* [Hoare's](src/main/java/algorithms/sorting/quickSort/hoares)
80-
* [Lomuto's](src/main/java/algorithms/sorting/quickSort/lomuto) (Not discussed in CS2040s)
79+
* [Hoare's](src/main/java/algorithms/sorting/quickSort/hoares) (this version is the one shown in lecture!)
80+
* [Lomuto's](src/main/java/algorithms/sorting/quickSort/lomuto)
8181
* [Paranoid](src/main/java/algorithms/sorting/quickSort/paranoid)
8282
* [3-way Partitioning](src/main/java/algorithms/sorting/quickSort/threeWayPartitioning)
8383
* [Counting Sort](src/main/java/algorithms/sorting/countingSort) (found in tutorial)
@@ -88,7 +88,7 @@ Gradle is used for development.
8888
* [Trie](src/main/java/dataStructures/trie)
8989
* [B-Tree](src/main/java/dataStructures/bTree)
9090
* [Segment Tree](src/main/java/dataStructures/segmentTree) (Not covered in CS2040s but useful!)
91-
* Red-Black Tree (Not covered in CS2040s but useful!)
91+
* Red-Black Tree (**WIP**)
9292
* [Orthogonal Range Searching](src/main/java/algorithms/orthogonalRangeSearching)
9393
* Interval Trees (**WIP**)
9494
5. [Binary Heap](src/main/java/dataStructures/heap) (Max heap)

src/main/java/algorithms/patternFinding/KMP.java

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -107,12 +107,6 @@ public static List<Integer> findOccurrences(String sequence, String pattern) {
107107
pTrav += 1;
108108
sTrav += 1;
109109
}
110-
// ALTERNATIVELY
111-
// if pTrav == 0 i.e. nothing matched, move on
112-
// sTrav += 1
113-
// continue
114-
//
115-
// pTrav = prefixTable[pTrav]
116110
}
117111
}
118112
return indicesFound;

src/main/java/algorithms/sorting/cyclicSort/README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
## Background
44

55
Cyclic sort is a comparison-based, in-place algorithm that performs sorting (generally) in O(n^2) time.
6-
Though under some conditions (discussed later), the best case could be done in O(n) time.
6+
Under some special conditions (discussed later), the algorithm is non-comparison based and
7+
the best case could be done in O(n) time. This is the version that tends to be used in practice.
78

89
### Implementation Invariant
910

@@ -24,16 +25,16 @@ This allows cyclic sort to have a time complexity of O(n) for certain inputs.
2425

2526
We discuss more implementation-specific details and complexity analysis in the respective folders. In short,
2627

27-
1. The [**simple**](./simple) case discusses the non-comparison based implementation of cyclic sort under
28+
1. The [**simple**](./simple) case discusses the **non-comparison based** implementation of cyclic sort under
2829
certain conditions. This allows the best case to be better than O(n^2).
2930
2. The [**generalised**](./generalised) case discusses cyclic sort for general inputs. This is comparison-based and is
30-
usually implemented in O(n^2).
31+
typically implemented in O(n^2).
3132

3233
Note that, in practice, the generalised case is hardly used. There are more efficient algorithms to use for sorting,
3334
e.g. merge and quick sort. If the concern is the number of swaps, generalized cyclic sort does indeed require fewer
3435
swaps, but likely won't lower than selection sort's.
3536

36-
In other words, cyclic sort is specially designed for situations where the elements to be sorted are
37-
known to fall within a specific, continuous range, such as integers from 1 to n, without any gaps or duplicates.
37+
In other words, **cyclic sort is specially designed for situations where the elements to be sorted are
38+
known to fall within a specific, continuous range, such as integers from 1 to n, without any gaps or duplicates.**
3839

3940

src/main/java/algorithms/sorting/cyclicSort/simple/README.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,31 @@ This is typically applicable when sorting a sequence of integers that are in a c
1010
or can be easily mapped to such a range. We illustrate the idea with n integers from 0 to n-1.
1111

1212
In this implementation, the algorithm is **not comparison-based**! (unlike the general case).
13-
It makes use of the known inherent ordering of the numbers, bypassing the nlogn lower bound for most sorting algorithms.
13+
It makes use of the known inherent ordering of the numbers,
14+
bypassing the `nlogn` lower bound for most sorting algorithms.
15+
16+
<details>
17+
<summary> <b>Duplicates</b> </summary>
18+
Not designed to hande duplicates. When duplicates are present, the algorithm can run into issues,
19+
such as overwriting elements or getting stuck in infinite loops,
20+
because it assumes that each element has a unique position in the array.
21+
22+
If you need to handle duplicates, modifications are required,
23+
such as checking for duplicate values before placing elements,
24+
which can impact the simplicity and efficiency (possibly degrade to `O(n^2)`) of the algorithm.
25+
</details>
26+
27+
<details>
28+
<summary> <b>Inherent Ordering..?</b> </summary>
29+
This property allows the sorting algorithm to avoid comparing elements with each other
30+
and instead directly place each element in its correct position.
31+
32+
For example, if sorting integers from 0 to n-1, the number 0 naturally belongs at index 0, 1 at index 1, and so on.
33+
This inherent structure allows Cyclic Sort to achieve `O(n)` time complexity,
34+
bypassing the typical `O(nlogn)` time bound of comparison-based sorting algorithms
35+
([proof](https://tildesites.bowdoin.edu/~ltoma/teaching/cs231/fall07/Lectures/sortLB.pdf))
36+
by using the known order of elements rather than making comparisons to determine their positions.
37+
</details>
1438

1539
## Complexity Analysis
1640

@@ -48,3 +72,4 @@ otherwise there would be a contradiction.
4872
and sorting needs to be done in O(1) auxiliary space.
4973
2. The implementation here uses integers from 0 to n-1. This can be easily modified for n contiguous integers starting
5074
at some arbitrary number (simply offset by this start number).
75+
3. This version of cyclic sort does not handle duplicates (at least, sorting might not be guaranteed to be in O(n))

src/main/java/dataStructures/avlTree/README.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,19 @@ Here we discuss a type of self-balancing BST, known as the AVL tree, that avoids
1313
across the operations by ensuring careful updating of the tree's structure whenever there is a change
1414
(e.g. insert or delete).
1515

16+
<details>
17+
<summary> <b>Terminology</b> </summary>
18+
<li>
19+
Level: Refers to the number of edges from the root to that particular node. Root is at level 0.
20+
</li>
21+
<li>
22+
Depth: The depth of a node is the same as its level; i.e. how far a node is from the root of the tree.
23+
</li>
24+
<li>
25+
Height: The number of edges on the longest path from that node to a leaf. A leaf node has height 0.
26+
</li>
27+
</details>
28+
1629
### Definition of Balanced Trees
1730
Balanced trees are a special subset of trees with **height in the order of log(n)**, where n is the number of nodes.
1831
This choice is not an arbitrary one. It can be mathematically shown that a binary tree of n nodes has height of at least
@@ -39,8 +52,11 @@ former.
3952

4053
<details>
4154
<summary> <b>Ponder..</b> </summary>
42-
Consider any two nodes (need not have the same immediate parent node) in the tree. Is the difference in height
43-
between the two nodes <= 1 too?
55+
Can a tree exists where there exists 2 leaf nodes whose depths differ by more than 1? What about 2? 10?
56+
<details>
57+
<summary> <b>Answer</b> </summary>
58+
Yes! In fact, you can always construct a large enough AVL tree where their difference in depth is > some arbitrary x!
59+
</details>
4460
</details>
4561

4662
It can be mathematically shown that a **height-balanced tree with n nodes, has at most height <= 2log(n)** (
@@ -75,7 +91,7 @@ Hence, we need some re-balancing operations. To do so, tree rotation operations
7591

7692
Prof Seth explains it best! Go re-visit his slides (Lecture 10) for the operations :P <br>
7793
Here is a [link](https://www.youtube.com/watch?v=dS02_IuZPes&list=PLgpwqdiEMkHA0pU_uspC6N88RwMpt9rC8&index=9)
78-
for prof's lecture on trees. <br>
94+
to prof's lecture on trees. <br>
7995
_We may add a summary in the near future._
8096

8197
## Application

src/main/java/dataStructures/bTree/README.md

Lines changed: 88 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -104,26 +104,97 @@ Image Source: https://www.geeksforgeeks.org/insert-operation-in-b-tree/
104104
The delete operation has a similar idea as the insert operation, but involves a lot more edge cases. If you are
105105
interested to learn about it, you can read more [here](https://www.geeksforgeeks.org/delete-operation-in-b-tree/).
106106

107-
## Application
108-
There are many uses of B-Trees but the most common is their utility in database management systems in handling large
109-
datasets by optimizing disk accesses.
107+
## Application: Index Structure
108+
B+ trees tend to be used in practice over vanilla B-trees.
109+
The B+ tree is a specific variant of the B-tree that is optimized for efficient data retrieval from disk
110+
and range queries.
110111

111-
Large amounts of data have to be stored on the disk. But disk I/O operations are slow and not knowing where to look
112-
for the data can drastically worsen search time. B-Tree is used as an index structure to efficiently locate the
113-
desired data. Note, the B-Tree itself can be partially stored in RAM (higher levels) and partially on disk
114-
(lower, less freq accessed levels).
112+
We will discuss two common applications of B+ trees: **database indexing** and **file system indexing**.
115113

116-
Consider a database of all the CS modules offered in NUS. Suppose there is a column "Code" (module code) in the
117-
"CS Modules" table. If the database has a B-Tree index on the "Code" column, the keys in the B-Tree would be the
118-
module code of all CS modules offered.
114+
---
119115

120-
Each key in the B-Tree is associated with a pointer, that points to the location on the disk where the corresponding
121-
data can be found. For e.g., a key for "CS2040s" would have a pointer to the disk location(s) where the row(s)
122-
(i.e. data) with "CS2040s" is stored. This efficient querying allows the database quickly navigate through the keys
123-
and find the disk location of the desired data without having to scan the whole "CS Modules" table.
116+
### Indexing Structure
124117

125-
The choice of t will impact the height of the tree, and hence how fast the query is. Trade-off would be space, as a
126-
higher t means more keys in each node, and they would have to be (if not already) loaded to RAM.
118+
B+ trees are often used to efficiently manage large amounts of data stored on disk.
119+
They do not store the actual data itself but instead store **pointers** (or references)
120+
to where the data is located on the disk.
121+
122+
#### Pointer / Reference
123+
A pointer in the context of a B+ tree refers to some piece of information that can be used to
124+
retrieve actual data from the disk. Some common examples include:
125+
- **Disk address/block number**
126+
- **Filename with offset**
127+
- **Database page and record ID**
128+
- **Primary key ID**
129+
130+
<details>
131+
<summary> <b>File System Indexing</b> </summary>
132+
133+
### B+ Trees for File System Indexing
134+
135+
File system indexing refers to the process by which an operating system organizes and manages files on
136+
storage media (such as hard drives, SSDs) to enable efficient file retrieval, searching, and management.
137+
It involves creating and maintaining indexes (similar to those in a database) that help quickly locate files,
138+
directories, and their metadata (like file names, attributes, permissions, and timestamps).
139+
140+
#### Workflow:
141+
- The **root node** of a B+ tree is typically stored in **RAM** to speed up access.
142+
- **Nodes** in the tree contain keys and child pointers to other nodes.
143+
- **Intermediate nodes** do not store actual data but guide the search process toward the leaf nodes.
144+
- **Leaf nodes** either contain the actual data or pointers to the data stored on disk.
145+
This is where the data retrieval process ends.
146+
147+
#### Optimized Disk I/O:
148+
B+ trees are optimized for disk I/O, especially for **range queries**.
149+
The tree nodes are designed to fit into disk pages, meaning a single disk read operation can bring in multiple keys
150+
and pointers. This reduces the overall number of disk accesses required and efficiently utilizes disk pages.
151+
152+
#### Range Queries:
153+
B+ trees are particularly effective for **range queries**. Since the leaf nodes in a B+ tree are linked together
154+
(typically via a **doubly linked list**), this makes sequential access for range queries efficient.
155+
For example, in a file system, this allows fetching multiple adjacent keys (like file names in a directory)
156+
without requiring additional disk I/O.
157+
158+
</details>
159+
160+
<details>
161+
<summary> <b>SQL Engines</b> </summary>
162+
163+
### B+ Trees in SQL Engines
164+
165+
In **MySQL**, B+ trees are extensively used in the **InnoDB** storage engine
166+
(the default storage engine for MySQL databases).
167+
168+
#### Primary Key Index (Clustered Index):
169+
In **InnoDB**, the primary key is always stored in a **clustered index**.
170+
This means the leaf nodes of the B+ tree store the actual rows of the table.
171+
In a clustered index, the rows are physically stored in the order of the primary key,
172+
making retrieval by primary key highly efficient.
173+
174+
#### Secondary Indexes:
175+
For secondary indexes in MySQL (specifically in InnoDB),
176+
once the B+ tree for the secondary index is navigated to the leaf node, the following process occurs:
177+
178+
1. **Secondary Index B+ Tree**: The leaf nodes store the indexed column value (e.g., `last_name`)
179+
along with a reference to the primary key (e.g., `emp_id`).
180+
2. **Reference to Primary Key**: This reference (the primary key value) is used to look up the actual data
181+
in the **clustered index** (which is also a B+ tree). The clustered index stores the entire row data in its leaf nodes.
182+
183+
#### Detailed Process:
184+
- **Step 1**: MySQL navigates the secondary index tree based on the query condition (e.g. a range query on `last_name`)
185+
- The internal nodes guide the search, and the leaf node contains the `last_name`
186+
value and the corresponding primary key (`emp_id`).
187+
188+
- **Step 2**: Once MySQL reaches the leaf node of the secondary index B+ tree, it retrieves the primary key (`emp_id`).
189+
190+
- **Step 3**: MySQL uses this primary key to directly access the **clustered index** (the B+ tree for the primary key).
191+
- It navigates the primary key B+ tree to locate the row in its leaf nodes, where the full row data
192+
(e.g., `emp_id`, `last_name`, `first_name`, `salary`) is stored.
193+
194+
> **Note**: If multiple results match a query on the secondary index,
195+
the leaf nodes of the secondary index B+ tree will store multiple primary keys corresponding to the matching rows.
196+
197+
</details>
127198

128199
## References
129-
This description heavily references CS2040S Recitation Sheet 4.
200+
CS2040S Recitation Sheet 4.

0 commit comments

Comments
 (0)