@@ -104,26 +104,97 @@ Image Source: https://www.geeksforgeeks.org/insert-operation-in-b-tree/
104
104
The delete operation has a similar idea as the insert operation, but involves a lot more edge cases. If you are
105
105
interested to learn about it, you can read more [ here] ( https://www.geeksforgeeks.org/delete-operation-in-b-tree/ ) .
106
106
107
- ## Application
108
- There are many uses of B-Trees but the most common is their utility in database management systems in handling large
109
- datasets by optimizing disk accesses.
107
+ ## Application: Index Structure
108
+ B+ trees tend to be used in practice over vanilla B-trees.
109
+ The B+ tree is a specific variant of the B-tree that is optimized for efficient data retrieval from disk
110
+ and range queries.
110
111
111
- Large amounts of data have to be stored on the disk. But disk I/O operations are slow and not knowing where to look
112
- for the data can drastically worsen search time. B-Tree is used as an index structure to efficiently locate the
113
- desired data. Note, the B-Tree itself can be partially stored in RAM (higher levels) and partially on disk
114
- (lower, less freq accessed levels).
112
+ We will discuss two common applications of B+ trees: ** database indexing** and ** file system indexing** .
115
113
116
- Consider a database of all the CS modules offered in NUS. Suppose there is a column "Code" (module code) in the
117
- "CS Modules" table. If the database has a B-Tree index on the "Code" column, the keys in the B-Tree would be the
118
- module code of all CS modules offered.
114
+ ---
119
115
120
- Each key in the B-Tree is associated with a pointer, that points to the location on the disk where the corresponding
121
- data can be found. For e.g., a key for "CS2040s" would have a pointer to the disk location(s) where the row(s)
122
- (i.e. data) with "CS2040s" is stored. This efficient querying allows the database quickly navigate through the keys
123
- and find the disk location of the desired data without having to scan the whole "CS Modules" table.
116
+ ### Indexing Structure
124
117
125
- The choice of t will impact the height of the tree, and hence how fast the query is. Trade-off would be space, as a
126
- higher t means more keys in each node, and they would have to be (if not already) loaded to RAM.
118
+ B+ trees are often used to efficiently manage large amounts of data stored on disk.
119
+ They do not store the actual data itself but instead store ** pointers** (or references)
120
+ to where the data is located on the disk.
121
+
122
+ #### Pointer / Reference
123
+ A pointer in the context of a B+ tree refers to some piece of information that can be used to
124
+ retrieve actual data from the disk. Some common examples include:
125
+ - ** Disk address/block number**
126
+ - ** Filename with offset**
127
+ - ** Database page and record ID**
128
+ - ** Primary key ID**
129
+
130
+ <details >
131
+ <summary > <b >File System Indexing</b > </summary >
132
+
133
+ ### B+ Trees for File System Indexing
134
+
135
+ File system indexing refers to the process by which an operating system organizes and manages files on
136
+ storage media (such as hard drives, SSDs) to enable efficient file retrieval, searching, and management.
137
+ It involves creating and maintaining indexes (similar to those in a database) that help quickly locate files,
138
+ directories, and their metadata (like file names, attributes, permissions, and timestamps).
139
+
140
+ #### Workflow:
141
+ - The ** root node** of a B+ tree is typically stored in ** RAM** to speed up access.
142
+ - ** Nodes** in the tree contain keys and child pointers to other nodes.
143
+ - ** Intermediate nodes** do not store actual data but guide the search process toward the leaf nodes.
144
+ - ** Leaf nodes** either contain the actual data or pointers to the data stored on disk.
145
+ This is where the data retrieval process ends.
146
+
147
+ #### Optimized Disk I/O:
148
+ B+ trees are optimized for disk I/O, especially for ** range queries** .
149
+ The tree nodes are designed to fit into disk pages, meaning a single disk read operation can bring in multiple keys
150
+ and pointers. This reduces the overall number of disk accesses required and efficiently utilizes disk pages.
151
+
152
+ #### Range Queries:
153
+ B+ trees are particularly effective for ** range queries** . Since the leaf nodes in a B+ tree are linked together
154
+ (typically via a ** doubly linked list** ), this makes sequential access for range queries efficient.
155
+ For example, in a file system, this allows fetching multiple adjacent keys (like file names in a directory)
156
+ without requiring additional disk I/O.
157
+
158
+ </details >
159
+
160
+ <details >
161
+ <summary > <b >SQL Engines</b > </summary >
162
+
163
+ ### B+ Trees in SQL Engines
164
+
165
+ In ** MySQL** , B+ trees are extensively used in the ** InnoDB** storage engine
166
+ (the default storage engine for MySQL databases).
167
+
168
+ #### Primary Key Index (Clustered Index):
169
+ In ** InnoDB** , the primary key is always stored in a ** clustered index** .
170
+ This means the leaf nodes of the B+ tree store the actual rows of the table.
171
+ In a clustered index, the rows are physically stored in the order of the primary key,
172
+ making retrieval by primary key highly efficient.
173
+
174
+ #### Secondary Indexes:
175
+ For secondary indexes in MySQL (specifically in InnoDB),
176
+ once the B+ tree for the secondary index is navigated to the leaf node, the following process occurs:
177
+
178
+ 1 . ** Secondary Index B+ Tree** : The leaf nodes store the indexed column value (e.g., ` last_name ` )
179
+ along with a reference to the primary key (e.g., ` emp_id ` ).
180
+ 2 . ** Reference to Primary Key** : This reference (the primary key value) is used to look up the actual data
181
+ in the ** clustered index** (which is also a B+ tree). The clustered index stores the entire row data in its leaf nodes.
182
+
183
+ #### Detailed Process:
184
+ - ** Step 1** : MySQL navigates the secondary index tree based on the query condition (e.g. a range query on ` last_name ` )
185
+ - The internal nodes guide the search, and the leaf node contains the ` last_name `
186
+ value and the corresponding primary key (` emp_id ` ).
187
+
188
+ - ** Step 2** : Once MySQL reaches the leaf node of the secondary index B+ tree, it retrieves the primary key (` emp_id ` ).
189
+
190
+ - ** Step 3** : MySQL uses this primary key to directly access the ** clustered index** (the B+ tree for the primary key).
191
+ - It navigates the primary key B+ tree to locate the row in its leaf nodes, where the full row data
192
+ (e.g., ` emp_id ` , ` last_name ` , ` first_name ` , ` salary ` ) is stored.
193
+
194
+ > ** Note** : If multiple results match a query on the secondary index,
195
+ the leaf nodes of the secondary index B+ tree will store multiple primary keys corresponding to the matching rows.
196
+
197
+ </details >
127
198
128
199
## References
129
- This description heavily references CS2040S Recitation Sheet 4.
200
+ CS2040S Recitation Sheet 4.
0 commit comments