@@ -104,26 +104,97 @@ Image Source: https://www.geeksforgeeks.org/insert-operation-in-b-tree/
104104The delete operation has a similar idea as the insert operation, but involves a lot more edge cases. If you are
105105interested to learn about it, you can read more [ here] ( https://www.geeksforgeeks.org/delete-operation-in-b-tree/ ) .
106106
107- ## Application
108- There are many uses of B-Trees but the most common is their utility in database management systems in handling large
109- datasets by optimizing disk accesses.
107+ ## Application: Index Structure
108+ B+ trees tend to be used in practice over vanilla B-trees.
109+ The B+ tree is a specific variant of the B-tree that is optimized for efficient data retrieval from disk
110+ and range queries.
110111
111- Large amounts of data have to be stored on the disk. But disk I/O operations are slow and not knowing where to look
112- for the data can drastically worsen search time. B-Tree is used as an index structure to efficiently locate the
113- desired data. Note, the B-Tree itself can be partially stored in RAM (higher levels) and partially on disk
114- (lower, less freq accessed levels).
112+ We will discuss two common applications of B+ trees: ** database indexing** and ** file system indexing** .
115113
116- Consider a database of all the CS modules offered in NUS. Suppose there is a column "Code" (module code) in the
117- "CS Modules" table. If the database has a B-Tree index on the "Code" column, the keys in the B-Tree would be the
118- module code of all CS modules offered.
114+ ---
119115
120- Each key in the B-Tree is associated with a pointer, that points to the location on the disk where the corresponding
121- data can be found. For e.g., a key for "CS2040s" would have a pointer to the disk location(s) where the row(s)
122- (i.e. data) with "CS2040s" is stored. This efficient querying allows the database quickly navigate through the keys
123- and find the disk location of the desired data without having to scan the whole "CS Modules" table.
116+ ### Indexing Structure
124117
125- The choice of t will impact the height of the tree, and hence how fast the query is. Trade-off would be space, as a
126- higher t means more keys in each node, and they would have to be (if not already) loaded to RAM.
118+ B+ trees are often used to efficiently manage large amounts of data stored on disk.
119+ They do not store the actual data itself but instead store ** pointers** (or references)
120+ to where the data is located on the disk.
121+
122+ #### Pointer / Reference
123+ A pointer in the context of a B+ tree refers to some piece of information that can be used to
124+ retrieve actual data from the disk. Some common examples include:
125+ - ** Disk address/block number**
126+ - ** Filename with offset**
127+ - ** Database page and record ID**
128+ - ** Primary key ID**
129+
130+ <details >
131+ <summary > <b >File System Indexing</b > </summary >
132+
133+ ### B+ Trees for File System Indexing
134+
135+ File system indexing refers to the process by which an operating system organizes and manages files on
136+ storage media (such as hard drives, SSDs) to enable efficient file retrieval, searching, and management.
137+ It involves creating and maintaining indexes (similar to those in a database) that help quickly locate files,
138+ directories, and their metadata (like file names, attributes, permissions, and timestamps).
139+
140+ #### Workflow:
141+ - The ** root node** of a B+ tree is typically stored in ** RAM** to speed up access.
142+ - ** Nodes** in the tree contain keys and child pointers to other nodes.
143+ - ** Intermediate nodes** do not store actual data but guide the search process toward the leaf nodes.
144+ - ** Leaf nodes** either contain the actual data or pointers to the data stored on disk.
145+ This is where the data retrieval process ends.
146+
147+ #### Optimized Disk I/O:
148+ B+ trees are optimized for disk I/O, especially for ** range queries** .
149+ The tree nodes are designed to fit into disk pages, meaning a single disk read operation can bring in multiple keys
150+ and pointers. This reduces the overall number of disk accesses required and efficiently utilizes disk pages.
151+
152+ #### Range Queries:
153+ B+ trees are particularly effective for ** range queries** . Since the leaf nodes in a B+ tree are linked together
154+ (typically via a ** doubly linked list** ), this makes sequential access for range queries efficient.
155+ For example, in a file system, this allows fetching multiple adjacent keys (like file names in a directory)
156+ without requiring additional disk I/O.
157+
158+ </details >
159+
160+ <details >
161+ <summary > <b >SQL Engines</b > </summary >
162+
163+ ### B+ Trees in SQL Engines
164+
165+ In ** MySQL** , B+ trees are extensively used in the ** InnoDB** storage engine
166+ (the default storage engine for MySQL databases).
167+
168+ #### Primary Key Index (Clustered Index):
169+ In ** InnoDB** , the primary key is always stored in a ** clustered index** .
170+ This means the leaf nodes of the B+ tree store the actual rows of the table.
171+ In a clustered index, the rows are physically stored in the order of the primary key,
172+ making retrieval by primary key highly efficient.
173+
174+ #### Secondary Indexes:
175+ For secondary indexes in MySQL (specifically in InnoDB),
176+ once the B+ tree for the secondary index is navigated to the leaf node, the following process occurs:
177+
178+ 1 . ** Secondary Index B+ Tree** : The leaf nodes store the indexed column value (e.g., ` last_name ` )
179+ along with a reference to the primary key (e.g., ` emp_id ` ).
180+ 2 . ** Reference to Primary Key** : This reference (the primary key value) is used to look up the actual data
181+ in the ** clustered index** (which is also a B+ tree). The clustered index stores the entire row data in its leaf nodes.
182+
183+ #### Detailed Process:
184+ - ** Step 1** : MySQL navigates the secondary index tree based on the query condition (e.g. a range query on ` last_name ` )
185+ - The internal nodes guide the search, and the leaf node contains the ` last_name `
186+ value and the corresponding primary key (` emp_id ` ).
187+
188+ - ** Step 2** : Once MySQL reaches the leaf node of the secondary index B+ tree, it retrieves the primary key (` emp_id ` ).
189+
190+ - ** Step 3** : MySQL uses this primary key to directly access the ** clustered index** (the B+ tree for the primary key).
191+ - It navigates the primary key B+ tree to locate the row in its leaf nodes, where the full row data
192+ (e.g., ` emp_id ` , ` last_name ` , ` first_name ` , ` salary ` ) is stored.
193+
194+ > ** Note** : If multiple results match a query on the secondary index,
195+ the leaf nodes of the secondary index B+ tree will store multiple primary keys corresponding to the matching rows.
196+
197+ </details >
127198
128199## References
129- This description heavily references CS2040S Recitation Sheet 4.
200+ CS2040S Recitation Sheet 4.
0 commit comments