Optimize the performance of the RBTree and RzIntervalTree implementations and algorithms

# Optimization Suggestions for Rizin's RBTree & IntervalTree

After analyzing the header and implementation files, here are concrete optimizations ranked by expected impact.

---

## 1. Pack Color Bit into Child Pointer (Highest Impact — Memory & Cache)

The current `RBNode` uses a full `bool` for the red/black color, which with struct padding wastes **7 bytes per node** on 64-bit systems [^3]:

```c
// Current: 24 bytes on 64-bit (2 pointers + bool + 7 bytes padding)
typedef struct rz_rb_node_t {
    struct rz_rb_node_t *child[2];
    bool red;
} RBNode;
```

Since heap-allocated nodes are always at least word-aligned (8-byte on 64-bit), the **lowest bit of a child pointer is always 0** and can store the color. This shrinks `RBNode` from 24 to 16 bytes — a **33% reduction** that dramatically improves cache utilization:

```c
typedef struct rz_rb_node_t {
    uintptr_t child[2]; // low bit of child[0] stores color
} RBNode;

#define RB_RED_BIT  ((uintptr_t)1)

static inline RBNode *rb_child(const RBNode *n, int dir) {
    return (RBNode *)(n->child[dir] & ~RB_RED_BIT);
}

static inline void rb_set_child(RBNode *n, int dir, RBNode *c) {
    n->child[dir] = (n->child[dir] & RB_RED_BIT) | (uintptr_t)c;
}

static inline bool rb_is_red(const RBNode *n) {
    return n && (n->child[0] & RB_RED_BIT);
}

static inline void rb_set_red(RBNode *n, bool red) {
    if (red) {
        n->child[0] |= RB_RED_BIT;
    } else {
        n->child[0] &= ~RB_RED_BIT;
    }
}
```

**Cascading benefits for `RzIntervalNode`** [^2]:

| Field | Before (bytes) | After (bytes) |
|---|---|---|
| `RBNode node` | 24 | 16 |
| `ut64 start` | 8 | 8 |
| `ut64 end` | 8 | 8 |
| `ut64 max_end` | 8 | 8 |
| `void *data` | 8 | 8 |
| **Total** | **56** | **48** |

Going from 56 to 48 bytes means an `RzIntervalNode` now fits in **¾ of a cache line** instead of nearly a full one, and you get ~17% more nodes per cache line during traversals.

---

## 2. Shrink `RBIter` — It's 504 Bytes on the Stack

`RBIter` contains a 62-element pointer array, making it **504 bytes** on 64-bit [^3]. This is returned **by value** from `rz_rbtree_first()`, `rz_rbtree_last()`, and `rz_rbtree_lower_bound_forward()`, and is used on the stack in every `rz_rbtree_foreach` / `rz_interval_tree_foreach` macro [^1][^2]:

```c
// Current: 504 bytes returned by value and copied
typedef struct rz_rb_iter_t {
    int len;
    RBNode *path[RZ_RBTREE_MAX_HEIGHT]; // 62 * 8 = 496 bytes
} RBIter;
```

**Option A — Reduce `RZ_RBTREE_MAX_HEIGHT` to a practical value:**

A red-black tree with `2^31 - 1` nodes (~2 billion) has max height `2 * log2(2^31) = 62`. In practice, Rizin trees rarely exceed millions of nodes. A height of **40** supports up to ~1 trillion nodes and saves 176 bytes:

```c
#define RZ_RBTREE_MAX_HEIGHT 40  // supports up to ~2^20 = 1M nodes with 2× margin
```

**Option B — Use a compact iterator with parent pointers:**

If you add a parent pointer to `RBNode` (or use the bit-packed variant), iterators become 8 bytes instead of 504:

```c
typedef struct rz_rb_iter_t {
    RBNode *current;
} RBIter;

// next via parent pointer: O(1) amortized
static inline void rz_rbtree_iter_next(RBIter *it) {
    RBNode *n = it->current;
    if (rb_child(n, 1)) {
        n = rb_child(n, 1);
        while (rb_child(n, 0)) n = rb_child(n, 0);
        it->current = n;
    } else {
        // walk up via parent until we came from a left child
        RBNode *p = rb_parent(n);
        while (p && n == rb_child(p, 1)) {
            n = p;
            p = rb_parent(p);
        }
        it->current = p;
    }
}
```

> **Trade-off:** Adds 8 bytes per node for the parent pointer, but eliminates the 500-byte stack iterator and makes iteration allocation-free. The parent pointer also simplifies deletion.

---

## 3. Node Slab Allocator for `RContRBTree` and `RzIntervalTree`

Every `rz_rbtree_cont_insert` allocates a single `RContRBNode` via `RZ_NEW0` (which is `calloc(1, sizeof(...))`) [^1]. Similarly, `rz_interval_tree_insert` allocates a single `RzIntervalNode` per insertion [^2]. For trees with thousands of nodes, this means thousands of small, scattered heap allocations.

```c
#define RB_SLAB_SIZE 128

typedef struct rb_slab_t {
    ut8 *nodes;           // block of RB_SLAB_SIZE * node_size bytes
    struct rb_slab_t *next;
} RBSlab;

typedef struct rb_pool_t {
    size_t node_size;
    RBSlab *slabs;
    void *freelist;       // singly-linked list of free nodes
} RBPool;

static inline void *rb_pool_alloc(RBPool *pool) {
    if (RZ_UNLIKELY(!pool->freelist)) {
        RBSlab *slab = malloc(sizeof(RBSlab));
        if (!slab) return NULL;
        slab->nodes = calloc(RB_SLAB_SIZE, pool->node_size);
        if (!slab->nodes) { free(slab); return NULL; }
        slab->next = pool->slabs;
        pool->slabs = slab;
        // Chain nodes onto freelist
        for (size_t i = 0; i < RB_SLAB_SIZE; i++) {
            void *node = slab->nodes + i * pool->node_size;
            *(void **)node = pool->freelist;
            pool->freelist = node;
        }
    }
    void *n = pool->freelist;
    pool->freelist = *(void **)n;
    memset(n, 0, pool->node_size);
    return n;
}

static inline void rb_pool_release(RBPool *pool, void *node) {
    *(void **)node = pool->freelist;
    pool->freelist = node;
}
```

Add an optional `RBPool *pool` to `RContRBTree` and `RzIntervalTree`. This eliminates per-node `malloc`/`free`, keeps nodes contiguous for better cache behavior, and makes `rz_rbtree_cont_free` / `rz_interval_tree_fini` a simple slab chain deallocation.

---

## 4. Branchless Direction Selection in `rz_rbtree_find`

The current find loop has a **three-way branch** (less, greater, equal) [^1]:

```c
// Current: 2 branches per level
while (x) {
    int d = cmp(data, x, user);
    if (d < 0) {
        x = x->child[0];
    } else if (d > 0) {
        x = x->child[1];
    } else {
        return x;
    }
}
```

**Convert to a branchless two-way descent with deferred equality check:**

```c
RZ_API RBNode *rz_rbtree_find(RBNode *x, void *data, RBComparator cmp, void *user) {
    RBNode *result = NULL;
    while (x) {
        int d = cmp(data, x, user);
        if (d == 0) {
            result = x; // record match, but could continue for leftmost/rightmost
        }
        // Branchless: d < 0 → dir=0, d >= 0 → dir=1
        int dir = (d > 0); // or: d >= 0 if you want leftmost match
        x = x->child[dir];
        // Prefetch next level
        if (RZ_LIKELY(x)) {
            __builtin_prefetch(x->child[0], 0, 1);
            __builtin_prefetch(x->child[1], 0, 1);
        }
    }
    return result;
}
```

This eliminates a branch per tree level. On a tree of depth 20, that's **20 fewer branch mispredictions** in the worst case. The prefetch hides the pointer-chasing latency by loading the next level's children while the comparator runs.

Apply the same pattern to `rz_rbtree_lower_bound` and `rz_rbtree_upper_bound` [^1].

---

## 5. `__builtin_prefetch` in Traversal and Query Functions

Tree pointer chasing is the **main bottleneck** — each node is at a random memory address. Prefetching the next node(s) while processing the current one hides ~60–100 ns of DRAM latency.

**In `_next` (iterator advance)** [^1]:

```c
static inline void _next(RBIter *it, int dir) {
    RBNode *x = it->path[--it->len];
    x = x->child[!dir];
    if (x) {
        __builtin_prefetch(x->child[dir], 0, 1);
    }
    for (; x; x = x->child[dir]) {
        it->path[it->len++] = x;
        if (x->child[dir]) {
            __builtin_prefetch(x->child[dir]->child[dir], 0, 1);
        }
    }
}
```

**In interval tree queries** (`all_in`, `all_intersect`), prefetch both children before descending:

```c
// Before descending into a subtree during interval query:
static bool _all_in(RBNode *node, ut64 value, bool end_inclusive,
                     RzIntervalIterCb cb, void *user) {
    if (!node) return true;
    RzIntervalNode *intervalnode = container_of(node, RzIntervalNode, node);

    // Early termination via max_end augmentation
    if (value > intervalnode->max_end) return true;
    if (!end_inclusive && value >= intervalnode->max_end) return true;

    // Prefetch children while we process this node
    if (node->child[0]) __builtin_prefetch(node->child[0], 0, 1);
    if (node->child[1]) __builtin_prefetch(node->child[1], 0, 1);

    // ... recurse / process
}
```

---

## 6. Convert Recursive `rz_rbtree_free` to Iterative

The tree free function is recursive, adding function call overhead per node. Convert to an iterative post-order traversal using the existing path stack:

```c
RZ_API void rz_rbtree_free(RZ_NULLABLE RBNode *root, RBNodeFree freefn, void *user) {
    if (!root) return;

    // Morris-like iterative post-order traversal
    RBNode *stack[RZ_RBTREE_MAX_HEIGHT];
    int top = 0;
    RBNode *current = root;
    RBNode *last_visited = NULL;

    while (current || top > 0) {
        while (current) {
            stack[top++] = current;
            current = current->child[0];
        }
        RBNode *peek = stack[top - 1];
        if (peek->child[1] && peek->child[1] != last_visited) {
            current = peek->child[1];
        } else {
            top--;
            last_visited = peek;
            if (freefn) {
                freefn(peek, user);
            }
        }
    }
}
```

Eliminates ~20 function call frames for a typical tree, with better branch prediction due to the tight loop.

---

## 7. Inline the `sum` Callback for IntervalTree

The `RBNodeSum` callback is called on **every rotation** and on the **entire path back to root** after insertion/deletion [^1]. For the interval tree, this is always the `max_end` update:

```c
static void interval_node_sum(RBNode *node) {
    RzIntervalNode *n = container_of(node, RzIntervalNode, node);
    n->max_end = n->end;
    if (node->child[0]) {
        ut64 cm = container_of(node->child[0], RzIntervalNode, node)->max_end;
        if (cm > n->max_end) n->max_end = cm;
    }
    if (node->child[1]) {
        ut64 cm = container_of(node->child[1], RzIntervalNode, node)->max_end;
        if (cm > n->max_end) n->max_end = cm;
    }
}
```

This is called via function pointer, preventing inlining. **Create specialized insert/delete functions** for the interval tree that inline the sum computation:

```c
// intervaltree.c — specialized insert that inlines the sum
static inline void interval_update_max(RBNode *node) {
    RzIntervalNode *n = container_of(node, RzIntervalNode, node);
    ut64 m = n->end;
    for (int i = 0; i < 2; i++) {
        if (node->child[i]) {
            ut64 cm = container_of(node->child[i], RzIntervalNode, node)->max_end;
            m = cm > m ? cm : m;
        }
    }
    n->max_end = m;
}

// Use CMOV instead of branch for max:
static inline ut64 u64_max(ut64 a, ut64 b) {
    return a > b ? a : b; // compiler emits cmov
}

static inline void interval_update_max_branchless(RBNode *node) {
    RzIntervalNode *n = container_of(node, RzIntervalNode, node);
    ut64 m = n->end;
    if (node->child[0]) m = u64_max(m, container_of(node->child[0], RzIntervalNode, node)->max_end);
    if (node->child[1]) m = u64_max(m, container_of(node->child[1], RzIntervalNode, node)->max_end);
    n->max_end = m;
}
```

Either create `rz_rbtree_aug_insert_interval()` as a template-like specialization, or use a **macro-generated** version with the sum logic inlined. This eliminates ~20–40 indirect calls per insert/delete.

---

## 8. Bulk Construction from Sorted Data — O(n) Instead of O(n log n)

When building a tree from a known sorted sequence (e.g., loading analysis metadata), inserting one-by-one is O(n log n). A **bottom-up bulk construction** builds a balanced red-black tree in O(n):

```c
// Build a perfectly balanced RB tree from a sorted array of nodes
static RBNode *build_balanced(RBNode **nodes, size_t count, int depth, RBNodeSum sum) {
    if (count == 0) return NULL;
    size_t mid = count / 2;
    RBNode *root = nodes[mid];
    root->child[0] = build_balanced(nodes, mid, depth + 1, sum);
    root->child[1] = build_balanced(nodes + mid + 1, count - mid - 1, depth + 1, sum);
    // Color: nodes at maximum depth are red, others black
    root->red = (depth > 0 && count <= 2);
    if (sum) sum(root);
    return root;
}

RZ_API void rz_rbtree_build_sorted(RBNode **root, RBNode **nodes,
                                     size_t count, RBNodeSum sum) {
    *root = build_balanced(nodes, count, 0, sum);
    if (*root) (*root)->red = false;
}
```

**For the interval tree**, this is particularly useful when loading binary metadata — all intervals can be sorted by `start`, then built into a tree in one O(n) pass with correct `max_end` propagation.

---

## 9. Optimize Interval Tree `all_in` / `all_intersect` with Stack-Based Iteration

The current query functions are recursive, which adds call overhead per node visited [^2]. Convert to explicit stack:

```c
RZ_API bool rz_interval_tree_all_in(RzIntervalTree *tree, ut64 value,
                                      bool end_inclusive,
                                      RzIntervalIterCb cb, void *user) {
    if (!tree->root) return true;

    RBNode *stack[RZ_RBTREE_MAX_HEIGHT];
    int top = 0;
    stack[top++] = &tree->root->node;

    while (top > 0) {
        RBNode *node = stack[--top];
        RzIntervalNode *n = container_of(node, RzIntervalNode, node);

        // Prune: if max_end < value, no descendant can contain value
        bool can_contain = end_inclusive ? (n->max_end >= value) : (n->max_end > value);
        if (!can_contain) continue;

        // Check this node
        if (n->start <= value) {
            bool in_range = end_inclusive ? (value <= n->end) : (value < n->end);
            if (in_range) {
                if (!cb(n, user)) return false;
            }
        }

        // Push children (right first so left is processed first)
        if (node->child[1]) {
            __builtin_prefetch(node->child[1], 0, 1);
            stack[top++] = node->child[1];
        }
        // Left child: only if start <= value (BST property)
        if (node->child[0] && n->start >= value) {
            // Left subtree might have nodes with start <= value
        }
        if (node->child[0]) {
            __builtin_prefetch(node->child[0], 0, 1);
            stack[top++] = node->child[0];
        }
    }
    return true;
}
```

This eliminates recursive function call overhead and allows prefetching both children before descending.

---

## 10. `rz_interval_tree_resize` — Avoid Delete+Reinsert When Only `end` Changes

The documentation notes this is already optimized for end-only changes [^2], but make sure the implementation only **walks up the path updating `max_end`** without a full delete+reinsert:

```c
RZ_API bool rz_interval_tree_resize(RzIntervalTree *tree, RzIntervalNode *node,
                                      ut64 new_start, ut64 new_end) {
    if (node->start == new_start) {
        // Only end changed — just update max_end up the path
        node->end = new_end;
        // Walk ancestors and update max_end
        rz_rbtree_aug_update_sum(
            tree->root ? &tree->root->node : NULL,
            &node->start,
            &node->node,
            interval_cmp,
            NULL,
            interval_node_sum
        );
        return true;
    }
    // Start changed — must delete and reinsert
    void *data = node->data;
    rz_interval_tree_delete(tree, node, false);
    return rz_interval_tree_insert(tree, new_start, new_end, data);
}
```

This avoids the O(log n) delete + O(log n) insert when only the end changes, replacing it with a single O(log n) sum update.

---

## 11. SIMD for Batch Interval Queries

When querying many points against the same interval tree (e.g., "which analysis metadata covers each of these 1000 addresses?"), you can vectorize the **initial descent** by comparing multiple query values against a node's `start`/`max_end` simultaneously:

```c
#if defined(__AVX2__)
#include <immintrin.h>

// Check 4 query values at once against a single node's pruning condition
static inline __m256i interval_prune_4(
    __m256i queries,        // 4 × ut64 query values
    __m256i node_max_end,   // broadcast of node->max_end
    __m256i node_start      // broadcast of node->start
) {
    // mask = (query <= max_end) & (query >= start)
    // Use unsigned comparison via XOR with sign bit
    __m256i sign = _mm256_set1_epi64x((int64_t)0x8000000000000000ULL);
    __m256i q_signed = _mm256_xor_si256(queries, sign);
    __m256i max_signed = _mm256_xor_si256(node_max_end, sign);
    __m256i start_signed = _mm256_xor_si256(node_start, sign);

    __m256i le_max = _mm256_cmpgt_epi64(max_signed, q_signed);  // max_end > query
    __m256i ge_start = _mm256_cmpgt_epi64(q_signed, start_signed); // query > start
    // OR with equality...
    return _mm256_and_si256(le_max, ge_start);
}
#endif
```

This is most beneficial for **batch stabbing queries** — checking many addresses against the tree. If none of 4 queries can match a subtree, you prune it once instead of 4 times.

---

## 12. `__builtin_expect` on Depth Overflow Checks

The depth overflow checks in `rz_rbtree_aug_insert` and `rz_rbtree_aug_delete` are on the hot path [^1]:

```c
// Current: checked every iteration
if (dep >= RZ_RBTREE_MAX_HEIGHT) {
    RZ_LOG_ERROR("Red-black tree depth is too big\n");
    break;
}
```

**Mark as unlikely:**

```c
if (RZ_UNLIKELY(dep >= RZ_RBTREE_MAX_HEIGHT)) {
    RZ_LOG_ERROR("Red-black tree depth is too big\n");
    break;
}
```

This ensures the branch predictor assumes the check fails, keeping the hot path straight-line.

---

## 13. Avoid Wrapper Overhead in `RContRBTree`

`rz_rbtree_cont_insert` wraps the user's comparator in `cont_rbtree_cmp_wrapper`, adding an extra indirect call and a struct allocation on the stack per comparison [^1]:

```c
// Current: wrapper adds indirection
RCRBCmpWrap cmp_wrap = { cmp, NULL, user };
rz_rbtree_aug_insert(&root_node, incoming_node,
    &incoming_node->node, cont_rbtree_cmp_wrapper, &cmp_wrap, NULL);
```

**Optimize: Store the comparator in the tree and call it directly in a specialized insert:**

```c
typedef struct rz_containing_rb_tree_t {
    RContRBNode *root;
    RContRBFree free;
    RContRBCmp cmp;        // store once
    void *cmp_user;        // store once
} RContRBTree;

// Direct comparator without wrapper
static int cont_rb_cmp_direct(const void *incoming, const RBNode *in_tree, void *user) {
    RContRBTree *tree = (RContRBTree *)user;
    RContRBNode *a = (RContRBNode *)incoming;
    RContRBNode *b = container_of(in_tree, RContRBNode, node);
    return tree->cmp(a->data, b->data, tree->cmp_user);
}
```

This eliminates one level of function pointer indirection per comparison.

---

## Summary Table

| # | Optimization | Complexity | Expected Impact |
|---|---|---|---|
| 1 | Pack color bit into pointer LSB | Medium | **33% smaller nodes**, much better cache |
| 2 | Shrink / redesign `RBIter` | Medium | **500 → 8 bytes** per iterator |
| 3 | Node slab allocator | Medium | **5–20×** for insert-heavy workloads |
| 4 | Branchless find + prefetch children | Easy | **~20% faster** lookups |
| 5 | `__builtin_prefetch` in traversal | Easy | **10–30%** for cold-cache queries |
| 6 | Iterative `rz_rbtree_free` | Easy | Eliminates recursive call overhead |
| 7 | Inline `sum` for interval tree | Medium | **Eliminates ~40 indirect calls** per insert/delete |
| 8 | Bulk O(n) construction | Medium | **O(n) vs O(n log n)** for batch builds |
| 9 | Stack-based interval queries | Easy | **Eliminates recursion** + enables prefetch |
| 10 | Smarter `resize` for end-only changes | Easy | **Halves cost** of common resize case |
| 11 | SIMD batch interval queries | High | **4× throughput** for batch stabbing queries |
| 12 | `__builtin_expect` on depth checks | Trivial | **5–10%** on hot paths |
| 13 | Eliminate comparator wrapper | Easy | **~15% fewer** indirect calls |

The **highest-impact changes** are:
- **1 (color bit packing)** — affects every single node in every tree, improving cache behavior across the entire codebase
- **3 (slab allocator)** — eliminates per-node `malloc` which dominates insertion cost
- **7 (inline sum)** — the `sum` function pointer is called dozens of times per tree mutation and can never be inlined by the compiler through the pointer



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize the performance of the RBTree and RzIntervalTree implementations and algorithms #6097

Optimization Suggestions for Rizin's RBTree & IntervalTree

1. Pack Color Bit into Child Pointer (Highest Impact — Memory & Cache)

2. Shrink `RBIter` — It's 504 Bytes on the Stack

3. Node Slab Allocator for `RContRBTree` and `RzIntervalTree`

4. Branchless Direction Selection in `rz_rbtree_find`

5. `__builtin_prefetch` in Traversal and Query Functions

6. Convert Recursive `rz_rbtree_free` to Iterative

7. Inline the `sum` Callback for IntervalTree

8. Bulk Construction from Sorted Data — O(n) Instead of O(n log n)

9. Optimize Interval Tree `all_in` / `all_intersect` with Stack-Based Iteration

10. `rz_interval_tree_resize` — Avoid Delete+Reinsert When Only `end` Changes

11. SIMD for Batch Interval Queries

12. `__builtin_expect` on Depth Overflow Checks

13. Avoid Wrapper Overhead in `RContRBTree`

Summary Table

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Before (bytes)	After (bytes)
`RBNode node`	24	16
`ut64 start`	8	8
`ut64 end`	8	8
`ut64 max_end`	8	8
`void *data`	8	8
Total	56	48

#	Optimization	Complexity	Expected Impact
1	Pack color bit into pointer LSB	Medium	33% smaller nodes, much better cache
2	Shrink / redesign `RBIter`	Medium	500 → 8 bytes per iterator
3	Node slab allocator	Medium	5–20× for insert-heavy workloads
4	Branchless find + prefetch children	Easy	~20% faster lookups
5	`__builtin_prefetch` in traversal	Easy	10–30% for cold-cache queries
6	Iterative `rz_rbtree_free`	Easy	Eliminates recursive call overhead
7	Inline `sum` for interval tree	Medium	Eliminates ~40 indirect calls per insert/delete
8	Bulk O(n) construction	Medium	O(n) vs O(n log n) for batch builds
9	Stack-based interval queries	Easy	Eliminates recursion + enables prefetch
10	Smarter `resize` for end-only changes	Easy	Halves cost of common resize case
11	SIMD batch interval queries	High	4× throughput for batch stabbing queries
12	`__builtin_expect` on depth checks	Trivial	5–10% on hot paths
13	Eliminate comparator wrapper	Easy	~15% fewer indirect calls

Uh oh!

Optimize the performance of the RBTree and RzIntervalTree implementations and algorithms #6097

Description

Optimization Suggestions for Rizin's RBTree & IntervalTree

1. Pack Color Bit into Child Pointer (Highest Impact — Memory & Cache)

2. Shrink RBIter — It's 504 Bytes on the Stack

3. Node Slab Allocator for RContRBTree and RzIntervalTree

4. Branchless Direction Selection in rz_rbtree_find

5. __builtin_prefetch in Traversal and Query Functions

6. Convert Recursive rz_rbtree_free to Iterative

7. Inline the sum Callback for IntervalTree

8. Bulk Construction from Sorted Data — O(n) Instead of O(n log n)

9. Optimize Interval Tree all_in / all_intersect with Stack-Based Iteration

10. rz_interval_tree_resize — Avoid Delete+Reinsert When Only end Changes

11. SIMD for Batch Interval Queries

12. __builtin_expect on Depth Overflow Checks

13. Avoid Wrapper Overhead in RContRBTree

Summary Table

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Shrink `RBIter` — It's 504 Bytes on the Stack

3. Node Slab Allocator for `RContRBTree` and `RzIntervalTree`

4. Branchless Direction Selection in `rz_rbtree_find`

5. `__builtin_prefetch` in Traversal and Query Functions

6. Convert Recursive `rz_rbtree_free` to Iterative

7. Inline the `sum` Callback for IntervalTree

9. Optimize Interval Tree `all_in` / `all_intersect` with Stack-Based Iteration

10. `rz_interval_tree_resize` — Avoid Delete+Reinsert When Only `end` Changes

12. `__builtin_expect` on Depth Overflow Checks

13. Avoid Wrapper Overhead in `RContRBTree`