Optimization Suggestions for Rizin's RBTree & IntervalTree
After analyzing the header and implementation files, here are concrete optimizations ranked by expected impact.
1. Pack Color Bit into Child Pointer (Highest Impact — Memory & Cache)
The current RBNode uses a full bool for the red/black color, which with struct padding wastes 7 bytes per node on 64-bit systems [^3]:
// Current: 24 bytes on 64-bit (2 pointers + bool + 7 bytes padding)
typedef struct rz_rb_node_t {
struct rz_rb_node_t *child[2];
bool red;
} RBNode;
Since heap-allocated nodes are always at least word-aligned (8-byte on 64-bit), the lowest bit of a child pointer is always 0 and can store the color. This shrinks RBNode from 24 to 16 bytes — a 33% reduction that dramatically improves cache utilization:
typedef struct rz_rb_node_t {
uintptr_t child[2]; // low bit of child[0] stores color
} RBNode;
#define RB_RED_BIT ((uintptr_t)1)
static inline RBNode *rb_child(const RBNode *n, int dir) {
return (RBNode *)(n->child[dir] & ~RB_RED_BIT);
}
static inline void rb_set_child(RBNode *n, int dir, RBNode *c) {
n->child[dir] = (n->child[dir] & RB_RED_BIT) | (uintptr_t)c;
}
static inline bool rb_is_red(const RBNode *n) {
return n && (n->child[0] & RB_RED_BIT);
}
static inline void rb_set_red(RBNode *n, bool red) {
if (red) {
n->child[0] |= RB_RED_BIT;
} else {
n->child[0] &= ~RB_RED_BIT;
}
}
Cascading benefits for RzIntervalNode [^2]:
| Field |
Before (bytes) |
After (bytes) |
RBNode node |
24 |
16 |
ut64 start |
8 |
8 |
ut64 end |
8 |
8 |
ut64 max_end |
8 |
8 |
void *data |
8 |
8 |
| Total |
56 |
48 |
Going from 56 to 48 bytes means an RzIntervalNode now fits in ¾ of a cache line instead of nearly a full one, and you get ~17% more nodes per cache line during traversals.
2. Shrink RBIter — It's 504 Bytes on the Stack
RBIter contains a 62-element pointer array, making it 504 bytes on 64-bit [^3]. This is returned by value from rz_rbtree_first(), rz_rbtree_last(), and rz_rbtree_lower_bound_forward(), and is used on the stack in every rz_rbtree_foreach / rz_interval_tree_foreach macro [^1][^2]:
// Current: 504 bytes returned by value and copied
typedef struct rz_rb_iter_t {
int len;
RBNode *path[RZ_RBTREE_MAX_HEIGHT]; // 62 * 8 = 496 bytes
} RBIter;
Option A — Reduce RZ_RBTREE_MAX_HEIGHT to a practical value:
A red-black tree with 2^31 - 1 nodes (~2 billion) has max height 2 * log2(2^31) = 62. In practice, Rizin trees rarely exceed millions of nodes. A height of 40 supports up to ~1 trillion nodes and saves 176 bytes:
#define RZ_RBTREE_MAX_HEIGHT 40 // supports up to ~2^20 = 1M nodes with 2× margin
Option B — Use a compact iterator with parent pointers:
If you add a parent pointer to RBNode (or use the bit-packed variant), iterators become 8 bytes instead of 504:
typedef struct rz_rb_iter_t {
RBNode *current;
} RBIter;
// next via parent pointer: O(1) amortized
static inline void rz_rbtree_iter_next(RBIter *it) {
RBNode *n = it->current;
if (rb_child(n, 1)) {
n = rb_child(n, 1);
while (rb_child(n, 0)) n = rb_child(n, 0);
it->current = n;
} else {
// walk up via parent until we came from a left child
RBNode *p = rb_parent(n);
while (p && n == rb_child(p, 1)) {
n = p;
p = rb_parent(p);
}
it->current = p;
}
}
Trade-off: Adds 8 bytes per node for the parent pointer, but eliminates the 500-byte stack iterator and makes iteration allocation-free. The parent pointer also simplifies deletion.
3. Node Slab Allocator for RContRBTree and RzIntervalTree
Every rz_rbtree_cont_insert allocates a single RContRBNode via RZ_NEW0 (which is calloc(1, sizeof(...))) [^1]. Similarly, rz_interval_tree_insert allocates a single RzIntervalNode per insertion [^2]. For trees with thousands of nodes, this means thousands of small, scattered heap allocations.
#define RB_SLAB_SIZE 128
typedef struct rb_slab_t {
ut8 *nodes; // block of RB_SLAB_SIZE * node_size bytes
struct rb_slab_t *next;
} RBSlab;
typedef struct rb_pool_t {
size_t node_size;
RBSlab *slabs;
void *freelist; // singly-linked list of free nodes
} RBPool;
static inline void *rb_pool_alloc(RBPool *pool) {
if (RZ_UNLIKELY(!pool->freelist)) {
RBSlab *slab = malloc(sizeof(RBSlab));
if (!slab) return NULL;
slab->nodes = calloc(RB_SLAB_SIZE, pool->node_size);
if (!slab->nodes) { free(slab); return NULL; }
slab->next = pool->slabs;
pool->slabs = slab;
// Chain nodes onto freelist
for (size_t i = 0; i < RB_SLAB_SIZE; i++) {
void *node = slab->nodes + i * pool->node_size;
*(void **)node = pool->freelist;
pool->freelist = node;
}
}
void *n = pool->freelist;
pool->freelist = *(void **)n;
memset(n, 0, pool->node_size);
return n;
}
static inline void rb_pool_release(RBPool *pool, void *node) {
*(void **)node = pool->freelist;
pool->freelist = node;
}
Add an optional RBPool *pool to RContRBTree and RzIntervalTree. This eliminates per-node malloc/free, keeps nodes contiguous for better cache behavior, and makes rz_rbtree_cont_free / rz_interval_tree_fini a simple slab chain deallocation.
4. Branchless Direction Selection in rz_rbtree_find
The current find loop has a three-way branch (less, greater, equal) [^1]:
// Current: 2 branches per level
while (x) {
int d = cmp(data, x, user);
if (d < 0) {
x = x->child[0];
} else if (d > 0) {
x = x->child[1];
} else {
return x;
}
}
Convert to a branchless two-way descent with deferred equality check:
RZ_API RBNode *rz_rbtree_find(RBNode *x, void *data, RBComparator cmp, void *user) {
RBNode *result = NULL;
while (x) {
int d = cmp(data, x, user);
if (d == 0) {
result = x; // record match, but could continue for leftmost/rightmost
}
// Branchless: d < 0 → dir=0, d >= 0 → dir=1
int dir = (d > 0); // or: d >= 0 if you want leftmost match
x = x->child[dir];
// Prefetch next level
if (RZ_LIKELY(x)) {
__builtin_prefetch(x->child[0], 0, 1);
__builtin_prefetch(x->child[1], 0, 1);
}
}
return result;
}
This eliminates a branch per tree level. On a tree of depth 20, that's 20 fewer branch mispredictions in the worst case. The prefetch hides the pointer-chasing latency by loading the next level's children while the comparator runs.
Apply the same pattern to rz_rbtree_lower_bound and rz_rbtree_upper_bound [^1].
5. __builtin_prefetch in Traversal and Query Functions
Tree pointer chasing is the main bottleneck — each node is at a random memory address. Prefetching the next node(s) while processing the current one hides ~60–100 ns of DRAM latency.
In _next (iterator advance) [^1]:
static inline void _next(RBIter *it, int dir) {
RBNode *x = it->path[--it->len];
x = x->child[!dir];
if (x) {
__builtin_prefetch(x->child[dir], 0, 1);
}
for (; x; x = x->child[dir]) {
it->path[it->len++] = x;
if (x->child[dir]) {
__builtin_prefetch(x->child[dir]->child[dir], 0, 1);
}
}
}
In interval tree queries (all_in, all_intersect), prefetch both children before descending:
// Before descending into a subtree during interval query:
static bool _all_in(RBNode *node, ut64 value, bool end_inclusive,
RzIntervalIterCb cb, void *user) {
if (!node) return true;
RzIntervalNode *intervalnode = container_of(node, RzIntervalNode, node);
// Early termination via max_end augmentation
if (value > intervalnode->max_end) return true;
if (!end_inclusive && value >= intervalnode->max_end) return true;
// Prefetch children while we process this node
if (node->child[0]) __builtin_prefetch(node->child[0], 0, 1);
if (node->child[1]) __builtin_prefetch(node->child[1], 0, 1);
// ... recurse / process
}
6. Convert Recursive rz_rbtree_free to Iterative
The tree free function is recursive, adding function call overhead per node. Convert to an iterative post-order traversal using the existing path stack:
RZ_API void rz_rbtree_free(RZ_NULLABLE RBNode *root, RBNodeFree freefn, void *user) {
if (!root) return;
// Morris-like iterative post-order traversal
RBNode *stack[RZ_RBTREE_MAX_HEIGHT];
int top = 0;
RBNode *current = root;
RBNode *last_visited = NULL;
while (current || top > 0) {
while (current) {
stack[top++] = current;
current = current->child[0];
}
RBNode *peek = stack[top - 1];
if (peek->child[1] && peek->child[1] != last_visited) {
current = peek->child[1];
} else {
top--;
last_visited = peek;
if (freefn) {
freefn(peek, user);
}
}
}
}
Eliminates ~20 function call frames for a typical tree, with better branch prediction due to the tight loop.
7. Inline the sum Callback for IntervalTree
The RBNodeSum callback is called on every rotation and on the entire path back to root after insertion/deletion [^1]. For the interval tree, this is always the max_end update:
static void interval_node_sum(RBNode *node) {
RzIntervalNode *n = container_of(node, RzIntervalNode, node);
n->max_end = n->end;
if (node->child[0]) {
ut64 cm = container_of(node->child[0], RzIntervalNode, node)->max_end;
if (cm > n->max_end) n->max_end = cm;
}
if (node->child[1]) {
ut64 cm = container_of(node->child[1], RzIntervalNode, node)->max_end;
if (cm > n->max_end) n->max_end = cm;
}
}
This is called via function pointer, preventing inlining. Create specialized insert/delete functions for the interval tree that inline the sum computation:
// intervaltree.c — specialized insert that inlines the sum
static inline void interval_update_max(RBNode *node) {
RzIntervalNode *n = container_of(node, RzIntervalNode, node);
ut64 m = n->end;
for (int i = 0; i < 2; i++) {
if (node->child[i]) {
ut64 cm = container_of(node->child[i], RzIntervalNode, node)->max_end;
m = cm > m ? cm : m;
}
}
n->max_end = m;
}
// Use CMOV instead of branch for max:
static inline ut64 u64_max(ut64 a, ut64 b) {
return a > b ? a : b; // compiler emits cmov
}
static inline void interval_update_max_branchless(RBNode *node) {
RzIntervalNode *n = container_of(node, RzIntervalNode, node);
ut64 m = n->end;
if (node->child[0]) m = u64_max(m, container_of(node->child[0], RzIntervalNode, node)->max_end);
if (node->child[1]) m = u64_max(m, container_of(node->child[1], RzIntervalNode, node)->max_end);
n->max_end = m;
}
Either create rz_rbtree_aug_insert_interval() as a template-like specialization, or use a macro-generated version with the sum logic inlined. This eliminates ~20–40 indirect calls per insert/delete.
8. Bulk Construction from Sorted Data — O(n) Instead of O(n log n)
When building a tree from a known sorted sequence (e.g., loading analysis metadata), inserting one-by-one is O(n log n). A bottom-up bulk construction builds a balanced red-black tree in O(n):
// Build a perfectly balanced RB tree from a sorted array of nodes
static RBNode *build_balanced(RBNode **nodes, size_t count, int depth, RBNodeSum sum) {
if (count == 0) return NULL;
size_t mid = count / 2;
RBNode *root = nodes[mid];
root->child[0] = build_balanced(nodes, mid, depth + 1, sum);
root->child[1] = build_balanced(nodes + mid + 1, count - mid - 1, depth + 1, sum);
// Color: nodes at maximum depth are red, others black
root->red = (depth > 0 && count <= 2);
if (sum) sum(root);
return root;
}
RZ_API void rz_rbtree_build_sorted(RBNode **root, RBNode **nodes,
size_t count, RBNodeSum sum) {
*root = build_balanced(nodes, count, 0, sum);
if (*root) (*root)->red = false;
}
For the interval tree, this is particularly useful when loading binary metadata — all intervals can be sorted by start, then built into a tree in one O(n) pass with correct max_end propagation.
9. Optimize Interval Tree all_in / all_intersect with Stack-Based Iteration
The current query functions are recursive, which adds call overhead per node visited [^2]. Convert to explicit stack:
RZ_API bool rz_interval_tree_all_in(RzIntervalTree *tree, ut64 value,
bool end_inclusive,
RzIntervalIterCb cb, void *user) {
if (!tree->root) return true;
RBNode *stack[RZ_RBTREE_MAX_HEIGHT];
int top = 0;
stack[top++] = &tree->root->node;
while (top > 0) {
RBNode *node = stack[--top];
RzIntervalNode *n = container_of(node, RzIntervalNode, node);
// Prune: if max_end < value, no descendant can contain value
bool can_contain = end_inclusive ? (n->max_end >= value) : (n->max_end > value);
if (!can_contain) continue;
// Check this node
if (n->start <= value) {
bool in_range = end_inclusive ? (value <= n->end) : (value < n->end);
if (in_range) {
if (!cb(n, user)) return false;
}
}
// Push children (right first so left is processed first)
if (node->child[1]) {
__builtin_prefetch(node->child[1], 0, 1);
stack[top++] = node->child[1];
}
// Left child: only if start <= value (BST property)
if (node->child[0] && n->start >= value) {
// Left subtree might have nodes with start <= value
}
if (node->child[0]) {
__builtin_prefetch(node->child[0], 0, 1);
stack[top++] = node->child[0];
}
}
return true;
}
This eliminates recursive function call overhead and allows prefetching both children before descending.
10. rz_interval_tree_resize — Avoid Delete+Reinsert When Only end Changes
The documentation notes this is already optimized for end-only changes [^2], but make sure the implementation only walks up the path updating max_end without a full delete+reinsert:
RZ_API bool rz_interval_tree_resize(RzIntervalTree *tree, RzIntervalNode *node,
ut64 new_start, ut64 new_end) {
if (node->start == new_start) {
// Only end changed — just update max_end up the path
node->end = new_end;
// Walk ancestors and update max_end
rz_rbtree_aug_update_sum(
tree->root ? &tree->root->node : NULL,
&node->start,
&node->node,
interval_cmp,
NULL,
interval_node_sum
);
return true;
}
// Start changed — must delete and reinsert
void *data = node->data;
rz_interval_tree_delete(tree, node, false);
return rz_interval_tree_insert(tree, new_start, new_end, data);
}
This avoids the O(log n) delete + O(log n) insert when only the end changes, replacing it with a single O(log n) sum update.
11. SIMD for Batch Interval Queries
When querying many points against the same interval tree (e.g., "which analysis metadata covers each of these 1000 addresses?"), you can vectorize the initial descent by comparing multiple query values against a node's start/max_end simultaneously:
#if defined(__AVX2__)
#include <immintrin.h>
// Check 4 query values at once against a single node's pruning condition
static inline __m256i interval_prune_4(
__m256i queries, // 4 × ut64 query values
__m256i node_max_end, // broadcast of node->max_end
__m256i node_start // broadcast of node->start
) {
// mask = (query <= max_end) & (query >= start)
// Use unsigned comparison via XOR with sign bit
__m256i sign = _mm256_set1_epi64x((int64_t)0x8000000000000000ULL);
__m256i q_signed = _mm256_xor_si256(queries, sign);
__m256i max_signed = _mm256_xor_si256(node_max_end, sign);
__m256i start_signed = _mm256_xor_si256(node_start, sign);
__m256i le_max = _mm256_cmpgt_epi64(max_signed, q_signed); // max_end > query
__m256i ge_start = _mm256_cmpgt_epi64(q_signed, start_signed); // query > start
// OR with equality...
return _mm256_and_si256(le_max, ge_start);
}
#endif
This is most beneficial for batch stabbing queries — checking many addresses against the tree. If none of 4 queries can match a subtree, you prune it once instead of 4 times.
12. __builtin_expect on Depth Overflow Checks
The depth overflow checks in rz_rbtree_aug_insert and rz_rbtree_aug_delete are on the hot path [^1]:
// Current: checked every iteration
if (dep >= RZ_RBTREE_MAX_HEIGHT) {
RZ_LOG_ERROR("Red-black tree depth is too big\n");
break;
}
Mark as unlikely:
if (RZ_UNLIKELY(dep >= RZ_RBTREE_MAX_HEIGHT)) {
RZ_LOG_ERROR("Red-black tree depth is too big\n");
break;
}
This ensures the branch predictor assumes the check fails, keeping the hot path straight-line.
13. Avoid Wrapper Overhead in RContRBTree
rz_rbtree_cont_insert wraps the user's comparator in cont_rbtree_cmp_wrapper, adding an extra indirect call and a struct allocation on the stack per comparison [^1]:
// Current: wrapper adds indirection
RCRBCmpWrap cmp_wrap = { cmp, NULL, user };
rz_rbtree_aug_insert(&root_node, incoming_node,
&incoming_node->node, cont_rbtree_cmp_wrapper, &cmp_wrap, NULL);
Optimize: Store the comparator in the tree and call it directly in a specialized insert:
typedef struct rz_containing_rb_tree_t {
RContRBNode *root;
RContRBFree free;
RContRBCmp cmp; // store once
void *cmp_user; // store once
} RContRBTree;
// Direct comparator without wrapper
static int cont_rb_cmp_direct(const void *incoming, const RBNode *in_tree, void *user) {
RContRBTree *tree = (RContRBTree *)user;
RContRBNode *a = (RContRBNode *)incoming;
RContRBNode *b = container_of(in_tree, RContRBNode, node);
return tree->cmp(a->data, b->data, tree->cmp_user);
}
This eliminates one level of function pointer indirection per comparison.
Summary Table
| # |
Optimization |
Complexity |
Expected Impact |
| 1 |
Pack color bit into pointer LSB |
Medium |
33% smaller nodes, much better cache |
| 2 |
Shrink / redesign RBIter |
Medium |
500 → 8 bytes per iterator |
| 3 |
Node slab allocator |
Medium |
5–20× for insert-heavy workloads |
| 4 |
Branchless find + prefetch children |
Easy |
~20% faster lookups |
| 5 |
__builtin_prefetch in traversal |
Easy |
10–30% for cold-cache queries |
| 6 |
Iterative rz_rbtree_free |
Easy |
Eliminates recursive call overhead |
| 7 |
Inline sum for interval tree |
Medium |
Eliminates ~40 indirect calls per insert/delete |
| 8 |
Bulk O(n) construction |
Medium |
O(n) vs O(n log n) for batch builds |
| 9 |
Stack-based interval queries |
Easy |
Eliminates recursion + enables prefetch |
| 10 |
Smarter resize for end-only changes |
Easy |
Halves cost of common resize case |
| 11 |
SIMD batch interval queries |
High |
4× throughput for batch stabbing queries |
| 12 |
__builtin_expect on depth checks |
Trivial |
5–10% on hot paths |
| 13 |
Eliminate comparator wrapper |
Easy |
~15% fewer indirect calls |
The highest-impact changes are:
- 1 (color bit packing) — affects every single node in every tree, improving cache behavior across the entire codebase
- 3 (slab allocator) — eliminates per-node
malloc which dominates insertion cost
- 7 (inline sum) — the
sum function pointer is called dozens of times per tree mutation and can never be inlined by the compiler through the pointer
Optimization Suggestions for Rizin's RBTree & IntervalTree
After analyzing the header and implementation files, here are concrete optimizations ranked by expected impact.
1. Pack Color Bit into Child Pointer (Highest Impact — Memory & Cache)
The current
RBNodeuses a fullboolfor the red/black color, which with struct padding wastes 7 bytes per node on 64-bit systems [^3]:Since heap-allocated nodes are always at least word-aligned (8-byte on 64-bit), the lowest bit of a child pointer is always 0 and can store the color. This shrinks
RBNodefrom 24 to 16 bytes — a 33% reduction that dramatically improves cache utilization:Cascading benefits for
RzIntervalNode[^2]:RBNode nodeut64 startut64 endut64 max_endvoid *dataGoing from 56 to 48 bytes means an
RzIntervalNodenow fits in ¾ of a cache line instead of nearly a full one, and you get ~17% more nodes per cache line during traversals.2. Shrink
RBIter— It's 504 Bytes on the StackRBItercontains a 62-element pointer array, making it 504 bytes on 64-bit [^3]. This is returned by value fromrz_rbtree_first(),rz_rbtree_last(), andrz_rbtree_lower_bound_forward(), and is used on the stack in everyrz_rbtree_foreach/rz_interval_tree_foreachmacro [^1][^2]:Option A — Reduce
RZ_RBTREE_MAX_HEIGHTto a practical value:A red-black tree with
2^31 - 1nodes (~2 billion) has max height2 * log2(2^31) = 62. In practice, Rizin trees rarely exceed millions of nodes. A height of 40 supports up to ~1 trillion nodes and saves 176 bytes:Option B — Use a compact iterator with parent pointers:
If you add a parent pointer to
RBNode(or use the bit-packed variant), iterators become 8 bytes instead of 504:3. Node Slab Allocator for
RContRBTreeandRzIntervalTreeEvery
rz_rbtree_cont_insertallocates a singleRContRBNodeviaRZ_NEW0(which iscalloc(1, sizeof(...))) [^1]. Similarly,rz_interval_tree_insertallocates a singleRzIntervalNodeper insertion [^2]. For trees with thousands of nodes, this means thousands of small, scattered heap allocations.Add an optional
RBPool *pooltoRContRBTreeandRzIntervalTree. This eliminates per-nodemalloc/free, keeps nodes contiguous for better cache behavior, and makesrz_rbtree_cont_free/rz_interval_tree_finia simple slab chain deallocation.4. Branchless Direction Selection in
rz_rbtree_findThe current find loop has a three-way branch (less, greater, equal) [^1]:
Convert to a branchless two-way descent with deferred equality check:
This eliminates a branch per tree level. On a tree of depth 20, that's 20 fewer branch mispredictions in the worst case. The prefetch hides the pointer-chasing latency by loading the next level's children while the comparator runs.
Apply the same pattern to
rz_rbtree_lower_boundandrz_rbtree_upper_bound[^1].5.
__builtin_prefetchin Traversal and Query FunctionsTree pointer chasing is the main bottleneck — each node is at a random memory address. Prefetching the next node(s) while processing the current one hides ~60–100 ns of DRAM latency.
In
_next(iterator advance) [^1]:In interval tree queries (
all_in,all_intersect), prefetch both children before descending:6. Convert Recursive
rz_rbtree_freeto IterativeThe tree free function is recursive, adding function call overhead per node. Convert to an iterative post-order traversal using the existing path stack:
Eliminates ~20 function call frames for a typical tree, with better branch prediction due to the tight loop.
7. Inline the
sumCallback for IntervalTreeThe
RBNodeSumcallback is called on every rotation and on the entire path back to root after insertion/deletion [^1]. For the interval tree, this is always themax_endupdate:This is called via function pointer, preventing inlining. Create specialized insert/delete functions for the interval tree that inline the sum computation:
Either create
rz_rbtree_aug_insert_interval()as a template-like specialization, or use a macro-generated version with the sum logic inlined. This eliminates ~20–40 indirect calls per insert/delete.8. Bulk Construction from Sorted Data — O(n) Instead of O(n log n)
When building a tree from a known sorted sequence (e.g., loading analysis metadata), inserting one-by-one is O(n log n). A bottom-up bulk construction builds a balanced red-black tree in O(n):
For the interval tree, this is particularly useful when loading binary metadata — all intervals can be sorted by
start, then built into a tree in one O(n) pass with correctmax_endpropagation.9. Optimize Interval Tree
all_in/all_intersectwith Stack-Based IterationThe current query functions are recursive, which adds call overhead per node visited [^2]. Convert to explicit stack:
This eliminates recursive function call overhead and allows prefetching both children before descending.
10.
rz_interval_tree_resize— Avoid Delete+Reinsert When OnlyendChangesThe documentation notes this is already optimized for end-only changes [^2], but make sure the implementation only walks up the path updating
max_endwithout a full delete+reinsert:This avoids the O(log n) delete + O(log n) insert when only the end changes, replacing it with a single O(log n) sum update.
11. SIMD for Batch Interval Queries
When querying many points against the same interval tree (e.g., "which analysis metadata covers each of these 1000 addresses?"), you can vectorize the initial descent by comparing multiple query values against a node's
start/max_endsimultaneously:This is most beneficial for batch stabbing queries — checking many addresses against the tree. If none of 4 queries can match a subtree, you prune it once instead of 4 times.
12.
__builtin_expecton Depth Overflow ChecksThe depth overflow checks in
rz_rbtree_aug_insertandrz_rbtree_aug_deleteare on the hot path [^1]:Mark as unlikely:
This ensures the branch predictor assumes the check fails, keeping the hot path straight-line.
13. Avoid Wrapper Overhead in
RContRBTreerz_rbtree_cont_insertwraps the user's comparator incont_rbtree_cmp_wrapper, adding an extra indirect call and a struct allocation on the stack per comparison [^1]:Optimize: Store the comparator in the tree and call it directly in a specialized insert:
This eliminates one level of function pointer indirection per comparison.
Summary Table
RBIter__builtin_prefetchin traversalrz_rbtree_freesumfor interval treeresizefor end-only changes__builtin_expecton depth checksThe highest-impact changes are:
mallocwhich dominates insertion costsumfunction pointer is called dozens of times per tree mutation and can never be inlined by the compiler through the pointer