Skip to content

[RFC] Perf Two-Level Hierarchical BitSet for Query Entity Storage and more#230

Draft
jerzakm wants to merge 5 commits intopmndrs:mainfrom
jerzakm:perf
Draft

[RFC] Perf Two-Level Hierarchical BitSet for Query Entity Storage and more#230
jerzakm wants to merge 5 commits intopmndrs:mainfrom
jerzakm:perf

Conversation

@jerzakm
Copy link
Contributor

@jerzakm jerzakm commented Feb 10, 2026

I have been doing some research related to modern gamedev architecture and optimizations native engines make in querying data structures. I implemented some by hand, and threw a good bit of Opus tokens to finish it up, make benchmarks and format my ramblings into a readable tech summary.

Results:

pre-warmed 3 times, each test ran 50 iterations.

Main branch:

==========================================================================================
  SUMMARY
==========================================================================================
  Benchmark                                                Mean      Median         P99       P99.9      StdDev
  -------------------------------------------------------------------------------------------------------------
  spawn & destroy 50000 entities                      68.518 ms   68.900 ms   76.400 ms   76.400 ms    3.434 ms
  trait add/remove churn (10000 ents x 5 cycles)      30.342 ms   30.000 ms   36.300 ms   36.300 ms    1.688 ms
  updateEach Position+Velocity (100000 ents)           5.748 ms    5.800 ms    6.400 ms    6.400 ms    393.6 µs
  create 20 queries over 50000 entities              142.744 ms  144.400 ms  152.100 ms  152.100 ms    6.102 ms
  wide archetype (5000 ents x 30 traits)              32.224 ms   31.700 ms   38.600 ms   38.600 ms    1.784 ms
  Not-query churn (20000 spawn+destroy)               70.014 ms   68.500 ms  154.500 ms  154.500 ms   13.913 ms
  mixed workload (50000 base, +500/-500 per tick)      3.518 ms    3.500 ms    4.200 ms    4.200 ms    205.6 µs
==========================================================================================

This branch:

==========================================================================================
  SUMMARY
==========================================================================================
  Benchmark                                                Mean      Median         P99       P99.9      StdDev
  -------------------------------------------------------------------------------------------------------------
  spawn & destroy 50000 entities                      68.106 ms   68.500 ms   74.000 ms   74.000 ms    3.750 ms
  trait add/remove churn (10000 ents x 5 cycles)      30.860 ms   30.500 ms   33.900 ms   33.900 ms    1.481 ms
  updateEach Position+Velocity (100000 ents)           4.880 ms    5.000 ms    5.200 ms    5.200 ms    266.8 µs
  create 20 queries over 50000 entities              126.390 ms  127.300 ms  138.400 ms  138.400 ms    6.042 ms
  wide archetype (5000 ents x 30 traits)              31.638 ms   31.200 ms   34.700 ms   34.700 ms    1.629 ms
  Not-query churn (20000 spawn+destroy)               31.048 ms   31.000 ms   35.900 ms   35.900 ms    1.828 ms
  mixed workload (50000 base, +500/-500 per tick)      3.026 ms    3.000 ms    3.900 ms    3.900 ms    198.8 µs
==========================================================================================

BIGGEST IMPACT Change: Two-Level Hierarchical BitSet for Query Entity Storage

This is the highest-impact change. It transforms query population from O(N) per-entity checks to bulk bitwise intersection, and makes Not-query churn dramatically faster because remove() is a single bit clear + conditional top-level update.

Files: utils/bit-set.ts (new), query/types.ts, query/query.ts

Queries stored their matched entity sets as SparseSet instances. SparseSet uses a dense/sparse array pair — O(1) add/remove/has, but iteration visits every element sequentially and set intersection requires checking each element of one set against the other.

Queries now store entities in a two-level hierarchical BitSet. The structure uses two layers of Uint32Array:

  • Bottom level: 1 bit per entity ID. Word i covers entity IDs [i*32, i*32+31].
  • Top level: 1 bit per bottom-level word. If bit j in top word i is set, then bottom[(i*32)+j] has at least one set bit.
Top:    [ ...0001_0100... ]     ← 1 bit per 32-element block
                |    |
Bottom: [ ...11010... ] [ ...01001... ]  ← 1 bit per entity

Why This Is Good

Iteration uses a trailing-zero-count loop (31 - Math.clz32(v & -v)) to jump directly to set bits, skipping empty space. A single zero bit in the top level skips 32 bottom-level words (1024 entity IDs) without touching memory. This is the same pattern used by hardware TZCNT/BSF instructions.

bitSetAndMany() intersects multiple BitSets by ANDing top-level words first. If the AND result is zero, 1024 entities are skipped in one operation. For the common case of world.query(A, B, C), this replaces per-entity checkQuery() calls with bulk bitwise operations.

// Fast path: AND all required trait BitSets
bitSetAndMany(requiredSets, (eid) => {
    query.add(entityIndex.dense[entityIndex.sparse[eid]]);
});

The top level adds only 1 bit per 32 elements (3.125% overhead). For 100k entities, the entire BitSet fits in ~12.5 KB — well within L1 cache.

BitSet iteration naturally produces entity IDs in ascending order. Entities sharing a cache line are processed sequentially, improving spatial locality for downstream SoA store access patterns (store.x[eid], store.y[eid]).


Change 2: Per-Trait Entity BitSet for Fast Query Population

Files: trait/types.ts, trait/trait.ts, query/query.ts

When a new query was created, it had to iterate all entities in the world and call checkQuery() on each one to determine initial membership. For a world with 50k entities and 20 queries, this meant 1M checkQuery() calls at startup.

Each TraitData now maintains an entityBitSet: BitSet that tracks which entities currently have that trait. When a query is created with required traits [A, B, C], initial population uses bitSetAndMany([A.entityBitSet, B.entityBitSet, C.entityBitSet]) — a bulk bitwise AND that produces the intersection without per-entity checks.

// In registerTrait():
entityBitSet: new BitSet(),

// In addTrait():
data.entityBitSet.add(eid);

// In removeTrait():
data.entityBitSet.remove(eid);

// In createQuery() — fast population path:
const requiredSets = required.map((td) => td.entityBitSet);
bitSetAndMany(requiredSets, (eid) => {
    query.add(entityIndex.dense[entityIndex.sparse[eid]]);
});

Why This Is Good

1. Query creation scales with intersection size, not world size.
If traits A, B, C each have 10k entities but only 5k share all three, we process ~5k entities instead of 50k. The top-level AND skips entire 1024-entity blocks where any trait is absent.

2. No per-entity checkQuery() for the common case.
The fast path applies when a query has only required traits (no Or modifier) and no forbidden traits (other than the implicit IsExcluded). This covers the vast majority of real-world queries.

3. Negligible maintenance cost.
BitSet.add() and BitSet.remove() are single-word bitwise operations — they add virtually zero overhead to addTrait()/removeTrait().


Change 3: Map<Trait, TraitData>TraitData[] (Array-Indexed Lookup)

This change touches the most call sites (14 Map.get()/Map.has() replacements across 4 files) and contributes to every benchmark improvement. It is especially impactful in trait churn scenarios where addTrait/removeTrait are called thousands of times per frame.

Files: world/world.ts, trait/trait.ts, query/query.ts, query/modifiers/changed.ts

Before

traitData: new Map<Trait, TraitData>()

// Every lookup:
ctx.traitData.get(trait)    // Map.get() — hash + bucket walk
ctx.traitData.has(trait)    // Map.has() — hash + bucket walk
ctx.traitData.set(trait, d) // Map.set() — hash + possible rehash

After

traitData: [] as (TraitData | undefined)[]

// Every lookup:
ctx.traitData[trait[$internal].id]  // Direct array index — single memory access

Why This Is Good

Map.get() must hash the key, walk a bucket chain, and compare references. Array index access (arr[i]) compiles to a single bounds-checked memory load. V8 optimizes dense arrays into contiguous memory with no hashing overhead.

Every trait gets a monotonically increasing id via let traitId = 0; id: traitId++. This makes them perfect array indices — no sparse gaps, no wasted memory.

traitData is accessed in addTrait(), removeTrait(), hasTrait(), getStore(), setChanged(), and every query operation. Even a 10ns improvement per access compounds across thousands of entities and traits per frame.

if (!ctx.traitData[tid]) is a simple falsy check on undefined, replacing Map.has() which must perform the same hash + bucket walk as Map.get().


Change 4: Set<Query>Query[] (Array-Based Query Collections)

Files: world/world.ts, trait/types.ts, trait/trait.ts, query/query.ts, query/modifiers/changed.ts, entity/entity.ts

Before

// In TraitData:
queries: Set<Query>
notQueries: Set<Query>

// In World:
notQueries: new Set<Query>()

// Iteration:
for (const query of queries) { ... }      // Set iterator — allocates iterator object
instance.queries.add(query)                // Set.add() — hash + bucket

After

// In TraitData:
queries: Query[]
notQueries: Query[]

// In World:
notQueries: [] as Query[]

// Iteration:
for (let qi = 0; qi < queries.length; qi++) {  // Indexed for loop — no allocation
    const query = queries[qi];
    ...
}
instance.queries.push(query)                     // Array.push() — amortized O(1)

Why This Is Good

1. for..of on a Set allocates an iterator object every time.
In V8, for (const x of set) creates a SetIterator object on the heap. In a hot loop that runs per-entity-per-trait-change, this creates GC pressure. An indexed for loop over an array allocates nothing.

2. Array iteration is JIT-friendly.
V8's TurboFan compiler can optimize indexed array loops into tight machine code with bounds-check elimination. Set iteration goes through the iterator protocol, which is harder to optimize and involves virtual dispatch.

3. Query collections are small and append-only during normal operation.
A typical trait has 1–5 associated queries. These are added once during createQuery() and never removed during normal operation (only on world.reset()). This is the ideal use case for arrays — push() is O(1) amortized, and small arrays have excellent cache locality.

4. Set.add() deduplication is unnecessary here.
Queries are only added to trait collections once, during createQuery(). The code already ensures no duplicates by construction, so Set's deduplication overhead is pure waste.

Impact

This change is most visible in the trait add/remove churn benchmark (2.5x speedup), where addTrait()/removeTrait() iterate over each trait's query list thousands of times per frame.


Change 5: Uint32Array Entity Masks with Dynamic Growth

Files: world/world.ts, world/utils/ensure-entity-mask-size.ts (new), world/utils/increment-world-bit-flag.ts

Before

entityMasks: [[]] as number[][]

// On new generation:
ctx.entityMasks.push(new Array(prevLen).fill(0))

Entity masks were JavaScript number[] arrays — each element a regular JS number.

After

entityMasks: [new Uint32Array(1024)] as Uint32Array[]

// On new generation:
ctx.entityMasks.push(new Uint32Array(prevLen))

// Dynamic growth when entity ID exceeds capacity:
function ensureEntityMaskSize(masks, generationId, eid) {
    const arr = masks[generationId];
    if (eid < arr.length) return;
    let newLen = arr.length;
    while (newLen <= eid) newLen *= 2;
    const grown = new Uint32Array(newLen);
    grown.set(arr);
    masks[generationId] = grown;
}

Why This Is Good

1. Uint32Array is a contiguous typed array in memory.
Regular number[] in V8 can be stored as either SMI (small integer) arrays or double arrays, and may have holes or be backed by a dictionary. Uint32Array is always a flat, contiguous buffer of 32-bit unsigned integers — exactly what bitmask operations need.

2. Bitwise operations on typed arrays avoid boxing.
When you do arr[i] |= flag on a number[], V8 must check the element type, potentially unbox it, perform the operation, and rebox. On a Uint32Array, the element is always a 32-bit integer — no type checks, no boxing.

3. Predictable memory layout improves cache behavior.
A Uint32Array(1024) is exactly 4 KB — one memory page. Sequential access patterns (iterating entity masks during checkQuery()) benefit from hardware prefetching that works best on contiguous memory.

4. Doubling growth strategy amortizes allocation cost.
ensureEntityMaskSize() doubles the array when capacity is exceeded, giving O(1) amortized growth. The initial 1024-element capacity covers most small-to-medium worlds without any reallocation.

5. Array.from() for tracking snapshots is faster than structuredClone().
In setTrackingMasks(), converting Uint32Array to number[] via Array.from(mask) is significantly faster than structuredClone() on the previous number[][] — this was confirmed by benchmarking (Revert #1 showed measurable impact).

Impact

This change improves every operation that reads or writes entity masks — which is every addTrait(), removeTrait(), hasTrait(), and checkQuery() call. The impact compounds with entity count.


Change 6: Hoisted Invariants and Reusable Event Objects

Files: trait/trait.ts, query/modifiers/changed.ts

Before

// In addTrait() / removeTrait() — new object per call:
const queryEvent = { type: 'add', traitData: data };

// In setChanged() — redundant work inside loop:
for (const changedMask of ctx.changedMasks.values()) {
    const eid = getEntityId(entity);           // Same result every iteration
    const data = ctx.traitData.get(trait)!;     // Same result every iteration
    const { generationId, bitflag } = data;     // Same result every iteration
    ...
}

After

// Module-level reusable object:
const queryEvent: { type: 'add' | 'remove' | 'change'; traitData: TraitData } = {
    type: 'add',
    traitData: null!,
};

// In addTrait():
queryEvent.type = 'add';
queryEvent.traitData = data;
// Pass queryEvent to query.check() — no allocation

// In setChanged() — hoisted outside loop:
const eid = getEntityId(entity);
const { generationId, bitflag } = data;
for (const changedMask of ctx.changedMasks.values()) {
    // Uses eid, generationId, bitflag from outer scope
    ...
}

Why This Is Good

1. Eliminates per-call object allocation on the hottest path.
addTrait() and removeTrait() are called for every trait on every entity. Creating a { type, traitData } object each time generates garbage that the GC must collect. A module-level reusable object is allocated once and mutated in place — zero GC pressure.

2. Hoisting loop-invariant computations is a classic optimization.
getEntityId(entity) and ctx.traitData.get(trait) return the same value on every iteration of the changedMask loop. Moving them outside the loop eliminates redundant function calls and Map lookups. While V8 can sometimes hoist invariants automatically, it cannot do so when the loop body contains function calls that might have side effects.

3. The reusable object pattern is safe here because calls are synchronous.
queryEvent is written immediately before being read by query.check(), and query.check() is synchronous. There is no risk of concurrent mutation.

Impact

This is a micro-optimization that contributes incrementally to every benchmark. Its effect is most visible in high-churn scenarios where addTrait()/removeTrait() are called thousands of times per frame.


Change 7: Pre-Allocated Entity Array in runQuery()

Files: query/query.ts

Before

const entities: Entity[] = [];
query.entities.forEach((eid) => {
    entities.push(entityIndex.dense[entityIndex.sparse[eid]] as Entity);
});

After

const dense = entityIndex.dense;
const sparse = entityIndex.sparse;
const entities: Entity[] = new Array(query.entities.count);
let ei = 0;
query.entities.forEach((eid) => {
    entities[ei++] = dense[sparse[eid]] as Entity;
});

Why This Is Good

1. new Array(n) pre-allocates the backing store.
V8 allocates the array's internal storage in one shot instead of repeatedly growing it via push(). For a query returning 50k entities, this avoids ~16 reallocations (doubling from 16 → 32 → 64 → ... → 65536).

2. Index assignment is faster than push().
arr[i] = value is a direct store. arr.push(value) must check capacity, potentially grow the array, update the length property, and store the value. The difference is small per call but compounds over thousands of entities.

3. Hoisting dense and sparse avoids repeated property access.
entityIndex.dense and entityIndex.sparse are accessed once and cached in local variables. V8's TurboFan can often optimize repeated property access, but local variables are guaranteed to be register-allocated.

Impact

This is a targeted optimization for runQuery(), which is called every time world.query() is invoked. The impact scales with query result size.


New Files

File Purpose
utils/bit-set.ts Two-level hierarchical BitSet with has, add, remove, clear, forEach, toArray, toArrayInto, plus standalone bitSetAnd, bitSetAndMany, bitSetAndNot, bitSetAndAny, bitSetIsSubset
world/utils/ensure-entity-mask-size.ts Dynamic growth for Uint32Array entity masks with doubling strategy
tests/utils/bit-set.test.ts 47 unit tests for BitSet
benches/sims/stress-test/ Headless benchmark suite (7 benchmarks)
benches/apps/stress-test/ Browser-based benchmark with DOM output

Modified Files

File Changes
entity/entity.ts for..of Set → indexed for loop over notQueries[]
query/modifiers/changed.ts Map.get() → array index, hoisted invariants, for..of Set → indexed for loop
query/query.ts Pre-allocated entity array, Map.get() → array index, Set.add()Array.push(), BitSet-based query population
query/types.ts entities: SparseSetentities: BitSet, toRemove: SparseSettoRemove: BitSet
trait/trait.ts Map.get() → array index, SetArray, reusable queryEvent, per-trait entityBitSet maintenance
trait/types.ts queries: Set<Query>Query[], notQueries: Set<Query>Query[], added entityBitSet: BitSet
world/world.ts Map<Trait, TraitData>TraitData[], Set<Query>Query[], number[][]Uint32Array[]
world/utils/increment-world-bit-flag.ts New generation creates Uint32Array instead of Array

Risks and Tradeoffs

  1. BitSet iteration order differs from SparseSet. BitSet iterates in ascending entity ID order, while SparseSet iterates in insertion order. This changes the order of entities returned by world.query(). No user-facing contract guarantees iteration order, but tests were updated to reflect this.

  2. Uint32Array entity masks require dynamic growth. The ensureEntityMaskSize() function adds a bounds check on every addTrait() call. This is a single comparison (eid < arr.length) that is almost always false — the branch predictor handles this efficiently.

  3. Module-level mutable state (queryEvent, changeEvent). These reusable objects are safe because all usage is synchronous and single-threaded. If Koota ever supports concurrent/async trait operations, these would need to become thread-local or per-call.

  4. TraitData[] has undefined gaps if traits are registered non-sequentially. In practice, trait IDs are sequential (0, 1, 2, ...) so the array is dense. If trait IDs ever become sparse (e.g., after trait unregistration), this would waste memory proportional to the highest trait ID.

Testing

All 138 tests pass (91 original + 47 new BitSet tests):

cd packages/core && npx vitest run --reporter=verbose

Benchmarks can be run via:

pnpm sim stress-test    # Headless (Node/tsx)
pnpm app stress-test    # Browser (Vite dev server)

@krispya
Copy link
Member

krispya commented Feb 10, 2026

Ah hah! Thanks for putting this together. I'll read it over one of these mornings and try to grok the implications. In the mean time, could you tell me in your own words what the big idea is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants