[RFC] Perf Two-Level Hierarchical BitSet for Query Entity Storage and more#230
Draft
jerzakm wants to merge 5 commits intopmndrs:mainfrom
Draft
[RFC] Perf Two-Level Hierarchical BitSet for Query Entity Storage and more#230jerzakm wants to merge 5 commits intopmndrs:mainfrom
jerzakm wants to merge 5 commits intopmndrs:mainfrom
Conversation
Member
|
Ah hah! Thanks for putting this together. I'll read it over one of these mornings and try to grok the implications. In the mean time, could you tell me in your own words what the big idea is? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I have been doing some research related to modern gamedev architecture and optimizations native engines make in querying data structures. I implemented some by hand, and threw a good bit of Opus tokens to finish it up, make benchmarks and format my ramblings into a readable tech summary.
Results:
pre-warmed 3 times, each test ran 50 iterations.
Main branch:
This branch:
BIGGEST IMPACT Change: Two-Level Hierarchical BitSet for Query Entity Storage
This is the highest-impact change. It transforms query population from O(N) per-entity checks to bulk bitwise intersection, and makes Not-query churn dramatically faster because
remove()is a single bit clear + conditional top-level update.Files:
utils/bit-set.ts(new),query/types.ts,query/query.tsQueries stored their matched entity sets as
SparseSetinstances.SparseSetuses a dense/sparse array pair — O(1) add/remove/has, but iteration visits every element sequentially and set intersection requires checking each element of one set against the other.Queries now store entities in a two-level hierarchical BitSet. The structure uses two layers of
Uint32Array:icovers entity IDs[i*32, i*32+31].jin top wordiis set, thenbottom[(i*32)+j]has at least one set bit.Why This Is Good
Iteration uses a trailing-zero-count loop (
31 - Math.clz32(v & -v)) to jump directly to set bits, skipping empty space. A single zero bit in the top level skips 32 bottom-level words (1024 entity IDs) without touching memory. This is the same pattern used by hardwareTZCNT/BSFinstructions.bitSetAndMany()intersects multiple BitSets by ANDing top-level words first. If the AND result is zero, 1024 entities are skipped in one operation. For the common case ofworld.query(A, B, C), this replaces per-entitycheckQuery()calls with bulk bitwise operations.The top level adds only 1 bit per 32 elements (3.125% overhead). For 100k entities, the entire BitSet fits in ~12.5 KB — well within L1 cache.
BitSet iteration naturally produces entity IDs in ascending order. Entities sharing a cache line are processed sequentially, improving spatial locality for downstream SoA store access patterns (
store.x[eid],store.y[eid]).Change 2: Per-Trait Entity BitSet for Fast Query Population
Files:
trait/types.ts,trait/trait.ts,query/query.tsWhen a new query was created, it had to iterate all entities in the world and call
checkQuery()on each one to determine initial membership. For a world with 50k entities and 20 queries, this meant 1McheckQuery()calls at startup.Each
TraitDatanow maintains anentityBitSet: BitSetthat tracks which entities currently have that trait. When a query is created with required traits[A, B, C], initial population usesbitSetAndMany([A.entityBitSet, B.entityBitSet, C.entityBitSet])— a bulk bitwise AND that produces the intersection without per-entity checks.Why This Is Good
1. Query creation scales with intersection size, not world size.
If traits A, B, C each have 10k entities but only 5k share all three, we process ~5k entities instead of 50k. The top-level AND skips entire 1024-entity blocks where any trait is absent.
2. No per-entity
checkQuery()for the common case.The fast path applies when a query has only required traits (no
Ormodifier) and no forbidden traits (other than the implicitIsExcluded). This covers the vast majority of real-world queries.3. Negligible maintenance cost.
BitSet.add()andBitSet.remove()are single-word bitwise operations — they add virtually zero overhead toaddTrait()/removeTrait().Change 3:
Map<Trait, TraitData>→TraitData[](Array-Indexed Lookup)This change touches the most call sites (14
Map.get()/Map.has()replacements across 4 files) and contributes to every benchmark improvement. It is especially impactful in trait churn scenarios whereaddTrait/removeTraitare called thousands of times per frame.Files:
world/world.ts,trait/trait.ts,query/query.ts,query/modifiers/changed.tsBefore
After
Why This Is Good
Map.get()must hash the key, walk a bucket chain, and compare references. Array index access (arr[i]) compiles to a single bounds-checked memory load. V8 optimizes dense arrays into contiguous memory with no hashing overhead.Every trait gets a monotonically increasing
idvialet traitId = 0; id: traitId++. This makes them perfect array indices — no sparse gaps, no wasted memory.traitDatais accessed inaddTrait(),removeTrait(),hasTrait(),getStore(),setChanged(), and every query operation. Even a 10ns improvement per access compounds across thousands of entities and traits per frame.if (!ctx.traitData[tid])is a simple falsy check onundefined, replacingMap.has()which must perform the same hash + bucket walk asMap.get().Change 4:
Set<Query>→Query[](Array-Based Query Collections)Files:
world/world.ts,trait/types.ts,trait/trait.ts,query/query.ts,query/modifiers/changed.ts,entity/entity.tsBefore
After
Why This Is Good
1.
for..ofon a Set allocates an iterator object every time.In V8,
for (const x of set)creates aSetIteratorobject on the heap. In a hot loop that runs per-entity-per-trait-change, this creates GC pressure. An indexedforloop over an array allocates nothing.2. Array iteration is JIT-friendly.
V8's TurboFan compiler can optimize indexed array loops into tight machine code with bounds-check elimination. Set iteration goes through the iterator protocol, which is harder to optimize and involves virtual dispatch.
3. Query collections are small and append-only during normal operation.
A typical trait has 1–5 associated queries. These are added once during
createQuery()and never removed during normal operation (only onworld.reset()). This is the ideal use case for arrays —push()is O(1) amortized, and small arrays have excellent cache locality.4.
Set.add()deduplication is unnecessary here.Queries are only added to trait collections once, during
createQuery(). The code already ensures no duplicates by construction, so Set's deduplication overhead is pure waste.Impact
This change is most visible in the trait add/remove churn benchmark (2.5x speedup), where
addTrait()/removeTrait()iterate over each trait's query list thousands of times per frame.Change 5:
Uint32ArrayEntity Masks with Dynamic GrowthFiles:
world/world.ts,world/utils/ensure-entity-mask-size.ts(new),world/utils/increment-world-bit-flag.tsBefore
Entity masks were JavaScript
number[]arrays — each element a regular JS number.After
Why This Is Good
1.
Uint32Arrayis a contiguous typed array in memory.Regular
number[]in V8 can be stored as either SMI (small integer) arrays or double arrays, and may have holes or be backed by a dictionary.Uint32Arrayis always a flat, contiguous buffer of 32-bit unsigned integers — exactly what bitmask operations need.2. Bitwise operations on typed arrays avoid boxing.
When you do
arr[i] |= flagon anumber[], V8 must check the element type, potentially unbox it, perform the operation, and rebox. On aUint32Array, the element is always a 32-bit integer — no type checks, no boxing.3. Predictable memory layout improves cache behavior.
A
Uint32Array(1024)is exactly 4 KB — one memory page. Sequential access patterns (iterating entity masks duringcheckQuery()) benefit from hardware prefetching that works best on contiguous memory.4. Doubling growth strategy amortizes allocation cost.
ensureEntityMaskSize()doubles the array when capacity is exceeded, giving O(1) amortized growth. The initial 1024-element capacity covers most small-to-medium worlds without any reallocation.5.
Array.from()for tracking snapshots is faster thanstructuredClone().In
setTrackingMasks(), convertingUint32Arraytonumber[]viaArray.from(mask)is significantly faster thanstructuredClone()on the previousnumber[][]— this was confirmed by benchmarking (Revert #1 showed measurable impact).Impact
This change improves every operation that reads or writes entity masks — which is every
addTrait(),removeTrait(),hasTrait(), andcheckQuery()call. The impact compounds with entity count.Change 6: Hoisted Invariants and Reusable Event Objects
Files:
trait/trait.ts,query/modifiers/changed.tsBefore
After
Why This Is Good
1. Eliminates per-call object allocation on the hottest path.
addTrait()andremoveTrait()are called for every trait on every entity. Creating a{ type, traitData }object each time generates garbage that the GC must collect. A module-level reusable object is allocated once and mutated in place — zero GC pressure.2. Hoisting loop-invariant computations is a classic optimization.
getEntityId(entity)andctx.traitData.get(trait)return the same value on every iteration of thechangedMaskloop. Moving them outside the loop eliminates redundant function calls and Map lookups. While V8 can sometimes hoist invariants automatically, it cannot do so when the loop body contains function calls that might have side effects.3. The reusable object pattern is safe here because calls are synchronous.
queryEventis written immediately before being read byquery.check(), andquery.check()is synchronous. There is no risk of concurrent mutation.Impact
This is a micro-optimization that contributes incrementally to every benchmark. Its effect is most visible in high-churn scenarios where
addTrait()/removeTrait()are called thousands of times per frame.Change 7: Pre-Allocated Entity Array in
runQuery()Files:
query/query.tsBefore
After
Why This Is Good
1.
new Array(n)pre-allocates the backing store.V8 allocates the array's internal storage in one shot instead of repeatedly growing it via
push(). For a query returning 50k entities, this avoids ~16 reallocations (doubling from 16 → 32 → 64 → ... → 65536).2. Index assignment is faster than
push().arr[i] = valueis a direct store.arr.push(value)must check capacity, potentially grow the array, update the length property, and store the value. The difference is small per call but compounds over thousands of entities.3. Hoisting
denseandsparseavoids repeated property access.entityIndex.denseandentityIndex.sparseare accessed once and cached in local variables. V8's TurboFan can often optimize repeated property access, but local variables are guaranteed to be register-allocated.Impact
This is a targeted optimization for
runQuery(), which is called every timeworld.query()is invoked. The impact scales with query result size.New Files
utils/bit-set.tshas,add,remove,clear,forEach,toArray,toArrayInto, plus standalonebitSetAnd,bitSetAndMany,bitSetAndNot,bitSetAndAny,bitSetIsSubsetworld/utils/ensure-entity-mask-size.tsUint32Arrayentity masks with doubling strategytests/utils/bit-set.test.tsbenches/sims/stress-test/benches/apps/stress-test/Modified Files
entity/entity.tsfor..of Set→ indexedforloop overnotQueries[]query/modifiers/changed.tsMap.get()→ array index, hoisted invariants,for..of Set→ indexedforloopquery/query.tsMap.get()→ array index,Set.add()→Array.push(), BitSet-based query populationquery/types.tsentities: SparseSet→entities: BitSet,toRemove: SparseSet→toRemove: BitSettrait/trait.tsMap.get()→ array index,Set→Array, reusablequeryEvent, per-traitentityBitSetmaintenancetrait/types.tsqueries: Set<Query>→Query[],notQueries: Set<Query>→Query[], addedentityBitSet: BitSetworld/world.tsMap<Trait, TraitData>→TraitData[],Set<Query>→Query[],number[][]→Uint32Array[]world/utils/increment-world-bit-flag.tsUint32Arrayinstead ofArrayRisks and Tradeoffs
BitSet iteration order differs from SparseSet. BitSet iterates in ascending entity ID order, while SparseSet iterates in insertion order. This changes the order of entities returned by
world.query(). No user-facing contract guarantees iteration order, but tests were updated to reflect this.Uint32Arrayentity masks require dynamic growth. TheensureEntityMaskSize()function adds a bounds check on everyaddTrait()call. This is a single comparison (eid < arr.length) that is almost always false — the branch predictor handles this efficiently.Module-level mutable state (
queryEvent,changeEvent). These reusable objects are safe because all usage is synchronous and single-threaded. If Koota ever supports concurrent/async trait operations, these would need to become thread-local or per-call.TraitData[]hasundefinedgaps if traits are registered non-sequentially. In practice, trait IDs are sequential (0, 1, 2, ...) so the array is dense. If trait IDs ever become sparse (e.g., after trait unregistration), this would waste memory proportional to the highest trait ID.Testing
All 138 tests pass (91 original + 47 new BitSet tests):
Benchmarks can be run via: