shruti2522
diff --git a/‎Cargo.lock‎
Lines changed: 559 additions & 5 deletions b/‎Cargo.lock‎
Lines changed: 559 additions & 5 deletions
diff --git a/‎notes/coll_alloc_supertrait/2026-02-27.md‎
Lines changed: 46 additions & 0 deletions b/‎notes/coll_alloc_supertrait/2026-02-27.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎notes/coll_alloc_supertrait/2026-03-02.md‎
Lines changed: 52 additions & 0 deletions b/‎notes/coll_alloc_supertrait/2026-03-02.md‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎notes/coll_alloc_supertrait/arena2_vs_arena3.md‎
Lines changed: 127 additions & 0 deletions b/‎notes/coll_alloc_supertrait/arena2_vs_arena3.md‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎notes/coll_alloc_supertrait/collector_allocator_supertrait.md‎
Lines changed: 47 additions & 0 deletions b/‎notes/coll_alloc_supertrait/collector_allocator_supertrait.md‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎notes/coll_alloc_supertrait/oscars_vs_boa_gc.md‎
Lines changed: 92 additions & 0 deletions b/‎notes/coll_alloc_supertrait/oscars_vs_boa_gc.md‎
Lines changed: 92 additions & 0 deletions
diff --git a/‎oscars/Cargo.toml‎
Lines changed: 17 additions & 0 deletions b/‎oscars/Cargo.toml‎
Lines changed: 17 additions & 0 deletions
@@ -0,0 +1,46 @@
+## Exploring `Collector: Allocator` possibilities (note 1/2)
+
+Note author: shruti2522
+
+Issue #11 brought up making `Collector` a supertrait of `Allocator` from `allocator_api2`. The idea is to use `Vec<T, &MyCollector>` directly, backed by the gc arena, without extra handle types
+
+It seems possible. But actually running it hits some Rust ownership walls. Here's what I've run into so far. I'm still wrapping my head around some of this, so the fixes are mostly just ideas at this point
+
+### 1. `&self` vs `&mut self`
+
+**What I'm seeing:**
+`Allocator` needs `&self` to allocate, but our `collect()` needs `&mut self`. If we make a `Vec<u64, &MarkSweepGarbageCollector>`, we've borrowed the gc. If we then try to call `gc.collect()`, the borrow checker yells because we can't have a mutable borrow while the vec holds a shared one.
+
+**Potential ways around it:**
+- Just cast pointers to bypass the check. This works in testing, but it feels risky because we'd have to manually guarantee `collect` never frees the Vec's page.
+- Maybe we change `collect(&mut self)` to `collect(&self)` and put the internal state in `RefCell` or `UnsafeCell`? This might be a cleaner solution, but it needs more thought
+
+### 2. Reentrancy issues
+
+**What I'm seeing:**
+If we try the interior mutability route, the arena is in a `RefCell`. `allocate()` would call `borrow_mut()` on it. If an allocation triggers `collect()` internally, it might try to borrow the arena again. A double `borrow_mut()` equals a runtime panic.
+
+**Potential ways around it:**
+- Use `UnsafeCell` to skip runtime tracking, but that could be dangerous.
+- decouple allocation from collection: Maybe `allocate()` should never trigger a gc cycle? It could just ask for memory, and we could have something else watch the threshold to trigger `collect()` sequentially, this seems safer
+
+### 3. `deallocate` does nothing
+
+**What I'm seeing:**
+Our arena is a bump allocator, which doesn't give memory back slot by slot. `deallocate` has nowhere to put the memory, so a shrinking Vec might just leak its old buffer until the whole arena drops.
+
+**Potential ways around it:**
+- Maybe this is fine? standard bump allocators work this way
+- If we want to be clean, we could make an `ArenaHeapItem::RawBytes` variant so `allocate()` returns a proper chunk that can be reclaimed during the sweep phase. I'm not sure if it's worth the overhead yet.
+
+### 4. Trace integration
+
+**What I'm seeing:**
+The mark phase doesn't know what's in a `Vec<T, &Collector>`. If there's a `Gc<U>` inside, the gc won't see it and might accidentally sweep it.
+
+**Potential ways around it:**
+- Maybe use a GC aware wrapper like `GcVec<T>` that implements `Trace`? It could register itself with the gc root queue on creation
+
+### Early thoughts
+
+`Collector: Allocator` seems possible, but it looks like we'd need to shift the collector entirely to `&self` with interior mutability and rework how allocation thresholds happen.
@@ -0,0 +1,52 @@
+## Implementing `Collector: Allocator`: why arena3 was needed and what problems came up (2/2)
+
+Note author: shruti2522
+
+this continues from the previous note, writing `unsafe impl Allocator for MarkSweepGarbageCollector` ran into several problems, some in the arena design itself and some from Rust rules, below is each problem and what fixed it.
+
+### Why arena2 could not be used
+
+arena2 is a `LinkedList` of fixed size pages, where every slot has a 8 byte header storing allocation chain pointers. The collector uses that header to decide if a slot is alive. The issue is there is no way to tell a typed GC object apart from a raw byte slab which means they can land on the same page mixed togther.
+
+problem:
+`Allocator::allocate` gives back raw bytes. When a `Vec<T, &MarkSweepGarbageCollector>` asks for a buffer, that buffer has no `GcBox` header. So under arena2, a `GcBox<Node>` and a Vec buffer could end up side by side on the same page. The sweep phase would then walk into the Vec buffer expecting a `GcBox` at each slot and produce garbage. On top of that, there was no per page count of raw allocs, so `drop_dead_arenas` would either hold onto a page forever or free it too early while the Vec was still alive
+
+fix:
+arena3 fixes this by splitting into two separate page lists:`typed_arenas` for GC objects and `raw_arenas` for raw byte allocations. The mark and sweep phases only ever touch `typed_arenas`, each raw page gets an `active_raw_allocs: Cell<usize>` counter. `try_alloc_bytes` increments it,`dealloc_bytes` decrements it, and `drop_dead_arenas` frees the page once it hits zero. On the typed side, a per page bitmap (one bit per slot) tracks liveness instead of storing anything inside the slot itself. This keeps all slots in a page the same size, which the size-class `alloc_cache` index needs to work correctly and it's also what lets grow extend a Vec buffer in place when the new layout stays in the same size class.
+
+### Problems that came up during the implementation
+
+**`&self` vs `&mut self`**
+`Allocator::allocate` takes `&self`, but all arena code was using `&mut self`. The obvious workaround was `Arc/Mutex` but that would break `no_std`, so that was a dead end
+
+fix: the arena field was changed to `RefCell<ArenaAllocator<'static>>`. `allocate()` calls `borrow_mut()` for just the duration of the allocation and drops it right after. Since the collector is single threaded there is never a second caller trying to borrow at the same time anyway
+
+**Reentrancy** 
+If `allocate()` triggered a collection internally to free up space, `collect()` would try to call `borrow_mut()` on the same `RefCell` that `allocate()` was alraedy holding leading to instant runtime panic 
+
+fix: the threshold check and the `collect()` call happen before the allocator borrow is taken:
+
+```rust
+if self.collect_needed.get() && !self.is_collecting.get() {
+    self.collect_needed.set(false);
+    self.collect(); // borrows and releases the arena internally
+}
+let result = self.allocator.borrow_mut().try_alloc_bytes(layout)?;
+```
+
+there's also a debug panic if `allocate()` is called while `is_collecting` is set, mostly as a safety net to catch any future code that sneaks back in through an unexpected path
+
+**`deallocate` leak risk** 
+bump allocators can't free individual slots
+
+fix: arena3's `active_raw_allocs` counter gives `drop_dead_arenas` a way to reclaim a whole raw page once every allocation on it has been released. It's not per slot free but it's enough for the bump allocator model.
+
+**Trace integration** 
+plain `Vec<T, &MarkSweepGarbageCollector>` is invisible to the mark phase. If it's holding `Gc<T>` values inside a GC managed object, those values look unreachable and get swept
+
+fix: `GcAllocVec<T: Trace>` and `GcAllocBox<T: Trace>` wrap `allocator_api2::Vec`/`Box` and implement `Trace` by visiting each element. Any `Vec` or `Box` on the GC heap that holds `Gc<T>` values must use one of these wrappers, plain data rooted at the stack or a `Root` boundary doesn't need one.
+
+**Hidden `'static` in `alloc_gc_node`** 
+`Collector::alloc_gc_node` used to return `ArenaPointer<'static, GcBox<T>>` which silently extended the lifetime inside the impl with no visible `unsafe` at the call site
+
+fix: the return type is now `ArenaPointer<'gc, _>` tied to `&'gc self`, so the lifetime is explicit. Call sites that actually need `'static` storage, `Root::new_in`, `WeakGc::new_in`, and `WeakMap::insert` now call `unsafe { ptr.extend_lifetime() }` directly with a safety comment.
@@ -0,0 +1,127 @@
+# arena2 vs arena3 benchmark results
+
+date: 2026-03-02
+Note author: shruti2522
+
+this goes over the results of the `arena2_vs_arena3` bench suite. The question was
+whether arena3's size class bitmap design is worth using over arena2's simpler
+linked list with per slot headers. 
+
+**answer**: arena2 is faster at raw allocation,
+but arena3 fits more objects into the same amount of memory
+
+## Results
+
+### allocation speed
+
+arena2 is faster at every size and the gap grows as object count goes up
+
+| objects | arena3  | arena2  |
+| 100     | 1.02 µs | 643 ns  |
+| 500     | 4.15 µs | 1.83 µs |
+| 1000    | 8.36 µs | 2.77 µs |
+
+At 100 objects arena2 is roughly 2x faster, at 1000 it's 3x. arena3 is slower
+for two reasons: it has to do a size class lookup on every allocation (finding the
+right arena for the object's size) and set a bit in the bitmap. arena2 just moves
+a pointer forward and writes an 8 byte header
+
+### small object overhead
+
+arena2 is faster here too, which might seem odd given that it writes an 8 byte
+header on every object. But this bench is measuring allocation time, not memory use.
+writing the header is cheap, what costs time in arena3 is the size class routing.
+
+| objects | arena3 (0-byte header) | arena2 (8-byte header) |
+| 100     | 781 ns                 | 257 ns                 |
+| 500     | 3.56 µs                | 1.08 µs                |
+| 1000    | 7.02 µs                | 2.15 µs                |
+
+arena2 is roughly 3x faster across all sizes here, the cost of the bitmap and
+size class lookup shows up clearly when the objects are small.
+
+### mixed sizes
+
+allocating objects of four different sizes (16, 32, 64, 128 bytes) in interleaved
+batches of 50 each:
+
+- arena3: 1.878 µs  
+- arena2: 441 ns
+
+arena2 is ~4x faster, arena3 sends each allocation to a different per size arena,
+which means more branching and more work keeping track of arena pointers
+
+### memory efficiency
+
+how many 16 byte objects fit in a single 4KB page before it needs to grab a new one:
+
+- arena3: **254 objects**
+- arena2: **170 objects**
+
+arena3 fits ~50% more objects per page. The reason is arena2's 8 byte header per slot, a 16 byte object actually takes 24 bytes. arena3
+tracks liveness in a bitmap at the top of the page instead, so each slot
+stays 16 bytes
+
+this is the number that drove the decision, fewer pages means fewer pointer
+reads during the sweep phase, better cache use and less work for the collector overall
+
+### Vec growth pattern
+
+simulating a Vec doubling from capacity 1 to 1024 (11 allocations of increasing
+size):
+
+- arena3: 1.12 µs  
+- arena2: 370 ns
+
+arena2 is ~3x faster, for a growing vec the size class lookup cost hits on every
+doubling step since each new size lands in a different arena.
+
+### sustained throughput (10k allocations)
+
+- arena3: 71.5 µs  
+- arena2: 23.0 µs
+
+arena2 is ~3x faster at a steady allocation rate, this is the biggest gap in
+the whole suite
+
+### deallocation speed
+
+time to free all objects and reclaim dead arenas:
+
+| objects | arena3  | arena2  |
+| 100     | 951 ns  | 665 ns  |
+| 500     | 2.57 µs | 2.11 µs |
+| 1000    | 4.65 µs | 4.97 µs |
+
+arena2 is ~30% faster at 100 objects and ~18% faster at 500. dealloc in arena2 is
+a single bit flip on the slot header, arena3 has to write a free list node into
+the slot and clear the bitmap bit
+
+at 1000 objects it flips: arena3 is ~6% faster, arena3 recycles freed slots via
+the free list so fewer pool pages accumulate and `drop_dead_arenas` has less to
+walk. arena2 cannot recycle slots so all pages stay alive until the whole arena
+is dropped
+
+the crossover is somewhere between 500 and 1000 objects, roughly where slot
+recycling starts paying back the per-free overhead
+
+## what this means
+
+arena2 wins every timing number, but for a GC, allocation is only half the work, the other half is how
+cheap it is to sweep dead objects and how well the heap fits in cache.
+
+254 vs 170 objects per 4KB page means fewer pages to walk and less memory for the mark phase to touch. arena2 also has to read and decode an 8 byte header on
+every slot during the sweep. arena3's bitmap checks 64 slots at once with a single
+64 bit word read and a `trailing_zeros` call
+
+The tradeoff is intentional, arena3 pays more at allocation time to get cheaper
+collection, a smaller heap and better cache behavior during the sweep. The
+supertrait benchmark results confirm this holds in practice. The collection pause
+improvements over `boa_gc` come from arena3's sweep being cheaper
+
+## things to keep in mind
+
+- the allocation slowdown matters for workloads that alloc a lot and collect rarely.
+  Worth profiling Boa's JS workloads to check the alloc/collect ratio
+- the size class lookup at mixed sizes is the main cost, a binary search or a small
+  table indexed by leading zeros could speed it up without changing the bitmap design
@@ -0,0 +1,47 @@
+# `Collector: Allocator`: is it possible and is it safe?
+
+Note author: shruti2522
+date: 2026-03-03
+
+Issue #11 raised two questions about making `Collector` a supertrait of
+`Allocator`
+
+## Is it possible in a generic way across multiple collectors?
+
+mostly yes, the main friction is that `Allocator` requires `&self` while allocation
+paths take `&mut self`. The fix in PR #15 was to store the arena as
+`RefCell<ArenaAllocator<'static>>` directly on `MarkSweepGarbageCollector`, then write `unsafe impl allocator_api2::alloc::Allocator for MarkSweepGarbageCollector`
+. `allocate()` calls `borrow_mut()` on the `RefCell` for
+the duration of the allocation and drops it right after. Because
+mark sweep is non moving, raw pointers from `allocate()` stay valid
+
+For compacting collectors it doesn't work without extra machinery, a compacting
+collector moves objects around, which silently invalidates raw pointers held in a
+`Vec<T, GcAllocator>` buffer, those would need a pinning mechanism or
+a different allocator surface. The supertrait is practical here specifically
+because mark sweep is non moving
+
+There's also a reentrancy problem, `allocate()` takes a mutable borrow on the
+`RefCell`. If that triggers a collection pass, the second borrow panics at
+runtime. Fixed it by putting the threshold check and `collect()` call before the
+allocator borrow, not inside the arena.
+
+## Is it even safe to do this and use collections that may not be properly designed for this use case?
+
+It depends on whether the collection implements `Trace`, a `Vec<T, GcAllocator>` on the stack or in a rooted `GcRefCell` is fine. The
+problem is when the Vec lives inside a GC managed object. The mark phase needs to trace into the vec's elements, if it's holding `Gc<T>` pointers and the
+collector doesn't know about them, those pointers look unreachable and get swept
+
+`GcAllocVec` handles this by implementing `Trace` and visiting its elements. Any
+collection stored in the GC managed object holds `Gc<T>` values needs a
+`Trace` wrapper, plain data rooted at `Root` or on a stack doesn't.
+
+`Collector: Allocator` is safe as long as:
+
+1. The collector is non moving
+2. `deallocate` is a no-op or correctly releases arena memory. For a bump arena doing nothing is fine
+3. Any collection stored in the GC managed object that holds `Gc<T>` values must implement `Trace`
+
+the `oscars_vs_boa_gc` bench confirms this holds in practice. `GcAllocVec` is always wrapped in
+`Root` or accessed through `GcRefCell` that implements `Trace` and the
+sweep phase causes no issues
@@ -0,0 +1,92 @@
+# oscars(with collector:allocator supertrait) vs boa_gc benchmark results
+
+Note author: shruti2522
+date: 2026-03-02
+
+I wrote this benchmark to measure what adding the `collector:allocator` supertrait and the size class bitmap in arena3 does to performance.
+
+Ran the `oscars_vs_boa_gc` bench suite with the `gc_allocator` feature on.
+It compares oscars against `boa_gc` across node allocation, collection pauses,
+vector operations, mixed workloads, memory pressure and deep object graphs
+
+overall: oscars is faster across the board and the gap grows at larger sizes.
+A few regressions showed up worth watching but the overall direction I think is good
+
+## Results
+
+### gc_node_allocation
+
+oscars got ~12% faster at size 10. Sizes 100 and 1000 were flat. boa_gc got ~15% faster at all three sizes. The numbers still favor
+oscars heavily: boa_gc at 1000 nodes takes ~59 µs vs ~24 µs for oscars
+
+### gc_collection_pause
+
+oscars stayed flat at 100 and 1000 objects, at 500 objects it got ~30% slower. `boa_gc` got ~30% faster at 100 and 1000,
+but also got ~75% slower at 500.
+
+Both sides regressed at 500 in the same direction, which I think is due to benchmark noise
+or a scheduling blip rather than a code change, still worth watching
+
+### vector_creation (oscars_gc_allocator vs boa_gc_std_vec)
+
+oscars got ~8% faster at size 10 and ~10% faster at size 100, size 1000 was flat.
+`boa_gc` was flat at 10 and 100 but at 1000 it showed a regression of over 2000%.
+almost certainly a fluke, `Criterion` flagged a warmup warning at that size, which means
+the bench run was unstable, will look into it again 
+
+### vec_of_gc_pointers
+
+oscars got ~8% faster at 50 elements. 10 and 100 were unchanged for both.
+
+### mixed_workload
+
+oscars was flat. boa_gc got ~12% faster. The ratio between them is about the
+same as before: oscars at ~6.7 µs vs boa_gc at ~15.7 µs
+
+### memory_pressure
+
+oscars got ~9% slower, boa_gc was unchanged. The churn pattern (allocate 50 per
+round, keep 1 in 10, collect 10 rounds) puts a lot of pressure on arena reuse.
+Would look into this, i think arenas that are nearly but not fully empty may be the cause.
+
+### deep_object_graph (depth 5, branching factor 3)
+
+oscars got ~20% faster (15.6 µs → 16.3 µs). `boa_gc` improved ~99%, likely from
+a very bad baseline run, down to ~39.8 µs. oscars is still roughly 2.5x faster for this workload
+
+## What the supertrait and size class bitmap had to do with it
+
+### `Collector: Allocator` supertrait
+
+The `Allocator` supertrait means `MarkSweepGarbageCollector` implements
+`allocator_api2::Allocator` through a shared reference. This lets us write
+`Vec<T, &MarkSweepGarbageCollector>`, which is what `GcAllocVec` is. The vec's
+backing buffer lives inside the GC arena directly instead of going through the
+system allocator.
+
+The `vector_creation` and `vec_of_gc_pointers` benchmarks show this most clearly,
+when oscars creates a `GcAllocVec`, the capacity slab and the GC node header both
+come out from the same arena page. The system allocator is never touched, that's
+where the consistent improvement in the vec benchmarks comes from.
+
+I think it also helps the mixed workload and deep graph cases. A `Node` with a
+`Vec<Gc<...>>` field puts its children buffer in the arena too, so the whole object graph ends up packed together rather than spread across the system heap.
+
+### size class bitmap (arena3)
+
+arena3 stores liveness in a 64 bit bitmap at the top of each page instead of
+a per object header field, this means:
+
+- **zero per object overhead**: no extra bytes per object for a liveor dead flag
+
+- **fast sweep**: during `collect()`, the sweep scans bitmap words with bitwise
+  ops instead of visiting every object. For 100 or 1000 small objects the
+  mark and clear pass is cheap enough to keep collection pauses low
+
+- **size class routing**: objects go into arenas sized to the nearest class
+  (16, 24, 32 ... 2048 bytes). This keeps all slots in a page the same size,
+  which makes bitmap indexing simple and free list reuse reliable, allocation
+  stays fast because `alloc_slot` checks the free list first then bumps
+
+the allocation improvement at `gc_node_allocation/oscars/10` and the collection
+pause improvements across all sizes, i think this is beacuse of tight arena packing and cheap bitwise sweep coming together
@@ -4,10 +4,27 @@ version = "0.1.0"
 edition = "2024"
 
 [dependencies]
+allocator-api2 = { version = "0.4.0", optional = true }
 hashbrown = "0.16.1"
 oscars_derive = { path = "../oscars_derive", version = "0.1.0" }
+rustc-hash = "2.1.1"
+
+[dev-dependencies]
+criterion = { version = "0.5", features = ["html_reports"] }
+
+boa_gc = { git = "https://github.com/boa-dev/boa", branch = "main" }
+
+[[bench]]
+name = "oscars_vs_boa_gc"
+harness = false
+required-features = ["gc_allocator"]
+
+[[bench]]
+name = "arena2_vs_arena3"
+harness = false
 
 [features]
 default = ["mark_sweep"]
 std = []
 mark_sweep = []
+gc_allocator = ["dep:allocator-api2", "mark_sweep"]