Skip to content

Commit 2a63f33

Browse files
authored
Merge branch 'main' into mempool3
2 parents bb79316 + dd2bfee commit 2a63f33

File tree

28 files changed

+3211
-272
lines changed

28 files changed

+3211
-272
lines changed

Cargo.lock

Lines changed: 559 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
## Exploring `Collector: Allocator` possibilities (note 1/2)
2+
3+
Note author: shruti2522
4+
5+
Issue #11 brought up making `Collector` a supertrait of `Allocator` from `allocator_api2`. The idea is to use `Vec<T, &MyCollector>` directly, backed by the gc arena, without extra handle types
6+
7+
It seems possible. But actually running it hits some Rust ownership walls. Here's what I've run into so far. I'm still wrapping my head around some of this, so the fixes are mostly just ideas at this point
8+
9+
### 1. `&self` vs `&mut self`
10+
11+
**What I'm seeing:**
12+
`Allocator` needs `&self` to allocate, but our `collect()` needs `&mut self`. If we make a `Vec<u64, &MarkSweepGarbageCollector>`, we've borrowed the gc. If we then try to call `gc.collect()`, the borrow checker yells because we can't have a mutable borrow while the vec holds a shared one.
13+
14+
**Potential ways around it:**
15+
- Just cast pointers to bypass the check. This works in testing, but it feels risky because we'd have to manually guarantee `collect` never frees the Vec's page.
16+
- Maybe we change `collect(&mut self)` to `collect(&self)` and put the internal state in `RefCell` or `UnsafeCell`? This might be a cleaner solution, but it needs more thought
17+
18+
### 2. Reentrancy issues
19+
20+
**What I'm seeing:**
21+
If we try the interior mutability route, the arena is in a `RefCell`. `allocate()` would call `borrow_mut()` on it. If an allocation triggers `collect()` internally, it might try to borrow the arena again. A double `borrow_mut()` equals a runtime panic.
22+
23+
**Potential ways around it:**
24+
- Use `UnsafeCell` to skip runtime tracking, but that could be dangerous.
25+
- decouple allocation from collection: Maybe `allocate()` should never trigger a gc cycle? It could just ask for memory, and we could have something else watch the threshold to trigger `collect()` sequentially, this seems safer
26+
27+
### 3. `deallocate` does nothing
28+
29+
**What I'm seeing:**
30+
Our arena is a bump allocator, which doesn't give memory back slot by slot. `deallocate` has nowhere to put the memory, so a shrinking Vec might just leak its old buffer until the whole arena drops.
31+
32+
**Potential ways around it:**
33+
- Maybe this is fine? standard bump allocators work this way
34+
- If we want to be clean, we could make an `ArenaHeapItem::RawBytes` variant so `allocate()` returns a proper chunk that can be reclaimed during the sweep phase. I'm not sure if it's worth the overhead yet.
35+
36+
### 4. Trace integration
37+
38+
**What I'm seeing:**
39+
The mark phase doesn't know what's in a `Vec<T, &Collector>`. If there's a `Gc<U>` inside, the gc won't see it and might accidentally sweep it.
40+
41+
**Potential ways around it:**
42+
- Maybe use a GC aware wrapper like `GcVec<T>` that implements `Trace`? It could register itself with the gc root queue on creation
43+
44+
### Early thoughts
45+
46+
`Collector: Allocator` seems possible, but it looks like we'd need to shift the collector entirely to `&self` with interior mutability and rework how allocation thresholds happen.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## Implementing `Collector: Allocator`: why arena3 was needed and what problems came up (2/2)
2+
3+
Note author: shruti2522
4+
5+
this continues from the previous note, writing `unsafe impl Allocator for MarkSweepGarbageCollector` ran into several problems, some in the arena design itself and some from Rust rules, below is each problem and what fixed it.
6+
7+
### Why arena2 could not be used
8+
9+
arena2 is a `LinkedList` of fixed size pages, where every slot has a 8 byte header storing allocation chain pointers. The collector uses that header to decide if a slot is alive. The issue is there is no way to tell a typed GC object apart from a raw byte slab which means they can land on the same page mixed togther.
10+
11+
problem:
12+
`Allocator::allocate` gives back raw bytes. When a `Vec<T, &MarkSweepGarbageCollector>` asks for a buffer, that buffer has no `GcBox` header. So under arena2, a `GcBox<Node>` and a Vec buffer could end up side by side on the same page. The sweep phase would then walk into the Vec buffer expecting a `GcBox` at each slot and produce garbage. On top of that, there was no per page count of raw allocs, so `drop_dead_arenas` would either hold onto a page forever or free it too early while the Vec was still alive
13+
14+
fix:
15+
arena3 fixes this by splitting into two separate page lists:`typed_arenas` for GC objects and `raw_arenas` for raw byte allocations. The mark and sweep phases only ever touch `typed_arenas`, each raw page gets an `active_raw_allocs: Cell<usize>` counter. `try_alloc_bytes` increments it,`dealloc_bytes` decrements it, and `drop_dead_arenas` frees the page once it hits zero. On the typed side, a per page bitmap (one bit per slot) tracks liveness instead of storing anything inside the slot itself. This keeps all slots in a page the same size, which the size-class `alloc_cache` index needs to work correctly and it's also what lets grow extend a Vec buffer in place when the new layout stays in the same size class.
16+
17+
### Problems that came up during the implementation
18+
19+
**`&self` vs `&mut self`**
20+
`Allocator::allocate` takes `&self`, but all arena code was using `&mut self`. The obvious workaround was `Arc/Mutex` but that would break `no_std`, so that was a dead end
21+
22+
fix: the arena field was changed to `RefCell<ArenaAllocator<'static>>`. `allocate()` calls `borrow_mut()` for just the duration of the allocation and drops it right after. Since the collector is single threaded there is never a second caller trying to borrow at the same time anyway
23+
24+
**Reentrancy**
25+
If `allocate()` triggered a collection internally to free up space, `collect()` would try to call `borrow_mut()` on the same `RefCell` that `allocate()` was alraedy holding leading to instant runtime panic
26+
27+
fix: the threshold check and the `collect()` call happen before the allocator borrow is taken:
28+
29+
```rust
30+
if self.collect_needed.get() && !self.is_collecting.get() {
31+
self.collect_needed.set(false);
32+
self.collect(); // borrows and releases the arena internally
33+
}
34+
let result = self.allocator.borrow_mut().try_alloc_bytes(layout)?;
35+
```
36+
37+
there's also a debug panic if `allocate()` is called while `is_collecting` is set, mostly as a safety net to catch any future code that sneaks back in through an unexpected path
38+
39+
**`deallocate` leak risk**
40+
bump allocators can't free individual slots
41+
42+
fix: arena3's `active_raw_allocs` counter gives `drop_dead_arenas` a way to reclaim a whole raw page once every allocation on it has been released. It's not per slot free but it's enough for the bump allocator model.
43+
44+
**Trace integration**
45+
plain `Vec<T, &MarkSweepGarbageCollector>` is invisible to the mark phase. If it's holding `Gc<T>` values inside a GC managed object, those values look unreachable and get swept
46+
47+
fix: `GcAllocVec<T: Trace>` and `GcAllocBox<T: Trace>` wrap `allocator_api2::Vec`/`Box` and implement `Trace` by visiting each element. Any `Vec` or `Box` on the GC heap that holds `Gc<T>` values must use one of these wrappers, plain data rooted at the stack or a `Root` boundary doesn't need one.
48+
49+
**Hidden `'static` in `alloc_gc_node`**
50+
`Collector::alloc_gc_node` used to return `ArenaPointer<'static, GcBox<T>>` which silently extended the lifetime inside the impl with no visible `unsafe` at the call site
51+
52+
fix: the return type is now `ArenaPointer<'gc, _>` tied to `&'gc self`, so the lifetime is explicit. Call sites that actually need `'static` storage, `Root::new_in`, `WeakGc::new_in`, and `WeakMap::insert` now call `unsafe { ptr.extend_lifetime() }` directly with a safety comment.
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# arena2 vs arena3 benchmark results
2+
3+
date: 2026-03-02
4+
Note author: shruti2522
5+
6+
this goes over the results of the `arena2_vs_arena3` bench suite. The question was
7+
whether arena3's size class bitmap design is worth using over arena2's simpler
8+
linked list with per slot headers.
9+
10+
**answer**: arena2 is faster at raw allocation,
11+
but arena3 fits more objects into the same amount of memory
12+
13+
## Results
14+
15+
### allocation speed
16+
17+
arena2 is faster at every size and the gap grows as object count goes up
18+
19+
| objects | arena3 | arena2 |
20+
| 100 | 1.02 µs | 643 ns |
21+
| 500 | 4.15 µs | 1.83 µs |
22+
| 1000 | 8.36 µs | 2.77 µs |
23+
24+
At 100 objects arena2 is roughly 2x faster, at 1000 it's 3x. arena3 is slower
25+
for two reasons: it has to do a size class lookup on every allocation (finding the
26+
right arena for the object's size) and set a bit in the bitmap. arena2 just moves
27+
a pointer forward and writes an 8 byte header
28+
29+
### small object overhead
30+
31+
arena2 is faster here too, which might seem odd given that it writes an 8 byte
32+
header on every object. But this bench is measuring allocation time, not memory use.
33+
writing the header is cheap, what costs time in arena3 is the size class routing.
34+
35+
| objects | arena3 (0-byte header) | arena2 (8-byte header) |
36+
| 100 | 781 ns | 257 ns |
37+
| 500 | 3.56 µs | 1.08 µs |
38+
| 1000 | 7.02 µs | 2.15 µs |
39+
40+
arena2 is roughly 3x faster across all sizes here, the cost of the bitmap and
41+
size class lookup shows up clearly when the objects are small.
42+
43+
### mixed sizes
44+
45+
allocating objects of four different sizes (16, 32, 64, 128 bytes) in interleaved
46+
batches of 50 each:
47+
48+
- arena3: 1.878 µs
49+
- arena2: 441 ns
50+
51+
arena2 is ~4x faster, arena3 sends each allocation to a different per size arena,
52+
which means more branching and more work keeping track of arena pointers
53+
54+
### memory efficiency
55+
56+
how many 16 byte objects fit in a single 4KB page before it needs to grab a new one:
57+
58+
- arena3: **254 objects**
59+
- arena2: **170 objects**
60+
61+
arena3 fits ~50% more objects per page. The reason is arena2's 8 byte header per slot, a 16 byte object actually takes 24 bytes. arena3
62+
tracks liveness in a bitmap at the top of the page instead, so each slot
63+
stays 16 bytes
64+
65+
this is the number that drove the decision, fewer pages means fewer pointer
66+
reads during the sweep phase, better cache use and less work for the collector overall
67+
68+
### Vec growth pattern
69+
70+
simulating a Vec doubling from capacity 1 to 1024 (11 allocations of increasing
71+
size):
72+
73+
- arena3: 1.12 µs
74+
- arena2: 370 ns
75+
76+
arena2 is ~3x faster, for a growing vec the size class lookup cost hits on every
77+
doubling step since each new size lands in a different arena.
78+
79+
### sustained throughput (10k allocations)
80+
81+
- arena3: 71.5 µs
82+
- arena2: 23.0 µs
83+
84+
arena2 is ~3x faster at a steady allocation rate, this is the biggest gap in
85+
the whole suite
86+
87+
### deallocation speed
88+
89+
time to free all objects and reclaim dead arenas:
90+
91+
| objects | arena3 | arena2 |
92+
| 100 | 951 ns | 665 ns |
93+
| 500 | 2.57 µs | 2.11 µs |
94+
| 1000 | 4.65 µs | 4.97 µs |
95+
96+
arena2 is ~30% faster at 100 objects and ~18% faster at 500. dealloc in arena2 is
97+
a single bit flip on the slot header, arena3 has to write a free list node into
98+
the slot and clear the bitmap bit
99+
100+
at 1000 objects it flips: arena3 is ~6% faster, arena3 recycles freed slots via
101+
the free list so fewer pool pages accumulate and `drop_dead_arenas` has less to
102+
walk. arena2 cannot recycle slots so all pages stay alive until the whole arena
103+
is dropped
104+
105+
the crossover is somewhere between 500 and 1000 objects, roughly where slot
106+
recycling starts paying back the per-free overhead
107+
108+
## what this means
109+
110+
arena2 wins every timing number, but for a GC, allocation is only half the work, the other half is how
111+
cheap it is to sweep dead objects and how well the heap fits in cache.
112+
113+
254 vs 170 objects per 4KB page means fewer pages to walk and less memory for the mark phase to touch. arena2 also has to read and decode an 8 byte header on
114+
every slot during the sweep. arena3's bitmap checks 64 slots at once with a single
115+
64 bit word read and a `trailing_zeros` call
116+
117+
The tradeoff is intentional, arena3 pays more at allocation time to get cheaper
118+
collection, a smaller heap and better cache behavior during the sweep. The
119+
supertrait benchmark results confirm this holds in practice. The collection pause
120+
improvements over `boa_gc` come from arena3's sweep being cheaper
121+
122+
## things to keep in mind
123+
124+
- the allocation slowdown matters for workloads that alloc a lot and collect rarely.
125+
Worth profiling Boa's JS workloads to check the alloc/collect ratio
126+
- the size class lookup at mixed sizes is the main cost, a binary search or a small
127+
table indexed by leading zeros could speed it up without changing the bitmap design
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# `Collector: Allocator`: is it possible and is it safe?
2+
3+
Note author: shruti2522
4+
date: 2026-03-03
5+
6+
Issue #11 raised two questions about making `Collector` a supertrait of
7+
`Allocator`
8+
9+
## Is it possible in a generic way across multiple collectors?
10+
11+
mostly yes, the main friction is that `Allocator` requires `&self` while allocation
12+
paths take `&mut self`. The fix in PR #15 was to store the arena as
13+
`RefCell<ArenaAllocator<'static>>` directly on `MarkSweepGarbageCollector`, then write `unsafe impl allocator_api2::alloc::Allocator for MarkSweepGarbageCollector`
14+
. `allocate()` calls `borrow_mut()` on the `RefCell` for
15+
the duration of the allocation and drops it right after. Because
16+
mark sweep is non moving, raw pointers from `allocate()` stay valid
17+
18+
For compacting collectors it doesn't work without extra machinery, a compacting
19+
collector moves objects around, which silently invalidates raw pointers held in a
20+
`Vec<T, GcAllocator>` buffer, those would need a pinning mechanism or
21+
a different allocator surface. The supertrait is practical here specifically
22+
because mark sweep is non moving
23+
24+
There's also a reentrancy problem, `allocate()` takes a mutable borrow on the
25+
`RefCell`. If that triggers a collection pass, the second borrow panics at
26+
runtime. Fixed it by putting the threshold check and `collect()` call before the
27+
allocator borrow, not inside the arena.
28+
29+
## Is it even safe to do this and use collections that may not be properly designed for this use case?
30+
31+
It depends on whether the collection implements `Trace`, a `Vec<T, GcAllocator>` on the stack or in a rooted `GcRefCell` is fine. The
32+
problem is when the Vec lives inside a GC managed object. The mark phase needs to trace into the vec's elements, if it's holding `Gc<T>` pointers and the
33+
collector doesn't know about them, those pointers look unreachable and get swept
34+
35+
`GcAllocVec` handles this by implementing `Trace` and visiting its elements. Any
36+
collection stored in the GC managed object holds `Gc<T>` values needs a
37+
`Trace` wrapper, plain data rooted at `Root` or on a stack doesn't.
38+
39+
`Collector: Allocator` is safe as long as:
40+
41+
1. The collector is non moving
42+
2. `deallocate` is a no-op or correctly releases arena memory. For a bump arena doing nothing is fine
43+
3. Any collection stored in the GC managed object that holds `Gc<T>` values must implement `Trace`
44+
45+
the `oscars_vs_boa_gc` bench confirms this holds in practice. `GcAllocVec` is always wrapped in
46+
`Root` or accessed through `GcRefCell` that implements `Trace` and the
47+
sweep phase causes no issues
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# oscars(with collector:allocator supertrait) vs boa_gc benchmark results
2+
3+
Note author: shruti2522
4+
date: 2026-03-02
5+
6+
I wrote this benchmark to measure what adding the `collector:allocator` supertrait and the size class bitmap in arena3 does to performance.
7+
8+
Ran the `oscars_vs_boa_gc` bench suite with the `gc_allocator` feature on.
9+
It compares oscars against `boa_gc` across node allocation, collection pauses,
10+
vector operations, mixed workloads, memory pressure and deep object graphs
11+
12+
overall: oscars is faster across the board and the gap grows at larger sizes.
13+
A few regressions showed up worth watching but the overall direction I think is good
14+
15+
## Results
16+
17+
### gc_node_allocation
18+
19+
oscars got ~12% faster at size 10. Sizes 100 and 1000 were flat. boa_gc got ~15% faster at all three sizes. The numbers still favor
20+
oscars heavily: boa_gc at 1000 nodes takes ~59 µs vs ~24 µs for oscars
21+
22+
### gc_collection_pause
23+
24+
oscars stayed flat at 100 and 1000 objects, at 500 objects it got ~30% slower. `boa_gc` got ~30% faster at 100 and 1000,
25+
but also got ~75% slower at 500.
26+
27+
Both sides regressed at 500 in the same direction, which I think is due to benchmark noise
28+
or a scheduling blip rather than a code change, still worth watching
29+
30+
### vector_creation (oscars_gc_allocator vs boa_gc_std_vec)
31+
32+
oscars got ~8% faster at size 10 and ~10% faster at size 100, size 1000 was flat.
33+
`boa_gc` was flat at 10 and 100 but at 1000 it showed a regression of over 2000%.
34+
almost certainly a fluke, `Criterion` flagged a warmup warning at that size, which means
35+
the bench run was unstable, will look into it again
36+
37+
### vec_of_gc_pointers
38+
39+
oscars got ~8% faster at 50 elements. 10 and 100 were unchanged for both.
40+
41+
### mixed_workload
42+
43+
oscars was flat. boa_gc got ~12% faster. The ratio between them is about the
44+
same as before: oscars at ~6.7 µs vs boa_gc at ~15.7 µs
45+
46+
### memory_pressure
47+
48+
oscars got ~9% slower, boa_gc was unchanged. The churn pattern (allocate 50 per
49+
round, keep 1 in 10, collect 10 rounds) puts a lot of pressure on arena reuse.
50+
Would look into this, i think arenas that are nearly but not fully empty may be the cause.
51+
52+
### deep_object_graph (depth 5, branching factor 3)
53+
54+
oscars got ~20% faster (15.6 µs → 16.3 µs). `boa_gc` improved ~99%, likely from
55+
a very bad baseline run, down to ~39.8 µs. oscars is still roughly 2.5x faster for this workload
56+
57+
## What the supertrait and size class bitmap had to do with it
58+
59+
### `Collector: Allocator` supertrait
60+
61+
The `Allocator` supertrait means `MarkSweepGarbageCollector` implements
62+
`allocator_api2::Allocator` through a shared reference. This lets us write
63+
`Vec<T, &MarkSweepGarbageCollector>`, which is what `GcAllocVec` is. The vec's
64+
backing buffer lives inside the GC arena directly instead of going through the
65+
system allocator.
66+
67+
The `vector_creation` and `vec_of_gc_pointers` benchmarks show this most clearly,
68+
when oscars creates a `GcAllocVec`, the capacity slab and the GC node header both
69+
come out from the same arena page. The system allocator is never touched, that's
70+
where the consistent improvement in the vec benchmarks comes from.
71+
72+
I think it also helps the mixed workload and deep graph cases. A `Node` with a
73+
`Vec<Gc<...>>` field puts its children buffer in the arena too, so the whole object graph ends up packed together rather than spread across the system heap.
74+
75+
### size class bitmap (arena3)
76+
77+
arena3 stores liveness in a 64 bit bitmap at the top of each page instead of
78+
a per object header field, this means:
79+
80+
- **zero per object overhead**: no extra bytes per object for a liveor dead flag
81+
82+
- **fast sweep**: during `collect()`, the sweep scans bitmap words with bitwise
83+
ops instead of visiting every object. For 100 or 1000 small objects the
84+
mark and clear pass is cheap enough to keep collection pauses low
85+
86+
- **size class routing**: objects go into arenas sized to the nearest class
87+
(16, 24, 32 ... 2048 bytes). This keeps all slots in a page the same size,
88+
which makes bitmap indexing simple and free list reuse reliable, allocation
89+
stays fast because `alloc_slot` checks the free list first then bumps
90+
91+
the allocation improvement at `gc_node_allocation/oscars/10` and the collection
92+
pause improvements across all sizes, i think this is beacuse of tight arena packing and cheap bitwise sweep coming together

oscars/Cargo.toml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,27 @@ version = "0.1.0"
44
edition = "2024"
55

66
[dependencies]
7+
allocator-api2 = { version = "0.4.0", optional = true }
78
hashbrown = "0.16.1"
89
oscars_derive = { path = "../oscars_derive", version = "0.1.0" }
10+
rustc-hash = "2.1.1"
11+
12+
[dev-dependencies]
13+
criterion = { version = "0.5", features = ["html_reports"] }
14+
15+
boa_gc = { git = "https://github.com/boa-dev/boa", branch = "main" }
16+
17+
[[bench]]
18+
name = "oscars_vs_boa_gc"
19+
harness = false
20+
required-features = ["gc_allocator"]
21+
22+
[[bench]]
23+
name = "arena2_vs_arena3"
24+
harness = false
925

1026
[features]
1127
default = ["mark_sweep"]
1228
std = []
1329
mark_sweep = []
30+
gc_allocator = ["dep:allocator-api2", "mark_sweep"]

0 commit comments

Comments
 (0)