perf(block): sync.Map -> atomic bitmap for cache dirty tracking by levb · Pull Request #2235 · e2b-dev/infra

levb · 2026-03-26T19:51:28Z

Pre-cursor to PR #2034, separated to reduce the scope there.

Replace the sync.Map-based dirty block tracking in block.Cache with a pre-allocated []atomic.Uint64 bitset. Each bit represents one block; atomic OR (atomic.Uint64.Or) is used for concurrent-safe marking.

Problem

The block cache tracks which blocks have been written (are "dirty") so that Slice() can decide whether to serve from mmap or return BytesNotAvailable. The previous implementation used sync.Map with one entry per dirty block offset. For a 64 MiB file at 4K block size, that's up to 16,384 map entries — each requiring a heap-allocated key and value, plus the internal hash map overhead of sync.Map.

Every call to setIsCached allocated a []int64 via header.BlocksOffsets() and then called sync.Map.Store() in a loop. Every call to isCached did the same allocation and called sync.Map.Load() in a loop. These are the hottest paths in the block device — called on every NBD read and every chunk fetch.

Solution

dirty field changed from sync.Map to []atomic.Uint64, sized at ceil(numBlocks / 64) and allocated once in NewCache.
setIsCached computes a bitmask per 64-bit word and applies it with atomic.Uint64.Or — one atomic op per word instead of one map store per block.
isCached / isBlockCached read bits with atomic.Uint64.Load — no allocation, no map lookup.
dirtySortedKeys iterates words and uses bits.TrailingZeros64 to extract set bits — produces a naturally sorted result without slices.Sort.

Benchmarks

64 MiB cache, 4K blocks, 4 MiB chunks (1024 blocks per chunk):

                   │  sync.Map      │            bitmap                  │
                   │     sec/op     │   sec/op     vs base               │
MarkRangeCached-16   63857.50n ± 3%   25.04n ± 6%  -99.96% (p=0.002 n=6)
IsCached_Hit-16       19406.0n ± 1%   908.6n ± 0%  -95.32% (p=0.002 n=6)
IsCached_Miss-16     1168.500n ± 1%   2.813n ± 1%  -99.76% (p=0.002 n=6)
Slice_Hit-16          19296.0n ± 1%   913.7n ± 1%  -95.27% (p=0.002 n=6)
Slice_Miss-16        1161.000n ± 3%   3.765n ± 0%  -99.68% (p=0.002 n=6)

                   │ sync.Map     │              bitmap                     │
                   │     B/op     │     B/op      vs base                   │
MarkRangeCached-16   64.05Ki ± 0%   0.00Ki ± 0%  -100.00% (p=0.002 n=6)
IsCached_Hit-16      8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
IsCached_Miss-16     8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
Slice_Hit-16         8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
Slice_Miss-16        8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)

                   │ sync.Map     │              bitmap                     │
                   │  allocs/op   │  allocs/op   vs base                    │
MarkRangeCached-16    2.049k ± 0%   0.000k ± 0%  -100.00% (p=0.002 n=6)
IsCached_Hit-16        1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
IsCached_Miss-16       1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
Slice_Hit-16           1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
Slice_Miss-16          1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)

Zero allocations across all operations. setIsCached goes from 64µs to 25ns (2550x), Slice hit from 19µs to 914ns (21x).

…y tracking Replace the `sync.Map`-based dirty block tracking in `block.Cache` with a pre-allocated `[]atomic.Uint64` bitset. Each bit represents one block; atomic OR (`atomic.Uint64.Or`) is used for concurrent-safe marking. ## Problem The block cache tracks which blocks have been written (are "dirty") so that `Slice()` can decide whether to serve from mmap or return `BytesNotAvailable`. The previous implementation used `sync.Map` with one entry per dirty block offset. For a 64 MiB file at 4K block size, that's up to 16,384 map entries — each requiring a heap-allocated key and value, plus the internal hash map overhead of sync.Map. Every call to `setIsCached` allocated a `[]int64` via `header.BlocksOffsets()` and then called `sync.Map.Store()` in a loop. Every call to `isCached` did the same allocation and called `sync.Map.Load()` in a loop. These are the hottest paths in the block device — called on every NBD read and every chunk fetch. ## Solution - `dirty` field changed from `sync.Map` to `[]atomic.Uint64`, sized at `ceil(numBlocks / 64)` and allocated once in `NewCache`. - `setIsCached` computes a bitmask per 64-bit word and applies it with `atomic.Uint64.Or` — one atomic op per word instead of one map store per block. - `isCached` / `isBlockCached` read bits with `atomic.Uint64.Load` — no allocation, no map lookup. - `dirtySortedKeys` iterates words and uses `bits.TrailingZeros64` to extract set bits — produces a naturally sorted result without `slices.Sort`. ## Benchmarks 64 MiB cache, 4K blocks, 4 MiB chunks (1024 blocks per chunk): ``` │ sync.Map │ bitmap │ │ sec/op │ sec/op vs base │ MarkRangeCached-16 63857.50n ± 3% 25.04n ± 6% -99.96% (p=0.002 n=6) IsCached_Hit-16 19406.0n ± 1% 908.6n ± 0% -95.32% (p=0.002 n=6) IsCached_Miss-16 1168.500n ± 1% 2.813n ± 1% -99.76% (p=0.002 n=6) Slice_Hit-16 19296.0n ± 1% 913.7n ± 1% -95.27% (p=0.002 n=6) Slice_Miss-16 1161.000n ± 3% 3.765n ± 0% -99.68% (p=0.002 n=6) │ sync.Map │ bitmap │ │ B/op │ B/op vs base │ MarkRangeCached-16 64.05Ki ± 0% 0.00Ki ± 0% -100.00% (p=0.002 n=6) IsCached_Hit-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6) IsCached_Miss-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6) Slice_Hit-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6) Slice_Miss-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6) │ sync.Map │ bitmap │ │ allocs/op │ allocs/op vs base │ MarkRangeCached-16 2.049k ± 0% 0.000k ± 0% -100.00% (p=0.002 n=6) IsCached_Hit-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6) IsCached_Miss-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6) Slice_Hit-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6) Slice_Miss-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6) ``` Zero allocations across all operations. `setIsCached` goes from 64µs to 25ns (2550x), `Slice` hit from 19µs to 914ns (21x). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

packages/orchestrator/pkg/sandbox/block/cache.go

- Fix OOB panic in setIsCached when range extends past cache size. The old sync.Map.Store silently accepted out-of-bounds keys, but the bitmap would panic on index OOB. Cap n to len(dirty)*64. - Add TestSetIsCached_PastCacheSize to verify the fix. - Add TestSetIsCached_ConcurrentOverlapping: 8 goroutines with overlapping ranges under -race to prove atomic OR correctness. - Remove redundant sort.SliceIsSorted (covered by require.Equal). - Use 2 MiB block size in BoundaryCrossing test (real hugepage size) to exercise a second block size and drop nolint:unparam. - Make benchmark consts consistently int64. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dobrac

Could we either use a library (like the bits-and-blooms/bitset) for this functionality - prefered, or refactor the []atomic.Uint64 functionality to a separate file (module)?

packages/orchestrator/pkg/sandbox/block/cache.go

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…bitmap

…o lev-block-cache-bitmap

…bitmap

dobrac

Lets then move it to its own utility right away, there is almost none added cost doing it already

Move the []atomic.Uint64 bitmap from block.Cache into a standalone atomicbitset.Bitset in packages/shared/pkg/atomicbitset. Word-level masking in HasRange reduces atomic loads on the chunker hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

levb · 2026-03-27T15:49:04Z

packages/shared/pkg/atomicbitset/bitset.go

+}
+
+// New returns a Bitset with capacity for n bits, all initially zero.
+func New(n uint) Bitset {


I am at most 1/5 on uint here (and for indexing in the set in all methods). It's clean for the interface. The callers use int64, which would be awkward for the bitset index IMO. 2/5 we should do a separate cleanup PR where we assume int == int64 and use ints everywhere except uints where absolutely needed.

For this case, I think we should be explicit with uint64. Every other value than uint64 will be incorrectly handled in this code part as the 64 length is directly used in the operations.

We do not need 64 bit values for how many blocks there are, do we? The normal roaring bitset supports only 32-bit indices for addressing the bits. There is a 64 bit version, but less mature. And using a uint64 would still cause an ugly cast at the caller?

I thing it's okay to have the ~8TiB limit for now (for 4KiB blocks), I don't expect we would approach it any time soon

ValentaTomas · 2026-03-27T23:14:01Z

Could we either use a library (like the bits-and-blooms/bitset) for this functionality - prefered, or refactor the []atomic.Uint64 functionality to a separate file (module)?

@dobrac For the compressed bitmaps, https://github.com/RoaringBitmap/roaring seems to be widely used, and for the filesystem, it could really improve the default sizes (even beyond this PR's optimization). The only problem is that it is also not atomic, so we would need have a mutex there. This PR is better in a way that the it effectively shards the mutexes via using the atomic.

packages/orchestrator/pkg/sandbox/block/cache.go

dobrac · 2026-03-30T08:28:49Z

packages/orchestrator/pkg/sandbox/block/cache.go

 	mmap      *mmap.MMap
 	mu        sync.RWMutex
-	dirty     sync.Map
+	dirty     atomicbitset.Bitset


if we have a 100GB disk and we set a dirty block at the end, won't we use like 400KB of memory just for storing that one bit?

Won't this be 3.125MiB?

((100*1024^3)/4096/64)/1024/1024 (edited)

To be fair here though, this is also a problem with the bitset we use now. Only the roaring bitmaps solve this, but they don't support the iteration/lookup we would ideally need, so for pause proccessing I would still like to use the current bitset (or implement similarly effective iterator over the bitmaps).

One other node—with the current NBD the cache map might not be under much contention (but might be relevant if we use that for the other caches).

Won't this be 3.125MiB?

divided by 64 as you have 64 bits (chunks) for one array slot

👌 My bad

So I added 2 implementations now - roaring (by far the most compact, and plenty fast on its own, single threaded), and a sharded atomic uint64 array which is still quite compact and much (100x) faster when concurrency is involved.

The roaring implementation looks great. Unless we really need the the speed optimization, I'm for opting for now for the roaring approach only, leaving the sharded atomic optimization outside of this PR

dobrac · 2026-03-30T08:30:20Z

packages/shared/pkg/atomicbitset/bitset.go

+}
+
+// New returns a Bitset with capacity for n bits, all initially zero.
+func New(n uint) Bitset {


For this case, I think we should be explicit with uint64. Every other value than uint64 will be incorrectly handled in this code part as the 64 length is directly used in the operations.

packages/shared/pkg/atomicbitset/bitset.go

dobrac · 2026-03-30T16:24:26Z

We've agreed @levb to put this on hold for now unless there is something important that needs it

Introduce three atomicbitset implementations behind a shared Bitset interface, selectable at runtime via the chunker-config feature flag ("bitset" key): roaring (default), and atomic (flat/sharded auto). - Flat: []atomic.Uint64, lock-free, for small bitmaps (≤64KB) - Sharded: two-level with lazy shard allocation, lock-free, for large sparse bitmaps - Roaring: compressed bitmap with RWMutex, default via FF - Iterator/UnsafeIterator split for safe vs caller-synchronized iteration - Conformance tests across all implementations with race detector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…bitmap

- Use header.TotalBlocks and header.BlockIdx instead of inline math - Restore isCached comments per reviewer request Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Expand Has() with named locals for readability (flat.go, sharded.go) - Fix nlreturn, staticcheck, thelper lint issues - Fix DefaultShardBits comment to match staticcheck ST1022 form Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes revive redefines-builtin-id lint error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use 512 MB / 4 KB = 131072 bits to match production rootfs. Consolidate per-impl benchmark functions into table-driven subtests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Temporarily make atomic (flat/sharded) the default to verify integration tests pass with the lock-free implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…interface Remove internal RWMutex from Roaring and UnsafeIterator from the Bitset interface. Cache now holds a dirtyMu that fences mmap writes for non-atomic impls (roaring), skipped for atomic impls (flat/sharded). Concurrent benchmarks use external RWMutex for Roaring to match prod. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

e2b-request-same-site-reviewers bot assigned djeebus Mar 26, 2026

claude bot reviewed Mar 26, 2026

View reviewed changes

packages/orchestrator/pkg/sandbox/block/cache.go Show resolved Hide resolved

levb and others added 2 commits March 26, 2026 13:03

chore: auto-commit generated changes

7238599

levb marked this pull request as ready for review March 26, 2026 20:35

levb requested review from ValentaTomas, dobrac and jakubno as code owners March 26, 2026 20:35

dobrac assigned dobrac and unassigned djeebus Mar 27, 2026

dobrac added the improvement Improvement for current functionality label Mar 27, 2026

dobrac requested changes Mar 27, 2026

View reviewed changes

levb and others added 4 commits March 27, 2026 05:30

perf(block): precompute OTEL for chunker hot paths (#2236)

791950e

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Merge branch 'main' of github.com:e2b-dev/infra into lev-block-cache-…

f8e5216

…bitmap

Merge branch 'lev-block-cache-bitmap' of github.com:e2b-dev/infra int…

6b71c54

…o lev-block-cache-bitmap

PR feedback: i->blockIdx, restore comments

7fc43fc

levb requested a review from dobrac March 27, 2026 13:11

Merge branch 'main' of github.com:e2b-dev/infra into lev-block-cache-…

be976c0

…bitmap

dobrac requested changes Mar 27, 2026

View reviewed changes

levb commented Mar 27, 2026

View reviewed changes

levb requested a review from dobrac March 27, 2026 15:52

PR feedback: start/endBlock() helpers

19a857a

ValentaTomas mentioned this pull request Mar 29, 2026

Add filesystem zero copy during pause via reflink (XFS, Btrfs) #2249

Open

9 tasks

dobrac requested changes Mar 30, 2026

View reviewed changes

dobrac marked this pull request as draft March 30, 2026 16:24

dobrac mentioned this pull request Apr 2, 2026

Reduce memory allocations by using bitset instead of sync.map #2284

Draft

levb and others added 10 commits April 1, 2026 22:58

Merge branch 'main' of github.com:e2b-dev/infra into lev-block-cache-…

c65d2b2

…bitmap

chore: auto-commit generated changes

433c5d4

PR feedback: use header helpers, restore comments

cb37eb1

- Use header.TotalBlocks and header.BlockIdx instead of inline math - Restore isCached comments per reviewer request Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PR feedback: start/endBlock() helpers

51ba221

- Expand Has() with named locals for readability (flat.go, sharded.go) - Fix nlreturn, staticcheck, thelper lint issues - Fix DefaultShardBits comment to match staticcheck ST1022 form Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(bitset): rename make param to avoid builtin shadowing

86f31cb

Fixes revive redefines-builtin-id lint error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(bitset): benchmark at realistic size (131K bits)

43b647d

Use 512 MB / 4 KB = 131072 bits to match production rootfs. Consolidate per-impl benchmark functions into table-driven subtests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test: switch default bitset to atomic for CI validation

adf386d

Temporarily make atomic (flat/sharded) the default to verify integration tests pass with the lock-free implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: auto-commit generated changes

1c53aec

Conversation

levb commented Mar 26, 2026

Problem

Solution

Benchmarks

Uh oh!

Uh oh!

dobrac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dobrac left a comment

Choose a reason for hiding this comment

Uh oh!

levb Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ValentaTomas commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ValentaTomas Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dobrac commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

levb Mar 27, 2026 •

edited

Loading

ValentaTomas Mar 30, 2026 •

edited

Loading

dobrac commented Mar 30, 2026 •

edited

Loading