Skip to content

perf(block): sync.Map -> atomic bitmap for cache dirty tracking#2235

Draft
levb wants to merge 10 commits intomainfrom
lev-block-cache-bitmap
Draft

perf(block): sync.Map -> atomic bitmap for cache dirty tracking#2235
levb wants to merge 10 commits intomainfrom
lev-block-cache-bitmap

Conversation

@levb
Copy link
Copy Markdown
Contributor

@levb levb commented Mar 26, 2026

Pre-cursor to PR #2034, separated to reduce the scope there.

Replace the sync.Map-based dirty block tracking in block.Cache with a pre-allocated []atomic.Uint64 bitset. Each bit represents one block; atomic OR (atomic.Uint64.Or) is used for concurrent-safe marking.

Problem

The block cache tracks which blocks have been written (are "dirty") so that Slice() can decide whether to serve from mmap or return BytesNotAvailable. The previous implementation used sync.Map with one entry per dirty block offset. For a 64 MiB file at 4K block size, that's up to 16,384 map entries — each requiring a heap-allocated key and value, plus the internal hash map overhead of sync.Map.

Every call to setIsCached allocated a []int64 via header.BlocksOffsets() and then called sync.Map.Store() in a loop. Every call to isCached did the same allocation and called sync.Map.Load() in a loop. These are the hottest paths in the block device — called on every NBD read and every chunk fetch.

Solution

  • dirty field changed from sync.Map to []atomic.Uint64, sized at ceil(numBlocks / 64) and allocated once in NewCache.
  • setIsCached computes a bitmask per 64-bit word and applies it with atomic.Uint64.Or — one atomic op per word instead of one map store per block.
  • isCached / isBlockCached read bits with atomic.Uint64.Load — no allocation, no map lookup.
  • dirtySortedKeys iterates words and uses bits.TrailingZeros64 to extract set bits — produces a naturally sorted result without slices.Sort.

Benchmarks

64 MiB cache, 4K blocks, 4 MiB chunks (1024 blocks per chunk):

                   │  sync.Map      │            bitmap                  │
                   │     sec/op     │   sec/op     vs base               │
MarkRangeCached-16   63857.50n ± 3%   25.04n ± 6%  -99.96% (p=0.002 n=6)
IsCached_Hit-16       19406.0n ± 1%   908.6n ± 0%  -95.32% (p=0.002 n=6)
IsCached_Miss-16     1168.500n ± 1%   2.813n ± 1%  -99.76% (p=0.002 n=6)
Slice_Hit-16          19296.0n ± 1%   913.7n ± 1%  -95.27% (p=0.002 n=6)
Slice_Miss-16        1161.000n ± 3%   3.765n ± 0%  -99.68% (p=0.002 n=6)

                   │ sync.Map     │              bitmap                     │
                   │     B/op     │     B/op      vs base                   │
MarkRangeCached-16   64.05Ki ± 0%   0.00Ki ± 0%  -100.00% (p=0.002 n=6)
IsCached_Hit-16      8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
IsCached_Miss-16     8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
Slice_Hit-16         8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
Slice_Miss-16        8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)

                   │ sync.Map     │              bitmap                     │
                   │  allocs/op   │  allocs/op   vs base                    │
MarkRangeCached-16    2.049k ± 0%   0.000k ± 0%  -100.00% (p=0.002 n=6)
IsCached_Hit-16        1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
IsCached_Miss-16       1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
Slice_Hit-16           1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
Slice_Miss-16          1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)

Zero allocations across all operations. setIsCached goes from 64µs to 25ns (2550x), Slice hit from 19µs to 914ns (21x).

…y tracking

Replace the `sync.Map`-based dirty block tracking in `block.Cache` with a
pre-allocated `[]atomic.Uint64` bitset. Each bit represents one block;
atomic OR (`atomic.Uint64.Or`) is used for concurrent-safe marking.

## Problem

The block cache tracks which blocks have been written (are "dirty") so that
`Slice()` can decide whether to serve from mmap or return `BytesNotAvailable`.
The previous implementation used `sync.Map` with one entry per dirty block
offset. For a 64 MiB file at 4K block size, that's up to 16,384 map entries —
each requiring a heap-allocated key and value, plus the internal hash map
overhead of sync.Map.

Every call to `setIsCached` allocated a `[]int64` via `header.BlocksOffsets()`
and then called `sync.Map.Store()` in a loop. Every call to `isCached` did
the same allocation and called `sync.Map.Load()` in a loop. These are the
hottest paths in the block device — called on every NBD read and every
chunk fetch.

## Solution

- `dirty` field changed from `sync.Map` to `[]atomic.Uint64`, sized at
  `ceil(numBlocks / 64)` and allocated once in `NewCache`.
- `setIsCached` computes a bitmask per 64-bit word and applies it with
  `atomic.Uint64.Or` — one atomic op per word instead of one map store per
  block.
- `isCached` / `isBlockCached` read bits with `atomic.Uint64.Load` — no
  allocation, no map lookup.
- `dirtySortedKeys` iterates words and uses `bits.TrailingZeros64` to extract
  set bits — produces a naturally sorted result without `slices.Sort`.

## Benchmarks

64 MiB cache, 4K blocks, 4 MiB chunks (1024 blocks per chunk):

```
                   │  sync.Map      │            bitmap                  │
                   │     sec/op     │   sec/op     vs base               │
MarkRangeCached-16   63857.50n ± 3%   25.04n ± 6%  -99.96% (p=0.002 n=6)
IsCached_Hit-16       19406.0n ± 1%   908.6n ± 0%  -95.32% (p=0.002 n=6)
IsCached_Miss-16     1168.500n ± 1%   2.813n ± 1%  -99.76% (p=0.002 n=6)
Slice_Hit-16          19296.0n ± 1%   913.7n ± 1%  -95.27% (p=0.002 n=6)
Slice_Miss-16        1161.000n ± 3%   3.765n ± 0%  -99.68% (p=0.002 n=6)

                   │ sync.Map     │              bitmap                     │
                   │     B/op     │     B/op      vs base                   │
MarkRangeCached-16   64.05Ki ± 0%   0.00Ki ± 0%  -100.00% (p=0.002 n=6)
IsCached_Hit-16      8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
IsCached_Miss-16     8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
Slice_Hit-16         8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)
Slice_Miss-16        8.000Ki ± 0%   0.000Ki ± 0%  -100.00% (p=0.002 n=6)

                   │ sync.Map     │              bitmap                     │
                   │  allocs/op   │  allocs/op   vs base                    │
MarkRangeCached-16    2.049k ± 0%   0.000k ± 0%  -100.00% (p=0.002 n=6)
IsCached_Hit-16        1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
IsCached_Miss-16       1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
Slice_Hit-16           1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
Slice_Miss-16          1.000 ± 0%    0.000 ± 0%  -100.00% (p=0.002 n=6)
```

Zero allocations across all operations. `setIsCached` goes from 64µs to 25ns
(2550x), `Slice` hit from 19µs to 914ns (21x).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
levb and others added 2 commits March 26, 2026 13:03
- Fix OOB panic in setIsCached when range extends past cache size.
  The old sync.Map.Store silently accepted out-of-bounds keys, but
  the bitmap would panic on index OOB. Cap n to len(dirty)*64.
- Add TestSetIsCached_PastCacheSize to verify the fix.
- Add TestSetIsCached_ConcurrentOverlapping: 8 goroutines with
  overlapping ranges under -race to prove atomic OR correctness.
- Remove redundant sort.SliceIsSorted (covered by require.Equal).
- Use 2 MiB block size in BoundaryCrossing test (real hugepage size)
  to exercise a second block size and drop nolint:unparam.
- Make benchmark consts consistently int64.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@levb levb marked this pull request as ready for review March 26, 2026 20:35
@dobrac dobrac assigned dobrac and unassigned djeebus Mar 27, 2026
@dobrac dobrac added the improvement Improvement for current functionality label Mar 27, 2026
Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we either use a library (like the bits-and-blooms/bitset) for this functionality - prefered, or refactor the []atomic.Uint64 functionality to a separate file (module)?

}

func (c *Cache) isCached(off, length int64) bool {
// Make sure the offset is within the cache size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is still missing

return false
}

// Cap if the length goes beyond the cache size, so we don't check for blocks that are out of bounds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is still missing

levb and others added 4 commits March 27, 2026 05:30
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@levb levb requested a review from dobrac March 27, 2026 13:11
Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets then move it to its own utility right away, there is almost none added cost doing it already

Move the []atomic.Uint64 bitmap from block.Cache into a standalone
atomicbitset.Bitset in packages/shared/pkg/atomicbitset. Word-level
masking in HasRange reduces atomic loads on the chunker hot path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
}

// New returns a Bitset with capacity for n bits, all initially zero.
func New(n uint) Bitset {
Copy link
Copy Markdown
Contributor Author

@levb levb Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am at most 1/5 on uint here (and for indexing in the set in all methods). It's clean for the interface. The callers use int64, which would be awkward for the bitset index IMO. 2/5 we should do a separate cleanup PR where we assume int == int64 and use ints everywhere except uints where absolutely needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I think we should be explicit with uint64. Every other value than uint64 will be incorrectly handled in this code part as the 64 length is directly used in the operations.

@levb levb requested a review from dobrac March 27, 2026 15:52
@ValentaTomas
Copy link
Copy Markdown
Member

Could we either use a library (like the bits-and-blooms/bitset) for this functionality - prefered, or refactor the []atomic.Uint64 functionality to a separate file (module)?

@dobrac For the compressed bitmaps, https://github.com/RoaringBitmap/roaring seems to be widely used, and for the filesystem, it could really improve the default sizes (even beyond this PR's optimization). The only problem is that it is also not atomic, so we would need have a mutex there. This PR is better in a way that the it effectively shards the mutexes via using the atomic.

Comment on lines +251 to +253
func (c *Cache) startBlock(off int64) uint {
return uint(off / c.blockSize)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use the BlockIdx in the implementation of startBlock

return nil, fmt.Errorf("error mapping file: %w", err)
}

numBlocks := (size + blockSize - 1) / blockSize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TotalBlocks can be used instead

return false
}

// Cap if the length goes beyond the cache size, so we don't check for blocks that are out of bounds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is still missing

}

func (c *Cache) isCached(off, length int64) bool {
// Make sure the offset is within the cache size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is still missing

mmap *mmap.MMap
mu sync.RWMutex
dirty sync.Map
dirty atomicbitset.Bitset
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a 100GB disk and we set a dirty block at the end, won't we use like 400KB of memory just for storing that one bit?

Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be 3.125MiB?

((100*1024^3)/4096/64)/1024/1024 (edited)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair here though, this is also a problem with the bitset we use now. Only the roaring bitmaps solve this, but they don't support the iteration/lookup we would ideally need, so for pause proccessing I would still like to use the current bitset (or implement similarly effective iterator over the bitmaps).

One other node—with the current NBD the cache map might not be under much contention (but might be relevant if we use that for the other caches).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be 3.125MiB?

divided by 64 as you have 64 bits (chunks) for one array slot

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 My bad

}

// New returns a Bitset with capacity for n bits, all initially zero.
func New(n uint) Bitset {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I think we should be explicit with uint64. Every other value than uint64 will be incorrectly handled in this code part as the 64 length is directly used in the operations.

func (b *Bitset) Len() uint { return b.n }

// Has reports whether bit i is set. Out-of-range returns false.
func (b *Bitset) Has(i uint) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make all the functions implementations more readable so its clear what we're doing?

Something like for this one for example:

func (b *Bitset) Has(i uint) bool {
	if i >= b.n {
		return false
	}

	wordIndex := i / 64
	bitIndex  := i % 64

	// selector mask (a 1 at the specific bit position)
    mask := 1 << bitIndex

	wordValue := b.words[wordIndex].Load()
	return (wordValue & mask) != 0
}

@dobrac
Copy link
Copy Markdown
Contributor

dobrac commented Mar 30, 2026

We've agreed @levb to put this on hold for now unless there is something important that needs it

@dobrac dobrac marked this pull request as draft March 30, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement for current functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants