Sequential download uses an in-memory cache for coalesced range within a segment #558

seanses · 2025-11-06T09:17:00Z

With the disk cache turned off, when multiple ranges are coalesced into one larger download range (i.e. one pre-signed URL), this range will be downloaded multiple times without being cached. This PR adds an in-memory cache only to cache these coalesced ranges. This in-memory cache is per-segment, meaning dropped after the entire segment is full reconstructed, because the reconstruction info is returned segment wise and only adjacent or overlapping ranges within a segment are coalesced.

Partly adopted the memory cache implementation from ec66175.

…n a segment

hoytak

LGTM, though I'm worried a bit about memory use when there are a lot of parallel downloads. Maybe we should change some defaults for the cache size?

rajatarya · 2025-11-06T20:40:32Z

cas_client/src/remote_client.rs

                        debug!(call_id, num_tasks = terms.len(), "enqueueing download tasks");
+
+                        // in-memory cache for this segment
+                        let coalesced_range_reuse_cache = Arc::new(MemoryCache::default());


Like @hoytak I'm worried this could suck up too much RAM since it can keep growing.

I think capping the overall byte size of the cache would be good - maybe the default is 1GB but configurable by env variable.

Instead of random eviction, do we want to evict based on min heap of num_bytes_in_segment? Meaning, let's evict the smallest segments since that would minimize the number of network requests.

Also, think we should probably log a bit more so we know how many times a segment is reused. That is likely nice-to-have.

I don't think evicting based on num_bytes_in_segment would work. Then the cache will fill up with large chunks and then any small chunks in it will get immediately evicted, before they would be used.

A better heuristic here perhaps is insertion time. However, I think the memory issues can possibly be resolved by having a common cache with the larger limit instead of the per-download cache. Or simply changing that default to be smaller, reflecting the limit on the number of parallel downloads.

Isn't the main optimization goal leading to adding a memory-cache is to limit the number of network requests to reconstruct a file? By minimizing this we reduce overall download time and we ensure we don't download more bytes than we need (repeatedly downloading the same segment).

So by having an upper bound for cache size we need to evict entries. The entries we should evict should be the least used & smallest segments. These segments being evicted will result in the least number of network requests.

Since we get all the reconstruction info at the beginning of the download we can do the bookkeeping to see which segments have the most terms and maybe weight those segments based on overall num bytes when inserting into cache. Then as we consume from cache we can decrement remaining terms from segment and remove from cache when segment is fully consumed. When evicting, we evict the segments with the least usage + least bytes -> preserving the biggest segments with the most remaining terms.

Yes, this will result in the cache being filled up with large chunks but those large chunks also have lots of remaining terms to be consumed so worth keeping them. And the large chunks will be removed from cache once all their terms are consumed anyway.

The map would be something like:

(xorb_id : ( num_terms_remaining, weighted_segment_score, bytes ))

Weighted segment score = ( (segment_bytes / HF_XET_NUM_RANGE_IN_SEGMENT_BASE) / num_terms

This means segment_bytes as % of configurable segment size divided by number of terms it satisfies.

The smallest of these weighted scores should be evicted.

This way on a cache hit once num_terms_remaining hits 0 the entire cache entry is removed.

assafvayner · 2025-11-06T20:43:55Z

cas_client/src/download_utils.rs

+        // the requested range cannot be larger than the fetched range, and cache the fetched range
+        // because it will be reused within the segment.
        // "else" case data matches exact, save some work, return whole data.
+        if self.term.range.start != chunk_range.start || self.term.range.end != chunk_range.end {


this condition captures some of the cases but not all. this helps with the case we're dealing but it's not general for everything.

True, but I think it's a pretty solid heuristic here -- the only case it doesn't capture is when the downloaded range is referenced in full in this situation and then partially referenced later. But I think for now, best-effort is probably fine.

assafvayner · 2025-11-06T20:45:53Z

cas_client/src/download_utils.rs

                          * effectively taking [skip_bytes..skip_bytes+take]
                          * out of the downloaded range */
+    #[derivative(Debug = "ignore")]
+    pub coalesced_range_reuse_cache: Arc<dyn ChunkCache>,


I wish there was a way to use 1 cache passed in here, there's one in FetchTermDownload

This is definitely a trade-off. Using a disk cache brings down memory usage but as Adrien and Corentin mentioned using a disk cache slows down their environment a LOT. I don't think there is a universal solution for all environments and this approach is more tailored for their usage, meaning, I don't think using an in-memory cache is suitable for all hf-xet users either.

assafvayner · 2025-11-06T20:48:10Z

chunk_cache/src/memory.rs

no evictions, this could get large, I think it's fair to assume for fragmented files that we will have all of the ranges in here.

assafvayner · 2025-11-06T20:49:15Z

chunk_cache/src/memory.rs

+struct MemoryCacheItem {
+    range: ChunkRange,
+    chunk_byte_indices: Vec<u32>,
+    data: Vec<u8>,


can we use Bytes here?

What will be the benefit of Bytes to Vec w.r.t. implementing this ChunkCache trait?

sequential download uses an in-memory cache for coalesced range withi…

1ca98de

…n a segment

seanses changed the title ~~Sequential download uses an in-memory cache for coalesced range withi…~~ Sequential download uses an in-memory cache for coalesced range within a segment Nov 6, 2025

seanses marked this pull request as ready for review November 6, 2025 09:56

hoytak approved these changes Nov 6, 2025

View reviewed changes

seanses requested review from assafvayner and rajatarya November 6, 2025 19:30

rajatarya reviewed Nov 6, 2025

View reviewed changes

assafvayner reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sequential download uses an in-memory cache for coalesced range within a segment #558

Sequential download uses an in-memory cache for coalesced range within a segment #558

seanses commented Nov 6, 2025 •

edited

Loading

Uh oh!

hoytak left a comment

Uh oh!

rajatarya Nov 6, 2025

Uh oh!

hoytak Nov 6, 2025 •

edited

Loading

Uh oh!

rajatarya Nov 6, 2025

Uh oh!

rajatarya Nov 6, 2025

Uh oh!

assafvayner Nov 6, 2025

Uh oh!

hoytak Nov 6, 2025

Uh oh!

assafvayner Nov 6, 2025

Uh oh!

seanses Nov 7, 2025

Uh oh!

assafvayner Nov 6, 2025

Uh oh!

assafvayner Nov 6, 2025

Uh oh!

seanses Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Sequential download uses an in-memory cache for coalesced range within a segment #558

Are you sure you want to change the base?

Sequential download uses an in-memory cache for coalesced range within a segment #558

Conversation

seanses commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hoytak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hoytak Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanses Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seanses commented Nov 6, 2025 •

edited

Loading

hoytak Nov 6, 2025 •

edited

Loading

seanses Nov 7, 2025 •

edited

Loading