Skip to content

Conversation

@seanses
Copy link
Collaborator

@seanses seanses commented Nov 6, 2025

With the disk cache turned off, when multiple ranges are coalesced into one larger download range (i.e. one pre-signed URL), this range will be downloaded multiple times without being cached. This PR adds an in-memory cache only to cache these coalesced ranges. This in-memory cache is per-segment, meaning dropped after the entire segment is full reconstructed, because the reconstruction info is returned segment wise and only adjacent or overlapping ranges within a segment are coalesced.

Partly adopted the memory cache implementation from ec66175.

@seanses seanses changed the title Sequential download uses an in-memory cache for coalesced range withi… Sequential download uses an in-memory cache for coalesced range within a segment Nov 6, 2025
@seanses seanses marked this pull request as ready for review November 6, 2025 09:56
Copy link
Collaborator

@hoytak hoytak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm worried a bit about memory use when there are a lot of parallel downloads. Maybe we should change some defaults for the cache size?

debug!(call_id, num_tasks = terms.len(), "enqueueing download tasks");

// in-memory cache for this segment
let coalesced_range_reuse_cache = Arc::new(MemoryCache::default());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @hoytak I'm worried this could suck up too much RAM since it can keep growing.

I think capping the overall byte size of the cache would be good - maybe the default is 1GB but configurable by env variable.

Instead of random eviction, do we want to evict based on min heap of num_bytes_in_segment? Meaning, let's evict the smallest segments since that would minimize the number of network requests.

Also, think we should probably log a bit more so we know how many times a segment is reused. That is likely nice-to-have.

Copy link
Collaborator

@hoytak hoytak Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think evicting based on num_bytes_in_segment would work. Then the cache will fill up with large chunks and then any small chunks in it will get immediately evicted, before they would be used.

A better heuristic here perhaps is insertion time. However, I think the memory issues can possibly be resolved by having a common cache with the larger limit instead of the per-download cache. Or simply changing that default to be smaller, reflecting the limit on the number of parallel downloads.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the main optimization goal leading to adding a memory-cache is to limit the number of network requests to reconstruct a file? By minimizing this we reduce overall download time and we ensure we don't download more bytes than we need (repeatedly downloading the same segment).

So by having an upper bound for cache size we need to evict entries. The entries we should evict should be the least used & smallest segments. These segments being evicted will result in the least number of network requests.

Since we get all the reconstruction info at the beginning of the download we can do the bookkeeping to see which segments have the most terms and maybe weight those segments based on overall num bytes when inserting into cache. Then as we consume from cache we can decrement remaining terms from segment and remove from cache when segment is fully consumed. When evicting, we evict the segments with the least usage + least bytes -> preserving the biggest segments with the most remaining terms.

Yes, this will result in the cache being filled up with large chunks but those large chunks also have lots of remaining terms to be consumed so worth keeping them. And the large chunks will be removed from cache once all their terms are consumed anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map would be something like:

(xorb_id : ( num_terms_remaining, weighted_segment_score, bytes ))
Weighted segment score = ( (segment_bytes / HF_XET_NUM_RANGE_IN_SEGMENT_BASE) / num_terms

This means segment_bytes as % of configurable segment size divided by number of terms it satisfies.

The smallest of these weighted scores should be evicted.


This way on a cache hit once num_terms_remaining hits 0 the entire cache entry is removed.

// the requested range cannot be larger than the fetched range, and cache the fetched range
// because it will be reused within the segment.
// "else" case data matches exact, save some work, return whole data.
if self.term.range.start != chunk_range.start || self.term.range.end != chunk_range.end {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition captures some of the cases but not all. this helps with the case we're dealing but it's not general for everything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think it's a pretty solid heuristic here -- the only case it doesn't capture is when the downloaded range is referenced in full in this situation and then partially referenced later. But I think for now, best-effort is probably fine.

* effectively taking [skip_bytes..skip_bytes+take]
* out of the downloaded range */
#[derivative(Debug = "ignore")]
pub coalesced_range_reuse_cache: Arc<dyn ChunkCache>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was a way to use 1 cache passed in here, there's one in FetchTermDownload

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a trade-off. Using a disk cache brings down memory usage but as Adrien and Corentin mentioned using a disk cache slows down their environment a LOT. I don't think there is a universal solution for all environments and this approach is more tailored for their usage, meaning, I don't think using an in-memory cache is suitable for all hf-xet users either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no evictions, this could get large, I think it's fair to assume for fragmented files that we will have all of the ranges in here.

struct MemoryCacheItem {
range: ChunkRange,
chunk_byte_indices: Vec<u32>,
data: Vec<u8>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Bytes here?

Copy link
Collaborator Author

@seanses seanses Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the benefit of Bytes to Vec w.r.t. implementing this ChunkCache trait?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants