Skip to content

Commit 195a566

Browse files
committed
make the impl paragraphs more readable
1 parent 6cd9ccb commit 195a566

File tree

1 file changed

+76
-151
lines changed

1 file changed

+76
-151
lines changed

core/core/src/docs/rfcs/7127_foyer_chunked.md

Lines changed: 76 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -158,13 +158,13 @@ struct ObjectMetadata {
158158
}
159159
```
160160

161-
### Read Operation Implementation
161+
### Implementation
162162

163-
The read operation follows this flow (inspired by SlateDB's design):
163+
The read operation follows this flow:
164164

165165
1. **Check chunked mode**: If `chunk_size_bytes` is `None`, fallback to whole-object caching (current implementation).
166166

167-
2. **Prefetch with aligned range** (key optimization):
167+
2. **Prefetch with aligned range**:
168168
```rust
169169
async fn maybe_prefetch_range(
170170
&self,
@@ -227,8 +227,6 @@ The read operation follows this flow (inspired by SlateDB's design):
227227

228228
**Why alignment matters**: When object is not yet cached, aligning the range allows us to fetch complete chunks in a single request. For example, if user requests bytes 100MB-150MB with 64MB chunks, we fetch 64MB-192MB in one request and save chunks 1 and 2. Future reads to any part of chunks 1 or 2 will hit cache.
229229

230-
**Version handling**: The version (etag) is obtained from the read response and included in all cache keys. This ensures that when an object is updated (etag changes), old cached chunks won't be used.
231-
232230
3. **Split range into chunks**:
233231
```rust
234232
fn split_range_into_chunks(
@@ -300,6 +298,7 @@ The read operation follows this flow (inspired by SlateDB's design):
300298
// Save to cache (best-effort, ignore errors)
301299
self.cache.insert(chunk_key, chunk_data.clone()).await.ok();
302300

301+
// Return the requested range
303302
Ok(chunk_data.slice(range_in_chunk))
304303
}
305304
```
@@ -309,168 +308,94 @@ The read operation follows this flow (inspired by SlateDB's design):
309308
- Each chunk is fetched lazily when the stream is polled
310309
- This reduces memory pressure and allows streaming large ranges efficiently
311310

312-
### Write Operation Implementation
313-
314-
Following SlateDB's pattern, write operations can optionally cache the written data:
311+
### Key Design Considerations
315312

316-
```rust
317-
async fn write(&self, path: &str, args: OpWrite) -> Result<RpWrite> {
318-
// Write to underlying storage first
319-
let result = self.inner.write(path, args.clone()).await?;
320-
321-
// Optionally cache the written data (can be controlled by a flag)
322-
if self.cache_writes {
323-
// Fetch metadata via stat
324-
if let Ok(meta) = self.inner.stat(path).await {
325-
let metadata = ObjectMetadata::from(meta);
326-
let meta_key = format!("{}#meta", path);
327-
self.cache.insert(meta_key, serialize_metadata(&metadata)?).await;
328-
329-
// Stream the written data into chunks
330-
// Note: This requires buffering the write payload, which may not be desirable
331-
// For now, we can skip caching write data and only cache on subsequent reads
332-
}
333-
}
313+
1. **Range alignment strategy**
334314

335-
Ok(result)
336-
}
337-
```
315+
When metadata is not yet cached, the implementation aligns the requested range to chunk boundaries before fetching from the underlying storage. For example, if a user requests bytes 100-150MB with 64MB chunks configured, the system will fetch the aligned range of 64-192MB.
338316

339-
**Write caching strategy**:
340-
- **Simple approach**: Only invalidate metadata, don't cache write data
341-
- Remove `{path}#meta` from cache
342-
- Let chunks naturally expire via LRU
343-
- Subsequent reads will populate cache
344-
- **Aggressive approach**: Cache written data if enabled
345-
- Useful for write-then-read patterns
346-
- Requires access to write payload (may need buffering)
347-
- Can be controlled via `cache_writes` flag (similar to SlateDB)
317+
While this fetches more data initially, it significantly reduces the number of requests to the underlying storage by consolidating multiple chunk fetches into a single aligned request. This trade-off proves beneficial as it populates the cache more efficiently and reduces overall latency.
348318

349-
### Delete Operation Implementation
319+
The alignment is only applied on the first fetch (cache miss). Subsequent reads can directly use the cached chunks without additional alignment overhead.
350320

351-
When a delete completes:
321+
2. **Streaming result**
352322

353-
```rust
354-
async fn delete(&self, path: &str) -> Result<RpDelete> {
355-
let result = self.inner.delete(path).await?;
323+
The implementation returns data as a stream where each chunk is fetched lazily when consumed.
356324

357-
// Best-effort cache invalidation
358-
// Remove metadata (chunks will be evicted naturally)
359-
let meta_key = format!("{}#meta", path);
360-
self.cache.remove(&meta_key).await.ok();
325+
This approach is critical for memory efficiency when reading large ranges that span many chunks. Without streaming, reading a multi-gigabyte range would require loading all chunks into memory simultaneously, potentially exhausting available memory and causing performance degradation.
361326

362-
// Optionally: If metadata is in cache, calculate and remove all chunks
363-
// This is more thorough but requires additional cache lookup
327+
3. **Best-effort cache operations**
364328

365-
Ok(result)
366-
}
367-
```
329+
All cache operations (insert, remove, get) are designed to never fail the user's read or write operation.
368330

369-
**Rationale**: Lazy chunk removal is acceptable because:
370-
- Cached chunks for deleted objects are harmless (worst case: wasted cache space)
371-
- They'll be evicted naturally by LRU when cache pressure increases
372-
- Scanning cache for all chunks is expensive and not worth the cost
373-
374-
### Key Design Decisions
375-
376-
**Range alignment strategy**:
377-
- When metadata is not cached, align the requested range to chunk boundaries before fetching
378-
- Example: Request 100-150MB with 64MB chunks → fetch aligned 64-192MB
379-
- **Trade-off**: Fetches more data initially, but populates cache more efficiently
380-
- **Benefit**: Reduces number of requests to underlying storage (one aligned request vs. multiple chunk requests)
381-
- Only apply alignment on first fetch (cache miss); subsequent reads use cached chunks
382-
383-
**Streaming instead of buffering**:
384-
- Return data as a stream rather than loading all chunks into memory
385-
- Each chunk is fetched lazily when consumed
386-
- Matches OpenDAL's streaming API design
387-
- Critical for memory efficiency when reading large ranges
388-
389-
**Chunk size validation**:
390-
- Require chunk size to be aligned to 1KB (similar to SlateDB)
391-
- Prevents edge cases with very small or misaligned chunks
392-
- Recommended range: 16MB - 128MB
393-
394-
**Cache operation error handling**:
395-
- All cache operations (insert, remove, get) should be best-effort
396-
- Cache failures should NOT fail the user's read/write operation
397-
- Log warnings for cache errors but continue with fallback to underlying storage
398-
- This ensures cache is truly transparent to users
331+
If a cache operation encounters an error, the implementation logs a warning and continues by falling back to the underlying storage. This ensures that the cache layer remains truly transparent to users.
399332

400333
### Edge Cases and Considerations
401334

402-
**Last chunk handling**:
403-
- The last chunk may be smaller than `chunk_size_bytes`
404-
- Calculate actual chunk size: `min((chunk_idx + 1) * chunk_size, object_size) - chunk_idx * chunk_size`
405-
- Example: 200 MB file with 64 MB chunks → chunks 0, 1, 2 (64MB each), chunk 3 (8MB)
406-
- Already handled in `split_range_into_chunks` logic above
407-
408-
**Empty or invalid range requests**:
409-
- Empty range: Return empty result without cache operations
410-
- Start beyond object size: Return error (per OpenDAL semantics)
411-
- End beyond object size: Clamp end to object size
412-
413-
**Concurrent access**:
414-
- Foyer's built-in request deduplication handles concurrent reads to the same chunk
415-
- Multiple concurrent reads to chunk N will result in only one fetch from underlying storage
416-
- Other readers wait and reuse the result
417-
- No additional locking needed in FoyerLayer
418-
419-
**Cache consistency**:
420-
- Cache follows eventual consistency model (same as OpenDAL)
421-
- No distributed coordination for concurrent writes from different processes
422-
- Cache invalidation on write/delete is best-effort
423-
- Acceptable for object storage workloads (most are read-heavy, immutable objects)
424-
425-
### Performance Characteristics
426-
427-
**Benefits of aligned prefetching**:
428-
- **Fewer requests**: One aligned request instead of N chunk requests on cache miss
429-
- Example: Request 100-150MB → 1 aligned fetch (64-192MB) vs. 2 separate chunk fetches
430-
- **Better locality**: Neighboring chunks are likely to be accessed together
431-
- **Reduced latency**: Fewer round-trips to underlying storage
432-
433-
**Memory efficiency**:
434-
- Metadata overhead: ~100-200 bytes per object
435-
- Chunk data follows normal LRU eviction
436-
- Streaming API avoids buffering large ranges in memory
437-
- Each chunk is independently evictable
438-
439-
**Cache hit rate analysis**:
440-
- **Partial reads**: Significantly improved hit rate
441-
- Chunks are smaller units, higher reuse probability
442-
- Example: Reading different columns of a Parquet file reuses row group chunks
443-
- **Whole-object reads**: Slightly lower hit rate due to fragmentation
444-
- Requires all chunks to be cached vs. one whole-object entry
445-
- Trade-off is acceptable given target workload (partial reads)
335+
**Last chunk handling**
336+
337+
The last chunk of an object may be smaller than the configured chunk size and requires special attention. The implementation calculates the actual chunk size using the formula `min((chunk_idx + 1) * chunk_size, object_size) - chunk_idx * chunk_size`.
338+
339+
For example, a 200 MB file with 64 MB chunks would be split into chunks 0, 1, and 2 of 64MB each, followed by chunk 3 containing only 8MB.
340+
341+
**Empty or invalid range requests**
342+
343+
Range requests are handled according to OpenDAL's existing semantics:
344+
- Empty range: Returns empty result without performing any cache operations
345+
- Range start beyond object size: Returns error to match OpenDAL's behavior
346+
- Range end exceeds object size: Clamped to the actual object size, allowing partial reads near the end of objects
347+
348+
**Concurrent access**
349+
350+
Concurrent access patterns benefit from Foyer's built-in request deduplication mechanism. When multiple concurrent reads request the same chunk, Foyer ensures that only one fetch actually occurs from the underlying storage, while other readers wait and reuse the result.
351+
352+
This deduplication happens transparently within the Foyer cache layer, requiring no additional locking or coordination logic in FoyerLayer itself.
353+
354+
**Cache consistency**
355+
356+
The cache follows an eventual consistency model aligned with OpenDAL's consistency guarantees. There is no distributed coordination for concurrent writes from different processes, and cache invalidation on write or delete operations is performed on a best-effort basis.
357+
358+
This relaxed consistency model is acceptable for typical object storage workloads, which are predominantly read-heavy and often involve immutable objects.
446359

447360
### Testing Strategy
448361

449-
**Unit tests**:
450-
- `split_range_into_chunks` with various ranges and object sizes
451-
- `align_range` edge cases (aligned, unaligned, boundary conditions)
452-
- Last chunk handling (smaller than chunk_size)
453-
- Empty and invalid ranges
454-
455-
**Integration tests**:
456-
- End-to-end read with cache hit and miss
457-
- Concurrent reads to same chunk (verify deduplication)
458-
- Write invalidation behavior
459-
- Mixed whole-object and chunked reads
460-
461-
**Behavior tests**:
462-
- Use existing OpenDAL behavior test suite
463-
- Add chunked cache specific scenarios:
464-
- Large file with range reads
465-
- Sequential read patterns
466-
- Random access patterns
362+
1. **Unit tests**
363+
364+
Focus on the core algorithms with various test cases:
365+
- `split_range_into_chunks` with different combinations of ranges and object sizes to verify correct chunk boundary calculations
366+
- `align_range` with aligned ranges, unaligned ranges, and boundary conditions to ensure all edge cases are handled correctly
367+
- Last chunk handling when it's smaller than chunk_size
368+
- Empty and invalid range scenarios
369+
370+
2. **Integration tests**
371+
372+
Validate end-to-end behavior of the chunked cache system:
373+
- Cache hit and miss scenarios to ensure prefetching and caching logic works correctly
374+
- Concurrent reads to the same chunk to verify Foyer's request deduplication
375+
- Write invalidation behavior to confirm cached data is properly invalidated when objects are modified
376+
- Mixed workloads using both whole-object mode and chunked mode
377+
378+
3. **Behavior tests**
379+
380+
Leverage OpenDAL's existing behavior test suite, which provides comprehensive coverage across different backends.
381+
382+
Add chunked cache specific scenarios:
383+
- Large files with range reads to validate performance characteristics
384+
- Sequential read patterns to verify prefetching efficiency
385+
- Random access patterns to ensure proper handling of non-sequential workloads
467386

468387
### Compatibility and Migration
469388

470-
- **Backward compatible**: Defaults to `chunk_size_bytes = None` (whole-object mode)
471-
- **No breaking changes**: Existing users unaffected
472-
- **Opt-in**: Users explicitly enable chunked mode via configuration
473-
- **Cache format change**: Whole-object cache and chunked cache use different key formats
474-
- No automatic migration needed (cache rebuilds naturally)
475-
- Changing chunk size also invalidates cache (keys change)
476-
- This is acceptable since cache is ephemeral
389+
**Backward compatibility**
390+
391+
The chunked cache feature is fully backward compatible with existing FoyerLayer usage. The implementation defaults to `chunk_size_bytes = None`, which activates whole-object mode matching the current behavior. This means existing users are completely unaffected by the introduction of chunked caching.
392+
393+
**Opt-in design**
394+
395+
Chunked cache is an opt-in feature that users must explicitly enable through configuration by setting the chunk size. This conservative approach ensures that users who haven't evaluated whether chunked caching benefits their workload will continue to use the proven whole-object caching strategy.
396+
397+
**Cache key migration**
398+
399+
The cache key format changes between whole-object and chunked modes, but this requires no special migration handling. Since whole-object cache uses different keys than chunked cache, and different chunk sizes use different keys from each other, old cache entries simply coexist harmlessly with new ones.
400+
401+
As the LRU eviction policy runs, old entries naturally expire and are replaced with new entries in the current format. This natural invalidation is acceptable because the cache is ephemeral by design, storing temporary performance-optimization data rather than durable state.

0 commit comments

Comments
 (0)