Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.#190
Open
copybara-service[bot] wants to merge 1 commit into
Open
Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.#190copybara-service[bot] wants to merge 1 commit into
copybara-service[bot] wants to merge 1 commit into
Conversation
…Displacement. We identified a critical CPU cache coherence failure on the Sender (D2H) side that caused exactly ~32 MB (matching L3 cache slice size) of data corruption under high parallelism (P=8, or P=4 with tight semaphore limits). The high-performance host allocator uses a first-touch policy that writes zeroes to the buffer, filling the CPU cache with Dirty lines of 0s. When the TPU performs D2H DMA, it writes directly to DRAM (No-Snoop PCIe), bypassing the CPU cache and leaving the dirty 0 lines intact. When the CPU eventually evicts these lines, it overwrites the TPU's fresh data in DRAM with 0s. Under tight semaphore limits, the allocator immediately recycles the same buffer back-to-back, guaranteeing stale cache hits. To resolve this without the 57-second performance penalty of a full clflush on 32 GB, we implement a hardware-portable CPU Cache Displacement mechanism. By sequentially reading a thread-local 128 MB dummy buffer, we force the CPU to evict all stale/dirty lines from the L3 cache to DRAM in 2-3 milliseconds (a 20,000x speedup). We integrate this displacement automatically into PjRtCopyFuture::Await() for all futures marked as is_d2h, transparently protecting JAX and PyTorch D2H transfers. Additionally, we implement clean C++ CPU cache flushing (clwb + sfence) on the H2D path before TPU DMA launches. PiperOrigin-RevId: 938149369
1f51a6c to
56767db
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.
We identified a critical CPU cache coherence failure on the Sender (D2H) side that caused exactly ~32 MB (matching L3 cache slice size) of data corruption under high parallelism (P=8, or P=4 with tight semaphore limits).
The high-performance host allocator uses a first-touch policy that writes zeroes to the buffer, filling the CPU cache with Dirty lines of 0s. When the TPU performs D2H DMA, it writes directly to DRAM (No-Snoop PCIe), bypassing the CPU cache and leaving the dirty 0 lines intact. When the CPU eventually evicts these lines, it overwrites the TPU's fresh data in DRAM with 0s. Under tight semaphore limits, the allocator immediately recycles the same buffer back-to-back, guaranteeing stale cache hits.
To resolve this without the 57-second performance penalty of a full clflush on 32 GB, we implement a hardware-portable CPU Cache Displacement mechanism. By sequentially reading a thread-local 128 MB dummy buffer, we force the CPU to evict all stale/dirty lines from the L3 cache to DRAM in 2-3 milliseconds (a 20,000x speedup).
We integrate this displacement automatically into PjRtCopyFuture::Await() for all futures marked as is_d2h, transparently protecting JAX and PyTorch D2H transfers.
Additionally, we implement clean C++ CPU cache flushing (clwb + sfence) on the H2D path before TPU DMA launches.