-
Notifications
You must be signed in to change notification settings - Fork 21
Exploring fastsafetensors for Network Storage: Seeking Community Input on Custom Reader Integration #55
Description
Hi maintainers,
The two-phase abstraction in fastsafetensors is excellent:
- Parallel model loading with multiple workers
- Efficient tensor distribution via NCCL broadcast, leveraging high-speed GPU memory and NVLink bandwidth
While this design excels in GDS scenarios, we believe it's equally well-suited for network storage workloads that benefit from:
- High-concurrency file loading (different files in parallel)
- Sequential bulk data reads
Building on fastsafetensors' architecture (referenced in issue #29), we've implemented a zero-copy reader optimized for network storage and achieved promising results. Since zero-copy implementations vary across storage systems, we extended this approach to 3FS (an open-source distributed filesystem), creating a usrbio-based reader.
Performance highlights:
- Peak throughput: 35 GB/s (saturating single RDMA 400 Gbps link)
- Setup: 8 processes via usrbio SDK
- Real-world result: Loading 640GB DeepSeek-R1 in ~27 seconds
We're excited about these results and would love to open-source our 3FS reader. However, we'd like to check whether fastsafetensors has plans to support such custom readers, or if you'd be open to this type of contribution?
Our initial implementation consists of three main components, though we're happy to refine the architecture based on your feedback if there's interest in moving forward.
Looking forward to your thoughts!
