Skip to content

Exploring fastsafetensors for Network Storage: Seeking Community Input on Custom Reader Integration #55

@ABNER-1

Description

@ABNER-1

Hi maintainers,

The two-phase abstraction in fastsafetensors is excellent:

  1. Parallel model loading with multiple workers
  2. Efficient tensor distribution via NCCL broadcast, leveraging high-speed GPU memory and NVLink bandwidth

While this design excels in GDS scenarios, we believe it's equally well-suited for network storage workloads that benefit from:

  1. High-concurrency file loading (different files in parallel)
  2. Sequential bulk data reads

Building on fastsafetensors' architecture (referenced in issue #29), we've implemented a zero-copy reader optimized for network storage and achieved promising results. Since zero-copy implementations vary across storage systems, we extended this approach to 3FS (an open-source distributed filesystem), creating a usrbio-based reader.

Performance highlights:

  • Peak throughput: 35 GB/s (saturating single RDMA 400 Gbps link)
  • Setup: 8 processes via usrbio SDK
  • Real-world result: Loading 640GB DeepSeek-R1 in ~27 seconds

We're excited about these results and would love to open-source our 3FS reader. However, we'd like to check whether fastsafetensors has plans to support such custom readers, or if you'd be open to this type of contribution?

Our initial implementation consists of three main components, though we're happy to refine the architecture based on your feedback if there's interest in moving forward.

Looking forward to your thoughts!

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions