Skip to content

Commit 1de5f13

Browse files
authored
[FEAT] [ROCm] Add ROCm support to fastsafetensors (#34)
Signed-off-by: tjtanaa <[email protected]>
1 parent e99b606 commit 1de5f13

File tree

18 files changed

+664
-46
lines changed

18 files changed

+664
-46
lines changed

.gitignore

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,13 @@ htmlcov/
1010
.idea
1111
*.log
1212
*.pyc
13-
examples/paddle_case/log
13+
*.so
14+
examples/paddle_case/log
15+
16+
# Auto-generated hipified files and directories (created during ROCm build)
17+
fastsafetensors/cpp/hip/
18+
fastsafetensors/cpp/*.hip.*
19+
fastsafetensors/cpp/hip_compat.h
20+
21+
# Auto-generated PyPI index (generated by GitHub Actions)
22+
pypi-index/

README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,34 @@ Please refer to [Foundation Model Stack Community Code of Conduct](https://githu
4848

4949
Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025.
5050

51+
## For NVIDIA
5152

52-
## Install from PyPI
53+
### Install from PyPI
5354

5455
See https://pypi.org/project/fastsafetensors/
5556

5657
```bash
5758
pip install fastsafetensors
5859
```
5960

60-
## Install from source
61+
### Install from source
62+
63+
```bash
64+
pip install .
65+
```
66+
67+
## For ROCm
68+
69+
On ROCm, there are not GDS equivalent support. So fastsafetensors support only supports `nogds=True` mode.
70+
The performance gain example can be found at [amd-perf.md](./docs/amd-perf.md)
71+
72+
### Install from Github Source
73+
74+
```bash
75+
pip install git+https://github.com/foundation-model-stack/fastsafetensors.git
76+
```
77+
78+
### Install from source
6179

6280
```bash
6381
pip install .

docs/amd-perf.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Performance of FastSafeTensors on AMD GPUs
2+
3+
## DeepSeek-R1 vLLM Model Weight Loading Speed
4+
5+
This benchmark compares the performance of `safetensors` vs `fastsafetensors` when loading model weights on AMD GPUs.
6+
7+
NOTES: `fastsafetensors` does not support GDS feature on ROCm as there are no GDS alternative on ROCm.
8+
9+
### Benchmark Methodology
10+
11+
**Platform:** AMD ROCm 7.0.1
12+
**GPUs:** 8x AMD Instinct MI300X
13+
**Library:** fastsafetensors 0.1.15
14+
15+
1. **Clear system cache** to ensure consistent starting conditions:
16+
```bash
17+
sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
18+
```
19+
20+
2. **Launch vLLM** with either `--load-format safetensors` or `--load-format fastsafetensors`:
21+
22+
```bash
23+
MODEL=EmbeddedLLM/deepseek-r1-FP8-Dynamic
24+
25+
VLLM_USE_V1=1 \
26+
VLLM_ROCM_USE_AITER=1 \
27+
vllm serve $MODEL \
28+
--tensor-parallel-size 8 \
29+
--disable-log-requests \
30+
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
31+
--trust-remote-code \
32+
--load-format fastsafetensors \
33+
--block-size 1
34+
```
35+
36+
### Results
37+
38+
The experiments are carried on MI300X.
39+
40+
**Cache Scenarios:**
41+
- **No cache**: Model weights are loaded after clearing the system cache (cold start).
42+
- **Cached**: Model weights are loaded immediately after a previous load. The weights are cached in the filesystem and RAM (warm start).
43+
44+
<img src="./images/fastsafetensors-rocm.png" alt="FastSafeTensors on ROCm" width="70%">
45+
46+
47+
48+
49+
## GPT-2 perf tests based on the script [perf/fastsafetensors_perf/perf.py](../perf/fastsafetensors_perf/perf.py)
50+
51+
### Test Configuration
52+
53+
All tests were performed on single-GPU loading scenarios with two different model sizes:
54+
- **GPT-2 (small):** 523MB safetensors file
55+
- **GPT-2 Medium:** ~1.4GB safetensors file
56+
57+
#### Key Parameters Tested:
58+
- **nogds mode:** ROCm fallback (GDS not available on AMD GPUs)
59+
- **Thread counts:** 8, 16, 32
60+
- **Buffer sizes:** 8MB, 16MB, 32MB
61+
- **Loading methods:** nogds (async I/O), mmap (memory-mapped)
62+
- **Data types:** AUTO (no conversion), F16 (half precision conversion)
63+
64+
---
65+
66+
#### Performance Results
67+
68+
##### GPT-2 (523MB) - Single GPU Tests
69+
70+
| Test # | Method | Threads | Buffer | Config | Bandwidth | Elapsed Time | Notes |
71+
|--------|--------|---------|--------|--------|-----------|--------------|-------|
72+
| 1 | nogds | 16 | 16MB | default | **1.91 GB/s** | 0.268s | Baseline test |
73+
| 2 | nogds | 32 | 32MB | default | **2.07 GB/s** | 0.246s | Higher threads/buffer |
74+
| 3 | nogds | 8 | 8MB | default | **2.10 GB/s** | 0.243s | Lower threads/buffer |
75+
| 4 | mmap | N/A | N/A | default | **1.01 GB/s** | 0.505s | Memory-mapped |
76+
| 5 | nogds | 32 | 32MB | cache-drop | **1.24 GB/s** | 0.410s | Cold cache test |
77+
| 6 | nogds | 32 | 32MB | F16 dtype | **0.77 GB/s** | 0.332s | With type conversion |
78+
| 8 | nogds | 16 | 16MB | **optimal** | **2.62 GB/s** | 0.195s | Best config |
79+
80+
##### GPT-2 Medium (1.4GB) - Single GPU Tests
81+
82+
| Test # | Method | Threads | Buffer | Block Size | Bandwidth | Elapsed Time | Notes |
83+
|--------|--------|---------|--------|------------|-----------|--------------|-------|
84+
| 9 | nogds | 16 | 16MB | 160MB | **6.02 GB/s** | 0.235s | Optimal config |
85+
| 10 | mmap | N/A | N/A | N/A | **1.28 GB/s** | 1.104s | Memory-mapped |
86+
| 11 | nogds | 32 | 32MB | 160MB | **5.34 GB/s** | 0.265s | Higher threads |
87+
88+
---
49.5 KB
Loading

fastsafetensors/common.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,15 @@
1414
from .st_types import Device, DType
1515

1616

17+
def is_gpu_found():
18+
"""Check if any GPU (CUDA or HIP) is available.
19+
20+
Returns True if either CUDA or ROCm/HIP GPUs are detected.
21+
This allows code to work transparently across both platforms.
22+
"""
23+
return fstcpp.is_cuda_found() or fstcpp.is_hip_found()
24+
25+
1726
def get_device_numa_node(device: Optional[int]) -> Optional[int]:
1827
if device is None or not sys.platform.startswith("linux"):
1928
return None

fastsafetensors/copier/gds.py

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from typing import Dict, Optional
66

77
from .. import cpp as fstcpp
8-
from ..common import SafeTensorsMetadata
8+
from ..common import SafeTensorsMetadata, is_gpu_found
99
from ..frameworks import FrameworkOpBase, TensorBase
1010
from ..st_types import Device, DeviceType, DType
1111
from .base import CopierInterface
@@ -30,12 +30,29 @@ def __init__(
3030
self.fh: Optional[fstcpp.gds_file_handle] = None
3131
self.copy_reqs: Dict[int, int] = {}
3232
self.aligned_length = 0
33-
cudavers = list(map(int, framework.get_cuda_ver().split(".")))
34-
# CUDA 12.2 (GDS version 1.7) introduces support for non O_DIRECT file descriptors
35-
# Compatible with CUDA 11.x
36-
self.o_direct = not (
37-
cudavers[0] > 12 or (cudavers[0] == 12 and cudavers[1] >= 2)
38-
)
33+
cuda_ver = framework.get_cuda_ver()
34+
if cuda_ver and cuda_ver != "0.0":
35+
# Parse version string (e.g., "cuda-12.1" or "hip-5.7.0")
36+
# Extract the numeric part after the platform prefix
37+
ver_parts = cuda_ver.split("-", 1)
38+
if len(ver_parts) == 2:
39+
cudavers = list(map(int, ver_parts[1].split(".")))
40+
# CUDA 12.2 (GDS version 1.7) introduces support for non O_DIRECT file descriptors
41+
# Compatible with CUDA 11.x
42+
# Only applies to CUDA platform (not ROCm/HIP)
43+
if ver_parts[0] == "cuda":
44+
self.o_direct = not (
45+
cudavers[0] > 12 or (cudavers[0] == 12 and cudavers[1] >= 2)
46+
)
47+
else:
48+
# ROCm/HIP platform, use O_DIRECT
49+
self.o_direct = True
50+
else:
51+
# Fallback if format is unexpected
52+
self.o_direct = True
53+
else:
54+
# No GPU platform detected, use O_DIRECT
55+
self.o_direct = True
3956

4057
def set_o_direct(self, enable: bool):
4158
self.o_direct = enable
@@ -151,8 +168,10 @@ def new_gds_file_copier(
151168
nogds: bool = False,
152169
):
153170
device_is_not_cpu = device.type != DeviceType.CPU
154-
if device_is_not_cpu and not fstcpp.is_cuda_found():
155-
raise Exception("[FAIL] libcudart.so does not exist")
171+
if device_is_not_cpu and not is_gpu_found():
172+
raise Exception(
173+
"[FAIL] GPU runtime library (libcudart.so or libamdhip64.so) does not exist"
174+
)
156175
if not fstcpp.is_cufile_found() and not nogds:
157176
warnings.warn(
158177
"libcufile.so does not exist but nogds is False. use nogds=True",

fastsafetensors/cpp/cuda_compat.h

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
// SPDX-License-Identifier: Apache-2.0
2+
/*
3+
* CUDA/HIP compatibility layer for fastsafetensors
4+
* Minimal compatibility header - only defines what hipify-perl doesn't handle
5+
*/
6+
7+
#ifndef __CUDA_COMPAT_H__
8+
#define __CUDA_COMPAT_H__
9+
10+
// Platform detection - this gets hipified to check __HIP_PLATFORM_AMD__
11+
#ifdef __HIP_PLATFORM_AMD__
12+
#ifndef USE_ROCM
13+
#define USE_ROCM
14+
#endif
15+
// Note: We do NOT include <hip/hip_runtime.h> here to avoid compile-time dependencies.
16+
// Instead, we dynamically load the ROCm runtime library (libamdhip64.so) at runtime
17+
// using dlopen(), just like we do for CUDA (libcudart.so).
18+
// Minimal types are defined in ext.hpp.
19+
#else
20+
// For CUDA platform, we also avoid including headers and define minimal types in ext.hpp
21+
#endif
22+
23+
// Runtime library name - hipify-perl doesn't change string literals
24+
#ifdef USE_ROCM
25+
#define GPU_RUNTIME_LIB "libamdhip64.so"
26+
#else
27+
#define GPU_RUNTIME_LIB "libcudart.so"
28+
#endif
29+
30+
// Custom function pointer names that hipify-perl doesn't recognize
31+
// These are our own naming in ext_funcs struct, not standard CUDA API
32+
#ifdef USE_ROCM
33+
#define cudaDeviceMalloc hipDeviceMalloc
34+
#define cudaDeviceFree hipDeviceFree
35+
#endif
36+
37+
#endif // __CUDA_COMPAT_H__

fastsafetensors/cpp/ext.cpp

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include <chrono>
1111
#include <dlfcn.h>
1212

13+
#include "cuda_compat.h"
1314
#include "ext.hpp"
1415

1516
#define ALIGN 4096
@@ -78,6 +79,7 @@ ext_funcs_t cpu_fns = ext_funcs_t {
7879
ext_funcs_t cuda_fns;
7980

8081
static bool cuda_found = false;
82+
static bool is_hip_runtime = false; // Track if we loaded HIP (not auto-hipified)
8183
static bool cufile_found = false;
8284

8385
static int cufile_ver = 0;
@@ -89,7 +91,7 @@ template <typename T> void mydlsym(T** h, void* lib, std::string const& name) {
8991
static void load_nvidia_functions() {
9092
cudaError_t (*cudaGetDeviceCount)(int*);
9193
const char* cufileLib = "libcufile.so.0";
92-
const char* cudartLib = "libcudart.so";
94+
const char* cudartLib = GPU_RUNTIME_LIB;
9395
const char* numaLib = "libnuma.so.1";
9496
bool init_log = getenv(ENV_ENABLE_INIT_LOG);
9597
int mode = RTLD_LAZY | RTLD_GLOBAL | RTLD_NODELETE;
@@ -122,8 +124,12 @@ static void load_nvidia_functions() {
122124
count = 0; // why cudaGetDeviceCount returns non-zero for errors?
123125
}
124126
cuda_found = count > 0;
127+
// Detect if we loaded HIP runtime (ROCm) vs CUDA runtime
128+
if (cuda_found && std::string(cudartLib).find("hip") != std::string::npos) {
129+
is_hip_runtime = true;
130+
}
125131
if (init_log) {
126-
fprintf(stderr, "[DEBUG] device count=%d, cuda_found=%d\n", count, cuda_found);
132+
fprintf(stderr, "[DEBUG] device count=%d, cuda_found=%d, is_hip_runtime=%d\n", count, cuda_found, is_hip_runtime);
127133
}
128134
} else {
129135
cuda_found = false;
@@ -217,11 +223,28 @@ static void load_nvidia_functions() {
217223
}
218224
}
219225

226+
// Note: is_cuda_found gets auto-hipified to is_hip_found on ROCm builds
227+
// So this function will be is_hip_found() after hipification on ROCm
220228
bool is_cuda_found()
221229
{
222230
return cuda_found;
223231
}
224232

233+
// Separate function that always returns false on ROCm (CUDA not available on ROCm)
234+
// This will be used for the "is_cuda_found" Python export on ROCm builds
235+
bool cuda_not_available()
236+
{
237+
return false; // On ROCm, CUDA is never available
238+
}
239+
240+
// Separate function for checking HIP runtime detection (not hipified)
241+
// On CUDA: checks if HIP runtime was detected
242+
// On ROCm: not used (is_cuda_found gets hipified to is_hip_found)
243+
bool check_hip_runtime()
244+
{
245+
return is_hip_runtime;
246+
}
247+
225248
bool is_cufile_found()
226249
{
227250
return cufile_found;
@@ -718,7 +741,21 @@ cpp_metrics_t get_cpp_metrics() {
718741

719742
PYBIND11_MODULE(__MOD_NAME__, m)
720743
{
721-
m.def("is_cuda_found", &is_cuda_found);
744+
// Export both is_cuda_found and is_hip_found on all platforms
745+
// Use string concatenation to prevent hipify from converting the export names
746+
#ifdef USE_ROCM
747+
// On ROCm after hipify:
748+
// - is_cuda_found() becomes is_hip_found(), so export it as "is_hip_found"
749+
// - Export cuda_not_available() as "is_cuda_found" (CUDA not available on ROCm)
750+
m.def(("is_" "cuda" "_found"), &cuda_not_available); // Returns false on ROCm
751+
m.def(("is_" "hip" "_found"), &is_cuda_found); // hipified to is_hip_found, returns hip status
752+
#else
753+
// On CUDA:
754+
// - is_cuda_found() checks for CUDA
755+
// - check_hip_runtime() checks if HIP runtime was loaded
756+
m.def(("is_" "cuda" "_found"), &is_cuda_found);
757+
m.def(("is_" "hip" "_found"), &check_hip_runtime);
758+
#endif
722759
m.def("is_cufile_found", &is_cufile_found);
723760
m.def("cufile_version", &cufile_version);
724761
m.def("set_debug_log", &set_debug_log);

fastsafetensors/cpp/ext.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
#include <pybind11/pybind11.h>
1616
#include <pybind11/stl.h>
1717

18+
#include "cuda_compat.h"
19+
1820
#define ENV_ENABLE_INIT_LOG "FASTSAFETENSORS_ENABLE_INIT_LOG"
1921

2022
#ifndef __MOD_NAME__
@@ -33,6 +35,9 @@ typedef struct CUfileDescr_t {
3335
const void *fs_ops; /* CUfileFSOps_t */
3436
} CUfileDescr_t;
3537
typedef struct CUfileError { CUfileOpError err; } CUfileError_t;
38+
39+
// Define minimal CUDA/HIP types for both platforms to avoid compile-time dependencies
40+
// We load all GPU functions dynamically at runtime via dlopen()
3641
typedef enum cudaError { cudaSuccess = 0, cudaErrorMemoryAllocation = 2 } cudaError_t;
3742
enum cudaMemcpyKind { cudaMemcpyHostToDevice=2, cudaMemcpyDefault = 4 };
3843

0 commit comments

Comments
 (0)