Skip to content

Commit e2455af

Browse files
committed
docs(gpu): update content
1 parent 0e1d4be commit e2455af

File tree

1 file changed

+7
-16
lines changed

1 file changed

+7
-16
lines changed

pages/gpu/reference-content/migration-h100.mdx

Lines changed: 7 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,26 +9,14 @@ dates:
99

1010
Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users.
1111

12-
## Current situation
13-
14-
Below is an overview of the current status of each instance type:
15-
16-
| Instance type | Availability status | Notes |
17-
| ------------------ | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
18-
| H100-1-80G | Low stock | No additional GPUs can be added at this time. |
19-
| H100-2-80G | Frequently out of stock | Supply remains unstable, and shortages are expected to continue. |
20-
| H100-SXM-2-80G | Good availability | This Instance type can scale further and is ideal for multi-GPU workloads, offering NVLink connectivity and superior memory bandwidth. |
21-
22-
In summary, while the single- and dual-GPU PCIe instances (H100-1-80G and H100-2-80G) are experiencing supply constraints, the H100-SXM-2-80G remains available in good quantity and is the recommended option for users requiring scalable performance and high-bandwidth interconnects.
23-
2412
We recommend users to migrate their workload from PCIe-based GPU Instances to SXM GPU Instances for improvements in performance and fure-proof access to GPUs. As H100 PCIe-variants becomes increasingly scarce, migrating ensures uninterrupted access to H100-class compute.
2513

2614
## Benefits of the migration
2715

2816
There are two primary scenarios: migrating **Kubernetes (Kapsule)** workloads or **standalone** workloads.
2917

3018
<Message type="important">
31-
Always ensure that your **data is backed up** before performing any operations that could affect it.
19+
Always ensure that your **data is backed up** before performing any operations that could affect it. Keep in mind that **Scratch Storage** is ephemere and does not survive once the Instance is stopped: doing a full stop/start cycle will **erase the scratch data**. However, doing a simple reboot or using the stop in place function will keep the data.
3220
</Message>
3321

3422
### Migrating Kubernetes workloads (Kubernetes Kapsule)
@@ -96,12 +84,15 @@ For further information, refer to the [Instance CLI documentation](https://githu
9684
H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to `H100-SXM-2-80G` to avoid future disruptions.
9785

9886
#### Is H100-SXM-2-80G compatible with my current setup?
99-
Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). However, verify that your workload does not require large system RAM or NVMe scratch space.
87+
Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). No changes in your code base are required when upgrading to a SXM-based GPU Instance.
10088

10189
#### Why is H100-SXM better for multi-GPU?
102-
Because of *NVLink*, which enables near-shared-memory speeds between GPUs. In contrast, PCIe-based instances like H100-2-80G have slower interconnects that can bottleneck training. Learn more: [Understanding NVIDIA NVLink](https://www.scaleway.com/en/docs/gpu/reference-content/understanding-nvidia-nvlink/)
90+
The NVIDIA H100-SXM outperforms the H100-PCIe in multi-GPU configurations due to its superior interconnect and higher power capacity.
91+
It leverages fourth-generation NVLink and NVSwitch, providing up to 900 GB/s of bidirectional bandwidth for rapid GPU-to-GPU communication, compared to the H100-PCIe's 128 GB/s via PCIe Gen 5, which creates bottlenecks in demanding workloads like large-scale AI training and HPC.
92+
Additionally, the H100-SXM’s 700W TDP enables higher clock speeds and sustained performance, while the H100-PCIe’s 300-350W TDP limits its throughput.
93+
For high-communication, multi-GPU tasks, the H100-SXM is the optimal choice, while the H100-PCIe suits less intensive applications with greater flexibility.
10394

10495
#### What if my workload needs more CPU or RAM?
105-
Let us know via [support ticket we’re evaluating options for compute-optimized configurations to complement our GPU offerings.
96+
Let us know via [support ticket](https://console.scaleway.com/support/tickets/create) what your specific requoirements are. Currently we are evaluating options for compute-optimized configurations to complement our GPU offerings.
10697

10798
-

0 commit comments

Comments
 (0)