Skip to content

Commit 6053926

Browse files
bene2k1RoRoJldecarvalho-docnerda-codes
authored
docs(gpu): H100 (#5756)
* docs(gpu): migrate h100 pcie * docs(gpu): update content * docs(gpu): update content * Apply suggestions from code review Co-authored-by: Rowena Jones <[email protected]> * docs(gpu): wording * Update pages/gpu/reference-content/migration-h100.mdx * Update pages/gpu/reference-content/migration-h100.mdx Co-authored-by: ldecarvalho-doc <[email protected]> * Apply suggestions from code review Co-authored-by: Néda <[email protected]> --------- Co-authored-by: Rowena Jones <[email protected]> Co-authored-by: ldecarvalho-doc <[email protected]> Co-authored-by: Néda <[email protected]>
1 parent 51ff779 commit 6053926

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: Migrating from H100-2-80G to H100-SXM-2-80G
3+
description: Learn how to migrate from H100-2-80G to H100-SXM-2-80G GPU Instances.
4+
tags: gpu nvidia
5+
dates:
6+
validation: 2025-11-04
7+
posted: 2025-11-04
8+
---
9+
10+
Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users.
11+
12+
For optimal availability and performance, we recommend switching from **H100-2-80G** to the improved **H100-SXM-2-80G** GPU Instance. This latest generation has more stock, improved NVLink, and better and faster VRAM.
13+
14+
## Benefits of the migration
15+
16+
There are two primary scenarios: migrating **Kubernetes (Kapsule)** workloads or **standalone** workloads.
17+
18+
<Message type="important">
19+
Always make sure your **data is backed up** before performing any operation that could affect it. Remember that **scratch storage** is ephemeral and will not persist after an Instance is fully stopped. A full stop/start cycle—such as during an Instance server migration—will **erase all scratch data**. However, outside of server-type migrations, a simple reboot or using **stop in place** will preserve the data stored on the Instance’s scratch storage.
20+
</Message>
21+
22+
### Migrating Kubernetes workloads (Kubernetes Kapsule)
23+
24+
If you are using Kapsule, follow these steps to move existing workloads to nodes powered by `H100-SXM-2-80G` GPUs.
25+
26+
<Message type="important">
27+
The Kubernetes autoscaler may get stuck if it tries to scale up a node pool with out-of-stock Instances. We recommend switching to `H100-SXM-2-80G` GPU Instances proactively to avoid disruptions.
28+
</Message>
29+
30+
31+
#### Step-by-step
32+
1. Create a new node pool using `H100-SXM-2-80G` GPU Instances.
33+
34+
2. Run `kubectl get nodes` to check that the new nodes are in a `Ready` state.
35+
3. Cordon the nodes in the old node pool to prevent new Pods from being scheduled there. For each node, run: `kubectl cordon <node-name>`
36+
<Message type="tip">
37+
You can use a selector on the pool name label to cordon or drain multiple nodes at the same time if your app allows it (ex. `kubectl cordon -l k8s.scaleway.com/pool-name=mypoolname`)
38+
</Message>
39+
4. Drain the nodes to evict the Pods gracefully.
40+
- For each node, run: `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data`
41+
- The `--ignore-daemonsets` flag is used because daemon sets manage Pods across all nodes and will automatically reschedule them.
42+
- The `--delete-emptydir-data` flag is necessary if your Pods use emptyDir volumes, but use this option carefully as it will delete the data stored in these volumes.
43+
- Refer to the [official Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) for further information.
44+
5. Run `kubectl get pods -o wide` after draining, to verify that the Pods have been rescheduled to the new node pool.
45+
6. Delete the old node pool.
46+
47+
<Message type="tip">
48+
For further information, refer to our dedicated documentation: [How to migrate existing workloads to a new Kapsule node pool](/kubernetes/how-to/manage-node-pools/#how-to-migrate-existing-workloads-to-a-new-kubernets-kapsule-node-pool).
49+
</Message>
50+
51+
### Migrating a standalone Instance
52+
53+
For standalone GPU instances, you can recreate your environment using a `H100-SXM-2-80G` GPU Instance using either the CLI, API or in visual mode using the Scaleway console.
54+
55+
#### Quick Start (CLI example):
56+
1. Stop the Instance.
57+
```
58+
scw instance server stop <instance_id> zone=<zone>
59+
```
60+
Replace `<zone>` with the Availability Zone of your Instance. For example, if your Instance is located in Paris-1, the zone would be `fr-par-1`. Replace `<instance_id>` with the ID of your Instance.
61+
<Message type="tip">
62+
You can find the ID of your Instance on its overview page in the Scaleway console or by running the following CLI command: `scw instance server list`.
63+
</Message>
64+
65+
2. Update the commercial type of the Instance.
66+
```
67+
scw instance server update <instance_id> commercial-type=H100-SXM-2-80G zone=<zone>
68+
```
69+
Replace `<instance_id>` with the UUID of your Instance and `<zone>` with the Availability Zone of your GPU Instance.
70+
71+
3. Power on the Instance.
72+
```
73+
scw instance server start <instance_id> zone=<zone>
74+
```
75+
For further information, refer to the [Instance CLI documentation](https://github.com/scaleway/scaleway-cli/blob/master/docs/commands/instance.md).
76+
77+
<Message type="tip">
78+
You can also migrate your GPU Instances using the [API](https://www.scaleway.com/en/docs/instances/api-cli/migrating-instances/) and via [Scaleway console](/instances/how-to/migrate-instances/).
79+
</Message>
80+
81+
## FAQ
82+
83+
#### Are PCIe-based H100s being discontinued?
84+
H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to `H100-SXM-2-80G` to avoid future disruptions.
85+
86+
#### Is H100-SXM-2-80G compatible with my current setup?
87+
Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). No changes in your code base are required when upgrading to a SXM-based GPU Instance.
88+
89+
### Why is the H100-SXM better for multi-GPU workloads?
90+
91+
The NVIDIA H100-SXM outperforms the H100-PCIe in multi-GPU configurations primarily due to its higher interconnect bandwidth and greater power capacity. It uses fourth-generation NVLink and NVSwitch, delivering up to **900 GB/s of bidirectional bandwidth** for fast GPU-to-GPU communication. In contrast, the H100-PCIe is limited to a **theoretical maximum of 128 GB/s** via PCIe Gen 5, which becomes a bottleneck in communication-heavy workloads such as large-scale AI training and HPC.
92+
The H100-SXM also provides **HBM3e memory** with up to **3.35 TB/s of bandwidth**, compared to **2 TB/s** with the H100-PCIe’s HBM3, improving performance in memory-bound tasks.
93+
Additionally, the H100-SXM’s **700W TDP** allows higher sustained clock speeds and throughput, while the H100-PCIe’s **300–350W TDP** imposes stricter performance limits.
94+
Overall, the H100-SXM is the optimal choice for high-communication, multi-GPU workloads, whereas the H100-PCIe offers more flexibility for less communication-intensive applications.

0 commit comments

Comments
 (0)