Skip to content

Commit 0e1d4be

Browse files
committed
docs(gpu): migrate h100 pcie
1 parent bf3b28f commit 0e1d4be

File tree

1 file changed

+107
-0
lines changed

1 file changed

+107
-0
lines changed
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: Migrating from H100-2-80G to H100-SXM-2-80G
3+
description: Learn how to migrating from H100-2-80G to H100-SXM-2-80G GPU Instances.
4+
tags: gpu nvidia
5+
dates:
6+
validation: 2025-10-21
7+
posted: 2025-10-21
8+
---
9+
10+
Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users.
11+
12+
## Current situation
13+
14+
Below is an overview of the current status of each instance type:
15+
16+
| Instance type | Availability status | Notes |
17+
| ------------------ | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
18+
| H100-1-80G | Low stock | No additional GPUs can be added at this time. |
19+
| H100-2-80G | Frequently out of stock | Supply remains unstable, and shortages are expected to continue. |
20+
| H100-SXM-2-80G | Good availability | This Instance type can scale further and is ideal for multi-GPU workloads, offering NVLink connectivity and superior memory bandwidth. |
21+
22+
In summary, while the single- and dual-GPU PCIe instances (H100-1-80G and H100-2-80G) are experiencing supply constraints, the H100-SXM-2-80G remains available in good quantity and is the recommended option for users requiring scalable performance and high-bandwidth interconnects.
23+
24+
We recommend users to migrate their workload from PCIe-based GPU Instances to SXM GPU Instances for improvements in performance and fure-proof access to GPUs. As H100 PCIe-variants becomes increasingly scarce, migrating ensures uninterrupted access to H100-class compute.
25+
26+
## Benefits of the migration
27+
28+
There are two primary scenarios: migrating **Kubernetes (Kapsule)** workloads or **standalone** workloads.
29+
30+
<Message type="important">
31+
Always ensure that your **data is backed up** before performing any operations that could affect it.
32+
</Message>
33+
34+
### Migrating Kubernetes workloads (Kubernetes Kapsule)
35+
36+
If you are using Kapsule, follow these steps to move existing workloads to nodes powered by `H100-SXM-2-80G`.
37+
38+
<Message type="important">
39+
The Kubernetes autoscaler may get stuck if it tries to scale up a node pool with out-of-stock. We recommend switching to `H100-SXM-2-80G` proactively to avoid disruptions.
40+
</Message>
41+
42+
43+
#### Step-by-step
44+
1. Create a new node pool using `H100-SXM-2-80G` GPU Instances.
45+
46+
2. Run `kubectl get nodes` to check that the new nodes are in a `Ready` state.
47+
3. Cordon the nodes in the old node pool to prevent new Pods from being scheduled there. For each node, run: `kubectl cordon <node-name>`
48+
<Message type="tip">
49+
You can use a selector on the pool name label to cordon or drain multiple nodes at the same time if your app allows it (ex. `kubectl cordon -l k8s.scaleway.com/pool-name=mypoolname`)
50+
</Message>
51+
4. Drain the nodes to evict the Pods gracefully.
52+
- For each node, run: `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data`
53+
- The `--ignore-daemonsets` flag is used because daemon sets manage Pods across all nodes and will automatically reschedule them.
54+
- The `--delete-emptydir-data` flag is necessary if your Pods use emptyDir volumes, but use this option carefully as it will delete the data stored in these volumes.
55+
- Refer to the [official Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) for further information.
56+
5. Run `kubectl get pods -o wide` after draining, to verify that the Pods have been rescheduled to the new node pool.
57+
6. Delete the old node pool.
58+
59+
<Message type="tip">
60+
For further information, refer to our dedicated documentation [How to migrate existing workloads to a new Kapsule node pool](/kubernetes/how-to/manage-node-pools/#how-to-migrate-existing-workloads-to-a-new-kubernets-kapsule-node-pool).
61+
</Message>
62+
63+
### Migrating a standalone Instance
64+
65+
For standalone GPU instances, you can recreate your environment using a `H100-SXM-2-80G` GPU Instance using either the CLI, API or in visual mode using the Scaleway console.
66+
67+
#### Quick Start (CLI example):
68+
1. Stop the Instance.
69+
```
70+
scw instance server stop <instance_id> zone=<zone>
71+
```
72+
Replace `<zone>` with the Availability Zone of your Instance. For example, if your Instance is located in Paris-1, the zone would be `fr-par-1`. Replace `<instance_id>` with the ID of your Instance.
73+
<Message type="tip">
74+
You can find the ID of your Instance on it's overview page in the Scaleway console or using the CLI by running the following command: `scw instance server list`.
75+
</Message>
76+
77+
2. Update the commercial type of the Instance
78+
```
79+
scw instance server update <instance_id> commercial-type=H100-SXM-2-80G zone=<zone>
80+
```
81+
Replace `<instance_id>` with the UUID of your Instance and `<zone>` with the Availability Zone of your GPU Instance.
82+
83+
3. Power on the Instance.
84+
```
85+
scw instance server start <instance_id> zone=<zone>
86+
```
87+
For further information, refer to the [Instance CLI documentation](https://github.com/scaleway/scaleway-cli/blob/master/docs/commands/instance.md).
88+
89+
<Message type="tip">
90+
You can also migrate your GPU Instances using the [API](https://www.scaleway.com/en/docs/instances/api-cli/migrating-instances/) and via [Scaleway console](/instances/how-to/migrate-instances/).
91+
</Message>
92+
93+
## FAQ
94+
95+
#### Are PCIe-based H100 being discontinued?
96+
H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to `H100-SXM-2-80G` to avoid future disruptions.
97+
98+
#### Is H100-SXM-2-80G compatible with my current setup?
99+
Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). However, verify that your workload does not require large system RAM or NVMe scratch space.
100+
101+
#### Why is H100-SXM better for multi-GPU?
102+
Because of *NVLink*, which enables near-shared-memory speeds between GPUs. In contrast, PCIe-based instances like H100-2-80G have slower interconnects that can bottleneck training. Learn more: [Understanding NVIDIA NVLink](https://www.scaleway.com/en/docs/gpu/reference-content/understanding-nvidia-nvlink/)
103+
104+
#### What if my workload needs more CPU or RAM?
105+
Let us know via [support ticket we’re evaluating options for compute-optimized configurations to complement our GPU offerings.
106+
107+
-

0 commit comments

Comments
 (0)