-
Notifications
You must be signed in to change notification settings - Fork 258
docs(gpu): H100 #5756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
docs(gpu): H100 #5756
Changes from 5 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
06ee375
docs(gpu): migrate h100 pcie
bene2k1 d7206ae
docs(gpu): update content
bene2k1 9d262d9
docs(gpu): update content
bene2k1 d07c116
Apply suggestions from code review
bene2k1 e88be1c
docs(gpu): wording
bene2k1 c2ecd76
Update pages/gpu/reference-content/migration-h100.mdx
bene2k1 f0267a2
Update pages/gpu/reference-content/migration-h100.mdx
bene2k1 4b42e3d
Apply suggestions from code review
bene2k1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| --- | ||
| title: Migrating from H100-2-80G to H100-SXM-2-80G | ||
| description: Learn how to migrating from H100-2-80G to H100-SXM-2-80G GPU Instances. | ||
| tags: gpu nvidia | ||
| dates: | ||
| validation: 2025-11-04 | ||
| posted: 2025-11-04 | ||
| --- | ||
|
|
||
| Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users. | ||
|
|
||
| For optimal availability and performance, we recommend switching from **H100-2-80G** to the next generation **H100-SXM-2-80G** GPU Instance. This latest generation has more stock, improved NVLink, better and faster VRAM. | ||
bene2k1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Benefits of the migration | ||
|
|
||
| There are two primary scenarios: migrating **Kubernetes (Kapsule)** workloads or **standalone** workloads. | ||
|
|
||
| <Message type="important"> | ||
| Always make sure your **data is backed up** before performing any operation that could affect it. Remember that **scratch storage** is ephemeral and will not persist after an Instance is fully stopped. A full stop/start cycle—such as during an Instance server migration—will **erase all scratch data**. However, outside of server-type migrations, a simple reboot or using **stop in place** will preserve the data stored on the Instance’s scratch storage. | ||
| </Message> | ||
|
|
||
| ### Migrating Kubernetes workloads (Kubernetes Kapsule) | ||
|
|
||
| If you are using Kapsule, follow these steps to move existing workloads to nodes powered by `H100-SXM-2-80G` GPUs. | ||
|
|
||
| <Message type="important"> | ||
| The Kubernetes autoscaler may get stuck if it tries to scale up a node pool with out-of-stock Instances. We recommend switching to `H100-SXM-2-80G` GPU Instances proactively to avoid disruptions. | ||
| </Message> | ||
|
|
||
|
|
||
| #### Step-by-step | ||
| 1. Create a new node pool using `H100-SXM-2-80G` GPU Instances. | ||
|
|
||
| 2. Run `kubectl get nodes` to check that the new nodes are in a `Ready` state. | ||
| 3. Cordon the nodes in the old node pool to prevent new Pods from being scheduled there. For each node, run: `kubectl cordon <node-name>` | ||
| <Message type="tip"> | ||
| You can use a selector on the pool name label to cordon or drain multiple nodes at the same time if your app allows it (ex. `kubectl cordon -l k8s.scaleway.com/pool-name=mypoolname`) | ||
| </Message> | ||
| 4. Drain the nodes to evict the Pods gracefully. | ||
| - For each node, run: `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data` | ||
| - The `--ignore-daemonsets` flag is used because daemon sets manage Pods across all nodes and will automatically reschedule them. | ||
| - The `--delete-emptydir-data` flag is necessary if your Pods use emptyDir volumes, but use this option carefully as it will delete the data stored in these volumes. | ||
| - Refer to the [official Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) for further information. | ||
| 5. Run `kubectl get pods -o wide` after draining, to verify that the Pods have been rescheduled to the new node pool. | ||
| 6. Delete the old node pool. | ||
|
|
||
| <Message type="tip"> | ||
| For further information, refer to our dedicated documentation [How to migrate existing workloads to a new Kapsule node pool](/kubernetes/how-to/manage-node-pools/#how-to-migrate-existing-workloads-to-a-new-kubernets-kapsule-node-pool). | ||
bene2k1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| </Message> | ||
|
|
||
| ### Migrating a standalone Instance | ||
|
|
||
| For standalone GPU instances, you can recreate your environment using a `H100-SXM-2-80G` GPU Instance using either the CLI, API or in visual mode using the Scaleway console. | ||
|
|
||
| #### Quick Start (CLI example): | ||
| 1. Stop the Instance. | ||
| ``` | ||
| scw instance server stop <instance_id> zone=<zone> | ||
| ``` | ||
| Replace `<zone>` with the Availability Zone of your Instance. For example, if your Instance is located in Paris-1, the zone would be `fr-par-1`. Replace `<instance_id>` with the ID of your Instance. | ||
| <Message type="tip"> | ||
| You can find the ID of your Instance on it's overview page in the Scaleway console or using the CLI by running the following command: `scw instance server list`. | ||
bene2k1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| </Message> | ||
|
|
||
| 2. Update the commercial type of the Instance | ||
bene2k1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
| scw instance server update <instance_id> commercial-type=H100-SXM-2-80G zone=<zone> | ||
| ``` | ||
| Replace `<instance_id>` with the UUID of your Instance and `<zone>` with the Availability Zone of your GPU Instance. | ||
|
|
||
| 3. Power on the Instance. | ||
| ``` | ||
| scw instance server start <instance_id> zone=<zone> | ||
| ``` | ||
| For further information, refer to the [Instance CLI documentation](https://github.com/scaleway/scaleway-cli/blob/master/docs/commands/instance.md). | ||
|
|
||
| <Message type="tip"> | ||
| You can also migrate your GPU Instances using the [API](https://www.scaleway.com/en/docs/instances/api-cli/migrating-instances/) and via [Scaleway console](/instances/how-to/migrate-instances/). | ||
| </Message> | ||
|
|
||
| ## FAQ | ||
|
|
||
| #### Are PCIe-based H100s being discontinued? | ||
| H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to `H100-SXM-2-80G` to avoid future disruptions. | ||
|
|
||
| #### Is H100-SXM-2-80G compatible with my current setup? | ||
| Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). No changes in your code base are required when upgrading to a SXM-based GPU Instance. | ||
|
|
||
| ### Why is the H100-SXM better for multi-GPU workloads? | ||
|
|
||
| The NVIDIA H100-SXM outperforms the H100-PCIe in multi-GPU configurations primarily due to its higher interconnect bandwidth and greater power capacity. It uses fourth-generation NVLink and NVSwitch, delivering up to **900 GB/s of bidirectional bandwidth** for fast GPU-to-GPU communication. In contrast, the H100-PCIe is limited to a **theoretical maximum of 128 GB/s** via PCIe Gen 5, which becomes a bottleneck in communication-heavy workloads such as large-scale AI training and HPC. | ||
| The H100-SXM also provides **HBM3e memory** with up to **3.35 TB/s of bandwidth**, compared to **2 TB/s** with the H100-PCIe’s HBM3, improving performance in memory-bound tasks. | ||
| Additionally, the H100-SXM’s **700W TDP** allows higher sustained clock speeds and throughput, while the H100-PCIe’s **300–350W TDP** imposes stricter performance limits. | ||
| Overall, the H100-SXM is the optimal choice for high-communication, multi-GPU workloads, whereas the H100-PCIe offers more flexibility for less communication-intensive applications. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.