|
| 1 | +--- |
| 2 | +title: Migrating from H100-2-80G to H100-SXM-2-80G |
| 3 | +description: Learn how to migrating from H100-2-80G to H100-SXM-2-80G GPU Instances. |
| 4 | +tags: gpu nvidia |
| 5 | +dates: |
| 6 | + validation: 2025-10-21 |
| 7 | + posted: 2025-10-21 |
| 8 | +--- |
| 9 | + |
| 10 | +Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users. |
| 11 | + |
| 12 | +## Current situation |
| 13 | + |
| 14 | +Below is an overview of the current status of each instance type: |
| 15 | + |
| 16 | +| Instance type | Availability status | Notes | |
| 17 | +| ------------------ | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | |
| 18 | +| H100-1-80G | Low stock | No additional GPUs can be added at this time. | |
| 19 | +| H100-2-80G | Frequently out of stock | Supply remains unstable, and shortages are expected to continue. | |
| 20 | +| H100-SXM-2-80G | Good availability | This Instance type can scale further and is ideal for multi-GPU workloads, offering NVLink connectivity and superior memory bandwidth. | |
| 21 | + |
| 22 | +In summary, while the single- and dual-GPU PCIe instances (H100-1-80G and H100-2-80G) are experiencing supply constraints, the H100-SXM-2-80G remains available in good quantity and is the recommended option for users requiring scalable performance and high-bandwidth interconnects. |
| 23 | + |
| 24 | +We recommend users to migrate their workload from PCIe-based GPU Instances to SXM GPU Instances for improvements in performance and fure-proof access to GPUs. As H100 PCIe-variants becomes increasingly scarce, migrating ensures uninterrupted access to H100-class compute. |
| 25 | + |
| 26 | +## Benefits of the migration |
| 27 | + |
| 28 | +There are two primary scenarios: migrating **Kubernetes (Kapsule)** workloads or **standalone** workloads. |
| 29 | + |
| 30 | +<Message type="important"> |
| 31 | + Always ensure that your **data is backed up** before performing any operations that could affect it. |
| 32 | +</Message> |
| 33 | + |
| 34 | +### Migrating Kubernetes workloads (Kubernetes Kapsule) |
| 35 | + |
| 36 | +If you are using Kapsule, follow these steps to move existing workloads to nodes powered by `H100-SXM-2-80G`. |
| 37 | + |
| 38 | +<Message type="important"> |
| 39 | + The Kubernetes autoscaler may get stuck if it tries to scale up a node pool with out-of-stock. We recommend switching to `H100-SXM-2-80G` proactively to avoid disruptions. |
| 40 | +</Message> |
| 41 | + |
| 42 | + |
| 43 | +#### Step-by-step |
| 44 | +1. Create a new node pool using `H100-SXM-2-80G` GPU Instances. |
| 45 | + |
| 46 | +2. Run `kubectl get nodes` to check that the new nodes are in a `Ready` state. |
| 47 | +3. Cordon the nodes in the old node pool to prevent new Pods from being scheduled there. For each node, run: `kubectl cordon <node-name>` |
| 48 | + <Message type="tip"> |
| 49 | + You can use a selector on the pool name label to cordon or drain multiple nodes at the same time if your app allows it (ex. `kubectl cordon -l k8s.scaleway.com/pool-name=mypoolname`) |
| 50 | + </Message> |
| 51 | +4. Drain the nodes to evict the Pods gracefully. |
| 52 | + - For each node, run: `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data` |
| 53 | + - The `--ignore-daemonsets` flag is used because daemon sets manage Pods across all nodes and will automatically reschedule them. |
| 54 | + - The `--delete-emptydir-data` flag is necessary if your Pods use emptyDir volumes, but use this option carefully as it will delete the data stored in these volumes. |
| 55 | + - Refer to the [official Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) for further information. |
| 56 | +5. Run `kubectl get pods -o wide` after draining, to verify that the Pods have been rescheduled to the new node pool. |
| 57 | +6. Delete the old node pool. |
| 58 | + |
| 59 | +<Message type="tip"> |
| 60 | + For further information, refer to our dedicated documentation [How to migrate existing workloads to a new Kapsule node pool](/kubernetes/how-to/manage-node-pools/#how-to-migrate-existing-workloads-to-a-new-kubernets-kapsule-node-pool). |
| 61 | +</Message> |
| 62 | + |
| 63 | +### Migrating a standalone Instance |
| 64 | + |
| 65 | +For standalone GPU instances, you can recreate your environment using a `H100-SXM-2-80G` GPU Instance using either the CLI, API or in visual mode using the Scaleway console. |
| 66 | + |
| 67 | +#### Quick Start (CLI example): |
| 68 | +1. Stop the Instance. |
| 69 | + ``` |
| 70 | + scw instance server stop <instance_id> zone=<zone> |
| 71 | + ``` |
| 72 | + Replace `<zone>` with the Availability Zone of your Instance. For example, if your Instance is located in Paris-1, the zone would be `fr-par-1`. Replace `<instance_id>` with the ID of your Instance. |
| 73 | + <Message type="tip"> |
| 74 | + You can find the ID of your Instance on it's overview page in the Scaleway console or using the CLI by running the following command: `scw instance server list`. |
| 75 | + </Message> |
| 76 | + |
| 77 | +2. Update the commercial type of the Instance |
| 78 | + ``` |
| 79 | + scw instance server update <instance_id> commercial-type=H100-SXM-2-80G zone=<zone> |
| 80 | + ``` |
| 81 | + Replace `<instance_id>` with the UUID of your Instance and `<zone>` with the Availability Zone of your GPU Instance. |
| 82 | + |
| 83 | +3. Power on the Instance. |
| 84 | + ``` |
| 85 | + scw instance server start <instance_id> zone=<zone> |
| 86 | + ``` |
| 87 | +For further information, refer to the [Instance CLI documentation](https://github.com/scaleway/scaleway-cli/blob/master/docs/commands/instance.md). |
| 88 | + |
| 89 | +<Message type="tip"> |
| 90 | + You can also migrate your GPU Instances using the [API](https://www.scaleway.com/en/docs/instances/api-cli/migrating-instances/) and via [Scaleway console](/instances/how-to/migrate-instances/). |
| 91 | +</Message> |
| 92 | + |
| 93 | +## FAQ |
| 94 | + |
| 95 | +#### Are PCIe-based H100 being discontinued? |
| 96 | +H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to `H100-SXM-2-80G` to avoid future disruptions. |
| 97 | + |
| 98 | +#### Is H100-SXM-2-80G compatible with my current setup? |
| 99 | +Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). However, verify that your workload does not require large system RAM or NVMe scratch space. |
| 100 | + |
| 101 | +#### Why is H100-SXM better for multi-GPU? |
| 102 | +Because of *NVLink*, which enables near-shared-memory speeds between GPUs. In contrast, PCIe-based instances like H100-2-80G have slower interconnects that can bottleneck training. Learn more: [Understanding NVIDIA NVLink](https://www.scaleway.com/en/docs/gpu/reference-content/understanding-nvidia-nvlink/) |
| 103 | + |
| 104 | +#### What if my workload needs more CPU or RAM? |
| 105 | +Let us know via [support ticket we’re evaluating options for compute-optimized configurations to complement our GPU offerings. |
| 106 | + |
| 107 | +- |
0 commit comments