|
9 | 9 |
|
10 | 10 | Scaleway is optimizing its H100 GPU Instance portfolio to improve long-term availability and provide better performance for all users. |
11 | 11 |
|
12 | | -## Current situation |
13 | | - |
14 | | -Below is an overview of the current status of each instance type: |
15 | | - |
16 | | -| Instance type | Availability status | Notes | |
17 | | -| ------------------ | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | |
18 | | -| H100-1-80G | Low stock | No additional GPUs can be added at this time. | |
19 | | -| H100-2-80G | Frequently out of stock | Supply remains unstable, and shortages are expected to continue. | |
20 | | -| H100-SXM-2-80G | Good availability | This Instance type can scale further and is ideal for multi-GPU workloads, offering NVLink connectivity and superior memory bandwidth. | |
21 | | - |
22 | | -In summary, while the single- and dual-GPU PCIe instances (H100-1-80G and H100-2-80G) are experiencing supply constraints, the H100-SXM-2-80G remains available in good quantity and is the recommended option for users requiring scalable performance and high-bandwidth interconnects. |
23 | | - |
24 | 12 | We recommend users to migrate their workload from PCIe-based GPU Instances to SXM GPU Instances for improvements in performance and fure-proof access to GPUs. As H100 PCIe-variants becomes increasingly scarce, migrating ensures uninterrupted access to H100-class compute. |
25 | 13 |
|
26 | 14 | ## Benefits of the migration |
27 | 15 |
|
28 | 16 | There are two primary scenarios: migrating **Kubernetes (Kapsule)** workloads or **standalone** workloads. |
29 | 17 |
|
30 | 18 | <Message type="important"> |
31 | | - Always ensure that your **data is backed up** before performing any operations that could affect it. |
| 19 | + Always ensure that your **data is backed up** before performing any operations that could affect it. Keep in mind that **Scratch Storage** is ephemere and does not survive once the Instance is stopped: doing a full stop/start cycle will **erase the scratch data**. However, doing a simple reboot or using the stop in place function will keep the data. |
32 | 20 | </Message> |
33 | 21 |
|
34 | 22 | ### Migrating Kubernetes workloads (Kubernetes Kapsule) |
@@ -96,12 +84,15 @@ For further information, refer to the [Instance CLI documentation](https://githu |
96 | 84 | H100 PCIe-based GPU Instances are not End-of-Life (EOL), but due to limited availability, we recommend migrating to `H100-SXM-2-80G` to avoid future disruptions. |
97 | 85 |
|
98 | 86 | #### Is H100-SXM-2-80G compatible with my current setup? |
99 | | -Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). However, verify that your workload does not require large system RAM or NVMe scratch space. |
| 87 | +Yes — it runs the same CUDA toolchain and supports standard frameworks (PyTorch, TensorFlow, etc.). No changes in your code base are required when upgrading to a SXM-based GPU Instance. |
100 | 88 |
|
101 | 89 | #### Why is H100-SXM better for multi-GPU? |
102 | | -Because of *NVLink*, which enables near-shared-memory speeds between GPUs. In contrast, PCIe-based instances like H100-2-80G have slower interconnects that can bottleneck training. Learn more: [Understanding NVIDIA NVLink](https://www.scaleway.com/en/docs/gpu/reference-content/understanding-nvidia-nvlink/) |
| 90 | +The NVIDIA H100-SXM outperforms the H100-PCIe in multi-GPU configurations due to its superior interconnect and higher power capacity. |
| 91 | +It leverages fourth-generation NVLink and NVSwitch, providing up to 900 GB/s of bidirectional bandwidth for rapid GPU-to-GPU communication, compared to the H100-PCIe's 128 GB/s via PCIe Gen 5, which creates bottlenecks in demanding workloads like large-scale AI training and HPC. |
| 92 | +Additionally, the H100-SXM’s 700W TDP enables higher clock speeds and sustained performance, while the H100-PCIe’s 300-350W TDP limits its throughput. |
| 93 | +For high-communication, multi-GPU tasks, the H100-SXM is the optimal choice, while the H100-PCIe suits less intensive applications with greater flexibility. |
103 | 94 |
|
104 | 95 | #### What if my workload needs more CPU or RAM? |
105 | | -Let us know via [support ticket we’re evaluating options for compute-optimized configurations to complement our GPU offerings. |
| 96 | +Let us know via [support ticket](https://console.scaleway.com/support/tickets/create) what your specific requoirements are. Currently we are evaluating options for compute-optimized configurations to complement our GPU offerings. |
106 | 97 |
|
107 | 98 | - |
0 commit comments