|
| 1 | +--- |
| 2 | +title: Supporting Intel Gaudi |
| 3 | +date: 2025-02-21 |
| 4 | +description: "TBA" |
| 5 | +slug: intel-gaudi |
| 6 | +image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-intel-gaudi-and-intel-tiber-cloud.png-v2?raw=true |
| 7 | +categories: |
| 8 | + - Fleets |
| 9 | +--- |
| 10 | + |
| 11 | +# Supporting Intel Gaudi |
| 12 | + |
| 13 | +At `dstack`, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not |
| 14 | +just leading cloud providers and on-prem environments but also a wide range of accelerators. |
| 15 | + |
| 16 | +With our latest release, we’re adding support |
| 17 | +for [Intel Gaudi :material-arrow-top-right-thin:{ .external }](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html){:target="_blank"} and |
| 18 | +launching a new partnership with Intel. |
| 19 | + |
| 20 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-intel-gaudi-and-intel-tiber-cloud-v2.png?raw=true" width="630"/> |
| 21 | + |
| 22 | +<!-- more --> |
| 23 | + |
| 24 | +## About Intel Gaudi |
| 25 | + |
| 26 | +Intel Gaudi is a series of accelerators built to handle AI tasks. Powered by Intel’s Habana architecture, Gaudi is |
| 27 | +tailored for high-performance AI inference and training, offering high throughput and efficiency. It has a scalable |
| 28 | +design with numerous cores and ample memory bandwidth, enabling better performance per watt. |
| 29 | + |
| 30 | +Here's a brief spec for Gaudi 2 and Gaudi 3: |
| 31 | + |
| 32 | +| | **Gaudi 2** | **Gaudi 3** | |
| 33 | +|----------------------|-------------|-------------| |
| 34 | +| **MME Units** | 2 | 8 | |
| 35 | +| **TPC Units** | 24 | 64 | |
| 36 | +| **HBM Capacity** | 96 GB | 128 GB | |
| 37 | +| **HBM Bandwidth** | 2.46 TB/s | 3.7 TB/s | |
| 38 | +| **Networking** | 600 GB/s | 1200 GB/s | |
| 39 | +| **FP8 Performance** | 865 TFLOPs | 1835 TFLOPs | |
| 40 | +| **BF16 Performance** | 432 TFLOPs | 1835 TFLOPs | |
| 41 | + |
| 42 | +In the latest release, `dstack` now supports the orchestration of containers across on-prem |
| 43 | +machines equipped with Intel Gaudi accelerators. |
| 44 | + |
| 45 | +## Create a fleet |
| 46 | + |
| 47 | +To manage container workloads on on-prem machines with Intel Gaudi accelerators, start by configuring an |
| 48 | +[SSH fleet](../../docs/concepts/fleets.md#ssh). Here’s an example configuration for your fleet: |
| 49 | + |
| 50 | +<div editor-title="examples/misc/fleets/gaudi.dstack.yml"> |
| 51 | + |
| 52 | +```yaml |
| 53 | +type: fleet |
| 54 | +name: my-gaudi2-fleet |
| 55 | +ssh_config: |
| 56 | + hosts: |
| 57 | + - hostname: 100.83.163.67 |
| 58 | + user: sdp |
| 59 | + identity_file: ~/.ssh/id_rsa |
| 60 | + blocks: auto |
| 61 | + - hostname: 100.83.163.68 |
| 62 | + user: sdp |
| 63 | + identity_file: ~/.ssh/id_rsa |
| 64 | + blocks: auto |
| 65 | + proxy_jump: |
| 66 | + hostname: 146.152.186.135 |
| 67 | + user: guest |
| 68 | + identity_file: ~/.ssh/intel_id_rsa |
| 69 | +``` |
| 70 | +
|
| 71 | +</div> |
| 72 | +
|
| 73 | +To provision the fleet, run the [`dstack apply`](../../docs/reference/cli/dstack/apply.md) command: |
| 74 | + |
| 75 | +<div class="termy"> |
| 76 | + |
| 77 | +```shell |
| 78 | +$ dstack apply -f examples/misc/fleets/gaudi.dstack.yml |
| 79 | +
|
| 80 | +Provisioning... |
| 81 | +---> 100% |
| 82 | +
|
| 83 | + FLEET INSTANCE BACKEND GPU STATUS CREATED |
| 84 | + my-gaudi2-fleet 0 ssh 152xCPU, 1007GB, 8xGaudi2 idle 3 mins ago |
| 85 | + (96GB), 388.0GB (disk) |
| 86 | + 1 ssh 152xCPU, 1007GB, 8xGaudi2 idle 3 mins ago |
| 87 | + (96GB), 388.0GB (disk) |
| 88 | +``` |
| 89 | + |
| 90 | +</div> |
| 91 | + |
| 92 | +## Apply a configuration |
| 93 | + |
| 94 | +With your fleet provisioned, you can now run [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md). |
| 95 | + |
| 96 | +Below is an example of a task configuration for fine-tuning the [`DeepSeek-R1-Distill-Qwen-7B` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B){:target="_blank"} |
| 97 | +model using [Optimum for Intel Gaudi :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-habana){:target="_blank"} |
| 98 | +and [DeepSpeed :material-arrow-top-right-thin:{ .external }](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide/DeepSpeed_User_Guide.html#deepspeed-user-guide){:target="_blank"} with |
| 99 | +the [`lvwerra/stack-exchange-paired` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/lvwerra/stack-exchange-paired){:target="_blank"} dataset: |
| 100 | + |
| 101 | +<div editor-title="examples/fine-tuning/trl/intel/.dstack.yml"> |
| 102 | + |
| 103 | +```yaml |
| 104 | +type: task |
| 105 | +name: trl-train |
| 106 | +
|
| 107 | +image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 |
| 108 | +env: |
| 109 | + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| 110 | + - WANDB_API_KEY |
| 111 | + - WANDB_PROJECT |
| 112 | +commands: |
| 113 | + - pip install --upgrade-strategy eager optimum[habana] |
| 114 | + - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0 |
| 115 | + - git clone https://github.com/huggingface/optimum-habana.git |
| 116 | + - cd optimum-habana/examples/trl |
| 117 | + - pip install -r requirements.txt |
| 118 | + - pip install wandb |
| 119 | + - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py |
| 120 | + --model_name_or_path $MODEL_ID |
| 121 | + --dataset_name "lvwerra/stack-exchange-paired" |
| 122 | + --deepspeed ../language-modeling/llama2_ds_zero3_config.json |
| 123 | + --output_dir="./sft" |
| 124 | + --do_train |
| 125 | + --max_steps=500 |
| 126 | + --logging_steps=10 |
| 127 | + --save_steps=100 |
| 128 | + --per_device_train_batch_size=1 |
| 129 | + --per_device_eval_batch_size=1 |
| 130 | + --gradient_accumulation_steps=2 |
| 131 | + --learning_rate=1e-4 |
| 132 | + --lr_scheduler_type="cosine" |
| 133 | + --warmup_steps=100 |
| 134 | + --weight_decay=0.05 |
| 135 | + --optim="paged_adamw_32bit" |
| 136 | + --lora_target_modules "q_proj" "v_proj" |
| 137 | + --bf16 |
| 138 | + --remove_unused_columns=False |
| 139 | + --run_name="sft_deepseek_70" |
| 140 | + --report_to="wandb" |
| 141 | + --use_habana |
| 142 | + --use_lazy_mode |
| 143 | +
|
| 144 | +resources: |
| 145 | + gpu: gaudi2:8 |
| 146 | +``` |
| 147 | + |
| 148 | +</div> |
| 149 | + |
| 150 | +Submit the task using the [`dstack apply`](../../docs/reference/cli/dstack/apply.md) command: |
| 151 | + |
| 152 | +<div class="termy"> |
| 153 | + |
| 154 | +```shell |
| 155 | +$ dstack apply -f examples/fine-tuning/trl/intel/.dstack.yml -R |
| 156 | +``` |
| 157 | + |
| 158 | +</div> |
| 159 | + |
| 160 | +`dstack` will automatically create containers according to the run configuration and execute them across the fleet. |
| 161 | + |
| 162 | +!!! info "Examples": |
| 163 | + Explore our [examples](../../examples/accelerators/intel/index.md) to learn how to train and deploy large models on Intel Gaudi. |
| 164 | + |
| 165 | +!!! info "What's next?" |
| 166 | + 1. Refer to [Quickstart](../../docs/quickstart.md) |
| 167 | + 2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md) |
| 168 | + 3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"} |
0 commit comments