Skip to content

Commit 3cdcf7b

Browse files
[Docs] Intel Gaudi blog
1 parent bf6cae4 commit 3cdcf7b

File tree

3 files changed

+170
-2
lines changed

3 files changed

+170
-2
lines changed

docs/blog/posts/distributed-training-with-aws-efa.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ With the latest release of `dstack`, you can now leverage AWS EFA to supercharge
2020

2121
<!-- more -->
2222

23-
## Why EFA?
23+
## About EFA
2424

2525
AWS EFA delivers up to 400 Gbps of bandwidth, enabling lightning-fast GPU-to-GPU communication across nodes. By
2626
bypassing the kernel and providing direct network access, EFA minimizes latency and maximizes throughput. Its native

docs/blog/posts/intel-gaudi.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
title: Supporting Intel Gaudi
3+
date: 2025-02-21
4+
description: "TBA"
5+
slug: intel-gaudi
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-intel-gaudi-and-intel-tiber-cloud.png-v2?raw=true
7+
categories:
8+
- Fleets
9+
---
10+
11+
# Supporting Intel Gaudi
12+
13+
At `dstack`, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not
14+
just leading cloud providers and on-prem environments but also a wide range of accelerators.
15+
16+
With our latest release, we’re adding support
17+
for [Intel Gaudi :material-arrow-top-right-thin:{ .external }](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html){:target="_blank"} and
18+
launching a new partnership with Intel.
19+
20+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-intel-gaudi-and-intel-tiber-cloud-v2.png?raw=true" width="630"/>
21+
22+
<!-- more -->
23+
24+
## About Intel Gaudi
25+
26+
Intel Gaudi is a series of accelerators built to handle AI tasks. Powered by Intel’s Habana architecture, Gaudi is
27+
tailored for high-performance AI inference and training, offering high throughput and efficiency. It has a scalable
28+
design with numerous cores and ample memory bandwidth, enabling better performance per watt.
29+
30+
Here's a brief spec for Gaudi 2 and Gaudi 3:
31+
32+
| | **Gaudi 2** | **Gaudi 3** |
33+
|----------------------|-------------|-------------|
34+
| **MME Units** | 2 | 8 |
35+
| **TPC Units** | 24 | 64 |
36+
| **HBM Capacity** | 96 GB | 128 GB |
37+
| **HBM Bandwidth** | 2.46 TB/s | 3.7 TB/s |
38+
| **Networking** | 600 GB/s | 1200 GB/s |
39+
| **FP8 Performance** | 865 TFLOPs | 1835 TFLOPs |
40+
| **BF16 Performance** | 432 TFLOPs | 1835 TFLOPs |
41+
42+
In the latest release, `dstack` now supports the orchestration of containers across on-prem
43+
machines equipped with Intel Gaudi accelerators.
44+
45+
## Create a fleet
46+
47+
To manage container workloads on on-prem machines with Intel Gaudi accelerators, start by configuring an
48+
[SSH fleet](../../docs/concepts/fleets.md#ssh). Here’s an example configuration for your fleet:
49+
50+
<div editor-title="examples/misc/fleets/gaudi.dstack.yml">
51+
52+
```yaml
53+
type: fleet
54+
name: my-gaudi2-fleet
55+
ssh_config:
56+
hosts:
57+
- hostname: 100.83.163.67
58+
user: sdp
59+
identity_file: ~/.ssh/id_rsa
60+
blocks: auto
61+
- hostname: 100.83.163.68
62+
user: sdp
63+
identity_file: ~/.ssh/id_rsa
64+
blocks: auto
65+
proxy_jump:
66+
hostname: 146.152.186.135
67+
user: guest
68+
identity_file: ~/.ssh/intel_id_rsa
69+
```
70+
71+
</div>
72+
73+
To provision the fleet, run the [`dstack apply`](../../docs/reference/cli/dstack/apply.md) command:
74+
75+
<div class="termy">
76+
77+
```shell
78+
$ dstack apply -f examples/misc/fleets/gaudi.dstack.yml
79+
80+
Provisioning...
81+
---> 100%
82+
83+
FLEET INSTANCE BACKEND GPU STATUS CREATED
84+
my-gaudi2-fleet 0 ssh 152xCPU, 1007GB, 8xGaudi2 idle 3 mins ago
85+
(96GB), 388.0GB (disk)
86+
1 ssh 152xCPU, 1007GB, 8xGaudi2 idle 3 mins ago
87+
(96GB), 388.0GB (disk)
88+
```
89+
90+
</div>
91+
92+
## Apply a configuration
93+
94+
With your fleet provisioned, you can now run [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md).
95+
96+
Below is an example of a task configuration for fine-tuning the [`DeepSeek-R1-Distill-Qwen-7B` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B){:target="_blank"}
97+
model using [Optimum for Intel Gaudi :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-habana){:target="_blank"}
98+
and [DeepSpeed :material-arrow-top-right-thin:{ .external }](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide/DeepSpeed_User_Guide.html#deepspeed-user-guide){:target="_blank"} with
99+
the [`lvwerra/stack-exchange-paired` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/lvwerra/stack-exchange-paired){:target="_blank"} dataset:
100+
101+
<div editor-title="examples/fine-tuning/trl/intel/.dstack.yml">
102+
103+
```yaml
104+
type: task
105+
name: trl-train
106+
107+
image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
108+
env:
109+
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
110+
- WANDB_API_KEY
111+
- WANDB_PROJECT
112+
commands:
113+
- pip install --upgrade-strategy eager optimum[habana]
114+
- pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
115+
- git clone https://github.com/huggingface/optimum-habana.git
116+
- cd optimum-habana/examples/trl
117+
- pip install -r requirements.txt
118+
- pip install wandb
119+
- DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py
120+
--model_name_or_path $MODEL_ID
121+
--dataset_name "lvwerra/stack-exchange-paired"
122+
--deepspeed ../language-modeling/llama2_ds_zero3_config.json
123+
--output_dir="./sft"
124+
--do_train
125+
--max_steps=500
126+
--logging_steps=10
127+
--save_steps=100
128+
--per_device_train_batch_size=1
129+
--per_device_eval_batch_size=1
130+
--gradient_accumulation_steps=2
131+
--learning_rate=1e-4
132+
--lr_scheduler_type="cosine"
133+
--warmup_steps=100
134+
--weight_decay=0.05
135+
--optim="paged_adamw_32bit"
136+
--lora_target_modules "q_proj" "v_proj"
137+
--bf16
138+
--remove_unused_columns=False
139+
--run_name="sft_deepseek_70"
140+
--report_to="wandb"
141+
--use_habana
142+
--use_lazy_mode
143+
144+
resources:
145+
gpu: gaudi2:8
146+
```
147+
148+
</div>
149+
150+
Submit the task using the [`dstack apply`](../../docs/reference/cli/dstack/apply.md) command:
151+
152+
<div class="termy">
153+
154+
```shell
155+
$ dstack apply -f examples/fine-tuning/trl/intel/.dstack.yml -R
156+
```
157+
158+
</div>
159+
160+
`dstack` will automatically create containers according to the run configuration and execute them across the fleet.
161+
162+
!!! info "Examples":
163+
Explore our [examples](../../examples/accelerators/intel/index.md) to learn how to train and deploy large models on Intel Gaudi.
164+
165+
!!! info "What's next?"
166+
1. Refer to [Quickstart](../../docs/quickstart.md)
167+
2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
168+
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

docs/examples.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ hide:
9090
</h3>
9191

9292
<p>
93-
Deploy and fine-tune LLMs on AMD
93+
Deploy and fine-tune LLMs on Intel Gaudi
9494
</p>
9595
</a>
9696

0 commit comments

Comments
 (0)