Skip to content

Commit 6122471

Browse files
add images
1 parent 06b9b6a commit 6122471

File tree

3 files changed

+4
-0
lines changed

3 files changed

+4
-0
lines changed

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/dstack/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ fsdp_config:
121121
* `distributed_type`: `FSDP` indicates the use of Fully Sharded Data Parallel (FSDP), a technique that enables training large models that would otherwise not fit on a single GPU.
122122
* `fsdp_config`: These set up how FSDP operates, such as how the model is sharded (`fsdp_sharding_strategy`) and whether parameters are offloaded to CPU (`fsdp_offload_params`).
123123

124+
![Hybrid shards](/assets/images/image2.jpg "Hybrid shards")
125+
124126
With the `FSDP` of `distributed_type` and `FULL_SHARD` of `fsdp_config`’s `fsdp_sharding_strategy`, a model will be sharded across multiple GPUs in a single machine. When dealing with multiple compute nodes, each node will host an identical copy of the model, which is itself split across multiple GPUs within that node. This means each partitioned model instance on each node processes different sections or batches of your dataset. To distribute a single model across multiple GPUs spanning across multiple nodes, configure the parameter `fsdp_sharding_strategy` as `HYBRID_SHARD`.
125127

126128
Additional parameters like "machine_rank," "num_machines," and "num_processes" are important for coordination. However, it's recommended to set these values dynamically at runtime, as this provides flexibility when switching between different infrastructure setups.
@@ -206,6 +208,8 @@ curl -X POST https://black-octopus-1.mycustomdomain.com/generate \
206208

207209
Additionally, for a deployed model, dstack automatically provides a user interface to directly interact with the model:
208210

211+
![User interface](/assets/images/image1.jpg "User interface")
212+
209213
## Conclusion
210214

211215
By following the steps outlined in this article, you've unlocked a powerful approach to fine-tuning and deploying LLMs using the combined capabilities of dstack, OCI, and Hugging Face's ecosystem. You can now leverage dstack's user-friendly interface to manage your OCI resources effectively, streamlining the process of setting up distributed training environments for your LLM projects.
220 KB
Loading
171 KB
Loading

0 commit comments

Comments
 (0)