|
| 1 | +# From Fine-Tuning to Serving LLMs with OCI and dstack |
| 2 | + |
| 3 | +dstack is an open-source tool that simplifies AI container orchestration and makes distributed training and deployment of LLMs more accessible. Combining dstack and OCI unlocks a streamlined process for setting up cloud infrastructure for distributed training and scalable model deployment. |
| 4 | + |
| 5 | +This article walks you through fine-tuning a model using dstack on OCI, incorporating best practices from the Hugging Face Alignment Handbook, and then deploying the model using Hugging Face’s Text Generation Inference (TGI). |
| 6 | + |
| 7 | +**NOTE**: The experiment described in the article used an OCI cluster of three nodes, each with 2 x A10 GPUs, to fine-tune the Gemma 7B model. |
| 8 | + |
| 9 | +## How dstack works |
| 10 | + |
| 11 | +dstack offers a unified interface for the development, training, and deployment of AI models across any cloud or data center. For example, you can specify a configuration for a training task or a model to be deployed, and dstack will take care of setting up the required infrastructure and orchestrating the containers. One of the advantages dstack offers is that it allows the use of any hardware, frameworks, and scripts. |
| 12 | + |
| 13 | +## Setting up dstack with OCI |
| 14 | + |
| 15 | +With four simple steps, we can use dstack with OCI. First, we need to install the dstack Python package. Since dstack supports multiple cloud providers, we can narrow down the scope to OCI: |
| 16 | + |
| 17 | +``` |
| 18 | +pip install dstack[oci] |
| 19 | +``` |
| 20 | + |
| 21 | +Next, we need to configure the OCI specific credentials inside the `~/.dstack/server/config.yml`. Below assumes that you have credentials for OCI CLI configured. For other configuration options, please follow the dstack’s official document. |
| 22 | + |
| 23 | +``` |
| 24 | +projects: |
| 25 | +- name: main |
| 26 | + backends: |
| 27 | + - type: oci |
| 28 | + creds: |
| 29 | + type: default |
| 30 | +``` |
| 31 | + |
| 32 | +The final step is to run the dstack server as below. |
| 33 | + |
| 34 | +``` |
| 35 | +dstack server |
| 36 | +INFO Applying ~/.dstack/server/config.yml... |
| 37 | +INFO Configured the main project in ~/.dstack/config.yml |
| 38 | +INFO The admin token is ab6e8759-9cd9-4e84-8d47-5b94ac877ebf |
| 39 | +INFO The dstack server 0.18.4 is running at http://127.0.0.1:3000 |
| 40 | +``` |
| 41 | + |
| 42 | +Then, switch to the folder with your project scripts and initialize dstack. |
| 43 | + |
| 44 | +``` |
| 45 | +dstack init |
| 46 | +``` |
| 47 | + |
| 48 | +## Fine-Tuning on OCI with dstack |
| 49 | +To fine-tune Gemma 7B model, we’ll be using the Hugging Face Alignment Handbook to ensure the incorporation of the best fine-tuning practices. The source code of this tutorial can be obtained from GitHub. Let's dive into the practical steps for fine-tuning your LLM. |
| 50 | + |
| 51 | +Once, you switch to the project folder, here's the command to initiate the fine-tuning job on OCI with dstack: |
| 52 | + |
| 53 | +``` |
| 54 | +ACCEL_CONFIG_PATH=fsdp_qlora_full_shard.yaml \ |
| 55 | + FT_MODEL_CONFIG_PATH=qlora_finetune_config.yaml \ |
| 56 | + HUGGING_FACE_HUB_TOKEN=xxxx \ |
| 57 | + WANDB_API_KEY=xxxx \ |
| 58 | + dstack run . -f ft.task.dstack.yml |
| 59 | +``` |
| 60 | + |
| 61 | +The `FT_MODEL_CONFIG_PATH`, `ACCEL_CONFIG_PATH`, `HUGGING_FACE_HUB_TOKEN`, and `WANDB_API_KEY` environment variables are defined inside the `ft.task.dstack.yml` task configuration. `dstack run` submits the task defined in `ft.task.dstack.yml` on OCI. |
| 62 | + |
| 63 | +**NOTE**: that dstack automatically copies the current directory’s content when executing the task. |
| 64 | + |
| 65 | +Let’s explore the key parts of each YAML file (for the full contents, check the repository). |
| 66 | + |
| 67 | +The `qlora_finetune_config.yaml` file is the recipe configuration that the Alignment Handbook can understand about how you would want to fine-tune an LLM: |
| 68 | + |
| 69 | +``` |
| 70 | +# Model arguments |
| 71 | +model_name_or_path: google/gemma-7b |
| 72 | +tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml |
| 73 | +torch_dtype: bfloat16 |
| 74 | +bnb_4bit_quant_storage: bfloat16 |
| 75 | +
|
| 76 | +# LoRA arguments |
| 77 | +load_in_4bit: true |
| 78 | +use_peft: true |
| 79 | +lora_r: 8 |
| 80 | +lora_alpha: 16 |
| 81 | +lora_dropout: 0.05 |
| 82 | +lora_target_modules: |
| 83 | + - q_proj |
| 84 | + - k_proj |
| 85 | +# ... |
| 86 | +
|
| 87 | +
|
| 88 | +# Data training arguments |
| 89 | +dataset_mixer: |
| 90 | + chansung/mental_health_counseling_conversations: 1.0 |
| 91 | +dataset_splits: |
| 92 | + - train |
| 93 | + - test |
| 94 | +# ... |
| 95 | +``` |
| 96 | + |
| 97 | +* **Model arguments** |
| 98 | + |
| 99 | + * `model_name_or_path`: Google’s Gemma 7B is chosen as the base model |
| 100 | + * `tokenizer_name_or_path`: alignment-handbook uses apply_chat_template() method of the chosen tokenizer. This tutorial uses the ChatML template instead of the Gemma 7B’s standard conversation template. |
| 101 | + * `torch_dtype` and `bnb_4bit_quant_storage`: these two values should be defined the same if we want to leverage FSDP+QLoRA fine-tuning method. Since Gemma 7B is hard to fit into a single A10 GPU, this blog post uses FSDP+QLoRA to shard a model into 2 x A10 GPUs while leveraging QLoRA technique. |
| 102 | +* **LoRA arguments**: LoRA specific configurations. Since this blog post leverages FSDP+QLoRA technique, `load_in_4bit` is set to `true`. Other configurations could vary from experiment to experiment. |
| 103 | +* **Data training arguments**: we have prepared a dataset which is based on Amod’s mental health counseling conversations’ dataset. Since alignment-handbook only understands the data in the form of `[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, …]` which can be interpreted with tokenizer’s `apply_chat_template()` method, the prepared dataset is basically the conversion of the original dataset into the `apply_chat_template()` compatible format. |
| 104 | + |
| 105 | +The `fsdp_qlora_full_shard.yaml` file configures accelerate how to use the underlying infrastructure for fine-tuning the LLM: |
| 106 | + |
| 107 | +``` |
| 108 | +compute_environment: LOCAL_MACHINE |
| 109 | +distributed_type: FSDP # Use Fully Sharded Data Parallelism |
| 110 | +fsdp_config: |
| 111 | + fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP |
| 112 | + fsdp_backward_prefetch: BACKWARD_PRE |
| 113 | + fsdp_cpu_ram_efficient_loading: true |
| 114 | + fsdp_use_orig_params: false |
| 115 | + fsdp_offload_params: true |
| 116 | + fsdp_sharding_strategy: FULL_SHARD |
| 117 | + # ... (other FSDP configurations) |
| 118 | +# ... (other configurations) |
| 119 | +``` |
| 120 | + |
| 121 | +* `distributed_type`: `FSDP` indicates the use of Fully Sharded Data Parallel (FSDP), a technique that enables training large models that would otherwise not fit on a single GPU. |
| 122 | +* `fsdp_config`: These set up how FSDP operates, such as how the model is sharded (`fsdp_sharding_strategy`) and whether parameters are offloaded to CPU (`fsdp_offload_params`). |
| 123 | + |
| 124 | + |
| 125 | + |
| 126 | +With the `FSDP` of `distributed_type` and `FULL_SHARD` of `fsdp_config`’s `fsdp_sharding_strategy`, a model will be sharded across multiple GPUs in a single machine. When dealing with multiple compute nodes, each node will host an identical copy of the model, which is itself split across multiple GPUs within that node. This means each partitioned model instance on each node processes different sections or batches of your dataset. To distribute a single model across multiple GPUs spanning across multiple nodes, configure the parameter `fsdp_sharding_strategy` as `HYBRID_SHARD`. |
| 127 | + |
| 128 | +Additional parameters like "machine_rank," "num_machines," and "num_processes" are important for coordination. However, it's recommended to set these values dynamically at runtime, as this provides flexibility when switching between different infrastructure setups. |
| 129 | + |
| 130 | +## The power of dstack: simplified configuration |
| 131 | + |
| 132 | +Finally, let's explore the `fsdp_qlora_full_shard.yaml` configuration that puts everything together and instructs dstack on how to provision infrastructure and run the task. |
| 133 | + |
| 134 | +``` |
| 135 | +type: task |
| 136 | +nodes: 3 |
| 137 | +
|
| 138 | +python: "3.11" |
| 139 | +env: |
| 140 | + - ACCEL_CONFIG_PATH |
| 141 | + - FT_MODEL_CONFIG_PATH |
| 142 | + - HUGGING_FACE_HUB_TOKEN |
| 143 | + - WANDB_API_KEY |
| 144 | +commands: |
| 145 | + # ... (setup steps, cloning repo, installing requirements) |
| 146 | + - ACCELERATE_LOG_LEVEL=info accelerate launch \ |
| 147 | + --config_file recipes/custom/accel_config.yaml \ |
| 148 | + --main_process_ip=$DSTACK_MASTER_NODE_IP \ |
| 149 | + --main_process_port=8008 \ |
| 150 | + --machine_rank=$DSTACK_NODE_RANK \ |
| 151 | + --num_processes=$DSTACK_GPUS_NUM \ |
| 152 | + --num_machines=$DSTACK_NODES_NUM \ |
| 153 | + scripts/run_sft.py recipes/custom/config.yaml |
| 154 | +ports: |
| 155 | + - 6006 |
| 156 | +resources: |
| 157 | + gpu: 1..2 |
| 158 | + shm_size: 24GB |
| 159 | +``` |
| 160 | + |
| 161 | +**Key points to highlight**: |
| 162 | +* **Seamless Integration**: dstack effortlessly integrates with Hugging Face's open source ecosystem. In Particular, you can simply use the accelerate library with the configurations that we defined in `fsdp_qlora_full_shard.yaml` as normal. |
| 163 | +* **Automatic Configuration**: `DSTACK_MASTER_NODE_IP`, `DSTACK_NODE_RANK`, `DSTACK_GPUS_NUM`, and `DSTACK_NODES_NUM` variables are automatically managed by dstack, reducing manual setup. |
| 164 | +* **Resource Allocation**: dstack makes it easy to specify the number of nodes and GPUs (gpu: 1..2) for your fine-tuning job. Hence, for this blog post, there are three nodes each of which is equipped with 2 x A10(24GB) GPUs. |
| 165 | + |
| 166 | +## Serving your fine-tuned model with dstack |
| 167 | + |
| 168 | +Once your model is fine-tuned, dstack makes it a breeze to deploy it on OCI using Hugging Face's Text Generation Inference (TGI) framework. |
| 169 | + |
| 170 | +Here's an example of how you can define a service in dstack: |
| 171 | + |
| 172 | +``` |
| 173 | +type: service |
| 174 | +image: ghcr.io/huggingface/text-generation-inference:latest |
| 175 | +env: |
| 176 | + - HUGGING_FACE_HUB_TOKEN |
| 177 | + - MODEL_ID=chansung/mental_health_counseling_merged_v0.1 |
| 178 | +commands: |
| 179 | + - text-generation-launcher \ |
| 180 | + --max-input-tokens 512 --max-total-tokens 1024 \ |
| 181 | + --max-batch-prefill-tokens 512 --port 8000 |
| 182 | +port: 8000 |
| 183 | +
|
| 184 | +resources: |
| 185 | + gpu: |
| 186 | + memory: 48GB |
| 187 | +
|
| 188 | +# (Optional) Enable the OpenAI-compatible endpoint |
| 189 | +model: |
| 190 | + format: tgi |
| 191 | + type: chat |
| 192 | + name: chansung/mental_health_counseling_merged_v0.1 |
| 193 | +``` |
| 194 | + |
| 195 | +**Key advantages of this approach**: |
| 196 | +* **Secure HTTPS Gateway**: Dstack simplifies the process of setting up a secure HTTPS connection through a gateway, a crucial aspect of production-level model serving. |
| 197 | +* **Optimized for Inference**: The TGI framework is designed for efficient text generation inference, ensuring your model delivers responsive and reliable results. |
| 198 | +* **Auto-scaling**: dstack allows to specify the auto-scaling policy, including the minimum and maximum number of model replicas. |
| 199 | + |
| 200 | +At this point, you can interact with the service via standard curl command and Python’s requests, OpenAI SDK, and Hugging Face’s InferenceClient libraries. For instance, the code snippet below shows an example of curl. |
| 201 | + |
| 202 | +``` |
| 203 | +curl -X POST https://black-octopus-1.mycustomdomain.com/generate \ |
| 204 | + -H "Authorization: Bearer <dstack-token>" \ |
| 205 | + -H 'Content-Type: application/json' \ |
| 206 | + -d '{"inputs": "I feel bad...", "parameters": {"max_new_tokens": 128}}' |
| 207 | +``` |
| 208 | + |
| 209 | +Additionally, for a deployed model, dstack automatically provides a user interface to directly interact with the model: |
| 210 | + |
| 211 | +<p align="center"> |
| 212 | + <img src="https://github.com/oracle-devrel/technology-engineering/blob/dstack-tutorial/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/dstack/assets/images/image1.png" width="600"> |
| 213 | +</p> |
| 214 | + |
| 215 | +## Conclusion |
| 216 | + |
| 217 | +By following the steps outlined in this article, you've unlocked a powerful approach to fine-tuning and deploying LLMs using the combined capabilities of dstack, OCI, and Hugging Face's ecosystem. You can now leverage dstack's user-friendly interface to manage your OCI resources effectively, streamlining the process of setting up distributed training environments for your LLM projects. |
| 218 | + |
| 219 | +Furthermore, the integration with Hugging Face's Alignment Handbook and TGI framework empowers you to fine-tune and serve your models seamlessly, ensuring they're optimized for performance and ready for real-world applications. We encourage you to explore the possibilities further and experiment with different models and configurations to achieve your desired outcomes in the world of natural language processing. |
| 220 | + |
| 221 | +**About the author**: Chansung Park is a HuggingFace fellow and is an AI researcher working on LLMs. |
0 commit comments