Skip to content

Commit c3cf73a

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents 46a70a9 + 92f2668 commit c3cf73a

File tree

4 files changed

+18
-20
lines changed

4 files changed

+18
-20
lines changed

examples/kfto-sft-llm/README.md

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ This example has been validated with the following configurations:
5252
### Llama 3.1 8B Instruct - GSM8k Dataset - LoRA - 8x NVIDIA A100/80G
5353

5454
* Infrastructure:
55-
* OpenShift AI 2.17
55+
* OpenShift AI 2.19
5656
* 8x NVIDIA-A100-SXM4-80GB
5757
* Configuration:
5858
```yaml
@@ -61,7 +61,7 @@ This example has been validated with the following configurations:
6161
model_revision: main
6262
torch_dtype: bfloat16
6363
attn_implementation: flash_attention_2
64-
use_liger: false
64+
use_liger_kernel: true
6565

6666
# PEFT / LoRA
6767
use_peft: true
@@ -79,12 +79,13 @@ This example has been validated with the following configurations:
7979
dataset_config: main
8080

8181
# SFT
82-
max_seq_length: 1024
82+
max_length: 4096
8383
packing: false
84+
padding_free: true
8485

8586
# Training
86-
per_device_train_batch_size: 64
87-
per_device_eval_batch_size: 64
87+
per_device_train_batch_size: 128
88+
per_device_eval_batch_size: 128
8889

8990
bf16: true
9091
tf32: false
@@ -108,8 +109,8 @@ This example has been validated with the following configurations:
108109
resources_per_worker:
109110
"nvidia.com/gpu": 1
110111
"memory": 96Gi
111-
"cpu": 4
112-
base_image: quay.io/modh/training:py311-cuda121-torch241
112+
"cpu": 8
113+
base_image: quay.io/modh/training:py311-cuda124-torch251
113114
env_vars:
114115
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"
115116
"NCCL_DEBUG": "INFO"
@@ -189,7 +190,7 @@ This example has been validated with the following configurations:
189190
### Llama 3.1 8B Instruct - GSM8k Dataset - LoRA - 8x AMD Instinct MI300X
190191
191192
* Infrastructure:
192-
* OpenShift AI 2.17
193+
* OpenShift AI 2.19
193194
* 8x AMD Instinct MI300X
194195
* Configuration:
195196
```yaml
@@ -198,15 +199,14 @@ This example has been validated with the following configurations:
198199
model_revision: main
199200
torch_dtype: bfloat16
200201
attn_implementation: flash_attention_2
201-
use_liger: true
202+
use_liger_kernel: true
202203

203204
# PEFT / LoRA
204205
use_peft: true
205206
lora_r: 16
206207
lora_alpha: 8
207208
lora_dropout: 0.05
208209
lora_target_modules: ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
209-
lora_modules_to_save: []
210210

211211
# QLoRA (BitsAndBytes)
212212
load_in_4bit: false
@@ -217,12 +217,13 @@ This example has been validated with the following configurations:
217217
dataset_config: main
218218

219219
# SFT
220-
max_seq_length: 4096
220+
max_length: 8192
221221
packing: false
222+
padding_free: true
222223

223224
# Training
224-
per_device_train_batch_size: 128
225-
per_device_eval_batch_size: 128
225+
per_device_train_batch_size: 512
226+
per_device_eval_batch_size: 512
226227

227228
bf16: true
228229
tf32: false
@@ -245,18 +246,15 @@ This example has been validated with the following configurations:
245246
num_procs_per_worker: 1
246247
resources_per_worker:
247248
"amd.com/gpu": 1
248-
"memory": 96Gi
249+
"memory": 128Gi
249250
"cpu": 4
250-
base_image: quay.io/modh/training:py311-rocm62-torch241
251+
base_image: quay.io/modh/training:py311-rocm62-torch251
251252
env_vars:
252253
"PYTORCH_HIP_ALLOC_CONF": "expandable_segments:True"
253254
"NCCL_DEBUG": "INFO"
254255
```
255256
* Metrics:
256257
![](./docs/run03.png)
257-
Blue: with Liger kernels
258-
259-
Orange: without Liger kernels
260258
261259
### Llama 3.3 70B Instruct - GSM8k Dataset - LoRA - 8x NVIDIA A100/80G
262260

examples/kfto-sft-llm/docs/run01.png

69.8 KB
Loading

examples/kfto-sft-llm/docs/run03.png

114 KB
Loading

examples/kfto-sft-llm/sft.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,7 @@
334334
"* Amend the resources per worker according to the job requirements\n",
335335
"* If you use AMD accelerators:\n",
336336
" * Change `nvidia.com/gpu` to `amd.com/gpu` in `resources_per_worker`\n",
337-
" * Change `base_image` to `quay.io/modh/training:py311-rocm62-torch241`\n",
337+
" * Change `base_image` to `quay.io/modh/training:py311-rocm62-torch251`\n",
338338
"* Update the PVC name to the one you've attached to the workbench if needed"
339339
]
340340
},
@@ -356,7 +356,7 @@
356356
" \"memory\": \"64Gi\",\n",
357357
" \"cpu\": 4,\n",
358358
" },\n",
359-
" base_image=\"quay.io/modh/training:py311-cuda121-torch241\",\n",
359+
" base_image=\"quay.io/modh/training:py311-cuda124-torch251\",\n",
360360
" env_vars={\n",
361361
" # HuggingFace\n",
362362
" \"HF_HOME\": \"/mnt/shared/.cache\",\n",

0 commit comments

Comments
 (0)