You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that training on CPU will be significantly slower than training on a GPU. The CPU configuration uses:
307
307
308
-
1. A smaller model (Phi-3.5-mini-instruct) which is more CPU-friendly
308
+
1. A smaller model (`phi-3.5-mini-instruct`) which is more CPU-friendly
309
309
2. Reduced batch size and increased gradient accumulation steps
310
310
3. Fewer total training steps (50 instead of 300)
311
311
4. Half-precision (float16) where possible to reduce memory usage
@@ -315,14 +315,14 @@ Note that training on CPU will be significantly slower than training on a GPU. T
315
315
For best results, we recommend:
316
316
- Using a machine with at least 16GB of RAM
317
317
- Being patient! LLM training on CPU is much slower than on GPU
318
-
- If you still encounter memory issues, try reducing the max_train_samples parameter even further in the config file
318
+
- If you still encounter memory issues, try reducing the `max_train_samples` parameter even further in the config file
319
319
320
320
### Known Issues and Workarounds
321
321
322
322
Some large language models like Phi-3.5 have caching mechanisms that are optimized for GPU usage and may encounter issues when running on CPU. Our CPU configuration includes several workarounds:
323
323
324
324
1. Disabling KV caching for model generation
325
-
2. Using torch.float16 data type to reduce memory usage
325
+
2. Using `torch.float16 data` type to reduce memory usage
326
326
3. Disabling flash attention which isn't needed on CPU
327
327
4. Using standard AdamW optimizer instead of 8-bit optimizers that require GPU
0 commit comments