Replies: 5 comments 2 replies
-
here's an example with deepspeed https://wandb.ai/axolotl-ai/fft-70b-w-liger/runs/wtexza0s and the yaml to reproduce https://wandb.ai/axolotl-ai/fft-70b-w-liger/runs/wtexza0s/files/tmp/axolotl_config_70oa5qvt.yml |
Beta Was this translation helpful? Give feedback.
-
@winglian And also, I would like to make a donation for your work. However, my circumstances do not allow me to use Visa or MasterCard. Do you have a crypto wallet? |
Beta Was this translation helpful? Give feedback.
-
@winglian I'm also very interested in this configuration and can't access it via the links. Any chance you could add it to examples? |
Beta Was this translation helpful? Give feedback.
-
@sklyar61 @tcleberg sorry about that! the wandb is public now! |
Beta Was this translation helpful? Give feedback.
-
@winglian trying and failing to reproduce that run success- could you share the commands you used to run it and the image version? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Greetings, Axolotl community members!
I recently attempted to perform a full precision tuning of the llama-3-70B model using the docker image winglian/axolotl-cloud
on runpod.io's cloud service. This experience led me to wonder: is it even feasible to conduct a full precision tuning on a single machine with 640 GB or even 754 GB of memory? Unfortunately, I was unable to get the training started :(
Initially, I fine-tuned the FSDP settings on a machine equipped with 4*A100 for the llama-3-8B model. The training started and concluded successfully.
fsdp:
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
According to the llama repository, full precision tuning requires 500 GB of memory, so I opted for a machine with 640 GB and tried to initiate a full precision setup for the 70B model on a machine with 8A100. I encountered an OOM error. The system reported a shortage of 1.6 GB of memory on the card. I then tried training on 8H100 NVL with a total machine memory of 754 GB. This scenario also ended in error:
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/a
pi.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
axolotl.cli.train FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-09-05_11:17:50
host : d5a636c8b4c6
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 17103)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
In this case, the code indicated a shortage of less than 1 GB of memory..
Here are my significant settings:
base_model: failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5
sequence_len: 2048
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 1e-5
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
flash_attention: true
fsdp:
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
I'm using deepspeed with config zero2.json and zero3.json
I also tried changing the precision to fp16. Moreover, I initiated training with
using the command 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True accelerate launch --mixed-precision bf16 -m axolotl.cli.train config.yml accelerate launch --mixed-precision bf16 -m axolotl.cli.train config.yml’.
I also attempted to switch the CUDA version to 12.4 and Pytorch to 2.4.0, but to no avail.
I also tried setting the variables:
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export NCCL_P2P_DISABLE=1
export NCCL_TIMEOUT=600
This also did not help.
In this regard, I have a question: has anyone managed to run a full fine-tuning of the 70B model on a single machine, or is this mission impossible just for me??
Beta Was this translation helpful? Give feedback.
All reactions