Axolotl not using more than 46 GB of 143 GB on 8xH200 #2687

jwm1969 · 2025-05-17T20:20:25Z

jwm1969
May 17, 2025

Here is a description of my issue. Wonder if anyone has seen anything similar.

Been using axolotl with the attached axolotl and deepspeed config filesn below for 9 months without any issues. On vast, 8xH200
For the training today, I have been unable to get it to use higher than 46Gb of the 143GB per GPU on the H200
Last training using a 0411 py3.11 build, it used 136 GB on each GPU.
I used to tune it by changing the micro_batch_size in axolotl config and leave deepspeed to auto on batch size and micro batch size and I could increment it up and get to the optimum GPU memory usage. Now I am trying to set it in deepspeed config json, but still not using more than 46 GB.
For some reason now it does not work the same way, i have researched and read and cannot figure out why for the life of me.
As an additional problem i also cannot solve, now axolotl (I have tried old versions, new versions, right now trying 0.9.2) is now gcc compiling something that takes an hour to do every time when i restart the training and it never did that before either. /_inductor/compile_worker/ and repeating over and over the test.out compile....takes over an hour to complete the compile, i can see in htop gcc with 32 workers and it using over 500 GB of memory.

Thank you ahead of time if you can point me in the right direction. I have tried multiple servers on vast to the same result.

Here are snippets of logs (keep going for configs):

[2025-05-17 18:15:03,315] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmp_yct_3l1/test.c -o /tmp/tmp_yct_3l1/test.o
[2025-05-17 18:15:03,333] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmp_yct_3l1/test.o -laio -o /tmp/tmp_yct_3l1/a.out
[2025-05-17 18:15:03,358] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpkolcnp42/test.c -o /tmp/tmpkolcnp42/test.o
[2025-05-17 18:15:03,375] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpkolcnp42/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpkolcnp42/a.out
[2025-05-17 18:15:04,130] [INFO] [datasets.:54] [PID:387141] PyTorch version 2.6.0+cu124 available.
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
INFO 05-17 18:15:05 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 18:15:05 [init.py:239] Automatically detected platform cuda.
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2025-05-17 18:15:18,459] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-17 18:15:18,521] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpbq1bi_7t/test.c -o /tmp/tmpbq1bi_7t/test.o
[2025-05-17 18:15:18,538] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpbq1bi_7t/test.o -laio -o /tmp/tmpbq1bi_7t/a.out
[2025-05-17 18:15:18,566] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpgzp1ho_z/test.c -o /tmp/tmpgzp1ho_z/test.o
[2025-05-17 18:15:18,583] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpgzp1ho_z/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpgzp1ho_z/a.out
[2025-05-17 18:15:19,312] [INFO] [datasets.:54] [PID:387065] PyTorch version 2.6.0+cu124 available.
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
INFO 05-17 18:15:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 18:15:20 [init.py:239] Automatically detected platform cuda.
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2025-05-17 18:16:23,053] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-17 18:16:23,115] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpz5_fg1q_/test.c -o /tmp/tmpz5_fg1q_/test.o
[2025-05-17 18:16:23,134] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpz5_fg1q_/test.o -laio -o /tmp/tmpz5_fg1q_/a.out
[2025-05-17 18:16:23,162] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpth0h5hka/test.c -o /tmp/tmpth0h5hka/test.o
[2025-05-17 18:16:23,179] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpth0h5hka/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpth0h5hka/a.out
[2025-05-17 18:16:23,891] [INFO] [datasets.:54] [PID:387133] PyTorch version 2.6.0+cu124 available.
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
INFO 05-17 18:16:25 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 18:16:25 [init.py:239] Automatically detected platform cuda.
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).

axolotl config:

#base_model: NousResearch/Meta-Llama-3.1-8B
#base_model: NousResearch/Hermes-3-Llama-3.1-8B
#base_model: meta-llama/Llama-3.1-8B-Instruct
#base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
base_model: unsloth/Phi-4
seed: 1337
#tokenizer_num_proc: 189
#dataset_processes: 189
plugins:

axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true

strict: false

Save model as safetensors (require safetensors package)

save_safetensors: true
auto_resume_from_checkpoints: true

Custom JSON dataset configuration

datasets:

path: /root/app/my_dataset.json
type: completion
data_files:
- /root/app/my_dataset.json
  ds_type: json
  split: train
  field: text

dataset_prepared_path: /root/app/last_run_prepared
val_set_size: 0
output_dir: /root/app/model # Output directory for the trained model

Sequence and batch configuration

sequence_len: 128 # Adjust based on your model's needs
sample_packing: true
pad_to_sequence_len: true

Optional W&B logging setup

#wandb_project: your_project_name # Set your Weights & Biases project name
#wandb_entity: your_entity_name
#wandb_watch: false
#wandb_name: your_run_name
#wandb_log_model: true # Set to true if you want to log the model in W&B

Training parameters

gradient_accumulation_steps: 1 # From your Python script
micro_batch_size: 1000 # Batch size from your script
num_epochs: 1 # Number of epochs from your script
#optimizer: adamw_torch
optimizer: adopt_adamw
adam_beta1: 0.90
adam_beta2: 0.999
#optimizer_kwargs:

betas: [0.85, 0.999]

lr_scheduler: cosine
learning_rate: 1e-6 # Learning rate from your Python code
max_grad_norm: 0.15

Mixed precision training

train_on_inputs: false
group_by_length: false
bf16: auto
fp16: # Enable FP16 based on your system's capabilities
tf32: false
#gpu_memory_limit: 20GiB

Gradient checkpointing and memory optimization

gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false

Logging and evaluation settings

logging_steps: 1
xformers_attention:
flash_attention: true

Save and evaluation settings

warmup_steps: 25
#evals_per_epoch: 1
save_strategy: steps
save_steps: 200 # Save every 1500 steps, matching your script
saves_per_epoch: # Leave this empty since it's mutually exclusive with save_steps
save_total_limit: 2 # Save a maximum of 5 checkpoints
debug: false
deepspeed: /root/app/zero3.json
weight_decay: 0.01

Special tokens configuration

special_tokens:
pad_token: <|dummy_87|>
eos_token: <|im_end|>

Add extra tokens.

tokens:

zero3.json

NanoCode012 · 2025-05-19T09:07:39Z

NanoCode012
May 19, 2025
Maintainer

Hey, this is a weird problem. If anything, people usually go OOM 😄

What's a 0411 py3.11 build ?
Are you really using sequence_len: 128? Isn't this quite short? Did you check whether there is a warning that samples are being dropped?

As an additional problem i also cannot solve, now axolotl (I have tried old versions, new versions, right now trying 0.9.2) is now gcc compiling something that takes an hour to do every time when i restart the training and it never did that before either.

Do you have logs on the compile?

deepspeed: /root/app/zero3.json

You can reduce deepspeed to zero1 for faster training given you have enough VRAM.
You can also increase your grad accu (if you want to increase total batch size)

1 reply

jwm1969 May 19, 2025
Author

1.. Just a part of a docker tag for the axolotl version
2. Yes, my completions are very short by design
3. No, but next time it does the _inductor compile i will pipe the logs, will be huge
4. Yes, i did this and it went up to ~89 GB, but this is not the issue i am trying to solve.
5. Yes on gradient accu, but i want to figure out why it is forcing an unchangeable batchsize. 31995 steps and 51192496 samples is a batchsize of 1600 and micro batch size of 200 (with 8 GPUs) and nowhere close to my actual axolotl and deepspeed settings. I also tried setting auto_find_batch_size: false in axolotl config, but did not fix it either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Axolotl not using more than 46 GB of 143 GB on 8xH200 #2687

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Axolotl not using more than 46 GB of 143 GB on 8xH200 #2687

Uh oh!

jwm1969 May 17, 2025

Save model as safetensors (require safetensors package)

Custom JSON dataset configuration

Sequence and batch configuration

Optional W&B logging setup

Training parameters

betas: [0.85, 0.999]

Mixed precision training

Gradient checkpointing and memory optimization

Logging and evaluation settings

Save and evaluation settings

Special tokens configuration

Add extra tokens.

Replies: 1 comment · 1 reply

Uh oh!

NanoCode012 May 19, 2025 Maintainer

Uh oh!

jwm1969 May 19, 2025 Author

jwm1969
May 17, 2025

Replies: 1 comment 1 reply

NanoCode012
May 19, 2025
Maintainer

jwm1969 May 19, 2025
Author