Replies: 1 comment 1 reply
-
Hey, this is a weird problem. If anything, people usually go OOM 😄
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Here is a description of my issue. Wonder if anyone has seen anything similar.
Last training using a 0411 py3.11 build, it used 136 GB on each GPU.
Thank you ahead of time if you can point me in the right direction. I have tried multiple servers on vast to the same result.
Here are snippets of logs (keep going for configs):
[2025-05-17 18:15:03,315] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmp_yct_3l1/test.c -o /tmp/tmp_yct_3l1/test.o
[2025-05-17 18:15:03,333] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmp_yct_3l1/test.o -laio -o /tmp/tmp_yct_3l1/a.out
[2025-05-17 18:15:03,358] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpkolcnp42/test.c -o /tmp/tmpkolcnp42/test.o
[2025-05-17 18:15:03,375] [INFO] [root.spawn:77] [PID:387141] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpkolcnp42/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpkolcnp42/a.out
[2025-05-17 18:15:04,130] [INFO] [datasets.:54] [PID:387141] PyTorch version 2.6.0+cu124 available.
Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).INFO 05-17 18:15:05 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 18:15:05 [init.py:239] Automatically detected platform cuda.
Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).[2025-05-17 18:15:18,459] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-17 18:15:18,521] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpbq1bi_7t/test.c -o /tmp/tmpbq1bi_7t/test.o
[2025-05-17 18:15:18,538] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpbq1bi_7t/test.o -laio -o /tmp/tmpbq1bi_7t/a.out
[2025-05-17 18:15:18,566] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpgzp1ho_z/test.c -o /tmp/tmpgzp1ho_z/test.o
[2025-05-17 18:15:18,583] [INFO] [root.spawn:77] [PID:387065] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpgzp1ho_z/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpgzp1ho_z/a.out
[2025-05-17 18:15:19,312] [INFO] [datasets.:54] [PID:387065] PyTorch version 2.6.0+cu124 available.
Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).INFO 05-17 18:15:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 18:15:20 [init.py:239] Automatically detected platform cuda.
Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).[2025-05-17 18:16:23,053] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-17 18:16:23,115] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpz5_fg1q_/test.c -o /tmp/tmpz5_fg1q_/test.o
[2025-05-17 18:16:23,134] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpz5_fg1q_/test.o -laio -o /tmp/tmpz5_fg1q_/a.out
[2025-05-17 18:16:23,162] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -c /tmp/tmpth0h5hka/test.c -o /tmp/tmpth0h5hka/test.o
[2025-05-17 18:16:23,179] [INFO] [root.spawn:77] [PID:387133] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpth0h5hka/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpth0h5hka/a.out
[2025-05-17 18:16:23,891] [INFO] [datasets.:54] [PID:387133] PyTorch version 2.6.0+cu124 available.
Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).INFO 05-17 18:16:25 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 18:16:25 [init.py:239] Automatically detected platform cuda.
Using the
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).axolotl config:
#base_model: NousResearch/Meta-Llama-3.1-8B
#base_model: NousResearch/Hermes-3-Llama-3.1-8B
#base_model: meta-llama/Llama-3.1-8B-Instruct
#base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
base_model: unsloth/Phi-4
seed: 1337
#tokenizer_num_proc: 189
#dataset_processes: 189
plugins:
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true
strict: false
Save model as safetensors (require safetensors package)
save_safetensors: true
auto_resume_from_checkpoints: true
Custom JSON dataset configuration
datasets:
type: completion
data_files:
ds_type: json
split: train
field: text
dataset_prepared_path: /root/app/last_run_prepared
val_set_size: 0
output_dir: /root/app/model # Output directory for the trained model
Sequence and batch configuration
sequence_len: 128 # Adjust based on your model's needs
sample_packing: true
pad_to_sequence_len: true
Optional W&B logging setup
#wandb_project: your_project_name # Set your Weights & Biases project name
#wandb_entity: your_entity_name
#wandb_watch: false
#wandb_name: your_run_name
#wandb_log_model: true # Set to true if you want to log the model in W&B
Training parameters
gradient_accumulation_steps: 1 # From your Python script
micro_batch_size: 1000 # Batch size from your script
num_epochs: 1 # Number of epochs from your script
#optimizer: adamw_torch
optimizer: adopt_adamw
adam_beta1: 0.90
adam_beta2: 0.999
#optimizer_kwargs:
betas: [0.85, 0.999]
lr_scheduler: cosine
learning_rate: 1e-6 # Learning rate from your Python code
max_grad_norm: 0.15
Mixed precision training
train_on_inputs: false
group_by_length: false
bf16: auto
fp16: # Enable FP16 based on your system's capabilities
tf32: false
#gpu_memory_limit: 20GiB
Gradient checkpointing and memory optimization
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
Logging and evaluation settings
logging_steps: 1
xformers_attention:
flash_attention: true
Save and evaluation settings
warmup_steps: 25
#evals_per_epoch: 1
save_strategy: steps
save_steps: 200 # Save every 1500 steps, matching your script
saves_per_epoch: # Leave this empty since it's mutually exclusive with save_steps
save_total_limit: 2 # Save a maximum of 5 checkpoints
debug: false
deepspeed: /root/app/zero3.json
weight_decay: 0.01
Special tokens configuration
special_tokens:
pad_token: <|dummy_87|>
eos_token: <|im_end|>
Add extra tokens.
tokens:
zero3.json
Beta Was this translation helpful? Give feedback.
All reactions