Skip to content

Conversation

@NanoCode012
Copy link
Collaborator

@NanoCode012 NanoCode012 commented Aug 7, 2025

Description

Some training notes on 4xH100:

  • offload would require patching modeling code to remove e_score_correction_bias else device mismatch during calculation.
  • fft offload, checkpointing error
  • fft without offload, oom even with 4bit optim, 2k context, 1 micro batch / GA
  • qlora ddp oom
  • qlora fsdp stuck on step0, memory 65GiB/gpu

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/glm45

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@NanoCode012
Copy link
Collaborator Author

Model training seem stuck, have tested upto 4xH200 for lora and fft.

  • LoRA SFT (4xH200 @ 84GB/GPU) but stuck
  • FFT SFT (4xH200 @ OOM without checkpointing)

If anyone's interested in this model, please feel free to test from the configs in this yaml.

@zerofata
Copy link

Follow up from the discord discussion earlier;

Looks like some form of liger support might be there based on the below and the fact I don't crash out:

[2025-08-14 09:50:21,568] [WARNING] [axolotl.integrations.liger.plugin.warning_once:39] [PID:3125] [RANK:0] Applied ONLY liger_fused_linear_cross_entropy genericpatches for model type: glm4_moe
[2025-08-14 09:50:21,568] [WARNING] [axolotl.integrations.liger.plugin.warning_once:39] [PID:3125] [RANK:0] Liger + glm4_moe generic FLCE support is experimental and may not work as expected.

Pod setups tested (Environment is generic PyTorch 2.8 template pod from RunPod, with torch downgraded to 2.7.0):

Tried a modified config based on what we discussed with 6xH200 and got stuck at step 0. Tried my old config I know works (posted in discord) and that also got stuck.

Did another 4xH200 pod using the old config again and confirmed it worked as expected. First step took 1:37 with a bsz of 16 and seq_len 8192.

[2025-08-14 09:57:38,233] [INFO] [axolotl.utils.callbacks.on_train_begin:827] [PID:3125] [RANK:0] The DeepSpeed config has been saved to the WandB run under files.
{'loss': 1.9211, 'grad_norm': 1.2673743963241577, 'learning_rate': 8e-07, 'memory/max_mem_active(gib)': 102.28, 'memory/max_mem_allocated(gib)': 102.28, 'memory/device_mem_reserved(gib)': 107.05, 'epoch': 0.01}
{'loss': 1.9185, 'grad_norm': 1.1321855783462524, 'learning_rate': 1.6e-06, 'memory/max_mem_active(gib)': 102.32, 'memory/max_mem_allocated(gib)': 102.32, 'memory/device_mem_reserved(gib)': 107.09, 'epoch': 0.03}
{'loss': 1.9239, 'grad_norm': 1.4860970973968506, 'learning_rate': 2.4e-06, 'memory/max_mem_active(gib)': 102.33, 'memory/max_mem_allocated(gib)': 102.33, 'memory/device_mem_reserved(gib)': 107.13, 'epoch': 0.04}
{'loss': 1.9629, 'grad_norm': 1.5326619148254395, 'learning_rate': 3.2e-06, 'memory/max_mem_active(gib)': 102.33, 'memory/max_mem_allocated(gib)': 102.33, 'memory/device_mem_reserved(gib)': 107.13, 'epoch': 0.05}
{'loss': 1.9507, 'grad_norm': 1.2274515628814697, 'learning_rate': 4e-06, 'memory/max_mem_active(gib)': 102.33, 'memory/max_mem_allocated(gib)': 102.33, 'memory/device_mem_reserved(gib)': 107.16, 'epoch': 0.07}
  2%|██▋                                                                                                                | 5/219 [06:27<4:25:26, 74.42s/it]

Re the odd issue with the missing Layer 46 / mtp layer when the lora gets merged in, I was able to transplant the layer from the base model into the trained model and it seemed to take that ok enough that I could convert the model into GGUF. Dunno if the actual MTP functionality itself still works as expected, but that's not such a biggie for me.

Had infinite generation issues with the trained model, but that one's probably an issue with my dataset / hyperparams not working right with the hybrid reasoning.

Side note related to those infinite generations: How does axolotl handle models like this with hybrid reasoning / enable thinking settings in chat templates when training / formatting the dataset?

It seems like currently with multi-turn datasets if turns within that have reasoning or empty think tags, you'd need some sort of masking, as it's my understanding that when you train the sample, the tags and content inside should be available for that specific turn, but not anything within tags for previous turns in the conversation.

Might be a useful feature, although feel free to ignore if that goes outside the scope of the PR.

@NanoCode012
Copy link
Collaborator Author

NanoCode012 commented Aug 14, 2025

Thanks for the points @zerofata

Re the odd issue with the missing Layer 46 / mtp layer when the lora gets merged in, I was able to transplant the layer from the base model into the trained model and it seemed to take that ok enough that I could convert the model into GGUF

Do you mind sharing a snippet for any future readers?

Side note related to those infinite generations: How does axolotl handle models like this with hybrid reasoning / enable thinking settings in chat templates when training / formatting the dataset?

It seems like currently with multi-turn datasets if turns within that have reasoning or empty think tags, you'd need some sort of masking, as it's my understanding that when you train the sample, the tags and content inside should be available for that specific turn, but not anything within tags for previous turns in the conversation.

This is a bit mixed. You can see during axolotl preprocess config.yaml --debug to see what we mask.

Currently, I think for glm4_moe, the think section is masked if reasoning is not provided, but is unmasked if it is.

On the turns level, we unmask the assistant turn by default (can set via config). If you only want to unmask the last turn, it's possible via our legacy method. https://docs.axolotl.ai/docs/dataset-formats/conversation.html#training-on-last-message

@NanoCode012 NanoCode012 mentioned this pull request Aug 15, 2025
5 tasks
@zerofata
Copy link

fix_air_mtp.py
Script I've been using attached. Fair warning, it was vibe coded by Claude and there's a few things not quite right with it, I think they're relatively minor, although I'm by no means an expert. (It doesn't update the model size when the layer is re-added. It doesn't take into account the precision that the merged model is in, even if converted to bf16 it will re-add the layer in mixed bf16 f32 precision). Gets the job done for allowing GGUF's to be created from lcpp though.

Thanks for the preprocess command, that was useful for debugging, looked like axolotl was handling it exactly as I'd hoped.

I found to stop my infinite gens, I had to add this to my config. The model has three EOT tokens, maybe they're all that's needed and the eos_token can be left alone, but I didn't want to risk another failed train, so I went with the primary stop token that the chat template uses.

eot_tokens:
  - "<|user|>"
special_tokens:
  eos_token: "<|user|>"

@zerofata
Copy link

zerofata commented Aug 27, 2025

Was something changed in the last few days?

Just did another training attempt on glm4.5-air and noticed when I tried to merge the adapter with the base model, I was receiving an assertionerror.

It went away by adding lora_mlp_kernal: false to the config.yml

root@3d9e66bb03ae:/workspace/axolotl# python3 -m axolotl.cli.merge_lora glm-air/config_sft_nothink.yml  --lora_model_dir="./GLM-AIR-SFT_v2-5/"
[2025-08-27 09:33:50,262] [INFO] [axolotl.integrations.base.register:369] [PID:4969] [RANK:0] Attempting to load plugin: axolotl.integrations.liger.LigerPlugin
[2025-08-27 09:33:51,004] [INFO] [axolotl.integrations.base.register:372] [PID:4969] [RANK:0] Plugin loaded successfully: axolotl.integrations.liger.LigerPlugin
[2025-08-27 09:33:51,004] [INFO] [axolotl.integrations.base.register:369] [PID:4969] [RANK:0] Attempting to load plugin: axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
[2025-08-27 09:33:51,005] [INFO] [axolotl.integrations.base.register:372] [PID:4969] [RANK:0] Plugin loaded successfully: axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
[2025-08-27 09:33:51,083] [WARNING] [axolotl.utils.schemas.config.check_auto_enable_lora_kernels:1118] [PID:4969] [RANK:0] Auto-enabling LoRA kernel optimizations for faster training. Please explicitly set lora_*_kernel config values to false to disable. See [https://docs.axolotl.ai/docs/lora_optims.html](https://www.google.com/url?sa=E&q=https%3A%2F%2Fdocs.axolotl.ai%2Fdocs%2Flora_optims.html) for more info.
[2025-08-27 09:33:51,083] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:130] [PID:4969] [RANK:0] setting remove_unused_columns: false for when sample_packing and eval_sample_packing don't match
[2025-08-27 09:33:51,084] [WARNING] [axolotl.utils.schemas.validation.check_sample_packing_without_attention:179] [PID:4969] [RANK:0] sample_packing without flash, sdp, xformers or flex attention does not handle cross sample decontamination.
[2025-08-27 09:33:51,287] [INFO] [axolotl.cli.config.load_cfg:245] [PID:4969] [RANK:0] config:
{
"activation_offloading": false,
"adapter": "qlora",
"axolotl_config_path": "glm-air/config_sft_nothink.yml",
"base_model": "zai-org/GLM-4.5-Air",
"base_model_config": "zai-org/GLM-4.5-Air",
"batch_size": 8,
"bf16": true,
"capabilities": {
"bf16": true,
"compute_capability": "sm_90",
"fp8": false,
"n_gpu": 1,
"n_node": 1
},
"context_parallel_size": 1,
"cut_cross_entropy": false,
"dataloader_num_workers": 1,
"dataloader_pin_memory": true,
"dataloader_prefetch_factor": 256,
"dataset_prepared_path": "./last_run_prepared",
"dataset_processes": 96,
"datasets": [
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/automated_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/chat_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/cw_claude_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/cw_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/handcrafted_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/instruct_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/stories_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
},
{
"chat_template": "tokenizer_default",
"field_messages": "messages",
"message_property_mappings": {
"content": "content",
"role": "role"
},
"path": "./data/summaries_dataset_nothink.jsonl",
"roles": {
"assistant": [
"assistant"
],
"system": [
"system"
],
"user": [
"user"
]
},
"split": "train",
"trust_remote_code": false,
"type": "chat_template"
}
],
"ddp": false,
"device": "cuda:0",
"device_map": "auto",
"dion_rank_fraction": 1.0,
"dion_rank_multiple_of": 1,
"env_capabilities": {
"torch_version": "2.7.0"
},
"eot_tokens": [
"<|user|>",
"<|endoftext|>"
],
"eval_batch_size": 2,
"eval_causal_lm_metrics": [
"sacrebleu",
"comet",
"ter",
"chrf"
],
"eval_max_new_tokens": 128,
"eval_sample_packing": false,
"eval_steps": 35,
"eval_table_size": 0,
"flash_attention": false,
"fp16": false,
"gradient_accumulation_steps": 4,
"gradient_checkpointing": true,
"gradient_checkpointing_kwargs": {
"use_reentrant": true
},
"greater_is_better": false,
"learning_rate": 4.5e-06,
"liger_fused_linear_cross_entropy": true,
"liger_glu_activation": true,
"liger_layer_norm": true,
"liger_rms_norm": true,
"liger_rope": false,
"lisa_layers_attribute": "model.layers",
"load_best_model_at_end": true,
"load_in_4bit": false,
"load_in_8bit": false,
"local_rank": 0,
"logging_steps": 1,
"lora_alpha": 32,
"lora_dropout": 0.0,
"lora_mlp_kernel": true,
"lora_model_dir": "./GLM-AIR-SFT_v2-5/",
"lora_o_kernel": true,
"lora_qkv_kernel": true,
"lora_r": 32,
"lora_target_modules": [
"gate_proj",
"down_proj",
"up_proj",
"q_proj",
"v_proj",
"k_proj",
"o_proj"
],
"loraplus_lr_embedding": 1e-06,
"lr_scheduler": "rex",
"max_grad_norm": 1.0,
"max_prompt_len": 512,
"mean_resizing_embeddings": false,
"merge_lora": true,
"metric_for_best_model": "eval_loss",
"micro_batch_size": 2,
"model_config_type": "glm4_moe",
"num_epochs": 3.0,
"optimizer": "paged_adamw_8bit",
"output_dir": "./GLM-AIR-SFT_v2-5",
"pad_to_sequence_len": true,
"plugins": [
"axolotl.integrations.liger.LigerPlugin",
"axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin"
],
"pretrain_multipack_attn": true,
"pretrain_multipack_buffer_size": 10000,
"profiler_steps_start": 0,
"qlora_sharded_model_loading": false,
"ray_num_workers": 1,
"remove_unused_columns": false,
"resources_per_worker": {
"GPU": 1
},
"sample_packing": true,
"sample_packing_bin_size": 200,
"sample_packing_group_size": 100000,
"save_only_model": false,
"save_safetensors": true,
"save_steps": 20,
"save_strategy": "steps",
"save_total_limit": 18,
"sequence_len": 8192,
"shuffle_before_merging_datasets": false,
"shuffle_merged_datasets": true,
"skip_prepare_dataset": false,
"special_tokens": {
"eos_token": "<|user|>"
},
"strict": false,
"tensor_parallel_size": 1,
"tiled_mlp_use_original_mlp": true,
"tokenizer_config": "zai-org/GLM-4.5-Air",
"torch_dtype": "torch.bfloat16",
"train_on_inputs": false,
"trl": {
"log_completions": false,
"mask_truncated_completions": false,
"ref_model_mixup_alpha": 0.9,
"ref_model_sync_steps": 64,
"scale_rewards": true,
"sync_ref_model": false,
"use_vllm": false,
"vllm_server_host": "0.0.0.0",
"vllm_server_port": 8000
},
"use_ray": false,
"use_wandb": true,
"val_set_size": 0.02,
"vllm": {
"device": "auto",
"dtype": "auto",
"gpu_memory_utilization": 0.9,
"host": "0.0.0.0",
"port": 8000
},
"wandb_name": "GLM-AIR-SFT_v2-5",
"wandb_project": "GLM-AIR-SFT",
"warmup_ratio": 0.05,
"weight_decay": 0.01,
"world_size": 1
}
[2025-08-27 09:33:51,288] [INFO] [axolotl.cli.utils.load.load_model_and_tokenizer:40] [PID:4969] [RANK:0] loading tokenizer... zai-org/GLM-4.5-Air
[2025-08-27 09:33:51,966] [INFO] [axolotl.loaders.tokenizer.load_tokenizer:300] [PID:4969] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2025-08-27 09:33:51,966] [INFO] [axolotl.cli.utils.load.load_model_and_tokenizer:43] [PID:4969] [RANK:0] loading model...
[2025-08-27 09:33:52,054] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:110] [PID:4969] [RANK:0] Patched Trainer.evaluation_loop with nanmean loss calculation
[2025-08-27 09:33:52,055] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:164] [PID:4969] [RANK:0] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/axolotl/cli/merge_lora.py", line 90, in <module>
fire.Fire(do_cli)
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/axolotl/cli/merge_lora.py", line 86, in do_cli
do_merge_lora(cfg=parsed_cfg)
File "/usr/local/lib/python3.11/dist-packages/axolotl/cli/merge_lora.py", line 24, in do_merge_lora
model, tokenizer, processor = load_model_and_tokenizer(cfg=cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/axolotl/cli/utils/load.py", line 45, in load_model_and_tokenizer
model, _ = model_loader.load()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/axolotl/loaders/model.py", line 168, in load
self.patch_manager.apply_pre_model_load_patches()
File "/usr/local/lib/python3.11/dist-packages/axolotl/loaders/patch_manager.py", line 67, in apply_pre_model_load_patches
self._apply_self_attention_lora_patch()
File "/usr/local/lib/python3.11/dist-packages/axolotl/loaders/patch_manager.py", line 248, in _apply_self_attention_lora_patch
patch_self_attn_lora(self.cfg)
File "/usr/local/lib/python3.11/dist-packages/axolotl/monkeypatch/lora_kernels.py", line 206, in patch_self_attn_lora
assert any(
^^^^
AssertionError: Original QKV code not found

@NanoCode012
Copy link
Collaborator Author

NanoCode012 commented Aug 28, 2025

@zerofata , hey, nope, this branch has been stale for past week or so. Could you have been using a newer transformers version? Although, I am not aware of new transformers breaking QKV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants