-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Feat: add glm45 #3030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: add glm45 #3030
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
bf0778b to
272a456
Compare
|
Model training seem stuck, have tested upto 4xH200 for lora and fft.
If anyone's interested in this model, please feel free to test from the configs in this yaml. |
|
Follow up from the discord discussion earlier; Looks like some form of liger support might be there based on the below and the fact I don't crash out: Pod setups tested (Environment is generic PyTorch 2.8 template pod from RunPod, with torch downgraded to 2.7.0): Tried a modified config based on what we discussed with 6xH200 and got stuck at step 0. Tried my old config I know works (posted in discord) and that also got stuck. Did another 4xH200 pod using the old config again and confirmed it worked as expected. First step took 1:37 with a bsz of 16 and seq_len 8192. Re the odd issue with the missing Layer 46 / mtp layer when the lora gets merged in, I was able to transplant the layer from the base model into the trained model and it seemed to take that ok enough that I could convert the model into GGUF. Dunno if the actual MTP functionality itself still works as expected, but that's not such a biggie for me. Had infinite generation issues with the trained model, but that one's probably an issue with my dataset / hyperparams not working right with the hybrid reasoning. Side note related to those infinite generations: How does axolotl handle models like this with hybrid reasoning / enable thinking settings in chat templates when training / formatting the dataset? It seems like currently with multi-turn datasets if turns within that have reasoning or empty think tags, you'd need some sort of masking, as it's my understanding that when you train the sample, the tags and content inside should be available for that specific turn, but not anything within tags for previous turns in the conversation. Might be a useful feature, although feel free to ignore if that goes outside the scope of the PR. |
|
Thanks for the points @zerofata
Do you mind sharing a snippet for any future readers?
This is a bit mixed. You can see during Currently, I think for glm4_moe, the think section is masked if reasoning is not provided, but is unmasked if it is. On the turns level, we unmask the |
|
fix_air_mtp.py Thanks for the preprocess command, that was useful for debugging, looked like axolotl was handling it exactly as I'd hoped. I found to stop my infinite gens, I had to add this to my config. The model has three EOT tokens, maybe they're all that's needed and the eos_token can be left alone, but I didn't want to risk another failed train, so I went with the primary stop token that the chat template uses. eot_tokens:
- "<|user|>"
special_tokens:
eos_token: "<|user|>" |
|
Was something changed in the last few days? Just did another training attempt on glm4.5-air and noticed when I tried to merge the adapter with the base model, I was receiving an assertionerror. It went away by adding |
|
@zerofata , hey, nope, this branch has been stale for past week or so. Could you have been using a newer transformers version? Although, I am not aware of new transformers breaking QKV. |
Description
Some training notes on 4xH100:
e_score_correction_biaselse device mismatch during calculation.Motivation and Context
How has this been tested?
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)