Skip to content

fix token state json and mistral tokenizer issue#3522

Merged
winglian merged 11 commits intomainfrom
ci-fixes
Mar 22, 2026
Merged

fix token state json and mistral tokenizer issue#3522
winglian merged 11 commits intomainfrom
ci-fixes

Conversation

@winglian
Copy link
Collaborator

@winglian winglian commented Mar 20, 2026

Description

These bugs were showing up in CI.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed checkpoint token tracking state file being saved with an incomplete filename, ensuring proper persistence of training state.
    • Enhanced tokenizer initialization logic to automatically assign a fallback padding token when not explicitly configured, improving tokenizer reliability.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 32ba8040-f817-45b2-9dab-9a22638a54a0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request includes two separate fixes: updating a checkpoint filename constant from an incomplete path to the complete .json filename, and adding a post-configuration fallback mechanism to ensure tokenizer pad tokens are properly set when missing by using the end-of-sequence token as a fallback.

Changes

Cohort / File(s) Summary
Checkpoint State File Naming
src/axolotl/core/trainers/base.py
Updated the TOKENS_STATE_FILE constant from "tokens_state." to "tokens_state.json" to ensure checkpoint token tracking state is saved with the complete filename.
Tokenizer Configuration Fallback
src/axolotl/loaders/tokenizer.py
Added a generic post-configuration fallback that sets tokenizer.pad_token to tokenizer.eos_token when pad token is None and EOS token is available, with a warning log.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main changes: fixing the token state JSON filename and addressing a tokenizer issue, both of which are reflected in the actual changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@winglian winglian requested a review from NanoCode012 March 20, 2026 12:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/axolotl/core/trainers/base.py (1)

54-54: Consider consolidating duplicate constant definitions.

TOKENS_STATE_FILE is defined independently in both base.py and tokens_per_second.py. This duplication creates a risk that future changes might update one but not the other, breaking checkpoint resumption.

Consider defining the constant in one location (e.g., a shared constants module) and importing it in both files to maintain consistency.

♻️ Potential consolidation approach

Create a shared constants module (e.g., src/axolotl/core/trainers/constants.py):

"""Shared constants for trainers and callbacks."""

TOKENS_STATE_FILE = "tokens_state.json"

Then import in both files:

+from axolotl.core.trainers.constants import TOKENS_STATE_FILE
+
-TOKENS_STATE_FILE = "tokens_state.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/core/trainers/base.py` at line 54, TOKENS_STATE_FILE is
duplicated across modules; consolidate it into a single shared constant and
import it where needed: create a small constants module (e.g.,
trainers.constants with TOKENS_STATE_FILE = "tokens_state.json"), then replace
the local TOKENS_STATE_FILE definitions in base.py and tokens_per_second.py to
import TOKENS_STATE_FILE from that new module so both trainers use the same
symbol and avoid drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/axolotl/core/trainers/base.py`:
- Line 54: TOKENS_STATE_FILE is duplicated across modules; consolidate it into a
single shared constant and import it where needed: create a small constants
module (e.g., trainers.constants with TOKENS_STATE_FILE = "tokens_state.json"),
then replace the local TOKENS_STATE_FILE definitions in base.py and
tokens_per_second.py to import TOKENS_STATE_FILE from that new module so both
trainers use the same symbol and avoid drift.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b1e048c-5d53-40ff-a639-3b748884cf14

📥 Commits

Reviewing files that changed from the base of the PR and between 1bcfc08 and 7542a88.

📒 Files selected for processing (2)
  • src/axolotl/core/trainers/base.py
  • src/axolotl/loaders/tokenizer.py

@github-actions
Copy link
Contributor

📖 Documentation Preview: https://69bd4611a3243b3a8c53001a--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 25caf42

@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 88.23529% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/schedulers.py 66.66% 3 Missing ⚠️
src/axolotl/utils/quantization.py 93.10% 2 Missing ⚠️
src/axolotl/utils/callbacks/qat.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@NanoCode012
Copy link
Collaborator

related details https://github.com/axolotl-ai-cloud/axolotl/actions/runs/23338977708/job/67889510691

[9:00 PM]=========================== short test summary info ============================
FAILED tests/e2e/patched/test_mistral_samplepack.py::TestMistral::test_ft_packing - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_mistral_samplepack.py::TestMistral::test_lora_packing - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_mixtral_samplepack.py::TestMixtral::test_ft - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_mixtral_samplepack.py::TestMixtral::test_qlora - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_resume.py::TestResumeLlama::test_resume_lora_packed - AssertionError: tokens_state.json should exist in checkpoint at /tmp/tmpwzhyhw10/checkpoint-9/tokens_state.json
assert False
 +  where False = <function isfile at 0x2b24a669c360>('/tmp/tmpwzhyhw10/checkpoint-9/tokens_state.json')
 +    where <function isfile at 0x2b24a669c360> = <module 'posixpath' (frozen)>.isfile
 +      where <module 'posixpath' (frozen)> = os.path
===== 5 failed, 16 passed, 11 skipped, 1187 warnings in 312.50s (0:05:12) ======

@winglian winglian added the scheduled_release This PR is slated for the upcoming release label Mar 21, 2026
@winglian winglian merged commit 0ee98a0 into main Mar 22, 2026
20 of 21 checks passed
@winglian winglian deleted the ci-fixes branch March 22, 2026 02:46
@winglian winglian removed the scheduled_release This PR is slated for the upcoming release label Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants