fix token state json and mistral tokenizer issue by winglian · Pull Request #3522 · axolotl-ai-cloud/axolotl

winglian · 2026-03-20T12:43:34Z

Description

These bugs were showing up in CI.

Summary by CodeRabbit

Bug Fixes
- Fixed checkpoint token tracking state file being saved with an incomplete filename, ensuring proper persistence of training state.
- Enhanced tokenizer initialization logic to automatically assign a fallback padding token when not explicitly configured, improving tokenizer reliability.

coderabbitai · 2026-03-20T12:43:51Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 32ba8040-f817-45b2-9dab-9a22638a54a0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request includes two separate fixes: updating a checkpoint filename constant from an incomplete path to the complete .json filename, and adding a post-configuration fallback mechanism to ensure tokenizer pad tokens are properly set when missing by using the end-of-sequence token as a fallback.

Changes

Cohort / File(s)	Summary
Checkpoint State File Naming `src/axolotl/core/trainers/base.py`	Updated the `TOKENS_STATE_FILE` constant from `"tokens_state."` to `"tokens_state.json"` to ensure checkpoint token tracking state is saved with the complete filename.
Tokenizer Configuration Fallback `src/axolotl/loaders/tokenizer.py`	Added a generic post-configuration fallback that sets `tokenizer.pad_token` to `tokenizer.eos_token` when pad token is `None` and EOS token is available, with a warning log.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the two main changes: fixing the token state JSON filename and addressing a tokenizer issue, both of which are reflected in the actual changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ci-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/axolotl/core/trainers/base.py (1)
54-54: Consider consolidating duplicate constant definitions.

TOKENS_STATE_FILE is defined independently in both base.py and tokens_per_second.py. This duplication creates a risk that future changes might update one but not the other, breaking checkpoint resumption.

Consider defining the constant in one location (e.g., a shared constants module) and importing it in both files to maintain consistency.
♻️ Potential consolidation approach

Create a shared constants module (e.g., src/axolotl/core/trainers/constants.py):
"""Shared constants for trainers and callbacks."""

TOKENS_STATE_FILE = "tokens_state.json"
Then import in both files:
+from axolotl.core.trainers.constants import TOKENS_STATE_FILE
+
-TOKENS_STATE_FILE = "tokens_state.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/core/trainers/base.py` at line 54, TOKENS_STATE_FILE is
duplicated across modules; consolidate it into a single shared constant and
import it where needed: create a small constants module (e.g.,
trainers.constants with TOKENS_STATE_FILE = "tokens_state.json"), then replace
the local TOKENS_STATE_FILE definitions in base.py and tokens_per_second.py to
import TOKENS_STATE_FILE from that new module so both trainers use the same
symbol and avoid drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/axolotl/core/trainers/base.py`:
- Line 54: TOKENS_STATE_FILE is duplicated across modules; consolidate it into a
single shared constant and import it where needed: create a small constants
module (e.g., trainers.constants with TOKENS_STATE_FILE = "tokens_state.json"),
then replace the local TOKENS_STATE_FILE definitions in base.py and
tokens_per_second.py to import TOKENS_STATE_FILE from that new module so both
trainers use the same symbol and avoid drift.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b1e048c-5d53-40ff-a639-3b748884cf14

📥 Commits

Reviewing files that changed from the base of the PR and between 1bcfc08 and 7542a88.

📒 Files selected for processing (2)

src/axolotl/core/trainers/base.py
src/axolotl/loaders/tokenizer.py

github-actions · 2026-03-20T13:05:39Z

📖 Documentation Preview: https://69bd4611a3243b3a8c53001a--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 25caf42

codecov · 2026-03-20T13:41:28Z

Codecov Report

❌ Patch coverage is 88.23529% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/schedulers.py	66.66%	3 Missing ⚠️
src/axolotl/utils/quantization.py	93.10%	2 Missing ⚠️
src/axolotl/utils/callbacks/qat.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

NanoCode012 · 2026-03-20T14:04:18Z

related details https://github.com/axolotl-ai-cloud/axolotl/actions/runs/23338977708/job/67889510691

[9:00 PM]=========================== short test summary info ============================
FAILED tests/e2e/patched/test_mistral_samplepack.py::TestMistral::test_ft_packing - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_mistral_samplepack.py::TestMistral::test_lora_packing - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_mixtral_samplepack.py::TestMixtral::test_ft - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_mixtral_samplepack.py::TestMixtral::test_qlora - ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
FAILED tests/e2e/patched/test_resume.py::TestResumeLlama::test_resume_lora_packed - AssertionError: tokens_state.json should exist in checkpoint at /tmp/tmpwzhyhw10/checkpoint-9/tokens_state.json
assert False
 +  where False = <function isfile at 0x2b24a669c360>('/tmp/tmpwzhyhw10/checkpoint-9/tokens_state.json')
 +    where <function isfile at 0x2b24a669c360> = <module 'posixpath' (frozen)>.isfile
 +      where <module 'posixpath' (frozen)> = os.path
===== 5 failed, 16 passed, 11 skipped, 1187 warnings in 312.50s (0:05:12) ======

winglian requested a review from NanoCode012 March 20, 2026 12:43

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

winglian force-pushed the ci-fixes branch from 25caf42 to dd47e8d Compare March 20, 2026 13:02

winglian force-pushed the ci-fixes branch from 89e248e to 4a91835 Compare March 20, 2026 13:25

winglian added the scheduled_release This PR is slated for the upcoming release label Mar 21, 2026

winglian and others added 11 commits March 21, 2026 18:48

fix token state json and mistral tokenizer issue

a5fc461

centralize constants

f688082

forgot to commit constants file

b9f752c

Fix weakref in pickling relora state dict

a541c83

make curl a bit quieter so it doesn't log 2K lines

439a999

fix path traversal for olmoe test

aa8698b

more test fixes that weren't flagged previously

da58413

chore: lint

b76faf7

skip tests that fail b/c of OutOfResources

4455b2a

scattermoe as slow tests

2aa52d3

update fbgemm-genai for torch 2.10

5d744bb

winglian force-pushed the ci-fixes branch from e8e4954 to 5d744bb Compare March 21, 2026 22:48

winglian merged commit 0ee98a0 into main Mar 22, 2026
20 of 21 checks passed

winglian deleted the ci-fixes branch March 22, 2026 02:46

winglian removed the scheduled_release This PR is slated for the upcoming release label Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix token state json and mistral tokenizer issue#3522

fix token state json and mistral tokenizer issue#3522
winglian merged 11 commits intomainfrom
ci-fixes

winglian commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

codecov bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

NanoCode012 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

winglian commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

codecov bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NanoCode012 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

winglian commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

codecov bot commented Mar 20, 2026 •

edited

Loading