Skip to content

Conversation

@i-vainn
Copy link
Collaborator

@i-vainn i-vainn commented Jan 26, 2026

Changes

  • get_code_execution_model now passes require_tokenizer=True to ensure tokenizer availability
  • CodeExecutionWrapper now uses tokenizer to count tokens for:
    • Code execution output added to prompts
    • Streaming segment token counts
  • Added RuntimeError with clear message if tokenizer is unavailable
  • Fixed _initialize_tokenizer to catch exceptions and allow fallback to server endpoint

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Enhanced tokenizer initialization with improved error handling and fallback mechanisms
    • Improved token counting accuracy during code execution and generation to ensure proper token budget tracking
    • Added runtime validation to ensure tokenizer availability during generation processes

✏️ Tip: You can customize this high-level summary in your review settings.

@i-vainn i-vainn requested a review from Kipok January 26, 2026 16:59
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

The changes add a require_tokenizer parameter to enforce tokenizer initialization at model construction time and implement comprehensive token counting for manually injected tokens during code execution, with runtime validation to ensure tokenizer availability.

Changes

Cohort / File(s) Summary
Model initialization configuration
nemo_skills/inference/model/__init__.py, nemo_skills/inference/model/base.py
Added require_tokenizer: bool = False parameter to BaseModel.__init__ to conditionally enforce tokenizer initialization. Modified tokenizer initialization logic to trigger when enable_soft_fail or require_tokenizer is true. Enhanced _initialize_tokenizer with try/except error handling and logging. Updated get_code_execution_model to pass require_tokenizer=True when constructing models.
Token accounting in code execution
nemo_skills/inference/model/code_execution.py
Introduced token counting for manually injected tokens (code_end, code_output, formatted_code_output) in both streaming and non-streaming generation paths. Added runtime validation requiring tokenizer availability before token counting operations, raising RuntimeError if tokenizer is missing. Reordered token budget updates and code block execution logic to perform token accounting before break conditions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main change: adding proper token counting to the code execution model. This is the primary objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

This PR adds proper token counting to the code execution model to ensure accurate token budget tracking during code generation and execution cycles.

Key changes:

  • The get_code_execution_model function now requests tokenizer availability
  • CodeExecutionWrapper uses the tokenizer to count tokens for both generated code segments and execution outputs
  • Added runtime validation to fail fast if tokenizer is unavailable
  • Improved error handling in tokenizer setup to catch exceptions and allow fallback to server endpoint
  • Token budget now accurately accounts for manually added code end markers and code execution outputs

The implementation addresses the previous limitation where code execution outputs were not counted toward the token budget (as noted in the removed TODO comment).

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations around tokenization accuracy
  • The changes are well-structured and address a real issue with token counting. The score is 4/5 rather than 5/5 because: (1) the tokenizer.encode() approach may have context-dependent differences from actual LLM token usage (as noted in previous review), and (2) the exception handling catches generic Exception which is quite broad, though this is acceptable for fallback logic
  • No files require special attention - all changes are straightforward improvements to token counting

Important Files Changed

Filename Overview
nemo_skills/inference/model/base.py Added require_tokenizer parameter and improved error handling in _initialize_tokenizer with try/catch for OSError and generic exceptions
nemo_skills/inference/model/code_execution.py Implemented proper token counting for code execution outputs and streaming segments using tokenizer.encode(), added runtime validation, and fixed token budget tracking
nemo_skills/inference/model/init.py Added require_tokenizer parameter to get_code_execution_model to ensure tokenizer initialization for token counting

Sequence Diagram

sequenceDiagram
    participant Client
    participant CodeExecModel
    participant BaseModel
    participant TokenCounter
    participant LLMServer

    Client->>CodeExecModel: get_code_execution_model()
    CodeExecModel->>BaseModel: construct with require_tokenizer
    BaseModel->>BaseModel: check require_tokenizer setting
    BaseModel->>TokenCounter: setup tokenizer
    TokenCounter-->>BaseModel: tokenizer ready
    BaseModel-->>CodeExecModel: model ready
    CodeExecModel->>CodeExecModel: verify tokenizer available
    CodeExecModel-->>Client: wrapper ready
    
    Client->>CodeExecModel: generate_async()
    loop Code execution rounds
        CodeExecModel->>LLMServer: request generation
        LLMServer-->>CodeExecModel: text segment
        CodeExecModel->>TokenCounter: count segment tokens
        TokenCounter-->>CodeExecModel: token count
        CodeExecModel->>CodeExecModel: update budget
        CodeExecModel->>CodeExecModel: run code
        CodeExecModel->>TokenCounter: count output tokens
        TokenCounter-->>CodeExecModel: token count
        CodeExecModel->>CodeExecModel: update budget
    end
    CodeExecModel-->>Client: final result
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

yield {"generation": code_end}

# Calculate token count for this segment (after adding code_end if needed)
num_generated_tokens = len(self.model.tokenizer.encode(current_output_segment))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token count uses encode() on the segment string, which may differ from actual token count if tokenization is context-dependent. Consider verifying this matches the LLM's actual token usage.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@nemo_skills/inference/model/base.py`:
- Around line 207-211: The try/except in the tokenizer initialization around
WrapperAutoTokenizer(tokenizer) is catching a broad Exception; narrow it to the
documented failure modes by catching OSError, ValueError and ImportError
specifically (instead of Exception) so failures from
AutoTokenizer.from_pretrained()/WrapperAutoTokenizer are handled but other
unexpected exceptions still surface; update the except clause accordingly in the
code that constructs WrapperAutoTokenizer(tokenizer).

In `@nemo_skills/inference/model/code_execution.py`:
- Around line 198-201: After you decrement request["tokens_to_generate"] by
code_output_tokens (computed via self.model.tokenizer.encode(code_output)),
immediately guard against a non‑positive budget: if
request["tokens_to_generate"] <= 0 then stop further generation (break/return
from the generation loop) and avoid calling the model with a negative
tokens_to_generate; apply the same guard in the other identical path around
lines 431-434 so both code_output_tokens paths early‑exit when budget is
exhausted and still update total_num_generated_tokens.

Comment on lines 207 to 211
try:
return WrapperAutoTokenizer(tokenizer)
except (OSError, Exception) as e:
LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}")
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual code at lines 207-211
head -n 215 nemo_skills/inference/model/base.py | tail -n 20

Repository: NVIDIA-NeMo/Skills

Length of output: 801


🏁 Script executed:

# Find where AutoTokenizer comes from
rg "AutoTokenizer" nemo_skills/inference/model/base.py -B 3 -A 3

Repository: NVIDIA-NeMo/Skills

Length of output: 1280


🏁 Script executed:

# Check the entire tokenizer initialization context
sed -n '190,220p' nemo_skills/inference/model/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1128


🏁 Script executed:

# Look for any documentation or requirements that specify Transformers version
fd -e "txt" -e "in" -e "yaml" | xargs rg -l "transformers" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 162


🏁 Script executed:

# Find and examine WrapperAutoTokenizer implementation
find . -name "utils.py" -path "*/inference/model/*" -exec cat {} \;

Repository: NVIDIA-NeMo/Skills

Length of output: 4062


🏁 Script executed:

# Check the main requirements to understand Transformers version
cat requirements/main.txt | grep -i transformers

Repository: NVIDIA-NeMo/Skills

Length of output: 117


🏁 Script executed:

# Also check stem requirements
cat requirements/stem.txt | grep -i transformers

Repository: NVIDIA-NeMo/Skills

Length of output: 74


🏁 Script executed:

# Search for documentation or examples of AutoTokenizer exceptions in the codebase
rg "AutoTokenizer" -A 5 -B 5 | grep -i "except\|raise\|error" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1792


🏁 Script executed:

# Check if there are other try-except blocks handling tokenizer initialization elsewhere
rg "from_pretrained" -A 3 -B 1 | grep -E "(except|OSError|ValueError|ImportError)"

Repository: NVIDIA-NeMo/Skills

Length of output: 358


🌐 Web query:

AutoTokenizer.from_pretrained exceptions transformers library

💡 Result:

Common AutoTokenizer.from_pretrained(...) exceptions in 🤗 Transformers, what they mean, and the usual fix:

  1. OSError: We couldn't connect to 'https://huggingface.co' ... couldn't find it in the cached files ... not the path to a directory containing ... config.json
    Cause: no internet / blocked Hub access, or you’re pointing to a local folder that doesn’t actually contain the needed tokenizer files.
    Fix: ensure network access (or pre-download), or pass a correct local directory that contains tokenizer artifacts (e.g., tokenizer.json, tokenizer_config.json, vocab files). (discuss.huggingface.co)

  2. huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' ...
    Cause: you passed something that is not a Hub repo ID (often a file path like .../tf_model.h5, an S3 URI, or a malformed string).
    Fix: pass either a valid Hub ID ("org/model") or a local folder path (directory), not a single weights file / invalid URI. (stackoverflow.com)

  3. ValueError: Unrecognized model ... Should have a 'model_type' key in its config.json ...
    Cause: the local folder/repo you’re loading is missing a proper config.json (or it lacks model_type), or you’re using an incompatible/old export.
    Fix: verify the folder has a valid config.json with model_type (and that you saved/pushed the model/tokenizer correctly). (discuss.huggingface.co)

  4. ValueError: Loading <repo> requires you to execute the configuration file in that repo ... set trust_remote_code=True
    Cause: the model/tokenizer relies on custom Python code from the repo.
    Fix: only if you trust the repo, pass trust_remote_code=True (and ensure you pass it everywhere needed in your loading flow). (discuss.huggingface.co)

  5. Auth/token-related breakages (e.g., private/gated repos), including reports around token= behavior changing
    Cause: missing/incorrect auth token, or a version-specific regression/change in how token is propagated during tokenizer loading.
    Fix: confirm you’re logged in / passing a valid token; if it “suddenly” started failing after an upgrade, try pinning/adjusting transformers/huggingface_hub versions and retest. (github.com)

Citations:


🏁 Script executed:

# Check how other parts of the codebase handle similar initialization errors
rg "from_pretrained|WrapperAutoTokenizer" -B 3 -A 5 | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 3855


🏁 Script executed:

# Check if there are any HFValidationError imports or usages
rg "HFValidationError|huggingface_hub" nemo_skills/ -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1379


🌐 Web query:

transformers AutoTokenizer.from_pretrained exception types ValueError OSError ImportError

💡 Result:

transformers.AutoTokenizer.from_pretrained(...) (and the underlying tokenizer .from_pretrained(...)) commonly fails with these exception types:

  • OSError (historically also raised as EnvironmentError, which is an alias of OSError on Python 3): raised when the tokenizer files can’t be resolved/loaded (bad model id, wrong directory, missing required tokenizer files, etc.). The HF code explicitly does raise EnvironmentError("Can't load tokenizer for ..."). [1]

  • ValueError: raised for “logical” issues such as:

    • passing a single file/URL to tokenizers that require a directory/model id (explicit raise ValueError(...)). [1]
    • AutoTokenizer can’t map the model/config to a tokenizer (unrecognized configuration), or a required tokenizer implementation is unavailable (e.g., SentencePiece-backed slow tokenizer class is None so AutoTokenizer raises a ValueError instructing to install sentencepiece). [2]
    • fast tokenizer backend can’t be instantiated (no tokenizer.json, no slow tokenizer available to convert, etc.), raising a ValueError with the “Couldn't instantiate the backend tokenizer…” message. [3]
  • ImportError: can be raised indirectly when the load path requires optional deps that aren’t installed (most often during slow→fast conversion), e.g. requires_protobuf(...) raising ImportError: ... requires the protobuf library .... [4]

Sources: [1] [2] [3] [4]


Narrow exception handling for tokenizer initialization.
Catching broad Exception can mask unexpected bugs and silently degrade; replace with specific exception types that AutoTokenizer.from_pretrained() raises.

🛠️ Proposed refinement
-            except (OSError, Exception) as e:
+            except (OSError, ValueError, ImportError) as e:
                 LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}")
                 return None

AutoTokenizer.from_pretrained() raises OSError (missing files, network issues), ValueError (unrecognized model config, missing dependencies), and ImportError (optional dep not installed). These three types cover the documented failure modes for tokenizer loading.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
return WrapperAutoTokenizer(tokenizer)
except (OSError, Exception) as e:
LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}")
return None
try:
return WrapperAutoTokenizer(tokenizer)
except (OSError, ValueError, ImportError) as e:
LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}")
return None
🧰 Tools
🪛 Ruff (0.14.13)

209-209: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@nemo_skills/inference/model/base.py` around lines 207 - 211, The try/except
in the tokenizer initialization around WrapperAutoTokenizer(tokenizer) is
catching a broad Exception; narrow it to the documented failure modes by
catching OSError, ValueError and ImportError specifically (instead of Exception)
so failures from AutoTokenizer.from_pretrained()/WrapperAutoTokenizer are
handled but other unexpected exceptions still surface; update the except clause
accordingly in the code that constructs WrapperAutoTokenizer(tokenizer).

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small question, otherwise looks good. Please run gpt-oss slurm test to double check nothing is broken and code isn't much slower

return WrapperAutoTokenizer(tokenizer)
try:
return WrapperAutoTokenizer(tokenizer)
except (OSError, Exception) as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? If we were asked to get tokenizer, but it failed, we should just fail completely, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional - see https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/model/base.py#L177-L195 which tries to initialize the tokenizer from the string, otherwise initialize the tokenizer endpoint.

Without this catch, server model path crashes before reaching the server endpoint fallback.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

if output.count(code_end) + 1 == output.count(code_begin):
output += code_end
# Count tokens for the manually added code_end
num_generated_tokens += len(self.model.tokenizer.encode(code_end))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In non-streaming mode, code_end tokens are added to num_generated_tokens which came from output_dict.get("num_generated_tokens", 0). If the server already counted tokens for code_end when it was manually added, this would double-count those tokens.

Comment on lines +207 to +211
try:
return WrapperAutoTokenizer(tokenizer)
except OSError:
LOG.warning(f"Tokenizer not found at '{tokenizer}', trying fallback to server /tokenize endpoint")
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only catching OSError may miss other exceptions during tokenizer initialization (e.g., ImportError, ValueError). If the goal is to gracefully fall back to server endpoint, consider catching broader exceptions.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: i-vainn <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@i-vainn
Copy link
Collaborator Author

i-vainn commented Jan 29, 2026

Please run gpt-oss slurm test to double check nothing is broken and code isn't much slower

@Kipok, main vs branch complete within the same time

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +207 to +211
try:
return WrapperAutoTokenizer(tokenizer)
except OSError:
LOG.warning(f"Tokenizer not found at '{tokenizer}', trying fallback to server /tokenize endpoint")
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching only OSError doesn't follow CONTRIBUTING.md guidelines about not being overly defensive. If require_tokenizer is True, the code should fail loudly when tokenizer initialization fails, not silently fall back. The runtime check on line 279 of code_execution.py will catch this later, but it happens during generation (after model setup), which could cause issues in production.

Consider checking the flag here and only catching when fallback is acceptable (when the flag is False).

@Kipok Kipok merged commit 7ded756 into main Jan 30, 2026
6 checks passed
@Kipok Kipok deleted the imoshkov/code-execution-token-count branch January 30, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants