-
Notifications
You must be signed in to change notification settings - Fork 139
Add proper token counting to code execution model #1184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: i-vainn <[email protected]>
Signed-off-by: i-vainn <[email protected]>
Signed-off-by: i-vainn <[email protected]>
📝 WalkthroughWalkthroughThe changes add a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryThis PR adds proper token counting to the code execution model to ensure accurate token budget tracking during code generation and execution cycles. Key changes:
The implementation addresses the previous limitation where code execution outputs were not counted toward the token budget (as noted in the removed TODO comment). Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant CodeExecModel
participant BaseModel
participant TokenCounter
participant LLMServer
Client->>CodeExecModel: get_code_execution_model()
CodeExecModel->>BaseModel: construct with require_tokenizer
BaseModel->>BaseModel: check require_tokenizer setting
BaseModel->>TokenCounter: setup tokenizer
TokenCounter-->>BaseModel: tokenizer ready
BaseModel-->>CodeExecModel: model ready
CodeExecModel->>CodeExecModel: verify tokenizer available
CodeExecModel-->>Client: wrapper ready
Client->>CodeExecModel: generate_async()
loop Code execution rounds
CodeExecModel->>LLMServer: request generation
LLMServer-->>CodeExecModel: text segment
CodeExecModel->>TokenCounter: count segment tokens
TokenCounter-->>CodeExecModel: token count
CodeExecModel->>CodeExecModel: update budget
CodeExecModel->>CodeExecModel: run code
CodeExecModel->>TokenCounter: count output tokens
TokenCounter-->>CodeExecModel: token count
CodeExecModel->>CodeExecModel: update budget
end
CodeExecModel-->>Client: final result
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| yield {"generation": code_end} | ||
|
|
||
| # Calculate token count for this segment (after adding code_end if needed) | ||
| num_generated_tokens = len(self.model.tokenizer.encode(current_output_segment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token count uses encode() on the segment string, which may differ from actual token count if tokenization is context-dependent. Consider verifying this matches the LLM's actual token usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/model/base.py`:
- Around line 207-211: The try/except in the tokenizer initialization around
WrapperAutoTokenizer(tokenizer) is catching a broad Exception; narrow it to the
documented failure modes by catching OSError, ValueError and ImportError
specifically (instead of Exception) so failures from
AutoTokenizer.from_pretrained()/WrapperAutoTokenizer are handled but other
unexpected exceptions still surface; update the except clause accordingly in the
code that constructs WrapperAutoTokenizer(tokenizer).
In `@nemo_skills/inference/model/code_execution.py`:
- Around line 198-201: After you decrement request["tokens_to_generate"] by
code_output_tokens (computed via self.model.tokenizer.encode(code_output)),
immediately guard against a non‑positive budget: if
request["tokens_to_generate"] <= 0 then stop further generation (break/return
from the generation loop) and avoid calling the model with a negative
tokens_to_generate; apply the same guard in the other identical path around
lines 431-434 so both code_output_tokens paths early‑exit when budget is
exhausted and still update total_num_generated_tokens.
| try: | ||
| return WrapperAutoTokenizer(tokenizer) | ||
| except (OSError, Exception) as e: | ||
| LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}") | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual code at lines 207-211
head -n 215 nemo_skills/inference/model/base.py | tail -n 20Repository: NVIDIA-NeMo/Skills
Length of output: 801
🏁 Script executed:
# Find where AutoTokenizer comes from
rg "AutoTokenizer" nemo_skills/inference/model/base.py -B 3 -A 3Repository: NVIDIA-NeMo/Skills
Length of output: 1280
🏁 Script executed:
# Check the entire tokenizer initialization context
sed -n '190,220p' nemo_skills/inference/model/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1128
🏁 Script executed:
# Look for any documentation or requirements that specify Transformers version
fd -e "txt" -e "in" -e "yaml" | xargs rg -l "transformers" | head -5Repository: NVIDIA-NeMo/Skills
Length of output: 162
🏁 Script executed:
# Find and examine WrapperAutoTokenizer implementation
find . -name "utils.py" -path "*/inference/model/*" -exec cat {} \;Repository: NVIDIA-NeMo/Skills
Length of output: 4062
🏁 Script executed:
# Check the main requirements to understand Transformers version
cat requirements/main.txt | grep -i transformersRepository: NVIDIA-NeMo/Skills
Length of output: 117
🏁 Script executed:
# Also check stem requirements
cat requirements/stem.txt | grep -i transformersRepository: NVIDIA-NeMo/Skills
Length of output: 74
🏁 Script executed:
# Search for documentation or examples of AutoTokenizer exceptions in the codebase
rg "AutoTokenizer" -A 5 -B 5 | grep -i "except\|raise\|error" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 1792
🏁 Script executed:
# Check if there are other try-except blocks handling tokenizer initialization elsewhere
rg "from_pretrained" -A 3 -B 1 | grep -E "(except|OSError|ValueError|ImportError)"Repository: NVIDIA-NeMo/Skills
Length of output: 358
🌐 Web query:
AutoTokenizer.from_pretrained exceptions transformers library
💡 Result:
Common AutoTokenizer.from_pretrained(...) exceptions in 🤗 Transformers, what they mean, and the usual fix:
-
OSError: We couldn't connect to 'https://huggingface.co' ... couldn't find it in the cached files ... not the path to a directory containing ... config.json
Cause: no internet / blocked Hub access, or you’re pointing to a local folder that doesn’t actually contain the needed tokenizer files.
Fix: ensure network access (or pre-download), or pass a correct local directory that contains tokenizer artifacts (e.g.,tokenizer.json,tokenizer_config.json, vocab files). (discuss.huggingface.co) -
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' ...
Cause: you passed something that is not a Hub repo ID (often a file path like.../tf_model.h5, an S3 URI, or a malformed string).
Fix: pass either a valid Hub ID ("org/model") or a local folder path (directory), not a single weights file / invalid URI. (stackoverflow.com) -
ValueError: Unrecognized model ... Should have a 'model_type' key in its config.json ...
Cause: the local folder/repo you’re loading is missing a properconfig.json(or it lacksmodel_type), or you’re using an incompatible/old export.
Fix: verify the folder has a validconfig.jsonwithmodel_type(and that you saved/pushed the model/tokenizer correctly). (discuss.huggingface.co) -
ValueError: Loading <repo> requires you to execute the configuration file in that repo ... set trust_remote_code=True
Cause: the model/tokenizer relies on custom Python code from the repo.
Fix: only if you trust the repo, passtrust_remote_code=True(and ensure you pass it everywhere needed in your loading flow). (discuss.huggingface.co) -
Auth/token-related breakages (e.g., private/gated repos), including reports around
token=behavior changing
Cause: missing/incorrect auth token, or a version-specific regression/change in howtokenis propagated during tokenizer loading.
Fix: confirm you’re logged in / passing a valid token; if it “suddenly” started failing after an upgrade, try pinning/adjustingtransformers/huggingface_hubversions and retest. (github.com)
Citations:
- 1: https://discuss.huggingface.co/t/huggingface-offline-error/89170?utm_source=openai
- 2: https://stackoverflow.com/questions/76900349/repo-id-error-when-using-hugging-face-transformers?utm_source=openai
- 3: https://discuss.huggingface.co/t/assistance-needed-unrecognized-model-error-in-autotokenizer-from-pretrained-for-gemma-2-2b/140143?utm_source=openai
- 4: https://discuss.huggingface.co/t/how-to-use-trust-remote-code-true-with-load-checkpoint-and-dispatch/39849?utm_source=openai
- 5:
AutoTokenizer.from_pretraineddoes not propagatetokenhuggingface/transformers#39030
🏁 Script executed:
# Check how other parts of the codebase handle similar initialization errors
rg "from_pretrained|WrapperAutoTokenizer" -B 3 -A 5 | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 3855
🏁 Script executed:
# Check if there are any HFValidationError imports or usages
rg "HFValidationError|huggingface_hub" nemo_skills/ -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 1379
🌐 Web query:
transformers AutoTokenizer.from_pretrained exception types ValueError OSError ImportError
💡 Result:
transformers.AutoTokenizer.from_pretrained(...) (and the underlying tokenizer .from_pretrained(...)) commonly fails with these exception types:
-
OSError(historically also raised asEnvironmentError, which is an alias ofOSErroron Python 3): raised when the tokenizer files can’t be resolved/loaded (bad model id, wrong directory, missing required tokenizer files, etc.). The HF code explicitly doesraise EnvironmentError("Can't load tokenizer for ..."). [1] -
ValueError: raised for “logical” issues such as:- passing a single file/URL to tokenizers that require a directory/model id (explicit
raise ValueError(...)). [1] - AutoTokenizer can’t map the model/config to a tokenizer (unrecognized configuration), or a required tokenizer implementation is unavailable (e.g., SentencePiece-backed slow tokenizer class is
Noneso AutoTokenizer raises aValueErrorinstructing to installsentencepiece). [2] - fast tokenizer backend can’t be instantiated (no
tokenizer.json, no slow tokenizer available to convert, etc.), raising aValueErrorwith the “Couldn't instantiate the backend tokenizer…” message. [3]
- passing a single file/URL to tokenizers that require a directory/model id (explicit
-
ImportError: can be raised indirectly when the load path requires optional deps that aren’t installed (most often during slow→fast conversion), e.g.requires_protobuf(...)raisingImportError: ... requires the protobuf library .... [4]
Sources: [1] [2] [3] [4]
Narrow exception handling for tokenizer initialization.
Catching broad Exception can mask unexpected bugs and silently degrade; replace with specific exception types that AutoTokenizer.from_pretrained() raises.
🛠️ Proposed refinement
- except (OSError, Exception) as e:
+ except (OSError, ValueError, ImportError) as e:
LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}")
return NoneAutoTokenizer.from_pretrained() raises OSError (missing files, network issues), ValueError (unrecognized model config, missing dependencies), and ImportError (optional dep not installed). These three types cover the documented failure modes for tokenizer loading.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| return WrapperAutoTokenizer(tokenizer) | |
| except (OSError, Exception) as e: | |
| LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}") | |
| return None | |
| try: | |
| return WrapperAutoTokenizer(tokenizer) | |
| except (OSError, ValueError, ImportError) as e: | |
| LOG.debug(f"Failed to initialize tokenizer from '{tokenizer}': {e}") | |
| return None |
🧰 Tools
🪛 Ruff (0.14.13)
209-209: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@nemo_skills/inference/model/base.py` around lines 207 - 211, The try/except
in the tokenizer initialization around WrapperAutoTokenizer(tokenizer) is
catching a broad Exception; narrow it to the documented failure modes by
catching OSError, ValueError and ImportError specifically (instead of Exception)
so failures from AutoTokenizer.from_pretrained()/WrapperAutoTokenizer are
handled but other unexpected exceptions still surface; update the except clause
accordingly in the code that constructs WrapperAutoTokenizer(tokenizer).
Signed-off-by: i-vainn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Signed-off-by: i-vainn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Kipok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small question, otherwise looks good. Please run gpt-oss slurm test to double check nothing is broken and code isn't much slower
nemo_skills/inference/model/base.py
Outdated
| return WrapperAutoTokenizer(tokenizer) | ||
| try: | ||
| return WrapperAutoTokenizer(tokenizer) | ||
| except (OSError, Exception) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? If we were asked to get tokenizer, but it failed, we should just fail completely, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional - see https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/model/base.py#L177-L195 which tries to initialize the tokenizer from the string, otherwise initialize the tokenizer endpoint.
Without this catch, server model path crashes before reaching the server endpoint fallback.
Signed-off-by: i-vainn <[email protected]>
Signed-off-by: i-vainn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 4 comments
| if output.count(code_end) + 1 == output.count(code_begin): | ||
| output += code_end | ||
| # Count tokens for the manually added code_end | ||
| num_generated_tokens += len(self.model.tokenizer.encode(code_end)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In non-streaming mode, code_end tokens are added to num_generated_tokens which came from output_dict.get("num_generated_tokens", 0). If the server already counted tokens for code_end when it was manually added, this would double-count those tokens.
| try: | ||
| return WrapperAutoTokenizer(tokenizer) | ||
| except OSError: | ||
| LOG.warning(f"Tokenizer not found at '{tokenizer}', trying fallback to server /tokenize endpoint") | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only catching OSError may miss other exceptions during tokenizer initialization (e.g., ImportError, ValueError). If the goal is to gracefully fall back to server endpoint, consider catching broader exceptions.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: i-vainn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
@Kipok, main vs branch complete within the same time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| try: | ||
| return WrapperAutoTokenizer(tokenizer) | ||
| except OSError: | ||
| LOG.warning(f"Tokenizer not found at '{tokenizer}', trying fallback to server /tokenize endpoint") | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching only OSError doesn't follow CONTRIBUTING.md guidelines about not being overly defensive. If require_tokenizer is True, the code should fail loudly when tokenizer initialization fails, not silently fall back. The runtime check on line 279 of code_execution.py will catch this later, but it happens during generation (after model setup), which could cause issues in production.
Consider checking the flag here and only catching when fallback is acceptable (when the flag is False).
Changes
get_code_execution_modelnow passesrequire_tokenizer=Trueto ensure tokenizer availabilityCodeExecutionWrappernow uses tokenizer to count tokens for:_initialize_tokenizerto catch exceptions and allow fallback to server endpointSummary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.