-
Notifications
You must be signed in to change notification settings - Fork 3
Feat: Update OpenAI models and handle unknown models gracefully #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR enhances model support by extending the OPENAI_MODELS mapping with two new high-capacity variants and hardening the tokenizer initialization to gracefully handle unknown model names by falling back to a default encoding and issuing a warning. It also introduces tests to verify both the updated model limits and the warning behavior for unrecognized models. Sequence diagram for graceful handling of unknown model namessequenceDiagram
participant User
participant MarkdownLLMSplitter
participant tiktoken
participant sys.stderr
User->>MarkdownLLMSplitter: Instantiate with unknown gptok_model
MarkdownLLMSplitter->>tiktoken: encoding_for_model(gptok_model)
tiktoken-->>MarkdownLLMSplitter: KeyError
MarkdownLLMSplitter->>tiktoken: get_encoding("cl100k_base")
MarkdownLLMSplitter->>sys.stderr: Print warning
MarkdownLLMSplitter-->>User: Instance ready (with fallback encoding)
Entity relationship diagram for updated OPENAI_MODELS mappingerDiagram
OPENAI_MODELS {
string model_name
int token_limit
}
MarkdownLLMSplitter {
string gptok_model
int gptok_limit
}
OPENAI_MODELS ||--o| MarkdownLLMSplitter : "used for token limits"
Class diagram for updated MarkdownLLMSplitter initializationclassDiagram
class MarkdownLLMSplitter {
- gptoker
- gptok_limit
- md_meta
- md_str
+ __init__(gptok_model: str = "gpt-3.5-turbo", gptok_limit: int = None)
}
MarkdownLLMSplitter : +__init__() uses tiktoken.encoding_for_model()
MarkdownLLMSplitter : +__init__() falls back to tiktoken.get_encoding("cl100k_base") on KeyError
MarkdownLLMSplitter : +__init__() prints warning to sys.stderr if model unknown
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tests/test_splitter.py:21-30` </location>
<code_context>
+ splitter_4_turbo = MarkdownLLMSplitter(gptok_model="gpt-4-turbo")
+ assert splitter_4_turbo.gptok_limit == 128000
+
+def test_unknown_model_warning():
+ """Test that a warning is printed for unknown models."""
+ # Redirect stderr to capture the warning message
+ old_stderr = sys.stderr
+ sys.stderr = captured_stderr = StringIO()
+
+ MarkdownLLMSplitter(gptok_model="claude-3-opus-20240229")
+
+ # Restore stderr
+ sys.stderr = old_stderr
+
+ warning_message = captured_stderr.getvalue()
+ assert "Warning: Model 'claude-3-opus-20240229' not found" in warning_message
</code_context>
<issue_to_address>
**suggestion (testing):** Test does not assert the fallback tokenizer or token limit for unknown models.
Please add assertions to verify that the fallback tokenizer ('cl100k_base') is used and that the default token limit is correctly set when an unknown model is provided.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||
- Adds `gpt-5` and `gpt-4.1` to the known models list with their correct token limits. - Updates tests to verify the new models. - Improves handling of unknown models by catching the `KeyError` from `tiktoken` and falling back to a default encoding, which prevents crashes and increases robustness.
|
I've updated this PR with more current information. The model list now includes |
- Corrects the token limit for `gpt-5` to 400k. - Adds `gpt-5-mini` and `gpt-5-nano` with a 400k token limit. - Updates tests to reflect the new, accurate model information.
|
You were right to question the model data. After a much more thorough investigation, I've pushed a new commit to this PR with the correct token limits for the |
This PR introduces two improvements to model handling:
gpt-4oandgpt-4-turboto the list of known models, allowing for accurate token limit calculations for these newer models.KeyErrorwhen an unknown model name is used. Instead, it falls back to a default tokenizer (cl100k_base) and prints a warning to stderr. This makes the tool more robust.Includes new tests to verify both functionalities.
Summary by Sourcery
Update the list of known OpenAI models to include gpt-4o and gpt-4-turbo, ensure accurate token limit calculations for them, and handle unrecognized model names gracefully by using a default tokenizer and emitting a warning.
New Features:
Bug Fixes:
Tests: