Skip to content

Commit 3b54c38

Browse files
author
dori
committed
feat:
feat: implement dynamic token limits with model-specific context management This commit introduces a comprehensive token management system that replaces hardcoded limits with dynamic, model-specific token limits while maintaining backward compatibility. ## Key Features Added: ### Dynamic Token Limits (NEW) - `src/mcp_as_a_judge/db/dynamic_token_limits.py`: New module providing model-specific token limits with LiteLLM integration - Initialization pattern: start with hardcoded defaults, upgrade from cache or LiteLLM API if available, return whatever is available - Caching system to avoid repeated API calls for model information ### Enhanced Token Calculation - `src/mcp_as_a_judge/db/token_utils.py`: Upgraded to async functions with accurate LiteLLM token counting and character-based fallback - Unified model detection from LLM config or MCP sampling context - Functions: `calculate_tokens_in_string`, `calculate_tokens_in_record`, `filter_records_by_token_limit` (all now async) ### Two-Level Token Management - **Database Level**: Storage limits enforced during save operations - Record count limit (20 per session) - Token count limit (dynamic based on model, fallback to 50K) - LRU session cleanup (50 total sessions max) - **Load Level**: LLM context limits enforced during retrieval - Ensures history + current prompt fits within model's input limit - FIFO removal of oldest records when limits exceeded ### Updated Service Layer - `src/mcp_as_a_judge/db/conversation_history_service.py`: Added await for async token filtering function - `src/mcp_as_a_judge/db/providers/sqlite_provider.py`: Integrated dynamic token limits in cleanup operations ### Test Infrastructure - `tests/test_helpers/`: New test utilities package - `tests/test_helpers/token_utils_helpers.py`: Helper functions for token calculation testing and model cache management - `tests/test_improved_token_counting.py`: Comprehensive async test suite - Updated existing tests to support async token functions ## Implementation Details: ### Model Detection Strategy: 1. Try LLM configuration (fast, synchronous) 2. Try MCP sampling detection (async, requires context) 3. Fallback to None with hardcoded limits ### Token Limit Logic: - **On Load**: Check total history + current prompt tokens against model max input - **On Save**: Two-step cleanup (record count limit, then token limit) - **FIFO Removal**: Always remove oldest records first to preserve recent context ### Backward Compatibility: - All existing method signatures preserved with alias support - Graceful fallback when model information unavailable - No breaking changes to existing functionality ## Files Changed: - Modified: 5 core files (service, provider, token utils, server) - Added: 3 new files (dynamic limits, test helpers) - Enhanced: 2 test files with async support ## Testing: - All 160 tests pass (1 skipped for integration-only) - Comprehensive coverage of token calculation, limits, and cleanup logic - Edge cases and error handling verified This implementation follows the user's preferred patterns: - Configuration-based approach with rational fallbacks - Clean separation of concerns between storage and LLM limits - Efficient FIFO cleanup maintaining recent conversation context
1 parent 24f9cc7 commit 3b54c38

File tree

10 files changed

+517
-70
lines changed

10 files changed

+517
-70
lines changed

src/mcp_as_a_judge/db/conversation_history_service.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@
1313
create_database_provider,
1414
)
1515
from mcp_as_a_judge.db.db_config import Config
16-
from mcp_as_a_judge.db.token_utils import filter_records_by_token_limit
16+
from mcp_as_a_judge.db.token_utils import (
17+
filter_records_by_token_limit,
18+
)
1719
from mcp_as_a_judge.logging_config import get_logger
1820

1921
# Set up logger
@@ -37,7 +39,7 @@ def __init__(
3739
self.db = db_provider or create_database_provider(config)
3840

3941
async def load_filtered_context_for_enrichment(
40-
self, session_id: str, current_prompt: str = ""
42+
self, session_id: str, current_prompt: str = "", ctx=None
4143
) -> list[ConversationRecord]:
4244
"""
4345
Load recent conversation records for LLM context enrichment.
@@ -49,6 +51,7 @@ async def load_filtered_context_for_enrichment(
4951
Args:
5052
session_id: Session identifier
5153
current_prompt: Current prompt that will be sent to LLM (for token calculation)
54+
ctx: MCP context for model detection and accurate token counting (optional)
5255
5356
Returns:
5457
List of conversation records for LLM context (filtered for LLM limits)
@@ -63,8 +66,9 @@ async def load_filtered_context_for_enrichment(
6366

6467
# Apply LLM context filtering: ensure history + current prompt will fit within token limit
6568
# This filters the list without modifying the database (only token limit matters for LLM)
66-
filtered_records = filter_records_by_token_limit(
67-
recent_records, current_prompt=current_prompt
69+
# Pass ctx for accurate token counting when available
70+
filtered_records = await filter_records_by_token_limit(
71+
recent_records, current_prompt=current_prompt, ctx=ctx
6872
)
6973

7074
logger.info(
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
"""
2+
Dynamic token limits based on actual model capabilities.
3+
4+
This module provides dynamic token limit calculation based on the actual model
5+
being used, replacing hardcoded MAX_CONTEXT_TOKENS and MAX_RESPONSE_TOKENS
6+
with model-specific limits from LiteLLM.
7+
"""
8+
9+
from dataclasses import dataclass
10+
11+
from mcp_as_a_judge.constants import MAX_CONTEXT_TOKENS, MAX_RESPONSE_TOKENS
12+
13+
14+
@dataclass
15+
class ModelLimits:
16+
"""Model-specific token limits."""
17+
18+
context_window: int # Total context window size
19+
max_input_tokens: int # Maximum tokens for input (context + prompt)
20+
max_output_tokens: int # Maximum tokens for output/response
21+
model_name: str # Model name for reference
22+
source: str # Where the limits came from ("litellm", "hardcoded", "estimated")
23+
24+
25+
# Cache for model limits to avoid repeated API calls
26+
_model_limits_cache: dict[str, ModelLimits] = {}
27+
28+
29+
def get_model_limits(model_name: str | None = None) -> ModelLimits:
30+
"""
31+
Get token limits: start with hardcoded, upgrade from cache or LiteLLM if available.
32+
"""
33+
# Start with hardcoded defaults
34+
limits = ModelLimits(
35+
context_window=MAX_CONTEXT_TOKENS + MAX_RESPONSE_TOKENS,
36+
max_input_tokens=MAX_CONTEXT_TOKENS,
37+
max_output_tokens=MAX_RESPONSE_TOKENS,
38+
model_name=model_name or "unknown",
39+
source="hardcoded",
40+
)
41+
42+
# If no model name, return hardcoded
43+
if not model_name:
44+
return limits
45+
46+
# Try to upgrade from cache
47+
if model_name in _model_limits_cache:
48+
return _model_limits_cache[model_name]
49+
50+
# Try to upgrade from LiteLLM
51+
try:
52+
import litellm
53+
54+
model_info = litellm.get_model_info(model_name)
55+
56+
limits = ModelLimits(
57+
context_window=model_info.get("max_tokens", limits.context_window),
58+
max_input_tokens=model_info.get(
59+
"max_input_tokens", limits.max_input_tokens
60+
),
61+
max_output_tokens=model_info.get(
62+
"max_output_tokens", limits.max_output_tokens
63+
),
64+
model_name=model_name,
65+
source="litellm",
66+
)
67+
68+
# Cache and return what we have
69+
_model_limits_cache[model_name] = limits
70+
71+
except Exception:
72+
pass
73+
74+
return limits
75+
76+
77+
def get_llm_input_limit(model_name: str | None = None) -> int:
78+
"""
79+
Get dynamic context token limit for conversation history.
80+
81+
This replaces the hardcoded MAX_CONTEXT_TOKENS with model-specific limits.
82+
83+
Args:
84+
model_name: Name of the model (optional)
85+
86+
Returns:
87+
Maximum tokens for conversation history/context
88+
"""
89+
limits = get_model_limits(model_name)
90+
return limits.max_input_tokens
91+
92+
93+
def get_llm_output_limit(model_name: str | None = None) -> int:
94+
"""
95+
Get dynamic response token limit for LLM output.
96+
97+
This replaces the hardcoded MAX_RESPONSE_TOKENS with model-specific limits.
98+
99+
Args:
100+
model_name: Name of the model (optional)
101+
102+
Returns:
103+
Maximum tokens for LLM response/output
104+
"""
105+
limits = get_model_limits(model_name)
106+
return limits.max_output_tokens

src/mcp_as_a_judge/db/providers/sqlite_provider.py

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@
1111
from sqlalchemy import create_engine
1212
from sqlmodel import Session, SQLModel, desc, select
1313

14-
from mcp_as_a_judge.constants import MAX_CONTEXT_TOKENS
1514
from mcp_as_a_judge.db.cleanup_service import ConversationCleanupService
15+
from mcp_as_a_judge.db.dynamic_token_limits import get_llm_input_limit
1616
from mcp_as_a_judge.db.interface import ConversationHistoryDB, ConversationRecord
17-
from mcp_as_a_judge.db.token_utils import calculate_record_tokens
17+
from mcp_as_a_judge.db.token_utils import calculate_tokens_in_record, detect_model_name
1818
from mcp_as_a_judge.logging_config import get_logger
1919

2020
# Set up logger
@@ -94,16 +94,14 @@ def _cleanup_excess_sessions(self) -> int:
9494
"""
9595
return self._cleanup_service.cleanup_excess_sessions()
9696

97-
def _cleanup_old_messages(self, session_id: str) -> int:
97+
async def _cleanup_old_messages(self, session_id: str) -> int:
9898
"""
99-
Remove old messages from a session using efficient hybrid FIFO strategy.
99+
Remove old messages from a session using token-based FIFO cleanup.
100100
101-
Two-step process:
102-
1. If record count > max_records, remove oldest record
103-
2. If total tokens > max_tokens, remove oldest records until within limit
101+
Uses dynamic token limits based on current model (get_llm_input_limit).
102+
Removes oldest records until total tokens are within the model's input limit.
104103
105104
Optimization: Single DB query with ORDER BY, then in-memory list operations.
106-
Eliminates 2 extra database queries compared to naive implementation.
107105
"""
108106
with Session(self.engine) as session:
109107
# Get current records ordered by timestamp DESC (newest first for token calculation)
@@ -140,25 +138,29 @@ def _cleanup_old_messages(self, session_id: str) -> int:
140138
# Update our in-memory list to reflect the deletion
141139
current_records.remove(oldest_record)
142140

143-
# STEP 2: Handle token limit (list is already sorted newest first - perfect for token calculation)
141+
# STEP 2: Handle token limit using dynamic model-specific limits
144142
current_tokens = sum(record.tokens for record in current_records)
145143

144+
# Get dynamic token limit based on current model
145+
model_name = await detect_model_name()
146+
max_input_tokens = get_llm_input_limit(model_name)
147+
146148
logger.info(
147149
f" 🔢 {len(current_records)} records, {current_tokens} tokens "
148-
f"(max: {MAX_CONTEXT_TOKENS})"
150+
f"(max: {max_input_tokens} for model: {model_name or 'default'})"
149151
)
150152

151-
if current_tokens > MAX_CONTEXT_TOKENS:
153+
if current_tokens > max_input_tokens:
152154
logger.info(
153-
f" 🚨 Token limit exceeded, removing oldest records to fit within {MAX_CONTEXT_TOKENS} tokens"
155+
f" 🚨 Token limit exceeded, removing oldest records to fit within {max_input_tokens} tokens"
154156
)
155157

156158
# Calculate which records to keep (newest first, within token limit)
157159
records_to_keep = []
158160
running_tokens = 0
159161

160162
for record in current_records: # Already ordered newest first
161-
if running_tokens + record.tokens <= MAX_CONTEXT_TOKENS:
163+
if running_tokens + record.tokens <= max_input_tokens:
162164
records_to_keep.append(record)
163165
running_tokens += record.tokens
164166
else:
@@ -220,7 +222,7 @@ async def save_conversation(
220222
is_new_session = self._is_new_session(session_id)
221223

222224
# Calculate token count for input + output
223-
token_count = calculate_record_tokens(input_data, output)
225+
token_count = await calculate_tokens_in_record(input_data, output)
224226

225227
# Create new record
226228
record = ConversationRecord(
@@ -244,9 +246,9 @@ async def save_conversation(
244246
logger.info(f"🆕 New session detected: {session_id}, running LRU cleanup")
245247
self._cleanup_excess_sessions()
246248

247-
# Per-session FIFO cleanup: maintain max 20 records per session
249+
# Per-session FIFO cleanup: maintain max records per session and model-specific token limits
248250
# (runs on every save)
249-
self._cleanup_old_messages(session_id)
251+
await self._cleanup_old_messages(session_id)
250252

251253
return record_id
252254

0 commit comments

Comments
 (0)