Skip to content

Commit 3b73bbc

Browse files
doriwaldori
andauthored
feat: Implement hybrid token-based conversation history system (#22)
* feat: Implement hybrid token-based conversation history system ## Summary Implemented comprehensive token-based conversation history management that respects both record count and token limits (50K tokens max). The system uses a hybrid approach with efficient two-level filtering for optimal performance. ## Key Features Added ### 1. Token Calculation & Storage - Added `tokens` field to ConversationRecord model for storing combined input+output token count - Created `token_utils.py` with token calculation utilities (1 token ≈ 4 characters) - Automatic token calculation and storage on every record save ### 2. Hybrid Database Cleanup (Save-time) - Enhanced `_cleanup_old_messages()` with efficient two-step process: 1. If record count > max_records, remove 1 oldest record (since we add one-by-one) 2. If total tokens > 50K, remove oldest records until within limit - Maintains both record count (20) AND token limits (50K) in persistent storage - Sessions can have fewer than 20 records if they contain large records ### 3. LLM Context Filtering (Load-time) - Updated `load_context_for_enrichment()` to filter history for LLM context - Ensures history + current prompt fits within token limits - Filters in-memory list without modifying database - Two-level approach: DB enforces storage limits, load enforces LLM context limits ### 4. Constants & Configuration - Added `MAX_CONTEXT_TOKENS = 50000` constant - Token limit integrated into filtering utilities for consistent usage ## Files Modified ### Core Implementation - `src/mcp_as_a_judge/constants.py` - Added MAX_CONTEXT_TOKENS constant - `src/mcp_as_a_judge/db/interface.py` - Added tokens field to ConversationRecord - `src/mcp_as_a_judge/db/providers/sqlite_provider.py` - Enhanced with hybrid cleanup logic - `src/mcp_as_a_judge/db/conversation_history_service.py` - Updated load logic for LLM context ### New Utilities - `src/mcp_as_a_judge/utils/__init__.py` - Created utils package - `src/mcp_as_a_judge/utils/token_utils.py` - Token calculation and filtering utilities ### Comprehensive Testing - `tests/test_token_based_history.py` - New comprehensive test suite (10 tests) - `tests/test_conversation_history_lifecycle.py` - Enhanced existing tests with token verification ## Technical Improvements ### Performance Optimizations - Simplified record count cleanup to remove exactly 1 record (matches one-by-one addition pattern) - Removed unnecessary parameter passing (limit=None) using method defaults - Efficient two-step cleanup process instead of recalculating everything ### Architecture Benefits - **Write Heavy, Read Light**: Enforce constraints at save time, simplify loads - **Two-level filtering**: Storage limits vs LLM context limits serve different purposes - **FIFO consistency**: Oldest records removed first in both cleanup phases - **Hybrid approach**: Respects whichever limit (record count or tokens) is more restrictive ## Test Coverage - ✅ Token calculation accuracy (1 token ≈ 4 characters) - ✅ Database token storage and retrieval - ✅ Record count limit enforcement - ✅ Token limit enforcement with FIFO removal - ✅ Hybrid behavior (record vs token limits) - ✅ Mixed record sizes handling - ✅ Edge cases and error conditions - ✅ Integration with existing lifecycle tests - ✅ Database cleanup during save operations - ✅ LLM context filtering during load operations ## Backward Compatibility - All existing functionality preserved - Existing tests continue to pass - Database schema extended (not breaking) - API remains the same for consumers ## Usage Example ```python # System automatically handles both limits: service = ConversationHistoryService(config) # Save: Enforces storage limits (record count + tokens) await service.save_tool_interaction(session_id, tool, input, output) # Load: Filters for LLM context (history + prompt ≤ 50K tokens) context = await service.load_context_for_enrichment(session_id) ``` The implementation provides a robust, efficient, and well-tested foundation for token-aware conversation history management. * feat: during load only verify max token limit and filter old records according * feat: refactor ai code * feat: refactor ai code * feat: fix error * feat: cleanup * feat: fix response token * feat: feat: implement dynamic token limits with model-specific context management This commit introduces a comprehensive token management system that replaces hardcoded limits with dynamic, model-specific token limits while maintaining backward compatibility. ## Key Features Added: ### Dynamic Token Limits (NEW) - `src/mcp_as_a_judge/db/dynamic_token_limits.py`: New module providing model-specific token limits with LiteLLM integration - Initialization pattern: start with hardcoded defaults, upgrade from cache or LiteLLM API if available, return whatever is available - Caching system to avoid repeated API calls for model information ### Enhanced Token Calculation - `src/mcp_as_a_judge/db/token_utils.py`: Upgraded to async functions with accurate LiteLLM token counting and character-based fallback - Unified model detection from LLM config or MCP sampling context - Functions: `calculate_tokens_in_string`, `calculate_tokens_in_record`, `filter_records_by_token_limit` (all now async) ### Two-Level Token Management - **Database Level**: Storage limits enforced during save operations - Record count limit (20 per session) - Token count limit (dynamic based on model, fallback to 50K) - LRU session cleanup (50 total sessions max) - **Load Level**: LLM context limits enforced during retrieval - Ensures history + current prompt fits within model's input limit - FIFO removal of oldest records when limits exceeded ### Updated Service Layer - `src/mcp_as_a_judge/db/conversation_history_service.py`: Added await for async token filtering function - `src/mcp_as_a_judge/db/providers/sqlite_provider.py`: Integrated dynamic token limits in cleanup operations ### Test Infrastructure - `tests/test_helpers/`: New test utilities package - `tests/test_helpers/token_utils_helpers.py`: Helper functions for token calculation testing and model cache management - `tests/test_improved_token_counting.py`: Comprehensive async test suite - Updated existing tests to support async token functions ## Implementation Details: ### Model Detection Strategy: 1. Try LLM configuration (fast, synchronous) 2. Try MCP sampling detection (async, requires context) 3. Fallback to None with hardcoded limits ### Token Limit Logic: - **On Load**: Check total history + current prompt tokens against model max input - **On Save**: Two-step cleanup (record count limit, then token limit) - **FIFO Removal**: Always remove oldest records first to preserve recent context ### Backward Compatibility: - All existing method signatures preserved with alias support - Graceful fallback when model information unavailable - No breaking changes to existing functionality ## Files Changed: - Modified: 5 core files (service, provider, token utils, server) - Added: 3 new files (dynamic limits, test helpers) - Enhanced: 2 test files with async support ## Testing: - All 160 tests pass (1 skipped for integration-only) - Comprehensive coverage of token calculation, limits, and cleanup logic - Edge cases and error handling verified This implementation follows the user's preferred patterns: - Configuration-based approach with rational fallbacks - Clean separation of concerns between storage and LLM limits - Efficient FIFO cleanup maintaining recent conversation context * feat: fix build * feat: try to fix build * feat: try to fix build * feat: try to fix build * feat: try to fix build * feat: try to fix build * feat: fix build --------- Co-authored-by: dori <[email protected]>
1 parent a6874a6 commit 3b73bbc

13 files changed

+1306
-106
lines changed

src/mcp_as_a_judge/constants.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,5 @@
1515
DATABASE_URL = "sqlite://:memory:"
1616
MAX_SESSION_RECORDS = 20 # Maximum records to keep per session (FIFO)
1717
MAX_TOTAL_SESSIONS = 50 # Maximum total sessions to keep (LRU cleanup)
18+
MAX_CONTEXT_TOKENS = 50000 # Maximum tokens for session token (1 token ≈ 4 characters)
19+
MAX_RESPONSE_TOKENS = 5000 # Maximum tokens for LLM responses

src/mcp_as_a_judge/db/conversation_history_service.py

Lines changed: 36 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,17 @@
77
3. Managing session-based conversation history
88
"""
99

10+
from typing import Any
11+
1012
from mcp_as_a_judge.db import (
1113
ConversationHistoryDB,
1214
ConversationRecord,
1315
create_database_provider,
1416
)
1517
from mcp_as_a_judge.db.db_config import Config
18+
from mcp_as_a_judge.db.token_utils import (
19+
filter_records_by_token_limit,
20+
)
1621
from mcp_as_a_judge.logging_config import get_logger
1722

1823
# Set up logger
@@ -35,34 +40,54 @@ def __init__(
3540
self.config = config
3641
self.db = db_provider or create_database_provider(config)
3742

38-
async def load_context_for_enrichment(
39-
self, session_id: str
43+
async def load_filtered_context_for_enrichment(
44+
self, session_id: str, current_prompt: str = "", ctx: Any = None
4045
) -> list[ConversationRecord]:
4146
"""
4247
Load recent conversation records for LLM context enrichment.
4348
49+
Two-level filtering approach:
50+
1. Database already enforces storage limits (record count + token limits)
51+
2. Load-time filtering ensures history + current prompt fits within LLM context limits
52+
4453
Args:
4554
session_id: Session identifier
55+
current_prompt: Current prompt that will be sent to LLM (for token calculation)
56+
ctx: MCP context for model detection and accurate token counting (optional)
4657
4758
Returns:
48-
List of conversation records for LLM context
59+
List of conversation records for LLM context (filtered for LLM limits)
4960
"""
5061
logger.info(f"🔍 Loading conversation history for session: {session_id}")
5162

52-
# Load recent conversations for this session
53-
recent_records = await self.db.get_session_conversations(
54-
session_id=session_id,
55-
limit=self.config.database.max_session_records, # load last X records (same as save limit)
56-
)
63+
# Load all conversations for this session - database already contains
64+
# records within storage limits, but we may need to filter further for LLM context
65+
recent_records = await self.db.get_session_conversations(session_id)
5766

5867
logger.info(f"📚 Retrieved {len(recent_records)} conversation records from DB")
59-
return recent_records
6068

61-
async def save_tool_interaction(
69+
# Apply LLM context filtering: ensure history + current prompt will fit within token limit
70+
# This filters the list without modifying the database (only token limit matters for LLM)
71+
# Pass ctx for accurate token counting when available
72+
filtered_records = await filter_records_by_token_limit(
73+
recent_records, current_prompt=current_prompt, ctx=ctx
74+
)
75+
76+
logger.info(
77+
f"✅ Returning {len(filtered_records)} conversation records for LLM context"
78+
)
79+
return filtered_records
80+
81+
async def save_tool_interaction_and_cleanup(
6282
self, session_id: str, tool_name: str, tool_input: str, tool_output: str
6383
) -> str:
6484
"""
65-
Save a tool interaction as a conversation record.
85+
Save a tool interaction as a conversation record and perform automatic cleanup.in the provider layer
86+
87+
After saving, the database provider automatically performs cleanup to enforce limits:
88+
- Removes old records if session exceeds MAX_SESSION_RECORDS (20)
89+
- Removes old records if session exceeds MAX_CONTEXT_TOKENS (50,000)
90+
- Removes least recently used sessions if total sessions exceed MAX_TOTAL_SESSIONS (50)
6691
6792
Args:
6893
session_id: Session identifier from AI agent
@@ -87,28 +112,6 @@ async def save_tool_interaction(
87112
logger.info(f"✅ Saved conversation record with ID: {record_id}")
88113
return record_id
89114

90-
async def get_conversation_history(
91-
self, session_id: str
92-
) -> list[ConversationRecord]:
93-
"""
94-
Get conversation history for a session to be injected into user prompts.
95-
96-
Args:
97-
session_id: Session identifier
98-
99-
Returns:
100-
List of conversation records for the session (most recent first)
101-
"""
102-
logger.info(f"🔄 Loading conversation history for session {session_id}")
103-
104-
context_records = await self.load_context_for_enrichment(session_id)
105-
106-
logger.info(
107-
f"📝 Retrieved {len(context_records)} conversation records for session {session_id}"
108-
)
109-
110-
return context_records
111-
112115
def format_conversation_history_as_json_array(
113116
self, conversation_history: list[ConversationRecord]
114117
) -> list[dict]:
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
"""
2+
Dynamic token limits based on actual model capabilities.
3+
4+
This module provides dynamic token limit calculation based on the actual model
5+
being used, replacing hardcoded MAX_CONTEXT_TOKENS and MAX_RESPONSE_TOKENS
6+
with model-specific limits from LiteLLM.
7+
"""
8+
9+
from dataclasses import dataclass
10+
11+
from mcp_as_a_judge.constants import MAX_CONTEXT_TOKENS, MAX_RESPONSE_TOKENS
12+
from mcp_as_a_judge.logging_config import get_logger
13+
14+
# Set up logger
15+
logger = get_logger(__name__)
16+
17+
18+
@dataclass
19+
class ModelLimits:
20+
"""Model-specific token limits."""
21+
22+
context_window: int # Total context window size
23+
max_input_tokens: int # Maximum tokens for input (context + prompt)
24+
max_output_tokens: int # Maximum tokens for output/response
25+
model_name: str # Model name for reference
26+
source: str # Where the limits came from ("litellm", "hardcoded", "estimated")
27+
28+
29+
# Cache for model limits to avoid repeated API calls
30+
_model_limits_cache: dict[str, ModelLimits] = {}
31+
32+
33+
def get_model_limits(model_name: str | None = None) -> ModelLimits:
34+
"""
35+
Get token limits: start with hardcoded, upgrade from cache or LiteLLM if available.
36+
"""
37+
# Start with hardcoded defaults
38+
limits = ModelLimits(
39+
context_window=MAX_CONTEXT_TOKENS + MAX_RESPONSE_TOKENS,
40+
max_input_tokens=MAX_CONTEXT_TOKENS,
41+
max_output_tokens=MAX_RESPONSE_TOKENS,
42+
model_name=model_name or "unknown",
43+
source="hardcoded",
44+
)
45+
46+
# If no model name, return hardcoded
47+
if not model_name:
48+
return limits
49+
50+
# Try to upgrade from cache
51+
if model_name in _model_limits_cache:
52+
return _model_limits_cache[model_name]
53+
54+
# Try to upgrade from LiteLLM
55+
try:
56+
import litellm
57+
58+
model_info = litellm.get_model_info(model_name)
59+
60+
# Extract values with proper fallbacks
61+
context_window = model_info.get("max_tokens")
62+
if context_window is not None:
63+
context_window = int(context_window)
64+
else:
65+
context_window = limits.context_window
66+
67+
max_input_tokens = model_info.get("max_input_tokens")
68+
if max_input_tokens is not None:
69+
max_input_tokens = int(max_input_tokens)
70+
else:
71+
max_input_tokens = limits.max_input_tokens
72+
73+
max_output_tokens = model_info.get("max_output_tokens")
74+
if max_output_tokens is not None:
75+
max_output_tokens = int(max_output_tokens)
76+
else:
77+
max_output_tokens = limits.max_output_tokens
78+
79+
limits = ModelLimits(
80+
context_window=context_window,
81+
max_input_tokens=max_input_tokens,
82+
max_output_tokens=max_output_tokens,
83+
model_name=model_name,
84+
source="litellm",
85+
)
86+
87+
# Cache and return what we have
88+
_model_limits_cache[model_name] = limits
89+
logger.debug(
90+
f"Retrieved model limits from LiteLLM for {model_name}: {limits.max_input_tokens} input tokens"
91+
)
92+
93+
except ImportError:
94+
logger.debug("LiteLLM not available, using hardcoded defaults")
95+
except Exception as e:
96+
logger.debug(f"Failed to get model info from LiteLLM for {model_name}: {e}")
97+
# Continue with hardcoded defaults
98+
99+
return limits
100+
101+
102+
def get_llm_input_limit(model_name: str | None = None) -> int:
103+
"""
104+
Get dynamic context token limit for conversation history.
105+
106+
This replaces the hardcoded MAX_CONTEXT_TOKENS with model-specific limits.
107+
108+
Args:
109+
model_name: Name of the model (optional)
110+
111+
Returns:
112+
Maximum tokens for conversation history/context
113+
"""
114+
limits = get_model_limits(model_name)
115+
return limits.max_input_tokens
116+
117+
118+
def get_llm_output_limit(model_name: str | None = None) -> int:
119+
"""
120+
Get dynamic response token limit for LLM output.
121+
122+
This replaces the hardcoded MAX_RESPONSE_TOKENS with model-specific limits.
123+
124+
Args:
125+
model_name: Name of the model (optional)
126+
127+
Returns:
128+
Maximum tokens for LLM response/output
129+
"""
130+
limits = get_model_limits(model_name)
131+
return limits.max_output_tokens

src/mcp_as_a_judge/db/interface.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ class ConversationRecord(SQLModel, table=True):
2121
source: str # tool name
2222
input: str # tool input query
2323
output: str # tool output string
24+
tokens: int = Field(
25+
default=0
26+
) # combined token count for input + output (1 token ≈ 4 characters)
2427
timestamp: datetime = Field(
2528
default_factory=datetime.utcnow, index=True
2629
) # when the record was created

0 commit comments

Comments
 (0)