A comprehensive framework with context and memory management capabilities in AI agents using a healthy diet planning benchmark.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Context-Memory-Management Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β Discord Bot β β Core Agent β β Evaluation β β
β β (discordBot) β β(agent w/ rag.py)β β Framework β β
β β β β β β β β
β β β’ User Input βββββΊβ β’ GeminiClient ββββββ β’ Baseline β β
β β β’ Mention β β β’ Calculator β β Evaluation β β
β β β’ Logging β β β’ Web Search β β β’ Analysis β β
β β β β β’ Rate Limiting β β β’ Reporting β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β β β β² β
β β β β β β
β β βββββββββββββββββββββ β β β
β β β β β β
β β βΌ βΌ β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β External β β Tools & β β Benchmark β β
β β APIs β β Helpers β β Dataset β β
β β β β β β β β
β β β’ Gemini API ββββββ β’ Calculator ββββββ β’ 15 Tests β β
β β β’ Tavily Search β β β’ Search Tool β β β’ 4 Users β β
β β β’ Discord API β β β’ Grading Help β β β’ Multi-turn β β
β β β β β’ Validation β β β’ Memory Dep β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Flow β
β β
β User Input β Discord Bot β Core Agent β Response β
β β β
β Benchmark β Evaluation β Analysis β Reports β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Core Agent Data Flow |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
| |
| ββββββββββββββββββββββββββββββββββββββββ |
| | (1) Input | |
| |βββββββββββββββββββββββββββββββββββββββ€ |
| | β Agent.chat_with_tools() | |
| ββββββββββββββββββββββββββββββββββββββββ |
| |
| |
| βββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ |
| | (2) Tool Planning Loop | | (3) Tool Execution | |
| |ββββββββββββββββββββββββββββββββββββββ€ tool call |βββββββββββββββββββββββββββββββ€ |
| | - Agent.determine_if_calc_needed() | request | - CalculatorTool.calculate() | |
| | - Agent.refine_calc_expression() |----------->| - TavilyClient.search() | |
| | - Agent.determine_if_search_needed()| | - RateLimiter.acquire() | |
| | - Agent.refine_search_term() | | - _retry_with_backoff() | |
| βββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ |
| | ^ | |
| | | tool result & updated history | |
| | βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | |
| | no more tool calls needed |
| v |
| βββββββββββββββββββββββββββββββββββββββ |
| | (4) Generate Final Answer & Output | |
| |ββββββββββββββββββββββββββββββββββββββ€ |
| | - Agent.generate_response() | |
| | - GeminiClient.infer() | |
| βββββββββββββββββββββββββββββββββββββββ |
| |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- GeminiClient: Wrapper for Google's Gemini API with rate limiting and retry logic
- CalculatorTool: Safe arithmetic expression evaluator for nutrition calculations
- Agent: Main agent class with tool-calling capabilities (search + calculator) and context-aware response generation
- RateLimiter: Prevents API rate limit violations
- SimpleLogger: Structured logging for debugging and analysis
- HybridMemoryManager: Manages user profiles, conversation history, and menu tracking with SQLite persistence
- UserProfile: Structured dataclass for user preferences (calories, allergies, equipment, etc.)
- MenuExtractor: Extracts meals from responses and enforces variety rules
- Database Schema: Three tables (user_profiles, conversation_history, menu_history) with indexed queries
- Context Building: Intelligently combines profile + recent conversation + menu history for LLM prompts
- See MEMORY_SYSTEM.md for detailed documentation
- RAGSystem: Semantic search over conversation and menu history using BGE embeddings (BAAI/bge-small-en-v1.5)
- BackgroundIndexer: Asynchronous indexing in background thread to avoid blocking user requests
- EnhancedMemoryManager: Combines traditional memory with semantic search for smarter context retrieval
- Benefits: Find relevant past conversations by meaning, retrieve similar dishes semantically, reduce LLM context size
- Performance: ~2ms overhead per query, runs locally with no API costs
- Enable with
USE_RAG=truein.env
- Discord integration for real-time agent interaction
- Responds to mentions in "general" channel
- Logs user interactions with timestamps
baseline_evaluation.py: Comprehensive baseline evaluation without context managementanalyze_results.py: Analysis tools for evaluation results with trend analysisgrading_helpers.py: Helper functions for nutrition validation and user requirement checking
This pack contains:
healthy_diet_benchmark.jsonlβ 15 multi-turn conversations across 4 users (short/medium/long). Each object includes:id,user_id,session_id,length,required_tools,rotation_policy,memory_dependencies(intra/inter-session),turns(assistant messages left blank for evaluation),ground_truth(pass/fail criteria),notes(what this test stresses).
grading_helpers.pyβ tiny helpers to map ambiguous phrases (e.g., βmy usualβ) to hard anchors and verify variety and macro rules.
Use these anchors when a user says βmy usualβ:
- u01 β 1800 kcal/day; β₯140 g protein; β₯30 g fiber; US units; peanut allergy; stove/oven only; Med/Mex; no blender.
- u02 β 1600 kcal/day; β₯110 g protein; metric; vegetarian; lactose-free; 12β20 fasting; microwave + rice cooker only.
- u03 β 2000 kcal/day; β₯150 g protein; US units; halal; low-glycemic; grill + air fryer; (fiber β₯30 g where specified).
- u04 β 2200 kcal/day; β₯130 g protein; fiber β₯30 g; US units; pescatarian; no tuna; Japanese/Thai.
Every test object has a rotation_policy. Enforce:
no_repeat_days: no exact dish repeats within that window.max_same_primary_protein_per_week: cap per primary protein across the plan.ingredient_jaccard_max: keep day-to-day ingredient overlap below this threshold.
All tests require both tools. Log tool usage (e.g., ["search","calculator"]) and validate with check_tool_usage.
Amazon Lightsail Ubuntu 24.04 TLS
Recommended Size: 1 GB Memory / 2 vCPUs / 40 GB SSD / 2 TB Transfer
-
Install dependencies:
pip install -r requirements.txt -
Create
.envfile with API keys (see.env.examplefor full options):Option A: Using Gemini (default)
LLM_PROVIDER=gemini GEMINI_API_KEY=your_gemini_key TAVILY_API_KEY=your_tavily_key DISCORD_TOKEN=your_discord_token
Option B: Using OpenRouter (supports Claude, GPT-4, Llama, etc.)
LLM_PROVIDER=openrouter OPENROUTER_API_KEY=your_openrouter_key OPENROUTER_MODEL=anthropic/claude-3.5-sonnet TAVILY_API_KEY=your_tavily_key DISCORD_TOKEN=your_discord_token
Get OpenRouter API key: https://openrouter.ai/keys Available models: https://openrouter.ai/models
-
Optional: Enable RAG for semantic search
# Add to .env USE_RAG=true # Install BGE embeddings pip install sentence-transformers
What RAG adds:
- Semantic search: "breakfast like before" finds similar breakfasts, not just recent ones
- Background indexing: No blocking - indexing happens in separate thread
- Smart context: Only includes most relevant past conversations/meals
- Local & free: Runs on your machine, no API costs
-
For quick access to our deployed instance, join our discord channel https://discord.gg/Ur7dS9Fut2 and @ the bot in
#general
- Baseline evaluation:
python baseline_evaluation.py - Results analysis:
python analyze_results.py - Discord bot with memory:
python discordBot.py - Test memory system:
python test_memory_system.py
- Load JSONL line-by-line. For each test, run your agent over the
turnsand capture outputs. - Compare outputs against
ground_truthusing the helpers or your own evaluator. - Provide any
menu_historyand prior-session context your harness maintains to check variety.
- Task Completion Rate: Overall pass/fail rate across all tests
- Nutrition Validation: Proper macro/micro nutrient calculations
- User Requirements: Adherence to dietary restrictions and preferences
- Context Handling: Memory dependency resolution
- Variety Rules: No repeats, protein rotation, ingredient diversity
- Timing Constraints: Fasting windows, meal timing
- Tool Usage: Calculator and search tool utilization
- Inter-session Memory: Cross-session context retention
- Length-based performance analysis (short/medium/long conversations)
- User-specific requirement tracking
- Context weakness identification
- Trend analysis across conversation types
We evaluated the system on a benchmark of 15 multi-turn diet-planning tasks, comparing the Phase 1 Baseline (stateless) against the Phase 2 Final System (Memory + RAG).
The enhanced agent achieved a 4x improvement in task completion rate and significantly higher nutrition validity.
| Metric | Baseline (Stateless) | Final (Memory + RAG) | Improvement |
|---|---|---|---|
| Task Completion Rate | 13.3% (2/15) | 53.3% (8/15) | +40.0 pp |
| Nutrition Validity | 0/15 Valid Plans | 8/15 Valid Plans | +8 plans |
| User Consistency (u01) | 25% | 100% | +75 pp |
While the memory overhead increases latency slightly, the system is significantly more efficient at producing successful plans.
- Latency: Average latency increased by ~22% (20.8s β 25.5s) due to retrieval and verification overhead.
- Token Efficiency: Tokens per successful plan dropped by 35% (12.8k β 8.3k).
- Cost Implication: The system is "expensive but worthwhile"βit spends more compute per interaction but wastes substantially less on plans that ultimately fail safety or nutrition requirements.
Zihao Wang, Ye Tian, Yiming Zhao, Hengzhou Li, Ziqiao Xi University of California, San Diego December 9, 2025
