Skip to content

ConstBob/Context-Memory-Management-for-Agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Context-Memory-Management-for-Agents

A comprehensive framework with context and memory management capabilities in AI agents using a healthy diet planning benchmark.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Context-Memory-Management Framework              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Discord Bot   β”‚    β”‚   Core Agent    β”‚    β”‚  Evaluation  β”‚ β”‚
β”‚  β”‚   (discordBot)  β”‚    β”‚(agent w/ rag.py)β”‚    β”‚  Framework   β”‚ β”‚
β”‚  β”‚                 β”‚    β”‚                 β”‚    β”‚              β”‚ β”‚
β”‚  β”‚ β€’ User Input    │───►│ β€’ GeminiClient  │◄───│ β€’ Baseline   β”‚ β”‚
β”‚  β”‚ β€’ Mention       β”‚    β”‚ β€’ Calculator    β”‚    β”‚   Evaluation β”‚ β”‚
β”‚  β”‚ β€’ Logging       β”‚    β”‚ β€’ Web Search    β”‚    β”‚ β€’ Analysis   β”‚ β”‚
β”‚  β”‚                 β”‚    β”‚ β€’ Rate Limiting β”‚    β”‚ β€’ Reporting  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚           β”‚                     β”‚ β”‚                       β–²     β”‚
β”‚           β”‚                     β”‚ β”‚                       β”‚     β”‚
β”‚           β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                       β”‚     β”‚
β”‚           β”‚ β”‚                     β”‚                       β”‚     β”‚
β”‚           β”‚ β–Ό                     β–Ό                       β”‚     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   External      β”‚    β”‚   Tools &       β”‚    β”‚  Benchmark   β”‚ β”‚
β”‚  β”‚   APIs          β”‚    β”‚   Helpers       β”‚    β”‚  Dataset     β”‚ β”‚
β”‚  β”‚                 β”‚    β”‚                 β”‚    β”‚              β”‚ β”‚
β”‚  β”‚ β€’ Gemini API    │◄───│ β€’ Calculator    │◄───│ β€’ 15 Tests   β”‚ β”‚
β”‚  β”‚ β€’ Tavily Search β”‚    β”‚ β€’ Search Tool   β”‚    β”‚ β€’ 4 Users    β”‚ β”‚
β”‚  β”‚ β€’ Discord API   β”‚    β”‚ β€’ Grading Help  β”‚    β”‚ β€’ Multi-turn β”‚ β”‚
β”‚  β”‚                 β”‚    β”‚ β€’ Validation    β”‚    β”‚ β€’ Memory Dep β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                        Data Flow                                β”‚
β”‚                                                                 β”‚
β”‚  User Input β†’ Discord Bot β†’ Core Agent β†’ Response               β”‚
β”‚       ↓                                                         β”‚
β”‚  Benchmark β†’ Evaluation β†’ Analysis β†’ Reports                    β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|                                Core Agent Data Flow                                   |
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
|                                                                                       |
|  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                             |
|  | (1) Input                            |                                             |
|  |───────────────────────────────────────                                             |
|  | ─ Agent.chat_with_tools()            |                                             |
|  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             |
|                                                                                       |
|                                                                                       |
|  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  |
|  | (2) Tool Planning Loop              |            | (3) Tool Execution           |  |
|  |────────────────────────────────────── tool call  |───────────────────────────────  |
|  | - Agent.determine_if_calc_needed()  |  request   | - CalculatorTool.calculate() |  |
|  | - Agent.refine_calc_expression()    |----------->| - TavilyClient.search()      |  |
|  | - Agent.determine_if_search_needed()|            | - RateLimiter.acquire()      |  |
|  | - Agent.refine_search_term()        |            | - _retry_with_backoff()      |  |
|  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  |
|      |         ^                                                             |        |
|      |         |  tool result & updated history                              |        |
|      |         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        |
|      |                                                                                |
|      | no more tool calls needed                                                      |
|      v                                                                                |
|  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                              |
|  | (4) Generate Final Answer & Output  |                                              |
|  |──────────────────────────────────────                                              |
|  | - Agent.generate_response()         |                                              |
|  | - GeminiClient.infer()              |                                              |
|  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              |
|                                                                                       |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

Core Agent (agent.py)

  • GeminiClient: Wrapper for Google's Gemini API with rate limiting and retry logic
  • CalculatorTool: Safe arithmetic expression evaluator for nutrition calculations
  • Agent: Main agent class with tool-calling capabilities (search + calculator) and context-aware response generation
  • RateLimiter: Prevents API rate limit violations
  • SimpleLogger: Structured logging for debugging and analysis

Hybrid Memory System (memory_manager.py, menu_extractor.py)

  • HybridMemoryManager: Manages user profiles, conversation history, and menu tracking with SQLite persistence
  • UserProfile: Structured dataclass for user preferences (calories, allergies, equipment, etc.)
  • MenuExtractor: Extracts meals from responses and enforces variety rules
  • Database Schema: Three tables (user_profiles, conversation_history, menu_history) with indexed queries
  • Context Building: Intelligently combines profile + recent conversation + menu history for LLM prompts
  • See MEMORY_SYSTEM.md for detailed documentation

RAG System (rag_system.py) - Optional Enhancement

  • RAGSystem: Semantic search over conversation and menu history using BGE embeddings (BAAI/bge-small-en-v1.5)
  • BackgroundIndexer: Asynchronous indexing in background thread to avoid blocking user requests
  • EnhancedMemoryManager: Combines traditional memory with semantic search for smarter context retrieval
  • Benefits: Find relevant past conversations by meaning, retrieve similar dishes semantically, reduce LLM context size
  • Performance: ~2ms overhead per query, runs locally with no API costs
  • Enable with USE_RAG=true in .env

Discord Bot (discordBot.py)

  • Discord integration for real-time agent interaction
  • Responds to mentions in "general" channel
  • Logs user interactions with timestamps

Evaluation Framework

  • baseline_evaluation.py: Comprehensive baseline evaluation without context management
  • analyze_results.py: Analysis tools for evaluation results with trend analysis
  • grading_helpers.py: Helper functions for nutrition validation and user requirement checking

Healthy Diet Agent Benchmark (JSONL) + Grading Helpers

This pack contains:

  • healthy_diet_benchmark.jsonl β€” 15 multi-turn conversations across 4 users (short/medium/long). Each object includes:
    • id, user_id, session_id, length, required_tools, rotation_policy,
    • memory_dependencies (intra/inter-session),
    • turns (assistant messages left blank for evaluation),
    • ground_truth (pass/fail criteria),
    • notes (what this test stresses).
  • grading_helpers.py β€” tiny helpers to map ambiguous phrases (e.g., β€œmy usual”) to hard anchors and verify variety and macro rules.

β€œMy usual” β†’ anchors

Use these anchors when a user says β€œmy usual”:

  • u01 β†’ 1800 kcal/day; β‰₯140 g protein; β‰₯30 g fiber; US units; peanut allergy; stove/oven only; Med/Mex; no blender.
  • u02 β†’ 1600 kcal/day; β‰₯110 g protein; metric; vegetarian; lactose-free; 12–20 fasting; microwave + rice cooker only.
  • u03 β†’ 2000 kcal/day; β‰₯150 g protein; US units; halal; low-glycemic; grill + air fryer; (fiber β‰₯30 g where specified).
  • u04 β†’ 2200 kcal/day; β‰₯130 g protein; fiber β‰₯30 g; US units; pescatarian; no tuna; Japanese/Thai.

Variety rules

Every test object has a rotation_policy. Enforce:

  • no_repeat_days: no exact dish repeats within that window.
  • max_same_primary_protein_per_week: cap per primary protein across the plan.
  • ingredient_jaccard_max: keep day-to-day ingredient overlap below this threshold.

Tool usage

All tests require both tools. Log tool usage (e.g., ["search","calculator"]) and validate with check_tool_usage.

Running

Server Environment

Amazon Lightsail Ubuntu 24.04 TLS

Recommended Size: 1 GB Memory / 2 vCPUs / 40 GB SSD / 2 TB Transfer

Setup

  1. Install dependencies: pip install -r requirements.txt

  2. Create .env file with API keys (see .env.example for full options):

    Option A: Using Gemini (default)

    LLM_PROVIDER=gemini
    GEMINI_API_KEY=your_gemini_key
    TAVILY_API_KEY=your_tavily_key
    DISCORD_TOKEN=your_discord_token

    Option B: Using OpenRouter (supports Claude, GPT-4, Llama, etc.)

    LLM_PROVIDER=openrouter
    OPENROUTER_API_KEY=your_openrouter_key
    OPENROUTER_MODEL=anthropic/claude-3.5-sonnet
    TAVILY_API_KEY=your_tavily_key
    DISCORD_TOKEN=your_discord_token

    Get OpenRouter API key: https://openrouter.ai/keys Available models: https://openrouter.ai/models

  3. Optional: Enable RAG for semantic search

    # Add to .env
    USE_RAG=true
    
    # Install BGE embeddings
    pip install sentence-transformers

    What RAG adds:

    • Semantic search: "breakfast like before" finds similar breakfasts, not just recent ones
    • Background indexing: No blocking - indexing happens in separate thread
    • Smart context: Only includes most relevant past conversations/meals
    • Local & free: Runs on your machine, no API costs
  4. For quick access to our deployed instance, join our discord channel https://discord.gg/Ur7dS9Fut2 and @ the bot in #general

    Query Example

Evaluation

  • Baseline evaluation: python baseline_evaluation.py
  • Results analysis: python analyze_results.py
  • Discord bot with memory: python discordBot.py
  • Test memory system: python test_memory_system.py

Benchmark Usage

  • Load JSONL line-by-line. For each test, run your agent over the turns and capture outputs.
  • Compare outputs against ground_truth using the helpers or your own evaluator.
  • Provide any menu_history and prior-session context your harness maintains to check variety.

Evaluation Metrics

Core Metrics

  • Task Completion Rate: Overall pass/fail rate across all tests
  • Nutrition Validation: Proper macro/micro nutrient calculations
  • User Requirements: Adherence to dietary restrictions and preferences
  • Context Handling: Memory dependency resolution

Advanced Metrics

  • Variety Rules: No repeats, protein rotation, ingredient diversity
  • Timing Constraints: Fasting windows, meal timing
  • Tool Usage: Calculator and search tool utilization
  • Inter-session Memory: Cross-session context retention

Analysis Features

  • Length-based performance analysis (short/medium/long conversations)
  • User-specific requirement tracking
  • Context weakness identification
  • Trend analysis across conversation types

Experimental Results

We evaluated the system on a benchmark of 15 multi-turn diet-planning tasks, comparing the Phase 1 Baseline (stateless) against the Phase 2 Final System (Memory + RAG).

Quality & Accuracy

The enhanced agent achieved a 4x improvement in task completion rate and significantly higher nutrition validity.

Metric Baseline (Stateless) Final (Memory + RAG) Improvement
Task Completion Rate 13.3% (2/15) 53.3% (8/15) +40.0 pp
Nutrition Validity 0/15 Valid Plans 8/15 Valid Plans +8 plans
User Consistency (u01) 25% 100% +75 pp

Efficiency & Latency

While the memory overhead increases latency slightly, the system is significantly more efficient at producing successful plans.

  • Latency: Average latency increased by ~22% (20.8s β†’ 25.5s) due to retrieval and verification overhead.
  • Token Efficiency: Tokens per successful plan dropped by 35% (12.8k β†’ 8.3k).
  • Cost Implication: The system is "expensive but worthwhile"β€”it spends more compute per interaction but wastes substantially less on plans that ultimately fail safety or nutrition requirements.

Contributors

Zihao Wang, Ye Tian, Yiming Zhao, Hengzhou Li, Ziqiao Xi University of California, San Diego December 9, 2025

About

CSE 291A Fall 2025 Course Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages