Context-Memory-Management-for-Agents

A comprehensive framework with context and memory management capabilities in AI agents using a healthy diet planning benchmark.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                Context-Memory-Management Framework              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────┐ │
│  │   Discord Bot   │    │   Core Agent    │    │  Evaluation  │ │
│  │   (discordBot)  │    │(agent w/ rag.py)│    │  Framework   │ │
│  │                 │    │                 │    │              │ │
│  │ • User Input    │───►│ • GeminiClient  │◄───│ • Baseline   │ │
│  │ • Mention       │    │ • Calculator    │    │   Evaluation │ │
│  │ • Logging       │    │ • Web Search    │    │ • Analysis   │ │
│  │                 │    │ • Rate Limiting │    │ • Reporting  │ │
│  └─────────────────┘    └─────────────────┘    └──────────────┘ │
│           │                     │ │                       ▲     │
│           │                     │ │                       │     │
│           │ ┌───────────────────┘ │                       │     │
│           │ │                     │                       │     │
│           │ ▼                     ▼                       │     │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────┐ │
│  │   External      │    │   Tools &       │    │  Benchmark   │ │
│  │   APIs          │    │   Helpers       │    │  Dataset     │ │
│  │                 │    │                 │    │              │ │
│  │ • Gemini API    │◄───│ • Calculator    │◄───│ • 15 Tests   │ │
│  │ • Tavily Search │    │ • Search Tool   │    │ • 4 Users    │ │
│  │ • Discord API   │    │ • Grading Help  │    │ • Multi-turn │ │
│  │                 │    │ • Validation    │    │ • Memory Dep │ │
│  └─────────────────┘    └─────────────────┘    └──────────────┘ │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                        Data Flow                                │
│                                                                 │
│  User Input → Discord Bot → Core Agent → Response               │
│       ↓                                                         │
│  Benchmark → Evaluation → Analysis → Reports                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────┐
|                                Core Agent Data Flow                                   |
├───────────────────────────────────────────────────────────────────────────────────────┤
|                                                                                       |
|  ┌──────────────────────────────────────┐                                             |
|  | (1) Input                            |                                             |
|  |──────────────────────────────────────┤                                             |
|  | ─ Agent.chat_with_tools()            |                                             |
|  └──────────────────────────────────────┘                                             |
|                                                                                       |
|                                                                                       |
|  ┌─────────────────────────────────────┐            ┌──────────────────────────────┐  |
|  | (2) Tool Planning Loop              |            | (3) Tool Execution           |  |
|  |─────────────────────────────────────┤ tool call  |──────────────────────────────┤  |
|  | - Agent.determine_if_calc_needed()  |  request   | - CalculatorTool.calculate() |  |
|  | - Agent.refine_calc_expression()    |----------->| - TavilyClient.search()      |  |
|  | - Agent.determine_if_search_needed()|            | - RateLimiter.acquire()      |  |
|  | - Agent.refine_search_term()        |            | - _retry_with_backoff()      |  |
|  └─────────────────────────────────────┘            └──────────────────────────────┘  |
|      |         ^                                                             |        |
|      |         |  tool result & updated history                              |        |
|      |         └─────────────────────────────────────────────────────────────┘        |
|      |                                                                                |
|      | no more tool calls needed                                                      |
|      v                                                                                |
|  ┌─────────────────────────────────────┐                                              |
|  | (4) Generate Final Answer & Output  |                                              |
|  |─────────────────────────────────────┤                                              |
|  | - Agent.generate_response()         |                                              |
|  | - GeminiClient.infer()              |                                              |
|  └─────────────────────────────────────┘                                              |
|                                                                                       |
└───────────────────────────────────────────────────────────────────────────────────────┘

Components

Core Agent (`agent.py`)

GeminiClient: Wrapper for Google's Gemini API with rate limiting and retry logic
CalculatorTool: Safe arithmetic expression evaluator for nutrition calculations
Agent: Main agent class with tool-calling capabilities (search + calculator) and context-aware response generation
RateLimiter: Prevents API rate limit violations
SimpleLogger: Structured logging for debugging and analysis

Hybrid Memory System (`memory_manager.py`, `menu_extractor.py`)

HybridMemoryManager: Manages user profiles, conversation history, and menu tracking with SQLite persistence
UserProfile: Structured dataclass for user preferences (calories, allergies, equipment, etc.)
MenuExtractor: Extracts meals from responses and enforces variety rules
Database Schema: Three tables (user_profiles, conversation_history, menu_history) with indexed queries
Context Building: Intelligently combines profile + recent conversation + menu history for LLM prompts
See MEMORY_SYSTEM.md for detailed documentation

RAG System (`rag_system.py`) - Optional Enhancement

RAGSystem: Semantic search over conversation and menu history using BGE embeddings (BAAI/bge-small-en-v1.5)
BackgroundIndexer: Asynchronous indexing in background thread to avoid blocking user requests
EnhancedMemoryManager: Combines traditional memory with semantic search for smarter context retrieval
Benefits: Find relevant past conversations by meaning, retrieve similar dishes semantically, reduce LLM context size
Performance: ~2ms overhead per query, runs locally with no API costs
Enable with USE_RAG=true in .env

Discord Bot (`discordBot.py`)

Discord integration for real-time agent interaction
Responds to mentions in "general" channel
Logs user interactions with timestamps

Evaluation Framework

baseline_evaluation.py: Comprehensive baseline evaluation without context management
analyze_results.py: Analysis tools for evaluation results with trend analysis
grading_helpers.py: Helper functions for nutrition validation and user requirement checking

Healthy Diet Agent Benchmark (JSONL) + Grading Helpers

This pack contains:

healthy_diet_benchmark.jsonl — 15 multi-turn conversations across 4 users (short/medium/long). Each object includes:
- id, user_id, session_id, length, required_tools, rotation_policy,
- memory_dependencies (intra/inter-session),
- turns (assistant messages left blank for evaluation),
- ground_truth (pass/fail criteria),
- notes (what this test stresses).
grading_helpers.py — tiny helpers to map ambiguous phrases (e.g., “my usual”) to hard anchors and verify variety and macro rules.

“My usual” → anchors

Use these anchors when a user says “my usual”:

u01 → 1800 kcal/day; ≥140 g protein; ≥30 g fiber; US units; peanut allergy; stove/oven only; Med/Mex; no blender.
u02 → 1600 kcal/day; ≥110 g protein; metric; vegetarian; lactose-free; 12–20 fasting; microwave + rice cooker only.
u03 → 2000 kcal/day; ≥150 g protein; US units; halal; low-glycemic; grill + air fryer; (fiber ≥30 g where specified).
u04 → 2200 kcal/day; ≥130 g protein; fiber ≥30 g; US units; pescatarian; no tuna; Japanese/Thai.

Variety rules

Every test object has a rotation_policy. Enforce:

no_repeat_days: no exact dish repeats within that window.
max_same_primary_protein_per_week: cap per primary protein across the plan.
ingredient_jaccard_max: keep day-to-day ingredient overlap below this threshold.

Tool usage

All tests require both tools. Log tool usage (e.g., ["search","calculator"]) and validate with check_tool_usage.

Running

Server Environment

Amazon Lightsail Ubuntu 24.04 TLS

Recommended Size: 1 GB Memory / 2 vCPUs / 40 GB SSD / 2 TB Transfer

Setup

Install dependencies: pip install -r requirements.txt

Create .env file with API keys (see .env.example for full options):

Option A: Using Gemini (default)

LLM_PROVIDER=gemini
GEMINI_API_KEY=your_gemini_key
TAVILY_API_KEY=your_tavily_key
DISCORD_TOKEN=your_discord_token

Option B: Using OpenRouter (supports Claude, GPT-4, Llama, etc.)

LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=your_openrouter_key
OPENROUTER_MODEL=anthropic/claude-3.5-sonnet
TAVILY_API_KEY=your_tavily_key
DISCORD_TOKEN=your_discord_token

Get OpenRouter API key: https://openrouter.ai/keys Available models: https://openrouter.ai/models

Optional: Enable RAG for semantic search
```
# Add to .env
USE_RAG=true

# Install BGE embeddings
pip install sentence-transformers
```
What RAG adds:
- Semantic search: "breakfast like before" finds similar breakfasts, not just recent ones
- Background indexing: No blocking - indexing happens in separate thread
- Smart context: Only includes most relevant past conversations/meals
- Local & free: Runs on your machine, no API costs
For quick access to our deployed instance, join our discord channel https://discord.gg/Ur7dS9Fut2 and @ the bot in #general

Evaluation

Baseline evaluation: python baseline_evaluation.py
Results analysis: python analyze_results.py
Discord bot with memory: python discordBot.py
Test memory system: python test_memory_system.py

Benchmark Usage

Load JSONL line-by-line. For each test, run your agent over the turns and capture outputs.
Compare outputs against ground_truth using the helpers or your own evaluator.
Provide any menu_history and prior-session context your harness maintains to check variety.

Evaluation Metrics

Core Metrics

Task Completion Rate: Overall pass/fail rate across all tests
Nutrition Validation: Proper macro/micro nutrient calculations
User Requirements: Adherence to dietary restrictions and preferences
Context Handling: Memory dependency resolution

Advanced Metrics

Variety Rules: No repeats, protein rotation, ingredient diversity
Timing Constraints: Fasting windows, meal timing
Tool Usage: Calculator and search tool utilization
Inter-session Memory: Cross-session context retention

Analysis Features

Length-based performance analysis (short/medium/long conversations)
User-specific requirement tracking
Context weakness identification
Trend analysis across conversation types

Experimental Results

We evaluated the system on a benchmark of 15 multi-turn diet-planning tasks, comparing the Phase 1 Baseline (stateless) against the Phase 2 Final System (Memory + RAG).

Quality & Accuracy

The enhanced agent achieved a 4x improvement in task completion rate and significantly higher nutrition validity.

Metric	Baseline (Stateless)	Final (Memory + RAG)	Improvement
Task Completion Rate	13.3% (2/15)	53.3% (8/15)	+40.0 pp
Nutrition Validity	0/15 Valid Plans	8/15 Valid Plans	+8 plans
User Consistency (u01)	25%	100%	+75 pp

Efficiency & Latency

While the memory overhead increases latency slightly, the system is significantly more efficient at producing successful plans.

Latency: Average latency increased by ~22% (20.8s → 25.5s) due to retrieval and verification overhead.
Token Efficiency: Tokens per successful plan dropped by 35% (12.8k → 8.3k).
Cost Implication: The system is "expensive but worthwhile"—it spends more compute per interaction but wastes substantially less on plans that ultimately fail safety or nutrition requirements.

Contributors

Zihao Wang, Ye Tian, Yiming Zhao, Hengzhou Li, Ziqiao Xi University of California, San Diego December 9, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
evaluation_results_baseline		evaluation_results_baseline
evaluation_results_final		evaluation_results_final
logs		logs
tutorial		tutorial
.gitignore		.gitignore
MEMORY_SYSTEM.md		MEMORY_SYSTEM.md
README.md		README.md
agent.py		agent.py
agent_with_rag.py		agent_with_rag.py
analysis_report.txt		analysis_report.txt
analyze_results.py		analyze_results.py
baseline_evaluation.py		baseline_evaluation.py
discordBot.py		discordBot.py
grading_helpers.py		grading_helpers.py
healthy_diet_benchmark.jsonl		healthy_diet_benchmark.jsonl
log.txt		log.txt
maodie.jpg		maodie.jpg
memory_enhanced_evaluation.py		memory_enhanced_evaluation.py
memory_manager.py		memory_manager.py
menu_extractor.py		menu_extractor.py
rag_system.py		rag_system.py
requirements.txt		requirements.txt
test_memory_simple.py		test_memory_simple.py
test_memory_system.py		test_memory_system.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context-Memory-Management-for-Agents

Architecture

Components

Core Agent (`agent.py`)

Hybrid Memory System (`memory_manager.py`, `menu_extractor.py`)

RAG System (`rag_system.py`) - Optional Enhancement

Discord Bot (`discordBot.py`)

Evaluation Framework

Healthy Diet Agent Benchmark (JSONL) + Grading Helpers

“My usual” → anchors

Variety rules

Tool usage

Running

Server Environment

Setup

Evaluation

Benchmark Usage

Evaluation Metrics

Core Metrics

Advanced Metrics

Analysis Features

Experimental Results

Quality & Accuracy

Efficiency & Latency

Contributors

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

ConstBob/Context-Memory-Management-for-Agents

Folders and files

Latest commit

History

Repository files navigation

Context-Memory-Management-for-Agents

Architecture

Components

Core Agent (agent.py)

Hybrid Memory System (memory_manager.py, menu_extractor.py)

RAG System (rag_system.py) - Optional Enhancement

Discord Bot (discordBot.py)

Evaluation Framework

Healthy Diet Agent Benchmark (JSONL) + Grading Helpers

“My usual” → anchors

Variety rules

Tool usage

Running

Server Environment

Setup

Evaluation

Benchmark Usage

Evaluation Metrics

Core Metrics

Advanced Metrics

Analysis Features

Experimental Results

Quality & Accuracy

Efficiency & Latency

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Core Agent (`agent.py`)

Hybrid Memory System (`memory_manager.py`, `menu_extractor.py`)

RAG System (`rag_system.py`) - Optional Enhancement

Discord Bot (`discordBot.py`)

Packages