Comprehensive evaluation framework for the Investor Paradise ADK agent using Google's Agent Development Kit (ADK).
evaluations/
├── test_config.json # Evaluation criteria & thresholds
├── integration.evalset.json # 12 fixed integration test cases
├── user_simulation.json # 6 dynamic conversation scenarios
├── session_input.json # Session metadata for tests
└── README.md # This file- ✅
test_01_greeting- Agent greets user warmly - ✅
test_02_capability_query- Agent explains capabilities
- ✅
test_03_automobile_sector_list- List all Automobile sector stocks
- ✅
test_04_bhartiartl_analysis- Full stock analysis with news & recommendations
- ✅
test_05_prompt_injection_security- Rejects prompt injection attempts
- Banking Stocks Query - User asks for banking stocks → Agent lists them → User thanks
Note: User simulations test conversational flow without assertions. Integration tests provide better coverage with specific pass/fail criteria.
-
Install ADK (if not already installed):
pip install google-adk
-
Set up environment:
export GOOGLE_API_KEY="your-api-key-here"
-
Ensure data is loaded: The agent needs NSE stock data pre-loaded. Make sure your data cache is available.
# From the project root directory
adk eval investor_agent evaluations/integration.evalset.json \
--config_file_path=evaluations/test_config.json \
--print_detailed_resultsExpected output:
- Pass/Fail status for each test case
- Tool trajectory scores (should be ≥ 0.85)
- Response match scores (should be ≥ 0.70)
- Detailed diff for any failures
adk eval_set create investor_agent eval_user_simulationadk eval_set add_eval_case investor_agent eval_user_simulation \
--scenarios_file=evaluations/user_simulation.json \
--session_input_file=evaluations/session_input.jsonadk eval investor_agent eval_user_simulation \
--config_file_path=evaluations/test_config.json \
--print_detailed_results# Integration tests
echo "🧪 Running Integration Tests..."
adk eval investor_agent evaluations/integration.evalset.json \
--config_file_path=evaluations/test_config.json \
--print_detailed_results
# User simulation tests
echo "🤖 Running User Simulation Tests..."
adk eval_set create investor_agent eval_user_simulation
adk eval_set add_eval_case investor_agent eval_user_simulation \
--scenarios_file=evaluations/user_simulation.json \
--session_input_file=evaluations/session_input.json
adk eval investor_agent eval_user_simulation \
--config_file_path=evaluations/user_sim_config.json \
--print_detailed_results- Measures whether the agent uses the correct tools with correct parameters
- Checks the sequence of tool calls against expected behavior
- Score of 1.0 = perfect tool usage, 0.0 = wrong tools/parameters
What it validates:
- ✅ EntryRouter calls
get_index_constituentsfor "What stocks are in NIFTY 50?" - ✅ MarketAnalyst calls
check_data_availabilitybefore analysis - ✅ MarketAnalyst uses
get_top_gainersfor "top gainers this week" - ✅ NewsIntelligence uses both
semantic_search(PDF) andgoogle_search(Web)
- Measures how similar the agent's response is to the expected response
- Uses text similarity algorithms to compare content
- Score of 1.0 = perfect match, 0.0 = completely different
What it validates:
- ✅ Response structure and formatting
- ✅ Key information presence (stock symbols, metrics, recommendations)
- ✅ Tone and communication style
- ✅ Presence of follow-up prompts
✅ test_08_top_gainers_full_pipeline: PASS
Tool Trajectory: 1.0/0.85
Response Match: 0.82/0.70- Both scores meet thresholds
- Agent behavior matches expected pattern
❌ test_05_nifty50_constituents: FAIL
Tool Trajectory: 0.95/0.85 ✅
Response Match: 0.65/0.70 ❌
Diff:
Expected: "📋 NIFTY 50 Index Constituents (50 stocks)..."
Actual: "Here are the NIFTY 50 stocks: RELIANCE, TCS..."- Tool usage correct but response formatting different
- Need to adjust prompt or update expected response
Tool Trajectory Failures:
- Agent using wrong tools (e.g.,
analyze_stockinstead ofget_top_gainers) - Missing tool calls (e.g., forgot
check_data_availability) - Wrong parameters (e.g., wrong sector name mapping)
Response Match Failures:
- Different formatting (bullet points vs comma-separated)
- Missing follow-up prompts
- Different tone or phrasing
- Extra/missing information
Always run before:
- Merging prompt changes
- Updating tool definitions
- Changing model versions (e.g., Flash → Pro)
- Production deployments
Good practices:
- Baseline: Run full suite on stable version, save results
- Compare: Run suite after changes, compare scores
- Investigate: Any score drop > 5% requires investigation
- Fix: Update code or adjust test expectations
- Repeat: Re-run until all tests pass
Add to your CI/CD pipeline:
# Example GitHub Actions
- name: Run ADK Evaluations
run: |
export GOOGLE_API_KEY=${{ secrets.GOOGLE_API_KEY }}
adk eval investor_agent evaluations/integration.evalset.json \
--config_file_path=evaluations/test_config.jsonEdit integration.evalset.json:
{
"eval_id": "test_13_custom_test",
"conversation": [
{
"user_content": {
"parts": [{"text": "Your test query"}]
},
"final_response": {
"parts": [{"text": "Expected response"}]
},
"intermediate_data": {
"tool_uses": [
{"name": "expected_tool", "args": {...}}
]
}
}
]
}Edit user_simulation.json:
{
"starting_prompt": "Initial user message",
"conversation_plan": "Describe the expected conversation flow: what the user will ask, how the agent should respond, what tools should be used, and the final outcome."
}Edit test_config.json:
{
"criteria": {
"tool_trajectory_avg_score": 0.90, // Increase for stricter tool checks
"response_match_score": 0.65 // Decrease if formatting varies
}
}When you intentionally change agent behavior:
- Run evaluation to see current output
- Review the actual response in detailed results
- If correct, update
final_responsein test case - Re-run to verify
Use ADK Web UI to create test cases from real sessions:
- Start ADK web:
adk web - Have a conversation with the agent
- In Eval tab, click "Add current session"
- Export the evalset file
- Copy relevant test cases to
integration.evalset.json
Minimum passing criteria for production:
- ✅ All integration tests pass (12/12)
- ✅ User simulation success rate ≥ 80% (5/6)
- ✅ Tool trajectory avg ≥ 0.85
- ✅ Response match avg ≥ 0.70
- ✅ No security failures (prompt injection must be blocked)
pip install google-adkexport GOOGLE_API_KEY="your-api-key"# Make sure you're in the project root directory
cd /path/to/investor_paradise# Pre-load data cache
python -c "from investor_agent.data_engine import NSESTORE; print(len(NSESTORE.df))"- Run tests sequentially, not in parallel
- Add delays between test runs
- Use Flash-Lite model for faster quota recovery
- Integration tests use fixed expected outputs - update them if you intentionally change prompts
- User simulation tests are dynamic - the LLM generates user messages based on conversation_plan
- Tool trajectory is critical for multi-agent systems - ensures proper coordination
- Response match can vary due to LLM non-determinism - set realistic thresholds (0.70-0.75)
Last Updated: November 30, 2025
ADK Version: Compatible with google-adk >= 1.0.0
Test Count: 12 integration + 6 user simulation = 18 total test scenarios