Add AG-UI Protocol Integration for Agent Evaluation #2395

contextablemark · 2025-11-01T22:08:36Z

Summary

This PR adds comprehensive support for evaluating agents that use the AG-UI (Agent User Interaction) protocol, enabling real-time
evaluation of streaming agent interactions with support for tool calls, multi-turn conversations, and full event-based message reconstruction.

What is AG-UI?

AG-UI is an event-based protocol for streaming agent-to-UI communication that uses typed events for messages, tool calls, and
state synchronization. Popular agent frameworks supporting AG-UI include:

LangGraph (LangChain)
Google ADK (Agent Development Kit)
Pydantic AI
Mastra

🎯 Key Features

Core Event Processing - Converts core AG-UI events into Ragas messages.
AG-UI Endpoint Integration - Because all AG-UI compliant agents emit a common event stream, they can be invoked directly as part of the eval.
Multi-Turn Conversation Support - Supports multi-turn evals and tool call evals.

📁 Files Added

Integration Code:

src/ragas/integrations/ag_ui.py (1,283 lines) - Complete AG-UI integration

Tests:

tests/unit/integrations/test_ag_ui.py (1,186 lines) - 33 tests covering all features

Examples:

examples/ragas_examples/ag_ui_agent_evals/ - Complete runnable example
- evals.py (314 lines) - Evaluation script with two scenarios
- README.md (314 lines) - Comprehensive documentation
- test_data/scientist_biographies.csv - Factual correctness test cases
- test_data/weather_tool_calls.csv - Tool call evaluation test cases

🧪 Testing

33 comprehensive unit tests covering:

✅ Basic event conversion and streaming reconstruction
✅ Metadata preservation across event types
✅ Tool call parsing and association with AI messages
✅ Multi-turn conversation handling
✅ Chunk event processing (both text and tool calls)
✅ Message snapshot conversion with type-based checking
✅ Error handling (malformed JSON, incomplete sequences, orphaned events)
✅ Role mapping (user → HumanMessage, assistant → AIMessage)
✅ FastAPI endpoint integration with mocked SSE responses
✅ MultiTurnSample support with conversation appending
✅ Retroactive tool call attachment for validation compliance

Test Coverage: All edge cases including invalid JSON, missing messages, event ordering issues, and validation requirements.

🔧 Technical Implementation

Key Classes:

AGUIEventCollector: Stateful event accumulation with streaming reconstruction
Caches AG-UI imports for performance
Tracks context (run_id, thread_id, step) for metadata
Handles both streaming triads and chunk events

Event Processing Flow:

Lifecycle events update context (run_id, thread_id, step)
Text message events accumulate content chunks
Tool call events accumulate args and create ToolCall objects
Message end events create Ragas messages with pending tool calls
Tool result events ensure preceding AIMessage has tool_calls (validation requirement)

Multi-Turn Processing:

Converts Ragas messages → AG-UI messages for request payload
Sends to endpoint and collects AG-UI events
Converts events → new Ragas messages (AIMessage, ToolMessage only)
Appends to conversation for iterative evaluation

📖 Documentation

Comprehensive docstrings for all public functions
Module-level examples for common use cases
Complete README in examples directory with:
- Setup instructions linking to AG-UI quickstart
- Usage examples with all CLI options
- Expected output formats
- Troubleshooting guide
- Metric interpretation guide

🎓 References

AG-UI Documentation: https://docs.ag-ui.com
AG-UI Quickstart: https://docs.ag-ui.com/quickstart/applications
Compatible frameworks: LangGraph, Google ADK, Pydantic AI, Mastra

Ready for review! All tests passing ✅ (33/33)

Add comprehensive integration with AG-UI protocol to enable evaluation of agents using the AG-UI event-based communication standard. This integration converts AG-UI streaming events (text messages, tool calls, state updates) into Ragas message format for evaluation. Key features: - Convert streaming AG-UI events to Ragas messages - Support for both event sequences and MessagesSnapshotEvent - AGUIEventCollector for stateful event stream reconstruction - Handles text messages, tool calls with arguments, and tool results - Optional metadata preservation (run_id, thread_id, step_name) - Automatic filtering of non-message events (lifecycle, state management) - Uses official ag-ui-protocol package (>=0.1.9) with Pydantic models Files added: - src/ragas/integrations/ag_ui.py: Main integration module - tests/unit/integrations/test_ag_ui.py: Comprehensive test suite (19 tests) - pyproject.toml: Added ag-ui optional dependency The integration follows the same patterns as existing framework integrations (langgraph, swarm, llama_index) while properly leveraging the AG-UI protocol libraries instead of recreating structures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Implement HTTP client and batch evaluation support for AG-UI agents running on FastAPI endpoints. Changes: - Add httpx>=0.27.0 to ag-ui optional dependency group - Implement _call_ag_ui_endpoint() for async HTTP requests to AG-UI endpoints - Parses Server-Sent Events (SSE) streams line-by-line - Collects AG-UI protocol events from streaming responses - Handles malformed JSON gracefully with warnings - Implement evaluate_ag_ui_agent() for batch evaluation - Follows llama_index integration pattern - Uses Executor for parallel HTTP calls - Converts streaming events to Ragas messages - Extracts responses and retrieved contexts from AI/tool messages - Evaluates with specified metrics - Add 6 comprehensive tests for FastAPI integration - Test SSE parsing and event collection - Test batch evaluation with tool calls - Test error handling for HTTP failures - Tests skipped when httpx/ag-ui-protocol not installed - Update module documentation with FastAPI examples - Update CLAUDE.md with project overview and development setup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Extends AG-UI protocol integration to support both streaming event triads (Start-Content-End) and convenience chunk events (TextMessageChunk, ToolCallChunk) for complete messages delivered in a single event. Key changes: - Add handlers for TEXT_MESSAGE_CHUNK and TOOL_CALL_CHUNK event types - Refactor test suite to use real AG-UI events instead of mocks - Update documentation to reflect dual event pattern support - Fix RunAgentInput thread_id generation and sample mutation logic This eliminates mock maintenance burden and ensures accurate event handling across both streaming and non-streaming use cases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Enables multi-turn conversation evaluation with AG-UI agents, supporting tool call metrics like ToolCallF1. Agent responses are appended to the conversation history for metrics that analyze complete interactions. Key changes: - Add MultiTurnSample support alongside existing SingleTurnSample - Create message conversion helper for Ragas → AG-UI format - Update _call_ag_ui_endpoint to accept both string and message list - Implement dual processing: single-turn extracts response, multi-turn appends to conversation - Fix ToolMessage validation: ensure preceding AIMessage has tool_calls - Add comprehensive multi-turn tests (4 new tests, 31 total passing) Technical details: - MultiTurnSample requires ToolMessage be preceded by AIMessage with tool_calls - Fixed event collector to attach tool calls before creating ToolMessages - Handles edge cases: tool calls before/after text messages, missing AIMessages - AG-UI ToolCall uses nested FunctionCall structure - ToolMessage in conversion skipped (sent FROM agent, not TO agent) Backward compatibility: All existing single-turn tests pass unchanged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Switch from role-based to type-based checking when converting AG-UI AssistantMessage/UserMessage/ToolMessage objects to Ragas messages in snapshot processing. This is more explicit and type-safe. Changes: - _handle_messages_snapshot now uses isinstance() checks - Import AG-UI message types (AssistantMessage, UserMessage, ToolMessage) - Raise ImportError if AG-UI types unavailable (no fallback) - Streaming events still use role-based checking (events have role attribute) This ensures we correctly identify AG-UI message types rather than relying on role attributes that could be ambiguous. All 31 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…der ag_ui_agent_evals in the ragas_examples folder.

anistark

Looks great @contextablemark
Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

contextablemark · 2025-11-03T18:30:36Z

Looks great @contextablemark Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

Sure... sounds good. I also see the code quality issues that need addressing that are coming out in the check builds.

contextablemark · 2025-11-04T05:50:12Z

Looks great @contextablemark Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

@anistark Please re-review. make run-ci should be passing now and I added "How-to" docs along with the .ipynb (and associated generated .md). Please let me know if anything else is needed.

contextablemark · 2025-11-04T07:49:27Z

@anistark It appears that there were a couple of additional issues that crept in once I added the examples. The build should be clean now.

anistark

Thanks for the update @contextablemark

We're refactoring our metrics approach from LangchainLLMWrapper to work with InstructorLLM via llm_factory.

While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)

Checkout more info on the implementation changes in new metrics collections approach: /src/ragas/metrics/collections/

contextablemark · 2025-11-04T14:15:31Z

Thanks for the update @contextablemark

We're refactoring our metrics approach from LangchainLLMWrapper to work with InstructorLLM via llm_factory.

While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)

Checkout more info on the implementation changes in new metrics collections approach: /src/ragas/metrics/collections/

Sure... I had tried using it initially when I saw the deprecation warning message, but had some issues - I'll take another look.

contextablemark · 2025-11-04T16:22:29Z

@anistark Looking into the refactoring raised some additional issues/questions regarding other steps in the workflow :

Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?
Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?
Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?
Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?

Just trying to figure out whether it makes sense to wait until next week if some changes are imminent that will make the overall implementation easier.

Mark

anistark · 2025-11-04T18:36:02Z

Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?

evaluate will be deprecated once all metrics are migrated to collections.

Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?

I think we can focus on collections as of now going forward. We'll support legacy evaluate till a certain version (undecided) and then completely remove.

Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?

While manually is fine, but better to align with rest of it, so we don't have to make changes again in couple of weeks.

Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?

If you want to do manual, then yes. Otherwise, not required.

contextablemark · 2025-11-05T03:51:46Z

@anistark Thanks for the answers to my questions. I'm starting to think that my integration may be attempting to do too much; in particular the "evaluate_ag_ui_agent" method has at its core ragas.evaluate, which is going away. And if this PR is reflective of the intended direction of the overall project, I may need to rethink my examples.

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Thanks,
Mark

anistark · 2025-11-05T06:30:37Z

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Sure, if that's what you want. Can park it till all migrations are done, with docs update.

contextablemark · 2025-11-05T09:22:34Z

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Sure, if that's what you want. Can park it till all migrations are done, with docs update.

Thanks... I'll keep an eye on the situation

contextablemark and others added 11 commits October 27, 2025 07:00

Merge branch 'explodinggradients:main' into feature/ag-ui

fbc510f

Merge branch 'explodinggradients:main' into feature/ag-ui

569354c

Update to get things to run for AI Tinkerers.

9bdf83a

Merge branch 'explodinggradients:main' into feature/ag-ui

5f3aaa6

Slight refactoring on ag_ui.py integration. Added detailed example un…

a67261a

…der ag_ui_agent_evals in the ragas_examples folder.

chore: revert trivial whitespace changes to CLAUDE.md

013ce37

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Nov 1, 2025

Updated README with different framework names.

f1f0848

anistark reviewed Nov 3, 2025

View reviewed changes

contextablemark added 2 commits November 3, 2025 19:20

Addressing issues with make run-ci

545124e

Added "How-to" docs and Jupyter notebook.

008d6c4

contextablemark added 2 commits November 3, 2025 23:39

Addressing formatting issue.

4fc413a

More formatting / import issues.

fb7dd52

anistark reviewed Nov 4, 2025

View reviewed changes

Merge branch 'explodinggradients:main' into feature/ag-ui

d6fe626

contextablemark marked this pull request as draft November 5, 2025 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AG-UI Protocol Integration for Agent Evaluation #2395

Add AG-UI Protocol Integration for Agent Evaluation #2395

Uh oh!

contextablemark commented Nov 1, 2025 •

edited

Loading

Uh oh!

anistark left a comment •

edited

Loading

Uh oh!

contextablemark commented Nov 3, 2025 •

edited

Loading

Uh oh!

contextablemark commented Nov 4, 2025

Uh oh!

contextablemark commented Nov 4, 2025

Uh oh!

anistark left a comment

Uh oh!

contextablemark commented Nov 4, 2025 •

edited

Loading

Uh oh!

contextablemark commented Nov 4, 2025

Uh oh!

anistark commented Nov 4, 2025

Uh oh!

contextablemark commented Nov 5, 2025

Uh oh!

anistark commented Nov 5, 2025

Uh oh!

contextablemark commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add AG-UI Protocol Integration for Agent Evaluation #2395

Are you sure you want to change the base?

Add AG-UI Protocol Integration for Agent Evaluation #2395

Uh oh!

Conversation

contextablemark commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anistark left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

contextablemark commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

contextablemark commented Nov 4, 2025

Uh oh!

contextablemark commented Nov 4, 2025

Uh oh!

anistark left a comment

Choose a reason for hiding this comment

Uh oh!

contextablemark commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

contextablemark commented Nov 4, 2025

Uh oh!

anistark commented Nov 4, 2025

Uh oh!

contextablemark commented Nov 5, 2025

Uh oh!

anistark commented Nov 5, 2025

Uh oh!

contextablemark commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

contextablemark commented Nov 1, 2025 •

edited

Loading

anistark left a comment •

edited

Loading

contextablemark commented Nov 3, 2025 •

edited

Loading

contextablemark commented Nov 4, 2025 •

edited

Loading