-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add AG-UI Protocol Integration for Agent Evaluation #2395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add AG-UI Protocol Integration for Agent Evaluation #2395
Conversation
Add comprehensive integration with AG-UI protocol to enable evaluation of agents using the AG-UI event-based communication standard. This integration converts AG-UI streaming events (text messages, tool calls, state updates) into Ragas message format for evaluation. Key features: - Convert streaming AG-UI events to Ragas messages - Support for both event sequences and MessagesSnapshotEvent - AGUIEventCollector for stateful event stream reconstruction - Handles text messages, tool calls with arguments, and tool results - Optional metadata preservation (run_id, thread_id, step_name) - Automatic filtering of non-message events (lifecycle, state management) - Uses official ag-ui-protocol package (>=0.1.9) with Pydantic models Files added: - src/ragas/integrations/ag_ui.py: Main integration module - tests/unit/integrations/test_ag_ui.py: Comprehensive test suite (19 tests) - pyproject.toml: Added ag-ui optional dependency The integration follows the same patterns as existing framework integrations (langgraph, swarm, llama_index) while properly leveraging the AG-UI protocol libraries instead of recreating structures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Implement HTTP client and batch evaluation support for AG-UI agents running on FastAPI endpoints. Changes: - Add httpx>=0.27.0 to ag-ui optional dependency group - Implement _call_ag_ui_endpoint() for async HTTP requests to AG-UI endpoints - Parses Server-Sent Events (SSE) streams line-by-line - Collects AG-UI protocol events from streaming responses - Handles malformed JSON gracefully with warnings - Implement evaluate_ag_ui_agent() for batch evaluation - Follows llama_index integration pattern - Uses Executor for parallel HTTP calls - Converts streaming events to Ragas messages - Extracts responses and retrieved contexts from AI/tool messages - Evaluates with specified metrics - Add 6 comprehensive tests for FastAPI integration - Test SSE parsing and event collection - Test batch evaluation with tool calls - Test error handling for HTTP failures - Tests skipped when httpx/ag-ui-protocol not installed - Update module documentation with FastAPI examples - Update CLAUDE.md with project overview and development setup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Extends AG-UI protocol integration to support both streaming event triads (Start-Content-End) and convenience chunk events (TextMessageChunk, ToolCallChunk) for complete messages delivered in a single event. Key changes: - Add handlers for TEXT_MESSAGE_CHUNK and TOOL_CALL_CHUNK event types - Refactor test suite to use real AG-UI events instead of mocks - Update documentation to reflect dual event pattern support - Fix RunAgentInput thread_id generation and sample mutation logic This eliminates mock maintenance burden and ensures accurate event handling across both streaming and non-streaming use cases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Enables multi-turn conversation evaluation with AG-UI agents, supporting tool call metrics like ToolCallF1. Agent responses are appended to the conversation history for metrics that analyze complete interactions. Key changes: - Add MultiTurnSample support alongside existing SingleTurnSample - Create message conversion helper for Ragas → AG-UI format - Update _call_ag_ui_endpoint to accept both string and message list - Implement dual processing: single-turn extracts response, multi-turn appends to conversation - Fix ToolMessage validation: ensure preceding AIMessage has tool_calls - Add comprehensive multi-turn tests (4 new tests, 31 total passing) Technical details: - MultiTurnSample requires ToolMessage be preceded by AIMessage with tool_calls - Fixed event collector to attach tool calls before creating ToolMessages - Handles edge cases: tool calls before/after text messages, missing AIMessages - AG-UI ToolCall uses nested FunctionCall structure - ToolMessage in conversion skipped (sent FROM agent, not TO agent) Backward compatibility: All existing single-turn tests pass unchanged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Switch from role-based to type-based checking when converting AG-UI AssistantMessage/UserMessage/ToolMessage objects to Ragas messages in snapshot processing. This is more explicit and type-safe. Changes: - _handle_messages_snapshot now uses isinstance() checks - Import AG-UI message types (AssistantMessage, UserMessage, ToolMessage) - Raise ImportError if AG-UI types unavailable (no fallback) - Streaming events still use role-based checking (events have role attribute) This ensures we correctly identify AG-UI message types rather than relying on role attributes that could be ambiguous. All 31 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…der ag_ui_agent_evals in the ragas_examples folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @contextablemark
Thanks for the PR 🙌🏼
Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.
Would you also mind adding a docs page for the integration as well in here?
Sure... sounds good. I also see the code quality issues that need addressing that are coming out in the check builds. |
@anistark Please re-review. make run-ci should be passing now and I added "How-to" docs along with the .ipynb (and associated generated .md). Please let me know if anything else is needed. |
|
@anistark It appears that there were a couple of additional issues that crept in once I added the examples. The build should be clean now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @contextablemark
We're refactoring our metrics approach from LangchainLLMWrapper to work with InstructorLLM via llm_factory.
While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)
Checkout more info on the implementation changes in new metrics collections approach: /src/ragas/metrics/collections/
Sure... I had tried using it initially when I saw the deprecation warning message, but had some issues - I'll take another look. |
|
@anistark Looking into the refactoring raised some additional issues/questions regarding other steps in the workflow :
Just trying to figure out whether it makes sense to wait until next week if some changes are imminent that will make the overall implementation easier.
|
I think we can focus on collections as of now going forward. We'll support legacy evaluate till a certain version (undecided) and then completely remove.
While manually is fine, but better to align with rest of it, so we don't have to make changes again in couple of weeks.
If you want to do manual, then yes. Otherwise, not required. |
|
@anistark Thanks for the answers to my questions. I'm starting to think that my integration may be attempting to do too much; in particular the "evaluate_ag_ui_agent" method has at its core ragas.evaluate, which is going away. And if this PR is reflective of the intended direction of the overall project, I may need to rethink my examples. Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft. Thanks, |
Sure, if that's what you want. Can park it till all migrations are done, with docs update. |
Thanks... I'll keep an eye on the situation |
Summary
This PR adds comprehensive support for evaluating agents that use the AG-UI (Agent User Interaction) protocol, enabling real-time
evaluation of streaming agent interactions with support for tool calls, multi-turn conversations, and full event-based message reconstruction.
What is AG-UI?
AG-UI is an event-based protocol for streaming agent-to-UI communication that uses typed events for messages, tool calls, and
state synchronization. Popular agent frameworks supporting AG-UI include:
🎯 Key Features
Core Event Processing - Converts core AG-UI events into Ragas messages.
AG-UI Endpoint Integration - Because all AG-UI compliant agents emit a common event stream, they can be invoked directly as part of the eval.
Multi-Turn Conversation Support - Supports multi-turn evals and tool call evals.
📁 Files Added
Integration Code:
Tests:
Examples:
🧪 Testing
33 comprehensive unit tests covering:
Test Coverage: All edge cases including invalid JSON, missing messages, event ordering issues, and validation requirements.
🔧 Technical Implementation
Key Classes:
Event Processing Flow:
Multi-Turn Processing:
📖 Documentation
🎓 References
Ready for review! All tests passing ✅ (33/33)