Skip to content

Conversation

@contextablemark
Copy link

@contextablemark contextablemark commented Nov 1, 2025

Summary

This PR adds comprehensive support for evaluating agents that use the AG-UI (Agent User Interaction) protocol, enabling real-time
evaluation of streaming agent interactions with support for tool calls, multi-turn conversations, and full event-based message reconstruction.

What is AG-UI?

AG-UI is an event-based protocol for streaming agent-to-UI communication that uses typed events for messages, tool calls, and
state synchronization. Popular agent frameworks supporting AG-UI include:

  • LangGraph (LangChain)
  • Google ADK (Agent Development Kit)
  • Pydantic AI
  • Mastra

🎯 Key Features

  1. Core Event Processing - Converts core AG-UI events into Ragas messages.

  2. AG-UI Endpoint Integration - Because all AG-UI compliant agents emit a common event stream, they can be invoked directly as part of the eval.

  3. Multi-Turn Conversation Support - Supports multi-turn evals and tool call evals.

📁 Files Added

Integration Code:

  • src/ragas/integrations/ag_ui.py (1,283 lines) - Complete AG-UI integration

Tests:

  • tests/unit/integrations/test_ag_ui.py (1,186 lines) - 33 tests covering all features

Examples:

  • examples/ragas_examples/ag_ui_agent_evals/ - Complete runnable example
    • evals.py (314 lines) - Evaluation script with two scenarios
    • README.md (314 lines) - Comprehensive documentation
    • test_data/scientist_biographies.csv - Factual correctness test cases
    • test_data/weather_tool_calls.csv - Tool call evaluation test cases

🧪 Testing

33 comprehensive unit tests covering:

  • ✅ Basic event conversion and streaming reconstruction
  • ✅ Metadata preservation across event types
  • ✅ Tool call parsing and association with AI messages
  • ✅ Multi-turn conversation handling
  • ✅ Chunk event processing (both text and tool calls)
  • ✅ Message snapshot conversion with type-based checking
  • ✅ Error handling (malformed JSON, incomplete sequences, orphaned events)
  • ✅ Role mapping (user → HumanMessage, assistant → AIMessage)
  • ✅ FastAPI endpoint integration with mocked SSE responses
  • ✅ MultiTurnSample support with conversation appending
  • ✅ Retroactive tool call attachment for validation compliance

Test Coverage: All edge cases including invalid JSON, missing messages, event ordering issues, and validation requirements.

🔧 Technical Implementation

Key Classes:

  • AGUIEventCollector: Stateful event accumulation with streaming reconstruction
  • Caches AG-UI imports for performance
  • Tracks context (run_id, thread_id, step) for metadata
  • Handles both streaming triads and chunk events

Event Processing Flow:

  1. Lifecycle events update context (run_id, thread_id, step)
  2. Text message events accumulate content chunks
  3. Tool call events accumulate args and create ToolCall objects
  4. Message end events create Ragas messages with pending tool calls
  5. Tool result events ensure preceding AIMessage has tool_calls (validation requirement)

Multi-Turn Processing:

  • Converts Ragas messages → AG-UI messages for request payload
  • Sends to endpoint and collects AG-UI events
  • Converts events → new Ragas messages (AIMessage, ToolMessage only)
  • Appends to conversation for iterative evaluation

📖 Documentation

  • Comprehensive docstrings for all public functions
  • Module-level examples for common use cases
  • Complete README in examples directory with:
    • Setup instructions linking to AG-UI quickstart
    • Usage examples with all CLI options
    • Expected output formats
    • Troubleshooting guide
    • Metric interpretation guide

🎓 References


Ready for review! All tests passing ✅ (33/33)

contextablemark and others added 11 commits October 27, 2025 07:00
Add comprehensive integration with AG-UI protocol to enable evaluation of
agents using the AG-UI event-based communication standard. This integration
converts AG-UI streaming events (text messages, tool calls, state updates)
into Ragas message format for evaluation.

Key features:
- Convert streaming AG-UI events to Ragas messages
- Support for both event sequences and MessagesSnapshotEvent
- AGUIEventCollector for stateful event stream reconstruction
- Handles text messages, tool calls with arguments, and tool results
- Optional metadata preservation (run_id, thread_id, step_name)
- Automatic filtering of non-message events (lifecycle, state management)
- Uses official ag-ui-protocol package (>=0.1.9) with Pydantic models

Files added:
- src/ragas/integrations/ag_ui.py: Main integration module
- tests/unit/integrations/test_ag_ui.py: Comprehensive test suite (19 tests)
- pyproject.toml: Added ag-ui optional dependency

The integration follows the same patterns as existing framework integrations
(langgraph, swarm, llama_index) while properly leveraging the AG-UI protocol
libraries instead of recreating structures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Implement HTTP client and batch evaluation support for AG-UI agents
running on FastAPI endpoints.

Changes:
- Add httpx>=0.27.0 to ag-ui optional dependency group
- Implement _call_ag_ui_endpoint() for async HTTP requests to AG-UI endpoints
  - Parses Server-Sent Events (SSE) streams line-by-line
  - Collects AG-UI protocol events from streaming responses
  - Handles malformed JSON gracefully with warnings
- Implement evaluate_ag_ui_agent() for batch evaluation
  - Follows llama_index integration pattern
  - Uses Executor for parallel HTTP calls
  - Converts streaming events to Ragas messages
  - Extracts responses and retrieved contexts from AI/tool messages
  - Evaluates with specified metrics
- Add 6 comprehensive tests for FastAPI integration
  - Test SSE parsing and event collection
  - Test batch evaluation with tool calls
  - Test error handling for HTTP failures
  - Tests skipped when httpx/ag-ui-protocol not installed
- Update module documentation with FastAPI examples
- Update CLAUDE.md with project overview and development setup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Extends AG-UI protocol integration to support both streaming event triads
(Start-Content-End) and convenience chunk events (TextMessageChunk,
ToolCallChunk) for complete messages delivered in a single event.

Key changes:
- Add handlers for TEXT_MESSAGE_CHUNK and TOOL_CALL_CHUNK event types
- Refactor test suite to use real AG-UI events instead of mocks
- Update documentation to reflect dual event pattern support
- Fix RunAgentInput thread_id generation and sample mutation logic

This eliminates mock maintenance burden and ensures accurate event
handling across both streaming and non-streaming use cases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Enables multi-turn conversation evaluation with AG-UI agents, supporting
tool call metrics like ToolCallF1. Agent responses are appended to the
conversation history for metrics that analyze complete interactions.

Key changes:
- Add MultiTurnSample support alongside existing SingleTurnSample
- Create message conversion helper for Ragas → AG-UI format
- Update _call_ag_ui_endpoint to accept both string and message list
- Implement dual processing: single-turn extracts response, multi-turn
  appends to conversation
- Fix ToolMessage validation: ensure preceding AIMessage has tool_calls
- Add comprehensive multi-turn tests (4 new tests, 31 total passing)

Technical details:
- MultiTurnSample requires ToolMessage be preceded by AIMessage with tool_calls
- Fixed event collector to attach tool calls before creating ToolMessages
- Handles edge cases: tool calls before/after text messages, missing AIMessages
- AG-UI ToolCall uses nested FunctionCall structure
- ToolMessage in conversion skipped (sent FROM agent, not TO agent)

Backward compatibility: All existing single-turn tests pass unchanged.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Switch from role-based to type-based checking when converting AG-UI
AssistantMessage/UserMessage/ToolMessage objects to Ragas messages in
snapshot processing. This is more explicit and type-safe.

Changes:
- _handle_messages_snapshot now uses isinstance() checks
- Import AG-UI message types (AssistantMessage, UserMessage, ToolMessage)
- Raise ImportError if AG-UI types unavailable (no fallback)
- Streaming events still use role-based checking (events have role attribute)

This ensures we correctly identify AG-UI message types rather than
relying on role attributes that could be ambiguous.

All 31 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…der ag_ui_agent_evals in the ragas_examples folder.
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Nov 1, 2025
Copy link
Member

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @contextablemark
Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

@contextablemark
Copy link
Author

contextablemark commented Nov 3, 2025

Looks great @contextablemark Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

Sure... sounds good. I also see the code quality issues that need addressing that are coming out in the check builds.

@contextablemark
Copy link
Author

Looks great @contextablemark Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

@anistark Please re-review. make run-ci should be passing now and I added "How-to" docs along with the .ipynb (and associated generated .md). Please let me know if anything else is needed.

@contextablemark
Copy link
Author

@anistark It appears that there were a couple of additional issues that crept in once I added the examples. The build should be clean now.

Copy link
Member

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @contextablemark

We're refactoring our metrics approach from LangchainLLMWrapper to work with InstructorLLM via llm_factory.

While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)

Checkout more info on the implementation changes in new metrics collections approach: /src/ragas/metrics/collections/

@contextablemark
Copy link
Author

contextablemark commented Nov 4, 2025

Thanks for the update @contextablemark

We're refactoring our metrics approach from LangchainLLMWrapper to work with InstructorLLM via llm_factory.

While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)

Checkout more info on the implementation changes in new metrics collections approach: /src/ragas/metrics/collections/

Sure... I had tried using it initially when I saw the deprecation warning message, but had some issues - I'll take another look.

@contextablemark
Copy link
Author

@anistark Looking into the refactoring raised some additional issues/questions regarding other steps in the workflow :

  • Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?

  • Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?

  • Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?

  • Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?

Just trying to figure out whether it makes sense to wait until next week if some changes are imminent that will make the overall implementation easier.

  • Mark

@anistark
Copy link
Member

anistark commented Nov 4, 2025

  • Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?

evaluate will be deprecated once all metrics are migrated to collections.

  • Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?

I think we can focus on collections as of now going forward. We'll support legacy evaluate till a certain version (undecided) and then completely remove.

  • Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?

While manually is fine, but better to align with rest of it, so we don't have to make changes again in couple of weeks.

  • Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?

If you want to do manual, then yes. Otherwise, not required.

@contextablemark
Copy link
Author

@anistark Thanks for the answers to my questions. I'm starting to think that my integration may be attempting to do too much; in particular the "evaluate_ag_ui_agent" method has at its core ragas.evaluate, which is going away. And if this PR is reflective of the intended direction of the overall project, I may need to rethink my examples.

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Thanks,
Mark

@anistark
Copy link
Member

anistark commented Nov 5, 2025

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Sure, if that's what you want. Can park it till all migrations are done, with docs update.

@contextablemark contextablemark marked this pull request as draft November 5, 2025 09:21
@contextablemark
Copy link
Author

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Sure, if that's what you want. Can park it till all migrations are done, with docs update.

Thanks... I'll keep an eye on the situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants