fix(evals): Add agent evaluators to made by traceloop #3505

nina-kollman · 2025-12-08T07:18:04Z

Fixes TLP-1313

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

Add agent evaluators and synonym mapping for flexible task output validation, with comprehensive tests.

New Features:
- Added agent evaluators for goal accuracy, tool error detection, flow quality, efficiency, and goal completeness in agents_exp.py.
- Introduced synonym mapping for field names in field_mapping.py to allow flexible task output.
Tests:
- Added tests for synonym mapping and task output validation in test_field_mapping.py and test_evaluator.py.
- Updated test_experiment.py to test validation with multiple evaluators and synonym mapping.
Misc:
- Refactored validate_task_output to validate_and_normalize_task_output in evaluator.py to include normalization.

^{This description was created by}^{for 2f8dd94. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Many new evaluators added (agent performance, tool-error detection, flow quality, efficiency, goal completeness, correctness, completeness, uncertainty, tone, etc.).
- Sample app: demo experiment to run and view multi-evaluator agent results.
- Travel-planner agent example with tooling and itinerary generation.
Behavior Changes
- Input normalization: evaluator inputs accept common synonyms and are normalized with clearer guidance when fields are missing.
Tests
- Added comprehensive tests for field-mapping/normalization; existing tests updated to match new validation behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-08T07:18:33Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds synonym-based field mapping and normalization used before evaluator validation, renames validator to return normalized output, adds many new "made by traceloop" evaluator factories, introduces a sample agents experiment demo and a travel-agent example, and updates tests to cover mappings and validator behavior.

Changes

Cohort / File(s)	Summary
Field Mapping Infrastructure `packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py`	New module with `SYNONYM_GROUPS`, `_SYNONYM_MAP`, and helpers: `get_synonyms()`, `normalize_task_output()`, `get_field_suggestions()`, `format_field_help()` to map synonyms, normalize task outputs, and produce suggestions/help text.
Validator Refactor & Integration `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py`, `packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py`	Renames `validate_task_output()` → `validate_and_normalize_task_output()` and changes it to return a normalized `Dict[str, Any]`; integrates normalization step, uses `get_field_suggestions()` and `format_field_help()` for richer missing-field messages, and updates experiment codepaths to consume normalized output.
New Evaluator Factories `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py`	Adds many new factory methods on `EvaluatorMadeByTraceloop` (e.g., `context_relevance`, `answer_completeness`, `answer_correctness`, `uncertainty_detector`, `agent_tool_error_detector`, `agent_flow_quality`, `agent_efficiency`, `agent_goal_completeness`, `instruction_adherence`, `conversation_quality`, `intent_change`, `tone_detection`, etc.), each declaring `required_input_fields`.
Sample Agent Evaluator Demo `packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py`	New demo script that generates asynchronous agent traces (uses AsyncOpenAI), normalizes inputs, configures five Traceloop agent evaluators, runs an experiment via `client.experiment.run`, and prints results/errors.
Agent Example — Tools & Agent `packages/sample-app/sample_app/agents/travel_agent_example.py`	New travel planner agent with Pydantic models, six function-tools (search, weather, coords, destination info, distance, create_itinerary), `TravelPlannerAgent` class, streaming tool handling, query generator, CLI driver, and run summary reporting.
Tests — Validator Usage & Field-Mapping Tests `packages/traceloop-sdk/tests/evaluator/test_evaluator.py`, `packages/traceloop-sdk/tests/evaluator/test_field_mapping.py`, `packages/traceloop-sdk/tests/experiment/test_experiment.py`	Tests updated to call `validate_and_normalize_task_output()`, relaxed exact-error-string assertions, and new comprehensive tests for synonyms, normalization, suggestions, help formatting, and end-to-end normalization behavior.
Manifests / Metadata `pyproject.toml`, `requirements.txt`	Manifest files present in diff (no functional changes described).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant AgentDemo as AgentDemo (sample_app)
    participant OpenAI as AsyncOpenAI
    participant Normalizer as FieldMapping.normalize_task_output
    participant Traceloop as Traceloop Client
    participant Evaluators as EvaluatorMadeByTraceloop
    participant Experiment as Experiment.run
    AgentDemo->>OpenAI: generate_agent_trace(task_description)
    OpenAI-->>AgentDemo: trace (prompt, completion, tools, etc.)
    AgentDemo->>Normalizer: normalize_task_output(trace, required_fields)
    Normalizer-->>AgentDemo: normalized_task_output
    AgentDemo->>Traceloop: build experiment payload (evaluators list)
    Traceloop->>Evaluators: instantiate evaluators (required_input_fields)
    AgentDemo->>Experiment: client.experiment.run(dataset_slug, payload)
    Experiment-->>Traceloop: dispatch to evaluators
    Evaluators-->>Experiment: evaluations/results
    Experiment-->>AgentDemo: aggregated results & errors

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas needing extra attention:
- field_mapping.py: verify synonym groups, precedence rules (exact match over synonyms), preservation of unmapped fields, and edge cases (empty inputs, multiple candidate synonyms).
- evaluator.py: confirm normalized output is consistently used downstream and that error messages are formatted correctly and helpful.
- experiment.py: ensure single-row and bulk paths correctly accept and persist the normalized output shape.
- evaluators_made_by_traceloop.py: spot-check each new evaluator's required_input_fields for accuracy.
- agents_exp.py and travel_agent_example.py: validate async/OpenAI usage patterns, environment assumptions, and safe tool orchestration.
- Tests: ensure relaxed assertions still validate intended behavior and that new mapping tests cover tricky synonym/priority cases.

Poem

🐰 I hopped through synonyms with a twitchy nose,
Mapped prompts to completions in tidy rows,
Five evaluators now dance and peep,
Traces normalized before they leap —
A rabbit's cheer for tidy tests and prose!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes out-of-scope changes: travel_agent_example.py introduces a complete travel planning agent unrelated to agent evaluator requirements, and agents_exp.py demonstrates evaluator usage beyond the scope of adding evaluators.	Consider separating the travel agent demo and experimental evaluator demonstrations into separate PRs focused solely on adding the agent evaluators and field mapping infrastructure.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(evals): Add agent evaluators to made by traceloop' clearly summarizes the main change: adding agent evaluators to the evaluators_made_by_traceloop module.
Linked Issues check	✅ Passed	The PR addresses TLP-1313 by adding agent evaluators (agent_tool_error_detector, agent_flow_quality, agent_efficiency, agent_goal_completeness) and field mapping for input normalization as required.
Docstring Coverage	✅ Passed	Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch nk/add_agent_evals

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 3532412 in 2 minutes and 24 seconds. Click for details.

Reviewed 1364 lines of code in 8 files
Skipped 0 files when reviewing.
Skipped posting 10 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py:29

Draft comment:
Consider verifying that OPENAI_API_KEY is set to avoid runtime errors.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 20% <= threshold 50% The comment suggests verifying that an environment variable is set, which is a general suggestion to avoid runtime errors. It doesn't provide a specific code suggestion or ask for a specific test to be written. It falls under the category of asking the author to ensure something is done, which is against the rules.

2. packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py:32

Draft comment:
Consider adding error handling around the OpenAI API call to handle potential failures.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This comment is suggesting a code quality improvement (adding error handling) rather than identifying a specific bug. While error handling is generally good practice, the rules state that I should not comment on things that are "obvious" and that comments should only be made when "there is clearly a code change required." This is more of a suggestion for defensive programming. Additionally, this is sample/example code in an experimental file, where error handling might be intentionally omitted for clarity. The comment doesn't point to a specific failure case or bug - it's generic advice. According to the rules, "Comments that suggest code quality refactors are good! But only if they are actionable and clear" - while this is actionable, it's also quite generic and could apply to almost any API call. The rules emphasize that I should only keep comments with "STRONG EVIDENCE that the comment is correct" and by default assume comments are incorrect or not useful. This could be legitimate advice since API calls can fail for various reasons (network issues, rate limits, authentication failures, etc.), and the lack of error handling could cause the entire experiment to crash. In production code, this would definitely be important. The comment is actionable and clear about what needs to be done. While error handling is important in production code, this is explicitly sample/example code in an experimental file. The rules state not to make comments that are "obvious" - and adding error handling to API calls is fairly obvious best practice. More importantly, the experiment framework itself has stop_on_error=False at line 159, suggesting the framework handles errors at a higher level. The comment is generic advice rather than identifying a specific problem with the code change. This comment should be deleted. It's generic advice about error handling that could apply to any API call, rather than identifying a specific issue with the code. This is sample/example code where such details might be intentionally simplified, and the suggestion falls under "obvious" best practices rather than a clear code change requirement.

3. packages/traceloop-sdk/tests/evaluator/test_evaluator.py:50

Draft comment:
Great tests! The error messages properly include required field names and synonym suggestions.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that the error messages include required field names and synonym suggestions, which does not align with the rules for useful comments.

4. packages/traceloop-sdk/tests/evaluator/test_field_mapping.py:10

Draft comment:
Comprehensive test coverage for synonym mapping and normalization. Consider adding additional edge cases if necessary.
Reason this comment was not posted:
Comment looked like it was already resolved.

5. packages/traceloop-sdk/experiment/experiment.py:447

Draft comment:
Consider logging a warning when skipping invalid JSON lines in _parse_jsonl_to_rows to help diagnose data issues.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

6. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:149

Draft comment:
The validate_and_normalize_task_output function effectively integrates synonym mapping and provides detailed error messages. Nice work!
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the function without offering any specific guidance or questions.

7. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:243

Draft comment:
Docstring update for word_count_ratio improves clarity. The configuration details are well documented.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

8. packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py:162

Draft comment:
The normalization logic is clear; consider documenting that an exact match is prioritized over synonyms.
Reason this comment was not posted:
Comment looked like it was already resolved.

9. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:230

Draft comment:
Using asyncio.Semaphore(50) is good for concurrency control; consider making the concurrency limit configurable for scalability.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

10. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:169

Draft comment:
Integrating task output normalization using validate_and_normalize_task_output enhances consistency. Good design choice.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that a change enhances consistency, which is not useful for the PR author.

Workflow ID: wflow_XEMgq1SW6kNzRGaS

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

packages/traceloop-sdk/tests/evaluator/test_evaluator.py (1)
6-7: Minor: Class docstring still references old function name.

The docstring says "Tests for validate_task_output function" but the tests now use validate_and_normalize_task_output.
 class TestValidateTaskOutput:
-    """Tests for validate_task_output function"""
+    """Tests for validate_and_normalize_task_output function"""
packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py (1)
20-21: Consider deferring Traceloop.init() to avoid side effects on import.

Module-level initialization can cause unintended side effects when the module is imported but not executed. Consider moving the initialization inside run_agents_experiment().
-# Initialize Traceloop
-client = Traceloop.init()
+# Client will be initialized when running the experiment
+client = None


 async def generate_agent_trace(task_description: str) -> dict:
Then in run_agents_experiment():
async def run_agents_experiment():
    global client
    if client is None:
        client = Traceloop.init()
    # ... rest of function

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f12aaec and 3532412.

📒 Files selected for processing (8)

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py (1 hunks)
packages/traceloop-sdk/tests/evaluator/test_evaluator.py (14 hunks)
packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (1 hunks)
packages/traceloop-sdk/tests/experiment/test_experiment.py (3 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (4 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/tests/experiment/test_experiment.py
packages/traceloop-sdk/tests/evaluator/test_field_mapping.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py
packages/traceloop-sdk/tests/evaluator/test_evaluator.py
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py

🧬 Code graph analysis (5)

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (4)

get_synonyms (52-63)

normalize_task_output (66-126)

get_field_suggestions (129-149)

format_field_help (152-167)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

validate_and_normalize_task_output (149-218)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (3)

normalize_task_output (66-126)

get_field_suggestions (129-149)

format_field_help (152-167)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

packages/traceloop-sdk/tests/evaluator/test_evaluator.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

validate_and_normalize_task_output (149-218)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)

Evaluator (16-146)

validate_and_normalize_task_output (149-218)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

🔇 Additional comments (41)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (13)

243-243: Minor docstring fix looks good.

The docstring now correctly documents both required fields (numerator_text and denominator_text) for the word_count_ratio evaluator.

456-478: New answer_completeness evaluator looks good.

The implementation follows the established pattern with appropriate required fields (question, completion, context).

480-499: New prompt_perplexity evaluator looks good.

Follows the established pattern with the required prompt field.

501-522: New answer_correctness evaluator looks good.

Implementation is consistent with the required fields (question, completion, ground_truth).

524-543: New uncertainty_detector evaluator looks good.

Appropriately requires only the prompt field.

545-565: New agent_tool_error_detector evaluator looks good.

Required fields (tool_input, tool_output) are appropriate for detecting tool execution errors.

567-587: New agent_flow_quality evaluator looks good.

Uses trajectory fields (trajectory_prompts, trajectory_completions) as documented.

589-609: New agent_efficiency evaluator looks good.

Consistent with other trajectory-based evaluators using the same required fields.

611-631: New agent_goal_completeness evaluator looks good.

Uses trajectory fields consistently with other agent evaluators.

633-653: New instruction_adherence evaluator looks good.

Required fields (instructions, response) are appropriate for evaluating instruction following.

655-676: New conversation_quality evaluator looks good.

Required fields (prompts, completions) are appropriate for conversation evaluation.

678-699: New intent_change evaluator looks good.

Uses the same conversation fields (prompts, completions) consistently.

701-720: New tone_detection evaluator looks good.

Requires only text field, consistent with other text-analysis evaluators.

packages/traceloop-sdk/tests/experiment/test_experiment.py (3)

168-169: Test assertions updated correctly for new error message format.

The split assertions ("pii-detector requires:" and "'text'" separately) correctly accommodate the new multi-line error format from validate_and_normalize_task_output.

215-216: Consistent with the assertion pattern change above.

315-320: Helpful inline comment explaining synonym mapping behavior.

The comment clarifies that 'text' maps to 'response' via synonym mapping, so only 'prompt' remains missing. This makes the test's expected behavior clear for future maintainers.

packages/traceloop-sdk/tests/evaluator/test_evaluator.py (3)

2-2: Import correctly updated to validate_and_normalize_task_output.

15-15: All function calls correctly updated.

All invocations of the validation function have been updated to use validate_and_normalize_task_output.

Also applies to: 26-26, 40-40, 50-50, 73-73, 94-94, 116-116, 131-131, 150-150, 160-160, 180-180, 196-196, 218-218

54-55: Assertions correctly adapted for new multi-line error format.

The split assertions properly verify both the evaluator slug and the missing field name are present in the error message.

Also applies to: 163-164, 200-201

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)

8-8: Import correctly updated to validate_and_normalize_task_output.

169-171: Correct usage of normalized task output.

The normalized task_result is now properly assigned and used downstream for task creation and evaluator execution. The comment accurately describes the behavior.

484-487: Consistent normalization in _execute_tasks.

The same normalization pattern is correctly applied in the GitHub execution path, ensuring consistent behavior across both local and GitHub execution modes.

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (6)

1-7: Well-organized test imports and structure.

Clean imports of the field mapping functions under test.

10-75: Comprehensive tests for get_synonyms.

Good coverage including:

Different field categories (text, question, reference, trajectory)

Non-synonym and unknown fields returning single-element sets

Symmetry verification ensuring all synonyms in a group map to the same set

77-239: Thorough tests for normalize_task_output.

Excellent coverage of normalization scenarios:

Various synonym mappings (text→completion, prompt→question, etc.)

Multiple fields normalization

Preservation of extra fields

Empty input handling

Priority of exact matches over synonyms

Mixed synonym and non-synonym fields

242-296: Good tests for get_field_suggestions.

Covers key scenarios: single suggestions, multiple synonyms available, no suggestions, and empty available fields.

299-323: Tests for format_field_help look good.

Verifies both fields with and without synonyms are formatted correctly.

326-411: Valuable integration tests with validate_and_normalize_task_output.

End-to-end tests verify the complete flow including:

Synonym mapping during validation

Helpful error messages with synonym information

Context→reference mapping

Trajectory field mappings for agent evaluators

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (5)

3-3: Clean import of field mapping utilities.

The imports are appropriately scoped to only the needed functions.

149-168: Well-documented function signature and docstring.

The renamed function clearly communicates its dual purpose (validation and normalization). The docstring accurately describes the return value and exception behavior.

170-177: Clean normalization flow.

Collecting all required fields and normalizing in a single pass is efficient. The normalized output is then used for validation, ensuring synonyms are properly resolved before checking for missing fields.

179-195: Validation logic correctly checks normalized output.

The validation now operates on normalized_output instead of the raw task_output, ensuring that synonym-mapped fields are recognized as present.

197-218: Enhanced error reporting with helpful suggestions.

The error messages now include:

Per-evaluator breakdown of missing fields

Synonym information via format_field_help

Suggestions via get_field_suggestions

Helpful hint about using synonyms

This significantly improves the developer experience when debugging validation failures.

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py (5)

1-12: Clear and informative module docstring.

The docstring provides a good overview of the five agent evaluators being demonstrated.

29-29: Good security practice: API key accessed via environment variable.

Using os.getenv("OPENAI_API_KEY") follows the coding guidelines for not hardcoding secrets.

52-109: Excellent documentation of field synonym mapping.

The detailed docstring explaining synonym mappings is very helpful for users learning the SDK. The task function demonstrates practical usage of synonyms.

153-155: Hardcoded dataset slug requires user action.

The comment on line 154 explains the user needs to create a matching dataset, which is appropriate for a sample app. Consider adding this to the README or top-level docstring as well.

137-143: Clean evaluator configuration demonstrating all 5 agent evaluators.

The evaluator setup clearly shows usage of the new factory methods.

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (3)

52-63: LGTM!

The function correctly returns all synonyms for a given field, with appropriate fallback behavior when the field has no defined synonyms.

129-149: LGTM!

The synonym-based field suggestion logic is correct and efficient. The use of set intersection to identify synonym overlap is appropriate.

152-167: LGTM!

The help formatting logic correctly handles both single fields and fields with synonyms. Good use of sorted() on line 166 to ensure consistent output.

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 5f6a12e in 52 seconds. Click for details.

Reviewed 87 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/tests/evaluator/test_field_mapping.py:28

Draft comment:
Updated test for 'question' now checks for the 'query' synonym and expects 4 items. This aligns with the updated synonym mapping.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, explaining what was changed in the test. It doesn't provide any suggestions or raise any issues that need to be addressed. According to the rules, purely informative comments should be removed.

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:299

Draft comment:
New 'context_relevance' evaluator added with required fields ['query', 'context']. Looks consistent; ensure documentation reflects the new evaluator.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking to ensure that documentation reflects a new feature. This is a request for confirmation and does not provide a specific code suggestion or ask for a test to be written. It violates the rule against asking the PR author to ensure something is done.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:477

Draft comment:
Removed 'prompt_perplexity' evaluator. Confirm that this renaming (to 'perplexity') is intentional and that any external references or documentation have been updated accordingly.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py:21

Draft comment:
The input/question synonym group now includes 'query'—this change is consistent with evaluator updates. Ensure all related docs/tests are in sync.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that all related documentation and tests are in sync with the change. This falls under the rule of not asking the author to ensure things are tested or documented, which is not allowed.

Workflow ID: wflow_lWaGc86NZC3i8ve5

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (2)

11-28: Clarify/remove “canonical form” wording for unordered synonym groups.

SYNONYM_GROUPS are sets, so there is no stable “first field” or canonical element. The comment on Line 12 is misleading and may confuse future maintainers. Either drop the canonical-form claim or switch to an ordered type (e.g., tuple/list) if you truly need a canonical element.

97-119: Make synonym selection deterministic when multiple synonyms are present.

The loop over synonyms (a set) means that when the task output contains multiple synonyms for a required field (e.g., both "answer" and "response" for "completion"), which one is chosen is effectively random across runs/Python versions. This can yield inconsistent normalized outputs.

Make the selection ordered, for example:

Sort before iterating: for synonym in sorted(synonyms): ..., or

Define SYNONYM_GROUPS as ordered (e.g., tuples with canonical-first) and iterate that sequence.

Your test for preferring an exact match already covers one important case; this change will make the remaining cases deterministic as well.

🧹 Nitpick comments (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)
241-243: Minor docstring indentation inconsistency.

Line 242 has 11 leading spaces while the modified line 243 has 10 leading spaces. Consider aligning them for consistency.
        Required task output fields:
-           - numerator_text: The numerator text for ratio calculation
-          - denominator_text: The denominator text for ratio calculation
+            - numerator_text: The numerator text for ratio calculation
+            - denominator_text: The denominator text for ratio calculation
500-503: Remove extra blank line (PEP 8 E303).

There are two consecutive blank lines between methods. PEP 8 recommends a maximum of two blank lines between top-level definitions, but only one blank line between method definitions inside a class.
         return EvaluatorDetails(
             slug="answer-completeness",
             version=None,
             config=config,
             required_input_fields=["question", "completion", "context"],
         )

-
     @staticmethod
     def answer_correctness(
packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (1)

31-64: Avoid exposing and sharing mutable synonym sets across callers.

_build_synonym_map stores each group set directly, and get_synonyms returns that same set instance. Any caller that mutates the returned set will silently change _SYNONYM_MAP (and effectively SYNONYM_GROUPS) for the entire process.

To make this safer, consider one of:

Clone groups when building the map: synonym_map[field] = set(group).

Or keep the current map but have get_synonyms return a copy or frozenset(synonyms).

This keeps internal configuration immutable from the outside while preserving current behavior.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3532412 and 5f6a12e.

📒 Files selected for processing (3)

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py
packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧬 Code graph analysis (2)

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (4)

get_synonyms (52-63)

normalize_task_output (66-126)

get_field_suggestions (129-149)

format_field_help (152-167)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

validate_and_normalize_task_output (149-218)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)

302-322: LGTM!

The context_relevance evaluator follows the established pattern with appropriate required fields (query, context) and clear documentation.

526-633: LGTM!

The agent-focused evaluators (uncertainty_detector, agent_tool_error_detector, agent_flow_quality, agent_efficiency, agent_goal_completeness) are well-structured with appropriate required input fields for their respective use cases. The trajectory-based evaluators consistently require trajectory_prompts and trajectory_completions, which aligns with the PR's agent evaluation objectives.

635-722: LGTM!

The remaining evaluators (instruction_adherence, conversation_quality, intent_change, tone_detection) are well-implemented with clear documentation. The docstrings helpfully specify the expected data format for conversation-based evaluators (e.g., "flattened dict with llm.prompts.X.content/role").

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py

packages/sample-app/sample_app/agents/travel_agent_example.py

ellipsis-dev

Caution

Changes requested ❌

Reviewed c27da2f in 2 minutes and 31 seconds. Click for details.

Reviewed 927 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/agents/travel_agent_example.py:178

Draft comment:
Consider using an asynchronous HTTP client (e.g., aiohttp) instead of blocking 'requests.get' calls in async functions to avoid blocking the event loop.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% This is a code quality/performance suggestion about using async HTTP clients in async functions. While technically correct that blocking calls in async functions can block the event loop, I need to consider: 1) This is a sample/demo app (in "sample-app" directory), 2) The author explicitly used requests throughout despite writing async code, suggesting a deliberate choice, 3) The comment is somewhat generic and doesn't point to a specific bug or broken functionality, 4) According to the rules, I should only keep comments with STRONG EVIDENCE of correctness and that require clear code changes. This feels more like a "nice to have" optimization rather than a critical issue. The code will work as-is, just not optimally from an async perspective. This could be a legitimate performance concern if the application is meant to handle concurrent requests efficiently. The blocking calls could indeed cause issues in a production async application. The comment is actionable (switch to aiohttp) and technically correct. Maybe I'm being too lenient by dismissing it as just a sample app issue. While the technical advice is sound, the rules state I should not keep comments unless there's STRONG EVIDENCE of incorrectness. This is more of a best practice/optimization suggestion rather than a bug. The code functions correctly as written. The author clearly made a deliberate choice to use synchronous requests (they imported requests, used it consistently throughout, and added asyncio.sleep calls showing async awareness). For a sample/demo app, this is acceptable. The comment doesn't point to broken functionality that would be caught as a bug. This is a valid best practice suggestion but not a critical issue. The code works correctly as written, and this appears to be a deliberate design choice in a sample application. The comment is more of a "nice to have" optimization rather than pointing to broken functionality. According to the rules requiring STRONG EVIDENCE of incorrectness, this should be deleted.

2. packages/sample-app/sample_app/agents/travel_agent_example.py:536

Draft comment:
Verify if the OpenAI client call is non-blocking. If the client is synchronous, consider using an async client or running it in an executor.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment raises a valid concern about using a synchronous OpenAI client in an async function. However, there are several issues: 1) The comment is placed on the client instantiation line (536) rather than the actual API call line (561), 2) The comment uses weak language like "Verify if..." which violates the rule against asking the author to confirm things, 3) Looking at the actual usage at line 561, client.beta.chat.completions.parse() is indeed a synchronous call that could block the event loop. While the underlying concern is valid (synchronous API calls in async functions are problematic), the comment's phrasing ("Verify if...") makes it non-actionable and speculative rather than definitive. The underlying technical concern is legitimate - using synchronous OpenAI client methods in an async function can block the event loop. However, the comment violates the rule against asking authors to "verify" or "ensure" things. It should either definitively state the problem and suggest a fix, or not be made at all. While the technical concern is valid, the comment explicitly violates the rules by starting with "Verify if..." which is a pattern we're told to reject. The rules state: "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended... If the comments starts with 'Verify that...' or 'Ensure that...', it is likely not useful." This comment should be deleted based on its phrasing alone. Delete this comment. It violates the rule against asking authors to "verify" things. Even though there's a legitimate technical concern about using synchronous OpenAI calls in async functions, the comment's phrasing makes it non-actionable and speculative rather than definitive.

3. packages/traceloop-sdk/tests/evaluator/test_field_mapping.py:327

Draft comment:
Docstring and function references updated to 'validate_and_normalize_task_output' look correct. Ensure all module references are consistent.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_A9Ea3hOHiVYhRDoz

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/sample-app/sample_app/agents/travel_agent_example.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (1)

327-412: Integration tests give good end-to-end coverage; minor naming nit

The integration tests do a nice job of verifying that validate_and_normalize_task_output actually wires in normalization and synonym-aware error messages, including the new trajectory fields for agent evaluators.

One minor naming nit: the class name TestIntegrationWithValidateTaskOutput (Line 327) still references the old function name, even though the tests now correctly exercise validate_and_normalize_task_output. Renaming the class (e.g. to TestIntegrationWithValidateAndNormalizeTaskOutput or similar) would avoid confusion for future readers.

🧹 Nitpick comments (6)

packages/sample-app/sample_app/agents/travel_agent_example.py (5)
2-2: Remove unused import.

The json module is imported but never used in this file.
 import asyncio
-import json
 import random
313-320: Remove debug traceback output before production use.

The traceback.print_exc() call is useful for debugging but should be removed or replaced with structured logging for production code. This applies here and at line 585-586 as well.
     except Exception as e:
         print(f"Error getting weather forecast (Exception): {type(e).__name__}: {str(e)}")
-        import traceback
-        traceback.print_exc()
         return WeatherResponse(
             status="error",
             message=f"Failed to get weather forecast: {type(e).__name__}: {str(e)}"
         )
583-590: Remove debug traceback output.

Same as the earlier occurrence—remove or replace with structured logging.
     except Exception as e:
         print(f"Error creating itinerary: {type(e).__name__}: {str(e)}")
-        import traceback
-        traceback.print_exc()
         return ItineraryResponse(
             status="error",
             message=f"Failed to create itinerary: {str(e)}"
         )
706-706: Remove unused variable travel_ctx.

The travel_ctx is instantiated but never passed to the runner or agent.
-    travel_ctx = TravelContext(conversation_history=[])
     travel_agent = TravelPlannerAgent()
871-873: Use asyncio.sleep in async context.

Using blocking time.sleep in an async function blocks the event loop. While this is at the end of a query iteration and doesn't block concurrent operations in this demo, using asyncio.sleep is more idiomatic.
         if i < args.count:
             print(f"\nWaiting {args.delay} seconds before next query...")
-            time.sleep(args.delay)
+            await asyncio.sleep(args.delay)
packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (1)

10-76: Robust synonym tests; length checks may be brittle over time

The coverage of synonym groups and symmetry looks solid. The strict len(synonyms) == N assertions (Lines 20, 29, 37, 44, 51) will force test updates any time a new synonym is added, even when such additions are backwards compatible. If you expect synonym groups to evolve, consider asserting only membership (e.g. required subset) and dropping the length checks to make the tests less brittle.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5f6a12e and c27da2f.

📒 Files selected for processing (2)

packages/sample-app/sample_app/agents/travel_agent_example.py (1 hunks)
packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py
packages/sample-app/sample_app/agents/travel_agent_example.py

🧬 Code graph analysis (2)

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py (4)

get_synonyms (52-63)

normalize_task_output (66-126)

get_field_suggestions (129-149)

format_field_help (152-167)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

validate_and_normalize_task_output (149-218)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

packages/sample-app/sample_app/agents/travel_agent_example.py (3)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)

flush (219-224)

packages/traceloop-sdk/traceloop/sdk/decorators/__init__.py (1)

tool (61-71)

🪛 Flake8 (7.3.0)

packages/sample-app/sample_app/agents/travel_agent_example.py

[error] 2-2: 'json' imported but unused

(F401)

[error] 186-186: continuation line under-indented for visual indent

(E128)

[error] 706-706: local variable 'travel_ctx' is assigned to but never used

(F841)

🪛 GitHub Check: CodeQL

packages/sample-app/sample_app/agents/travel_agent_example.py

[failure] 238-238: Clear-text logging of sensitive information
This expression logs sensitive data (private) as clear text.
This expression logs sensitive data (private) as clear text.

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/agents/travel_agent_example.py

153-153: Unused function argument: cw

(ARG001)

213-213: Use explicit conversion flag

Replace with conversion flag

(RUF010)

216-216: Use explicit conversion flag

Replace with conversion flag

(RUF010)

222-222: Unused function argument: cw

(ARG001)

308-308: Use explicit conversion flag

Replace with conversion flag

(RUF010)

311-311: Use explicit conversion flag

Replace with conversion flag

(RUF010)

313-313: Do not catch blind exception: Exception

(BLE001)

314-314: Use explicit conversion flag

Replace with conversion flag

(RUF010)

319-319: Use explicit conversion flag

Replace with conversion flag

(RUF010)

325-325: Unused function argument: cw

(ARG001)

380-380: Use explicit conversion flag

Replace with conversion flag

(RUF010)

383-383: Use explicit conversion flag

Replace with conversion flag

(RUF010)

389-389: Unused function argument: cw

(ARG001)

430-430: Use explicit conversion flag

Replace with conversion flag

(RUF010)

433-433: Use explicit conversion flag

Replace with conversion flag

(RUF010)

439-439: Unused function argument: cw

(ARG001)

496-496: Do not catch blind exception: Exception

(BLE001)

497-497: Use explicit conversion flag

Replace with conversion flag

(RUF010)

500-500: Use explicit conversion flag

Replace with conversion flag

(RUF010)

506-506: Unused function argument: cw

(ARG001)

583-583: Do not catch blind exception: Exception

(BLE001)

584-584: Use explicit conversion flag

Replace with conversion flag

(RUF010)

589-589: Use explicit conversion flag

Replace with conversion flag

(RUF010)

706-706: Local variable travel_ctx is assigned to but never used

Remove assignment to unused variable travel_ctx

(F841)

799-799: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

802-802: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

803-803: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

809-809: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

810-810: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

811-811: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

812-812: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

813-813: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

816-816: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

817-817: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Lint
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)

🔇 Additional comments (5)

packages/sample-app/sample_app/agents/travel_agent_example.py (2)

141-148: LGTM!

The mutable default for conversation_history is correctly handled using __post_init__ to avoid the common Python pitfall of shared mutable defaults.

593-649: LGTM!

The agent class is well-structured with comprehensive instructions that guide the LLM through different query specificity levels. The tool configuration is correct.

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py (3)

78-241: normalize_task_output test matrix is comprehensive

These tests exercise the key behaviors of normalize_task_output: synonym mapping in both single- and multi-field scenarios, preservation of extra fields, handling of empty inputs/requirements, preference for exact matches, and coexistence with non-synonym fields. This gives good confidence that the normalization logic matches the documented behavior.

243-297: get_field_suggestions tests align well with synonym logic

The suggestion tests cover positive matches, negative cases, multiple-suggestion scenarios, and empty availability. They mirror the get_field_suggestions semantics closely and should catch regressions in how synonym overlap is computed without over-specifying formatting.

300-325: format_field_help tests validate both branched behaviors

You verify both the single-synonym case and the multi-synonym formatting, plus a small multi-field sanity check. This is an appropriate level of assertion (checking for key substrings rather than exact formatting), keeping tests resilient to minor wording tweaks while still guarding behavior.

packages/sample-app/sample_app/agents/travel_agent_example.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 2f8dd94 in 1 minute and 2 seconds. Click for details.

Reviewed 140 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/agents/travel_agent_example.py:703

Draft comment:
Good removal: the unused 'travel_ctx' variable was removed since it's not utilized. Ensure that context is managed internally.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

2. packages/sample-app/sample_app/agents/travel_agent_example.py:728

Draft comment:
Improved readability by converting query templates into multi-line parenthesized strings. This formatting change is clear and improves maintainability.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the change without offering any specific guidance or questions.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:243

Draft comment:
Added the missing '- denominator_text' description in the word_count_ratio evaluator docstring. The documentation now fully reflects the required input fields.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_BAekXsQaVidmm2IN

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

packages/sample-app/sample_app/agents/travel_agent_example.py (4)
404-413: URL‑encode destination_name in Wikipedia request (duplicate of existing suggestion).

Concatenating destination_name directly into the path will break for spaces and special characters and can cause 404s or misrouted requests. Encode it before building the URL.
+    from urllib.parse import quote
@@
-        url = "https://en.wikipedia.org/api/rest_v1/page/summary/" + destination_name
+        encoded_name = quote(destination_name, safe="")
+        url = "https://en.wikipedia.org/api/rest_v1/page/summary/" + encoded_name
Make sure the import is at the top of the file.

848-854: Argparse help text doesn’t match the actual default (duplicate of existing suggestion).

--count has default=1 but the help string still says “default: 3”. Recommend updating the help to avoid confusing users:
     parser.add_argument(
         "--count",
         type=int,
         default=1,
-        help="Number of queries to run (default: 3)"
+        help="Number of queries to run (default: 1)"
     )
237-238: Printed location + coordinates may be sensitive (duplicate of existing warning).

print(f"Getting weather forecast for {location_name} ({latitude}, {longitude})") logs a potentially identifying place together with precise coordinates. For production usage, consider either removing this log, redacting coordinates, or downgrading it behind a debug flag, especially if logs are retained centrally.

889-891: Use await asyncio.sleep instead of time.sleep in async def main (duplicate of existing suggestion).

time.sleep(args.delay) blocks the entire event loop between queries. Replace it with await asyncio.sleep(args.delay) to keep the async runtime responsive:
-        if i < args.count:
-            print(f"\nWaiting {args.delay} seconds before next query...")
-            time.sleep(args.delay)
+        if i < args.count:
+            print(f"\nWaiting {args.delay} seconds before next query...")
+            await asyncio.sleep(args.delay)

🧹 Nitpick comments (5)

packages/sample-app/sample_app/agents/travel_agent_example.py (5)

150-155: Unused cw context parameter in tool functions (ARG001).

Ruff flags cw as unused here and in the other @function_tool functions. If you don't plan to use the context inside these tools, consider renaming to _cw (or adding # noqa: ARG001 on the function) to keep the signature compatible with the framework while silencing the lint warning.

168-179: Blocking HTTP calls inside async tools.

All the @function_tool coroutines (search_destinations, get_weather_forecast, get_location_coordinates, get_destination_info) call requests.get directly, which blocks the event loop and can stall other async work. For better responsiveness, either switch to an async HTTP client (e.g., httpx.AsyncClient) or wrap blocking calls with await asyncio.to_thread(requests.get, ...).

211-216: Error handling: avoid echoing raw exceptions and address RUF010.

Two points here (and similarly in other error handlers):

f"... {str(e)}" triggers Ruff RUF010; prefer f"... {e!s}" or simply f"... {e}".

Returning/logging the full exception string to the user can leak implementation details; consider logging the detailed error internally and returning a generic, user‑friendly message instead.

Same pattern appears in the weather, coordinates, destination info, distance, and itinerary helpers.

497-502: Broad except Exception in distance calculation.

Catching Exception hides programming errors and makes debugging harder. If you mainly expect numeric issues or bad inputs, narrow this to specific exception types (e.g., ValueError, TypeError) and let unexpected errors surface, or re‑raise after logging.

584-591: Broad except Exception in create_itinerary.

Same concern as in calculate_travel_distance: a blanket except Exception will swallow unexpected bugs (including misconfiguration of the OpenAI client). Prefer catching the specific OpenAI/library exceptions you expect plus input/validation errors, and allow truly unexpected failures to propagate or be re‑raised after logging.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c27da2f and 2f8dd94.

📒 Files selected for processing (2)

packages/sample-app/sample_app/agents/travel_agent_example.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/agents/travel_agent_example.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧬 Code graph analysis (2)

packages/sample-app/sample_app/agents/travel_agent_example.py (2)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/decorators/__init__.py (1)

tool (61-71)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/agents/travel_agent_example.py

152-152: Unused function argument: cw

(ARG001)

212-212: Use explicit conversion flag

Replace with conversion flag

(RUF010)

215-215: Use explicit conversion flag

Replace with conversion flag

(RUF010)

221-221: Unused function argument: cw

(ARG001)

309-309: Use explicit conversion flag

Replace with conversion flag

(RUF010)

312-312: Use explicit conversion flag

Replace with conversion flag

(RUF010)

314-314: Do not catch blind exception: Exception

(BLE001)

315-315: Use explicit conversion flag

Replace with conversion flag

(RUF010)

320-320: Use explicit conversion flag

Replace with conversion flag

(RUF010)

326-326: Unused function argument: cw

(ARG001)

381-381: Use explicit conversion flag

Replace with conversion flag

(RUF010)

384-384: Use explicit conversion flag

Replace with conversion flag

(RUF010)

390-390: Unused function argument: cw

(ARG001)

431-431: Use explicit conversion flag

Replace with conversion flag

(RUF010)

434-434: Use explicit conversion flag

Replace with conversion flag

(RUF010)

440-440: Unused function argument: cw

(ARG001)

497-497: Do not catch blind exception: Exception

(BLE001)

498-498: Use explicit conversion flag

Replace with conversion flag

(RUF010)

501-501: Use explicit conversion flag

Replace with conversion flag

(RUF010)

507-507: Unused function argument: cw

(ARG001)

584-584: Do not catch blind exception: Exception

(BLE001)

585-585: Use explicit conversion flag

Replace with conversion flag

(RUF010)

590-590: Use explicit conversion flag

Replace with conversion flag

(RUF010)

817-817: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

820-820: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

821-821: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

827-827: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

828-828: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

829-829: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

830-830: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

831-831: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

834-834: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

835-835: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (4)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

243-243: Good documentation fix.

The missing denominator_text field documentation now aligns with the required_input_fields specification on line 254.

479-544: LGTM! Answer evaluators are well-structured.

The three new answer-focused evaluators follow consistent patterns with existing code. Each has clear documentation and appropriate required fields for its evaluation purpose.

546-632: LGTM! Agent evaluators are well-designed.

The agent evaluators follow a coherent design where three trajectory-based evaluators (agent_flow_quality, agent_efficiency, agent_goal_completeness) appropriately share the same input fields while evaluating different aspects of agent behavior. The agent_tool_error_detector correctly uses distinct fields for tool monitoring.

634-721: LGTM! Conversation and tone evaluators are consistent.

The evaluators follow logical groupings with appropriate field sharing between conversation_quality and intent_change. Documentation clearly distinguishes between conversation-level analysis (using prompts/completions) and agent trajectory analysis.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

added

f6d9add

nina-kollman added 7 commits December 8, 2025 09:22

fix input

6873bac

added agent exp

a130003

added field mapping

d0e15f6

rename

2a76ce7

lint

2b37b95

fix test

580167d

remove

3532412

nina-kollman marked this pull request as ready for review December 8, 2025 09:22

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py Show resolved Hide resolved

packages/traceloop-sdk/traceloop/sdk/evaluator/field_mapping.py Show resolved Hide resolved

galzilber approved these changes Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py Show resolved Hide resolved

added context_relevance

5f6a12e

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/tests/evaluator/test_field_mapping.py Show resolved Hide resolved

add agent

c27da2f

github-advanced-security bot found potential problems Dec 8, 2025

View reviewed changes

packages/sample-app/sample_app/agents/travel_agent_example.py Dismissed Show dismissed Hide dismissed

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

packages/sample-app/sample_app/agents/travel_agent_example.py Show resolved Hide resolved

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

packages/sample-app/sample_app/agents/travel_agent_example.py Show resolved Hide resolved

packages/sample-app/sample_app/agents/travel_agent_example.py Show resolved Hide resolved

lint

2f8dd94

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py Show resolved Hide resolved

nina-kollman merged commit f2cda38 into main Dec 8, 2025
12 checks passed

fix(evals): Add agent evaluators to made by traceloop #3505

fix(evals): Add agent evaluators to made by traceloop #3505

Uh oh!

Conversation

nina-kollman commented Dec 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nina-kollman commented Dec 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 8, 2025 •

edited

Loading