fix(evals): evals API supports input + config, generate mbt functions #3534

nina-kollman · 2026-01-08T10:50:54Z

Fixes: TLP-1613

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

Refactor evals API to use static definitions, add agent tool trajectory evaluation example, and update code generation scripts.

Behavior:
- Added agent_tool_trajectory.py to demonstrate agent tool trajectory evaluation.
- Refactored evaluator API to use static definitions in definitions.py.
Code Generation:
- Updated generate_evaluator_models.py to generate static factory methods in definitions.py.
- Removed dynamic factory create_evaluator from evaluators_made_by_traceloop.py.
Models:
- Added AgentToolTrajectoryRequest and AgentToolTrajectoryResponse to request.py and response.py.
- Updated registry.py to map new request/response models.
Misc:
- Updated imports and exports in __init__.py files to reflect new structure.

^{This description was created by}^{for acabf5e. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Added a new example demonstrating agent tool trajectory evaluation functionality
Refactor
- Restructured evaluator definitions and imports for improved type safety and consistency

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-08T10:51:04Z

📝 Walkthrough

Walkthrough

This PR refactors the evaluator factory system from dynamic runtime generation to static code generation. It removes the dynamic metaclass-based EvaluatorMadeByTraceloop factory, introduces a codegen function to generate static definitions, updates import paths, modifies input validation to use nested payloads, and adds a sample experiment demonstrating agent tool trajectory evaluation.

Changes

Cohort / File(s)	Summary
Sample App & Experiment `packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py`	New module with two functions: `agent_evaluators_task()` to extract tool call data from rows, and `run_agent_tool_trajectory_experiment()` async function to configure and execute trajectory evaluation via the Traceloop client with threshold and sensitivity parameters.
Evaluator Module Refactoring `packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py`	Import source for `EvaluatorMadeByTraceloop` redirected from `.evaluators_made_by_traceloop` to `..generated.evaluators.definitions`; removed `create_evaluator` from public exports.
Evaluator Module Refactoring `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py`	Modified `_validate_evaluator_input()` to construct request model with nested input payload (`request_model(input=input)`) instead of unpacking input dictionary directly.
Module Removal `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py`	Entire module deleted (120 lines removed), including: dynamic `EvaluatorMadeByTraceloop` class, `_EvaluatorMadeByTraceloopMeta` metaclass, `create_evaluator()` factory function, and helper utilities (`_get_required_fields`, `_get_config_fields`, slug/method name converters).
Codegen Enhancement `scripts/codegen/generate_evaluator_models.py`	New `generate_definitions_py()` function added to generate static, type-safe evaluator definitions from slug mappings; `main()` extended to invoke this function and report generation status. Generated `definitions.py` now provides `EvaluatorMadeByTraceloop` class with static methods per evaluator slug.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix(evals): auto generate evals #3529: Directly related refactoring of the evaluator generation system—removes dynamic evaluators_made_by_traceloop factory and introduces static generated EvaluatorMadeByTraceloop definitions and codegen scripts at the code level.

Suggested reviewers

doronkopit5

Poem

🐰 From dynamic springs to static code,
We hop along this refactored road,
Where factories bloom in generation's light,
And definitions stand forever bright!
Trajectories trace through agents' flight. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: refactoring evaluators API to support input+config separation and generating static MBT functions instead of dynamic factories.
Linked Issues check	✅ Passed	The PR objectives clearly align with TLP-1613: refactoring evaluator API to use static definitions, updating code generation, and removing dynamic factory pattern. Changes match the stated goals.
Out of Scope Changes check	✅ Passed	All changes are in-scope: new example module, refactored evaluator infrastructure, updated code generation, and import path changes directly support the stated objective of migrating to static definitions.
Docstring Coverage	✅ Passed	Docstring coverage is 85.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 747845b in 2 minutes and 29 seconds. Click for details.

Reviewed 2054 lines of code in 10 files
Skipped 0 files when reviewing.
Skipped posting 9 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py:20

Draft comment:
Consider using a proper JSON structure (e.g. a list/dict) for the default 'expected_tool_calls' rather than a string literal to ensure type consistency.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% This is a new file being added, so all the code is technically "changed". The comment raises a valid point about type consistency - using a string literal for a JSON structure when it could be a proper Python list/dict. However, I need to consider: 1) Is this definitely wrong or just a style preference? 2) Without seeing how the evaluator API expects this data, I can't be certain if strings are acceptable or if proper data structures are required. 3) The comment says "consider" which makes it sound like a suggestion rather than identifying a definite bug. 4) Both executed_tool_calls and expected_tool_calls use string defaults (empty string vs JSON string), so there's some consistency there. 5) This could be intentional if the API expects string representations. Without seeing the API documentation or how expected_tool_calls is used by the evaluator, I cannot definitively say this is wrong. The code might intentionally use strings if that's what the API expects. The comment uses "consider" which suggests it's more of a style suggestion than identifying a definite bug. While the critique is valid, there is a type inconsistency issue here - one default is an empty string while the other is a JSON-formatted string. If the API expects structured data, using actual Python data structures would be clearer and less error-prone. However, without strong evidence this is definitely wrong, and given the speculative nature ("consider"), this falls into the category of suggestions without clear evidence of incorrectness. This comment is a code quality suggestion about using proper data structures instead of string literals. However, it lacks strong evidence that the current approach is wrong - it could be intentional based on API requirements. The use of "consider" indicates it's speculative rather than identifying a definite issue. Per the rules, without strong evidence of correctness, the comment should be deleted.

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:32

Draft comment:
The update to call 'request_model(input=input)' correctly nests the input as required by the Pydantic model. Ensure all evaluator callers supply input accordingly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It also asks the author to ensure something, which is against the rules.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py:3

Draft comment:
Import now points to the auto-generated factory class; this streamlines evaluator configuration.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states what the change does without offering any guidance or raising any concerns.

4. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/factories.py:109

Draft comment:
Factory methods (e.g. for 'agent-tool-trajectory') are well-structured with type hints and filtering. Ensure that future changes in config schema are synchronized with these auto-generated methods.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

5. scripts/codegen/generate_evaluator_models.py:509

Draft comment:
Using 'unlink(missing_ok=True)' effectively cleans up the temporary file. Verify that this usage meets the compatibility requirements for all targeted Python versions.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

6. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py:116

Draft comment:
Typographical issue: The identifier 'HtmlComparisonResponse' might be better written as 'HTMLComparisonResponse' for consistency with other acronyms like JSON and PII in this file.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a tricky case. The comment makes a valid observation about naming consistency - HTML should probably be all caps like JSON, PII, and SQL. However, the file is explicitly marked as auto-generated with a "DO NOT EDIT MANUALLY" warning at the top. This means any changes would need to be made to the source (the swagger.json file or the generation script), not to this file directly. The comment doesn't acknowledge this and suggests a change that would violate the "do not edit manually" instruction. Additionally, this is a style/consistency issue, not a functional bug. The rules state I should not comment on things that are obvious or unimportant, and naming style in auto-generated code falls into that category since it can't be manually fixed anyway. While the naming inconsistency observation is technically correct, perhaps the comment could still be valuable if it prompts the team to fix the underlying generation script or swagger definition. The inconsistency might indicate a real issue in the source that should be addressed. Even if the underlying issue is real, this comment is not actionable for the PR author since they cannot manually edit this auto-generated file. The comment would need to be directed at a different place (the generation script or swagger definition) to be useful. Comments on auto-generated code that can't be manually fixed are not useful in a PR review context. This comment should be deleted. The file is explicitly marked as auto-generated and should not be manually edited. Any naming consistency issues would need to be fixed in the source (swagger.json or generation script), not in this file. The comment is not actionable for the PR author.

7. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py:245

Draft comment:
Typographical note: The added entry "HtmlComparisonResponse" uses 'Html' instead of 'HTML'. Given that abbreviations like JSON in "JSONValidatorResponse" are fully capitalized, consider using "HTMLComparisonResponse" for consistency.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This file is explicitly marked as auto-generated code that should not be edited manually (lines 1-5). The comment is suggesting a naming convention change from HtmlComparisonResponse to HTMLComparisonResponse for consistency with JSONValidatorResponse. However, since this is generated code, the fix would need to be made in the source (likely the OpenAPI spec or the code generator), not in this file. The comment doesn't acknowledge this and seems to be treating it as manually editable code. Additionally, the comment is about code quality/style rather than a functional bug. According to the rules, I should not comment on things that would require changes outside the scope of what's being reviewed, and this is generated code that explicitly says not to edit manually. While the naming inconsistency observation is valid, perhaps the code generator is correctly following the naming from the OpenAPI spec, and changing it here would be wrong. Also, maybe "Html" is intentionally not fully capitalized for some reason in the source specification. Even if the observation about inconsistency is valid, the comment is not actionable in this context because the file is auto-generated and explicitly marked as "DO NOT EDIT MANUALLY". The PR author cannot fix this in this file - it would need to be fixed in the source spec or generator. This makes the comment not useful for this PR. This comment should be deleted. The file is explicitly marked as auto-generated code that should not be manually edited, making the suggested naming change not actionable in this context. Any naming convention fixes would need to be made in the source OpenAPI spec or code generator, not in this generated file.

8. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.py:28

Draft comment:
Typo/consistency: The name HtmlComparisonRequest might be better written as HTMLComparisonRequest to match the capitalization style used in other acronyms (e.g., JSONValidatorRequest, PIIDetectorRequest).
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a style/consistency comment about naming conventions. While the observation about inconsistency is valid (JSON and PII are all-caps, but Html is not), there are several issues: 1) The file is auto-generated according to the header comment, so manual edits would be overwritten. 2) This is a purely stylistic comment that doesn't affect functionality. 3) The comment doesn't provide a clear, actionable change since fixing it would require changing the code generation script or the source swagger.json file. 4) This type of naming consistency issue is relatively minor and may not be worth addressing. According to the rules, I should only keep comments that require a clear code change and are not obvious or unimportant. The naming inconsistency is real and could be confusing for developers. Also, even if the file is auto-generated, the comment could prompt the team to fix the underlying source, which would be valuable for long-term code quality. While the inconsistency exists, this file is explicitly marked as auto-generated and should not be manually edited. The comment is not actionable in the context of this PR since it would require changes to the generation script or source data, not this file. Additionally, this is a minor stylistic issue that doesn't affect functionality, making it fall under "obvious or unimportant" comments that should be avoided. This comment should be deleted because: 1) The file is auto-generated and shouldn't be manually edited, 2) The comment is about a minor stylistic inconsistency rather than a functional issue, and 3) It's not actionable within the scope of this PR.

9. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.py:66

Draft comment:
Typo: The acronym in 'HtmlComparisonResponse' appears to have inconsistent casing compared to similar responses (e.g. 'JSONValidatorResponse' and 'PIIDetectorResponse'). Consider using 'HTMLComparisonResponse' if that is the intended convention.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is technically correct about the inconsistency - HTML should probably be all caps like JSON, PII, and SQL. However, there are several issues: 1) The file explicitly states "DO NOT EDIT MANUALLY" and is auto-generated from a swagger.json file, so any fix would need to happen at the source, not in this file. 2) The comment doesn't acknowledge this auto-generation aspect. 3) This is a code quality/style issue rather than a functional bug. 4) The comment is speculative ("if that is the intended convention") rather than definitive. 5) According to the rules, I should not comment on things that are obvious or unimportant, and naming convention issues in auto-generated code fall into this category. While the inconsistency is real, maybe it's intentional based on the source swagger file, or maybe "HTML" vs "Html" follows some other convention I'm not aware of. Also, pointing out style issues in auto-generated code might still be valuable if it helps improve the generation script. Even if the feedback could theoretically improve the generation script, the comment is directed at this PR and this file, not at the generation script. The PR author likely just ran the generation script and shouldn't be expected to fix upstream naming conventions. This violates the rule about not making comments unless there's clearly a code change required in this PR. This comment should be deleted. It's pointing out a style inconsistency in auto-generated code where the file explicitly says not to edit manually. The PR author cannot and should not fix this in the generated file - any fix would need to happen in the source swagger file or generation script, which is outside the scope of this PR.

Workflow ID: wflow_5ZJYARk5JEBlbvHh

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @scripts/codegen/generate_evaluator_models.py:
- Line 307: The assignment using prop.get("examples", [None])[0] can raise
IndexError when "examples" exists but is an empty list; instead, first retrieve
the examples value from prop, check whether it's truthy/non-empty, and only then
take the first element, otherwise use None. Update the example extraction near
the prop variable to something like: read examples = prop.get("examples"), then
set example to the first item if examples is non-empty, otherwise set example to
None.
- Around line 391-431: Several string literals in the
generate_evaluator_models.py generator are unnecessarily using f-strings where
there are no placeholders (e.g., lines building method signature, docstring, and
return block that reference method_name, config_fields, input_fields,
EvaluatorDetails, and slug); change those f-strings to plain string literals
(remove the leading f) wherever the string contains no interpolation to satisfy
Flake8 and avoid false-positive linting issues, keeping interpolation only on
lines that actually use {method_name}, {slug}, or other variables.

🧹 Nitpick comments (3)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (3)
18-28: Consider extracting the default expected_tool_calls to a module-level constant.

The default value for expected_tool_calls on lines 20-23 is a long JSON string that reduces readability. Consider extracting it to a module-level constant with a descriptive name.
♻️ Proposed refactor
+# Default expected tool trajectory for demonstration
+DEFAULT_EXPECTED_TRAJECTORY = (
+    "[{'name': 'search', 'input': {'query': 'weather'}}, "
+    "{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, "
+    "{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]"
+)
+
 async def agent_evaluators_task(row):
     executed_tool_calls = row.get("actual", "")
-    expected_tool_calls = row.get(
-        "expected",
-        "[{'name': 'search', 'input': {'query': 'weather'}}, {'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, {'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]"
-    )
+    expected_tool_calls = row.get("expected", DEFAULT_EXPECTED_TRAJECTORY)
 
     return {
         "executed_tool_calls": executed_tool_calls,
         "expected_tool_calls": expected_tool_calls,
     }
18-18: Async function without await is unusual but acceptable.

The agent_evaluators_task function is declared as async but doesn't contain any await expressions. While this works (it returns a coroutine that immediately resolves), it's typically a style inconsistency. However, this may be intentional to maintain a consistent interface with other task functions that do require async operations.

If this function doesn't need to be async, consider making it synchronous:
-async def agent_evaluators_task(row):
+def agent_evaluators_task(row):
Otherwise, leave it as-is if the async signature is required by the experiment framework.

58-58: Dataset dependency requires external setup.

The experiment references a dataset with slug "agent-tool-trajectory" that must exist in the Traceloop platform. While there's a helpful comment on line 56, this dependency could cause the example to fail for new users.

Consider adding clearer setup instructions in the module docstring or adding error handling to provide actionable feedback:
# At the top of the file in the docstring
"""
...

Setup:
1. Create a dataset in the Traceloop platform with slug "agent-tool-trajectory"
2. Add test cases with "actual" and "expected" fields containing tool call trajectories
3. Run this script to evaluate agent performance
"""

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a528699 and 747845b.

⛔ Files ignored due to path filters (5)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/factories.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py is excluded by !**/generated/**

📒 Files selected for processing (5)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
scripts/codegen/generate_evaluator_models.py

💤 Files with no reviewable changes (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
scripts/codegen/generate_evaluator_models.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/factories.py (1)

EvaluatorMadeByTraceloop (14-548)

🪛 Flake8 (7.3.0)

scripts/codegen/generate_evaluator_models.py

[error] 391-391: f-string is missing placeholders

(F541)

[error] 394-394: f-string is missing placeholders

(F541)

[error] 396-396: f-string is missing placeholders

(F541)

[error] 416-416: f-string is missing placeholders

(F541)

[error] 418-418: f-string is missing placeholders

(F541)

[error] 419-419: f-string is missing placeholders

(F541)

[error] 420-420: f-string is missing placeholders

(F541)

[error] 422-422: f-string is missing placeholders

(F541)

[error] 425-425: f-string is missing placeholders

(F541)

[error] 427-427: f-string is missing placeholders

(F541)

[error] 431-431: f-string is missing placeholders

(F541)

🪛 Ruff (0.14.10)

scripts/codegen/generate_evaluator_models.py

391-391: f-string without any placeholders

Remove extraneous f prefix

(F541)

394-394: f-string without any placeholders

Remove extraneous f prefix

(F541)

396-396: f-string without any placeholders

Remove extraneous f prefix

(F541)

416-416: f-string without any placeholders

Remove extraneous f prefix

(F541)

418-418: f-string without any placeholders

Remove extraneous f prefix

(F541)

419-419: f-string without any placeholders

Remove extraneous f prefix

(F541)

420-420: f-string without any placeholders

Remove extraneous f prefix

(F541)

422-422: f-string without any placeholders

Remove extraneous f prefix

(F541)

425-425: f-string without any placeholders

Remove extraneous f prefix

(F541)

427-427: f-string without any placeholders

Remove extraneous f prefix

(F541)

431-431: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)

🔇 Additional comments (8)

scripts/codegen/generate_evaluator_models.py (3)

206-208: LGTM!

The factory class import is correctly added to the generated __init__.py.

228-229: LGTM!

The factory class is correctly exported in __all__.

516-518: LGTM!

The factory generation is properly integrated into the main flow with consistent logging.

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (2)

41-46: Good use of the generated factory method.

The configuration of the agent_tool_trajectory evaluator with explicit parameters demonstrates proper usage of the generated factory API. The parameters are clear and self-documenting.

57-65: Good experiment configuration and error handling.

The experiment is well-configured with:

Clear parameters for dataset and experiment identification

stop_on_error=False to process all rows

wait_for_results=True for synchronous result retrieval

Proper error collection and reporting

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (2)

3-3: Import path verified. The generated factories module exists at the correct location and properly exports EvaluatorMadeByTraceloop with static factory methods for each evaluator type.

1-9: No action required—create_evaluator does not exist in the codebase.

Verification found no definition, imports, or usages of create_evaluator anywhere in the codebase. There is no breaking change because this function either never existed or had no external dependencies.

Likely an incorrect or invalid review comment.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

31-32: The request model schema change is correct and consistent.

The change to request_model(input=input) is not a breaking change—it aligns with the actual structure of all generated request models. All 35 request model classes in the registry have an 'input' field as a required attribute, confirming that this validation pattern is the correct and expected structure across all evaluators.

Likely an incorrect or invalid review comment.

scripts/codegen/generate_evaluator_models.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 403dbbf in 1 minute and 15 seconds. Click for details.

Reviewed 74 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py:18

Draft comment:
Changed 'agent_evaluators_task' from async to a regular function. This is acceptable since no async operations occur, but ensure that downstream code (e.g., the experiment runner) correctly handles a synchronous task.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that downstream code handles the change from an async to a regular function. This is a request for confirmation and testing, which violates the rules.

2. scripts/codegen/generate_evaluator_models.py:307

Draft comment:
Improved example extraction: using a conditional check on the 'examples' list is safer than defaulting to [None][0].
Reason this comment was not posted:
Comment looked like it was already resolved.

3. scripts/codegen/generate_evaluator_models.py:388

Draft comment:
Refactored string concatenation: removed unnecessary f-string usage for static text, which clarifies the intent.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_ncfYth13aSA8L8u8

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (1)
14-15: Consider deferring initialization to runtime.

The module-level Traceloop.init() call executes immediately on import, which works for scripts but may cause issues if this module is imported elsewhere. For better flexibility, consider initializing the client inside the experiment function or within the main guard.
♻️ Proposed refactor
-# Initialize Traceloop
-client = Traceloop.init()
-
-
 def agent_evaluators_task(row):
And inside run_agent_tool_trajectory_experiment:
 async def run_agent_tool_trajectory_experiment():
+    # Initialize Traceloop
+    client = Traceloop.init()
+
     print("\n" + "="*80)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 747845b and 403dbbf.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
scripts/codegen/generate_evaluator_models.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

scripts/codegen/generate_evaluator_models.py
packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (6)

scripts/codegen/generate_evaluator_models.py (4)

206-207: LGTM!

The import of EvaluatorMadeByTraceloop from the generated factories module correctly integrates the new static factory approach into the public API.

228-229: LGTM!

The factory class is correctly added to __all__ with appropriate documentation, ensuring proper public API exposure.

251-439: LGTM! Comprehensive factory generation implementation.

The function correctly generates type-safe factory methods with proper handling of:

Type hint conversion from JSON schema to Python (including modern list[type] syntax for Python 3.10+)

Optional vs required parameters with appropriate defaults

Config field extraction and filtering of None values

Input field requirements from nested model definitions

Proper docstring generation with examples

517-519: LGTM!

The factory generation step is correctly integrated into the main workflow with appropriate arguments and status reporting.
packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (2)
18-30: Verify the expected format for tool calls.

The function uses empty string as the default for missing "actual" data and string representations for tool call sequences. Confirm that the evaluator API accepts string-formatted tool calls rather than parsed data structures.
Additional context

If the API expects parsed lists instead of string representations, the defaults should be actual Python lists:
default_expected = [
    {'name': 'search', 'input': {'query': 'weather'}},
    {'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}},
    {'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}
]
And the empty string default on Line 19 should be [] or None depending on API requirements.
82-85: LGTM!

The main guard correctly executes the async experiment using asyncio.run().

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed c35518b in 1 minute and 32 seconds. Click for details.

Reviewed 394 lines of code in 15 files
Skipped 0 files when reviewing.
Skipped posting 5 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py:12

Draft comment:
Good update replacing EvaluatorMadeByTraceloop with EvaluatorMadeByTraceloopDefinition. Ensure related docs and tests are updated accordingly.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/sample-app/sample_app/guardrail_medical_chat_example.py:41

Draft comment:
The evaluator in the first guardrail is correctly updated to EvaluatorMadeByTraceloopDefinition.pii_detector. However, note that the second guardrail uses a string ('medicaladvicegiven'). Confirm if this disparity is intentional.
Reason this comment was not posted:
Comment was on unchanged code.

3. packages/sample-app/sample_app/guardrail_travel_agent_example.py:64

Draft comment:
Guardrail wrapper appears well refactored; capturing stdout for safe display is a good practice. Ensure testing includes edge cases where _captured_stdout may be empty.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

4. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py:109

Draft comment:
Factory methods in EvaluatorMadeByTraceloopDefinition look clean and well documented. Verify that new configuration parameters (e.g. 'threshold', sensitivities) are properly covered by unit tests.
Reason this comment was not posted:
Confidence changes required: 10% <= threshold 50% None

5. scripts/codegen/generate_evaluator_models.py:490

Draft comment:
The codegen script effectively generates registry, definitions, and init.py files. Consider capturing and logging subprocess output for the datamodel-codegen call for easier debugging in case of errors.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_9Pqr2p8ab3faeHTB

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1)
79-83: Add None-check for evaluator config.

Line 80 accesses evaluator.config.items() without verifying that config is not None. Evaluators without configuration parameters (like profanity_detector()) may have config=None, which would raise an AttributeError.
🛡️ Proposed fix
     print("Running experiment with content safety evaluators:")
     for evaluator in evaluators:
-        config_str = ", ".join(f"{k}={v}" for k, v in evaluator.config.items() if k != "description")
+        config_str = ", ".join(f"{k}={v}" for k, v in (evaluator.config or {}).items() if k != "description")
         print(f"  - {evaluator.slug}")
         if config_str:
             print(f"    Config: {config_str}")
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)
117-121: Invalid parameters for placeholder_regex() method.

Lines 118-119 pass regex and placeholder_name parameters, but according to the method signature in definitions.py (lines 321-345), placeholder_regex() only accepts: case_sensitive, dot_include_nl, multi_line, and should_match. This will cause a TypeError at runtime.
🐛 Proposed fix

Based on the method signature, remove the invalid parameters:
-        EvaluatorMadeByTraceloopDefinition.placeholder_regex(
-            regex=r"^user_.*",
-            placeholder_name="username",
-            should_match=True
-        ),
+        EvaluatorMadeByTraceloopDefinition.placeholder_regex(
+            should_match=True,
+            case_sensitive=False,
+        ),
Note: If you need regex pattern validation, consider using regex_validator() instead of placeholder_regex().

🤖 Fix all issues with AI agents

In
@packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py:
- Around line 18-30: The default_expected string in agent_evaluators_task uses
Python-style single quotes and will not be valid JSON; update the
default_expected value to a JSON-valid string using double quotes (e.g., change
the list/dict literals to use double quotes for keys and string values) so that
expected_tool_calls (and executed_tool_calls) remain JSON strings parsable by
downstream evaluators; ensure you only modify the default_expected literal
inside the agent_evaluators_task function.

In @packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py:
- Line 142: The call to
EvaluatorMadeByTraceloopDefinition.agent_goal_completeness() is missing its
required threshold parameter; update the call in agents_exp.py to pass a float
threshold (e.g., 0.8 or the appropriate value for your tests) so it matches the
signature defined in definitions.py and avoids the TypeError.
- Line 140: The call to EvaluatorMadeByTraceloopDefinition.agent_flow_quality()
is missing its required parameters and will raise a TypeError; update the call
in agents_exp.py to pass a list of condition strings for the first arg
(conditions) and a float for the second arg (threshold) per the signature
defined in definitions.py (agent_flow_quality(conditions: list[str], threshold:
float)); either supply literal values that match the evaluator expectations
(e.g., ["conditionA", "conditionB"], 0.8) or pass through appropriately named
variables (e.g., conditions, threshold) that are defined earlier in the module.

In @packages/sample-app/sample_app/guardrail_travel_agent_example.py:
- Line 14: The import line for EvaluatorMadeByTraceloopDefinition contains an
unnecessary noqa directive; remove the trailing "# noqa: E402" from the import
statement (the symbol is EvaluatorMadeByTraceloopDefinition) so the line becomes
a normal import without the unused suppression comment.

In @scripts/codegen/generate_evaluator_models.py:
- Line 354: The generated class name is incorrect: the file emits class
EvaluatorMadeByTraceloop but other code expects
EvaluatorMadeByTraceloopDefinition; update the generator so the class
declaration uses the exact name EvaluatorMadeByTraceloopDefinition (or
alternatively update all references to match the emitted name), ensuring symbol
consistency with references like EvaluatorMadeByTraceloopDefinition at lines
that import/export it.
- Line 362: Update the docstring example that imports and references
EvaluatorMadeByTraceloop to instead import and reference
EvaluatorMadeByTraceloopDefinition; change the import line (currently "from
traceloop.sdk.evaluator import EvaluatorMadeByTraceloop") and any subsequent
example usages to use the correct class name EvaluatorMadeByTraceloopDefinition
so the docstring matches the real class API.

🧹 Nitpick comments (1)

scripts/codegen/generate_evaluator_models.py (1)
251-253: Function name doesn't match its behavior.

The function is named generate_factories_py but it actually generates definitions.py (see line 437). Consider renaming to generate_definitions_py for clarity.
♻️ Proposed refactor
-def generate_factories_py(
+def generate_definitions_py(
     output_dir: Path, slug_mappings: dict, filtered_definitions: dict
 ) -> int:
-    """Generate factories.py with static factory methods for each evaluator.
+    """Generate definitions.py with static factory methods for each evaluator.
And update the call site at line 518:
-    factory_count = generate_factories_py(output_dir, slug_mappings, filtered_definitions)
+    factory_count = generate_definitions_py(output_dir, slug_mappings, filtered_definitions)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 403dbbf and c35518b.

⛔ Files ignored due to path filters (2)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py is excluded by !**/generated/**

📒 Files selected for processing (13)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
packages/sample-app/sample_app/guardrail_medical_chat_example.py
packages/sample-app/sample_app/guardrail_travel_agent_example.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
scripts/codegen/generate_evaluator_models.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py
scripts/codegen/generate_evaluator_models.py
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py
packages/sample-app/sample_app/guardrail_medical_chat_example.py
packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
packages/sample-app/sample_app/guardrail_travel_agent_example.py

🧬 Code graph analysis (11)

packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (5)

EvaluatorMadeByTraceloopDefinition (14-548)

char_count (169-177)

word_count (529-537)

char_count_ratio (180-188)

word_count_ratio (540-548)

packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (4)

EvaluatorMadeByTraceloopDefinition (14-548)

profanity_detector (349-357)

toxicity_detector (497-515)

sexism_detector (443-461)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (4)

agent_goal_accuracy (66-74)

agent_flow_quality (43-63)

agent_efficiency (32-40)

agent_goal_completeness (77-95)

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (4)

EvaluatorMadeByTraceloopDefinition (14-548)

pii_detector (301-319)

secrets_detector (421-429)

prompt_injection (360-378)

packages/sample-app/sample_app/guardrail_medical_chat_example.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (1)

pii_detector (301-319)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (2)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (1)

agent_tool_trajectory (109-133)

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py (1)

agent_evaluators_task (52-109)

packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (5)

EvaluatorMadeByTraceloopDefinition (14-548)

json_validator (267-287)

sql_validator (464-472)

regex_validator (392-418)

placeholder_regex (322-346)

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (5)

agent_goal_accuracy (66-74)

agent_tool_error_detector (98-106)

agent_flow_quality (43-63)

agent_efficiency (32-40)

agent_goal_completeness (77-95)

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (4)

perplexity (290-298)

agent_goal_accuracy (66-74)

semantic_similarity (432-440)

topic_adherence (486-494)

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (1)

EvaluatorMadeByTraceloopDefinition (14-548)

packages/sample-app/sample_app/guardrail_travel_agent_example.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (1)

pii_detector (301-319)

🪛 Ruff (0.14.10)

packages/sample-app/sample_app/guardrail_travel_agent_example.py

14-14: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Lint
GitHub Check: Build Packages (3.11)

🔇 Additional comments (20)

packages/sample-app/sample_app/guardrail_medical_chat_example.py (2)

6-6: LGTM! Import updated to new evaluator definition API.

The import change from EvaluatorMadeByTraceloop to EvaluatorMadeByTraceloopDefinition aligns with the PR's refactoring of the evaluator factory system.

42-42: LGTM! Evaluator usage updated correctly.

The guardrail decorator now uses EvaluatorMadeByTraceloopDefinition.pii_detector(...) with the correct method signature.

packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (1)

16-16: LGTM! Correctness evaluators migrated to new API.

The import and both evaluator usages (answer_relevancy and faithfulness) have been correctly updated to use EvaluatorMadeByTraceloopDefinition.

Also applies to: 79-80

packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)

15-15: LGTM! Style evaluators migrated to new API.

The import and all four style evaluators (char_count, word_count, char_count_ratio, word_count_ratio) have been correctly updated to use EvaluatorMadeByTraceloopDefinition.

Also applies to: 77-80

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

17-17: LGTM! Security evaluators migrated to new API.

The import and all three security evaluators (pii_detector, secrets_detector, prompt_injection) have been correctly updated to use EvaluatorMadeByTraceloopDefinition.

Also applies to: 76-78

packages/sample-app/sample_app/guardrail_travel_agent_example.py (1)

14-14: LGTM! Evaluator API updated correctly.

The import and usage have been correctly updated to use EvaluatorMadeByTraceloopDefinition.

Also applies to: 65-65

packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (2)

16-16: LGTM!

The import change to EvaluatorMadeByTraceloopDefinition is consistent with the broader refactor.

73-75: LGTM!

The evaluator instantiation correctly uses the new EvaluatorMadeByTraceloopDefinition API with appropriate configuration parameters.

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)

21-21: LGTM!

The import change to EvaluatorMadeByTraceloopDefinition is consistent with the refactor.

97-103: LGTM!

The evaluator instantiation correctly uses the new EvaluatorMadeByTraceloopDefinition API. The agent_flow_quality configuration with threshold and conditions is properly structured.

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

1-9: LGTM!

The refactor correctly updates the public API to export EvaluatorMadeByTraceloopDefinition from the new generated definitions module. This change aligns with the broader migration across the codebase.

scripts/codegen/generate_evaluator_models.py (3)

206-208: LGTM!

Adding the import of EvaluatorMadeByTraceloopDefinition to the generated __init__.py correctly exposes the new public API.

228-229: LGTM!

Adding EvaluatorMadeByTraceloopDefinition to the __all__ export list correctly makes it part of the public API.

251-439: LGTM on the overall code generation logic!

The function correctly extracts config fields, input fields, generates type hints, and constructs factory methods with proper docstrings. The logic for filtering None values from config and handling optional parameters is well-implemented.

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)

18-18: LGTM!

The import change to EvaluatorMadeByTraceloopDefinition is consistent with the refactor.

91-94: LGTM!

The evaluator instantiation correctly uses the new EvaluatorMadeByTraceloopDefinition API. All four evaluators are properly instantiated without configuration parameters.

packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)

18-18: LGTM! Import updated correctly.

The import correctly uses EvaluatorMadeByTraceloopDefinition, aligning with the refactored evaluator factory system.

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (2)

10-15: LGTM!

Imports and initialization are correct.

43-48: LGTM!

The agent_tool_trajectory() evaluator configuration correctly uses all valid parameters as defined in the method signature.

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py (1)

18-18: LGTM! Import updated correctly.

The import correctly uses EvaluatorMadeByTraceloopDefinition, consistent with the refactored API.

coderabbitai · 2026-01-08T13:55:28Z

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py

+def agent_evaluators_task(row):
+    executed_tool_calls = row.get("actual", "")
+    default_expected = (
+        "[{'name': 'search', 'input': {'query': 'weather'}}, "
+        "{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, "
+        "{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]"
+    )
+    expected_tool_calls = row.get("expected", default_expected)
+
+    return {
+        "executed_tool_calls": executed_tool_calls,
+        "expected_tool_calls": expected_tool_calls,
+    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n -A5 -B5 "tool_calls" --type=py

Repository: traceloop/openllmetry

Length of output: 50378

🏁 Script executed:

# First, let's look at the file in question to understand the context cat -n packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py # Search for the evaluator that uses agent_evaluators_task rg -n "agent_evaluators_task\|AgentToolTrajectoryEvaluator\|agent_tool_trajectory" --type=py packages/sample-app/

Repository: traceloop/openllmetry

Length of output: 3431

🏁 Script executed:

# Search for the agent_tool_trajectory evaluator implementation rg -n "agent_tool_trajectory" --type=py -A10 -B5 | head -100 # Also search for EvaluatorMadeByTraceloopDefinition fd -e py -x grep -l "agent_tool_trajectory\|AgentToolTrajectory" {} \;

Repository: traceloop/openllmetry

Length of output: 6137

🏁 Script executed:

# Get the full definition of agent_tool_trajectory evaluator sed -n '109,135p' packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py # Also check the response.py to understand how it processes tool calls cat packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py

Repository: traceloop/openllmetry

Length of output: 6133

🏁 Script executed:

# Search for other examples of agent_tool_trajectory evaluator usage rg -n "executed_tool_calls\|expected_tool_calls" --type=py -B5 -A5 # Also search for any tests or documentation fd -e md -e txt -x grep -l "agent.*tool.*trajectory\|executed_tool_calls" {} \;

Repository: traceloop/openllmetry

Length of output: 47

🏁 Script executed:

# Search for any README or documentation files related to experiments fd -e md "README\|GUIDE\|EXAMPLE" packages/sample-app/ | head -20 # Check if there are any other evaluator examples in sample-app ls -la packages/sample-app/sample_app/experiment/made_by_traceloop/ # Search for how row data is structured or passed rg -n "agent-tool-trajectory" --type=py -B3 -A3

Repository: traceloop/openllmetry

Length of output: 5601

🏁 Script executed:

# Check other agent experiment files to see how they handle tool calls cat packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py # Also check travel_agent_exp.py cat packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

Repository: traceloop/openllmetry

Length of output: 12825

🏁 Script executed:

# Search for the actual evaluator implementation or any processing logic find packages/traceloop-sdk -name "*.py" -type f -exec grep -l "agent_tool_trajectory\|AgentToolTrajectory" {} \; # Check if there's any validation or parsing code for tool_calls rg -n "json.loads\|ast.literal_eval\|parse.*tool" --type=py packages/traceloop-sdk/ | head -50

Repository: traceloop/openllmetry

Length of output: 416

🏁 Script executed:

# Check the request.py to see AgentToolTrajectoryRequest structure cat packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py

Repository: traceloop/openllmetry

Length of output: 13294

🏁 Script executed:

# Check if there are any examples or tests showing proper format rg -n "executed_tool_calls\|expected_tool_calls" --type=py -B2 -A2 packages/

Repository: traceloop/openllmetry

Length of output: 47

Fix JSON string format for tool calls to use valid JSON syntax.

Lines 20-24 define default_expected as a string, which is correct. However, the string uses single quotes instead of double quotes, creating invalid JSON that will fail parsing. Update to proper JSON format:

Corrected code

default_expected = ( '[{"name": "search", "input": {"query": "weather"}}, ' '{"name": "book_flight", "input": {"flight": "NYC to Paris"}}, ' '{"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]' )

The agent-tool-trajectory evaluator expects both executed_tool_calls and expected_tool_calls as JSON strings (not Python lists).

🤖 Prompt for AI Agents

In @packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py around lines 18 - 30, The default_expected string in agent_evaluators_task uses Python-style single quotes and will not be valid JSON; update the default_expected value to a JSON-valid string using double quotes (e.g., change the list/dict literals to use double quotes for keys and string values) so that expected_tool_calls (and executed_tool_calls) remain JSON strings parsable by downstream evaluators; ensure you only modify the default_expected literal inside the agent_evaluators_task function.

coderabbitai · 2026-01-08T13:55:28Z

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py

-        EvaluatorMadeByTraceloop.agent_goal_completeness(),
+        EvaluatorMadeByTraceloopDefinition.agent_goal_accuracy(),
+        EvaluatorMadeByTraceloopDefinition.agent_tool_error_detector(),
+        EvaluatorMadeByTraceloopDefinition.agent_flow_quality(),


⚠️ Potential issue | 🔴 Critical

Missing required parameters for agent_flow_quality().

According to the method signature in definitions.py (lines 42-62), agent_flow_quality() requires two mandatory parameters: conditions (list[str]) and threshold (float). Calling it without arguments will cause a TypeError.

🐛 Proposed fix

Add the required parameters:

- EvaluatorMadeByTraceloopDefinition.agent_flow_quality(), + EvaluatorMadeByTraceloopDefinition.agent_flow_quality( + conditions=["Agent should not repeat questions", "Agent should complete task efficiently"], + threshold=0.7, + ),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

EvaluatorMadeByTraceloopDefinition.agent_flow_quality(),

EvaluatorMadeByTraceloopDefinition.agent_flow_quality(

conditions=["Agent should not repeat questions", "Agent should complete task efficiently"],

threshold=0.7,

),

🤖 Prompt for AI Agents

In @packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py at line 140, The call to EvaluatorMadeByTraceloopDefinition.agent_flow_quality() is missing its required parameters and will raise a TypeError; update the call in agents_exp.py to pass a list of condition strings for the first arg (conditions) and a float for the second arg (threshold) per the signature defined in definitions.py (agent_flow_quality(conditions: list[str], threshold: float)); either supply literal values that match the evaluator expectations (e.g., ["conditionA", "conditionB"], 0.8) or pass through appropriately named variables (e.g., conditions, threshold) that are defined earlier in the module.

coderabbitai · 2026-01-08T13:55:28Z

packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py

+        EvaluatorMadeByTraceloopDefinition.agent_tool_error_detector(),
+        EvaluatorMadeByTraceloopDefinition.agent_flow_quality(),
+        EvaluatorMadeByTraceloopDefinition.agent_efficiency(),
+        EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(),


⚠️ Potential issue | 🔴 Critical

Missing required parameter for agent_goal_completeness().

According to the method signature in definitions.py (lines 76-94), agent_goal_completeness() requires a mandatory threshold (float) parameter. Calling it without arguments will cause a TypeError.

🐛 Proposed fix

Add the required parameter:

- EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(), + EvaluatorMadeByTraceloopDefinition.agent_goal_completeness( + threshold=0.8, + ),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(),

EvaluatorMadeByTraceloopDefinition.agent_goal_completeness(

threshold=0.8,

),

🤖 Prompt for AI Agents

In @packages/sample-app/sample_app/experiment/made_by_traceloop/agents_exp.py at line 142, The call to EvaluatorMadeByTraceloopDefinition.agent_goal_completeness() is missing its required threshold parameter; update the call in agents_exp.py to pass a float threshold (e.g., 0.8 or the appropriate value for your tests) so it matches the signature defined in definitions.py and avoids the TypeError.

coderabbitai · 2026-01-08T13:55:28Z

packages/sample-app/sample_app/guardrail_travel_agent_example.py


 from traceloop.sdk.guardrails.guardrails import guardrail  # noqa: E402
-from traceloop.sdk.evaluator import EvaluatorMadeByTraceloop  # noqa: E402
+from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition  # noqa: E402


⚠️ Potential issue | 🟡 Minor

Remove unused noqa directive.

The # noqa: E402 comment is not needed as the E402 check is not enabled or not triggering on this line.

🧹 Proposed cleanup

-from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition # noqa: E402 +from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition # noqa: E402

from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

🧰 Tools

🪛 Ruff (0.14.10)

14-14: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In @packages/sample-app/sample_app/guardrail_travel_agent_example.py at line 14, The import line for EvaluatorMadeByTraceloopDefinition contains an unnecessary noqa directive; remove the trailing "# noqa: E402" from the import statement (the symbol is EvaluatorMadeByTraceloopDefinition) so the line becomes a normal import without the unused suppression comment.

scripts/codegen/generate_evaluator_models.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 6f9cd16 in 1 minute and 28 seconds. Click for details.

Reviewed 51 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py:22

Draft comment:
Docstring examples have been updated to use EvaluatorMadeByTraceloopDefinition. Confirm that all documentation is consistent.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to confirm that all documentation is consistent, which violates the rule against asking for confirmation or verification. It does not provide a specific suggestion or point out a specific issue with the code.

2. scripts/codegen/generate_evaluator_models.py:265

Draft comment:
get_type_hint may return 'Any' for unknown or array item types. Consider importing 'Any' from typing in the generated file for proper type checking.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. scripts/codegen/generate_evaluator_models.py:424

Draft comment:
When embedding input_fields in the factory method, consider using repr(input_fields) to ensure the list is formatted as a valid Python literal.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. scripts/codegen/generate_evaluator_models.py:371

Draft comment:
Typo: The closing triple quotes on line 371 are single quotes (''' rather than """). This inconsistency in docstring delimiters should be corrected to match the opening delimiter.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_H3S3THMwTV5DIih7

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @scripts/codegen/generate_evaluator_models.py:
- Around line 251-257: The docstring for generate_factories_py incorrectly
states it "Generate factories.py" while the function actually writes to
definitions.py; update the function docstring to say it generates definitions.py
(and optionally mention "static factory methods for each evaluator written to
definitions.py") so the description matches the actual output file and intent.

🧹 Nitpick comments (1)

scripts/codegen/generate_evaluator_models.py (1)
263-278: Consider handling object type explicitly.

The get_type_hint function returns "Any" for unhandled types including object. This is acceptable but could be enhanced if config fields contain nested objects.
💡 Optional enhancement
         elif prop_type == "array":
             items = prop.get("items", {})
             item_type = get_type_hint(items) if items else "Any"
             return f"list[{item_type}]"
+        elif prop_type == "object":
+            return "dict[str, Any]"
         return "Any"

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c35518b and 6f9cd16.

⛔ Files ignored due to path filters (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py is excluded by !**/generated/**

📒 Files selected for processing (1)

scripts/codegen/generate_evaluator_models.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

scripts/codegen/generate_evaluator_models.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Lint

🔇 Additional comments (6)

scripts/codegen/generate_evaluator_models.py (6)

206-208: LGTM!

The import statement for EvaluatorMadeByTraceloopDefinition is correctly added. Since generate_factories_py is always called in main(), definitions.py will exist when this import is used.

228-229: LGTM!

The EvaluatorMadeByTraceloopDefinition is correctly added to __all__ with proper categorization comment.

280-316: LGTM!

The get_config_fields function correctly traverses the OpenAPI definitions to extract config fields with their types, required status, and examples.

318-338: LGTM!

The get_input_fields function correctly extracts required input field names for documentation purposes in the generated factory methods.

373-432: LGTM!

The factory method generation loop is well-structured:

Sorted iteration ensures deterministic output

Type hints correctly use Python 3.10+ union syntax (type | None)

Config dict filtering correctly handles optional parameters

Docstrings include helpful information about required input fields

517-519: LGTM!

The generate_factories_py call is correctly integrated into the main workflow with appropriate status output.

scripts/codegen/generate_evaluator_models.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed f628e5e in 1 minute and 38 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. scripts/codegen/generate_evaluator_models.py:337

Draft comment:
Updated comment now refers to 'definitions.py' instead of 'factories.py', which aligns with the new static evaluator factory generation. Ensure consistency in the docstring header (if needed) to clearly reflect the purpose of the generated file.
Reason this comment was not posted:
Comment looked like it was already resolved.

Workflow ID: wflow_Rywm6LcquLcBa53n

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

scripts/codegen/generate_evaluator_models.py (1)
263-278: Consider handling additional JSON Schema types.

The get_type_hint function handles common types well but doesn't explicitly handle "object" type (returns "Any" as fallback). This is acceptable, but if the OpenAPI spec contains object types without $ref, they'll be typed as Any.

Additionally, consider handling "null" type if the spec uses nullable schemas.
♻️ Optional: Extend type hint coverage
     def get_type_hint(prop: dict) -> str:
         """Convert JSON schema property to Python type hint."""
         prop_type = prop.get("type")
         if prop_type == "number":
             return "float"
         elif prop_type == "integer":
             return "int"
         elif prop_type == "boolean":
             return "bool"
         elif prop_type == "string":
             return "str"
         elif prop_type == "array":
             items = prop.get("items", {})
             item_type = get_type_hint(items) if items else "Any"
             return f"list[{item_type}]"
+        elif prop_type == "object":
+            return "dict[str, Any]"
         return "Any"

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6f9cd16 and f628e5e.

📒 Files selected for processing (1)

scripts/codegen/generate_evaluator_models.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

scripts/codegen/generate_evaluator_models.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Lint
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)

🔇 Additional comments (3)

scripts/codegen/generate_evaluator_models.py (3)

206-208: LGTM!

The import generation for EvaluatorMadeByTraceloopDefinition from .definitions is correctly placed and will be available since generate_factories_py is called before generate_init_py in the main workflow.

228-229: LGTM!

The export is properly added to __all__ with a descriptive comment, following the existing pattern.

517-519: LGTM!

The factory generation is correctly integrated into the workflow. The ordering is appropriate—generate_factories_py runs before generate_init_py, ensuring definitions.py exists when __init__.py imports from it.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed a27c7eb in 1 minute and 15 seconds. Click for details.

Reviewed 54 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. scripts/codegen/generate_evaluator_models.py:246

Draft comment:
Good renaming: 'generate_factories_py' is now 'generate_definitions_py' with updated docstring. This improves clarity.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. scripts/codegen/generate_evaluator_models.py:339

Draft comment:
Updated header in definitions.py reflects purpose change from factory to definition. Looks consistent.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. scripts/codegen/generate_evaluator_models.py:353

Draft comment:
Class docstring now correctly describes this as a definition class instead of a factory. Good update.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

4. scripts/codegen/generate_evaluator_models.py:514

Draft comment:
Correctly updated print message and variable names (definition_count vs factory_count) for generating definitions.py.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_TDuBROQi5g3luOPI

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @scripts/codegen/generate_evaluator_models.py:
- Around line 263-278: get_type_hint can return the fallback "Any" but the
generated definitions.py only imports EvaluatorDetails, causing a NameError;
update the code generation path so that whenever get_type_hint may emit "Any"
you add "from typing import Any" to the generated definitions.py (or include Any
alongside existing imports such as EvaluatorDetails) — modify the generator that
writes definitions.py to detect use of "Any" (or always include it) and emit the
proper import so the runtime has typing.Any available.
- Around line 380-398: The function that builds method signatures iterates
config_fields as-is which can place optional parameters before required ones and
produce invalid Python signatures; sort config_fields so required fields come
first (e.g., stable sort by field["required"] descending or partition into
required then optional) before building params for the method signature for
method_name and use that same sorted order when generating the docstring and the
config dict (the code paths that append to content and create the
EvaluatorDetails/config entries should consume the sorted list so parameter
order is consistent and valid).

🧹 Nitpick comments (1)

scripts/codegen/generate_evaluator_models.py (1)
414-420: Config dict comprehension may be difficult to read in generated code.

The generated config dictionary construction spans multiple lines with a nested comprehension. While functional, this pattern produces hard-to-read generated code.
♻️ Alternative: Generate a simpler conditional pattern

Consider generating code that builds the dict more explicitly:
config = {}
if field1 is not None:
    config["field1"] = field1
if field2 is not None:
    config["field2"] = field2
This is more verbose but clearer in the generated output. However, the current approach is correct and this is just a stylistic suggestion.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f628e5e and a27c7eb.

📒 Files selected for processing (1)

scripts/codegen/generate_evaluator_models.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

scripts/codegen/generate_evaluator_models.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Lint

🔇 Additional comments (3)

scripts/codegen/generate_evaluator_models.py (3)

206-208: LGTM!

The import statement generation for EvaluatorMadeByTraceloopDefinition follows the established pattern in the file and correctly references the new definitions module.

228-229: LGTM!

The export is correctly added to __all__ with a descriptive comment, following the existing pattern.

517-519: LGTM!

The generate_definitions_py call is correctly placed before generate_init_py, ensuring the definitions module exists when the __init__.py import statement is generated.

coderabbitai · 2026-01-08T14:50:42Z

scripts/codegen/generate_evaluator_models.py

+    def get_type_hint(prop: dict) -> str:
+        """Convert JSON schema property to Python type hint."""
+        prop_type = prop.get("type")
+        if prop_type == "number":
+            return "float"
+        elif prop_type == "integer":
+            return "int"
+        elif prop_type == "boolean":
+            return "bool"
+        elif prop_type == "string":
+            return "str"
+        elif prop_type == "array":
+            items = prop.get("items", {})
+            item_type = get_type_hint(items) if items else "Any"
+            return f"list[{item_type}]"
+        return "Any"


⚠️ Potential issue | 🟠 Major

Missing Any import in generated definitions.py.

The get_type_hint function returns "Any" as a fallback type (line 278), but the generated definitions.py only imports EvaluatorDetails. If an evaluator has a property with an unhandled JSON schema type, the generated code will raise a NameError at runtime.

🐛 Proposed fix: Add Any import to the generated file

content = '''""" Definitions for creating Traceloop evaluators. Provides type-safe methods with IDE autocomplete support. DO NOT EDIT MANUALLY - Regenerate with: ./scripts/generate-models.sh /path/to/swagger.json """ from __future__ import annotations +from typing import Any + from ...evaluator.config import EvaluatorDetails

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In @scripts/codegen/generate_evaluator_models.py around lines 263 - 278, get_type_hint can return the fallback "Any" but the generated definitions.py only imports EvaluatorDetails, causing a NameError; update the code generation path so that whenever get_type_hint may emit "Any" you add "from typing import Any" to the generated definitions.py (or include Any alongside existing imports such as EvaluatorDetails) — modify the generator that writes definitions.py to detect use of "Any" (or always include it) and emit the proper import so the runtime has typing.Any available.

coderabbitai · 2026-01-08T14:50:42Z

scripts/codegen/generate_evaluator_models.py

+        # Build method signature
+        params = []
+        for field in config_fields:
+            type_hint = field["type"]
+            if not field["required"]:
+                type_hint = f"{type_hint} | None"
+            default = " = None" if not field["required"] else ""
+            params.append(f"{field['name']}: {type_hint}{default}")
+
+        # Method signature
+        if params:
+            params_str = ",\n        ".join(params)
+            content += "    @staticmethod\n"
+            content += f"    def {method_name}(\n"
+            content += f"        {params_str},\n"
+            content += "    ) -> EvaluatorDetails:\n"
+        else:
+            content += "    @staticmethod\n"
+            content += f"    def {method_name}() -> EvaluatorDetails:\n"


⚠️ Potential issue | 🔴 Critical

Parameter ordering may generate invalid Python syntax.

Required parameters (without defaults) must come before optional parameters (with defaults) in Python function signatures. The current implementation iterates over config_fields without sorting by required status, which can produce invalid code like def method(optional: str | None = None, required: str).

🐛 Proposed fix: Sort parameters by required status

# Build method signature params = [] + # Sort fields: required first, then optional + sorted_fields = sorted(config_fields, key=lambda f: (not f["required"], f["name"])) - for field in config_fields: + for field in sorted_fields: type_hint = field["type"] if not field["required"]: type_hint = f"{type_hint} | None" default = " = None" if not field["required"] else "" params.append(f"{field['name']}: {type_hint}{default}")

Also update the docstring and config dict generation to use the same sorted order for consistency:

# Docstring content += f' """Create {slug} evaluator.\n' if config_fields: content += "\n Args:\n" - for field in config_fields: + for field in sorted_fields: example_str = f" (example: {field['example']})" if field["example"] is not None else "" content += f" {field['name']}: {field['type']}{example_str}\n"

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In @scripts/codegen/generate_evaluator_models.py around lines 380 - 398, The function that builds method signatures iterates config_fields as-is which can place optional parameters before required ones and produce invalid Python signatures; sort config_fields so required fields come first (e.g., stable sort by field["required"] descending or partition into required then optional) before building params for the method signature for method_name and use that same sorted order when generating the docstring and the config dict (the code paths that append to content and create the EvaluatorDetails/config entries should consume the sorted list so parameter order is consistent and valid).

ellipsis-dev

Important

Looks good to me! 👍

Reviewed acabf5e in 1 minute and 43 seconds. Click for details.

Reviewed 410 lines of code in 15 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py:12

Draft comment:
Consistently renamed to EvaluatorMadeByTraceloop; verify the new 'threshold' param is handled correctly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment asks the PR author to verify that the new threshold parameter is handled correctly. This falls under the category of asking the author to ensure the behavior is intended or to double-check things, which is against the rules. Therefore, this comment should be removed.

2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:21

Draft comment:
Refactor: evaluator factory renamed consistently; ensure conditions (e.g., in agent_flow_quality) are validated.
Reason this comment was not posted:
Comment was on unchanged code.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py:3

Draft comment:
The export now uses EvaluatorMadeByTraceloop for consistency; double-check downstream dependencies.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to double-check downstream dependencies, which violates the rule against asking for confirmation or double-checking. It doesn't provide a specific suggestion or point out a specific issue with the code.

4. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py:14

Draft comment:
Renamed factory class to EvaluatorMadeByTraceloop; ensure docstrings and method signatures reflect new naming.
Reason this comment was not posted:
Comment looked like it was already resolved.

5. scripts/codegen/generate_evaluator_models.py:207

Draft comment:
Code generation updated to export 'EvaluatorMadeByTraceloop'; ensure generated models remain backward compatible.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to ensure backward compatibility, which falls under the rule of not asking the author to ensure behavior is intended or tested. It doesn't provide a specific suggestion or point out a specific issue.

6. scripts/codegen/generate_evaluator_models.py:371

Draft comment:
There appears to be an extraneous triple single quote (''') at the end of the file, which looks unintended. Please remove or correct this typo.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_MTxL56jU84NmHnjE

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @scripts/codegen/generate_evaluator_models.py:
- Around line 380-388: The parameter list builder in
generate_evaluator_models.py appends params from config_fields in original order
which can place optional (defaulted) parameters before required ones and produce
invalid Python signatures; update the logic that builds params (the loop
populating params using field, type_hint, default) to first sort or partition
config_fields so that all required fields (where field["required"] is True) are
processed before optional ones, preserving relative order within each group,
then build params from that ordered list so required parameters appear before
optional/defaulted parameters in the generated method signature.
- Around line 411-432: Several generated-string concatenation lines in the
content-building logic of generate_evaluator_models.py are exceeding Flake8
E501; refactor the long literal/f-string expressions used when building
config_items and the EvaluatorDetails return block (references: config_items,
content += "...", f'"{f["name"]}": {f["name"]}', and the f'           
required_input_fields={input_fields},\n' lines) into shorter pieces—e.g., split
long f-strings into multiple concatenations or build lists of lines and join
them, or use implicit adjacent string literals or a small helper to format each
line—so each physical source line stays within the max length while preserving
the same generated output and behavior.

🧹 Nitpick comments (3)

scripts/codegen/generate_evaluator_models.py (1)
263-278: Consider handling object types in get_type_hint.

The get_type_hint function handles basic JSON schema types but doesn't handle object types or $ref references. If a property has "type": "object" or uses a $ref, it will return "Any" which may lose type information.

This is acceptable for code generation but could be enhanced if stricter typing is needed.
♻️ Optional: Handle object type explicitly
     def get_type_hint(prop: dict) -> str:
         """Convert JSON schema property to Python type hint."""
         prop_type = prop.get("type")
+        if "$ref" in prop:
+            # Could extract model name from ref if needed
+            return "Any"
         if prop_type == "number":
             return "float"
         elif prop_type == "integer":
             return "int"
         elif prop_type == "boolean":
             return "bool"
         elif prop_type == "string":
             return "str"
         elif prop_type == "array":
             items = prop.get("items", {})
             item_type = get_type_hint(items) if items else "Any"
             return f"list[{item_type}]"
+        elif prop_type == "object":
+            return "dict"
         return "Any"
packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (2)
14-15: Consider initializing client inside the experiment function.

Module-level initialization of Traceloop.init() executes as a side effect when the module is imported, which could cause issues if this file is imported elsewhere (e.g., for testing or as part of a larger application). For a sample app this is acceptable, but moving initialization inside the main guard or experiment function would be more robust.
♻️ Optional: Move initialization inside function
-# Initialize Traceloop
-client = Traceloop.init()
+client = None


 def agent_evaluators_task(row):
     # ... existing code ...


 async def run_agent_tool_trajectory_experiment():
+    global client
+    if client is None:
+        client = Traceloop.init()
+
     print("\n" + "="*80)
73-80: Consider printing actual result details for debugging.

The results summary only prints counts. For a sample/demo application, it might be helpful to also print sample result details to help users understand the evaluator output.
♻️ Optional: Add result details for demonstration
     print("Results summary:")
     print(f"  - Total rows processed: {len(results) if results else 0}")
     print(f"  - Errors encountered: {len(errors) if errors else 0}")

+    if results:
+        print("\nSample results (first 3):")
+        for i, result in enumerate(results[:3]):
+            print(f"  {i+1}. {result}")
+
     if errors:
         print("\nErrors:")
         for error in errors:
             print(f"  - {error}")

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a27c7eb and acabf5e.

⛔ Files ignored due to path filters (2)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py is excluded by !**/generated/**

📒 Files selected for processing (3)

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
scripts/codegen/generate_evaluator_models.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

scripts/codegen/generate_evaluator_models.py
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/definitions.py (1)

EvaluatorMadeByTraceloop (14-548)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Lint
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)

🔇 Additional comments (6)

scripts/codegen/generate_evaluator_models.py (2)

340-371: LGTM!

The module header, imports, and class docstring are well-structured with clear documentation and usage examples.

517-519: LGTM!

The main function correctly integrates the new generate_definitions_py call with appropriate status messaging.

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py (3)

18-30: LGTM!

The task function correctly extracts executed_tool_calls and expected_tool_calls from the row data, with a reasonable default fallback for expected values. The return structure matches the required input fields for the agent-tool-trajectory evaluator (['executed_tool_calls', 'expected_tool_calls']).

33-67: LGTM!

The experiment configuration properly uses the new static EvaluatorMadeByTraceloop.agent_tool_trajectory() method with typed configuration parameters. The async experiment run with proper awaiting and result handling is well-structured.

82-85: LGTM!

The main guard correctly runs the async experiment using asyncio.run().

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

1-9: LGTM!

The import path correctly references the new generated definitions module at ..generated.evaluators.definitions, and the exports are properly aligned with the refactored architecture. The removal of create_evaluator is consistent with the shift to static definitions—confirmed that no references to create_evaluator exist elsewhere in the codebase.

scripts/codegen/generate_evaluator_models.py

added tool

31abd4c

nina-kollman added 2 commits January 8, 2026 14:02

generate

38511d3

added example

9b15bec

nina-kollman changed the title ~~added tool~~ fix(evals): evals API supports input + config, generate mbt functions Jan 8, 2026

lint

747845b

nina-kollman marked this pull request as ready for review January 8, 2026 12:44

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

scripts/codegen/generate_evaluator_models.py Outdated Show resolved Hide resolved

scripts/codegen/generate_evaluator_models.py Outdated Show resolved Hide resolved

nina-kollman added 2 commits January 8, 2026 14:52

fix

6af89b2

fix comm

403dbbf

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

packages/sample-app/sample_app/experiment/made_by_traceloop/agent_tool_trajectory.py Show resolved Hide resolved

definition

c35518b

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

comm

6f9cd16

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

scripts/codegen/generate_evaluator_models.py Outdated Show resolved Hide resolved

rename

f628e5e

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

rename

a27c7eb

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

rename

acabf5e

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

doronkopit5 approved these changes Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

scripts/codegen/generate_evaluator_models.py Show resolved Hide resolved

scripts/codegen/generate_evaluator_models.py Show resolved Hide resolved

nina-kollman merged commit 8e52c97 into main Jan 8, 2026
14 checks passed

	from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition # noqa: E402
	from traceloop.sdk.evaluator import EvaluatorMadeByTraceloopDefinition

fix(evals): evals API supports input + config, generate mbt functions #3534

fix(evals): evals API supports input + config, generate mbt functions #3534

Uh oh!

Conversation

nina-kollman commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

nina-kollman commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading