feat: add primary_key eval #156

pnilan · 2025-10-24T17:24:37Z

Summary

Introduces a primary key evaluation workflow for declarative connectors and expands evaluator coverage in the Phoenix evaluation suite.

Changes

Added primary key evaluator.
Added expected_primary_key field to connectors.yaml.
Added unit tests for the evaluation suite.
Added airbyte-cdk dependency to connector-builder-agents.
Introduced ConnectorBuilderEvalTaskOutput model to manage build artifacts and metadata.
Updated eval suite to parse the task output manifest as a DeclarativeStream.
- Each evaluator now accepts the output argument (required by Phoenix) as a dictionary.

Implementation Details

Supports DeclarativeStream primary key formats as defined in declarative_component_schema.yaml:
- None
- str
- List[str]
- List[List[str]]
Does not handle post-parse transformations or downstream normalization.

Scope

Record count validation.
Future LLM-based evaluator improvements.

Summary by CodeRabbit

New Features
- Added manifest validation, primary-key checks, and a unified evaluation sequence; task outputs now return structured result metadata and tool retry behavior improved.
Bug Fixes
- More robust manifest/YAML parsing with clearer, actionable error messages.
Documentation
- Guidance checklist expanded with new rules and wording clarifications about manifests and dotenv handling.
Chores
- Updated runtime and dev dependencies; added a test task and pytest configuration.
Tests
- Added comprehensive unit tests for evaluators and task behavior.

…heck

coderabbitai · 2025-10-24T17:24:48Z

📝 Walkthrough

Walkthrough

Adds manifest-aware evaluations (DeclarativeSource parsing, manifest validation, primary-key checks), refactors task output to a Pydantic model, extends manifest utilities and tests, updates guidance and tool retry config, and adds/adjusts test and runtime dependencies and test tooling.

Changes

Cohort / File(s)	Summary
Dependency & tooling `connector_builder_agents/pyproject.toml`, `pyproject.toml`, `poe_tasks.toml`	Added `airbyte-cdk>=7.3.8,<8.0` and testing dev deps (pytest, pytest-asyncio); added pytest config lines and a new `test-agents` Poe task.
Evaluators & control flow `connector_builder_agents/src/evals/evaluators.py`, `connector_builder_agents/src/evals/phoenix_run.py`	Introduced `manifest_validation_eval` and `primary_key_eval`; updated `readiness_eval` and `streams_eval` signatures to accept `dict
Task output model & task flow `connector_builder_agents/src/evals/task.py`, `connector_builder_agents/tests/test_task.py`	Added `ConnectorBuilderEvalTaskOutput` (Pydantic) and changed `run_connector_build_task` to return it; added try/except around build run, explicit artifact reads, and updated tests to assert new output shape.
Manifest parsing utilities `connector_builder_mcp/_util.py`, `tests/unit/test_util.py`	Wrapped YAML parsing with clearer errors, changed `validate_manifest_structure` return to `(bool, str
Evaluator unit tests & test data `connector_builder_agents/tests/test_evaluators.py`, `connector_builder_agents/src/evals/data/connectors.yaml`	Added comprehensive unit tests for evaluators; converted `expected_streams` entries from strings to objects with `name` and `primary_key` arrays for all connectors/streams.
Guidance & tools `connector_builder_mcp/_guidance.py`, `connector_builder_agents/src/tools.py`	Minor copy edits and a new "Rules" section in guidance; removed detailed tool-arg logging and added `max_retries=5` to MCP server builder configs.
Tests & CI config `pyproject.toml`, `connector_builder_agents/pyproject.toml`	Added pytest configuration (`testpaths`, `norecursedirs`) and updated test-related dev dependencies.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as Test / Runner
    participant Task as run_connector_build_task
    participant Build as run_connector_build()
    participant Art as Artifact store
    participant EvalMgr as EVALUATORS
    participant DSrc as DeclarativeSource

    Runner->>Task: invoke(dataset_row)
    Task->>Build: run_connector_build(...)
    Build-->>Task: final_result, turn_count, artifacts
    Task->>Art: read readiness_report, manifest
    alt manifest present & valid
        Task->>DSrc: DeclarativeSource.parse_obj(manifest)
        DSrc-->>EvalMgr: parsed manifest model
        EvalMgr->>EvalMgr: manifest_validation_eval()
        EvalMgr->>EvalMgr: streams_eval(expected, model)
        EvalMgr->>EvalMgr: primary_key_eval(expected, model)
    else manifest missing/invalid
        EvalMgr->>EvalMgr: fallback evaluations (readiness-only / diagnostics)
    end
    Task->>Task: assemble ConnectorBuilderEvalTaskOutput
    Task-->>Runner: ConnectorBuilderEvalTaskOutput

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

• Areas to focus on:

DeclarativeSource usage and error messages in connector_builder_mcp/_util.py.
Manifest parsing, root-object handling, and resilient stream discovery in connector_builder_agents/src/evals/evaluators.py.
The changed return type and artifact handling in connector_builder_agents/src/evals/task.py and corresponding tests.
New and updated unit tests in connector_builder_agents/tests/test_evaluators.py and tests/unit/test_util.py for coverage/accuracy.

Possibly related PRs

feat(agents): Replace openai-agents with pydantic-ai implementation #139 — Related changes to connector_builder_agents/src/evals/task.py return shape and artifact/final output handling.
chore(evals): restructure YAML to use input/expected top-level keys #116 — Related changes to the eval pipeline and expected input shapes for connector data and streams.

Suggested reviewers

aaronsteers

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "feat: add primary_key eval" directly reflects the primary objective stated in the PR description: introducing a primary key evaluation workflow for declarative connectors. The title is concise, specific, and uses conventional commit formatting, clearly communicating that a primary key evaluator is being added to the evaluation suite. While the changeset includes supporting modifications (such as the new ConnectorBuilderEvalTaskOutput model, manifest_validation_eval function, and updated test coverage), these are all in service of the main feature highlighted in the title. The title accurately summarizes the core purpose of the PR without being misleading or overly vague.
Docstring Coverage	✅ Passed	Docstring coverage is 94.87% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pnilan/evals/primary-key

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-24T17:24:57Z

👋 Welcome to the Airbyte Connector Builder MCP!

Thank you for your contribution! Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@pnilan/evals/primary-key", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@pnilan/evals/primary-key#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/build-connector - Builds the default connector on-demand using the AI builder
/build-connector prompt="<your prompt>" - Builds a connector on-demand using the AI builder
/poe <command> - Runs any poe command in the uv virtual environment

AI Builder Evaluations

AI builder evaluations run automatically under the following conditions:

When a PR is marked as "ready for review"
When a PR is reopened

A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch.

Helpful Resources

If you have any questions, feel free to ask in the PR comments or join our Slack community.

📝 Edit this welcome message.

github-actions · 2025-10-24T17:25:14Z

PyTest Results (Full)

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 361954b.

♻️ This comment has been updated with latest results.

github-actions · 2025-10-24T17:25:16Z

PyTest Results (Fast)

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
0 files ±0 0 ❌ ±0

Results for commit 361954b. ± Comparison against base commit 1806c6b.

♻️ This comment has been updated with latest results.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
141-144: Apply vacuous-truth when no expectations are defined.

When expected_stream_names is empty, return 1.0 instead of 0.0 to avoid penalizing missing expectations. Based on learnings.
-    if not expected_stream_names:
-        logger.warning("No expected streams found")
-        return 0.0
+    if not expected_stream_names:
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0

🧹 Nitpick comments (11)

connector_builder_mcp/_guidance.py (1)
183-189: Consider removing "please" for consistency.

The new Rules section provides valuable guidance. However, line 185 uses "please" ("please pass it without parsing secrets yourself"), which is inconsistent with the imperative tone established elsewhere in this file (e.g., line 168 where "please" was removed).

Apply this diff for consistency:
-- All MCP tools support receiving .env file path - please pass it without parsing secrets yourself.
+- All MCP tools support receiving .env file path - pass it without parsing secrets yourself.
connector_builder_mcp/_util.py (1)

163-180: Consider the breadth of exception handling.

The function catches all exceptions with except Exception as e, which is quite broad. While this ensures that any parsing error is caught and reported, it may also hide programming errors or unexpected issues.

For validation purposes, this broad catch may be intentional to handle any failure gracefully. However, consider whether specific exception types from DeclarativeSource.parse_obj() should be caught instead.

Additionally, the docstring format (opening triple quotes on a separate line) differs from other functions in this file. Consider aligning with the single-line opening style used elsewhere for consistency.

connector_builder_agents/src/tools.py (1)

82-82: Reduced logging may impact debugging.

The removal of tool arguments from the log message reduces verbosity but may make debugging tool failures more difficult. Consider whether argument details are needed for troubleshooting.
connector_builder_agents/tests/test_evaluators.py (1)
385-385: Remove duplicate import.

pytest is already imported at line 8. The duplicate import at line 385 is redundant.

Apply this diff:
-import pytest
-
-
 @pytest.mark.parametrize(
connector_builder_agents/src/evals/task.py (2)
48-50: Preserve stack traces on failures.

Use logger.exception(...) to capture the traceback; re-raise unchanged.
-    except Exception as e:
-        logger.error(f"Failed to build connector '{connector_name}': {e}")
-        raise
+    except Exception:
+        logger.exception(f"Failed to build connector '{connector_name}'")
+        raise
55-66: Success computation and log message clarity.

Consider success = bool(build_result) so an empty list isn’t treated as success.

Avoid unconditional “successfully” wording; include the status in the log.
-    final_result = build_result[-1] if build_result else None
-    success = build_result is not None
-    num_turns = len(build_result) if build_result else 0
-
-    logger.info(f"Build completed - Success: {success}, Turns: {num_turns}")
+    final_result = build_result[-1] if build_result else None
+    success = bool(build_result)
+    num_turns = len(build_result) if build_result else 0
+
+    logger.info(f"Build completed - success={success}, turns={num_turns}")
@@
-    logger.info(f"Task completed successfully for connector '{connector_name}'")
+    logger.info(f"Task completed for connector '{connector_name}' (success={success})")
connector_builder_agents/src/evals/evaluators.py (5)
74-77: Keep return types consistent (ints for readiness_eval).

Function is annotated to return int; return 0 instead of 0.0.
-    if output is None:
-        logger.warning("No output provided - build task likely failed")
-        return 0.0
+    if output is None:
+        logger.warning("No output provided - build task likely failed")
+        return 0
113-116: Return float type consistently.

streams_eval returns float; use 0.0.
-    if not manifest_str:
-        logger.warning("No manifest found or manifest is empty")
-        return 0
+    if not manifest_str:
+        logger.warning("No manifest found or manifest is empty")
+        return 0.0
190-191: Avoid print; use logger and lower log level.

Replace prints with logger.debug to keep logs structured.
-        print(f"Actual primary key: {actual_pk}")
-        print(f"Expected primary key: {expected_by_name[name]}")
+        logger.debug(f"Actual primary key for {name}: {actual_pk}")
+        logger.debug(f"Expected primary key for {name}: {expected_by_name[name]}")
186-195: Normalize primary key shapes before comparing.

Treat str, List[str], and List[List[str]] forms equivalently and compare on a canonical representation (order-insensitive). This avoids false negatives due to formatting differences.
-    correct = 0
-    for name in matched_names:
-        pk = getattr(available[name], "primary_key", None)
-        actual_pk = getattr(pk, "__root__", pk)
-        logger.debug(f"Actual primary key for {name}: {actual_pk}")
-        logger.debug(f"Expected primary key for {name}: {expected_by_name[name]}")
-        if actual_pk == expected_by_name[name]:
-            correct += 1
+    def _canon(pk):
+        if pk is None:
+            return None
+        # Coerce to list-of-lists
+        if isinstance(pk, str):
+            pk = [[pk]]
+        elif isinstance(pk, list) and pk and all(isinstance(x, str) for x in pk):
+            pk = [pk]
+        elif not isinstance(pk, list):
+            return pk
+        # Sort inner lists and outer list for order-insensitive compare
+        return tuple(sorted(tuple(sorted(inner)) for inner in pk))
+
+    correct = 0
+    for name in matched_names:
+        pk = getattr(available[name], "primary_key", None)
+        actual_pk = getattr(pk, "__root__", pk)
+        if _canon(actual_pk) == _canon(expected_by_name[name]):
+            correct += 1
If you prefer strict order-sensitive equality, skip the sorting and only coerce shapes.

120-125: Optional: future-proof DeclarativeSource parsing across Pydantic v1/v2.

Airbyte CDK models may migrate to Pydantic v2 (model_validate). Consider a helper that tries both.
def _parse_declarative_source(obj):
    try:
        return DeclarativeSource.parse_obj(obj)
    except AttributeError:
        # Pydantic v2
        return DeclarativeSource.model_validate(obj)
Use _parse_declarative_source(manifest_dict) in both streams_eval and manifest_validation_eval.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1806c6b and 7f4a9ef.

⛔ Files ignored due to path filters (2)

connector_builder_agents/uv.lock is excluded by !**/*.lock
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

connector_builder_agents/pyproject.toml (1 hunks)
connector_builder_agents/src/evals/data/connectors.yaml (1 hunks)
connector_builder_agents/src/evals/evaluators.py (4 hunks)
connector_builder_agents/src/evals/phoenix_run.py (4 hunks)
connector_builder_agents/src/evals/task.py (3 hunks)
connector_builder_agents/src/tools.py (4 hunks)
connector_builder_agents/tests/test_evaluators.py (1 hunks)
connector_builder_agents/tests/test_task.py (1 hunks)
connector_builder_mcp/_guidance.py (3 hunks)
connector_builder_mcp/_util.py (4 hunks)
poe_tasks.toml (1 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-11T05:20:31.622Z

Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

connector_builder_agents/tests/test_evaluators.py
connector_builder_agents/src/evals/evaluators.py

🧬 Code graph analysis (5)

connector_builder_agents/tests/test_evaluators.py (2)

connector_builder_agents/src/evals/task.py (1)

ConnectorBuilderEvalTaskOutput (23-28)

connector_builder_agents/src/evals/evaluators.py (4)

manifest_validation_eval (42-68)

primary_key_eval (153-195)

readiness_eval (71-100)

streams_eval (103-150)

connector_builder_agents/src/evals/task.py (2)

connector_builder_agents/src/run.py (1)

get_workspace_dir (34-38)

connector_builder_agents/src/evals/helpers.py (1)

get_artifact (4-13)

connector_builder_agents/src/evals/evaluators.py (1)

connector_builder_agents/src/evals/task.py (1)

ConnectorBuilderEvalTaskOutput (23-28)

connector_builder_agents/tests/test_task.py (1)

connector_builder_agents/src/evals/task.py (2)

ConnectorBuilderEvalTaskOutput (23-28)

run_connector_build_task (31-76)

connector_builder_agents/src/evals/phoenix_run.py (1)

connector_builder_agents/src/evals/task.py (1)

run_connector_build_task (31-76)

🔇 Additional comments (13)

connector_builder_mcp/_guidance.py (2)

14-14: Grammar improvement looks good.

The change from "Not include" to "Do not include" corrects the grammar and improves clarity.

168-168: Imperative tone is appropriate for checklist instructions.

The removal of "please" makes the instruction more direct and consistent with standard checklist formatting.

connector_builder_mcp/_util.py (2)

114-117: LGTM!

The try-except block improves error handling by providing a clearer error message when YAML parsing fails. The exception chaining with from e is appropriate and maintains the stack trace.

11-12: ✓ airbyte_cdk dependency is properly declared.

The import of DeclarativeSource from airbyte_cdk is supported. The root pyproject.toml includes "airbyte-cdk>=6.61.6,<7.0" in its dependencies, which governs the connector_builder_mcp package.

connector_builder_agents/pyproject.toml (1)

17-17: LGTM! Dependency additions support new evaluation features.

The new dependencies appropriately support:

airbyte-cdk>=7.3.8,<8.0: Enables DeclarativeSource parsing for manifest validation

pytest>=8.4.2 and pytest-asyncio>=0.23.0: Support the new test files added in this PR

Also applies to: 24-25

pyproject.toml (1)

12-12: LGTM! Runtime dependency additions are appropriate.

The dependencies support core agent functionality:

emoji>=2.0.0,<3.0: Used for emoji detection in progress logging

pydantic-ai-slim[openai,duckduckgo]>=0.0.14,<1.0: Provides agent framework capabilities

Also applies to: 17-17

connector_builder_agents/src/tools.py (1)

121-121: LGTM! Retry configuration improves resilience.

Adding max_retries=5 to MCP server configurations provides appropriate resilience against transient failures in tool execution.

Also applies to: 150-150, 176-176

poe_tasks.toml (1)

20-20: LGTM! New test task enables agent test execution.

The test-agents task provides a convenient way to run the new agent tests with verbose output, complementing the existing test task for the main codebase.

connector_builder_agents/src/evals/data/connectors.yaml (1)

7-18: LGTM! Stream structure enhanced to support primary key evaluation.

The transition from plain string stream names to structured objects with name and primary_key fields enables the new primary key evaluator. The primary key values are appropriate:

JSONPlaceholder and Rick & Morty APIs correctly use ["id"]

StarWars API correctly uses ["url"] as the unique identifier

Also applies to: 24-35, 41-46

connector_builder_agents/src/evals/phoenix_run.py (2)

28-28: LGTM! Refactoring improves maintainability.

The centralization of evaluators into a single EVALUATORS collection from the evaluators module improves maintainability and makes it easier to add new evaluators in the future.

Also applies to: 62-62, 70-70

44-50: LGTM! Improved documentation of Phoenix instrumentation.

The added comments clearly explain the automatic instrumentation mechanism and the environment variables used for configuration, improving developer understanding.

connector_builder_agents/tests/test_task.py (1)

17-77: LGTM! Comprehensive test coverage for task execution.

The test effectively validates:

Correct construction of ConnectorBuilderEvalTaskOutput

Proper handling of task artifacts (readiness report and manifest)

Correct propagation of session ID, API name, and model configurations

Expected interaction with mocked dependencies

The test structure is well-organized with clear mocking and assertions.

connector_builder_agents/tests/test_evaluators.py (1)

300-327: LGTM! Comprehensive test coverage for evaluators.

The test suite thoroughly validates:

readiness_eval: Both passing and failing scenarios with mocked LLM classification

streams_eval: Perfect match, partial match, no match, and extra streams scenarios

primary_key_eval: Various primary key formats (None, string, single list, nested list, composite keys), no match, and partial match cases

The parametrized tests for primary key formats effectively cover the different formats supported by DeclarativeStream.

Based on learnings

Also applies to: 330-382, 388-437

connector_builder_agents/src/evals/evaluators.py

connector_builder_agents/src/evals/task.py

connector_builder_agents/tests/test_evaluators.py

connector_builder_mcp/_util.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

connector_builder_agents/tests/test_evaluators.py (1)

11-16: Import path inconsistency (already flagged).

The imports use a relative path src.evals.evaluators instead of the full path connector_builder_agents.src.evals.evaluators. This inconsistency has been flagged in previous reviews.

connector_builder_mcp/_util.py (1)

142-160: Improved error reporting with tuple return.

The updated signature returning tuple[bool, str | None] is a good improvement that provides detailed error messages to callers. Note: There's a typo "stucture" on line 158 that was flagged in a previous review.

connector_builder_agents/src/evals/evaluators.py (1)

178-180: Apply vacuous truth when no expected streams (already flagged).

When there are zero expected streams, the function should return 1.0 (vacuous truth: all zero tests passed) rather than 0.0. This issue was flagged in a previous review.

Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b3c114 and ef12d56.

📒 Files selected for processing (4)

connector_builder_agents/src/evals/evaluators.py (4 hunks)
connector_builder_agents/src/evals/task.py (3 hunks)
connector_builder_agents/tests/test_evaluators.py (1 hunks)
connector_builder_mcp/_util.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-11T05:20:31.622Z

Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

connector_builder_agents/tests/test_evaluators.py
connector_builder_agents/src/evals/evaluators.py

🧬 Code graph analysis (3)

connector_builder_agents/tests/test_evaluators.py (1)

connector_builder_agents/src/evals/evaluators.py (4)

manifest_validation_eval (42-68)

primary_key_eval (157-199)

readiness_eval (71-100)

streams_eval (103-154)

connector_builder_agents/src/evals/task.py (2)

connector_builder_agents/src/run.py (1)

get_workspace_dir (34-38)

connector_builder_agents/src/evals/helpers.py (1)

get_artifact (4-13)

connector_builder_agents/src/evals/evaluators.py (1)

connector_builder_agents/src/evals/task.py (1)

ConnectorBuilderEvalTaskOutput (23-28)

🔇 Additional comments (11)

connector_builder_mcp/_util.py (3)

11-11: LGTM!

The import of DeclarativeSource is appropriate for the new validation function added below.

114-117: LGTM!

Adding try-except around YAML parsing with a clear error message improves error handling and user experience.

163-180: LGTM!

The new validation function properly leverages DeclarativeSource.parse_obj for schema validation and provides helpful error guidance. The return signature is consistent with validate_manifest_structure.

connector_builder_agents/src/evals/task.py (3)

6-10: LGTM!

The imports are correct, including Path from the standard library pathlib module. The necessary types for the new ConnectorBuilderEvalTaskOutput model are properly imported.

23-28: LGTM!

The ConnectorBuilderEvalTaskOutput model provides a clean, well-typed structure for task results. Using Mapping[str, Any] for artifacts provides flexibility while maintaining type safety.

31-76: LGTM!

The refactored function provides better error handling and returns a structured, type-safe output. The try-except block with re-raise preserves the exception context while allowing for logging.

connector_builder_agents/src/evals/evaluators.py (5)

13-14: LGTM!

The new imports are necessary for DeclarativeSource-based manifest parsing and the structured task output model.

42-68: LGTM!

The manifest validation function properly handles all error cases with appropriate try-except blocks for both YAML parsing and DeclarativeSource parsing. This addresses concerns from previous reviews.

71-100: LGTM!

The readiness_eval function properly integrates with the new ConnectorBuilderEvalTaskOutput model and maintains appropriate error handling throughout.

103-154: LGTM!

The streams_eval function now includes proper YAML error handling (as requested in previous reviews) and correctly parses the manifest as a DeclarativeSource to extract available streams.

202-202: LGTM!

All evaluators, including the newly added primary_key_eval, are properly registered in the EVALUATORS list. This resolves concerns from previous reviews.

connector_builder_agents/src/evals/evaluators.py

connector_builder_agents/tests/test_evaluators.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
145-147: Apply vacuous-truth when no expected streams.

When there are zero expectations to validate against, the evaluator should return 1.0 (all zero tests passed) rather than 0.0 (failure). Based on learnings.

Apply this diff:
     if not expected_stream_names:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0

♻️ Duplicate comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
178-180: Apply vacuous-truth when no expected streams.

When there are zero expectations to validate against, the evaluator should return 1.0 (all zero tests passed) rather than 0.0 (failure). Based on learnings.

Apply this diff:
     if not expected_streams:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0

🧹 Nitpick comments (2)

tests/unit/test_util.py (1)
71-71: Consider verifying error messages in negative test cases.

While the boolean checks are correct, these negative test cases could also verify that appropriate error messages are returned in the second tuple element. This would ensure the error reporting is working as expected.

For example:
 def test_missing_required_fields(self):
     """Test validation fails for missing required fields."""
     manifest = {"version": "0.1.0", "type": "DeclarativeSource"}

-    assert validate_manifest_structure(manifest)[0] is False
+    is_valid, error_msg = validate_manifest_structure(manifest)
+    assert is_valid is False
+    assert error_msg is not None
+    assert "required fields" in error_msg.lower()
Also applies to: 77-77
connector_builder_agents/src/evals/evaluators.py (1)
117-129: Consider extracting manifest parsing into a helper function.

The manifest parsing logic (YAML parse → DeclarativeSource parse → __root__ unwrap) is duplicated in both streams_eval and primary_key_eval. Consider extracting this into a reusable helper function to reduce duplication and improve maintainability.

Example:
def parse_manifest_to_declarative_source(manifest_str: str) -> DeclarativeSource | None:
    """Parse manifest string to DeclarativeSource, handling __root__ unwrapping."""
    try:
        manifest_dict = yaml.safe_load(manifest_str)
        declarative_source = DeclarativeSource.parse_obj(manifest_dict)
        return getattr(declarative_source, "__root__", declarative_source)
    except Exception as e:
        logger.error(f"Failed to parse manifest: {e}")
        return None
Also applies to: 169-175

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef12d56 and 8494bf9.

📒 Files selected for processing (3)

connector_builder_agents/src/evals/evaluators.py (4 hunks)
pyproject.toml (1 hunks)
tests/unit/test_util.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-11T05:20:31.622Z

Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

connector_builder_agents/src/evals/evaluators.py

🧬 Code graph analysis (2)

connector_builder_agents/src/evals/evaluators.py (1)

connector_builder_agents/src/evals/task.py (1)

ConnectorBuilderEvalTaskOutput (23-28)

tests/unit/test_util.py (1)

connector_builder_mcp/_util.py (1)

validate_manifest_structure (142-160)

🔇 Additional comments (4)

tests/unit/test_util.py (1)

65-65: LGTM!

The test correctly handles the new tuple return type from validate_manifest_structure.

pyproject.toml (1)

74-75: This review comment is incorrect.

The norecursedirs = ["connector_builder_agents"] setting is not redundant. The directory connector_builder_agents/tests/ exists and contains actual test files (test_evaluators.py, test_task.py).

While testpaths = ["tests"] restricts pytest discovery to the top-level tests/ directory under normal circumstances, norecursedirs serves as a necessary safeguard to prevent pytest from discovering and running tests in connector_builder_agents/tests/ if:

The configuration is bypassed or changed

pytest is invoked with different arguments

The testpaths constraint is not applied in certain CI contexts

This is a legitimate and defensive configuration choice.

Likely an incorrect or invalid review comment.

connector_builder_agents/src/evals/evaluators.py (2)

42-68: LGTM! Manifest validation is well-structured.

The function correctly handles all error cases (missing output, missing manifest, YAML parse errors, DeclarativeSource parse errors) and returns consistent integer values (0 or 1).

200-200: LGTM! All evaluators properly exported.

The EVALUATORS list now correctly includes the newly added primary_key_eval function. This addresses a previous review comment.

connector_builder_agents/src/evals/evaluators.py

aaronsteers · 2025-10-25T02:41:23Z

connector_builder_agents/src/evals/evaluators.py

+        if hasattr(declarative_source, "__root__"):
+            declarative_source = declarative_source.__root__


I'm not familiar with root. Does this jump dump as dict? I think model_dump() will do the same.

https://docs.pydantic.dev/latest/concepts/serialization/#python-mode

Our DeclarativeSource model is defined as a Union of DeclarativeSource1 and DeclarativeSource2. It’s a bit awkward, but __root__ is just how you access the underlying declarative source model. This pattern was auto-generated when dynamic streams were added to the CDK.

aaronsteers

@pnilan - Looks great. A couple non-blocking comments. Feel free to merge when ready!

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
145-147: Apply vacuous truth when no expected streams defined.

When expected_stream_names is empty (zero expectations), the evaluator should return 1.0 (vacuous truth: all zero tests passed) rather than 0.0. Based on learnings.

Apply this diff:
     if not expected_stream_names:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0

♻️ Duplicate comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
177-179: Apply vacuous truth when no expected streams defined.

When expected_streams is empty (zero expectations), the evaluator should return 1.0 (vacuous truth: all zero tests passed) rather than 0.0. This aligns with the vacuous-truth pattern used in other evaluators and prevents penalizing scenarios where no validation criteria are defined. Based on learnings.

Apply this diff:
     if not expected_streams:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0

🧹 Nitpick comments (2)

connector_builder_agents/src/evals/evaluators.py (2)

125-126: Prefer consistent getattr() pattern for __root__ handling.

For consistency with primary_key_eval (line 171), consider using the more concise getattr() pattern instead of hasattr() check.

Apply this diff for consistency:

-        if hasattr(declarative_source, "__root__"):
-            declarative_source = declarative_source.__root__
+        declarative_source = getattr(declarative_source, "__root__", declarative_source)

112-129: Consider extracting manifest parsing to reduce duplication.

The manifest parsing logic (YAML loading + DeclarativeSource parsing + __root__ handling) is duplicated between streams_eval and primary_key_eval. Extracting this to a helper function would improve maintainability.

Example helper function:

def _parse_manifest_as_declarative_source(output: dict) -> DeclarativeSource | None:
    """Parse manifest from output into DeclarativeSource, returning None on error."""
    connector_builder_eval_task_output = ConnectorBuilderEvalTaskOutput(**output)
    manifest_str = connector_builder_eval_task_output.artifacts.get("manifest")
    
    if not manifest_str:
        logger.warning("No manifest found or manifest is empty")
        return None
    
    try:
        manifest_dict = yaml.safe_load(manifest_str)
    except yaml.YAMLError as e:
        logger.warning(f"Failed to parse manifest YAML: {e}")
        return None
    
    try:
        declarative_source = DeclarativeSource.parse_obj(manifest_dict)
        return getattr(declarative_source, "__root__", declarative_source)
    except Exception as e:
        logger.error(f"Failed to parse DeclarativeSource from manifest: {e}")
        return None

Also applies to: 163-174

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8494bf9 and 361954b.

📒 Files selected for processing (1)

connector_builder_agents/src/evals/evaluators.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-11T05:20:31.622Z

Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

connector_builder_agents/src/evals/evaluators.py

🧬 Code graph analysis (1)

connector_builder_agents/src/evals/evaluators.py (1)

connector_builder_agents/src/evals/task.py (1)

ConnectorBuilderEvalTaskOutput (23-28)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pytest (All, Python 3.11, Ubuntu)

🔇 Additional comments (3)

connector_builder_agents/src/evals/evaluators.py (3)

42-68: LGTM! Manifest validation implementation is solid.

The function properly validates the manifest by parsing it as a DeclarativeSource with comprehensive error handling for YAML parsing and DeclarativeSource instantiation.

71-100: LGTM! Readiness evaluation properly returns float.

The function signature and return values are now type-consistent (all return float), and the LLM-based classification approach is appropriate for readiness evaluation.

199-199: LGTM! All evaluators properly exported.

The EVALUATORS list now includes all four evaluator functions, including the newly added primary_key_eval. This addresses the previous review comment about missing exports.

pnilan added 13 commits October 17, 2025 14:01

eval update first pass

86439f8

update airbyte-cdk to stable

d70efb7

merge

74a2094

Merge branch 'main' into pnilan/connector-readiness/add-primary-key-c…

06bac15

…heck

add primary key eval and manifest parsing

9830a65

fix expected primary keys

a25ae31

remove explicit project name arg

7cdd913

Merge branch 'main' into pnilan/evals/primary-key

7301b69

add retries

98207c7

update pk eval for clarity

581ca25

revert manifest scaffold

72fc545

revert validation testing

745f5fe

remove redundant package

7f4a9ef

github-actions bot added the enhancement New feature or request label Oct 24, 2025

chore: format & linting

6b3c114

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

incorporate review suggestions

ef12d56

pnilan requested review from aaronsteers and pedroslopez October 24, 2025 21:56

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

connector_builder_agents/src/evals/evaluators.py Outdated Show resolved Hide resolved

connector_builder_agents/tests/test_evaluators.py Show resolved Hide resolved

pnilan added 2 commits October 24, 2025 15:15

fix test config

0202e26

remove unused deps

8494bf9

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

connector_builder_agents/src/evals/evaluators.py Show resolved Hide resolved

connector_builder_agents/src/evals/evaluators.py Outdated Show resolved Hide resolved

aaronsteers approved these changes Oct 25, 2025

View reviewed changes

aaronsteers reviewed Oct 25, 2025

View reviewed changes

connector_builder_agents/src/evals/evaluators.py Outdated Show resolved Hide resolved

aaronsteers reviewed Oct 25, 2025

View reviewed changes

aaronsteers approved these changes Oct 25, 2025

View reviewed changes

fix eval return types (float)

361954b

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

pnilan merged commit b2164c2 into main Oct 27, 2025
15 checks passed

pnilan deleted the pnilan/evals/primary-key branch October 27, 2025 16:52

pnilan mentioned this pull request Oct 27, 2025

Import declarative_component_schema for pydantic model parsing #146

Closed

coderabbitai bot mentioned this pull request Oct 28, 2025

feat: add phase prompt tools, fix pydantic validation, add output models #159

Merged

coderabbitai bot mentioned this pull request Nov 10, 2025

feat: Add checklist tools for tracking connector development progress #181

Merged

5 tasks

		if hasattr(declarative_source, "__root__"):
		declarative_source = declarative_source.__root__

feat: add primary_key eval #156

feat: add primary_key eval #156

Uh oh!

Conversation

pnilan commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Implementation Details

Scope

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 24, 2025

👋 Welcome to the Airbyte Connector Builder MCP!

Testing This Branch via MCP

Testing This Branch via CLI

PR Slash Commands

AI Builder Evaluations

Helpful Resources

Uh oh!

github-actions bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

github-actions bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaronsteers Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

pnilan Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

aaronsteers left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

pnilan commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading

github-actions bot commented Oct 24, 2025 •

edited

Loading

github-actions bot commented Oct 24, 2025 •

edited

Loading