Skip to content

Conversation

@pnilan
Copy link
Contributor

@pnilan pnilan commented Oct 24, 2025

Summary

Introduces a primary key evaluation workflow for declarative connectors and expands evaluator coverage in the Phoenix evaluation suite.

Changes

  • Added primary key evaluator.
  • Added expected_primary_key field to connectors.yaml.
  • Added unit tests for the evaluation suite.
  • Added airbyte-cdk dependency to connector-builder-agents.
  • Introduced ConnectorBuilderEvalTaskOutput model to manage build artifacts and metadata.
  • Updated eval suite to parse the task output manifest as a DeclarativeStream.
    • Each evaluator now accepts the output argument (required by Phoenix) as a dictionary.

Implementation Details

  • Supports DeclarativeStream primary key formats as defined in declarative_component_schema.yaml:
    • None
    • str
    • List[str]
    • List[List[str]]
  • Does not handle post-parse transformations or downstream normalization.

Scope

  • Record count validation.
  • Future LLM-based evaluator improvements.

Summary by CodeRabbit

  • New Features

    • Added manifest validation, primary-key checks, and a unified evaluation sequence; task outputs now return structured result metadata and tool retry behavior improved.
  • Bug Fixes

    • More robust manifest/YAML parsing with clearer, actionable error messages.
  • Documentation

    • Guidance checklist expanded with new rules and wording clarifications about manifests and dotenv handling.
  • Chores

    • Updated runtime and dev dependencies; added a test task and pytest configuration.
  • Tests

    • Added comprehensive unit tests for evaluators and task behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

📝 Walkthrough

Walkthrough

Adds manifest-aware evaluations (DeclarativeSource parsing, manifest validation, primary-key checks), refactors task output to a Pydantic model, extends manifest utilities and tests, updates guidance and tool retry config, and adds/adjusts test and runtime dependencies and test tooling.

Changes

Cohort / File(s) Summary
Dependency & tooling
connector_builder_agents/pyproject.toml, pyproject.toml, poe_tasks.toml
Added airbyte-cdk>=7.3.8,<8.0 and testing dev deps (pytest, pytest-asyncio); added pytest config lines and a new test-agents Poe task.
Evaluators & control flow
connector_builder_agents/src/evals/evaluators.py, connector_builder_agents/src/evals/phoenix_run.py
Introduced manifest_validation_eval and primary_key_eval; updated readiness_eval and streams_eval signatures to accept `dict
Task output model & task flow
connector_builder_agents/src/evals/task.py, connector_builder_agents/tests/test_task.py
Added ConnectorBuilderEvalTaskOutput (Pydantic) and changed run_connector_build_task to return it; added try/except around build run, explicit artifact reads, and updated tests to assert new output shape.
Manifest parsing utilities
connector_builder_mcp/_util.py, tests/unit/test_util.py
Wrapped YAML parsing with clearer errors, changed validate_manifest_structure return to `(bool, str
Evaluator unit tests & test data
connector_builder_agents/tests/test_evaluators.py, connector_builder_agents/src/evals/data/connectors.yaml
Added comprehensive unit tests for evaluators; converted expected_streams entries from strings to objects with name and primary_key arrays for all connectors/streams.
Guidance & tools
connector_builder_mcp/_guidance.py, connector_builder_agents/src/tools.py
Minor copy edits and a new "Rules" section in guidance; removed detailed tool-arg logging and added max_retries=5 to MCP server builder configs.
Tests & CI config
pyproject.toml, connector_builder_agents/pyproject.toml
Added pytest configuration (testpaths, norecursedirs) and updated test-related dev dependencies.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as Test / Runner
    participant Task as run_connector_build_task
    participant Build as run_connector_build()
    participant Art as Artifact store
    participant EvalMgr as EVALUATORS
    participant DSrc as DeclarativeSource

    Runner->>Task: invoke(dataset_row)
    Task->>Build: run_connector_build(...)
    Build-->>Task: final_result, turn_count, artifacts
    Task->>Art: read readiness_report, manifest
    alt manifest present & valid
        Task->>DSrc: DeclarativeSource.parse_obj(manifest)
        DSrc-->>EvalMgr: parsed manifest model
        EvalMgr->>EvalMgr: manifest_validation_eval()
        EvalMgr->>EvalMgr: streams_eval(expected, model)
        EvalMgr->>EvalMgr: primary_key_eval(expected, model)
    else manifest missing/invalid
        EvalMgr->>EvalMgr: fallback evaluations (readiness-only / diagnostics)
    end
    Task->>Task: assemble ConnectorBuilderEvalTaskOutput
    Task-->>Runner: ConnectorBuilderEvalTaskOutput
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

• Areas to focus on:

  • DeclarativeSource usage and error messages in connector_builder_mcp/_util.py.
  • Manifest parsing, root-object handling, and resilient stream discovery in connector_builder_agents/src/evals/evaluators.py.
  • The changed return type and artifact handling in connector_builder_agents/src/evals/task.py and corresponding tests.
  • New and updated unit tests in connector_builder_agents/tests/test_evaluators.py and tests/unit/test_util.py for coverage/accuracy.

Possibly related PRs

Suggested reviewers

  • aaronsteers

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "feat: add primary_key eval" directly reflects the primary objective stated in the PR description: introducing a primary key evaluation workflow for declarative connectors. The title is concise, specific, and uses conventional commit formatting, clearly communicating that a primary key evaluator is being added to the evaluation suite. While the changeset includes supporting modifications (such as the new ConnectorBuilderEvalTaskOutput model, manifest_validation_eval function, and updated test coverage), these are all in service of the main feature highlighted in the title. The title accurately summarizes the core purpose of the PR without being misleading or overly vague.
Docstring Coverage ✅ Passed Docstring coverage is 94.87% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pnilan/evals/primary-key

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the enhancement New feature or request label Oct 24, 2025
@github-actions
Copy link

👋 Welcome to the Airbyte Connector Builder MCP!

Thank you for your contribution! Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@pnilan/evals/primary-key", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@pnilan/evals/primary-key#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /build-connector - Builds the default connector on-demand using the AI builder
  • /build-connector prompt="<your prompt>" - Builds a connector on-demand using the AI builder
  • /poe <command> - Runs any poe command in the uv virtual environment

AI Builder Evaluations

AI builder evaluations run automatically under the following conditions:

  • When a PR is marked as "ready for review"
  • When a PR is reopened

A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch.

Helpful Resources

If you have any questions, feel free to ask in the PR comments or join our Slack community.

📝 Edit this welcome message.

@github-actions
Copy link

github-actions bot commented Oct 24, 2025

PyTest Results (Full)

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 361954b.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 24, 2025

PyTest Results (Fast)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 361954b. ± Comparison against base commit 1806c6b.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

141-144: Apply vacuous-truth when no expectations are defined.

When expected_stream_names is empty, return 1.0 instead of 0.0 to avoid penalizing missing expectations. Based on learnings.

-    if not expected_stream_names:
-        logger.warning("No expected streams found")
-        return 0.0
+    if not expected_stream_names:
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0
🧹 Nitpick comments (11)
connector_builder_mcp/_guidance.py (1)

183-189: Consider removing "please" for consistency.

The new Rules section provides valuable guidance. However, line 185 uses "please" ("please pass it without parsing secrets yourself"), which is inconsistent with the imperative tone established elsewhere in this file (e.g., line 168 where "please" was removed).

Apply this diff for consistency:

-- All MCP tools support receiving .env file path - please pass it without parsing secrets yourself.
+- All MCP tools support receiving .env file path - pass it without parsing secrets yourself.
connector_builder_mcp/_util.py (1)

163-180: Consider the breadth of exception handling.

The function catches all exceptions with except Exception as e, which is quite broad. While this ensures that any parsing error is caught and reported, it may also hide programming errors or unexpected issues.

For validation purposes, this broad catch may be intentional to handle any failure gracefully. However, consider whether specific exception types from DeclarativeSource.parse_obj() should be caught instead.

Additionally, the docstring format (opening triple quotes on a separate line) differs from other functions in this file. Consider aligning with the single-line opening style used elsewhere for consistency.

connector_builder_agents/src/tools.py (1)

82-82: Reduced logging may impact debugging.

The removal of tool arguments from the log message reduces verbosity but may make debugging tool failures more difficult. Consider whether argument details are needed for troubleshooting.

connector_builder_agents/tests/test_evaluators.py (1)

385-385: Remove duplicate import.

pytest is already imported at line 8. The duplicate import at line 385 is redundant.

Apply this diff:

-import pytest
-
-
 @pytest.mark.parametrize(
connector_builder_agents/src/evals/task.py (2)

48-50: Preserve stack traces on failures.

Use logger.exception(...) to capture the traceback; re-raise unchanged.

-    except Exception as e:
-        logger.error(f"Failed to build connector '{connector_name}': {e}")
-        raise
+    except Exception:
+        logger.exception(f"Failed to build connector '{connector_name}'")
+        raise

55-66: Success computation and log message clarity.

  • Consider success = bool(build_result) so an empty list isn’t treated as success.
  • Avoid unconditional “successfully” wording; include the status in the log.
-    final_result = build_result[-1] if build_result else None
-    success = build_result is not None
-    num_turns = len(build_result) if build_result else 0
-
-    logger.info(f"Build completed - Success: {success}, Turns: {num_turns}")
+    final_result = build_result[-1] if build_result else None
+    success = bool(build_result)
+    num_turns = len(build_result) if build_result else 0
+
+    logger.info(f"Build completed - success={success}, turns={num_turns}")
@@
-    logger.info(f"Task completed successfully for connector '{connector_name}'")
+    logger.info(f"Task completed for connector '{connector_name}' (success={success})")
connector_builder_agents/src/evals/evaluators.py (5)

74-77: Keep return types consistent (ints for readiness_eval).

Function is annotated to return int; return 0 instead of 0.0.

-    if output is None:
-        logger.warning("No output provided - build task likely failed")
-        return 0.0
+    if output is None:
+        logger.warning("No output provided - build task likely failed")
+        return 0

113-116: Return float type consistently.

streams_eval returns float; use 0.0.

-    if not manifest_str:
-        logger.warning("No manifest found or manifest is empty")
-        return 0
+    if not manifest_str:
+        logger.warning("No manifest found or manifest is empty")
+        return 0.0

190-191: Avoid print; use logger and lower log level.

Replace prints with logger.debug to keep logs structured.

-        print(f"Actual primary key: {actual_pk}")
-        print(f"Expected primary key: {expected_by_name[name]}")
+        logger.debug(f"Actual primary key for {name}: {actual_pk}")
+        logger.debug(f"Expected primary key for {name}: {expected_by_name[name]}")

186-195: Normalize primary key shapes before comparing.

Treat str, List[str], and List[List[str]] forms equivalently and compare on a canonical representation (order-insensitive). This avoids false negatives due to formatting differences.

-    correct = 0
-    for name in matched_names:
-        pk = getattr(available[name], "primary_key", None)
-        actual_pk = getattr(pk, "__root__", pk)
-        logger.debug(f"Actual primary key for {name}: {actual_pk}")
-        logger.debug(f"Expected primary key for {name}: {expected_by_name[name]}")
-        if actual_pk == expected_by_name[name]:
-            correct += 1
+    def _canon(pk):
+        if pk is None:
+            return None
+        # Coerce to list-of-lists
+        if isinstance(pk, str):
+            pk = [[pk]]
+        elif isinstance(pk, list) and pk and all(isinstance(x, str) for x in pk):
+            pk = [pk]
+        elif not isinstance(pk, list):
+            return pk
+        # Sort inner lists and outer list for order-insensitive compare
+        return tuple(sorted(tuple(sorted(inner)) for inner in pk))
+
+    correct = 0
+    for name in matched_names:
+        pk = getattr(available[name], "primary_key", None)
+        actual_pk = getattr(pk, "__root__", pk)
+        if _canon(actual_pk) == _canon(expected_by_name[name]):
+            correct += 1

If you prefer strict order-sensitive equality, skip the sorting and only coerce shapes.


120-125: Optional: future-proof DeclarativeSource parsing across Pydantic v1/v2.

Airbyte CDK models may migrate to Pydantic v2 (model_validate). Consider a helper that tries both.

def _parse_declarative_source(obj):
    try:
        return DeclarativeSource.parse_obj(obj)
    except AttributeError:
        # Pydantic v2
        return DeclarativeSource.model_validate(obj)

Use _parse_declarative_source(manifest_dict) in both streams_eval and manifest_validation_eval.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1806c6b and 7f4a9ef.

⛔ Files ignored due to path filters (2)
  • connector_builder_agents/uv.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • connector_builder_agents/pyproject.toml (1 hunks)
  • connector_builder_agents/src/evals/data/connectors.yaml (1 hunks)
  • connector_builder_agents/src/evals/evaluators.py (4 hunks)
  • connector_builder_agents/src/evals/phoenix_run.py (4 hunks)
  • connector_builder_agents/src/evals/task.py (3 hunks)
  • connector_builder_agents/src/tools.py (4 hunks)
  • connector_builder_agents/tests/test_evaluators.py (1 hunks)
  • connector_builder_agents/tests/test_task.py (1 hunks)
  • connector_builder_mcp/_guidance.py (3 hunks)
  • connector_builder_mcp/_util.py (4 hunks)
  • poe_tasks.toml (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T05:20:31.622Z
Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

  • connector_builder_agents/tests/test_evaluators.py
  • connector_builder_agents/src/evals/evaluators.py
🧬 Code graph analysis (5)
connector_builder_agents/tests/test_evaluators.py (2)
connector_builder_agents/src/evals/task.py (1)
  • ConnectorBuilderEvalTaskOutput (23-28)
connector_builder_agents/src/evals/evaluators.py (4)
  • manifest_validation_eval (42-68)
  • primary_key_eval (153-195)
  • readiness_eval (71-100)
  • streams_eval (103-150)
connector_builder_agents/src/evals/task.py (2)
connector_builder_agents/src/run.py (1)
  • get_workspace_dir (34-38)
connector_builder_agents/src/evals/helpers.py (1)
  • get_artifact (4-13)
connector_builder_agents/src/evals/evaluators.py (1)
connector_builder_agents/src/evals/task.py (1)
  • ConnectorBuilderEvalTaskOutput (23-28)
connector_builder_agents/tests/test_task.py (1)
connector_builder_agents/src/evals/task.py (2)
  • ConnectorBuilderEvalTaskOutput (23-28)
  • run_connector_build_task (31-76)
connector_builder_agents/src/evals/phoenix_run.py (1)
connector_builder_agents/src/evals/task.py (1)
  • run_connector_build_task (31-76)
🔇 Additional comments (13)
connector_builder_mcp/_guidance.py (2)

14-14: Grammar improvement looks good.

The change from "Not include" to "Do not include" corrects the grammar and improves clarity.


168-168: Imperative tone is appropriate for checklist instructions.

The removal of "please" makes the instruction more direct and consistent with standard checklist formatting.

connector_builder_mcp/_util.py (2)

114-117: LGTM!

The try-except block improves error handling by providing a clearer error message when YAML parsing fails. The exception chaining with from e is appropriate and maintains the stack trace.


11-12: ✓ airbyte_cdk dependency is properly declared.

The import of DeclarativeSource from airbyte_cdk is supported. The root pyproject.toml includes "airbyte-cdk>=6.61.6,<7.0" in its dependencies, which governs the connector_builder_mcp package.

connector_builder_agents/pyproject.toml (1)

17-17: LGTM! Dependency additions support new evaluation features.

The new dependencies appropriately support:

  • airbyte-cdk>=7.3.8,<8.0: Enables DeclarativeSource parsing for manifest validation
  • pytest>=8.4.2 and pytest-asyncio>=0.23.0: Support the new test files added in this PR

Also applies to: 24-25

pyproject.toml (1)

12-12: LGTM! Runtime dependency additions are appropriate.

The dependencies support core agent functionality:

  • emoji>=2.0.0,<3.0: Used for emoji detection in progress logging
  • pydantic-ai-slim[openai,duckduckgo]>=0.0.14,<1.0: Provides agent framework capabilities

Also applies to: 17-17

connector_builder_agents/src/tools.py (1)

121-121: LGTM! Retry configuration improves resilience.

Adding max_retries=5 to MCP server configurations provides appropriate resilience against transient failures in tool execution.

Also applies to: 150-150, 176-176

poe_tasks.toml (1)

20-20: LGTM! New test task enables agent test execution.

The test-agents task provides a convenient way to run the new agent tests with verbose output, complementing the existing test task for the main codebase.

connector_builder_agents/src/evals/data/connectors.yaml (1)

7-18: LGTM! Stream structure enhanced to support primary key evaluation.

The transition from plain string stream names to structured objects with name and primary_key fields enables the new primary key evaluator. The primary key values are appropriate:

  • JSONPlaceholder and Rick & Morty APIs correctly use ["id"]
  • StarWars API correctly uses ["url"] as the unique identifier

Also applies to: 24-35, 41-46

connector_builder_agents/src/evals/phoenix_run.py (2)

28-28: LGTM! Refactoring improves maintainability.

The centralization of evaluators into a single EVALUATORS collection from the evaluators module improves maintainability and makes it easier to add new evaluators in the future.

Also applies to: 62-62, 70-70


44-50: LGTM! Improved documentation of Phoenix instrumentation.

The added comments clearly explain the automatic instrumentation mechanism and the environment variables used for configuration, improving developer understanding.

connector_builder_agents/tests/test_task.py (1)

17-77: LGTM! Comprehensive test coverage for task execution.

The test effectively validates:

  • Correct construction of ConnectorBuilderEvalTaskOutput
  • Proper handling of task artifacts (readiness report and manifest)
  • Correct propagation of session ID, API name, and model configurations
  • Expected interaction with mocked dependencies

The test structure is well-organized with clear mocking and assertions.

connector_builder_agents/tests/test_evaluators.py (1)

300-327: LGTM! Comprehensive test coverage for evaluators.

The test suite thoroughly validates:

  • readiness_eval: Both passing and failing scenarios with mocked LLM classification
  • streams_eval: Perfect match, partial match, no match, and extra streams scenarios
  • primary_key_eval: Various primary key formats (None, string, single list, nested list, composite keys), no match, and partial match cases

The parametrized tests for primary key formats effectively cover the different formats supported by DeclarativeStream.

Based on learnings

Also applies to: 330-382, 388-437

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
connector_builder_agents/tests/test_evaluators.py (1)

11-16: Import path inconsistency (already flagged).

The imports use a relative path src.evals.evaluators instead of the full path connector_builder_agents.src.evals.evaluators. This inconsistency has been flagged in previous reviews.

connector_builder_mcp/_util.py (1)

142-160: Improved error reporting with tuple return.

The updated signature returning tuple[bool, str | None] is a good improvement that provides detailed error messages to callers. Note: There's a typo "stucture" on line 158 that was flagged in a previous review.

connector_builder_agents/src/evals/evaluators.py (1)

178-180: Apply vacuous truth when no expected streams (already flagged).

When there are zero expected streams, the function should return 1.0 (vacuous truth: all zero tests passed) rather than 0.0. This issue was flagged in a previous review.

Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b3c114 and ef12d56.

📒 Files selected for processing (4)
  • connector_builder_agents/src/evals/evaluators.py (4 hunks)
  • connector_builder_agents/src/evals/task.py (3 hunks)
  • connector_builder_agents/tests/test_evaluators.py (1 hunks)
  • connector_builder_mcp/_util.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T05:20:31.622Z
Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

  • connector_builder_agents/tests/test_evaluators.py
  • connector_builder_agents/src/evals/evaluators.py
🧬 Code graph analysis (3)
connector_builder_agents/tests/test_evaluators.py (1)
connector_builder_agents/src/evals/evaluators.py (4)
  • manifest_validation_eval (42-68)
  • primary_key_eval (157-199)
  • readiness_eval (71-100)
  • streams_eval (103-154)
connector_builder_agents/src/evals/task.py (2)
connector_builder_agents/src/run.py (1)
  • get_workspace_dir (34-38)
connector_builder_agents/src/evals/helpers.py (1)
  • get_artifact (4-13)
connector_builder_agents/src/evals/evaluators.py (1)
connector_builder_agents/src/evals/task.py (1)
  • ConnectorBuilderEvalTaskOutput (23-28)
🔇 Additional comments (11)
connector_builder_mcp/_util.py (3)

11-11: LGTM!

The import of DeclarativeSource is appropriate for the new validation function added below.


114-117: LGTM!

Adding try-except around YAML parsing with a clear error message improves error handling and user experience.


163-180: LGTM!

The new validation function properly leverages DeclarativeSource.parse_obj for schema validation and provides helpful error guidance. The return signature is consistent with validate_manifest_structure.

connector_builder_agents/src/evals/task.py (3)

6-10: LGTM!

The imports are correct, including Path from the standard library pathlib module. The necessary types for the new ConnectorBuilderEvalTaskOutput model are properly imported.


23-28: LGTM!

The ConnectorBuilderEvalTaskOutput model provides a clean, well-typed structure for task results. Using Mapping[str, Any] for artifacts provides flexibility while maintaining type safety.


31-76: LGTM!

The refactored function provides better error handling and returns a structured, type-safe output. The try-except block with re-raise preserves the exception context while allowing for logging.

connector_builder_agents/src/evals/evaluators.py (5)

13-14: LGTM!

The new imports are necessary for DeclarativeSource-based manifest parsing and the structured task output model.


42-68: LGTM!

The manifest validation function properly handles all error cases with appropriate try-except blocks for both YAML parsing and DeclarativeSource parsing. This addresses concerns from previous reviews.


71-100: LGTM!

The readiness_eval function properly integrates with the new ConnectorBuilderEvalTaskOutput model and maintains appropriate error handling throughout.


103-154: LGTM!

The streams_eval function now includes proper YAML error handling (as requested in previous reviews) and correctly parses the manifest as a DeclarativeSource to extract available streams.


202-202: LGTM!

All evaluators, including the newly added primary_key_eval, are properly registered in the EVALUATORS list. This resolves concerns from previous reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

145-147: Apply vacuous-truth when no expected streams.

When there are zero expectations to validate against, the evaluator should return 1.0 (all zero tests passed) rather than 0.0 (failure). Based on learnings.

Apply this diff:

     if not expected_stream_names:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0
♻️ Duplicate comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

178-180: Apply vacuous-truth when no expected streams.

When there are zero expectations to validate against, the evaluator should return 1.0 (all zero tests passed) rather than 0.0 (failure). Based on learnings.

Apply this diff:

     if not expected_streams:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0
🧹 Nitpick comments (2)
tests/unit/test_util.py (1)

71-71: Consider verifying error messages in negative test cases.

While the boolean checks are correct, these negative test cases could also verify that appropriate error messages are returned in the second tuple element. This would ensure the error reporting is working as expected.

For example:

 def test_missing_required_fields(self):
     """Test validation fails for missing required fields."""
     manifest = {"version": "0.1.0", "type": "DeclarativeSource"}

-    assert validate_manifest_structure(manifest)[0] is False
+    is_valid, error_msg = validate_manifest_structure(manifest)
+    assert is_valid is False
+    assert error_msg is not None
+    assert "required fields" in error_msg.lower()

Also applies to: 77-77

connector_builder_agents/src/evals/evaluators.py (1)

117-129: Consider extracting manifest parsing into a helper function.

The manifest parsing logic (YAML parse → DeclarativeSource parse → __root__ unwrap) is duplicated in both streams_eval and primary_key_eval. Consider extracting this into a reusable helper function to reduce duplication and improve maintainability.

Example:

def parse_manifest_to_declarative_source(manifest_str: str) -> DeclarativeSource | None:
    """Parse manifest string to DeclarativeSource, handling __root__ unwrapping."""
    try:
        manifest_dict = yaml.safe_load(manifest_str)
        declarative_source = DeclarativeSource.parse_obj(manifest_dict)
        return getattr(declarative_source, "__root__", declarative_source)
    except Exception as e:
        logger.error(f"Failed to parse manifest: {e}")
        return None

Also applies to: 169-175

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef12d56 and 8494bf9.

📒 Files selected for processing (3)
  • connector_builder_agents/src/evals/evaluators.py (4 hunks)
  • pyproject.toml (1 hunks)
  • tests/unit/test_util.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T05:20:31.622Z
Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

  • connector_builder_agents/src/evals/evaluators.py
🧬 Code graph analysis (2)
connector_builder_agents/src/evals/evaluators.py (1)
connector_builder_agents/src/evals/task.py (1)
  • ConnectorBuilderEvalTaskOutput (23-28)
tests/unit/test_util.py (1)
connector_builder_mcp/_util.py (1)
  • validate_manifest_structure (142-160)
🔇 Additional comments (4)
tests/unit/test_util.py (1)

65-65: LGTM!

The test correctly handles the new tuple return type from validate_manifest_structure.

pyproject.toml (1)

74-75: This review comment is incorrect.

The norecursedirs = ["connector_builder_agents"] setting is not redundant. The directory connector_builder_agents/tests/ exists and contains actual test files (test_evaluators.py, test_task.py).

While testpaths = ["tests"] restricts pytest discovery to the top-level tests/ directory under normal circumstances, norecursedirs serves as a necessary safeguard to prevent pytest from discovering and running tests in connector_builder_agents/tests/ if:

  • The configuration is bypassed or changed
  • pytest is invoked with different arguments
  • The testpaths constraint is not applied in certain CI contexts

This is a legitimate and defensive configuration choice.

Likely an incorrect or invalid review comment.

connector_builder_agents/src/evals/evaluators.py (2)

42-68: LGTM! Manifest validation is well-structured.

The function correctly handles all error cases (missing output, missing manifest, YAML parse errors, DeclarativeSource parse errors) and returns consistent integer values (0 or 1).


200-200: LGTM! All evaluators properly exported.

The EVALUATORS list now correctly includes the newly added primary_key_eval function. This addresses a previous review comment.

Comment on lines +125 to +126
if hasattr(declarative_source, "__root__"):
declarative_source = declarative_source.__root__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with root. Does this jump dump as dict? I think model_dump() will do the same.

https://docs.pydantic.dev/latest/concepts/serialization/#python-mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our DeclarativeSource model is defined as a Union of DeclarativeSource1 and DeclarativeSource2. It’s a bit awkward, but __root__ is just how you access the underlying declarative source model. This pattern was auto-generated when dynamic streams were added to the CDK.

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pnilan - Looks great. A couple non-blocking comments. Feel free to merge when ready!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

145-147: Apply vacuous truth when no expected streams defined.

When expected_stream_names is empty (zero expectations), the evaluator should return 1.0 (vacuous truth: all zero tests passed) rather than 0.0. Based on learnings.

Apply this diff:

     if not expected_stream_names:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0
♻️ Duplicate comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

177-179: Apply vacuous truth when no expected streams defined.

When expected_streams is empty (zero expectations), the evaluator should return 1.0 (vacuous truth: all zero tests passed) rather than 0.0. This aligns with the vacuous-truth pattern used in other evaluators and prevents penalizing scenarios where no validation criteria are defined. Based on learnings.

Apply this diff:

     if not expected_streams:
-        logger.warning("No expected streams found")
-        return 0.0
+        logger.info("No expected streams defined; treating as pass (vacuous truth)")
+        return 1.0
🧹 Nitpick comments (2)
connector_builder_agents/src/evals/evaluators.py (2)

125-126: Prefer consistent getattr() pattern for __root__ handling.

For consistency with primary_key_eval (line 171), consider using the more concise getattr() pattern instead of hasattr() check.

Apply this diff for consistency:

-        if hasattr(declarative_source, "__root__"):
-            declarative_source = declarative_source.__root__
+        declarative_source = getattr(declarative_source, "__root__", declarative_source)

112-129: Consider extracting manifest parsing to reduce duplication.

The manifest parsing logic (YAML loading + DeclarativeSource parsing + __root__ handling) is duplicated between streams_eval and primary_key_eval. Extracting this to a helper function would improve maintainability.

Example helper function:

def _parse_manifest_as_declarative_source(output: dict) -> DeclarativeSource | None:
    """Parse manifest from output into DeclarativeSource, returning None on error."""
    connector_builder_eval_task_output = ConnectorBuilderEvalTaskOutput(**output)
    manifest_str = connector_builder_eval_task_output.artifacts.get("manifest")
    
    if not manifest_str:
        logger.warning("No manifest found or manifest is empty")
        return None
    
    try:
        manifest_dict = yaml.safe_load(manifest_str)
    except yaml.YAMLError as e:
        logger.warning(f"Failed to parse manifest YAML: {e}")
        return None
    
    try:
        declarative_source = DeclarativeSource.parse_obj(manifest_dict)
        return getattr(declarative_source, "__root__", declarative_source)
    except Exception as e:
        logger.error(f"Failed to parse DeclarativeSource from manifest: {e}")
        return None

Also applies to: 163-174

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8494bf9 and 361954b.

📒 Files selected for processing (1)
  • connector_builder_agents/src/evals/evaluators.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-11T05:20:31.622Z
Learnt from: aaronsteers
PR: airbytehq/connector-builder-mcp#136
File: connector_builder_agents/src/evals/evaluators.py:0-0
Timestamp: 2025-10-11T05:20:31.622Z
Learning: In connector builder evaluator functions (e.g., primary_keys_eval, records_eval, stream_names_eval), when there are zero expectations to validate against (total_expected_streams == 0), the evaluator should return 1.0 (representing vacuous truth: all zero tests passed) rather than 0.0 (failure). This prevents penalizing scenarios where no validation criteria are defined.

Applied to files:

  • connector_builder_agents/src/evals/evaluators.py
🧬 Code graph analysis (1)
connector_builder_agents/src/evals/evaluators.py (1)
connector_builder_agents/src/evals/task.py (1)
  • ConnectorBuilderEvalTaskOutput (23-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (3)
connector_builder_agents/src/evals/evaluators.py (3)

42-68: LGTM! Manifest validation implementation is solid.

The function properly validates the manifest by parsing it as a DeclarativeSource with comprehensive error handling for YAML parsing and DeclarativeSource instantiation.


71-100: LGTM! Readiness evaluation properly returns float.

The function signature and return values are now type-consistent (all return float), and the LLM-based classification approach is appropriate for readiness evaluation.


199-199: LGTM! All evaluators properly exported.

The EVALUATORS list now includes all four evaluator functions, including the newly added primary_key_eval. This addresses the previous review comment about missing exports.

@pnilan pnilan merged commit b2164c2 into main Oct 27, 2025
15 checks passed
@pnilan pnilan deleted the pnilan/evals/primary-key branch October 27, 2025 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants