Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Oct 28, 2025

feat: add anonymized and final_pages_only parameters to execute_stream_test_read

Summary

Adds two new parameters to the execute_stream_test_read MCP tool to support capturing real API data for mock server test fixtures:

  1. anonymized: bool - Enables deterministic anonymization of raw API responses and records using HMAC-SHA256. Requires MOCK_ANON_SALT environment variable for reproducible hashing across runs.

  2. final_pages_only: bool | None - Captures ONLY the last 2 pages of paginated results, ignoring all intermediate pages. Sets very high limits (10,000 pages max) to force full pagination, then post-processes to keep only the final 2 pages.

Use Case: These parameters enable creating mock server test fixtures from real API data by (1) capturing terminal pages that demonstrate end-of-pagination behavior, and (2) anonymizing sensitive data while preserving structure for testing.

Implementation Details:

  • New anonymize.py module with format-preserving anonymization (emails keep @example.com structure)
  • Redacts sensitive HTTP headers (Authorization, Cookie, API keys) and query parameters
  • Field detection uses pattern matching (IDs, emails, names, etc.)
  • Safety guard prevents unbounded pagination (stops at 10,000 pages)
  • Includes metadata (reached_end, anonymized, salt_id) in output

Review & Testing Checklist for Human

This is a yellow risk change - new functionality with comprehensive tests but limited real-world validation.

  • Verify safety guard appropriateness: Is 10,000 pages the right limit? Could this cause issues with deeply paginated APIs or timeout problems?
  • Test with real connector: Pick a connector (HubSpot, GitHub, etc.) and verify that (a) anonymization preserves enough structure for testing, and (b) final_pages_only correctly captures terminal pages
  • Check anonymization patterns: Review the field detection patterns in anonymize.py:78-102 - are these comprehensive enough? Too aggressive? Missing common sensitive field names?
  • Validate MOCK_ANON_SALT requirement: Is requiring an environment variable acceptable, or should there be a fallback to non-deterministic anonymization?
  • Review metadata fields: Are reached_end, salt_id, and anonymized fields in the output actually useful? Should there be additional metadata?

Recommended Test Plan:

  1. Set MOCK_ANON_SALT=test_secret_123 in environment
  2. Run execute_stream_test_read with anonymized=true and final_pages_only=true on a connector with 5+ pages of results
  3. Verify only last 2 pages are returned, all sensitive fields are anonymized, and structure is preserved
  4. Try without MOCK_ANON_SALT set - should fail gracefully with clear error message
  5. Test that same input produces same anonymized output (determinism check)

Notes

  • All 48 tests passing (35 unit + 13 integration)
  • Type checking passes with mypy
  • Linting passes with ruff
  • Changes are backwards compatible (new parameters with sensible defaults)
  • Implementation uses very high limits for final_pages_only rather than explicit parameter conflict detection - this is pragmatic but worth noting

Link to Devin run: https://app.devin.ai/sessions/75f2bd231eea4960b5a76b7d14b8ddcf
Requested by: AJ Steers ([email protected]) / @aaronsteers

Summary by CodeRabbit

  • New Features

    • Added deterministic salt-based anonymization for test data and HTTP interactions, including records, headers, and query parameters.
    • Introduced anonymized parameter to enable/disable anonymization in stream test reads.
    • Introduced final_pages_only parameter to focus test collection on final pages only.
  • Tests

    • Added comprehensive unit tests for anonymization utilities.
    • Added integration tests validating anonymization and final pages parameters with various configurations.

…m_test_read

- Add anonymize.py module with deterministic HMAC-SHA256 anonymization
- Add anonymized parameter to anonymize raw responses and records
- Add final_pages_only parameter to capture only last 2 pages
- Implement safety guard (MAX_PAGES_SCAN = 10,000 pages)
- Add comprehensive unit tests for anonymization module (35 tests)
- Add integration tests for new parameters (13 tests)
- All tests passing with proper type checking

The anonymized parameter requires MOCK_ANON_SALT environment variable
for reproducible anonymization. The final_pages_only parameter forces
full pagination but only returns the final 2 pages in output, useful
for creating mock server test fixtures from real API data.

Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - Take context from this docs pr: <https://github.com/airbytehq/airbyte/pull/68667/files>

It occurs to me that many of the configuration elements of the mock server tests can be inferred from a manifest.yaml file. Let's make a plan to create a simplified version of mock server tests that requires much less work for humans and LLMs to generate.
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1761623322937639?thread_ts=1761623322.937639

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

📝 Walkthrough

Walkthrough

Introduces deterministic, salt-based anonymization for test data and HTTP interactions through a new module. Integrates anonymization into the test reading workflow via new anonymized and final_pages_only parameters. Adds comprehensive unit and integration test coverage for the new functionality.

Changes

Cohort / File(s) Summary
Core anonymization module
connector_builder_mcp/anonymize.py
New module providing 11 functions for deterministic anonymization: salt retrieval and validation, field-aware anonymization of strings/emails, recursive data anonymization (dicts, lists), HTTP header/query parameter redaction, and batch record processing. Uses HMAC-SHA256 with environment-provided salt.
Validation testing integration
connector_builder_mcp/validation_testing.py
Added anonymized and final_pages_only parameters to execute_stream_test_read. Implemented runtime anonymization validation, final-pages-only mode with permissive scan limits, aggregated page collection, conditional anonymization of records and HTTP interactions, and updated error/logging paths.
Unit test coverage
tests/unit/test_anonymize.py
New test module covering 10 test groups: salt management, string/email anonymization, field classification heuristics, value anonymization, recursive dict/list traversal, HTTP header/query parameter redaction, and batch record processing with determinism validation.
Integration test coverage
tests/integration/test_anonymized_and_final_pages.py
New test module validating anonymized and final_pages_only parameters with Rick and Morty manifest fixture, covering parameter validation, error conditions, combined parameter usage, log assertions, and environment variable mocking.

Sequence Diagram

sequenceDiagram
    participant User
    participant execute_stream_test_read
    participant Salt Management
    participant Test Execution
    participant Anonymization
    participant Result

    User->>execute_stream_test_read: Call with anonymized=True
    alt anonymized enabled
        execute_stream_test_read->>Salt Management: get_anonymization_salt()
        alt MOCK_ANON_SALT missing
            Salt Management-->>Result: ValueError
            Result-->>User: Error result
        else MOCK_ANON_SALT present
            Salt Management-->>execute_stream_test_read: salt bytes + salt_id
            execute_stream_test_read->>Test Execution: Run test read
            Test Execution-->>execute_stream_test_read: records + pages
            execute_stream_test_read->>Anonymization: anonymize_records()
            execute_stream_test_read->>Anonymization: anonymize_http_interaction()
            Anonymization-->>execute_stream_test_read: anonymized data
            execute_stream_test_read->>Result: Return with anonymized data + salt_id log
            Result-->>User: Success with anonymization
        end
    else anonymized disabled
        execute_stream_test_read->>Test Execution: Run test read
        Test Execution-->>Result: records + pages (unmodified)
        Result-->>User: Success without anonymization
    end

    alt final_pages_only enabled
        execute_stream_test_read->>Test Execution: Set permissive limits
        Test Execution-->>execute_stream_test_read: Aggregate all_pages
        execute_stream_test_read->>execute_stream_test_read: Truncate to last 2 pages
        execute_stream_test_read->>Result: Log page summary
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Core anonymization module: 11 new functions with cryptographic operations (HMAC-SHA256), field detection heuristics (regex patterns), and recursive data traversal requiring logic verification across multiple paths.
  • Validation testing integration: Significant control flow changes introducing conditional anonymization, final-pages-only mode with tuned limits, page aggregation refactoring, and error handling integration that affects existing test execution paths.
  • Test coverage: Comprehensive unit and integration tests, though homogeneous in pattern, require verification of test assertions against implementation behavior.
  • Key areas requiring extra attention:
    • HMAC-SHA256 implementation correctness and determinism across salt operations
    • Field detection heuristics in should_anonymize_field() to ensure appropriate coverage without over-anonymizing
    • Page aggregation and truncation logic in final_pages_only mode to verify correctness of "last two pages" logic
    • Integration of anonymization flow with existing error handling and logging in execute_stream_test_read()
    • Environment variable validation (MOCK_ANON_SALT) error paths in both production and test contexts

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: add anonymized and final_pages_only parameters to execute_stream_test_read" accurately and specifically describes the primary change in the changeset. The title directly aligns with the PR objective which states "This PR adds two parameters to the execute_stream_test_read MCP tool." The title is concrete and specific, naming both the function (execute_stream_test_read) and the parameters being added (anonymized and final_pages_only), making it clear to a teammate scanning the history what the main feature is. While the PR also includes a supporting anonymize.py module and comprehensive tests, the title appropriately focuses on the user-facing API change, which is the main deliverable.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1761628542-add-anonymized-final-pages-params

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the enhancement New feature or request label Oct 28, 2025
@github-actions
Copy link

👋 Welcome to the Airbyte Connector Builder MCP!

Thank you for your contribution! Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1761628542-add-anonymized-final-pages-params", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1761628542-add-anonymized-final-pages-params#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /build-connector - Builds the default connector on-demand using the AI builder
  • /build-connector prompt="<your prompt>" - Builds a connector on-demand using the AI builder
  • /poe <command> - Runs any poe command in the uv virtual environment

AI Builder Evaluations

AI builder evaluations run automatically under the following conditions:

  • When a PR is marked as "ready for review"
  • When a PR is reopened

A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch.

Helpful Resources

If you have any questions, feel free to ask in the PR comments or join our Slack community.

📝 Edit this welcome message.

@github-actions
Copy link

PyTest Results (Fast)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 3af39c8. ± Comparison against base commit b2164c2.

@github-actions
Copy link

PyTest Results (Full)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 3af39c8. ± Comparison against base commit b2164c2.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
connector_builder_mcp/validation_testing.py (1)

454-471: Security/privacy bug: sanitized pages not propagated to raw_api_responses

You filter secrets into slices but never write them back into stream_data, which is what gets returned in raw_api_responses. This can leak unsanitized request/response data. Write the sanitized structure back (handling the auxiliary_requests fallback).

Apply this diff:

@@
-        slices = cast(
-            list[dict[str, Any]],
-            filter_config_secrets(slices_from_stream),
-        )
+        slices = cast(
+            list[dict[str, Any]],
+            filter_config_secrets(slices_from_stream),
+        )
+        # Ensure sanitized slices/aux requests are what we later return
+        if not stream_data.get("slices") and "auxiliary_requests" in stream_data:
+            stream_data["auxiliary_requests"] = slices
+        else:
+            stream_data["slices"] = slices
🧹 Nitpick comments (8)
tests/unit/test_anonymize.py (2)

188-200: Add assertion for name anonymization in lists

In TestAnonymizeDict.test_anonymize_dict_with_list, also assert that the "name" field is anonymized to catch regressions in should_anonymize_field("name").


150-183: Clarify numeric ID policy (anonymize or preserve?)

Tests currently assert non-string values are preserved. Confirm this is intentional for numeric IDs, or add coverage for deterministic numeric anonymization if required. I can propose an approach that preserves digit count.

tests/integration/test_anonymized_and_final_pages.py (2)

181-197: Strengthen assertions: raw responses should be trimmed/anonymized

When anonymized=True and final_pages_only=True with include_raw_responses_data=True, assert that:

  • Only the last 2 pages are present in raw_api_responses.
  • Sensitive headers are redacted (e.g., Authorization) and volatile headers removed.

This guards against leakage and ensures final_pages_only semantics apply to raw payloads, not just records.


116-137: Also assert reached_end metadata via logs

You already check for the presence of final_pages_only info. Add an assertion that the log explicitly contains reached_end=true/false to validate the pagination guard behavior.

connector_builder_mcp/validation_testing.py (1)

347-365: Expose result metadata (optional)

Consider adding fields to StreamTestResult: anonymized: bool, reached_end: bool | None, and salt_id: str | None. Today this is only in logs; structured fields improve downstream automation.

connector_builder_mcp/anonymize.py (3)

107-129: Numeric IDs are not anonymized

anonymize_value() only transforms strings. Fields like user_id=12345 remain unchanged, which may contradict “IDs anonymized” expectations. If desired, add deterministic numeric anonymization (e.g., HMAC→int of same digit length), gated by field detection.

Example helper:

+def anonymize_int(n: int, salt: bytes, digits: int | None = None) -> int:
+    h = hmac.new(salt, str(n).encode("utf-8"), hashlib.sha256).hexdigest()
+    d = digits or len(str(n))
+    return int(str(int(h, 16)).zfill(d)[-d:])
@@
-    if isinstance(value, str):
+    if isinstance(value, str):
         if "@" in value and "." in value.split("@")[-1]:
             return anonymize_email(value, salt)
         return anonymize_string(value, salt)
+    if isinstance(value, int):
+        return anonymize_int(value, salt)

160-201: Unused parameter ‘salt’ in headers/query helpers

salt isn’t used in anonymize_headers/anonymize_query_params. Either remove it or mark as intentionally unused (e.g., _salt: bytes) to avoid confusion.


88-105: Extend field heuristics (optional)

Consider adding patterns like uuid, guid, token, apikey, key$, secret$ to reduce misses. Keep anchored variants to avoid false positives.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2164c2 and 3af39c8.

📒 Files selected for processing (4)
  • connector_builder_mcp/anonymize.py (1 hunks)
  • connector_builder_mcp/validation_testing.py (4 hunks)
  • tests/integration/test_anonymized_and_final_pages.py (1 hunks)
  • tests/unit/test_anonymize.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/unit/test_anonymize.py (1)
connector_builder_mcp/anonymize.py (11)
  • anonymize_dict (132-157)
  • anonymize_email (60-76)
  • anonymize_headers (160-202)
  • anonymize_http_interaction (240-277)
  • anonymize_query_params (205-237)
  • anonymize_records (280-290)
  • anonymize_string (43-57)
  • anonymize_value (107-129)
  • get_anonymization_salt (14-29)
  • get_salt_id (32-40)
  • should_anonymize_field (79-104)
connector_builder_mcp/validation_testing.py (3)
connector_builder_mcp/anonymize.py (4)
  • anonymize_http_interaction (240-277)
  • anonymize_records (280-290)
  • get_salt_id (32-40)
  • get_anonymization_salt (14-29)
connector_builder_mcp/_util.py (6)
  • as_bool (183-203)
  • as_dict (208-211)
  • as_dict (213-215)
  • as_dict (217-219)
  • as_dict (222-242)
  • parse_manifest_input (82-139)
connector_builder_mcp/secrets.py (1)
  • hydrate_config (265-295)
tests/integration/test_anonymized_and_final_pages.py (2)
connector_builder_mcp/validation_testing.py (2)
  • StreamTestResult (61-77)
  • execute_stream_test_read (306-567)
tests/conftest.py (1)
  • resources_path (9-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)

Comment on lines +487 to +508
reached_end = True
if final_pages_only:
total_pages = len(all_pages)
if total_pages >= MAX_PAGES_SCAN:
reached_end = False
execution_logs.append(
{
"level": "WARNING",
"message": f"Reached max pages scan limit ({MAX_PAGES_SCAN}). May not have captured true final pages.",
}
)

if total_pages > 2:
logger.info(f"final_pages_only: keeping last 2 of {total_pages} pages")
all_pages = all_pages[-2:]

execution_logs.append(
{
"level": "INFO",
"message": f"final_pages_only mode: captured {len(all_pages)} final pages out of {total_pages} total pages. reached_end={reached_end}",
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

final_pages_only not reflected in returned raw payload

You compute all_pages[-2:] for records but do not trim the returned raw_api_responses. This defeats the “final pages only” promise and increases payload size.

Apply this diff to trim raw output as well and surface reached_end info in logs (already added) and optionally as a result field (see next comment):

@@
-        if total_pages > 2:
+        if total_pages > 2:
             logger.info(f"final_pages_only: keeping last 2 of {total_pages} pages")
             all_pages = all_pages[-2:]
 
         execution_logs.append(
             {
                 "level": "INFO",
                 "message": f"final_pages_only mode: captured {len(all_pages)} final pages out of {total_pages} total pages. reached_end={reached_end}",
             }
         )
+        # Reflect trimming in raw output if we return it
+        try:
+            # Flatten into a minimal structure to avoid returning thousands of pages
+            stream_data["slices"] = [{"pages": all_pages}]
+        except Exception:
+            # Best-effort; do not fail the run on trimming issues
+            logger.exception("Failed to trim raw slices to final pages")
🤖 Prompt for AI Agents
In connector_builder_mcp/validation_testing.py around lines 487-508,
final_pages_only is applied to all_pages but raw_api_responses isn't trimmed, so
the returned raw payload still contains all pages; update the code to mirror the
all_pages trimming for raw_api_responses (e.g., if final_pages_only and
total_pages > 2, set raw_api_responses = raw_api_responses[-2:]) and ensure
reached_end is included in the execution_logs (already present) and also added
to the function's returned result dict (e.g., result['reached_end'] =
reached_end) so callers can programmatically see whether the true end was
reached.

Comment on lines +515 to +533
if anonymized:
try:
logger.info("Anonymizing records and raw responses")
records_data = anonymize_records(records_data)

for slice_obj in slices:
if isinstance(slice_obj, dict) and "pages" in slice_obj:
for page in slice_obj["pages"]:
if isinstance(page, dict):
if "request" in page or "response" in page:
anonymized_interaction = anonymize_http_interaction(page)
page.update(anonymized_interaction)

execution_logs.append(
{
"level": "INFO",
"message": f"Applied anonymization with salt_id={get_salt_id()}",
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Anonymized raw payload may not be returned

You anonymize in-memory slices but don’t ensure the returned raw_api_responses reflect those changes. After anonymization, reassign sanitized pages back into stream_data.

Apply this diff:

@@
         try:
             logger.info("Anonymizing records and raw responses")
             records_data = anonymize_records(records_data)
 
             for slice_obj in slices:
                 if isinstance(slice_obj, dict) and "pages" in slice_obj:
                     for page in slice_obj["pages"]:
                         if isinstance(page, dict):
                             if "request" in page or "response" in page:
                                 anonymized_interaction = anonymize_http_interaction(page)
                                 page.update(anonymized_interaction)
 
+            # Ensure anonymized slices are what we return in raw_api_responses
+            stream_data["slices"] = slices
🤖 Prompt for AI Agents
connector_builder_mcp/validation_testing.py around lines 515 to 533, you
anonymize in-memory slices/pages but never write those sanitized pages back into
the stream_data/raw_api_responses returned to callers; after you update each
page in slices, replace the corresponding entries in
stream_data["raw_api_responses"] (or the variable holding the original API
responses) with the anonymized pages so the function returns the sanitized
payloads (iterate matching slices to raw responses and reassign the modified
page dicts, then append the execution log as already done).

@aaronsteers aaronsteers marked this pull request as draft October 29, 2025 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants