-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add anonymized and final_pages_only parameters to execute_stream_test_read #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add anonymized and final_pages_only parameters to execute_stream_test_read #158
Conversation
…m_test_read - Add anonymize.py module with deterministic HMAC-SHA256 anonymization - Add anonymized parameter to anonymize raw responses and records - Add final_pages_only parameter to capture only last 2 pages - Implement safety guard (MAX_PAGES_SCAN = 10,000 pages) - Add comprehensive unit tests for anonymization module (35 tests) - Add integration tests for new parameters (13 tests) - All tests passing with proper type checking The anonymized parameter requires MOCK_ANON_SALT environment variable for reproducible anonymization. The final_pages_only parameter forces full pagination but only returns the final 2 pages in output, useful for creating mock server test fixtures from real API data. Co-Authored-By: AJ Steers <[email protected]>
Original prompt from AJ Steers |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
📝 WalkthroughWalkthroughIntroduces deterministic, salt-based anonymization for test data and HTTP interactions through a new module. Integrates anonymization into the test reading workflow via new Changes
Sequence DiagramsequenceDiagram
participant User
participant execute_stream_test_read
participant Salt Management
participant Test Execution
participant Anonymization
participant Result
User->>execute_stream_test_read: Call with anonymized=True
alt anonymized enabled
execute_stream_test_read->>Salt Management: get_anonymization_salt()
alt MOCK_ANON_SALT missing
Salt Management-->>Result: ValueError
Result-->>User: Error result
else MOCK_ANON_SALT present
Salt Management-->>execute_stream_test_read: salt bytes + salt_id
execute_stream_test_read->>Test Execution: Run test read
Test Execution-->>execute_stream_test_read: records + pages
execute_stream_test_read->>Anonymization: anonymize_records()
execute_stream_test_read->>Anonymization: anonymize_http_interaction()
Anonymization-->>execute_stream_test_read: anonymized data
execute_stream_test_read->>Result: Return with anonymized data + salt_id log
Result-->>User: Success with anonymization
end
else anonymized disabled
execute_stream_test_read->>Test Execution: Run test read
Test Execution-->>Result: records + pages (unmodified)
Result-->>User: Success without anonymization
end
alt final_pages_only enabled
execute_stream_test_read->>Test Execution: Set permissive limits
Test Execution-->>execute_stream_test_read: Aggregate all_pages
execute_stream_test_read->>execute_stream_test_read: Truncate to last 2 pages
execute_stream_test_read->>Result: Log page summary
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
👋 Welcome to the Airbyte Connector Builder MCP!Thank you for your contribution! Here are some helpful tips and reminders for your convenience. Testing This Branch via MCPTo test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration: {
"mcpServers": {
"connector-builder-mcp-dev": {
"command": "uvx",
"args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1761628542-add-anonymized-final-pages-params", "connector-builder-mcp"]
}
}
}Testing This Branch via CLIYou can test this version of the MCP Server using the following CLI snippet: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1761628542-add-anonymized-final-pages-params#egg=airbyte-connector-builder-mcp' --helpPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
AI Builder EvaluationsAI builder evaluations run automatically under the following conditions:
A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch. Helpful ResourcesIf you have any questions, feel free to ask in the PR comments or join our Slack community. |
Co-Authored-By: AJ Steers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
connector_builder_mcp/validation_testing.py (1)
454-471: Security/privacy bug: sanitized pages not propagated to raw_api_responsesYou filter secrets into
slicesbut never write them back intostream_data, which is what gets returned inraw_api_responses. This can leak unsanitized request/response data. Write the sanitized structure back (handling theauxiliary_requestsfallback).Apply this diff:
@@ - slices = cast( - list[dict[str, Any]], - filter_config_secrets(slices_from_stream), - ) + slices = cast( + list[dict[str, Any]], + filter_config_secrets(slices_from_stream), + ) + # Ensure sanitized slices/aux requests are what we later return + if not stream_data.get("slices") and "auxiliary_requests" in stream_data: + stream_data["auxiliary_requests"] = slices + else: + stream_data["slices"] = slices
🧹 Nitpick comments (8)
tests/unit/test_anonymize.py (2)
188-200: Add assertion for name anonymization in listsIn TestAnonymizeDict.test_anonymize_dict_with_list, also assert that the "name" field is anonymized to catch regressions in should_anonymize_field("name").
150-183: Clarify numeric ID policy (anonymize or preserve?)Tests currently assert non-string values are preserved. Confirm this is intentional for numeric IDs, or add coverage for deterministic numeric anonymization if required. I can propose an approach that preserves digit count.
tests/integration/test_anonymized_and_final_pages.py (2)
181-197: Strengthen assertions: raw responses should be trimmed/anonymizedWhen anonymized=True and final_pages_only=True with include_raw_responses_data=True, assert that:
- Only the last 2 pages are present in raw_api_responses.
- Sensitive headers are redacted (e.g., Authorization) and volatile headers removed.
This guards against leakage and ensures final_pages_only semantics apply to raw payloads, not just records.
116-137: Also assert reached_end metadata via logsYou already check for the presence of final_pages_only info. Add an assertion that the log explicitly contains reached_end=true/false to validate the pagination guard behavior.
connector_builder_mcp/validation_testing.py (1)
347-365: Expose result metadata (optional)Consider adding fields to StreamTestResult:
anonymized: bool,reached_end: bool | None, andsalt_id: str | None. Today this is only in logs; structured fields improve downstream automation.connector_builder_mcp/anonymize.py (3)
107-129: Numeric IDs are not anonymizedanonymize_value() only transforms strings. Fields like user_id=12345 remain unchanged, which may contradict “IDs anonymized” expectations. If desired, add deterministic numeric anonymization (e.g., HMAC→int of same digit length), gated by field detection.
Example helper:
+def anonymize_int(n: int, salt: bytes, digits: int | None = None) -> int: + h = hmac.new(salt, str(n).encode("utf-8"), hashlib.sha256).hexdigest() + d = digits or len(str(n)) + return int(str(int(h, 16)).zfill(d)[-d:]) @@ - if isinstance(value, str): + if isinstance(value, str): if "@" in value and "." in value.split("@")[-1]: return anonymize_email(value, salt) return anonymize_string(value, salt) + if isinstance(value, int): + return anonymize_int(value, salt)
160-201: Unused parameter ‘salt’ in headers/query helpers
saltisn’t used in anonymize_headers/anonymize_query_params. Either remove it or mark as intentionally unused (e.g.,_salt: bytes) to avoid confusion.
88-105: Extend field heuristics (optional)Consider adding patterns like
uuid,guid,token,apikey,key$,secret$to reduce misses. Keep anchored variants to avoid false positives.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
connector_builder_mcp/anonymize.py(1 hunks)connector_builder_mcp/validation_testing.py(4 hunks)tests/integration/test_anonymized_and_final_pages.py(1 hunks)tests/unit/test_anonymize.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/unit/test_anonymize.py (1)
connector_builder_mcp/anonymize.py (11)
anonymize_dict(132-157)anonymize_email(60-76)anonymize_headers(160-202)anonymize_http_interaction(240-277)anonymize_query_params(205-237)anonymize_records(280-290)anonymize_string(43-57)anonymize_value(107-129)get_anonymization_salt(14-29)get_salt_id(32-40)should_anonymize_field(79-104)
connector_builder_mcp/validation_testing.py (3)
connector_builder_mcp/anonymize.py (4)
anonymize_http_interaction(240-277)anonymize_records(280-290)get_salt_id(32-40)get_anonymization_salt(14-29)connector_builder_mcp/_util.py (6)
as_bool(183-203)as_dict(208-211)as_dict(213-215)as_dict(217-219)as_dict(222-242)parse_manifest_input(82-139)connector_builder_mcp/secrets.py (1)
hydrate_config(265-295)
tests/integration/test_anonymized_and_final_pages.py (2)
connector_builder_mcp/validation_testing.py (2)
StreamTestResult(61-77)execute_stream_test_read(306-567)tests/conftest.py (1)
resources_path(9-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
| reached_end = True | ||
| if final_pages_only: | ||
| total_pages = len(all_pages) | ||
| if total_pages >= MAX_PAGES_SCAN: | ||
| reached_end = False | ||
| execution_logs.append( | ||
| { | ||
| "level": "WARNING", | ||
| "message": f"Reached max pages scan limit ({MAX_PAGES_SCAN}). May not have captured true final pages.", | ||
| } | ||
| ) | ||
|
|
||
| if total_pages > 2: | ||
| logger.info(f"final_pages_only: keeping last 2 of {total_pages} pages") | ||
| all_pages = all_pages[-2:] | ||
|
|
||
| execution_logs.append( | ||
| { | ||
| "level": "INFO", | ||
| "message": f"final_pages_only mode: captured {len(all_pages)} final pages out of {total_pages} total pages. reached_end={reached_end}", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
final_pages_only not reflected in returned raw payload
You compute all_pages[-2:] for records but do not trim the returned raw_api_responses. This defeats the “final pages only” promise and increases payload size.
Apply this diff to trim raw output as well and surface reached_end info in logs (already added) and optionally as a result field (see next comment):
@@
- if total_pages > 2:
+ if total_pages > 2:
logger.info(f"final_pages_only: keeping last 2 of {total_pages} pages")
all_pages = all_pages[-2:]
execution_logs.append(
{
"level": "INFO",
"message": f"final_pages_only mode: captured {len(all_pages)} final pages out of {total_pages} total pages. reached_end={reached_end}",
}
)
+ # Reflect trimming in raw output if we return it
+ try:
+ # Flatten into a minimal structure to avoid returning thousands of pages
+ stream_data["slices"] = [{"pages": all_pages}]
+ except Exception:
+ # Best-effort; do not fail the run on trimming issues
+ logger.exception("Failed to trim raw slices to final pages")🤖 Prompt for AI Agents
In connector_builder_mcp/validation_testing.py around lines 487-508,
final_pages_only is applied to all_pages but raw_api_responses isn't trimmed, so
the returned raw payload still contains all pages; update the code to mirror the
all_pages trimming for raw_api_responses (e.g., if final_pages_only and
total_pages > 2, set raw_api_responses = raw_api_responses[-2:]) and ensure
reached_end is included in the execution_logs (already present) and also added
to the function's returned result dict (e.g., result['reached_end'] =
reached_end) so callers can programmatically see whether the true end was
reached.
| if anonymized: | ||
| try: | ||
| logger.info("Anonymizing records and raw responses") | ||
| records_data = anonymize_records(records_data) | ||
|
|
||
| for slice_obj in slices: | ||
| if isinstance(slice_obj, dict) and "pages" in slice_obj: | ||
| for page in slice_obj["pages"]: | ||
| if isinstance(page, dict): | ||
| if "request" in page or "response" in page: | ||
| anonymized_interaction = anonymize_http_interaction(page) | ||
| page.update(anonymized_interaction) | ||
|
|
||
| execution_logs.append( | ||
| { | ||
| "level": "INFO", | ||
| "message": f"Applied anonymization with salt_id={get_salt_id()}", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Anonymized raw payload may not be returned
You anonymize in-memory slices but don’t ensure the returned raw_api_responses reflect those changes. After anonymization, reassign sanitized pages back into stream_data.
Apply this diff:
@@
try:
logger.info("Anonymizing records and raw responses")
records_data = anonymize_records(records_data)
for slice_obj in slices:
if isinstance(slice_obj, dict) and "pages" in slice_obj:
for page in slice_obj["pages"]:
if isinstance(page, dict):
if "request" in page or "response" in page:
anonymized_interaction = anonymize_http_interaction(page)
page.update(anonymized_interaction)
+ # Ensure anonymized slices are what we return in raw_api_responses
+ stream_data["slices"] = slices🤖 Prompt for AI Agents
connector_builder_mcp/validation_testing.py around lines 515 to 533, you
anonymize in-memory slices/pages but never write those sanitized pages back into
the stream_data/raw_api_responses returned to callers; after you update each
page in slices, replace the corresponding entries in
stream_data["raw_api_responses"] (or the variable holding the original API
responses) with the anonymized pages so the function returns the sanitized
payloads (iterate matching slices to raw responses and reassign the modified
page dicts, then append the execution log as already done).
feat: add anonymized and final_pages_only parameters to execute_stream_test_read
Summary
Adds two new parameters to the
execute_stream_test_readMCP tool to support capturing real API data for mock server test fixtures:anonymized: bool- Enables deterministic anonymization of raw API responses and records using HMAC-SHA256. RequiresMOCK_ANON_SALTenvironment variable for reproducible hashing across runs.final_pages_only: bool | None- Captures ONLY the last 2 pages of paginated results, ignoring all intermediate pages. Sets very high limits (10,000 pages max) to force full pagination, then post-processes to keep only the final 2 pages.Use Case: These parameters enable creating mock server test fixtures from real API data by (1) capturing terminal pages that demonstrate end-of-pagination behavior, and (2) anonymizing sensitive data while preserving structure for testing.
Implementation Details:
anonymize.pymodule with format-preserving anonymization (emails keep @example.com structure)reached_end,anonymized,salt_id) in outputReview & Testing Checklist for Human
This is a yellow risk change - new functionality with comprehensive tests but limited real-world validation.
anonymize.py:78-102- are these comprehensive enough? Too aggressive? Missing common sensitive field names?reached_end,salt_id, andanonymizedfields in the output actually useful? Should there be additional metadata?Recommended Test Plan:
MOCK_ANON_SALT=test_secret_123in environmentexecute_stream_test_readwithanonymized=trueandfinal_pages_only=trueon a connector with 5+ pages of resultsMOCK_ANON_SALTset - should fail gracefully with clear error messageNotes
final_pages_onlyrather than explicit parameter conflict detection - this is pragmatic but worth notingLink to Devin run: https://app.devin.ai/sessions/75f2bd231eea4960b5a76b7d14b8ddcf
Requested by: AJ Steers ([email protected]) / @aaronsteers
Summary by CodeRabbit
New Features
anonymizedparameter to enable/disable anonymization in stream test reads.final_pages_onlyparameter to focus test collection on final pages only.Tests