Skip to content

fix(cloud): Handle corrupted resources gracefully in list operations#898

Closed
Aaron ("AJ") Steers (aaronsteers) wants to merge 2 commits intomainfrom
devin/1765244433-investigate-flaky-ci-test
Closed

fix(cloud): Handle corrupted resources gracefully in list operations#898
Aaron ("AJ") Steers (aaronsteers) wants to merge 2 commits intomainfrom
devin/1765244433-investigate-flaky-ci-test

Conversation

@aaronsteers
Copy link
Member

@aaronsteers Aaron ("AJ") Steers (aaronsteers) commented Dec 9, 2025

Summary

When a workspace contains corrupted resources (e.g., destinations with references to deleted secrets), the Airbyte API returns a 500 error during list operations. Previously, this caused list_connections, list_sources, and list_destinations to fail entirely, breaking higher-level operations like deploy_destination (which uses list_destinations for uniqueness checks).

This PR adds targeted error handling that detects the specific "Secret reference does not exist but is referenced in the config" error pattern and returns an empty list with a warning instead of raising an exception. Other API errors are still raised normally.

Context: This was discovered during CI investigation where test_deploy_destination failed intermittently due to a corrupted destination in the test workspace.

Review & Testing Checklist for Human

  • Verify the empty list behavior is acceptable - When corrupted resources exist, callers will receive an empty list. For deploy_destination, this means the uniqueness check will pass (potentially allowing duplicates). Consider if this is the desired behavior or if a different approach is needed.
  • Verify error pattern matching - The detection relies on substring matching ("Secret reference" AND "does not exist but is referenced in the config"). Confirm this matches the actual error format from the API.
  • Consider adding unit tests - No unit tests were added for the new error handling. Consider whether tests should be added to lock in this behavior.

Suggested test plan:

  1. Run the integration tests to verify existing functionality still works
  2. If possible, manually test with a workspace containing a corrupted resource to verify the warning is printed and an empty list is returned

Notes

Summary by CodeRabbit

  • Bug Fixes

    • Workspace listing operations now detect corrupted workspace resources and, when encountered, log a warning and return an empty list instead of causing the listing call to fail.
  • Tests

    • Added unit tests covering detection of corrupted-resource errors and verifying listing behavior for both corrupted and unrelated errors.

✏️ Tip: You can customize this high-level summary in your review settings.

When listing connections, sources, or destinations, if the workspace
contains corrupted resources (e.g., references to deleted secrets),
the API may return a 500 error. Previously this would cause the entire
list operation to fail.

Now, when a specific 'Secret reference does not exist' error is detected,
the list methods will log a warning and return an empty list instead of
raising an exception. This allows operations like deploy_destination
(which uses list_destinations for uniqueness checks) to continue working
even when the workspace contains corrupted resources.

Other API errors are still raised normally to avoid masking genuine issues.

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - Check the PyAirbyte Coral MCP tools - specifically the "check workspace" MCP tool. Can you see if it is possible for that tool to also return the organization ID that the workspace belongs to (and the name too if available)? This will help users who don't know there organization ID. While we're at it, we can return a workspace name if available as well.

If you find a path forward, go ahead and create the PR. Lmk if you need help or get stuck.
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1765226026855169

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1765244433-investigate-flaky-ci-test' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1765244433-investigate-flaky-ci-test'

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test-pr - Runs tests with the updated PyAirbyte

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

Tests verify that:
- _is_corrupted_resource_error correctly identifies the secret reference error pattern
- list_destinations, list_sources, list_connections return empty list on corrupted resource
- Other SDKErrors are still raised normally

Co-Authored-By: AJ Steers <aj@airbyte.io>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Walkthrough

Adds private pattern-based detection for corrupted workspace resource errors and updates list operations in airbyte/cloud/workspaces.py to catch such SDKErrors, log a warning, and return empty lists instead of propagating the error; unrelated SDKErrors are still re-raised. wdyt?

Changes

Cohort / File(s) Summary
Workspace error handling
airbyte/cloud/workspaces.py
Added _CORRUPTED_RESOURCE_ERROR_PATTERNS and _is_corrupted_resource_error(error: SDKError) -> bool; updated list_connections(), list_sources(), and list_destinations() to catch SDKError, treat matching corruption errors by logging a warning and returning [], and re-raise other errors.
Unit tests for corrupted-resource handling
tests/unit_tests/test_cloud_workspace_error_handling.py
Added tests validating _is_corrupted_resource_error() behavior across bodies, ensuring list methods return empty lists for corrupted-resource SDKErrors and re-raise for other SDKErrors; mocks __post_init__ and API calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review areas to focus on:
    • Pattern list _CORRUPTED_RESOURCE_ERROR_PATTERNS for completeness and false positives/negatives.
    • Correct identification and handling in _is_corrupted_resource_error(), including None/missing body cases.
    • That list methods still re-raise non-matching SDKErrors and retain existing public behavior.
    • Tests in tests/unit_tests/test_cloud_workspace_error_handling.py for coverage and realistic mocking.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding graceful error handling for corrupted resources in list operations, which directly matches the PR's primary objective and code changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1765244433-investigate-flaky-ci-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte/cloud/workspaces.py (2)

59-76: Make _is_corrupted_resource_error a bit more defensive and resilient to message format changes?

The helper looks good and nicely scoped, but a couple of small hardening tweaks might make it more future‑proof:

  • It assumes SDKError always has a .body attribute; if that ever changes, this could raise AttributeError on the error path.
  • Matching is case‑sensitive and only looks at the stringified body. If the API tweaks capitalization or nests the message differently, we might silently stop recognizing these “corrupted resource” errors.

Would you be open to something like this to be a bit more defensive and case‑insensitive, while keeping the same semantics (all patterns must be present), wdyt?

-def _is_corrupted_resource_error(error: SDKError) -> bool:
+def _is_corrupted_resource_error(error: SDKError) -> bool:
@@
-    error_body = str(error.body) if error.body else ""
-    return all(pattern in error_body for pattern in _CORRUPTED_RESOURCE_ERROR_PATTERNS)
+    body = getattr(error, "body", "") or ""
+    error_body = str(body).lower()
+    return all(pattern.lower() in error_body for pattern in _CORRUPTED_RESOURCE_ERROR_PATTERNS)

And separately, would it make sense to add a small unit test that feeds a fake SDKError with a representative .body into this function to lock in the pattern and guard against regressions, wdyt?


470-501: Confirm the empty‑list fallback semantics and exception type, and optionally DRY/log this pattern?

The new try/except blocks around api_util.list_connections/sources/destinations are a nice targeted mitigation to keep higher‑level flows running when the workspace has corrupted resources.

A few points I’d love your take on:

  1. Behavior change / uniqueness checks

    When a corrupted resource triggers the specific SDKError, these methods now return [] instead of raising. That means flows relying on uniqueness checks (e.g., deploy_destination calling list_destinations(name=...)) can now proceed and potentially create “duplicates” if the workspace is in a corrupted state.

    Given the PR description, this seems intentional, but can you confirm you’re comfortable with:

    • treating “workspace contains at least one corrupted resource” as “treat lists as empty” for these three methods, and
    • the fact that consumers won’t be able to distinguish “no resources exist” from “resources exist but listing is blocked by corruption”?

    If you’d like, we could consider an opt‑in flag or a dedicated warning type in a follow‑up so callers can choose stricter behavior, wdyt?

  2. Exception type alignment

    These handlers currently catch SDKError. From the surrounding context (airbyte/_util/api_util.py), it looks like the list helpers may historically have raised AirbyteError. Can you double‑check that, on this branch, api_util.list_connections/list_sources/list_destinations are in fact surfacing an SDKError with the populated .body you’re matching against (and not wrapping it in another exception type), wdyt?

  3. Duplication and logging (optional polish)

    The three blocks are nearly identical and each uses print for warnings. To keep things DRY and a bit more “library‑friendly”, would you be open to, in a follow‑up:

    • factoring the try/except into a small helper like _safe_list_resources(func, ...) that applies the corrupted‑resource handling, and
    • switching the print(...) calls to a module‑level logger (e.g. logger.warning(...)) for better integration with host applications’ logging,

    or do you prefer the current explicit duplication for readability here, wdyt?

Also applies to: 511-542, 552-583

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9de76d2 and 8b6ea3e.

📒 Files selected for processing (1)
  • airbyte/cloud/workspaces.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/cloud/workspaces.py (1)
airbyte/_util/api_util.py (3)
  • list_connections (133-181)
  • list_sources (233-278)
  • list_destinations (281-330)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unit_tests/test_cloud_workspace_error_handling.py (1)

78-107: Consider verifying the warning output?

The test correctly verifies that an empty list is returned when a corrupted resource error occurs. However, the implementation also prints a warning message (using print() in the actual code). Would it be valuable to verify this warning is emitted, perhaps using pytest's capsys fixture? This could help ensure users receive proper feedback about the issue, wdyt?

Example:

 def test_list_operations_return_empty_on_corrupted_resource(
-    list_method: str, api_mock_path: str
+    list_method: str, api_mock_path: str, capsys
 ) -> None:
     """List operations should return empty list when corrupted resource error occurs."""
     # ... existing setup ...
     
     method = getattr(workspace, list_method)
     result = method()
     assert result == []
+    
+    captured = capsys.readouterr()
+    assert "Warning: Failed to list" in captured.out
+    assert "corrupted resource" in captured.out
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b6ea3e and 8ec1a32.

📒 Files selected for processing (1)
  • tests/unit_tests/test_cloud_workspace_error_handling.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: yohannj
Repo: airbytehq/PyAirbyte PR: 716
File: airbyte/logs.py:384-402
Timestamp: 2025-07-11T19:53:44.427Z
Learning: In the PyAirbyte project, when reviewing PRs, maintain clear separation of concerns. Don't suggest changes that are outside the scope of the PR's main objective, even if they would improve consistency or fix other issues. This helps with reviewing changes and potential reverts.
🧬 Code graph analysis (1)
tests/unit_tests/test_cloud_workspace_error_handling.py (1)
airbyte/cloud/workspaces.py (2)
  • _is_corrupted_resource_error (67-76)
  • CloudWorkspace (86-744)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (3)
tests/unit_tests/test_cloud_workspace_error_handling.py (3)

1-13: LGTM! Clean test setup.

The imports are well-organized and appropriate for testing the error handling behavior. Directly testing the private _is_corrupted_resource_error function is a good practice for unit tests.


15-56: Excellent pattern matching test coverage!

This parametrized test thoroughly validates the error detection logic with positive, negative, partial match, and edge cases. The test cases are well-designed and the descriptive IDs make the test output clear.


109-154: Great defensive test!

This test ensures that only the specific corrupted resource errors are caught and handled gracefully, while other API errors continue to propagate normally. This is important for maintaining proper error visibility and preventing silent failures.

Comment on lines +483 to +492
try:
connections = api_util.list_connections(
api_root=self.api_root,
workspace_id=self.workspace_id,
name=name,
name_filter=name_filter,
client_id=self.client_id,
client_secret=self.client_secret,
)
except SDKError as ex:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong. Needs to be handled on the inner loop so all valid objects are still returned. It is not sufficient to explain why we are failing. We need to not fail. Try again from scratch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - returning an empty list isn't acceptable. I investigated the API layer and found the constraint:

The SDKError is thrown by airbyte_api.destinations.list_destinations before we receive any item data. The API fails the entire request when it tries to serialize a destination with a corrupted secret reference.

Options I see:

  1. Per-page handling with binary search fallback: Move error handling into api_util.list_destinations at the pagination level. When a page fails with the secret error, use binary search on the offset range to isolate the corrupted item(s) and return all valid items from that page. This gives true "inner loop" semantics.

  2. Alternative endpoint fallback: If there's a Config API endpoint that can list destination IDs without resolving secret-backed configs, we could use that as a fallback to enumerate IDs, then call get_destination per-item with inner-loop error handling.

  3. Backend fix: Address this in the backend so the list API doesn't fail entirely when one item is corrupted.

Which approach would you prefer? Option 1 is the most self-contained client-side fix but adds complexity. Happy to implement whichever you think is best.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

PyTest Results (Fast Tests Only, No Creds)

360 tests  +12   360 ✅ +12   5m 56s ⏱️ -26s
  1 suites ± 0     0 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 8ec1a32. ± Comparison against base commit 9de76d2.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

PyTest Results (Full)

428 tests  +12   412 ✅ +12   26m 4s ⏱️ + 1m 26s
  1 suites ± 0    16 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 8ec1a32. ± Comparison against base commit 9de76d2.

@aaronsteers
Copy link
Member Author

Closing. No good solution available. If thesis is correct that the 'list' endpoint implodes if anything within it is corrupted, then an upstream fix is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant