fix(cloud): Handle corrupted resources gracefully in list operations#898
fix(cloud): Handle corrupted resources gracefully in list operations#898Aaron ("AJ") Steers (aaronsteers) wants to merge 2 commits intomainfrom
Conversation
When listing connections, sources, or destinations, if the workspace contains corrupted resources (e.g., references to deleted secrets), the API may return a 500 error. Previously this would cause the entire list operation to fail. Now, when a specific 'Secret reference does not exist' error is detected, the list methods will log a warning and return an empty list instead of raising an exception. This allows operations like deploy_destination (which uses list_destinations for uniqueness checks) to continue working even when the workspace contains corrupted resources. Other API errors are still raised normally to avoid masking genuine issues. Co-Authored-By: AJ Steers <aj@airbyte.io>
Original prompt from AJ Steers |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This PyAirbyte VersionYou can test this version of PyAirbyte using the following: # Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1765244433-investigate-flaky-ci-test' pyairbyte --help
# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1765244433-investigate-flaky-ci-test'Helpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
Community SupportQuestions? Join the #pyairbyte channel in our Slack workspace. |
Tests verify that: - _is_corrupted_resource_error correctly identifies the secret reference error pattern - list_destinations, list_sources, list_connections return empty list on corrupted resource - Other SDKErrors are still raised normally Co-Authored-By: AJ Steers <aj@airbyte.io>
📝 WalkthroughWalkthroughAdds private pattern-based detection for corrupted workspace resource errors and updates list operations in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte/cloud/workspaces.py (2)
59-76: Make_is_corrupted_resource_errora bit more defensive and resilient to message format changes?The helper looks good and nicely scoped, but a couple of small hardening tweaks might make it more future‑proof:
- It assumes
SDKErroralways has a.bodyattribute; if that ever changes, this could raiseAttributeErroron the error path.- Matching is case‑sensitive and only looks at the stringified body. If the API tweaks capitalization or nests the message differently, we might silently stop recognizing these “corrupted resource” errors.
Would you be open to something like this to be a bit more defensive and case‑insensitive, while keeping the same semantics (all patterns must be present), wdyt?
-def _is_corrupted_resource_error(error: SDKError) -> bool: +def _is_corrupted_resource_error(error: SDKError) -> bool: @@ - error_body = str(error.body) if error.body else "" - return all(pattern in error_body for pattern in _CORRUPTED_RESOURCE_ERROR_PATTERNS) + body = getattr(error, "body", "") or "" + error_body = str(body).lower() + return all(pattern.lower() in error_body for pattern in _CORRUPTED_RESOURCE_ERROR_PATTERNS)And separately, would it make sense to add a small unit test that feeds a fake
SDKErrorwith a representative.bodyinto this function to lock in the pattern and guard against regressions, wdyt?
470-501: Confirm the empty‑list fallback semantics and exception type, and optionally DRY/log this pattern?The new try/except blocks around
api_util.list_connections/sources/destinationsare a nice targeted mitigation to keep higher‑level flows running when the workspace has corrupted resources.A few points I’d love your take on:
Behavior change / uniqueness checks
When a corrupted resource triggers the specific
SDKError, these methods now return[]instead of raising. That means flows relying on uniqueness checks (e.g.,deploy_destinationcallinglist_destinations(name=...)) can now proceed and potentially create “duplicates” if the workspace is in a corrupted state.Given the PR description, this seems intentional, but can you confirm you’re comfortable with:
- treating “workspace contains at least one corrupted resource” as “treat lists as empty” for these three methods, and
- the fact that consumers won’t be able to distinguish “no resources exist” from “resources exist but listing is blocked by corruption”?
If you’d like, we could consider an opt‑in flag or a dedicated warning type in a follow‑up so callers can choose stricter behavior, wdyt?
Exception type alignment
These handlers currently catch
SDKError. From the surrounding context (airbyte/_util/api_util.py), it looks like the list helpers may historically have raisedAirbyteError. Can you double‑check that, on this branch,api_util.list_connections/list_sources/list_destinationsare in fact surfacing anSDKErrorwith the populated.bodyyou’re matching against (and not wrapping it in another exception type), wdyt?Duplication and logging (optional polish)
The three blocks are nearly identical and each uses
- factoring the try/except into a small helper like
_safe_list_resources(func, ...)that applies the corrupted‑resource handling, and- switching the
print(...)calls to a module‑level logger (e.g.logger.warning(...)) for better integration with host applications’ logging,or do you prefer the current explicit duplication for readability here, wdyt?
Also applies to: 511-542, 552-583
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte/cloud/workspaces.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/cloud/workspaces.py (1)
airbyte/_util/api_util.py (3)
list_connections(133-181)list_sources(233-278)list_destinations(281-330)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Pytest (All, Python 3.11, Windows)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Windows)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (No Creds)
- GitHub Check: Pytest (Fast)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unit_tests/test_cloud_workspace_error_handling.py (1)
78-107: Consider verifying the warning output?The test correctly verifies that an empty list is returned when a corrupted resource error occurs. However, the implementation also prints a warning message (using
print()in the actual code). Would it be valuable to verify this warning is emitted, perhaps using pytest'scapsysfixture? This could help ensure users receive proper feedback about the issue, wdyt?Example:
def test_list_operations_return_empty_on_corrupted_resource( - list_method: str, api_mock_path: str + list_method: str, api_mock_path: str, capsys ) -> None: """List operations should return empty list when corrupted resource error occurs.""" # ... existing setup ... method = getattr(workspace, list_method) result = method() assert result == [] + + captured = capsys.readouterr() + assert "Warning: Failed to list" in captured.out + assert "corrupted resource" in captured.out
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unit_tests/test_cloud_workspace_error_handling.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: yohannj
Repo: airbytehq/PyAirbyte PR: 716
File: airbyte/logs.py:384-402
Timestamp: 2025-07-11T19:53:44.427Z
Learning: In the PyAirbyte project, when reviewing PRs, maintain clear separation of concerns. Don't suggest changes that are outside the scope of the PR's main objective, even if they would improve consistency or fix other issues. This helps with reviewing changes and potential reverts.
🧬 Code graph analysis (1)
tests/unit_tests/test_cloud_workspace_error_handling.py (1)
airbyte/cloud/workspaces.py (2)
_is_corrupted_resource_error(67-76)CloudWorkspace(86-744)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Windows)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (No Creds)
🔇 Additional comments (3)
tests/unit_tests/test_cloud_workspace_error_handling.py (3)
1-13: LGTM! Clean test setup.The imports are well-organized and appropriate for testing the error handling behavior. Directly testing the private
_is_corrupted_resource_errorfunction is a good practice for unit tests.
15-56: Excellent pattern matching test coverage!This parametrized test thoroughly validates the error detection logic with positive, negative, partial match, and edge cases. The test cases are well-designed and the descriptive IDs make the test output clear.
109-154: Great defensive test!This test ensures that only the specific corrupted resource errors are caught and handled gracefully, while other API errors continue to propagate normally. This is important for maintaining proper error visibility and preventing silent failures.
| try: | ||
| connections = api_util.list_connections( | ||
| api_root=self.api_root, | ||
| workspace_id=self.workspace_id, | ||
| name=name, | ||
| name_filter=name_filter, | ||
| client_id=self.client_id, | ||
| client_secret=self.client_secret, | ||
| ) | ||
| except SDKError as ex: |
There was a problem hiding this comment.
Wrong. Needs to be handled on the inner loop so all valid objects are still returned. It is not sufficient to explain why we are failing. We need to not fail. Try again from scratch.
There was a problem hiding this comment.
You're right - returning an empty list isn't acceptable. I investigated the API layer and found the constraint:
The SDKError is thrown by airbyte_api.destinations.list_destinations before we receive any item data. The API fails the entire request when it tries to serialize a destination with a corrupted secret reference.
Options I see:
-
Per-page handling with binary search fallback: Move error handling into
api_util.list_destinationsat the pagination level. When a page fails with the secret error, use binary search on the offset range to isolate the corrupted item(s) and return all valid items from that page. This gives true "inner loop" semantics. -
Alternative endpoint fallback: If there's a Config API endpoint that can list destination IDs without resolving secret-backed configs, we could use that as a fallback to enumerate IDs, then call
get_destinationper-item with inner-loop error handling. -
Backend fix: Address this in the backend so the list API doesn't fail entirely when one item is corrupted.
Which approach would you prefer? Option 1 is the most self-contained client-side fix but adds complexity. Happy to implement whichever you think is best.
|
Closing. No good solution available. If thesis is correct that the 'list' endpoint implodes if anything within it is corrupted, then an upstream fix is needed. |
Summary
When a workspace contains corrupted resources (e.g., destinations with references to deleted secrets), the Airbyte API returns a 500 error during list operations. Previously, this caused
list_connections,list_sources, andlist_destinationsto fail entirely, breaking higher-level operations likedeploy_destination(which useslist_destinationsfor uniqueness checks).This PR adds targeted error handling that detects the specific "Secret reference does not exist but is referenced in the config" error pattern and returns an empty list with a warning instead of raising an exception. Other API errors are still raised normally.
Context: This was discovered during CI investigation where
test_deploy_destinationfailed intermittently due to a corrupted destination in the test workspace.Review & Testing Checklist for Human
deploy_destination, this means the uniqueness check will pass (potentially allowing duplicates). Consider if this is the desired behavior or if a different approach is needed."Secret reference"AND"does not exist but is referenced in the config"). Confirm this matches the actual error format from the API.Suggested test plan:
Notes
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.