fix(source-github): prevent blocking sleep during rate limit handling#74758
fix(source-github): prevent blocking sleep during rate limit handling#74758devin-ai-integration[bot] wants to merge 10 commits intomasterfrom
Conversation
Replace single long blocking time.sleep() with chunked 60-second intervals during rate limit wait periods. This prevents the platform heartbeat from timing out when the connector waits for GitHub API rate limits to reset. Also fix FailureType from config_error to transient_error for rate limit errors, since rate limits are temporary conditions, not configuration issues. Resolves airbytehq/oncall#11614 Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksPR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful Resources
|
Co-Authored-By: bot_apk <apk@cognition.ai>
|
Deploy preview for airbyte-docs ready! ✅ Preview Built with commit 8d2e7cd. |
|
|
↪️ Triggering Reason: Draft PR with CI passing. Ready for regression validation and live testing. |
|
Fix Validation EvidenceOutcome: Fix/Feature Proven Successfully (regression tests) | Live testing pending approval Evidence SummaryRegression tests (SPEC, CHECK, DISCOVER, READ) all passed with no regressions vs baseline v2.1.14. The fix replaces a single blocking Live connection testing against the customer connection from the oncall issue is pending Slack approval for version pinning. Next Steps
Connector & PR DetailsConnector: Evidence PlanProving Criteria
Disproving Criteria
Cases Attempted
Pre-flight Checks
Detailed Evidence Log2026-03-13 11:53 UTC — Initial status comment posted, pre-flight checks started Note: Connection IDs and detailed logs are recorded in the linked private issue. Devin Session: https://app.devin.ai/sessions/515bfef5c6ec4515a97d359c702b0234 |
|
|
↪️ Triggering Reason: Prove-fix passed (regression tests all green). PR marked as ready-for-review. Ready for final AI review gate. |
Reviewing PR for connector safety and quality.
|
|
AI PR Review in progress. Gathering evidence and evaluating gates now. Session: https://app.devin.ai/sessions/cc6ab7c7ff3b44138ac6060d993aadce |
AI PR Review ReportReview Action: NO ACTION (NOT ELIGIBLE)
📋 PR Details & EligibilityConnector & PR InfoConnector(s): Auto-Approve EligibilityEligible: No Review Action DetailsNO ACTION (NOT ELIGIBLE) — All enforced gates pass, but the Behavioral Changes anti-pattern gate is flagged. This PR changes runtime behavior (sleep chunking, error type reclassification) which requires human sign-off. No PR review submitted.
🔍 Gate Evaluation DetailsGate-by-Gate Analysis
Behavioral Changes DetailChange 1 — Chunked sleep (
Change 2 — Error classification (
Breaking Change EvaluationEvaluated against the breaking change checklist:
Conclusion: NOT a breaking change. PATCH version bump (2.1.14 → 2.1.15) is correct. Progressive rollout is disabled ( 📚 Evidence ConsultedEvidence
❓ How to RespondBehavioral Changes — Human Sign-Off RequiredThe Behavioral Changes gate is flagged because this PR changes:
These are intentional bug fixes. A human reviewer should verify:
Minor HousekeepingThe changelog entry at Providing Context or JustificationYou can add explanations that the bot will see on the next review: Option 1: PR Description (recommended) ## AI PR Review Justification
### Behavioral Changes
[Your explanation here]Option 2: PR Comment After adding your response, re-run Note: Justifications provide context for the bot to evaluate. For the Behavioral Changes gate, justifications help explain the situation but still require human sign-off. |
|
↪️ Triggering Reason: Draft PR with prove-fix passed (regression tests all green). Ready for AI review before merge consideration. |
Reviewing PR for connector safety and quality.
|
|
AI PR Review in progress. Gathering evidence and evaluating gates now. Session: https://app.devin.ai/sessions/9900985c80f4426ca43c5a2a6ab01ab0 |
AI PR Review ReportReview Action: NO ACTION (NOT ELIGIBLE)
📋 PR Details & EligibilityConnector & PR InfoConnector(s): Auto-Approve EligibilityEligible: No Review Action DetailsNO ACTION (NOT ELIGIBLE) -- All enforced gates pass, but the Behavioral Changes anti-pattern gate is flagged. This PR changes runtime behavior (sleep chunking, error type reclassification) which requires human sign-off. No PR review submitted.
🔍 Gate Evaluation DetailsGate-by-Gate Analysis
Behavioral Changes DetailChange 1 -- Chunked sleep (
Change 2 -- Error classification (
Breaking Change EvaluationEvaluated against the breaking change checklist:
Conclusion: NOT a breaking change. PATCH version bump (2.1.14 -> 2.1.15) is correct. 📚 Evidence ConsultedEvidence
❓ How to RespondBehavioral Changes -- Human Sign-Off RequiredThe Behavioral Changes gate is flagged because this PR changes:
These are intentional bug fixes. A human reviewer should verify:
Minor HousekeepingThe changelog entry at Providing Context or JustificationYou can add explanations that the bot will see on the next review: Option 1: PR Description (recommended) ## AI PR Review Justification
### Behavioral Changes
[Your explanation here]Option 2: PR Comment After adding your response, re-run Note: Justifications provide context for the bot to evaluate. For the Behavioral Changes gate, justifications help explain the situation but still require human sign-off. |
…before rate limit exhaustion Instead of draining all tokens to zero and then blocking with a long sleep, inject small proportional delays once every token's remaining quota drops below a configurable reserve (default: 50 calls or 10% of limit). This spreads remaining calls over the reset window and reduces the chance of hitting the wall entirely. Also reclassifies rate-limit exhaustion as transient_error (not config_error) since it is a temporary condition.
|
🙋 Escalating per Hands-Free AI Triage Project triage. Reason: PR has prove-fix passed and AI review completed twice, but the |
|
/format-fix
|
|
/publish-connectors-prerelease
|
|
|
/publish-connectors-prerelease
|
…heartbeat sleep - _get_budget_reserve() now uses token's actual remaining count instead of ignoring its parameters - _budget_logged resets after rate limit window rolls over via check_all_tokens() - Exhaustion sleep is now chunked into 60s intervals with periodic log messages so the platform heartbeat stays alive - Updated test to verify chunked sleep behavior
|
/format-fix
|
|
/ai-prove-fix
|
Fix Validation EvidenceOutcome: Fix/Feature Proven Successfully Evidence SummaryTested on 1 customer connection that was previously experiencing rate limit exhaustion and heartbeat timeout failures. After applying the fix ( Additionally, all regression tests passed (SPEC, CHECK, DISCOVER, READ) with no regressions vs baseline v2.1.14. Next Steps
Connector & PR DetailsConnector: Evidence PlanProving Criteria
Disproving Criteria
Cases Attempted
Pre-flight Checks
Detailed Evidence Log
Regression Test Workflow: https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/23642845435 Note: Connection IDs and detailed logs are recorded in the linked private issue. |
|
|
↪️ Triggering Reason: PR has prove-fix evidence (regression tests passed). No |
Reviewing PR for connector safety and quality.
|
|
AI PR Review in progress. Gathering evidence and evaluating gates now. Session: https://app.devin.ai/sessions/b21aa318270941858dc52384dac2463e |
AI PR Review ReportReview Action: NO ACTION (NOT ELIGIBLE)
📋 PR Details & EligibilityConnector & PR InfoConnector(s): Auto-Approve EligibilityEligible: No Review Action DetailsNO ACTION (NOT ELIGIBLE) — All enforced gates pass, but the Behavioral Changes anti-pattern gate is flagged. This PR changes runtime behavior (sleep chunking, error type reclassification, new budget throttling) which requires human sign-off. No PR review submitted.
🔍 Gate Evaluation DetailsGate-by-Gate Analysis
Behavioral Changes DetailChange 1 — Chunked sleep (
Change 2 — Error classification (
Change 3 — API budget throttling (
Breaking Change EvaluationEvaluated against the breaking change checklist:
Conclusion: NOT a breaking change. PATCH version bump (2.1.14 → 2.1.15) is correct. 📚 Evidence ConsultedEvidence
❓ How to RespondBehavioral Changes — Human Sign-Off RequiredThe Behavioral Changes gate is flagged because this PR changes:
These are intentional bug fixes and enhancements. A human reviewer should verify:
Providing Context or JustificationYou can add explanations that the bot will see on the next review: Option 1: PR Description (recommended) ## AI PR Review Justification
### Behavioral Changes
[Your explanation here]Option 2: PR Comment After adding your response, re-run Note: Justifications provide context for the bot to evaluate. For the Behavioral Changes gate, justifications help explain the situation but still require human sign-off. |
…over GitHub rate limit window Co-Authored-By: gl_serhii.lazebnyi <serglazebny@gmail.com>
What
Resolves https://github.com/airbytehq/oncall/issues/11614:
When all GitHub API tokens hit rate limits, the connector sleeps in a single blocking
time.sleep()call for the entire wait duration. This blocks all output, causing the platform's heartbeat mechanism to consider the connector dead and terminate the sync.Additionally, the old default
max_waiting_timeof 10 minutes was well below GitHub's 60-minute rate limit reset window. This meant the chunked sleep path was almost never reachable — the connector would immediately raise an exception instead of waiting for the reset.Rate limit exhaustion was also classified as
FailureType.config_error, which is incorrect — rate limits are a transient condition, not a configuration problem.How
Chunked sleep (
utils.py): Replaced the singletime.sleep(min_time_to_wait)call with_sleep_with_heartbeat(), which sleeps in 60-second intervals and emits a log line between each interval. This keeps the platform heartbeat alive during long rate-limit waits.API budget throttle (
utils.py): Added a proactive throttling mechanism that injects small delays when all tokens drop below a reserve threshold (10% of quota or 50 calls minimum). This spreads remaining calls over the reset window to reduce the chance of full exhaustion.Increased default
max_waiting_time(spec.json,source.py,utils.py): Bumped the default from 10 minutes → 120 minutes and the spec maximum from 60 → 240. GitHub rate limits reset every 60 minutes, so the old 10-minute default meant the connector would always fail immediately rather than wait for a reset. The new 120-minute default ensures the connector can survive a full rate-limit cycle.Error classification (
streams.py): ChangedFailureType.config_error→FailureType.transient_errorforGitHubAPILimitException. Also cleaned up the user-facing error message (removed embedded URL and remediation language).Test updates: Adjusted assertions to match the new chunked-sleep behavior, updated expected
FailureType, and added tests for the budget throttle mechanism.Review guide
source_github/utils.py— core changes:_sleep_with_heartbeat(),_apply_budget_throttle(),_get_budget_reserve(), and the_max_timedefault increasesource_github/spec.json— default/maximum/description changes formax_waiting_timesource_github/source.py— default fallback updated from 10 → 120source_github/streams.py— error classification and message changesunit_tests/test_multiple_token_authenticator.py— updated test assertions and new budget throttle testsKey things to verify in review:
max_waiting_timeincrease (10 min → 120 min): Existing connections that relied on the old 10-minute default will now wait up to 2 hours when rate-limited instead of failing fast. Is this the desired behavior for all users, or should the default be lower (e.g. 60 min)?_sleep_with_heartbeatis sufficient to satisfy the platform heartbeat (vs. needing to emit records). The v2.1.14 fix forpull_request_statsused a similar pattern.FailureType.transient_error— does the platform retry differently for transient vs config errors? This means the platform may auto-retry on rate limit exhaustion._budget_loggedflag only resets oncheck_all_tokens()(after exhaustion sleep completes). During normal throttled operation, the log message appears only once per connector lifetime.getattr/setattrfor dynamic attribute access (count_restvscount_graphql) — this is inherited from the existing pattern inprocess_token.User Impact
Syncs with many configured repositories (or other configurations that exhaust all token rate limits) should no longer time out during rate-limit waits. The connector will log progress periodically and resume once limits reset, rather than silently blocking and getting killed by the platform.
Behavior change: The default
max_waiting_timeincreases from 10 to 120 minutes. Existing connections using the default will now wait longer before failing on rate limits. This is intentional — the old default was too low to ever allow waiting through a GitHub rate limit reset cycle.Rate limit failures that do exceed
max_waiting_timewill now surface as transient errors instead of config errors, which more accurately reflects the nature of the problem.Can this PR be safely reverted and rolled back?
No state, config, or schema changes. The fix is purely behavioral (sleep chunking, error classification, and default tuning). Reverting restores the previous defaults and behavior.
Link to Devin session: https://app.devin.ai/sessions/d857c90514c549fda9c170dbac70633e
Requested by: Serhii Lazebnyi (@lazebnyi)