-
Notifications
You must be signed in to change notification settings - Fork 53
fix: add configurable connection timeouts for couchbase connector #623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add connect_timeout_seconds and bootstrap_timeout_seconds configuration parameters to CouchbaseConnectionConfig to prevent indefinite hangs when Couchbase cluster is unreachable. Defaults to 10 seconds for both timeouts, which is sufficient for most network conditions. Users can override these values in their connector configuration if needed for slower networks.
The Couchbase SDK's ClusterTimeoutOptions are not respected, and the SDK catches standard exceptions like TimeoutError. Use signal.SIGALRM to force timeout and raise KeyboardInterrupt which cannot be easily caught. Also reduce wait_until_ready timeout to match the alarm timeout.
| # Set alarm for connection timeout (ClusterTimeoutOptions doesn't work reliably) | ||
| logger.info(f"Setting connection alarm for {self.connect_timeout_seconds} seconds") | ||
| old_handler = signal.signal(signal.SIGALRM, timeout_handler) | ||
| signal.alarm(self.connect_timeout_seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero timeout value disables all connection timeout protection
If connect_timeout_seconds is set to 0, signal.alarm(0) cancels any pending alarm rather than setting one, completely disabling timeout protection. Additionally, wait_until_ready(timedelta(seconds=min(5, 0))) would use a 0-second timeout, causing undefined behavior. There's no validation to ensure the timeout value is positive, which means users could inadvertently disable the entire timeout protection that this PR is meant to provide.
Additional Locations (1)
The Cluster() constructor doesn't respect ClusterTimeoutOptions, but the documented Cluster.connect() method does. Switching to connect() enables proper timeout enforcement. Removed signal-based workaround and wait_until_ready() call as connect() handles connection establishment internally with configured timeouts. Verified: Connection to unreachable host now fails in 3s (configured timeout) instead of hanging for 120s.
| @@ -1,4 +1,5 @@ | |||
| import hashlib | |||
| import signal | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Change timeout fields to Optional[int] with None default - Only create ClusterTimeoutOptions if custom values provided - SDK defaults (connect=2s, bootstrap=10s) used when not specified - Users can override if needed for specific network conditions
Per SDK docs, wait_until_ready() should be called after connect() to ensure cluster is fully ready before operations. Uses bootstrap_timeout value (or 10s default) for wait timeout.
| if self.connect_timeout_seconds is not None: | ||
| timeout_kwargs['connect_timeout'] = timedelta(seconds=self.connect_timeout_seconds) | ||
| if self.bootstrap_timeout_seconds is not None: | ||
| timeout_kwargs['bootstrap_timeout'] = timedelta(seconds=self.bootstrap_timeout_seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout None checks always pass, SDK defaults never used
The connect_timeout_seconds and bootstrap_timeout_seconds fields are typed as int with a default of 10, so they can never be None. The is not None checks will always evaluate to True, meaning the comment's claim to "use SDK defaults" when not specified is incorrect. The code will always apply 10-second timeouts. According to the PR description, the SDK's connect_timeout default is 2 seconds, so this effectively uses a 5x longer timeout than intended. To actually allow SDK defaults, these fields would need to be Optional[int] with a default of None.
Problem
Couchbase connector hangs for ~120 seconds when the Couchbase cluster is unreachable, instead of failing quickly with the SDK's default timeouts (connect=2s, bootstrap=10s).
Root Cause
The connector uses
Cluster()constructor +wait_until_ready()instead of the documentedCluster.connect()method. The constructor doesn't properly respect SDK timeout configurations, causing connections to hang with an undocumented 120s fallback timeout.Solution
Switch to using
Cluster.connect()method as documented in the official Couchbase SDK docs. This enables the SDK's default timeouts to work correctly:connect_timeout: 2 seconds (SDK default)bootstrap_timeout: 10 seconds (SDK default)Also adds optional configuration fields for users who need custom timeout values for specific network conditions.
Changes
Testing
Verified with unreachable cluster (
couchbase://127.0.0.1):Backwards Compatibility
✅ Fully backwards compatible - uses SDK defaults, behavior improved for unreachable clusters.
Note
Improves Couchbase connector reliability and failure behavior by honoring SDK timeouts.
connect_timeout_secondsandbootstrap_timeout_secondstoCouchbaseConnectionConfigCluster.connect(...)withClusterTimeoutOptionsinstead ofCluster(...)so SDK/default timeouts are appliedwait_until_readynow uses the configured/bootstrap timeout for readiness checksWritten by Cursor Bugbot for commit 3a54951. This will update automatically on new commits. Configure here.