Skip to content

Conversation

@aballman
Copy link

@aballman aballman commented Jan 2, 2026

Problem

Couchbase connector hangs for ~120 seconds when the Couchbase cluster is unreachable, instead of failing quickly with the SDK's default timeouts (connect=2s, bootstrap=10s).

Root Cause

The connector uses Cluster() constructor + wait_until_ready() instead of the documented Cluster.connect() method. The constructor doesn't properly respect SDK timeout configurations, causing connections to hang with an undocumented 120s fallback timeout.

Solution

Switch to using Cluster.connect() method as documented in the official Couchbase SDK docs. This enables the SDK's default timeouts to work correctly:

  • connect_timeout: 2 seconds (SDK default)
  • bootstrap_timeout: 10 seconds (SDK default)

Also adds optional configuration fields for users who need custom timeout values for specific network conditions.

Changes

# Before (hangs for 120s)
options = ClusterOptions(auth)
cluster = Cluster(connection_string, options)
cluster.wait_until_ready(timedelta(seconds=5))

# After (respects SDK defaults)
options = ClusterOptions(auth)  
cluster = Cluster.connect(connection_string, options)

Testing

Verified with unreachable cluster (couchbase://127.0.0.1):

  • ✅ Without custom timeouts: Fails after ~2-10s (SDK defaults)
  • ✅ With custom 3s timeout: Fails after 3.0s
  • ❌ Before this fix: Hung for 120s

Backwards Compatibility

✅ Fully backwards compatible - uses SDK defaults, behavior improved for unreachable clusters.


Note

Improves Couchbase connector reliability and failure behavior by honoring SDK timeouts.

  • Adds optional connect_timeout_seconds and bootstrap_timeout_seconds to CouchbaseConnectionConfig
  • Uses Cluster.connect(...) with ClusterTimeoutOptions instead of Cluster(...) so SDK/default timeouts are applied
  • wait_until_ready now uses the configured/bootstrap timeout for readiness checks

Written by Cursor Bugbot for commit 3a54951. This will update automatically on new commits. Configure here.

Add connect_timeout_seconds and bootstrap_timeout_seconds configuration
parameters to CouchbaseConnectionConfig to prevent indefinite hangs when
Couchbase cluster is unreachable.

Defaults to 10 seconds for both timeouts, which is sufficient for most
network conditions. Users can override these values in their connector
configuration if needed for slower networks.
The Couchbase SDK's ClusterTimeoutOptions are not respected, and the SDK
catches standard exceptions like TimeoutError. Use signal.SIGALRM to force
timeout and raise KeyboardInterrupt which cannot be easily caught.

Also reduce wait_until_ready timeout to match the alarm timeout.
# Set alarm for connection timeout (ClusterTimeoutOptions doesn't work reliably)
logger.info(f"Setting connection alarm for {self.connect_timeout_seconds} seconds")
old_handler = signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(self.connect_timeout_seconds)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero timeout value disables all connection timeout protection

If connect_timeout_seconds is set to 0, signal.alarm(0) cancels any pending alarm rather than setting one, completely disabling timeout protection. Additionally, wait_until_ready(timedelta(seconds=min(5, 0))) would use a 0-second timeout, causing undefined behavior. There's no validation to ensure the timeout value is positive, which means users could inadvertently disable the entire timeout protection that this PR is meant to provide.

Additional Locations (1)

Fix in Cursor Fix in Web

The Cluster() constructor doesn't respect ClusterTimeoutOptions, but the
documented Cluster.connect() method does. Switching to connect() enables
proper timeout enforcement.

Removed signal-based workaround and wait_until_ready() call as connect()
handles connection establishment internally with configured timeouts.

Verified: Connection to unreachable host now fails in 3s (configured timeout)
instead of hanging for 120s.
@@ -1,4 +1,5 @@
import hashlib
import signal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused signal import accidentally committed

The signal module is imported but never used anywhere in the file. This appears to be leftover debugging or testing code that was accidentally included in the commit.

Fix in Cursor Fix in Web

- Change timeout fields to Optional[int] with None default
- Only create ClusterTimeoutOptions if custom values provided
- SDK defaults (connect=2s, bootstrap=10s) used when not specified
- Users can override if needed for specific network conditions
Per SDK docs, wait_until_ready() should be called after connect() to
ensure cluster is fully ready before operations. Uses bootstrap_timeout
value (or 10s default) for wait timeout.
if self.connect_timeout_seconds is not None:
timeout_kwargs['connect_timeout'] = timedelta(seconds=self.connect_timeout_seconds)
if self.bootstrap_timeout_seconds is not None:
timeout_kwargs['bootstrap_timeout'] = timedelta(seconds=self.bootstrap_timeout_seconds)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout None checks always pass, SDK defaults never used

The connect_timeout_seconds and bootstrap_timeout_seconds fields are typed as int with a default of 10, so they can never be None. The is not None checks will always evaluate to True, meaning the comment's claim to "use SDK defaults" when not specified is incorrect. The code will always apply 10-second timeouts. According to the PR description, the SDK's connect_timeout default is 2 seconds, so this effectively uses a 5x longer timeout than intended. To actually allow SDK defaults, these fields would need to be Optional[int] with a default of None.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants