Skip to content

Conversation

@isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Nov 5, 2025

Fixes #15011

Motivation

Modifications

Verification

Documentation

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic database reconnection with configurable retry logic for queries that fail due to network issues. Queries now retry automatically with exponential backoff, allowing the system to recover from temporary database connectivity problems without user intervention.
  • Documentation

    • Added documentation for the database reconnection feature and updated workflow controller configuration guidance to reflect new reconnection settings.

✏️ Tip: You can customize this high-level summary in your review settings.

@isubasinghe isubasinghe changed the title feat: retry connections on network errors feat: retry connections on network errors. Fixes #15011 Nov 10, 2025
Signed-off-by: isubasinghe <[email protected]>
@isubasinghe isubasinghe marked this pull request as ready for review November 13, 2025 04:34
@isubasinghe isubasinghe requested a review from Joibel November 17, 2025 09:33
@isubasinghe
Copy link
Member Author

isubasinghe commented Nov 17, 2025

Not sure how to test this unfortunately, any recommendations here? Maybe spin up test containers, spin down postgres and then spin it up again?

EDIT: I have a testing solution, going to write up a test logger to collect logs, spin down postgres, witness the logs, spin up postgres and witness the logs indicating recovery.

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds automatic retry and reconnection capabilities for database connections on network errors to address issue #15011. The implementation introduces a SessionProxy wrapper around upperdb sessions that detects network errors and automatically attempts to reconnect and retry operations.

Key Changes:

  • Introduced SessionProxy with automatic reconnection on network errors
  • Refactored all database operations to use SessionProxy instead of raw db.Session
  • Added configurable retry parameters (DBReconnectConfig) with exponential backoff
  • Updated sync manager, workflow archive, and persistence layers to use the new proxy pattern

Reviewed Changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
util/sqldb/session.go Core SessionProxy implementation with retry/reconnection logic
util/sqldb/session_test.go Tests for SessionProxy reconnection behavior
util/sync/db/queries.go Refactored all query methods to use SessionProxy.With() pattern
util/sync/db/config.go Updated DBInfo to use SessionProxy instead of Session
workflow/sync/sync_manager.go Updated to create and use SessionProxy, simplified transaction handling
workflow/sync/transaction.go Changed to hold SessionProxy instead of raw Session
workflow/sync/database_semaphore.go Updated all database calls to use SessionProxy
workflow/sync/database_mutex.go Updated to use SessionProxy
persist/sqldb/workflow_archive.go Wrapped all database operations with SessionProxy.With()
persist/sqldb/offload_node_status_repo.go Updated to use SessionProxy pattern
workflow/controller/controller.go Updated to create SessionProxy for persistence
workflow/controller/config.go Updated configuration to use SessionProxy
server/apiserver/argoserver.go Updated to use SessionProxy
config/config.go Added DBReconnectConfig for retry configuration
docs/workflow-controller-configmap.md Documented new DBReconnectConfig options
Multiple test files Updated all test instantiations to pass ensureDBConnection parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

func (sp *SessionProxy) TxWith(ctx context.Context, fn func(*SessionProxy) error, opts *sql.TxOptions) error {
return sp.With(ctx, func(s db.Session) error {
return s.TxContext(ctx, func(sess db.Session) error {
newSp := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sess, sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, true}
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TxWith, line 147 creates a new SessionProxy with insideTransaction set to true in the struct literal (position 12), but the parent's sp.insideTransaction value is used in position 11. This is the same issue as in the Tx() method. The value at position 12 should be true, but the struct initialization is confusing. Consider using named fields in the struct literal for clarity and correctness.

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +280
sp.mu.RLock()
if sp.closed {
sp.mu.RUnlock()
logger.Warn(ctx, "session proxy is closed")
return fmt.Errorf("session proxy is closed")
}
sp.mu.RUnlock()

sp.mu.RLock()
sess := sp.sess
sp.mu.RUnlock()

if sess == nil {
return fmt.Errorf("no active session")
}

err := fn(sess)
if err == nil {
return nil
}

// If it's not a network error or inside a tx do not retry
if !sp.isNetworkError(err) || sp.insideTransaction {
return err
}

if reconnectErr := sp.Reconnect(ctx); reconnectErr != nil {
return fmt.Errorf("operation failed and reconnection failed: %w", reconnectErr)
}

sp.mu.RLock()
sess = sp.sess
sp.mu.RUnlock()

if sess == nil {
return fmt.Errorf("no active session after reconnection")
}

if retryErr := fn(sess); retryErr != nil {
return fmt.Errorf("operation failed after reconnection: %w", retryErr)
}

return nil
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The retry logic in With() only retries once after reconnection. If the second attempt also fails due to a network error, it will return an error without attempting to reconnect again. Consider implementing a retry loop that respects maxRetries for consistency with the reconnection logic in Reconnect().

Suggested change
sp.mu.RLock()
if sp.closed {
sp.mu.RUnlock()
logger.Warn(ctx, "session proxy is closed")
return fmt.Errorf("session proxy is closed")
}
sp.mu.RUnlock()
sp.mu.RLock()
sess := sp.sess
sp.mu.RUnlock()
if sess == nil {
return fmt.Errorf("no active session")
}
err := fn(sess)
if err == nil {
return nil
}
// If it's not a network error or inside a tx do not retry
if !sp.isNetworkError(err) || sp.insideTransaction {
return err
}
if reconnectErr := sp.Reconnect(ctx); reconnectErr != nil {
return fmt.Errorf("operation failed and reconnection failed: %w", reconnectErr)
}
sp.mu.RLock()
sess = sp.sess
sp.mu.RUnlock()
if sess == nil {
return fmt.Errorf("no active session after reconnection")
}
if retryErr := fn(sess); retryErr != nil {
return fmt.Errorf("operation failed after reconnection: %w", retryErr)
}
return nil
var lastErr error
for attempt := 0; attempt <= sp.maxRetries; attempt++ {
sp.mu.RLock()
if sp.closed {
sp.mu.RUnlock()
logger.Warn(ctx, "session proxy is closed")
return fmt.Errorf("session proxy is closed")
}
sess := sp.sess
sp.mu.RUnlock()
if sess == nil {
return fmt.Errorf("no active session")
}
err := fn(sess)
if err == nil {
return nil
}
// If it's not a network error or inside a tx do not retry
if !sp.isNetworkError(err) || sp.insideTransaction {
return err
}
lastErr = err
logger.WithField("attempt_number", attempt).Warn(ctx, fmt.Sprintf("operation failed with network error: %v", err))
// If this is the last attempt, break
if attempt == sp.maxRetries {
break
}
if reconnectErr := sp.Reconnect(ctx); reconnectErr != nil {
return fmt.Errorf("operation failed and reconnection failed: %w", reconnectErr)
}
// Exponential backoff before next retry, similar to Reconnect
delay := time.Duration(float64(sp.baseDelay) * float64(attempt+1) * sp.retryMultiple)
delay = min(delay, sp.maxDelay)
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(delay):
}
}
return fmt.Errorf("operation failed after %d retries, last error: %w", sp.maxRetries, lastErr)

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +142
s := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sp.Session(), sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, sp.insideTransaction}
s.insideTransaction = true
return &s
}

Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tx() method creates a new SessionProxy with insideTransaction = true, but it doesn't properly copy the field. The statement s.insideTransaction at the end should copy the value from the parent, but then immediately sets it to true on the next line. However, there's a bug: the field at position 11 in the struct literal is sp.insideTransaction, which is the parent's value (likely false), not the new struct's field. This means the transaction flag won't be properly set.

The corrected line 138 should be:

s := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sp.Session(), sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, false}
Suggested change
s := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sp.Session(), sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, sp.insideTransaction}
s.insideTransaction = true
return &s
}
s := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sp.Session(), sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, true}
return &s
}

Copilot uses AI. Check for mistakes.
@Joibel
Copy link
Member

Joibel commented Dec 10, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This pull request implements database query resilience by introducing a SessionProxy abstraction that automatically reconnects and retries failed queries on network errors. The feature includes configurable retry parameters (max retries, backoff delays, multiplier), network error detection, and integration across persistence, sync, and controller layers.

Changes

Cohort / File(s) Summary
Session Proxy Core Infrastructure
util/sqldb/session.go, util/sqldb/session_test.go
Introduces SessionProxy type with automatic reconnection support, exponential backoff retry logic, network error detection, and methods (Tx, TxWith, With, Reconnect, Session, Close) for safe DB operations. Includes test coverage for PostgreSQL container-based reconnection scenarios.
Configuration & Documentation
.features/pending/feat-refresh-connection.md, config/config.go, docs/workflow-controller-configmap.md
Adds DBReconnectConfig struct with retry parameters (MaxRetries, BaseDelaySeconds, MaxDelaySeconds, RetryMultiple) to DBConfig, PersistConfig, and SyncConfig. Includes feature documentation describing reconnect and retry behavior.
Persistence Layer Migration
persist/sqldb/{workflow_archive.go, offload_node_status_repo.go, archived_workflow_labels.go, migrate.go}
Migrates all persistence repository operations from direct db.Session to SessionProxy, updating constructor signatures and internal DB operations to use sessionProxy.With(...) patterns while preserving existing business logic.
Sync Database Layer Migration
util/sync/db/{queries.go, config.go, migrate.go, mocks/SyncQueries.go}
Converts SyncQueries and related database operations from db.Session to *sqldb.SessionProxy, updating all query methods (GetCurrentState, GetPendingInQueue, UpdateStateToHeld, InsertHeldState, etc.) with new session proxy signatures and mock definitions.
Workflow Sync Components Migration
workflow/sync/{sync_manager.go, database_semaphore.go, database_mutex.go, transaction.go}
Replaces db.Session with SessionProxy across lock manager, semaphore, and mutex implementations. NewLockManager now accepts ensureDBConnection flag and returns error; all internal database operations updated to use sessionProxy-based flows.
Server & Controller Integration
server/apiserver/argoserver.go, server/sync/sync_server.go, workflow/controller/{config.go, controller.go}
Integrates SessionProxy creation and usage at initialization points; replaces direct session creation with sqldb.NewSessionProxy configured with KubectlConfig, Namespace, and DBConfig.
Test Updates & Utilities
hack/db/main.go, test/e2e/fixtures/persistence.go, workflow/controller/{operator_test.go, operator_concurrency_test.go}, workflow/sync/{*_test.go}
Updates test fixtures and all test files to use SessionProxy instead of direct sessions; adds error handling for NewLockManager updates; adjusts test database setup and assertions to work with proxy-based session access.
Dependency Management
go.mod
Converts github.com/docker/docker v28.1.1+incompatible and github.com/docker/go-connections v0.5.0 from indirect to direct dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Application<br/>(Query)
    participant SP as SessionProxy
    participant DB as Database
    participant Net as Network

    rect rgb(220, 240, 250)
    Note over Client,Net: Normal Query Flow
    Client->>SP: With(ctx, func(session))
    SP->>SP: Check if in transaction
    SP->>DB: Execute query via session
    DB-->>SP: Result
    SP-->>Client: Success
    end

    rect rgb(250, 240, 220)
    Note over Client,Net: Network Error & Reconnect Flow
    Client->>SP: With(ctx, func(session))
    SP->>DB: Execute query via session
    DB-xNet: Network error occurs
    Net-->>SP: Error (timeout/connection lost)
    SP->>SP: isNetworkError(err) → true
    SP->>SP: Reconnect(ctx)<br/>with exponential backoff
    SP->>SP: Attempt 1: wait baseDelay ms
    SP->>Net: New connection attempt
    alt Connection succeeds
        Net-->>DB: Connected
        SP->>DB: Retry query
        DB-->>SP: Result
        SP-->>Client: Success (after reconnect)
    else Max retries exhausted
        SP->>SP: Close old session
        SP-->>Client: Error (retry limit reached)
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • util/sqldb/session.go – Core reconnection logic with network error detection, backoff calculations, and transaction safety; verify exponential backoff formula and network error classification comprehensiveness
  • workflow/sync/sync_manager.go – Critical control flow change to NewLockManager signature (added ensureDBConnection flag and error return); validate error propagation paths and transaction handling via sessionProxy.TxWith
  • workflow/sync/database_semaphore.go – Extensive refactoring of semaphore acquire/release logic to use sessionProxy; verify state consistency in concurrent scenarios and query parameter flow through proxy
  • persist/sqldb/workflow_archive.go – Large surface migration touching ArchiveWorkflow, GetWorkflow, ListWorkflows, and related methods; ensure backward-compatibility labels and error handling are preserved across all paths
  • util/sync/db/queries.go – All SyncQueries methods updated to use sessionProxy.With(...); cross-check that each database operation maintains proper error propagation and transaction semantics
  • Integration points – Verify sessionProxy initialization in argoserver.go, sync_server.go, and workflow/controller/config.go handle nil sessions correctly and propagate errors appropriately

Poem

🐰 A proxy spun from careful code,
Retries when the network road
Grows weary, backs off with a sigh,
Then reconnects to reach the sky.
Where queries once would simply fail,
Now patience writes a better tale. 🔄✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description includes only the template boilerplate with 'Fixes #15011' filled in. All substantive sections (Motivation, Modifications, Verification, Documentation) are empty placeholders marked with TODO comments. Complete the Motivation, Modifications, Verification, and Documentation sections with concrete details about the implementation, testing approach, and user-facing changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: retry connections on network errors. Fixes #15011' clearly and concisely describes the main feature addition: implementing automatic retry logic for database connections on network errors.
Linked Issues check ✅ Passed The code changes comprehensively implement the core requirement from #15011: automatic database query retry on network errors via reconnection. SessionProxy with configurable retry parameters, exponential backoff, network error detection, and integration across all persistence layers demonstrates full compliance with the linked objective.
Out of Scope Changes check ✅ Passed Changes are tightly scoped to implement connection retry logic. All modifications (SessionProxy abstraction, retry configuration, error detection, migration updates) directly support the network error resilience objective. No unrelated refactoring or feature creep detected.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
workflow/controller/controller.go (1)

130-135: Handle lock‑manager creation failures to avoid nil syncManager panics

createSynchronizationManager logs the error from sync.NewLockManager but leaves wfc.syncManager potentially nil. Later calls such as wfc.initManagers(ctx) (which does wfc.syncManager.Initialize(...)) and the background wfc.syncManager.CheckWorkflowExistence goroutine assume a non‑nil manager and will panic if the DB connection fails during creation.

It would be safer to treat this as a fatal startup error and stop Run early rather than continuing with a nil syncManager. One way to do this is to return an error from createSynchronizationManager and bail out in Run:

-func (wfc *WorkflowController) createSynchronizationManager(ctx context.Context) {
+func (wfc *WorkflowController) createSynchronizationManager(ctx context.Context) error {
@@
-	var err error
-	wfc.syncManager, err = sync.NewLockManager(ctx, wfc.kubeclientset, wfc.namespace, wfc.Config.Synchronization, getSyncLimit, nextWorkflow, isWFDeleted, true)
-
-	if err != nil {
-		logging.RequireLoggerFromContext(ctx).WithError(err).Error(ctx, "Failed to create lock manager")
-	}
+	var err error
+	wfc.syncManager, err = sync.NewLockManager(ctx, wfc.kubeclientset, wfc.namespace, wfc.Config.Synchronization, getSyncLimit, nextWorkflow, isWFDeleted, true)
+	if err != nil {
+		logging.RequireLoggerFromContext(ctx).WithError(err).WithFatal().Error(ctx, "Failed to create lock manager")
+		return err
+	}
+	return nil
 }

and in Run:

-	// Create Synchronization Manager
-	wfc.createSynchronizationManager(ctx)
+	// Create Synchronization Manager
+	if err := wfc.createSynchronizationManager(ctx); err != nil {
+		return
+	}

Also applies to: 401-444

test/e2e/fixtures/persistence.go (1)

31-54: Potential resource leak: sessionProxy is not closed.

The code creates both a raw session (line 31) and a sessionProxy (lines 41-45). The session is stored and closed in Close(), but sessionProxy is only passed to the repos and never explicitly closed. If SessionProxy manages its own DB connection internally, this could leak connections.

Additionally, creating both may result in maintaining two separate database connections when only one is needed.

Consider either:

  1. Close the sessionProxy in Close() method, or
  2. Remove the raw session creation if sessionProxy can provide the underlying session for cleanup
#!/bin/bash
# Check if SessionProxy has a Close method or manages connection lifecycle
ast-grep --pattern $'type SessionProxy struct {
  $$$
}'

rg -n "func.*SessionProxy.*Close|sessionProxy.*Close" util/sqldb/
🧹 Nitpick comments (13)
docs/workflow-controller-configmap.md (1)

238-250: Consider documenting default values and fix capitalization inconsistency.

The DBReconnectConfig documentation is clear, but consider these improvements:

  1. Default values: Users would benefit from knowing the default values for these fields (e.g., "MaxRetries defaults to 5"). Based on the code snippet in util/sqldb/session.go, the defaults appear to be: maxRetries=5, baseDelay=100ms, maxDelay=30s, retryMultiple=2.0.

  2. Capitalization: Line 247 uses lowercase retryMultiple when referring to the field name RetryMultiple.

Apply this diff to document defaults and fix capitalization:

 ## DBReconnectConfig
 
 DBReconnectConfig contains database reconnect settings
 
 ### Fields
 
 |     Field Name     | Field Type |                                                 Description                                                 |
 |--------------------|------------|-------------------------------------------------------------------------------------------------------------|
-| `MaxRetries`       | `int`      | MaxRetries defines how many connection attempts should be made before we give up                            |
-| `BaseDelaySeconds` | `int`      | BaseDelaySeconds delays retries by this amount multiplied by the retryMultiple, capped to `maxDelaySeconds` |
-| `MaxDelaySeconds`  | `int`      | MaxDelaySeconds the absolute upper limit to wait before retrying                                            |
-| `RetryMultiple`    | `float64`  | RetryMultiple is the growth factor for `baseDelaySeconds`                                                   |
+| `MaxRetries`       | `int`      | MaxRetries defines how many connection attempts should be made before we give up (default: 5)                            |
+| `BaseDelaySeconds` | `int`      | BaseDelaySeconds delays retries by this amount multiplied by the `RetryMultiple`, capped to `MaxDelaySeconds` (default: 0.1 seconds) |
+| `MaxDelaySeconds`  | `int`      | MaxDelaySeconds the absolute upper limit to wait before retrying (default: 30 seconds)                                            |
+| `RetryMultiple`    | `float64`  | RetryMultiple is the growth factor for `BaseDelaySeconds` (default: 2.0)                                                   |
server/sync/sync_server.go (1)

37-44: Consider logging session proxy creation errors.

The session proxy creation error is silently ignored (line 42), causing the DATABASE sync provider to not be registered. While graceful degradation to CONFIGMAP-only sync is reasonable, users might not realize their database sync configuration failed.

Consider logging the error to help with troubleshooting:

 	if syncConfig != nil && syncConfig.EnableAPI {
 		sessionProxy, err := sqldb.NewSessionProxy(ctx, sqldb.SessionProxyConfig{
 			KubectlConfig: kubectlConfig,
 			Namespace:     namespace,
 			DBConfig:      syncConfig.DBConfig,
 		})
 		if err == nil {
 			server.providers[syncpkg.SyncConfigType_DATABASE] = &dbSyncProvider{db: syncdb.NewSyncQueries(sessionProxy, syncdb.DBConfigFromConfig(syncConfig))}
+		} else {
+			log := logging.RequireLoggerFromContext(ctx)
+			log.WithError(err).Warn(ctx, "Failed to create database session proxy for sync, DATABASE sync provider will not be available")
 		}
 	}
workflow/controller/config.go (1)

36-77: Persistence now uses SessionProxy; consider behavior on DB config changes

The new wiring that creates a single wfc.sessionProxy (used for offload repo, archive, and migrations) is consistent with the SessionProxy abstraction and gives persistence the same reconnect/retry behavior as the rest of the system.

However, because updateConfig only creates the proxy when wfc.sessionProxy == nil and sqldb.NewSessionProxy takes a copy of persistence.DBConfig, subsequent changes to persistence DB settings in the controller ConfigMap won’t affect an already‑running controller: it will keep using the original connection parameters until restart.

If supporting hot changes to persistence DB/reconnect settings is important, consider:

  • detecting relevant changes to persistence.DBConfig and
  • recreating the SessionProxy (closing the old one) when they occur, so that initDB and later operations run against the updated configuration.

Also applies to: 93-107

util/sqldb/session_test.go (2)

21-26: Fixed port may cause test flakiness in parallel runs.

Using a hardcoded port (15432) can cause conflicts when tests run in parallel or when the port isn't released quickly after container termination. Consider using dynamic port allocation with a mechanism to pass the new port to the SessionProxy after reconnection, or document that this test must run serially.


100-124: Test assertions inside goroutine may not fail the test correctly.

When assert.NoError or assert.True fail inside a goroutine after the main test function has returned, the failure may not be properly reported. The test does wait via <-doneChan, but if the assertions fail, t.FailNow() won't work correctly from a goroutine.

Consider capturing errors and asserting in the main goroutine:

 	doneChan := make(chan struct{})
+	var goroutineErr error
+	var sawError bool
 	go func() {
-		hasSeenErr := false
 		outerErr := sessionProxy.With(ctx, func(s db.Session) error {
 			innerErr := s.Ping()
 			if innerErr != nil {
-				hasSeenErr = true
+				sawError = true
 			}
 			return innerErr
 		})
-		assert.NoError(t, outerErr)
-		assert.True(t, hasSeenErr)
+		goroutineErr = outerErr
 		doneChan <- struct{}{}
 	}()
 
 	newDBConfig, cancel, err := setupPostgresContainer(ctx, t)
 	require.NoError(t, err)
 	assert.Equal(t, cfg.PostgreSQL.Host, newDBConfig.PostgreSQL.Host)
 	assert.Equal(t, cfg.PostgreSQL.Port, newDBConfig.PostgreSQL.Port)
 	<-doneChan
+	require.NoError(t, goroutineErr)
+	assert.True(t, sawError)
 	cancel()
persist/sqldb/workflow_archive.go (1)

290-357: Consider extracting duplicated name filter logic.

The name filter logic (lines 311-325 and similar blocks in countWorkflowsOptimized and HasMoreWorkflows) is duplicated across multiple methods. Consider extracting to a helper function for maintainability.

func applyNameFilter(selector db.Selector, options sutils.ListOptions) db.Selector {
    if options.Name == "" {
        return selector
    }
    nameFilter := options.NameFilter
    if nameFilter == "" {
        nameFilter = "Exact"
    }
    switch nameFilter {
    case "Exact":
        return selector.And(nameEqual(options.Name))
    case "Contains":
        return selector.And(nameContainsClause(options.Name))
    case "Prefix":
        return selector.And(namePrefixClause(options.Name))
    default:
        return selector
    }
}
util/sqldb/session.go (3)

137-141: Redundant assignment in Tx() method.

Line 138 initializes s.insideTransaction with sp.insideTransaction (position 12 in struct literal), then line 139 immediately overwrites it with true. This is confusing but functionally correct. Consider using named fields for clarity.

 func (sp *SessionProxy) Tx() *SessionProxy {
-	s := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sp.Session(), sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, sp.insideTransaction}
-	s.insideTransaction = true
-	return &s
+	return &SessionProxy{
+		kubectlConfig:     sp.kubectlConfig,
+		namespace:         sp.namespace,
+		dbConfig:          sp.dbConfig,
+		username:          sp.username,
+		password:          sp.password,
+		sess:              sp.Session(),
+		closed:            sp.closed,
+		maxRetries:        sp.maxRetries,
+		baseDelay:         sp.baseDelay,
+		maxDelay:          sp.maxDelay,
+		retryMultiple:     sp.retryMultiple,
+		insideTransaction: true,
+	}
 }

143-151: Consider using named fields in TxWith for consistency.

Similar to Tx(), using named fields would improve readability and reduce maintenance risk when fields are added/reordered.

 func (sp *SessionProxy) TxWith(ctx context.Context, fn func(*SessionProxy) error, opts *sql.TxOptions) error {
 	return sp.With(ctx, func(s db.Session) error {
 		return s.TxContext(ctx, func(sess db.Session) error {
-			newSp := SessionProxy{sp.kubectlConfig, sp.namespace, sp.dbConfig, sp.username, sp.password, sess, sync.RWMutex{}, sp.closed, sp.maxRetries, sp.baseDelay, sp.maxDelay, sp.retryMultiple, true}
+			newSp := SessionProxy{
+				kubectlConfig:     sp.kubectlConfig,
+				namespace:         sp.namespace,
+				dbConfig:          sp.dbConfig,
+				username:          sp.username,
+				password:          sp.password,
+				sess:              sess,
+				closed:            sp.closed,
+				maxRetries:        sp.maxRetries,
+				baseDelay:         sp.baseDelay,
+				maxDelay:          sp.maxDelay,
+				retryMultiple:     sp.retryMultiple,
+				insideTransaction: true,
+			}
 			return fn(&newSp)
 		}, opts)
 	})
 }

236-281: With() only retries once after reconnection.

If the second attempt also fails with a network error, no further retries are attempted. This may be intentional to avoid indefinite retry loops, but consider documenting this behavior or aligning with maxRetries configuration.

The current single-retry approach may be intentional for simplicity and to avoid long blocking. If this is the desired behavior, consider adding a doc comment clarifying that With() performs at most one retry after reconnection, while Reconnect() itself uses maxRetries for the connection attempt.

-// With executes with a db Session
+// With executes fn with a db.Session. On network errors, it attempts one reconnection
+// (which itself may retry up to maxRetries times) and retries the operation once.
+// Operations inside transactions (insideTransaction=true) are not retried.
 func (sp *SessionProxy) With(ctx context.Context, fn func(db.Session) error) error {
util/sync/db/queries.go (4)

63-86: Consider consolidating duplicate method pairs in the interface.

The interface contains duplicate method pairs with identical signatures:

  • GetPendingInQueue (line 72) vs GetPendingInQueueWithSession (line 83)
  • UpdateStateToHeld (line 70) vs UpdateStateToHeldWithSession (line 84)
  • InsertHeldState (line 71) vs InsertHeldStateWithSession (line 85)

Both variants accept *sqldb.SessionProxy and have the same parameters. Consider removing the WithSession suffix variants and keeping only one set of methods, as the distinction is no longer meaningful after the migration.


263-277: Duplicate implementations: GetPendingInQueue and GetPendingInQueueWithSession.

These two methods have identical signatures and implementations. Since both accept *sqldb.SessionProxy as a parameter, one should be removed. The WithSession variant at lines 364-377 could simply delegate to GetPendingInQueue or be removed entirely.

-func (q *syncQueries) GetPendingInQueueWithSession(ctx context.Context, session *sqldb.SessionProxy, semaphoreName, holderKey, controllerName string) ([]StateRecord, error) {
-	var pending []StateRecord
-	err := session.With(ctx, func(s db.Session) error {
-		err := s.SQL().
-			Select(StateKeyField).
-			From(q.config.StateTable).
-			Where(db.Cond{StateNameField: semaphoreName}).
-			And(db.Cond{StateKeyField: holderKey}).
-			And(db.Cond{StateControllerField: controllerName}).
-			And(db.Cond{StateHeldField: false}).
-			All(&pending)
-		return err
-	})
-	return pending, err
+func (q *syncQueries) GetPendingInQueueWithSession(ctx context.Context, session *sqldb.SessionProxy, semaphoreName, holderKey, controllerName string) ([]StateRecord, error) {
+	return q.GetPendingInQueue(ctx, session, semaphoreName, holderKey, controllerName)
 }

Also applies to: 364-377


243-254: Duplicate implementations: UpdateStateToHeld and UpdateStateToHeldWithSession.

These methods are identical in signature and implementation. Consider having UpdateStateToHeldWithSession delegate to UpdateStateToHeld:

-func (q *syncQueries) UpdateStateToHeldWithSession(ctx context.Context, session *sqldb.SessionProxy, semaphoreName, holderKey, controllerName string) error {
-	return session.With(ctx, func(s db.Session) error {
-		_, err := s.SQL().Update(q.config.StateTable).
-			Set(StateHeldField, true).
-			Where(db.Cond{StateNameField: semaphoreName}).
-			And(db.Cond{StateKeyField: holderKey}).
-			And(db.Cond{StateControllerField: controllerName}).
-			And(db.Cond{StateHeldField: false}).
-			Exec()
-		return err
-	})
+func (q *syncQueries) UpdateStateToHeldWithSession(ctx context.Context, session *sqldb.SessionProxy, semaphoreName, holderKey, controllerName string) error {
+	return q.UpdateStateToHeld(ctx, session, semaphoreName, holderKey, controllerName)
 }

Also applies to: 380-391


256-261: Duplicate implementations: InsertHeldState and InsertHeldStateWithSession.

Same duplication pattern as the other methods. Consider delegating:

-func (q *syncQueries) InsertHeldStateWithSession(ctx context.Context, session *sqldb.SessionProxy, record *StateRecord) error {
-	return session.With(ctx, func(s db.Session) error {
-		_, err := s.Collection(q.config.StateTable).Insert(record)
-		return err
-	})
+func (q *syncQueries) InsertHeldStateWithSession(ctx context.Context, session *sqldb.SessionProxy, record *StateRecord) error {
+	return q.InsertHeldState(ctx, session, record)
 }

Also applies to: 393-398

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4eef2b and 13f1d2b.

📒 Files selected for processing (35)
  • .features/pending/feat-refresh-connection.md (1 hunks)
  • config/config.go (1 hunks)
  • docs/workflow-controller-configmap.md (3 hunks)
  • go.mod (1 hunks)
  • hack/db/main.go (3 hunks)
  • persist/sqldb/archived_workflow_labels.go (2 hunks)
  • persist/sqldb/migrate.go (1 hunks)
  • persist/sqldb/offload_node_status_repo.go (8 hunks)
  • persist/sqldb/workflow_archive.go (9 hunks)
  • server/apiserver/argoserver.go (2 hunks)
  • server/sync/sync_server.go (2 hunks)
  • test/e2e/fixtures/persistence.go (1 hunks)
  • util/sqldb/migrate.go (1 hunks)
  • util/sqldb/session.go (1 hunks)
  • util/sqldb/session_test.go (1 hunks)
  • util/sync/db/config.go (2 hunks)
  • util/sync/db/migrate.go (1 hunks)
  • util/sync/db/mocks/SyncQueries.go (29 hunks)
  • util/sync/db/queries.go (3 hunks)
  • workflow/controller/config.go (3 hunks)
  • workflow/controller/controller.go (3 hunks)
  • workflow/controller/operator_concurrency_test.go (11 hunks)
  • workflow/controller/operator_test.go (1 hunks)
  • workflow/sync/database_helper_test.go (1 hunks)
  • workflow/sync/database_mutex.go (1 hunks)
  • workflow/sync/database_mutex_test.go (1 hunks)
  • workflow/sync/database_semaphore.go (10 hunks)
  • workflow/sync/database_semaphore_test.go (12 hunks)
  • workflow/sync/multiple_test.go (6 hunks)
  • workflow/sync/mutex_test.go (4 hunks)
  • workflow/sync/semaphore_test.go (7 hunks)
  • workflow/sync/sync_manager.go (9 hunks)
  • workflow/sync/sync_manager_multiple_test.go (1 hunks)
  • workflow/sync/sync_manager_test.go (23 hunks)
  • workflow/sync/transaction.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (22)
workflow/sync/sync_manager_test.go (1)
workflow/sync/sync_manager.go (1)
  • NewLockManager (49-59)
util/sync/db/config.go (1)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/database_mutex_test.go (1)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/database_semaphore_test.go (1)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/controller/operator_concurrency_test.go (1)
workflow/sync/sync_manager.go (1)
  • NewLockManager (49-59)
workflow/controller/operator_test.go (2)
workflow/sync/sync_manager.go (1)
  • NewLockManager (49-59)
pkg/apis/workflow/v1alpha1/workflow_types.go (2)
  • Synchronization (1765-1770)
  • Mutex (1797-1804)
workflow/sync/mutex_test.go (2)
workflow/sync/sync_manager.go (1)
  • NewLockManager (49-59)
workflow/sync/sync_manager_test.go (1)
  • WorkflowExistenceFunc (330-332)
workflow/sync/transaction.go (1)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/multiple_test.go (2)
workflow/sync/sync_manager.go (1)
  • NewLockManager (49-59)
workflow/sync/sync_manager_test.go (1)
  • WorkflowExistenceFunc (330-332)
util/sync/db/migrate.go (2)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
util/sqldb/migrate.go (2)
  • Migrate (25-97)
  • Change (12-14)
workflow/sync/sync_manager_multiple_test.go (2)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/sync_manager_test.go (1)
  • WorkflowExistenceFunc (330-332)
hack/db/main.go (2)
util/sqldb/session.go (1)
  • NewSessionProxyFromSession (122-133)
persist/sqldb/workflow_archive.go (1)
  • NewWorkflowArchive (100-102)
workflow/sync/database_helper_test.go (2)
util/sqldb/session.go (3)
  • NewSessionProxy (73-119)
  • SessionProxyConfig (45-54)
  • SessionProxy (22-42)
util/sync/db/config.go (2)
  • DBInfo (26-29)
  • DBConfigFromConfig (77-91)
workflow/sync/semaphore_test.go (1)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
persist/sqldb/workflow_archive.go (5)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
util/sqldb/db_type.go (3)
  • DBTypeFor (19-27)
  • MySQL (14-14)
  • Postgres (15-15)
server/utils/list_options.go (1)
  • ListOptions (15-24)
persist/sqldb/selector.go (1)
  • BuildArchivedWorkflowSelector (12-54)
workflow/common/common.go (1)
  • LabelKeyWorkflowArchivingStatus (89-89)
util/sync/db/mocks/SyncQueries.go (2)
util/sync/db/queries.go (2)
  • SyncQueries (57-86)
  • StateRecord (18-25)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/sync_manager.go (3)
util/sqldb/session.go (3)
  • SessionProxy (22-42)
  • NewSessionProxy (73-119)
  • SessionProxyConfig (45-54)
util/sync/db/config.go (1)
  • DBConfigFromConfig (77-91)
util/sync/db/queries.go (1)
  • NewSyncQueries (97-102)
util/sync/db/queries.go (1)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/database_mutex.go (2)
util/sync/db/queries.go (1)
  • NewSyncQueries (97-102)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/controller/controller.go (2)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
workflow/sync/sync_manager.go (1)
  • NewLockManager (49-59)
workflow/controller/config.go (6)
util/sqldb/session.go (2)
  • NewSessionProxy (73-119)
  • SessionProxyConfig (45-54)
config/config.go (2)
  • DBConfig (239-248)
  • ConnectionPool (323-330)
util/sqldb/sqldb.go (1)
  • ConfigureDBSession (117-124)
persist/sqldb/offload_node_status_repo.go (1)
  • NewOffloadNodeStatusRepo (35-41)
persist/sqldb/workflow_archive.go (1)
  • NewWorkflowArchive (100-102)
persist/sqldb/migrate.go (1)
  • Migrate (16-231)
workflow/sync/database_semaphore.go (3)
util/sync/db/queries.go (2)
  • NewSyncQueries (97-102)
  • StateRecord (18-25)
util/sync/db/mocks/SyncQueries.go (1)
  • NewSyncQueries (18-28)
util/sqldb/session.go (1)
  • SessionProxy (22-42)
🪛 GitHub Actions: CI
util/sqldb/session_test.go

[error] 89-89: TestSessionReconnect failed: generic container: create container: ensure default network: network create: Error response from daemon: could not find plugin bridge in v1 plugin registry: plugin not found

🔇 Additional comments (68)
go.mod (2)

1-212: Scope mismatch: Missing implementation code in review.

The PR objective describes implementing a SessionProxy abstraction with retry parameters, network error detection, and integration across multiple layers. However, the provided file only shows go.mod dependency changes. The actual implementation code (SessionProxy, retry logic, network error handling) is not included in this review.

For a comprehensive review of the retry resilience feature, please ensure all modified/added implementation files are included (e.g., persistence layer changes, SessionProxy definition, retry configuration, error handling logic).


128-129: Clarify why Docker dependencies are promoted to direct dependencies.

The review focuses on database connection resilience via SessionProxy, but the shown changes promote github.com/docker/docker and github.com/docker/go-connections from indirect to direct dependencies. Since testcontainers-go is already listed as a direct dependency, verify whether docker packages are used directly in the codebase or if they are only needed transitively through testcontainers. If the latter, these should remain indirect dependencies.

util/sqldb/migrate.go (1)

117-121: LGTM!

The comment clarification accurately describes that the change is applied directly with the transaction session (tx), which ensures proper transactional semantics during migration.

workflow/sync/sync_manager_multiple_test.go (1)

69-80: LGTM!

The test setup correctly migrates to the SessionProxy pattern:

  • Uses SessionProxy.Session() to access the underlying session for SQL operations
  • Passes SessionProxy to createLockManager as expected by the new signature
  • Assertions properly verify that sessions are initialized through the proxy

The changes are consistent with the broader SessionProxy migration.

workflow/sync/database_mutex.go (1)

19-19: LGTM!

The change correctly passes info.SessionProxy to NewSyncQueries, matching the updated function signature that now expects a *sqldb.SessionProxy parameter.

config/config.go (1)

250-260: Review validation logic or document expected behavior for DBReconnectConfig edge cases.

The configuration struct lacks explicit handling for edge cases:

  • What happens if MaxRetries is 0 or negative?
  • What if MaxDelaySeconds < BaseDelaySeconds?
  • What are sensible defaults when this config is nil or partially set?

If validation or defaulting is handled elsewhere in SessionProxy initialization, this comment can be disregarded. Otherwise, consider adding a validation method or documenting expected behavior.

workflow/sync/transaction.go (1)

7-9: [rewritten review comment]
[classification tag]

hack/db/main.go (2)

33-35: LGTM!

The error handling after createDBSession is correct and ensures failures are properly propagated.


79-80: Verify transaction semantics for SessionProxy.Tx() in the fake data generation loop.

The sessionProxy is created via SessionProxy.Tx() before the loop and reused for multiple ArchiveWorkflow calls. Please confirm:

  1. Does Tx() initiate an actual database transaction that must be explicitly committed?
  2. If so, should the entire batch be committed after the loop completes, or should each iteration have separate transaction handling?
  3. Are there defer/cleanup mechanisms that automatically handle rollback on errors?

This is important for data integrity in bulk operations.

.features/pending/feat-refresh-connection.md (1)

1-7: LGTM!

The feature documentation clearly describes the reconnection and retry behavior for database queries on network failures.

docs/workflow-controller-configmap.md (2)

185-185: LGTM!

The DBReconnectConfig field addition to PersistConfig is well-documented with a clear description.


332-349: LGTM!

The DBReconnectConfig field addition to SyncConfig is consistent with the PersistConfig addition and properly documented.

workflow/sync/database_mutex_test.go (1)

26-26: LGTM!

The migration from direct session usage to SessionProxy.Tx() aligns with the broader SessionProxy adoption across the codebase.

server/apiserver/argoserver.go (2)

241-245: LGTM!

The migration from sqldb.CreateDBSession to sqldb.NewSessionProxy properly configures the session proxy with the necessary Kubernetes configuration, namespace, and database settings.


255-261: LGTM!

The updates to NewOffloadNodeStatusRepo and NewWorkflowArchive correctly pass the session proxy instead of a direct session, aligning with the new constructor signatures.

persist/sqldb/archived_workflow_labels.go (2)

21-26: LGTM!

The refactor to use sessionProxy.With properly scopes the SQL query to the provided context, enabling automatic retry and reconnection logic while preserving the original query behavior.


42-48: LGTM!

Consistent with the previous change, this refactor properly wraps the SQL query with sessionProxy.With for context-aware session management.

persist/sqldb/migrate.go (1)

15-17: Verify session vs proxy usage in migration function.

The new Migrate function accepts a direct db.Session rather than a *SessionProxy. While this might be intentional for migration scenarios (where retry logic could interfere with schema changes), please confirm:

  1. Is it safe to run migrations without retry/reconnection logic?
  2. Should migrations be wrapped in a transaction at a higher level?
  3. Are migrations idempotent enough that manual retry is acceptable?

Based on learnings from the PR context, other components like util/sync/db/migrate.go have been updated to use SessionProxy. Consider checking if this migration should follow the same pattern for consistency.

util/sync/db/config.go (1)

26-30: SessionProxy integration in DBInfo and Migrate looks consistent

Switching DBInfo to hold SessionProxy and delegating migration via migrate(ctx, d.SessionProxy, &d.Config) cleanly aligns with the new proxy abstraction while preserving the “disable DB locks on migration failure” behavior.

Also applies to: 57-75

workflow/sync/database_semaphore_test.go (1)

44-279: Tests correctly pivot to SessionProxy and transaction sessionProxy

Updating the tests to route all DB calls through info.SessionProxy.Session() and to construct transaction{sessionProxy: info.SessionProxy} keeps them aligned with the production SessionProxy abstraction and ensures semaphore/mutex behavior is validated against the new retry‑capable session layer.

Also applies to: 299-369

workflow/sync/sync_manager_test.go (1)

355-1851: Updated NewLockManager test wiring and SessionProxy assertions look sound

Adapting all tests to the new NewLockManager(..., ensureDBConnection bool) (*Manager, error) signature (with require.NoError checks) and using ensureDBConnection=false where DB connectivity isn’t under test keeps the suite focused and avoids spurious failures. The direct createLockManager(..., info.SessionProxy, ...) usage and assertion that syncManager.dbInfo.SessionProxy.Session() is non‑nil appropriately exercise the SessionProxy‑backed DB path.

util/sync/db/migrate.go (1)

13-47: SessionProxy-aware migrate helper is correctly wired

Updating migrate to accept a *sqldb.SessionProxy and delegate via sessionProxy.Session() into sqldb.Migrate cleanly aligns schema setup with the new proxy layer while keeping the migration SQL and ordering unchanged.

workflow/controller/operator_concurrency_test.go (1)

163-627: Controller tests correctly adapt to NewLockManager’s new signature

Initializing controller.syncManager via sync.NewLockManager(..., false) with explicit require.NoError checks in all the operator‑concurrency tests is the right adjustment: it exercises the updated API while keeping these tests focused on ConfigMap‑based synchronization rather than DB connectivity.

Also applies to: 835-839, 910-919, 981-985

workflow/sync/database_helper_test.go (1)

44-83: Test DB helper’s switch to SessionProxy is well-structured

Creating a SessionProxy in createTestDBSession, storing it in syncdb.DBInfo, and using it for controller‑health inserts (with proper Close and container termination in the defer) cleanly mirrors the production path while keeping tests self‑contained.

workflow/sync/mutex_test.go (4)

118-121: LGTM! Properly updated to handle the new API signature.

The constructor call correctly passes false for ensureDBConnection (appropriate for unit tests without DB) and properly handles the new error return value with require.NoError.


132-136: Consistent update pattern.

Same appropriate pattern: ensureDBConnection: false and error assertion.


215-219: Consistent update pattern.


408-412: Consistent update pattern.

workflow/sync/multiple_test.go (6)

46-49: LGTM! Properly updated to handle the new API signature.


128-131: Consistent update pattern.


211-214: Consistent update pattern.


331-334: Consistent update pattern.


410-413: Consistent update pattern.


423-426: Consistent update pattern.

util/sqldb/session_test.go (1)

86-127: Race condition exists between container lifecycle operations and port binding.

Container cleanup at line 100 and new container startup at line 120 may overlap if the first container hasn't fully released port 15432 before the replacement attempts to bind. A delay or port-availability verification is needed between these operations.

The concern about assertions inside goroutines is invalid—the test properly waits for goroutine completion via doneChan (line 124) before returning, ensuring assertions execute before cleanup.

test/e2e/fixtures/persistence.go (1)

41-45: Verify SessionProxyConfig field names.

The field is named KubectlConfig but typically Kubernetes client interfaces use KubeClient or similar naming. Confirm this matches the struct definition in sqldb.SessionProxyConfig.

workflow/controller/operator_test.go (2)

8662-8665: NewLockManager initialization and error handling look correct

The explicit err variable, handling of the new (manager, error) return, and passing ensureDBConnection=false for this mutex-only test are all appropriate and idiomatic.


8668-8668: Updated TryAcquire usage is aligned with the new signature

Discarding the non-error return values and only asserting err == nil is fine for this test, which just needs to ensure the preemptive lock acquisition doesn’t fail.

persist/sqldb/offload_node_status_repo.go (6)

35-41: LGTM! Constructor properly migrated to SessionProxy.

The constructor signature and initialization correctly use *sqldb.SessionProxy instead of db.Session, aligning with the PR's objective to enable automatic reconnection on network errors.


93-107: LGTM! Save method correctly uses sessionProxy.With pattern.

The closure properly captures record and logCtx, and error handling for duplicate keys is preserved within the proxy-wrapped block.


128-145: LGTM! Get method properly migrated.

The query result is correctly populated via the closure parameter, and the unmarshaling happens after the proxy block completes.


150-173: LGTM! List method correctly structured.

The slice is populated within the closure, and post-processing (JSON unmarshaling) happens outside the proxy block as expected.


178-195: LGTM! ListOldOffloads properly migrated.

Consistent pattern with other methods.


206-226: LGTM! Delete method correctly captures rowsAffected.

The rowsAffected variable is declared outside the closure and properly assigned within, allowing the logging to occur after the proxy block.

workflow/sync/semaphore_test.go (3)

19-27: LGTM! Factory type correctly updated.

The semaphore factory signature now returns *sqldb.SessionProxy, and the in-memory implementation correctly returns nil for the proxy since it doesn't use a database.


30-44: LGTM! Database semaphore setup uses Session() for direct SQL.

Using info.SessionProxy.Session().SQL() for test setup (inserting sync_limit) is appropriate since test setup doesn't need retry logic.


100-126: LGTM! Test correctly uses updated transaction struct.

The transaction{sessionProxy: sessionProxy} initialization aligns with the new transaction implementation that uses SessionProxy.

persist/sqldb/workflow_archive.go (3)

100-102: LGTM! Constructor properly migrated to SessionProxy.

The dbType is correctly derived from sessionProxy.Session() for database-specific query handling.


115-166: LGTM! ArchiveWorkflow correctly uses TxWith for atomicity.

The transactional wrapper ensures delete-insert-label operations are atomic. The nested sp.Session() calls within TxWith properly use the transaction-scoped session.


550-624: LGTM! GetWorkflow correctly handles result via closure.

The result variable is declared outside and assigned within the closure. The db.ErrNoMoreRows check correctly returns nil, nil for not-found cases.

workflow/sync/sync_manager.go (4)

48-59: LGTM! NewLockManager correctly handles conditional DB connection.

The ensureDBConnection flag allows callers to control whether a database session should be created, and errors are properly propagated.


69-92: LGTM! createLockManager properly initializes with SessionProxy.

The DBInfo.SessionProxy field is correctly set, and nil checks are used before starting background tasks that require database access.


324-336: LGTM! TryAcquire correctly uses TxWith with serializable isolation.

The transaction block properly uses serializable isolation for lock acquisition, ensuring consistency. The transaction-scoped proxy is passed to tryAcquireImpl.


244-249: LGTM! Initialize correctly creates transactions for lock acquisition.

The transaction{sessionProxy: sm.dbInfo.SessionProxy} pattern is consistent with the updated transaction struct.

Also applies to: 269-272

util/sqldb/session.go (3)

56-70: LGTM! Validation correctly rejects negative parameters.

The validation prevents misconfiguration with negative retry parameters.


181-233: LGTM! Comprehensive network error detection.

The isNetworkError method covers common patterns including string matching, net.Error interface, and specific driver errors (driver.ErrBadConn, sql.ErrConnDone).


153-179: Potential race: sp.closed set before sp.sess in connect().

Line 175 sets sp.closed = false before line 177 sets sp.sess = sess. If another goroutine checks sp.closed via With() between these lines, it might proceed with a nil session. This is safe only if connect() is always called with proper synchronization—verify that connect() is only called from NewSessionProxy (before the proxy is returned) and Reconnect() (which holds the mutex).

workflow/sync/database_semaphore.go (5)

9-11: Import update aligns with SessionProxy migration.

The import change from github.com/upper/db/v4 to github.com/argoproj/argo-workflows/v3/util/sqldb correctly reflects the transition to the new SessionProxy abstraction for database operations.


38-41: Correct initialization with SessionProxy.

The SyncQueries construction properly uses info.SessionProxy for the new session proxy-based API.


90-114: Consistent SessionProxy propagation through state query methods.

The pattern of accepting sessionProxy as a parameter in internal methods while using s.info.SessionProxy in public methods allows for transaction-aware queries during acquire operations while maintaining a simple API for standalone calls.


161-169: Queue ordering query correctly uses SessionProxy.

The queueOrdered method properly accepts and uses the sessionProxy parameter for transactional consistency when called from checkAcquire.


232-377: Transaction-aware session handling in checkAcquire and acquire.

The checkAcquire and acquire methods correctly use tx.sessionProxy for all database operations, ensuring transactional consistency during semaphore acquisition. This allows the SessionProxy's retry logic to work within the transaction context.

util/sync/db/mocks/SyncQueries.go (1)

1-14: Auto-generated mock file correctly updated.

This file is auto-generated by mockery and properly reflects the interface changes to use *sqldb.SessionProxy. The updated method signatures are consistent with the SyncQueries interface definition.

util/sync/db/queries.go (5)

7-10: Import structure correctly updated.

The import section properly adds the sqldb package while retaining github.com/upper/db/v4 for the db.Session type used within the With() callback.


96-102: Constructor correctly initializes SessionProxy.

The NewSyncQueries constructor properly accepts and stores the SessionProxy for use in query operations.


105-160: Limit operations correctly use SessionProxy.With() pattern.

All limit operations properly wrap database interactions in q.sessionProxy.With(), enabling automatic retry on network errors. The error handling in UpdateSemaphoreLimit correctly propagates db.ErrNoMoreRows when no rows are affected.


163-183: State query methods correctly implement session parameter pattern.

GetCurrentState and GetCurrentHolders accept a session parameter for transaction support, while GetCurrentPending uses the internal sessionProxy for standalone queries. This design supports both transactional and non-transactional use cases.


279-361: Lock and controller operations correctly wrapped with SessionProxy.

All lock operations (GetExistingLocks, InsertLock, DeleteLock, ExpireInactiveLocks) and controller operations (InsertControllerHealth, UpdateControllerTimestamp) properly use q.sessionProxy.With() for automatic retry support on network errors.

Comment on lines +284 to +324
func (sp *SessionProxy) Reconnect(ctx context.Context) error {
logger := logging.RequireLoggerFromContext(ctx)
sp.mu.Lock()
defer sp.mu.Unlock()

var err error

for attempt := 0; attempt <= sp.maxRetries; attempt++ {
// Perform the reconnection attempt
// Close the bad connection if it exists
if sp.sess != nil {
sp.sess.Close()
sp.closed = true
}

err = sp.connect(ctx)
if err == nil {
logger.WithField("attempt_number", attempt).Info(ctx, "connected to database")
return nil
}
logger.WithField("attempt_number", attempt).Warn(ctx, "failed to connect to database")

// If this is the last attempt, don't wait
if attempt == sp.maxRetries || !sp.isNetworkError(err) {
break
}

// Calculate delay for next retry with exponential backoff
delay := time.Duration(float64(sp.baseDelay) * float64(attempt+1) * sp.retryMultiple)
delay = min(delay, sp.maxDelay)

// Wait before retrying with context cancellation support
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(delay):
}
}

return fmt.Errorf("reconnection failed after %d retries, last error: %w", sp.maxRetries, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reconnect() holds mutex for entire retry duration.

Reconnect() acquires the write lock at line 286 and holds it through all retry attempts with backoff delays. This could cause significant contention if other goroutines are waiting to use the session. Consider releasing the lock during backoff waits.

 func (sp *SessionProxy) Reconnect(ctx context.Context) error {
 	logger := logging.RequireLoggerFromContext(ctx)
-	sp.mu.Lock()
-	defer sp.mu.Unlock()
 
 	var err error
 
 	for attempt := 0; attempt <= sp.maxRetries; attempt++ {
+		sp.mu.Lock()
 		// Close the bad connection if it exists
 		if sp.sess != nil {
 			sp.sess.Close()
 			sp.closed = true
 		}
 
 		err = sp.connect(ctx)
 		if err == nil {
 			logger.WithField("attempt_number", attempt).Info(ctx, "connected to database")
+			sp.mu.Unlock()
 			return nil
 		}
+		sp.mu.Unlock()
 		logger.WithField("attempt_number", attempt).Warn(ctx, "failed to connect to database")
 
 		// If this is the last attempt, don't wait
 		if attempt == sp.maxRetries || !sp.isNetworkError(err) {
 			break
 		}
 
 		// Calculate delay for next retry with exponential backoff
 		delay := time.Duration(float64(sp.baseDelay) * float64(attempt+1) * sp.retryMultiple)
 		delay = min(delay, sp.maxDelay)
 
 		// Wait before retrying with context cancellation support
 		select {
 		case <-ctx.Done():
 			return ctx.Err()
 		case <-time.After(delay):
 		}
 	}
 
 	return fmt.Errorf("reconnection failed after %d retries, last error: %w", sp.maxRetries, err)
 }

@Joibel
Copy link
Member

Joibel commented Dec 10, 2025

@coderabbitai generate docstrings

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

✅ Actions performed

Initiated docstring generation; will generate only if new commits exist.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #15142

coderabbitai bot added a commit that referenced this pull request Dec 10, 2025
Docstrings generation was requested by @Joibel.

* #15006 (comment)

The following files were modified:

* `hack/db/main.go`
* `persist/sqldb/migrate.go`
* `persist/sqldb/offload_node_status_repo.go`
* `persist/sqldb/workflow_archive.go`
* `server/sync/sync_server.go`
* `test/e2e/fixtures/persistence.go`
* `util/sqldb/migrate.go`
* `util/sqldb/session.go`
* `util/sync/db/config.go`
* `util/sync/db/migrate.go`
* `workflow/sync/database_mutex.go`
* `workflow/sync/database_semaphore.go`
* `workflow/sync/sync_manager.go`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support retrying of queries by reconnecting on network error.

2 participants