Skip to content

Also retry SSH connections#20

Merged
mangelajo merged 3 commits intomainfrom
retry-all-ssh
Sep 8, 2025
Merged

Also retry SSH connections#20
mangelajo merged 3 commits intomainfrom
retry-all-ssh

Conversation

@mangelajo
Copy link
Member

@mangelajo mangelajo commented Sep 8, 2025

Previously, on an SSH connection failure everthing fails, instead we must retry connections too, and the SSH connection must be created at each attempt instead of just once.

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability by creating fresh SSH connections for each retry.
    • Retry handling now distinguishes system upgrades from per-instance retries and queues failures appropriately.
    • More accurate final error messages when maximum retry attempts are reached.
  • Chores

    • Applied host templating before establishing SSH connections to reduce connection failures and ensure retries target the rendered host.
    • Enhanced logging for templating, connection status, and clearer human-readable retry descriptions.
    • Added explicit SSH resource cleanup to prevent leaked connections.

@coderabbitai
Copy link

coderabbitai bot commented Sep 8, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Applies host templating before SSH, replaces RetryItem.HostSSH with RetryItem.RenderedHost, creates a fresh SSH per retry, integrates bootc upgrades into per-host flow, enhances global retry messaging, and adds HostManager.Close() for SSH resource cleanup.

Changes

Cohort / File(s) Summary
Retry flow & host templating
internal/exporter/host/host.go
- Removed HostSSH from RetryItem; added RenderedHost *api.ExporterHost.
- Added getRetryItemDescription(retryItem RetryItem) string helper.
- Updated processExporterInstancesAndBootc signature to accept renderedHost *api.ExporterHost and adjusted call sites.
- Integrated bootc upgrade handling into per-host processing and enqueue retries with RenderedHost.
- Reworked retry flow to always create/test a fresh SSH connection from RenderedHost before retrying.
- Updated SyncExporterHosts to apply templating before SSH and enqueue SSH/status failures with RenderedHost; adjusted logging.
SSH resource cleanup API
internal/exporter/ssh/ssh.go
- Added Close() error to HostManager interface.
- Implemented func (m *SSHHostManager) Close() error to close SFTP and SSH clients and return appropriate errors.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator
  participant Sync as SyncExporterHosts
  participant Template as HostTemplating
  participant SSH as createSSHConnection
  participant Proc as processExporterInstancesAndBootc
  participant RetryQ as processGlobalRetryQueue

  Operator->>Sync: trigger sync
  Sync->>Template: apply templates -> RenderedHost
  Template-->>Sync: RenderedHost
  Sync->>SSH: createSSHConnection(RenderedHost)
  alt SSH succeeds
    SSH-->>Sync: hostSsh
    Sync->>Proc: processExporterInstancesAndBootc(instances, hostName, RenderedHost, retryQueue)
    Proc-->>RetryQ: enqueue RetryItem(RenderedHost) on errors
  else SSH fails
    Sync->>RetryQ: enqueue RetryItem(RenderedHost, reason)
  end

  loop retries
    RetryQ->>SSH: createSSHConnection(RetryItem.RenderedHost)
    alt Connection ok
      SSH-->>Proc: retry per-item or bootc upgrade using hostSsh + RenderedHost
      Proc-->>RetryQ: success or requeue(with incremented attempts)
    else Connection failed
      RetryQ->>RetryQ: increment attempts / finalize with bootc vs instance message
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

I nibble code, render host then try,
Fresh hops each retry beneath SSH sky.
Bootc or instance, queued with care,
Close the sockets, tidy the lair.
A carrot in logs — hop on, repair. 🥕🐇


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17bea57 and 490d49f.

📒 Files selected for processing (2)
  • internal/exporter/host/host.go (6 hunks)
  • internal/exporter/ssh/ssh.go (2 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch retry-all-ssh

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
internal/exporter/host/host.go (3)

196-204: Consider adding defensive nil check for renderedHost parameter.

While the current callers always pass a non-nil renderedHost, adding a defensive check would make the function more robust.

 // createSSHConnection creates a new SSH connection for the given rendered exporter host
 func (e *ExporterHostSyncer) createSSHConnection(renderedHost *api.ExporterHost) (ssh.HostManager, error) {
+	if renderedHost == nil {
+		return nil, fmt.Errorf("rendered host cannot be nil")
+	}
 	hostSsh, err := ssh.NewSSHHostManager(renderedHost)
 	if err != nil {
 		return nil, fmt.Errorf("error creating SSH host manager for %s: %w", renderedHost.Name, err)
 	}
 
 	return hostSsh, nil
 }

206-213: Consider using pointer receiver for consistency.

The function getRetryItemDescription doesn't modify the RetryItem, but using a pointer receiver would be more consistent with Go conventions for structs and avoid copying.

 // getRetryItemDescription returns a human-readable description of what is being retried
-func getRetryItemDescription(retryItem RetryItem) string {
+func getRetryItemDescription(retryItem *RetryItem) string {
 	if retryItem.ExporterInstance == nil {
 		return "bootc upgrade"
 	} else {
 		return fmt.Sprintf("instance %s", retryItem.ExporterInstance.Name)
 	}
 }

Also update the call site on line 286:

 			fmt.Printf("🔄 Retrying %s on %s (attempt %d/%d)...\n",
-				getRetryItemDescription(retryItem), retryItem.HostName, retryItem.Attempts+1, e.retryConfig.MaxAttempts)
+				getRetryItemDescription(&retryItem), retryItem.HostName, retryItem.Attempts+1, e.retryConfig.MaxAttempts)

73-78: Consider making retry configuration user-configurable.

The retry configuration is currently hardcoded. Consider making these values configurable through environment variables or configuration files to allow users to adjust retry behavior based on their network conditions.

Would you like me to create an implementation that reads retry configuration from environment variables with sensible defaults?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2e0d13 and fead6cf.

📒 Files selected for processing (1)
  • internal/exporter/host/host.go (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (.cursorrules)

**/*.go: Run golangci-lint via make lint and fix all linting errors before proceeding
Run make build to ensure the binary compiles successfully and fix any build errors
Run make fmt to ensure code formatting (go fmt) before finalizing changes

Files:

  • internal/exporter/host/host.go
🧬 Code graph analysis (1)
internal/exporter/host/host.go (2)
api/v1alpha1/exporterhost_types.go (1)
  • ExporterHost (102-108)
internal/exporter/ssh/ssh.go (2)
  • HostManager (32-40)
  • NewSSHHostManager (56-79)
🔇 Additional comments (5)
internal/exporter/host/host.go (5)

37-37: LGTM! Good architectural change for retry resilience.

Replacing HostSSH with RenderedHost is the right approach, as it enables creating fresh SSH connections on each retry attempt rather than reusing a potentially stale connection. This aligns well with the PR's objective of making SSH connections more resilient to transient failures.


284-323: Excellent implementation of fresh SSH connections per retry!

The retry logic correctly creates a new SSH connection for each attempt, which is crucial for recovering from transient network issues or stale connections. The connection test before attempting the actual operation is also a good practice.


389-399: Good handling of initial SSH connection failures.

Adding SSH connection failures to the retry queue ensures that transient network issues during initial connection attempts are properly handled. The retry item correctly uses ExporterInstance: nil to distinguish SSH connection failures from instance-specific failures.


378-382: Template application is correctly positioned before SSH connection.

Applying templates to the host before establishing SSH connections ensures that any template-based configuration (like dynamic addresses or credentials) is properly resolved. The use of DeepCopy() prevents mutations to the original host configuration.


257-265: Clear and informative error messages for different retry scenarios.

The differentiation between bootc upgrade failures and instance-specific failures in error messages helps with debugging and understanding what exactly failed after max retries.

Previously, on an SSH connection failure everthing fails, instead
we must retry connections too, and the SSH connection must be created
at each attempt instead of just once.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
internal/exporter/host/host.go (4)

196-203: Clarify retry description text (optional)

getRetryItemDescription treats nil ExporterInstance as “bootc upgrade”. If you later add a host-connection-only retry type, this will mislabel logs. Consider adding a distinct retry type enum to disambiguate (instance vs bootc vs host-connect).

Example (outside diff scope):

type RetryType int
const (
  RetryTypeInstance RetryType = iota
  RetryTypeBootc
  RetryTypeHostConnect
)
type RetryItem struct {
  Type              RetryType
  ExporterInstance  *api.ExporterInstance
  // ...
}

416-417: Minor wording nit

The retry queue now includes bootc and SSH-connect failures, not just “instances”. Consider “failed items” for accuracy.

-    fmt.Printf("\n🔄 Processing retry queue (%d failed instances) ===========================\n", len(retryQueue))
+    fmt.Printf("\n🔄 Processing retry queue (%d failed items) ===========================\n", len(retryQueue))

274-295: Small UX improvement: show attempt counts consistently

You print attempt count in the retry log, but not on initial enqueue. Consider adding attempt info to initial failure logs for parity.


180-186: Backoff computation is fine; consider using time.Until to avoid rounding quirks (optional)

Rounding to seconds and adding 1s works; alternatively compute exact remaining wait with time.Until(nextAllowed) for precision.

Also applies to: 321-339

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fead6cf and 8c10a7c.

📒 Files selected for processing (1)
  • internal/exporter/host/host.go (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (.cursorrules)

**/*.go: Run golangci-lint via make lint and fix all linting errors before proceeding
Run make build to ensure the binary compiles successfully and fix any build errors
Run make fmt to ensure code formatting (go fmt) before finalizing changes

Files:

  • internal/exporter/host/host.go
🧬 Code graph analysis (1)
internal/exporter/host/host.go (2)
api/v1alpha1/exporterhost_types.go (1)
  • ExporterHost (102-108)
internal/exporter/ssh/ssh.go (2)
  • HostManager (32-40)
  • NewSSHHostManager (56-79)
🔇 Additional comments (7)
internal/exporter/host/host.go (7)

205-233: Confirm desired bootc behavior when no instances match filter

Bootc upgrade runs only when at least one exporter instance is present (hosts with zero viable instances are continued earlier). If bootc upgrades should still run for such hosts, you’ll need to move the upgrade attempt before the “len(exporterInstances) == 0” check or handle it separately.

Do you want bootc upgrades to run even if all instances are filtered out or marked dead?


247-255: Max-attempts logging looks good

Nice distinction between bootc upgrade and instance failures in the final error reporting.


37-41: Storing the rendered host for retries is the right call

Carrying RenderedHost ensures retries don’t depend on stale template state. Good change.


73-79: Retry tuning parameters: sensible defaults

MaxAttempts=3, BaseDelay=5s, multiplier=2, capped at 60s looks reasonable.


351-359: Template-before-SSH sequencing: good

Applying host templates before establishing SSH aligns with the PR goal and avoids stale connection params.


6-6: License header present

All good.


20-31: Use Go tooling for formatting, linting, and build verification
Imports look correct; the sandbox showed make fmt, make lint, and make build commands aren’t available. Please ensure you run equivalent checks locally or in CI—e.g.:

  • go fmt ./...
  • golangci-lint run
  • go build ./...
    Otherwise install/configure the Makefile targets accordingly.

@mangelajo mangelajo merged commit 8f608da into main Sep 8, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant