Skip to content

Parameterize certificate get and store operations.#283

Open
juriskrumins wants to merge 1 commit intomedik8s:mainfrom
juriskrumins:jkrumins/parameterize-cert-storage-apitimeout
Open

Parameterize certificate get and store operations.#283
juriskrumins wants to merge 1 commit intomedik8s:mainfrom
juriskrumins:jkrumins/parameterize-cert-storage-apitimeout

Conversation

@juriskrumins
Copy link

@juriskrumins juriskrumins commented Feb 10, 2026

Why we need this PR

I've created issue, we have in our cluster - too short timeout for certificate collection -10 seconds. Works good for us when we set it something like 30-40s, but I suppose parameter with default value of 10s is fine.

Changes made

Parameterize constant - apiTimeout = 10 * time.Second in certificate storage.

Which issue(s) this PR fixes

Fixes #281

Test plan

Summary by CodeRabbit

  • New Features
    • Added a configurable timeout for certificate storage API connectivity checks. Control the duration via the new --cert-storage-api-timeout flag (default: 10 seconds).
    • Certificate validation now enforces minimum timeout requirements for API operations.

@openshift-ci openshift-ci bot requested review from clobrano and slintes February 10, 2026 15:21
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: juriskrumins
Once this PR has been reviewed and has the lgtm label, please assign mshitrit for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 10, 2026

Hi @juriskrumins. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This pull request adds a configurable timeout parameter for certificate storage API operations, making the previously hardcoded 10-second timeout adjustable through the SelfNodeRemediationConfig custom resource. The timeout is threaded through the system from the API specification to the certificate storage implementation.

Changes

Cohort / File(s) Summary
CRD Type Definition
api/v1alpha1/selfnoderemediationconfig_types.go
Added CertStorageApiTimeout optional field (pointer to metav1.Duration) to SelfNodeRemediationConfigSpec with JSON tag and kubebuilder validations (default 10s, duration pattern).
CRD Validation
api/v1alpha1/selfnoderemediationconfig_webhook.go
Integrated CertStorageApiTimeout into webhook validation flow with minimum duration constraint, ensuring spec timeouts are validated alongside other API timeouts.
Generated Code
api/v1alpha1/zz_generated.deepcopy.go
Added deepcopy logic to handle cloning of the CertStorageApiTimeout field when non-nil.
CRD Manifests
config/crd/bases/self-node-remediation.medik8s.io_selfnoderemediationconfigs.yaml, bundle/manifests/self-node-remediation.medik8s.io_selfnoderemediationconfigs.yaml
Updated CustomResourceDefinition specs with new certStorageApiTimeout field (type: string, default: 10s, validation pattern for time units).
Controller Reconciler
controllers/selfnoderemediationconfig_controller.go
Added SecretCertStorageApiTimeout field to SelfNodeRemediationConfigReconciler and wired timeout into template rendering and SecretCertStorage initialization.
Certificate Storage Implementation
pkg/certificates/storage.go
Converted hardcoded apiTimeout constant to per-instance field in SecretCertStorage struct; updated constructor to accept and store timeout parameter; modified GetCerts and StoreCerts to use instance-scoped timeout.
Entry Point Configuration
main.go
Added --cert-storage-api-timeout command-line flag (default 10s) and propagated the timeout parameter through both manager and agent initialization paths, passing it to controller and certificate storage constructors.
Test and Deployment Updates
api/v1alpha1/selfnoderemediationconfig_webhook_test.go, controllers/tests/config/suite_test.go, pkg/certificates/storage_test.go, install/self-node-remediation-deamonset.yaml
Updated test suite and deployment manifests to initialize, validate, and pass the new timeout parameter through constructor calls and DaemonSet container arguments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

approved, lgtm

Suggested reviewers

  • slintes
  • razo7
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: parameterizing the certificate storage API timeout to make it configurable instead of hardcoded.
Linked Issues check ✅ Passed The PR successfully implements all coding requirements from issue #281: making the hardcoded apiTimeout configurable with a 10-second default, propagating it through initialization paths, and exposing it as a command-line flag.
Out of Scope Changes check ✅ Passed All changes are directly scoped to parameterizing certificate storage timeout: CRD updates, controller logic, test updates, and certificate storage implementation. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

We've launched Issue Planner and it is currently in beta. Please try it out and share your feedback on Discord!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
controllers/tests/config/suite_test.go (1)

102-109: ⚠️ Potential issue | 🟠 Major

SecretCertStorageApiTimeout not set on the test reconciler.

The SelfNodeRemediationConfigReconciler here doesn't set SecretCertStorageApiTimeout, so it defaults to 0. When syncCerts runs during reconciliation, it will create a SecretCertStorage with a zero timeout, which could cause test flakiness or unexpected behavior.

🔧 Suggested fix
 	err = (&controllers.SelfNodeRemediationConfigReconciler{
 		Client:                   k8sManager.GetClient(),
 		Log:                      ctrl.Log.WithName("controllers").WithName("self-node-remediation-config-controller"),
 		InstallFileFolder:        "../../../install/",
 		Scheme:                   scheme.Scheme,
 		Namespace:                shared.Namespace,
 		RebootDurationCalculator: shared.MockRebootDurationCalculator{},
+		SecretCertStorageApiTimeout: 10 * time.Second,
 	}).SetupWithManager(k8sManager)
api/v1alpha1/selfnoderemediationconfig_webhook_test.go (2)

337-352: ⚠️ Potential issue | 🔴 Critical

Missing certStorageApiTimeout case in setFieldValue — the test field override is silently ignored.

setFieldValue has no case certStorageApiTimeout: branch, so createSelfNodeRemediationConfigCRWithFieldValue(certStorageApiTimeout, ...) never applies the test value. The CR retains the default (10s), and the "invalid value" test for this field will not actually test an invalid value.

🐛 Proposed fix
 	case peerUpdateInterval:
 		snrc.Spec.PeerUpdateInterval = timeValue
+	case certStorageApiTimeout:
+		snrc.Spec.CertStorageApiTimeout = timeValue
 	}

275-309: ⚠️ Potential issue | 🔴 Critical

testValidCR doesn't set CertStorageApiTimeout — nil pointer dereference in validateTimes.

The CR is constructed as a bare struct (line 276) and only existing Duration fields are explicitly set (lines 281–286). CertStorageApiTimeout remains nil. When validateTimes accesses spec.CertStorageApiTimeout.Duration (webhook.go line 147), it will panic.

🐛 Proposed fix
 	snrc.Spec.PeerUpdateInterval = &metav1.Duration{Duration: 10 * time.Second}
+	snrc.Spec.CertStorageApiTimeout = &metav1.Duration{Duration: 10 * time.Second}
 	snrc.Spec.CustomDsTolerations = []v1.Toleration{{Key: "validValue", Effect: v1.TaintEffectNoSchedule}, {}, {Operator: v1.TolerationOpEqual, TolerationSeconds: pointer.Int64(-5)}, {Value: "SomeValidValue"}}
main.go (1)

97-108: ⚠️ Potential issue | 🔴 Critical

Fix type mismatch: controller renders float64 seconds that IntVar flag cannot parse.

The controller converts CertStorageApiTimeout to float64 via .Seconds() and renders it directly in the template: --cert-storage-api-timeout={{.CertStorageApiTimeout}}. However, flag.IntVar cannot parse decimal values. If a user sets a fractional duration in the CRD (e.g., 500ms or 1.5s), the template will render 0.5 or 1.5 respectively, causing the pod to crash on startup with a flag parse error.

Fix: Either convert to int64 in the controller before rendering (e.g., int64(snrConfig.Spec.CertStorageApiTimeout.Seconds())), or change the flag to accept a duration string format instead of an integer.

🤖 Fix all issues with AI agents
In `@api/v1alpha1/selfnoderemediationconfig_webhook_test.go`:
- Line 43: The test case uses certStorageApiTimeout with a value equal to the
minimum (10 * time.Second) so validation won't reject it; change the test input
to a value strictly below minDurCertStorageApiTimeout (for example 9 *
time.Second) so the validation in the webhook test for certStorageApiTimeout
correctly triggers rejection; update the test row where certStorageApiTimeout is
paired with minDurCertStorageApiTimeout to use a sub-minimum duration (e.g., 9 *
time.Second).

In `@controllers/selfnoderemediationconfig_controller.go`:
- Line 160: The code directly calls
snrConfig.Spec.CertStorageApiTimeout.Seconds() which can be nil; guard by
checking if snrConfig.Spec.CertStorageApiTimeout != nil before calling
Seconds(), and set data.Data["CertStorageApiTimeout"] to the default (e.g., 10)
or 0 when nil; apply the same nil-guard pattern used for other timeout fields
(the CertStorageApiTimeout field in snrConfig.Spec) to prevent panics.

In `@install/self-node-remediation-deamonset.yaml`:
- Line 42: The template renders CertStorageApiTimeout as a float (via
.Seconds()) causing values like "10.0" which flag.IntVar in main.go cannot
parse; update the DaemonSet template to emit an integer number of nanoseconds or
whole seconds to match main.go's parsing: either change the template to use
.Nanoseconds() (matching other timeouts) or apply math.Ceil on
snrConfig.Spec.CertStorageApiTimeout.Seconds() and cast to int before rendering
so the rendered --cert-storage-api-timeout value is an integer that flag.IntVar
can parse; locate uses of CertStorageApiTimeout, the DaemonSet flag rendering
line, and the parsing in main.go (flag.IntVar) to make the change consistently.

In `@main.go`:
- Around line 107-108: The flag certStorageApiTimeout (registered with
flag.IntVar) can be zero or negative which makes any context.WithTimeout using
it immediately expired; add validation after flag.Parse() to ensure
certStorageApiTimeout > 0 (e.g., if <= 0 then set a sensible minimum like 1,
return an error, or log.Fatal) and use the validated value when creating
contexts for certificate API calls (references: certStorageApiTimeout and the
code paths that call context.WithTimeout for collecting
self-node-remediation-certificates); ensure the chosen behavior is consistent
(reject invalid input or coerce to a minimum) and update any related error
messages to mention the timeout value.
🧹 Nitpick comments (2)
controllers/tests/config/suite_test.go (1)

123-123: Non-idiomatic duration expression.

time.Duration(10)*time.Second works but is unusual. The idiomatic form is 10 * time.Second.

♻️ Suggested fix
-	certReader = certificates.NewSecretCertStorage(k8sClient, ctrl.Log.WithName("SecretCertStorage"), shared.Namespace, time.Duration(10)*time.Second)
+	certReader = certificates.NewSecretCertStorage(k8sClient, ctrl.Log.WithName("SecretCertStorage"), shared.Namespace, 10*time.Second)
pkg/certificates/storage_test.go (1)

31-31: Nit: time.Duration(10)*time.Second can be simplified to 10 * time.Second.

The explicit cast is unnecessary here since 10 is an untyped constant. This same pattern appears in main.go as well (see comment there).

Suggested simplification
-			store := NewSecretCertStorage(k8sClient, ctrl.Log.WithName("TestSecretCertStore"), "default", time.Duration(10)*time.Second)
+			store := NewSecretCertStorage(k8sClient, ctrl.Log.WithName("TestSecretCertStore"), "default", 10*time.Second)

{peerRequestTimeout, -1 * time.Minute, minDurPeerRequestTimeout},
{apiCheckInterval, -1 * time.Second, minDurApiCheckInterval},
{peerUpdateInterval, -10 * time.Second, minDurPeerUpdateInterval},
{certStorageApiTimeout, 10 * time.Second, minDurCertStorageApiTimeout},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Test value equals the minimum — this test case won't trigger a rejection.

The test value 10 * time.Second equals minDurCertStorageApiTimeout (also 10s). The validation only rejects values less than the minimum. This test expects rejection but will pass validation, causing a test failure.

Use a value strictly below the minimum (e.g., 9 * time.Second):

🐛 Proposed fix
-	{certStorageApiTimeout, 10 * time.Second, minDurCertStorageApiTimeout},
+	{certStorageApiTimeout, 9 * time.Second, minDurCertStorageApiTimeout},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{certStorageApiTimeout, 10 * time.Second, minDurCertStorageApiTimeout},
{certStorageApiTimeout, 9 * time.Second, minDurCertStorageApiTimeout},
🤖 Prompt for AI Agents
In `@api/v1alpha1/selfnoderemediationconfig_webhook_test.go` at line 43, The test
case uses certStorageApiTimeout with a value equal to the minimum (10 *
time.Second) so validation won't reject it; change the test input to a value
strictly below minDurCertStorageApiTimeout (for example 9 * time.Second) so the
validation in the webhook test for certStorageApiTimeout correctly triggers
rejection; update the test row where certStorageApiTimeout is paired with
minDurCertStorageApiTimeout to use a sub-minimum duration (e.g., 9 *
time.Second).

data.Data["MinPeersForRemediation"] = snrConfig.Spec.MinPeersForRemediation
data.Data["HostPort"] = snrConfig.Spec.HostPort
data.Data["IsSoftwareRebootEnabled"] = fmt.Sprintf("\"%t\"", snrConfig.Spec.IsSoftwareRebootEnabled)
data.Data["CertStorageApiTimeout"] = snrConfig.Spec.CertStorageApiTimeout.Seconds()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's look at the specific file and lines around the concern
cat -n controllers/selfnoderemediationconfig_controller.go | sed -n '140,170p'

Repository: medik8s/self-node-remediation

Length of output: 1812


🏁 Script executed:

# Search for CertStorageApiTimeout usage across the codebase
rg -n 'CertStorageApiTimeout' --type=go -C 3

Repository: medik8s/self-node-remediation

Length of output: 9510


🏁 Script executed:

# Search for webhook definitions that might set defaults
rg -n 'webhook\|defaulting\|mutating' --type=go -i | head -50

Repository: medik8s/self-node-remediation

Length of output: 55


🏁 Script executed:

# Look for the CRD definition or struct that defines the field
fd -e yaml -e json | xargs rg -l 'CertStorageApiTimeout' 2>/dev/null | head -10

Repository: medik8s/self-node-remediation

Length of output: 117


🏁 Script executed:

# Get full webhook validation logic around line 147
cat -n api/v1alpha1/selfnoderemediationconfig_webhook.go | sed -n '130,160p'

Repository: medik8s/self-node-remediation

Length of output: 1367


🏁 Script executed:

# Look for where defaults might be set in the webhook
rg -n 'Default|default' api/v1alpha1/selfnoderemediationconfig_webhook.go -C 3

Repository: medik8s/self-node-remediation

Length of output: 1641


🏁 Script executed:

# Check the types of other timeout fields to see if they're also optional
cat -n api/v1alpha1/selfnoderemediationconfig_types.go | sed -n '120,148p'

Repository: medik8s/self-node-remediation

Length of output: 1775


🏁 Script executed:

# Look for the Default method in the webhook
rg -n 'func.*Default' api/v1alpha1/selfnoderemediationconfig_webhook.go -A 20

Repository: medik8s/self-node-remediation

Length of output: 55


🏁 Script executed:

# Look at the full validate method to see the order of operations
cat -n api/v1alpha1/selfnoderemediationconfig_webhook.go | sed -n '80,135p'

Repository: medik8s/self-node-remediation

Length of output: 2826


🏁 Script executed:

# Search for the Default method (mutating webhook)
rg -n 'func.*Default' api/v1alpha1/ --type=go

Repository: medik8s/self-node-remediation

Length of output: 710


🏁 Script executed:

# Check if there's a separate defaulter for this type
fd . api/v1alpha1/ -name "*webhook*.go" -o -name "*default*.go"

Repository: medik8s/self-node-remediation

Length of output: 303


🏁 Script executed:

# Look for all webhook-related files
ls -la api/v1alpha1/ | grep -i webhook

Repository: medik8s/self-node-remediation

Length of output: 673


🏁 Script executed:

# Check if there's a Default method for SelfNodeRemediationConfig anywhere
rg -n 'func.*SelfNodeRemediationConfig.*Default' api/v1alpha1/ --type=go -A 5

Repository: medik8s/self-node-remediation

Length of output: 55


🏁 Script executed:

# Look at the complete SelfNodeRemediationConfigSpec to see which fields are optional
cat -n api/v1alpha1/selfnoderemediationconfig_types.go | sed -n '55,148p'

Repository: medik8s/self-node-remediation

Length of output: 5107


🏁 Script executed:

# Check if there's a mutating webhook or default handler that would apply these defaults
rg -n 'Default|Defaulting' api/v1alpha1/selfnoderemediationconfig_webhook.go -C 3

Repository: medik8s/self-node-remediation

Length of output: 55


🏁 Script executed:

# Look for any pre-existing nil checks for these timeout fields in the codebase
rg -n 'CertStorageApiTimeout.*nil\|if.*CertStorageApiTimeout' --type=go -B 2 -A 2

Repository: medik8s/self-node-remediation

Length of output: 55


🏁 Script executed:

# Check test cases to see if they test nil scenarios
rg -n 'nil\|CertStorageApiTimeout' api/v1alpha1/selfnoderemediationconfig_webhook_test.go -B 2 -A 2

Repository: medik8s/self-node-remediation

Length of output: 55


🏁 Script executed:

# Check the actual webhook validation more carefully - look at validateTimes and how it's called
cat -n api/v1alpha1/selfnoderemediationconfig_webhook.go | sed -n '134,160p'

Repository: medik8s/self-node-remediation

Length of output: 1299


🏁 Script executed:

# Check if there's any nil guard before validateTimes is called
rg -n 'validateTimes' api/v1alpha1/selfnoderemediationconfig_webhook.go -B 5 -A 5

Repository: medik8s/self-node-remediation

Length of output: 1237


🏁 Script executed:

# Look for any checks before line 160 in the controller
cat -n controllers/selfnoderemediationconfig_controller.go | sed -n '125,165p'

Repository: medik8s/self-node-remediation

Length of output: 2270


Add nil guard before accessing CertStorageApiTimeout.Seconds().

The field is optional (+optional) and can be nil. Although +kubebuilder:default:="10s" should apply in normal scenarios, edge cases like CRD/CR upgrades from older versions or webhook failures could leave it nil, causing a panic. This pattern affects all timeout fields (lines 149–154), but adding a nil guard here protects the latest field and sets a pattern for the others.

🤖 Prompt for AI Agents
In `@controllers/selfnoderemediationconfig_controller.go` at line 160, The code
directly calls snrConfig.Spec.CertStorageApiTimeout.Seconds() which can be nil;
guard by checking if snrConfig.Spec.CertStorageApiTimeout != nil before calling
Seconds(), and set data.Data["CertStorageApiTimeout"] to the default (e.g., 10)
or 0 when nil; apply the same nil-guard pattern used for other timeout fields
(the CertStorageApiTimeout field in snrConfig.Spec) to prevent panics.

containers:
- args:
- --is-manager=false
- --cert-storage-api-timeout={{.CertStorageApiTimeout}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the daemonset file and controller code
find . -type f -name "*daemonset*" -o -name "*controller*" -o -name "main.go" | head -20

Repository: medik8s/self-node-remediation

Length of output: 1289


🏁 Script executed:

# Search for CertStorageApiTimeout and cert-storage-api-timeout across the repo
rg -l "CertStorageApiTimeout|cert-storage-api-timeout" --type=go --type=yaml

Repository: medik8s/self-node-remediation

Length of output: 396


🏁 Script executed:

# Search for context around cert-storage-api-timeout parameter
rg "cert-storage-api-timeout" -C5 --type=go --type=yaml

Repository: medik8s/self-node-remediation

Length of output: 1604


🏁 Script executed:

# Read the daemonset YAML file to see all timeout parameters
cat install/self-node-remediation-deamonset.yaml

Repository: medik8s/self-node-remediation

Length of output: 3627


🏁 Script executed:

# Read the controller code around line 160 where CertStorageApiTimeout is rendered
sed -n '140,180p' controllers/selfnoderemediationconfig_controller.go

Repository: medik8s/self-node-remediation

Length of output: 1885


🏁 Script executed:

# Read main.go to see the full context of certStorageApiTimeout parsing
sed -n '1,100p' main.go

Repository: medik8s/self-node-remediation

Length of output: 3436


🏁 Script executed:

# Read more of main.go to see how certStorageApiTimeout is used
sed -n '75,130p' main.go

Repository: medik8s/self-node-remediation

Length of output: 2119


🏁 Script executed:

# Check the type definition of CertStorageApiTimeout in the API types
grep -n "CertStorageApiTimeout" api/v1alpha1/selfnoderemediationconfig_types.go -A2 -B2

Repository: medik8s/self-node-remediation

Length of output: 231


🏁 Script executed:

# Check if CertStorageApiTimeout has a kubebuilder default
grep -n "CertStorageApiTimeout" api/v1alpha1/ -r -B5

Repository: medik8s/self-node-remediation

Length of output: 4859


🏁 Script executed:

# Verify that time.Duration.Seconds() returns float64 by checking the actual usage
rg "\.Seconds\(\)" --type=go -B2 -A2

Repository: medik8s/self-node-remediation

Length of output: 30122


🏁 Script executed:

# Check if there's any conversion happening elsewhere or if certStorageApiTimeout is used
grep -n "certStorageApiTimeout" main.go -A5 -B5

Repository: medik8s/self-node-remediation

Length of output: 1882


🏁 Script executed:

# Verify the exact behavior of metav1.Duration with Seconds()
cat api/v1alpha1/selfnoderemediationconfig_types.go | grep -A10 "CertStorageApiTimeout"

Repository: medik8s/self-node-remediation

Length of output: 503


Critical bug: --cert-storage-api-timeout flag will fail to parse at runtime.

The controller renders CertStorageApiTimeout using .Seconds(), which returns a float64 value. For the default "10s" duration, this renders to --cert-storage-api-timeout=10.0 in the DaemonSet. However, main.go parses this flag with flag.IntVar(), which cannot parse "10.0" as a valid integer. This will cause a runtime error when the DaemonSet agent starts.

Additionally, this approach is inconsistent with other timeout parameters, which use .Nanoseconds() (returning int64) and are passed as environment variables. Change CertStorageApiTimeout to use .Nanoseconds() or apply math.Ceil(snrConfig.Spec.CertStorageApiTimeout.Seconds()) to convert to int before rendering.

🤖 Prompt for AI Agents
In `@install/self-node-remediation-deamonset.yaml` at line 42, The template
renders CertStorageApiTimeout as a float (via .Seconds()) causing values like
"10.0" which flag.IntVar in main.go cannot parse; update the DaemonSet template
to emit an integer number of nanoseconds or whole seconds to match main.go's
parsing: either change the template to use .Nanoseconds() (matching other
timeouts) or apply math.Ceil on snrConfig.Spec.CertStorageApiTimeout.Seconds()
and cast to int before rendering so the rendered --cert-storage-api-timeout
value is an integer that flag.IntVar can parse; locate uses of
CertStorageApiTimeout, the DaemonSet flag rendering line, and the parsing in
main.go (flag.IntVar) to make the change consistently.

Comment on lines +107 to +108
flag.IntVar(&certStorageApiTimeout, "cert-storage-api-timeout", 10,
"Set Kubernetes timeout value while collecting self-node-remediation-certificates secret.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No validation for zero or negative timeout values.

If a user passes --cert-storage-api-timeout=0 or a negative value, the resulting context.WithTimeout will create an already-expired context, causing all certificate API calls to fail immediately. Consider adding a guard.

Suggested validation
 flag.IntVar(&certStorageApiTimeout, "cert-storage-api-timeout", 10,
 	"Set Kubernetes timeout value while collecting self-node-remediation-certificates secret.")
+// ... after flag.Parse():
+if certStorageApiTimeout <= 0 {
+	setupLog.Error(fmt.Errorf("invalid cert-storage-api-timeout: %d", certStorageApiTimeout), "cert-storage-api-timeout must be positive")
+	os.Exit(1)
+}
🤖 Prompt for AI Agents
In `@main.go` around lines 107 - 108, The flag certStorageApiTimeout (registered
with flag.IntVar) can be zero or negative which makes any context.WithTimeout
using it immediately expired; add validation after flag.Parse() to ensure
certStorageApiTimeout > 0 (e.g., if <= 0 then set a sensible minimum like 1,
return an error, or log.Fatal) and use the validated value when creating
contexts for certificate API calls (references: certStorageApiTimeout and the
code paths that call context.WithTimeout for collecting
self-node-remediation-certificates); ensure the chosen behavior is consistent
(reject invalid input or coerce to a minimum) and update any related error
messages to mention the timeout value.

@slintes
Copy link
Member

slintes commented Feb 10, 2026

Thanks for providing a PR.
As mentioned in my comment on the issue, just setting a high timeout isn't enough. It needs to be reflected in the calculation of the time a reboot needs (calculateMinimumRebootDuration() in pkg/reboot/calculator.go).

I went the more direct approach of waiting on informer sync in this PR, WDYT? #282
That fixes the problem with the initial sync, while keeping the 10s timeout on later reads.

@juriskrumins
Copy link
Author

If later reads will take more then 10s ? As far as I understand it creates cache and sync from k8s tom this cache. So cache should probably eventually expire at some point in time, no ?

@slintes
Copy link
Member

slintes commented Feb 10, 2026

No, there isn't a cache expiration. It's updated on the fly when secrets are added / modified / deleted. It's just the initial read of all secrets which can be slow for many secrets.
Can you maybe give the other PR a try on your environment?

@juriskrumins
Copy link
Author

I can test it, but It's just the initial read of all secrets which can be slow for many secrets. question is, why we need all secrets, rather then only couple of them - related to specific operator ?

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slintes
Copy link
Member

slintes commented Feb 11, 2026

the get function is getting the one specific secret only. However, the informer always has all of them 🤷🏼‍♂️

@juriskrumins
Copy link
Author

Ok, thanks for the explanation.
By the way, what's the schedule for v0.12.0 release ?

@slintes
Copy link
Member

slintes commented Feb 11, 2026

when things go well, we will have this in a 0.12.1 bugfix release in 2-3 weeks

@juriskrumins
Copy link
Author

And one more question, looking at the code once again specifically related to informers as far as I understand it's not possible to specify filter ( for example based on namespace and kind ) on which objects we'd like to sync to the cache and keep eventually updated later on. Which is from my perspective is the root cause, since in cluster with huge amount of objects we're effectively pulling in all specified objects (in this case all Secrets land in the cache). Correct me if I'm wrong, probably don't understand all this good enough.

@slintes
Copy link
Member

slintes commented Feb 12, 2026

feel free to further investigate if we can change the informer. Priority was to have simple quick fix for the planned release for now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue starting daemonset pods.

3 participants

Comments