Skip to content

Conversation

coderbirju
Copy link
Contributor

@coderbirju coderbirju commented Jul 28, 2025

PR to add orchestration logic to nerdctl healthchecks

This PR uses systemd timers to orchestrate healthcheks in nerdctl without the need for a daemon process. Feature currently scoped for rootful linux systems.

@coderbirju coderbirju force-pushed the automate_health_checks_feat branch 15 times, most recently from a4676a1 to bba6059 Compare August 8, 2025 20:24
@coderbirju coderbirju force-pushed the automate_health_checks_feat branch from bba6059 to 97643eb Compare August 11, 2025 00:23
@coderbirju coderbirju force-pushed the automate_health_checks_feat branch 3 times, most recently from f4fc171 to f12e708 Compare August 27, 2025 18:49
@coderbirju coderbirju marked this pull request as ready for review August 27, 2025 19:03
@coderbirju
Copy link
Contributor Author

@swagatbora90 @Shubhranshu153 @AkihiroSuda PTAL when you get a chance, open to improvements

@AkihiroSuda AkihiroSuda added this to the v2.1.4 milestone Aug 28, 2025
logrus.Debugf("Creating healthcheck timer unit: %s", hcName)

cmd := []string{}
if rootlessutil.IsRootless() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function seems to already support rootless.
Any reason to skip tests on rootless?

Comment on lines 505 to 509
// Clean up healthcheck units if configured.
// if err := healthcheck.RemoveTransientHealthCheckFiles(ctx, container); err != nil {
// return fmt.Errorf("failed to clean up healthcheck units for container %s", container.ID())
// }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the purpose of this commented code?

Comment on lines 362 to 366
// Clean up healthcheck units if configured.
// if err := healthcheck.RemoveTransientHealthCheckFiles(ctx, container); err != nil {
// return fmt.Errorf("failed to clean up healthcheck units for container %s", container.ID())
// }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the purpose of this commented code?

@@ -120,7 +120,7 @@ require (
github.com/santhosh-tekuri/jsonschema/v6 v6.0.1 // indirect
github.com/sasha-s/go-deadlock v0.3.5 // indirect
//gomodjail:unconfined
github.com/sirupsen/logrus v1.9.3 // indirect
github.com/sirupsen/logrus v1.9.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for this change?

"github.com/containerd/nerdctl/v2/pkg/testutil"
"github.com/containerd/nerdctl/v2/pkg/testutil/nerdtest"
)

func TestContainerHealthCheckBasic(t *testing.T) {
if rootlessutil.IsRootless() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we cant support rootless?

@@ -391,43 +399,6 @@ func TestContainerHealthCheckAdvance(t *testing.T) {
}
},
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we needed to remove the test?

"--health-interval", "2s",
testutil.CommonImage, "sleep", "30")
nerdtest.EnsureContainerStarted(helpers, data.Identifier())
// Wait for a healthcheck to execute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to wait, ensure container started makes sure the container has started.

Cleanup: func(data test.Data, helpers test.Helpers) {
// Ensure proper cleanup of systemd units
helpers.Anyhow("stop", data.Identifier())
time.Sleep(500 * time.Millisecond) // Allow systemd cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here wait should not be needed when you are doing rm -f

Expected: func(data test.Data, helpers test.Helpers) *test.Expected {
return &test.Expected{
ExitCode: 0,
Output: expect.All(func(stdout string, t tig.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this portion needs a retry to for container to be in a healthy state.
Initially it should be starting and we can retry, if unhealthy we break with error, if healthy success.
And we retry a few times.

"--health-interval", "1s",
testutil.CommonImage, "sleep", "30")
nerdtest.EnsureContainerStarted(helpers, data.Identifier())
time.Sleep(2 * time.Second) // Wait for at least one health check to execute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

containerEnd, err := time.Parse(time.RFC3339Nano, inspect.State.FinishedAt)
assert.NilError(t, err, "parsing container FinishedAt")

// Assert all healthcheck log start times are before container finished
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check we remove the systemd timer on a kill signal?
Integ test should try to see the intended steady state condition in case of an event.


Health checks can be configured in multiple ways:

1. At container creation time using nerdctl run or nerdctl create with `--health-*` flags
1. At container creation time using `nerdctl run` or `nerdctl create` with these flags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should specify if an option is not supported. Have we checked all options are supported? if not we cannot make the claims. Test should cover it, some may be part of the previous PR, but lets do a sanity check on what exists and what is not there.

@@ -111,6 +112,11 @@ func killContainer(ctx context.Context, container containerd.Container, signal s
return err
}

// Clean up healthcheck systemd units
if err := healthcheck.RemoveTransientHealthCheckFiles(ctx, container); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a risky function to be called in kill. Is there a possibility the function hangs?
It should also follow the pattern similar to the signal in question. Majorly SIGKILL and SIGTERM should try to be forced removal of the files.

@@ -179,6 +180,11 @@ func RemoveContainer(ctx context.Context, c containerd.Container, globalOptions
// Otherwise, nil the error so that we do not write the error label on the container
retErr = nil

// Clean up healthcheck systemd units
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

@@ -179,6 +180,11 @@ func RemoveContainer(ctx context.Context, c containerd.Container, globalOptions
// Otherwise, nil the error so that we do not write the error label on the container
retErr = nil

// Clean up healthcheck systemd units
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a integ test for remove case (kill is there but i didnt see remove and stop)

@@ -128,26 +128,42 @@ func updateHealthStatus(ctx context.Context, container containerd.Container, hcC
currentHealth = &HealthState{
Status: Starting,
FailingStreak: 0,
StartPeriod: hcConfig.StartPeriod > 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to avoid logical assertion in assignment. It might confuse the actual intent of the operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the startPeriod tested?

currentHealth.Status = Unhealthy

// Check if we're in start period workflow
inStartPeriodTime := hcResult.Start.Sub(containerCreated) < hcConfig.StartPeriod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would suggest to have unit test for this logic, seems to have a few cases

}

// StartTimer starts the healthcheck timer unit.
// TODO if we persist hcName to container state, pass that to this function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are adding TODO we can create a ticket for future follow up.

// RemoveTransientHealthCheckFilesByID stops and cleans up the transient timer and service using just the container ID.
func RemoveTransientHealthCheckFilesByID(ctx context.Context, containerID string) error {
// Don't proceed if systemd is unavailable or disabled
if !defaults.IsSystemdAvailable() || os.Getenv("DISABLE_HC_SYSTEMD") == "true" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already checked in shouldSkipHealthCheckSystemd

}

// Reset failed units
_ = conn.ResetFailedUnitContext(context.Background(), service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we reset in all cases

// Stop timer
tChan := make(chan string)
if _, err := conn.StopUnitContext(context.Background(), timer, "ignore-dependencies", tChan); err == nil {
if msg := <-tChan; msg != "done" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking calls in removing workflows is dangerous.

@@ -391,43 +399,6 @@ func TestContainerHealthCheckAdvance(t *testing.T) {
}
},
},
{
Description: "Healthcheck emits large output repeatedly",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this test?

Signed-off-by: Arjun Raja Yogidas <[email protected]>
Signed-off-by: Arjun Raja Yogidas <[email protected]>
Signed-off-by: Arjun Raja Yogidas <[email protected]>
@coderbirju coderbirju force-pushed the automate_health_checks_feat branch from f12e708 to 2a8f99d Compare September 1, 2025 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants