-
Notifications
You must be signed in to change notification settings - Fork 696
add healthcheck orchestration logic #4427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add healthcheck orchestration logic #4427
Conversation
a4676a1
to
bba6059
Compare
bba6059
to
97643eb
Compare
f4fc171
to
f12e708
Compare
@swagatbora90 @Shubhranshu153 @AkihiroSuda PTAL when you get a chance, open to improvements |
logrus.Debugf("Creating healthcheck timer unit: %s", hcName) | ||
|
||
cmd := []string{} | ||
if rootlessutil.IsRootless() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function seems to already support rootless.
Any reason to skip tests on rootless?
pkg/containerutil/containerutil.go
Outdated
// Clean up healthcheck units if configured. | ||
// if err := healthcheck.RemoveTransientHealthCheckFiles(ctx, container); err != nil { | ||
// return fmt.Errorf("failed to clean up healthcheck units for container %s", container.ID()) | ||
// } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the purpose of this commented code?
pkg/containerutil/containerutil.go
Outdated
// Clean up healthcheck units if configured. | ||
// if err := healthcheck.RemoveTransientHealthCheckFiles(ctx, container); err != nil { | ||
// return fmt.Errorf("failed to clean up healthcheck units for container %s", container.ID()) | ||
// } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the purpose of this commented code?
@@ -120,7 +120,7 @@ require ( | |||
github.com/santhosh-tekuri/jsonschema/v6 v6.0.1 // indirect | |||
github.com/sasha-s/go-deadlock v0.3.5 // indirect | |||
//gomodjail:unconfined | |||
github.com/sirupsen/logrus v1.9.3 // indirect | |||
github.com/sirupsen/logrus v1.9.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for this change?
"github.com/containerd/nerdctl/v2/pkg/testutil" | ||
"github.com/containerd/nerdctl/v2/pkg/testutil/nerdtest" | ||
) | ||
|
||
func TestContainerHealthCheckBasic(t *testing.T) { | ||
if rootlessutil.IsRootless() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we cant support rootless?
@@ -391,43 +399,6 @@ func TestContainerHealthCheckAdvance(t *testing.T) { | |||
} | |||
}, | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we needed to remove the test?
"--health-interval", "2s", | ||
testutil.CommonImage, "sleep", "30") | ||
nerdtest.EnsureContainerStarted(helpers, data.Identifier()) | ||
// Wait for a healthcheck to execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to wait, ensure container started makes sure the container has started.
Cleanup: func(data test.Data, helpers test.Helpers) { | ||
// Ensure proper cleanup of systemd units | ||
helpers.Anyhow("stop", data.Identifier()) | ||
time.Sleep(500 * time.Millisecond) // Allow systemd cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here wait should not be needed when you are doing rm -f
Expected: func(data test.Data, helpers test.Helpers) *test.Expected { | ||
return &test.Expected{ | ||
ExitCode: 0, | ||
Output: expect.All(func(stdout string, t tig.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this portion needs a retry to for container to be in a healthy state.
Initially it should be starting and we can retry, if unhealthy we break with error, if healthy success.
And we retry a few times.
"--health-interval", "1s", | ||
testutil.CommonImage, "sleep", "30") | ||
nerdtest.EnsureContainerStarted(helpers, data.Identifier()) | ||
time.Sleep(2 * time.Second) // Wait for at least one health check to execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
containerEnd, err := time.Parse(time.RFC3339Nano, inspect.State.FinishedAt) | ||
assert.NilError(t, err, "parsing container FinishedAt") | ||
|
||
// Assert all healthcheck log start times are before container finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check we remove the systemd timer on a kill signal?
Integ test should try to see the intended steady state condition in case of an event.
|
||
Health checks can be configured in multiple ways: | ||
|
||
1. At container creation time using nerdctl run or nerdctl create with `--health-*` flags | ||
1. At container creation time using `nerdctl run` or `nerdctl create` with these flags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should specify if an option is not supported. Have we checked all options are supported? if not we cannot make the claims. Test should cover it, some may be part of the previous PR, but lets do a sanity check on what exists and what is not there.
@@ -111,6 +112,11 @@ func killContainer(ctx context.Context, container containerd.Container, signal s | |||
return err | |||
} | |||
|
|||
// Clean up healthcheck systemd units | |||
if err := healthcheck.RemoveTransientHealthCheckFiles(ctx, container); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a risky function to be called in kill. Is there a possibility the function hangs?
It should also follow the pattern similar to the signal in question. Majorly SIGKILL and SIGTERM should try to be forced removal of the files.
@@ -179,6 +180,11 @@ func RemoveContainer(ctx context.Context, c containerd.Container, globalOptions | |||
// Otherwise, nil the error so that we do not write the error label on the container | |||
retErr = nil | |||
|
|||
// Clean up healthcheck systemd units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
@@ -179,6 +180,11 @@ func RemoveContainer(ctx context.Context, c containerd.Container, globalOptions | |||
// Otherwise, nil the error so that we do not write the error label on the container | |||
retErr = nil | |||
|
|||
// Clean up healthcheck systemd units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a integ test for remove case (kill is there but i didnt see remove and stop)
@@ -128,26 +128,42 @@ func updateHealthStatus(ctx context.Context, container containerd.Container, hcC | |||
currentHealth = &HealthState{ | |||
Status: Starting, | |||
FailingStreak: 0, | |||
StartPeriod: hcConfig.StartPeriod > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to avoid logical assertion in assignment. It might confuse the actual intent of the operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the startPeriod tested?
currentHealth.Status = Unhealthy | ||
|
||
// Check if we're in start period workflow | ||
inStartPeriodTime := hcResult.Start.Sub(containerCreated) < hcConfig.StartPeriod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would suggest to have unit test for this logic, seems to have a few cases
} | ||
|
||
// StartTimer starts the healthcheck timer unit. | ||
// TODO if we persist hcName to container state, pass that to this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are adding TODO we can create a ticket for future follow up.
// RemoveTransientHealthCheckFilesByID stops and cleans up the transient timer and service using just the container ID. | ||
func RemoveTransientHealthCheckFilesByID(ctx context.Context, containerID string) error { | ||
// Don't proceed if systemd is unavailable or disabled | ||
if !defaults.IsSystemdAvailable() || os.Getenv("DISABLE_HC_SYSTEMD") == "true" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already checked in shouldSkipHealthCheckSystemd
} | ||
|
||
// Reset failed units | ||
_ = conn.ResetFailedUnitContext(context.Background(), service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reset in all cases
// Stop timer | ||
tChan := make(chan string) | ||
if _, err := conn.StopUnitContext(context.Background(), timer, "ignore-dependencies", tChan); err == nil { | ||
if msg := <-tChan; msg != "done" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking calls in removing workflows is dangerous.
@@ -391,43 +399,6 @@ func TestContainerHealthCheckAdvance(t *testing.T) { | |||
} | |||
}, | |||
}, | |||
{ | |||
Description: "Healthcheck emits large output repeatedly", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this test?
Signed-off-by: Arjun Raja Yogidas <[email protected]>
Signed-off-by: Arjun Raja Yogidas <[email protected]>
Signed-off-by: Arjun Raja Yogidas <[email protected]>
f12e708
to
2a8f99d
Compare
PR to add orchestration logic to nerdctl healthchecks
This PR uses systemd timers to orchestrate healthcheks in nerdctl without the need for a daemon process. Feature currently scoped for rootful linux systems.