Skip to content

fix: pod with a restart policy of Never or OnFailure stuck at 'Progressing' (#15317) #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

RoelofKuijpers
Copy link

@RoelofKuijpers RoelofKuijpers commented Apr 9, 2025

This implementation extends the health condition check for pods.
Previously the assumption was that Pods with restart policy of Never or OnFailure are hooks with a finite life, these were considered as Progressing instead of Healthy. However, this logic does not apply when the pod is managed by an operator (e.g., Flink operator) and therefore has a restart policy of Never.
We introduce a new annotation which existence is checked when the pod is Running, that allows for skipping this logic on restart policy.

Copy link

codecov bot commented Apr 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 47.34%. Comparing base (8849c3f) to head (b216058).
⚠️ Report is 55 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
- Coverage   54.26%   47.34%   -6.93%     
==========================================
  Files          64       64              
  Lines        6164     6537     +373     
==========================================
- Hits         3345     3095     -250     
- Misses       2549     3187     +638     
+ Partials      270      255      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drewhemm
Copy link

drewhemm commented Apr 9, 2025

This looks like a good approach to the problem.

@drewhemm
Copy link

drewhemm commented Apr 9, 2025

The pod manifest needs the following to pass the tests:

  • Compute and storage resources defined
  • The alpine tag needs to use something other than latest, e.g. 3.21
  • Add automountServiceAccountToken: false to the pod spec, as per the Kubernetes docs

@RoelofKuijpers
Copy link
Author

@drewhemm I have made the changes you suggested to get a Quality Gate pass

@drewhemm
Copy link

drewhemm commented Apr 9, 2025

Cool, looks like the last blocking issue is the commit sign off.

@RoelofKuijpers RoelofKuijpers force-pushed the 15317 branch 2 times, most recently from 444a326 to 84a1039 Compare April 9, 2025 11:41
@drewhemm
Copy link

drewhemm commented Apr 9, 2025

A non-blocking issue has been flagged by SonarQube, probably best to resolve it as follows:

resources:
  requests:
    ephemeral-storage: "100Mi"

Copy link
Member

@sivchari sivchari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. left nits.

@RoelofKuijpers RoelofKuijpers force-pushed the 15317 branch 2 times, most recently from f6e7045 to 61ddfd1 Compare April 21, 2025 16:32
Copy link

@RoelofKuijpers RoelofKuijpers requested a review from sivchari April 21, 2025 17:28
Copy link
Member

@sivchari sivchari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Liammarwood
Copy link

This needs to be merged to solve lots of subsequent bugs that have been raised.

@Liammarwood
Copy link

@christianh814 would you be able to give this PR a review?

@RoelofKuijpers
Copy link
Author

@crenshaw-dev would you be able to give this PR a review? Much appreciated!

@@ -12,6 +12,10 @@ import (
"github.com/argoproj/gitops-engine/pkg/utils/kube"
)

const (
AnnotationIgnoreRestartPolicy = "argocd.argoproj.io/ignore-restart-policy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting it in a separate file -

AnnotationSyncWave = "argocd.argoproj.io/sync-wave"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitishfy Thanks for your comment. I have moved this to types.go as you suggested. Indeed better.

Copy link
Member

@nitishfy nitishfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs required!

@RoelofKuijpers RoelofKuijpers force-pushed the 15317 branch 2 times, most recently from 3de2f0b to ffeef76 Compare July 29, 2025 12:10
Signed-off-by: Roelof Kuijpers <[email protected]>
Signed-off-by: Roelof Kuijpers <[email protected]>
…sses the checks of the Quality Gate

Signed-off-by: Roelof Kuijpers <[email protected]>
RoelofKuijpers and others added 5 commits July 29, 2025 14:14
Signed-off-by: Roelof Kuijpers <[email protected]>
improve code readability

Co-authored-by: sivchari <[email protected]>
Signed-off-by: Roelof Kuijpers <[email protected]>
Signed-off-by: Roelof Kuijpers <[email protected]>
Signed-off-by: Roelof Kuijpers <[email protected]>
Copy link

@RoelofKuijpers
Copy link
Author

RoelofKuijpers commented Jul 29, 2025

Docs required!

Have added info to the documentation now!

@RoelofKuijpers RoelofKuijpers requested a review from nitishfy July 29, 2025 12:24
Comment on lines +72 to +74
A workaround is to use the annotation: argocd.argoproj.io/ignore-restart-policy: "true".

When this annotation is set on the Pod resource, the controller will ignore the `restartPolicy` and consider the Pod *Running* as a valid healthy state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior, combined with hooks, would be problematic since hook would be Healthy before their completions.

It is good to explain the current behavior. But not should be part of the health check documentation, and not the hooks.

policy := pod.Spec.RestartPolicy
// If the pod has the AnnotationIgnoreRestartPolicy annotation or its restart policy is Always,
// then treat it as a long-running pod and check its health status.
if _, ok := pod.Annotations[common.AnnotationIgnoreRestartPolicy]; ok || policy == corev1.RestartPolicyAlways {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this feature is necessary with the usage of argocd.argoproj.io/ignore-healthcheck: 'true'. The pod will show running, because that is the behavior that is configured in the Pod and we should have a consistent behaviour with Kubernetes.

If a controller is creating resource on which we cannot assume application healthiness reliably, then we should exclude those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants