-
Notifications
You must be signed in to change notification settings - Fork 113
fix: Annotate PLR when when started status is reported #2208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @gbenhaim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical bug where the status reporting to SCM systems for PipelineRuns was unreliable due to the use of a mutable label. It introduces a new annotation to ensure consistent and accurate reporting of the 'started' status. Additionally, it enhances the reconciler's resilience by ensuring it always operates on the most up-to-date PipelineRun data, preventing race conditions and stale cache issues.
Highlights
- Improved SCM Status Reporting Reliability: Replaced the mutable
state
label with a new, immutablescm-reporting-plr-started
annotation to reliably track when a PipelineRun's 'started' status has been reported to the Source Code Management (SCM) system. This prevents issues caused by other controllers inadvertently overwriting the state label, ensuring the 'started' status is reported accurately and only once. - Enhanced Reconciler Robustness: Implemented a mechanism to fetch the latest PipelineRun object directly from the API server at the start of each reconcile loop. This mitigates problems arising from stale informer cache data, ensuring the reconciler always operates on the most up-to-date state of the PipelineRun.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
@zakisk can you please guide me on where should I add tests for this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a flaky behavior where the SCM status for a started PipelineRun could be reported multiple times. It introduces a new annotation, scm-reporting-plr-started
, to reliably track whether the 'started' status has been reported. Additionally, it ensures the reconciler always works with the freshest PipelineRun object by fetching it directly from the API server at the start of the reconciliation loop.
The changes are logical and correctly implement the fix. I've provided a couple of suggestions in pkg/reconciler/reconciler.go
to improve code clarity and reduce log verbosity.
pkg/reconciler/reconciler.go
Outdated
scmReportingPLRStarted, exist := pr.GetAnnotations()[keys.SCMReportingPLRStarted] | ||
if exist && scmReportingPLRStarted == "true" { | ||
startReported = true | ||
} else if !exist { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for determining startReported
can be simplified. The current if/else if
structure is a bit verbose for what is essentially a boolean check.
You can make this more concise and idiomatic like this:
scmAnnotation, exists := pr.GetAnnotations()[keys.SCMReportingPLRStarted]
startReported := exists && scmAnnotation == "true"
This correctly handles all cases:
- Annotation is missing (
exists
is false). - Annotation is present but not
"true"
. - Annotation is present and
"true"
.
This change also makes the log message on line 94 redundant, as the comprehensive log on line 97 already includes the value of startReported
.
scmReportingPLRStarted, exist := pr.GetAnnotations()[keys.SCMReportingPLRStarted]
startReported := exist && scmReportingPLRStarted == "true"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/reconciler/reconciler.go
Outdated
logger.Infof("RECONCILE: Checking condition - reason='%s' (is Running: %v), state='%s' (is queued: %v) (startReported: %v)", | ||
reason, reason == string(tektonv1.PipelineRunReasonRunning), state, state == kubeinteraction.StateQueued, startReported) | ||
|
||
if reason == string(tektonv1.PipelineRunReasonRunning) && !startReported { | ||
logger.Infof("RECONCILE: PipelineRun %s/%s condition met (reason=Running, state=queued), calling updatePipelineRunToInProgress", pr.GetNamespace(), pr.GetName()) | ||
repoName := pr.GetAnnotations()[keys.Repository] | ||
repo, err := r.repoLister.Repositories(pr.Namespace).Get(repoName) | ||
if err != nil { | ||
return fmt.Errorf("failed to get repository CR: %w", err) | ||
} | ||
return r.updatePipelineRunToInProgress(ctx, logger, repo, pr) | ||
} else { | ||
logger.Infof("RECONCILE: Condition NOT met for %s/%s - reason='%s', state='%s' (startReported: %v)", pr.GetNamespace(), pr.GetName(), reason, state, startReported) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new Infof
logs prefixed with RECONCILE:
are very verbose for the Info
level and seem to serve a debugging purpose. While helpful for tracking down the flakiness issue, they might add significant noise to production logs.
Consider either removing them or changing them to a debug level (e.g., Debugf
) if your logging framework supports it. This would make the logs cleaner for general operations while still keeping the detailed trace available when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
currentState := pr.GetAnnotations()[keys.State] | ||
logger.Infof("updating pipelineRun %v/%v state from %s to %s", pr.GetNamespace(), pr.GetName(), currentState, state) | ||
annotations := map[string]string{ | ||
keys.State: state, | ||
} | ||
if state == kubeinteraction.StateStarted { | ||
annotations[keys.SCMReportingPLRStarted] = "true" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of setting the SCMReportingPLRStarted
annotation only when the state is StateStarted
, it might be more robust to set it unconditionally within this function. This ensures that the annotation is always present, regardless of the state transition. This can help avoid potential issues if the state transition logic changes in the future.
Consider setting the annotation like this:
annotations[keys.SCMReportingPLRStarted] = "true"
annotations[keys.SCMReportingPLRStarted] = "true"
GithubApplicationID = "github-application-id" | ||
GithubPrivateKey = "github-private-key" | ||
ResultsRecordSummary = "results.tekton.dev/recordSummaryAnnotations" | ||
SCMReportingPLRStarted = pipelinesascode.GroupName + "/scm-reporting-plr-started" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbenhaim shouldn't this be on line no 63 above? grouped with other annotation keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/ok-to-test |
pkg/reconciler/reconciler.go
Outdated
@@ -54,6 +54,23 @@ var ( | |||
func (r *Reconciler) ReconcileKind(ctx context.Context, pr *tektonv1.PipelineRun) pkgreconciler.Event { | |||
ctx = info.StoreNS(ctx, system.Namespace()) | |||
logger := logging.FromContext(ctx).With("namespace", pr.GetNamespace()) | |||
|
|||
logger.Infof("=== RECONCILE START: PipelineRun %s/%s (resourceVersion: %s) ===", pr.GetNamespace(), pr.GetName(), pr.GetResourceVersion()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please rephrase log messages here and below to align with log convention used across project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/ok-to-test |
/hold |
but in normal cases (without Kueue) if reconciliation fails e.g. because of race condition we're facing and other cases which we might not know at this moment, we could end up not having git status at all. and current setup is working fine so I think we should keep it as it is. cc: @chmouel |
I didn't understand how this can be happen? |
|
what do you mean if reconciliation fails we won't have git status at all? the reporting from PAC can fail as well, I don't see how is it different. |
I am not saying that reporting would fail but if many controllers are patching PipelineRun, there will be many reconciliation loops and I don't know if they cause any error to reporting. (we also have PaC concurrency stuff as well) at this moment I don't know what would break but this will be big design change I guess. |
@gbenhaim for unauthorized users, we create mirror PR to run E2E tests but when I am doing it I am getting gitlint error about your commit message, can you please adjust length of lines in your commit message? (every line should be less than 80 chars)
|
@zakisk can you maybe make me an authorized user :) ? |
@zakisk I fixed the commit message |
I am just following the etiquette that we generally add people after some contributions 🙂 |
🚀 Mirrored PR Created for E2E Testing |
@zakisk I pushed another commit that should solve the race condition I mentioned in earlier comments. It solves it without moving all the reporting to the SCM to the watcher. |
pkg/reconciler/reconciler.go
Outdated
return fmt.Errorf("cannot get fresh PipelineRun: %w", err) | ||
} | ||
|
||
if freshPR.GetResourceVersion() != pr.GetResourceVersion() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbenhaim there is similar check below on line no 125 in this file, if version isn't matched it should return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move it to the to of the function as me you and @enarha agreed today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
triggered CI!
} else { | ||
// Mark that the start will be reported to the Git provider | ||
patchAnnotations[keys.SCMReportingPLRStarted] = "true" | ||
patchAnnotations[keys.State] = kubeinteraction.StateStarted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to set the state here? I see it is otherwise the reconciler setting the state label and annotation. I understand why the SCMReportingPLRStarted was added here, but why change the existing behavior by also setting state here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the plr isn't getting created with the state anymore, it should be set here as well to support plrs which are not pending.
The reconciler won't reconcile plrs without the state annotation
func checkStateAndEnqueue(impl *controller.Impl) func(obj any) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@zakisk what is the next step for getting this one merged? |
Caution There are some errors in your PipelineRun template.
|
How this error is related to my change? |
seems like it was an infra error.. not sure why the stepactions crds wasn't applied at the time you ran this... |
/retest |
The use of the state label (which is mutable) for deciding when to report to the SCM that the PLR was started is flaky. It was seen that the reconciler get events about PLRs with unexpected value for the state label. For example, after the status is reported to the SCM, and the label value is patched to "started", after serval reconcile iterations the label had the "queued" value again. This can happen because of unsafe patching done by controllers (not just the PAC controllers) which reconciles PLRs. Introduce a new annotation for indicating the the status was reported to the SCM. By adding an annotation which is set once, we remove the risk that its value will get overwritten by other controllers (since maps are merged when patched, values are not getting removed unless explicitly defined in the patch - https://datatracker.ietf.org/doc/html/rfc7386#section-2) In addition, at the start of each reconcile loop, ensure that we operate on the latest version of the PLR and not using a stale value from the cache. Assisted-By: Cursor Signed-off-by: Gal Ben Haim <[email protected]>
Set the state annotation and labels in pipelineascode only after the state was reported to the SCM. This is achieved by patching the PLR with the state label/annotation after it was created. This change is needed in order to avoid a case where the watcher will report a status before PAC. With this change the issue can't happen, since the watcher only reconciles PLRs that has the state annotation, so the first reconcile of the PLRs will happen only after PAC already sent the initial status to the SCM. In addition, simplify the condition which checks if the "queued" state annotation/label should be set. We can only check if the PLR is pending, it doesn't matter if the reason is because of PAC's concurrency control or an external controller. Signed-off-by: Gal Ben Haim <[email protected]>
Instead of running the reconcile loop on the fresh pipeline, skip it and let the next iterations to reconcile it. Signed-off-by: Gal Ben Haim <[email protected]>
I have done a code review and tried to "simulate" a random failure with concurrency in the code: diff --git a/pkg/reconciler/reconciler.go b/pkg/reconciler/reconciler.go
index 58493f289..14dee5134 100644
--- a/pkg/reconciler/reconciler.go
+++ b/pkg/reconciler/reconciler.go
@@ -276,7 +276,30 @@ func (r *Reconciler) reportFinalStatus(ctx context.Context, logger *zap.SugaredL
return repo, nil
}
+func randomError(prn string) error {
+ if strings.HasPrefix(prn, "test-gh-2") {
+ return fmt.Errorf("DEBUG: 😈 randomly failing this PipelineRun: %s", prn)
+ }
+ /* if we want to generate more randomness on load testing
+ var b [8]byte
+ if _, err := rand.Read(b[:]); err != nil {
+ return fmt.Errorf("failed to generate random number: %w", err)
+ }
+ randomNumber := binary.LittleEndian.Uint64(b[:]) % 100
+
+ if randomNumber < 50 {
+ return fmt.Errorf("DEBUG: 😈 randomly failing this PipelineRun: %s", prn)
+ }
+
+ */
+ return nil
+}
+
func (r *Reconciler) updatePipelineRunToInProgress(ctx context.Context, logger *zap.SugaredLogger, repo *v1alpha1.Repository, pr *tektonv1.PipelineRun) error {
+ if err := randomError(pr.GetName()); err != nil {
+ return err
+ }
+
pr, err := r.updatePipelineRunState(ctx, logger, pr, kubeinteraction.StateStarted)
if err != nil {
return fmt.Errorf("cannot update state: %w", err) I thought it would fail to continue and get stuck forever when the concurrency get out of order because of but it seems to work well and continue until the end, i haven't dug into the commit flow to understand how that happen tho but at least it works for me 2025-08-11.14-41-41.mp4 |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congrats @gbenhaim your PR Has been approved 🎉
✅ Pull Request Approved
Approval Status:
- Required Approvals: 1
- Current Approvals: 1
👥 Reviewers Who Approved:
Reviewer | Permission Level | Approval Status |
---|---|---|
@chmouel | admin |
✅ |
📝 Next Steps
- Ensure all required checks pass
- Comply with branch protection rules
- Request a maintainer to merge using the
/merge
command (or merge it
directly if you have repository permission).
Automated by the PAC Boussole 🧭
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congrats @gbenhaim your PR Has been approved 🎉
✅ Pull Request Approved
Approval Status:
- Required Approvals: 1
- Current Approvals: 2
👥 Reviewers Who Approved:
Reviewer | Permission Level | Approval Status |
---|---|---|
@chmouel | admin |
✅ |
@zakisk | admin |
✅ |
📝 Next Steps
- Ensure all required checks pass
- Comply with branch protection rules
- Request a maintainer to merge using the
/merge
command (or merge it
directly if you have repository permission).
Automated by the PAC Boussole 🧭
…elines#2208) * fix: Annotate PLR when started status is reported The use of the state label (which is mutable) for deciding when to report to the SCM that the PLR was started is flaky. It was seen that the reconciler get events about PLRs with unexpected value for the state label. For example, after the status is reported to the SCM, and the label value is patched to "started", after serval reconcile iterations the label had the "queued" value again. This can happen because of unsafe patching done by controllers (not just the PAC controllers) which reconciles PLRs. Introduce a new annotation for indicating the the status was reported to the SCM. By adding an annotation which is set once, we remove the risk that its value will get overwritten by other controllers (since maps are merged when patched, values are not getting removed unless explicitly defined in the patch - https://datatracker.ietf.org/doc/html/rfc7386#section-2) In addition, at the start of each reconcile loop, ensure that we operate on the latest version of the PLR and not using a stale value from the cache. Assisted-By: Cursor Signed-off-by: Gal Ben Haim <[email protected]>
📝 Description of the Change
The use of the state label (which is mutable) for deciding when to report to the SCM that the PLR was started is flaky.
It was seen that the reconciler get events about PLRs with unexpected value for the state label. For example, after the status is reported to the SCM, and the label value is patched to "started", after serval reconcile iterations the label had the "queued" value again. This can happen because of unsafe patching done by controllers (not just the PAC controllers) which reconciles PLRs.
Introduce a new annotation for indicating the the status was reported to the SCM. By adding an annotation which is set once, we remove the risk that its value will get overwritten by other controllers (since maps are merged when patched, values are not getting removed unless explicitly defined in the patch
In addition, at the start of each reconcile loop, ensure that we operate on the latest version of the PLR and not using a stale value from the cache.
Set the state annotation and labels in pipelineascode only after the state was reported to the SCM. This is achieved by patching the PLR with the state label/annotation after it was created.
This change is needed in order to avoid a case where the watcher will report a status before PAC.
With this change the issue can't happen, since the watcher only reconciles PLRs that has the state annotation, so the first reconcile of the PLRs will happen only after PAC already sent the initial status to the SCM.
In addition, simplify the condition which checks if the "queued" state annotation/label should be set. We can only check if the PLR is pending, it doesn't matter if the reason is because of PAC's concurrency control or an external controller.
🔗 Linked GitHub Issue
Fixes #
👨🏻 Linked Jira
https://issues.redhat.com/browse/SRVKP-8255
🚀 Type of Change
fix:
)feat:
)feat!:
,fix!:
)docs:
)chore:
)refactor:
)enhance:
)🧪 Testing Strategy
✅ Submitter Checklist
fix:
,feat:
) matches the "Type of Change" I selected above.make test
andmake lint
locally to check for and fix anyissues. For an efficient workflow, I have considered installing
pre-commit and running
pre-commit install
toautomate these checks.