fix: Annotate PLR when when started status is reported #2208

gbenhaim · 2025-08-06T09:26:23Z

📝 Description of the Change

The use of the state label (which is mutable) for deciding when to report to the SCM that the PLR was started is flaky.

It was seen that the reconciler get events about PLRs with unexpected value for the state label. For example, after the status is reported to the SCM, and the label value is patched to "started", after serval reconcile iterations the label had the "queued" value again. This can happen because of unsafe patching done by controllers (not just the PAC controllers) which reconciles PLRs.

Introduce a new annotation for indicating the the status was reported to the SCM. By adding an annotation which is set once, we remove the risk that its value will get overwritten by other controllers (since maps are merged when patched, values are not getting removed unless explicitly defined in the patch

https://datatracker.ietf.org/doc/html/rfc7386#section-2)

In addition, at the start of each reconcile loop, ensure that we operate on the latest version of the PLR and not using a stale value from the cache.

Set the state annotation and labels in pipelineascode only after the state was reported to the SCM. This is achieved by patching the PLR with the state label/annotation after it was created.

This change is needed in order to avoid a case where the watcher will report a status before PAC.

With this change the issue can't happen, since the watcher only reconciles PLRs that has the state annotation, so the first reconcile of the PLRs will happen only after PAC already sent the initial status to the SCM.

In addition, simplify the condition which checks if the "queued" state annotation/label should be set. We can only check if the PLR is pending, it doesn't matter if the reason is because of PAC's concurrency control or an external controller.

🔗 Linked GitHub Issue

Fixes #

👨🏻‍ Linked Jira

https://issues.redhat.com/browse/SRVKP-8255

🚀 Type of Change

🧪 Testing Strategy

✅ Submitter Checklist

gemini-code-assist

Summary of Changes

Hello @gbenhaim, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug where the status reporting to SCM systems for PipelineRuns was unreliable due to the use of a mutable label. It introduces a new annotation to ensure consistent and accurate reporting of the 'started' status. Additionally, it enhances the reconciler's resilience by ensuring it always operates on the most up-to-date PipelineRun data, preventing race conditions and stale cache issues.

Highlights

Improved SCM Status Reporting Reliability: Replaced the mutable state label with a new, immutable scm-reporting-plr-started annotation to reliably track when a PipelineRun's 'started' status has been reported to the Source Code Management (SCM) system. This prevents issues caused by other controllers inadvertently overwriting the state label, ensuring the 'started' status is reported accurately and only once.
Enhanced Reconciler Robustness: Implemented a mechanism to fetch the latest PipelineRun object directly from the API server at the start of each reconcile loop. This mitigates problems arising from stale informer cache data, ensuring the reconciler always operates on the most up-to-date state of the PipelineRun.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gbenhaim · 2025-08-06T09:27:15Z

@zakisk can you please guide me on where should I add tests for this change?

gemini-code-assist

Code Review

This pull request addresses a flaky behavior where the SCM status for a started PipelineRun could be reported multiple times. It introduces a new annotation, scm-reporting-plr-started, to reliably track whether the 'started' status has been reported. Additionally, it ensures the reconciler always works with the freshest PipelineRun object by fetching it directly from the API server at the start of the reconciliation loop.

The changes are logical and correctly implement the fix. I've provided a couple of suggestions in pkg/reconciler/reconciler.go to improve code clarity and reduce log verbosity.

gemini-code-assist · 2025-08-06T09:28:30Z

pkg/reconciler/reconciler.go

+	scmReportingPLRStarted, exist := pr.GetAnnotations()[keys.SCMReportingPLRStarted]
+	if exist && scmReportingPLRStarted == "true" {
+		startReported = true
+	} else if !exist {


This block of code for determining startReported can be simplified. The current if/else if structure is a bit verbose for what is essentially a boolean check.

You can make this more concise and idiomatic like this:

scmAnnotation, exists := pr.GetAnnotations()[keys.SCMReportingPLRStarted] startReported := exists && scmAnnotation == "true"

This correctly handles all cases:

Annotation is missing (exists is false).

Annotation is present but not "true".

Annotation is present and "true".

This change also makes the log message on line 94 redundant, as the comprehensive log on line 97 already includes the value of startReported.

scmReportingPLRStarted, exist := pr.GetAnnotations()[keys.SCMReportingPLRStarted] startReported := exist && scmReportingPLRStarted == "true"

gemini-code-assist · 2025-08-06T09:28:30Z

pkg/reconciler/reconciler.go

+	logger.Infof("RECONCILE: Checking condition - reason='%s' (is Running: %v), state='%s' (is queued: %v) (startReported: %v)",
+		reason, reason == string(tektonv1.PipelineRunReasonRunning), state, state == kubeinteraction.StateQueued, startReported)
+
+	if reason == string(tektonv1.PipelineRunReasonRunning) && !startReported {
+		logger.Infof("RECONCILE: PipelineRun %s/%s condition met (reason=Running, state=queued), calling updatePipelineRunToInProgress", pr.GetNamespace(), pr.GetName())
 		repoName := pr.GetAnnotations()[keys.Repository]
 		repo, err := r.repoLister.Repositories(pr.Namespace).Get(repoName)
 		if err != nil {
 			return fmt.Errorf("failed to get repository CR: %w", err)
 		}
 		return r.updatePipelineRunToInProgress(ctx, logger, repo, pr)
+	} else {
+		logger.Infof("RECONCILE: Condition NOT met for %s/%s - reason='%s', state='%s' (startReported: %v)", pr.GetNamespace(), pr.GetName(), reason, state, startReported)
 	}


The new Infof logs prefixed with RECONCILE: are very verbose for the Info level and seem to serve a debugging purpose. While helpful for tracking down the flakiness issue, they might add significant noise to production logs.

Consider either removing them or changing them to a debug level (e.g., Debugf) if your logging framework supports it. This would make the logs cleaner for general operations while still keeping the detailed trace available when needed.

gemini-code-assist · 2025-08-06T09:28:30Z

pkg/reconciler/reconciler.go

+	currentState := pr.GetAnnotations()[keys.State]
+	logger.Infof("updating pipelineRun %v/%v state from %s to %s", pr.GetNamespace(), pr.GetName(), currentState, state)
+	annotations := map[string]string{
+		keys.State: state,
+	}
+	if state == kubeinteraction.StateStarted {
+		annotations[keys.SCMReportingPLRStarted] = "true"
+	}


Instead of setting the SCMReportingPLRStarted annotation only when the state is StateStarted, it might be more robust to set it unconditionally within this function. This ensures that the annotation is always present, regardless of the state transition. This can help avoid potential issues if the state transition logic changes in the future.

Consider setting the annotation like this:

annotations[keys.SCMReportingPLRStarted] = "true"

annotations[keys.SCMReportingPLRStarted] = "true"

zakisk · 2025-08-06T09:40:09Z

pkg/apis/pipelinesascode/keys/keys.go

+	GithubApplicationID    = "github-application-id"
+	GithubPrivateKey       = "github-private-key"
+	ResultsRecordSummary   = "results.tekton.dev/recordSummaryAnnotations"
+	SCMReportingPLRStarted = pipelinesascode.GroupName + "/scm-reporting-plr-started"


@gbenhaim shouldn't this be on line no 63 above? grouped with other annotation keys.

zakisk · 2025-08-06T09:41:32Z

/ok-to-test

zakisk · 2025-08-06T09:50:21Z

pkg/reconciler/reconciler.go

@@ -54,6 +54,23 @@ var (
 func (r *Reconciler) ReconcileKind(ctx context.Context, pr *tektonv1.PipelineRun) pkgreconciler.Event {
 	ctx = info.StoreNS(ctx, system.Namespace())
 	logger := logging.FromContext(ctx).With("namespace", pr.GetNamespace())
+
+	logger.Infof("=== RECONCILE START: PipelineRun %s/%s (resourceVersion: %s) ===", pr.GetNamespace(), pr.GetName(), pr.GetResourceVersion())


can you please rephrase log messages here and below to align with log convention used across project?

enarha · 2025-08-06T10:37:04Z

/ok-to-test

gbenhaim · 2025-08-06T11:06:06Z

/hold
Current implementation still has a race condition in it.
A situation can happen where the watcher reports the status first and only then pac reports its status.
I think that the creation of the status should be moved to be handled by the watcher, except for the case where the pipelinerun creation failed, in this case the watcher has nothing to reconcile and the pac should report the error.

@zakisk @enarha WDYT?

zakisk · 2025-08-06T11:49:37Z

I think that the creation of the status should be moved to be handled by the watcher

but in normal cases (without Kueue) if reconciliation fails e.g. because of race condition we're facing and other cases which we might not know at this moment, we could end up not having git status at all. and current setup is working fine so I think we should keep it as it is.

cc: @chmouel

zakisk · 2025-08-06T11:52:40Z

A situation can happen where the watcher reports the status first and only then pac reports its status.

I didn't understand how this can be happen?

gbenhaim · 2025-08-06T12:23:58Z

I didn't understand how this can be happen?
after pac create the plr nothing stops the watcher from reporting the status to github

gbenhaim · 2025-08-06T12:26:04Z

I think that the creation of the status should be moved to be handled by the watcher

but in normal cases (without Kueue) if reconciliation fails e.g. because of race condition we're facing and other cases which we might not know at this moment, we could end up not having git status at all. and current setup is working fine so I think we should keep it as it is.

cc: @chmouel

what do you mean if reconciliation fails we won't have git status at all? the reporting from PAC can fail as well, I don't see how is it different.

zakisk · 2025-08-06T12:42:22Z

what do you mean if reconciliation fails we won't have git status at all? the reporting from PAC can fail as well, I don't see how is it different.

I am not saying that reporting would fail but if many controllers are patching PipelineRun, there will be many reconciliation loops and I don't know if they cause any error to reporting. (we also have PaC concurrency stuff as well) at this moment I don't know what would break but this will be big design change I guess.

zakisk · 2025-08-06T13:09:01Z

@gbenhaim for unauthorized users, we create mirror PR to run E2E tests but when I am doing it I am getting gitlint error about your commit message, can you please adjust length of lines in your commit message? (every line should be less than 80 chars)

Gitlint..................................................................Failed
- hook id: gitlint
- exit code: 2

9: B1 Line exceeds max length (82>80): "the label had the "queued" value again. This can happen because of unsafe patching"
12: B1 Line exceeds max length (81>80): "Introduce a new annotation for indicating the the status was reported to the SCM."
13: B1 Line exceeds max length (81>80): "By adding an annotation which is set once, we remove the risk that its value will"
18: B1 Line exceeds max length (86>80): "In addition, at the start of each reconcile loop, ensure that we operate on the latest"
make: *** [Makefile:119: gitlint] Error 4

gbenhaim · 2025-08-06T13:54:59Z

@zakisk can you maybe make me an authorized user :) ?

gbenhaim · 2025-08-06T18:15:50Z

@zakisk I fixed the commit message

zakisk · 2025-08-07T06:52:06Z

@zakisk can you maybe make me an authorized user :) ?

I am just following the etiquette that we generally add people after some contributions 🙂

zakisk · 2025-08-07T06:54:17Z

🚀 Mirrored PR Created for E2E Testing

A mirrored PR has been opened for end-to-end testing: View PR

⏳ Follow progress there for E2E results.
If you need to update the PR with new changes, please ask a maintainer to rerun hack/mirror-pr.sh.

gbenhaim · 2025-08-07T07:07:20Z

@zakisk I pushed another commit that should solve the race condition I mentioned in earlier comments. It solves it without moving all the reporting to the SCM to the watcher.

zakisk · 2025-08-07T09:51:09Z

pkg/reconciler/reconciler.go

+		return fmt.Errorf("cannot get fresh PipelineRun: %w", err)
+	}
+
+	if freshPR.GetResourceVersion() != pr.GetResourceVersion() {


@gbenhaim there is similar check below on line no 125 in this file, if version isn't matched it should return.

I move it to the to of the function as me you and @enarha agreed today.

triggered CI!

pkg/pipelineascode/pipelineascode_test.go

enarha · 2025-08-07T14:40:49Z

pkg/pipelineascode/pipelineascode.go

+	} else {
+		// Mark that the start will be reported to the Git provider
+		patchAnnotations[keys.SCMReportingPLRStarted] = "true"
+		patchAnnotations[keys.State] = kubeinteraction.StateStarted


Why do we need to set the state here? I see it is otherwise the reconciler setting the state label and annotation. I understand why the SCMReportingPLRStarted was added here, but why change the existing behavior by also setting state here?

Because the plr isn't getting created with the state anymore, it should be set here as well to support plrs which are not pending.

The reconciler won't reconcile plrs without the state annotation

pipelines-as-code/pkg/reconciler/controller.go

Line 77 in fa877bb

func checkStateAndEnqueue(impl *controller.Impl) func(obj any) {

enarha

/lgtm

gbenhaim · 2025-08-08T08:18:35Z

@zakisk what is the next step for getting this one merged?

pipelines-as-code · 2025-08-08T08:18:41Z

Caution

There are some errors in your PipelineRun template.

PipelineRun	Error
git-clone	`error decoding yaml document: no kind "StepAction" is registered for version "tekton.dev/v1alpha1" in scheme "pkg/runtime/scheme.go:100"`

gbenhaim · 2025-08-08T09:31:23Z

Caution

There are some errors in your PipelineRun template.

PipelineRun Error
git-clone error decoding yaml document: no kind "StepAction" is registered for version "tekton.dev/v1alpha1" in scheme "pkg/runtime/scheme.go:100"

How this error is related to my change?

chmouel · 2025-08-11T07:20:59Z

seems like it was an infra error.. not sure why the stepactions crds wasn't applied at the time you ran this...

chmouel · 2025-08-11T07:21:09Z

/retest

The use of the state label (which is mutable) for deciding when to report to the SCM that the PLR was started is flaky. It was seen that the reconciler get events about PLRs with unexpected value for the state label. For example, after the status is reported to the SCM, and the label value is patched to "started", after serval reconcile iterations the label had the "queued" value again. This can happen because of unsafe patching done by controllers (not just the PAC controllers) which reconciles PLRs. Introduce a new annotation for indicating the the status was reported to the SCM. By adding an annotation which is set once, we remove the risk that its value will get overwritten by other controllers (since maps are merged when patched, values are not getting removed unless explicitly defined in the patch - https://datatracker.ietf.org/doc/html/rfc7386#section-2) In addition, at the start of each reconcile loop, ensure that we operate on the latest version of the PLR and not using a stale value from the cache. Assisted-By: Cursor Signed-off-by: Gal Ben Haim <[email protected]>

Set the state annotation and labels in pipelineascode only after the state was reported to the SCM. This is achieved by patching the PLR with the state label/annotation after it was created. This change is needed in order to avoid a case where the watcher will report a status before PAC. With this change the issue can't happen, since the watcher only reconciles PLRs that has the state annotation, so the first reconcile of the PLRs will happen only after PAC already sent the initial status to the SCM. In addition, simplify the condition which checks if the "queued" state annotation/label should be set. We can only check if the PLR is pending, it doesn't matter if the reason is because of PAC's concurrency control or an external controller. Signed-off-by: Gal Ben Haim <[email protected]>

Instead of running the reconcile loop on the fresh pipeline, skip it and let the next iterations to reconcile it. Signed-off-by: Gal Ben Haim <[email protected]>

chmouel · 2025-08-11T12:47:15Z

I have done a code review and tried to "simulate" a random failure with concurrency in the code:

diff --git a/pkg/reconciler/reconciler.go b/pkg/reconciler/reconciler.go
index 58493f289..14dee5134 100644
--- a/pkg/reconciler/reconciler.go
+++ b/pkg/reconciler/reconciler.go
@@ -276,7 +276,30 @@ func (r *Reconciler) reportFinalStatus(ctx context.Context, logger *zap.SugaredL
 	return repo, nil
 }
 
+func randomError(prn string) error {
+	if strings.HasPrefix(prn, "test-gh-2") {
+		return fmt.Errorf("DEBUG: 😈 randomly failing this PipelineRun: %s", prn)
+	}
+	/* if we want to generate more randomness on load testing
+	   var b [8]byte
+	   if _, err := rand.Read(b[:]); err != nil {
+	     return fmt.Errorf("failed to generate random number: %w", err)
+	   }
+	   randomNumber := binary.LittleEndian.Uint64(b[:]) % 100
+
+	   if randomNumber < 50 {
+	    return fmt.Errorf("DEBUG: 😈 randomly failing this PipelineRun: %s", prn)
+	   }
+
+	*/
+	return nil
+}
+
 func (r *Reconciler) updatePipelineRunToInProgress(ctx context.Context, logger *zap.SugaredLogger, repo *v1alpha1.Repository, pr *tektonv1.PipelineRun) error {
+	if err := randomError(pr.GetName()); err != nil {
+		return err
+	}
+
 	pr, err := r.updatePipelineRunState(ctx, logger, pr, kubeinteraction.StateStarted)
 	if err != nil {
 		return fmt.Errorf("cannot update state: %w", err)

I thought it would fail to continue and get stuck forever when the concurrency get out of order because of d34206b (#2208)

but it seems to work well and continue until the end, i haven't dug into the commit flow to understand how that happen tho but at least it works for me

2025-08-11.14-41-41.mp4

chmouel · 2025-08-11T12:48:09Z

/lgtm

pipelines-as-code

Congrats @gbenhaim your PR Has been approved 🎉

✅ Pull Request Approved

Approval Status:

Required Approvals: 1
Current Approvals: 1

👥 Reviewers Who Approved:

Reviewer	Permission Level	Approval Status
@chmouel	`admin`	✅

📝 Next Steps

Ensure all required checks pass
Comply with branch protection rules
Request a maintainer to merge using the /merge command (or merge it
directly if you have repository permission).

Automated by the PAC Boussole 🧭

zakisk · 2025-08-11T12:52:59Z

/lgtm

pipelines-as-code

Congrats @gbenhaim your PR Has been approved 🎉

✅ Pull Request Approved

Approval Status:

Required Approvals: 1
Current Approvals: 2

👥 Reviewers Who Approved:

Reviewer	Permission Level	Approval Status
@chmouel	`admin`	✅
@zakisk	`admin`	✅

📝 Next Steps

Ensure all required checks pass
Comply with branch protection rules
Request a maintainer to merge using the /merge command (or merge it
directly if you have repository permission).

Automated by the PAC Boussole 🧭

…elines#2208) * fix: Annotate PLR when started status is reported The use of the state label (which is mutable) for deciding when to report to the SCM that the PLR was started is flaky. It was seen that the reconciler get events about PLRs with unexpected value for the state label. For example, after the status is reported to the SCM, and the label value is patched to "started", after serval reconcile iterations the label had the "queued" value again. This can happen because of unsafe patching done by controllers (not just the PAC controllers) which reconciles PLRs. Introduce a new annotation for indicating the the status was reported to the SCM. By adding an annotation which is set once, we remove the risk that its value will get overwritten by other controllers (since maps are merged when patched, values are not getting removed unless explicitly defined in the patch - https://datatracker.ietf.org/doc/html/rfc7386#section-2) In addition, at the start of each reconcile loop, ensure that we operate on the latest version of the PLR and not using a stale value from the cache. Assisted-By: Cursor Signed-off-by: Gal Ben Haim <[email protected]>

gemini-code-assist bot reviewed Aug 6, 2025

View reviewed changes

zakisk reviewed Aug 6, 2025

View reviewed changes

gbenhaim force-pushed the ann-on-start branch from da40ee1 to ab2a205 Compare August 6, 2025 10:22

gbenhaim force-pushed the ann-on-start branch from ab2a205 to 56793cb Compare August 6, 2025 18:15

zakisk mentioned this pull request Aug 7, 2025

🪞 [MIRRORED] fix: Annotate PLR when when started status is reported #2209

Closed

gbenhaim force-pushed the ann-on-start branch from 29d8678 to 25bfed7 Compare August 7, 2025 07:10

zakisk reviewed Aug 7, 2025

View reviewed changes

enarha reviewed Aug 7, 2025

View reviewed changes

gbenhaim added 3 commits August 11, 2025 15:02

fix: Skip reconciliation of stale PLRs

d34206b

Instead of running the reconcile loop on the fresh pipeline, skip it and let the next iterations to reconcile it. Signed-off-by: Gal Ben Haim <[email protected]>

gbenhaim force-pushed the ann-on-start branch from c822bdb to d34206b Compare August 11, 2025 12:04

pipelines-as-code bot approved these changes Aug 11, 2025

View reviewed changes

zakisk merged commit a1d7e4e into openshift-pipelines:main Aug 11, 2025
5 checks passed

fix: Annotate PLR when when started status is reported #2208

fix: Annotate PLR when when started status is reported #2208

Uh oh!

Conversation

gbenhaim commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Description of the Change

🔗 Linked GitHub Issue

👨🏻‍ Linked Jira

🚀 Type of Change

🧪 Testing Strategy

✅ Submitter Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gbenhaim commented Aug 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zakisk commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enarha commented Aug 6, 2025

Uh oh!

gbenhaim commented Aug 6, 2025

Uh oh!

zakisk commented Aug 6, 2025

Uh oh!

zakisk commented Aug 6, 2025

Uh oh!

gbenhaim commented Aug 6, 2025

Uh oh!

gbenhaim commented Aug 6, 2025

Uh oh!

zakisk commented Aug 6, 2025

Uh oh!

zakisk commented Aug 6, 2025

Uh oh!

gbenhaim commented Aug 6, 2025

Uh oh!

gbenhaim commented Aug 6, 2025

Uh oh!

zakisk commented Aug 7, 2025

Uh oh!

zakisk commented Aug 7, 2025

Uh oh!

gbenhaim commented Aug 7, 2025

Uh oh!

zakisk Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbenhaim commented Aug 6, 2025 •

edited

Loading

zakisk Aug 7, 2025 •

edited

Loading

chmouel commented Aug 11, 2025 •

edited

Loading