Skip to content

Conversation

@SimonTheLeg
Copy link
Contributor

With this change we now specifically wait for WorkspaceTypes to report themselves and their VW URLs as ready.

We will see if this already fixes the flakiness that we have in the test. If not we'll need to investigate how sometimes a VW cannot be watched within 30 seconds due to failing permissions (for example see https://public-prow.kcp.k8c.io/view/s3/prow-public-data/pr-logs/pull/kcp-dev_kcp/3412/pull-kcp-test-e2e-sharded/1981638200807919616)

Summary

What Type of PR Is This?

/kind bug

Related Issue(s)

Fixes #

Release Notes

NONE

@kcp-ci-bot kcp-ci-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 24, 2025
}
}

func conditionIsTrue(conditions conditionsv1alpha1.Conditions, conditionType conditionsv1alpha1.ConditionType) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be replaced with conditionsv1alpha1.IsTrue(wt, conditionType)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah very nice! Changed it.

@SimonTheLeg SimonTheLeg force-pushed the termiantors-wait-for-wst-ready branch from f9909b8 to 06ae166 Compare October 28, 2025 09:32
@kcp-ci-bot kcp-ci-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 28, 2025
@xrstf
Copy link
Contributor

xrstf commented Oct 28, 2025

@SimonTheLeg that test failure a flake you want to look at or wdyt?

@SimonTheLeg
Copy link
Contributor Author

This is a different issue. The flake before was that the admin was not able to access them, presumably because the workspace was marked for deletion too early (e.g. role has not synched yet). Luckily this seems to not occur anymore with this change.

This time it is the user-1 having this issue. Which begs the question if we have a flake in our permission reconciliation or applying of the clusterroles. I had a look at the kcp.log of this particular run, but everything looks exactly like it looks for a run that succeeds. It updates the global cache accordingly (and with no direct error) and then in the logs you only see requests failing afterwards

I1023 13:50:20.817761   63219 labelclusterrolebinding_controller.go:168] "queueing ClusterRoleBinding" reconciler="kcp-tenancy-replicate-clusterrolebinding" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.817833   63219 replication_reconcile.go:187] "Creating object in global cache" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-replication-controller" key="v1.clusterrolebindings.rbac.authorization.k8s.io::1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" reconcilerKey="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.818767   63219 replication_reconcile.go:209] "Updating object in global cache" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-replication-controller" key="v1.clusterroles.rbac.authorization.k8s.io::1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" reconcilerKey="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" kind="ClusterRole" namespace="" name="1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.817863   63219 labelclusterrolebinding_controller.go:168] "queueing ClusterRoleBinding" reconciler="kcp-apis-replicate-clusterrolebinding" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.818896   63219 labelclusterrolebinding_controller.go:228] "processing key" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-apis-replicate-clusterrolebinding" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.818044   63219 httplog.go:134] "HTTP" verb="GET" URI="/clusters/2pgvufntpx5q7kx4/apis/core.kcp.io/v1alpha1/logicalclusters/cluster" latency="1.531337ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5/kcp-workspace+root" audit-ID="e067a559-5107-441f-9d2c-4f7ee0e46f19" srcIP="127.0.0.1:35110" apf_pl="exempt" apf_fs="exempt" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="1.305494ms" resp=200
I1023 13:50:20.817736   63219 labelclusterrolebinding_controller.go:168] "queueing ClusterRoleBinding" reconciler="kcp-core-replicate-clusterrolebinding" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.818993   63219 labelclusterrolebinding_controller.go:228] "processing key" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-tenancy-replicate-clusterrolebinding" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.817749   63219 labelclusterrole_controller.go:166] "queueing ClusterRole" reconciler="kcp-tenancy-replicate-clusterrole" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" reason="ClusterRoleBinding" ClusterRoleBinding.name="1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.819017   63219 labelclusterrole_controller.go:228] "processing key" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-tenancy-replicate-clusterrole" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.819002   63219 labelclusterrolebinding_controller.go:228] "processing key" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-core-replicate-clusterrolebinding" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.817852   63219 labelclusterrole_controller.go:166] "queueing ClusterRole" reconciler="kcp-apis-replicate-clusterrole" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" reason="ClusterRoleBinding" ClusterRoleBinding.name="1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.819686   63219 labelclusterrole_controller.go:228] "processing key" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-apis-replicate-clusterrole" key="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.822125   63219 httplog.go:134] "HTTP" verb="GET" URI="/clusters/2pgvufntpx5q7kx4/apis/core.kcp.io/v1alpha1/logicalclusters/cluster" latency="2.577828ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5/kcp-workspace+root" audit-ID="f778af17-9499-4a3b-a265-23858b662a74" srcIP="127.0.0.1:35110" apf_pl="exempt" apf_fs="exempt" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="2.143503ms" resp=200
I1023 13:50:20.823111   63219 workspace_reconcile_phase.go:114] "LogicalCluster is still deleting, requeuing" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-workspace" key="1hjsxwowl68q6u29|e2e-workspace-45gs4" workspace.workspace="1hjsxwowl68q6u29" workspace.namespace="" workspace.name="e2e-workspace-45gs4" workspace.apiVersion="" reconciler="phase" cluster="2pgvufntpx5q7kx4" after="964.62013ms"
I1023 13:50:20.823409   63219 httplog.go:134] "HTTP" verb="POST" URI="/services/cache/shards/root/clusters/1hjsxwowl68q6u29/apis/rbac.authorization.k8s.io/v1/clusterrolebindings" latency="2.975592ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5" audit-ID="aa9723aa-074c-46a3-97ac-fbbc1f9ab2b3" srcIP="127.0.0.1:35110" resp=201
I1023 13:50:20.824408   63219 replication_reconcile.go:209] "Updating object in global cache" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-replication-controller" key="v1.clusterrolebindings.rbac.authorization.k8s.io::1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" reconcilerKey="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" kind="ClusterRoleBinding" namespace="" name="1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.827415   63219 httplog.go:134] "HTTP" verb="PUT" URI="/services/cache/shards/root/clusters/1hjsxwowl68q6u29/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" latency="2.680789ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5" audit-ID="c6784973-143b-44c9-9978-97feac7b6d99" srcIP="127.0.0.1:35110" resp=200
I1023 13:50:20.828806   63219 replication_reconcile.go:209] "Updating object in global cache" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-replication-controller" key="v1.clusterrolebindings.rbac.authorization.k8s.io::1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" reconcilerKey="1hjsxwowl68q6u29|1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" kind="ClusterRoleBinding" namespace="" name="1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator"
I1023 13:50:20.831612   63219 httplog.go:134] "HTTP" verb="PUT" URI="/services/cache/shards/root/clusters/1hjsxwowl68q6u29/apis/rbac.authorization.k8s.io/v1/clusterroles/1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" latency="12.525095ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5" audit-ID="b16d5d38-8e4b-4a02-917b-c08fbfaf9d01" srcIP="127.0.0.1:35110" resp=200
I1023 13:50:20.832979   63219 httplog.go:134] "HTTP" verb="PUT" URI="/services/cache/shards/root/clusters/1hjsxwowl68q6u29/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/1hjsxwowl68q6u29:gamma-ofbewlbtbr-terminator" latency="3.657279ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5" audit-ID="c5e48176-6eee-4d9c-9bef-f1c000e727fb" srcIP="127.0.0.1:35110" resp=200
I1023 13:50:20.909394   63219 authorization.go:93] "Forbidden" URI="/services/terminatingworkspaces/1hjsxwowl68q6u29:alpha-krsavygsbu/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters" reason="access denied"
I1023 13:50:20.909557   63219 httplog.go:134] "HTTP" verb="LIST" URI="/services/terminatingworkspaces/1hjsxwowl68q6u29:alpha-krsavygsbu/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters" latency="341.444µs" userAgent="terminatingworkspaces.test/v0.0.0 (linux/arm64) kubernetes/$Format/TestTerminatingWorkspacesVirtualWorkspaceAccess-virtual" audit-ID="9a199a70-67cb-4291-b5b3-dc6766bb83a1" srcIP="127.0.0.1:35086" resp=403
I1023 13:50:21.006112   63219 workspace_controller.go:299] "processing key" component="kcp" postStartHook="kcp-start-controllers" reconciler="kcp-workspace" key="1hjsxwowl68q6u29|e2e-workspace-pvk7s"
I1023 13:50:21.008852   63219 httplog.go:134] "HTTP" verb="GET" URI="/clusters/h25xyujjeq713wzp/apis/core.kcp.io/v1alpha1/logicalclusters/cluster" latency="2.273344ms" userAgent="kcp/v1.33.3+kcp (linux/arm64) kubernetes/5da7ab5/kcp-workspace+root" audit-ID="57bfd789-43b4-4bd0-8792-8504533dd203" srcIP="127.0.0.1:35110" apf_pl="exempt" apf_fs="exempt" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="1.204933ms" resp=200
I1023 13:50:21.009558   63219 authorization.go:93] "Forbidden" URI="/services/terminatingworkspaces/1hjsxwowl68q6u29:alpha-krsavygsbu/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters" reason="access denied"

For now I suggest we get this PR merged, because it solves a different possibility for a flake. And then I would say, I create a bug ticket for the other one. So far I have no clue what the issue could be :/

@SimonTheLeg
Copy link
Contributor Author

/retest

@SimonTheLeg
Copy link
Contributor Author

the only thing that I can think of is that there is some sort of race-condition that if the LogicalCluster is marked for deletion if the cache is fast enough, the ClusterRole gets still applied and if not, we just stop doing it, leading to this issue 🤔

@kcp-ci-bot kcp-ci-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@SimonTheLeg SimonTheLeg force-pushed the termiantors-wait-for-wst-ready branch from 06ae166 to dd0498a Compare November 4, 2025 09:27
@kcp-ci-bot kcp-ci-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@SimonTheLeg
Copy link
Contributor Author

/retest

2 similar comments
@SimonTheLeg
Copy link
Contributor Author

/retest

@xrstf
Copy link
Contributor

xrstf commented Nov 6, 2025

/retest

@xrstf
Copy link
Contributor

xrstf commented Nov 6, 2025

/approve
/lgtm

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2025
@kcp-ci-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 252ecb340f7421f2290bd57ccf11f11b6cd26713

@kcp-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xrstf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025
@kcp-ci-bot kcp-ci-bot merged commit 5609047 into kcp-dev:main Nov 6, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants