Skip to content

Conversation

@bernardodemarco
Copy link
Member

Description

When scaling the compute offering of stopped k8s clusters, the following exception is thrown in the Management Server logs:

2025-09-08 16:53:27,031 INFO  [c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-3:ctx-32fdc15a job-43 ctx-3bbdb451) (logid:0a5136ec) Scaling Kubernetes cluster : k8s-min-offering
2025-09-08 16:53:27,035 WARN  [c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-3:ctx-32fdc15a job-43 ctx-3bbdb451) (logid:0a5136ec) Failed to transition state of the Kubernetes cluster : k8s-min-offering in state Stopped on event ScaleUpRequested
com.cloud.utils.fsm.NoTransitionException: Unable to transition to a new state from Stopped via ScaleUpRequested
	at com.cloud.utils.fsm.StateMachine2.getTransition(StateMachine2.java:108)
	at com.cloud.utils.fsm.StateMachine2.getNextState(StateMachine2.java:94)
	at com.cloud.utils.fsm.StateMachine2.transitTo(StateMachine2.java:124)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterActionWorker.stateTransitTo(KubernetesClusterActionWorker.java:560)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterOffering(KubernetesClusterScaleWorker.java:288)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleCluster(KubernetesClusterScaleWorker.java:470)
	at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.scaleKubernetesCluster(KubernetesClusterManagerImpl.java:1744)
# (...)

It is relevant to highlight that, although the exception is thrown, it does not interfere in the scaling process.

This PR proposes to fix the exception throw by adding a new state and transitions to the Kubernetes Cluster finite state machine.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Before upgrading the environment with the current PR's changes, I verified that when scaling the compute offering of stopped k8s clusters, the exception was thrown.

After applying the PR's changes, I verified that when scaling the compute offering of stopped k8s clusters, no exceptions were thrown.

@bernardodemarco
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@bernardodemarco a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 4.00%. Comparing base (9349b69) to head (6dd54e6).
⚠️ Report is 10 commits behind head on 4.20.

❗ There is a different number of reports uploaded between BASE (9349b69) and HEAD (6dd54e6). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (9349b69) HEAD (6dd54e6)
unittests 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##               4.20   #11598       +/-   ##
=============================================
- Coverage     16.17%    4.00%   -12.17%     
=============================================
  Files          5656      402     -5254     
  Lines        498082    32642   -465440     
  Branches      60415     5799    -54616     
=============================================
- Hits          80569     1308    -79261     
+ Misses       408551    31183   -377368     
+ Partials       8962      151     -8811     
Flag Coverage Δ
uitests 4.00% <ø> (-0.01%) ⬇️
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14906

@bernardodemarco
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@bernardodemarco a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14919

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-14259)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 52968 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11598-t14259-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache self-assigned this Sep 10, 2025
Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

I will test it

@weizhouapache
Copy link
Member

tested ok @bernardodemarco

  • create CKS cluster
  • stop it
  • scale it with different offering: works
  • scale it with the same offering: works, but noticed an exception in log
2025-09-10 10:25:54,591 WARN  [c.c.k.c.a.KubernetesClusterScaleWorker] (API-Job-Executor-3:[ctx-4c667cd3, job-94, ctx-37993c65]) (logid:29bd2ac2) Failed to transition state of the Kubernetes cluster : cks-002 in state Stopped on event OperationSucceeded com.cloud.utils.fsm.NoTransitionException: Unable to transition to a new state from Stopped via OperationSucceeded

@bernardodemarco
I suggest to add a simple change as below

diff --git a/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterScaleWorker.java b/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterScaleWorker.java
index 6fb15088e9b..f6828e3b203 100644
--- a/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterScaleWorker.java
+++ b/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterScaleWorker.java
@@ -476,6 +476,8 @@ public class KubernetesClusterScaleWorker extends KubernetesClusterResourceModif
             scaleKubernetesClusterOffering();
         } else if (clusterSizeScalingNeeded) {
             scaleKubernetesClusterSize();
+        } else {
+            return true;
         }
         stateTransitTo(kubernetesCluster.getId(), KubernetesCluster.Event.OperationSucceeded);
         return true;

@bernardodemarco
Copy link
Member Author

tested ok @bernardodemarco

@weizhouapache, thanks for testing!


I suggest to add a simple change as below

Yes, makes sense. I'll soon apply that. Thanks!

@bernardodemarco
Copy link
Member Author

@weizhouapache, done! Verified now that when scaling a stopped k8s cluster to the same offering, no exception is thrown

@bernardodemarco
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@bernardodemarco a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@weizhouapache
Copy link
Member

@weizhouapache, done! Verified now that when scaling a stopped k8s cluster to the same offering, no exception is thrown

good, tested ok, thanks @bernardodemarco

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14939

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@weizhouapache
Copy link
Member

tested ok

  • scale running cluster
  • scale stopped cluster to same offering
  • scale stopped cluster to another offering

@weizhouapache
Copy link
Member

We had some issues with the testing environments.
Merging on approvals and manual test results. Will keep eye on smoke tests of the health check PR

@weizhouapache weizhouapache merged commit 7c727a3 into apache:4.20 Sep 11, 2025
25 of 26 checks passed
@blueorangutan
Copy link

[SF] Trillian test result (tid-14283)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 59010 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11598-t14283-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Sep 15, 2025
…he#11598)

* add new k8s cluster transition

* apply suggestion

* apply suggestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants