Degraded cloudstack agent #12450
-
problemOver time, the CloudStack agent becomes increasingly slow when starting or stopping virtual machines. This performance degradation is especially noticeable during bulk VM operations, where the delays can become significant. The only effective workaround I've found so far is to restart the CloudStack agent service, which temporarily restores normal performance. versionsthe version is 4.20.1 The steps to reproduce the bugWhat to do about it? |
Beta Was this translation helpful? Give feedback.
Replies: 84 comments 7 replies
-
|
Thanks for opening your first issue here! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
|
I have a similar problem, with other symptoms as well. For example, alerts in Cloudstack like "Health checks failed" for virtual routers on affected hosts. The agent is constantly consuming 100% CPU, even when there are no jobs or any current actions on that host. |
Beta Was this translation helpful? Give feedback.
-
|
@poli1025 is degradation specific only for the agent or is there overall slowness in handling start-stop API calls? (So we can rule out any MS issue) |
Beta Was this translation helpful? Give feedback.
-
|
We’ve identified that the issue is specifically related to the agent — there is no noticeable delay in the API calls. The bulk operations are being executed via Terraform, such as the creation of 50 machines. The next day, when we destroy and recreate them, the process takes significantly more time. However, if we restart the agent, the performance returns to normal. Over time, the degradation happens again, and the creation time can eventually triple.
|
Beta Was this translation helpful? Give feedback.
-
|
I created 50 vm instances in parallel, the deployments and agents worked well need more testing and investigation |
Beta Was this translation helpful? Give feedback.
-
|
@weizhouapache The initial deployments and agents work well. However, after a few days we observe degradation: VM creation, power cycling, and console access start to slow down. |
Beta Was this translation helpful? Give feedback.
-
|
The problem seems to be a leak in the handler threads while checking storage usage. The more agent threads you configure in agent.properties and the less time you configure to retrieve volume usage metrics, the worse it gets, and the faster it happens. On a fresh start of the agent, you get the 'Trying to fetch storage pool xxxx from libvirt' message whenever the usage service is getting updated metrics. Those requests are either leaking or not getting garbage collected or something like that in time. Those requests start to overlap with time, and you end up seeing the same request to the same primary storage tens or hundreds of times. The only way to recover from that is to restart the agent, limit the number of threads of the agent and try to read the usage metrics in longer time spans (I think it defaults to 10 minutes or something like that, setting it to once every two hours mitigates it a bit, just enough so you don't have to restart the agent every few hours so it doesn't hog the kvm node cpu). Here's a log with redacted storage uuids so it's easier to see (take a look at the timestamps) This happens at least since Cloudstack 4.19 |
Beta Was this translation helpful? Give feedback.
-
|
thanks for the information @vgarcia-linube |
Beta Was this translation helpful? Give feedback.
-
|
we also have this kind of issue. After a few days the agent stops executing tasks. After restarting the agent it starts working again.
Is there any useful debugging data i can collect when this error occurs? |
Beta Was this translation helpful? Give feedback.
-
|
I wonder if this is the same issue I've been struggling with. That said, it seems at least in my case just restarting the agent isn't always enough, something gets 'hung up', like a lock, when the agent tries to connect to the management nodes which means sometimes I need to restart the management nodes as well. |
Beta Was this translation helpful? Give feedback.
-
|
I also had situations where the agent was restarted, up and running (at least from what i could see in the logs everything was fine). But it took several minutes for the agent to reconnect to the management servers. Usually this happens instantly. |
Beta Was this translation helpful? Give feedback.
-
|
@jgotteswinter We've seen that too but it's usually pressure on other parts of the stack. For example, if the primary storage server is slow to respond, restarting the cloudstack-agent may need more time to initialize as it loops through the storage domains. On this specific issue, we've reconfigured the agent to set its logs to debug but we didn't find any more clues than the logs we shared here With about 500 instances in our clusters, we're seeing this event once every two days approximately, but time varies on node capability and number of machines concurrently running on that specific node |
Beta Was this translation helpful? Give feedback.
-
|
@vgarcia-linube on the settings that helped you said:
Can you provide the exact setting names (and where, such as agent.properties or something in management config) and values you used that helped? I'd like to see if that helps me. Also does preemptively restarting the agent before it gets too bad in something like cron.daily prevent the issue from occurring? |
Beta Was this translation helpful? Give feedback.
-
|
@bhouse-nexthop i configured a daily cron which restarts the agent every night a few days ago. So far, it did not show about again. i was thinking about configuring jmx, maybe this could give more information?
|
Beta Was this translation helpful? Give feedback.
-
@jgotteswinter |
Beta Was this translation helpful? Give feedback.
-
|
@DaanHoogland It would be good to isolate and get a patch upstream to Ubuntu as excluding Ubuntu 24.04 from a supported release seems extreme. |
Beta Was this translation helpful? Give feedback.
-
|
@bradh352 , I meant to exclude the combination ubuntu-24/libvirt-10.4 , not the entire range of ubuntu distros (though I am tempted) |
Beta Was this translation helpful? Give feedback.
-
|
@DaanHoogland why tempted? |
Beta Was this translation helpful? Give feedback.
-
|
Cloudstack doesn't really advertise Debian as a first class citizen of a distro as it doesn't list community packages: https://cloudstack.apache.org/downloads/#community-packages Interestingly, I am at least seeing bookworm (debian 12) packages here: It would be good to get trixie in there and make them more official by referencing them in the download page. I probably would have gone Debian on a deployment if I thought it was supported at the same level as Ubuntu. |
Beta Was this translation helpful? Give feedback.
-
|
I dont know how widespread debian is as platform for acs, this might need testing. The Debian 12 life cycle encompasses five years: the initial three years of full Debian support, until June 10th, 2026, and two years of Long Term Support (LTS), until June 30th, 2028 Bookworm has libvirt 9.0.0, Trixie is using 11.3.0. Thats a huge version step, none of the EL is on 11.x yet afaik. I will try to find time adding a Debian 13 host in our dev environment. @bhouse-nexthop Shapeblue has generic marked Debian / Ubuntu packages https://www.shapeblue.com/cloudstack-packages/ |
Beta Was this translation helpful? Give feedback.
-
shear prejudice |
Beta Was this translation helpful? Give feedback.
-
|
I am testing Debian 13 since yesterday with automated high concurrent load tests around the instance and volume lifecycle, so far no signs of problems. We will probably move on to Debian. |
Beta Was this translation helpful? Give feedback.
-
|
hey folks, I bring good news. I managed to identify where the big is and is not on
FYI, @cpaelzer |
Beta Was this translation helpful? Give feedback.
-
|
hmm, the changeset between 10.4 and 10.6 isn't exactly small, don't know how easy it would be to identify the actual commit that fixes this: libvirt/libvirt@v10.4.0...v10.6.0 |
Beta Was this translation helpful? Give feedback.
-
|
I can confirm that 11.3 also works perfectly fine |
Beta Was this translation helpful? Give feedback.
-
So far, the plan is to dissect even more. I have increased my testing environment from 3 to 5 KVM hosts and I have 2 more on the way. This might speed up the process to identify when the change that fixed the bug. |
Beta Was this translation helpful? Give feedback.
-
|
After two weeks of operation in the production environment, I can confirm that the agent with libvirt-10.6.0-1ubuntu3.3 behaves correctly without any apparent problems. The CPU load for the agent process still ranges from 0.x% to 1.8%. For now, we will have libvirt-10.6.0-1ubuntu3.3 in production as a temporary solution. If the bug fix is backported, we will return (after testing) to the original libvirt from the Ubuntu distribution. If the bug is not fixed, we will remain on libvirt-10.6.0-1ubuntu3.3 until the major upgrade to Ubuntu 26.04 (depending on compatibility with CloudStack). |
Beta Was this translation helpful? Give feedback.
-
|
Is this Libvirt 10 issue Ubuntu specific? If not, it would be probably a good idea to collect working libvirt + qemu combinations for specific acs releases instead of focussing on distributions. There is another Ubuntu 24.04 issue ongoing: #12427 |
Beta Was this translation helpful? Give feedback.
-
|
Could someone who has migrated from 10.0 to 10.6 share a verified action plan for a trouble-free upgrade? A brief manual would be greatly appreciated. 🙏 |
Beta Was this translation helpful? Give feedback.
After two weeks of operation in the production environment, I can confirm that the agent with libvirt-10.6.0-1ubuntu3.3 behaves correctly without any apparent problems. The CPU load for the agent process still ranges from 0.x% to 1.8%.
For now, we will have libvirt-10.6.0-1ubuntu3.3 in production as a temporary solution. If the bug fix is backported, we will return (after testing) to the original libvirt from the Ubuntu distribution. If the bug is not fixed, we will remain on libvirt-10.6.0-1ubuntu3.3 until the major upgrade to Ubuntu 26.04 (depending on compatibility with CloudStack).