api,agent,server,engine-schema: scalability improvements #9840

shwstppr · 2024-10-23T10:05:27Z

Description

Following changes and improvements have been added:

Improvements in handling of PingRoutingCommand
1. Added global config - vm.sync.power.state.transitioning, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
3. Optimized scanning stalled VMs
Added option to set worker threads for capacity calculation using config - capacity.calculate.workers
Added caching for account/use role API access with expiration after write can be configured using config - dynamic.apichecker.cache.period. If set to zero then there will be no caching. Default is 0.
Added caching for account/use role API access with expiration after write set to 60 seconds.
Added caching for some recurring DB retrievals
1. CapacityManager - listing service offerings - beneficial in host capacity calculation
2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
3. DownloadListener - hypervisors for zone - beneficial for host joins
4. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands
Optimized MS list retrieval for agent connect
Optimize finding ready systemvm template for zone
Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
Changes in agent-agentmanager connection with NIO client-server classes
1. Optimized the use of the executor service
2. Refactore Agent class to better handle connections.
3. Do SSL handshakes within worker threads
4. Added global configs to control the behaviour depending on the infra. SSL handshake and initial processing of a new agent could be a bottleneck during agent connections. Configs - agent.max.concurrent.new.connections can be used to control number of new connections management server handles at a time. agent.ssl.handshake.timeout can be used to set number of seconds after which SSL handshake times out at MS end.
5. On agent side backoff and sslhandshake timeout can be controlled by agent properties. backoff.seconds and ssl.handshake.timeout properties can be used.
Improvements in StatsCollection - minimize DB retrievals.
Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
Improvements in hosts connection for a storage pool. Added config - storage.pool.host.connect.workers to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.
Minor improvements in resource limit calculations wrt DB retrievals

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI
test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

codecov · 2024-10-23T10:12:57Z

Codecov Report

Attention: Patch coverage is 25.68856% with 1403 lines in your changes missing coverage. Please review.

Project coverage is 15.98%. Comparing base (33a37da) to head (6820717).
Report is 19 commits behind head on 4.20.

Files with missing lines	Patch %	Lines
agent/src/main/java/com/cloud/agent/Agent.java	20.79%	270 Missing and 8 partials ⚠️
.../main/java/com/cloud/vm/dao/VMInstanceDaoImpl.java	11.40%	129 Missing and 3 partials ⚠️
...com/cloud/vm/VirtualMachinePowerStateSyncImpl.java	0.00%	87 Missing ⚠️
.../src/main/java/com/cloud/host/dao/HostDaoImpl.java	50.94%	64 Missing and 14 partials ⚠️
.../apache/cloudstack/metrics/MetricsServiceImpl.java	0.00%	73 Missing and 1 partial ⚠️
...n/java/com/cloud/capacity/CapacityManagerImpl.java	20.68%	69 Missing ⚠️
...java/com/cloud/agent/manager/AgentManagerImpl.java	4.68%	61 Missing ⚠️
...n/java/com/cloud/vm/VirtualMachineManagerImpl.java	0.00%	60 Missing ⚠️
...src/main/java/com/cloud/server/StatsCollector.java	0.00%	47 Missing ⚠️
...ain/java/com/cloud/storage/StorageManagerImpl.java	30.64%	40 Missing and 3 partials ⚠️
... and 55 more

Additional details and impacted files

@@             Coverage Diff              @@
##               4.20    #9840      +/-   ##
============================================
- Coverage     16.15%   15.98%   -0.18%     
- Complexity    12987    13042      +55     
============================================
  Files          5639     5644       +5     
  Lines        494148   494806     +658     
  Branches      59916    59934      +18     
============================================
- Hits          79854    79079     -775     
- Misses       405465   406906    +1441     
+ Partials       8829     8821       -8

Flag	Coverage Δ
uitests	`4.02% <ø> (ø)`
unittests	`16.81% <25.68%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Following changes and improvements have been added: - Improvements in handling of PingRoutingCommand 1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs. 2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch 3. Optimized scanning stalled VMs - Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers` - Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine - Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0. - Added caching for account/use role API access with expiration after write set to 60 seconds. - Added caching for some recurring DB retrievals 1. CapacityManager - listing service offerings - beneficial in host capacity calculation 2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins 3. DownloadListener - hypervisors for zone - beneficial for host joins 5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands - Optimized MS list retrieval for agent connect - Optimize finding ready systemvm template for zone - Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks - Changes in agent-agentmanager connection with NIO client-server classes 1. Optimized the use of the executor service 2. Refactore Agent class to better handle connections. 3. Do SSL handshakes within worker threads 5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end. 6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used. - Improvements in StatsCollection - minimize DB retrievals. - Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals. - Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools. - Minor improvements in resource limit calculations wrt DB retrievals Signed-off-by: Abhishek Kumar <[email protected]> Co-authored-by: Abhishek Kumar <[email protected]> Co-authored-by: Rohit Yadav <[email protected]>

GutoVeronezi · 2024-10-23T13:31:18Z

Honestly, I don't like PRs with thousand of lines doing thousand of things. It is hard to review and test. I encourage you to separate it in several minor PRs that address each one of the changes you are proposing.

Signed-off-by: Abhishek Kumar <[email protected]>

shwstppr · 2024-10-25T10:46:38Z

@blueorangutan package

blueorangutan · 2024-10-25T10:48:03Z

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-10-25T11:58:31Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11441

shwstppr · 2024-11-02T16:23:47Z

@blueorangutan package

blueorangutan · 2024-11-02T16:26:03Z

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-11-02T17:41:03Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11486

shwstppr · 2024-11-02T17:43:52Z

@blueorangutan test

blueorangutan · 2024-11-02T17:46:03Z

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan · 2024-11-03T14:20:15Z

[SF] Trillian test result (tid-11738)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 71834 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9840-t11738-kvm-ol8.zip
Smoke tests completed. 134 look OK, 6 have errors, 1 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_host_tags	`Failure`	10.40	test_host_tags.py
test_03_secured_to_nonsecured_vm_migration	`Error`	1286.75	test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration	`Error`	1286.76	test_vm_life_cycle.py
ContextSuite context=TestSecuredVmMigration>:teardown	`Error`	2253.98	test_vm_life_cycle.py
ContextSuite context=TestVAppsVM>:setup	`Error`	2409.13	test_vm_life_cycle.py
ContextSuite context=TestVMLifeCycle>:setup	`Error`	2499.28	test_vm_life_cycle.py
ContextSuite context=TestVMSchedule>:setup	`Error`	0.00	test_vm_schedule.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	2101.84	test_hostha_kvm.py
test_01_migrate_vm_strict_tags_success	`Error`	99.22	test_vm_strict_host_tags.py
test_01_redundant_vpc_site2site_vpn	`Failure`	455.29	test_vpc_vpn.py
all_test_vm_lifecycle_unmanage_import	`Skipped`	---	test_vm_lifecycle_unmanage_import.py

Signed-off-by: Abhishek Kumar <[email protected]>

shwstppr · 2024-11-04T11:40:53Z

@blueorangutan package

blueorangutan · 2024-11-04T11:42:03Z

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Signed-off-by: Abhishek Kumar <[email protected]>

…nts-changes

shwstppr · 2025-01-28T05:59:25Z

@blueorangutan package

blueorangutan · 2025-01-28T06:00:04Z

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-01-28T07:26:45Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12230

…nts-changes

rohityadavcloud · 2025-01-31T08:14:49Z

@blueorangutan package

blueorangutan · 2025-01-31T08:16:03Z

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-01-31T09:21:19Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12284

rohityadavcloud · 2025-01-31T13:44:45Z

@blueorangutan test matrix

blueorangutan · 2025-01-31T13:46:05Z

@rohityadavcloud a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

blueorangutan · 2025-02-01T04:49:53Z

[SF] Trillian test result (tid-12253)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 51652 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9840-t12253-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_11_isolated_network_with_dynamic_routed_mode	`Error`	2.30	test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode	`Error`	2.43	test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode	`Error`	2.43	test_ipv4_routing.py

blueorangutan · 2025-02-01T05:52:06Z

[SF] Trillian test result (tid-12254)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 55866 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9840-t12254-kvm-ubuntu22.zip
Smoke tests completed. 139 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_11_isolated_network_with_dynamic_routed_mode	`Error`	2.41	test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode	`Error`	2.52	test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode	`Error`	2.52	test_ipv4_routing.py
test_oobm_multiple_mgmt_server_ownership	`Failure`	30.75	test_outofbandmanagement.py

blueorangutan · 2025-02-01T12:32:14Z

[SF] Trillian test result (tid-12256)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 79994 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9840-t12256-xcpng82.zip
Smoke tests completed. 135 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_condensed_drs_algorithm	`Failure`	168.36	test_cluster_drs.py
test_02_balanced_drs_algorithm	`Failure`	185.49	test_cluster_drs.py
test_11_isolated_network_with_dynamic_routed_mode	`Error`	1.30	test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode	`Error`	2.46	test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode	`Error`	2.47	test_ipv4_routing.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup	`Error`	7.42	test_network.py
test_01_non_strict_host_anti_affinity	`Error`	217.06	test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity	`Error`	113.29	test_nonstrict_affinity_group.py
test_02_create_volume	`Error`	4.28	test_resource_names.py
test_05_scale_vm_dont_allow_disk_offering_change	`Failure`	65.77	test_scale_vm.py

…Pools (apache#446) Following changes and improvements have been added: - Allows configuring connection pool library for database connection. As default, replaces dbcp2 connection pool library with more performant HikariCP. db.<DATABASE>.connectionPoolLib property can be set in the db.properties to use the desired library. > Set dbcp for using DBCP2 > Set hikaricp or for using HikariCP - Improvements in handling of PingRoutingCommand 1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs. 2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch 3. Optimized scanning stalled VMs - Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers` - Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine - Added caching for dynamic config keys with expiration after write set to 30 seconds. - Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0. - Added caching for some recurring DB retrievals 1. CapacityManager - listing service offerings - beneficial in host capacity calculation 2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins 3. DownloadListener - hypervisors for zone - beneficial for host joins 5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands - Optimized MS list retrieval for agent connect - Optimize finding ready systemvm template for zone - Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks - Changes in agent-agentmanager connection with NIO client-server classes 1. Optimized the use of the executor service 2. Refactore Agent class to better handle connections. 3. Do SSL handshakes within worker threads 5. Added global configs to control the behaviour depending on the infra. SSL handshake and initial processing of a new agent could be a bottleneck during agent connections. Configs - `agent.max.concurrent.new.connections` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end. 6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used. - Improvements in StatsCollection - minimize DB retrievals. - Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals. - Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools. - Minor improvements in resource limit calculations wrt DB retrievals ### Schema changes Schema changes that need to be applied if updating from 4.18.1.x [FR73B-Phase1-sql-changes.sql.txt](https://github.com/user-attachments/files/17485581/FR73B-Phase1-sql-changes.sql.txt) Upstream PR: apache#9840 ### Changes and details from scoping phase <details> <summary>Changes and details from scoping phase</summary> FR73B isn't a traditional feature FR per-se and the only way to scope this is we find class of problems and try to put them in buckets and propose a time-bound phase of developing and delivering optimisations. Instead of specific proposal on how to fix them, we're looking to find approaches and methodologies that can be applied as sprints (or short investigation/fix cycles) as well as split and do well-defined problem as separate FRs. Below are some examples of the type of problem we can find around resource contention or spikes (where resource can be CPU, RAM, DB): - Resources spikes on management server start/restart (such as maintenance led restarts) - Resource spikes on addition of Hosts - Resource spikes on deploying VMs - Resource spikes or slowness on running list APIs As an examples, the following issues were found during the scoping exercise: ### 1. Reduce CPU and DB spikes on adding hosts or restarting mgmt server (direct agents, such as Simulator) Introduced in apache#1403 this gates the logic only to XenServer where this would at all run. The specific code is only applicable for XenServer and SolidFire (https://youtu.be/YQ3pBeL-WaA?si=ed_gT_A8lZYJiEh. Hotspot took away about 20-40% CPU & DB pressures alone: <img width="1002" alt="Screenshot 2024-05-03 at 3 10 13 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/f7f86c44-f865-4734-a6fd-89bd6a85ab73"> <img width="1067" alt="Screenshot 2024-05-03 at 3 11 41 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/caa5081b-8fd6-46cd-acb1-f4c5d6b5d10f"> **After the fix:** ![Screenshot 2024-05-03 at 5 31 05 PM](https://github.com/shapeblue/cloudstack-apple/assets/95203/2ba0b1c9-9922-44a9-ae4f-fb65f77866d4) ### 2. Reduce DB load on capacity scans Another type of code/programming pattern wherein, we fetch all DB records only to count them and discard them. Such refactoring can reduce CPU/DB load for env with really large hosts. The common pattern in code to search is to optimise of list of hosts/hostVOs. DB hot-spot reduced by ~5-13% during aggressive scans. ### 3. Reduce DB load on Ping command Upon handling Ping commands, we try to fetch whole bunch of columns from the vm_instance (joined to other) table(s), but only use the `id` column. We can optimise and reduce DB load by only fetching the `id`. Further optimise how power reports are handled (for example, previously it calls DB query and then used an iterator -> which was optimised as doing a select query excluding list of VM ids). With 1,2,3, single management server host + simulator deployed against single MySQL 8.x DB was found to do upto 20k hosts across two cluster. ### 4. API and UI optimisation In this type of issues, the metrics API for zone and cluster were optimised, so the pages would load faster. This sort of thing may be possible across the UI, for resources that are very high in number. ### 5. Log optimisations Reducing (unnecessary) logging can improve anything b/w 5-10% improving in overall performance throughput (API or operational) ### 6. DB, SQL Query and Mgmt server CPU load Optimisations Several optimisations were possible, as an example, this was improved wherein `isZoneReady` was causing both DB scans/load and CPU hotspot: <img width="1314" alt="Screenshot 2024-05-04 at 9 19 33 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/b0749642-0819-4bb9-803a-faa9754ccefa"> The following were explored: - Using mysql slow-query logging along with index scan logging to find hotspot, along with jprofiler - Adding missing indexes to speed up queries - Reduce table scans by optimising sql query and using indexes - Optimising sql queries to remove duplicate rows (use of distinct) - Reduce CPU and DB load by using jprofiler to optimise both sql query and CPU hotspots Example fix: server: reduce CPU and DB load caused by systemvm ::isZoneReady() For this case, the sql query was fetching large number of table scans only to determine if zone has any available pool+host to launch systemvms. Accodingly the code and sql queries along with indexes optimisations were used to lower both DB scans and mgmt server CPU load. Further, tools such as EXPLAIN or EXAMPLE ANALYZE or visual explaining of queries can help optimise queries; for example, before: <img width="508" alt="Screenshot 2024-05-08 at 6 16 17 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/d85f4d19-36a2-41ee-9334-c119a4b2fc52"> After adding an index: <img width="558" alt="Screenshot 2024-05-08 at 6 22 32 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/14ef3d13-2d25-4f41-ba25-ee68e37b5b76"> Here's a bigger view of the user_vm_view that's optimised against by adding an index to user_ip_address table: ![zzexplain](https://github.com/shapeblue/cloudstack-apple/assets/95203/72e44291-a657-49da-adcd-5803a2fa91f9) ### 7. Better DB Connection Pooling: HikariCP Several CPU and DB hotspots suggested about 20+% of time was spent to process `SELECT 1` query, which was found later wasn't necessary for JDBC 4 compliant drivers that would use a Connection::isValid to ascertain if a connection was good enough. Further, heap and GC spikes are seen due to load on mgmt server with 50k hosts. By replacing the dbcp2 based library with a more performant library with low-production overhead HikariCP, it was found the application head/GC load and DB CPU/Query load could be reduced further. For existing environments, the validation query can be set to `/* ping */ SELECT 1` which performance a lower-overhead application ping b/w mgmt server and DB. Migration to HikariCP and changes shows lower number of select query load, and about 10-15% lower cpu load: <img width="1071" alt="Screenshot 2024-05-09 at 10 56 09 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/5dbf919e-4d15-48a3-ab87-5647db666132"> <img width="372" alt="Screenshot 2024-05-09 at 10 58 40 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/9cfc80c6-eb91-4036-b7f2-1e24b6c5b78a"> Caveat: this has led to unit test failures, as many dependent on dbcp2 based assumptions, which can be fixed in due time. However, build is passing and a simulator based test-setup seems to be working. The following is telemetry of the application (mgmt server), after 50k hosts join: <img width="1184" alt="Screenshot 2024-05-10 at 12 31 09 AM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/e47cd71e-2bae-4640-949c-a457c420ab70"> <img width="1188" alt="Screenshot 2024-05-10 at 12 31 26 AM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/33dec07b-834c-44b8-a9a4-1d7502973fc7"> For 100k hosts added/joining, the connection scaling seems more better: <img width="1180" alt="Screenshot 2024-05-22 at 8 32 44 PM" src="https://github.com/shapeblue/cloudstack-apple/assets/95203/ee4d3c5d-4b6d-43f0-8efb-28aba64917d9"> ### 8. Using MySQL slow logs to optimise application logic and queries Using MySQL slow logs, using the following configuration was enabled: ``` slow_query_log = 1 slow_query_log_file = /var/log/mysql/mysql-slow.log long_query_time = 1 log_queries_not_using_indexes = 1 min_examined_row_limit = 100 ``` Upon analysing the slow logs, network_offering and user_vm_views related view and query & application logic for example were optimised to demonstrate how the methodology can be used to measure, find and optimise bottlenecks. It was found that queries that end up doing more table scans than the rows they returned to application (ACS mgmt server), were adding pressure on the db. - In case of network_offering_view adding an index reduced table scans. - In case of user_vm_view, it was found that MySQL was picking the wrong index that caused a lot of scans as many IPs addresses were there in the user_ip_address table. It turned to be related or same as an old MySQL server bug https://bugs.mysql.com/bug.php?id=41220 and the workaround fix was to force the relevant index. This speed up listVirtualMachines API in my test env (with 50-100k hosts) from 17s to under 200ms (measured locally). ### 9. Bottlenecks identified and categorised As part of the FR scoping effort, not everything could be possibly fixed, as an example, some of the code has been marked with FIXME or TODO that relate to hotspots discovered during the profiling process. Some of which was commented, to for example speed up host additions while reduce CPU/DB load (to allow testing of 50k-100k hosts joining). Such code can be further optimised by exploring and using new caching layer(s) that could be built using Caffein library and Hazelcast. Misc: if distributed multi-primary MySQL cluster support is to be explored: shapeblue/cloudstack-apple#437 Misc: list API optimisations may be worth back porting: apache#9177 apache#8782 </details> --------- Signed-off-by: Rohit Yadav <[email protected]> Signed-off-by: Abhishek Kumar <[email protected]> Co-authored-by: Abhishek Kumar <[email protected]> Co-authored-by: Fabricio Duarte <[email protected]>

In the absence of a SYSTEM type template for a zone, listing of templates can break. Behaviour was change in apache#9840 but it would be better to find available hypervisors using existing hosts.

* server: fix available hypervisors listing for a zone In the absence of a SYSTEM type template for a zone, listing of templates can break. Behaviour was change in #9840 but it would be better to find available hypervisors using existing hosts. * fix Signed-off-by: Abhishek Kumar <[email protected]> --------- Signed-off-by: Abhishek Kumar <[email protected]>

* api,agent,server,engine-schema: scalability improvements Following changes and improvements have been added: - Improvements in handling of PingRoutingCommand 1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs. 2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch 3. Optimized scanning stalled VMs - Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers` - Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine - Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0. - Added caching for account/use role API access with expiration after write set to 60 seconds. - Added caching for some recurring DB retrievals 1. CapacityManager - listing service offerings - beneficial in host capacity calculation 2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins 3. DownloadListener - hypervisors for zone - beneficial for host joins 5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands - Optimized MS list retrieval for agent connect - Optimize finding ready systemvm template for zone - Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks - Changes in agent-agentmanager connection with NIO client-server classes 1. Optimized the use of the executor service 2. Refactore Agent class to better handle connections. 3. Do SSL handshakes within worker threads 5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end. 6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used. - Improvements in StatsCollection - minimize DB retrievals. - Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals. - Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools. - Minor improvements in resource limit calculations wrt DB retrievals Signed-off-by: Abhishek Kumar <[email protected]> Co-authored-by: Abhishek Kumar <[email protected]> Co-authored-by: Rohit Yadav <[email protected]> * test1, domaindetails, capacitymanager fix Signed-off-by: Abhishek Kumar <[email protected]> * test2 - agent tests Signed-off-by: Abhishek Kumar <[email protected]> * capacitymanagertest fix Signed-off-by: Abhishek Kumar <[email protected]> * change Signed-off-by: Abhishek Kumar <[email protected]> * fix missing changes Signed-off-by: Abhishek Kumar <[email protected]> * address comments Signed-off-by: Abhishek Kumar <[email protected]> * revert marvin/setup.py Signed-off-by: Abhishek Kumar <[email protected]> * fix indent Signed-off-by: Abhishek Kumar <[email protected]> * use space in sql Signed-off-by: Abhishek Kumar <[email protected]> * address duplicate Signed-off-by: Abhishek Kumar <[email protected]> * update host logs Signed-off-by: Abhishek Kumar <[email protected]> * revert e36c6a5 Signed-off-by: Abhishek Kumar <[email protected]> * fix npe in capacity calculation Signed-off-by: Abhishek Kumar <[email protected]> * move schema changes to 4.20.1 upgrade Signed-off-by: Abhishek Kumar <[email protected]> * build fix Signed-off-by: Abhishek Kumar <[email protected]> * address comments Signed-off-by: Abhishek Kumar <[email protected]> * fix build Signed-off-by: Abhishek Kumar <[email protected]> * add some more tests Signed-off-by: Abhishek Kumar <[email protected]> * checkstyle fix Signed-off-by: Abhishek Kumar <[email protected]> * remove unnecessary mocks Signed-off-by: Abhishek Kumar <[email protected]> * build fix Signed-off-by: Abhishek Kumar <[email protected]> * replace statics Signed-off-by: Abhishek Kumar <[email protected]> * engine/orchestration,utils: limit number of concurrent new agent connections Signed-off-by: Abhishek Kumar <[email protected]> * refactor - remove unused Signed-off-by: Abhishek Kumar <[email protected]> * unregister closed connections, monitor & cleanup Signed-off-by: Abhishek Kumar <[email protected]> * add check for outdated vm filter in power sync Signed-off-by: Abhishek Kumar <[email protected]> * agent: synchronize sendRequest wait Signed-off-by: Abhishek Kumar <[email protected]> --------- Signed-off-by: Abhishek Kumar <[email protected]> Co-authored-by: Rohit Yadav <[email protected]>

* server: fix available hypervisors listing for a zone In the absence of a SYSTEM type template for a zone, listing of templates can break. Behaviour was change in apache#9840 but it would be better to find available hypervisors using existing hosts. * fix Signed-off-by: Abhishek Kumar <[email protected]> --------- Signed-off-by: Abhishek Kumar <[email protected]>

boring-cyborg bot added component:agent component:api component:compute component:orchestration labels Oct 23, 2024

shwstppr force-pushed the scalability-improvements-changes branch from 2c750db to 080e5af Compare October 23, 2024 12:52

shwstppr force-pushed the scalability-improvements-changes branch from 080e5af to e3cf7fd Compare October 23, 2024 13:27

JoaoJandre requested review from BryanMLima, GutoVeronezi and JoaoJandre October 23, 2024 13:28

shwstppr added 4 commits October 24, 2024 10:22

test1, domaindetails, capacitymanager fix

feec5b1

Signed-off-by: Abhishek Kumar <[email protected]>

test2 - agent tests

eafd03c

Signed-off-by: Abhishek Kumar <[email protected]>

capacitymanagertest fix

5159f26

Signed-off-by: Abhishek Kumar <[email protected]>

change

4ffb091

Signed-off-by: Abhishek Kumar <[email protected]>

apache deleted a comment from blueorangutan Oct 25, 2024

fix missing changes

779d521

Signed-off-by: Abhishek Kumar <[email protected]>

shwstppr added 3 commits January 28, 2025 11:26

add check for outdated vm filter in power sync

eab0531

Signed-off-by: Abhishek Kumar <[email protected]>

agent: synchronize sendRequest wait

2bbdc0c

Signed-off-by: Abhishek Kumar <[email protected]>

Merge remote-tracking branch 'apache/4.20' into scalability-improveme…

af5ac23

…nts-changes

Merge remote-tracking branch 'apache/4.20' into scalability-improveme…

6820717

…nts-changes

rohityadavcloud added this to the 4.20.1 milestone Jan 31, 2025

rohityadavcloud marked this pull request as ready for review January 31, 2025 08:14

rohityadavcloud merged commit 0b5a5e8 into apache:4.20 Feb 1, 2025
25 of 26 checks passed

rohityadavcloud deleted the scalability-improvements-changes branch February 1, 2025 06:58

Pearl1594 added this to ACS 4.20.1 Mar 17, 2025

Pearl1594 moved this to Done in ACS 4.20.1 Mar 17, 2025

shwstppr mentioned this pull request Apr 16, 2025

server: fix available hypervisors listing for a zone #10738

Merged

14 tasks

shwstppr mentioned this pull request Oct 10, 2025

SSL_Handshake: close nio channel when NioClient fail to handshake wit… #10153

Closed

14 tasks

api,agent,server,engine-schema: scalability improvements #9840

api,agent,server,engine-schema: scalability improvements #9840

Uh oh!

Conversation

shwstppr commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

codecov bot commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

GutoVeronezi commented Oct 23, 2024

Uh oh!

shwstppr commented Oct 25, 2024

Uh oh!

blueorangutan commented Oct 25, 2024

Uh oh!

blueorangutan commented Oct 25, 2024

Uh oh!

shwstppr commented Nov 2, 2024

Uh oh!

blueorangutan commented Nov 2, 2024

Uh oh!

blueorangutan commented Nov 2, 2024

Uh oh!

shwstppr commented Nov 2, 2024

Uh oh!

blueorangutan commented Nov 2, 2024

Uh oh!

blueorangutan commented Nov 3, 2024

Uh oh!

shwstppr commented Nov 4, 2024

Uh oh!

blueorangutan commented Nov 4, 2024

Uh oh!

shwstppr commented Jan 28, 2025

Uh oh!

blueorangutan commented Jan 28, 2025

Uh oh!

blueorangutan commented Jan 28, 2025

Uh oh!

rohityadavcloud commented Jan 31, 2025

Uh oh!

blueorangutan commented Jan 31, 2025

Uh oh!

blueorangutan commented Jan 31, 2025

Uh oh!

rohityadavcloud commented Jan 31, 2025

Uh oh!

blueorangutan commented Jan 31, 2025

Uh oh!

blueorangutan commented Feb 1, 2025

Uh oh!

blueorangutan commented Feb 1, 2025

Uh oh!

Uh oh!

blueorangutan commented Feb 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shwstppr commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading