Skip to content

Conversation

@shwstppr
Copy link
Contributor

@shwstppr shwstppr commented Aug 25, 2025

Description

This pull request refactors the TLS framing and buffer management in the Link class to improve correctness and maintainability, and updates the SSL context initialization to use TLS 1.3 for enhanced security. CloudStack uses a 4-byte header for TLS packets. Earlier, it was not sent within the TLS application data, which affected maintainability (simply using TLS1.3 without packet changes didn't work, and it resulted in errors like [1]) and the implementation of agent-server communication using a different language. The most important changes are grouped below.

TLS Framing and Buffer Management

  • Reworked the TLS buffer handling in Link.java, replacing legacy header and packet assembly logic with a more robust system using netBuffer, appBuffer, and an explicit headerBuffer for frame length management. This improves frame parsing and avoids buffer overflows.
  • Refactored the read and write logic: the read method now correctly assembles frames from TLS streams, handling buffer resizing and edge cases, while the doWrite method builds TLS packets with a 4-byte length header and payload, ensuring correct framing and handshake handling.
  • Simplified the message sending and writing logic by removing manual header prepending and using the new framing system; the write queue now contains only payload buffers, and the header is added during the TLS wrap process.

Security Improvements

  • Updated SSL context initialization in Link.java to use SSLUtils.getSSLContextWithLatestVersion(), ensuring that TLS 1.3 is used for all server, client, and management SSL contexts.
  • Added a new method getSSLContextWithLatestVersion() in SSLUtils.java, which returns an SSLContext instance for TLS 1.3.
[1] Error in agent-server connection with TLS1.3 without packet framing changes

2025-08-25 18:41:41,698 INFO [utils.nio.NioClient] (main:[]) (logid:) Connecting to 172.120.0.67:8250
2025-08-25 18:41:41,702 INFO [utils.nio.NioClient] (main:[]) (logid:) Connected to 172.120.0.67:8250
2025-08-25 18:41:41,704 INFO [utils.nio.Link] (main:[]) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2025-08-25 18:41:41,941 INFO [utils.nio.NioClient] (main:[]) (logid:) SSL: Handshake done
2025-08-25 18:41:41,950 DEBUG [utils.nio.NioClient] (Agent-NioConnectionHandler-1:[]) (logid:) Location 1: Socket Socket[addr=/172.120.0.67,port=8250,localport=59004] closed on read. Probably -1 returned: Input record too big: max = 16709 len = 22679
2025-08-25 18:41:41,950 DEBUG [utils.nio.NioClient] (Agent-NioConnectionHandler-1:[]) (logid:) Closing socket Socket[addr=/172.120.0.67,port=8250,localport=59004]

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Logs from management server:

[root@qa1-main-kvm-c0c69556-kvm-mgmt1 ~]# tail -f /var/log/cloudstack/management/management-server.log | grep SSL
2025-08-28 11:49:43,597 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-1:[]) (logid:) SSL: Handshake done with /172.120.0.188:34740 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384
2025-08-28 11:49:43,677 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-2:[]) (logid:) SSL: Handshake done with /172.120.0.156:37860 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384
2025-08-28 11:49:43,741 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-3:[]) (logid:) SSL: Handshake done with /172.120.1.143:44026 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384
2025-08-28 11:49:43,781 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-4:[]) (logid:) SSL: Handshake done with /172.120.1.227:36560 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384

Logs from one of the host:

[root@qa1-main-kvm-c0c69556-kvm-host1 ~]# tail -f /var/log/cloudstack/agent/agent.log | grep SSL
2025-08-28 11:49:43,673 INFO  [utils.nio.NioClient] (Agent-Handler-3:[]) (logid:) SSL: Handshake done with /172.120.0.67:8250 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384

Communication with hosts, system VMs and MS seemed fine

How did you try to break this feature and the system with this change?

This pull request refactors the TLS framing and buffer management in the `Link` class to improve correctness and maintainability, and updates the SSL context initialization to use TLS 1.3 for enhanced security. CloudStack uses a 4-byte header for TLS packets. Earlier, it was not sent within the TLS application data, which affected maintainability and the implementation of agent-server communication using a different language. The most important changes are grouped below.

* Reworked the TLS buffer handling in `Link.java`, replacing legacy header and packet assembly logic with a more robust system using `netBuffer`, `appBuffer`, and an explicit `headerBuffer` for frame length management. This improves frame parsing and avoids buffer overflows.
* Refactored the read and write logic: the `read` method now correctly assembles frames from TLS streams, handling buffer resizing and edge cases, while the `doWrite` method builds TLS packets with a 4-byte length header and payload, ensuring correct framing and handshake handling.
* Simplified the message sending and writing logic by removing manual header prepending and using the new framing system; the write queue now contains only payload buffers, and the header is added during the TLS wrap process.

* Updated SSL context initialization in `Link.java` to use `SSLUtils.getSSLContextWithLatestVersion()`, ensuring that TLS 1.3 is used for all server, client, and management SSL contexts.
* Added a new method `getSSLContextWithLatestVersion()` in `SSLUtils.java`, which returns an `SSLContext` instance for TLS 1.3.

Signed-off-by: Abhishek Kumar <[email protected]>
@codecov
Copy link

codecov bot commented Aug 25, 2025

Codecov Report

❌ Patch coverage is 61.97183% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.55%. Comparing base (8c86f24) to head (8cfd858).
⚠️ Report is 36 commits behind head on main.

Files with missing lines Patch % Lines
utils/src/main/java/com/cloud/utils/nio/Link.java 59.70% 38 Missing and 16 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11503      +/-   ##
============================================
- Coverage     17.55%   17.55%   -0.01%     
+ Complexity    15543    15534       -9     
============================================
  Files          5910     5910              
  Lines        529334   529359      +25     
  Branches      64654    64656       +2     
============================================
- Hits          92944    92909      -35     
- Misses       425933   425995      +62     
+ Partials      10457    10455       -2     
Flag Coverage Δ
uitests 3.58% <ø> (ø)
unittests 18.61% <61.97%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shwstppr
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14723

Signed-off-by: Abhishek Kumar <[email protected]>
@apache apache deleted a comment from shwstppr Aug 29, 2025
@apache apache deleted a comment from blueorangutan Aug 29, 2025
@apache apache deleted a comment from blueorangutan Aug 29, 2025
@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-14133)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 418823 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11503-t14133-kvm-ol8.zip
Smoke tests completed. 135 look OK, 11 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 Error 6208.17 test_internal_lb.py
ContextSuite context=TestIpv4Routing>:setup Error 0.00 test_ipv4_routing.py
test_01_create_iso_with_checksum_sha1 Error 66.53 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.52 test_iso.py
test_list_system_vms_metrics_history Failure 0.25 test_metrics_api.py
test_list_vms_metrics_admin Error 3605.09 test_metrics_api.py
test_list_vms_metrics_history Error 5.54 test_metrics_api.py
test_01_vpn_usage Error 1.10 test_usage.py
test_01_scale_up_verify Failure 576.75 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Failure 370.82 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 734.97 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 734.99 test_vm_autoscaling.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 6953.64 test_vpc_redundant.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 6954.10 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Failure 8064.42 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 8064.96 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 8298.88 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 8299.49 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Failure 8559.59 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 8560.14 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 9811.76 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 9812.64 test_vpc_redundant.py
test_01_VPC_nics_after_destroy Failure 4954.75 test_vpc_router_nics.py
test_02_VPC_default_routes Failure 5398.62 test_vpc_router_nics.py
test_01_redundant_vpc_site2site_vpn Failure 8479.78 test_vpc_vpn.py
test_01_redundant_vpc_site2site_vpn Error 8480.33 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Failure 5471.38 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Error 5471.82 test_vpc_vpn.py
test_01_vpc_site2site_vpn Failure 5990.11 test_vpc_vpn.py
test_01_vpc_site2site_vpn Error 5990.47 test_vpc_vpn.py
test_hostha_enable_ha_when_host_in_maintenance Error 305.97 test_hostha_kvm.py

@blueorangutan
Copy link

[LL] Trillian Build Failed (tid-7129)

@shwstppr
Copy link
Contributor Author

shwstppr commented Nov 9, 2025

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15690

@shwstppr
Copy link
Contributor Author

shwstppr commented Nov 9, 2025

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor Author

Some issue with smoke test runs. Will investigate and make the required fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants