Skip to content

Conversation

@gangj
Copy link
Contributor

@gangj gangj commented Jan 6, 2026

Problem: CA-384228 - Xapi fails to start on slave during pool join

When a slave joins a pool with jumbo frames (MTU 9000) configured, xapi can hang
during "Synchronising bonds on slave with master" if the network path doesn't
support the configured MTU.

Root Cause:

  • Management interface configured with MTU=9000 (jumbo frames)
  • Network path actual MTU ~1500 bytes (switches/routers don't support jumbo frames)
  • During pool join, xapi makes RPC calls to master with large requests (~1613 bytes)
  • Packets exceeding path MTU are silently dropped by network infrastructure
  • Without Path MTU Discovery enabled, TCP cannot detect the MTU mismatch
  • TCP retransmissions fail repeatedly as packets continue to be dropped
  • The RPC call hangs, eventually timing out
  • Application-level retry logic (in master_connection.ml) attempts reconnection
  • Each retry encounters the same MTU issue
  • Meanwhile, the retrying connections keep holding one database lock
  • Any other DB request from slave to master is blocked waiting for this lock
  • This causes the entire slave xapi to hang, not just the pool join operation
  • Pool join hangs for an extended period before eventually failing

Why it happens:
By default, TCP relies on ICMP "Fragmentation Needed" messages to discover path MTU.
When the interface is configured with MTU=9000 but the network path only supports
1500 bytes:

  • If ICMP is working: Router sends "Fragmentation Needed" ICMP message back to TCP,
    TCP reduces packet size to 1500, connection works fine.

  • If ICMP is blocked (CA-384228 scenario): Router drops large packets but the ICMP
    message is blocked by firewalls. TCP has no way to discover the mismatch. It keeps
    retrying with 9000-byte packets that get silently dropped, leading to connection hangs and
    database lock contention..

Before this fix: TCP depends entirely on ICMP for MTU discovery. When ICMP is
blocked, TCP cannot adapt, causing extended hangs and database deadlocks.

After this fix: TCP can detect packet loss patterns and proactively reduce packet
size even without ICMP feedback, preventing hangs and allowing pool operations to
complete successfully.

Solution Overview:
This fix has two parts:

  1. Enable TCP Path MTU Discovery (PMTUD) - Allows TCP to automatically detect
    and adapt to path MTU, preventing hangs
  2. Add diagnostics during pool join - Detect and warn about MTU mismatches
    for visibility, creates alert for customer awareness

Commit 1: CA-384228: Enable TCP Path MTU Discovery by default

Add sysctl configuration to enable TCP PMTUD on all XenServer hosts.
This prevents TCP connection hangs when path MTU is smaller than configured
interface MTU (e.g., jumbo frames configured but network infrastructure
doesn't support them).

How it fixes the hang:
With PMTUD enabled, TCP can now automatically:

  1. Detect packet loss patterns indicating MTU issues
  2. Reduce packet size (MSS) to find working MTU
  3. Continue communication with adjusted packet size
  4. Work even when ICMP is blocked by firewalls

This prevents the database lock contention that causes slave xapi to hang completely.

Configuration:

  • net.ipv4.tcp_mtu_probing=1: Enable automatic MTU detection when ICMP
    blackholed (recommended setting)
  • net.ipv4.tcp_base_mss=1024: Base MSS for MTU probing

Files:

  • scripts/92-xapi-tcp-mtu.conf: New sysctl configuration file
  • scripts/Makefile: Install sysctl config to /etc/sysctl.d/

The "92" prefix ensures this loads after basic network configuration
(91-net-ipv6.conf) but before local administrator overrides (99-*).

Reference: https://blog.cloudflare.com/path-mtu-discovery-in-practice/


Commit 2: CA-384228: Add MTU diagnostics during pool join

Add diagnostic tests during pool join to detect and warn about MTU
mismatches, particularly when higher MTU values are configured but
the network path doesn't support them.

Why diagnostics are needed:
While TCP PMTUD (commit 1) fixes the hang automatically, customers need
visibility into MTU configuration issues. This creates an alert visible
in XenCenter/CLI when path MTU < configured MTU, prompting infrastructure
fixes to prevent performance degradation.

The diagnostics:

  1. Query master's management network MTU via RPC
  2. Detect VLAN configuration and account for 4-byte overhead
  3. Calculate ICMP payload dynamically: MTU - IP header (20) - ICMP header (8) - VLAN (4 if present)
  4. Test standard MTU (1500) with ICMP ping
  5. Test configured MTU if > 1500
  6. Create pool-level alert when CA-384228 scenario detected:
    • Standard MTU (1500) works
    • Configured higher MTU fails
    • This indicates path MTU < configured MTU

Key design decisions:

  • Does NOT block pool join (ICMP may be blocked by firewalls)
  • Queries master's DB via verified RPC (slave's DB not yet synced)
  • Called after certificate exchange with verified connection
  • Creates pool-level alert for customer visibility in XenCenter/CLI
  • Relies on TCP PMTUD (enabled by commit 1) to handle issues automatically
  • Diagnostics are informational only, providing visibility

The implementation dynamically calculates test packet sizes based on
actual configured MTU rather than assuming fixed values, making it
work correctly with any MTU configuration (not just jumbo frames).

Warning format highlights the issue clearly and references the
TCP PMTUD fix that handles it automatically, with guidance for
persistent problems.

@gangj gangj force-pushed the private/gangj/CA-384228 branch 2 times, most recently from 1d0e0d4 to 0d7e423 Compare January 8, 2026 09:16
@bleader
Copy link

bleader commented Jan 8, 2026

The PR description has " - <title>" and the description " - <another_title>" and the link is not clear for someone outside XenServer. I assume the PR title reference the original issue this fixes and the description one is the solution found to fix it?

@gangj
Copy link
Contributor Author

gangj commented Jan 8, 2026

The PR description has " - <title>" and the description " - <another_title>" and the link is not clear for someone outside XenServer. I assume the PR title reference the original issue this fixes and the description one is the solution found to fix it?

Sorry, added detailed context for the fix in the description, pls review it again, thanks.

@gangj gangj force-pushed the private/gangj/CA-384228 branch from 0d7e423 to dcaf8d9 Compare January 9, 2026 10:30
gangj added 2 commits January 11, 2026 18:27
Fixes slave xapi hang during pool join when jumbo frames are configured
but the network path doesn't support them.

Problem:
When MTU mismatch occurs (interface configured for 9000 but path supports
only 1500), RPC connections hang on large requests (~1613 bytes). The
hanging connection holds a database lock, blocking all other DB operations
and causing the entire slave xapi to become unresponsive during pool join.

Root cause:
Without Path MTU Discovery, TCP cannot detect when the path MTU is smaller
than the configured interface MTU. When ICMP "Fragmentation Needed" messages
are blocked by firewalls, TCP has no feedback mechanism to reduce packet size.
Packets exceeding the path MTU are silently dropped by network infrastructure,
leading to connection timeouts. The application-level retry logic (in
master_connection.ml) attempts reconnection, but each retry encounters the
same issue while holding a database lock, causing extended hangs.

Solution:
Enable TCP PMTUD to allow automatic MTU detection and adaptation.

Configuration:
- net.ipv4.tcp_mtu_probing=1: Enable automatic MTU detection when ICMP
  blackhole is detected (recommended setting)
- net.ipv4.tcp_base_mss=1024: Base MSS for MTU probing

With PMTUD enabled, TCP detects packet loss patterns indicating MTU issues
and proactively reduces packet size to find a working MTU. This works even
when ICMP Fragmentation Needed messages are blocked by firewalls, allowing
connections to succeed and preventing database lock contention.

Files:
- scripts/92-xapi-tcp-mtu.conf: New sysctl configuration file
- scripts/Makefile: Install sysctl config to /usr/lib/sysctl.d/

The conf file is installed to /usr/lib/sysctl.d/ (package-owned location)
rather than /etc/sysctl.d/ (user config space). The "92" prefix ensures this
loads after basic network configuration (91-net-ipv6.conf), and admins can
override with files in /etc/sysctl.d/ if needed.

Reference: https://blog.cloudflare.com/path-mtu-discovery-in-practice/

Signed-off-by: Gang Ji <[email protected]>
Add diagnostic tests during pool join to detect and warn about MTU
mismatches, particularly when higher MTU values are configured but
the network path doesn't support them.

While TCP PMTUD (enabled in previous commit) fixes the hang automatically,
this provides visibility into MTU configuration issues so customers can
fix their network infrastructure.

The diagnostics:
1. Query master's management network MTU via RPC
2. Detect VLAN configuration and account for 4-byte overhead
3. Calculate ICMP payload dynamically:
   MTU - IP header (20) - ICMP header (8) - VLAN (4 if present)
4. Test standard MTU (1500) with ICMP ping
5. Test configured MTU if > 1500
6. Create pool-level alert when CA-384228 scenario detected:
   - Standard MTU (1500) works
   - Configured higher MTU fails
   - This indicates path MTU < configured MTU

Key design decisions:
- Does NOT block pool join (ICMP may be blocked by firewalls)
- Queries master's DB via verified RPC (slave's DB not yet synced)
- Called after certificate exchange with verified connection
- Creates pool-level alert for customer visibility in XenCenter/CLI
- Relies on TCP PMTUD (enabled by previous commit) to prevent hangs
- Diagnostics are informational only, providing visibility

The implementation dynamically calculates test packet sizes based on
actual configured MTU rather than assuming fixed values, making it
work correctly with any MTU configuration (not just jumbo frames).

Warning format highlights the issue clearly and references the
TCP PMTUD fix that handles it automatically, with guidance for
infrastructure improvements.

Signed-off-by: Gang Ji <[email protected]>
@gangj gangj closed this Jan 11, 2026
@gangj gangj deleted the private/gangj/CA-384228 branch January 11, 2026 10:34
@gangj
Copy link
Contributor Author

gangj commented Jan 11, 2026

Apologies for the notification noise just now—I accidentally deleted the branch and closed the PR due to a typo in my git push command.

@gangj gangj reopened this Jan 11, 2026
@gangj gangj force-pushed the private/gangj/CA-384228 branch from 65433e1 to ea5101b Compare January 11, 2026 10:52
Copy link
Contributor

@changlei-li changlei-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. But I think the critical issue is the infrastructure/MTU configuration in the CA ticket. This PR is the improvement.
IMO, the commit 1 enable TCP PMTUD, it should goes into some Foundation repo, not XAPI. Also it takes effect in the whole host which should be considered carefully. It may change the existing test case behavior. It only makes sense to add to XAPI repo if need to be configured and managed by XAPI.
For the commit 2, I'm not sure if it is deserved to take more time to diagnose MTU during pool join.

@minglumlu
Copy link
Member

Close this PR as the author can't continue working on this. Can re-open it if someone would like to continue based on this. Otherwise, may pick up this in the future with preparation.

@minglumlu minglumlu closed this Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants