Skip to content

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Apr 9, 2025

Reverts PR #635 which broke deploying new clusters.

635 changed the logic for compute-init to try to avoid an error seen in CI where compute nodes failed to come back up after rebuild. The recollection is that there was a permissions error when the compute-init playbook stats the hostvars (no logs available). The suspicion was that the mount didn't happen due to network race and the stat got done on the unmounted host dir instead. So 635 added a check that the NFS server was reachable before trying the mount. However:

  1. The logic in 635 broke new deployments: it meant ansible-init never completed if the nfs server wasn't up (b/c the compute-init playbook timed out and then systemd restarted ansible-init forever) which meant site.yml hung forever waiting for it to complete.
  2. The unit file for ansible-init specifies
    Wants=network.target
    After=network-online.target
    
    so host networking should be ok by the time ansible-init runs
  3. If the nfs server isn't available, the pre-635 logic means the stat should never happen. So the reasoning for 635 seems faulty.
  4. With 635 reverted, it has not been possible to duplicate a permissions error either locally via faking a rebuild (by removing the ansible-init sentinel file and rebooting), actually reimaging locally using the image in this PR, or in CI.

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 9, 2025

@sjpb sjpb marked this pull request as ready for review April 9, 2025 13:45
@sjpb sjpb requested a review from a team as a code owner April 9, 2025 13:45
@sjpb sjpb requested a review from bertiethorpe April 9, 2025 13:45
Copy link
Member

@bertiethorpe bertiethorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb sjpb merged commit 6bf1dbb into main Apr 9, 2025
7 checks passed
@sjpb sjpb deleted the fix/compute-init-deploy branch April 9, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants