Skip to content

fog: Try ipmi power-cycle if stuck in a reimage reboot hang#2146

Merged
djgalloway merged 1 commit intomainfrom
reboot-7min
Feb 17, 2026
Merged

fog: Try ipmi power-cycle if stuck in a reimage reboot hang#2146
djgalloway merged 1 commit intomainfrom
reboot-7min

Conversation

@djgalloway
Copy link
Copy Markdown
Contributor

@djgalloway djgalloway commented Feb 12, 2026

Fixes: https://tracker.ceph.com/issues/74717

Works: https://pulpito.ceph.com/dgalloway-2026-02-12_21:01:11-rados:cephadm-wip-rocky10-branch-of-the-day-2026-02-03-1770151121-distro-default-trial/

2026-02-12T21:03:43.012 INFO:teuthology.provision.fog.trial163:Waiting for deploy to finish
2026-02-12T21:03:43.047 INFO:teuthology.provision.fog.trial020:Waiting for deploy to finish

2026-02-12T21:06:53.165 INFO:teuthology.provision.fog.trial020:Deploy complete!

...

2026-02-12T21:12:28.627 DEBUG:teuthology.orchestra.connection:{'hostname': 'trial163.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2026-02-12T21:12:31.709 WARNING:teuthology.provision.fog.trial163:[Errno None] Unable to connect to port 22 on 10.20.193.163
2026-02-12T21:12:37.715 WARNING:teuthology.provision.fog.trial163:trial163: SSH not up after 427s (~70% of timeout); power cycling and continuing to wait
2026-02-12T21:12:37.716 INFO:teuthology.orchestra.console.trial163:Power off
2026-02-12T21:12:37.716 DEBUG:teuthology.orchestra.console.trial163:pexpect command: ipmitool -H trial163.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power off
2026-02-12T21:12:37.758 DEBUG:teuthology.orchestra.console.trial163:power off output: Chassis Power Control: Down/Off
2026-02-12T21:12:41.762 DEBUG:teuthology.orchestra.console.trial163:pexpect command: ipmitool -H trial163.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power status
2026-02-12T21:12:41.873 DEBUG:teuthology.orchestra.console.trial163:check power output: Chassis Power is off
2026-02-12T21:12:41.873 INFO:teuthology.orchestra.console.trial163:Power off completed
2026-02-12T21:12:41.973 INFO:teuthology.orchestra.console.trial163:Power on
2026-02-12T21:12:41.974 DEBUG:teuthology.orchestra.console.trial163:pexpect command: ipmitool -H trial163.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power on
2026-02-12T21:12:42.014 DEBUG:teuthology.orchestra.console.trial163:power on output: Chassis Power Control: Up/On
2026-02-12T21:12:46.019 DEBUG:teuthology.orchestra.console.trial163:pexpect command: ipmitool -H trial163.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power status
2026-02-12T21:12:46.130 DEBUG:teuthology.orchestra.console.trial163:check power output: Chassis Power is on
2026-02-12T21:12:46.130 INFO:teuthology.orchestra.console.trial163:Power on completed
2026-02-12T21:12:46.231 DEBUG:teuthology.orchestra.connection:{'hostname': 'trial163.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2026-02-12T21:12:49.309 WARNING:teuthology.provision.fog.trial163:[Errno None] Unable to connect to port 22 on 10.20.193.163

...

2026-02-12T21:14:49.526 INFO:teuthology.provision.fog.trial163:Deploy complete!

I used 70% in the test job but that's way too long so changed to 50%. I simulated a reboot hang by powering trial163 off after I saw it start booting after reimage.

@djgalloway djgalloway requested a review from a team as a code owner February 12, 2026 21:18
@djgalloway djgalloway requested review from amathuria and kamoltat and removed request for a team February 12, 2026 21:18
@djgalloway djgalloway requested a review from dmick February 12, 2026 21:55
@dmick dmick changed the title fog: Try rebooting if stuck in a reimage reboot hang fog: Try ipmi power-cycle if stuck in a reimage reboot hang Feb 12, 2026
@dmick
Copy link
Copy Markdown
Member

dmick commented Feb 12, 2026

I got the PR title. Please change the commit msg.

Fixes: https://tracker.ceph.com/issues/74717

Signed-off-by: David Galloway <david.galloway@ibm.com>
Copy link
Copy Markdown
Member

@dmick dmick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any reason we shouldn't include this, even if other reboot methods prove more reliable

@djgalloway djgalloway merged commit fa17720 into main Feb 17, 2026
9 checks passed
@djgalloway djgalloway deleted the reboot-7min branch February 17, 2026 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants