Skip to content

Add badfish support for iDRAC cleanup operations#745

Merged
openshift-merge-bot[bot] merged 1 commit intoredhat-performance:mainfrom
cjeanner:feature/badfish
Jan 12, 2026
Merged

Add badfish support for iDRAC cleanup operations#745
openshift-merge-bot[bot] merged 1 commit intoredhat-performance:mainfrom
cjeanner:feature/badfish

Conversation

@cjeanner
Copy link
Contributor

@cjeanner cjeanner commented Dec 17, 2025

This commit adds support for using badfish container to perform iDRAC cleanup operations on Dell hardware. Badfish is used to clear the iDRAC job queue and reset the iDRAC service to improve stability during boot operations. It does not replace redfish_command or URI modules, which continue to be used for standard Redfish operations.

Changes:

  • Created new 'badfish' Ansible role with install.yml and call.yml tasks
  • Added 'reset_idrac' parameter to control badfish-based iDRAC cleanup operations
  • Integrated badfish container installation into bastion bootstrap process
  • Updated boot-iso/dell.yml to use badfish for:
    • Clearing iDRAC job queue (always executed)
    • Resetting iDRAC service (when reset_idrac is enabled)
    • Waiting for iDRAC to be available after reset
  • Replaced fixed pause with wait_for module to verify host power down
  • All badfish operations use quay.io/quads/badfish container image

The badfish role provides a reusable call.yml task file that accepts badfish_host, badfish_user, badfish_password, and badfish_args parameters, making it easy to call badfish commands from other roles.

The 'reset_idrac' parameter pulls and uses the badfish container to perform iDRAC cleanup operations, which helps resolve issues with stuck job queues and improves iDRAC stability during virtual media boot operations.

AI Model: Claude Sonnet 4.5

@openshift-ci
Copy link

openshift-ci bot commented Dec 17, 2025

Hi @cjeanner. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cjeanner
Copy link
Contributor Author

I could use this updated code against a cluster of old Dell r630. Without those updates, I faced many issues with the provisioning where nodes weren't booting on the virtual media, or crashed for some iDrac-related issues (error 50x, etc).

With this patch, I could deploy twice in a row OCP - a big improvement compared to my previous experience. Since the default value for the new parameter is "false", it's really on-demand and shouldn't impact others without their knowledge.

Copy link
Member

@akrzos akrzos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that some of the changes here have not been tested such as changing the the use of raw to command.

@akrzos
Copy link
Member

akrzos commented Dec 17, 2025

I could use this updated code against a cluster of old Dell r630. Without those updates, I faced many issues with the provisioning where nodes weren't booting on the virtual media, or crashed for some iDrac-related issues (error 50x, etc).

With this patch, I could deploy twice in a row OCP - a big improvement compared to my previous experience. Since the default value for the new parameter is "false", it's really on-demand and shouldn't impact others without their knowledge.

I do want to acknowledge I completely feel your pain in trying to deploy on the older Dell r630 lab hardware so I would like to get this in to help alleviate how difficult it is with that hardware.

Copy link
Contributor Author

@cjeanner cjeanner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; need to revert the raw -> command change, and add the missing condition.

Copy link
Member

@akrzos akrzos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the PR in a self scheduled environment and it successfully deployed after completing both the job clear and idrac reset.

Additional time:

  • job clear - 40s (badfish running)
  • idrac reset - 26s (badfish running) + 3m 6s waiting for idrac to become responsive again

My final feedback would be to rename one of the tasks noted and remove the word run from it also consider removing the extra podman install and image pull.

@akrzos
Copy link
Member

akrzos commented Jan 9, 2026

/test ?

@openshift-ci
Copy link

openshift-ci bot commented Jan 9, 2026

@akrzos: The following commands are available to trigger required jobs:

/test deploy-cmno
/test deploy-cmno-private
/test deploy-cmno-private-bond
/test deploy-hmno
/test deploy-hmno-private
/test deploy-mno
/test deploy-mno-private
/test deploy-mno-private-bond
/test deploy-mno-scaleout
/test deploy-sno
/test deploy-sno-private
/test deploy-sno-private-bond
/test deploy-sno-scaleout
/test deploy-sno-self-sched
/test deploy-vmno
/test deploy-vmno-private
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@akrzos
Copy link
Member

akrzos commented Jan 9, 2026

/test deploy-mno

@akrzos
Copy link
Member

akrzos commented Jan 9, 2026

@josecastillolema Could you look at this CI failure? Looks like the CI might not be copying and editing the all.sample.yml file since it appears it is missing a var.

This commit adds support for using badfish container to perform iDRAC
cleanup operations on Dell hardware. Badfish is used to clear the iDRAC
job queue and reset the iDRAC service to improve stability during boot
operations. It does not replace redfish_command or URI modules, which
continue to be used for standard Redfish operations.

Changes:
- Created new 'badfish' Ansible role with install.yml and call.yml tasks
- Added 'reset_idrac' parameter to control badfish-based iDRAC cleanup
  operations
- Integrated badfish container installation into bastion bootstrap process
- Updated boot-iso/dell.yml to use badfish for:
  - Clearing iDRAC job queue (always executed)
  - Resetting iDRAC service (when reset_idrac is enabled)
  - Waiting for iDRAC to be available after reset
- Replaced fixed pause with wait_for module to verify host power down
- All badfish operations use quay.io/quads/badfish container image

The badfish role provides a reusable call.yml task file that accepts
badfish_host, badfish_user, badfish_password, and badfish_args parameters,
making it easy to call badfish commands from other roles.

The 'reset_idrac' parameter pulls and uses the badfish container to
perform iDRAC cleanup operations, which helps resolve issues with stuck
job queues and improves iDRAC stability during virtual media boot
operations.

AI Model: Claude Sonnet 4.5
@akrzos
Copy link
Member

akrzos commented Jan 12, 2026

/test deploy-mno

Copy link
Member

@akrzos akrzos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the lgtm label Jan 12, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akrzos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit d028c3e into redhat-performance:main Jan 12, 2026
2 checks passed
@cjeanner cjeanner deleted the feature/badfish branch January 12, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants