Skip to content

Conversation

@pfliu
Copy link
Collaborator

@pfliu pfliu commented Feb 18, 2025

fence_kdump_send may fail to send a message if the network is slow to
initialize while local dumping completes quickly.

To address this, add an additional wait for the network and make a best
effort to send the message before rebooting.

Pingfan Liu added 3 commits February 18, 2025 10:57
Resolves: https://issues.redhat.com/browse/RHEL-46337

As man 8 fence_kdump_send:
-i, --interval=INTERVAL
Time to wait between sending a message. The value for INTERVAL must be greater than zero. (default: 10)

The interval 10 seconds are two large especially in the case that local
dumping goes fast. Suppose the following scenario:
	network is not ready
	fence_kdump_notify &
	network is ready
	local dumping finish and reboot within 10 seconds.
We will miss the chance to send out the fence dump messages.

Shorten the interval to one second to ease this issue.

Signed-off-by: Pingfan Liu <[email protected]>
Resolves: https://issues.redhat.com/browse/RHEL-46337

fence_kdump_send may fail to send a message if the network is slow to
initialize while local dumping completes quickly.

To address this, add an additional wait for the network and make a best
effort to send the message before rebooting.

Signed-off-by: Pingfan Liu <[email protected]>
}

get_host_ip() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pfliu get_host_ip will wait for the network to be ready. Maybe a simpler solution to modify get_host_ip so it will wait the network to be ready for fence_kdump? Note fence_kdump_notify has the code to check if it's fence_dkump so get_host_ip can reuse the code. I Btw, assume fence_kdump_notify is used for sending fence message so it's better get_host_ip gets moved before it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coiby, sorry to reply late. here is an assumption that saving vmcore is more important than sending out fence message.
As we have observed long wait time for the network readiness, I am a little worry about the cluster manager may reboot the crashed machine forcefully during the period. If things go that way, the vmcore will not be saved.

But I am open to this option. What is your opinion now?

thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! If I understand it correctly, the reason we send out fence message is exactly to notify the cluster manager to not reboot the machine because we are in the process of doing vmcore dumping, right? But I agree there is no need to wait for the network to be ready first since FENCE_KDUMP_SEND can keeping sending message until it succeeds. So if the purpose of fence kdump to make sure vmcore dumping will not be interrupted, does it mean there is no need to wait for the network to be ready after vmcore dumping has finished?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants