-
Notifications
You must be signed in to change notification settings - Fork 6
Add host failure test to verify VM resiliency and SR stability #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add host failure test to verify VM resiliency and SR stability #313
Conversation
… host. - Chooses a host within a LINSTOR SR pool and simulate crash using sysrq-trigger. - Verifies VM boot and shutdown on all remaining hosts during the outage, and confirms recovery of the failed host for VM placement post-reboot. - Ensures SR scan consistency post-recovery. Signed-off-by: Rushikesh Jadhav <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test fails while the SR is in a failed state, could this leave the pool in a bad state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test fails and SR goes bad, I don't see an easy way to recover from it. There could be scenario based troubleshooting required before cleaning up. However, principally single host failure in XOSTOR should be tolerable and this test should catch in case failures.
We probably should use nested tests for this so that in case the pool goes bad, we can just wipe and start with a new clean one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or the SR used for physical tests should be a throw-away one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are block devices involved (LVM, drbd, tapdisk which gets blocked on IO), and an improper teardown needs careful inspection and recover or a harsh wipe of everything + reboots. Host failure is still tolerated better than a disk/LVM failure.
XOSTOR SR is hardly a throw-away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if we use those block devices for nothing else than this test, we can easily blank them to restart from scratch. We do that for all local SRs, and in some way a Linstor SR is "local to the pool". If our test pool is the sole user, it looks throw-away to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this happens (manually) when test needs clean start. If it's acceptable then we can add it into prepare_test
or similar so that manual script is not required.
@ydirson do we need new fixture to handle VM reboots (during host failed and recovered state)? https://github.com/xcp-ng/xcp-ng-tests/pull/313/files#diff-e40824d600ab1c5614cf60bf13e30d8bea1634a03c0df205b9cb1a15239a8505R162-R164 |
hosts.remove(sr.pool.master) | ||
# Evacuate the node to be deleted | ||
try: | ||
random_host = random.choice(hosts) # TBD: Choose Linstor Diskfull node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_host = random.choice(hosts) # TBD: Choose Linstor Diskfull node | |
random_host = random.choice(hosts) # TBD: Choose Linstor Diskful node |
Fail non master host from the same pool Linstor SR. | ||
Ensure that VM is able to boot and shutdown on all hosts. | ||
""" | ||
import random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random
remains a common module, perhaps we should put it in global import in case a new future function uses it in this module?
Added
test_linstor_sr_fail_host
to simulate a crash of a non-master host.