-
Notifications
You must be signed in to change notification settings - Fork 338
DAOS-15993 rebuild: for manual rebuilds do not eval self_heal #17345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Consider a quick maintenance scenario in which a daos_engine is stopped briefly, and the administrator does not wish to have the DAOS automatic recovery / rebuild mechanism occur. That is, a pool map update (targets from UP_IN to DOWN) is to occur, the pool to enter a degraded mode (still allowing ongoing I/O), and NO rebuild to be triggered during this brief time window. The above can be arranged by modifying the system or pool-specific self_heal property value (to not set the rebuild bit), and then stopping the engine. Now also consider the conclusion of the maintenance that involes re-starting the engine, and reintegrating that rank back into the pool. It is most convenient to directly issue a dmg pool reintegrate command from the maintenance state. Before this change, manual administration commands such as dmg pool exclude/reintegrate were prevented from triggering rebuilds due to the pool self_heal property setting. However, the intention of the self_heal (aka auto recovery) feature is to only apply to automatic rebuilds. With this change, the is_pool_rebuild_allowed() function is updated to accept an indication of whether the self_heal checks are applicable. Manual pool map update and rebuild cases supply false for this argument (allowing those cases to result in a rebuild being scheduled). Features: rebuild pool Signed-off-by: Kenneth Cain <[email protected]>
|
Ticket title is 'pool reintegrate issue when pool property self_heal:exclude (no rebuild)' |
|
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/1/display/redirect |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/1/testReport/ |
liw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general; one "[question]" needs an answer before I approve this PR.
src/include/daos_srv/pool.h
Outdated
|
|
||
| static inline bool | ||
| is_pool_rebuild_allowed(struct ds_pool *pool, bool check_delayed_rebuild) | ||
| is_pool_rebuild_allowed(struct ds_pool *pool, uint64_t self_heal, bool self_heal_applicable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] Would it be any clearer to name self_heal_applicable auto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so. Changed to auto_recovery.
src/include/daos_srv/rebuild.h
Outdated
| int | ||
| ds_rebuild_regenerate_task(struct ds_pool *pool, daos_prop_t *prop, uint64_t sys_self_heal, | ||
| uint64_t delay_sec); | ||
| bool self_heal_applicable, uint64_t delay_sec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] self_heal_applicable or perhaps auto? No strong opinion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to auto_recovery.
src/pool/srv_pool.c
Outdated
|
|
||
| rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, 0); | ||
| rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, | ||
| true /* self_heal_applicable */, 0 /* delay_sec*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] A missing space between delay_sec and */.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
src/pool/srv_pool.c
Outdated
| self_heal_applicable = (opc == MAP_EXCLUDE && src == MUS_SWIM); | ||
|
|
||
| if (sys_self_heal_applicable) { | ||
| /* do not update pool map if system.self_heal is applicable but does not enable exclude */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] The comment isn't that helpful; if one must be added, I think a conciser one like "If applicable, check the system self-heal policy." might be more helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, simplified.
src/pool/srv_pool.c
Outdated
| } | ||
| } | ||
|
|
||
| /* Update pool map if pool.self_heal is applicable and enables exclude. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] Could we say "The pool self-heal policy is checked by the following call." instead? Considering that the call performs many other checks too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, fixed.
src/pool/srv_pool.c
Outdated
| d_freeenv_str(&env); | ||
|
|
||
| if (sys_self_heal_applicable && !(sys_self_heal & DS_MGMT_SELF_HEAL_POOL_REBUILD)) { | ||
| /* Do not trigger rebuild if system.self_heal is applicable but does not enable rebuild. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] The log message has already said it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed to de-duplicate the information in the source code.
src/pool/srv_pool.c
Outdated
| } | ||
|
|
||
| if (!is_pool_rebuild_allowed(svc->ps_pool, true)) { | ||
| /* Do not trigger rebuild if pool.self_heal is applicable but does not enable rebuild. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] The log message has already said it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed to de-duplicate the information in the source code.
src/include/daos_srv/pool.h
Outdated
| if (pool->sp_disable_rebuild) | ||
| return false; | ||
| if (!(pool->sp_self_heal & flags)) | ||
| if (self_heal_applicable && !(self_heal & flags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question] Does this change affect the delayed rebuild case? I don't know the answer; just making sure this has been considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. What I've done is remove that argument, since no callers currently specify anything other than true.
Also I did experiment with some manual testing of a pool whose self_heal property value was "exclude;delay_rebuild" and it seemed to work as expected (no exclude rebuilds occur, deferring until a subsequent reintegrate).
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/2/testReport/ |
Signed-off-by: Kenneth Cain <[email protected]>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1100/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1141/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/3/testReport/ |
|
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/5/display/redirect |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/6/testReport/ |
Consider a quick maintenance scenario in which a daos_engine is stopped briefly, and the administrator does not wish to have the DAOS automatic recovery / rebuild mechanism occur. That is, a pool map update (targets from UP_IN to DOWN) is to occur, the pool to enter a degraded mode (still allowing ongoing I/O), and NO rebuild to be triggered during this brief time window.
The above can be arranged by modifying the system or pool-specific self_heal property value (to not set the rebuild bit), and then stopping the engine.
Now also consider the conclusion of the maintenance that involes re-starting the engine, and reintegrating that rank back into the pool. It is most convenient to directly issue a dmg pool reintegrate command from the maintenance state.
Before this change, manual administration commands such as dmg pool exclude/reintegrate were prevented from triggering rebuilds due to the pool self_heal property setting. However, the intention of the self_heal (aka auto recovery) feature is to only apply to automatic rebuilds.
With this change, the is_pool_rebuild_allowed() function is updated to accept an indication of whether the self_heal checks are applicable. Manual pool map update and rebuild cases supply false for this argument (allowing those cases to result in a rebuild being scheduled).
Features: rebuild pool
Steps for the author:
After all prior steps are complete: