Skip to content

Conversation

@kccain
Copy link
Contributor

@kccain kccain commented Jan 6, 2026

Consider a quick maintenance scenario in which a daos_engine is stopped briefly, and the administrator does not wish to have the DAOS automatic recovery / rebuild mechanism occur. That is, a pool map update (targets from UP_IN to DOWN) is to occur, the pool to enter a degraded mode (still allowing ongoing I/O), and NO rebuild to be triggered during this brief time window.

The above can be arranged by modifying the system or pool-specific self_heal property value (to not set the rebuild bit), and then stopping the engine.

Now also consider the conclusion of the maintenance that involes re-starting the engine, and reintegrating that rank back into the pool. It is most convenient to directly issue a dmg pool reintegrate command from the maintenance state.

Before this change, manual administration commands such as dmg pool exclude/reintegrate were prevented from triggering rebuilds due to the pool self_heal property setting. However, the intention of the self_heal (aka auto recovery) feature is to only apply to automatic rebuilds.

With this change, the is_pool_rebuild_allowed() function is updated to accept an indication of whether the self_heal checks are applicable. Manual pool map update and rebuild cases supply false for this argument (allowing those cases to result in a rebuild being scheduled).

Features: rebuild pool

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Consider a quick maintenance scenario in which a daos_engine
is stopped briefly, and the administrator does not wish to have
the DAOS automatic recovery / rebuild mechanism occur. That is,
a pool map update (targets from UP_IN to DOWN) is to occur, the
pool to enter a degraded mode (still allowing ongoing I/O), and
NO rebuild to be triggered during this brief time window.

The above can be arranged by modifying the system or pool-specific
self_heal property value (to not set the rebuild bit), and then
stopping the engine.

Now also consider the conclusion of the maintenance that involes
re-starting the engine, and reintegrating that rank back into the pool.
It is most convenient to directly issue a dmg pool reintegrate command
from the maintenance state.

Before this change, manual administration commands such as
dmg pool exclude/reintegrate were prevented from triggering rebuilds
due to the pool self_heal property setting. However, the intention
of the self_heal (aka auto recovery) feature is to only apply
to automatic rebuilds.

With this change, the is_pool_rebuild_allowed() function is updated
to accept an indication of whether the self_heal checks are applicable.
Manual pool map update and rebuild cases supply false for this argument
(allowing those cases to result in a rebuild being scheduled).

Features: rebuild pool

Signed-off-by: Kenneth Cain <[email protected]>
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Ticket title is 'pool reintegrate issue when pool property self_heal:exclude (no rebuild)'
Status is 'In Progress'
Labels: 'scrubbed_2.8,triaged'
https://daosio.atlassian.net/browse/DAOS-15993

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/1/display/redirect

@daosbuild3
Copy link
Collaborator

Copy link
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general; one "[question]" needs an answer before I approve this PR.


static inline bool
is_pool_rebuild_allowed(struct ds_pool *pool, bool check_delayed_rebuild)
is_pool_rebuild_allowed(struct ds_pool *pool, uint64_t self_heal, bool self_heal_applicable,
Copy link
Contributor

@liw liw Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Would it be any clearer to name self_heal_applicable auto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. Changed to auto_recovery.

int
ds_rebuild_regenerate_task(struct ds_pool *pool, daos_prop_t *prop, uint64_t sys_self_heal,
uint64_t delay_sec);
bool self_heal_applicable, uint64_t delay_sec);
Copy link
Contributor

@liw liw Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] self_heal_applicable or perhaps auto? No strong opinion though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to auto_recovery.


rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, 0);
rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal,
true /* self_heal_applicable */, 0 /* delay_sec*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] A missing space between delay_sec and */.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

self_heal_applicable = (opc == MAP_EXCLUDE && src == MUS_SWIM);

if (sys_self_heal_applicable) {
/* do not update pool map if system.self_heal is applicable but does not enable exclude */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The comment isn't that helpful; if one must be added, I think a conciser one like "If applicable, check the system self-heal policy." might be more helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, simplified.

}
}

/* Update pool map if pool.self_heal is applicable and enables exclude. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Could we say "The pool self-heal policy is checked by the following call." instead? Considering that the call performs many other checks too...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, fixed.

d_freeenv_str(&env);

if (sys_self_heal_applicable && !(sys_self_heal & DS_MGMT_SELF_HEAL_POOL_REBUILD)) {
/* Do not trigger rebuild if system.self_heal is applicable but does not enable rebuild. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The log message has already said it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed to de-duplicate the information in the source code.

}

if (!is_pool_rebuild_allowed(svc->ps_pool, true)) {
/* Do not trigger rebuild if pool.self_heal is applicable but does not enable rebuild. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The log message has already said it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed to de-duplicate the information in the source code.

if (pool->sp_disable_rebuild)
return false;
if (!(pool->sp_self_heal & flags))
if (self_heal_applicable && !(self_heal & flags))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Does this change affect the delayed rebuild case? I don't know the answer; just making sure this has been considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. What I've done is remove that argument, since no callers currently specify anything other than true.

Also I did experiment with some manual testing of a pool whose self_heal property value was "exclude;delay_rebuild" and it seemed to work as expected (no exclude rebuilds occur, deferring until a subsequent reintegrate).

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1100/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1141/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/3/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/5/display/redirect

@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants