Skip to content

Conversation

@tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Dec 29, 2025

Add intermediate "derived" rebuild state field to indicate temporal
pool rebuild conditions. Preserve rebuild state value (idle/done/busy)
whilst adding intermediate states in derived_state field
(stopped/stopping/failed/failing) to better inform administrator.

Features: control

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr requested a review from kccain December 29, 2025 05:40
@tanabarr tanabarr self-assigned this Dec 29, 2025
@tanabarr tanabarr requested review from a team as code owners December 29, 2025 05:40
@github-actions
Copy link

github-actions bot commented Dec 29, 2025

Ticket title is 'Rebuild state reported in pool query human-readable output needs refinement'
Status is 'In Review'
Labels: 'scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-18347

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17322/1/execution/node/301/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr force-pushed the tanabarr/control-rebuild-states branch from 6eae49c to 61cc7aa Compare December 29, 2025 13:30
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17322/2/execution/node/301/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
coverage

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
…build-states

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@daosbuild3
Copy link
Collaborator

Copy link
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had submitted this feedback earlier, but github still shows it as pending. Trying again. Sorry for the inadvertent delay.

FAILING = 5;
FAILED = 6;
}
State state = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid re-assigning the official status/state before output of human-readable or JSON (i.e., to always have alignment with what a libdaos API caller would see), a solution could be to add here a derived_state used only by the tools. Both human-readable and JSON output could consistently show all 3 values (e.g., "derived_state (state, status)".

Regardless of whether we adopt the above suggestion, let's inform @jamesanunez and @daltonbohning of the potential changes here in case functional testing (that uses dmg/daos pool query mostly, and less-so the libdaos API).

In the proposal, state would only ever be busy/idle/done and would never be translated to a new value. status (errno) would similarly never be manipulated. It will have that special -DER_OP_CANCELED value for the stopping/stopped conditions and it would always be presented to the caller of the dmg/daos pool query utilities.

derived_state could take on any of the above State values (busy/idle/done if there is no derived condition such as stopping or stopped or failing or failed. If there is a further derived condition, derived_state could take on one of the new State values to provide the qualifying detail. In that latter case, stopping and failing are considered modifiers to the busy state, stopped is a modifier to idle, and failed is a modifier to done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kccain @daltonbohning changes applied, please review and verify when you can. TIA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if state and status are never manipulated then this seems to be backward compatible with existing tests. Using the new derived_state probably should be a separate PR because the rebuild code in ftest that detects that is pretty messy currently.

IDLE = 1;
DONE = 2;
STOPPING = 3;
StOPPED = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(change to all upper case letters)

Suggested change
StOPPED = 4;
STOPPED = 4;

Signed-off-by: Tom Nabarro <[email protected]>
Features: control
Signed-off-by: Tom Nabarro <[email protected]>
…build-states

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17322/7/display/redirect

1 similar comment
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17322/7/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17322/7/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17322/7/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17322/8/execution/node/507/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17322/8/execution/node/482/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17322/8/testReport/

…build-states

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17322/9/execution/node/1282/log

@tanabarr
Copy link
Contributor Author

@daltonbohning I need to make some test changes to adjust for the addition of derived_state (https://jenkins.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-17322/9/tests). Any idea whether the necessary changes will be in many multiple places or whether it will just be a few helper functions?

@daltonbohning
Copy link
Contributor

@daltonbohning I need to make some test changes to adjust for the addition of derived_state (https://jenkins.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-17322/9/tests). Any idea whether the necessary changes will be in many multiple places or whether it will just be a few helper functions?

For control/dmg_pool_query_test.py you can update these expected values:

Most of the tests are failing with errors like this that I don't understand

2026/01/14 12:31:45 DEBUG                log_result_data:   opa-[102-103] (rc=1): find: \x2018/mnt/share/tmp.lwK2QyOveh\x2019: No such file or directory
2026/01/14 12:31:45 ERROR                      fail_test: Error determining if /mnt/share/tmp.lwK2QyOveh/valgrind* files exist on opa-[102-103]

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested review from a team as code owners January 16, 2026 17:28
…build-states

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@daosbuild3
Copy link
Collaborator

@tanabarr
Copy link
Contributor Author

awaiting reviews

@knard38
Copy link
Contributor

knard38 commented Jan 19, 2026

Nice update which should be very helpful for support tasks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants