Skip to content

Conversation

@liw
Copy link
Contributor

@liw liw commented Jan 13, 2026

Add the following to pool query output (shown as dmg pool query output):

- Data redundancy: degraded

When data redundancy is intact, "normal" is shown instead of "degraded".

Features: control pool

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Jan 13, 2026

Ticket title is 'dmg pool query should show that there are ranks that are DOWN but not DOWNOUT'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-17938

@liw liw force-pushed the liw/pool-query-out-pad branch from cfab5bf to 2f86dcb Compare January 13, 2026 07:36
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17371/2/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17371/2/execution/node/1350/log

Add the following to pool query output (shown as dmg pool query output):

  - Data redundancy: degraded

When data redundancy is intact, "normal" is shown instead of "degraded".

Features: control pool
Signed-off-by: Li Wei <[email protected]>
@liw liw force-pushed the liw/pool-query-out-pad branch from 2f86dcb to cf277d5 Compare January 13, 2026 23:51
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17371/2/execution/node/1335/log

@liw liw marked this pull request as ready for review January 13, 2026 23:55
@liw liw requested review from a team as code owners January 13, 2026 23:55
@liw liw requested review from kccain, mchaarawi and tanabarr January 13, 2026 23:56
@liw
Copy link
Contributor Author

liw commented Jan 13, 2026

Requesting reviews a bit early, since I'm not that familiar with the areas being changed.

@daosbuild3
Copy link
Collaborator

tanabarr
tanabarr previously approved these changes Jan 14, 2026
/** For daos_rebuild_status.rs_flags */
enum daos_rebuild_status_flag {
/** Data redundancy degraded (the pool has one or more DOWN targets) */
DAOS_RSF_DEGRADED = (1 << 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of shifting by zero bits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanabarr, just a style. :) See line 156 of this header, for example:
DPI_SPACE = 1ULL << 0,

else
rebuild->state = MGMT__POOL_REBUILD_STATUS__STATE__BUSY;

rebuild->degraded = !!(info->rs_flags & DAOS_RSF_DEGRADED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of the double negation here? is it equivalent to ... != 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanabarr, it's an idiom (or pattern, at least) for converting a nonzero number to 1, usually used on flags. For example---I've seen this a couple of times in my career---say, the flags are uint64_t, and the result of the & is 0x100000000, if assigned to a uint32_t, the value overflows and becomes 0!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, it's the same as != 0.

mchaarawi
mchaarawi previously approved these changes Jan 14, 2026
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the C & API changes look good to me.
not sure about all the control changes.

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM. The rs_padding16 seemed unused anyway

@liw
Copy link
Contributor Author

liw commented Jan 15, 2026

Mohamad, Dalton, thank you for the quick reviews. I'll wait for one additional reviewer.

liuxuezhao
liuxuezhao previously approved these changes Jan 15, 2026
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17371/3/execution/node/1283/log

@liw
Copy link
Contributor Author

liw commented Jan 16, 2026

Two simple regressions (expected pool query results of two tests need updating) need to be fixed:

  • control/dmg_pool_query_test
    • basic: regression
    • ior: SRE-3525
  • dfuse/daos_build: likely SRE-3525 slowness
    • / FTEST_launch: likely CI issue "Error determining if /mnt/share/tmp.Dm0r7JNoE2/valgrind* files exist on opa-112"
  • nvme/pool_capacity: likely CI issue "ssh: connect to host opa-114 port 22: No route to host"
  • pool/list_verbose: regression
  • rebuild/mdtest: likely CI issue '"the server could not be reached at the configured address (opa-112:10001)": unable to contact the DAOS Management Service"'

@liw liw dismissed stale reviews from liuxuezhao, mchaarawi, and tanabarr via eb750f9 January 16, 2026 06:52
Features: control pool
Signed-off-by: Li Wei <[email protected]>
@liw liw force-pushed the liw/pool-query-out-pad branch from f0c7241 to b5f505d Compare January 16, 2026 06:54
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17371/6/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17371/6/testReport/

liw added 2 commits January 19, 2026 10:18
Features: control pool
Signed-off-by: Li Wei <[email protected]>
@liw
Copy link
Contributor Author

liw commented Jan 19, 2026

One rebuild case in pool/list_verbose needs a further update. The container/boundary space errors do not seem to be related to this PR; I've filed DAOS-18477 for them.

@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants