Skip to content

Conversation

@liuxuezhao
Copy link
Contributor

@liuxuezhao liuxuezhao commented Dec 29, 2025

1. fix a bug of using ec_agg_boundary before checking its valid
2. add some more logs for rebuild fetch getting zero iod_size,
   to provide some hints for layout information.
3. fix a bug of EC agg peer update, some failed update need to be retried
    to avoid data corruption.
4. refine some detailed process of dtx_resync wating for rebuild scan.

 Signed-off-by: Xuezhao Liu <[email protected]>

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@liuxuezhao liuxuezhao requested review from a team as code owners December 29, 2025 09:18
@github-actions
Copy link

Ticket title is 'Data corruption observed with master branch under MDonSSD environment.'
Status is 'In Progress'
Labels: 'md_on_ssd,scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-18368

@daosbuild3
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17324/2/display/redirect

NiuYawei
NiuYawei previously approved these changes Dec 30, 2025
@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch from 10fa58e to 66f44a9 Compare December 31, 2025 06:06
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/4/execution/node/1313/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/4/execution/node/1323/log

@liuxuezhao
Copy link
Contributor Author

just refresh to change a few logs.

@liuxuezhao liuxuezhao requested a review from NiuYawei January 4, 2026 10:47
@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch 2 times, most recently from 32db84f to e1c08ea Compare January 4, 2026 11:00
@daosbuild3
Copy link
Collaborator

NiuYawei
NiuYawei previously approved these changes Jan 5, 2026
wangshilong
wangshilong previously approved these changes Jan 5, 2026
@liuxuezhao liuxuezhao dismissed stale reviews from wangshilong and NiuYawei via c474fe4 January 7, 2026 07:31
@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch from e1c08ea to c474fe4 Compare January 7, 2026 07:31
@liuxuezhao liuxuezhao changed the title DAOS-18368 rebuild: fix use before check of ec_agg_boundary DAOS-18368 rebuild: fix bug of ec_agg_boundary usage and reintegrate Jan 7, 2026
@daosbuild3
Copy link
Collaborator

@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch from c474fe4 to 3b13f7d Compare January 7, 2026 12:30
@daosbuild3
Copy link
Collaborator

@liuxuezhao liuxuezhao requested a review from kccain January 7, 2026 14:03
NiuYawei
NiuYawei previously approved these changes Jan 8, 2026
@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch from 3b13f7d to 1ba9f49 Compare January 8, 2026 12:36
@liuxuezhao liuxuezhao changed the title DAOS-18368 rebuild: fix bug of ec_agg_boundary usage and reintegrate DAOS-18368 rebuild: fix bug of ec_agg_boundary and agg peer update Jan 8, 2026
@liuxuezhao liuxuezhao requested a review from NiuYawei January 8, 2026 12:54
@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch 2 times, most recently from 56751dc to f4bc272 Compare January 8, 2026 13:00
NiuYawei
NiuYawei previously approved these changes Jan 8, 2026
wangshilong
wangshilong previously approved these changes Jan 8, 2026
kccain
kccain previously approved these changes Jan 8, 2026
Copy link
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebuild/ source file changes LGTM.

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/12/execution/node/1277/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/12/execution/node/1318/log

1. fix a bug of using ec_agg_boundary before checking its valid
2. add some more logs for rebuild fetch getting zero iod_size,
   to provide some hints for layout information.

 Signed-off-by: Xuezhao Liu <[email protected]>
Some failures need to be retried.

Signed-off-by: Xuezhao Liu <[email protected]>
For reint ranks is excluded from rebuild/reclaim if the co_in_ver
exceed rebuild ver. Should set its completion in rebuild leader to
avoid possible stuck.
Refine dtx_resync wait handling, need not wait anymore if previously
already resynced.
Add some log.

Signed-off-by: Xuezhao Liu <[email protected]>
@liuxuezhao liuxuezhao dismissed stale reviews from kccain, wangshilong, and NiuYawei via 17ef124 January 9, 2026 10:51
@liuxuezhao liuxuezhao force-pushed the lxz/rb_fix_ec_agg_eph branch from f4bc272 to 17ef124 Compare January 9, 2026 10:51
@liuxuezhao liuxuezhao requested a review from Nasf-Fan January 9, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants