DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support #17332

shimizukko · 2025-12-31T05:44:46Z

To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path.

Update ddb_utils.py to support the new commands.

Add check_ram_used in recovery_utils.py to detect whether the system is MD-on-SSD.

Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands.

We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR).

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path. Update ddb_utils.py to support the new commands. Add check_ram_used in recovery_utils.py to detect whether the system is MD-on-SSD. Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands. We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR). Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

github-actions · 2025-12-31T05:45:04Z

Ticket title is 'CR Test Update - recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support'
Status is 'In Progress'
Labels: 'catastrophic_recovery'
https://daosio.atlassian.net/browse/DAOS-18387

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

daosbuild3 · 2025-12-31T06:08:36Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17332/1/display/redirect

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

daosbuild3 · 2025-12-31T09:37:45Z

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17332/5/execution/node/857/log

daosbuild3 · 2025-12-31T10:08:06Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17332/5/execution/node/898/log

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

shimizukko · 2026-01-02T06:56:47Z

@phender @dinghwah
In Git diff, it looks like ddb_pmem.py and ddb_pmem.yaml are new, but they appear so because I removed test_recovery_ddb_ls from there and renamed the files by adding _pmem. Please focus on the rest of the files.

I have two questions:

In recovery_utils.py, I added check_ram_used to determine whether the system is MD-on-SSD. The logic is that if the test runs on HW Medium and the server config has ram field, it must be running on MD-on-SSD. Is this okay or is there any other better way?
On MD-on-SSD, we need to load/mount the pool dir to a new location. That location can be anywhere, but I chose /mnt/daos_load to make it consistent with the existing pattern. At the end of the test, I call umount and rm -rf on /mnt/daos_load. Is this okay?

Thanks.

phender · 2026-01-05T15:29:36Z

src/tests/ftest/recovery/ddb.py

+        md_on_ssd = check_ram_used(server_manager=self.server_managers[0], log=self.log)
+        if md_on_ssd:


We already have a DaosServerManager method to determine if we're using MD on SSD:

Suggested change

md_on_ssd = check_ram_used(server_manager=self.server_managers[0], log=self.log)

if md_on_ssd:

md_on_ssd = self.server_managers[0].manager.job.using_control_metadata

if md_on_ssd:

phender · 2026-01-05T15:31:01Z

src/tests/ftest/util/recovery_utils.py

    return policies
+
+
+def check_ram_used(server_manager, log):


If we're just using this to determine if we are using MD on SSD, we already have self.server_managers[0].manager.job.using_control_metadata.

phender · 2026-01-05T15:40:25Z

src/tests/ftest/recovery/ddb.py

+        if md_on_ssd:
+            self.log_step(f"MD-on-SSD: Load pool dir to {daos_load_path}")
+            db_path = os.path.join(
+                self.log_dir, "control_metadata", "daos_control", "engine0")


Shouldn't the control metadata path be obtained via self.server_managers[0].job.yaml.metadata_params.path.value?

phender · 2026-01-05T15:42:52Z

src/tests/ftest/util/ddb_utils.py

        return self.run()
+
+    def prov_mem(self, db_path, tmpfs_mount):
+        """Call ddb "" prov_mem <db_path> <tmpfs_mount>.


Are we always calling this command with an empty vos_path, or is specific to the only test currently using this command?

phender · 2026-01-05T15:54:02Z

src/tests/ftest/recovery/ddb.py

+        command_result = run_remote(
+            log=self.log, hosts=self.hostlist_servers, command=command_root).passed
+        if not command_result:
+            self.fail(f"{command} failed!")


If you want, we can also report on which host(s) the command failed:

Suggested change

command_result = run_remote(

log=self.log, hosts=self.hostlist_servers, command=command_root).passed

if not command_result:

self.fail(f"{command} failed!")

result = run_remote(

log=self.log, hosts=self.hostlist_servers, command=command_root)

if not result.passed:

self.fail(f"{command} failed on {result.failed_hosts}!")

phender · 2026-01-05T16:00:47Z

src/tests/ftest/recovery/ddb_pmem.py

+    Args:
+        remote_file_path (str): File path to copy to local.
+        test_dir (str): Test directory. Usually self.test_dir.
+        remote (str): Remote hostname to copy file from.


get_clush_command requires a NodeSet.

Suggested change

remote (str): Remote hostname to copy file from.

remote (NodeSet): Remote hostname to copy file from.

phender · 2026-01-05T16:04:44Z

src/tests/ftest/recovery/ddb_pmem.py

+            f"ERROR: Copying {remote_file_path} from {remote}: {error}") from error
+
+    # Remove the appended .<server_hostname> from the copied file.
+    current_file_path = "".join([remote_file_path, ".", remote])


This will be a problem if there are multiple hosts specified in the remote argument. If the test is only going to using one remote host, is clush rcopy even needed?

To handle multiple hosts, this function could just return the paths of the copied files (with the hostname extension) and the caller could loop over the list of files to process them.

DAOS-18387 test: Fix pylint

975427f

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

shimizukko added 3 commits December 31, 2025 06:15

DAOS-18387 test: Fix pylint

70db493

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

DAOS-18387 test: Fix pylint

517afe8

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

DAOS-18387 test: Fix pylint

5327dc4

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

shimizukko added 3 commits January 2, 2026 05:13

Merge branch 'master' into makito/DAOS-18387

f40059b

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest

DAOS-18387 test: Add targets: 1 to ddb.yaml

069c126

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

DAOS-18387 test: Fix pylint

a4106c0

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-medium: false Test-tag: test_recovery_ddb_ls DdbPMEMTest Signed-off-by: Makito Kano <[email protected]>

shimizukko marked this pull request as ready for review January 2, 2026 06:44

shimizukko requested review from a team as code owners January 2, 2026 06:44

shimizukko requested review from dinghwah and phender January 2, 2026 06:44

phender requested changes Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support #17332

DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support #17332

Uh oh!

shimizukko commented Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

shimizukko commented Jan 2, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

phender Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

		md_on_ssd = check_ram_used(server_manager=self.server_managers[0], log=self.log)
		if md_on_ssd:

	remote (str): Remote hostname to copy file from.
	remote (NodeSet): Remote hostname to copy file from.

DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support #17332

Are you sure you want to change the base?

DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support #17332

Uh oh!

Conversation

shimizukko commented Dec 31, 2025

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

shimizukko commented Jan 2, 2026

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

phender Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants