Skip to content

Conversation

@shimizukko
Copy link
Contributor

To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path.

Update ddb_utils.py to support the new commands.

Add check_ram_used in recovery_utils.py to detect whether the system is MD-on-SSD.

Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands.

We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR).

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

To support MD-on-SSD for ddb, we need to support two commands.
ddb prov_mem and ddb ls with --db_path.

Update ddb_utils.py to support the new commands.

Add check_ram_used in recovery_utils.py to detect whether
the system is MD-on-SSD.

Update test_recovery_ddb_ls to support MD-on-SSD with the
new ddb commands.

We need to update the test yaml to run on MD-on-SSD/HW Medium,
but that will break other tests in ddb.py because they don't
support MD-on-SSD yet. Keep the original tests as ddb_pmem.py
and ddb_pmem.yaml and keep running them on VM (except
test_recovery_ddb_ls because that's updated in this PR).

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
@github-actions
Copy link

Ticket title is 'CR Test Update - recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support'
Status is 'In Progress'
Labels: 'catastrophic_recovery'
https://daosio.atlassian.net/browse/DAOS-18387

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
@daosbuild3
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17332/1/display/redirect

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17332/5/execution/node/898/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <[email protected]>
@shimizukko shimizukko marked this pull request as ready for review January 2, 2026 06:44
@shimizukko shimizukko requested review from a team as code owners January 2, 2026 06:44
@shimizukko shimizukko requested review from dinghwah and phender January 2, 2026 06:44
@shimizukko
Copy link
Contributor Author

@phender @dinghwah
In Git diff, it looks like ddb_pmem.py and ddb_pmem.yaml are new, but they appear so because I removed test_recovery_ddb_ls from there and renamed the files by adding _pmem. Please focus on the rest of the files.

I have two questions:

  1. In recovery_utils.py, I added check_ram_used to determine whether the system is MD-on-SSD. The logic is that if the test runs on HW Medium and the server config has ram field, it must be running on MD-on-SSD. Is this okay or is there any other better way?

  2. On MD-on-SSD, we need to load/mount the pool dir to a new location. That location can be anywhere, but I chose /mnt/daos_load to make it consistent with the existing pattern. At the end of the test, I call umount and rm -rf on /mnt/daos_load. Is this okay?

Thanks.

Comment on lines +129 to +130
md_on_ssd = check_ram_used(server_manager=self.server_managers[0], log=self.log)
if md_on_ssd:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a DaosServerManager method to determine if we're using MD on SSD:

Suggested change
md_on_ssd = check_ram_used(server_manager=self.server_managers[0], log=self.log)
if md_on_ssd:
md_on_ssd = self.server_managers[0].manager.job.using_control_metadata
if md_on_ssd:

return policies


def check_ram_used(server_manager, log):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just using this to determine if we are using MD on SSD, we already have self.server_managers[0].manager.job.using_control_metadata.

if md_on_ssd:
self.log_step(f"MD-on-SSD: Load pool dir to {daos_load_path}")
db_path = os.path.join(
self.log_dir, "control_metadata", "daos_control", "engine0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the control metadata path be obtained via self.server_managers[0].job.yaml.metadata_params.path.value?

return self.run()

def prov_mem(self, db_path, tmpfs_mount):
"""Call ddb "" prov_mem <db_path> <tmpfs_mount>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always calling this command with an empty vos_path, or is specific to the only test currently using this command?

Comment on lines +104 to +107
command_result = run_remote(
log=self.log, hosts=self.hostlist_servers, command=command_root).passed
if not command_result:
self.fail(f"{command} failed!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, we can also report on which host(s) the command failed:

Suggested change
command_result = run_remote(
log=self.log, hosts=self.hostlist_servers, command=command_root).passed
if not command_result:
self.fail(f"{command} failed!")
result = run_remote(
log=self.log, hosts=self.hostlist_servers, command=command_root)
if not result.passed:
self.fail(f"{command} failed on {result.failed_hosts}!")

Args:
remote_file_path (str): File path to copy to local.
test_dir (str): Test directory. Usually self.test_dir.
remote (str): Remote hostname to copy file from.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_clush_command requires a NodeSet.

Suggested change
remote (str): Remote hostname to copy file from.
remote (NodeSet): Remote hostname to copy file from.

f"ERROR: Copying {remote_file_path} from {remote}: {error}") from error

# Remove the appended .<server_hostname> from the copied file.
current_file_path = "".join([remote_file_path, ".", remote])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a problem if there are multiple hosts specified in the remote argument. If the test is only going to using one remote host, is clush rcopy even needed?

To handle multiple hosts, this function could just return the paths of the copied files (with the hostname extension) and the caller could loop over the list of files to process them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants