Skip to content

Error creating dataset with resnet50/cosmoflow using multiple hosts #200

@bigdogdan2

Description

@bigdogdan2

I get "FileNotFoundError: [Errno 2] No such file or directory: 'tfrecord2idx'" error when trying to create dataset with resnet50 and cosmoflow models with more than 1 host. If I use 1 host, there is no issues creating the dataset. Unet3d works file when creating dataset with multi hosts.
Is there a setting or package that needs to be installed to get around the issue?

Command
mlpstorage training datagen --hosts=10.151.1.5,10.151.1.6 --model=resnet50 --exec-type=mpi --param dataset.num_files_train=10231 --num-processes=64 --results-dir=/mnt/nfs19/resnet_results2 --data-dir=/mnt/nfs19/resnet_data2/

Output
(myenv) root@adclient5:~# mlpstorage training datagen --hosts=10.151.1.5,10.151.1.6 --model=resnet50 --exec-type=mpi --param dataset.num_files_train=10231 --num-processes=64 --results-dir=/mnt/nfs19/resnet_results2 --data-dir=/mnt/nfs19/resnet_data2/ --allow-run-as-root
Hosts is: ['10.151.1.5,10.151.1.6']
Hosts is: ['10.151.1.5', '10.151.1.6']
2025-09-14 06:07:28|STATUS: Benchmark results directory: /mnt/nfs19/resnet_results2/training/resnet50/datagen/20250914_060728
2025-09-14 06:07:28|STATUS: Running benchmark command:: mpirun -n 64 -host 10.151.1.5:32,10.151.1.6:32 --allow-run-as-root /root/.venvs/myenv/bin/dlio_benchmark workload=resnet50_datagen ++hydra.run.dir=/mnt/nfs19/resnet_results2/training/resnet50/datagen/20250914_060728 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=10231 ++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50 --config-dir=/root/storage/configs/dlio
Warning: Permanently added '10.151.1.6' (ED25519) to the list of known hosts.
[OUTPUT] 2025-09-14T06:07:32.520868 Running DLIO [Generating data] with 64 process(es)
[OUTPUT] 2025-09-14T06:07:32.527553 Starting data generation

Traceback (most recent call last):
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 460, in run_benchmark
benchmark.initialize()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 167, in initialize
self.data_generator.generate()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/data_generator/tf_generator.py", line 81, in generate
call([tfrecord2idx_script, out_path_spec, self.storage.get_uri(tfrecord_idx)])
File "/usr/lib/python3.12/subprocess.py", line 389, in call
with Popen(*popenargs, **kwargs) as p:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'tfrecord2idx'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Traceback (most recent call last):
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 460, in run_benchmark
benchmark.initialize()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 167, in initialize
self.data_generator.generate()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/data_generator/tf_generator.py", line 81, in generate
call([tfrecord2idx_script, out_path_spec, self.storage.get_uri(tfrecord_idx)])
File "/usr/lib/python3.12/subprocess.py", line 389, in call
with Popen(*popenargs, **kwargs) as p:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'tfrecord2idx'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions