-
Notifications
You must be signed in to change notification settings - Fork 54
Description
I get "FileNotFoundError: [Errno 2] No such file or directory: 'tfrecord2idx'" error when trying to create dataset with resnet50 and cosmoflow models with more than 1 host. If I use 1 host, there is no issues creating the dataset. Unet3d works file when creating dataset with multi hosts.
Is there a setting or package that needs to be installed to get around the issue?
Command
mlpstorage training datagen --hosts=10.151.1.5,10.151.1.6 --model=resnet50 --exec-type=mpi --param dataset.num_files_train=10231 --num-processes=64 --results-dir=/mnt/nfs19/resnet_results2 --data-dir=/mnt/nfs19/resnet_data2/
Output
(myenv) root@adclient5:~# mlpstorage training datagen --hosts=10.151.1.5,10.151.1.6 --model=resnet50 --exec-type=mpi --param dataset.num_files_train=10231 --num-processes=64 --results-dir=/mnt/nfs19/resnet_results2 --data-dir=/mnt/nfs19/resnet_data2/ --allow-run-as-root
Hosts is: ['10.151.1.5,10.151.1.6']
Hosts is: ['10.151.1.5', '10.151.1.6']
2025-09-14 06:07:28|STATUS: Benchmark results directory: /mnt/nfs19/resnet_results2/training/resnet50/datagen/20250914_060728
2025-09-14 06:07:28|STATUS: Running benchmark command:: mpirun -n 64 -host 10.151.1.5:32,10.151.1.6:32 --allow-run-as-root /root/.venvs/myenv/bin/dlio_benchmark workload=resnet50_datagen ++hydra.run.dir=/mnt/nfs19/resnet_results2/training/resnet50/datagen/20250914_060728 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=10231 ++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50 --config-dir=/root/storage/configs/dlio
Warning: Permanently added '10.151.1.6' (ED25519) to the list of known hosts.
[OUTPUT] 2025-09-14T06:07:32.520868 Running DLIO [Generating data] with 64 process(es)
[OUTPUT] 2025-09-14T06:07:32.527553 Starting data generation
Traceback (most recent call last):
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 460, in run_benchmark
benchmark.initialize()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 167, in initialize
self.data_generator.generate()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/data_generator/tf_generator.py", line 81, in generate
call([tfrecord2idx_script, out_path_spec, self.storage.get_uri(tfrecord_idx)])
File "/usr/lib/python3.12/subprocess.py", line 389, in call
with Popen(*popenargs, **kwargs) as p:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'tfrecord2idx'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Error executing job with overrides: ['workload=resnet50_datagen', '++workload.dataset.num_files_train=10231', '++workload.dataset.data_folder=/mnt/nfs19/resnet_data2/resnet50']
Traceback (most recent call last):
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 460, in run_benchmark
benchmark.initialize()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 167, in initialize
self.data_generator.generate()
File "/root/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/data_generator/tf_generator.py", line 81, in generate
call([tfrecord2idx_script, out_path_spec, self.storage.get_uri(tfrecord_idx)])
File "/usr/lib/python3.12/subprocess.py", line 389, in call
with Popen(*popenargs, **kwargs) as p:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'tfrecord2idx'