S3 support with Boto3 for Pytorch NPZ data a generator and dataloader#264
S3 support with Boto3 for Pytorch NPZ data a generator and dataloader#264ekaynar wants to merge 6 commits intoargonne-lcf:mainfrom
Conversation
|
@ekaynar could you try to fix the failing ci tests? |
|
@ekaynar Thank you for your updates for S3 support. I think its a great addition to DLIO Benchmark Based on the code guidelines currently followed on DLIO Benchmark, I suggest the following updates.
Let me know if this makes sense or if you have concern. |
| @@ -1,4 +1,5 @@ | |||
| """ | |||
| Copyright (c) 2025 Dell Inc, or its subsidiaries. | |||
There was a problem hiding this comment.
We cannot have companies copyright here.
dlio_benchmark/reader/npz_reader.py
Outdated
| @@ -1,4 +1,5 @@ | |||
| """ | |||
| Copyright (c) 2025 Dell Inc, or its subsidiaries. | |||
There was a problem hiding this comment.
We cannot have companies copyright here.
| prev_out_spec = out_path_spec | ||
| if self.compression != Compression.ZIP: | ||
| np.savez(out_path_spec, x=records, y=record_labels) | ||
|
|
There was a problem hiding this comment.
Create a new class which inherits NPZGenerator and switch on generator_factory based on storage type.
There was a problem hiding this comment.
@hariharan-devarajan, @zhenghh04 currently, GeneratorFactory only receives the format type (NPZ). Should I pass the storage type during the initialization, so that it can select either NPZS3Generator or NPZGenerator class? Or should I use a new format type called NPZS3?
dlio_benchmark/reader/npz_reader.py
Outdated
|
|
||
| @dlp.log | ||
| def open(self, filename): | ||
| if isinstance(self.storage, S3PytorchStorage): |
There was a problem hiding this comment.
Move this to a new class that inherits NPZReader and overrides open function.
There was a problem hiding this comment.
@hariharan-devarajan Thank you for the feedback! I appreciate your suggestions. I will move everything into the NPZS3readers class.
My suggestion is that, in the long term, having a single reader class for different file types, which then directs to the appropriate storage class (file or object), might be a more sustainable approach compared to having separate readers for different protocol accesses. This way, we avoid duplicating code for different file types, and a unified structure can make the codebase easier to understand for other developers.
| self.dataset_type = dataset_type | ||
| self.open_file_map = {} | ||
|
|
||
| self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root, |
There was a problem hiding this comment.
Move this to new reader class' init function
dlio_benchmark/storage/s3_storage.py
Outdated
| @@ -1,4 +1,5 @@ | |||
| """ | |||
| Copyright (c) 2025 Dell Inc, or its subsidiaries. | |||
There was a problem hiding this comment.
Remove company copyright.
PytorchS3Storage class, which leverages the Boto3 library to interact with S3 storage. Adding support to data generation and loading of NPZ files.