Skip to content

S3 support with Boto3 for Pytorch NPZ data a generator and dataloader#264

Open
ekaynar wants to merge 6 commits intoargonne-lcf:mainfrom
ekaynar:main
Open

S3 support with Boto3 for Pytorch NPZ data a generator and dataloader#264
ekaynar wants to merge 6 commits intoargonne-lcf:mainfrom
ekaynar:main

Conversation

@ekaynar
Copy link
Copy Markdown

@ekaynar ekaynar commented Feb 28, 2025

PytorchS3Storage class, which leverages the Boto3 library to interact with S3 storage. Adding support to data generation and loading of NPZ files.

@zhenghh04
Copy link
Copy Markdown
Member

@ekaynar could you try to fix the failing ci tests?

@hariharan-devarajan
Copy link
Copy Markdown
Collaborator

@ekaynar Thank you for your updates for S3 support. I think its a great addition to DLIO Benchmark

Based on the code guidelines currently followed on DLIO Benchmark, I suggest the following updates.

  • We do not want switch cases in the NPZ Reader. Instead, inherit the NPZ Reader class (NPZS3Reader in a new module called npz_s3_reader) and update the open logic. The reasoning behind this is that it would easier to not install s3 if it's not supported on a specific system.
  • The self.storage variable is defined within NPZS3Reader's init.
  • No additional company copyrights as open Apache 2.0 License.
  • Add a new enumeration on StorageType as S3 and do a switch case within ReaderFactory class.
  • Similar to Reader, inherit the NPZGenerator class for S3 and override the generate function.
  • We need to include S3 emulation for the CI to test the above functionality and include test-case for that environment. if your not aware of S3 emulators that can run locally, @zhenghh04 could recommend you some.

Let me know if this makes sense or if you have concern.

@@ -1,4 +1,5 @@
"""
Copyright (c) 2025 Dell Inc, or its subsidiaries.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot have companies copyright here.

@@ -1,4 +1,5 @@
"""
Copyright (c) 2025 Dell Inc, or its subsidiaries.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot have companies copyright here.

prev_out_spec = out_path_spec
if self.compression != Compression.ZIP:
np.savez(out_path_spec, x=records, y=record_labels)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a new class which inherits NPZGenerator and switch on generator_factory based on storage type.

Copy link
Copy Markdown
Author

@ekaynar ekaynar Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hariharan-devarajan, @zhenghh04 currently, GeneratorFactory only receives the format type (NPZ). Should I pass the storage type during the initialization, so that it can select either NPZS3Generator or NPZGenerator class? Or should I use a new format type called NPZS3?


@dlp.log
def open(self, filename):
if isinstance(self.storage, S3PytorchStorage):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to a new class that inherits NPZReader and overrides open function.

Copy link
Copy Markdown
Author

@ekaynar ekaynar Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hariharan-devarajan Thank you for the feedback! I appreciate your suggestions. I will move everything into the NPZS3readers class.

My suggestion is that, in the long term, having a single reader class for different file types, which then directs to the appropriate storage class (file or object), might be a more sustainable approach compared to having separate readers for different protocol accesses. This way, we avoid duplicating code for different file types, and a unified structure can make the codebase easier to understand for other developers.

self.dataset_type = dataset_type
self.open_file_map = {}

self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to new reader class' init function

@@ -1,4 +1,5 @@
"""
Copyright (c) 2025 Dell Inc, or its subsidiaries.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove company copyright.

@ekaynar ekaynar closed this Mar 7, 2025
@ekaynar ekaynar reopened this Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants