S3 support with Boto3 for Pytorch NPZ data a generator and dataloader by ekaynar · Pull Request #264 · argonne-lcf/dlio_benchmark

ekaynar · 2025-02-28T16:35:01Z

PytorchS3Storage class, which leverages the Boto3 library to interact with S3 storage. Adding support to data generation and loading of NPZ files.

zhenghh04 · 2025-03-04T18:31:49Z

@ekaynar could you try to fix the failing ci tests?

hariharan-devarajan · 2025-03-05T18:10:22Z

@ekaynar Thank you for your updates for S3 support. I think its a great addition to DLIO Benchmark

Based on the code guidelines currently followed on DLIO Benchmark, I suggest the following updates.

We do not want switch cases in the NPZ Reader. Instead, inherit the NPZ Reader class (NPZS3Reader in a new module called npz_s3_reader) and update the open logic. The reasoning behind this is that it would easier to not install s3 if it's not supported on a specific system.
The self.storage variable is defined within NPZS3Reader's init.
No additional company copyrights as open Apache 2.0 License.
Add a new enumeration on StorageType as S3 and do a switch case within ReaderFactory class.
Similar to Reader, inherit the NPZGenerator class for S3 and override the generate function.
We need to include S3 emulation for the CI to test the above functionality and include test-case for that environment. if your not aware of S3 emulators that can run locally, @zhenghh04 could recommend you some.

Let me know if this makes sense or if you have concern.

hariharan-devarajan · 2025-03-05T18:02:38Z

dlio_benchmark/data_generator/npz_generator.py

@@ -1,4 +1,5 @@
 """
+   Copyright (c) 2025 Dell Inc, or its subsidiaries.


We cannot have companies copyright here.

hariharan-devarajan · 2025-03-05T18:10:45Z

dlio_benchmark/reader/npz_reader.py

@@ -1,4 +1,5 @@
 """
+   Copyright (c) 2025 Dell Inc, or its subsidiaries.


We cannot have companies copyright here.

hariharan-devarajan · 2025-03-05T18:11:24Z

dlio_benchmark/data_generator/npz_generator.py

            prev_out_spec = out_path_spec
-            if self.compression != Compression.ZIP:
-                np.savez(out_path_spec, x=records, y=record_labels)
+


Create a new class which inherits NPZGenerator and switch on generator_factory based on storage type.

@hariharan-devarajan, @zhenghh04 currently, GeneratorFactory only receives the format type (NPZ). Should I pass the storage type during the initialization, so that it can select either NPZS3Generator or NPZGenerator class? Or should I use a new format type called NPZS3?

hariharan-devarajan · 2025-03-05T18:12:23Z

dlio_benchmark/reader/npz_reader.py


    @dlp.log
    def open(self, filename):
+        if isinstance(self.storage, S3PytorchStorage):


Move this to a new class that inherits NPZReader and overrides open function.

@hariharan-devarajan Thank you for the feedback! I appreciate your suggestions. I will move everything into the NPZS3readers class.

My suggestion is that, in the long term, having a single reader class for different file types, which then directs to the appropriate storage class (file or object), might be a more sustainable approach compared to having separate readers for different protocol accesses. This way, we avoid duplicating code for different file types, and a unified structure can make the codebase easier to understand for other developers.

hariharan-devarajan · 2025-03-05T18:12:48Z

dlio_benchmark/reader/reader_handler.py

        self.dataset_type = dataset_type
        self.open_file_map = {}
-
+        self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root,


Move this to new reader class' init function

hariharan-devarajan · 2025-03-05T18:13:00Z

dlio_benchmark/storage/s3_storage.py

@@ -1,4 +1,5 @@
 """
+   Copyright (c) 2025 Dell Inc, or its subsidiaries.


Remove company copyright.

ekaynar added 2 commits February 28, 2025 15:23

Boto3 Library for Pytorch NPZ reader and dataloader

7c88b48

add copyright

dc768a4

zhenghh04 requested review from hariharan-devarajan and zhenghh04 March 3, 2025 16:03

zhenghh04 added 2 commits March 3, 2025 21:50

Update requirements.txt

7842e1d

Update setup.py to include boto3

d2bf507

hariharan-devarajan requested changes Mar 5, 2025

View reviewed changes

ekaynar closed this Mar 7, 2025

ekaynar reopened this Mar 7, 2025

ekaynar added 2 commits March 7, 2025 19:32

created a new npz_s3 reader and generator modules

411dbce

remove copyrights

b5ee362

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 support with Boto3 for Pytorch NPZ data a generator and dataloader#264

S3 support with Boto3 for Pytorch NPZ data a generator and dataloader#264
ekaynar wants to merge 6 commits intoargonne-lcf:mainfrom
ekaynar:main

ekaynar commented Feb 28, 2025

Uh oh!

zhenghh04 commented Mar 4, 2025

Uh oh!

hariharan-devarajan commented Mar 5, 2025

Uh oh!

hariharan-devarajan Mar 5, 2025

Uh oh!

hariharan-devarajan Mar 5, 2025

Uh oh!

hariharan-devarajan Mar 5, 2025

Uh oh!

ekaynar Mar 7, 2025 •

edited

Loading

Uh oh!

hariharan-devarajan Mar 5, 2025

Uh oh!

ekaynar Mar 7, 2025 •

edited

Loading

Uh oh!

hariharan-devarajan Mar 5, 2025

Uh oh!

hariharan-devarajan Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1,4 +1,5 @@
		"""
		Copyright (c) 2025 Dell Inc, or its subsidiaries.

Conversation

ekaynar commented Feb 28, 2025

Uh oh!

zhenghh04 commented Mar 4, 2025

Uh oh!

hariharan-devarajan commented Mar 5, 2025

Uh oh!

hariharan-devarajan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

ekaynar Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

ekaynar Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ekaynar Mar 7, 2025 •

edited

Loading

ekaynar Mar 7, 2025 •

edited

Loading