awslabs
diff --git a/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 30 additions & 1 deletion b/‎README.md‎
Lines changed: 30 additions & 1 deletion
diff --git a/‎s3torchconnector/src/s3torchconnector/s3iterable_dataset.py‎
Lines changed: 58 additions & 2 deletions b/‎s3torchconnector/src/s3torchconnector/s3iterable_dataset.py‎
Lines changed: 58 additions & 2 deletions
diff --git a/‎s3torchconnector/tst/e2e/conftest.py‎
Lines changed: 57 additions & 14 deletions b/‎s3torchconnector/tst/e2e/conftest.py‎
Lines changed: 57 additions & 14 deletions
diff --git a/‎s3torchconnector/tst/e2e/test_common.py‎
Lines changed: 35 additions & 0 deletions b/‎s3torchconnector/tst/e2e/test_common.py‎
Lines changed: 35 additions & 0 deletions
@@ -1,9 +1,18 @@
+## v1.x.x (TBD)
+* Add support of distributed training to S3IterableDataset
+
+### Breaking changes
+* No breaking changes. 
+
 ## v1.2.7 (October 29, 2024)
 
 ### New features
 * Add support for CRT retries (awslabs/mountpoint-s3#1069).
 * Add support for `CopyObject` API (#242).
 
+### Breaking changes
+* No breaking changes.
+
 ## v1.2.6 (October 9, 2024)
 
 ### New features
 
@@ -28,7 +28,7 @@ Amazon S3, without first saving to local storage.
 pip install s3torchconnector
 ```
 
-Amazon S3 Connector for PyTorch supports only Linux via Pip for now. For other platforms,
+Amazon S3 Connector for PyTorch supports pre-build wheels via Pip only for Linux and MacOS for now. For other platforms,
 see [DEVELOPMENT](DEVELOPMENT.md) for build instructions.
 
 ### Configuration
@@ -114,7 +114,35 @@ For example, assuming the following directory bucket name `my-test-bucket--usw2-
 usw2-az1, then the URI used will look like: `s3://my-test-bucket--usw2-az1--x-s3/<PREFIX>` (**please note that the 
 prefix for Amazon S3 Express One Zone should end with '/'**), paired with region us-west-2.
 
+## Parallel/Distributed Training
 
+Amazon S3 Connector for PyTorch provides support for parallel and distributed training with PyTorch, 
+allowing you to leverage multiple processes and nodes for efficient data loading and training. 
+Both S3IterableDataset and S3MapDataset can be used for this purpose.
+
+### S3IterableDataset
+
+The S3IterableDataset can be directly passed to PyTorch's DataLoader for parallel and distributed training.
+By default, all worker processes will share the same list of training objects. However, 
+if you need each worker to have access to a unique portion of the dataset for better parallelization, 
+you can enable dataset sharding using the `enable_sharding` parameter. 
+```
+dataset = S3IterableDataset.from_prefix(DATASET_URI, region=REGION, enable_sharding=True)
+dataloader = DataLoader(dataset, num_workers=4)
+```
+When `enable_sharding` is set to True, the dataset will be automatically sharded across available number of workers. 
+This sharding mechanism supports both parallel training on a single host and distributed training across multiple hosts. 
+Each worker, regardless of its host, will load and process a distinct subset of the dataset.
+### S3MapDataset
+
+For the S3MapDataset, you need to pass it to DataLoader along with a [DistributedSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler) wrapped around it. 
+The DistributedSampler ensures that each worker or node receives a unique subset of the dataset, 
+enabling efficient parallel and distributed training.
+```
+dataset = S3MapDataset.from_prefix(DATASET_URI, region=REGION)
+sampler = DistributedSampler(dataset)
+dataloader = DataLoader(dataset, sampler=sampler, num_workers=4)
+```
 ## Lightning Integration
 
 Amazon S3 Connector for PyTorch includes an integration for PyTorch Lightning, featuring S3LightningCheckpoint, an 
@@ -183,3 +211,4 @@ See [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for more details.
 ## License
 
 Amazon S3 Connector for PyTorch has a BSD 3-Clause License, as found in the [LICENSE](LICENSE) file.
+
@@ -5,6 +5,7 @@
 import logging
 
 import torch.utils.data
+import torch
 
 from . import S3Reader
 from ._s3bucket_key_data import S3BucketKeyData
@@ -32,13 +33,21 @@ def __init__(
         endpoint: Optional[str] = None,
         transform: Callable[[S3Reader], Any] = identity,
         s3client_config: Optional[S3ClientConfig] = None,
+        enable_sharding: bool = False,
     ):
         self._get_dataset_objects = get_dataset_objects
         self._transform = transform
         self._region = region
         self._endpoint = endpoint
         self._s3client_config = s3client_config
         self._client = None
+        self._enable_sharding = enable_sharding
+
+        self._rank = 0
+        self._world_size = 1
+        if torch.distributed.is_initialized():
+            self._rank = torch.distributed.get_rank()
+            self._world_size = torch.distributed.get_world_size()
 
     @property
     def region(self):
@@ -57,6 +66,7 @@ def from_objects(
         endpoint: Optional[str] = None,
         transform: Callable[[S3Reader], Any] = identity,
         s3client_config: Optional[S3ClientConfig] = None,
+        enable_sharding: bool = False,
     ):
         """Returns an instance of S3IterableDataset using the S3 URI(s) provided.
 
@@ -66,6 +76,7 @@ def from_objects(
           endpoint(str): AWS endpoint of the S3 bucket where the objects are stored.
           transform: Optional callable which is used to transform an S3Reader into the desired type.
           s3client_config: Optional S3ClientConfig with parameters for S3 client.
+          enable_sharding: If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.
 
         Returns:
             S3IterableDataset: An IterableStyle dataset created from S3 objects.
@@ -80,6 +91,7 @@ def from_objects(
             endpoint,
             transform=transform,
             s3client_config=s3client_config,
+            enable_sharding=enable_sharding,
         )
 
     @classmethod
@@ -91,6 +103,7 @@ def from_prefix(
         endpoint: Optional[str] = None,
         transform: Callable[[S3Reader], Any] = identity,
         s3client_config: Optional[S3ClientConfig] = None,
+        enable_sharding: bool = False,
     ):
         """Returns an instance of S3IterableDataset using the S3 URI provided.
 
@@ -100,6 +113,7 @@ def from_prefix(
           endpoint(str): AWS endpoint of the S3 bucket where the objects are stored.
           transform: Optional callable which is used to transform an S3Reader into the desired type.
           s3client_config: Optional S3ClientConfig with parameters for S3 client.
+          enable_sharding: If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.
 
         Returns:
             S3IterableDataset: An IterableStyle dataset created from S3 objects.
@@ -114,6 +128,7 @@ def from_prefix(
             endpoint,
             transform=transform,
             s3client_config=s3client_config,
+            enable_sharding=enable_sharding,
         )
 
     def _get_client(self):
@@ -133,6 +148,47 @@ def _get_transformed_object(self, bucket_key: S3BucketKeyData) -> Any:
         )
 
     def __iter__(self) -> Iterator[Any]:
-        return map(
-            self._get_transformed_object, self._get_dataset_objects(self._get_client())
+        worker_id = 0
+        num_workers = 1
+        if self._enable_sharding:
+            worker_info = torch.utils.data.get_worker_info()
+            if worker_info is not None:
+                worker_id = worker_info.id
+                num_workers = worker_info.num_workers
+
+        if not self._enable_sharding or (self._world_size == 1 and num_workers == 1):
+            # sharding disabled or only one shard is available, so return the entire dataset
+            return map(
+                self._get_transformed_object,
+                self._get_dataset_objects(self._get_client()),
+            )
+
+        """In a multi-process setting (e.g., distributed training), the dataset needs to be
+        sharded across multiple processes. The following variables control this sharding:
+
+        _rank: The rank (index) of the current process within the world (group of processes).
+        _world_size: The total number of processes in the world (group).
+
+        In addition, within each process, the dataset may be further sharded across multiple
+        worker threads or processes (e.g., for data loading). The following variables control
+        this intra-process sharding:
+
+        worker_id: The ID of the current worker thread/process within the process.
+        num_workers: The total number of worker threads/processes within the process.
+        """
+
+        # First, distribute objects across ranks
+        rank_sharded_objects = (
+            obj
+            for idx, obj in enumerate(self._get_dataset_objects(self._get_client()))
+            if idx % self._world_size == self._rank
         )
+
+        # Then, distribute objects within each rank across workers
+        worker_sharded_objects = (
+            obj
+            for idx, obj in enumerate(rank_sharded_objects)
+            if idx % num_workers == worker_id
+        )
+
+        return map(self._get_transformed_object, worker_sharded_objects)
@@ -18,41 +18,67 @@ def getenv(var: str, optional: bool = False) -> str:
     return v
 
 
-class BucketPrefixFixture(object):
+class BucketPrefixData(object):
     """An S3 bucket/prefix and its contents for use in a single unit test. The prefix will be unique
     to this instance, so other concurrent tests won't affect its state."""
 
     region: str
     bucket: str
     prefix: str
     storage_class: str = None
+    contents: dict
 
     def __init__(
-        self, region: str, bucket: str, prefix: str, storage_class: str = None
+        self,
+        region: str,
+        bucket: str,
+        prefix: str,
+        storage_class: str = None,
+        contents: dict = None,
     ):
         self.bucket = bucket
         self.prefix = prefix
         self.region = region
         self.storage_class = storage_class
-        self.contents = {}
-        session = boto3.Session(region_name=region)
-        self.s3 = session.client("s3")
+        self.contents = contents or {}
 
     @property
     def s3_uri(self):
         return f"s3://{self.bucket}/{self.prefix}"
 
+    def __getitem__(self, index):
+        return self.contents[index]
+
+    def __iter__(self):
+        return iter(self.contents)
+
+
+class BucketPrefixFixture(BucketPrefixData):
+    """An S3 bucket/prefix and its contents for use in a single unit test. The prefix will be unique
+    to this instance, so other concurrent tests won't affect its state."""
+
+    def __init__(
+        self, region: str, bucket: str, prefix: str, storage_class: str = None
+    ):
+        super().__init__(region, bucket, prefix, storage_class)
+        session = boto3.Session(region_name=region)
+        self.s3 = session.client("s3")
+
     def add(self, key: str, contents: bytes, **kwargs):
         """Upload an S3 object to this prefix of the bucket."""
         full_key = f"{self.prefix}{key}"
         self.s3.put_object(Bucket=self.bucket, Key=full_key, Body=contents, **kwargs)
         self.contents[full_key] = contents
 
-    def __getitem__(self, index):
-        return self.contents[index]
+    def get_data_snapshot(self):
+        """Returns a read-only copy of the current instance's data.
 
-    def __iter__(self):
-        return iter(self.contents)
+        The returned object cannot modify the actual S3 bucket.
+        Useful when passing data to another process without serializing s3 client
+        """
+        return BucketPrefixData(
+            self.region, self.bucket, self.prefix, self.storage_class, self.contents
+        )
 
 
 def get_test_bucket_prefix(name: str) -> BucketPrefixFixture:
@@ -71,13 +97,30 @@ def get_test_bucket_prefix(name: str) -> BucketPrefixFixture:
 
 @pytest.fixture
 def image_directory(request) -> BucketPrefixFixture:
-    """Create a bucket/prefix fixture that contains a directory of random JPG image files."""
     NUM_IMAGES = 10
     IMAGE_SIZE = 100
-    fixture = get_test_bucket_prefix(f"{request.node.name}/image_directory")
-    for i in range(NUM_IMAGES):
-        data = np.random.randint(0, 256, IMAGE_SIZE * IMAGE_SIZE * 3, np.uint8)
-        data = data.reshape(IMAGE_SIZE, IMAGE_SIZE, 3)
+    return _create_image_directory_fixture(NUM_IMAGES, IMAGE_SIZE, request.node.name)
+
+
+@pytest.fixture
+def image_directory_for_dp(request) -> BucketPrefixFixture:
+    """When conducting distributed training tests, be cautious about the number of files (images) in the test dataset.
+    If the total number of images cannot be evenly divided by the number of workers,
+    the DistributedSampler will duplicate a subset of the images across workers to ensure an equal
+    distribution of data among all processes. This duplication of images will cause
+    integration distributed training test to fail.
+    """
+    NUM_IMAGES = 36
+    IMAGE_SIZE = 100
+    return _create_image_directory_fixture(NUM_IMAGES, IMAGE_SIZE, request.node.name)
+
+
+def _create_image_directory_fixture(num_image: int, image_size: int, node_name: str):
+    """Create a bucket/prefix fixture that contains a directory of random JPG image files."""
+    fixture = get_test_bucket_prefix(f"{node_name}/image_directory")
+    for i in range(num_image):
+        data = np.random.randint(0, 256, image_size * image_size * 3, np.uint8)
+        data = data.reshape(image_size, image_size, 3)
         image = Image.fromarray(data, "RGB")
         image_bytes = io.BytesIO()
         image.save(image_bytes, "jpeg")
 
@@ -0,0 +1,35 @@
+#  Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#  // SPDX-License-Identifier: BSD
+
+import platform
+import torch
+from s3torchconnector import S3Reader
+
+from typing import Tuple, List
+
+
+def _get_fork_methods() -> List[str]:
+    """Get a set of valid start methods for PyTorch's multiprocessing.
+    On macOS, the 'fork' and 'forkserver' start methods are known to crash,
+    despite being reported as usable by PyTorch. This function filters out
+    those methods for macOS systems.
+
+    Returns:
+        List[str]: A set of valid start methods for the current platform.
+    """
+    methods = set(torch.multiprocessing.get_all_start_methods())
+
+    if platform.system() == "Darwin":
+        # fork and forkserver crash on MacOS, even though it's reported as usable.
+        # https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
+        # https://bugs.python.org/issue?@action=redirect&bpo=33725
+        methods -= {"fork", "forkserver"}
+    return [method for method in methods]
+
+
+def _set_start_method(start_method: str):
+    torch.multiprocessing.set_start_method(start_method, force=True)
+
+
+def _read_data(s3reader: S3Reader) -> Tuple[str, bytes]:
+    return s3reader.key, s3reader.read()