You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When `enable_sharding` is set to True, the dataset will be automatically sharded across available number of workers.
134
+
This sharding mechanism supports both parallel training on a single host and distributed training across multiple hosts.
135
+
Each worker, regardless of its host, will load and process a distinct subset of the dataset.
136
+
### S3MapDataset
137
+
138
+
For the S3MapDataset, you need to pass it to DataLoader along with a [DistributedSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler) wrapped around it.
139
+
The DistributedSampler ensures that each worker or node receives a unique subset of the dataset,
140
+
enabling efficient parallel and distributed training.
"""Returns an instance of S3IterableDataset using the S3 URI(s) provided.
62
72
@@ -66,6 +76,7 @@ def from_objects(
66
76
endpoint(str): AWS endpoint of the S3 bucket where the objects are stored.
67
77
transform: Optional callable which is used to transform an S3Reader into the desired type.
68
78
s3client_config: Optional S3ClientConfig with parameters for S3 client.
79
+
enable_sharding: If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.
69
80
70
81
Returns:
71
82
S3IterableDataset: An IterableStyle dataset created from S3 objects.
@@ -80,6 +91,7 @@ def from_objects(
80
91
endpoint,
81
92
transform=transform,
82
93
s3client_config=s3client_config,
94
+
enable_sharding=enable_sharding,
83
95
)
84
96
85
97
@classmethod
@@ -91,6 +103,7 @@ def from_prefix(
91
103
endpoint: Optional[str] =None,
92
104
transform: Callable[[S3Reader], Any] =identity,
93
105
s3client_config: Optional[S3ClientConfig] =None,
106
+
enable_sharding: bool=False,
94
107
):
95
108
"""Returns an instance of S3IterableDataset using the S3 URI provided.
96
109
@@ -100,6 +113,7 @@ def from_prefix(
100
113
endpoint(str): AWS endpoint of the S3 bucket where the objects are stored.
101
114
transform: Optional callable which is used to transform an S3Reader into the desired type.
102
115
s3client_config: Optional S3ClientConfig with parameters for S3 client.
116
+
enable_sharding: If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.
103
117
104
118
Returns:
105
119
S3IterableDataset: An IterableStyle dataset created from S3 objects.
0 commit comments