Skip to content

Commit 2d90b4b

Browse files
authored
Update user-facing variable names and add override checks for name, resources, and batch_size (#1223)
* Add override checks for `name`, `resources`, and `batch_size` Signed-off-by: Sarah Yurick <[email protected]> * add pytests Signed-off-by: Sarah Yurick <[email protected]> * rename for name, resources, and batch_size to be user facing Signed-off-by: Sarah Yurick <[email protected]> * fix some tests Signed-off-by: Sarah Yurick <[email protected]> * small update Signed-off-by: Sarah Yurick <[email protected]> * revert DocumentFilter changes Signed-off-by: Sarah Yurick <[email protected]> * revert DocumentModifier changes Signed-off-by: Sarah Yurick <[email protected]> * update gpu test Signed-off-by: Sarah Yurick <[email protected]> * add abhinav's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
1 parent cebf111 commit 2d90b4b

File tree

110 files changed

+380
-273
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+380
-273
lines changed

api-design.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,11 +110,11 @@ class ProcessingStage(ABC, Generic[X, Y], metaclass=StageMeta):
110110

111111
@property
112112
@abstractmethod
113-
def name(self) -> str:
113+
def _name(self) -> str:
114114
"""Unique name for this stage."""
115115

116116
@property
117-
def resources(self) -> Resources:
117+
def _resources(self) -> Resources:
118118
"""Resource requirements for this stage."""
119119
return Resources(cpus=1.0)
120120

docs/about/concepts/image/data-loading-concepts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ pipeline.add_stage(FilePartitioningStage(
6868

6969
# Load images with DALI
7070
pipeline.add_stage(ImageReaderStage(
71-
task_batch_size=100,
71+
batch_size=100,
7272
verbose=True,
7373
num_threads=8,
7474
num_gpus_per_worker=0.25,

docs/about/concepts/video/abstractions.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,8 @@ Composite stages provide a user-facing convenience API and decompose into one or
5858

5959
```python
6060
class MyStage(ProcessingStage[X, Y]):
61-
@property
62-
def name(self) -> str: ...
63-
64-
@property
65-
def resources(self) -> Resources: ...
61+
name: str = "..."
62+
resources: Resources = Resources(...)
6663

6764
def inputs(self) -> tuple[list[str], list[str]]: ...
6865
def outputs(self) -> tuple[list[str], list[str]]: ...

docs/about/release-notes/migration-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -247,7 +247,7 @@ In the new version, data loading is encapsulated in a dedicated pipeline stage (
247247
```python
248248
# New: Read images from webdataset tar files
249249
read_stage = ImageReaderStage(
250-
task_batch_size=args.task_batch_size,
250+
batch_size=args.batch_size,
251251
num_threads=16,
252252
num_gpus_per_worker=0.25,
253253
)

docs/curate-images/load-data/tar-archives.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ pipeline.add_stage(FilePartitioningStage(
6363

6464
# Stage 2: Read JPEG images from tar files using DALI
6565
pipeline.add_stage(ImageReaderStage(
66-
task_batch_size=100,
66+
batch_size=100,
6767
verbose=True,
6868
num_threads=16,
6969
num_gpus_per_worker=0.25,
@@ -77,7 +77,7 @@ results = pipeline.run()
7777

7878
- `file_paths`: Path to directory containing tar files
7979
- `files_per_partition`: Number of tar files to process per partition (controls parallelism)
80-
- `task_batch_size`: Number of images per ImageBatch for processing
80+
- `batch_size`: Number of images per ImageBatch for processing
8181

8282
---
8383

@@ -152,7 +152,7 @@ The `ImageReaderStage` is the core component that handles tar archive loading wi
152152
- Type
153153
- Default
154154
- Description
155-
* - `task_batch_size`
155+
* - `batch_size`
156156
- int
157157
- 100
158158
- Number of images per ImageBatch for processing
@@ -205,7 +205,7 @@ ImageObject(
205205
```python
206206
# Optimal configuration for GPU acceleration
207207
pipeline.add_stage(ImageReaderStage(
208-
task_batch_size=256, # Larger batches for GPU throughput
208+
batch_size=256, # Larger batches for GPU throughput
209209
num_threads=16, # More threads for I/O parallelism
210210
num_gpus_per_worker=0.5, # Allocate more GPU memory
211211
verbose=True,
@@ -217,7 +217,7 @@ pipeline.add_stage(ImageReaderStage(
217217
```python
218218
# Optimized for CPU decoding
219219
pipeline.add_stage(ImageReaderStage(
220-
task_batch_size=64, # Smaller batches to avoid memory pressure
220+
batch_size=64, # Smaller batches to avoid memory pressure
221221
num_threads=8, # Fewer threads for CPU processing
222222
num_gpus_per_worker=0, # No GPU allocation
223223
verbose=True,
@@ -228,7 +228,7 @@ pipeline.add_stage(ImageReaderStage(
228228

229229
- **GPU Acceleration**: Use a GPU-enabled environment for optimal performance. The stage automatically detects CUDA availability and uses GPU decoding when possible.
230230
- **Parallelism Control**: Adjust `files_per_partition` to control how many tar files are processed together. Lower values increase parallelism but may increase overhead.
231-
- **Batch Size Tuning**: Increase `task_batch_size` for better throughput, but ensure sufficient memory is available.
231+
- **Batch Size Tuning**: Increase `batch_size` for better throughput, but ensure sufficient memory is available.
232232
- **Thread Configuration**: Adjust `num_threads` for I/O operations based on your storage system's characteristics.
233233

234234
---

docs/curate-images/process-data/embeddings/clip-embedder.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ pipeline.add_stage(FilePartitioningStage(
4545

4646
# Stage 2: Read images
4747
pipeline.add_stage(ImageReaderStage(
48-
task_batch_size=100,
48+
batch_size=100,
4949
num_gpus_per_worker=0.25,
5050
))
5151

docs/curate-images/process-data/filters/aesthetic.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ pipeline.add_stage(FilePartitioningStage(
5151

5252
# Stage 2: Read images
5353
pipeline.add_stage(ImageReaderStage(
54-
task_batch_size=100,
54+
batch_size=100,
5555
num_gpus_per_worker=0.25,
5656
))
5757

docs/curate-images/process-data/filters/nsfw.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ pipeline.add_stage(FilePartitioningStage(
5151

5252
# Stage 2: Read images
5353
pipeline.add_stage(ImageReaderStage(
54-
task_batch_size=100,
54+
batch_size=100,
5555
num_gpus_per_worker=0.25,
5656
))
5757

docs/curate-images/save-export.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ pipeline.add_stage(FilePartitioningStage(
8686
))
8787

8888
pipeline.add_stage(ImageReaderStage(
89-
task_batch_size=100,
89+
batch_size=100,
9090
num_threads=16,
9191
num_gpus_per_worker=0.25,
9292
))

docs/curate-images/tutorials/beginner.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ Load images from tar archives and extract metadata.
105105

106106
```python
107107
pipeline.add_stage(ImageReaderStage(
108-
task_batch_size=100, # Images per batch
108+
batch_size=100, # Images per batch
109109
verbose=True,
110110
num_threads=16, # I/O threads
111111
num_gpus_per_worker=0.25,
@@ -216,7 +216,7 @@ def create_image_curation_pipeline():
216216
))
217217

218218
pipeline.add_stage(ImageReaderStage(
219-
task_batch_size=100,
219+
batch_size=100,
220220
verbose=True,
221221
num_threads=16,
222222
num_gpus_per_worker=0.25,

0 commit comments

Comments
 (0)