You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/about/release-notes/migration-guide.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ modality: "universal"
15
15
This guide explains how to transition existing Dask-based NeMo Curator workflows to the new Ray-based pipeline architecture.
16
16
17
17
```{seealso}
18
-
For broader NeMo Framework migration topics, refer to the [NeMo Framework 2.0 Migration Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/index.html).
18
+
For broader NeMo Framework migration topics, refer to the [NeMo Framework 2.0 Migration Guide](https://docs.nvidia.com/nemo-framework/user-guide/25.11/nemo-2.0/migration/index.html).
@@ -352,43 +445,35 @@ For large datasets, consider these performance optimizations:
352
445
353
446
::::{tab-set}
354
447
355
-
:::{tab-item} Memory Efficient Processing
448
+
:::{tab-item} XennaExecutor (Default)
449
+
`XennaExecutor` is the default executor, optimized for streaming workloads. You can customize its configuration or use the defaults:
450
+
356
451
```python
357
-
# Process large datasets efficiently using pipeline streaming
452
+
from nemo_curator.backends.xenna import XennaExecutor
358
453
359
-
#Configure for streaming processing
360
-
executor_config= {
454
+
# Custom configuration for streaming processing
455
+
executor = XennaExecutor(config={
361
456
"execution_mode": "streaming",
362
-
"cpu_allocation_percentage": 0.8,
457
+
"cpu_allocation_percentage": 0.95,
363
458
"logging_interval": 60
364
-
}
365
-
366
-
# Use custom configuration for large datasets
367
-
executor = XennaExecutor(config=executor_config)
459
+
})
368
460
results = pipeline.run(executor)
369
-
370
-
# Default configuration works for most cases
371
-
# results = pipeline.run()
372
461
```
462
+
463
+
If no executor is specified, `pipeline.run()` uses `XennaExecutor` with default settings.
373
464
:::
374
465
375
-
:::{tab-item} Distributed Processing
376
-
```python
377
-
# Scale processing across multiple workers
466
+
:::{tab-item} RayDataExecutor (Experimental)
467
+
`RayDataExecutor`provides distributed processing using Ray Data. It has shown performance improvements for filtering workloads compared to the default executor.
378
468
379
-
# Configure for distributed processing
380
-
executor_config = {
381
-
"execution_mode": "streaming",
382
-
"cpu_allocation_percentage": 0.95,
383
-
"max_workers_per_stage": 8
384
-
}
469
+
```python
470
+
from nemo_curator.backends.experimental.ray_data import RayDataExecutor
385
471
386
-
# Use custom configuration for distributed processing
387
-
executor = XennaExecutor(config=executor_config)
472
+
executor = RayDataExecutor(
473
+
config={"ignore_failures": False},
474
+
ignore_head_node=True # Exclude head node from computation
0 commit comments