File tree Expand file tree Collapse file tree 4 files changed +11
-4
lines changed
curate-video/process-data Expand file tree Collapse file tree 4 files changed +11
-4
lines changed Original file line number Diff line number Diff line change @@ -116,7 +116,7 @@ New API for tracking and analyzing pipeline execution:
116116
117117## Bug Fixes
118118
119- - Fixed fasttext predict call compatibility with numpy>2
119+ - Fixed fasttext predict call compatibility with numpy>2
120120- Fixed broken NeMo Framework documentation links
121121- Fixed MegatronTokenizerWriter to download only necessary tokenizer files
122122- Fixed ID generator blocking issues for large-scale processing
@@ -147,7 +147,6 @@ New API for tracking and analyzing pipeline execution:
147147- ** Memory Management** : New guidance for handling CPU/GPU memory constraints
148148- ** AWS Integration** : Updated tutorials with correct AWS credentials setup
149149
150-
151150---
152151
153152## What's Next
Original file line number Diff line number Diff line change @@ -56,7 +56,7 @@ workflow = SemanticDeduplicationWorkflow(
5656 n_clusters = 1000 ,
5757 id_field = " id" ,
5858 embedding_field = " embedding" ,
59- embedding_dim = 512 , # 512 for InternVideo2, varies for Cosmos-Embed1
59+ embedding_dim = 768 , # Embedding dimension (768 for Cosmos-Embed1, varies by model)
6060 input_filetype = " parquet" ,
6161 eps = 0.1 , # Similarity threshold: cosine_sim >= 1.0 - eps identifies duplicates
6262 ranking_strategy = RankingStrategy.metadata_based(
Original file line number Diff line number Diff line change @@ -120,8 +120,16 @@ Here's a simple example to get started with NeMo Curator's image curation pipeli
120120Image loading and decoding happens in CPU memory before GPU processing. If you encounter out-of-memory errors during the ` ImageReaderStage ` , reduce:
121121- ` batch_size ` : Number of images per batch (reduce to 32-50 for systems with limited RAM)
122122- ` num_threads ` : Parallel decoding threads (reduce to 4 for systems with limited RAM)
123+ - ` num_cpus ` : Ray Client CPU allocation (reduce to 8-16 for systems with limited RAM)
123124
124125The example below uses conservative defaults suitable for most systems. For high-memory systems, you can increase these values for better performance.
126+
127+ To configure Ray with limited CPU resources:
128+ ``` python
129+ from nemo_curator.core.client import RayClient
130+ ray_client = RayClient(num_cpus = 8 ) # Adjust based on available CPU cores
131+ ray_client.start()
132+ ```
125133:::
126134
127135``` python
Original file line number Diff line number Diff line change @@ -37,7 +37,7 @@ NeMo Curator provides official Docker containers with all dependencies pre-insta
3737
3838The primary container includes comprehensive support for all curation modalities:
3939
40- ** Container registry:** ` nvcr.io/nvidia/nemo-curator:26.02 `
40+ ** Container registry:** ` nvcr.io/nvidia/nemo-curator:{{ container_version }} `
4141
4242** Supported modalities:**
4343- ✅ Text curation (CPU/GPU)
You can’t perform that action at this time.
0 commit comments