You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/about/release-notes/migration-faq.md
+8-20Lines changed: 8 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,19 +122,6 @@ A set of models assigns a quality score (for example, 0–20), bucketed into hig
122
122
Each task is a data class. You can add whatever statistics you need (input or output counts, tokens dropped, and so on) within stage logic for detailed reporting or logging.
123
123
```
124
124
125
-
```{dropdown} Does the new deduplication feature support global deduplication (across all snapshots, not just incremental)?
126
-
127
-
Yes, NeMo Curator supports {ref}`global deduplication <text-process-data-dedup>` by processing multiple data sources together in a single pass. You can provide files from multiple snapshots (such as different Common Crawl releases) as input, and deduplication will identify duplicates across all of them.
128
-
129
-
Three GPU-accelerated approaches are available:
130
-
131
-
- **Exact deduplication**: MD5 hashing for identical documents (unlimited scale)
132
-
- **Fuzzy deduplication**: MinHash with LSH for near-duplicates (petabyte-scale)
133
-
- **Semantic deduplication**: Embedding-based similarity for meaning-based duplicates (terabyte-scale)
134
-
135
-
For comprehensive documentation, refer to {ref}`Deduplication Concepts <about-concepts-deduplication>`.
136
-
```
137
-
138
125
---
139
126
140
127
## Fault Tolerance, Checkpointing, and Observability
@@ -168,11 +155,6 @@ Yes, by design. Add new stages or modify process functions to integrate custom l
168
155
Contributions are strongly encouraged. Submit pull requests or join community discussions to help expand NeMo Curator's capabilities for diverse regions and languages.
169
156
```
170
157
171
-
```{dropdown} How do I manage different dependencies for separate stages, especially if packaging as Docker images?
172
-
173
-
Each stage can specify its Conda environment, which must be present in the Docker image. Import your dependencies within stage logic to ensure proper isolation.
174
-
```
175
-
176
158
---
177
159
178
160
## Deployment, Infrastructure, and Practicalities
@@ -217,7 +199,13 @@ Yes. NeMo Curator supports multiple data modalities including {ref}`text <gs-tex
217
199
218
200
```{dropdown} Where can I get support, ask questions, or contribute?
219
201
220
-
Support and discussion channels are available on Slack and [GitHub](https://github.com/NVIDIA-NeMo/Curator). If you build innovative filters or features for your locale, please engage and contribute back. Regular calls and community check-ins are offered.
202
+
All support and community engagement happens on [GitHub](https://github.com/NVIDIA-NeMo/Curator). We encourage you to:
203
+
204
+
- **Open an issue** for bugs, feature requests, or questions
205
+
- **Start a discussion** to share ideas or ask for guidance
206
+
- **Submit a pull request** if you build innovative filters or features
207
+
208
+
Regular community calls and check-ins are also offered to connect with the team and other users.
221
209
```
222
210
223
211
---
@@ -231,5 +219,5 @@ Support and discussion channels are available on Slack and [GitHub](https://gith
231
219
```
232
220
233
221
```{tip}
234
-
If you find something missing or want to share a best practice or feature, join the NeMo Curator community or submit an issue or pull request on GitHub.
222
+
If you find something missing or want to share a best practice or feature, please [open an issue](https://github.com/NVIDIA-NeMo/Curator/issues) or [submit a pull request](https://github.com/NVIDIA-NeMo/Curator/pulls) on GitHub.
0 commit comments