Skip to content

Commit 4f9e02b

Browse files
authored
docs: changelog update (#860)
* docs: changelog update Signed-off-by: Lawrence Lane <llane@nvidia.com> * formatting Signed-off-by: Lawrence Lane <llane@nvidia.com> * remove item Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com>
1 parent b164cf9 commit 4f9e02b

File tree

1 file changed

+29
-0
lines changed

1 file changed

+29
-0
lines changed

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,34 @@
11
# Changelog
22

3+
## NVIDIA NeMo Curator 0.9.0
4+
5+
### Major Features and Enhancements
6+
7+
- New How-to Data Recipes (Tutorials)
8+
- Multimodal DAPT Curation w/ PDF Extraction
9+
- Llama Nemotron Data Curation
10+
- LLM NIM - PII Redaction
11+
- Performance and Code Optimizations
12+
- Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
13+
- Removed convoluted backend switching logic that caused performance issues
14+
- Eliminated expensive length assertions that could cause timeouts on large datasets
15+
- Improved GPU utilization during KMeans clustering operations
16+
- Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains
17+
18+
### Bug Fixes
19+
20+
- FastText Download URL Fix
21+
- Corrected the `fasttext` model download URL in nemotron-cc tutorial
22+
- Changed from `dl.fbaipublicfiles.com/fastText/` to `dl.fbaipublicfiles.com/fasttext/`
23+
- Ensures reliable model downloads for language identification
24+
- NeMo Retriever Tutorial Bug Fix
25+
- Fixed lambda function bug in `RetrieverEvalSetGenerator`
26+
- Corrected score assignment from `df["question"].apply(lambda: 1)` to `df["score"] = 1`
27+
- API Usage Updates
28+
- Updated examples and tutorials to use correct `DocumentDataset` API
29+
- Replaced deprecated `write_to_disk(result, output_dir, output_type="parquet")` with `result.to_parquet(output_dir)`
30+
- Updated exact deduplication workflows: `deduplicator.remove()` now returns `DocumentDataset` directly
31+
332
## NVIDIA NeMo Curator 0.8.0
433

534
- Llama Based PII Redaction

0 commit comments

Comments
 (0)