|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## NVIDIA NeMo Curator 0.9.0 |
| 4 | + |
| 5 | +### Major Features and Enhancements |
| 6 | + |
| 7 | +- New How-to Data Recipes (Tutorials) |
| 8 | + - Multimodal DAPT Curation w/ PDF Extraction |
| 9 | + - Llama Nemotron Data Curation |
| 10 | + - LLM NIM - PII Redaction |
| 11 | +- Performance and Code Optimizations |
| 12 | + - Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance |
| 13 | + - Removed convoluted backend switching logic that caused performance issues |
| 14 | + - Eliminated expensive length assertions that could cause timeouts on large datasets |
| 15 | + - Improved GPU utilization during KMeans clustering operations |
| 16 | + - Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains |
| 17 | + |
| 18 | +### Bug Fixes |
| 19 | + |
| 20 | +- FastText Download URL Fix |
| 21 | + - Corrected the `fasttext` model download URL in nemotron-cc tutorial |
| 22 | + - Changed from `dl.fbaipublicfiles.com/fastText/` to `dl.fbaipublicfiles.com/fasttext/` |
| 23 | + - Ensures reliable model downloads for language identification |
| 24 | +- NeMo Retriever Tutorial Bug Fix |
| 25 | + - Fixed lambda function bug in `RetrieverEvalSetGenerator` |
| 26 | + - Corrected score assignment from `df["question"].apply(lambda: 1)` to `df["score"] = 1` |
| 27 | +- API Usage Updates |
| 28 | + - Updated examples and tutorials to use correct `DocumentDataset` API |
| 29 | + - Replaced deprecated `write_to_disk(result, output_dir, output_type="parquet")` with `result.to_parquet(output_dir)` |
| 30 | + - Updated exact deduplication workflows: `deduplicator.remove()` now returns `DocumentDataset` directly |
| 31 | + |
3 | 32 | ## NVIDIA NeMo Curator 0.8.0 |
4 | 33 |
|
5 | 34 | - Llama Based PII Redaction |
|
0 commit comments