Update Llama Nemotron documentation (#771)

sarahyurick · web-flow · commit 79b5c648d70b · 2025-07-08T10:18:34.000-07:00
* update llama nemotron docs

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* add cd cmd

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* update blocksize and n workers

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* revert n workers

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

---------

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;
diff --git a/tutorials/README.md b/tutorials/README.md
@@ -16,14 +16,20 @@ To get started, we recommend starting with the following tutorials to become fam
 
 | Tutorial | Description | Additional Resources |
 | --- | --- | --- |
-| [pretraining-data-curation](./pretraining-data-curation/) | Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment | |
-| [pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/) | Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment | |
+| [bitext_cleaning](./bitext_cleaning/) | Highlights several bitext-specific functionalities within NeMo Curator's API | |
+| [curator-llm-pii](./curator-llm-pii/) | Demonstrates how to use NVIDIA's NeMo Curator library to modify text data containing Personally Identifiable Information (PII) using large language models (LLMs) | |
 | [dapt-curation](./dapt-curation) | Data curation sample for domain-adaptive pre-training (DAPT), focusing on [ChipNeMo](https://blogs.nvidia.com/blog/llm-semiconductors-chip-nemo/) data curation as an example | [Blog post](https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/) |
 | [distributed_data_classification](./distributed_data_classification) | Demonstrates machine learning classification with NVIDIA's Hugging Face models at scale in a distributed environment | |
-| [nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data | |
+| [image-curation](./image-curation/) | Explores all of the functionality that NeMo Curator has for image dataset curation | |
+| [llama-nemotron-data-curation](./llama-nemotron-data-curation/) | Demonstrates how a user can process a subset the Llama Nemotron dataset using NeMo Curator | |
+| [multimodal_dapt_curation](./multimodal_dapt_curation/) | Covers multimodal extraction and data curation for domain-adaptive pre-training (DAPT) | |
 | [nemo-retriever-synthetic-data-generation](./nemo_retriever_synthetic_data_generation) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [NIM models](https://ai.nvidia.com) for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines |
+| [nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data | |
+| [nemotron-cc](./nemotron-cc/) | How to use NeMo Curator to build the data curation pipeline used to create the Nemotron-CC dataset | [Blog post](https://developer.nvidia.com/blog/building-nemotron-cc-a-high-quality-trillion-token-dataset-for-llm-pretraining-from-common-crawl-using-nvidia-nemo-curator/) |
 | [peft-curation](./peft-curation/) | Data curation sample for parameter efficient fine-tuning (PEFT) use-cases | [Blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/) |
 | [peft-curation-with-sdg](./peft-curation/) | Demonstrates a pipeline to leverage external models such as [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for synthetic data generation, data quality annotation via [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward), as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases  | [Use this data to fine-tune your own model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb) |
+| [pretraining-data-curation](./pretraining-data-curation/) | Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment | |
+| [pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/) | Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment | |
 | [single_node_tutorial](./single_node_tutorial) | A comprehensive example to demonstrate running various NeMo Curator functionalities locally | |
 | [synthetic-data-hello-world](./synthetic-data-hello-world) | An introductory example of synthetic data generation using NeMo Curator | |
 | [synthetic-preference-data](./synthetic-preference-data) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) for generating synthetic preference data |
diff --git a/tutorials/llama-nemotron-data-curation/README.md b/tutorials/llama-nemotron-data-curation/README.md
@@ -111,6 +111,19 @@ If you are interested in counting and displaying the number of rows after each s
 
 ## Debugging Out of Memory Errors
 
-If you are running into out of memory (OOM) errors, there are a couple of approaches you can try. One is to avoid very large partitions of data. By default, the JSONL data is read with a blocksize of 256 MB per Dask partition. To customize the file reading logic, the user may specify `--json-blocksize "1gb"` with any string representation for the partition size (e.g., "1gb", "256mb"). Alternatively, the user may specify `--json-files-per-partition 2` with any integer to represent the number of JSONL files per Dask partition. Please note that either the blocksize or files per partition can be specified, but not both. For GPU workflows, a good general rule of thumb is to set the blocksize to 1/32 of the total GPU memory. In general, a blocksize between 100 MB and 1 GB is considered ideal.
+If you are running into out of memory (OOM) errors, there are a couple of approaches you can try. One is to avoid very large partitions of data. By default, the JSONL data is read with a blocksize of 100 MB per Dask partition. To customize the file reading logic, the user may specify `--json-blocksize "1gb"` with any string representation for the partition size (e.g., "1gb", "256mb"). Alternatively, the user may specify `--json-files-per-partition 2` with any integer to represent the number of JSONL files per Dask partition. Please note that either the blocksize or files per partition can be specified, but not both. For GPU workflows, a good general rule of thumb is to set the blocksize to 1/32 of the total GPU memory. In general, a blocksize between 100 MB and 1 GB is considered ideal.
 
 You may also encounter errors about Dask workers unexpectedly shutting down. To help mitigate this, consider lowering the `--n-workers` parameter. By default, we set the number of Dask workers equal to the number of CPU cores. It may be helpful to set `--n-workers` to half or a fourth of the number of CPU cores and possibly reduce the number from there. For example, if `lscpu` shows `CPU(s): 96`, then setting `--n-workers 48` or `--n-workers 24` may help optimize performance while avoiding memory issues. In the example bash script, we set `--n-workers 4` as a safe option to help avoid errors.
+
+## Next Steps
+
+To see how to train a reasoning model with the resulting dataset, please refer to this NeMo tutorial: [Train Your Own Reasoning Model in 48 Hours on a Single GPU](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/reasoning).
+
+Before running the NeMo tutorial, you should combine all of the resulting JSONL files from this tutorial into a single file called `training.jsonl`. To do this, you can navigate to the output directory and then combine all of the JSONL files:
+
+```bash
+cd /path/to/curated-data
+find . -name "*.jsonl" -exec cat {} + | sed '/^$/d' > training.jsonl
+```
+
+Please note that the above command contains some additional logic to help ignore any empty JSONL files, which may have resulted from the filtering done by this tutorial.
diff --git a/tutorials/llama-nemotron-data-curation/main.py b/tutorials/llama-nemotron-data-curation/main.py
@@ -424,9 +424,9 @@ def main(args: argparse.Namespace) -> None:  # noqa: C901, PLR0915
         # Filter out files that don't contain any of the provided substrings
         input_files = [filename for filename in input_files if any(s in filename for s in args.filename_filter)]
 
-    # If neither is set, set the default blocksize to 1GB
+    # If neither is set, set the default blocksize to 100MB
     if args.json_blocksize is None and args.json_files_per_partition is None:
-        args.json_blocksize = "256mb"
+        args.json_blocksize = "100mb"
 
     dataset = DocumentDataset.read_json(
         input_files, blocksize=args.json_blocksize, files_per_partition=args.json_files_per_partition