Skip to content

Commit 79b5c64

Browse files
authored
Update Llama Nemotron documentation (#771)
* update llama nemotron docs Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add cd cmd Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update blocksize and n workers Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * revert n workers Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent ecd5c57 commit 79b5c64

File tree

3 files changed

+25
-6
lines changed

3 files changed

+25
-6
lines changed

tutorials/README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,20 @@ To get started, we recommend starting with the following tutorials to become fam
1616

1717
| Tutorial | Description | Additional Resources |
1818
| --- | --- | --- |
19-
| [pretraining-data-curation](./pretraining-data-curation/) | Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment | |
20-
| [pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/) | Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment | |
19+
| [bitext_cleaning](./bitext_cleaning/) | Highlights several bitext-specific functionalities within NeMo Curator's API | |
20+
| [curator-llm-pii](./curator-llm-pii/) | Demonstrates how to use NVIDIA's NeMo Curator library to modify text data containing Personally Identifiable Information (PII) using large language models (LLMs) | |
2121
| [dapt-curation](./dapt-curation) | Data curation sample for domain-adaptive pre-training (DAPT), focusing on [ChipNeMo](https://blogs.nvidia.com/blog/llm-semiconductors-chip-nemo/) data curation as an example | [Blog post](https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/) |
2222
| [distributed_data_classification](./distributed_data_classification) | Demonstrates machine learning classification with NVIDIA's Hugging Face models at scale in a distributed environment | |
23-
| [nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data | |
23+
| [image-curation](./image-curation/) | Explores all of the functionality that NeMo Curator has for image dataset curation | |
24+
| [llama-nemotron-data-curation](./llama-nemotron-data-curation/) | Demonstrates how a user can process a subset the Llama Nemotron dataset using NeMo Curator | |
25+
| [multimodal_dapt_curation](./multimodal_dapt_curation/) | Covers multimodal extraction and data curation for domain-adaptive pre-training (DAPT) | |
2426
| [nemo-retriever-synthetic-data-generation](./nemo_retriever_synthetic_data_generation) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [NIM models](https://ai.nvidia.com) for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines |
27+
| [nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data | |
28+
| [nemotron-cc](./nemotron-cc/) | How to use NeMo Curator to build the data curation pipeline used to create the Nemotron-CC dataset | [Blog post](https://developer.nvidia.com/blog/building-nemotron-cc-a-high-quality-trillion-token-dataset-for-llm-pretraining-from-common-crawl-using-nvidia-nemo-curator/) |
2529
| [peft-curation](./peft-curation/) | Data curation sample for parameter efficient fine-tuning (PEFT) use-cases | [Blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/) |
2630
| [peft-curation-with-sdg](./peft-curation/) | Demonstrates a pipeline to leverage external models such as [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for synthetic data generation, data quality annotation via [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward), as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases | [Use this data to fine-tune your own model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb) |
31+
| [pretraining-data-curation](./pretraining-data-curation/) | Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment | |
32+
| [pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/) | Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment | |
2733
| [single_node_tutorial](./single_node_tutorial) | A comprehensive example to demonstrate running various NeMo Curator functionalities locally | |
2834
| [synthetic-data-hello-world](./synthetic-data-hello-world) | An introductory example of synthetic data generation using NeMo Curator | |
2935
| [synthetic-preference-data](./synthetic-preference-data) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) for generating synthetic preference data |

tutorials/llama-nemotron-data-curation/README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,19 @@ If you are interested in counting and displaying the number of rows after each s
111111

112112
## Debugging Out of Memory Errors
113113

114-
If you are running into out of memory (OOM) errors, there are a couple of approaches you can try. One is to avoid very large partitions of data. By default, the JSONL data is read with a blocksize of 256 MB per Dask partition. To customize the file reading logic, the user may specify `--json-blocksize "1gb"` with any string representation for the partition size (e.g., "1gb", "256mb"). Alternatively, the user may specify `--json-files-per-partition 2` with any integer to represent the number of JSONL files per Dask partition. Please note that either the blocksize or files per partition can be specified, but not both. For GPU workflows, a good general rule of thumb is to set the blocksize to 1/32 of the total GPU memory. In general, a blocksize between 100 MB and 1 GB is considered ideal.
114+
If you are running into out of memory (OOM) errors, there are a couple of approaches you can try. One is to avoid very large partitions of data. By default, the JSONL data is read with a blocksize of 100 MB per Dask partition. To customize the file reading logic, the user may specify `--json-blocksize "1gb"` with any string representation for the partition size (e.g., "1gb", "256mb"). Alternatively, the user may specify `--json-files-per-partition 2` with any integer to represent the number of JSONL files per Dask partition. Please note that either the blocksize or files per partition can be specified, but not both. For GPU workflows, a good general rule of thumb is to set the blocksize to 1/32 of the total GPU memory. In general, a blocksize between 100 MB and 1 GB is considered ideal.
115115

116116
You may also encounter errors about Dask workers unexpectedly shutting down. To help mitigate this, consider lowering the `--n-workers` parameter. By default, we set the number of Dask workers equal to the number of CPU cores. It may be helpful to set `--n-workers` to half or a fourth of the number of CPU cores and possibly reduce the number from there. For example, if `lscpu` shows `CPU(s): 96`, then setting `--n-workers 48` or `--n-workers 24` may help optimize performance while avoiding memory issues. In the example bash script, we set `--n-workers 4` as a safe option to help avoid errors.
117+
118+
## Next Steps
119+
120+
To see how to train a reasoning model with the resulting dataset, please refer to this NeMo tutorial: [Train Your Own Reasoning Model in 48 Hours on a Single GPU](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/reasoning).
121+
122+
Before running the NeMo tutorial, you should combine all of the resulting JSONL files from this tutorial into a single file called `training.jsonl`. To do this, you can navigate to the output directory and then combine all of the JSONL files:
123+
124+
```bash
125+
cd /path/to/curated-data
126+
find . -name "*.jsonl" -exec cat {} + | sed '/^$/d' > training.jsonl
127+
```
128+
129+
Please note that the above command contains some additional logic to help ignore any empty JSONL files, which may have resulted from the filtering done by this tutorial.

tutorials/llama-nemotron-data-curation/main.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -424,9 +424,9 @@ def main(args: argparse.Namespace) -> None: # noqa: C901, PLR0915
424424
# Filter out files that don't contain any of the provided substrings
425425
input_files = [filename for filename in input_files if any(s in filename for s in args.filename_filter)]
426426

427-
# If neither is set, set the default blocksize to 1GB
427+
# If neither is set, set the default blocksize to 100MB
428428
if args.json_blocksize is None and args.json_files_per_partition is None:
429-
args.json_blocksize = "256mb"
429+
args.json_blocksize = "100mb"
430430

431431
dataset = DocumentDataset.read_json(
432432
input_files, blocksize=args.json_blocksize, files_per_partition=args.json_files_per_partition

0 commit comments

Comments
 (0)