You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* update llama nemotron docs
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add cd cmd
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* update blocksize and n workers
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* revert n workers
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Copy file name to clipboardExpand all lines: tutorials/README.md
+9-3Lines changed: 9 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,14 +16,20 @@ To get started, we recommend starting with the following tutorials to become fam
16
16
17
17
| Tutorial | Description | Additional Resources |
18
18
| --- | --- | --- |
19
-
|[pretraining-data-curation](./pretraining-data-curation/)|Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment||
20
-
|[pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/)| Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment||
19
+
|[bitext_cleaning](./bitext_cleaning/)|Highlights several bitext-specific functionalities within NeMo Curator's API||
20
+
|[curator-llm-pii](./curator-llm-pii/)| Demonstrates how to use NVIDIA's NeMo Curator library to modify text data containing Personally Identifiable Information (PII) using large language models (LLMs)||
21
21
|[dapt-curation](./dapt-curation)| Data curation sample for domain-adaptive pre-training (DAPT), focusing on [ChipNeMo](https://blogs.nvidia.com/blog/llm-semiconductors-chip-nemo/) data curation as an example |[Blog post](https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/)|
22
22
|[distributed_data_classification](./distributed_data_classification)| Demonstrates machine learning classification with NVIDIA's Hugging Face models at scale in a distributed environment ||
23
-
|[nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen)| Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data ||
23
+
|[image-curation](./image-curation/)| Explores all of the functionality that NeMo Curator has for image dataset curation ||
24
+
|[llama-nemotron-data-curation](./llama-nemotron-data-curation/)| Demonstrates how a user can process a subset the Llama Nemotron dataset using NeMo Curator ||
25
+
|[multimodal_dapt_curation](./multimodal_dapt_curation/)| Covers multimodal extraction and data curation for domain-adaptive pre-training (DAPT) ||
24
26
|[nemo-retriever-synthetic-data-generation](./nemo_retriever_synthetic_data_generation)| Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [NIM models](https://ai.nvidia.com) for generating synthetic data and perform data quality assesement on generated data using LLM-as-judge and embedding-model-as-judge. The generated data would be used to evaluate retrieval/RAG pipelines |
27
+
|[nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen)| Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data ||
28
+
|[nemotron-cc](./nemotron-cc/)| How to use NeMo Curator to build the data curation pipeline used to create the Nemotron-CC dataset |[Blog post](https://developer.nvidia.com/blog/building-nemotron-cc-a-high-quality-trillion-token-dataset-for-llm-pretraining-from-common-crawl-using-nvidia-nemo-curator/)|
25
29
|[peft-curation](./peft-curation/)| Data curation sample for parameter efficient fine-tuning (PEFT) use-cases |[Blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/)|
26
30
|[peft-curation-with-sdg](./peft-curation/)| Demonstrates a pipeline to leverage external models such as [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for synthetic data generation, data quality annotation via [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward), as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases |[Use this data to fine-tune your own model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb)|
31
+
|[pretraining-data-curation](./pretraining-data-curation/)| Demonstrates accelerated pipeline for curating large-scale data for LLM pretraining in a distributed environment ||
32
+
|[pretraining-vietnamese-data-curation](./pretraining-vietnamese-data-curation/)| Demonstrates how to use NeMo Curator to process large-scale and high-quality Vietnamese data in a distributed environment ||
27
33
|[single_node_tutorial](./single_node_tutorial)| A comprehensive example to demonstrate running various NeMo Curator functionalities locally ||
28
34
|[synthetic-data-hello-world](./synthetic-data-hello-world)| An introductory example of synthetic data generation using NeMo Curator ||
29
35
|[synthetic-preference-data](./synthetic-preference-data)| Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) for generating synthetic preference data |
Copy file name to clipboardExpand all lines: tutorials/llama-nemotron-data-curation/README.md
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,6 +111,19 @@ If you are interested in counting and displaying the number of rows after each s
111
111
112
112
## Debugging Out of Memory Errors
113
113
114
-
If you are running into out of memory (OOM) errors, there are a couple of approaches you can try. One is to avoid very large partitions of data. By default, the JSONL data is read with a blocksize of 256 MB per Dask partition. To customize the file reading logic, the user may specify `--json-blocksize "1gb"` with any string representation for the partition size (e.g., "1gb", "256mb"). Alternatively, the user may specify `--json-files-per-partition 2` with any integer to represent the number of JSONL files per Dask partition. Please note that either the blocksize or files per partition can be specified, but not both. For GPU workflows, a good general rule of thumb is to set the blocksize to 1/32 of the total GPU memory. In general, a blocksize between 100 MB and 1 GB is considered ideal.
114
+
If you are running into out of memory (OOM) errors, there are a couple of approaches you can try. One is to avoid very large partitions of data. By default, the JSONL data is read with a blocksize of 100 MB per Dask partition. To customize the file reading logic, the user may specify `--json-blocksize "1gb"` with any string representation for the partition size (e.g., "1gb", "256mb"). Alternatively, the user may specify `--json-files-per-partition 2` with any integer to represent the number of JSONL files per Dask partition. Please note that either the blocksize or files per partition can be specified, but not both. For GPU workflows, a good general rule of thumb is to set the blocksize to 1/32 of the total GPU memory. In general, a blocksize between 100 MB and 1 GB is considered ideal.
115
115
116
116
You may also encounter errors about Dask workers unexpectedly shutting down. To help mitigate this, consider lowering the `--n-workers` parameter. By default, we set the number of Dask workers equal to the number of CPU cores. It may be helpful to set `--n-workers` to half or a fourth of the number of CPU cores and possibly reduce the number from there. For example, if `lscpu` shows `CPU(s): 96`, then setting `--n-workers 48` or `--n-workers 24` may help optimize performance while avoiding memory issues. In the example bash script, we set `--n-workers 4` as a safe option to help avoid errors.
117
+
118
+
## Next Steps
119
+
120
+
To see how to train a reasoning model with the resulting dataset, please refer to this NeMo tutorial: [Train Your Own Reasoning Model in 48 Hours on a Single GPU](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/reasoning).
121
+
122
+
Before running the NeMo tutorial, you should combine all of the resulting JSONL files from this tutorial into a single file called `training.jsonl`. To do this, you can navigate to the output directory and then combine all of the JSONL files:
Please note that the above command contains some additional logic to help ignore any empty JSONL files, which may have resulted from the filtering done by this tutorial.
0 commit comments