Skip to content

Commit faefef9

Browse files
authored
Update README for Llama Nemotron tutorial (#804)
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent 034d7da commit faefef9

File tree

1 file changed

+2
-3
lines changed
  • tutorials/llama-nemotron-data-curation

1 file changed

+2
-3
lines changed

tutorials/llama-nemotron-data-curation/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -119,11 +119,10 @@ You may also encounter errors about Dask workers unexpectedly shutting down. To
119119

120120
To see how to train a reasoning model with the resulting dataset, please refer to this NeMo tutorial: [Train Your Own Reasoning Model in 48 Hours on a Single GPU](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/reasoning).
121121

122-
Before running the NeMo tutorial, you should combine all of the resulting JSONL files from this tutorial into a single file called `training.jsonl`. To do this, you can navigate to the output directory and then combine all of the JSONL files:
122+
Before running the NeMo tutorial, you should combine all of the resulting JSONL files from this tutorial into a single file called `training.jsonl`. You can use the following command to combine all of the JSONL files:
123123

124124
```bash
125-
cd /path/to/curated-data
126-
find . -name "*.jsonl" -exec cat {} + | sed '/^$/d' > training.jsonl
125+
find /path/to/curated-data -type f -name "*.jsonl" -size +0c -print0 | xargs -0 cat | awk 'NF' > training.jsonl
127126
```
128127

129128
Please note that the above command contains some additional logic to help ignore any empty JSONL files, which may have resulted from the filtering done by this tutorial.

0 commit comments

Comments
 (0)