Skip to content

Commit 2f38e4a

Browse files
Data prep init
1 parent 436d8d2 commit 2f38e4a

File tree

1 file changed

+68
-2
lines changed

1 file changed

+68
-2
lines changed

docs/recipes/data-preparation.md

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,72 @@
22
title: Preparing Data for Training
33
---
44

5-
!!! warning
5+
This guide will demonstrate how to prepare data for Fast-LLM starting from a huggingface dataset.
6+
7+
## Prerequisites
8+
9+
## 📚 Step 1: Download the dataset from Huggingface
10+
11+
First, set `HF_HOME` to your Huggingface cache folder.
12+
13+
Let's create the folder to store the huggingface dataset
14+
```bash
15+
mkdir -p ~/datasets/upstream/the-stack
16+
```
17+
18+
Next we download the Stack dataset from huggingface.
19+
```bash
20+
huggingface-cli download bigcode/the-stack --revision v1.2 --repo-type dataset --max_workers 64 --local-dir /mnt/datasets/upstream/the-stack
21+
```
22+
23+
!!! warning "Choice of num_workers"
24+
25+
Setting a large num_workers sometimes leads to connection errors.
26+
27+
## ⚙️ Step 2: Prepare the configs for conversion of data to gpt_mmap format
28+
29+
In this step, we tokenize the huggingface dataset downloaded in Step 1 and save it in the gpt_mmap format that Fast-LLM accepts.
30+
31+
We'll use Mistral-Nemo-Base-2407 tokenizer. Let's create folder first
32+
```bash
33+
mkdir -p ~/checkpoints/upstream/Mistral-Nemo-Base-2407
34+
```
35+
36+
And then download the tokenizer with this script
37+
```python
38+
from transformers import AutoModelForCausalLM, AutoTokenizer
39+
40+
model_id = "mistralai/Mistral-Nemo-Base-2407"
41+
tokenizer = AutoTokenizer.from_pretrained(model_id)
42+
tokenizer.save_pretrained("./models/tokenizer/")
43+
44+
45+
Let's create a folder to store the gpt_mmap dataset
46+
```
47+
48+
49+
```bash
50+
mkdir -p ~/datasets/tokenized/Mistral-Nemo-Base-2407
51+
```
52+
53+
Create a config like this -
54+
55+
```yaml
56+
output_path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/the-stack/python
57+
58+
loading_workers: 32
59+
tokenize_workers: 32
60+
saving_workers: 32
61+
62+
dataset:
63+
path: /mnt/datasets/upstream/the-stack
64+
config_name: "python"
65+
split: "train"
66+
67+
tokenizer:
68+
path: /mnt/checkpoints/upstream/Mistral-Nemo-Base-2407/tokenizer.json
69+
```
70+
71+
72+
673
7-
This guide’s still in the works. Stay tuned—full instructions coming soon!

0 commit comments

Comments
 (0)