|
2 | 2 | title: Preparing Data for Training |
3 | 3 | --- |
4 | 4 |
|
5 | | -!!! warning |
| 5 | +This guide will demonstrate how to prepare data for Fast-LLM starting from a huggingface dataset. |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +## 📚 Step 1: Download the dataset from Huggingface |
| 10 | + |
| 11 | +First, set `HF_HOME` to your Huggingface cache folder. |
| 12 | + |
| 13 | +Let's create the folder to store the huggingface dataset |
| 14 | +```bash |
| 15 | +mkdir -p ~/datasets/upstream/the-stack |
| 16 | +``` |
| 17 | + |
| 18 | +Next we download the Stack dataset from huggingface. |
| 19 | +```bash |
| 20 | +huggingface-cli download bigcode/the-stack --revision v1.2 --repo-type dataset --max_workers 64 --local-dir /mnt/datasets/upstream/the-stack |
| 21 | +``` |
| 22 | + |
| 23 | +!!! warning "Choice of num_workers" |
| 24 | + |
| 25 | + Setting a large num_workers sometimes leads to connection errors. |
| 26 | + |
| 27 | +## ⚙️ Step 2: Prepare the configs for conversion of data to gpt_mmap format |
| 28 | + |
| 29 | +In this step, we tokenize the huggingface dataset downloaded in Step 1 and save it in the gpt_mmap format that Fast-LLM accepts. |
| 30 | + |
| 31 | +We'll use Mistral-Nemo-Base-2407 tokenizer. Let's create folder first |
| 32 | +```bash |
| 33 | +mkdir -p ~/checkpoints/upstream/Mistral-Nemo-Base-2407 |
| 34 | +``` |
| 35 | + |
| 36 | +And then download the tokenizer with this script |
| 37 | +```python |
| 38 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 39 | + |
| 40 | +model_id = "mistralai/Mistral-Nemo-Base-2407" |
| 41 | +tokenizer = AutoTokenizer.from_pretrained(model_id) |
| 42 | +tokenizer.save_pretrained("./models/tokenizer/") |
| 43 | + |
| 44 | + |
| 45 | +Let's create a folder to store the gpt_mmap dataset |
| 46 | +``` |
| 47 | + |
| 48 | + |
| 49 | +```bash |
| 50 | +mkdir -p ~/datasets/tokenized/Mistral-Nemo-Base-2407 |
| 51 | +``` |
| 52 | + |
| 53 | +Create a config like this - |
| 54 | + |
| 55 | +```yaml |
| 56 | +output_path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/the-stack/python |
| 57 | + |
| 58 | +loading_workers: 32 |
| 59 | +tokenize_workers: 32 |
| 60 | +saving_workers: 32 |
| 61 | + |
| 62 | +dataset: |
| 63 | + path: /mnt/datasets/upstream/the-stack |
| 64 | + config_name: "python" |
| 65 | + split: "train" |
| 66 | + |
| 67 | +tokenizer: |
| 68 | + path: /mnt/checkpoints/upstream/Mistral-Nemo-Base-2407/tokenizer.json |
| 69 | +``` |
| 70 | +
|
| 71 | +
|
| 72 | +
|
6 | 73 |
|
7 | | - This guide’s still in the works. Stay tuned—full instructions coming soon! |
|
0 commit comments