Skip to content

Commit 862fbaa

Browse files
[Feature] Support LLaMA-3 CPT and ST (#5619)
* support LLaMA-3 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Run pre-commit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent e094933 commit 862fbaa

28 files changed

+89
-87
lines changed

applications/Colossal-LLaMA-2/version.txt

Lines changed: 0 additions & 1 deletion
This file was deleted.

applications/Colossal-LLaMA-2/README.md renamed to applications/Colossal-LLaMA/README.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<div align="center">
22
<h1>
3-
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/>
3+
Colossal-LLaMA
44
</h1>
55
</div>
66

@@ -47,6 +47,7 @@
4747
- [Citations](#citations)
4848

4949
## News
50+
* [2024/4] Support continual pre-training and supervised fine-tuning of LLaMA-3.
5051
* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
5152
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
5253
[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
@@ -289,7 +290,7 @@ Here is details about CLI arguments:
289290

290291
#### 1. Install required packages
291292
```
292-
cd Colossal-LLaMA-2
293+
cd Colossal-LLaMA
293294
pip install -r requirements.txt
294295
```
295296
#### 2. Install `xentropy`, `layer_norm` and `rotary`
@@ -314,7 +315,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
314315
Command to initialize new tokenizer:
315316
```bash
316317
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
317-
python colossal_llama2/tokenizer/init_tokenizer.py \
318+
python colossal_llama/tokenizer/init_tokenizer.py \
318319
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
319320
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
320321
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
@@ -328,7 +329,7 @@ Here is details about CLI arguments:
328329
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
329330
Command to initialize new model checkpoint:
330331
```bash
331-
python colossal_llama2/model/init_model.py \
332+
python colossal_llama/model/init_model.py \
332333
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
333334
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
334335
--target_model_path "<TARGET_MODEL_DIR>"
@@ -362,18 +363,17 @@ Command to convert jsonl dataset to arrow format:
362363
python prepare_pretrain_dataset.py \
363364
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
364365
--tokenizer_dir "<TOKENIZER_DIR>" \
365-
--data_cache_dir "jsonl_to_arrow_cache" \
366-
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
367-
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
366+
--data_output_dirs "spliced tokenized output" \
368367
--max_length 4096 \
369368
--num_spliced_dataset_bins 10
370369
```
371370
Here is details about CLI arguments:
372371
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
373372
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
374-
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally.
375-
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format.
376-
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly.
373+
* Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
374+
* `cache`: Directory to store Hugging Face data cache.
375+
* `jsonl`: Output directory to store converted dataset in jsonl format.
376+
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
377377
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
378378
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
379379

@@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
392392
python prepare_sft_dataset.py.py \
393393
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
394394
--tokenizer_dir "<TOKENIZER_DIR>" \
395-
--data_cache_dir "jsonl_to_arrow_cache" \
396-
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
397-
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
395+
--data_output_dirs "spliced tokenized output" \
398396
--max_length 4096 \
399-
--num_spliced_dataset_bins 10
397+
--num_spliced_dataset_bins 10 \
398+
--llama_version 3
400399
```
401400

401+
Additional CLI arguments:
402+
* LLaMA verison: `llama_version`. Specify the LLaMA version.
403+
402404
#### 4. Command Line Arguments for Training
403405

404406
##### 4.1 Arguments for Pretraining
File renamed without changes.
File renamed without changes.

applications/Colossal-LLaMA-2/colossal_llama2/dataset/conversation.py renamed to applications/Colossal-LLaMA/colossal_llama/dataset/conversation.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ def dict(self):
8383
}
8484

8585

86-
conv = Conversation(
86+
LLaMA2_Conv = Conversation(
8787
system="A chat between a curious human and an artificial intelligence assistant. "
8888
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
8989
roles=("Human", "Assistant"),
@@ -93,4 +93,14 @@ def dict(self):
9393
seps=["<s>", "</s>"],
9494
)
9595

96-
default_conversation = conv
96+
LLaMA3_Conv = Conversation(
97+
system="A chat between a curious human and an artificial intelligence assistant. "
98+
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
99+
roles=("Human", "Assistant"),
100+
messages=[],
101+
offset=0,
102+
sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
103+
seps=["<|begin_of_text|>", "<|end_of_text|>"],
104+
)
105+
106+
default_conversation = LLaMA3_Conv
File renamed without changes.
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212

1313
from datasets import dataset_dict
1414
from torch.utils.data import ConcatDataset, Dataset, IterableDataset
15+
from transformers import AutoTokenizer
1516
from transformers.models.llama.tokenization_llama import LlamaTokenizer
1617
from transformers.tokenization_utils import PreTrainedTokenizer
1718

@@ -71,7 +72,7 @@ def supervised_tokenize_pretrain(
7172

7273
def supervised_tokenize_sft(
7374
data_point: Dict[str, str],
74-
tokenizer: LlamaTokenizer,
75+
tokenizer: AutoTokenizer,
7576
conversation_template: Conversation = default_conversation,
7677
ignore_index: int = None,
7778
max_length: int = 4096,
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)