Skip to content

Commit d183a1a

Browse files
authored
Merge branch 'main' into rlm/update-readme-langchain-agents
2 parents 9848dc0 + b4e1a42 commit d183a1a

File tree

25 files changed

+12965
-343
lines changed

25 files changed

+12965
-343
lines changed

docs/multi_gpu.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ This runs with the `samsum_dataset` for summarization application by default.
2424

2525
```bash
2626

27-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
27+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
2828

2929
```
3030

@@ -43,7 +43,7 @@ We use `torchrun` here to spawn multiple processes for FSDP.
4343
Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
4444

4545
```bash
46-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
46+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
4747
```
4848

4949
### Fine-tuning using FSDP Only
@@ -52,7 +52,7 @@ If interested in running full parameter finetuning without making use of PEFT me
5252

5353
```bash
5454

55-
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
55+
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
5656

5757
```
5858

@@ -62,7 +62,7 @@ If you are interested in running full parameter fine-tuning on the 70B model, yo
6262

6363
```bash
6464

65-
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
65+
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
6666

6767
```
6868

@@ -95,16 +95,16 @@ To run with each of the datasets set the `dataset` flag in the command as shown
9595

9696
```bash
9797
# grammer_dataset
98-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
98+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
9999

100100
# alpaca_dataset
101101

102-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
102+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
103103

104104

105105
# samsum_dataset
106106

107-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
107+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
108108

109109
```
110110

docs/single_gpu.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Get access to a machine with one GPU or if using a multi-GPU machine please make
2020

2121
```bash
2222

23-
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
23+
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
2424

2525
```
2626
The args used in the command above are:
@@ -51,16 +51,16 @@ to run with each of the datasets set the `dataset` flag in the command as shown
5151
```bash
5252
# grammer_dataset
5353

54-
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --dataset grammar_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
54+
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
5555

5656
# alpaca_dataset
5757

58-
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --dataset alpaca_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
58+
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
5959

6060

6161
# samsum_dataset
6262

63-
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
63+
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
6464

6565
```
6666

recipes/README.md

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,13 @@ This folder contains examples organized by topic:
22

33
| Subfolder | Description |
44
|---|---|
5-
[quickstart](./quickstart)|The "Hello World" of using Llama2, start here if you are new to using Llama2
6-
[multilingual](./multilingual)|Scripts to add a new language to Llama2
7-
[finetuning](./finetuning)|Scripts to finetune Llama2 on single-GPU and multi-GPU setups
8-
[inference](./inference)|Scripts to deploy Llama2 for inference locally and using model servers
9-
[use_cases](./use_cases)|Scripts showing common applications of Llama2
5+
[quickstart](./quickstart)|The "Hello World" of using Llama 3, start here if you are new to using Llama 3
6+
[multilingual](./multilingual)|Scripts to add a new language to Llama
7+
[finetuning](./finetuning)|Scripts to finetune Llama 3 on single-GPU and multi-GPU setups
8+
[inference](./inference)|Scripts to deploy Llama 3 for inference locally and using model servers
9+
[use_cases](./use_cases)|Scripts showing common applications of Llama 3
1010
[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
1111
[llama_api_providers](./llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
12-
[benchmarks](./benchmarks)|Scripts to benchmark Llama 2 models inference on various backends
12+
[benchmarks](./benchmarks)|Scripts to benchmark Llama 3 models inference on various backends
1313
[code_llama](./code_llama)|Scripts to run inference with the Code Llama models
14-
[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama2 models using `lm-evaluation-harness` from `EleutherAI`
15-
16-
17-
**<a id="replicate_note">Note on using Replicate</a>**
18-
To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
19-
20-
**<a id="octoai_note">Note on using OctoAI</a>**
21-
You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](./llama_api_providers/OctoAI_API_examples/). You can sign into OctoAI with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).
22-
23-
### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
24-
To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
14+
[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama 3 models using `lm-evaluation-harness` from `EleutherAI`

recipes/benchmarks/inference_throughput/on-prem/vllm/chat_vllm_benchmark.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,6 @@
4040
SAFE_CHECK = params["SAFE_CHECK"]
4141
# Threshold for tokens per second below which we deem the query to be slow
4242
THRESHOLD_TPS = params["THRESHOLD_TPS"]
43-
# Default Llama tokenizer, replace with your own tokenizer
44-
TOKENIZER_PATH = params["TOKENIZER_PATH"]
4543
TEMPERATURE = params["TEMPERATURE"]
4644
TOP_P = params["TOP_P"]
4745
# Add your model endpoints here, specify the port number. You can acquire the endpoint when creating a on-prem server like vLLM.
@@ -55,8 +53,8 @@
5553
print("No available GPUs")
5654

5755

58-
# This tokenizer is downloaded from Azure model catalog for each specific models. The main purpose is to decode the reponses for token calculation
59-
tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
56+
# This tokenizer is downloaded from HuggingFace based on the model path you set. Note Llama 3 use a different tokenizer compare to Llama 2
57+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
6058

6159
num_token_input_prompt = len(tokenizer.encode(PROMPT))
6260
print(f"Number of token for input prompt: {num_token_input_prompt}")

recipes/benchmarks/inference_throughput/on-prem/vllm/parameters.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
"MODEL_HEADERS" : {"Content-Type": "application/json"},
66
"SAFE_CHECK" : true,
77
"THRESHOLD_TPS" : 7,
8-
"TOKENIZER_PATH" : "../../tokenizer",
98
"RANDOM_PROMPT_LENGTH" : 1000,
109
"TEMPERATURE" : 0.6,
1110
"TOP_P" : 0.9,

recipes/benchmarks/inference_throughput/on-prem/vllm/pretrained_vllm_benchmark.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,6 @@
3636
SAFE_CHECK = params["SAFE_CHECK"]
3737
# Threshold for tokens per second below which we deem the query to be slow
3838
THRESHOLD_TPS = params["THRESHOLD_TPS"]
39-
# Replace with your own tokenizer
40-
TOKENIZER_PATH = params["TOKENIZER_PATH"]
4139
RANDOM_PROMPT_LENGTH = params["RANDOM_PROMPT_LENGTH"]
4240
TEMPERATURE = params["TEMPERATURE"]
4341
TOP_P = params["TOP_P"]
@@ -52,8 +50,8 @@
5250
print("No available GPUs")
5351

5452

55-
# This tokenizer is downloaded from Azure model catalog for each specific models. The main purpose is to decode the reponses for token calculation
56-
tokenizer = transformers.AutoTokenizer.from_pretrained(TOKENIZER_PATH)
53+
# This tokenizer is downloaded from HuggingFace based on the model path you set. Note Llama 3 use a different tokenizer compare to Llama 2
54+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
5755

5856
# Select vocabulary that is longer than 2 tokens (closer to real words) and close to the English (not foolproof)
5957
vocab = [token for token in tokenizer.get_vocab().keys() if len(token) > 2 and all(ord(c) < 128 for c in token)]

recipes/finetuning/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ It lets us specify the training settings for everything from `model_name` to `da
9999
You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
100100

101101
```bash
102-
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
102+
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
103103
```
104104
You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
105105
<div style="display: flex;">

recipes/finetuning/multigpu_finetuning.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
2323
<details open>
2424
<summary>Single-node Multi-GPU</summary>
2525

26-
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
26+
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
2727

2828
</details>
2929

@@ -49,15 +49,15 @@ The args used in the command above are:
4949
If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
5050

5151
```bash
52-
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
52+
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
5353
```
5454

5555
### Using less CPU memory (FSDP on 70B model)
5656

5757
If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
5858

5959
```bash
60-
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
60+
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
6161
```
6262

6363

@@ -79,16 +79,16 @@ To run with each of the datasets set the `dataset` flag in the command as shown
7979

8080
```bash
8181
# grammer_dataset
82-
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
82+
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
8383

8484
# alpaca_dataset
8585

86-
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
86+
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
8787

8888

8989
# samsum_dataset
9090

91-
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
91+
torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
9292

9393
```
9494

@@ -103,7 +103,7 @@ This will require to set the Sharding strategy in [fsdp config](../../src/llama_
103103

104104
```bash
105105

106-
torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
106+
torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
107107

108108
```
109109

recipes/finetuning/singlegpu_finetuning.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ To run fine-tuning on a single GPU, we will make use of two packages:
1616
## How to run it?
1717

1818
```bash
19-
python -m finetuning.py --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
19+
python -m finetuning.py --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
2020
```
2121
The args used in the command above are:
2222

@@ -48,16 +48,16 @@ to run with each of the datasets set the `dataset` flag in the command as shown
4848
```bash
4949
# grammer_dataset
5050

51-
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset grammar_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
51+
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
5252

5353
# alpaca_dataset
5454

55-
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset alpaca_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
55+
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
5656

5757

5858
# samsum_dataset
5959

60-
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
60+
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
6161

6262
```
6363

0 commit comments

Comments
 (0)