You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,7 +43,7 @@ We use `torchrun` here to spawn multiple processes for FSDP.
43
43
Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
Copy file name to clipboardExpand all lines: recipes/README.md
+7-17Lines changed: 7 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,23 +2,13 @@ This folder contains examples organized by topic:
2
2
3
3
| Subfolder | Description |
4
4
|---|---|
5
-
[quickstart](./quickstart)|The "Hello World" of using Llama2, start here if you are new to using Llama2
6
-
[multilingual](./multilingual)|Scripts to add a new language to Llama2
7
-
[finetuning](./finetuning)|Scripts to finetune Llama2 on single-GPU and multi-GPU setups
8
-
[inference](./inference)|Scripts to deploy Llama2 for inference locally and using model servers
9
-
[use_cases](./use_cases)|Scripts showing common applications of Llama2
5
+
[quickstart](./quickstart)|The "Hello World" of using Llama 3, start here if you are new to using Llama 3
6
+
[multilingual](./multilingual)|Scripts to add a new language to Llama
7
+
[finetuning](./finetuning)|Scripts to finetune Llama 3 on single-GPU and multi-GPU setups
8
+
[inference](./inference)|Scripts to deploy Llama 3 for inference locally and using model servers
9
+
[use_cases](./use_cases)|Scripts showing common applications of Llama 3
10
10
[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
11
11
[llama_api_providers](./llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
12
-
[benchmarks](./benchmarks)|Scripts to benchmark Llama 2 models inference on various backends
12
+
[benchmarks](./benchmarks)|Scripts to benchmark Llama 3 models inference on various backends
13
13
[code_llama](./code_llama)|Scripts to run inference with the Code Llama models
14
-
[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama2 models using `lm-evaluation-harness` from `EleutherAI`
15
-
16
-
17
-
**<aid="replicate_note">Note on using Replicate</a>**
18
-
To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
19
-
20
-
**<aid="octoai_note">Note on using OctoAI</a>**
21
-
You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](./llama_api_providers/OctoAI_API_examples/). You can sign into OctoAI with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).
22
-
23
-
### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
24
-
To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
14
+
[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama 3 models using `lm-evaluation-harness` from `EleutherAI`
Copy file name to clipboardExpand all lines: recipes/finetuning/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,7 +99,7 @@ It lets us specify the training settings for everything from `model_name` to `da
99
99
You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
@@ -49,15 +49,15 @@ The args used in the command above are:
49
49
If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
0 commit comments