Skip to content

Commit beed7fa

Browse files
Merge branch 'ggerganov:master' into master
2 parents 425ae74 + 10afa6f commit beed7fa

File tree

14 files changed

+1897
-1324
lines changed

14 files changed

+1897
-1324
lines changed

README-sycl.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -311,15 +311,13 @@ Output (example):
311311

312312
a. Download & install cmake for Windows: https://cmake.org/download/
313313

314-
b. Download & install make for Windows provided by mingw-w64
314+
b. Download & install mingw-w64 make for Windows provided by w64devkit
315315

316-
- Download binary package for Windows in https://github.com/niXman/mingw-builds-binaries/releases.
316+
- Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
317317

318-
Like [x86_64-13.2.0-release-win32-seh-msvcrt-rt_v11-rev1.7z](https://github.com/niXman/mingw-builds-binaries/releases/download/13.2.0-rt_v11-rev1/x86_64-13.2.0-release-win32-seh-msvcrt-rt_v11-rev1.7z).
318+
- Extract `w64devkit` on your pc.
319319

320-
- Unzip the binary package. In the **bin** sub-folder and rename **xxx-make.exe** to **make.exe**.
321-
322-
- Add the **bin** folder path in the Windows system PATH environment.
320+
- Add the **bin** folder path in the Windows system PATH environment, like `C:\xxx\w64devkit\bin\`.
323321

324322
### Build locally:
325323

README.md

Lines changed: 37 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -33,17 +33,14 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
3333
<li><a href="#get-the-code">Get the Code</a></li>
3434
<li><a href="#build">Build</a></li>
3535
<li><a href="#blas-build">BLAS Build</a></li>
36-
<li><a href="#prepare-data--run">Prepare Data & Run</a></li>
36+
<li><a href="#prepare-and-quantize">Prepare and Quantize</a></li>
37+
<li><a href="#run-the-quantized-model">Run the quantized model</a></li>
3738
<li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li>
3839
<li><a href="#quantization">Quantization</a></li>
3940
<li><a href="#interactive-mode">Interactive mode</a></li>
4041
<li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
41-
<li><a href="#instruction-mode-with-alpaca">Instruction mode with Alpaca</a></li>
42-
<li><a href="#using-openllama">Using OpenLLaMA</a></li>
43-
<li><a href="#using-gpt4all">Using GPT4All</a></li>
44-
<li><a href="#using-pygmalion-7b--metharme-7b">Using Pygmalion 7B & Metharme 7B</a></li>
45-
<li><a href="#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data">Obtaining the Facebook LLaMA original model and Stanford Alpaca model data</a></li>
46-
<li><a href="#verifying-the-model-files">Verifying the model files</a></li>
42+
<li><a href="#instruct-mode">Instruct mode</a></li>
43+
<li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
4744
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
4845
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
4946
<li><a href="#android">Android</a></li>
@@ -83,20 +80,16 @@ improved significantly thanks to many contributions. It is the main playground f
8380

8481
**Supported models:**
8582

83+
Typically finetunes of the base models below are supported as well.
84+
8685
- [X] LLaMA 🦙
8786
- [x] LLaMA 2 🦙🦙
88-
- [X] [Mistral AI v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
87+
- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
8988
- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
9089
- [X] Falcon
91-
- [X] [Alpaca](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
92-
- [X] [GPT4All](https://github.com/ggerganov/llama.cpp#using-gpt4all)
9390
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
9491
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
95-
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
9692
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
97-
- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
98-
- [X] [Pygmalion/Metharme](#using-pygmalion-7b--metharme-7b)
99-
- [X] [WizardLM](https://github.com/nlpxucan/WizardLM)
10093
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
10194
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
10295
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
@@ -149,6 +142,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:
149142
- [iohub/collama](https://github.com/iohub/coLLaMA)
150143
- [janhq/jan](https://github.com/janhq/jan) (AGPL)
151144
- [nat/openplayground](https://github.com/nat/openplayground)
145+
- [Faraday](https://faraday.dev/) (proprietary)
152146
- [LMStudio](https://lmstudio.ai/) (proprietary)
153147
- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
154148
- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)
@@ -165,7 +159,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:
165159

166160
Here is a typical run using LLaMA v2 13B on M2 Ultra:
167161

168-
```java
162+
```
169163
$ make -j && ./main -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
170164
I llama.cpp build info:
171165
I UNAME_S: Darwin
@@ -249,7 +243,7 @@ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8
249243

250244
## Usage
251245

252-
Here are the end-to-end binary build and model conversion steps for the LLaMA-7B model.
246+
Here are the end-to-end binary build and model conversion steps for most supported models.
253247

254248
### Get the Code
255249

@@ -634,7 +628,7 @@ Building the program with BLAS support may lead to some performance improvements
634628
635629
**Without docker**:
636630
637-
Firstly, you need to make sure you installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
631+
Firstly, you need to make sure you have installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
638632
639633
For example, on Ubuntu 22.04 (jammy), use the command below:
640634
@@ -647,6 +641,8 @@ Building the program with BLAS support may lead to some performance improvements
647641
vulkaninfo
648642
```
649643
644+
Alternatively your package manager might be able to provide the appropiate libraries. For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
645+
650646
Then, build llama.cpp using the cmake command below:
651647
652648
```bash
@@ -661,34 +657,42 @@ Building the program with BLAS support may lead to some performance improvements
661657
# ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32
662658
```
663659
664-
### Prepare Data & Run
660+
### Prepare and Quantize
661+
662+
To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
665663
666664
```bash
667-
# obtain the original LLaMA model weights and place them in ./models
665+
# obtain the official LLaMA model weights and place them in ./models
668666
ls ./models
669-
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
667+
llama-2-7b tokenizer_checklist.chk tokenizer.model
670668
# [Optional] for models using BPE tokenizers
671669
ls ./models
672-
65B 30B 13B 7B vocab.json
670+
<folder containing weights and tokenizer json> vocab.json
671+
# [Optional] for PyTorch .bin models like Mistral-7B
672+
ls ./models
673+
<folder containing weights and tokenizer json>
673674
674675
# install Python dependencies
675676
python3 -m pip install -r requirements.txt
676677
677-
# convert the 7B model to ggml FP16 format
678-
python3 convert.py models/7B/
678+
# convert the model to ggml FP16 format
679+
python3 convert.py models/mymodel/
679680
680681
# [Optional] for models using BPE tokenizers
681-
python convert.py models/7B/ --vocabtype bpe
682+
python convert.py models/mymodel/ --vocabtype bpe
682683
683-
# quantize the model to 4-bits (using q4_0 method)
684-
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
684+
# quantize the model to 4-bits (using Q4_K_M method)
685+
./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
685686
686-
# update the gguf filetype to current if older version is unsupported by another application
687-
./quantize ./models/7B/ggml-model-q4_0.gguf ./models/7B/ggml-model-q4_0-v2.gguf COPY
687+
# update the gguf filetype to current version if older version is now unsupported
688+
./quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
689+
```
688690
691+
### Run the quantized model
689692
690-
# run the inference
691-
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
693+
```bash
694+
# start inference on a gguf model
695+
./main -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
692696
```
693697
694698
When running the larger models, make sure you have enough disk space to store all the intermediate files.
@@ -709,7 +713,7 @@ From the unzipped folder, open a terminal/cmd window here and place a pre-conver
709713
710714
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
711715
712-
| Model | Original size | Quantized size (4-bit) |
716+
| Model | Original size | Quantized size (Q4_0) |
713717
|------:|--------------:|-----------------------:|
714718
| 7B | 13 GB | 3.9 GB |
715719
| 13B | 24 GB | 7.8 GB |
@@ -825,9 +829,9 @@ The `grammars/` folder contains a handful of sample grammars. To write your own,
825829
826830
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
827831
828-
### Instruction mode with Alpaca
832+
### Instruct mode
829833
830-
1. First, download the `ggml` Alpaca model into the `./models` folder
834+
1. First, download and place the `ggml` model into the `./models` folder
831835
2. Run the `main` tool like this:
832836
833837
```
@@ -853,50 +857,6 @@ cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
853857
>
854858
```
855859
856-
### Using [OpenLLaMA](https://github.com/openlm-research/open_llama)
857-
858-
OpenLLaMA is an openly licensed reproduction of Meta's original LLaMA model. It uses the same architecture and is a drop-in replacement for the original LLaMA weights.
859-
860-
- Download the [3B](https://huggingface.co/openlm-research/open_llama_3b), [7B](https://huggingface.co/openlm-research/open_llama_7b), or [13B](https://huggingface.co/openlm-research/open_llama_13b) model from Hugging Face.
861-
- Convert the model to ggml FP16 format using `python convert.py <path to OpenLLaMA directory>`
862-
863-
### Using [GPT4All](https://github.com/nomic-ai/gpt4all)
864-
865-
*Note: these instructions are likely obsoleted by the GGUF update*
866-
867-
- Obtain the `tokenizer.model` file from LLaMA model and put it to `models`
868-
- Obtain the `added_tokens.json` file from Alpaca model and put it to `models`
869-
- Obtain the `gpt4all-lora-quantized.bin` file from GPT4All model and put it to `models/gpt4all-7B`
870-
- It is distributed in the old `ggml` format which is now obsoleted
871-
- You have to convert it to the new format using `convert.py`:
872-
873-
```bash
874-
python3 convert.py models/gpt4all-7B/gpt4all-lora-quantized.bin
875-
```
876-
877-
- You can now use the newly generated `models/gpt4all-7B/ggml-model-q4_0.bin` model in exactly the same way as all other models
878-
879-
- The newer GPT4All-J model is not yet supported!
880-
881-
### Using Pygmalion 7B & Metharme 7B
882-
883-
- Obtain the [LLaMA weights](#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data)
884-
- Obtain the [Pygmalion 7B](https://huggingface.co/PygmalionAI/pygmalion-7b/) or [Metharme 7B](https://huggingface.co/PygmalionAI/metharme-7b) XOR encoded weights
885-
- Convert the LLaMA model with [the latest HF convert script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)
886-
- Merge the XOR files with the converted LLaMA weights by running the [xor_codec](https://huggingface.co/PygmalionAI/pygmalion-7b/blob/main/xor_codec.py) script
887-
- Convert to `ggml` format using the `convert.py` script in this repo:
888-
```bash
889-
python3 convert.py pygmalion-7b/ --outtype q4_1
890-
```
891-
> The Pygmalion 7B & Metharme 7B weights are saved in [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) precision. If you wish to convert to `ggml` without quantizating, please specify the `--outtype` as `f32` instead of `f16`.
892-
893-
894-
### Obtaining the Facebook LLaMA original model and Stanford Alpaca model data
895-
896-
- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.**
897-
- The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository.
898-
- Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data.
899-
900860
### Obtaining and using the Facebook LLaMA 2 model
901861
902862
- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
@@ -908,20 +868,6 @@ python3 convert.py pygmalion-7b/ --outtype q4_1
908868
- [LLaMA 2 13B chat](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF)
909869
- [LLaMA 2 70B chat](https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF)
910870
911-
### Verifying the model files
912-
913-
Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
914-
- The following python script will verify if you have all possible latest files in your self-installed `./models` subdirectory:
915-
916-
```bash
917-
# run the verification script
918-
./scripts/verify-checksum-models.py
919-
```
920-
921-
- On linux or macOS it is also possible to run the following commands to verify if you have all possible latest files in your self-installed `./models` subdirectory:
922-
- On Linux: `sha256sum --ignore-missing -c SHA256SUMS`
923-
- on macOS: `shasum -a 256 --ignore-missing -c SHA256SUMS`
924-
925871
### Seminal papers and background on the models
926872
927873
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

SHA256SUMS

Lines changed: 0 additions & 40 deletions
This file was deleted.

common/common.cpp

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,10 @@
4646
#define GGML_USE_CUBLAS_SYCL
4747
#endif
4848

49+
#if (defined(GGML_USE_CUBLAS) || defined(GGML_USE_SYCL)) || defined(GGML_USE_VULKAN)
50+
#define GGML_USE_CUBLAS_SYCL_VULKAN
51+
#endif
52+
4953
int32_t get_num_physical_cores() {
5054
#ifdef __linux__
5155
// enumerate the set of thread siblings, num entries is num cores
@@ -660,8 +664,8 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
660664
params.tensor_split[i] = 0.0f;
661665
}
662666
}
663-
#ifndef GGML_USE_CUBLAS_SYCL
664-
fprintf(stderr, "warning: llama.cpp was compiled without cuBLAS/SYCL. Setting a tensor split has no effect.\n");
667+
#ifndef GGML_USE_CUBLAS_SYCL_VULKAN
668+
fprintf(stderr, "warning: llama.cpp was compiled without cuBLAS/SYCL/Vulkan. Setting a tensor split has no effect.\n");
665669
#endif // GGML_USE_CUBLAS_SYCL
666670
} else if (arg == "--no-mmap") {
667671
params.use_mmap = false;

0 commit comments

Comments
 (0)