Skip to content

Commit afdd774

Browse files
authored
Merge branch 'ggerganov:master' into master
2 parents 2ac815f + 61408e7 commit afdd774

File tree

7 files changed

+541
-77
lines changed

7 files changed

+541
-77
lines changed

convert_lora_to_gguf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ def get_base_tensor_name(lora_tensor_name: str) -> str:
230230

231231
def parse_args() -> argparse.Namespace:
232232
parser = argparse.ArgumentParser(
233-
description="Convert a huggingface PEFT LoRA adapter to a GGML compatible file")
233+
description="Convert a Hugging Face PEFT LoRA adapter to a GGUF file")
234234
parser.add_argument(
235235
"--outfile", type=Path,
236236
help="path to write to; default: based on input. {ftype} will be replaced by the outtype.",
@@ -257,11 +257,11 @@ def parse_args() -> argparse.Namespace:
257257
)
258258
parser.add_argument(
259259
"--base", type=Path, required=True,
260-
help="directory containing base model file",
260+
help="directory containing Hugging Face model config files (config.json, tokenizer.json) for the base model that the adapter is based on - only config is needed, actual model weights are not required",
261261
)
262262
parser.add_argument(
263263
"lora_path", type=Path,
264-
help="directory containing LoRA adapter file",
264+
help="directory containing Hugging Face PEFT LoRA config (adapter_model.json) and weights (adapter_model.safetensors or adapter_model.bin)",
265265
)
266266

267267
return parser.parse_args()

examples/main/README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -333,6 +333,15 @@ These options help improve the performance and memory usage of the LLaMA models.
333333

334334
For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-and-quantize).
335335

336+
## LoRA (Low-Rank Adaptation) adapters
337+
338+
- `--lora FNAME`: Optional path to a LoRA adapter to use with scaling of 1.0. Can be mixed with `--lora-scaled` and can be repeated to use multiple adapters.
339+
- `--lora-scaled FNAME`: Optional path to a LoRA adapter with user-defined scaling. Can be mixed with `--lora` and can repeated to use multiple adapters.
340+
341+
You can add LoRA adapters using `--lora` or `--lora-scaled`. For example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...` or `--lora-scaled lora_task_A.gguf 0.5 --lora-scaled lora_task_B.gguf 0.5`.
342+
343+
LoRA adapters should be in GGUF format. To convert from Hugging Face format use the `convert-lora-to-gguf.py` script. LoRA adapters are loaded separately and applied during inference - they are not merged with the main model. This means that mmap model loading is fully supported when using LoRA adapters. The old `--lora-base` flag has been removed now that merging is no longer performed.
344+
336345
## Additional Options
337346

338347
These options provide extra functionality and customization when running the LLaMA models:
@@ -341,6 +350,4 @@ These options provide extra functionality and customization when running the LLa
341350
- `--verbose-prompt`: Print the prompt before generating text.
342351
- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
343352
- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
344-
- `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
345-
- `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
346353
- `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable or in an OS-specific local cache.

ggml/include/ggml-kompute.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@
1111
extern "C" {
1212
#endif
1313

14+
#define GGML_KOMPUTE_MAX_DEVICES 16
15+
1416
struct ggml_vk_device {
1517
int index;
1618
int type; // same as VkPhysicalDeviceType
@@ -41,6 +43,8 @@ GGML_API bool ggml_backend_is_kompute(ggml_backend_t backend);
4143

4244
GGML_API ggml_backend_buffer_type_t ggml_backend_kompute_buffer_type(int device);
4345

46+
GGML_API ggml_backend_reg_t ggml_backend_kompute_reg(void);
47+
4448
#ifdef __cplusplus
4549
}
4650
#endif

0 commit comments

Comments
 (0)