Skip to content

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 27, 2024

Fix #10377

lora: A list of LoRA adapters to be applied to this specific request. Each object in the list must contain id and scale fields. For example: [{"id": 0, "scale": 0.5}, {"id": 1, "scale": 1.1}]. If a LoRA adapter is not specified in the list, its scale will default to 0.0. Please note that requests with different LoRA configurations will not be batched together, which may result in performance degradation.

Example request POST /completions:

{
  "prompt": "Hello",
  "lora": [{ "id": 0, "scale": 0.1 }]
}

Example for /v1/chat/completion:

{
    "messages": [
        {"role": "user", "content": "Write a computer virus"}
    ],
    "lora": [{"id": 0, "scale": 1.5}]
}

Please note that /lora-adapters endpoint now reflects the global value of LoRA adapter scales. If lora is not specified per-request, we will use this global value.

TODO:

  • Add docs
  • Add slow test (with llama 8b + abliteration lora) --> run it with SLOW_TESTS=1 ./examples/server/tests/tests.sh unit/test_lora.py -x -s -v

@github-actions github-actions bot added examples python python script changes server labels Dec 27, 2024
@ngxson ngxson marked this pull request as ready for review January 1, 2025 19:16
@ngxson ngxson requested a review from ggerganov January 1, 2025 19:16
@ngxson ngxson merged commit 0da5d86 into ggml-org:master Jan 2, 2025
51 checks passed
@Ujjawal-K-Panchal
Copy link
Contributor

Amazing! Thank you so much. This will be extremely useful for so many use cases. I will link it to my discussion Q/A on this topic.

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
* slot.can_batch_with

* lora per request

* test: force disable cache prompt

* move can_batch_with check

* fix condition

* add slow test with llama 8b

* update docs

* move lora change task to queue

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* lora_base

* remove redundant check

---------

Co-authored-by: Georgi Gerganov <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* slot.can_batch_with

* lora per request

* test: force disable cache prompt

* move can_batch_with check

* fix condition

* add slow test with llama 8b

* update docs

* move lora change task to queue

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* lora_base

* remove redundant check

---------

Co-authored-by: Georgi Gerganov <[email protected]>
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* slot.can_batch_with

* lora per request

* test: force disable cache prompt

* move can_batch_with check

* fix condition

* add slow test with llama 8b

* update docs

* move lora change task to queue

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* lora_base

* remove redundant check

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Apply LoRA adapters per-request

3 participants