Skip to content

Commit 24b559d

Browse files
committed
Merge branch 'feat-add-local-inference' of https://github.com/codelion/optillm into feat-add-local-inference
2 parents 26030d0 + 795556f commit 24b559d

File tree

1 file changed

+35
-17
lines changed

1 file changed

+35
-17
lines changed

README.md

Lines changed: 35 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -48,22 +48,6 @@ python optillm.py
4848
* Running on http://192.168.10.48:8000
4949
2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit
5050
```
51-
52-
### Starting the optillm proxy for a local server (e.g. llama.cpp)
53-
54-
- Set the `OPENAI_API_KEY` env variable to a placeholder value
55-
- e.g. `export OPENAI_API_KEY="no_key"`
56-
- Run `./llama-server -c 4096 -m path_to_model` to start the server with the specified model and a context length of 4096 tokens
57-
- Run `python3 optillm.py --base_url base_url` to start the proxy
58-
- e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1`
59-
60-
> [!WARNING]
61-
> Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following:
62-
> `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, `re2`, and `z3`.
63-
64-
> [!NOTE]
65-
> You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.
66-
6751
## Usage
6852

6953
Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the `base_url` as `http://localhost:8000/v1`.
@@ -155,7 +139,41 @@ In the diagram:
155139
- `A` is an existing tool (like [oobabooga](https://github.com/oobabooga/text-generation-webui/)), framework (like [patchwork](https://github.com/patched-codes/patchwork))
156140
or your own code where you want to use the results from optillm. You can use it directly using any OpenAI client sdk.
157141
- `B` is the optillm service (running directly or in a docker container) that will send requests to the `base_url`.
158-
- `C` is any service providing an OpenAI API compatible chat completions endpoint.
142+
- `C` is any service providing an OpenAI API compatible chat completions endpoint.
143+
144+
### Local inference server
145+
146+
We support loading any HuggingFace model or LoRA directly in optillm. To use the built-in inference server set the `OPTILLM_API_KEY` to any value (e.g. `export OPTILLM_API_KEY="optillm"`)
147+
and then use the same in your OpenAI client. You can pass any HuggingFace model in model field. If it is a private model make sure you set the `HF_TOKEN` environment variable
148+
with your HuggingFace key. We also support adding any number of LoRAs on top of the model by using the `+` separator.
149+
150+
E.g. The following code loads the base model `meta-llama/Llama-3.2-1B-Instruct` and then adds two LoRAs on top - `patched-codes/Llama-3.2-1B-FixVulns` and `patched-codes/Llama-3.2-1B-FastApply`.
151+
You can specify which LoRA to use using the `active_adapter` param in `extra_args` field of OpenAI SDK client. By default we will load the last specified adapter.
152+
153+
```python
154+
OPENAI_BASE_URL = "http://localhost:8000/v1"
155+
OPENAI_KEY = "optillm"
156+
response = client.chat.completions.create(
157+
model="meta-llama/Llama-3.2-1B-Instruct+patched-codes/Llama-3.2-1B-FastApply+patched-codes/Llama-3.2-1B-FixVulns",
158+
messages=messages,
159+
temperature=0.2,
160+
logprobs = True,
161+
top_logprobs = 3,
162+
extra_body={"active_adapter": "patched-codes/Llama-3.2-1B-FastApply"},
163+
)
164+
```
165+
166+
### Starting the optillm proxy with an external server (e.g. llama.cpp or ollama)
167+
168+
- Set the `OPENAI_API_KEY` env variable to a placeholder value
169+
- e.g. `export OPENAI_API_KEY="sk-no-key"`
170+
- Run `./llama-server -c 4096 -m path_to_model` to start the server with the specified model and a context length of 4096 tokens
171+
- Run `python3 optillm.py --base_url base_url` to start the proxy
172+
- e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1`
173+
174+
> [!WARNING]
175+
> Note that llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches to the following:
176+
> `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, `re2`, and `z3`. Use the built-in local inference server to use these approaches.
159177
160178
## Implemented techniques
161179

0 commit comments

Comments
 (0)