Skip to content

Commit 64739ad

Browse files
authored
docs: OpenAI compatible API (#174)
1 parent a90d443 commit 64739ad

File tree

7 files changed

+655
-7
lines changed

7 files changed

+655
-7
lines changed

README.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
2525
- [Launch LoRAX Server](#launch-lorax-server)
2626
- [Prompt via REST API](#prompt-via-rest-api)
2727
- [Prompt via Python Client](#prompt-via-python-client)
28+
- [Chat via OpenAI API](#chat-via-openai-api)
2829
- [Next steps](#next-steps)
2930
- [🙇 Acknowledgements](#-acknowledgements)
3031
- [🗺️ Roadmap](#️-roadmap)
@@ -35,7 +36,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
3536
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
3637
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
3738
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
38-
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
39+
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations.
3940
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
4041

4142

@@ -134,6 +135,34 @@ See [Reference - Python Client](https://predibase.github.io/lorax/reference/pyth
134135

135136
For other ways to run LoRAX, see [Getting Started - Kubernetes](https://predibase.github.io/lorax/getting_started/kubernetes), [Getting Started - SkyPilot](https://predibase.github.io/lorax/getting_started/skypilot), and [Getting Started - Local](https://predibase.github.io/lorax/getting_started/local).
136137

138+
### Chat via OpenAI API
139+
140+
LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.
141+
142+
```python
143+
from openai import OpenAI
144+
145+
client = OpenAI(
146+
api_key="EMPTY",
147+
base_url="http://127.0.0.1:8080/v1",
148+
)
149+
150+
resp = client.chat.completions.create(
151+
model="alignment-handbook/zephyr-7b-dpo-lora",
152+
messages=[
153+
{
154+
"role": "system",
155+
"content": "You are a friendly chatbot who always responds in the style of a pirate",
156+
},
157+
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
158+
],
159+
max_tokens=100,
160+
)
161+
print("Response:", resp[0].choices[0].text)
162+
```
163+
164+
See [OpenAI Compatible API](https://predibase.github.io/lorax/guides/openai_api) for details.
165+
137166
### Next steps
138167

139168
Here are some other interesting Mistral-7B fine-tuned models to try out:

docs/guides/openai_api.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
LoRAX supports [OpenAI Chat Completions v1](https://platform.openai.com/docs/api-reference/completions/create) compatible endpoints that serve as a drop-in replacement for the OpenAI SDK. It supports multi-turn
2+
chat conversations while retaining support for dynamic adapter loading.
3+
4+
## Chat Completions v1
5+
6+
Using the existing OpenAI Python SDK, replace the `base_url` with your LoRAX endpoint with `/v1` appended. The `api_key` can be anything, as it is unused.
7+
8+
The `model` parameter can be set to the empty string `""` to use the base model, or any adapter ID on the HuggingFace hub.
9+
10+
```python
11+
from openai import OpenAI
12+
13+
client = OpenAI(
14+
api_key="EMPTY",
15+
base_url="http://127.0.0.1:8080/v1",
16+
)
17+
18+
resp = client.chat.completions.create(
19+
model="alignment-handbook/zephyr-7b-dpo-lora",
20+
messages=[
21+
{
22+
"role": "system",
23+
"content": "You are a friendly chatbot who always responds in the style of a pirate",
24+
},
25+
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
26+
],
27+
max_tokens=100,
28+
)
29+
print("Response:", resp[0].choices[0].text)
30+
```
31+
32+
### Streaming
33+
34+
The streaming API is supported with the `stream=True` parameter:
35+
36+
```python
37+
messages = client.chat.completions.create(
38+
model="alignment-handbook/zephyr-7b-dpo-lora",
39+
messages=[
40+
{
41+
"role": "system",
42+
"content": "You are a friendly chatbot who always responds in the style of a pirate",
43+
},
44+
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
45+
],
46+
max_tokens=100,
47+
stream=True,
48+
)
49+
50+
for message in messages:
51+
print(message)
52+
```
53+
54+
### REST API
55+
56+
The REST API can be used directly in addition to the Python SDK:
57+
58+
```bash
59+
curl http://127.0.0.1:8080/v1/chat/completions \
60+
-H "Content-Type: application/json" \
61+
-d '{
62+
"model": "alignment-handbook/zephyr-7b-dpo-lora",
63+
"messages": [
64+
{
65+
"role": "system",
66+
"content": "You are a friendly chatbot who always responds in the style of a pirate"
67+
},
68+
{
69+
"role": "user",
70+
"content": "How many helicopters can a human eat in one sitting?"
71+
}
72+
],
73+
"max_tokens": 100
74+
}'
75+
```
76+
77+
### Chat Templates
78+
79+
Multi-turn chat conversations are supported through [HuggingFace chat templates](https://huggingface.co/docs/transformers/chat_templating).
80+
81+
If the adapter selected with the `model` parameter has its own tokenizer and chat template, LoRAX will apply the adapter's chat template
82+
to the request during inference. If, however, the adapter does not have its own chat template, LoRAX will fallback to using the base model
83+
chat template. If this does not exist, an error will be raised, as chat templates are required for multi-turn conversations.
84+
85+
## (Legacy) Completions v1
86+
87+
The legacy completions v1 API can be used as well:
88+
89+
```python
90+
from openai import OpenAI
91+
92+
client = OpenAI(
93+
api_key="EMPTY",
94+
base_url="http://127.0.0.1:8080/v1",
95+
)
96+
97+
# synchrounous completions
98+
completion = client.completions.create(
99+
model=adapter_id,
100+
prompt=prompt,
101+
)
102+
print("Completion result:", completion[0].choices[0].text)
103+
104+
# streaming completions
105+
completion_stream = client.completions.create(
106+
model=adapter_id,
107+
prompt=prompt,
108+
stream=True,
109+
)
110+
111+
for message in completion_stream:
112+
print("Completion message:", message)
113+
```
114+
115+
REST:
116+
117+
```bash
118+
curl http://127.0.0.1:8080/v1/completions \
119+
-H "Content-Type: application/json" \
120+
-d '{
121+
"model": "",
122+
"prompt": "Instruct: Write a detailed analogy between mathematics and a lighthouse.\nOutput:",
123+
"max_tokens": 100
124+
}'
125+
```

docs/index.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
3131
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
3232
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
3333
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
34-
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
34+
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations.
3535
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
3636

3737

@@ -119,6 +119,34 @@ See [Reference - Python Client](./reference/python_client.md) for full details.
119119

120120
For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).
121121

122+
### Chat via OpenAI API
123+
124+
LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.
125+
126+
```python
127+
from openai import OpenAI
128+
129+
client = OpenAI(
130+
api_key="EMPTY",
131+
base_url="http://127.0.0.1:8080/v1",
132+
)
133+
134+
resp = client.chat.completions.create(
135+
model="alignment-handbook/zephyr-7b-dpo-lora",
136+
messages=[
137+
{
138+
"role": "system",
139+
"content": "You are a friendly chatbot who always responds in the style of a pirate",
140+
},
141+
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
142+
],
143+
max_tokens=100,
144+
)
145+
print("Response:", resp[0].choices[0].text)
146+
```
147+
148+
See [OpenAI Compatible API](./guides/openai_api.md) for details.
149+
122150
## 🙇 Acknowledgements
123151

124152
LoRAX is built on top of HuggingFace's [text-generation-inference](https://github.com/huggingface/text-generation-inference), forked from v0.9.4 (Apache 2.0).

0 commit comments

Comments
 (0)