You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index.md
+13-15Lines changed: 13 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,25 +27,23 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
27
27
28
28
## 🌳 Features
29
29
30
-
- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters.md#huggingface-hub), [Predibase](./models/adapters.md#predibase), or [any filesystem](./models/adapters.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
31
-
- 🏋️♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
32
-
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
33
-
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
34
-
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
35
-
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
36
-
30
+
- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters/index.md#huggingface-hub), [Predibase](./models/adapters/index.md#predibase), or [any filesystem](./models/adapters/index.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
31
+
- 🏋️♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
32
+
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
33
+
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
34
+
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
35
+
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
Serving a fine-tuned model with LoRAX consists of two components:
46
44
47
-
-[Base Model](./models/base_models.md): pretrained large model shared across all adapters.
48
-
-[Adapter](./models/adapter.md): task-specific adapter weights dynamically loaded per request.
45
+
-[Base Model](./models/base_models.md): pretrained large model shared across all adapters.
46
+
-[Adapter](./models/adapters/index.md): task-specific adapter weights dynamically loaded per request.
49
47
50
48
LoRAX supports a number of Large Language Models as the base model including [Llama](https://huggingface.co/meta-llama) (including [CodeLlama](https://huggingface.co/codellama)), [Mistral](https://huggingface.co/mistralai) (including [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)), and [Qwen](https://huggingface.co/Qwen). See [Supported Architectures](./models/base_models.md#supported-architectures) for a complete list of supported base models.
51
49
@@ -61,10 +59,10 @@ We recommend starting with our pre-built Docker image to avoid compiling custom
61
59
62
60
The minimum system requirements need to run LoRAX include:
See [Reference - Python Client](./reference/python_client.md) for full details.
125
+
See [Reference - Python Client](./reference/python_client/client.md) for full details.
128
126
129
127
For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).
For all histograms, there are metrics that are autogenerated which are the metric name + `_sum` and `_count`, which are the sum of all values for that histogram, and the count of all instances of that histogram respectively.
0 commit comments