Skip to content

Commit b3944ad

Browse files
authored
added metrics docs, updated links in main docs (#663)
1 parent bd92e52 commit b3944ad

File tree

3 files changed

+33
-15
lines changed

3 files changed

+33
-15
lines changed

docs/index.md

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -27,25 +27,23 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
2727

2828
## 🌳 Features
2929

30-
- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters.md#huggingface-hub), [Predibase](./models/adapters.md#predibase), or [any filesystem](./models/adapters.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
31-
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
32-
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
33-
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
34-
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
35-
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
36-
30+
- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters/index.md#huggingface-hub), [Predibase](./models/adapters/index.md#predibase), or [any filesystem](./models/adapters/index.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
31+
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
32+
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
33+
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
34+
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
35+
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
3736

3837
<p align="center">
3938
<img src="https://github.com/predibase/lorax/assets/29719151/f88aa16c-66de-45ad-ad40-01a7874ed8a9" />
4039
</p>
4140

42-
4341
## 🏠 Models
4442

4543
Serving a fine-tuned model with LoRAX consists of two components:
4644

47-
- [Base Model](./models/base_models.md): pretrained large model shared across all adapters.
48-
- [Adapter](./models/adapter.md): task-specific adapter weights dynamically loaded per request.
45+
- [Base Model](./models/base_models.md): pretrained large model shared across all adapters.
46+
- [Adapter](./models/adapters/index.md): task-specific adapter weights dynamically loaded per request.
4947

5048
LoRAX supports a number of Large Language Models as the base model including [Llama](https://huggingface.co/meta-llama) (including [CodeLlama](https://huggingface.co/codellama)), [Mistral](https://huggingface.co/mistralai) (including [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)), and [Qwen](https://huggingface.co/Qwen). See [Supported Architectures](./models/base_models.md#supported-architectures) for a complete list of supported base models.
5149

@@ -61,10 +59,10 @@ We recommend starting with our pre-built Docker image to avoid compiling custom
6159

6260
The minimum system requirements need to run LoRAX include:
6361

64-
- Nvidia GPU (Ampere generation or above)
65-
- CUDA 11.8 compatible device drivers and above
66-
- Linux OS
67-
- Docker (for this guide)
62+
- Nvidia GPU (Ampere generation or above)
63+
- CUDA 11.8 compatible device drivers and above
64+
- Linux OS
65+
- Docker (for this guide)
6866

6967
### Launch LoRAX Server
7068

@@ -124,7 +122,7 @@ adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
124122
print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)
125123
```
126124

127-
See [Reference - Python Client](./reference/python_client.md) for full details.
125+
See [Reference - Python Client](./reference/python_client/client.md) for full details.
128126

129127
For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).
130128

docs/reference/metrics.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Metrics
2+
3+
Prometheus-compatible metrics are made available on the default port, on the `/metrics` endpoint.
4+
5+
Below is a list of the metrics that are exposed:
6+
| Metric Name | Type |
7+
| -------------------------------------------- | --------- |
8+
| `lorax_request_count` | Counter |
9+
| `lorax_request_success` | Counter |
10+
| `lorax_request_failure` | Counter |
11+
| `lorax_request_duration` | Histogram |
12+
| `lorax_request_queue_duration` | Histogram |
13+
| `lorax_request_validation_duration` | Histogram |
14+
| `lorax_request_inference_duration` | Histogram |
15+
| `lorax_request_mean_time_per_token_duration` | Histogram |
16+
| `lorax_request_generated_tokens` | Histogram |
17+
| `lorax_request_input_length` | Histogram |
18+
19+
For all histograms, there are metrics that are autogenerated which are the metric name + `_sum` and `_count`, which are the sum of all values for that histogram, and the count of all instances of that histogram respectively.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ nav:
5050
- lorax.client: reference/python_client/client.md
5151
# - lorax.types: reference/python_client/types.md
5252
- OpenAI Compatible API: reference/openai_api.md
53+
- Metrics: reference/metrics.md
5354
- 🔬 Guides:
5455
- Quantization: guides/quantization.md
5556
- Structured Output (JSON): guides/structured_output.md

0 commit comments

Comments
 (0)