added metrics docs, updated links in main docs (#663)

noah-yoshida · web-flow · commit b3944ad843a8 · 2024-11-01T15:58:24.000-07:00
diff --git a/docs/index.md b/docs/index.md
@@ -27,25 +27,23 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
 
 ## 🌳 Features
 
-- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters.md#huggingface-hub), [Predibase](./models/adapters.md#predibase), or [any filesystem](./models/adapters.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
-- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
-- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
-- 👬 **Optimized Inference:**  high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
-- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
-- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
-
+-   🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters/index.md#huggingface-hub), [Predibase](./models/adapters/index.md#predibase), or [any filesystem](./models/adapters/index.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
+-   🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
+-   🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
+-   👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
+-   🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
+-   🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
 
 <p align="center">
   <img src="https://github.com/predibase/lorax/assets/29719151/f88aa16c-66de-45ad-ad40-01a7874ed8a9" />
 </p>
 
-
 ## 🏠 Models
 
 Serving a fine-tuned model with LoRAX consists of two components:
 
-- [Base Model](./models/base_models.md): pretrained large model shared across all adapters.
-- [Adapter](./models/adapter.md): task-specific adapter weights dynamically loaded per request.
+-   [Base Model](./models/base_models.md): pretrained large model shared across all adapters.
+-   [Adapter](./models/adapters/index.md): task-specific adapter weights dynamically loaded per request.
 
 LoRAX supports a number of Large Language Models as the base model including [Llama](https://huggingface.co/meta-llama) (including [CodeLlama](https://huggingface.co/codellama)), [Mistral](https://huggingface.co/mistralai) (including [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)), and [Qwen](https://huggingface.co/Qwen). See [Supported Architectures](./models/base_models.md#supported-architectures) for a complete list of supported base models.
 
@@ -61,10 +59,10 @@ We recommend starting with our pre-built Docker image to avoid compiling custom
 
 The minimum system requirements need to run LoRAX include:
 
-- Nvidia GPU (Ampere generation or above)
-- CUDA 11.8 compatible device drivers and above
-- Linux OS
-- Docker (for this guide)
+-   Nvidia GPU (Ampere generation or above)
+-   CUDA 11.8 compatible device drivers and above
+-   Linux OS
+-   Docker (for this guide)
 
 ### Launch LoRAX Server
 
@@ -124,7 +122,7 @@ adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
 print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)
 ```
 
-See [Reference - Python Client](./reference/python_client.md) for full details.
+See [Reference - Python Client](./reference/python_client/client.md) for full details.
 
 For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).
 
diff --git a/docs/reference/metrics.md b/docs/reference/metrics.md
@@ -0,0 +1,19 @@
+# Metrics
+
+Prometheus-compatible metrics are made available on the default port, on the `/metrics` endpoint.
+
+Below is a list of the metrics that are exposed:
+| Metric Name                                  | Type      |
+| -------------------------------------------- | --------- |
+| `lorax_request_count`                        | Counter   |
+| `lorax_request_success`                      | Counter   |
+| `lorax_request_failure`                      | Counter   |
+| `lorax_request_duration`                     | Histogram |
+| `lorax_request_queue_duration`               | Histogram |
+| `lorax_request_validation_duration`          | Histogram |
+| `lorax_request_inference_duration`           | Histogram |
+| `lorax_request_mean_time_per_token_duration` | Histogram |
+| `lorax_request_generated_tokens`             | Histogram |
+| `lorax_request_input_length`                 | Histogram |
+
+For all histograms, there are metrics that are autogenerated which are the metric name + `_sum` and `_count`, which are the sum of all values for that histogram, and the count of all instances of that histogram respectively.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -50,6 +50,7 @@ nav:
           - lorax.client: reference/python_client/client.md
         # - lorax.types: reference/python_client/types.md
       - OpenAI Compatible API: reference/openai_api.md
+      - Metrics: reference/metrics.md
   - 🔬 Guides:
       - Quantization: guides/quantization.md
       - Structured Output (JSON): guides/structured_output.md