[docs] Add vLLM health check docs

kthui · kthui · commit a0822ff26024 · 2024-11-25T17:41:09.000-08:00
diff --git a/README.md b/README.md
@@ -311,6 +311,32 @@ parameters: {
 }
 ```
 
+## vLLM Health Check (BETA)
+
+> [!NOTE]
+> The vLLM Health Check feature is currently in BETA. Its features and
+> functionality are subject to change as we collect feedback. We are excited to
+> hear any thoughts you have!
+
+The vLLM backend supports checking for
+[vLLM Engine Health](https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/engine/async_llm_engine.py#L1177-L1185)
+when an inference request is received. If the health check fails, the entire
+model will be unloaded, so it becomes NOT Ready at the server.
+
+The Health Check is disabled by default. To enable it, set the following
+parameter on the model config to true
+```
+parameters: {
+  key: "ENABLE_VLLM_HEALTH_CHECK"
+  value: { string_value: "true" }
+}
+```
+and select
+[Model Control Mode EXPLICIT](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md#model-control-mode-explicit)
+when the server is started.
+
+Supported since r24.12.
+
 ## Referencing the Tutorial
 
 You can read further in the