Changes to multi-model documentation (#941)

elevran · web-flow · commit 33cda4b2bd65 · 2025-06-09T13:12:27.000-07:00
- clarify the use of BBR upfront, and disptaching to different InferencePool/EPP
- fix typo in example inference - both requests were sent to same model

Signed-off-by: Etai Lev Ran &lt;elevran@gmail.com&gt;
diff --git a/site-src/guides/serve-multiple-genai-models.md b/site-src/guides/serve-multiple-genai-models.md
@@ -7,6 +7,9 @@ You can then route requests based on the model name (such as "chatbot" and "reco
 
 ## How
 The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name.
+The model name is extarcted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
+ from the request body to the header. The header is then matched to dispatch
+ requests to different `InferencePool` (and their EPPs) instances.
 ![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)
 
 This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
@@ -63,9 +66,9 @@ curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
 3. Send a few requests to model "recommender" as follows:
 ```bash
 curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
-"model": "chatbot",
+"model": "recommender",
 "prompt": "Give me restaurant recommendations in Paris",
 "max_tokens": 100,
 "temperature": 0
 }'
-```
+```