[Obs AI Assistant] Add LLM performance matrix docs (#2812)

viduni94 · mdbirnstiehl · florent-leborgne · web-flow · commit 04ab7343b3b9 · 2025-09-18T06:32:46.000-04:00
Closes elastic/obs-ai-assistant-team#347 Closes elastic/kibana#233110 This PR adds the LLM performance matrix and a link to the evaluation framework readme in the Observability AI Assistant docs. The scores that were used to calculate the ratings are attached in the first issue [linked above](elastic/obs-ai-assistant-team#347 (comment)). --------- Co-authored-by: Mike Birnstiehl <114418652+mdbirnstiehl@users.noreply.github.com> Co-authored-by: florent-leborgne <florent.leborgne@elastic.co>
diff --git a/solutions/observability/connect-to-own-local-llm.md b/solutions/observability/connect-to-own-local-llm.md
@@ -19,6 +19,10 @@ If your Elastic deployment is not on the same network, you must configure an Ngi
 You do not have to set up a proxy if LM Studio is running locally, or on the same network as your Elastic deployment. 
 ::::
 
+::::{note}
+For information about the performance of open-source models on {{obs-ai-assistant}} tasks, refer to the [LLM performance matrix](/solutions/observability/llm-performance-matrix.md).
+::::
+
 This example uses a server hosted in GCP to configure LM Studio with the [Llama-3.3-70B-Instruct](https://huggingface.co/lmstudio-community/Llama-3.3-70B-Instruct-GGUF) model.
 
 ### Already running LM Studio? [skip-if-already-running]
diff --git a/solutions/observability/llm-performance-matrix.md b/solutions/observability/llm-performance-matrix.md
@@ -0,0 +1,63 @@
+---
+mapped_pages:
+  - https://www.elastic.co/guide/en/observability/current/observability-llm-performance-matrix.html
+applies_to:
+  stack: ga 9.2
+  serverless: ga
+products:
+  - id: observability
+---
+
+# Large language model performance matrix
+
+This page summarizes internal test results comparing large language models (LLMs) across {{obs-ai-assistant}} use cases. To learn more about these use cases, refer to [AI Assistant](/solutions/observability/observability-ai-assistant.md).
+
+::::{important}
+Rating legend:
+
+**Excellent:** Highly accurate and reliable for the use case.<br>
+**Great:** Strong performance with minor limitations.<br>
+**Good:** Possibly adequate for many use cases but with noticeable tradeoffs.<br>
+**Poor:** Significant issues; not recommended for production for the use case.
+
+Recommended models are those rated **Excellent** or **Great** for the particular use case.
+::::
+
+## Proprietary models [_proprietary_models]
+
+Models from third-party LLM providers.
+
+| Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Amazon Bedrock | **Claude Sonnet 3.5** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent |
+| Amazon Bedrock | **Claude Sonnet 3.7** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Great | Excellent |
+| Amazon Bedrock | **Claude Sonnet 4**   | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Excellent |
+| OpenAI    | **GPT-4.1**           | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent |
+| Google Gemini    | **Gemini 2.0 Flash**    | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent |
+| Google Gemini    | **Gemini 2.5 Flash**    | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent |
+| Google Gemini    | **Gemini 2.5 Pro**    | Excellent | Great | Excellent | Excellent | Excellent | Good | Good | Excellent |
+
+
+## Open-source models [_open_source_models]
+
+```{applies_to}
+stack: preview 9.2
+serverless: preview
+```
+
+Models you can [deploy and manage yourself](/solutions/observability/connect-to-own-local-llm.md).
+
+| Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent |
+| Mistral | **Mistral-Small-3.2-24B-Instruct-2506** | Excellent | Poor | Great | Great | Excellent | Poor | Good | Excellent |
+
+::::{note}
+`Llama-3.3-70B-Instruct` is supported with simulated function calling.
+::::
+
+## Evaluate your own model
+
+You can run the {{obs-ai-assistant}} evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the [evaluation framework README](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md) for setup and usage details.
+
+For consistency, all ratings in this matrix were generated using `Gemini 2.5 Pro` as the judge model (specified via the `--evaluateWith` flag). Use the same judge when evaluating your own model to ensure comparable results.
diff --git a/solutions/observability/observability-ai-assistant.md b/solutions/observability/observability-ai-assistant.md
@@ -91,6 +91,11 @@ The AI Assistant connects to one of these supported LLM providers:
    - The provider's API endpoint URL
    - Your authentication key or secret
 
+::::{admonition} Recommended models
+While the {{obs-ai-assistant}} is compatible with many different models, refer to the [Large language model performance matrix](/solutions/observability/llm-performance-matrix.md) to select models that perform well with your desired use cases.
+
+::::
+
 ### Elastic Managed LLM [elastic-managed-llm-obs-ai-assistant]
 
 :::{include} ../_snippets/elastic-managed-llm.md
diff --git a/solutions/toc.yml b/solutions/toc.yml
@@ -503,6 +503,7 @@ toc:
       - file: observability/observability-ai-assistant.md
         children:
           - file: observability/connect-to-own-local-llm.md
+          - file: observability/llm-performance-matrix.md
       - file: observability/observability-serverless-feature-tiers.md
   - file: security.md
     children: