-
Notifications
You must be signed in to change notification settings - Fork 163
[Obs AI Assistant] Add LLM performance matrix docs #2812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
135f57c
ce47b33
d462cde
2f75666
fb00b1d
81f1d2a
8de6028
3f45db8
fc8d256
2020139
ea837bb
0f180bf
a4d1397
c641bee
b5bfd40
2ed08b5
b4f3816
0ebcaf2
95dad45
ebb7d47
86da386
efba9b5
9c991fc
cb7e64e
2e737ae
4c2579f
9ecce3d
8e093b9
39d5c1b
4ee68f0
24ab5c0
50ef7b2
4a47c0c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| --- | ||
| mapped_pages: | ||
| - https://www.elastic.co/guide/en/observability/current/observability-llm-performance-matrix.html | ||
| applies_to: | ||
| stack: ga 9.2 | ||
| serverless: ga | ||
| products: | ||
| - id: observability | ||
| --- | ||
|
|
||
| # Large language model performance matrix | ||
|
|
||
| _Last updated: 4 September 2025_ | ||
|
||
|
|
||
| This page summarizes internal test results comparing large language models (LLMs) across {{obs-ai-assistant}} use cases. To learn more about these use cases, refer to [AI Assistant](/solutions/observability/observability-ai-assistant.md). | ||
|
|
||
| ::::{important} | ||
| Rating legend | ||
|
|
||
| **Excellent** – Highly accurate and reliable for the use case.<br> | ||
| **Great** – Strong performance with minor limitations.<br> | ||
| **Good** – Possibly adequate for many use cases but with noticeable tradeoffs.<br> | ||
| **Poor** – Significant issues; not recommended for production for the use case. | ||
viduni94 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Recommended models are those rated **Excellent** or **Great** for the paticular use case. | ||
viduni94 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| :::: | ||
|
|
||
| ## Proprietary models [_proprietary_models] | ||
|
|
||
| Models from third-party LLM providers. | ||
|
|
||
| | Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** | | ||
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ||
| | Amazon Bedrock | **Claude Sonnet 3.5** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Excellent | | ||
| | Amazon Bedrock | **Claude Sonnet 3.7** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | | ||
| | Amazon Bedrock | **Claude Sonnet 4** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | | ||
| | OpenAI | **GPT-4.1** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Excellent | | ||
| | Google Gemini | **Gemini 2.0 Flash** | Excellent | Good | Great | Excellent | Excellent | Great | Great | Excellent | | ||
| | Google Gemini | **Gemini 2.5 Flash** | Excellent | Great | Excellent | Excellent | Excellent | Great | Great | Excellent | | ||
| | Google Gemini | **Gemini 2.5 Pro** | Excellent | Excellent | Excellent | Excellent | Great | Great | Excellent | Excellent | | ||
|
|
||
|
|
||
| ## Open-source models [_open_source_models] | ||
|
|
||
| Models you can [deploy and manage yourself](/solutions/observability/connect-to-own-local-llm.md). | ||
|
|
||
| | Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** | | ||
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ||
| | Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Great | Great | Excellent | | ||
| | Mistral | **Mistral-Small-3.2-24B-Instruct-2506** | Excellent | Good | Great | Excellent | Excellent | Poor | Great | Excellent | | ||
|
|
||
| ::::{note} | ||
| `Llama-3.3-70B-Instruct` is currently supported with simulated function calling. | ||
viduni94 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| :::: | ||
|
|
||
| ## Evaluate your own model | ||
|
|
||
| You can run the {{obs-ai-assistant}} evaluation framework against any model of choice. See the [evaluation framework README](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md) for setup and usage details. | ||
|
|
||
| You can use it to benchmark a custom or self-hosted model against the use cases in this matrix, then compare your results with the ratings above. | ||
viduni94 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Uh oh!
There was an error while loading. Please reload this page.