-
Notifications
You must be signed in to change notification settings - Fork 163
[Obs AI Assistant] Add LLM performance matrix docs #2812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
viduni94
merged 33 commits into
elastic:main
from
viduni94:obs-ai-assistant-llm-performance-matrix
Sep 18, 2025
Merged
Changes from 24 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
135f57c
Add initial docs for performance matrix
viduni94 ce47b33
Add link to the evaluation framework
viduni94 d462cde
Add new docs to toc
viduni94 2f75666
Update formatting
viduni94 fb00b1d
Add Mistral
viduni94 81f1d2a
Update wording
viduni94 8de6028
Update legend
viduni94 3f45db8
Update legend format
viduni94 fc8d256
Update columns
viduni94 2020139
Update rating for proprietary models
viduni94 ea837bb
Remove rating for local models
viduni94 0f180bf
Update llama and mistral small scores
viduni94 a4d1397
Update mistral small es|ql rating
viduni94 c641bee
Update llama model name in note
viduni94 b5bfd40
Fix typo
viduni94 2ed08b5
Update llm-performance-matrix.md
viduni94 b4f3816
Update connect-to-own-local-llm.md
viduni94 0ebcaf2
Update llm-performance-matrix.md
viduni94 95dad45
Add judge model
viduni94 ebb7d47
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
viduni94 86da386
Update ratings
viduni94 efba9b5
Update ratings to the new scale
viduni94 9c991fc
Update date
viduni94 cb7e64e
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
viduni94 2e737ae
Update solutions/observability/llm-performance-matrix.md
viduni94 4c2579f
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
florent-leborgne 9ecce3d
Address review comments
viduni94 8e093b9
Remove date
viduni94 39d5c1b
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
florent-leborgne 4ee68f0
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
viduni94 24ab5c0
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
viduni94 50ef7b2
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
viduni94 4a47c0c
Merge branch 'main' into obs-ai-assistant-llm-performance-matrix
viduni94 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
--- | ||
mapped_pages: | ||
- https://www.elastic.co/guide/en/observability/current/observability-llm-performance-matrix.html | ||
applies_to: | ||
stack: ga 9.2 | ||
serverless: ga | ||
products: | ||
- id: observability | ||
--- | ||
|
||
# Large language model performance matrix | ||
|
||
_Last updated: 15 September 2025_ | ||
|
||
This page summarizes internal test results comparing large language models (LLMs) across {{obs-ai-assistant}} use cases. To learn more about these use cases, refer to [AI Assistant](/solutions/observability/observability-ai-assistant.md). | ||
|
||
::::{important} | ||
Rating legend: | ||
|
||
**Excellent:** Highly accurate and reliable for the use case.<br> | ||
**Great:** Strong performance with minor limitations.<br> | ||
**Good:** Possibly adequate for many use cases but with noticeable tradeoffs.<br> | ||
**Poor:** Significant issues; not recommended for production for the use case. | ||
|
||
Recommended models are those rated **Excellent** or **Great** for the particular use case. | ||
:::: | ||
|
||
## Proprietary models [_proprietary_models] | ||
|
||
Models from third-party LLM providers. | ||
|
||
| Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** | | ||
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ||
| Amazon Bedrock | **Claude Sonnet 3.5** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent | | ||
| Amazon Bedrock | **Claude Sonnet 3.7** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Great | Excellent | | ||
| Amazon Bedrock | **Claude Sonnet 4** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Excellent | | ||
| OpenAI | **GPT-4.1** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent | | ||
| Google Gemini | **Gemini 2.0 Flash** | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent | | ||
| Google Gemini | **Gemini 2.5 Flash** | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent | | ||
| Google Gemini | **Gemini 2.5 Pro** | Excellent | Great | Excellent | Excellent | Excellent | Good | Good | Excellent | | ||
|
||
|
||
## Open-source models [_open_source_models] | ||
|
||
::::{warning} | ||
This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features. | ||
:::: | ||
viduni94 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Models you can [deploy and manage yourself](/solutions/observability/connect-to-own-local-llm.md). | ||
|
||
| Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** | | ||
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ||
| Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent | | ||
| Mistral | **Mistral-Small-3.2-24B-Instruct-2506** | Excellent | Poor | Great | Great | Excellent | Poor | Good | Excellent | | ||
|
||
::::{note} | ||
`Llama-3.3-70B-Instruct` is currently supported with simulated function calling. | ||
viduni94 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
:::: | ||
|
||
## Evaluate your own model | ||
|
||
You can run the {{obs-ai-assistant}} evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the [evaluation framework README](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md) for setup and usage details. | ||
|
||
For consistency, all ratings in this matrix were generated using `Gemini 2.5 Pro` as the judge model (specified via the `--evaluateWith` flag). Use the same judge when evaluating your own model to ensure comparable results. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping manually maintained dates isn't something we do nor advise doing in the docs, because they're considered to match the latest release, not specific dates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @florent-leborgne
Could you check @pmoust's comment here - #2812 (comment)
Is there a way to link it to a stack release if we are removing the date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole page will be marked as 9.2 thanks to the frontmatter

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Thanks @florent-leborgne
@pmoust are we okay with removing the date and only having the stack version we tested in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll show like this when 9.2 is officially released

Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@florent-leborgne
The requirement we have is to communicate the date we've evaluated the models with customers. This is important for Serverless too.
Is it okay to keep the date for this case?
We plan on updating these ratings whenever we come across a scenario in this comment.
cc: @pmoust
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps, I can update it to say "the evaluations were done on the 15 September 2025
What do you think?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, more precise wording already sounds a bit better (even if manually maintained dates is still a bad practice in technical docs 😄). Something like this maybe?
Last LLM performance evaluation: 15 September 2025
Happy to discuss this further if you'd like to anticipate further updates, lmk :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@florent-leborgne @viduni94 to unblock the discussion here, I am ok to back away from having a "Last updated" date.
Let's remove the date, and continue the discussion outside of this github issue.
We shouldn't block merging on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pmoust and @florent-leborgne
I'll remove the date for now.