|
| 1 | +--- |
| 2 | +mapped_pages: |
| 3 | + - https://www.elastic.co/guide/en/observability/current/observability-llm-performance-matrix.html |
| 4 | +applies_to: |
| 5 | + stack: ga 9.2 |
| 6 | + serverless: ga |
| 7 | +products: |
| 8 | + - id: observability |
| 9 | +--- |
| 10 | + |
| 11 | +# Large language model performance matrix |
| 12 | + |
| 13 | +This page summarizes internal test results comparing large language models (LLMs) across {{obs-ai-assistant}} use cases. To learn more about these use cases, refer to [AI Assistant](/solutions/observability/observability-ai-assistant.md). |
| 14 | + |
| 15 | +::::{important} |
| 16 | +Rating legend: |
| 17 | + |
| 18 | +**Excellent:** Highly accurate and reliable for the use case.<br> |
| 19 | +**Great:** Strong performance with minor limitations.<br> |
| 20 | +**Good:** Possibly adequate for many use cases but with noticeable tradeoffs.<br> |
| 21 | +**Poor:** Significant issues; not recommended for production for the use case. |
| 22 | + |
| 23 | +Recommended models are those rated **Excellent** or **Great** for the particular use case. |
| 24 | +:::: |
| 25 | + |
| 26 | +## Proprietary models [_proprietary_models] |
| 27 | + |
| 28 | +Models from third-party LLM providers. |
| 29 | + |
| 30 | +| Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** | |
| 31 | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
| 32 | +| Amazon Bedrock | **Claude Sonnet 3.5** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent | |
| 33 | +| Amazon Bedrock | **Claude Sonnet 3.7** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Great | Excellent | |
| 34 | +| Amazon Bedrock | **Claude Sonnet 4** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Excellent | |
| 35 | +| OpenAI | **GPT-4.1** | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent | |
| 36 | +| Google Gemini | **Gemini 2.0 Flash** | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent | |
| 37 | +| Google Gemini | **Gemini 2.5 Flash** | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent | |
| 38 | +| Google Gemini | **Gemini 2.5 Pro** | Excellent | Great | Excellent | Excellent | Excellent | Good | Good | Excellent | |
| 39 | + |
| 40 | + |
| 41 | +## Open-source models [_open_source_models] |
| 42 | + |
| 43 | +```{applies_to} |
| 44 | +stack: preview 9.2 |
| 45 | +serverless: preview |
| 46 | +``` |
| 47 | + |
| 48 | +Models you can [deploy and manage yourself](/solutions/observability/connect-to-own-local-llm.md). |
| 49 | + |
| 50 | +| Provider | Model | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **{{esql}} generation** | **Execute connector** | **Knowledge retrieval** | |
| 51 | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
| 52 | +| Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent | |
| 53 | +| Mistral | **Mistral-Small-3.2-24B-Instruct-2506** | Excellent | Poor | Great | Great | Excellent | Poor | Good | Excellent | |
| 54 | + |
| 55 | +::::{note} |
| 56 | +`Llama-3.3-70B-Instruct` is supported with simulated function calling. |
| 57 | +:::: |
| 58 | + |
| 59 | +## Evaluate your own model |
| 60 | + |
| 61 | +You can run the {{obs-ai-assistant}} evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the [evaluation framework README](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md) for setup and usage details. |
| 62 | + |
| 63 | +For consistency, all ratings in this matrix were generated using `Gemini 2.5 Pro` as the judge model (specified via the `--evaluateWith` flag). Use the same judge when evaluating your own model to ensure comparable results. |
0 commit comments