Skip to content

Conversation

@benironside
Copy link
Contributor

@benironside benironside commented Jan 4, 2026

This PR fixes #4307 by updating the LLM performance matrix for Elastic Security to reflect the latest testing. Thanks @dhru42 for your work generating the new data!

For models with one or more values of "Not recommended", I changed the "Average score" value to "N/A", because the not recommended values were skewing the data and IMO making the average scores not very meaningful. For future versions, I think it would be ideal to have numeric values for all cells, rather than "not recommended". We might also consider testing performance for Automatic Import.

Generative AI disclosure

  1. Did you use a generative AI (GenAI) tool to assist in creating this contribution?
  • [ x ] Yes
  • No
  1. If you answered "Yes" to the previous question, please specify the tool(s) and model(s) used (e.g., Google Gemini,
  • Gemini 3 web interface for reformatting Google sheets into markdown

@github-actions
Copy link
Contributor

github-actions bot commented Jan 4, 2026

Vale Linting Results

Summary: 2 suggestions found

💡 Suggestions (2)
File Line Rule Message
solutions/security/ai/large-language-model-performance-matrix.md 31 Elastic.Acronyms 'GPT' has no definition.
solutions/security/ai/large-language-model-performance-matrix.md 50 Elastic.Acronyms 'GPT' has no definition.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 4, 2026

Copy link
Contributor

@nastasha-solomon nastasha-solomon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two minor comments. Great job on continuing to improve this page. It has a ton of super useful info for our customers!

Higher scores indicate better performance. A score of 100 on a task means the model met or exceeded all task-specific benchmarks.

Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.
::::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be helpful to include a brief explanation of how to interpret the average score. Maybe something general like "models that score above [this threshold] might provide better performance for AI powered features. We don't recommend using models that score below [this threshold] as they won't perform as well."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask the product team if we can provide some more guidance on this, thank you for the idea

Copy link
Contributor

@nastasha-solomon nastasha-solomon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two minor comments. Great job on continuing to improve this page. It has a ton of super useful info for our customers!

@benironside benironside self-assigned this Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Internal]: Update LLM Performance Matrix with latest models

3 participants