Skip to content

Add LLM benchmarking proposal.#31

Open
crertel wants to merge 3 commits intoNixOS:mainfrom
crertel:patch-1
Open

Add LLM benchmarking proposal.#31
crertel wants to merge 3 commits intoNixOS:mainfrom
crertel:patch-1

Conversation

@crertel
Copy link

@crertel crertel commented Feb 7, 2026

No description provided.


## Generative Nix: Surveying LLM Proficiency In NixOS

Effort: small (90 hours)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not trust any survey that took less than 100 hours to conduct.

For reference, see NixOS/nixpkgs#410741 (comment) for a possible "survey", although this one cannot be conducted externally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Survey in the sense of "see what's out there", as one might survey a landscape to make a map--not survey as in "let's poll a bunch of people". Sorry for any miscommunication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Survey in the sense of "see what's out there", as one might survey a landscape to make a map--not survey as in "let's poll a bunch of people".

IIUC, you want to benchmark and rank LLMs to determine the currently best one for Nix. With LLMs constantly being obsoleted by better ones, would it not be better to establish a benchmark suite for continously updating the ranking instead of providing a one-time ranking?

Delegating this effort to the Nix community sounds like a lot of effort, when IMHO LLMs should be the ones promoting and declaring their domain proficiencies.

Either way, take my input with a grain of salt because I am not really interested in using LLMs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deliverables for this project include exactly that, a reusable selection of benchmarks for that purpose.

@Eveeifyeve
Copy link
Member

This would be useful for #32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants