RouteLLM-like efficiency discovery #26

BradKML · 2024-12-29T10:40:13Z

BradKML
Dec 29, 2024

Sounds weird, but I think Wilmer can be compatible with RouteLLM when it comes to picking some model over others in different times based on computation cost and accuracy https://github.com/lm-sys/RouteLLM

According to LiveBench for SOTA open weight models, here are the usage:

qwen/qwq-32b-preview for general reasoning and simple maths
qwen/qwen-2.5-coder-32b-instruct for simple programming tasks
deepseek/deepseek-chat for data science, heavy duty programming, and simple instructions
meta-llama/llama-3.3-70b-instruct for complex Instruction Following (IF)

P.S. It would be useful in specific applications as cost-to-accuracy All-Hands-AI/OpenHands#5869

SomeOddCodeGuy · 2024-12-29T21:58:24Z

SomeOddCodeGuy
Dec 29, 2024
Maintainer

I really like the idea of using the routes for cost efficiency. Though, could you elaborate a bit on what you mean by "compatible with"?

Looking at the example issue you posted, Wilmer's currently style of routing would at least work somewhat closely to what they're imagining:

Users need a way to balance cost and accuracy when choosing AI models for their specific tasks. For example, someone might prefer paying for a high-accuracy model for unfamiliar languages like TrueScript but would opt for a less expensive, lower-tier model for Python, where they can handle corrections themselves.

Having python requests route to say chatgpt 4o while Truescript go to Claude, things like that, would be something they could do now. But anything more dynamic than that would require a little work.

Are you thinking in terms of Wilmer dynamically adjusting the routes based on price similar to RouteLLM, or something else entirely?

2 replies

BradKML Dec 30, 2024
Author

Looks like WilmerAI focuses on different use cases, while RouteLLM focuses on difficulty juggling based on the same use case.

I would recommend, for each usecase, establish which one can handle most cases cheaply, and which ones are "more SOTA".

If QwenCoder fails to code or QwQ fails to do maths, use DeepSeek
If Deepseek fails to follow instructions, use Llama
Always use QwQ for general reasoning

Each of the use case has a way of doing model escalation

BradKML Aug 8, 2025
Author

Mid 2025 Update FTW https://livebench.ai/

For reasoning, Qwen3-32B for simple tasks, Qwen3-235B-A22B for complex tasks
For data analysis and instruction following, use Qwen3-30B-A3B for quick tasks, Qwen3-235B-A22B for slow tasks
For (agentic) coding, DeepSeek-R1, Qwen3-Coder, Kimi K2, and Qwen3-235B-A22B are all viable options cus no small models can handle it

Bonus observations:

GPT-OSS is subpar in instruction following for some reason on LiveBench, likely they cheated on IFBench (more moaning here Please create a leaderboard for this thu-coai/ComplexBench#6)
Honorable mention: For maths, DeepSeek-R1, GLM-4.5, Qwen 3-235B-A22B, Qwen3-32B, and GLM-4.5-Air are all suitable options
Note: GLM-4.5 is suspiciously good at "agentic coding" but not "coding" proper, since it is good at instruction following but not reasoning nor data analysis. Something about trial-and-error, maybe?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RouteLLM-like efficiency discovery #26

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

RouteLLM-like efficiency discovery #26

Uh oh!

Uh oh!

BradKML Dec 29, 2024

Replies: 1 comment · 2 replies

Uh oh!

SomeOddCodeGuy Dec 29, 2024 Maintainer

Uh oh!

BradKML Dec 30, 2024 Author

Uh oh!

Uh oh!

BradKML Aug 8, 2025 Author

BradKML
Dec 29, 2024

Replies: 1 comment 2 replies

SomeOddCodeGuy
Dec 29, 2024
Maintainer

BradKML Dec 30, 2024
Author

BradKML Aug 8, 2025
Author