Skip to content

PL-134151 Move to llama cpp inference#10

Draft
elikoga wants to merge 162 commits intomainfrom
move-to-llama-cpp-inference
Draft

PL-134151 Move to llama cpp inference#10
elikoga wants to merge 162 commits intomainfrom
move-to-llama-cpp-inference

Conversation

@elikoga
Copy link
Member

@elikoga elikoga commented Nov 26, 2025

PL-134151

@elikoga elikoga changed the title Move to llama cpp inference PL-134151 Move to llama cpp inference Dec 3, 2025
elikoga and others added 30 commits February 18, 2026 08:34
(note, this is inconsistent wrt pre-commit as I'm picking apart a larger
change and didn't manage to recreate this cleanly)
I noticed that 100 requests coming in at once are easily taking
30s to get passed to the backends. Seems like the current way we
authenticate requests requires ~300ms and likely has a db contention
issue.
- allow selecting for each model which runner to use
- allow configuring and communicating the max parallel requests
- split the resource monitoring and models into a separate module
- always explicitly specify the port for each model
- disable model cleanup for now as the hugging face integration
  doesn't fit the current abstraction level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants