PL-134151 Move to llama cpp inference#10

Draft

elikoga wants to merge 162 commits intomainfrom

move-to-llama-cpp-inference

Member

elikoga commented Nov 26, 2025 •

edited

Loading

PL-134151

elikoga added 2 commits

November 26, 2025 01:08


          add inference module, entrypoint

8f1cee1


          feat: implement model management with download and load into gpu

5f71d8a

ctheune reviewed

View reviewed changes

src/skvaider/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/skvaider/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/inference/manager.py Outdated Show resolved Hide resolved

ctheune reviewed

View reviewed changes

src/inference/manager.py Outdated Show resolved Hide resolved

elikoga added 12 commits

December 2, 2025 15:59


          add code review changes

d7b029b


          feat: dynamic port

45563ab


          move inference module

b3ae3f8


          feat: implement model unloading functionality and API endpoint

f8005ba


          integrate with skvaider

53ce3a1


          refactor: remove ModelManager call in skvaider and run inference in a…

db9bb68

… server


          feat: update model configurations and enhance test assertions for emb…

20a48d7

…eddings


          feat: update inference server port handling and improve test endpoint…

0aaa63b

… URLs


          feat: readd support for Ollama backend and parameterize model names i…

3bde3be

…n tests


          feat: add health check for backends during lifespan test


          change http error code, re-add ollama to lifespan

315d6da


          feat: refactor openai proxy file structure

31ee225

elikoga changed the title ~~Move to llama cpp inference~~ PL-134151 Move to llama cpp inference

elikoga added 10 commits

January 13, 2026 19:42


          feat: update ModelManager initialization to use models directory from…

1e2abc6

… environment variable


          feat: rename load endpoint to get_running_model_or_load and update re…

9953f5a

…ferences


          feat: update download_model to use models directory from ModelManager

e9c3a46


          make /download endpoint input format more aligned with other api

7edb8a8


          unset default for context_size

c6cb0d8


          move health endpoint to /manager/health , fix tests

85d854e


          add filename to model configuration in test lifespan

0f4a6d5


          add proxy request endpoint to interact with models

fc1c515


          add filename to model configuration in test lifespan

074f161


          update backend configuration to use 'ollama' and adjust health check …

1f2c95f

…endpoints

elikoga and others added 30 commits

February 18, 2026 08:34


          add debug logging and early exit for empty stdout in ROCmMemoryMonitor

cf3b890


          add Nvidia memory monitor

bf19dba


          Refactor test code, rely on DummyBackend and add reused patterns to c…

817d1b4

…onftest - no logic changes


          refactor, move test mocks to conftest

cd33195


          add test for model memory and placement

1d3eaff


          add pool semaphore tests, fix bug if all are busy

8a8e68d


          add test for size parsing in config

57d0301


          backend factory remove url param

7eee65a


          fix devenv up in claude readme

4d5d7e8


          add slugify, task_manager tests

71ed732


          add tests for RAM usage and metrics endpoints

f6868d4


          ci: update nix-quick-install-action to v34, add nix cache to pre-comm…

251868d

…it workflow


          ci: restore devenv.lock after venv setup to keep working tree clean

0d5fa78


          extract resource monitors to separate module

064783a


          extend gitignore

12c9985


          update devenv and pin nixos revision

c7f8771


          switch runner to uvicorn and pass config file as cmdline arg

007d8b2

(note, this is inconsistent wrt pre-commit as I'm picking apart a larger
change and didn't manage to recreate this cleanly)


          proxy: suppress messages when clients disconnect unexpectedly

a5b9003


          proxy/auth: cache lookups to reduce a serious performance bottleneck

69f92d2

I noticed that 100 requests coming in at once are easily taking
30s to get passed to the backends. Seems like the current way we
authenticate requests requires ~300ms and likely has a db contention
issue.


          inference: provide separate llama and vllm based models and runners

e2f885a

- allow selecting for each model which runner to use
- allow configuring and communicating the max parallel requests
- split the resource monitoring and models into a separate module
- always explicitly specify the port for each model
- disable model cleanup for now as the hugging face integration
  doesn't fit the current abstraction level


          health check: don't run health checks while the models are busy

5251cf0


          ignore unknown models reported by the inference servers

a9b245a


          minor cleanups, ensure we use ruff for pre-commit and editing

6b43a20


          resource monitoring: bugfix missing resource types

3e7cc04


          proxy: ignore unhealthy backends when rebalancing models

b55d4e3


          proxy: increase log output, fix bug retrying unavailable backends

8f7d383


          inference/proxy: ensure we pass through the content type header

4a3c635


          inference: do not allow configs with duplicate ports

bd6db76


          snapshot: move tests towards a working state

8fd66c3


          wrap up getting the tests green again

a3e688f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet