Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
162 commits
Select commit Hold shift + click to select a range
8f1cee1
add inference module, entrypoint
elikoga Nov 26, 2025
5f71d8a
feat: implement model management with download and load into gpu
elikoga Nov 26, 2025
d7b029b
add code review changes
elikoga Dec 2, 2025
45563ab
feat: dynamic port
elikoga Dec 2, 2025
b3ae3f8
move inference module
elikoga Dec 2, 2025
f8005ba
feat: implement model unloading functionality and API endpoint
elikoga Dec 2, 2025
53ce3a1
integrate with skvaider
elikoga Dec 2, 2025
db9bb68
refactor: remove ModelManager call in skvaider and run inference in a…
elikoga Dec 3, 2025
20a48d7
feat: update model configurations and enhance test assertions for emb…
elikoga Dec 3, 2025
0aaa63b
feat: update inference server port handling and improve test endpoint…
elikoga Dec 3, 2025
3bde3be
feat: readd support for Ollama backend and parameterize model names i…
elikoga Dec 3, 2025
6171556
feat: add health check for backends during lifespan test
elikoga Dec 3, 2025
315d6da
change http error code, re-add ollama to lifespan
elikoga Dec 3, 2025
31ee225
feat: refactor openai proxy file structure
elikoga Dec 3, 2025
1e2abc6
feat: update ModelManager initialization to use models directory from…
elikoga Jan 13, 2026
9953f5a
feat: rename load endpoint to get_running_model_or_load and update re…
elikoga Jan 13, 2026
e9c3a46
feat: update download_model to use models directory from ModelManager
elikoga Jan 19, 2026
7edb8a8
make /download endpoint input format more aligned with other api
elikoga Jan 19, 2026
c6cb0d8
unset default for context_size
elikoga Jan 19, 2026
85d854e
move health endpoint to /manager/health , fix tests
elikoga Jan 20, 2026
0f4a6d5
add filename to model configuration in test lifespan
elikoga Jan 20, 2026
fc1c515
add proxy request endpoint to interact with models
elikoga Jan 20, 2026
074f161
add filename to model configuration in test lifespan
elikoga Jan 20, 2026
1f2c95f
update backend configuration to use 'ollama' and adjust health check …
elikoga Jan 20, 2026
d97ef44
fix health check endpoint URL in model manager
elikoga Jan 20, 2026
0f0fafa
refactor: move model management logic from ModelManager to RunningModel
ctheune Jan 19, 2026
b1010b1
Slight cleanups
ctheune Jan 19, 2026
6b61a68
fix: properly cancel monitor task on model termination
ctheune Jan 19, 2026
0fa2cab
refactorings and get test to let manager start a model working again
ctheune Jan 20, 2026
158df6b
wording
ctheune Jan 20, 2026
df23b31
disable the web ui
ctheune Jan 20, 2026
d1dc350
xxx notes
ctheune Jan 20, 2026
85cb7ba
update devenv
ctheune Jan 20, 2026
c59bdf3
- handle llama-server crashes more cleanly and simplify the
ctheune Jan 20, 2026
32ee897
snapshot / inference cleanup:
ctheune Jan 20, 2026
8bf1bab
fix and refactor more tests
ctheune Jan 21, 2026
6c9648b
move logging configuration to the end of lifespan function because it…
elikoga Jan 21, 2026
6258d18
rip out ollama
ctheune Jan 21, 2026
2b6da44
ignore access_log
ctheune Jan 21, 2026
d2503dc
get tests clean again, devenv up delivers working environmnt
ctheune Jan 21, 2026
2cf766b
refactor proxy request handling to simplify model loading and endpoin…
elikoga Jan 21, 2026
466d3d0
increase timeout of llama-server crash test with bad arguments
elikoga Jan 21, 2026
d7e224c
increase timeout for AsyncClient in proxy_request and enhance error l…
elikoga Jan 21, 2026
7ab6536
improve error logging in proxy_request by using log.exception
elikoga Jan 22, 2026
f8ee8df
fix type issue in backend assignments
ctheune Jan 22, 2026
01cd126
ensure strict type checking mode
ctheune Jan 22, 2026
6d48d91
tried adding basedpyright but 2k errors are too much.
ctheune Jan 22, 2026
0322d74
edit endpoints
elikoga Jan 22, 2026
86d7c69
run only one request in skvaider monitor_health_and_update_models
elikoga Jan 22, 2026
ee87ccf
add test for multiple streaming requests with pool management
elikoga Jan 22, 2026
977a623
cleanup more typing
ctheune Jan 22, 2026
cb58275
allow choosing custom llama-server instances
ctheune Jan 22, 2026
f12402d
fix type annotation
ctheune Jan 22, 2026
b17c0a8
load model if it's not loaded anywhere anytime I see it during period…
elikoga Jan 22, 2026
a564976
cleanup
ctheune Jan 22, 2026
a705787
refactor: update model configuration to support multiple files and im…
elikoga Jan 22, 2026
9a1d356
update model configuration structure
elikoga Jan 22, 2026
96962b7
add test for downloading split model files
elikoga Jan 23, 2026
21cb117
Introduce strong typing assertions for the main code base.
ctheune Jan 23, 2026
2b1d445
clean up the llama-server commandline
ctheune Jan 23, 2026
e1e890c
ensure we use a reliable lookup for unqualified llama-server program …
ctheune Jan 23, 2026
a62568c
benchmark for embedding distances in different settings
ctheune Jan 23, 2026
a5bf91f
benchmark: add cosine similarity
ctheune Jan 23, 2026
4af967b
benchmark: heading structure
ctheune Jan 23, 2026
4ff20b4
benchmark: highlight "same precision" combinations
ctheune Jan 23, 2026
0547a73
Add model health monitoring and add tests for health checks
elikoga Jan 23, 2026
090e41d
benchmark: add cosine angles, add our current ollama baseline
ctheune Jan 23, 2026
25a35fb
benchmark: add more
ctheune Jan 23, 2026
6e00ce1
update stability research documentation
ctheune Jan 27, 2026
47714c2
update stability research docs
ctheune Jan 27, 2026
aa3365f
typo
ctheune Jan 27, 2026
489fb3e
update stability research docs
ctheune Jan 27, 2026
9f3f686
formatting
ctheune Jan 27, 2026
55debcb
typo
ctheune Jan 27, 2026
1f0b4d5
update stability research docs
ctheune Jan 27, 2026
c4b2704
stability research:
ctheune Jan 27, 2026
a15c733
stability research doc update: wrap up by providing links to our model
ctheune Jan 27, 2026
a40a506
formatting
ctheune Jan 27, 2026
e2fa044
wording
ctheune Jan 27, 2026
459c836
test fluctuations
ctheune Jan 27, 2026
5ba22f5
inference: cleanup, stability tests
ctheune Jan 27, 2026
5a86b9f
Add embedding verification support
elikoga Jan 28, 2026
0e8e862
allow embedding health check to batch request an have 1e-5 of numeric…
elikoga Jan 29, 2026
4a1c7e7
increase health check logging for embedding value mismatches
elikoga Jan 29, 2026
49016f5
increase tolerance for embedding value mismatches in health check
elikoga Jan 29, 2026
7d5d7fa
increase tolerance for embedding value mismatches in health check
elikoga Jan 29, 2026
6795ea4
revamp status management for models
ctheune Jan 29, 2026
15a96d7
test fixes
ctheune Jan 29, 2026
2c196d8
fix stupid merge mistake
ctheune Jan 29, 2026
07c80aa
Update embeddinggemma output stability test to use expected output fr…
elikoga Jan 29, 2026
6be6657
Update virtual environment path in pyproject.toml
elikoga Jan 29, 2026
6419dcc
Replace file rename with shutil.move to handle cross-filesystem issue…
elikoga Jan 29, 2026
960c173
Add step for uv venv prepare to pre-commit workflow
elikoga Jan 29, 2026
5a7c1df
increase test_embeddinggemma_output_stability timeout for GH actions
elikoga Jan 29, 2026
b9db5e5
increase test timeouts for github actions
elikoga Jan 29, 2026
8b3987c
Normalize model names to lowercase in inference endpoints and configu…
elikoga Jan 29, 2026
507d1b9
update llama-cpp to remove mentions of gpt-3.5-turbo in the output
elikoga Jan 29, 2026
1c11876
Change embeddinggemma output stability test to validate embedding val…
elikoga Jan 29, 2026
7a41f9a
add monitoring for vram usage
ctheune Jan 29, 2026
8a5c657
give CLAUDE.md a try
ctheune Jan 29, 2026
08664d6
extend memory management to allow inspecting real model usage
ctheune Jan 29, 2026
e2160f2
improve logging for memory usage info
ctheune Jan 30, 2026
9c3cbe5
use new memory calculations for placing models on backends
ctheune Jan 30, 2026
6139900
inference: improve recovery from timeouts when loading models
ctheune Jan 30, 2026
5c6d5df
proxy: serialise loading models per backend.
ctheune Jan 30, 2026
11f6cfd
increase model loading timeout
ctheune Jan 30, 2026
4d51ec0
asyncio: clean up task management and support unique/dedup tasks
ctheune Jan 30, 2026
9255d13
add info that this is a typed package to remove missing stub warnings
ctheune Jan 30, 2026
7ec49fb
improve logging and fix a logging error
ctheune Jan 30, 2026
a341177
document the task manager a bit
ctheune Jan 30, 2026
f6f9505
rename "warmup" to "reserved"
ctheune Feb 2, 2026
d50e03c
rename "load_model_with_options" to "load_model"
ctheune Feb 2, 2026
db8d7b7
proxy: don't try loading a model if the fitness has dropped to 0
ctheune Feb 2, 2026
1ebd205
proxy: first implementation of automatically unloading models
ctheune Feb 2, 2026
40e2c57
proxy: wrap up unloading models on demand and also add test coverage
ctheune Feb 3, 2026
6f0b7cf
proxy: implement backend availability check and retry logic
elikoga Feb 3, 2026
90e0e29
inference: ensure models can't be unloaded while being used
ctheune Feb 3, 2026
04b73cb
fix: handle case where task is already removed in cleanup callback
elikoga Feb 6, 2026
27768ed
fix: update proxy endpoint to return 540 status code for unavailable …
elikoga Feb 6, 2026
fcaee29
fix: ensure proper shutdown of fake llama server to avoid blocking
elikoga Feb 6, 2026
41936b5
prevent KeyError in cleanup callback by using pop with default
elikoga Feb 9, 2026
0d0dc61
tests: add wait_for_models_active function to ensure model instances …
elikoga Feb 9, 2026
1b00012
Add metrics endpoint
elikoga Feb 11, 2026
eaf732a
inference: simplify monitoring memory usage by
ctheune Feb 10, 2026
52af250
gateway: revamp model loading and request queueing
ctheune Feb 12, 2026
cda3731
fix: add resource and backend details to overload warning in Pool class
elikoga Feb 13, 2026
bb285e0
fix database access timeouts
ctheune Feb 13, 2026
3531756
increase health check intervals and timeouts for temporary resiliency
ctheune Feb 13, 2026
0fc4063
increase timeout for non-streaming
ctheune Feb 13, 2026
df8832c
add configuration validation scripts for skvaider
elikoga Feb 16, 2026
1749659
fix typo: skavider -> skvaider
elikoga Feb 16, 2026
dae18ff
read config file also from cli arg in check config
elikoga Feb 16, 2026
cf3b890
add debug logging and early exit for empty stdout in ROCmMemoryMonitor
elikoga Feb 18, 2026
bf19dba
add Nvidia memory monitor
elikoga Feb 18, 2026
817d1b4
Refactor test code, rely on DummyBackend and add reused patterns to c…
elikoga Mar 9, 2026
cd33195
refactor, move test mocks to conftest
elikoga Mar 12, 2026
1d3eaff
add test for model memory and placement
elikoga Mar 12, 2026
8a8e68d
add pool semaphore tests, fix bug if all are busy
elikoga Mar 12, 2026
57d0301
add test for size parsing in config
elikoga Mar 12, 2026
7eee65a
backend factory remove url param
elikoga Mar 12, 2026
4d5d7e8
fix devenv up in claude readme
elikoga Mar 12, 2026
71ed732
add slugify, task_manager tests
elikoga Mar 12, 2026
f6868d4
add tests for RAM usage and metrics endpoints
elikoga Mar 12, 2026
251868d
ci: update nix-quick-install-action to v34, add nix cache to pre-comm…
elikoga Mar 12, 2026
0d5fa78
ci: restore devenv.lock after venv setup to keep working tree clean
elikoga Mar 12, 2026
064783a
extract resource monitors to separate module
ctheune Feb 24, 2026
12c9985
extend gitignore
ctheune Mar 13, 2026
c7f8771
update devenv and pin nixos revision
ctheune Mar 13, 2026
007d8b2
switch runner to uvicorn and pass config file as cmdline arg
ctheune Mar 13, 2026
a5b9003
proxy: suppress messages when clients disconnect unexpectedly
ctheune Mar 13, 2026
69f92d2
proxy/auth: cache lookups to reduce a serious performance bottleneck
ctheune Mar 13, 2026
e2f885a
inference: provide separate llama and vllm based models and runners
ctheune Mar 17, 2026
5251cf0
health check: don't run health checks while the models are busy
ctheune Mar 17, 2026
a9b245a
ignore unknown models reported by the inference servers
ctheune Mar 17, 2026
6b43a20
minor cleanups, ensure we use ruff for pre-commit and editing
ctheune Mar 17, 2026
3e7cc04
resource monitoring: bugfix missing resource types
ctheune Mar 17, 2026
b55d4e3
proxy: ignore unhealthy backends when rebalancing models
ctheune Mar 17, 2026
8f7d383
proxy: increase log output, fix bug retrying unavailable backends
ctheune Mar 17, 2026
4a3c635
inference/proxy: ensure we pass through the content type header
ctheune Mar 17, 2026
bd6db76
inference: do not allow configs with duplicate ports
ctheune Mar 17, 2026
8fd66c3
snapshot: move tests towards a working state
ctheune Mar 17, 2026
a3e688f
wrap up getting the tests green again
ctheune Mar 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion .github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,22 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: nixbuild/nix-quick-install-action@v34
- name: Restore and save Nix store
uses: nix-community/cache-nix-action@v6
with:
primary-key: nix-${{ runner.os }}-${{ hashFiles('**/devenv.nix', '**/devenv.lock') }}
restore-prefixes-first-match: nix-${{ runner.os }}-
gc-max-store-size: 5G
purge: true
purge-prefixes: nix-${{ runner.os }}-
purge-created: 0
purge-last-accessed: 0
purge-primary-key: never
- name: Install devenv.sh
run: nix profile install nixpkgs#devenv
- name: Set up Python venv
run: devenv shell uv sync && git restore devenv.lock
- uses: actions/setup-python@v4
- uses: pre-commit/action@v3.0.0
11 changes: 5 additions & 6 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ jobs:
runs-on: "${{ matrix.os }}"
steps:
- uses: actions/checkout@v5
- uses: nixbuild/nix-quick-install-action@v33
- uses: nixbuild/nix-quick-install-action@v34
- name: Restore and save Nix store
uses: nix-community/cache-nix-action@v6
with:
Expand All @@ -25,13 +25,12 @@ jobs:
purge-primary-key: never
- name: Install devenv.sh
run: nix profile install nixpkgs#devenv
- name: Ollama Model Directorys
id: ollama-models
- name: Model Directories
id: models
uses: actions/cache@v4
with:
path: |
.ollama1
.ollama2
key: ${{ runner.os }}-ollama-models
.models
key: ${{ runner.os }}-models
- name: Run tests
run: devenv test
9 changes: 7 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ build/
dist/
wheels/
*.egg-info
.claude/
.zed/

# Virtual environments
.venv
Expand All @@ -18,6 +20,7 @@ result
# test generated files
.coverage*
htmlcov
.models

# Devenv
.devenv*
Expand All @@ -29,6 +32,8 @@ devenv.local.nix
.aramaki-workdir

.DS_store
.access_log
.access_log*

.ollama?
.ollama*/models*
models/
models-2/
79 changes: 40 additions & 39 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,41 +1,42 @@
exclude: ^secrets/|^appenv$
repos:
- hooks:
- id: detect-private-key
- id: check-added-large-files
- exclude: "(?x)^(\n secrets/|environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: trailing-whitespace
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: end-of-file-fixer
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-yaml
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-json
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-xml
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-toml
repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
- hooks:
- args:
- --profile
- black
- --filter-files
id: isort
name: isort (python)
repo: https://github.com/pycqa/isort
rev: 6.0.1
- hooks:
- id: black
repo: https://github.com/psf/black
rev: 25.1.0
- hooks:
- args:
- --ignore
- E501
- --ignore
- F401
id: ruff
repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
- hooks:
- id: detect-private-key
- id: check-added-large-files
- exclude: "(?x)^(\n secrets/|environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: trailing-whitespace
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: end-of-file-fixer
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-yaml
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-json
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-xml
- exclude: "(?x)^(\n environments/.*/secret.*|\n .*\\.patch\n)$\n"
id: check-toml
repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
- hooks:
- args:
- --profile
- black
- --filter-files
id: isort
name: isort (python)
repo: https://github.com/pycqa/isort
rev: 7.0.0
- hooks:
- args:
- --ignore
- E501
- --ignore
- F401
id: ruff
- id: ruff-format
repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.14.14
- repo: https://github.com/DetachHead/basedpyright-pre-commit-mirror
rev: 1.37.1 # or whatever the latest version is at the time
hooks:
- id: basedpyright
148 changes: 148 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Build and Test Commands

```bash
# Enter development shell (requires Nix + devenv)
devenv shell

# Run all tests
run-tests

# Run a single test file
uv run pytest src/skvaider/inference/tests/test_manager.py -vv

# Run a specific test
uv run pytest src/skvaider/inference/tests/test_manager.py::test_manager_start_model -vv

# Start all services in the background (terminal stays free)
devenv up -d

# Stop background services
devenv down

# Type checking, linting, formatting, etc. all in one:
pre-commit run -a

```

## Architecture

Skvaider is an OpenAI-compatible API proxy with two parts.


### The OpenAI compatible gateway facing application clients (`skvaider:app_factory()`)

Routes requests to inference backends with load balancing, authentication, health checks and resource management.

- **Entry point**: `src/skvaider/__init__.py`
- **Config file**: `config.toml`
- **Port**: 8000

Key components:
- `proxy/pool.py` - Request queue and backend load balancing
- `proxy/backends.py` - Backend interface (SkvaiderBackend)
- `routers/openai.py` - OpenAI-compatible endpoints (`/openai/v1/...`)
- `auth.py` - Token authentication via aramaki

### Inference server (`skvaider.inference:app_factory()`)

Runs local LLMs via llama-server subprocesses.

- **Entry point**: `src/skvaider/inference/__init__.py`
- **Config file**: `config-inference-{1,2}.toml` (via `SKVAIDER_CONFIG_FILE` env var)
- **Ports**: 8001, 8002

Key components:
- `inference/manager.py` - Model lifecycle (download, start, health check, terminate)
- `inference/routers/models.py` - Model management endpoints (`/models/{name}/load`, `/models/{name}/proxy/{path}`)
- `inference/routers/manager.py` - Health and VRAM usage endpoints

### Aramaki (`src/aramaki/`)

WebSocket-based distributed state management for authentication tokens.

Aramaki is intended to be split off later into a separate package. It is extremely important that no references (imports) from aramaki (`src/aramaki`) to the skvaider code base (`src/skvaider`) are
introduced under any circumstances.

- `manager.py` - WebSocket connections and subscriptions
- `collection.py` - Collection protocol and replication
- `db.py` - SQLite persistence

## Request Flow

1. Client → Proxy (`/openai/v1/chat/completions`)
2. Proxy authenticates via aramaki tokens
3. Pool assigns request to least-loaded backend but batches requests that are incoming at the same time.
4. Backend proxies to inference server (`/models/{model}/proxy/v1/chat/completions`)
5. Proxy starts models as needed (llama-server subprocess). At least one reserved model instance should always be available. Additional models are stopped and started as needed.
6. Response streams back through the chain

## Model Status System

Models track two status dimensions (inspired by Ceph):
- `process_status`: stopped → starting → running → stopping
- `health_status`: "" → healthy/unhealthy

Combined into `status` set with "active" (running+healthy) or "inactive".

## Configuration

Pydantic models in `config.py` files. Key patterns:
- Model files: URL + SHA256 hash for verification
- Logging: structlog with IP anonymization

## Code Style

- "-> None" is not needed on `__init__` methods
- if filtering through lists in a compound statement, prefer to use the `guardian` pattern to avoid long indentations.

Good:

```
for x in mylist:
if not condition(x):
continue
... do the happy path work ...
```

Bad:

```
for x in mylist:
if condition(x):
... do the happy path work ...
```

- do not add superfluous comments to code that is already there. when making comments to
new code you generate then do not make the comment if its basically exactly what the
code already reads like or is sensibly obvious. stick to higher order "why" comments
instead of superfluous comments

bad examples:

```
# do the foo bar thing
do_foo_bar()

# Get per-process VRAM usage from --showpids
await self._update_per_model_vram_rocm()

# Get total VRAM from --showmeminfo
proc = await asyncio.create_subprocess_exec(
"rocm-smi",
"--json",
"--showmeminfo",
"all",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
```

- if you log an exception, use the log.exception() function to ensure we see a proper traceback

- basedpyright strict mode
- black + isort (line length 80)
- ruff (ignoring E501, F401)
41 changes: 41 additions & 0 deletions config-inference-1.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Sample config used by the development environment
models_dir = "models"

[server]
host = "127.0.0.1"
port = 8001

[logging]
log_level = "DEBUG"
access_log_path = ".access_log-inference1"

[[openai.models]]
type = "llama-server"
id = "gemma"
context_size = 4096
port = 8100
cmd_args = []
max_requests = 21

[[openai.models.files]]
url = "https://huggingface.co/unsloth/gemma-3-270m-it-GGUF/resolve/main/gemma-3-270m-it-UD-Q4_K_XL.gguf?download=true"
hash = "e5420636e0cbfee24051ff22e9719380a3a93207a472edb18dd0c89a95f6ef80"

[[openai.models]]
type = "llama-server"
id = "embeddinggemma"
cmd_args = ["--embeddings"]
port = 8101
context_size = 4096

[[openai.models.files]]
url = "https://huggingface.co/unsloth/embeddinggemma-300m-GGUF/resolve/main/embeddinggemma-300M-F32.gguf"
hash = "a3125072128fc76d1c1d8d19f7b095c7e3bfbf00594dcf8a8bd3bcb334935d57"

# It would be useful to have a reasoning model, but the 12G are unwieldy
# for local development and CI/CD caching.
#
# [openai.models."unsloth/gpt-oss-20b-GGUF/gpt-oss-20b-UD-Q4_K_XL.gguf"]
# id = "gpt-oss"
# url = "https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/d449b42d93e1c2c7bda5312f5c25c8fb91dfa9b4/gpt-oss-20b-UD-Q4_K_XL.gguf"
# hash = "10fe673de12c20b74b8d670a9fdf0fd36b43b0a86ffc04daeb175c0a2b98c4f9"
42 changes: 42 additions & 0 deletions config-inference-2.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Sample config used by the development environment
models_dir = "models-2"

[server]
host = "127.0.0.1"
port = 8002

[logging]
log_level = "DEBUG"
access_log_path = ".access_log-inference2"

[[openai.models]]
type = "llama-server"
id = "gemma"
context_size = 4096
cmd_args = []
max_requests = 42
port = 8200

[[openai.models.files]]
url = "https://huggingface.co/unsloth/gemma-3-270m-it-GGUF/resolve/main/gemma-3-270m-it-UD-Q4_K_XL.gguf"
hash = "e5420636e0cbfee24051ff22e9719380a3a93207a472edb18dd0c89a95f6ef80"

[[openai.models]]
type = "llama-server"
id = "embeddinggemma"
cmd_args = ["--embeddings"]
port = 8201
context_size = 4096

[[openai.models.files]]
url = "https://huggingface.co/unsloth/embeddinggemma-300m-GGUF/resolve/main/embeddinggemma-300M-F32.gguf"
hash = "a3125072128fc76d1c1d8d19f7b095c7e3bfbf00594dcf8a8bd3bcb334935d57"


# It would be useful to have a reasoning model, but the 12G are unwieldy
# for local development and CI/CD caching.
#
# [openai.models."unsloth/gpt-oss-20b-GGUF/gpt-oss-20b-UD-Q4_K_XL.gguf"]
# id = "gpt-oss"
# url = "https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/d449b42d93e1c2c7bda5312f5c25c8fb91dfa9b4/gpt-oss-20b-UD-Q4_K_XL.gguf"
# hash = "10fe673de12c20b74b8d670a9fdf0fd36b43b0a86ffc04daeb175c0a2b98c4f9"
Loading
Loading