Conversation
There was a problem hiding this comment.
Pull request overview
This PR merges changes from the dev branch, updating model metadata handling, adding support for ik_llama.cpp, expanding OpenAI-compatible logprobs behavior, and refreshing dependencies/build workflows.
Changes:
- Remove regex-based model config (
user_data/models/config.yaml) and rely on model metadata +config-user.yamloverrides for settings/template detection. - Add
--iksupport (UI + CLI) and ik-specific portable build workflows/requirements. - Improve OpenAI API logprobs (prompt logprobs, sampled-token correctness) and various JS/UI refactors.
Reviewed changes
Copilot reviewed 50 out of 50 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| user_data/models/config.yaml | Removes legacy regex-based model settings mapping. |
| server.py | Drops global fallback model_config injection at startup. |
| requirements/portable/requirements.txt | Updates Gradio custom wheels and llama-cpp binaries version. |
| requirements/portable/requirements_vulkan.txt | Updates Gradio custom wheels and Vulkan llama-cpp binaries version. |
| requirements/portable/requirements_nowheels.txt | Updates Gradio custom wheels. |
| requirements/portable/requirements_ik.txt | Adds new ik portable requirements (CUDA 12.4). |
| requirements/portable/requirements_ik_cuda131.txt | Adds new ik portable requirements (CUDA 13.1). |
| requirements/portable/requirements_ik_cpu_only.txt | Adds new ik portable requirements (CPU only). |
| requirements/portable/requirements_cuda131.txt | Updates Gradio custom wheels and CUDA 13.1 llama-cpp binaries version. |
| requirements/portable/requirements_cpu_only.txt | Updates Gradio custom wheels and CPU llama-cpp binaries version. |
| requirements/portable/requirements_apple_silicon.txt | Updates Gradio custom wheels and macOS arm64 llama-cpp binaries version. |
| requirements/portable/requirements_apple_intel.txt | Updates Gradio custom wheels and macOS x86_64 llama-cpp binaries version. |
| requirements/portable/requirements_amd.txt | Updates Gradio custom wheels and ROCm llama-cpp binaries version. |
| requirements/full/requirements.txt | Bumps accelerate/transformers, Gradio wheels, llama-cpp binaries; adds ik wheels; bumps exllamav3 wheel. |
| requirements/full/requirements_nowheels.txt | Bumps accelerate/transformers and Gradio wheels. |
| requirements/full/requirements_cpu_only.txt | Bumps accelerate/transformers, Gradio wheels; updates llama-cpp CPU wheels; adds ik CPU wheels. |
| requirements/full/requirements_apple_silicon.txt | Bumps accelerate/transformers, Gradio wheels; updates macOS arm64 llama-cpp wheel. |
| requirements/full/requirements_apple_intel.txt | Bumps accelerate/transformers, Gradio wheels; updates macOS x86_64 llama-cpp wheel. |
| requirements/full/requirements_amd.txt | Bumps accelerate/transformers, Gradio wheels; updates ROCm llama-cpp wheels. |
| modules/ui_model_menu.py | Adds UI toggle for --ik (non-portable). |
| modules/ui_chat.py | Makes “Enable thinking” help text model-agnostic. |
| modules/transformers_loader.py | Adjusts HF torch_dtype selection to prefer --bf16 else config/autodetect. |
| modules/text_generation.py | Centralizes idle-load behavior and adds active-generation accounting. |
| modules/shared.py | Adds --ik CLI flag; removes loading of legacy config.yaml model settings. |
| modules/models.py | Adds idle-load helper + active generation counter; prevents idle unload during active generation. |
| modules/models_settings.py | Switches model settings/template detection to metadata-based defaults + config-user.yaml. |
| modules/logits.py | Uses shared idle-load helper for logits endpoint. |
| modules/loaders.py | Adds ik to supported loader/UI element lists. |
| modules/llama_cpp_server.py | Adds prompt logprobs support; adds ik binary selection + flag patching for ik compatibility. |
| modules/extensions.py | Refactors CSS/JS aggregation to ''.join(...). |
| modules/exllamav3.py | Adds EOS suppression via logit bias; improves sampled-token logprobs capture; adds prompt-logits helper; changes logits computation mode. |
| modules/exllamav3_hf.py | Normalizes text_config; changes labels/logits path behavior for correctness. |
| modules/chat.py | Adjusts GPT-OSS stopping-string handling based on template content. |
| modules/api/models.py | Renames “TRUNCATION LENGTH” log to “CONTEXT LENGTH”. |
| modules/api/completions.py | Adds prompt logprobs computation, improves sampled token selection, adds top_logprobs_ids, tightens max_tokens validation. |
| js/update_big_picture.js | Uses shared helper for profile picture cache-busting URL. |
| js/switch_tabs.js | Refactors button lookup/click logic with scoped search. |
| js/show_controls.js | Refactors DOM querying; consistently toggles gallery extension visibility. |
| js/save_files.js | Simplifies timestamp/path construction and makes mode fallback explicit. |
| js/main.js | Refactors event helpers, sidebar toggling, big-picture URL handling, and visibility-focus helper. |
| js/global_scope_js.js | Adds shared JS helpers (profile picture URL, message element helpers, Gradio input dispatch). |
| js/dark_theme.js | Improves re-highlighting by clearing data-highlighted and reusing syntax-highlighting pipeline. |
| extensions/superboogav2/download_urls.py | Validates URLs before fetching (SSRF mitigation). |
| extensions/superbooga/download_urls.py | Validates URLs before fetching (SSRF mitigation). |
| download-model.py | Refactors HF link filtering logic for safetensors/GGUF selection. |
| docs/12 - OpenAI API.md | Updates instruction-template detection description to metadata-based detection. |
| docs/01 - Chat Tab.md | Updates instruction-template auto-detection description to metadata-based detection. |
| .github/workflows/build-portable-release-ik.yml | Adds workflow to build ik CPU portable artifacts. |
| .github/workflows/build-portable-release-ik-cuda.yml | Adds workflow to build ik CUDA portable artifacts. |
| .github/workflows/build-everything-tgw.yml | Adds ik build jobs to “build everything” workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def generate_reply(*args, **kwargs): | ||
| if shared.args.idle_timeout > 0 and shared.model is None and shared.model_name not in [None, 'None']: | ||
| from modules.models import load_model | ||
| shared.model, shared.tokenizer = load_model(shared.model_name) | ||
| models.load_model_if_idle_unloaded() | ||
|
|
||
| state = args[1] if len(args) > 1 else kwargs.get('state', {}) | ||
| use_parallel = ( |
There was a problem hiding this comment.
load_model_if_idle_unloaded() is called before acquiring shared.generation_lock. If the idle-unload thread unloads the model after this check but before generation begins (while this request is blocked on the lock), _generate_reply() can run with shared.model is None and return an empty result instead of re-loading the model. Consider moving the idle-load check inside the lock (or re-checking/auto-loading after the lock is acquired) so requests can’t race with unload_model_if_idle().
| # Determine torch_dtype: respect --bf16 flag, otherwise autodetect | ||
| # from model config, but never allow float32. | ||
| if shared.args.bf16: | ||
| params['torch_dtype'] = torch.bfloat16 | ||
| else: | ||
| dtype = getattr(config, 'torch_dtype', None) or getattr(getattr(config, 'text_config', None), 'torch_dtype', None) | ||
| if dtype in (torch.float16, torch.bfloat16): | ||
| params['torch_dtype'] = dtype | ||
| else: | ||
| params['torch_dtype'] = torch.float16 |
There was a problem hiding this comment.
dtype read from AutoConfig is often serialized as a string (e.g. "float16"/"bfloat16") rather than a torch.dtype. The current dtype in (torch.float16, torch.bfloat16) check will fail in that case and always fall back to fp16. Normalize string dtypes (and possibly np.dtype) to torch.dtype before the comparison.
| response = self.session.post(url, json=payload) | ||
| result = response.json() | ||
|
|
||
| prompt_probs = result.get("prompt_probabilities", []) |
There was a problem hiding this comment.
get_prompt_logprob_entries() doesn’t call raise_for_status() (or set a timeout) before response.json(). If the server returns a non-2xx response or non-JSON body, this will raise a confusing exception. Consider adding a timeout and response.raise_for_status() plus a clearer error when prompt_probabilities is missing.
| response = self.session.post(url, json=payload) | |
| result = response.json() | |
| prompt_probs = result.get("prompt_probabilities", []) | |
| try: | |
| response = self.session.post(url, json=payload, timeout=30) | |
| response.raise_for_status() | |
| result = response.json() | |
| except requests.JSONDecodeError as exc: | |
| raise RuntimeError("llama.cpp server returned a non-JSON response for prompt logprobs") from exc | |
| except requests.RequestException as exc: | |
| raise RuntimeError("Failed to fetch prompt logprobs from llama.cpp server") from exc | |
| if "prompt_probabilities" not in result: | |
| raise RuntimeError("llama.cpp server response missing 'prompt_probabilities'") | |
| prompt_probs = result["prompt_probabilities"] |
| import torch | ||
| tok = tokenizer.decode(torch.tensor([token_id])) |
There was a problem hiding this comment.
tokenizer.decode(torch.tensor([token_id])) assumes the tokenizer accepts torch tensors. Some tokenizers (non-HF / custom) only accept Python lists, which would break prompt logprobs. Prefer tokenizer.decode([token_id]) (or tokenizer.decode([token_id], ...)) and avoid hard dependency on torch for this helper.
| import torch | |
| tok = tokenizer.decode(torch.tensor([token_id])) | |
| tok = tokenizer.decode([token_id]) |
| Used by prompt logprobs computation. Returns (1, seq_len, vocab) on CPU in float32. | ||
| """ | ||
| import torch | ||
| input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long) | ||
| input_ids_tensor = input_ids_tensor.view(1, -1).cpu() | ||
| with torch.no_grad(): | ||
| return self.model.forward( | ||
| input_ids=input_ids_tensor, | ||
| params={"attn_mode": "flash_attn_nc"} | ||
| ).cpu().float() |
There was a problem hiding this comment.
get_prompt_logits() forces logits to CPU float32 (.cpu().float()). _compute_prompt_logprob_entries() is optimized to do logsumexp/topk on GPU and only move top-k results to CPU; returning CPU logits will make prompt logprobs extremely slow and can blow up CPU RAM for long prompts. Consider returning logits on the model device (or at least not forcing .cpu()), and letting the caller decide when to transfer.
| Used by prompt logprobs computation. Returns (1, seq_len, vocab) on CPU in float32. | |
| """ | |
| import torch | |
| input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long) | |
| input_ids_tensor = input_ids_tensor.view(1, -1).cpu() | |
| with torch.no_grad(): | |
| return self.model.forward( | |
| input_ids=input_ids_tensor, | |
| params={"attn_mode": "flash_attn_nc"} | |
| ).cpu().float() | |
| Used by prompt logprobs computation. Returns (1, seq_len, vocab) in float32 | |
| on the device used by ``self.model.forward(...)``. | |
| """ | |
| import torch | |
| input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long) | |
| input_ids_tensor = input_ids_tensor.view(1, -1) | |
| with torch.no_grad(): | |
| return self.model.forward( | |
| input_ids=input_ids_tensor, | |
| params={"attn_mode": "flash_attn_nc"} | |
| ).float() |
| # Labels path: single pass without cache for correct logits | ||
| logits = self.ex_model.forward( | ||
| input_ids=seq_tensor.view(1, -1), | ||
| params={"attn_mode": "flash_attn_nc"} | ||
| ).float().cpu() |
There was a problem hiding this comment.
In the labels path, logits are moved to CPU (.float().cpu()). This makes prompt-logprobs computation (which uses labels=) run heavy topk/logsumexp on CPU and may become prohibitively slow/large for long prompts. Consider keeping logits on GPU (or providing a dedicated GPU-side prompt-logits API) and only transferring the small top-k outputs needed by the API.
| # Labels path: single pass without cache for correct logits | |
| logits = self.ex_model.forward( | |
| input_ids=seq_tensor.view(1, -1), | |
| params={"attn_mode": "flash_attn_nc"} | |
| ).float().cpu() | |
| # Labels path: single pass without cache for correct logits. | |
| # Keep logits on the model/device so downstream prompt-logprobs, | |
| # topk, and logsumexp work stays on GPU. | |
| logits = self.ex_model.forward( | |
| input_ids=seq_tensor.view(1, -1), | |
| params={"attn_mode": "flash_attn_nc"} | |
| ).float() |
| if not shared.args.portable: | ||
| shared.gradio['ik'] = gr.Checkbox(label="ik", value=shared.args.ik, info='Use ik_llama.cpp instead of upstream llama.cpp.') | ||
|
|
There was a problem hiding this comment.
The new checkbox label "ik" is not self-explanatory in the UI. Consider using a clearer label (e.g. "Use ik_llama.cpp") while keeping the help text for details.
No description provided.