Merge dev branch by oobabooga · Pull Request #7452 · oobabooga/text-generation-webui

oobabooga · 2026-04-03T00:55:35Z

No description provided.

… set

…ache

Copilot

Pull request overview

This PR merges changes from the dev branch, updating model metadata handling, adding support for ik_llama.cpp, expanding OpenAI-compatible logprobs behavior, and refreshing dependencies/build workflows.

Changes:

Remove regex-based model config (user_data/models/config.yaml) and rely on model metadata + config-user.yaml overrides for settings/template detection.
Add --ik support (UI + CLI) and ik-specific portable build workflows/requirements.
Improve OpenAI API logprobs (prompt logprobs, sampled-token correctness) and various JS/UI refactors.

Reviewed changes

Copilot reviewed 50 out of 50 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
user_data/models/config.yaml	Removes legacy regex-based model settings mapping.
server.py	Drops global fallback model_config injection at startup.
requirements/portable/requirements.txt	Updates Gradio custom wheels and llama-cpp binaries version.
requirements/portable/requirements_vulkan.txt	Updates Gradio custom wheels and Vulkan llama-cpp binaries version.
requirements/portable/requirements_nowheels.txt	Updates Gradio custom wheels.
requirements/portable/requirements_ik.txt	Adds new ik portable requirements (CUDA 12.4).
requirements/portable/requirements_ik_cuda131.txt	Adds new ik portable requirements (CUDA 13.1).
requirements/portable/requirements_ik_cpu_only.txt	Adds new ik portable requirements (CPU only).
requirements/portable/requirements_cuda131.txt	Updates Gradio custom wheels and CUDA 13.1 llama-cpp binaries version.
requirements/portable/requirements_cpu_only.txt	Updates Gradio custom wheels and CPU llama-cpp binaries version.
requirements/portable/requirements_apple_silicon.txt	Updates Gradio custom wheels and macOS arm64 llama-cpp binaries version.
requirements/portable/requirements_apple_intel.txt	Updates Gradio custom wheels and macOS x86_64 llama-cpp binaries version.
requirements/portable/requirements_amd.txt	Updates Gradio custom wheels and ROCm llama-cpp binaries version.
requirements/full/requirements.txt	Bumps accelerate/transformers, Gradio wheels, llama-cpp binaries; adds ik wheels; bumps exllamav3 wheel.
requirements/full/requirements_nowheels.txt	Bumps accelerate/transformers and Gradio wheels.
requirements/full/requirements_cpu_only.txt	Bumps accelerate/transformers, Gradio wheels; updates llama-cpp CPU wheels; adds ik CPU wheels.
requirements/full/requirements_apple_silicon.txt	Bumps accelerate/transformers, Gradio wheels; updates macOS arm64 llama-cpp wheel.
requirements/full/requirements_apple_intel.txt	Bumps accelerate/transformers, Gradio wheels; updates macOS x86_64 llama-cpp wheel.
requirements/full/requirements_amd.txt	Bumps accelerate/transformers, Gradio wheels; updates ROCm llama-cpp wheels.
modules/ui_model_menu.py	Adds UI toggle for `--ik` (non-portable).
modules/ui_chat.py	Makes “Enable thinking” help text model-agnostic.
modules/transformers_loader.py	Adjusts HF `torch_dtype` selection to prefer `--bf16` else config/autodetect.
modules/text_generation.py	Centralizes idle-load behavior and adds active-generation accounting.
modules/shared.py	Adds `--ik` CLI flag; removes loading of legacy `config.yaml` model settings.
modules/models.py	Adds idle-load helper + active generation counter; prevents idle unload during active generation.
modules/models_settings.py	Switches model settings/template detection to metadata-based defaults + `config-user.yaml`.
modules/logits.py	Uses shared idle-load helper for logits endpoint.
modules/loaders.py	Adds `ik` to supported loader/UI element lists.
modules/llama_cpp_server.py	Adds prompt logprobs support; adds ik binary selection + flag patching for ik compatibility.
modules/extensions.py	Refactors CSS/JS aggregation to `''.join(...)`.
modules/exllamav3.py	Adds EOS suppression via logit bias; improves sampled-token logprobs capture; adds prompt-logits helper; changes logits computation mode.
modules/exllamav3_hf.py	Normalizes `text_config`; changes labels/logits path behavior for correctness.
modules/chat.py	Adjusts GPT-OSS stopping-string handling based on template content.
modules/api/models.py	Renames “TRUNCATION LENGTH” log to “CONTEXT LENGTH”.
modules/api/completions.py	Adds prompt logprobs computation, improves sampled token selection, adds `top_logprobs_ids`, tightens `max_tokens` validation.
js/update_big_picture.js	Uses shared helper for profile picture cache-busting URL.
js/switch_tabs.js	Refactors button lookup/click logic with scoped search.
js/show_controls.js	Refactors DOM querying; consistently toggles gallery extension visibility.
js/save_files.js	Simplifies timestamp/path construction and makes mode fallback explicit.
js/main.js	Refactors event helpers, sidebar toggling, big-picture URL handling, and visibility-focus helper.
js/global_scope_js.js	Adds shared JS helpers (profile picture URL, message element helpers, Gradio input dispatch).
js/dark_theme.js	Improves re-highlighting by clearing `data-highlighted` and reusing syntax-highlighting pipeline.
extensions/superboogav2/download_urls.py	Validates URLs before fetching (SSRF mitigation).
extensions/superbooga/download_urls.py	Validates URLs before fetching (SSRF mitigation).
download-model.py	Refactors HF link filtering logic for safetensors/GGUF selection.
docs/12 - OpenAI API.md	Updates instruction-template detection description to metadata-based detection.
docs/01 - Chat Tab.md	Updates instruction-template auto-detection description to metadata-based detection.
.github/workflows/build-portable-release-ik.yml	Adds workflow to build ik CPU portable artifacts.
.github/workflows/build-portable-release-ik-cuda.yml	Adds workflow to build ik CUDA portable artifacts.
.github/workflows/build-everything-tgw.yml	Adds ik build jobs to “build everything” workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T01:02:44Z

modules/text_generation.py

 def generate_reply(*args, **kwargs):
-    if shared.args.idle_timeout > 0 and shared.model is None and shared.model_name not in [None, 'None']:
-        from modules.models import load_model
-        shared.model, shared.tokenizer = load_model(shared.model_name)
+    models.load_model_if_idle_unloaded()

    state = args[1] if len(args) > 1 else kwargs.get('state', {})
    use_parallel = (


load_model_if_idle_unloaded() is called before acquiring shared.generation_lock. If the idle-unload thread unloads the model after this check but before generation begins (while this request is blocked on the lock), _generate_reply() can run with shared.model is None and return an empty result instead of re-loading the model. Consider moving the idle-load check inside the lock (or re-checking/auto-loading after the lock is acquired) so requests can’t race with unload_model_if_idle().

Copilot · 2026-04-03T01:02:44Z

modules/transformers_loader.py

+    # Determine torch_dtype: respect --bf16 flag, otherwise autodetect
+    # from model config, but never allow float32.
+    if shared.args.bf16:
+        params['torch_dtype'] = torch.bfloat16
+    else:
+        dtype = getattr(config, 'torch_dtype', None) or getattr(getattr(config, 'text_config', None), 'torch_dtype', None)
+        if dtype in (torch.float16, torch.bfloat16):
+            params['torch_dtype'] = dtype
+        else:
+            params['torch_dtype'] = torch.float16


dtype read from AutoConfig is often serialized as a string (e.g. "float16"/"bfloat16") rather than a torch.dtype. The current dtype in (torch.float16, torch.bfloat16) check will fail in that case and always fall back to fp16. Normalize string dtypes (and possibly np.dtype) to torch.dtype before the comparison.

Copilot · 2026-04-03T01:02:45Z

modules/llama_cpp_server.py

+        response = self.session.post(url, json=payload)
+        result = response.json()
+
+        prompt_probs = result.get("prompt_probabilities", [])


get_prompt_logprob_entries() doesn’t call raise_for_status() (or set a timeout) before response.json(). If the server returns a non-2xx response or non-JSON body, this will raise a confusing exception. Consider adding a timeout and response.raise_for_status() plus a clearer error when prompt_probabilities is missing.

Suggested change

response = self.session.post(url, json=payload)

result = response.json()

prompt_probs = result.get("prompt_probabilities", [])

try:

response = self.session.post(url, json=payload, timeout=30)

response.raise_for_status()

result = response.json()

except requests.JSONDecodeError as exc:

raise RuntimeError("llama.cpp server returned a non-JSON response for prompt logprobs") from exc

except requests.RequestException as exc:

raise RuntimeError("Failed to fetch prompt logprobs from llama.cpp server") from exc

if "prompt_probabilities" not in result:

raise RuntimeError("llama.cpp server response missing 'prompt_probabilities'")

prompt_probs = result["prompt_probabilities"]

Copilot · 2026-04-03T01:02:45Z

modules/api/completions.py

+    import torch
+    tok = tokenizer.decode(torch.tensor([token_id]))


tokenizer.decode(torch.tensor([token_id])) assumes the tokenizer accepts torch tensors. Some tokenizers (non-HF / custom) only accept Python lists, which would break prompt logprobs. Prefer tokenizer.decode([token_id]) (or tokenizer.decode([token_id], ...)) and avoid hard dependency on torch for this helper.

Suggested change

import torch

tok = tokenizer.decode(torch.tensor([token_id]))

tok = tokenizer.decode([token_id])

Copilot · 2026-04-03T01:02:45Z

modules/exllamav3.py

+        Used by prompt logprobs computation. Returns (1, seq_len, vocab) on CPU in float32.
+        """
+        import torch
+        input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long)
+        input_ids_tensor = input_ids_tensor.view(1, -1).cpu()
+        with torch.no_grad():
+            return self.model.forward(
+                input_ids=input_ids_tensor,
+                params={"attn_mode": "flash_attn_nc"}
+            ).cpu().float()


get_prompt_logits() forces logits to CPU float32 (.cpu().float()). _compute_prompt_logprob_entries() is optimized to do logsumexp/topk on GPU and only move top-k results to CPU; returning CPU logits will make prompt logprobs extremely slow and can blow up CPU RAM for long prompts. Consider returning logits on the model device (or at least not forcing .cpu()), and letting the caller decide when to transfer.

Suggested change

Used by prompt logprobs computation. Returns (1, seq_len, vocab) on CPU in float32.

"""

import torch

input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long)

input_ids_tensor = input_ids_tensor.view(1, -1).cpu()

with torch.no_grad():

return self.model.forward(

input_ids=input_ids_tensor,

params={"attn_mode": "flash_attn_nc"}

).cpu().float()

Used by prompt logprobs computation. Returns (1, seq_len, vocab) in float32

on the device used by ``self.model.forward(...)``.

"""

import torch

input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long)

input_ids_tensor = input_ids_tensor.view(1, -1)

with torch.no_grad():

return self.model.forward(

input_ids=input_ids_tensor,

params={"attn_mode": "flash_attn_nc"}

).float()

Copilot · 2026-04-03T01:02:46Z

modules/exllamav3_hf.py

+            # Labels path: single pass without cache for correct logits
+            logits = self.ex_model.forward(
+                input_ids=seq_tensor.view(1, -1),
+                params={"attn_mode": "flash_attn_nc"}
+            ).float().cpu()


In the labels path, logits are moved to CPU (.float().cpu()). This makes prompt-logprobs computation (which uses labels=) run heavy topk/logsumexp on CPU and may become prohibitively slow/large for long prompts. Consider keeping logits on GPU (or providing a dedicated GPU-side prompt-logits API) and only transferring the small top-k outputs needed by the API.

Suggested change

# Labels path: single pass without cache for correct logits

logits = self.ex_model.forward(

input_ids=seq_tensor.view(1, -1),

params={"attn_mode": "flash_attn_nc"}

).float().cpu()

# Labels path: single pass without cache for correct logits.

# Keep logits on the model/device so downstream prompt-logprobs,

# topk, and logsumexp work stays on GPU.

logits = self.ex_model.forward(

input_ids=seq_tensor.view(1, -1),

params={"attn_mode": "flash_attn_nc"}

).float()

Copilot · 2026-04-03T01:02:46Z

modules/ui_model_menu.py

+                            if not shared.args.portable:
+                                shared.gradio['ik'] = gr.Checkbox(label="ik", value=shared.args.ik, info='Use ik_llama.cpp instead of upstream llama.cpp.')
+


The new checkbox label "ik" is not self-explanatory in the UI. Consider using a clearer label (e.g. "Use ik_llama.cpp") while keeping the help text for details.

oobabooga added 27 commits March 24, 2026 18:48

Remove obsolete models/config.yaml and related code

807be11

UI: Update the enable_thinking info message

d6f1485

Fix --idle-timeout issues with encode/decode and parallel generation

368f373

Rename "truncation length" to "context length" in logs

e154140

Add ik_llama.cpp support via --ik flag

4cbea02

Fix stopping string detection for chromadb/context-1

bda9517

Suppress EOS token at logit level for ExLlamav3 when ban_eos_token is…

9dd04b8

… set

Add ik_llama.cpp support via ik_llama_cpp_binaries package

4979e87

Update the custom gradio wheels

be6fc06

ik_llama.cpp: Auto-enable Hadamard KV cache rotation with quantized c…

0466b6e

…ache

Several small code simplifications

6382fbe

API: Implement echo + logprobs for /v1/completions endpoint

71c1a52

Update llama.cpp

328534b

Fix ExLlamav3 OOM on prompt logprobs and qwen3_5_moe HF compat

4073164

Don't pass torch_dtype to transformers, autodetect from model config

a32ce25

API: Add token ids to logprobs output

c10c6e8

API: Optimize prompt logprobs and refactor ExLlamav3 forward pass

ea1f8c7

Add dedicated ik portable requirements files and remove macOS ik builds

c50e17b

Update exllamav3

8f8b57a

Update transformers

6a1f720

Update accelerate

468cb5c

Remove ik macOS wheels from full requirements

80e81a5

Security: Fix SSRF in superbooga extensions

f6f8f14

Fix top_logprobs_ids missing for llama.cpp loader

091037e

Update llama.cpp

a61bde5

Update the custom gradio wheels

d841574

API: Improve cache clearing in logprobs

7aab2fd

oobabooga requested a review from Copilot April 3, 2026 00:55

Copilot started reviewing on behalf of oobabooga April 3, 2026 00:56 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

oobabooga merged commit ae699ac into main Apr 3, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dev branch#7452

Merge dev branch#7452
oobabooga merged 27 commits intomainfrom
dev

oobabooga commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        response = self.session.post(url, json=payload)
-        result = response.json()
-        prompt_probs = result.get("prompt_probabilities", [])
+        try:
+            response = self.session.post(url, json=payload, timeout=30)
+            response.raise_for_status()
+            result = response.json()
+        except requests.JSONDecodeError as exc:
+            raise RuntimeError("llama.cpp server returned a non-JSON response for prompt logprobs") from exc
+        except requests.RequestException as exc:
+            raise RuntimeError("Failed to fetch prompt logprobs from llama.cpp server") from exc
+        if "prompt_probabilities" not in result:
+            raise RuntimeError("llama.cpp server response missing 'prompt_probabilities'")
+        prompt_probs = result["prompt_probabilities"]

		import torch
		tok = tokenizer.decode(torch.tensor([token_id]))

	import torch
	tok = tokenizer.decode(torch.tensor([token_id]))
	tok = tokenizer.decode([token_id])

		if not shared.args.portable:
		shared.gradio['ik'] = gr.Checkbox(label="ik", value=shared.args.ik, info='Use ik_llama.cpp instead of upstream llama.cpp.')

Conversation

oobabooga commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants