Skip to content

Merge dev branch#7452

Merged
oobabooga merged 27 commits intomainfrom
dev
Apr 3, 2026
Merged

Merge dev branch#7452
oobabooga merged 27 commits intomainfrom
dev

Conversation

@oobabooga
Copy link
Copy Markdown
Owner

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges changes from the dev branch, updating model metadata handling, adding support for ik_llama.cpp, expanding OpenAI-compatible logprobs behavior, and refreshing dependencies/build workflows.

Changes:

  • Remove regex-based model config (user_data/models/config.yaml) and rely on model metadata + config-user.yaml overrides for settings/template detection.
  • Add --ik support (UI + CLI) and ik-specific portable build workflows/requirements.
  • Improve OpenAI API logprobs (prompt logprobs, sampled-token correctness) and various JS/UI refactors.

Reviewed changes

Copilot reviewed 50 out of 50 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
user_data/models/config.yaml Removes legacy regex-based model settings mapping.
server.py Drops global fallback model_config injection at startup.
requirements/portable/requirements.txt Updates Gradio custom wheels and llama-cpp binaries version.
requirements/portable/requirements_vulkan.txt Updates Gradio custom wheels and Vulkan llama-cpp binaries version.
requirements/portable/requirements_nowheels.txt Updates Gradio custom wheels.
requirements/portable/requirements_ik.txt Adds new ik portable requirements (CUDA 12.4).
requirements/portable/requirements_ik_cuda131.txt Adds new ik portable requirements (CUDA 13.1).
requirements/portable/requirements_ik_cpu_only.txt Adds new ik portable requirements (CPU only).
requirements/portable/requirements_cuda131.txt Updates Gradio custom wheels and CUDA 13.1 llama-cpp binaries version.
requirements/portable/requirements_cpu_only.txt Updates Gradio custom wheels and CPU llama-cpp binaries version.
requirements/portable/requirements_apple_silicon.txt Updates Gradio custom wheels and macOS arm64 llama-cpp binaries version.
requirements/portable/requirements_apple_intel.txt Updates Gradio custom wheels and macOS x86_64 llama-cpp binaries version.
requirements/portable/requirements_amd.txt Updates Gradio custom wheels and ROCm llama-cpp binaries version.
requirements/full/requirements.txt Bumps accelerate/transformers, Gradio wheels, llama-cpp binaries; adds ik wheels; bumps exllamav3 wheel.
requirements/full/requirements_nowheels.txt Bumps accelerate/transformers and Gradio wheels.
requirements/full/requirements_cpu_only.txt Bumps accelerate/transformers, Gradio wheels; updates llama-cpp CPU wheels; adds ik CPU wheels.
requirements/full/requirements_apple_silicon.txt Bumps accelerate/transformers, Gradio wheels; updates macOS arm64 llama-cpp wheel.
requirements/full/requirements_apple_intel.txt Bumps accelerate/transformers, Gradio wheels; updates macOS x86_64 llama-cpp wheel.
requirements/full/requirements_amd.txt Bumps accelerate/transformers, Gradio wheels; updates ROCm llama-cpp wheels.
modules/ui_model_menu.py Adds UI toggle for --ik (non-portable).
modules/ui_chat.py Makes “Enable thinking” help text model-agnostic.
modules/transformers_loader.py Adjusts HF torch_dtype selection to prefer --bf16 else config/autodetect.
modules/text_generation.py Centralizes idle-load behavior and adds active-generation accounting.
modules/shared.py Adds --ik CLI flag; removes loading of legacy config.yaml model settings.
modules/models.py Adds idle-load helper + active generation counter; prevents idle unload during active generation.
modules/models_settings.py Switches model settings/template detection to metadata-based defaults + config-user.yaml.
modules/logits.py Uses shared idle-load helper for logits endpoint.
modules/loaders.py Adds ik to supported loader/UI element lists.
modules/llama_cpp_server.py Adds prompt logprobs support; adds ik binary selection + flag patching for ik compatibility.
modules/extensions.py Refactors CSS/JS aggregation to ''.join(...).
modules/exllamav3.py Adds EOS suppression via logit bias; improves sampled-token logprobs capture; adds prompt-logits helper; changes logits computation mode.
modules/exllamav3_hf.py Normalizes text_config; changes labels/logits path behavior for correctness.
modules/chat.py Adjusts GPT-OSS stopping-string handling based on template content.
modules/api/models.py Renames “TRUNCATION LENGTH” log to “CONTEXT LENGTH”.
modules/api/completions.py Adds prompt logprobs computation, improves sampled token selection, adds top_logprobs_ids, tightens max_tokens validation.
js/update_big_picture.js Uses shared helper for profile picture cache-busting URL.
js/switch_tabs.js Refactors button lookup/click logic with scoped search.
js/show_controls.js Refactors DOM querying; consistently toggles gallery extension visibility.
js/save_files.js Simplifies timestamp/path construction and makes mode fallback explicit.
js/main.js Refactors event helpers, sidebar toggling, big-picture URL handling, and visibility-focus helper.
js/global_scope_js.js Adds shared JS helpers (profile picture URL, message element helpers, Gradio input dispatch).
js/dark_theme.js Improves re-highlighting by clearing data-highlighted and reusing syntax-highlighting pipeline.
extensions/superboogav2/download_urls.py Validates URLs before fetching (SSRF mitigation).
extensions/superbooga/download_urls.py Validates URLs before fetching (SSRF mitigation).
download-model.py Refactors HF link filtering logic for safetensors/GGUF selection.
docs/12 - OpenAI API.md Updates instruction-template detection description to metadata-based detection.
docs/01 - Chat Tab.md Updates instruction-template auto-detection description to metadata-based detection.
.github/workflows/build-portable-release-ik.yml Adds workflow to build ik CPU portable artifacts.
.github/workflows/build-portable-release-ik-cuda.yml Adds workflow to build ik CUDA portable artifacts.
.github/workflows/build-everything-tgw.yml Adds ik build jobs to “build everything” workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 19 to 23
def generate_reply(*args, **kwargs):
if shared.args.idle_timeout > 0 and shared.model is None and shared.model_name not in [None, 'None']:
from modules.models import load_model
shared.model, shared.tokenizer = load_model(shared.model_name)
models.load_model_if_idle_unloaded()

state = args[1] if len(args) > 1 else kwargs.get('state', {})
use_parallel = (
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_model_if_idle_unloaded() is called before acquiring shared.generation_lock. If the idle-unload thread unloads the model after this check but before generation begins (while this request is blocked on the lock), _generate_reply() can run with shared.model is None and return an empty result instead of re-loading the model. Consider moving the idle-load check inside the lock (or re-checking/auto-loading after the lock is acquired) so requests can’t race with unload_model_if_idle().

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +131
# Determine torch_dtype: respect --bf16 flag, otherwise autodetect
# from model config, but never allow float32.
if shared.args.bf16:
params['torch_dtype'] = torch.bfloat16
else:
dtype = getattr(config, 'torch_dtype', None) or getattr(getattr(config, 'text_config', None), 'torch_dtype', None)
if dtype in (torch.float16, torch.bfloat16):
params['torch_dtype'] = dtype
else:
params['torch_dtype'] = torch.float16
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype read from AutoConfig is often serialized as a string (e.g. "float16"/"bfloat16") rather than a torch.dtype. The current dtype in (torch.float16, torch.bfloat16) check will fail in that case and always fall back to fp16. Normalize string dtypes (and possibly np.dtype) to torch.dtype before the comparison.

Copilot uses AI. Check for mistakes.
Comment on lines +331 to +334
response = self.session.post(url, json=payload)
result = response.json()

prompt_probs = result.get("prompt_probabilities", [])
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_prompt_logprob_entries() doesn’t call raise_for_status() (or set a timeout) before response.json(). If the server returns a non-2xx response or non-JSON body, this will raise a confusing exception. Consider adding a timeout and response.raise_for_status() plus a clearer error when prompt_probabilities is missing.

Suggested change
response = self.session.post(url, json=payload)
result = response.json()
prompt_probs = result.get("prompt_probabilities", [])
try:
response = self.session.post(url, json=payload, timeout=30)
response.raise_for_status()
result = response.json()
except requests.JSONDecodeError as exc:
raise RuntimeError("llama.cpp server returned a non-JSON response for prompt logprobs") from exc
except requests.RequestException as exc:
raise RuntimeError("Failed to fetch prompt logprobs from llama.cpp server") from exc
if "prompt_probabilities" not in result:
raise RuntimeError("llama.cpp server response missing 'prompt_probabilities'")
prompt_probs = result["prompt_probabilities"]

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +54
import torch
tok = tokenizer.decode(torch.tensor([token_id]))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizer.decode(torch.tensor([token_id])) assumes the tokenizer accepts torch tensors. Some tokenizers (non-HF / custom) only accept Python lists, which would break prompt logprobs. Prefer tokenizer.decode([token_id]) (or tokenizer.decode([token_id], ...)) and avoid hard dependency on torch for this helper.

Suggested change
import torch
tok = tokenizer.decode(torch.tensor([token_id]))
tok = tokenizer.decode([token_id])

Copilot uses AI. Check for mistakes.
Comment on lines +533 to +542
Used by prompt logprobs computation. Returns (1, seq_len, vocab) on CPU in float32.
"""
import torch
input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long)
input_ids_tensor = input_ids_tensor.view(1, -1).cpu()
with torch.no_grad():
return self.model.forward(
input_ids=input_ids_tensor,
params={"attn_mode": "flash_attn_nc"}
).cpu().float()
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_prompt_logits() forces logits to CPU float32 (.cpu().float()). _compute_prompt_logprob_entries() is optimized to do logsumexp/topk on GPU and only move top-k results to CPU; returning CPU logits will make prompt logprobs extremely slow and can blow up CPU RAM for long prompts. Consider returning logits on the model device (or at least not forcing .cpu()), and letting the caller decide when to transfer.

Suggested change
Used by prompt logprobs computation. Returns (1, seq_len, vocab) on CPU in float32.
"""
import torch
input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long)
input_ids_tensor = input_ids_tensor.view(1, -1).cpu()
with torch.no_grad():
return self.model.forward(
input_ids=input_ids_tensor,
params={"attn_mode": "flash_attn_nc"}
).cpu().float()
Used by prompt logprobs computation. Returns (1, seq_len, vocab) in float32
on the device used by ``self.model.forward(...)``.
"""
import torch
input_ids_tensor = input_ids if isinstance(input_ids, torch.Tensor) else torch.tensor(input_ids, dtype=torch.long)
input_ids_tensor = input_ids_tensor.view(1, -1)
with torch.no_grad():
return self.model.forward(
input_ids=input_ids_tensor,
params={"attn_mode": "flash_attn_nc"}
).float()

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +209
# Labels path: single pass without cache for correct logits
logits = self.ex_model.forward(
input_ids=seq_tensor.view(1, -1),
params={"attn_mode": "flash_attn_nc"}
).float().cpu()
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the labels path, logits are moved to CPU (.float().cpu()). This makes prompt-logprobs computation (which uses labels=) run heavy topk/logsumexp on CPU and may become prohibitively slow/large for long prompts. Consider keeping logits on GPU (or providing a dedicated GPU-side prompt-logits API) and only transferring the small top-k outputs needed by the API.

Suggested change
# Labels path: single pass without cache for correct logits
logits = self.ex_model.forward(
input_ids=seq_tensor.view(1, -1),
params={"attn_mode": "flash_attn_nc"}
).float().cpu()
# Labels path: single pass without cache for correct logits.
# Keep logits on the model/device so downstream prompt-logprobs,
# topk, and logsumexp work stays on GPU.
logits = self.ex_model.forward(
input_ids=seq_tensor.view(1, -1),
params={"attn_mode": "flash_attn_nc"}
).float()

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +56
if not shared.args.portable:
shared.gradio['ik'] = gr.Checkbox(label="ik", value=shared.args.ik, info='Use ik_llama.cpp instead of upstream llama.cpp.')

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new checkbox label "ik" is not self-explanatory in the UI. Consider using a clearer label (e.g. "Use ik_llama.cpp") while keeping the help text for details.

Copilot uses AI. Check for mistakes.
@oobabooga oobabooga merged commit ae699ac into main Apr 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants