A production pipeline that converts GGUF checkpoints of hybrid‑attention and MoE models into NVIDIA NVFP4 quantized HuggingFace safetensors — ready to serve with vLLM on a single RTX 5090. Targets models
transformerscannot load from GGUF directly (Qwen3.5, Qwen3.6‑MoE, Gemma 4).
Languages: English · 中文
transformers cannot load Qwen3.5 (hybrid Gated‑DeltaNet), Qwen3.6‑MoE (A3B, with MTP head), or Gemma 4 (GQA + PLE) directly from GGUF. This repo provides model‑specific GGUF → HF → NVFP4 pipelines that extract tensors, fix the known conversion pitfalls, quantize to NVFP4 with llm-compressor, and emit a vLLM‑ready multimodal checkpoint.
| Model | Architecture | Source format | Output (NVFP4) | Target GPU | Stages |
|---|---|---|---|---|---|
| Qwen3.5‑27B | Hybrid Gated‑DeltaNet (3:1 linear ∶ full‑attn), text + vision | bf16 GGUF | ~18 GB + 0.9 GB vision | RTX 5090 (32 GB) | 3 (convert → quantize → stitch) |
| Qwen3.6‑35B‑A3B MoE | 256 experts (8 routed + 1 shared), hybrid DeltaNet, text + vision + MTP | Q8_K_P GGUF | ~21–22 GB | RTX 5090 (32 GB) | 2 (convert+MTP → quantize) |
| Gemma 4 E4B | Standard GQA + Per‑Layer Embedding, text + vision + audio | Q8_K_P GGUF | ~5–6 GB | RTX 5090 / smaller | 2 (convert → NVFP4A16) |
Quick links: Qwen3.5‑27B · Qwen3.6‑35B‑A3B MoE · Gemma 4 E4B · vLLM deployment · GitHub topics
Qwen3.5 uses a hybrid Gated‑DeltaNet architecture that interleaves Mamba‑style linear attention with full softmax attention at a 3:1 ratio. As of writing, transformers does not support loading Qwen3.5 from GGUF format directly. This repo manually extracts tensors from GGUF, applies the required transformations, quantizes to NVFP4, and produces a model ready for vLLM.
Qwen3.6‑35B‑A3B adds Mixture‑of‑Experts on top of the same hybrid attention and carries an additional MTP (Multi‑Token Prediction) head for speculative decoding. Many community finetunes (e.g. HauhauCS uncensored) publish only quantized GGUFs — no native bf16 safetensors.
Gemma 4 E4B has the same GGUF‑loader gap: transformers supports gemma2/gemma3 from GGUF but not gemma4. The HauhauCS uncensored series is published only as GGUF, so the same GGUF → HF → NVFP4 path applies.
GGUF (bf16 LLM + mmproj vision)
│ step1_convert.py
▼
HuggingFace Safetensors (bf16, sharded)
│ step2_quantize.py
▼
NVFP4 Quantized Model (text‑only)
│ step3_stitch_vision.py
▼
Final Model (NVFP4 text + bf16 vision, vLLM‑ready)
The Qwen3.5 GGUF format (via llama.cpp) has several non‑obvious differences from HuggingFace safetensors. Getting any of these wrong produces a model that loads cleanly but generates garbage.
GGUF stores RMSNorm weights as 1 + learned_parameter; HuggingFace stores just learned_parameter.
# Affected: attn_norm, post_attention_norm, output_norm, attn_q_norm, attn_k_norm
# NOT affected: ssm_norm (GroupNorm, different normalization)
if is_rmsnorm_tensor(gguf_name):
tensor = (tensor.float() - 1.0).to(torch.bfloat16)Without this fix every LayerNorm/RMSNorm output is shifted, causing gradual degradation — the first few tokens may look correct but output rapidly becomes incoherent.
GGUF stores the SSM decay parameter as the materialized value A = -exp(A_log); HuggingFace expects A_log (log‑space).
# GGUF ssm_a -> HF A_log
if gguf_name.endswith(".ssm_a"):
tensor = (-tensor.float()).log().to(torch.bfloat16)The most subtle bug. Qwen3.5's linear attention has 48 value heads organized as 16 KV‑groups with 3 heads each. GGUF (llama.cpp) stores these in (3 heads, 16 groups) order; HuggingFace expects (16 groups, 3 heads).
Affected tensors (all linear_attn layers, 48 of 64 layers):
| Tensor | Shape | Fix |
|---|---|---|
in_proj_a.weight |
[48, D] | Reshape (3,16,D) → permute (1,0,2) |
in_proj_b.weight |
[48, D] | Same as above |
dt_bias |
[48] | Reshape (3,16) → permute (1,0) |
A_log |
[48] | Same (after exp fix) |
in_proj_qkv.weight |
[10240, D] | V‑section only [4096:] |
in_proj_z.weight |
[6144, D] | Full reshape (3,16,128,D) → permute (1,0,2,3) |
out_proj.weight |
[D, 6144] | Column permutation |
conv1d.weight |
[10240, 1, K] | V‑section only [4096:] |
The QKV split for in_proj_qkv is: Q=2048 (16 heads × 128 dim) + K=2048 + V=6144 (48 heads × 128 dim) = 10240 total. Only the V section needs permutation.
GGUF stores tensor shapes in column‑major (Fortran) order. PyTorch uses row‑major (C) order.
shape = list(reversed(tensor_info.shape))The conv1d weight in GGUF is stored as 2D [channels, kernel_size] but PyTorch expects 3D [channels, 1, kernel_size].
if "conv1d.weight" in hf_name and tensor.dim() == 2:
tensor = tensor.unsqueeze(1).contiguous()Prerequisites
pip install torch transformers>=5.0 safetensors gguf numpy huggingface-hub
pip install llmcompressor datasets # for quantization stepStep 1 — GGUF → HF safetensors
python scripts/step1_convert.py \
--gguf-llm /path/to/model.bf16.gguf \
--gguf-vision /path/to/mmproj.gguf \
--output-dir ./model-bf16-hf \
--reference-repo huihui-ai/Huihui-Qwen3.5-27B-abliterated--reference-repo provides config.json and tokenizer files. Use any HuggingFace repo of the same model architecture.
Step 2 — NVFP4 Quantization
python scripts/step2_quantize.py \
--model-dir ./model-bf16-hf \
--output-dir ./model-nvfp4Runs oneshot NVFP4 quantization with 512 calibration samples from neuralmagic/calibration. The following layers are excluded:
lm_head— output projection stays bf16visual.*— vision encoder stays bf16*.in_proj_a,*.in_proj_b— SSM gate parameters stay bf16
Step 3 — Stitch Vision Weights
python scripts/step3_stitch_vision.py \
--bf16-dir ./model-bf16-hf \
--nvfp4-dir ./model-nvfp4Merges the original bf16 vision weights back into the quantized model, remaps weight names (model.* → model.language_model.*), updates config.json for Qwen3_5ForConditionalGeneration, and re‑shards the output.
Qwen3.5 Hybrid Attention. Qwen3.5‑27B has 64 layers with a 3:1 linear‑to‑full attention ratio:
- Layers 0, 1, 2, 4, 5, 6, 8, 9, 10, … (48 layers): Gated DeltaNet linear attention
- Layers 3, 7, 11, 15, … (16 layers): Full softmax attention
Linear‑attention layers use:
in_proj_qkv: Fused Q/K/V projection[10240, 5120]in_proj_z: Gate projection[6144, 5120]in_proj_a,in_proj_b: SSM parameters[48, 5120]out_proj: Output projection[5120, 6144]conv1d: Causal convolution[10240, 1, 4]A_log,dt_bias: Recurrence parameters[48]norm: GroupNorm[48]
Memory Budget (RTX 5090, 32 GB VRAM).
| Component | Size |
|---|---|
| NVFP4 quantized weights | ~18 GB |
| Vision encoder (bf16) | ~0.9 GB |
| KV cache (fp8, 32K context) | ~8 GB |
| Overhead | ~3 GB |
| Total | ~30 GB |
Use gpu_memory_utilization=0.90 and kv_cache_dtype=fp8 for comfortable operation.
A separate entry‑point supports Qwen3.6‑35B‑A3B (and community finetunes such as the HauhauCS uncensored series). This is a Mixture‑of‑Experts model with the same hybrid Gated‑DeltaNet attention as Qwen3.5, but fundamentally different in FFN structure — and it carries an MTP (Multi‑Token Prediction) head for speculative decoding.
| Qwen3.5‑27B | Qwen3.6‑35B‑A3B | |
|---|---|---|
| Type | Dense | MoE (256 experts, 8 routed + 1 shared) |
| Layers | 64 | 40 |
| Hidden size | 5120 | 2048 |
| Head dim (full attn) | 128 | 256 |
| V heads (linear attn) | 48 = 3×16 | 32 = 2×16 |
| Full‑attn Q dim | 4096 | 8192 (includes output gate) |
| QKV split (linear) | Q:2048 + K:2048 + V:6144 = 10240 | Q:2048 + K:2048 + V:4096 = 8192 |
| MTP head | — | ✅ (1 layer, 19 tensors, speculative decoding) |
| HF architecture | Qwen3_5ForConditionalGeneration |
Qwen3_5MoeForConditionalGeneration |
GGUF stores MoE expert weights as packed 3D tensors. HF expects a fused gate_up_proj:
# GGUF: ffn_gate_exps [256, 512, 2048] + ffn_up_exps [256, 512, 2048]
# HF: experts.gate_up_proj [256, 1024, 2048] (no .weight suffix!)
fused = torch.cat([gate_exps, up_exps], dim=1)Other MoE tensors:
ffn_down_exps→experts.down_proj(no.weightsuffix)ffn_gate_inp→mlp.gate.weight(router)ffn_gate_shexp/ffn_up_shexp/ffn_down_shexp→shared_expert.{gate,up,down}_proj.weightffn_gate_inp_shexp→shared_expert_gate.weight(needsunsqueeze(0): GGUF[2048]→ HF[1, 2048])
Beyond the five Qwen3.5 pitfalls (which all still apply), Qwen3.6 adds:
- V‑head permutation is (2,16) not (3,16). 32 V‑heads = 16 KV‑groups × 2 heads. Same reshape/permute logic, different constants.
- Patch embed is 5D. Vision encoder uses temporal 3D conv. GGUF splits into
v.patch_embd.weight+v.patch_embd.weight.1(two 4D tensors), which must be stacked into one 5D tensor[C, 3, 2, H, W]. - MTP not in GGUF. The Multi‑Token Prediction head (19 tensors) must be copied from the base HF model (
Qwen/Qwen3.6-35B-A3B). Only 2 safetensor shards are needed (model-00025,model-00026), so selective download is recommended over pulling the full 67 GB reference. - Q8_K_P source. Unlike Qwen3.5 which ships a bf16 GGUF, HauhauCS publishes only quantized GGUFs. Use
gguf.quants.dequantize()which handles Q8_K → F32 and auto‑reverses shapes.
Two stages — no stitch step needed, because the quantization ignore list keeps vision in bf16 and Qwen3_5MoeForConditionalGeneration loads the full multimodal model in one shot:
Q8_K_P GGUF + mmproj GGUF + Reference HF (config/tokenizer + MTP shards)
│ step1_convert_qwen36_moe.py (→ 1045 tensors across 22 shards, ~67 GB)
▼ (733 text + 333 vision + 19 MTP)
HuggingFace Safetensors (bf16, text + vision + MTP)
│ step2_quantize_qwen36_moe.py (conservative: linear_attn + MTP bf16)
│ step2b_quantize_qwen36_aggressive.py (aggressive: everything NVFP4)
▼
Final NVFP4 Model (~21–22 GB, vLLM‑ready)
Conservative (step2_quantize_qwen36_moe.py) — AEON‑7 / RedHatAI approach. Keeps linear_attn (DeltaNet) and MTP in bf16 because linear_attn is precision‑sensitive and MTP quality directly affects speculative‑decoding acceptance rates.
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$",
"re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"]Aggressive (step2b_quantize_qwen36_aggressive.py) — sakamakismile approach. Quantizes everything except lm_head, vision, and gates. Smaller footprint buys longer context.
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"]| Profile | Size | RTX 5090 text‑only ctx | With vision |
|---|---|---|---|
| Conservative | ~22 GB | ~131K | ~4K |
| Aggressive | ~21 GB | ~131K+ | ~65K |
# 1. Download GGUF source
hf download HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive \
--include "*Q8_K_P*" "*mmproj*" --local-dir ./src
# 2. Download reference config/tokenizer + the 2 MTP shards only
hf download Qwen/Qwen3.6-35B-A3B \
config.json tokenizer.json tokenizer_config.json chat_template.jinja \
merges.txt vocab.json generation_config.json preprocessor_config.json \
video_preprocessor_config.json model.safetensors.index.json \
model-00025-of-00026.safetensors model-00026-of-00026.safetensors \
--local-dir ./ref
# 3. Convert GGUF to HF safetensors (text + vision + MTP injection)
python scripts/step1_convert_qwen36_moe.py \
--gguf-llm ./src/*Q8_K_P*.gguf \
--gguf-vision ./src/*mmproj*.gguf \
--output-dir ./qwen36-bf16-hf \
--reference-repo Qwen/Qwen3.6-35B-A3B
# 4a. Conservative NVFP4 (linear_attn + MTP stay bf16)
python scripts/step2_quantize_qwen36_moe.py \
--model-dir ./qwen36-bf16-hf \
--output-dir ./qwen36-nvfp4-conservative
# 4b. OR Aggressive NVFP4 (everything quantized, best for long context)
python scripts/step2b_quantize_qwen36_aggressive.py \
--model-dir ./qwen36-bf16-hf \
--output-dir ./qwen36-nvfp4-aggressiveThe 67 GB bf16 model exceeds typical 64 GB system RAM. The step2 scripts use device_map="auto" with disk offloading, which requires two patches around a transformers / llmcompressor save bug triggered by MoE + disk offload:
transformers/integrations/accelerate.py(load_offloaded_parameter): Wrapmodel.get_submodule()intry / except AttributeError: continueto skip non‑matching paths.llmcompressor/.../compressed_tensors_utils.py(save_pretrained_wrapper): Comment outto_accelerate(model)andfrom_accelerate(model)to prevent tensor‑name‑prefix triplication.- Post‑save key rename: The saved safetensors will have a triple
model.language_model.language_model.language_model.prefix. Fix with:new_key = key.replace( 'model.language_model.language_model.language_model.', 'model.language_model.' )
These patches are not needed on 128 GB+ systems (load without device_map).
# Text‑only, 100K+ context on RTX 5090
docker run --gpus all -v ./qwen36-nvfp4:/model vllm/vllm-openai:nightly \
--model /model \
--quantization compressed-tensors \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.95 \
--max-model-len 100000 \
--max-num-seqs 1 \
--reasoning-parser qwen3 \
--language-model-only
# With MTP speculative decoding (when supported)
# --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'AEON‑7 recommends setting VLLM_TEST_FORCE_FP8_MARLIN=1 if CUTLASS NVFP4 is broken on your SM121 GPU.
A separate entry‑point supports Gemma 4 E4B multimodal models (text + vision + audio). Targets include HauhauCS's Gemma-4-E4B-Uncensored-*-Aggressive series and any other Gemma 4 E4B finetune published only as GGUF.
Gemma 4 has architectural differences from Qwen3.5 that require a different conversion path:
- No hybrid attention. Standard GQA transformer — no
(3,16)value‑head permutation, noA_logdomain mismatch, no conv1d SSM. Half of Qwen3.5's pitfalls disappear. - RMSNorm stores weights as‑is. Unlike Qwen3.5 / Gemma 2 / Gemma 3 — where GGUF stores
1 + weightbecause HF's RMSNorm forward is(1 + w) * x— Gemma 4'sGemma4RMSNorm.forwardisnormed_output * self.weight.float(): no+ 1.0. Subtracting 1 here silently corrupts every layer norm. - Per‑Layer Embedding (PLE). E4B has per‑layer 256‑dim embeddings and a global projection from hidden‑size to
42 * 256. New tensor types:embed_tokens_per_layer,per_layer_model_projection,per_layer_input_gate,per_layer_projection. - Vision + audio towers use quant‑wrapped linears. HF wraps each vision/audio linear as
self_attn.q_proj.linear.weight(nested) plus four FakeQuantize scalar bounds (input_max,input_min,output_max,output_min) with shape(). GGUF has only the bare weight with no scalar equivalent. - Q8_K_P is "Permissive". HauhauCS's
Q8_K_Pvariant keeps attention Q/K/V as F16 and only quantizes MLP + output projections to Q8_0. A better source than pure‑Q8 when full bf16 isn't published. - Tied embeddings. No
lm_head.weightin the HF state dict;lm_headis tied toembed_tokensat runtime.
Three of Qwen3.5's five conversion pitfalls (RMSNorm +1.0, A_log domain, value‑head (3,16) → (16,3)) do not apply. Two do (column‑major shape reversal is handled automatically by gguf.quants.dequantize, and conv1d unsqueeze is not needed because Gemma 4 has no SSM). In practice Gemma 4 E4B is simpler than Qwen3.5 once you know the tensor map.
Per‑layer tensors (blk.<i> → model.language_model.layers.<i>):
| GGUF suffix | HF suffix |
|---|---|
attn_norm.weight |
input_layernorm.weight |
attn_q.weight |
self_attn.q_proj.weight |
attn_k.weight |
self_attn.k_proj.weight |
attn_v.weight |
self_attn.v_proj.weight |
attn_output.weight |
self_attn.o_proj.weight |
attn_q_norm.weight |
self_attn.q_norm.weight |
attn_k_norm.weight |
self_attn.k_norm.weight |
post_attention_norm.weight |
post_attention_layernorm.weight |
ffn_norm.weight |
pre_feedforward_layernorm.weight |
post_ffw_norm.weight |
post_feedforward_layernorm.weight |
post_norm.weight |
post_per_layer_input_norm.weight |
ffn_gate.weight |
mlp.gate_proj.weight |
ffn_up.weight |
mlp.up_proj.weight |
ffn_down.weight |
mlp.down_proj.weight |
inp_gate.weight |
per_layer_input_gate.weight |
proj.weight |
per_layer_projection.weight |
layer_output_scale.weight |
layer_scalar (no .weight suffix) |
Globals:
| GGUF | HF |
|---|---|
token_embd.weight |
model.language_model.embed_tokens.weight |
output_norm.weight |
model.language_model.norm.weight |
per_layer_model_proj.weight |
model.language_model.per_layer_model_projection.weight |
per_layer_proj_norm.weight |
model.language_model.per_layer_projection_norm.weight |
per_layer_token_embd.weight |
model.language_model.embed_tokens_per_layer.weight |
rope_freqs.weight |
(skip — runtime‑computed by HF) |
Total: 17 per‑layer × 42 layers + 5 globals = 719 text tensors.
Rather than re‑implementing the quant‑wrapped linear layout and synthesizing FakeQuantize scalar bounds, the Gemma 4 pipeline copies vision, audio, embed_vision, and embed_audio tensors directly from a reference HF repo (default: huihui-ai/Huihui-gemma-4-E4B-it-abliterated). This works because Gemma 4 finetunes typically do not modify vision/audio towers — they inherit unchanged from google/gemma-4-e4b-it.
Verify this assumption for any new finetune before using the pipeline on it — e.g. dequantize a few vision tensors from the finetune's GGUF
mmprojand diff against the reference.
Two stages instead of three — no stitch, because llmcompressor can quantize the full multimodal Gemma4ForConditionalGeneration in one oneshot call with a regex ignore list, and writes a complete checkpoint:
GGUF (text Q8_K_P) + Reference HF repo (config/tokenizer + vision/audio tensors)
│ step1_convert_gemma4_e4b.py
▼
HuggingFace Safetensors (bf16, full multimodal)
│ step2_quantize_gemma4_e4b.py (NVFP4A16, weight‑only)
▼
Final Model (NVFP4 text + BF16 vision/audio, vLLM‑ready)
# 1. Download the GGUF source (Q8_K_P is the highest precision HauhauCS publishes for Gemma 4 E4B)
pip install hf-transfer
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \
--local-dir ./src
# 2. Convert GGUF to HF safetensors (reference repo auto‑downloaded)
python scripts/step1_convert_gemma4_e4b.py \
--gguf-text ./src/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \
--reference-repo huihui-ai/Huihui-gemma-4-E4B-it-abliterated \
--output-dir ./gemma4-e4b-bf16-hf
# 3. NVFP4A16 quantization (weight‑only, no calibration data)
python scripts/step2_quantize_gemma4_e4b.py \
--model-dir ./gemma4-e4b-bf16-hf \
--output-dir ./gemma4-e4b-nvfp4E4B's architecture (per‑layer embeddings, audio encoder, dynamic masking) breaks fx.symbolic_trace, which is a prerequisite for the sequential calibration pipeline needed by w4a4. NVFP4A16 quantizes weights from their own min/max statistics with no data flow, so no trace is needed at all. Quality impact is small because the 4‑bit weight quantization dominates the final error floor.
The _P in Q8_K_P stands for Permissive: attention Q/K/V projections stay as F16 (no quantization), only MLP and output projections are Q8_0. Since the final target is NVFP4 (4‑bit), Q8's ~8‑bit error floor is well below the NVFP4 noise floor and does not meaningfully affect quality.
Many Gemma 4 E4B finetunes only publish GGUF at up to Q8_K_P — no native bf16/fp16 safetensors. This pipeline's Q8_K_P → bf16 dequant is the fallback. If a bf16 source becomes available, skip step1 entirely and feed it directly to step2_quantize_gemma4_e4b.py.
The Gemma 4 pipeline needs transformers >= 5.5 (for Gemma4ForConditionalGeneration), llmcompressor main branch, and gguf >= 0.18. Tested on RTX 5090 with torch 2.11+cu130.
The final model config must have:
model_type: "qwen3_5"(top‑level)architectures: ["Qwen3_5ForConditionalGeneration"]- Nested
text_configwith the text model parameters quantization_configat the top level (not insidetext_config)dtype: "bfloat16"at top level- Weight names using
model.language_model.*prefix - Ignore‑list entries using
model.language_model.layers.*format
cp deploy/.env.example deploy/.env
# Edit deploy/.env with your paths
docker compose -f deploy/docker-compose.yml up -dSee deploy/ for the full configuration.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-id",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'Suggested GitHub About / Description:
GGUF → NVFP4 conversion pipeline for Qwen3.5, Qwen3.6‑A3B MoE, and Gemma 4 E4B — produces vLLM‑ready multimodal checkpoints on a single RTX 5090.
Suggested Topics (copy‑paste into repo settings):
nvfp4 gguf quantization vllm llm-compressor qwen qwen3 qwen3-moe
gemma gemma-4 mixture-of-experts moe gated-deltanet hybrid-attention
multimodal huggingface safetensors rtx-5090 nvidia-fp4 speculative-decoding
The code in this repository is MIT licensed. Model weights are subject to their original licenses.
- HauhauCS — uncensored Qwen3.5‑27B, Qwen3.6‑35B‑A3B, and Gemma 4 E4B GGUF models
- sakamakismile — Qwen3.6 NVFP4 aggressive quantization recipe reference
- AEON‑7 — Qwen3.6 NVFP4 conservative quantization insights
- Kbenkhaled — original NVFP4 quantization recipe
- huihui‑ai — HF‑format Gemma 4 E4B reference used by the Gemma 4 pipeline
- Neural Magic / llm‑compressor — quantization framework
- vLLM — serving infrastructure