[Bug] Separate tiles / channels

### Git commit

```
$ git rev-parse HEAD
c97702e1057c2fe13a7074cd9069cb9dd6edc1bf
```

### Operating System & Version

Alpine Linux Edge

### GGML backends

Vulkan

### Command-line arguments used

```
./build/bin/sd-cli \
	--diffusion-model ../flux1-dev-q3_k.gguf \
	--vae ../ae.safetensors \
	--clip_l ../clip_l.safetensors \
	--t5xxl ../t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'lol'" \
	--cfg-scale 1.0 \
	--sampling-method euler -v \
	--clip-on-cpu
```

### Steps to reproduce

1. Download the exact files listed in the docs: https://github.com/leejet/stable-diffusion.cpp/blob/master/docs/flux.md#download-weights
2. Generate an image with the above command

### What you expected to happen

Image properly generated

### What actually happened

The image has like, all channels separate?

<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/e5df42ba-6d04-491c-a1dc-071afefe4cf9" />

I'm sure there's a specific term to describe this. I don't know it.

### Logs / error messages / stack trace

```
[DEBUG] main.cpp:549  - version: stable-diffusion.cpp version master-586-c97702e, commit c97702e
[DEBUG] main.cpp:550  - System Info: 
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 | 
[DEBUG] main.cpp:551  - SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  image_path: "",
  metadata_format: "text",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false,
  metadata_raw: false,
  metadata_brief: false,
  metadata_all: false
}
[DEBUG] main.cpp:552  - SDContextParams {
  n_threads: 12,
  model_path: "",
  clip_l_path: "../clip_l.safetensors",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "../t5xxl_fp16.safetensors",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "../flux1-dev-q3_k.gguf",
  high_noise_diffusion_model_path: "",
  vae_path: "../ae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  hires_upscalers_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: false,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: true,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: false,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:553  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "a lovely cat holding a sign says 'lol'",
  negative_prompt: "",
  clip_skip: -1,
  width: -1,
  height: -1,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: euler, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=inf, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
  hires: { enabled: false, upscaler: "Latent (nearest)", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, upscale_tile_size: 128 },
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
}
[DEBUG] stable-diffusion.cpp:184  - Using Vulkan backend
[DEBUG] ggml_extend.hpp:78   - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:78   - ggml_vulkan: 0 = AMD Radeon RX 7900 XT (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] stable-diffusion.cpp:205  - Vulkan: Using device 0
[INFO ] stable-diffusion.cpp:270  - loading diffusion model from '../flux1-dev-q3_k.gguf'
[INFO ] model.cpp:229  - load ../flux1-dev-q3_k.gguf using gguf format
[DEBUG] model.cpp:278  - init from '../flux1-dev-q3_k.gguf'
[INFO ] stable-diffusion.cpp:286  - loading clip_l from '../clip_l.safetensors'
[INFO ] model.cpp:232  - load ../clip_l.safetensors using safetensors format
[DEBUG] model.cpp:307  - init from '../clip_l.safetensors', prefix = 'text_encoders.clip_l.transformer.'
[INFO ] stable-diffusion.cpp:310  - loading t5xxl from '../t5xxl_fp16.safetensors'
[INFO ] model.cpp:232  - load ../t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:307  - init from '../t5xxl_fp16.safetensors', prefix = 'text_encoders.t5xxl.transformer.'
[INFO ] stable-diffusion.cpp:331  - loading vae from '../ae.safetensors'
[INFO ] model.cpp:232  - load ../ae.safetensors using safetensors format
[DEBUG] model.cpp:307  - init from '../ae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:356  - Version: Flux 
[INFO ] stable-diffusion.cpp:384  - Weight type stat:                      f32: 720  |     f16: 415  |    q3_K: 304  
[INFO ] stable-diffusion.cpp:385  - Conditioner weight type stat:          f16: 415  
[INFO ] stable-diffusion.cpp:386  - Diffusion model weight type stat:      f32: 476  |    q3_K: 304  
[INFO ] stable-diffusion.cpp:387  - VAE weight type stat:                  f32: 244  
[DEBUG] stable-diffusion.cpp:389  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:434  - CLIP: Using CPU backend
[DEBUG] clip_tokenizer.cpp:65   - vocab size: 49408
[INFO ] flux.hpp:1283 - flux: depth = 19, depth_single_blocks = 38, guidance_embed = true, context_in_dim = 4096, hidden_size = 3072, num_heads = 24
[DEBUG] ggml_extend.hpp:2046 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:2046 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:2046 - flux params backend buffer size =  5105.72 MB(VRAM) (780 tensors)
[INFO ] stable-diffusion.cpp:682  - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:806  - loading weights
[DEBUG] model.cpp:755  - using 12 threads for model loading
[DEBUG] model.cpp:777  - loading tensors from ../flux1-dev-q3_k.gguf
  |===========================>                      | 780/1439 - 8.15GB/s
[DEBUG] model.cpp:777  - loading tensors from ../clip_l.safetensors
  |=================================>                | 976/1439 - 6.41GB/s
[DEBUG] model.cpp:777  - loading tensors from ../t5xxl_fp16.safetensors
  |=========================================>        | 1195/1439 - 11.13GB/s
[DEBUG] model.cpp:777  - loading tensors from ../ae.safetensors
  |==================================================| 1439/1439 - 9.74GB/s
[INFO ] model.cpp:1006 - loading tensors completed, taking 1.47s (process: 0.00s, read: 0.55s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.40s)
[DEBUG] stable-diffusion.cpp:846  - finished loaded file
[INFO ] stable-diffusion.cpp:898  - total params memory size = 14519.13MB (VRAM 5200.30MB, RAM 9318.83MB): text_encoders 9318.83MB(RAM), diffusion_model 5105.72MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:987  - running in Flux FLOW mode
[INFO ] stable-diffusion.cpp:3320 - generate_image 512x512
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2835 - sampling using Euler method
[DEBUG] conditioner.hpp:1157 - parse 'a lovely cat holding a sign says 'lol'' to [['a lovely cat holding a sign says 'lol'', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "a lovely cat holding a sign says 'lol'" to tokens ["a</w>", "lovely</w>", "cat</w>", "holding</w>", "a</w>", "sign</w>", "says</w>", "'</w>", "lol</w>", "'</w>", ]
[DEBUG] t5_unigram_tokenizer.cpp:336  - split prompt "a lovely cat holding a sign says 'lol'" to tokens ["▁", "a", "▁lovely", "▁cat", "▁holding", "▁", "a", "▁sign", "▁says", "▁", "'", "l", "o", "l", "'", ]
[DEBUG] clip.hpp:318  - identity projection
[DEBUG] ggml_extend.hpp:1859 - clip compute buffer size: 1.42 MB(RAM)
[DEBUG] clip.hpp:318  - identity projection
[DEBUG] ggml_extend.hpp:1859 - t5 compute buffer size: 68.25 MB(RAM)
[DEBUG] conditioner.hpp:1272 - computing condition graph completed, taking 6017 ms
[INFO ] stable-diffusion.cpp:3189 - get_learned_condition completed, taking 6.02s
[INFO ] stable-diffusion.cpp:3354 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1859 - flux compute buffer size: 341.50 MB(VRAM)
  |==================================================| 20/20 - 1.30it/s
[INFO ] stable-diffusion.cpp:3385 - sampling completed, taking 15.37s
[INFO ] stable-diffusion.cpp:3403 - generating 1 latent images completed, taking 15.38s
[INFO ] stable-diffusion.cpp:3213 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 1984.25 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 2.58s
[INFO ] stable-diffusion.cpp:3229 - latent 1 decoded, taking 2.58s
[INFO ] stable-diffusion.cpp:3233 - decode_first_stage completed, taking 2.58s
[INFO ] stable-diffusion.cpp:3540 - generate_image completed in 24.17s
[INFO ] main.cpp:440  - save result image 0 to 'output.png' (success)
[INFO ] main.cpp:489  - 1/1 images saved
```

### Additional context / environment details

I tried other models. Exact same result. I don't think the issue are the models, because the image is always like the example above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Separate tiles / channels #1455

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Separate tiles / channels #1455

Description

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions