Fix CPU/CUDA device mismatch in Klein edit control image encoding (#742)

HuangYuChuh · claude · web-flow · commit 489b1942315c · 2026-03-25T11:45:38.000-06:00
When training Klein models with a `control_path` (edit/kontext-style
paired datasets), `encode_image_refs()` returns tensors that reside on
the VAE's device (CPU, since the VAE weights are loaded via
`load_file(..., device="cpu")` and are never explicitly moved to the
training device).  Concatenating those CPU tensors with the training
latents (`packed_latents`) that live on CUDA raises:

    RuntimeError: Expected all tensors to be on the same device

Fix: move `img_cond_seq` and `img_cond_seq_ids` to the same device
(and dtype) as `img_input` / `img_input_ids` before concatenation.

Co-authored-by: HuangYuChuh &lt;HuangYuChuh@users.noreply.github.com&gt;
Co-authored-by: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/extensions_built_in/diffusion_models/flux2/flux2_model.py b/extensions_built_in/diffusion_models/flux2/flux2_model.py
@@ -412,8 +412,8 @@ def get_noise_prediction(
                 assert img_cond_seq_ids is not None, (
                     "You need to provide either both or neither of the sequence conditioning"
                 )
-                img_input = torch.cat((img_input, img_cond_seq), dim=1)
-                img_input_ids = torch.cat((img_input_ids, img_cond_seq_ids), dim=1)
+                img_input = torch.cat((img_input, img_cond_seq.to(img_input.device, img_input.dtype)), dim=1)
+                img_input_ids = torch.cat((img_input_ids, img_cond_seq_ids.to(img_input_ids.device)), dim=1)
 
             guidance_vec = torch.full(
                 (img_input.shape[0],),