Congratulation for Neurips Oral work. But I still have some question. Several models on your leaderboard (e.g., Gemini-2.5-Flash-Image, Seedream, BAGEL w/ CoT) appear to run a two-stage pipeline: an MLLM first rewrites/structures the edit instruction (often with CoT), then a second stage performs the actual image edit. The community has also shown system-prompt leaks from some closed models suggesting this “rewrite-then-generate” template like Seedream and nano banana.
Questions / requests for clarification:
- If we apply a strong, neutral MLLM rewriter (e.g., GPT-5 / Gemini / Qwen-VL) to all RISEBench prompts before feeding them to open-source editors (e.g., FLUX.1-Kontext-dev, Qwen-Image-Edit), do scores jump substantially?
- Does the leaderboard distinguish direct instruction vs rewrite-then-edit settings? Could you provide a reproducible toggle and a sub-leaderboard to avoid methodological apples-to-oranges?
- Would you consider an official Prompt-Rewrite Protocol (with a fixed rewriter + template) so we can attribute gains between “instruction enhancement” and the “editor’s intrinsic capability”?
Thanks—clear guidance here would be helpful