I noticed an inaccuracy in the model description between the README and the Technical Report.
README: mentions "...unified encoder-decoder architecture..."
Technical Report: states "...adopts a decoder-only vision–language architecture following the design principles of Qwen3-VL."
To maintain technical accuracy and consistency with Qwen3-VL, it would be better to update the README to reflect the decoder-only nature of the model.