Skip to content

[enhancement]: Support for Qwen Multimodal Models (Qwen-VL and Image Edit) #8983

@gafda

Description

@gafda

Is there an existing issue for this?

  • I have searched the existing issues

Contact Details

No response

What should this feature add?

This request is to add native support for integrating and running the Qwen series of multimodal models within InvokeAI, specifically targeting the highly capable models from Alibaba Cloud.

I'm requesting support for:

  1. Qwen-VL (Vision-Language): Integration would enable features like advanced image-to-text generation, detailed captioning, visual question-answering, and general image analysis. This could be used to enhance prompt generation.
  2. Qwen Image Edit / Agent: Support for Qwen's instruction-driven image editing capabilities. This would allow users to perform complex, natural-language-guided image manipulations and stylistic modifications without relying solely on traditional in-painting masks.

Alternatives

While InvokeAI has fantastic in-painting, image-to-image, and ControlNet capabilities, direct instruction-based editing using LLM-guided vision models offers a more conversational, flexible approach to iterative image editing.

Additional Content

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions