Skip to content

Releases: mindspore-lab/mindone

πŸŽ„MindOne v0.5.0 - Major Release

24 Dec 15:10

Choose a tag to compare

We're excited to announce the official release of MindOne v0.5.0, with enhanced community integration​ and significant performance improvements.

πŸš€ Key Highlights

  • mindone.diffusers: Compatible with πŸ€— diffusers v0.35.2, preview supports for sota v0.36 pipelines
  • mindone.transformers: Compatible with πŸ€— transformers v4.57.1
  • ComfyUI: Added initial ComfyUI integration support
  • MindSpore: Compatible with MindSpore 2.6.0 - 2.7.1

mindone.transformers updates

  • Major upgrade: Enhanced compatibility with πŸ€— transformers v4.54 and v4.57.1.
  • 70+ new models added: Check support list here.

Base Updates

  • Transformers 4.54 base support (#1387)
  • Transformers 4.57 base support (#1445)

New Models

  • Vision Models: AIMv2 (#1456), DINOv3 ViT/ConvNeXt (v4.57.1) (#1439), SAM-HQ (v4.57.1) (#1457), Bria (#1384), Florence2 (#1453), EfficientLoftr (#1456), HGNet_v2 (#1395), Ovis2 (#1454)

  • Audio/Speech Models: Granite Speech (#1406), Kyutai Speech-to-Text (#1407), Voxtral (#1456), Parakeet (#1451), XCodec (#1452), Dia (#1404), CSM (#1399)

  • Text/Language Models: Llama4 (#1470), Arcee (#1470), Falcon H1 (#1465), Dots1 (#1469), SmolLM3 (v4.54.1) (#1391), ModernBERT Decoder (v4.54.1) (#1397), Hunyuan V1 Dense/MoE (v4.57.1) (#1401), Evolla (v4.54.1) (#1440), EXAONE (#1396), Doge (#1392), ERNIE 4.5 & ERNIE 4.5 MoE (#1393), GLM4 MoE (#1409), Flex OLMo (#1442), T5Gemma (#1420), VaultGemma (#1450), BLT/Apertus/Ministral (#1462), EOMT/TimesFM (#1403), Seed OSS (#1441), xLSTM (#1466), d_fine, GraniteMoeHybrid, EfficientLoFTR Models (#1405)

  • Multimodal Models: Qwen3 Omni (#1411), Qwen3 Next (#1476), ColQwen2 (v4.54.1) (#1414), Cohere2 Vision (v4.57.1) (#1473), InternVL (v4.57) (#1463), Janus (v4.57) (#1463), Kosmos-2.5 (#1456), LFM2/LFM2-VL (#1456), MetaCLIP 2 (#1456), Mlcd (#1472), SAM2 (#1426), SAM2 Video Support (#1434), Olmo3 Model (#1467), DeepseekV2/DeepseekVL/DeepseekVLHybrid (#1477), MM Grounding DINO (#1486)

  • model updates: update Mistral3 to v4.57.1 (#1464), update Qwen2.5VL to v4.54.1 (#1421)

multimodal processors for vllm-mindspore community

  • Qwen2.5VL ImageProcessor Fast / VideoProcessor (#1429)
  • Qwen3_VL Video Processor & Qwen2_VL Image Processor Fast (#1419)
  • Phi4/Whisper/Ultravox/InternVL/Qwen2_audio/MiniCPMV/LLaVA-Next/LLaVA-Next-Video processors (#1471)

mindone.diffusers updates

New Features

  • πŸš€ Context parallelism: Ring & Ulysses & Unified Attention (#1438)
  • Added AutoencoderMixin (#1444)

New Pipelines

  • Kandinsky5 (#1388), Lucy (#1390), etc.
  • Enable multi-card Inference for flux2 Pipeline (zero-3 sharding) #1446

ComfyUI Integration

  • Added ComfyUI root files and CLI args (#1480)
  • Added text encoder files (#1481)
  • Updated clip_model.py (#1479)

Examples Updates

  • Added Wan2.2 LoRA finetune support (#1418)
  • Updated Emu3 performance for MindSpore 2.6.0 and 2.7.0 (#1417)
  • Updated HunyuanVideo-I2V to mindspore 2.6.0 and 2.7.0 (#1385)
  • πŸš€ Add accelerated dit pipelines compatible with mindspore Graph Mode (#1433)
  • πŸš€ Added Fb cache taylorseer graph mode implementation for Flux.1 (#1475)
  • Qwenimage LoRA fintune supports.#1394)

Fixed

  • Fixed AIMv2/Arcee rely on torch bug (#1485)
  • Fixed bugs of mindone.transformers models that rely on torch (#1482)
  • Fixed Qwen2.5VLProcessor tokenizer converting tensor bug (#1483)
  • Fixed Qwen3_VL text attention selection bug (#1455)
  • Fixed GLM4.1V bs>1 generation index bug (#1437)
  • Fixed training issue in TrainOneStepWrapper (#1408)
  • Fixed import error if env contains accelerate module (#1431)
  • ZeRO: Support training with MS 2.6.0 and 2.7.0 (#1383)
  • Misc bugfixes (#1424)
  • Fixed some diffusers bugs (#1448)
  • Docs updates for mindone v0.5.0 release, and ut fixes (#1484)

Statistics

  • Total commits: 82
  • Files changed: 798
  • Lines added: 157,122
  • Lines deleted: 22,303

πŸ™ Acknowledgments

Special thanks to our amazing contributors who helped shape MindOne v0.5.0!

Andy Zhou, Chaoran Wei, Cheung Ka Wai, Cui-yshoho, Didan Deng, Feiran Zhang, Fzilan, GUOGUO, Rustam Khadipash, The-truthh, YMC, Yingshu CHEN, alien-0119, jijiarong, liuchuting, vigo999, zackcxb, zyd-ustc

Together We Build, Together We Grow. Thanks to every open source maintainer, contributor, and user. ✨

Start your AI model development journey with MindOne v0.5.0 today! πŸš€

πŸ“– Full Changelog: CHANGELOG.md

v0.4.0

02 Nov 12:05

Choose a tag to compare

πŸŽ‰ MindOne v0.4.0 - Major Release

We're excited to announce the official release of MindOne v0.4.0! This is a milestone release that brings extensive AI model support and significant performance improvements.

πŸš€ Key Highlights

  • mindone.diffusers: Compatible with πŸ€— diffusers v0.35.0
  • mindone.transformers: Compatible with πŸ€— transformers v4.50
  • MindSpore: Upgraded to require >=2.6.0

mindone.transformers updates

  • Major upgrade: Enhanced compatibility with πŸ€— transformers v4.50
  • 280+ models supported: Comprehensive model library including vision, audio, multimodal, and text models

new models

mindone.diffusers updates

  • Major upgrade: Enhanced compatibility with πŸ€— diffusers v0.35.0
  • 70+ pipelines supported: Comprehensive pipeline library for text-to-image, image-to-image, text-to-video, and audio generation
  • 50+ model components: Transformers, autoencoders, controlnets, and processing modules as building blocks

new pipelines

  • Video Generation: QwenImage (#1288), HiDream (#1360), Wan-VACE (#1148), SkyReels-V2 (#1203), Chroma-Dev (#1157), Sana Sprint Img2Img/VisualCloze (#1145), HunyuanVideo (#1029), Wan (#1021), Lumina2 (#996), LTXCondition (#997), UniDiffuser (#979)
  • Image Generation: Amused & Ledits++ (#976), OmniGen & Marigold (#1062), Stable Diffusion Attend & Excite (#1013), SD Unclip/PIA (#958)
  • Audio Generation: AudioLDM2 (#981)
  • Advanced Sampling: K-diffusion pipelines (#986)
  • Testing & Documentation: UniDiffusers test (#1007), 'reuse a pipeline' docs (#989), diffusers mint changes (#992)

model components

  • Video Transformers: transformer_qwenimage (#1288), transformer_hidream_image, transformer_wan_vace (#1148), transformer_skyreels_v2 (#1203), transformer_chroma (#1157), transformer_cosmos (#1196), transformer_hunyuan_video_framepack (#1029), consisid_transformer_3d (#1124)
  • Autoencoders: autoencoder_kl_qwenimage (#1288), autoencoder_kl_cosmos (#1196)
  • ControlNets: controlnet_sana (#1145), multicontrolnet_union (#1158)
  • Processing Modules: cache_utils (#1299), auto_model (#1158), lora processing modules (#1158)

mindone.peft updates

  • Added mindone.peft and upgraded to v0.15.2 (#1194)
  • Added Qwen2.5-Omni LoRA finetuning script with transformers 4.53.0 (#1218)
  • Fixed lora and lora_scale from each PEFT layer (#1187)

models under examples (mostly with finetune/training scripts)

  • Added Janus model ...
Read more

MindONE v0.3.0 release

11 Apr 01:55
875acd8

Choose a tag to compare

We are thrilled to announce the release of MindONE 0.3.0, featuring more state-of-the-art multi-modal understanding and generative models and better compatibility with transformers and diffusers. MindONE now supports the latest features in diffuers v0.32.2, including over 160 pipelines, 50 models, and 35 schedulers. It allows users to easily develop new image/video/audio generation models or transfer existing models from torch to mindspore. MindONE 0.3.0 is built on MindSpore2.5 and optimized for Ascend NPUs, ensuring high-performance training for various generative models, such as opensora, cogvideox, and JanusPro from DeepSeek.

Key Features

  1. Support Diffusers v0.32.2

MindONE now supports the following new pipelines for image and video generation, along with new training scripts:

  • Video Generation Pipelines: CogVideoX, Latte, Mochi-1, Allegro, LTXVideo, HunyuanVideo, and more.

  • Image Generation Pipelines: Cogview3/4, Stable Diffusion 3.5, CogView3, Flux, SANA, Lumina, Kolors, AuraFlow, and more.

  • Training Scripts: CogvideoX SFT & LoRA, Flux SFT & LoRA & ControlNet, and SD3/3.5 SFT & LoRA.

For more details, visit the diffusers documentation.

  1. Expanded Multi-Modal Generative Models

MindONE v0.3.0 adds various state-of-the-art generative models as examples, ensuring efficient training performance on Ascend NPUs, including:

task model inference finetune pretrain institute
Image-to-Video hunyuanvideo-i2v πŸ”₯πŸ”₯ βœ… βœ–οΈ βœ–οΈ Tencent
Text/Image-to-Video wan2.1 πŸ”₯πŸ”₯πŸ”₯ βœ… βœ–οΈ βœ–οΈ Alibaba
Text-to-Image cogview4 πŸ”₯πŸ”₯πŸ”₯ βœ… βœ–οΈ βœ–οΈ Zhipuai
Text-to-Video step_video_t2v πŸ”₯πŸ”₯ βœ… βœ–οΈ βœ–οΈ StepFun
Image-Text-to-Text qwen2_vl πŸ”₯πŸ”₯πŸ”₯ βœ… βœ–οΈ βœ–οΈ Alibaba
Any-to-Any janus πŸ”₯πŸ”₯πŸ”₯ βœ… βœ… βœ… DeepSeek
Any-to-Any emu3 πŸ”₯πŸ”₯ βœ… βœ… βœ… BAAI
Class-to-Image varπŸ”₯πŸ”₯ βœ… βœ… βœ… ByteDance
Text/Image-to-Video hpcai open 2.0πŸ”₯πŸ”₯ βœ… βœ–οΈ βœ–οΈ HPC-AI Tech
Text/Image-to-Video cogvideox 1.5 5B~30B πŸ”₯πŸ”₯ βœ… βœ… βœ… Zhipu
Text-to-Video open sora plan 1.3πŸ”₯πŸ”₯ βœ… βœ… βœ… PKU
Text-to-Video hunyuanvideoπŸ”₯πŸ”₯ βœ… βœ… βœ… Tencent
Text-to-Video movie gen 30BπŸ”₯πŸ”₯ βœ… βœ… βœ… Meta
Video-Encode-Decode magvit βœ… βœ… βœ… Google
Text-to-Image story_diffusion βœ… βœ–οΈ βœ–οΈ ByteDance
Image-to-Video dynamicrafter βœ… βœ–οΈ βœ–οΈ Tencent
Video-to-Video venhancer βœ… βœ–οΈ βœ–οΈ Shanghai AI Lab
Text-to-Video t2v_turbo βœ… βœ… βœ… Google
Text/Image-to-Video video composer βœ… βœ… βœ… Alibaba
Text-to-Image flux πŸ”₯ βœ… βœ… βœ–οΈ Black Forest Lab
Text-to-Image stable diffusion 3 πŸ”₯ βœ… βœ… βœ–οΈ Stability AI
Text-to-Image kohya_sd_scripts βœ… βœ… βœ–οΈ kohya
Text-to-Image t2i-adapter βœ… βœ… βœ… Shanghai AI Lab
Text-to-Image ip adapter βœ… βœ… βœ… Tencent
Text-to-3D mvdream βœ… βœ… βœ… ByteDance
Image-to-3D instantmesh βœ… βœ… βœ… Tencent
Image-to-3D sv3d βœ… βœ… βœ… Stability AI
Text/Image-to-3D hunyuan3d-1.0 βœ… βœ… βœ… Tencent
  1. Support Texto-to-Video Data Curation

MindONE v0.3.0 adds a new pipeline for text-to-video filtering, which supports scene detection and video splitting, de-duplication, aesthetic/ocr/lpips/nsfw scoring, and video captioning.

For more details, visit t2v curation documentation

MindONE 0.2.0

06 Nov 08:09
4f03eb7

Choose a tag to compare

We are excited to announce the official release of MindONE, a state-of-the-art repository dedicated to multi-modal understanding and content generation. Built on MindSpore 2.3.1 and optimized for Ascend NPUs, MindONE provides a comprehensive suite of algorithms and models designed to facilitate advanced content generation across various modalities, including images, audio, videos, and even 3D objects.

Key Features

  1. diffusers support on MindSpore

We've tried to provide a completely consistent interface and usage with the huggingface/diffusers.
Only necessary changes are made to the huggingface/diffusers to make it seamless for users from torch.

- from diffusers import DiffusionPipeline
+ from mindone.diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
-    torch_dtype=torch.float16,
+    mindspore_dtype=mindspore.float16
    use_safetensors=True
)

prompt = "An astronaut riding a green horse"

images = pipe(prompt=prompt)[0][0]

Important

Due to the huggingface/diffusers is still under active development,
many features are not yet well-supported.
Currently, most functions of huggingface/diffusers v0.29.x are supported.
For details, see MindOne Diffusers.

  1. MindSpore patch for transformers

This MindSpore patch for huggingface/Transformers enables researchers or developers
in the field of text-to-image (t2i) and text-to-video (t2v) generation to utilize pretrained text and image models
from huggingface/Transformers on MindSpore.
Only the Ascend related modules are modified. Other modules reuse the huggingface/Transformers.

The following lines of code are an example that shows you how to download and use the pretrained models. Remember that the models are from mindone.transformers, and anything else is from huggingface/Transformers.

from mindspore import Tensor
# use tokenizer from huggingface/Transformers
from transformers import AutoTokenizer
# use model from mindone.transformers
-from transformers import CLIPTextModel
+from mindone.transformers import CLIPTextModel

model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(
    ["a photo of a cat", "a photo of a dog"],
    padding=True,
-    return_tensors="pt",
+    return_tensors="np"
)
-outputs = model(**inputs)
+outputs = model(Tensor(inputs.input_ids))

For details, see MindOne Transformers.

  1. State-of-the-Art generative models

MindONE showcases various state-of-the-art generative models as examples, ensuring efficient training performance on Ascend NPUs, including:

model features
hpcai open sora support v1.0/1.1/1.2 large scale training with dp/sp/zero
open sora plan support v1.0/1.1/1.2 large scale training with dp/sp/zero
stable diffusion support sd 1.5/2.0/2.1, vanilla fine tune, lora, dreambooth, text inversion
stable diffusion xl support sai style(stability AI) vanilla fine tune, lora, dreambooth
dit support text to image fine tune
hunyuan_dit support text to image fine tune
pixart_sigma suuport text to image fine tune at different aspect ratio
latte support uncondition text to image fine tune
animate diff support motion module and lora training
dynamicrafter support image to video generation