Releases · ByteDance-Seed/VeOmni

11 Mar 02:16

Luosuu

v0.1.7

a18e994

v0.1.7 Pre-release

Pre-release

Highlights

Unified Trainer Architecture

Refactored task handling so that all training tasks (text, VLM, omni, etc.) are managed through a unified trainer interface, simplifying the codebase and improving extensibility. (#458)

Flash Attention 4 & Quack GEMM for Fused MoE

Integrated FA4 with Ulysses sequence parallelism support. (#497, #498)
Added Quack GEMM backend for fused MoE, enabling higher-throughput expert computation. (#546)

Qwen3-Omni-MoE Full Training Support

End-to-end training support for Qwen3-Omni-MoE including audio processing, fused MoE forward, Expert Parallelism, varlen attention, and SP/FSDP compatibility. (#456, #489, #495, #475)

Transformers v5 Compatibility

Introduced patchgen — a tool to generate fully fleshed patched modeling code for transformers v5. Added v5 support for Qwen3, Qwen3-MoE, and Qwen3.5 (dense). (#486, #496, #523)

Distributed HuggingFace Safetensor Saving

New HuggingFaceStorageWriter enables distributed safetensor checkpoint saving with built-in synchronization, replacing scattered per-rank logic. (#461, #476, #479)

Model Patch Refactoring

Migrated Qwen2-VL, Qwen2.5-Omni, DeepSeek-V3, Seed-OSS, and Qwen3-Omni-MoE to a unified patch-based mechanism, eliminating duplicated HF source code. (#444, #445, #437, #450, #475)

New Model Support

GLM-5: Added support for both NPU and GPU backends. (#531, #541)
Qwen3.5 GatedDeltaNet: Added Ulysses sequence parallelism support. (#542)
Qwen3 Sequence Classification: Model and data pipeline support. (#426, #427)

Dynamic Batching

Added DynamicBatchingSizeDataset for stateful multi-worker dynamic batching. (#488)
Added EncoderDataBalance for ViT data balancing in Qwen3-VL MoE training. (#425)

PyTorch 2.9.1 + CUDA 12.9

Upgraded GPU dependencies to torch 2.9.1+cu129 with a new cu129 Dockerfile. (#451, #442)

MoE Load Balance Monitoring

Added real-time MoE load balance logging for Qwen3-MoE training diagnostics. (#539)

Breaking Changes

Unified trainer for all tasks (#458): Training tasks are now dispatched through a unified trainer. Custom task scripts that previously used standalone training loops must migrate to the new trainer-based interface.
Argument structure refactored (#538): The CLI argument hierarchy has been reorganized. Existing config files and launch scripts may need to be updated to match the new argument structure.

What's Changed

[model] feat: Use a patch to add the modeling code for Qwen3 for Sequence Classification. by @yiwzhao in #426
[data] feat: add arguments for sequence classification by @yiwzhao in #427
[misc] feat: add linux aarch64 platform check by @wang-hua-2019 in #434
[model, ci] refactor: patch qwen3-moe model by @FoolPlayer in #435
[ckpt] refactor: use exact match for HF checkpoint verification by @TimYangst in #440
[misc] feat: add cu129 Dockerfile for CUDA 12.9 builds by @piyifan123 in #442
[model] refactor: patch qwen25omni by @Coach257 in #437
[model] refactor: patch qwen2_vl by @Coach257 in #445
[misc] fix: Remove pip related logic in Dockerfile.cu129 to avoid 403 failures. by @piyifan123 in #446
[data] fix: fix data.py torch2.9 deprecation warning and unit tests by @piyifan123 in #443
[model] refactor: add model patch for deepseek-v3 by @FoolPlayer in #444
[model] chore: remove qwen3-moe npu expert fwd by @FoolPlayer in #449
[model] refactor: Add seed-oss patch by @FoolPlayer in #450
[model] fix: bring back router logits for Qwen3Moe by @Luosuu in #453
[ci, release] feat: Upgrade gpu deps to use torch 2.9.1+cu129 (and other compatible packages) from 2.8.0+cu128 by @piyifan123 in #451
[ci] test: add dsv3 saveload ci by @FoolPlayer in #452
[model] fix: add freqs split for RoPE within Context Parallelism by @zbian99 in #448
[data, model] feat: add EncoderDataBalance to support ViT data balance for qwen3vl moe model by @TRexces in #425
[misc] feat: debug gradient checkpointing by @Luosuu in #459
[task] fix: Exclude padding batches from gradient updates by @Ziyi-Wang in #455
[task] fix: Add enable_high_precision_for_bf16 to all trainers missing it. by @piyifan123 in #465
[config] feat: add wandb_id parameter for resuming wandb runs by @zwhe99 in #416
[task] refactor: Centralize HF safetensor saving logic into shared util by @Ziyi-Wang in #463
[model] fix: fix weight init by @FoolPlayer in #464
[data] feat: Add a data processor suitable for embedding classification. by @yiwzhao in #389
[misc] feat: Add transformers5-exp as an optional (non-default) dep (no-op to existing uv sync). by @piyifan123 in #469
[ckpt] feat: support distributed huggingface safetensor save via HuggingFaceStorageWriter by @Ziyi-Wang in #461
[model] refactor: use patch mechanism for qwen3_omni_moe instead of duplicating HF source by @TimYangst in #473
[ckpt] feat: wait for prev async save to finish before safetensor save by @Ziyi-Wang in #476
[model] chore: respect USE_LIGER_KERNEL flag in GPU model patch by @Luosuu in #474
[misc] fix: update transformers v5 compatibility imports by @piyifan123 in #477
[model] feat: qwen3-next flops counter by @yicheng-gong in #481
[misc] fix: bump transformers5-exp to 5.2.0 and fix no_split_modules merges by @piyifan123 in #485
[ckpt] feat: make save_hf_safetensor self-contained with internal synchronization by @Ziyi-Wang in #479
[model] feat: add parallel plan and varlen attention support for Qwen3OmniMoe by @TimYangst in #489
[omni,data] refactor: unify Qwen-Omni data processing into shared vlm_data_process by @TimYangst in #490
[model] feat: [transformers v5] Add patchgen capability to generate fully fleshed patched modeling code and use qwen3 as an example by @piyifan123 in #486
[omni, model] fix: support mixed video_w_audio & video_w/o_audio in Qwen3OmniMoe get_rope_index by @TimYangst in #493
[ops] feat: FA4 dependencies by @Luosuu in #497
[ops] feat: FA4 integration with ulyssess by @Luosuu in #498
[omni, model] feat: add fused MoE forward and EP support for Qwen3-Omni-MoE thinker and update expert weight merging in moe_merge script by @TimYangst in #495
[model, data] feat: support audio processing in Qwen3OmniMoe with SP and FSDP fixes by @TimYangst in #456
[model] feat: [transformers-v5] Add support for qwen3 moe by @piyifan123 in #496
[misc] fix: remove duplicate expert merge logic in moe_merge.py by @TimYangst in #502
[model] refactor: refactor Qwen3OmniMoe to patch-based implementation by @TimYangst in #475
[data] feat: add DynamicBatchingSizeDataset for stateful multi-worker dynamic batching by @LiuzcEECS in #488
[misc] fix: save processor in MoE checkpoint merge script by @TimYangst in #503
[parallel] fix: Correct k normalization backward pass order in async_ulysses_dit by @DaiShiResearch in #419
[task, model] fix: update import in train_qwen3_omni to use VeOmni subclass by @TimYangst in #509
[docs] fix: fix images not rendering on GitHub in ulysses.md by @TimYangst in #512
[parallel] fix: ep_group return None by @Coach257 in #513
[data] fix: defer DataCollatorWithPositionIDs instantiation to avoid import-time error by @TimYangst in #517
[BREAKING][d...

Contributors

TimYangst, LiuzcEECS, and 15 other contributors

Assets 2

0 Join discussion

28 Jan 18:42

Luosuu

v0.1.6

aa88bf4

v0.1.6 Pre-release

Pre-release

Highlights

Now we gradually migrate VeOmni modeling to patching the HuggingFace modeling so that users can easily know what is different from original HuggingFace implementation. Currently we migrate dense models first. @Coach257 @FoolPlayer
- In the following, we will work on generating modeling like HuggingFace, which will happen at the same time with HuggingFace Transformers v5 upgrade. @piyifan123
Support Qwen3-Omni-MoE by @Crystal-jiang .
Now VeOmni does not override HuggingFace Transformers ALL_ATTENTION_FUNCTION registry. To ensure backward compatibility, when flash_attention_2/3 is passed to model arguments, they will be replaced by veomni flash attention key names. @Luosuu
Support padding for packed input when rmpad_with_pos_ids, which will eliminate expensive Triton compiling for varying input size and pave the way for torch.compile integration.
Support torchcodec -based video preprocessing by @TimYangst
Many fixes.

What's Changed

[docs] fix: Optimize document links in Markdown rendering by @Crystal-jiang in #380
[config] feat: add MFU calculation for qwen3_vl_moe by @ZhuYajun-AI in #385
[data, model] feat: support Qwen3-VL textual token-based time encoding by @Coach257 in #386
[data,ci] test: enhance video_utils test suite with robust validation and benchmarks by @TimYangst in #375
[data,ci,docs] feat: add torchcodec-based video processing with ffmpeg support and comprehensive testing by @TimYangst in #221
[perf, dist] feat: add zero2 in fsdp1 and use_orig_params configurable by @zhtao303 in #382
[model] fix: Fused operator fix for qwen3vl by @phdddd in #378
[docs] feat: add async doc in ulysses.md by @zbian99 in #388
[data] feat: add data collators for embedding classification. by @yiwzhao in #376
[model] chore: add moe split script by @FoolPlayer in #390
[task] feat: train qwen2.5 omni by @Coach257 in #396
[model] refactor: change to model patch in qwen3 by @FoolPlayer in #392
[docker] fix: update npu dockerfile with ffmpeg by @FoolPlayer in #398
[config] feat: refactor args to support multi-level config by @FoolPlayer in #397
[docs] fix: update async ulysses document by @zbian99 in #394
[model] fix: add fa3 for Qwen3VL vision attention SP path by @Luosuu in #400
[model] feat: patch qwen3vlmoe; qwen3vl; qwen25vl by @Coach257 in #399
[model] fix: qwen25vl_config by @Coach257 in #404
[model] fix: fsdp1 load model weights by @Coach257 in #403
[config] fix: remove None option from choice type args by @FoolPlayer in #405
[misc] feat: add live star history in readme by @Luosuu in #406
[config, model] chore: avoid polluting huggingface transformer attention registry while keeping job config backward compatiable by @Luosuu in #407
[docs] fix: correct typos in documentation by @zwhe99 in #411
[data] feat: decouple rmpad and dyn_bsz by @yangtian6781 in #408
[model] feat: refactor the rot_pos_emb and fast_pos_embed_interpolate funcs in modeling_qwen_vl by @lipengfei1409 in #391
[ci] fix: use wheel URLs to avoid build dependencies in CI by @TimYangst in #417
[model] fix: remove the check that self.training==True for SP by @A1waysBeenHere in #421
[model] feat: Add pure HuggingFace version of Qwen3-Omni-MoE by @Crystal-jiang in #422
[data, perf, ops] feat: option to pad packed input to a fixed shape for text-only models by @Luosuu in #410
[ops, perf] feat: Liger-Kernel is now available for NPU by @zheliuyu in #415
[model] feat: Support qwen3-omni-moe model by @Crystal-jiang in #409
[dist] feat: No resharding enabled for accelerated small model training by @yangtian6781 in #413
[model] fix: qwen_vl rope index by @Coach257 in #430
[model] refactor: change dense llm to patch style by @FoolPlayer in #431
[perf] fix: The H200 compute power is being recognized as H20 by @HSYZhang in #428

New Contributors

@ZhuYajun-AI made their first contribution in #385
@zhtao303 made their first contribution in #382
@zwhe99 made their first contribution in #411
@yangtian6781 made their first contribution in #408
@lipengfei1409 made their first contribution in #391
@A1waysBeenHere made their first contribution in #421
@zheliuyu made their first contribution in #415
@HSYZhang made their first contribution in #428

Full Changelog: v0.1.5...v0.1.6

Contributors

TimYangst, ZhuYajun-AI, and 15 other contributors

Assets 2

0 Join discussion

09 Jan 01:36

Luosuu

v0.1.5

82e754a

v0.1.5 Pre-release

Pre-release

Highlights

Many enhancement on NPU thanks to community contribution, including NPU fused ops and CIs
Memory-efficient DCP to HuggingFace checkpoint conversion script by @TimYangst
Support FlashAttention3 by @Coach257 @Luosuu
Support SequenceClassification model by @yiwzhao

What's Changed

[model] fix: quick fix by @Coach257 in #260
[misc] chore: add auto label github flow & pr title check by @FoolPlayer in ##264
[docs] feat: quick start doc for Ascend NPU by @Crystal-jiang in #257
[dist] fix: the issue where NPU does not support the PreSumMul operation by @Feng0w0 in #263
[misc] fix: fix auto-label for forked pr by @FoolPlayer in #277
[chore] fix: The missing weights_path in build_parallelize_model prevents weight loading when using fsdp2 by @Feng0w0 in #275
[ci] fix: handle pull_request_target event in auto_label workflow by @Ziyi-Wang in #279
[misc] fix: fix auto-label logic & add tag by @FoolPlayer in #281
[misc] fix: fix check pr workflow by @FoolPlayer in #282
[ops]feat: support npu fused moe by @onehaitao in #269
[model] fix: calculate norm by foreach by @heidongxianhua in #274
[feat]Add environment variables that may lead to performance optimization on NPU by @Feng0w0 in #276
[dist] chore: refactor torch dcp to support multiple storage backends by @Ziyi-Wang in #278
[config] feat: add qwen3_vl flops estimation by @yfchen-byted in #220
[config] fix: restrict Python version to 3.11 and fix PyTorch CUDA dependency resolution by @TimYangst in #287
[dist]fix:fix a bug in the PreMulSum patch by @Feng0w0 in #285
[ci,config] chore: migrate to uv dependency management with enhanced installation guide by @TimYangst in #232
[dist] chore: make execute_save public to support multiple DCP backends by @Ziyi-Wang in #288
[data] fix: video utils by @Coach257 in #294
[ci] fix: uv by @Coach257 in #292
[dist] fix: remove redundant code (async_op) in ulysses by @KKZ20 in #295
[data] fix: interleave iterable dataset sharding by @ValMystletainn in #291
[misc] fix: print log twice bug by @pjgao in #290
[data] fix: fix annotation by @Fazziekey in #296
[docs] chore: update docs with readthedocs by @FoolPlayer in #286
[docs] feat: support Readthedocs on npu by @Crystal-jiang in #299
[misc] fix: del veomni_patch in helper by @Coach257 in #298
[docs] chore: add icon & fix cite by @FoolPlayer in #303
[misc] fix: clean vescale by @Coach257 in #302
[ops] feat: Optimize Qwen3-Moe Performance on Ascend NPU with Patches by @zhihaofang1017 in #167
[model, ops] feat: support fa3 by @Luosuu in #305
[ci] chore: add dockerfile and migrate to run ci with uv cache by @FoolPlayer in #300
[misc] fix: fix mfu for vlm by @heidongxianhua in #301
[ckpt, data] fix: fix merge dcp & fix chatml by @FoolPlayer in #310
[ops] feat: veomni_fa_function by @Coach257 in #304
[config] refactor: simplify NPU extras with separate aarch64 variant by @TimYangst in #312
[config] fix: remove torch extra from transformers dependency by @TimYangst in #313
[ops] fix: mv qwen3moe patch by @Coach257 in #314
[logging] fix: fix duplicata log print in wandb case by @FoolPlayer in #315
[ops] fix: rm ALL_FLASH_ATTENTION_FUNCTIONS by @Coach257 in #316
[misc] feat: use MODELING_BACKEND to control force_hf_load by @Coach257 in #311
[misc] fix: support enable_full_determinism without enable CUDA_LAUNCH_BLOCKING=1 by @FoolPlayer in #318
[model] feat: patch LOSS_MAPPING by @Coach257 in #321
[model] feat: fused_moe_forward by @Coach257 in #325
[docs] feat: add NPU Overview doc and qwen3 training guide doc by @Crystal-jiang in #319
[config] feat: Set the default value of use_wandb to false by @Crystal-jiang in #324
[docs] fix: fix npu doc suffix py by @Crystal-jiang in #328
[ci] test: add e2e ci test by @FoolPlayer in #320
[model] feat: support async ulysses for qwen3vl dense by @yyytiancai in #307
[model] fix: logits when labels None by @Coach257 in #332
[misc] feat: env param by @Coach257 in #329
[docs] refactor: Update README by @FoolPlayer in #334
[ops] feat: Optimize Qwen3 Performance on Ascend NPU with Patches by @Crystal-jiang in #308
[data] fix: Add sorted() to os.listdir() result in dataset.py to ensure consistent file order by @Crystal-jiang in #331
[docs] fix: update readthedoc by @FoolPlayer in #335
[ops] feat: Optimize Wan2.1 Performance on Ascend NPU with Patches by @zbian99 in #336
[dist] fix: loss when sp enabled by @Coach257 in #339
[misc] fix: vl token count by @Coach257 in #341
[ckpt] fix: checkpointer kwargs by @Coach257 in #343
[ops] refactor: abstract commonly used NPU fused operators into ops module by @zhihaofang1017 in #347
[optim] fix: support fused/foreach adamw on ascend by @onehaitao in #349
[docs] feat: add wan2.1 training guide doc and mock dataset generator by @zbian99 in #340
[model] fix: fix config usage in modeling_qwen3_vl by @yfchen-byted in #353
[misc] fix: bugfix for memory snapshot by @onehaitao in #351
[ci] fix: add skip remote path test by @FoolPlayer in #356
[data] fix: interleave_dataset duplication by @Coach257 in #358
[ci] fix: change network setting by @FoolPlayer in #360
[ci] feat: Add Dockerfile and workflow for Ascend image by @phdddd in #364
[model] fix: repeat all2all in Wan2.1 ulysses. by @TKONIY in #333
[ci] fix: Npu fix by @phdddd in #368
[ci] fix: fix_npu_proxy by @FoolPlayer in #370
[ci] test: Npu docker auto build by @phdddd in #372
[ops] fix: RoPE fused op caused freqs shape bug by @zbian99 in #355
[ci] fix: sync npu unit tests from gpu unit tests by @onehaitao in #359
[ckpt, ci] fix: refactor merge_dcp_to_hf.py to remove internal dependencies and add checkpoint verification by @TimYangst in #272
[ckpt, ci, docs] feat: memory-efficient DCP to HuggingFace checkpoint conversion by @TimYangst in #374
[ci] fix: fix async ulysses test by @Crystal-jiang in #357
[ops] feat: add chunkloss characteristics for NPU by @zhihaofang1017 in #361
[model] feat: Add ascend fused operators for Qwen3VL by @phdddd in #323
[dist, data] fix: init parallel state in data collator post init to avoid worker processing getting single process state by @Luosuu in #383
[model, ops] feat: add Qwen3 sequence classification model and loss for embedding class...

Contributors

TimYangst, pjgao, and 18 other contributors

Assets 2

0 Join discussion

07 Dec 01:44

Luosuu

v0.1.4

f491995

v0.1.4 Pre-release

Pre-release

Highlights

We now have GPU and NPU in CI and enable some tests on them. We call for more contribution of tests if you are interested!
Now VeOmni model/dataset/dataloader/checkpointer/chat_template/preprocess are registry-based, making adding new of them and customization more easily
Update pyproject.toml for uv-based env management. We now only suggest you to install veomni through uv.

What's Changed

update README.md by @Fazziekey in #188
[dist] fix: refactor fsdp2 grad norm clipping by @Luosuu in #185
[misc] fix: reset hf init flag for random init by @Luosuu in #176
Fix Qwen3-Moe MFU by @zhihaofang1017 in #125
[model] fix:avoid cpu-device sync for qwenvl on npu by @wey-code in #190
[task] fix: replace DataArguments with MyDataArguments and remove duplicated step2token saving by @MuyaoLi-jimo in #189
[ci] test: CI env test by @FoolPlayer in #201
[ckpt][fix]release cuda mem after dcp sync save by @EricOlivier in #207
[misc] feat: update uv support for aarch platform for Ascend+Kunpeng … by @pjgao in #148
[data] fix: fix exception raised when fetching current_device on NPU by @ji-huazhong in #211
[ci] test: fix data_ci by @Coach257 in #222
[ci] test: add npu ci env by @FoolPlayer in #219
title: [data] feat: Implement extensible data preprocessor registry by @TimYangst in #203
[ci]Add NPU support to data and model test by @Crystal-jiang in #224
[ci]Add Ascend NPU native support to the unit test code by @Crystal-jiang in #208
[ci] chore: add gemini config & test by @FoolPlayer in #229
[test] ci: add device api check for tests by @onehaitao in #213
[core] feat: registry for dataset & dataloader & checkpointer & ckpt_to_state_dict & chat_template & preprocess by @Coach257 in #230
[dist] fix: make OptimizerState EP-dim aware to fix its dcp saving by @Luosuu in #228
Automatically add the "ascend" label by @Crystal-jiang in #234
[helper]:fix npu profiling by @Feng0w0 in #214
helper: degrade veomni_patch functions to warnings/no-op by @iqiancheng in #197
[ci] fix: dataloader in e2e ckpt test by @Luosuu in #233
[feat] nccl_timeout by @brook-cpp in #217
Automatically apply "ascend" label to issues and PRs by @Crystal-jiang in #239
chore: Upgrade PyTorch dependencies to 2.8.0 and flash-attention to 2.8.3 by @TimYangst in #242
[version] update transformers version to 4.57.0 by @phdddd in #243
feat: distributed checkpointer support customized backend by @Ziyi-Wang in #182
[ckpt] refactor: remove unused output_dir parameter from ckpt_to_state_dict by @TimYangst in #248
[config, omni, dis] fix: quick fix for sft of Wan2.1-I2V-14B-480P by @zbian99 in #240
[model] fix: Update @check_model_inputs decorator for transformers 4.57+ compatibility by @TimYangst in #252
[core] fix is_x_backend by @brook-cpp in #251
[data] fix: quick fix for exception raised when building dit dataloader on NPU by @zbian99 in #246
upgrade: Upgrade transformers from v4.57.0 to v4.57.3 by @yiwzhao in #249
[core] feat: model registry by @Coach257 in #258
[dist] feat: unified veomni grad norm clipping by @Luosuu in #205
[task]fix: fix train.sh NPROC_PER_NODE calculation logic on the NPU by @Crystal-jiang in #227
[chore]: cache ep group by @heidongxianhua in #231

New Contributors

@zhihaofang1017 made their first contribution in #125
@MuyaoLi-jimo made their first contribution in #189
@FoolPlayer made their first contribution in #201
@EricOlivier made their first contribution in #207
@pjgao made their first contribution in #148
@ji-huazhong made their first contribution in #211
@TimYangst made their first contribution in #203
@Crystal-jiang made their first contribution in #224
@onehaitao made their first contribution in #213
@Feng0w0 made their first contribution in #214
@iqiancheng made their first contribution in #197
@brook-cpp made their first contribution in #217
@phdddd made their first contribution in #243
@zbian99 made their first contribution in #240
@yiwzhao made their first contribution in #249

Full Changelog: v0.1.3...v0.1.4

Contributors

TimYangst, iqiancheng, and 19 other contributors

Assets 2

0 Join discussion

07 Nov 07:09

Luosuu

v0.1.3

135bc45

v0.1.3 Pre-release

Pre-release

Highlights

Qwen3VL (both dense and MoE) series support by @Juntian777
DeepSeek performance restoration by @Luosuu

What's Changed

[dist] feat: enable EP-aware optimizer for FSDP2-based MoE-VLM training. by @Juntian777 in #145
[model] enable deepseek ulysses and fix deepseek transpose by @Luosuu in #152
[dist] fix: add alltoall async by @heidongxianhua in #146
[ckpt] fix: merge ckpt to hf script by @Luosuu in #156
[model] fix: remove npu flash attention sync by @wey-code in #154
[model] feat: support qwen3-vl dense by @Juntian777 in #164
[logging] fix: the logging lineno of log_rank0 by @ValMystletainn in #160
fix: repeat kv bug in flash attention forward with ulysses by @HaoyiZhu in #162
fix: correct code formatting for PR162 by @Juntian777 in #165
[model] perf: eliminate per-layer CPU-GPU sync in Qwen3-VL vision attention by @Juntian777 in #169
[model] feat: add dummy forward for video input by @Juntian777 in #177
perf: set reshard_after_forward to False for modules without MixedPrecision by @Luosuu in #153
[doc] feat: how to enable new models in veomni by @Juntian777 in #179
[model] feat: support qwen3 vl moe by @Juntian777 in #178
[doc] feat: update README with new support for Qwen3-VL and Qwen3-VL-MoE by @Juntian777 in #180
fix: workaround duplicated AllGather for EP+FSDP2 by @Luosuu in #173

New Contributors

@wey-code made their first contribution in #154
@ValMystletainn made their first contribution in #160
@HaoyiZhu made their first contribution in #162

Full Changelog: v0.1.2...v0.1.3

Contributors

heidongxianhua, ValMystletainn, and 4 other contributors

Assets 2

0 Join discussion

17 Oct 18:08

Luosuu

v0.1.2

600fe6d

v0.1.2 Pre-release

Pre-release

What's Changed

[misc] shift bytecheckpoint to optional dependency by @Luosuu in #92
[misc] revert ckpt default to avoid internal exceptions by @Luosuu in #93
[dist] minor fixes by @Luosuu in #94
[misc] feat: add GITBUG ISSUE TEMPLETE by @Fazziekey in #95
[data] feat: support megatron-energon dataset by @ziqi-wlb in #62
[data] add interleaved dataset by @Coach257 in #90
fix:remove a failing assertion by @KaijingOfficial in #97
[config] clean gitignore by @Luosuu in #99
[dist] fix: DCP auto load by @Luosuu in #106
[model] fix: Switch qwen3 and seed_oss to veomni defined GradientCheckpointingLayer by @piyifan123 in #109
[misc] feat: Add uv support to allow simple uv sync based python package management by @piyifan123 in #110
[BREAKING][dist] feat: Unified dcp saving for model and optimizer by @Luosuu in #107
[misc] feat: add skip_ulysses flag to bypass Ulysses logic in flash_attention_forward by @Juntian777 in #111
[dist] fix: remove unnecessary assert by @Luosuu in #112
[misc] feat: option to profile rank0 only or all the ranks by @Luosuu in #113
[misc] fix: remove buggy memory timeline export by @Luosuu in #114
[config] feat: add allow_cuda_launch_blocking by @Luosuu in #115
[ckpt] fix: remove unnecessary path joining for dcp by @Luosuu in #121
[ckpt][BREAKING] fix unnecessary wrapping for model and optimizer states by @Luosuu in #122
fix: qwen2 vl yaml by @Ziyi-Wang in #127
[data] fix :fix data collator for sp with cu_seq_lens_q and max_length_q by @Fazziekey in #126
[data] fix: dataset call hdfs api by @Ziyi-Wang in #128
[misc] fix: update asomeworks by @Fazziekey in #135
[model] fix: remove patch for npu by @heidongxianhua in #134
[dist] feat: faster weight loading through broadcasting from rank0 by @Luosuu in #123
[data] feat: support correct cu_seqlens handling for SP and non-SP by @Juntian777 in #136
[ckpt] fix: rank for get last iteraton for non-dcp path by @Luosuu in #140
[model] fix: deepseek-v3 by @Luosuu in #139
[model] fix: remove Qwen3-MoE redundant flashattention prep and fix input_ids access bug by @Juntian777 in #141
[data] fix: remove hf dependency on prepare_fa_kwargs_from_position_ids by @Juntian777 in #144
fix: wan_attnetion_missing_config_issue by @JeffryLee in #133
[fsdp] feat: support broadcast large weight by chunk. by @ZZWHU in #142
[core] fix: use flash_attention_2 backend by @KKZ20 in #124

New Contributors

@ziqi-wlb made their first contribution in #62
@KaijingOfficial made their first contribution in #97
@Juntian777 made their first contribution in #111
@Ziyi-Wang made their first contribution in #127
@heidongxianhua made their first contribution in #134
@JeffryLee made their first contribution in #133
@ZZWHU made their first contribution in #142

Full Changelog: v0.1.1...v0.1.2

Contributors

JeffryLee, ziqi-wlb, and 10 other contributors

Assets 2

24 Sep 07:09

Luosuu

v0.1.1

53fc5b5

v0.1.1: NPU support, Flexible mixed precision, DCP async, and more bug fixes Pre-release

Pre-release

New features

NPU support @FightingZhen
DCP async support @Luosuu
Flexible mixed precision control in FSDP2 @Luosuu

What's Changed

[feat] fix dcp async save by @Luosuu in #80
Add Ascend NPU native support by @FightingZhen in #65
hot fix npu by @Fazziekey in #81
update REAMDme by @Fazziekey in #82
fix qwen3_moe & qwen2_vl inference bug by @Coach257 in #84
[dist] fix: FSDP2 with flexible mixed policy control by @Luosuu in #86
fix: use argument global_rank to check is rank 0 in get_checkpoint_path by @piyifan123 in #87
[misc] test & code comments improvements by @Luosuu in #88

New Contributors

@FightingZhen made their first contribution in #65
@Coach257 made their first contribution in #84
@piyifan123 made their first contribution in #87

Full Changelog: v0.1.0.post1...v0.1.1

Contributors

FightingZhen, Luosuu, and 3 other contributors

Assets 2

22 Sep 06:33

Luosuu

v0.1.0.post1

2ce14f2

v0.1.0.post1 Pre-release

Pre-release

We are excited to publish the first release of VeOmni. From now on, we will actively develop VeOmni on GitHub and strive to make features stable. Welcome bug reports and feature requests!

New features

EP+FSDP2 and its DCP support: tutorial @Luosuu

What's Changed

[model] feat: ds-v3 liger-kernel & convert ckpt by @ZiyueHuang in #3
[docs] feat: add docs by @KKZ20 in #4
[misc] feat: add logging.py by @Fazziekey in #7
[misc] fix: remove hdfs requirements by @Fazziekey in #15
[data] fix: remove streaming data by @Fazziekey in #19
[ci] feat: Create pre-commit.yml by @Fazziekey in #20
[ci] fix: update ruff flow by @Fazziekey in #23
[dist] feat: support hsdp by @Fazziekey in #22
[core] chore: update model and fsdp by @Fazziekey in #24
[model] feat: add wan by @plorrrrrrr in #25
[dist] feat: support async ulysses for dit by @plorrrrrrr in #26
[core] feat: refactor attention interface and fix model loader by @KKZ20 in #27
[misc] fix: add all_gather_into_tensor by @KKZ20 in #29
[misc] feat: update wechat and paper by @Fazziekey in #30
[dist] feat: Reconstructing fused MoE by @Fazziekey in #33
[misc] feat: update Wechat by @Fazziekey in #34
[model] feat: add flux by @yuyu5333 in #28
[doc] update readme by @KKZ20 in #42
[doc] correct quwen3-moe.yaml in README.md by @feifeibear in #39
[model] feat: support seed_oss by @KKZ20 in #54
update wechat by @Fazziekey in #55
chore: fix bad link for wan_sft.yaml in README by @c8ef in #48
[release] feat: release v0.1.0 by @Fazziekey in #75
[doc] feat: EP+FSDP2 by @Luosuu in #78
fix: import error - hdfs_io & VideoInput by @airlsyn in #77
[model] fix: correct output tensor shape in Qwen3MoeSparseFusedMoeBlock by @RDShi in #76

New Contributors

@ZiyueHuang made their first contribution in #3
@Fazziekey made their first contribution in #7
@plorrrrrrr made their first contribution in #25
@yuyu5333 made their first contribution in #28
@feifeibear made their first contribution in #39
@c8ef made their first contribution in #48
@Luosuu made their first contribution in #78
@airlsyn made their first contribution in #77
@RDShi made their first contribution in #76

Full Changelog: https://github.com/ByteDance-Seed/VeOmni/commits/v0.1.0.post1

Contributors

airlsyn, feifeibear, and 8 other contributors

Assets 2

Releases: ByteDance-Seed/VeOmni

v0.1.7

Highlights

Unified Trainer Architecture

Flash Attention 4 & Quack GEMM for Fused MoE

Qwen3-Omni-MoE Full Training Support

Transformers v5 Compatibility

Distributed HuggingFace Safetensor Saving

Model Patch Refactoring

New Model Support

Dynamic Batching

PyTorch 2.9.1 + CUDA 12.9

MoE Load Balance Monitoring

Breaking Changes

What's Changed

Contributors

Uh oh!

v0.1.6

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.5

Highlights

What's Changed

Contributors

Uh oh!

v0.1.4

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.3

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.1: NPU support, Flexible mixed precision, DCP async, and more bug fixes

New features

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.0.post1

New features

What's Changed

New Contributors

Contributors

Uh oh!