Releases: xorbitsai/inference
v1.7.1.post1
What's new in 1.7.1.post1 (2025-06-30)
These are the changes in inference v1.7.1.post1.
Enhancements
- BLD: pin transformers version at 4.52.4 to fix "Failed to import module 'SentenceTransformer'" error by @amumu96 in #3743
Full Changelog: v1.7.1...v1.7.1.post1
v1.7.1
What's new in 1.7.1 (2025-06-27)
These are the changes in inference v1.7.1.
New features
- FEAT: [UI] enhance audio & rerank model registration params. by @yiboyasss in #3656
- FEAT: support async client by @zhcn000000 in #3645
- FEAT: [UI] add max_tokens display in rerank model. by @yiboyasss in #3671
- FEAT: [UI] add model_ability options for LLM registration. by @yiboyasss in #3663
- FEAT: support qwenLong-l1 by @Jun-Howie in #3691
- FEAT: [UI] model registration supports packages. by @yiboyasss in #3702
- FEAT: support MLU device by @nan9126 in #3693
- FEAT: vllm v1 auto enabling by @qinxuye in #3637
- FEAT: distributed inference for MLX by @qinxuye in #3700
Enhancements
- ENH: add
enable_flash_attnparam for loading qwen3 embedding & rerank by @qinxuye in #3640 - ENH: add more abilities for builtin model families API by @qinxuye in #3658
- ENH: improve local cluster startup reliability via child-process readiness signaling by @Checkmate544 in #3642
- ENH: FishSpeech support pcm by @codingl2k1 in #3680
- ENH: Add 4-sample micro-batching to Qwen-3 reranker to reduce GPU memory by @yasu-oh in #3666
- ENH: Limit default n_parallel for llama.cpp backend by @codingl2k1 in #3712
- BLD: pin flash-attn & flashinfer-python version and limit sgl-kernel version by @amumu96 in #3669
- BLD: Update Dockerfile by @XiaoXiaoJiangYun in #3695
- REF: remove unused code by @qinxuye in #3664
Bug fixes
- BUG: fix TTS error bug :No such file or directory by @robin12jbj in #3625
- BUG: Fix max_tokens value in Qwen3 Reranker by @yasu-oh in #3665
- BUG: fix custom embedding by @qinxuye in #3677
- BUG: [UI] rename the command-line argument from download-hub to download_hub. by @yiboyasss in #3685
- BUG: fix jina-clip-v2 for text only or image only by @qinxuye in #3690
- BUG: internvl chat error using vllm engine by @amumu96 in #3722
- BUG: fix the parsing logic of streaming tool calls by @amumu96 in #3721
- BUG: fix
<think>wrongly added when setchat_template_kwargs {"enable_thinking": False}by @qinxuye in #3718
Documentation
- DOC: add doc for paraformer by @leslie2046 in #3631
- DOC: Flexible model (traditional ML models) by @qinxuye in #3714
New Contributors
- @robin12jbj made their first contribution in #3625
- @zhcn000000 made their first contribution in #3645
- @yasu-oh made their first contribution in #3665
- @Checkmate544 made their first contribution in #3642
- @nan9126 made their first contribution in #3693
- @XiaoXiaoJiangYun made their first contribution in #3695
Full Changelog: v1.7.0...v1.7.1
v1.7.0.post1
What's new in 1.7.0.post1 (2025-06-13)
These are the changes in inference v1.7.0.post1.
Bug fixes
- BUG: fix qwen3-rerank to create model on GPU by @qinxuye in #3630
- BUG: fix mincpm4 modeling by @Jun-Howie in #3632
Full Changelog: v1.7.0...v1.7.0.post1
v1.7.0
What's new in 1.7.0 (2025-06-13)
These are the changes in inference v1.7.0.
New features
- FEAT: support CogView4 image model by @qinxuye in #3557
- FEAT: [UI] support model_ability filter for image and video models. by @yiboyasss in #3563
- FEAT: [UI] auto-switch to active tab when Running Models page loads. by @yiboyasss in #3568
- FEAT: support first-last-frame to video by @qinxuye in #3555
- FEAT: [UI] add Japanese and Korean language support. by @yiboyasss in #3574
- FEAT: SeACoParaformer model by @leslie2046 in #3587
- FEAT: support verbose_json for funasr family audio2text models by @leslie2046 in #3591
- FEAT: support deepseek-r1-0528 Mixed quantization by @Jun-Howie in #3601
- FEAT: support engines for embedding models by @pengjunfeng11 in #2791
- FEAT:support MiniCPM4 Series by @Jun-Howie in #3609
- FEAT: [UI] add model_engine parameter to embedding model. by @yiboyasss in #3617
- FEAT: add kwargs for transripts client API by @leslie2046 in #3622
- FEAT: support qwen3 embedding by @qinxuye in #3615
- FEAT: support qwen3-reranker by @qinxuye in #3627
Enhancements
- ENH: Support pcm response_format by @codingl2k1 in #3606
Bug fixes
- BUG: Fix dependency by @codingl2k1 in #3566
- BUG: Fix cmdline by @codingl2k1 in #3589
- BUG: fix potential hang for sglang by @qinxuye in #3597
- BUG: [UI] fixed the mobile language switching bug. by @yiboyasss in #3608
- BUG: Fix the error when using Qwen function call with Spring AI. by @aniya105 in #3614
Documentation
- DOC: update links by @qinxuye in #3565
- DOC: Update CosyVoice doc by @codingl2k1 in #3605
- DOC: update models by @qinxuye in #3628
Others
- FIX: [UI] fix model_engine parameter bug. by @yiboyasss in #3620
New Contributors
Full Changelog: v1.6.1...v1.7.0
v1.6.1
What's new in 1.6.1 (2025-05-30)
These are the changes in inference v1.6.1.
New features
- FEAT: llama.cpp backend support multimodal by @codingl2k1 in #3442
- FEAT: Auto ngl for llama.cpp backend by @codingl2k1 in #3518
- FEAT: [UI] add hint for common parameters with support for custom input. by @yiboyasss in #3521
- FEAT: add some other paraformer series models by @leslie2046 in #3536
- FEAT: support Deepseek-R1-0528 by @Jun-Howie in #3539
- FEAT: support deepseek-r1-0528-qwen3 by @Jun-Howie in #3552
Enhancements
- ENH: [rerank] add instruction for minicpm-reranker by @llyycchhee in #3453
- ENH: pass extra arguments for speech2text API. by @leslie2046 in #3516
- ENH: add modelscope support for kolors by @qinxuye in #3534
- ENH: remove check when specified GPU index for vllm by @kota-iizuka in #3527
- ENH: Supports
HybridCacheintransformerslib, mainly forgemma3chat model by @ChengjieLi28 in #3538 - ENH: support virtualenv for chattts by @qinxuye in #3541
- BLD: fix setup.cfg by @qinxuye in #3467
- BLD: update flashinfer version by @amumu96 in #3549
- REF: Refactor for multimodal llm models by @ChengjieLi28 in #3462
Bug fixes
- BUG: fix input for jina clip by @llyycchhee in #3440
- BUG: [ui] delete cache file white screen bug. by @yiboyasss in #3482
- BUG: fix import_submodules, ignore test files by @Gmgge in #3545
Documentation
- DOC: remove llama-cpp-python related doc & refine model_ability parts by @qinxuye in #3519
- DOC: Update doc about cosyvoice-2.0 stream and auto NGL by @codingl2k1 in #3547
New Contributors
- @kota-iizuka made their first contribution in #3527
Full Changelog: v1.6.0...v1.6.1
v1.6.0.post1
What's new in 1.6.0.post1 (2025-05-17)
These are the changes in inference v1.6.0.post1.
Enhancements
Full Changelog: v1.6.0...v1.6.0.post1
v1.6.0
What's new in 1.6.0 (2025-05-16)
These are the changes in inference v1.6.0.
New features
- FEAT: [MODEL]XiYanSQL-QwenCoder-2504 by @Minamiyama in #3352
- FEAT: [Model]HuatuoGPT-o1 by @Minamiyama in #3353
- FEAT: [Model]DianJin-R1 by @Minamiyama in #3343
- FEAT: support image_to_video by @qinxuye in #3386
- FEAT: Qwen3-235B-A22B GPTQ Quantization Int4 Int8 by @Jun-Howie in #3422
- FEAT: use xo.wait_for instead of asyncio.wait_for for actor call by @qinxuye in #3439
- FEAT: video UI by @qinxuye in #3448
- FEAT: auto add tag when it is missed by @amumu96 in #3456
- FEAT: Support Skywork-OR1 by @Jun-Howie in #3447
- FEAT: audio UI by @qinxuye in #3457
- FEAT: Support Skywork-OR1 gptq for 32B by @Jun-Howie in #3464
- FEAT: support enable_thinking for loading qwen3 by @qinxuye in #3463
Enhancements
- ENH: Qwen/Qwen2.5-Omni-3B by @Minamiyama in #3366
- ENH: added mlx format for qwen3 & update docs by @qinxuye in #3369
- ENH: Qwen3-AWQ for 14B & 32B by @Minamiyama in #3370
- ENH: Update the activated_steze_In_billions parameter in the deepseek-vl2 model by @Jun-Howie in #3380
- ENH: add mlx-community/Qwen2.5-VL-32B-Instruct by @xiaohan815 in #3405
- ENH: [UI] add a documentation link button in side menu by @Minamiyama in #3411
- ENH: QwQ use unsloth gguf by @codingl2k1 in #3408
- ENH: llama.cpp backend use xllamacpp by @codingl2k1 in #3412
- ENH: [UI] display version info in side menu by @Minamiyama in #3423
- ENH: Worker env isolation by @codingl2k1 in #3362
- ENH: Use Qwen's official quantitative model repository by @Jun-Howie in #3436
- ENH: Update cosyvoice by @codingl2k1 in #3365
- BLD: isolate autoawq and GPTQModel into separate extra install by @qinxuye in #3397
- BLD: pin transformers version at 4.51.3 by @amumu96 in #3431
- REF: support loading model config in function by @Minamiyama in #3428
Bug fixes
- BUG: fix qwen3 235b spec by @qinxuye in #3375
- BUG: fix incomplete parsing of reasoning content in reasoning_parser by @amumu96 in #3391
- BUG: fix the processing logic for inference content parsing and tool calls by @amumu96 in #3394
- BUG: fix stop word handling logic in vllm model generation configuration by @amumu96 in #3414
- BUG: fix Model._get_full_prompt() takes 3 positional arguments but 4 were given by @qinxuye in #3417
- BUG: fix potential stop hang by @qinxuye in #3434
- BUG: [UI] Added cpu_offload parameter to video model and fixed bug in audio model's filtering function. by @yiboyasss in #3461
New Contributors
- @xiaohan815 made their first contribution in #3405
Full Changelog: v1.5.1...v1.6.0
v1.5.1
What's new in 1.5.1 (2025-04-30)
These are the changes in inference v1.5.1.
New features
- FEAT: Wan 2.1 text2video by @qinxuye in #3297
- FEAT: [UI] highlight the input box content. by @yiboyasss in #3306
- FEAT: [UI] display the model_ability parameter. by @yiboyasss in #3308
- FEAT: add ggufv2 support for vLLM by @harryzwh in #3259
- FEAT: ovis2 by @Minamiyama in #3170
- FEAT: support Qwen3 and Qwen3MOE by @Jun-Howie in #3347
- FEAT: Add support for Qwen3 GPTQ quantization format by @Jun-Howie in #3363
Enhancements
- ENH: support setting sse ping attempts by @llyycchhee in #3313
- ENH: Support GLM4-0414 MLX and GGUF by @Jun-Howie in #3325
- ENH: optimize qwen3, support chat_template_kwargs for all engines by @qinxuye in #3354
- REF: Drop internal compression logic for
transformersquantization, using bnb config instead by @ChengjieLi28 in #3324 - REF: Unify audio model abilities by @llyycchhee in #3351
Bug fixes
- BUG: fix sglang chat by @qinxuye in #3326
- BUG: Show engine options on UI even if the specific engine is not installed by @ChengjieLi28 in #3331
- BUG: fix failure of clearing resources when loading model failed by @qinxuye in #3361
Documentation
New Contributors
- @llyycchhee made their first contribution in #3313
- @harryzwh made their first contribution in #3259
- @qiulang made their first contribution in #3342
Full Changelog: v1.5.0...v1.5.1
v1.5.0.post2
What's new in 1.5.0.post2 (2025-04-21)
These are the changes in xorbitsai/inference v1.5.0.post2.
Enhancements
Bug fixes
- BUG: [UI] fix the bug in the cancellation function. by @yiboyasss in #3301
- BUG: fix gemma-3-it max_tokens by @qinxuye in #3304
- BUG: fix potential progress error by @qinxuye in #3305
Full Changelog: v1.5.0.post1...v1.5.0.post2
v1.5.0.post1
What's new in 1.5.0.post1 (2025-04-19)
These are the changes in inference v1.5.0.post1.
Enhancements
Documentation
Full Changelog: v1.5.0...v1.5.0.post1