Skip to content

build(deps): bump sglang from 0.5.2 to 0.5.10#5927

Open
dependabot[bot] wants to merge 1 commit intomainfrom
dependabot/pip/sglang-0.5.10
Open

build(deps): bump sglang from 0.5.2 to 0.5.10#5927
dependabot[bot] wants to merge 1 commit intomainfrom
dependabot/pip/sglang-0.5.10

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot bot commented on behalf of github Apr 8, 2026

Bumps sglang from 0.5.2 to 0.5.10.

Release notes

Sourced from sglang's releases.

v0.5.10

Highlights

  • Piecewise CUDA Graph Enabled by Default: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331

  • Elastic EP for Partial Failure Tolerance: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments — when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17374, #12068 blog

  • GPU Staging Buffer for PD Disaggregation: Gathers scattered head slices into contiguous memory for bulk RDMA transfer, reducing RDMA request count on GQA models by ~1000x. TPS/GPU on large concurrency increased by ~5x with Prefill TP4+Decode DEP4 on Qwen3.5: #19890

  • HiSparse for Sparse Attention: Integrate HiSparse sparse attention backend for efficient long-context inference with reduced compute through sparsity-aware attention: #20343

  • SGLang-Diffusion Update:

    • Model support: LTX-2, Hunyuan3D-2, Helios
    • Performance improvements on Qwen-image, Z-image increased by 1.5x
    • New platform: macOS
    • New feature: enhance the performance of diffusers backend by integrating all optimization from Cache-DiT
    • SKILLs: feel free to explore the curated skill for developing and optimizing sglang-diffusion!
  • FlashInfer MXFP8 Kernel Support: Integrate FlashInfer mxfp8 kernels for GEMM and MoE operations, enabling mixed-precision FP8 inference with higher accuracy through microscaling for RL and general workloads: #19537

  • Transformers 5.3.0 Upgrade: Major upgrade from transformers 4.57.1 to 5.3.0, unlocking support for the latest model architectures and features from HuggingFace. GLM-5 model is now supported in this image instead of the custom built image: #17784

  • DeepSeek V3.2 / GLM-5 Optimization: GLM-5 runnable on main branch (with upgraded transformers). Fused Triton kernel for prefill KV cache fetching, NSA fuse store indexer for K cache, TRT-LLM prefill/decode DSA kernels as default on SM100/SM103, and IndexCache for improved throughput by more than 10% on high workloads: #19319, #19148, #20062, #21914, #21405

  • Qwen3.5 GDN/KDA Optimization: Transpose linear attention state layout from [N, HV, K, V] to [N, HV, V, K] and fuse split/reshape/cat ops in GDN projection with Triton kernel, plus CuTeDSL KDA decode kernel support for improved Qwen3.5 performance: #20283, #21019, #21203

  • LoRA Support for MoE Layers: Add LoRA fine-tuning support for Mixture-of-Experts layers with JIT alignment kernels, fused Triton kernels, TP support, CUDA graph support, and auto-detection of LoRA target modules — enabling efficient adapter-based tuning on MoE models like DeepSeek: #19710, #19711, #14105, #21439, #21647

  • Prefill Context Parallel for MHA (Qwen3): Enable context parallelism during prefill for multi-head attention models like Qwen3 MoE, distributing long sequences across GPUs to reduce per-GPU memory and accelerate prefill: #18233

  • Flash Attention 4 Official Library Support: Upgrade to the official FlashAttention 4 package, bringing the latest attention optimizations and Blackwell GPU support: #20303

  • Skip-Softmax Attention for FlashInfer TRT-LLM Kernels: Reduce computation overhead in attention layers by skipping redundant softmax normalization: #19089

  • Speculative Decoding with FA4 Backend: Enable speculative decoding for the FA4 attention backend, combining speculative inference with next-generation flash attention for faster generation: #21080

  • MM Attention FA4 Default on SM100: Multi-modal attention now uses FA4 by default on Blackwell hardware for improved VLM performance: #21595

  • Stronger Transformers Modeling Backend: Enhanced transformers backend with full TP, PP, MoE, VLM support, and torch.compile compatibility: #19163

  • sglang-kernel 0.4.1: Major kernel package release with renamed package (sgl-kernel → sglang-kernel), consolidated kernels, and cleanup of deprecated ops: #20440, #22009

  • Native MLX Backend for Apple Silicon: Add native MLX execution backend enabling SGLang to run inference directly on Apple Silicon Macs without CUDA: #20342

New Model Support

  • Nemotron-3-Super (bf16/fp8/nvfp4): #20407, cookbook
  • Mistral Small 4 (Pixtral): #20708
  • LFM2-VL (Liquid Foundation Model 2 Vision-Language): #21230
  • Voxtral (speech-to-text): #21635
  • GLM-5: Supported on main branch with transformers 5.3.0

... (truncated)

Commits
  • 1519acf [Hotfix] Fix router gemm on sm103 (#22134)
  • c1927e1 fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention (#21649)
  • 07f57fc Enable IndexCache for DeepSeek V3.2 (#21405)
  • 164bc0a [Fix] Fix nightly tests (#22140)
  • 43654ef [diffusion] CI: improve diffusion comparison benchmark setting for realistic ...
  • 1ad6839 [Feature] Add Reasoning Tokens Usage (#15562)
  • bf984ae Revert "[Bugfix] Temporarily skip TRTLLM attention on (G)B300 (SM103) to avoi...
  • 46bf19c chore: bump flashinfer version to 0.6.7.post2 (#22097)
  • 2476325 [Speculative Decoding] Add FA4-based Spec Support (#21080)
  • 34d5765 [VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer ...
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [sglang](https://github.com/sgl-project/sglang) from 0.5.2 to 0.5.10.
- [Release notes](https://github.com/sgl-project/sglang/releases)
- [Commits](sgl-project/sglang@v0.5.2...v0.5.10)

---
updated-dependencies:
- dependency-name: sglang
  dependency-version: 0.5.10
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants