Skip to content

[Diffusion] Add native /v1/diffusion/generate endpoint for trajectory metadata#19892

Open
Godmook wants to merge 2 commits intosgl-project:mainfrom
Godmook:feat/native-sglang
Open

[Diffusion] Add native /v1/diffusion/generate endpoint for trajectory metadata#19892
Godmook wants to merge 2 commits intosgl-project:mainfrom
Godmook:feat/native-sglang

Conversation

@Godmook
Copy link
Contributor

@Godmook Godmook commented Mar 4, 2026

Related Issue: #19827

Motivation

SGLang-D currently only exposes OpenAI-compatible endpoints (/v1/images/generations, /v1/videos). The OpenAI image/video API schema has no standard fields for latents or log_probs, so these values are silently dropped before the HTTP response — even though the pipeline already computes them when requested.

This is a blocking issue for RL training workloads: every RL pipeline runs against the server, and without HTTP-level access to trajectory latents and log probs, the log_prob feature added in #18806 is effectively unusable in production.

This PR introduces a native SGLang-D API at POST /v1/diffusion/generate, following the same pattern as SGLang's native LLM API (/generate) that coexists with the OpenAI-compatible endpoints.

Modifications

New file: python/sglang/multimodal_gen/runtime/entrypoints/diffusion_api.py

Adds a native generation endpoint with two extended metadata flags:

# Request
{
  "prompt": "A cat walking",
  "get_latents": false,   # default: false — no latency impact when unused
  "get_log_probs": false  # default: false — populated after PR #18806 lands
}

# Response
{
  "id": "...",
  "output_b64": "<base64-encoded mp4/png>",
  "output_format": "mp4",
  "peak_memory_mb": 12304.0,
  "inference_time_s": 336.46,
  "trajectory": {
    "latents": "<base64-encoded .npy blob>",
    "latents_shape": [1, 50, 16, 21, 60, 104],
    "latents_dtype": "torch.float32",
    "timesteps": ["<b64>", ...],
    "log_probs": null,       # populated after PR #18806
    "log_probs_shape": null
  }
}

Tensors are serialized as base64-encoded NumPy .npy blobs. Client deserialization:

import base64, io, numpy as np
arr = np.load(io.BytesIO(base64.b64decode(response["trajectory"]["latents"])))

Modified file: python/sglang/multimodal_gen/runtime/entrypoints/http_server.py

Added 3 lines in create_app() to register the new router. No existing code changed.

from sglang.multimodal_gen.runtime.entrypoints import diffusion_api
app.include_router(diffusion_api.router)

Design notes

  • Zero impact on existing endpoints: /v1/images/generations, /v1/videos, /v1/meshes are completely untouched.
  • No latency overhead when flags are false: trajectory data is never collected or serialized unless explicitly requested.
  • get_log_probs structure is ready: the field is accepted and returns null today. Once PR [diffusion] feat: add rollout log_prob with flow-matching SDE/CPS support #18806 adds trajectory_log_probs to OutputBatch, enabling it requires uncommenting ~3 lines.
  • All pipeline plumbing (build_sampling_params, prepare_request, process_generation_batch) is reused from existing code.

Accuracy Tests

This PR adds a new endpoint and does not modify any model forward code, kernel, or pipeline logic. No accuracy regression is possible.

Functional test on NVIDIA A40, Wan-AI/Wan2.1-T2V-1.3B-Diffusers:

# Basic generation (no trajectory)
curl -X POST http://localhost:30000/v1/diffusion/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "A cat walking"}'
# -> HTTP 200, output_b64 present, trajectory: null

# With latents
curl -X POST http://localhost:30000/v1/diffusion/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "A cat walking", "get_latents": true}'
# -> HTTP 200, trajectory.latents present (base64 npy)

Server log confirms two successful 200 OK responses:

[2026-03-04 08:55:48] INFO: "POST /v1/diffusion/generate HTTP/1.1" 200 OK
[2026-03-04 09:00:44] INFO: "POST /v1/diffusion/generate HTTP/1.1" 200 OK

Benchmarking and Profiling

No performance impact on the existing OpenAI endpoints. The new endpoint is only invoked when explicitly called.

For the native endpoint itself, observed on A40 with Wan2.1-T2V-1.3B (81 frames, 50 steps):

Run Denoising Decoding Total
1 (cold, get_latents=false) 305.6s 17.4s 336.5s
2 (warm, get_latents=true) 261.0s 12.1s 282.4s

Peak GPU memory: 12.02 GB (peak allocated 8.67 GB). When get_latents=false, trajectory serialization is skipped entirely — no overhead.

Checklist

@github-actions github-actions bot added the diffusion SGLang Diffusion label Mar 4, 2026
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant