Diffulex is a Paged Attention-based dLLM accelerated decoding inference framework that is easy to develop and extensible. The design maximizes hiding the complexity of underlying KV Cache management, parallel strategy scheduling, and memory optimization. By providing a clean and unified API interface along with flexible inference strategy configurations (e.g., D2F, Block Diffusion, Fast-dLLM), Diffulex allows developers to focus on model inference logic and business requirements while maintaining production-level inference performance and resource utilization efficiency.
- 12/22/2025 ✨: We are excited to announce that Diffulex, a Paged Attention-based dLLM accelerated decoding inference framework, is now open source and available to the public!
Although Diffulex aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H200, A100, RTX 4090, RTX 3090.
The only way to get started is to install from source:
uv pip install -e .Here's a simple example to get started with Diffulex:
from diffulex import Diffulex, SamplingParams
from transformers import AutoTokenizer
# Initialize the Diffulex engine
model_path = "/path/to/your/model"
llm = Diffulex(
model_path,
model_name="fast_dllm_v2", # or "dream", "llada", etc.
tensor_parallel_size=1,
data_parallel_size=1,
gpu_memory_utilization=0.25,
max_model_len=2048,
decoding_strategy="block_diffusion", # or "d2f", "fast_dllm"
mask_token_id=151665, # model-specific mask token ID
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=256,
)
# Prepare prompts
prompts = [
"Question: What is the capital of France? Answer:",
"Question: Explain quantum computing in simple terms. Answer:",
]
# Generate responses
outputs = llm.generate(prompts, sampling_params)
# Process results
for output in outputs:
print(f"Generated text: {output['text']}")
print(f"Number of diffusion steps: {output['n_diff_steps']}")
print(f"Token IDs: {output['token_ids']}")For more examples, check out the examples directory.
Check our Diffulex v0.0.1 release plan for upcoming features.
Welcome to join our Discord community for discussions, support, and collaboration!
We would like to express our gratitude to Nano-vLLM, which serves as the primary codebase foundation for this project, and vLLM, from which we draw the core architectural concepts, particularly the Paged Attention mechanism. The initial version of this project was mainly developed by Yijie Jin with supervision from Prof. Zhijie Deng at Shanghai Jiao Tong University.
