📄 Paper: Fast-dLLM | 💻 Code: github.com/NVlabs/Fast-dLLM
Resources and examples for inferencing and evaluating LLaDA and Dream with Fast-dLLM.
# Pipeline modules for Fast-dLLM
dllm/pipelines/fastdllm
├── __init__.py
├── dream/
│ ├── __init__.py
│ ├── models/
│ │ ├── configuration_dream.py # Fast-dLLM Dream model configuration
│ │ └── modeling_dream.py # Fast-dLLM Dream model architecture
│ ├── sampler.py # Fast-dLLM Dream inference module
│ └── eval.py # Fast-dLLM Dream evaluation module
└── llada/
├── __init__.py
├── models/
│ ├── configuration_llada.py # Fast-dLLM LLaDA model configuration
│ └── modeling_llada.py # Fast-dLLM LLaDA model architecture
├── sampler.py # Fast-dLLM LLaDA inference module
└── eval.py # Fast-dLLM LLaDA evaluation module
# Example entry points for inference and evaluation
examples/fastdllm
├── README.md # Documentation (you are here)
├── dream/
│ ├── sample.py # Fast-dLLM Dream inference example
│ └── eval.sh # Fast-dLLM Dream evaluation example
└── llada/
├── sample.py # Fast-dLLM LLaDA inference example
└── eval.sh # Fast-dLLM LLaDA evaluation example
Implementation matches the original Fast-dLLM; identical hyperparameters yield identical outputs.
Sampling with the Fast-dLLM LLaDA sampler (e.g. prefix cache + confidence threshold):
# Use --use_cache (none/prefix/dual) for cache scheme, --threshold for confidence-based sampling
python examples/fastdllm/llada/sample.py --model_name_or_path "GSAI-ML/LLaDA-8B-Instruct" --use_cache prefix --threshold 0.9Sampling with the Fast-dLLM Dream sampler (e.g. prefix cache + confidence-threshold decoding):
# Use --use_cache (none/prefix/dual) for cache scheme, --alg (entropy/confidence_threshold) and --threshold for decoding
python examples/fastdllm/dream/sample.py --model_name_or_path "Dream-org/Dream-v0-Instruct-7B" --use_cache prefix --alg confidence_threshold --threshold 0.9Read (optional) Evaluation setup before running evaluation.
For example, to evaluate LLaDA-8B-Instruct or Dream-v0-Base-7B on GSM8K with 4 GPUs, run:
# Use model_args to pass Fast-dLLM sampling options (use_cache, threshold, etc.).
accelerate launch --num_processes 4 \
dllm/pipelines/fastdllm/llada/eval.py \
--tasks "gsm8k" \
--num_fewshot 5 \
--model "fastdllm_llada" \
--apply_chat_template \
--model_args "pretrained=GSAI-ML/LLaDA-8B-Instruct,use_cache=prefix,threshold=0.9,max_new_tokens=256,steps=256,block_size=32,suppress_tokens=[],begin_suppress_tokens=[]"
accelerate launch --num_processes 4 \
dllm/pipelines/fastdllm/dream/eval.py \
--tasks "gsm8k" \
--num_fewshot 5 \
--model "fastdllm_dream" \
--apply_chat_template \
--model_args "pretrained=Dream-org/Dream-v0-Base-7B,use_cache=prefix,max_new_tokens=256,steps=256,block_size=32,alg=confidence_threshold,threshold=0.9,dtype=bfloat16,add_bos_token=True"To evaluate LLaDA-8B-Instruct and Dream-v0-Base-7B with the Fast-dLLM sampler across all cache settings and benchmarks, run:
bash examples/fastdllm/llada/eval.sh --model_name_or_path "GSAI-ML/LLaDA-8B-Instruct" --instruct True --num_gpu 1
bash examples/fastdllm/dream/eval.sh --model_name_or_path "Dream-org/Dream-v0-Base-7B" --instruct False --num_gpu 1Results (Reproduced) are evaluated using our framework, while results (Official) come from the original paper. All evaluation settings follow the Fast-dLLM repository with minor modifications; we add support for MBPP and Minerva-Math benchmarks (not provided in the original repo).
| Benchmark | Source | Len = 256 | Len = 512 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | +Cache | +Parallel | +Cache & Parallel | Baseline | +Cache | +Parallel | +Cache & Parallel | ||||||||||
| Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | ||
| GSM8K | Official | 79.3 | 6.7 (1.0×) | 79.5 | 21.2 (3.2×) | 79.2 | 16.5 (2.5×) | 78.5 | 54.4 (8.1×) | 77.5 | 3.2 (1.0×) | 77.0 | 10.4 (3.3×) | 77.6 | 18.6 (5.8×) | 77.2 | 35.3 (11.0×) |
| Reproduced | 78.0 | 8.1 (1.0×) | 78.2 | 25.6 (3.2×) | 78.9 | 18.4 (2.3×) | 78.0 | 52.8 (6.5×) | 81.1 | 6.7 (1.0×) | 76.0 | 19.9 (3.0×) | 77.6 | 21.8 (3.3×) | 76.6 | 51.7 (7.8×) | |
| MATH | Official | 33.5 | 9.1 (1.0×) | 33.3 | 23.7 (2.6×) | 33.4 | 24.8 (2.7×) | 33.2 | 51.7 (5.7×) | 37.2 | 8.0 (1.0×) | 36.2 | 19.7 (2.5×) | 36.8 | 23.8 (3.0×) | 36.0 | 47.1 (5.9×) |
| Reproduced | 38.3 | 9.7 (1.0×) | 37.6 | 26.4 (2.7×) | 38.6 | 19.6 (2.0×) | 37.5 | 49.0 (5.0×) | 42.4 | 7.4 (1.0×) | 41.9 | 21.1 (2.9×) | 42.5 | 19.8 (2.7×) | 41.8 | 44.8 (6.1×) | |
| HumanEval | Official | 41.5 | 30.5 (1.0×) | 42.7 | 40.7 (1.3×) | 43.9 | 101.5 (3.3×) | 43.3 | 114.1 (3.7×) | 43.9 | 18.4 (1.0×) | 45.7 | 29.3 (1.6×) | 43.3 | 57.1 (3.1×) | 44.5 | 73.7 (4.0×) |
| Reproduced | 38.4 | 18.8 (1.0×) | 36.0 | 27.6 (1.5×) | 39.6 | 53.0 (2.8×) | 36.0 | 68.1 (3.6×) | 48.2 | 13.0 (1.0×) | 41.5 | 23.3 (1.8×) | 50.6 | 36.3 (2.8×) | 41.5 | 55.8 (4.3×) | |
| MBPP | Official | 29.4 | 6.0 (1.0×) | 29.6 | 17.0 (2.8×) | 28.4 | 24.8 (4.1×) | 28.2 | 44.8 (7.5×) | 14.8 | 4.3 (1.0×) | 13.4 | 10.1 (2.3×) | 15.0 | 22.3 (5.1×) | 13.8 | 39.5 (9.2×) |
| Reproduced | 36.4 | 9.3 (1.0×) | 38.0 | 26.2 (2.8×) | 29.0 | 17.6 (1.9×) | 37.8 | 44.7 (4.8×) | 32.2 | 7.7 (1.0×) | 22.0 | 20.8 (2.7×) | 7.6 | 21.0 (2.7×) | 21.4 | 43.9 (5.7×) | |
Table 1: Evaluation results of LLaDA-8B-Instruct with Fast-dLLM.
| Benchmark | Source | Len = 256 | Len = 512 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | +Cache | +Parallel | +Cache & Parallel | Baseline | +Cache | +Parallel | +Cache & Parallel | ||||||||||
| Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | Acc | Tok/s (×) | ||
| GSM8K | Official | 75.0 | 9.1 (1.0×) | 74.3 | 32.5 (3.6×) | 74.2 | 14.2 (1.6×) | 74.8 | 48.2 (5.3×) | 76.0 | 7.7 (1.0×) | 74.3 | 25.6 (3.3×) | 73.4 | 14.6 (1.9×) | 74.0 | 42.9 (5.6×) |
| Reproduced | 75.4 | 9.0 (1.0×) | 75.0 | 32.9 (3.7×) | 72.6 | 12.1 (1.4×) | 74.2 | 42.2 (4.7×) | 75.7 | 7.6 (1.0×) | 73.8 | 25.6 (3.4×) | 72.7 | 11.8 (1.6×) | 74.5 | 33.7 (4.4×) | |
| MATH | Official | 38.4 | 11.4 (1.0×) | 36.8 | 34.3 (3.0×) | 37.9 | 27.3 (2.4×) | 37.6 | 66.8 (5.9×) | 39.8 | 9.6 (1.0×) | 38.0 | 26.8 (2.8×) | 39.5 | 31.6 (3.2×) | 39.3 | 63.3 (6.5×) |
| Reproduced | 31.5 | 25.1 (1.0×) | 33.3 | 36.8 (1.5×) | 23.5 | 52.1 (2.1×) | 31.1 | 76.7 (3.1×) | 39.2 | 15.8 (1.0×) | 39.2 | 28.1 (1.8×) | 32.0 | 25.5 (1.6×) | 38.9 | 46.3 (2.9×) | |
| HumanEval | Official | 49.4 | 23.3 (1.0×) | 53.7 | 35.2 (1.5×) | 49.4 | 45.6 (2.0×) | 54.3 | 62.0 (2.8×) | 54.3 | 16.3 (1.0×) | 54.9 | 27.8 (1.7×) | 51.8 | 29.8 (1.8×) | 54.3 | 52.8 (3.2×) |
| Reproduced | 57.9 | 14.0 (1.0×) | 53.7 | 33.8 (2.4×) | 51.2 | 21.0 (1.5×) | 53.1 | 43.0 (3.1×) | 54.9 | 10.4 (1.0×) | 54.9 | 26.3 (2.5×) | 50.6 | 16.4 (1.6×) | 54.3 | 37.4 (3.6×) | |
| MBPP | Official | 56.6 | 11.2 (1.0×) | 53.2 | 34.5 (3.1×) | 53.8 | 31.8 (2.8×) | 56.4 | 76.0 (6.8×) | 55.6 | 9.4 (1.0×) | 53.8 | 26.7 (2.8×) | 55.4 | 37.6 (4.0×) | 55.2 | 73.6 (7.8×) |
| Reproduced | 55.6 | 9.9 (1.0×) | 53.8 | 32.1 (3.3×) | 53.6 | 24.1 (2.5×) | 56.0 | 61.8 (6.3×) | 56.0 | 4.6 (1.0×) | 52.6 | 25.0 (5.4×) | 52.8 | 29.5 (6.4×) | 54.4 | 61.1 (13.3×) | |
Table 2: Evaluation results of Dream-v0-Base-7B with Fast-dLLM.