🔥 Open-dLLM: 开源扩散式大语言模型

🌍 Languages: English | 中文

👉 TL;DR: Open-dLLM 是迄今为止最开放的扩散式大语言模型发布 —— 我们开源了 预训练、评测、推理以及模型权重。

本仓库介绍了 Open-dCoder，它是 Open-dLLM 的 代码生成版本。

💻 代码 | 📖 博客 | 🤗 模型

🎥 演示

使用 Open-dCoder (0.5B) 生成快速排序算法

✨ 亮点

🏋️ 完整预训练流程 + 开源数据集
⚡ 推理脚本 —— 简单运行采样和生成
📊 评测套件 —— HumanEval、MBPP、代码（支持 lm-eval-harness + 自定义指标）
📦 模型权重（已上传到 Hugging Face）
🤝 透明配置，可完全复现

为什么选择 Open-dLLM？

目前大多数扩散式 LLM 仓库（例如 LLaDA、Dream）只开源了 推理代码和权重，限制了复现性。 Open-dLLM 是第一个开源全栈的扩散式 LLM：

👉 从 原始数据 → 训练 → 权重 → 评测 → 推理，全流程一个仓库搞定。

🔎 扩散式 LLM 开放程度对比

项目	数据	训练代码	推理	评测	权重
Open-dLLM / Open-dCoder (ours)	✅	✅	✅	✅	✅
LLaDA	❌	❌	✅	⚠️ 部分	✅
Dream	❌	❌	✅	⚠️ 部分	✅
Gemini-Diffusion	❌	❌	❌	❌	❌ (仅 API)
Seed Diffusion	❌	❌	❌	❌	❌ (仅 API)
Mercury	❌	❌	❌	❌	❌ (仅 API)

✅ = 完全开源 · ❌ = 未提供 · ⚠️ = 部分/有限

⚙️ 安装

我们推荐使用 micromamba 管理环境（也可改用 conda）：

micromamba install -c nvidia/label/cuda-12.3.0 cuda-toolkit -y
pip install ninja

# 安装最新 torch (cu121)
pip install torch==2.5.0 --index-url https://download.pytorch.org/whl/cu121

pip install "flash-attn==2.7.4.post1" \
  --extra-index-url https://github.com/Dao-AILab/flash-attention/releases/download

pip install --upgrade --no-cache-dir \
  tensordict torchdata byte-flux triton>=3.1.0 \
  transformers==4.54.1 accelerate datasets peft hf-transfer \
  codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc \
  wandb ninja liger-kernel==0.5.8 \
  pytest yapf py-spy pyext pre-commit ruff packaging

pip install -e .

🚀 快速开始：采样

from transformers import AutoTokenizer
from veomni.models.transformers.qwen2.modeling_qwen2 import Qwen2ForCausalLM
from veomni.models.transformers.qwen2.generation_utils import MDMGenerationConfig
import torch

model_id = "fredzzp/open-dcoder-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 加载 tokenizer + 模型
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = Qwen2ForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(device).eval()

# 输入提示
prompt = "用Python写一个快速排序算法。"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# 生成配置
gen_cfg = MDMGenerationConfig(max_new_tokens=128, steps=200, temperature=0.7)

with torch.no_grad():
    outputs = model.diffusion_generate(inputs=input_ids, generation_config=gen_cfg)

print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=True))

👉 更多日志记录与文件输出：

python sample.py

📊 基准测试

我们开源了完整的 评测套件，覆盖 标准代码生成任务 和 代码填充任务：

HumanEval / HumanEval+
MBPP / MBPP+
HumanEval-Infill
SantaCoder-FIM

结果表格与 README 中一致，这里不再重复。

🏋️ 预训练

数据: 开源高质量代码语料 FineCode
初始化: 基于 Qwen2.5-Coder 继续预训练，从自回归 → 扩散
目标函数: Masked Diffusion Model (MDM)，mask 比例均匀采样 [0,1]

🙏 致谢

本项目建立在以下工作之上：

框架与工具: VeOmni, lm-eval-harness
开源 dLLM: LLaDA, Dream
先锋探索: Gemini-Diffusion, Seed Diffusion, Mercury
基础研究: MD4, MDLM, DPLM

我们希望 Open-dLLM 能回馈社区，推动扩散式大语言模型研究。

📚 引用

如果您在研究中使用 Open-dLLM 或 Open-dCoder，请引用：

@misc{opendllm2025,
  title        = {Open-dLLM: Open Diffusion Large Language Models},
  author       = {Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, and contributors},
  year         = {2025},
  howpublished = {\url{https://github.com/pengzhangzhi/Open-dLLM}},
  note         = {Blog: \url{https://oval-shell-31c.notion.site/Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a?pvs=74}, 
                  Model: \url{https://huggingface.co/fredzzp/open-dcoder-0.5B}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔥 Open-dLLM: 开源扩散式大语言模型

🎥 演示

✨ 亮点

为什么选择 Open-dLLM？

🔎 扩散式 LLM 开放程度对比

⚙️ 安装

🚀 快速开始：采样

📊 基准测试

🏋️ 预训练

🙏 致谢

📚 引用

FilesExpand file tree

README_cn.md

Latest commit

History

README_cn.md

File metadata and controls

🔥 Open-dLLM: 开源扩散式大语言模型

🎥 演示

✨ 亮点

为什么选择 Open-dLLM？

🔎 扩散式 LLM 开放程度对比

⚙️ 安装

🚀 快速开始：采样

📊 基准测试

🏋️ 预训练

🙏 致谢

📚 引用