Skip to content

Commit f3bec0e

Browse files
mssssss123xhd0728
andauthored
Update UltraRAG 2.1 (#110)
* update README files: add latest news section for UltraRAG 2.0 release * update README files: add release note for UltraRAG 1.0 * fix: add space before link in release note for UltraRAG 1.0 * update: change README language * update: add QR codes for WeChat, Feishu, and Discord groups in README * feat: add vllm_serve_vlm.sh script for serving model * update: add 'embedding/' to .gitignore to exclude embedding files * feat: enhance retriever functionality with multimodal support * feat: refactor multimodal corpus handling and image embedding in Retriever class * feat: update generation parameters for multimodal support * fix: update vllm_serve_vlm.sh * fix: add model_warmup parameter to infinity_kwargs in parameter.yaml * fix: update output parameters in visrag.yaml and generation.py for consistency * fix: clean up whitespace * support: add visrag.yaml for multimodal generation * fix bug: retriever_init_openai miss faiss_use_gpu parameter * Update README files to include star history chart * feat: support API models for rag pipeline * feat: add light deepresearch demo * fix: remove unused retrieval service initialization from light_deepresearch and webnote demos * fix: update demo title in light_deepresearch.yaml * fix: add hands-on tutorial entry to the latest news section in README files * feat: add retriever configuration and initialization updates * update: OpenAI API embedding for SentenceTransformers * feat: enhance retriever pipeline with embedding functionality and batch processing * fix: st multi gpu bug * feat: add progress logging for FAISS indexing process * update: rename retriever_search_maxsim to retriever_search_colbert_maxsim and add backend validation * refactor: replace _to_st_device_list method with a local function for device conversion * update: enhance query embedding process * fix: improve error handling and fallback for document and query encoding * update: add query and document task configuration for embedding * update: clear query_instruction in parameter configuration * delete: remove test_retriever.yaml configuration file * update: generation server * feat: add vllm_shutdown method and integrate into generation pipeline * update: refactor generation methods and enhance multimodal support * update: enhance async generation handling and fix multimodal path type * update: consolidate GPU device handling and improve parameter organization in retriever * update: refactor GPU ID handling and improve error handling in retriever methods * update: add VisRAG demo configuration and multimodal QA prompt template * update: change configuration key for model name retrieval in Generation class * update: chunking method * update: add mineru support * feat: add corpus mineru processing pipeline * feat: add build_image_corpus tool for PDF to image conversion * update: add text corpus build method * fix: remove param mineru_corpus_prefix * update: support build text corpus for file folder * update: support image corpus build for folder * update: support mineru for folder * update: add simple ui * update: add terminal log for ui * update: fix some problems in ui * feat: support hf backend * update: ui appearance * update: add ui exit * fix: auto calculate tensor_parallel_size * fix: infinity multi-gpu embedding bug * fix: remove nouse note * fix: add st prompt name param * update: use vllm deploy embedding model * code review: generation server * feat: support batch generation for hf backend * fix: replace index when overwrite is true * code review: generation server * update: support picture for case study * fix: add sperator for text case study * code review: corpus server * fix: normalize file path * fix: openai gen topk param * feat: add chat ui * code review: retriever server * fix: initital self.faiss_index * fix: update pipeline * fix: remove some default sampling_params * fix: add auto shut down for vllm backend * update: change some default param sets * feat: add process bar in loading corpus * fix: normalize logging messages * feat: support bm25 retriever * code review: retriever server * update: add rag full pipeline * update: pipeline in tutorial * feat: support st and openai for reranker backend * code review & code format * feat: add bm25 index & search usage * update pipeline v2.1 * fix: import errnos in retriever server * update: add loop pipeline demo * fix: merge psg tool return bug * update: add branch pipeline demo * update: add evaluate results pipeline * fix: load image corpus bug * update: add vsirag pipeline * fix: show image bug * update: default setting * update: add search pipeline * update: pipeline * feat: support trec evaluate * update: ir trec eval parameter name * update: pipeline align with v2.1 * update: cli description * update: readme * feat: update deps * update: readme * fix readme en --------- Co-authored-by: Haidong Xin <hdxin2002@gmail.com>
1 parent b2779da commit f3bec0e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+7355
-1571
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ index/
1818
output/
1919
logs/
2020
test/
21-
21+
corpora/
22+
data/chat_sessions/
2223
# Environment variables and secrets
2324
.env
2425

README.md

Lines changed: 80 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,6 @@
1616
|
1717
<a href="https://ultrarag.openbmb.cn"><b>教程文档</b></a>
1818
|
19-
<a href="https://huggingface.co/datasets/UltraRAG/UltraRAG_Benchmark"><b>数据集</b></a>
20-
|
2119
<a href="https://github.com/OpenBMB/UltraRAG/tree/rag-paper-daily/rag-paper-daily"><b>每日论文</b></a>
2220
|
2321
<b>简体中文</b>
@@ -30,19 +28,26 @@
3028

3129
*更新日志* 🔥
3230

31+
- [2025.10.22] 🎉 UltraRAG 2.1 正式发布:RAG Servers 全面升级——重构文档解析与知识库构建流程,强化多模态 RAG 能力,支持更多后端框架。
3332
- [2025.09.23] 新增每日 RAG 论文分享,每日更新最新前沿 RAG 工作 👉 |[📖 论文](https://github.com/OpenBMB/UltraRAG/tree/rag-paper-daily/rag-paper-daily)|
33+
34+
<details>
35+
<summary>历史更新</summary>
36+
3437
- [2025.09.09] 发布轻量级 DeepResearch Pipeline 本地搭建教程 👉 |[📺 bilibili](https://www.bilibili.com/video/BV1p8JfziEwM/?spm_id_from=333.337.search-card.all.click)|[📖 博客](https://github.com/OpenBMB/UltraRAG/blob/page/project/blog/cn/01_build_light_deepresearch.md)|
3538
- [2025.09.01] 发布 UltraRAG 安装与完整 RAG 跑通视频 👉 |[📺 bilibili](https://www.bilibili.com/video/BV1B9apz4E7K/?share_source=copy_web&vd_source=7035ae721e76c8149fb74ea7a2432710)|[📖 博客](https://github.com/OpenBMB/UltraRAG/blob/page/project/blog/cn/00_Installing_and_Running_RAG.md)|
3639
- [2025.08.28] 🎉 发布 UltraRAG 2.0!UltraRAG 2.0 全新升级:几十行代码实现高性能 RAG,让科研专注思想创新!
3740
- [2025.01.23] 发布 UltraRAG!让大模型读懂善用知识库!我们保留了UltraRAG 1.0的代码,可以点击 [v1](https://github.com/OpenBMB/UltraRAG/tree/v1) 查看。
3841

42+
</details>
43+
3944
---
4045

4146
## UltraRAG 2.0:面向科研的“RAG实验”加速器
4247

4348
检索增强生成系统(RAG)正从早期“检索+生成”的简单拼接,走向融合 **自适应知识组织****多轮推理****动态检索** 的复杂知识系统(典型代表如 *DeepResearch**Search-o1*)。但这种复杂度的提升,使科研人员在 **方法复现****快速迭代新想法** 时,面临着高昂的工程实现成本。
4449

45-
为了解决这一痛点,清华大学 [THUNLP](https://nlp.csai.tsinghua.edu.cn/) 实验室、东北大学 [NEUIR](https://neuir.github.io) 实验室、[OpenBMB](https://www.openbmb.cn/home)[AI9stars](https://github.com/AI9Stars) 联合推出 UltraRAG 2.0 (UR-2.0)—— 首个基于 [Model Context Protocol (MCP)](https://modelcontextprotocol.io/overview) 架构设计的 RAG 框架。这一设计让科研人员只需编写 YAML 文件,就可以直接声明串行、循环、条件分支等复杂逻辑,从而以极低的代码量快速实现多阶段推理系统。
50+
为了解决这一痛点,清华大学 [THUNLP](https://nlp.csai.tsinghua.edu.cn/) 实验室、东北大学 [NEUIR](https://neuir.github.io) 实验室、[OpenBMB](https://www.openbmb.cn/home)[AI9stars](https://github.com/AI9Stars) 联合推出 UltraRAG 2.0 (UR-2.0)—— 首个基于 [Model Context Protocol (MCP)](https://modelcontextprotocol.io/docs/getting-started/intro) 架构设计的 RAG 框架。这一设计让科研人员只需编写 YAML 文件,就可以直接声明串行、循环、条件分支等复杂逻辑,从而以极低的代码量快速实现多阶段推理系统。
4651

4752
其核心思路是:
4853
- 组件化封装:将RAG 的核心组件封装为**标准化的独立 MCP Server**
@@ -70,7 +75,7 @@
7075

7176
## 秘诀:MCP 架构与原生流程控制
7277

73-
在不同的 RAG 系统中,检索、生成等核心能力在功能上具有高度相似性,但由于开发者实现策略各异,模块之间往往缺乏统一接口,难以跨项目复用。[Model Context Protocol (MCP)](https://modelcontextprotocol.io/overview) 作为一种开放协议,规范了为大型语言模型(LLMs)提供上下文的标准方式,并采用 **Client–Server** 架构,使得遵循该协议开发的 Server 组件可以在不同系统间无缝复用。
78+
在不同的 RAG 系统中,检索、生成等核心能力在功能上具有高度相似性,但由于开发者实现策略各异,模块之间往往缺乏统一接口,难以跨项目复用。[Model Context Protocol (MCP)](https://modelcontextprotocol.io/docs/getting-started/intro) 作为一种开放协议,规范了为大型语言模型(LLMs)提供上下文的标准方式,并采用 **Client–Server** 架构,使得遵循该协议开发的 Server 组件可以在不同系统间无缝复用。
7479

7580
受此启发,UltraRAG 2.0 基于 **MCP 架构**,将 RAG 系统中的检索、生成、评测等核心功能抽象并封装为相互独立的 **MCP Server**,并通过标准化的函数级 **Tool 接口**实现调用。这一设计既保证了模块功能扩展的灵活性,又允许新模块以“热插拔”的方式接入,无需对全局代码进行侵入式修改。在科研场景中,这种架构让研究者能够以极低的代码量快速适配新的模型或算法,同时保持整体系统的稳定性与一致性。
7681

@@ -112,60 +117,87 @@ uv pip install -e .
112117
pip install -e .
113118
```
114119

120+
运行以下命令验证安装是否成功:
121+
122+
```shell
123+
# 成功运行显示'Hello, UltraRAG 2.0!' 欢迎语
124+
ultrarag run examples/sayhello.yaml
125+
```
126+
115127

116128
【可选】UR-2.0支持丰富的Server组件,开发者可根据实际任务灵活安装所需依赖:
117129

118130
```shell
119-
# 如需使用faiss进行向量索引:
120-
# 需要根据自己的硬件环境,手动编译安装 CPU 或 GPU 版本的 FAISS:
131+
# Retriever/Reranker Server依赖:
132+
# infinity
133+
uv pip install infinity_emb
134+
# sentence_transformers
135+
uv pip install sentence_transformers
136+
# openai
137+
uv pip install openai
138+
# bm25
139+
uv pip install bm25s
140+
# faiss(需要根据自己的硬件环境,手动编译安装 CPU 或 GPU 版本的 FAISS)
121141
# CPU版本:
122142
uv pip install faiss-cpu
123143
# GPU 版本(示例:CUDA 12.x)
124144
uv pip install faiss-gpu-cu12
125145
# 其他 CUDA 版本请安装对应的包(例如:CUDA 11.x 使用 faiss-gpu-cu11)
126-
127-
# 如需使用infinity_emb进行语料库编码和索引:
128-
uv pip install -e ".[infinity_emb]"
129-
130-
# 如需使用lancedb向量数据库:
131-
uv pip install -e ".[lancedb]"
132-
133-
# 如需使用vLLM服务部署模型:
134-
uv pip install -e ".[vllm]"
135-
136-
# 如需使用语料库文档解析功能:
146+
# websearch
147+
# exa
148+
uv pip install exa_py
149+
# tavily
150+
uv pip install tavily-python
151+
# 一键安装:
152+
uv pip install -e ".[retriever]"
153+
154+
# Generation Server依赖:
155+
# vllm
156+
uv pip install vllm
157+
# openai
158+
uv pip install openai
159+
# hf
160+
uv pip install transformers
161+
# 一键安装:
162+
uv pip install -e ".[generation]"
163+
164+
# Corpus Server依赖:
165+
# chonkie
166+
uv pip install chonkie
167+
# pymupdf
168+
uv pip install pymupdf
169+
# mineru
170+
uv pip install "mineru[core]"
171+
# 一键安装:
137172
uv pip install -e ".[corpus]"
138173

139-
# ====== 安装所有依赖(除faiss) ======
174+
# 安装所有依赖
140175
uv pip install -e ".[all]"
176+
# 或使用conda导入环境:
177+
conda env create -f environment.yml
141178
```
142179

143-
运行以下命令验证安装是否成功:
144180

145-
```shell
146-
# 成功运行显示'Hello, UltraRAG 2.0!' 欢迎语
147-
ultrarag run examples/sayhello.yaml
148-
```
149181

150182
### 使用 Docker 构建运行环境
151183

152184
通过 git 克隆项目到本地或服务器:
153185

154186
```shell
155-
git clone https://github.com/OpenBMB/UltraRAG.git
187+
git clone https://github.com/OpenBMB/UltraRAG.git --depth 1
156188
cd UltraRAG
157189
```
158190

159191
构建镜像:
160192

161193
```shell
162-
docker build -t ultrarag:v2.0.0-beta .
194+
docker build -t ultrarag:v0.2.1 .
163195
```
164196

165197
运行交互环境:
166198

167199
```shell
168-
docker run -it --rm --gpus all ultrarag:v2.0.0-beta bash
200+
docker run -it --rm --gpus all ultrarag:v0.2.1 bash
169201
```
170202

171203
运行以下命令验证安装是否成功:
@@ -175,46 +207,41 @@ docker run -it --rm --gpus all ultrarag:v2.0.0-beta bash
175207
ultrarag run examples/sayhello.yaml
176208
```
177209

178-
## 快速上手
210+
## 快速开始
179211

180212
我们提供了从入门到进阶的完整教学示例,欢迎访问[教程文档](https://ultrarag.openbmb.cn
181213
)快速上手 UltraRAG 2.0!
182214

183-
阅读[快速上手](https://ultrarag.openbmb.cn/pages/cn/getting_started/quick_start),了解 UltraRAG 的使用流程。整体分为三步:**① 编译 Pipeline 文件生成参数配置;② 修改参数文件;③ 运行 Pipeline 文件**
184-
185-
此外,我们整理了一份科研中常用功能的目录,您可以直接点击跳转到所需模块:
186-
187-
- [使用检索器对语料库编码与索引](https://ultrarag.openbmb.cn/pages/cn/tutorials/part_3/emb_and_index)
188-
- [部署检索器](https://ultrarag.openbmb.cn/pages/cn/tutorials/part_4/deploy_retriever_serve)
189-
- [部署LLM](https://github.com/OpenBMB/UltraRAG/blob/main/script/vllm_serve.sh)
190-
- [基线复现](https://ultrarag.openbmb.cn/pages/cn/tutorials/part_3/reproduction)
191-
- [实验结果Case分析](https://ultrarag.openbmb.cn/pages/cn/tutorials/part_4/case_study)
192-
- [Debug调试教程](https://ultrarag.openbmb.cn/pages/cn/tutorials/part_4/debug)
193-
194-
195-
215+
阅读[快速开始](https://ultrarag.openbmb.cn/pages/cn/getting_started/quick_start),了解如何基于 UltraRAG 运行一个完整的 RAG Pipeline。
196216

197217
## 支持
198218

199-
UltraRAG 2.0 开箱即用,内置支持当前 RAG 领域最常用的 **公开评测数据集****大规模语料库** 以及 **典型基线方法**,方便科研人员快速复现与扩展实验。你也可以参考[数据格式说明](https://ultrarag.openbmb.cn/pages/cn/tutorials/part_3/prepare_dataset),灵活地自定义并添加任意数据集或语料库。完整的[数据集](https://huggingface.co/datasets/UltraRAG/UltraRAG_Benchmark)可通过该链接访问与下载。
219+
UltraRAG 2.0 开箱即用,已在 [ModelScope](https://modelscope.cn/datasets/UltraRAG/UltraRAG_Benchmark)[Huggingface](https://huggingface.co/datasets/UltraRAG/UltraRAG_Benchmark) 上同步发布当前 RAG 领域最常用的 **公开评测数据集**以及**大规模语料库**
220+
用户可直接下载使用,无需额外清洗或转换,即可与 UltraRAG 的评测管线无缝对接。除此之外还可以参考[数据格式说明](https://ultrarag.openbmb.cn/pages/cn/develop_guide/dataset),灵活地自定义并添加任意数据集或语料库。
200221

201222
### 1. 支持的数据集
202223

203224
| 任务类型 | 数据集名称 | 原始数据数量 | 评测采样数量 |
204-
|------------------|----------------------|--------------------------------------------|--------------------|
225+
|:------------------|:----------------------|:--------------------------------------------|:--------------------|
205226
| QA | [NQ](https://huggingface.co/datasets/google-research-datasets/nq_open) | 3,610 | 1,000 |
206227
| QA | [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) | 11,313 | 1,000 |
207228
| QA | [PopQA](https://huggingface.co/datasets/akariasai/PopQA) | 14,267 | 1,000 |
208229
| QA | [AmbigQA](https://huggingface.co/datasets/sewon/ambig_qa) | 2,002 | 1,000 |
209230
| QA | [MarcoQA](https://huggingface.co/datasets/microsoft/ms_marco/viewer/v2.1/validation) | 55,636 | 1,000|
210231
| QA | [WebQuestions](https://huggingface.co/datasets/stanfordnlp/web_questions) | 2,032 | 1,000 |
232+
| VQA | [MP-DocVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-MP-DocVQA) | 591 | 591 |
233+
| VQA | [ChartQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ChartQA) | 63 | 63 |
234+
| VQA | [InfoVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-InfoVQA) | 718 | 718 |
235+
| VQA | [PlotQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-PlotQA) | 863 | 863 |
211236
| Multi-hop QA | [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) | 7,405 | 1,000 |
212237
| Multi-hop QA | [2WikiMultiHopQA](https://www.dropbox.com/scl/fi/heid2pkiswhfaqr5g0piw/data.zip?e=2&file_subpath=%2Fdata&rlkey=ira57daau8lxfj022xvk1irju) | 12,576 | 1,000 |
213238
| Multi-hop QA | [Musique](https://drive.google.com/file/d/1tGdADlNjWFaHLeZZGShh2IRcpO6Lv24h/view) | 2,417 | 1,000 |
214239
| Multi-hop QA | [Bamboogle](https://huggingface.co/datasets/chiayewken/bamboogle) | 125 | 125 |
215240
| Multi-hop QA | [StrategyQA](https://huggingface.co/datasets/tasksource/strategy-qa) | 2,290 | 1,000 |
241+
| Multi-hop VQA | [SlideVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-SlideVQA) | 556 | 556 |
216242
| Multiple-choice | [ARC](https://huggingface.co/datasets/allenai/ai2_arc) | 3,548 | 1,000 |
217243
| Multiple-choice | [MMLU](https://huggingface.co/datasets/cais/mmlu) | 14,042 | 1,000 |
244+
| Multiple-choice VQA | [ArXivQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ArxivQA) | 816 | 816 |
218245
| Long-form QA | [ASQA](https://huggingface.co/datasets/din0s/asqa) | 948 | 948 |
219246
| Fact-verification| [FEVER](https://fever.ai/dataset/fever.html) | 13,332 | 1,000 |
220247
| Dialogue | [WoW](https://huggingface.co/datasets/facebook/kilt_tasks) | 3,054 | 1,000 |
@@ -225,25 +252,30 @@ UltraRAG 2.0 开箱即用,内置支持当前 RAG 领域最常用的 **公开
225252
### 2. 支持的语料库
226253

227254
| 语料库名称 | 文档数量 |
228-
|------------|--------------|
229-
| [wiki-2018](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus) | 21,015,324 |
230-
| wiki-2024 | 整理中,即将上线 |
255+
|:--------------|:--------------|
256+
| Wiki-2018 | 21,015,324 |
257+
| Wiki-2024 | 30,463,973 |
258+
| MP-DocVQA | 741 |
259+
| ChartQA | 500 |
260+
| InfoVQA | 459 |
261+
| PlotQA | 9,593 |
262+
| SlideVQA | 1,284 |
263+
| ArXivQA | 8,066 |
231264

232265
---
233266

234267
### 3. 支持的基线方法(持续更新)
235268

236269
| 基线名称 | 脚本 |
237-
|------------|--------------|
238-
| Vanilla LLM | examples/vanilla.yaml |
270+
|:------------|:--------------|
271+
| Vanilla LLM | examples/vanilla_llm.yaml |
239272
| Vanilla RAG | examples/rag.yaml |
240273
| [IRCoT](https://arxiv.org/abs/2212.10509) | examples/IRCoT.yaml |
241274
| [IterRetGen](https://arxiv.org/abs/2305.15294) | examples/IterRetGen.yaml |
242275
| [RankCoT](https://arxiv.org/abs/2502.17888) | examples/RankCoT.yaml |
243276
| [R1-searcher](https://arxiv.org/abs/2503.05592) | examples/r1_searcher.yaml |
244277
| [Search-o1](https://arxiv.org/abs/2501.05366) | examples/search_o1.yaml |
245278
| [Search-r1](https://arxiv.org/abs/2503.09516) | examples/search_r1.yaml |
246-
| WebNote | examples/webnote.yaml |
247279

248280
## 贡献
249281

0 commit comments

Comments
 (0)