🚀 DataFlow v1.0.7 更新日志（v1.0.6 → v1.0.7）

🔑 主要功能更新

📘 全新 VQA 提取能力（基于 MinerU 2.5 & Gemini 2.5 Pro）

引入 MinerU 2.5 与多项兼容修复，新增完整的 VQA Extraction Pipeline，支持两栏排版识别、长距离干扰项（distractor）构造、鲁棒结构化解析等。
感谢 @fatty-belly、@YalinFeng01、@haolpku、@wongzhenhao。

🧪 模型评测（Model Evaluation）能力增强

新增 model_answer / golden_label 对比算子，支持自动化模型输出评测，并补充了配套 test cases。
感谢 @haolpku。

🧬 Science & Chemistry Pipeline 修复与增强

修复 Chemistry Pipeline 多处问题、标准化 SMILES Operator 命名、增强 reasoning 相关算子稳定性。
感谢 @haolpku、@scuuy。

👨‍💻 全新代码合成（Code Synthesis）Pipeline

新增 Code Synthesis operators 与 Pipeline，支持代码自动生成、测试与转化任务。
感谢 @J1zz。

📦 Docker 支持正式上线

DataFlow 现已提供官方 Docker 支持，简化部署体验，提高跨环境一致性。
感谢 @MOLYHECI。

☁️ Google VertexAI Serving 全面接入

支持 Google VertexAI SDK、batch prediction、JSON Schema 输出格式，统一 Serving 能力。
感谢 @HeRunming、@fatty-belly、@wongzhenhao。

📚 PDF2Model & Eval Pipeline 全面升级

PDF2Model 模块性能提升，同时 Eval Pipeline 完成重构，支持更灵活的数据评测流程。
感谢 @YalinFeng01。

🧩 其他重要改进

🔧 Bug 修复与兼容性提升

修复 mineru import、算子命名错误、SFT playground import、mathbook 提取、reasoning generator 格式、storage 与 operator 多项问题。
修复 async 异步问题、Text2SQLPipeline name bug、Text2VecSQL MacOS 多数据库支持等。

🧠 Prompt 系统增强

提供 prompt_restrict 修复、自动检查（auto check）、新增白名单（white list）机制、增强 PromptTemplate Generator 的鲁棒性。
感谢 @SunnyHaze @wongzhenhao。

📥 Storage & CLI 优化

增加 LazyStorage 支持仅保存最终结果、加入安装路径展示（dataflow env）、修复 deprecated applymap()。
感谢 @SunnyHaze。

🌐 KBC Pipeline 语言支持扩展

感谢 @ZhaoyangHan04。

🧠 Playground & Pipeline 稳定性提升

感谢 @HeRunming @zzy1127。

🌟 新增 Demo / 功能亮点

新版 VQA Extractor Pipeline，可以自动提取书本长距离的 VQA
自动 API Pipeline 测试脚本
Text2QA 异常 JSON 保护机制
多列输入的 PromptTemplatedGenerator
Storage final-only saving

👨‍💻 新的贡献者

@fatty-belly — 首次贡献（PR #321）
@chen200210 — 首次贡献（PR #319）
@J1zz — 首次贡献（PR #328）
@Achewwa — 首次贡献（PR #327）

🚀 DataFlow v1.0.7 Key Feature Updates

🔑 Major Feature Additions

📘 VQA Extraction Revamp (MinerU 2.5 + Gemini 2.5 Pro)

A fully upgraded VQA extraction workflow with MinerU 2.5 adaptation, two-column layout support, long-distance distractor construction, and numerous robustness fixes.
Thanks to @fatty-belly, @YalinFeng01, @haolpku, and @wongzhenhao.

🧪 Enhanced Model Evaluation

Added model_answer vs golden label comparison operators with comprehensive tests for evaluation pipelines.
Thanks to @haolpku.

🧬 Science & Chemistry Pipeline Fixes

Significant improvements to chemistry operators, naming standardization for SMILES ops, and reasoning-related bug fixes.
Thanks to @haolpku and @scuuy.

👨‍💻 New Code Synthesis Pipeline

Introduced new operators and a complete pipeline for code generation and transformation tasks.
Thanks to @J1zz.

📦 Official Docker Support

DataFlow now ships with native Docker support for easier deployment and environment consistency.
Thanks to @MOLYHECI.

☁️ Google VertexAI Serving Integration

Added SDK support, batch predictions, and JSON Schema outputs for VertexAI Serving.
Thanks to @HeRunming, @fatty-belly, and @wongzhenhao.

📚 Upgraded PDF2Model & Eval Pipeline

Substantial improvements to PDF2Model accuracy and a fully revised Eval Pipeline.
Thanks to @YalinFeng01.

🧩 Other Important Improvements

🔧 Bug Fixes & Compatibility Enhancements

Fixed issues related to MinerU imports, operator naming errors, SFT playground imports, mathbook extraction, reasoning generator formatting, and multiple storage/operator bugs.
Resolved async execution problems, Text2SQLPipeline naming bugs, and added MacOS multi-database support for Text2VecSQL.

🧠 Prompt System Enhancements

Added fixes for prompt_restrict, automatic prompt validation (auto check), whitelist support, and improved robustness for the PromptTemplate Generator.
Thanks to @SunnyHaze and @wongzhenhao.

📥 Storage & CLI Optimization

Added LazyStorage to save only final results, exposed installation path via dataflow env, and replaced deprecated applymap().
Thanks to @SunnyHaze.

🌐 KBC Pipeline Language Support Expansion

Thanks to @ZhaoyangHan04.

🧠 Playground & Pipeline Stability Improvements

Thanks to @HeRunming and @zzy1127.

🌟 New Demos / Feature Highlights

New VQA Extractor Pipeline — Automatically extracts long-distance VQA pairs from book-style documents.
Automated API Pipeline Testing Script — Provides end-to-end validation for API-based pipelines.
Text2QA Exception-Safe JSON Handling — Adds robust protection against malformed or unexpected JSON formats.
Multi-column PromptTemplatedGenerator — Supports generating prompts from multi-column inputs.
Final-Only Storage Mode — Saves only the final results to reduce I/O overhead and storage usage.

👨‍💻 New Contributors

@fatty-belly – first contribution (PR #321)
@chen200210 – first contribution (PR #319)
@J1zz – first contribution (PR #328)
@Achewwa – first contribution (PR #327)

What's Changed

VQA Extraction Demo with MinerU 2.5 by @fatty-belly in #321
fix: 统一 NLTK 数据配置并修正描述 by @zzy1127 in #320
Create auto_test_for_api_pipeline.py by @chen200210 in #319
Add model_answer and golden label compare operators. Add test cases for model evaluation. by @haolpku in #322
fix chemistry pipeline by @haolpku in #323
fix mineru issues by @haolpku in #324
adapt mineru2.5 by @YalinFeng01 in #326
add code synthesis operators and pipeline by @J1zz in #328
[feature] add docker support by @MOLYHECI in #330
[Fix Log] Fix log of import error of local vllm server by @HeRunming in #331
fix name bug in reasoning_math_pipeline by @scuuy in #333
remove rare, gradio and adp platform by @haolpku in #332
address the async problem by @Achewwa in #327
[prompt] fix prompt_restrict bug. by @SunnyHaze in #336
Add optional output keys by @J1zz in #337
fix BUG-api_pipelines\text2vecsql_pipeline_gen by @yaodongwen in #338
[VQA Extractor] Fix mineru import issue by @fatty-belly in #340
Modified Eval Pipeline by @YalinFeng01 in #343
[register] add white list function to register by @SunnyHaze in #344
[text2vecsql] support for MacOS when multi sql server by @yaodongwen in #345
Add long distance VQA distractor demo. by @fatty-belly in #342
[VQA Extractor] Use gemini-2.5-pro to support two column pages by @fatty-belly in #346
Fix format problem of Reasoning Question Generator by @HeRunming in #347
[prompt] add auto check for prompt restrict for Dataflow-Agent by @SunnyHaze in #349
Pdf2model&EvalPipeline by @YalinFeng01 in #351
Fix bug for Select Text2SQLPipeline by @TechNomad-ds in #352
Standardize Smiles Operator Naming by @haolpku in #353
fix bug in op reasoninganswerjudgerfilter by @scuuy in #354
[operator] Add PromptTemplatedGenerator for multi-column input case & remove PairedPromptedGenerator by @SunnyHaze in #355
[Update] Increase Prompt Template Generator robustnesss by @wongzhenhao in #356
Text2qa added try except for unexpected json format by @wongzhenhao in #357
修改nltk数据配置并修改condor generator by @zzy1127 in #359
Add regex import to reasoning_question_generator.py by @HeRunming in #358
[test & debug] Fix bugs in storage and operators encountered during pipeline testing. by @leaderwolfpipi in #360
fix import bugs in sft playground by @zzy1127 in #361
add language support for kbc pipeline by @ZhaoyangHan04 in #362
fix: fix format_response for better API response handling by @CheinTian in #365
Fix bug of mathbook question extract and update mineru backend in mathbook extract by @HeRunming in #366
[Add Server] add support of google vertexai sdk by @HeRunming in #368
[Serving] add json schema to google_vertexai_serving by @fatty-belly in #369
[Google vertex serving] minor fix by @fatty-belly in #370
[storage&CLI] switch df.applymap() to df.map to avoid deprecated function & add install path to dataflow env by @SunnyHaze in #371
modify example data by @zzy1127 in #367
[storage] add lazystorage for only save final results by @SunnyHaze in #373
[op] hide vqa operator to avoid cv2 import issue by @SunnyHaze in #377
Google VertexAI serving batch_prediction by @wongzhenhao in #378
[update] revise github workflow to fix no space left issue in auto test by @SunnyHaze in #380
[VQA Extractor] New pipeline by @fatty-belly in #379
Change Folder Names by @fatty-belly in #381
Google Vertex naming optimized for BigQuery by @wongzhenhao in #382
[fix] fix naming issue and unexpected file delete issue. by @SunnyHaze in #386
[debug] fix bugs for MetaSampleEvaluator name issue by @MOLYHECI in #387
fix kbc playground bug by @ZhaoyangHan04 in #388
[Fix Playground] Fix errors of filepath and import path by @HeRunming in #389

New Contributors

@fatty-belly made their first contribution in #321
@chen200210 made their first contribution in #319
@J1zz made their first contribution in #328
@Achewwa made their first contribution in #327

Full Changelog: v1.0.6...v1.0.7

Dataflow v1.0.7 Release Note