Dataflow v1.0.7 Release Note
🚀 DataFlow v1.0.7 更新日志(v1.0.6 → v1.0.7)
🔑 主要功能更新
📘 全新 VQA 提取能力(基于 MinerU 2.5 & Gemini 2.5 Pro)
- 引入 MinerU 2.5 与多项兼容修复,新增完整的 VQA Extraction Pipeline,支持两栏排版识别、长距离干扰项(distractor)构造、鲁棒结构化解析等。
感谢 @fatty-belly、@YalinFeng01、@haolpku、@wongzhenhao。
🧪 模型评测(Model Evaluation)能力增强
- 新增 model_answer / golden_label 对比算子,支持自动化模型输出评测,并补充了配套 test cases。
感谢 @haolpku。
🧬 Science & Chemistry Pipeline 修复与增强
👨💻 全新代码合成(Code Synthesis)Pipeline
- 新增 Code Synthesis operators 与 Pipeline,支持代码自动生成、测试与转化任务。
感谢 @J1zz。
📦 Docker 支持正式上线
- DataFlow 现已提供官方 Docker 支持,简化部署体验,提高跨环境一致性。
感谢 @MOLYHECI。
☁️ Google VertexAI Serving 全面接入
- 支持 Google VertexAI SDK、batch prediction、JSON Schema 输出格式,统一 Serving 能力。
感谢 @HeRunming、@fatty-belly、@wongzhenhao。
📚 PDF2Model & Eval Pipeline 全面升级
- PDF2Model 模块性能提升,同时 Eval Pipeline 完成重构,支持更灵活的数据评测流程。
感谢 @YalinFeng01。
🧩 其他重要改进
🔧 Bug 修复与兼容性提升
- 修复 mineru import、算子命名错误、SFT playground import、mathbook 提取、reasoning generator 格式、storage 与 operator 多项问题。
- 修复 async 异步问题、Text2SQLPipeline name bug、Text2VecSQL MacOS 多数据库支持等。
🧠 Prompt 系统增强
- 提供 prompt_restrict 修复、自动检查(auto check)、新增白名单(white list)机制、增强 PromptTemplate Generator 的鲁棒性。
感谢 @SunnyHaze @wongzhenhao。
📥 Storage & CLI 优化
- 增加 LazyStorage 支持仅保存最终结果、加入安装路径展示(
dataflow env)、修复 deprecatedapplymap()。
感谢 @SunnyHaze。
🌐 KBC Pipeline 语言支持扩展
感谢 @ZhaoyangHan04。
🧠 Playground & Pipeline 稳定性提升
感谢 @HeRunming @zzy1127。
🌟 新增 Demo / 功能亮点
- 新版 VQA Extractor Pipeline,可以自动提取书本长距离的 VQA
- 自动 API Pipeline 测试脚本
- Text2QA 异常 JSON 保护机制
- 多列输入的 PromptTemplatedGenerator
- Storage final-only saving
👨💻 新的贡献者
- @fatty-belly — 首次贡献(PR #321)
- @chen200210 — 首次贡献(PR #319)
- @J1zz — 首次贡献(PR #328)
- @Achewwa — 首次贡献(PR #327)
🚀 DataFlow v1.0.7 Key Feature Updates
🔑 Major Feature Additions
📘 VQA Extraction Revamp (MinerU 2.5 + Gemini 2.5 Pro)
A fully upgraded VQA extraction workflow with MinerU 2.5 adaptation, two-column layout support, long-distance distractor construction, and numerous robustness fixes.
Thanks to @fatty-belly, @YalinFeng01, @haolpku, and @wongzhenhao.
🧪 Enhanced Model Evaluation
Added model_answer vs golden label comparison operators with comprehensive tests for evaluation pipelines.
Thanks to @haolpku.
🧬 Science & Chemistry Pipeline Fixes
Significant improvements to chemistry operators, naming standardization for SMILES ops, and reasoning-related bug fixes.
Thanks to @haolpku and @scuuy.
👨💻 New Code Synthesis Pipeline
Introduced new operators and a complete pipeline for code generation and transformation tasks.
Thanks to @J1zz.
📦 Official Docker Support
DataFlow now ships with native Docker support for easier deployment and environment consistency.
Thanks to @MOLYHECI.
☁️ Google VertexAI Serving Integration
Added SDK support, batch predictions, and JSON Schema outputs for VertexAI Serving.
Thanks to @HeRunming, @fatty-belly, and @wongzhenhao.
📚 Upgraded PDF2Model & Eval Pipeline
Substantial improvements to PDF2Model accuracy and a fully revised Eval Pipeline.
Thanks to @YalinFeng01.
🧩 Other Important Improvements
🔧 Bug Fixes & Compatibility Enhancements
- Fixed issues related to MinerU imports, operator naming errors, SFT playground imports, mathbook extraction, reasoning generator formatting, and multiple storage/operator bugs.
- Resolved async execution problems, Text2SQLPipeline naming bugs, and added MacOS multi-database support for Text2VecSQL.
🧠 Prompt System Enhancements
- Added fixes for
prompt_restrict, automatic prompt validation (auto check), whitelist support, and improved robustness for the PromptTemplate Generator.
Thanks to @SunnyHaze and @wongzhenhao.
📥 Storage & CLI Optimization
- Added LazyStorage to save only final results, exposed installation path via
dataflow env, and replaced deprecatedapplymap().
Thanks to @SunnyHaze.
🌐 KBC Pipeline Language Support Expansion
Thanks to @ZhaoyangHan04.
🧠 Playground & Pipeline Stability Improvements
Thanks to @HeRunming and @zzy1127.
🌟 New Demos / Feature Highlights
- New VQA Extractor Pipeline — Automatically extracts long-distance VQA pairs from book-style documents.
- Automated API Pipeline Testing Script — Provides end-to-end validation for API-based pipelines.
- Text2QA Exception-Safe JSON Handling — Adds robust protection against malformed or unexpected JSON formats.
- Multi-column PromptTemplatedGenerator — Supports generating prompts from multi-column inputs.
- Final-Only Storage Mode — Saves only the final results to reduce I/O overhead and storage usage.
👨💻 New Contributors
- @fatty-belly – first contribution (PR #321)
- @chen200210 – first contribution (PR #319)
- @J1zz – first contribution (PR #328)
- @Achewwa – first contribution (PR #327)
What's Changed
- VQA Extraction Demo with MinerU 2.5 by @fatty-belly in #321
- fix: 统一 NLTK 数据配置并修正描述 by @zzy1127 in #320
- Create auto_test_for_api_pipeline.py by @chen200210 in #319
- Add model_answer and golden label compare operators. Add test cases for model evaluation. by @haolpku in #322
- fix chemistry pipeline by @haolpku in #323
- fix mineru issues by @haolpku in #324
- adapt mineru2.5 by @YalinFeng01 in #326
- add code synthesis operators and pipeline by @J1zz in #328
- [feature] add docker support by @MOLYHECI in #330
- [Fix Log] Fix log of import error of local vllm server by @HeRunming in #331
- fix name bug in reasoning_math_pipeline by @scuuy in #333
- remove rare, gradio and adp platform by @haolpku in #332
- address the async problem by @Achewwa in #327
- [prompt] fix prompt_restrict bug. by @SunnyHaze in #336
- Add optional output keys by @J1zz in #337
- fix BUG-api_pipelines\text2vecsql_pipeline_gen by @yaodongwen in #338
- [VQA Extractor] Fix mineru import issue by @fatty-belly in #340
- Modified Eval Pipeline by @YalinFeng01 in #343
- [register] add white list function to register by @SunnyHaze in #344
- [text2vecsql] support for MacOS when multi sql server by @yaodongwen in #345
- Add long distance VQA distractor demo. by @fatty-belly in #342
- [VQA Extractor] Use gemini-2.5-pro to support two column pages by @fatty-belly in #346
- Fix format problem of Reasoning Question Generator by @HeRunming in #347
- [prompt] add auto check for prompt restrict for Dataflow-Agent by @SunnyHaze in #349
- Pdf2model&EvalPipeline by @YalinFeng01 in #351
- Fix bug for Select Text2SQLPipeline by @TechNomad-ds in #352
- Standardize Smiles Operator Naming by @haolpku in #353
- fix bug in op reasoninganswerjudgerfilter by @scuuy in #354
- [operator] Add PromptTemplatedGenerator for multi-column input case & remove PairedPromptedGenerator by @SunnyHaze in #355
- [Update] Increase Prompt Template Generator robustnesss by @wongzhenhao in #356
- Text2qa added try except for unexpected json format by @wongzhenhao in #357
- 修改nltk数据配置并修改condor generator by @zzy1127 in #359
- Add regex import to reasoning_question_generator.py by @HeRunming in #358
- [test & debug] Fix bugs in storage and operators encountered during pipeline testing. by @leaderwolfpipi in #360
- fix import bugs in sft playground by @zzy1127 in #361
- add language support for kbc pipeline by @ZhaoyangHan04 in #362
- fix: fix format_response for better API response handling by @CheinTian in #365
- Fix bug of mathbook question extract and update mineru backend in mathbook extract by @HeRunming in #366
- [Add Server] add support of google vertexai sdk by @HeRunming in #368
- [Serving] add json schema to google_vertexai_serving by @fatty-belly in #369
- [Google vertex serving] minor fix by @fatty-belly in #370
- [storage&CLI] switch df.applymap() to df.map to avoid deprecated function & add install path to
dataflow envby @SunnyHaze in #371 - modify example data by @zzy1127 in #367
- [storage] add lazystorage for only save final results by @SunnyHaze in #373
- [op] hide vqa operator to avoid
cv2import issue by @SunnyHaze in #377 - Google VertexAI serving batch_prediction by @wongzhenhao in #378
- [update] revise github workflow to fix
no space leftissue in auto test by @SunnyHaze in #380 - [VQA Extractor] New pipeline by @fatty-belly in #379
- Change Folder Names by @fatty-belly in #381
- Google Vertex naming optimized for BigQuery by @wongzhenhao in #382
- [fix] fix naming issue and unexpected file delete issue. by @SunnyHaze in #386
- [debug] fix bugs for MetaSampleEvaluator name issue by @MOLYHECI in #387
- fix kbc playground bug by @ZhaoyangHan04 in #388
- [Fix Playground] Fix errors of filepath and import path by @HeRunming in #389
New Contributors
- @fatty-belly made their first contribution in #321
- @chen200210 made their first contribution in #319
- @J1zz made their first contribution in #328
- @Achewwa made their first contribution in #327
Full Changelog: v1.0.6...v1.0.7