Skip to content

Dataflow v1.0.7 Release Note

Choose a tag to compare

@haolpku haolpku released this 20 Nov 06:09
· 114 commits to main since this release
b716ba8

🚀 DataFlow v1.0.7 更新日志(v1.0.6 → v1.0.7)

🔑 主要功能更新

📘 全新 VQA 提取能力(基于 MinerU 2.5 & Gemini 2.5 Pro)

  • 引入 MinerU 2.5 与多项兼容修复,新增完整的 VQA Extraction Pipeline,支持两栏排版识别、长距离干扰项(distractor)构造、鲁棒结构化解析等。
    感谢 @fatty-belly@YalinFeng01@haolpku@wongzhenhao

🧪 模型评测(Model Evaluation)能力增强

  • 新增 model_answer / golden_label 对比算子,支持自动化模型输出评测,并补充了配套 test cases。
    感谢 @haolpku

🧬 Science & Chemistry Pipeline 修复与增强

  • 修复 Chemistry Pipeline 多处问题、标准化 SMILES Operator 命名、增强 reasoning 相关算子稳定性。
    感谢 @haolpku@scuuy

👨‍💻 全新代码合成(Code Synthesis)Pipeline

  • 新增 Code Synthesis operators 与 Pipeline,支持代码自动生成、测试与转化任务。
    感谢 @J1zz

📦 Docker 支持正式上线

  • DataFlow 现已提供官方 Docker 支持,简化部署体验,提高跨环境一致性。
    感谢 @MOLYHECI

☁️ Google VertexAI Serving 全面接入

📚 PDF2Model & Eval Pipeline 全面升级

  • PDF2Model 模块性能提升,同时 Eval Pipeline 完成重构,支持更灵活的数据评测流程。
    感谢 @YalinFeng01

🧩 其他重要改进

🔧 Bug 修复与兼容性提升

  • 修复 mineru import、算子命名错误、SFT playground import、mathbook 提取、reasoning generator 格式、storage 与 operator 多项问题。
  • 修复 async 异步问题、Text2SQLPipeline name bug、Text2VecSQL MacOS 多数据库支持等。

🧠 Prompt 系统增强

  • 提供 prompt_restrict 修复、自动检查(auto check)、新增白名单(white list)机制、增强 PromptTemplate Generator 的鲁棒性。
    感谢 @SunnyHaze @wongzhenhao

📥 Storage & CLI 优化

  • 增加 LazyStorage 支持仅保存最终结果、加入安装路径展示(dataflow env)、修复 deprecated applymap()
    感谢 @SunnyHaze

🌐 KBC Pipeline 语言支持扩展

感谢 @ZhaoyangHan04

🧠 Playground & Pipeline 稳定性提升

感谢 @HeRunming @zzy1127


🌟 新增 Demo / 功能亮点

  • 新版 VQA Extractor Pipeline,可以自动提取书本长距离的 VQA
  • 自动 API Pipeline 测试脚本
  • Text2QA 异常 JSON 保护机制
  • 多列输入的 PromptTemplatedGenerator
  • Storage final-only saving

👨‍💻 新的贡献者


🚀 DataFlow v1.0.7 Key Feature Updates

🔑 Major Feature Additions

📘 VQA Extraction Revamp (MinerU 2.5 + Gemini 2.5 Pro)

A fully upgraded VQA extraction workflow with MinerU 2.5 adaptation, two-column layout support, long-distance distractor construction, and numerous robustness fixes.
Thanks to @fatty-belly, @YalinFeng01, @haolpku, and @wongzhenhao.

🧪 Enhanced Model Evaluation

Added model_answer vs golden label comparison operators with comprehensive tests for evaluation pipelines.
Thanks to @haolpku.

🧬 Science & Chemistry Pipeline Fixes

Significant improvements to chemistry operators, naming standardization for SMILES ops, and reasoning-related bug fixes.
Thanks to @haolpku and @scuuy.

👨‍💻 New Code Synthesis Pipeline

Introduced new operators and a complete pipeline for code generation and transformation tasks.
Thanks to @J1zz.

📦 Official Docker Support

DataFlow now ships with native Docker support for easier deployment and environment consistency.
Thanks to @MOLYHECI.

☁️ Google VertexAI Serving Integration

Added SDK support, batch predictions, and JSON Schema outputs for VertexAI Serving.
Thanks to @HeRunming, @fatty-belly, and @wongzhenhao.

📚 Upgraded PDF2Model & Eval Pipeline

Substantial improvements to PDF2Model accuracy and a fully revised Eval Pipeline.
Thanks to @YalinFeng01.


🧩 Other Important Improvements

🔧 Bug Fixes & Compatibility Enhancements

  • Fixed issues related to MinerU imports, operator naming errors, SFT playground imports, mathbook extraction, reasoning generator formatting, and multiple storage/operator bugs.
  • Resolved async execution problems, Text2SQLPipeline naming bugs, and added MacOS multi-database support for Text2VecSQL.

🧠 Prompt System Enhancements

  • Added fixes for prompt_restrict, automatic prompt validation (auto check), whitelist support, and improved robustness for the PromptTemplate Generator.
    Thanks to @SunnyHaze and @wongzhenhao.

📥 Storage & CLI Optimization

  • Added LazyStorage to save only final results, exposed installation path via dataflow env, and replaced deprecated applymap().
    Thanks to @SunnyHaze.

🌐 KBC Pipeline Language Support Expansion

Thanks to @ZhaoyangHan04.

🧠 Playground & Pipeline Stability Improvements

Thanks to @HeRunming and @zzy1127.


🌟 New Demos / Feature Highlights

  • New VQA Extractor Pipeline — Automatically extracts long-distance VQA pairs from book-style documents.
  • Automated API Pipeline Testing Script — Provides end-to-end validation for API-based pipelines.
  • Text2QA Exception-Safe JSON Handling — Adds robust protection against malformed or unexpected JSON formats.
  • Multi-column PromptTemplatedGenerator — Supports generating prompts from multi-column inputs.
  • Final-Only Storage Mode — Saves only the final results to reduce I/O overhead and storage usage.

👨‍💻 New Contributors

What's Changed

New Contributors

Full Changelog: v1.0.6...v1.0.7