PaddlePaddle · forBlank · Sep 23, 2025 · Sep 23, 2025 · Sep 23, 2025 · Sep 23, 2025
diff --git a/docs/paddleocr_vl_sft.md b/docs/paddleocr_vl_sft.md
@@ -6,19 +6,15 @@ English | [简体中文](./paddleocr_vl_sft_zh.md)
 
 PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
 
-While PaddleOCR-VL-0.9B excels in common scenarios, its performance often faces limitations in many specific or complex business applications. For instance:
-
-- Domain-Specific Applications
-    - Finance & Accounting: Recognizing documents such as invoices, receipts, bank statements, and financial reports
-    - Healthcare: Processing medical records, lab reports, handwritten prescriptions, and pharmaceutical instructions
-    - Legal Sector: Identifying text in contracts, legal instruments, court filings, and certificates.
-- Non-Standard Text and Typography
-    - Handwriting Recognition: Deciphering handwritten forms, notes, letters, and questionnaires.
-    - Stylized & Artistic Fonts: Recognizing text on posters, billboards, product packaging, and menus.
-    - Historical & Archival Documents: Processing ancient manuscripts, old newspapers, and historical archives.
-- Task-Specific Structured Output
-    - Table Recognition & Structuring: Converting tables within images into structured formats like Excel, CSV, or JSON.
-    - Mathematical Formula Recognition: Identifying mathematical equations in textbooks or research papers and exporting them into formats like LaTeX.
+While PaddleOCR-VL-0.9B performs excellently in common scenarios, its recognition capabilities may face bottlenecks in specific or complex business applications. For example:
+
+- Non-standard text and symbols
+    - Artistic or stylized fonts: Recognizing text on posters, billboards, product packaging, cards/documents, and seals.
+    - Specialized symbols: Such as the recognition of symbols in organic chemistry.
+- Specific tasks and output formats
+    - Fine-grained text localization and grounding outputs.
+    - Flowchart recognition with structured output.
+- Data for specific low-resource languages: such as Tibetan, Bengali, etc.
 
 This is where SFT (Supervised Fine-Tuning) becomes necessary to enhance the model’s accuracy and robustness for these specialized tasks.
 
@@ -49,20 +45,20 @@ python -m pip install opencv-python-headless
 python -m pip install numpy==1.26.4
 ```
 
-For more installation methods, please refer to the [ERNIEKit Installation Guide]((./erniekit.md#2-installation)).
+For more installation methods, please refer to the [ERNIEKit Installation Guide](./erniekit.md#2-installation).
 
 ## 3. Model and Dataset Preparation
 
 ### 3.1. Model Preparation
-The PaddleOCR-VL-0.9B model can be downloaded from [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main/PaddleOCR-VL-0.9B) or [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL/files).
+The PaddleOCR-VL-0.9B model can be downloaded from [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) or [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL).
 
 ```bash
 huggingface-cli download PaddlePaddle/PaddleOCR-VL --local-dir PaddlePaddle/PaddleOCR-VL
 ```
 
 ### 3.2. Dataset Preparation
 
-For the training dataset format, please refer to [SFT VL Dataset Format]((./datasets.md#sft-vl-dataset)). Required fields are as follows:
+For the training dataset format, please refer to [SFT VL Dataset Format](./datasets.md#sft-vl-dataset). Required fields are as follows:
 * `text_info`: The list of text data, each element contains a `text` and a `tag`
   * `text`: The text content from User question or System response
   * `tag`: The mask tag (`no_mask`=include in training, `mask`=exclude)
@@ -75,7 +71,7 @@ Notes:
 * Each training sample is in JSON format, with multiple samples separated by newlines
 * Please ensure that `mask` items and `no_mask` items alternate in the `text_info`
 
-For your convenience, we also provide a quick-start [Bengali training dataset]((https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl)) for fine-tuning PaddleOCR-VL-0.9B on Bengali recognition. Download it using the following command:
+For your convenience, we also provide a quick-start [Bengali training dataset](https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl) for fine-tuning PaddleOCR-VL-0.9B on Bengali recognition. Download it using the following command:   
 
 ```bash
 wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl
@@ -91,7 +87,7 @@ Bengali training example:
 ```json
 {
     "image_info": [
-        {"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
+        {"matched_text_index": 0, "image_url": "./assets/bengali_train_example.png"},
     ],
     "text_info": [
         {"text": "OCR:", "tag": "mask"},
@@ -194,7 +190,7 @@ cp PaddlePaddle/PaddleOCR-VL/inference.yml PaddleOCR-VL-SFT-Bengali
 ```
 
 ### 7.3. Inference Dataset Preparation
-We provide a [Bengali test dataset]((https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonl)) that can be used for inference to observe the fine-tuning results. Download it using the following command:
+We provide a [Bengali test dataset](https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonl) that can be used for inference to observe the fine-tuning results. Download it using the following command:
 
 ```bash
 wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonl
@@ -236,7 +232,7 @@ Table Data: OTSL format
 ```json
 {
     "image_info": [
-        {"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
+        {"matched_text_index": 0, "image_url": "./assets/table_example.png"},
     ],
     "text_info": [
         {"text": "Table Recognition:", "tag": "mask"},
@@ -254,7 +250,7 @@ Formula Data: LaTeX format
 ```json
 {
     "image_info": [
-        {"matched_text_index": 0, "image_url": "./assets/formula_example.jps"},
+        {"matched_text_index": 0, "image_url": "./assets/formula_example.jpg"},
     ],
     "text_info": [
         {"text": "Formula Recognition:", "tag": "mask"},
@@ -281,7 +277,7 @@ Chart Data: Markdown format
 }
 ```
 
-### Common Issues
+### 8.2. Common Issues
 
 If you encounter the following problem while using the above command, it is generally due to a conflict between cv2 and the environment. This can be resolved by installing `opencv-python-headless`.
 

diff --git a/docs/paddleocr_vl_sft_zh.md b/docs/paddleocr_vl_sft_zh.md
@@ -7,21 +7,14 @@ PaddleOCR-VL 是一款为文档解析任务量身打造的、性能顶尖 (SOTA)
 
 这款模型不仅能高效支持 109 种语言，还擅长识别文本、表格、公式、图表等复杂元素，并始终保持极低的资源占用。在多个权威的公开及内部基准测试中，PaddleOCR-VL 的页面级文档解析与元素级识别性能均达到了业界顶尖水平。其性能远超现有方案，面对顶级视觉语言模型也极具竞争力，且推理速度飞快。这些杰出特性使其成为在真实场景中落地部署的理想选择。
 
-虽然 PaddleOCR-VL-0.9B 在常见场景下表现出色，但在许多特定或复杂的业务场景中，其性能会遇到瓶颈。例如：
-- 特定行业与专业领域
-    - 金融与财会领域：识别发票、收据、银行对账单、财务报表等
-    - 医疗领域：识别病历、化验单、医生手写处方、药品说明书等
-    - 法律领域：识别合同、法律文书、法庭文件、证书等
-
-- 非标准化的文本与字体
-    - 手写体识别：识别手写的表单、笔记、信件、问卷调查等
-    - 艺术字体与设计字体：识别海报、广告牌、产品包装、菜单上的艺术字体等
-    - 古籍与历史文献：识别古代手稿、旧报纸、历史档案等
-
+虽然 PaddleOCR-VL-0.9B 在常见场景下表现出色，但在特定或复杂的业务场景中，其识别效果可能会遇到瓶颈。例如：
+- 非标准化的文本与符号
+    - 艺术设计字体：识别海报、广告牌、产品包装、卡证、印章字体等
+    - 特殊符号：有机化学符号识别
 - 特定任务与输出格式
-    - 表格识别与结构化输出：将图像中的表格转换为 Excel、CSV 或 JSON 格式
-    - 数学公式识别：识别教科书、论文中的数学公式，并输出为 LaTeX 等格式
-
+    - 细粒度文本定位和 Grounding 输出
+    - 流程图识别和结构化输出
+- 特定的小语种数据：藏语、孟加拉语……
 
 这时，就需要通过 SFT (Supervised Fine-Tuning) 来提升模型的准确性和鲁棒性。
 
@@ -58,7 +51,7 @@ python -m pip install numpy==1.26.4
 
 ### 3.1. 模型准备
 
-在 [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main/PaddleOCR-VL-0.9B) 或者 [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL/files) 可以下载 PaddleOCR-VL-0.9B 模型。
+在 [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) 或者 [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL) 可以下载 PaddleOCR-VL-0.9B 模型。
 
 ```bash
 huggingface-cli download PaddlePaddle/PaddleOCR-VL --local-dir PaddlePaddle/PaddleOCR-VL
@@ -93,7 +86,7 @@ wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl
 ```json
 {
     "image_info": [
-        {"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
+        {"matched_text_index": 0, "image_url": "./assets/bengali_train_example.png"},
     ],
     "text_info": [
         {"text": "OCR:", "tag": "mask"},
@@ -237,7 +230,7 @@ paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/datas
 ```json
 {
     "image_info": [
-        {"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
+        {"matched_text_index": 0, "image_url": "./assets/table_example.png"},
     ],
     "text_info": [
         {"text": "Table Recognition:", "tag": "mask"},
@@ -255,7 +248,7 @@ paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/datas
 ```json
 {
     "image_info": [
-        {"matched_text_index": 0, "image_url": "./assets/formula_example.jps"},
+        {"matched_text_index": 0, "image_url": "./assets/formula_example.jpg"},
     ],
     "text_info": [
         {"text": "Formula Recognition:", "tag": "mask"},
@@ -282,7 +275,7 @@ paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/datas
 }
 ```
 
-### 常见问题
+### 8.2. 常见问题
 
 如果你使用上述命令过程中遇到下面的问题，一般是因为cv2和环境的冲突，可以通过安装 `opencv-python-headless` 来解决问题
 

diff --git a/ernie/configuration_paddleocr_vl.py b/ernie/configuration_paddleocr_vl.py
@@ -50,9 +50,12 @@ def __init__(
         rms_norm_eps=1e-6,
         use_cache=False,
         use_flash_attention=False,
+        use_sparse_flash_attn=False,
         recompute=False,
         recompute_granularity="core_attn",
-        recompute_use_reentrant=True,
+        recompute_use_reentrant=False,
+        use_rmsnorm=True,
+        fuse_rms_norm=True,
         pad_token_id=0,
         bos_token_id=1,
         eos_token_id=2,
@@ -66,6 +69,7 @@ def __init__(
         hidden_dropout_prob=0.0,
         compression_ratio: float = 1.0,
         num_key_value_heads=None,
+        use_sparse_head_and_loss_fn=False,
         max_sequence_length=None,
         tie_word_embeddings=False,
         vision_config=None,
@@ -91,9 +95,12 @@ def __init__(
         self.rms_norm_eps = rms_norm_eps
         self.use_cache = use_cache
         self.use_flash_attention = use_flash_attention
+        self.use_sparse_flash_attn = use_sparse_flash_attn
         self.recompute = recompute
         self.recompute_granularity = recompute_granularity
         self.recompute_use_reentrant = recompute_use_reentrant
+        self.use_rmsnorm = use_rmsnorm
+        self.fuse_rms_norm = fuse_rms_norm
         self.pad_token_id = pad_token_id
         self.bos_token_id = bos_token_id
         self.eos_token_id = eos_token_id
@@ -113,6 +120,7 @@ def __init__(
         self.hidden_dropout_prob = hidden_dropout_prob
         self.compression_ratio = compression_ratio
         self.num_key_value_heads = num_key_value_heads
+        self.use_sparse_head_and_loss_fn = use_sparse_head_and_loss_fn
         self.max_sequence_length = max_sequence_length
         self.rope_scaling = rope_scaling
         if self.rope_scaling is not None and "type" in self.rope_scaling:
@@ -123,17 +131,13 @@ def __init__(
         super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
 
         # Currently, these configuration items are hard-coded
-        self.fuse_rms_norm = True
-        self.use_sparse_flash_attn = True
         self.use_var_len_flash_attn = False
         self.scale_qk_coeff = 1.0
         self.fuse_softmax_mask = False
-        self.use_sparse_head_and_loss_fn = False
         self.use_recompute_loss_fn = False
         self.use_fused_head_and_loss_fn = False
         self.fuse_linear = False
         self.token_balance_seqlen = False
-        self.use_rmsnorm = True
         self.fuse_ln = False
         self.cachekv_quant = False
         self.fuse_swiglu = False

diff --git a/ernie/dataset/vl_sft_reader/vl_sft_dataset_json.py b/ernie/dataset/vl_sft_reader/vl_sft_dataset_json.py
@@ -752,9 +752,6 @@ def gen_sample_list(self):
         indices = []
         for i, _ in enumerate(self.task_group):
             sample_size = int(self.weight_list[i] * self.length)
-            print(
-                f"Take {sample_size} samples from {self.task_group[i]._file_name} (total length: {len(self.task_group[i].exs)}) to construct current sample list"
-            )
             indices.extend([i] * sample_size)
         return indices
 

diff --git a/ernie/siglip/configuration.py b/ernie/siglip/configuration.py
@@ -57,8 +57,9 @@ def __init__(
         tokens_per_second=2,
         recompute=False,
         recompute_granularity="full",
-        recompute_use_reentrant=True,
+        recompute_use_reentrant=False,
         use_flash_attention=False,
+        use_sparse_flash_attn=False,
         **kwargs,
     ):
         super().__init__(**kwargs)
@@ -82,11 +83,13 @@ def __init__(
         self.recompute_granularity = recompute_granularity
         self.recompute_use_reentrant = recompute_use_reentrant
         self.use_flash_attention = use_flash_attention
+        self.use_sparse_flash_attn = use_sparse_flash_attn
 
         self.register_unsavable_keys(
             [
                 "recompute",
                 "recompute_use_reentrant",
                 "recompute_granularity",
+                "use_sparse_flash_attn",
             ]
         )