Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
29da03c
image preprocessor for ocr
forBlank Sep 23, 2025
bbf8f81
modeling ppocrvl
forBlank Sep 23, 2025
a37d543
modeling siglip for ocr
forBlank Sep 23, 2025
88463b9
ocr workflow & only DP dataflow
forBlank Sep 23, 2025
3586730
ocr 8k training yaml
forBlank Sep 23, 2025
f7ca4e2
update sigliip & GELUTanh & format correction
forBlank Oct 11, 2025
114a20c
update dataflow
forBlank Oct 11, 2025
755519a
constant learning rate scheduler
forBlank Oct 15, 2025
1e2f709
cross entropy by hand
forBlank Oct 15, 2025
7a5b260
update ocr 8k training yaml
forBlank Oct 15, 2025
4c94b84
make lint & remove unused code
forBlank Oct 15, 2025
b5bd996
update ocr 8k training yaml
forBlank Oct 17, 2025
4f81675
update image augmentation
forBlank Oct 17, 2025
d867546
update config
forBlank Oct 17, 2025
697ee0a
make lint & remove unused code
forBlank Oct 18, 2025
3bc4696
remove tp&sp & no fused mlp&attn for hf ckpt
forBlank Oct 22, 2025
713aee6
support padding&packing_size setting & freeze vit
forBlank Oct 22, 2025
524da01
Unified model name
forBlank Oct 22, 2025
89447d7
update doc & requirement
forBlank Oct 22, 2025
066b62d
update doc
forBlank Oct 22, 2025
e74c0c7
update doc for paddleocr_vl_sft
forBlank Oct 23, 2025
6ff11f1
use "PaddleOCR-VL-0.9B" for paddleocr_vl_sft doc
forBlank Oct 23, 2025
a044f4e
update packing setting
forBlank Oct 23, 2025
5cb1282
update Bengali dataset for paddleocr_vl_sft doc
forBlank Oct 23, 2025
07afabf
Merge remote-tracking branch 'upstream/develop' into ocr_vl
forBlank Oct 23, 2025
b9fe7bf
update Bengali dataset link for paddleocr doc&yaml
forBlank Oct 23, 2025
4ce36a4
support paddleocr vl sft with single GPU
forBlank Oct 23, 2025
d915e92
make lint
forBlank Oct 23, 2025
3680e0a
fix paddleocr_vl_sft with single GPU & update doc
forBlank Oct 24, 2025
a9257c8
Merge branch 'develop' into ocr_vl
forBlank Oct 24, 2025
9b60be4
update cli for downloading hf model
forBlank Oct 24, 2025
887cd1f
Merge remote-tracking branch 'origin/ocr_vl' into ocr_vl
forBlank Oct 24, 2025
2780691
update cli model path for paddleocr_vl_sft
forBlank Oct 24, 2025
ca45067
Update the paddleocr_vl config for single GPU
forBlank Oct 24, 2025
5fe0560
Update the paddleocr_vl config & doc
forBlank Oct 25, 2025
6f18125
Merge remote-tracking branch 'upstream/develop' into ocr_vl
forBlank Oct 25, 2025
479974e
Update the paddleocr_vl config & doc
forBlank Oct 25, 2025
c93e078
Update the paddleocr_vl doc
forBlank Oct 27, 2025
55f01d9
Update the paddleocr_vl doc
forBlank Oct 27, 2025
448235b
Merge remote-tracking branch 'upstream/develop' into ocr_vl
forBlank Oct 27, 2025
b73e20b
Fix typo errors in paddleocr_vl doc
forBlank Oct 28, 2025
5e09651
Merge remote-tracking branch 'upstream/develop' into ocr_vl
forBlank Oct 28, 2025
34241e8
Update the paddleocr_vl doc for single GPU
forBlank Oct 28, 2025
9d48804
Update the paddleocr_vl doc
forBlank Nov 5, 2025
115a9ad
Update the huggingface line in paddleocr_vl doc
forBlank Nov 10, 2025
f842f09
update config for paddleocr_vl
forBlank Nov 10, 2025
3ab7f39
Merge remote-tracking branch 'upstream/develop' into ocr_vl
forBlank Nov 10, 2025
78dbab8
Update markdown links in paddleocr_vl doc
forBlank Nov 14, 2025
1d9a30d
Update links of modelcard in paddleocr_vl doc
forBlank Nov 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 18 additions & 22 deletions docs/paddleocr_vl_sft.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,15 @@ English | [简体中文](./paddleocr_vl_sft_zh.md)

PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

While PaddleOCR-VL-0.9B excels in common scenarios, its performance often faces limitations in many specific or complex business applications. For instance:

- Domain-Specific Applications
- Finance & Accounting: Recognizing documents such as invoices, receipts, bank statements, and financial reports
- Healthcare: Processing medical records, lab reports, handwritten prescriptions, and pharmaceutical instructions
- Legal Sector: Identifying text in contracts, legal instruments, court filings, and certificates.
- Non-Standard Text and Typography
- Handwriting Recognition: Deciphering handwritten forms, notes, letters, and questionnaires.
- Stylized & Artistic Fonts: Recognizing text on posters, billboards, product packaging, and menus.
- Historical & Archival Documents: Processing ancient manuscripts, old newspapers, and historical archives.
- Task-Specific Structured Output
- Table Recognition & Structuring: Converting tables within images into structured formats like Excel, CSV, or JSON.
- Mathematical Formula Recognition: Identifying mathematical equations in textbooks or research papers and exporting them into formats like LaTeX.
While PaddleOCR-VL-0.9B performs excellently in common scenarios, its recognition capabilities may face bottlenecks in specific or complex business applications. For example:

- Non-standard text and symbols
- Artistic or stylized fonts: Recognizing text on posters, billboards, product packaging, cards/documents, and seals.
- Specialized symbols: Such as the recognition of symbols in organic chemistry.
- Specific tasks and output formats
- Fine-grained text localization and grounding outputs.
- Flowchart recognition with structured output.
- Data for specific low-resource languages: such as Tibetan, Bengali, etc.

This is where SFT (Supervised Fine-Tuning) becomes necessary to enhance the model’s accuracy and robustness for these specialized tasks.

Expand Down Expand Up @@ -49,20 +45,20 @@ python -m pip install opencv-python-headless
python -m pip install numpy==1.26.4
```

For more installation methods, please refer to the [ERNIEKit Installation Guide]((./erniekit.md#2-installation)).
For more installation methods, please refer to the [ERNIEKit Installation Guide](./erniekit.md#2-installation).

## 3. Model and Dataset Preparation

### 3.1. Model Preparation
The PaddleOCR-VL-0.9B model can be downloaded from [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main/PaddleOCR-VL-0.9B) or [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL/files).
The PaddleOCR-VL-0.9B model can be downloaded from [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) or [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL).

```bash
huggingface-cli download PaddlePaddle/PaddleOCR-VL --local-dir PaddlePaddle/PaddleOCR-VL
```

### 3.2. Dataset Preparation

For the training dataset format, please refer to [SFT VL Dataset Format]((./datasets.md#sft-vl-dataset)). Required fields are as follows:
For the training dataset format, please refer to [SFT VL Dataset Format](./datasets.md#sft-vl-dataset). Required fields are as follows:
* `text_info`: The list of text data, each element contains a `text` and a `tag`
* `text`: The text content from User question or System response
* `tag`: The mask tag (`no_mask`=include in training, `mask`=exclude)
Expand All @@ -75,7 +71,7 @@ Notes:
* Each training sample is in JSON format, with multiple samples separated by newlines
* Please ensure that `mask` items and `no_mask` items alternate in the `text_info`

For your convenience, we also provide a quick-start [Bengali training dataset]((https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl)) for fine-tuning PaddleOCR-VL-0.9B on Bengali recognition. Download it using the following command:
For your convenience, we also provide a quick-start [Bengali training dataset](https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl) for fine-tuning PaddleOCR-VL-0.9B on Bengali recognition. Download it using the following command:

```bash
wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl
Expand All @@ -91,7 +87,7 @@ Bengali training example:
```json
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
{"matched_text_index": 0, "image_url": "./assets/bengali_train_example.png"},
],
"text_info": [
{"text": "OCR:", "tag": "mask"},
Expand Down Expand Up @@ -194,7 +190,7 @@ cp PaddlePaddle/PaddleOCR-VL/inference.yml PaddleOCR-VL-SFT-Bengali
```

### 7.3. Inference Dataset Preparation
We provide a [Bengali test dataset]((https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonl)) that can be used for inference to observe the fine-tuning results. Download it using the following command:
We provide a [Bengali test dataset](https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonl) that can be used for inference to observe the fine-tuning results. Download it using the following command:

```bash
wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-test_Bengali.jsonl
Expand Down Expand Up @@ -236,7 +232,7 @@ Table Data: OTSL format
```json
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
{"matched_text_index": 0, "image_url": "./assets/table_example.png"},
],
"text_info": [
{"text": "Table Recognition:", "tag": "mask"},
Expand All @@ -254,7 +250,7 @@ Formula Data: LaTeX format
```json
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/formula_example.jps"},
{"matched_text_index": 0, "image_url": "./assets/formula_example.jpg"},
],
"text_info": [
{"text": "Formula Recognition:", "tag": "mask"},
Expand All @@ -281,7 +277,7 @@ Chart Data: Markdown format
}
```

### Common Issues
### 8.2. Common Issues

If you encounter the following problem while using the above command, it is generally due to a conflict between cv2 and the environment. This can be resolved by installing `opencv-python-headless`.

Expand Down
31 changes: 12 additions & 19 deletions docs/paddleocr_vl_sft_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,14 @@ PaddleOCR-VL 是一款为文档解析任务量身打造的、性能顶尖 (SOTA)

这款模型不仅能高效支持 109 种语言,还擅长识别文本、表格、公式、图表等复杂元素,并始终保持极低的资源占用。在多个权威的公开及内部基准测试中,PaddleOCR-VL 的页面级文档解析与元素级识别性能均达到了业界顶尖水平。其性能远超现有方案,面对顶级视觉语言模型也极具竞争力,且推理速度飞快。这些杰出特性使其成为在真实场景中落地部署的理想选择。

虽然 PaddleOCR-VL-0.9B 在常见场景下表现出色,但在许多特定或复杂的业务场景中,其性能会遇到瓶颈。例如:
- 特定行业与专业领域
- 金融与财会领域:识别发票、收据、银行对账单、财务报表等
- 医疗领域:识别病历、化验单、医生手写处方、药品说明书等
- 法律领域:识别合同、法律文书、法庭文件、证书等

- 非标准化的文本与字体
- 手写体识别:识别手写的表单、笔记、信件、问卷调查等
- 艺术字体与设计字体:识别海报、广告牌、产品包装、菜单上的艺术字体等
- 古籍与历史文献:识别古代手稿、旧报纸、历史档案等

虽然 PaddleOCR-VL-0.9B 在常见场景下表现出色,但在特定或复杂的业务场景中,其识别效果可能会遇到瓶颈。例如:
- 非标准化的文本与符号
- 艺术设计字体:识别海报、广告牌、产品包装、卡证、印章字体等
- 特殊符号:有机化学符号识别
- 特定任务与输出格式
- 表格识别与结构化输出:将图像中的表格转换为 Excel、CSV 或 JSON 格式
- 数学公式识别:识别教科书、论文中的数学公式,并输出为 LaTeX 等格式

- 细粒度文本定位和 Grounding 输出
- 流程图识别和结构化输出
- 特定的小语种数据:藏语、孟加拉语……

这时,就需要通过 SFT (Supervised Fine-Tuning) 来提升模型的准确性和鲁棒性。

Expand Down Expand Up @@ -58,7 +51,7 @@ python -m pip install numpy==1.26.4

### 3.1. 模型准备

在 [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main/PaddleOCR-VL-0.9B) 或者 [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL/files) 可以下载 PaddleOCR-VL-0.9B 模型。
在 [huggingface](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) 或者 [modelscope](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL) 可以下载 PaddleOCR-VL-0.9B 模型。

```bash
huggingface-cli download PaddlePaddle/PaddleOCR-VL --local-dir PaddlePaddle/PaddleOCR-VL
Expand Down Expand Up @@ -93,7 +86,7 @@ wget https://paddleformers.bj.bcebos.com/datasets/ocr_vl_sft-train_Bengali.jsonl
```json
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
{"matched_text_index": 0, "image_url": "./assets/bengali_train_example.png"},
],
"text_info": [
{"text": "OCR:", "tag": "mask"},
Expand Down Expand Up @@ -237,7 +230,7 @@ paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/datas
```json
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/table_example.jps"},
{"matched_text_index": 0, "image_url": "./assets/table_example.png"},
],
"text_info": [
{"text": "Table Recognition:", "tag": "mask"},
Expand All @@ -255,7 +248,7 @@ paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/datas
```json
{
"image_info": [
{"matched_text_index": 0, "image_url": "./assets/formula_example.jps"},
{"matched_text_index": 0, "image_url": "./assets/formula_example.jpg"},
],
"text_info": [
{"text": "Formula Recognition:", "tag": "mask"},
Expand All @@ -282,7 +275,7 @@ paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/PPOCRVL/datas
}
```

### 常见问题
### 8.2. 常见问题

如果你使用上述命令过程中遇到下面的问题,一般是因为cv2和环境的冲突,可以通过安装 `opencv-python-headless` 来解决问题

Expand Down
14 changes: 9 additions & 5 deletions ernie/configuration_paddleocr_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,12 @@ def __init__(
rms_norm_eps=1e-6,
use_cache=False,
use_flash_attention=False,
use_sparse_flash_attn=False,
recompute=False,
recompute_granularity="core_attn",
recompute_use_reentrant=True,
recompute_use_reentrant=False,
use_rmsnorm=True,
fuse_rms_norm=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
Expand All @@ -66,6 +69,7 @@ def __init__(
hidden_dropout_prob=0.0,
compression_ratio: float = 1.0,
num_key_value_heads=None,
use_sparse_head_and_loss_fn=False,
max_sequence_length=None,
tie_word_embeddings=False,
vision_config=None,
Expand All @@ -91,9 +95,12 @@ def __init__(
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.use_flash_attention = use_flash_attention
self.use_sparse_flash_attn = use_sparse_flash_attn
self.recompute = recompute
self.recompute_granularity = recompute_granularity
self.recompute_use_reentrant = recompute_use_reentrant
self.use_rmsnorm = use_rmsnorm
self.fuse_rms_norm = fuse_rms_norm
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
self.eos_token_id = eos_token_id
Expand All @@ -113,6 +120,7 @@ def __init__(
self.hidden_dropout_prob = hidden_dropout_prob
self.compression_ratio = compression_ratio
self.num_key_value_heads = num_key_value_heads
self.use_sparse_head_and_loss_fn = use_sparse_head_and_loss_fn
self.max_sequence_length = max_sequence_length
self.rope_scaling = rope_scaling
if self.rope_scaling is not None and "type" in self.rope_scaling:
Expand All @@ -123,17 +131,13 @@ def __init__(
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)

# Currently, these configuration items are hard-coded
self.fuse_rms_norm = True
self.use_sparse_flash_attn = True
self.use_var_len_flash_attn = False
self.scale_qk_coeff = 1.0
self.fuse_softmax_mask = False
self.use_sparse_head_and_loss_fn = False
self.use_recompute_loss_fn = False
self.use_fused_head_and_loss_fn = False
self.fuse_linear = False
self.token_balance_seqlen = False
self.use_rmsnorm = True
self.fuse_ln = False
self.cachekv_quant = False
self.fuse_swiglu = False
Expand Down
3 changes: 0 additions & 3 deletions ernie/dataset/vl_sft_reader/vl_sft_dataset_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -752,9 +752,6 @@ def gen_sample_list(self):
indices = []
for i, _ in enumerate(self.task_group):
sample_size = int(self.weight_list[i] * self.length)
print(
f"Take {sample_size} samples from {self.task_group[i]._file_name} (total length: {len(self.task_group[i].exs)}) to construct current sample list"
)
indices.extend([i] * sample_size)
return indices

Expand Down
5 changes: 4 additions & 1 deletion ernie/siglip/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,9 @@ def __init__(
tokens_per_second=2,
recompute=False,
recompute_granularity="full",
recompute_use_reentrant=True,
recompute_use_reentrant=False,
use_flash_attention=False,
use_sparse_flash_attn=False,
**kwargs,
):
super().__init__(**kwargs)
Expand All @@ -82,11 +83,13 @@ def __init__(
self.recompute_granularity = recompute_granularity
self.recompute_use_reentrant = recompute_use_reentrant
self.use_flash_attention = use_flash_attention
self.use_sparse_flash_attn = use_sparse_flash_attn

self.register_unsavable_keys(
[
"recompute",
"recompute_use_reentrant",
"recompute_granularity",
"use_sparse_flash_attn",
]
)
Loading
Loading