Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1151,6 +1151,8 @@
title: Pix2Struct
- local: model_doc/pixtral
title: Pixtral
- local: model_doc/pp_lcnet
title: PPLCNet
- local: model_doc/qwen2_5_omni
title: Qwen2.5-Omni
- local: model_doc/qwen2_5_vl
Expand Down
131 changes: 131 additions & 0 deletions docs/source/en/model_doc/pp_lcnet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# PP-LCNet

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

**PP-LCNet** PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. To address different document processing requirements, PP-LCNet has three main variants, each optimized for a specific task.

## Model Architecture

1. The Document Image Orientation Classification Module is primarily designed to distinguish the orientation of document images and correct them through post-processing. During processes such as document scanning or ID photo capturing, the device might be rotated to achieve clearer images, resulting in images with various orientations. Standard OCR pipelines may not handle these images effectively. By leveraging image classification techniques, the orientation of documents or IDs containing text regions can be pre-determined and adjusted, thereby improving the accuracy of OCR processing.

2. The Table Classification Module is a key component in computer vision systems, responsible for classifying input table images. The performance of this module directly affects the accuracy and efficiency of the entire table recognition process. The Table Classification Module typically receives table images as input and, using deep learning algorithms, classifies them into predefined categories based on the characteristics and content of the images, such as wired and wireless tables. The classification results from the Table Classification Module serve as output for use in table recognition pipelines.

3. The text line orientation classification module primarily distinguishes the orientation of text lines and corrects them using post-processing. In processes such as document scanning and license/certificate photography, to capture clearer images, the capture device may be rotated, resulting in text lines in various orientations. Standard OCR pipelines cannot handle such data well. By utilizing image classification technology, the orientation of text lines can be predetermined and adjusted, thereby enhancing the accuracy of OCR processing.


## Usage

### Single input inference

The example below demonstrates how to classify image with PP-LCNet using [`Pipeline`] or the [`AutoModel`].

<hfoptions id="usage">
<hfoption id="Pipeline">

```py
import requests
from PIL import Image
from transformers import pipeline
model_path = "PaddlePaddle/PP-LCNet_x1_0_doc_ori_safetensors"
image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/img_rot180_demo.jpg", stream=True).raw)
image_classifier = pipeline("image-classification", model=model_path, function_to_apply="none")
result = image_classifier(image)
print(result)
```

</hfoption>

<hfoption id="AutoModel">

```py
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification

model_path = "PaddlePaddle/PP-LCNet_x1_0_doc_ori_safetensors"
model = AutoModelForImageClassification.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/img_rot180_demo.jpg", stream=True).raw)

inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs)
print(outputs)
predicted_label = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
```

</hfoption>
</hfoptions>

### Batched inference

Here is how you can do it with PP-LCNet using [`Pipeline`] or the [`AutoModel`]:

<hfoptions id="usage">
<hfoption id="Pipeline">

```py
import requests
from PIL import Image
from transformers import pipeline
model_path = "PaddlePaddle/PP-LCNet_x1_0_doc_ori_safetensors"
image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/img_rot180_demo.jpg", stream=True).raw)
image_classifier = pipeline("image-classification", model=model_path, function_to_apply="none")
result = image_classifier([image, image])
print(result)

```

</hfoption>

<hfoption id="AutoModel">

```py
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification

model_path = "PaddlePaddle/PP-LCNet_x1_0_doc_ori_safetensors"
model = AutoModelForImageClassification.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/img_rot180_demo.jpg", stream=True).raw)

inputs = image_processor(images=[image, image], return_tensors="pt")
outputs = model(**inputs)

predicted_labels = outputs.logits.argmax(-1)

for label_id in predicted_labels:
label_id_scalar = label_id.item()
label = model.config.id2label[label_id_scalar]
print(label)
```

</hfoption>
</hfoptions>

## PPLCNetForImageClassification

[[autodoc]] PPLCNetForImageClassification

## PPLCNetConfig

[[autodoc]] PPLCNetConfig

## PPLCNetModel

[[autodoc]] PPLCNetModel

## PPLCNetImageProcessorFast

[[autodoc]] PPLCNetImageProcessorFast

## PPLCNetImageProcessor

[[autodoc]] PPLCNetImageProcessor
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,7 @@
from .plbart import *
from .poolformer import *
from .pop2piano import *
from .pp_lcnet import *
from .prompt_depth_anything import *
from .prophetnet import *
from .pvt import *
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,7 @@
("plbart", "PLBartConfig"),
("poolformer", "PoolFormerConfig"),
("pop2piano", "Pop2PianoConfig"),
("pp_lcnet", "PPLCNetConfig"),
("prompt_depth_anything", "PromptDepthAnythingConfig"),
("prophetnet", "ProphetNetConfig"),
("pvt", "PvtConfig"),
Expand Down Expand Up @@ -799,6 +800,7 @@
("plbart", "PLBart"),
("poolformer", "PoolFormer"),
("pop2piano", "Pop2Piano"),
("pp_lcnet", "PPLCNet"),
("prompt_depth_anything", "PromptDepthAnything"),
("prophetnet", "ProphetNet"),
("pvt", "PVT"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@
("pixio", ("BitImageProcessor", "BitImageProcessorFast")),
("pixtral", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
("poolformer", ("PoolFormerImageProcessor", "PoolFormerImageProcessorFast")),
("pp_lcnet", ("PPLCNetImageProcessor", "PPLCNetImageProcessorFast")),
("prompt_depth_anything", ("PromptDepthAnythingImageProcessor", "PromptDepthAnythingImageProcessorFast")),
("pvt", ("PvtImageProcessor", "PvtImageProcessorFast")),
("pvt_v2", ("PvtImageProcessor", "PvtImageProcessorFast")),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -896,6 +896,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
),
),
("poolformer", "PoolFormerForImageClassification"),
("pp_lcnet", "PPLCNetForImageClassification"),
("pvt", "PvtForImageClassification"),
("pvt_v2", "PvtV2ForImageClassification"),
("regnet", "RegNetForImageClassification"),
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/pp_lcnet/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_pp_lcnet import *
from .modeling_pp_lcnet import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
88 changes: 88 additions & 0 deletions src/transformers/models/pp_lcnet/configuration_pp_lcnet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/pp_lcnet/modular_pp_lcnet.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_pp_lcnet.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

from ...configuration_utils import PreTrainedConfig


class PPLCNetConfig(PreTrainedConfig):
model_type = "pp_lcnet"

"""
This is the configuration class to store the configuration of a [`PPLCNet`]. It is used to instantiate a
PP-LCNet model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the PP-LCNet
[PaddlePaddle/PP-LCNet_x1_0_doc_ori_safetensors](https://huggingface.co/PaddlePaddle/PP-LCNet_x1_0_doc_ori_safetensors) architecture.
Configuration objects inherit from [`PreTrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PreTrainedConfig`] for more information.
Args:
scale (`float`, *optional*, defaults to 1.0):
The scaling factor for the model's channel dimensions, used to adjust the model size and computational cost
without changing the overall architecture (e.g., 0.25, 0.5, 1.0, 1.5).
class_num (`int`, *optional*, defaults to 4):
The number of output classes for the classification task. Typical values are 2 (binary classification) or
4 (document orientation classification: 0°, 90°, 180°, 270°).
stride_list (`List[int]`, *optional*, defaults to `[2, 2, 2, 2, 2]`):
The list of stride values for convolutional layers in the backbone network, controlling the downsampling
rate of feature maps at each stage to capture multi-scale visual information.
reduction (`int`, *optional*, defaults to 4):
The reduction factor for feature channel dimensions in the squeeze-and-excitation (SE) blocks, used to
reduce the number of model parameters and computational complexity while maintaining feature representability.
dropout_prob (`float`, *optional*, defaults to 0.2):
The dropout probability for the classification head, used to prevent overfitting by randomly zeroing out
a fraction of the neurons during training.
class_expand (`int`, *optional*, defaults to 1280):
The number of hidden units in the expansion layer of the classification head, used to enhance the model's
feature representation capability before the final classification layer.
use_last_conv (`bool`, *optional*, defaults to `True`):
Whether to use the final convolutional layer in the classification head. Setting this to `True` helps
extract more discriminative features for the classification task.
act (`str`, *optional*, defaults to `"hardswish"`):
The non-linear activation function used in the model's hidden layers. Supported functions include
`"hardswish"`, `"relu"`, `"silu"`, and `"gelu"`. `"hardswish"` is preferred for lightweight and efficient
inference on edge devices.
backbone_config (`Union[dict, PreTrainedConfig]`, *optional*, defaults to `None`):
The configuration of the backbone model. If `None`, the default backbone configuration for PP-LCNet
will be used, which includes the standard block settings for feature extraction.

Examples:
```python
>>> from transformers import PPLCNetConfig, PPLCNetForImageClassification
>>> # Initializing a PP-LCNet configuration
>>> configuration = PPLCNetConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = PPLCNetForImageClassification(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""

def __init__(
self,
scale: float = 1.0,
class_num: int = 4,
stride_list: list[int] = [2, 2, 2, 2, 2],
reduction: int = 4,
dropout_prob: float = 0.2,
class_expand: int = 1280,
use_last_conv: bool = True,
act: str = "hardswish",
backbone_config: dict | None = None,
**kwargs,
):
super().__init__(**kwargs)

self.scale = scale
self.class_num = class_num
self.stride_list = stride_list
self.reduction = reduction
self.dropout_prob = dropout_prob
self.class_expand = class_expand
self.use_last_conv = use_last_conv
self.act = act
self.backbone_config = backbone_config


__all__ = ["PPLCNetConfig"]
Loading