Skip to content

Refactor to make torch and transformers optional dependencies #53

@lhaibach

Description

@lhaibach

Currently, the extraction package always installs and imports torch and transformers, even when only the treebased classifier (RandomForest/XGBoost) is used. This inflates Docker image size and baseline RAM usage.

Example import probe results (AWS workspace, Python 3.10):

Python baseline:       11.9 MB
+ numpy:               25.4 MB
+ pandas:             100.5 MB
+ sklearn:            158.7 MB
+ pymupdf:            183.6 MB
+ shapely:            186.5 MB
+ torch:              632.5 MB
+ transformers:       650.5 MB
+ fasttext:           650.8 MB
+ layoutparser:       650.8 MB

So torch adds ~450 MB just by being imported.

Make heavy deep-learning dependencies (torch, transformers,layoutparser) optional.
Use them only when LayoutLMv3 is explicitly used.
Allow treebased (RandomForest/XGBoost) and baseline classifiers to run without installing these deps.
If optional deps are missing and LayoutLMv3 is requested, raise a clear error:
“torch/transformers are required for LayoutLMv3. Please install with pip install '.[deep-learning]'.”

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions