Skip to content

Commit ab87731

Browse files
committed
Simplify OCR stack around DeepSeek
1 parent 91207d0 commit ab87731

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+4241
-3161
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,13 @@ htmlcov/
5858
# OCR test outputs
5959
test_ocr_*_output/
6060
*_demo_output/
61+
artifacts/
6162

6263
# OCR model weights (if downloaded locally)
6364
nanonets/
6465
ocr_models/
66+
deepseek-ocr-2-model/
67+
models/
6568

6669
# Noise analysis reports
6770
glossapi_noise_analysis_report.md
@@ -78,4 +81,5 @@ dependency_setup/.venvs/
7881
deepseek-ocr/DeepSeek-OCR-empty/
7982
# Local DeepSeek checkout and repro scripts (keep out of master)
8083
deepseek-ocr/
84+
deepseek-ocr-2/
8185
repro_rapidocr_onnx/

README.md

Lines changed: 18 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ GlossAPI is a GPU-ready document processing pipeline from [GFOSS](https://gfoss.
44

55
## Why GlossAPI
66
- Handles download → extraction → cleaning → sectioning in one pipeline.
7-
- Ships safe PyPDFium extraction plus Docling/RapidOCR for high-throughput OCR.
7+
- Ships safe PyPDFium extraction plus Docling for structured extraction and DeepSeek-OCR-2 for OCR remediation.
88
- Rust-powered cleaner/noise metrics keep Markdown quality predictable.
99
- Greek-first metadata and section classification tuned for academic corpora.
1010
- Modular Corpus API lets you resume from any stage or plug into existing flows.
@@ -40,45 +40,40 @@ PY
4040

4141
## Automated Environment Profiles
4242

43-
Use `dependency_setup/setup_glossapi.sh` to provision a virtualenv with the right dependency stack for the three supported modes:
43+
Use `dependency_setup/setup_glossapi.sh` for the Docling environment, or `dependency_setup/setup_deepseek_uv.sh` for the dedicated DeepSeek OCR runtime:
4444

4545
```bash
46-
# Vanilla pipeline (no GPU OCR extras)
47-
./dependency_setup/setup_glossapi.sh --mode vanilla --venv dependency_setup/.venvs/vanilla --run-tests
46+
# Docling / main GlossAPI environment
47+
./dependency_setup/setup_glossapi.sh --mode docling --venv dependency_setup/.venvs/docling --run-tests
4848

49-
# Docling + RapidOCR mode
50-
./dependency_setup/setup_glossapi.sh --mode rapidocr --venv dependency_setup/.venvs/rapidocr --run-tests
51-
52-
# DeepSeek OCR mode (requires weights under /path/to/deepseek-ocr/DeepSeek-OCR)
53-
./dependency_setup/setup_glossapi.sh \
54-
--mode deepseek \
49+
# DeepSeek OCR runtime (uv-managed)
50+
./dependency_setup/setup_deepseek_uv.sh \
5551
--venv dependency_setup/.venvs/deepseek \
56-
--weights-dir /path/to/deepseek-ocr \
52+
--model-root /path/to/deepseek-ocr-2-model \
53+
--download-model \
5754
--run-tests --smoke-test
5855
```
5956

60-
Pass `--download-deepseek` if you need the script to fetch weights automatically; otherwise it looks for `${REPO_ROOT}/deepseek-ocr/DeepSeek-OCR` unless you override `--weights-dir`. Check `dependency_setup/dependency_notes.md` for the latest pins, caveats, and validation history. The script also installs the Rust extensions in editable mode so local changes are picked up immediately.
57+
`setup_glossapi.sh --mode deepseek` now delegates to the same uv-based installer. `setup_deepseek_uv.sh` uses `uv venv` + `uv sync`, installs the Rust extensions in editable mode, and can download `deepseek-ai/DeepSeek-OCR-2` with `huggingface_hub`.
6158

6259
**DeepSeek runtime checklist**
63-
- Run `python -m glossapi.ocr.deepseek.preflight` (from your DeepSeek venv) to fail fast if the CLI would fall back to the stub.
64-
- Export these to force the real CLI and avoid silent stub output:
60+
- Run `python -m glossapi.ocr.deepseek.preflight` from the DeepSeek venv to fail fast before OCR.
61+
- Export these to force the real runtime and avoid silent stub output:
6562
- `GLOSSAPI_DEEPSEEK_ALLOW_CLI=1`
6663
- `GLOSSAPI_DEEPSEEK_ALLOW_STUB=0`
67-
- `GLOSSAPI_DEEPSEEK_VLLM_SCRIPT=/path/to/deepseek-ocr/run_pdf_ocr_vllm.py`
68-
- `GLOSSAPI_DEEPSEEK_TEST_PYTHON=/path/to/deepseek/venv/bin/python`
69-
- `GLOSSAPI_DEEPSEEK_MODEL_DIR=/path/to/deepseek-ocr/DeepSeek-OCR`
70-
- `GLOSSAPI_DEEPSEEK_LD_LIBRARY_PATH=/path/to/libjpeg-turbo/lib`
71-
- CUDA toolkit with `nvcc` available (FlashInfer/vLLM JIT falls back poorly without it); set `CUDA_HOME` and prepend `$CUDA_HOME/bin` to `PATH`.
72-
- If FlashInfer is problematic, disable with `VLLM_USE_FLASHINFER=0` and `FLASHINFER_DISABLE=1`.
73-
- To avoid FP8 KV cache issues, export `GLOSSAPI_DEEPSEEK_NO_FP8_KV=1` (propagates `--no-fp8-kv`).
74-
- Tune VRAM use via `GLOSSAPI_DEEPSEEK_GPU_MEMORY_UTILIZATION=<0.5–0.9>`.
64+
- `GLOSSAPI_DEEPSEEK_PYTHON=/path/to/deepseek/venv/bin/python`
65+
- `GLOSSAPI_DEEPSEEK_RUNNER_SCRIPT=/path/to/glossAPI/src/glossapi/ocr/deepseek/run_pdf_ocr_transformers.py`
66+
- `GLOSSAPI_DEEPSEEK_MODEL_DIR=/path/to/deepseek-ocr-2-model/DeepSeek-OCR-2`
67+
- The default fallback locations already point at the in-repo Transformers runner and `${REPO_ROOT}/deepseek-ocr-2-model/DeepSeek-OCR-2`.
68+
- `flash-attn` is optional. The runner uses `flash_attention_2` when available and falls back to `eager` otherwise.
7569

7670
## Choose Your Install Path
7771

7872
| Scenario | Commands | Notes |
7973
| --- | --- | --- |
8074
| Pip users | `pip install glossapi` | Fast vanilla evaluation with minimal dependencies. |
81-
| Mode automation (recommended) | `./dependency_setup/setup_glossapi.sh --mode {vanilla\|rapidocr\|deepseek}` | Creates an isolated venv per mode, installs Rust crates, and can run the relevant pytest subset. |
75+
| Docling environment | `./dependency_setup/setup_glossapi.sh --mode docling` | Creates the main GlossAPI venv for extraction, cleaning, sectioning, and enrichment. |
76+
| DeepSeek environment | `./dependency_setup/setup_deepseek_uv.sh` | Creates a separate uv-managed OCR runtime pinned to the tested Transformers/Torch stack. |
8277
| Manual editable install | `pip install -e .` after cloning | Keep this if you prefer to manage dependencies by hand. |
8378
| Conda-based stacks | `scripts/setup_conda.sh` | Provisions Python 3.10 env + Rust + editable install for Amazon Linux/SageMaker. |
8479

dependency_setup/deepseek_gpu_smoke.py

Lines changed: 15 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
Minimal DeepSeek OCR integration smoke test.
44
55
This script runs the GlossAPI DeepSeek backend on a tiny sample PDF and
6-
verifies that real Markdown output is produced. It requires the DeepSeek-OCR
7-
weights to be available under ``../deepseek-ocr/DeepSeek-OCR`` relative to
8-
the repository root (override via ``DEEPSEEK_MODEL_DIR``).
6+
verifies that real Markdown output is produced. It requires the DeepSeek-OCR-2
7+
weights to be available under ``../deepseek-ocr-2-model/DeepSeek-OCR-2`` relative to the
8+
repository root (override via ``DEEPSEEK_MODEL_DIR``).
99
"""
1010
from __future__ import annotations
1111

@@ -20,15 +20,16 @@
2020

2121
REPO_ROOT = Path(__file__).resolve().parents[1]
2222
SAMPLES_DIR = REPO_ROOT / "samples" / "lightweight_pdf_corpus" / "pdfs"
23-
DEFAULT_MODEL_ROOT = (REPO_ROOT / ".." / "deepseek-ocr").resolve()
23+
DEFAULT_MODEL_ROOT = (REPO_ROOT / "deepseek-ocr-2-model").resolve()
2424

2525

2626
def ensure_model_available(model_root: Path) -> None:
27-
expected = model_root / "DeepSeek-OCR" / "model-00001-of-000001.safetensors"
27+
direct_root = model_root if (model_root / "config.json").exists() else (model_root / "DeepSeek-OCR-2")
28+
expected = direct_root / "model-00001-of-000001.safetensors"
2829
if not expected.exists() or expected.stat().st_size < 1_000_000:
2930
raise FileNotFoundError(
30-
f"Expected DeepSeek-OCR weights at {expected}. "
31-
"Download the checkpoint (huggingface.co/deepseek-ai/DeepSeek-OCR) "
31+
f"Expected DeepSeek-OCR-2 weights at {expected}. "
32+
"Download the checkpoint (huggingface.co/deepseek-ai/DeepSeek-OCR-2) "
3233
"or set DEEPSEEK_MODEL_DIR to the directory that contains them."
3334
)
3435

@@ -37,7 +38,8 @@ def run_smoke(model_root: Path) -> None:
3738
from glossapi import Corpus
3839

3940
ensure_model_available(model_root)
40-
sample_pdf = SAMPLES_DIR / "sample01_plain.pdf"
41+
model_dir = model_root if (model_root / "config.json").exists() else (model_root / "DeepSeek-OCR-2")
42+
sample_pdf = SAMPLES_DIR / "alpha.pdf"
4143
if not sample_pdf.exists():
4244
raise FileNotFoundError(f"Sample PDF not found: {sample_pdf}")
4345

@@ -67,22 +69,17 @@ def run_smoke(model_root: Path) -> None:
6769
parquet_path = dl_dir / "download_results.parquet"
6870
df.to_parquet(parquet_path, index=False)
6971

72+
os.environ.setdefault("GLOSSAPI_DEEPSEEK_ALLOW_CLI", "1")
7073
os.environ.setdefault("GLOSSAPI_DEEPSEEK_ALLOW_STUB", "0")
7174
os.environ.setdefault(
72-
"GLOSSAPI_DEEPSEEK_VLLM_SCRIPT",
73-
str(model_root / "run_pdf_ocr_vllm.py"),
75+
"GLOSSAPI_DEEPSEEK_RUNNER_SCRIPT",
76+
str(REPO_ROOT / "src" / "glossapi" / "ocr" / "deepseek" / "run_pdf_ocr_transformers.py"),
7477
)
7578
os.environ.setdefault(
7679
"GLOSSAPI_DEEPSEEK_PYTHON",
7780
sys.executable,
7881
)
79-
ld_extra = os.environ.get("GLOSSAPI_DEEPSEEK_LD_LIBRARY_PATH") or str(
80-
model_root / "libjpeg-turbo" / "lib"
81-
)
82-
os.environ["GLOSSAPI_DEEPSEEK_LD_LIBRARY_PATH"] = ld_extra
83-
os.environ["LD_LIBRARY_PATH"] = (
84-
f"{ld_extra}:{os.environ.get('LD_LIBRARY_PATH','')}".rstrip(":")
85-
)
82+
os.environ.setdefault("GLOSSAPI_DEEPSEEK_MODEL_DIR", str(model_dir))
8683

8784
corpus = Corpus(input_dir=input_dir, output_dir=output_dir)
8885
corpus.ocr(
@@ -100,7 +97,7 @@ def run_smoke(model_root: Path) -> None:
10097

10198

10299
def main() -> None:
103-
model_dir_env = os.environ.get("DEEPSEEK_MODEL_DIR")
100+
model_dir_env = os.environ.get("DEEPSEEK_MODEL_DIR") or os.environ.get("GLOSSAPI_DEEPSEEK_MODEL_DIR")
104101
if model_dir_env:
105102
model_root = Path(model_dir_env).expanduser().resolve()
106103
else:
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
[project]
2+
name = "glossapi-deepseek-runtime"
3+
version = "0.1.0"
4+
description = "UV-managed runtime for GlossAPI DeepSeek-OCR-2 execution"
5+
requires-python = ">=3.11,<3.13"
6+
dependencies = [
7+
"glossapi[docling,deepseek]",
8+
"torch==2.6.0",
9+
"torchvision==0.21.0",
10+
"torchaudio==2.6.0",
11+
]
12+
13+
[dependency-groups]
14+
test = [
15+
"pytest",
16+
"fpdf2",
17+
]
18+
19+
[tool.uv.sources]
20+
glossapi = { path = "../..", editable = true }
21+
torch = { index = "pytorch-cu118" }
22+
torchvision = { index = "pytorch-cu118" }
23+
torchaudio = { index = "pytorch-cu118" }
24+
25+
[[tool.uv.index]]
26+
name = "pytorch-cu118"
27+
url = "https://download.pytorch.org/whl/cu118"
28+
explicit = true

0 commit comments

Comments
 (0)