A Gradio-based demo application for comparing state-of-the-art OCR models: DeepSeek-OCR, Dots.OCR, HunyuanOCR, and Nanonets-OCR2-3B. Users can upload images, select models, apply custom prompts, and generate recognized text or visual grounding results. Supports tasks like free OCR, markdown conversion, figure parsing, and object location.
- Multi-Model Comparison: Switch between DeepSeek-OCR (with resolution and task options), Dots.OCR, HunyuanOCR, and Nanonets-OCR2-3B for flexible OCR workflows.
- Image Upload and Processing: Supports direct upload or clipboard paste; handles various image formats with PIL.
- Customizable Prompts: Tailor queries for text extraction, detection, or specific tasks (e.g., "Extract all text" or "Locate the red car").
- DeepSeek-Specific Tools: Resolution presets (Tiny to Gundam), task types (Free OCR, Markdown, Parse Figure, Locate Object), and bounding box visualization.
- Advanced Generation Controls: Adjust max new tokens (up to 8192), temperature, top-p, and top-k for fine-tuned outputs.
- Streaming Output: Real-time text generation for Dots.OCR and Nanonets-OCR2-3B; non-streaming for others.
- Visual Results: DeepSeek outputs annotated images with bounding boxes or grounding visuals.
- Custom Theme: SteelBlueTheme for a modern, gradient-based UI with enhanced readability.
- Examples and Queueing: Built-in example images; supports queued inferences for up to 30 concurrent users.
- Python 3.10 or higher.
- CUDA-compatible GPU (recommended for bfloat16 models; falls back to CPU).
- Git for cloning submodules.
- Hugging Face account (optional, for model caching via
huggingface_hub).
-
Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Super-OCRs-Demo.git cd Super-OCRs-Demo -
Install dependencies: Create a
requirements.txtfile with the following content, then run:pip install -r requirements.txtrequirements.txt content:
flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4 git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/diffusers.git git+https://github.com/huggingface/peft.git huggingface_hub gradio==5.49.1 qwen-vl-utils sentencepiece opencv-python torch==2.6.0 torchvision supervision matplotlib easydict kernels einops spaces addict hf_xet numpy av -
Start the application:
python app.pyThe demo launches at
http://localhost:7860(or the provided URL if using Spaces).
-
Select Model: Choose from the radio buttons (default: DeepSeek-OCR).
- DeepSeek: Adjust resolution (e.g., "Gundam (Recommended)") and task (e.g., "Convert to Markdown").
- Others: Use the custom prompt textbox for queries like "Detect and extract all text with coordinates."
-
Upload Image: Drag-and-drop or paste an image (supports examples like receipts, figures, or documents).
-
Configure Settings:
- For "Locate Object" in DeepSeek, enter reference text (e.g., "the title").
- Tune advanced sliders for generation quality.
-
Run Inference: Click "Perform OCR" to process. Outputs stream to the textbox (with copy button); DeepSeek may show an annotated image.
-
View Results:
- Text: Raw OCR output, markdown, or formatted coordinates.
- Image: Bounding boxes in red for detected elements (DeepSeek only).
- Upload a receipt image.
- Select Dots.OCR, prompt: "Extract items and prices."
- Adjust temperature to 0.1 for deterministic results.
- Output: Structured text list.
| Model Name | Key Capabilities | Notes |
|---|---|---|
| DeepSeek-OCR-Latest-BF16.I64 | Free OCR, Markdown, Figure Parsing, Object Location | Visual grounding with bounding boxes; resolution presets. |
| Dots.OCR-Latest-BF16 | General text extraction; streaming | Qwen-based; custom prompts for flexibility. |
| HunyuanOCR | Detection and recognition with coordinates | Tencent model; handles Chinese/English well. |
| Nanonets-OCR2-3B | High-accuracy extraction; streaming | Qwen2.5-VL; suitable for complex layouts. |
- Model Loading Errors: Ensure CUDA is installed for GPU; use
torch.float32fallback if bfloat16 fails. - Out of Memory: Reduce resolution in DeepSeek or max_new_tokens; clear cache with
torch.cuda.empty_cache(). - Import Issues: Install
spacesonly if deploying to Hugging Face Spaces; mock it locally. - Generation Loops: Hunyuan may repeat; cleaned automatically via
clean_repeated_substrings. - UI Visibility: Model changes toggle DeepSeek-specific groups dynamically.
- Queue Full: Increase
max_sizeindemo.queue()for high traffic.
Contributions welcome! Open issues for bugs or features (e.g., more models, export to JSON). Fork, branch, and PR with tests. Repository: https://github.com/PRITHIVSAKTHIUR/Super-OCRs-Demo.git
Apache License 2.0. See LICENSE for details.
Built by Prithiv Sakthi. Report issues via the repository.