Skip to content

protagolabs/ProtagoDoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProtagoDoc

A curated collection of useful tools organized as git submodules for easy management and deployment.

🛠️ Tools Collection

This repository serves as a centralized hub for various development and productivity tools, each maintained as a git submodule for easy version control and updates.

Document Processing

Location: tools/mineru | Version: 1.3.10 (magic_pdf-1.3.11-released tag)

A high-quality tool for converting PDF to Markdown and JSON format. MinerU is a comprehensive solution for precise document content extraction with support for:

  • ✅ Multiple output formats (Markdown, JSON)
  • ✅ OCR support for 84 languages
  • ✅ Layout and span visualization
  • ✅ CPU and GPU acceleration support
  • ✅ Cross-platform compatibility (Windows, Linux, macOS)

Quick Start:

# Create and activate conda environment
conda create --name pd python=3.13 -y
source /opt/conda/etc/profile.d/conda.sh && conda activate pd

# Navigate to MinerU directory and install dependencies
cd tools/mineru
pip install -e .[full]
cd ../..

# Install required dependencies and download models
pip install requests huggingface_hub
python scripts/download_models_hf.py

# Use MinerU (note: command is magic-pdf, not mineru in v1.3.10)
magic-pdf -p <input_path> -o <output_path>

🚀 GPU Acceleration Setup

For optimal performance with CUDA GPU acceleration:

1. Verify GPU Support:

nvidia-smi  # Check GPU availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

2. Configure GPU Acceleration: The model download script automatically creates a configuration file at ~/magic-pdf.json. To enable GPU acceleration, ensure the device mode is set to cuda:

{
    "device-mode": "cuda",
    "models-dir": "/path/to/downloaded/models"
}

📋 Configuration Template: See configs/magic-pdf-gpu.template.json for a complete configuration template with all available options.

3. Performance Comparison:

  • CPU Mode: ~16-17 it/s processing speed, language switched to ch_lite
  • GPU Mode: ~134+ it/s processing speed (8x faster), full language support

Example Usage:

magic-pdf -p demo/pdfs/small_ocr.pdf -o output/

📦 Getting Started

🚀 Complete Fresh Setup (From Scratch)

Here's the complete process to reproduce the magic-pdf setup:

1. Clone Repository with Submodules:

git clone --recursive https://github.com/protagolabs/ProtagoDoc.git
cd ProtagoDoc

2. Set up Conda Environment:

# Ensure conda is available and activate environment
source /opt/conda/etc/profile.d/conda.sh && conda activate pd
# or create new environment: conda create -n pd python=3.13 && conda activate pd

3. Install MinerU:

cd tools/mineru
pip install -e .[full]
cd ../..

4. Download Models and Configure GPU:

# Install required dependencies first
pip install requests huggingface_hub
python scripts/download_models_hf.py

5. Verify Setup:

python scripts/test_fresh_setup.py

6. Test Magic-PDF:

mkdir -p output
magic-pdf -p tools/mineru/demo/pdfs/small_ocr.pdf -o output/

Expected result: ~134+ it/s GPU processing speed 🔥

Clone with Submodules

To clone this repository with all submodules:

git clone --recursive https://github.com/protagolabs/ProtagoDoc.git

If you've already cloned the repository, initialize and update submodules:

git submodule init
git submodule update

Adding New Tools

To add a new tool as a submodule:

git submodule add <repository-url> tools/<tool-name>
git commit -m "Add <tool-name> submodule"

Updating Submodules

To update all submodules to their latest versions:

git submodule update --remote

To update a specific submodule:

# For MinerU (uses master branch)
cd tools/mineru
git pull origin master
cd ../..
git add tools/mineru
git commit -m "Update MinerU submodule"

# For other tools that might use main branch
cd tools/<tool-name>
git pull origin main  # or master, depending on the repository
cd ../..
git add tools/<tool-name>
git commit -m "Update <tool-name> submodule"

📁 Repository Structure

ProtagoDoc/
├── tools/                  # All tool submodules
│   └── mineru/            # MinerU - PDF to Markdown/JSON converter
├── scripts/               # Utility scripts
│   ├── download_models_hf.py  # Model download script (local)
│   └── test_fresh_setup.py    # Setup validation script
├── configs/               # Configuration templates
│   ├── magic-pdf-gpu.template.json  # GPU configuration template
│   └── README.md          # Configuration documentation
├── .gitmodules            # Submodule configuration
└── README.md              # This file

🔧 Troubleshooting

⚠️ Critical Setup Requirements

Always run in the correct conda environment:

# ALWAYS activate the environment first
source /opt/conda/etc/profile.d/conda.sh && conda activate pd

# Verify you're in the right environment (should show "pd")
echo $CONDA_DEFAULT_ENV

# If not in pd environment, installations will fail or go to wrong location

Submodule Update Issues

Error: fatal: couldn't find remote ref main

  • Some repositories use master as the default branch instead of main
  • For MinerU: use git pull origin master
  • Check the default branch with: git branch -r

Updating from a specific version:

# To update MinerU to a newer version tag
cd tools/mineru
git fetch origin
git checkout magic_pdf-1.3.11-released  # or desired version
cd ../..
git add tools/mineru
git commit -m "Update MinerU to version 1.3.10 (magic_pdf-1.3.11-released tag)"

Reset submodule to specific commit:

cd tools/mineru
git checkout ea619281ef43577da91247a9df60f53b12d47cbc  # current pinned commit (magic_pdf-1.3.11-released)
cd ../..
git add tools/mineru
git commit -m "Reset MinerU to pinned version 1.3.10 (magic_pdf-1.3.11-released tag)"

GPU Configuration Issues

Error: magic-pdf: command not found

  • CRITICAL: Ensure you're in the correct conda environment: conda activate pd
  • Ensure you've run the model download script: python scripts/download_models_hf.py
  • Check if MinerU is properly installed: pip show magic-pdf
  • Verify environment activation: echo $CONDA_DEFAULT_ENV should show pd

Error: Still using CPU despite CUDA configuration

  1. Verify the configuration file exists: ls -la ~/magic-pdf.json
  2. Check device mode setting:
    python -c "from magic_pdf.libs.config_reader import get_device; print('Device:', get_device())"
  3. Ensure device-mode is set to "cuda" in ~/magic-pdf.json:
    {
        "device-mode": "cuda"
    }

Error: Missing model weights

# Re-download models if they're missing
python scripts/download_models_hf.py

GPU Memory Issues

  • Reduce batch size by modifying the configuration
  • Check available GPU memory: nvidia-smi
  • For GPUs with <6GB VRAM, consider using CPU mode

Performance Optimization

  • Expected GPU Performance: 130+ it/s for OCR processing
  • Expected CPU Performance: 16-17 it/s for OCR processing
  • If GPU performance is poor, check CUDA installation and drivers

🤝 Contributing

  1. Fork the repository
  2. Add your tool as a submodule in the tools/ directory
  3. Update this README with tool documentation
  4. Submit a pull request

📝 License

This repository serves as a collection hub. Each tool maintains its own license:

  • MinerU: AGPL-3.0 License

Last updated: 2025-07-25 - Fresh setup validated with 4x RTX 4090 GPUs achieving 134+ it/s performance

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages