ProtagoDoc

A curated collection of useful tools organized as git submodules for easy management and deployment.

🛠️ Tools Collection

This repository serves as a centralized hub for various development and productivity tools, each maintained as a git submodule for easy version control and updates.

Document Processing

MinerU

Location: tools/mineru | Version: 1.3.10 (magic_pdf-1.3.11-released tag)

A high-quality tool for converting PDF to Markdown and JSON format. MinerU is a comprehensive solution for precise document content extraction with support for:

✅ Multiple output formats (Markdown, JSON)
✅ OCR support for 84 languages
✅ Layout and span visualization
✅ CPU and GPU acceleration support
✅ Cross-platform compatibility (Windows, Linux, macOS)

Quick Start:

# Create and activate conda environment
conda create --name pd python=3.13 -y
source /opt/conda/etc/profile.d/conda.sh && conda activate pd

# Navigate to MinerU directory and install dependencies
cd tools/mineru
pip install -e .[full]
cd ../..

# Install required dependencies and download models
pip install requests huggingface_hub
python scripts/download_models_hf.py

# Use MinerU (note: command is magic-pdf, not mineru in v1.3.10)
magic-pdf -p <input_path> -o <output_path>

🚀 GPU Acceleration Setup

For optimal performance with CUDA GPU acceleration:

1. Verify GPU Support:

nvidia-smi  # Check GPU availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

2. Configure GPU Acceleration: The model download script automatically creates a configuration file at ~/magic-pdf.json. To enable GPU acceleration, ensure the device mode is set to cuda:

{
    "device-mode": "cuda",
    "models-dir": "/path/to/downloaded/models"
}

📋 Configuration Template: See configs/magic-pdf-gpu.template.json for a complete configuration template with all available options.

3. Performance Comparison:

CPU Mode: ~16-17 it/s processing speed, language switched to ch_lite
GPU Mode: ~134+ it/s processing speed (8x faster), full language support

Example Usage:

magic-pdf -p demo/pdfs/small_ocr.pdf -o output/

📦 Getting Started

🚀 Complete Fresh Setup (From Scratch)

Here's the complete process to reproduce the magic-pdf setup:

1. Clone Repository with Submodules:

git clone --recursive https://github.com/protagolabs/ProtagoDoc.git
cd ProtagoDoc

2. Set up Conda Environment:

# Ensure conda is available and activate environment
source /opt/conda/etc/profile.d/conda.sh && conda activate pd
# or create new environment: conda create -n pd python=3.13 && conda activate pd

3. Install MinerU:

cd tools/mineru
pip install -e .[full]
cd ../..

4. Download Models and Configure GPU:

# Install required dependencies first
pip install requests huggingface_hub
python scripts/download_models_hf.py

5. Verify Setup:

python scripts/test_fresh_setup.py

6. Test Magic-PDF:

mkdir -p output
magic-pdf -p tools/mineru/demo/pdfs/small_ocr.pdf -o output/

Expected result: ~134+ it/s GPU processing speed 🔥

Clone with Submodules

To clone this repository with all submodules:

git clone --recursive https://github.com/protagolabs/ProtagoDoc.git

If you've already cloned the repository, initialize and update submodules:

git submodule init
git submodule update

Adding New Tools

To add a new tool as a submodule:

git submodule add <repository-url> tools/<tool-name>
git commit -m "Add <tool-name> submodule"

Updating Submodules

To update all submodules to their latest versions:

git submodule update --remote

To update a specific submodule:

# For MinerU (uses master branch)
cd tools/mineru
git pull origin master
cd ../..
git add tools/mineru
git commit -m "Update MinerU submodule"

# For other tools that might use main branch
cd tools/<tool-name>
git pull origin main  # or master, depending on the repository
cd ../..
git add tools/<tool-name>
git commit -m "Update <tool-name> submodule"

📁 Repository Structure

ProtagoDoc/
├── tools/                  # All tool submodules
│   └── mineru/            # MinerU - PDF to Markdown/JSON converter
├── scripts/               # Utility scripts
│   ├── download_models_hf.py  # Model download script (local)
│   └── test_fresh_setup.py    # Setup validation script
├── configs/               # Configuration templates
│   ├── magic-pdf-gpu.template.json  # GPU configuration template
│   └── README.md          # Configuration documentation
├── .gitmodules            # Submodule configuration
└── README.md              # This file

🔧 Troubleshooting

⚠️ Critical Setup Requirements

Always run in the correct conda environment:

# ALWAYS activate the environment first
source /opt/conda/etc/profile.d/conda.sh && conda activate pd

# Verify you're in the right environment (should show "pd")
echo $CONDA_DEFAULT_ENV

# If not in pd environment, installations will fail or go to wrong location

Submodule Update Issues

Error: fatal: couldn't find remote ref main

Some repositories use master as the default branch instead of main
For MinerU: use git pull origin master
Check the default branch with: git branch -r

Updating from a specific version:

# To update MinerU to a newer version tag
cd tools/mineru
git fetch origin
git checkout magic_pdf-1.3.11-released  # or desired version
cd ../..
git add tools/mineru
git commit -m "Update MinerU to version 1.3.10 (magic_pdf-1.3.11-released tag)"

Reset submodule to specific commit:

cd tools/mineru
git checkout ea619281ef43577da91247a9df60f53b12d47cbc  # current pinned commit (magic_pdf-1.3.11-released)
cd ../..
git add tools/mineru
git commit -m "Reset MinerU to pinned version 1.3.10 (magic_pdf-1.3.11-released tag)"

GPU Configuration Issues

Error: magic-pdf: command not found

CRITICAL: Ensure you're in the correct conda environment: conda activate pd
Ensure you've run the model download script: python scripts/download_models_hf.py
Check if MinerU is properly installed: pip show magic-pdf
Verify environment activation: echo $CONDA_DEFAULT_ENV should show pd

Error: Still using CPU despite CUDA configuration

Verify the configuration file exists: ls -la ~/magic-pdf.json

Check device mode setting:

python -c "from magic_pdf.libs.config_reader import get_device; print('Device:', get_device())"

Ensure device-mode is set to "cuda" in ~/magic-pdf.json:
```
{
    "device-mode": "cuda"
}
```

Error: Missing model weights

# Re-download models if they're missing
python scripts/download_models_hf.py

GPU Memory Issues

Reduce batch size by modifying the configuration
Check available GPU memory: nvidia-smi
For GPUs with <6GB VRAM, consider using CPU mode

Performance Optimization

Expected GPU Performance: 130+ it/s for OCR processing
Expected CPU Performance: 16-17 it/s for OCR processing
If GPU performance is poor, check CUDA installation and drivers

🤝 Contributing

Fork the repository
Add your tool as a submodule in the tools/ directory
Update this README with tool documentation
Submit a pull request

📝 License

This repository serves as a collection hub. Each tool maintains its own license:

MinerU: AGPL-3.0 License

Last updated: 2025-07-25 - Fresh setup validated with 4x RTX 4090 GPUs achieving 134+ it/s performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProtagoDoc

🛠️ Tools Collection

Document Processing

MinerU

🚀 GPU Acceleration Setup

📦 Getting Started

🚀 Complete Fresh Setup (From Scratch)

Clone with Submodules

Adding New Tools

Updating Submodules

📁 Repository Structure

🔧 Troubleshooting

⚠️ Critical Setup Requirements

Submodule Update Issues

GPU Configuration Issues

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
scripts		scripts
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

protagolabs/ProtagoDoc

Folders and files

Latest commit

History

Repository files navigation

ProtagoDoc

🛠️ Tools Collection

Document Processing

MinerU

🚀 GPU Acceleration Setup

📦 Getting Started

🚀 Complete Fresh Setup (From Scratch)

Clone with Submodules

Adding New Tools

Updating Submodules

📁 Repository Structure

🔧 Troubleshooting

⚠️ Critical Setup Requirements

Submodule Update Issues

GPU Configuration Issues

🤝 Contributing

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages