Skip to content

Latest commit

 

History

History
90 lines (53 loc) · 1.94 KB

File metadata and controls

90 lines (53 loc) · 1.94 KB

OpenDataLoader Docker API

This repository provides a Dockerized REST API wrapper around the opendataloader-pdf Python package.

The API accepts one or more PDF streams and returns extracted output in multiple formats using content negotiation (Accept header) and/or explicit conversion options.

What OpenDataLoader Does

OpenDataLoader extracts structured content from PDFs for downstream use cases like search, indexing, RAG, and document automation. It can produce formats such as JSON, Markdown, HTML, text, and annotated PDF output through conversion options.

For full product details, capabilities, and documentation, see the official site:

Full API reference for this repository:

  • REST_API.md

Prerequisites

  • Docker Desktop (or Docker Engine + Compose)
  • Python 3.10+ (only needed to run local test script)

Run the Container

From the repository root:

docker compose up -d --build

Check service health:

curl http://localhost:8080/health

Expected response:

{"status":"ok"}

Test the Container

Option 1: Run automated API tests

python scripts/test_rest_api.py

This validates:

  • /health and /options
  • Single-file conversion with JSON and Markdown responses
  • Multi-file ZIP response
  • JSON options payload handling

Option 2: Use REST Client examples

Open:

  • docker-api/opendataloader-api-examples.http

Run requests directly from VS Code REST Client to test common scenarios.

Stop the Container

docker compose down

Configuration

The container reads configuration from:

  • /app/docker-api/config.yaml

Override with environment variable:

  • APP_CONFIG

For JSON format reference, see:

  • docker-api/config.example.json

Notes

  • The API validates uploaded PDF streams and returns clear 400 errors for invalid/truncated uploads.
  • The implementation installs and uses opendataloader-pdf from PyPI inside the container.