|
| 1 | +# PDF Data Extractor Demo |
| 2 | + |
| 3 | +This demo application allows you to extract structured data from PDF documents using JSON schemas and AI models. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- 📄 Upload and process PDF files |
| 8 | +- 📋 Define custom JSON schemas for data extraction |
| 9 | +- 🎯 Pre-built schema examples (Invoice, Receipt, Form) |
| 10 | +- 📊 View extracted data with token usage statistics |
| 11 | +- ⚙️ Configurable temperature and model selection |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +Before running this demo, you need: |
| 16 | + |
| 17 | +1. **Node.js** (version 18 or higher) |
| 18 | +2. **Docker Model Runner** |
| 19 | +3. **A suitable AI model** for text extraction |
| 20 | + |
| 21 | +## Setup Instructions |
| 22 | + |
| 23 | +### 1. Enable Docker Model Runner |
| 24 | + |
| 25 | +**Using Docker Desktop:** |
| 26 | +- Open Docker Desktop settings |
| 27 | +- Go to the **AI** tab |
| 28 | +- Select **Enable Docker Model Runner** |
| 29 | +- Enable **host-side TCP support** on port `12434` (default) |
| 30 | + |
| 31 | +For detailed instructions, see the [Docker Model Runner documentation](https://docs.docker.com/ai/model-runner/get-started/#enable-docker-model-runner). |
| 32 | + |
| 33 | +**Using Standalone Docker Engine:** |
| 34 | +TCP support is enabled by default on port `12434`. |
| 35 | + |
| 36 | +#### 2. Pull a Suitable Model |
| 37 | + |
| 38 | +You'll need a model capable of understanding and extracting text. Recommended models: |
| 39 | + |
| 40 | +```bash |
| 41 | +# Pull a general-purpose model |
| 42 | +docker model pull ai/gemma3 |
| 43 | +``` |
| 44 | + |
| 45 | +To see available models, visit [Docker Hub - AI Models](https://hub.docker.com/r/ai). |
| 46 | + |
| 47 | +## Installation |
| 48 | + |
| 49 | +1. **Navigate to the demo directory:** |
| 50 | + ```bash |
| 51 | + cd demos/extractor |
| 52 | + ``` |
| 53 | + |
| 54 | +2. **Install dependencies:** |
| 55 | + ```bash |
| 56 | + npm install |
| 57 | + ``` |
| 58 | + |
| 59 | +3. **Start the server:** |
| 60 | + ```bash |
| 61 | + npm start |
| 62 | + ``` |
| 63 | + |
| 64 | + The server will start on `http://localhost:3000` |
| 65 | + |
| 66 | +4. **Open the demo:** |
| 67 | + Open `demo.html` in your web browser (you can simply double-click the file or serve it with a local server) |
| 68 | + |
| 69 | +## Usage Guide |
| 70 | + |
| 71 | +### Basic Workflow |
| 72 | + |
| 73 | +1. **Configure API Settings** |
| 74 | + - **Base API URL**: Set to `http://127.0.0.1:12434/engines/v1` for Docker Model Runner |
| 75 | + - **Model**: Select from available models |
| 76 | + |
| 77 | +2. **Define Your Schema** |
| 78 | + - Use the provided examples (Invoice, Receipt, Form) or create your own |
| 79 | + - The schema defines what data to extract from the PDF |
| 80 | + - Use standard JSON Schema format with `type`, `properties`, etc. |
| 81 | + |
| 82 | +3. **Upload a PDF** |
| 83 | + - Click "Choose File" and select your PDF document |
| 84 | + - Supported: Any text-based PDF (not scanned images without OCR) |
| 85 | + - You can use sample PDFs [invoice.pdf](invoice.pdf) |
| 86 | + |
| 87 | +4. **Extract Data** |
| 88 | + - Click "Extract Data" button |
| 89 | + - Wait for processing (may take 10-30 seconds depending on PDF size and model) |
| 90 | + - View extracted data in the result section |
0 commit comments