|
| 1 | +# AI Document Processing Workflow with Structured Streaming |
| 2 | + |
| 3 | +A Databricks Asset Bundle demonstrating **incremental document processing** using `ai_parse_document`, `ai_query`, and Databricks Workflows with Structured Streaming. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This example shows how to build an incremental workflow that: |
| 8 | +1. **Parses** PDFs and images using [`ai_parse_document`](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document) |
| 9 | +2. **Extracts** clean text with incremental processing |
| 10 | +3. **Analyzes** content using [`ai_query`](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_query) with LLMs |
| 11 | + |
| 12 | +All stages run as Python notebook tasks in a Databricks Workflow using Structured Streaming with serverless compute. |
| 13 | + |
| 14 | +## Architecture |
| 15 | + |
| 16 | +``` |
| 17 | +Source Documents (UC Volume) |
| 18 | + ↓ |
| 19 | + Task 1: ai_parse_document → parsed_documents_raw (variant) |
| 20 | + ↓ |
| 21 | + Task 2: text extraction → parsed_documents_text (string) |
| 22 | + ↓ |
| 23 | + Task 3: ai_query → parsed_documents_structured (json) |
| 24 | +``` |
| 25 | + |
| 26 | +### Key Features |
| 27 | + |
| 28 | +- **Incremental processing**: Only new files are processed using Structured Streaming checkpoints |
| 29 | +- **Serverless compute**: Runs on serverless compute for cost efficiency |
| 30 | +- **Task dependencies**: Sequential execution with automatic dependency management |
| 31 | +- **Parameterized**: Catalog, schema, volumes, and table names configurable via variables |
| 32 | +- **Error handling**: Gracefully handles parsing failures |
| 33 | +- **Visual debugging**: Interactive notebook for inspecting results |
| 34 | + |
| 35 | +## Prerequisites |
| 36 | + |
| 37 | +- Databricks workspace with Unity Catalog |
| 38 | +- Databricks CLI v0.218.0+ |
| 39 | +- Unity Catalog volumes for: |
| 40 | + - Source documents (PDFs/images) |
| 41 | + - Parsed output images |
| 42 | + - Streaming checkpoints |
| 43 | +- AI functions (`ai_parse_document`, `ai_query`) |
| 44 | + |
| 45 | +## Quick Start |
| 46 | + |
| 47 | +1. **Install and authenticate** |
| 48 | + ```bash |
| 49 | + databricks auth login --host https://your-workspace.cloud.databricks.com |
| 50 | + ``` |
| 51 | + |
| 52 | +2. **Configure** `databricks.yml` with your workspace settings |
| 53 | + |
| 54 | +3. **Validate** the bundle configuration |
| 55 | + ```bash |
| 56 | + databricks bundle validate |
| 57 | + ``` |
| 58 | + |
| 59 | +4. **Deploy** |
| 60 | + ```bash |
| 61 | + databricks bundle deploy |
| 62 | + ``` |
| 63 | + |
| 64 | +5. **Upload documents** to your source volume |
| 65 | + |
| 66 | +6. **Run workflow** from the Databricks UI (Workflows) |
| 67 | + |
| 68 | +## Configuration |
| 69 | + |
| 70 | +Edit `databricks.yml`: |
| 71 | + |
| 72 | +```yaml |
| 73 | +variables: |
| 74 | + catalog: main # Your catalog |
| 75 | + schema: default # Your schema |
| 76 | + source_volume_path: /Volumes/main/default/source_documents # Source PDFs |
| 77 | + output_volume_path: /Volumes/main/default/parsed_output # Parsed images |
| 78 | + checkpoint_base_path: /tmp/checkpoints/ai_parse_workflow # Checkpoints |
| 79 | + raw_table_name: parsed_documents_raw # Table names |
| 80 | + text_table_name: parsed_documents_text |
| 81 | + structured_table_name: parsed_documents_structured |
| 82 | +``` |
| 83 | +
|
| 84 | +## Workflow Tasks |
| 85 | +
|
| 86 | +### Task 1: Document Parsing |
| 87 | +**File**: `src/transformations/01_parse_documents.py` |
| 88 | + |
| 89 | +Uses `ai_parse_document` to extract text, tables, and metadata from PDFs/images: |
| 90 | +- Reads files from volume using Structured Streaming |
| 91 | +- Stores variant output with bounding boxes |
| 92 | +- Incremental: checkpointed streaming prevents reprocessing |
| 93 | + |
| 94 | +### Task 2: Text Extraction |
| 95 | +**File**: `src/transformations/02_extract_text.py` |
| 96 | + |
| 97 | +Extracts clean concatenated text using `transform()`: |
| 98 | +- Reads from previous task's table via streaming |
| 99 | +- Handles both parser v1.0 and v2.0 formats |
| 100 | +- Uses `transform()` for efficient text extraction |
| 101 | +- Includes error handling for failed parses |
| 102 | + |
| 103 | +### Task 3: AI Query Extraction |
| 104 | +**File**: `src/transformations/03_extract_structured_data.py` |
| 105 | + |
| 106 | +Applies LLM to extract structured insights: |
| 107 | +- Reads from text table via streaming |
| 108 | +- Uses `ai_query` with Claude Sonnet 4 |
| 109 | +- Customizable prompt for domain-specific extraction |
| 110 | +- Outputs structured JSON |
| 111 | + |
| 112 | +## Visual Debugger |
| 113 | + |
| 114 | +The included notebook visualizes parsing results with interactive bounding boxes. |
| 115 | + |
| 116 | +**Open**: `src/explorations/ai_parse_document -- debug output.py` |
| 117 | + |
| 118 | +**Configure widgets**: |
| 119 | +- `input_file`: `/Volumes/main/default/source_docs/sample.pdf` |
| 120 | +- `image_output_path`: `/Volumes/main/default/parsed_out/` |
| 121 | +- `page_selection`: `all` (or `1-3`, `1,5,10`) |
| 122 | + |
| 123 | +**Features**: |
| 124 | +- Color-coded bounding boxes by element type |
| 125 | +- Hover tooltips showing extracted content |
| 126 | +- Automatic image scaling |
| 127 | +- Page selection support |
| 128 | + |
| 129 | +## Project Structure |
| 130 | + |
| 131 | +``` |
| 132 | +. |
| 133 | +├── databricks.yml # Bundle configuration |
| 134 | +├── resources/ |
| 135 | +│ └── ai_parse_document_workflow.job.yml |
| 136 | +├── src/ |
| 137 | +│ ├── transformations/ |
| 138 | +│ │ ├── 01_parse_documents.py |
| 139 | +│ │ ├── 02_extract_text.py |
| 140 | +│ │ └── 03_extract_structured_data.py |
| 141 | +│ └── explorations/ |
| 142 | +│ └── ai_parse_document -- debug output.py |
| 143 | +└── README.md |
| 144 | +``` |
| 145 | +
|
| 146 | +## Resources |
| 147 | +
|
| 148 | +- [Databricks Asset Bundles](https://docs.databricks.com/dev-tools/bundles/) |
| 149 | +- [Databricks Workflows](https://docs.databricks.com/workflows/) |
| 150 | +- [Structured Streaming](https://docs.databricks.com/structured-streaming/) |
| 151 | +- [`ai_parse_document` Function](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document) |
| 152 | +- [`ai_query` Function](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_query) |
0 commit comments