pdf_ingestion

Architecture

This project enable databricks users to ingest large number of unstructured files (e.g. PDF, Docx, PPTx, etc) from a Databricks Unity Catalog volume into a Databricks Delta table. The project is designed to be high-performant, scalable and fault-tolerant.

Key Features

Concurrency within microbatch with threadpools
Automatic skewed tasks management
- Users can set file size and page count threshold to bucket large data to seperate async jobs
- Optional salted options to alleviate skewness in micro-batches
Error handling and retry logic

Getting started

Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
Authenticate to your Databricks workspace, if you have not done so already:
```
$ databricks configure
```

Setup you bundle

databricks bundle init pdf_ingestion_template

Follow the prompts to define your project name, set cloud environment, and set file size/page count threshold
The databricks asset bundle job configurations will be auto created
Review the main asset configuration databricks.yml for further customization for your workspace
- Update Variables section with your workspace environment details
Refer to Databricks Asset Bundle Documentation

To validate a development copy of this project:

validate default setting for dev workflow (Note that "dev" is the default target, so the --target parameter is optional here.)

$ databricks validate deploy --target dev

validate with custom variables

databricks bundle validate --target dev --var="catalog=test,schema=unstructured_ingestion,volume=pdfs,num_workers=5,table_prefix=test_job,max_concurrent_async_runs=2

To deploy a development copy of this project:

Deploy the dev workflow to a databricks workspace
```
$ databricks bundle deploy --target dev
```
This deploys everything that's defined for this project. For example, the default template would deploy a job called [dev yourname] pdf_ingestion_job to your workspace. You can find that job by opening your workpace and clicking on Workflows.

deploy with custom variables
```
databricks bundle deploy --target dev --var="catalog=test,schema=unstructured_ingestion,volume=pdfs,num_workers=5,table_prefix=test_job,max_concurrent_async_runs=2
```
Similarly, to deploy a production copy, type:
```
$ databricks bundle deploy --target prod
```
Note that the default job from the template has a schedule that runs every day (defined in resources/pdf_ingestion_job.yml). The schedule is paused when deploying in development mode (see https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).
To run a job or pipeline, use the "run" command:
```
$ databricks bundle run
```
Optionally, install developer tools such as the Databricks extension for Visual Studio Code from https://docs.databricks.com/dev-tools/vscode-ext.html.
For documentation on the Databricks asset bundles format used for this project, and for CI/CD configuration, see https://docs.databricks.com/dev-tools/bundles/index.html.

Usage

Databricks Job Variables

User can customize the following variables in the databricks.yml after bundle initialization or set with databricks CLI. These variables are used to configure the job and can be set to match your specific environment and requirements.

Variable	Description	Default Value
cloud_provider	cloud provider	aws
node_type_id	worker node type	m5d.2xlarge
async_node_type_id	async job worker node type	m5d.4xlarge
num_workers	number of workers	40
async_num_workers	number of workers for async job	1
max_concurrent_async_runs	max number of concurrent async job runs	10
zone_id	AWS zone_id attribute	us-west-2a
catalog	unity catalog name	my_catalog
schema	schema name	my_schema
volume	source file volume	my_volume
checkpoints_volume	checkpoints volume name	checkpoints
reset_data	reset all data tables boolean flag	true
table_prefix	output table prefix	my_project
file_format	input file format (extension)	pdf
parser_name	name of the parser, choices are `unstructured`, `databricks_ai_parse`	`unstructured`
strategy	Unstructured parser strategy setting, choices are 'auto', 'hi_res'	`auto`

Parsers

The currently supported parsers, file format

	Databricks AI Parse	Unstructured OSS Parser (`auto` strategy)	Unstructured OSS Parser (`hi_res` strategy)
Supported File Types	`pdf`, `jpg`, `jpeg`, `png`	`pdf`, `pptx`, `docx`, `xlsx`, `eml`, `msg`	`pdf`, `pptx`, `docx`, `xlsx`, `eml`, `msg`
technology	agentic system with customer models	dynamtically choose between OCR and OCR + yolox based on document layout	OCR + yolox

Known Limitations

The open source Unstructured API uses Yolox object detection model for hi_res strategy and auto strategy (dynamically determined). The Yolox model is hosted at huggingface hub. Depending on the time of the day and traffic condition, user may experience 429 error due to the rate limit from huggingface
AI parse is currently in Private Preview, please reach out to your Databricks Account Team to enable access

Comming Soon

Memory management
Map-reduce page splits option for large files
Declaritive Document Processing

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
archives/notebook_based_template		archives/notebook_based_template
fixtures		fixtures
imgs		imgs
pdf_ingestion_template		pdf_ingestion_template
scratch		scratch
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdf_ingestion

Architecture

Key Features

Getting started

Usage

Databricks Job Variables

Parsers

Known Limitations

Comming Soon

Reference:

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

qian-yu-db/pdf_ingestion_DAB

Folders and files

Latest commit

History

Repository files navigation

pdf_ingestion

Architecture

Key Features

Getting started

Usage

Databricks Job Variables

Parsers

Known Limitations

Comming Soon

Reference:

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages