PDF Entity Labeling

1 Minute Demo (Click Below! ⬇️)

Idea

The Problem

Every business is modernizing big data processes, but what about information that isn't stored in a database or transferred in EDI format? Most unstructured, confidential information is sent as PDFs. These PDFs are often read once and discarded, or a human has to spend valuable time recording the details. Even tech-forward businesses rely on analysts to read PDFs and manually enter fields or trigger actions in their software systems.

The Solution

NER automates PDF reading. Extracting key features converts your PDFs into an SQL database. From there, you can easily finalize an automation workflow to execute your business logic or mine insights on data you didn't realize you had.

Core functionality

In-browser PDF labeling to enable NER (named entity recognition) model training. The page displays a PDF on the left and a table of entities on the right. The user can label the text in the PDF that corresponds to each entity type in the table. The annotation objecs, including the entity type, and text, are saved to JSON or could be passed to a database and backend to train models. The user can choose the color and whether highlight/underline/squiggly for each entity type in the table, as well as delete annotations from the table.

User Steps

Describe entity types - give each variable a name, written definition, and constraints for data type, required, and unique
Label Documents - use the intuitive web interface to label examples of finding the features in the documents
Train Model - train a custom SLM/trasnformer using the labeled data or use the defintions and examples to craft a comprehensive prompt for an LLM to achieve high accuracy without training
Check outputs - create annotations for the predicted labels and siaply the PDFs in the UI to verify each output visually and make corrections to the model or the output before it is used in prod

Preview site locally

Install Node.js v22 and Git
Clone repo and install dependencies:

git clone https://github.com/optimalcharb/pdf-entity-labeling.git

npm install

Run the server

npm run dev

Development Tools

Core Frontend

Frontend framework: Next.js 15 App Router + React
Language: TypeScript with ts-reset, config by tsconfig.json
Environment variable management: no environment variables, for now all variables should be hard-coded, loaded from an annotations file, or user provided
Containerization: none, no Docker or Kubernetes
Styles: Tailwind CSS v4 with CVA (Class Variance Authority) for CSS integration and PostCSS for JavaScript integration
Linting: ESlint 9, config by eslint.config.mjs
Formatting: Prettier, config by .prettierignore, .prettierrc
Testing: React Testing Library + Bun Test Runner which is based on Jest, name files as ".{spec,test}.{ts,tsx}"
End-to-End Testing: Playwright, name files as ".e2e.ts"

Backend for Frontend (BFF)

Storage: must get PDF from local storage or URL

Core Backend

None

Scripts

Script	Description
dev	run site locally
build	build for prod
start	start prod server
tsc	compile types without generating files
lint	check for linting errors
lint:fix	fix some linting errors automatically
prettier	check format
prettier:fix	fix format (.vscode/settings.json does this on every save)
prepare	automatically called by install
postinstall	automatically called by install
depcheck	check for unused dependencies
storybook	view storybook workshop
test	run tests using Bun Test Runner
e2e	run playwright end-to-end tests
others	other scripts can be added to package.json

Version Control

DevOps CI/CD: GitHub Actions with workflows for check and bundle analyzer - currently disabled
Changelog generation: Semantic Release config by .releaserc and ran by .github/workflows/semantic-release.yml, Conventional Commits enforced by husky config by .commitlintrc.json, commit messages must start with a prefix in the table below, the workflow edits CHANGELOG.md on any version bump

commit prefix	version bump	definition
type!:	major (0.0.0 -> 1.0.0)	breaking changes (`feat!:`, `perf!:`, ...)
feat:	minor (0.0.0 -> 0.1.0)	new feature
perf:	patch (0.0.0 -> 0.0.1)	performance improvement
fix:	patch (0.0.0 -> 0.0.1)	bug fix
docs:	none	documentation changes
test:	none	adding or updating tests
ci:	none	CI/CD configuration changes
revert:	none	reverting previous commits
style:	none	formatting without code changes
refactor:	none	reorganizing code without changes
chore:	none	maintenance tasks
build:	none	build system or dependencies

Dependency Control

Package manager: npm
Package management: Corepack
Package fixes: Patch-package
Bundle management: Bundle analyzer - currently disabled
Import management: Absolute imports so imports from same module are alphabetically ordered

Component Development

State management: React or local state and zustand for global state stores
Component workshop: Storybook using .stories.tsx files

Features

PDF Rendering

EmbedPDF: GitHub, docs for @embedpdf/pdfium the JS library to wrap the C++ engine, docs for @embedpdf/core which I have modified
Plugins are built in consitent style defined by core (not using standard Redux style) and must have commented sections following plugin-template/

UI Libraries

shadcn/ui stored in components/shadcn-ui and config by components.json

Icons

Lucide Icons

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
.github		.github
.husky		.husky
.storybook		.storybook
.vscode		.vscode
app		app
components		components
e2e		e2e
hooks		hooks
lib		lib
public		public
.commitlintrc.json		.commitlintrc.json
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
.releaserc		.releaserc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
bun-test-setup.ts		bun-test-setup.ts
bunfig.toml		bunfig.toml
components.json		components.json
env.mjs		env.mjs
eslint.config.mjs		eslint.config.mjs
next-env.d.ts		next-env.d.ts
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
postcss.config.js		postcss.config.js
reset.d.ts		reset.d.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Entity Labeling

1 Minute Demo (Click Below! ⬇️)

Idea

The Problem

The Solution

Core functionality

User Steps

Preview site locally

Development Tools

Core Frontend

Backend for Frontend (BFF)

Core Backend

Scripts

Version Control

Dependency Control

Component Development

Features

PDF Rendering

UI Libraries

Icons

About

Uh oh!

Contributors 4

Uh oh!

Languages

License

optimalcharb/pdf-entity-labeling

Folders and files

Latest commit

History

Repository files navigation

PDF Entity Labeling

1 Minute Demo (Click Below! ⬇️)

Idea

The Problem

The Solution

Core functionality

User Steps

Preview site locally

Development Tools

Core Frontend

Backend for Frontend (BFF)

Core Backend

Scripts

Version Control

Dependency Control

Component Development

Features

PDF Rendering

UI Libraries

Icons

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Contributors 4

Uh oh!

Languages