Skip to content

Building a UI to guide Python AI workflow development

License

Notifications You must be signed in to change notification settings

optimalcharb/pdf-entity-labeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Entity Labeling

1 Minute Demo (Click Below! ⬇️)

Demo

Idea

The Problem

Every business is modernizing big data processes, but what about information that isn't stored in a database or transferred in EDI format? Most unstructured, confidential information is sent as PDFs. These PDFs are often read once and discarded, or a human has to spend valuable time recording the details. Even tech-forward businesses rely on analysts to read PDFs and manually enter fields or trigger actions in their software systems.

The Solution

NER automates PDF reading. Extracting key features converts your PDFs into an SQL database. From there, you can easily finalize an automation workflow to execute your business logic or mine insights on data you didn't realize you had.

Core functionality

In-browser PDF labeling to enable NER (named entity recognition) model training. The page displays a PDF on the left and a table of entities on the right. The user can label the text in the PDF that corresponds to each entity type in the table. The annotation objecs, including the entity type, and text, are saved to JSON or could be passed to a database and backend to train models. The user can choose the color and whether highlight/underline/squiggly for each entity type in the table, as well as delete annotations from the table.

User Steps

  1. Describe entity types - give each variable a name, written definition, and constraints for data type, required, and unique
  2. Label Documents - use the intuitive web interface to label examples of finding the features in the documents
  3. Train Model - train a custom SLM/trasnformer using the labeled data or use the defintions and examples to craft a comprehensive prompt for an LLM to achieve high accuracy without training
  4. Check outputs - create annotations for the predicted labels and siaply the PDFs in the UI to verify each output visually and make corrections to the model or the output before it is used in prod

Preview site locally

  1. Install Node.js v22 and Git
  2. Clone repo and install dependencies:
git clone https://github.com/optimalcharb/pdf-entity-labeling.git
npm install
  1. Run the server
npm run dev

Development Tools

Core Frontend

  • Frontend framework: Next.js 15 App Router + React
  • Language: TypeScript with ts-reset, config by tsconfig.json
  • Environment variable management: no environment variables, for now all variables should be hard-coded, loaded from an annotations file, or user provided
  • Containerization: none, no Docker or Kubernetes
  • Styles: Tailwind CSS v4 with CVA (Class Variance Authority) for CSS integration and PostCSS for JavaScript integration
  • Linting: ESlint 9, config by eslint.config.mjs
  • Formatting: Prettier, config by .prettierignore, .prettierrc
  • Testing: React Testing Library + Bun Test Runner which is based on Jest, name files as ".{spec,test}.{ts,tsx}"
  • End-to-End Testing: Playwright, name files as ".e2e.ts"

Backend for Frontend (BFF)

  • Storage: must get PDF from local storage or URL

Core Backend

  • None

Scripts

Script Description
dev run site locally
build build for prod
start start prod server
tsc compile types without generating files
lint check for linting errors
lint:fix fix some linting errors automatically
prettier check format
prettier:fix fix format (.vscode/settings.json does this on every save)
prepare automatically called by install
postinstall automatically called by install
depcheck check for unused dependencies
storybook view storybook workshop
test run tests using Bun Test Runner
e2e run playwright end-to-end tests
others other scripts can be added to package.json

Version Control

  • DevOps CI/CD: GitHub Actions with workflows for check and bundle analyzer - currently disabled
  • Changelog generation: Semantic Release config by .releaserc and ran by .github/workflows/semantic-release.yml, Conventional Commits enforced by husky config by .commitlintrc.json, commit messages must start with a prefix in the table below, the workflow edits CHANGELOG.md on any version bump
commit prefix version bump definition
type!: major (0.0.0 -> 1.0.0) breaking changes (feat!:, perf!:, ...)
feat: minor (0.0.0 -> 0.1.0) new feature
perf: patch (0.0.0 -> 0.0.1) performance improvement
fix: patch (0.0.0 -> 0.0.1) bug fix
docs: none documentation changes
test: none adding or updating tests
ci: none CI/CD configuration changes
revert: none reverting previous commits
style: none formatting without code changes
refactor: none reorganizing code without changes
chore: none maintenance tasks
build: none build system or dependencies

Dependency Control

Component Development

  • State management: React or local state and zustand for global state stores
  • Component workshop: Storybook using .stories.tsx files

Features

PDF Rendering

  • EmbedPDF: GitHub, docs for @embedpdf/pdfium the JS library to wrap the C++ engine, docs for @embedpdf/core which I have modified
  • Plugins are built in consitent style defined by core (not using standard Redux style) and must have commented sections following plugin-template/

UI Libraries

  • shadcn/ui stored in components/shadcn-ui and config by components.json

Icons

About

Building a UI to guide Python AI workflow development

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •