Skip to content

fw7th/redact

Repository files navigation

Status Build Status Coverage Status


Redact

A fast, containerized, and scalable API service for automated OCR and PII redaction on document batches.

Redact is a production-ready microservice built with FastAPI that handles the secure and asynchronous processing of data redaction tasks. Designed to be easily deployed using Docker and scaled via a worker-based architecture (Redis/RQ setup).

Showcases:

  • A robust, modern Python API using FastAPI (complete with automatic documentation).
  • A clear separation of concerns (API, Core, Services, Workers).
  • Scalable asynchronous task processing.
  • Containerization with Docker.
  • Database interaction via SQLAlchemy.

Features

  • RESTful API: Clear endpoints for submitting and retrieving redaction tasks.
  • Asynchronous Processing: Long-running redaction tasks are handled by a dedicated worker pool, keeping the API fast and responsive.
  • Persistent Storage: Uses SQLAlchemy for task metadata and Redis for task queuing.
  • Containerized: Built for easy deployment with a Dockerfile.
  • Benchmarked: Includes load testing scripts using Locust and benchmarks for performance analysis.

🚧 Project Status: [Phase 5 of 6]

Current: Integration and testing

Goal

Process batches of document images, automatically detect and redact sensitive information (PII).

Features

  • Batch image upload
  • Asynchronous processing
  • Redaction model
  • Enable batch prediction
  • REST API
  • Results retrieval
  • Deploy to Docker/Kubernetes

Phases

  • Phase 1: Basic upload
  • Phase 2: Job tracking
  • Phase 3: Async infrastructure
  • Phase 4: Model development (IN PROGRESS)
  • Phase 5: Integration
  • Phase 6: Deployment

Performance

API Benchmarking Table

Endpoint Operation Payload Size Concurrent Users Requests/sec Avg Latency (ms) P95 Latency (ms) Error Rate Notes
POST /predict Create 100KB image 2 0.67 34.95 56 0% Includes file validation, disk write, Redis, ML model inference
GET /predict/{id} GET N/A 10 3.58 10.38 32 0% Retrieve processed document from server
GET /predict/check/{id} Read N/A 10 3.52 9.94 28 0% Fetches job status from Redis
DELETE /predict/drop/{id} Delete N/A 10 3.47 10.43 30 Delete a batch and all related files from DB

Legend:

  • Payload Size: Size of file or JSON sent in the request.
  • Concurrent Users: Simulated users (e.g., in Locust).
  • Requests/sec: Throughput under load.
  • Latency: Time from request to response (P95 = 95th percentile).

Model Inference Benchmark Table

  • [Models]
  • [NER: GLiNER (Medium v2.1 {Remember to cite urchade and the GLiNER paper})]
  • [OCR: Tesseract 5.5.1 w/ py-tesseract]
Model Name Input Size Avg Inference Time (ms) Device Notes
GLiNER v2.1 200 lines of text 790 CPU Model has a long cold start 20secs-ish
Tesseract 512x512 image 0.96 CPU Tesseract?

Legend:

  • Inference Time: Time to run prediction (ms).
  • Throughput: How many predictions/sec the model can handle.

Tests performed in a subprocess.

🔧 Quickstart

Clone the repo

git clone https://github.com/fw7th/redact.git
cd redact

Create and activate virtualenv (optional)

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install dependencies

pip install -r requirements.txt

Copy example env and configure

cp .env.example .env

Start the app

uvicorn app.main:app --reload

Client request example

curl -X POST http://localhost:8000/predict \
  -F "[email protected]"

Check job status

curl http://localhost:8000/predict/abc123

Running tests

pytest tests/

Optionally benchmark

./scripts/benchmark

API Documentation

[Link to /docs once deployed]

When the server is running locally, visit:

These provide interactive documentation of all available endpoints with live testing.

Design Decisions

  • FastAPI, mainly because it's lighter.
  • Postgres, I'm familiar with the dbms already.
  • Redis: Read about persistence, speed, and distributed support. Decided to go with it over normal multiprocessing.Queue, it scales and it integrates well with Celery and RQ. Not using Kafka, or rabbit, project isn't that advanced.
  • RQ: Initially wanted to use Celery, however after much investigation, I realized that it's probably too advanced for my use case. I'd rather avoid the setup overhead, just wanted simple and quick setup.

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

About

Document redaction microservice.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published