Skip to content

amidstdebug/document_extraction

Repository files navigation

Document Extraction and De-identification

This repository contains scripts for converting various document formats into Markdown and performing entity redaction. The workflow is split into two parts:

  1. Conversion (convert.py)
  2. De-identification (de-identification.py)

An additional script deidentification_obsolated.py shows an older approach using Presidio.

Conversion

convert.py walks through a dataset/ directory, converts each document to PDF (using LibreOffice or built-in functions) and then to Markdown via the marker tool. Image tags in the resulting Markdown are replaced with alt text descriptions generated by an OpenAI-compatible server.

Generated files are stored under output_real/ and progress is tracked in progress.csv.

Usage

python convert.py

Configure the API endpoint and model at the top of the script if necessary.

De-identification

de-identification.py scans Markdown files and redacts personal or organisational entities. It combines a knowledge base of terms (publicly_avail_knowledge_base.json) with extraction from a vLLM server to replace sensitive data with fake values generated by faker.

The script writes a <name>_final.md file with replacements as well as <name>_entities.json summarising the detected entities.

Usage

python de-identification.py --model <model-name> [--input_dir PATH] [--output_dir PATH] [--vllm_url URL]
  • --input_dir Directory containing Markdown files (default: current directory)
  • --output_dir Where to write <name>_entities.json and result.txt
  • --vllm_url URL of the chat completions endpoint
  • --model Model identifier on the vLLM server

Knowledge Base

publicly_avail_knowledge_base.json lists known organisation names and shorthand terms mapped to replacement placeholders. The file was renamed from knowledge_base.json and the de-identification script expects this name by default; adjust the path if needed.

Requirements

Python 3 with the following packages:

  • pandas
  • img2pdf
  • Pillow
  • openai
  • requests
  • faker
  • tqdm

Conversion also requires LibreOffice (for documents) and the marker_single command-line tool in your PATH.

Dataset Structure

Place input files under a dataset/ directory. The script creates progress.csv and initial_files.csv to keep track of what has been processed.

Obsolete Script

deidentification_obsolated.py provides an earlier implementation using Presidio for entity detection. It is kept for reference only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages