Document Extraction and De-identification

This repository contains scripts for converting various document formats into Markdown and performing entity redaction. The workflow is split into two parts:

Conversion (convert.py)
De-identification (de-identification.py)

An additional script deidentification_obsolated.py shows an older approach using Presidio.

Conversion

convert.py walks through a dataset/ directory, converts each document to PDF (using LibreOffice or built-in functions) and then to Markdown via the marker tool. Image tags in the resulting Markdown are replaced with alt text descriptions generated by an OpenAI-compatible server.

Generated files are stored under output_real/ and progress is tracked in progress.csv.

Usage

python convert.py

Configure the API endpoint and model at the top of the script if necessary.

De-identification

de-identification.py scans Markdown files and redacts personal or organisational entities. It combines a knowledge base of terms (publicly_avail_knowledge_base.json) with extraction from a vLLM server to replace sensitive data with fake values generated by faker.

The script writes a <name>_final.md file with replacements as well as <name>_entities.json summarising the detected entities.

Usage

python de-identification.py --model <model-name> [--input_dir PATH] [--output_dir PATH] [--vllm_url URL]

--input_dir Directory containing Markdown files (default: current directory)
--output_dir Where to write <name>_entities.json and result.txt
--vllm_url URL of the chat completions endpoint
--model Model identifier on the vLLM server

Knowledge Base

publicly_avail_knowledge_base.json lists known organisation names and shorthand terms mapped to replacement placeholders. The file was renamed from knowledge_base.json and the de-identification script expects this name by default; adjust the path if needed.

Requirements

Python 3 with the following packages:

pandas
img2pdf
Pillow
openai
requests
faker
tqdm

Conversion also requires LibreOffice (for documents) and the marker_single command-line tool in your PATH.

Dataset Structure

Place input files under a dataset/ directory. The script creates progress.csv and initial_files.csv to keep track of what has been processed.

Obsolete Script

deidentification_obsolated.py provides an earlier implementation using Presidio for entity detection. It is kept for reference only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Extraction and De-identification

Conversion

Usage

De-identification

Usage

Knowledge Base

Requirements

Dataset Structure

Obsolete Script

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
convert.py		convert.py
de-identification.py		de-identification.py
deidentification_obsolated.py		deidentification_obsolated.py
publicly_avail_knowledge_base.json		publicly_avail_knowledge_base.json

amidstdebug/document_extraction

Folders and files

Latest commit

History

Repository files navigation

Document Extraction and De-identification

Conversion

Usage

De-identification

Usage

Knowledge Base

Requirements

Dataset Structure

Obsolete Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages