This repository contains scripts for converting various document formats into Markdown and performing entity redaction. The workflow is split into two parts:
- Conversion (
convert.py) - De-identification (
de-identification.py)
An additional script deidentification_obsolated.py shows an older approach using Presidio.
convert.py walks through a dataset/ directory, converts each document to PDF (using LibreOffice or built-in functions) and then to Markdown via the marker tool. Image tags in the resulting Markdown are replaced with alt text descriptions generated by an OpenAI-compatible server.
Generated files are stored under output_real/ and progress is tracked in progress.csv.
python convert.pyConfigure the API endpoint and model at the top of the script if necessary.
de-identification.py scans Markdown files and redacts personal or organisational entities. It combines a knowledge base of terms (publicly_avail_knowledge_base.json) with extraction from a vLLM server to replace sensitive data with fake values generated by faker.
The script writes a <name>_final.md file with replacements as well as <name>_entities.json summarising the detected entities.
python de-identification.py --model <model-name> [--input_dir PATH] [--output_dir PATH] [--vllm_url URL]--input_dirDirectory containing Markdown files (default: current directory)--output_dirWhere to write<name>_entities.jsonandresult.txt--vllm_urlURL of the chat completions endpoint--modelModel identifier on the vLLM server
publicly_avail_knowledge_base.json lists known organisation names and shorthand terms mapped to replacement placeholders. The file was renamed from knowledge_base.json and the de-identification script expects this name by default; adjust the path if needed.
Python 3 with the following packages:
pandasimg2pdfPillowopenairequestsfakertqdm
Conversion also requires LibreOffice (for documents) and the marker_single command-line tool in your PATH.
Place input files under a dataset/ directory. The script creates progress.csv and initial_files.csv to keep track of what has been processed.
deidentification_obsolated.py provides an earlier implementation using Presidio for entity detection. It is kept for reference only.