<<<<<<< HEAD

An Open Source Python-Based DICOM De-Identification Pipeline

Created by Zixin Nie at RTI International

Publishing date: Feb. 11, 2025

This pipeline leverages several open-source tools to de-identify medical imaging files in DICOM format.

Pydicom: Used for viewing, manipulation, compression, and writing out DICOM files
Pydicom de-ID: Used for an initial de-identification pass using rules-based de-ID of DICOM headers and pre-defined bounding-box de-ID of pixel data
PaddleOCR: Used to detect text on DICOM images
GliNER: Used to detect PII within free-text fields in DICOM headers and within the OCR detected text

When you first run the pipeline, the PaddleOCR and GliNER models will be downloaded to your system (recommended to do so in a virtual environment). The total download size is about 3.5GB. This only needs to be performed the first time, future runs will use the models already downloaded to your system.

The pipeline is modularized in the following fashion:

Module 1: De-identitification of DICOM Headers and bounding-box de-identification of DICOM pixel data using Pydicom and Pydicom de-ID
Module 2: De-identification of DICOM pixel data using OCR and NLP (which is split into two sub-modules)
Module 2.1: Uses OCR to detect all text for de-identification
Module 2.2: Uses OCR and NLP to detect only PII for de-identification
Module 3: De-identification of DICOM headers using NLP, which is split into two sub-modules
Module 3.1: De-identification of free-text fields in DICOM headers using NLP
Module 3.2: De-identification of date fields in DICOM headers using dateshifting

<<<<<<< HEAD

bd5bb8baf7de8210ce9a928e3c08aa78aa40ab14

Configuration and execution of the pipeline is all controlled from the deid_pipeline function, which has options to turn on or off the different modules and submodules. The function has the following options:

PII: True means we only look for PII in the detected text. False means we detect all text
gliner: True means we use GliNER for finding PII. False means GliNER will not be used (default to Scrubadub)
scrubadub: True means we use Scrubadub for finding PII after using GliNER. False means we do not use Scrubadub after GliNER.
perf_header_bounding_box_deID: True means we use the header_bounding_box_deID function to leverage Pydicom deid to de-identify the header data and perform bounding box pixel de-ID
perf_image_deID: True means we use the image_deID function to detect text in pixel data and perform de-identification
redact: True means we black out detected text using a filled in rectangle. False means we draw a bounding box around detected text.
perf_dateshift: True means we use the dateshift function to dateshift dates in headers
perf_gliner_NLP_header_redaction: True means we use GliNER to scan through free-text fields in headers to remove PII
labels: used by gliner_NLP_header_redaction to tell it what kinds of PII to look for
VR_check: used by gliner_NLP_header_redaction to tell what types of header fields it should look through to remove PII. Default settings cover all fields that can contain free-text
gliner_threshold: significance threshold for GliNER PII detections
ocr_threshold: significance threshold for OCR text detections
compression: flag for whether we want to compress the files. Default is to leave the files uncompressed. Current compression methods supported are RLELossless, JPEGLSLossless, JPEGLSNearLossless, JPEG2000Lossless, and JPEG2000.
id_mask: flag for whether to mask IDs. If set to True, then masking will be performed using the provided lookup table. If False, then no masking of IDs or Patient Names will be performed.

The modules have the following dependencies: Module 1: Requires a Recipe file, which specifies the transformations to be applied to DICOM headers and the locations of the bounding boxes. This requirement comes from the Pydicom deID module. More information about recipe files can be found in the Pydicom DeID documentation (https://pydicom.github.io/deid/examples/recipe/) We have provided pre-built recipes that implement the following de-identification rulesets:

HIPAA Safe Harbor
DICOM PS3.15 2024e - Security and System Management Profiles Table E.1-1. Application Level Confidentiality Profile Attributes (in progress)

Module 2: Requires PaddleOCR and GliNER. Current defaults are to use the pre-trained en_PP-OCRv3 model for OCR and the "E3-JSI/gliner-multi-pii-domains-v1" pre-trained GliNER model for NER (https://huggingface.co/E3-JSI/gliner-multi-pii-domains-v1)

Module 3: Requires GliNER. This module uses the same GliNER model as Module 2.

Sample data containing phantom DICOM images (in the Imaging_Phantom_FallTest folder) and DICOM images with synthetic PII (in the ms_sample_dicom) folder have been provided. Lookup tables for ID replacement for these data have been provided as well in the sample_lookup_tables folder (lookup_table.tsv.txt for the phantom data, and lookup_table_ms.tsv for the images with synthetic PII).

Quick Start

To get started, first download the project folder into a directory of your choice.

Then, we will want to install Python v3.11. You can get that version from here: https://www.python.org/downloads/release/python-3110/

Then, we will want to set up a virtual environment to run this process, so that the dependencies do not conflict with your base python installation (or any other virtual environments you may have).

To create and activate a virtual environment in Python, you can use the built-in venv module:

Open your terminal or command prompt
Navigate to the directory where you want to create the virtual environment
Create the virtual environment using the command python3 -m venv
Activate the virtual environment using the appropriate command for your operating system
Install Python packages using pip
Deactivate the virtual environment when you're done working in it

Once you have the virtual environment set up and activated, use the requirements.txt downloaded from the repository file to install all necessary dependencies.

pip install -r /path/to/requirements.txt

After installing all required dependencies, the environment is set up and ready to go. The next step is to configure the de-identification pipeline.

To do so, there are three options. If you wish to configure within a python script, then you can use set_config.py and edit the script directly, running which will generate the config.json file. Otherwise, you can open the config.json file found within the repository in a text editor like Notepad and edit it directly. Or, if you wish to use the command line, then you can run the set_config_cmd_args.py from your command line to generate a config.json file. “set_config_cmd_args.py” takes the following arguments, which are also present in the config.json file.

project_directory: A string specifying the full path of your project directory. This directory is where you store the two key files "functions.py" and "classes.py" downloaded from the repository
input_folder: A string specifying the full path of the input folder where the DICOM files are stored (if there are subfolders, the top level directory folder is fine to provide)
recipe_loc: A string specifying the full path of the DICOM recipe file
lookup_loc: A string specifying the full path of the lookup table for ID replacement
output_loc: A string specifying the full path of the output folder
sequential: A boolean where if true, then we use sequential processing. If false then use parallel processing (default true)
PII: True means we only look for PII in the detected text. False means we detect all text (default true)
gliner: use GliNER for finding PII. False means GliNER will not be used (default to Scrubadub)
scrubadub: True means we use Scrubadub for finding PII after using GliNER. False means we do not use Scrubadub after GliNER.
perf_header_deID: True means we use the header_deID function to leverage Pydicom deid to de-identify the header data (default true)
perf_bounding_box_deID: True means we use the bounding_box_deID function to leverage Pydicom deid to perform bounding box pixel de-ID (default true)
perf_image_deID: True means we use the image_deID function to detect text in pixel data and perform de-identification (default true)
redact: True means we black out detected text using a filled in rectangle. False means we draw a bounding box around detected text.
id_mask: True means masking will be performed using the provided lookup table. If False, then no masking of IDs or Patient Names will be performed.
perf_dateshift: True means we use the dateshift function to dateshift dates in headers (default false)
max_dateshift: Numeric value defining the maximum number of days to shift. Default is 365 (1 year)
perf_gliner_NLP_header_redaction: True means we use GliNER to scan through free-text fields in headers to remove PII (default true)
labels: used by gliner_NLP_header_redaction to tell it what kinds of PII to look for (default ["name", "date", "date of birth", "address"])
VR_check: used by gliner_NLP_header_redaction to tell what types of header fields it should look through to remove PII. Default settings cover all fields that can contain free-text (["LO", "LT", "SH", "ST", "UC", "UN", "UT"])
gliner_threshold: significance threshold for GliNER PII detections (default 0.5)
ocr_threshold: significance threshold for OCR text detections (default 0.25)
compression: Flag for whether we want to compress the files. True means outputted files will be compressed. Current compression supported is RLELossless (default false)
compression_method: Define a compression method from one of pydicom's native compression methods (RLELossless, JPEGLSLossless, JPEGLSNearLossless, JPEG2000Lossless, JPEG2000)

After configuring the file, run main.py, which will run the pipeline. If running main.py from the command line, it takes the argument “config_loc”, which is the full directory path to your config.json file.

De-identified files that have gone through the pipeline will be outputted into the directory specified in “output_loc”. If there is a folder structure in the “input_folder”, then that same folder structure will be copied into the output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Imaging_Phantom_FallTest		Imaging_Phantom_FallTest
Recipes		Recipes
__pycache__		__pycache__
ms_sample_dicom		ms_sample_dicom
sample_lookup_tables		sample_lookup_tables
src		src
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_focused.txt		requirements_focused.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

<<<<<<< HEAD

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

<<<<<<< HEAD

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages