Skip to content

RTIInternational/RTI-DICOM-DEID-External

Repository files navigation

An Open Source Python-Based DICOM De-Identification Pipeline

Created by Zixin Nie at RTI International

Publishing date: Feb. 11, 2025

This pipeline leverages several open-source tools to de-identify medical imaging files in DICOM format.

  • Pydicom: Used for viewing, manipulation, compression, and writing out DICOM files
  • Pydicom de-ID: Used for an initial de-identification pass using rules-based de-ID of DICOM headers and pre-defined bounding-box de-ID of pixel data
  • PaddleOCR: Used to detect text on DICOM images
  • GliNER: Used to detect PII within free-text fields in DICOM headers and within the OCR detected text

When you first run the pipeline, the PaddleOCR and GliNER models will be downloaded to your system (recommended to do so in a virtual environment). The total download size is about 3.5GB. This only needs to be performed the first time, future runs will use the models already downloaded to your system.

The pipeline is modularized in the following fashion:

  • Module 1: De-identitification of DICOM Headers and bounding-box de-identification of DICOM pixel data using Pydicom and Pydicom de-ID
  • Module 2: De-identification of DICOM pixel data using OCR and NLP (which is split into two sub-modules)
  • Module 2.1: Uses OCR to detect all text for de-identification
  • Module 2.2: Uses OCR and NLP to detect only PII for de-identification
  • Module 3: De-identification of DICOM headers using NLP, which is split into two sub-modules
  • Module 3.1: De-identification of free-text fields in DICOM headers using NLP
  • Module 3.2: De-identification of date fields in DICOM headers using dateshifting

image image <<<<<<< HEAD

bd5bb8baf7de8210ce9a928e3c08aa78aa40ab14

Configuration and execution of the pipeline is all controlled from the deid_pipeline function, which has options to turn on or off the different modules and submodules. The function has the following options:

  • PII: True means we only look for PII in the detected text. False means we detect all text
  • gliner: True means we use GliNER for finding PII. False means GliNER will not be used (default to Scrubadub)
  • scrubadub: True means we use Scrubadub for finding PII after using GliNER. False means we do not use Scrubadub after GliNER.
  • perf_header_bounding_box_deID: True means we use the header_bounding_box_deID function to leverage Pydicom deid to de-identify the header data and perform bounding box pixel de-ID
  • perf_image_deID: True means we use the image_deID function to detect text in pixel data and perform de-identification
  • redact: True means we black out detected text using a filled in rectangle. False means we draw a bounding box around detected text.
  • perf_dateshift: True means we use the dateshift function to dateshift dates in headers
  • perf_gliner_NLP_header_redaction: True means we use GliNER to scan through free-text fields in headers to remove PII
  • labels: used by gliner_NLP_header_redaction to tell it what kinds of PII to look for
  • VR_check: used by gliner_NLP_header_redaction to tell what types of header fields it should look through to remove PII. Default settings cover all fields that can contain free-text
  • gliner_threshold: significance threshold for GliNER PII detections
  • ocr_threshold: significance threshold for OCR text detections
  • compression: flag for whether we want to compress the files. Default is to leave the files uncompressed. Current compression methods supported are RLELossless, JPEGLSLossless, JPEGLSNearLossless, JPEG2000Lossless, and JPEG2000.
  • id_mask: flag for whether to mask IDs. If set to True, then masking will be performed using the provided lookup table. If False, then no masking of IDs or Patient Names will be performed.

The modules have the following dependencies: Module 1: Requires a Recipe file, which specifies the transformations to be applied to DICOM headers and the locations of the bounding boxes. This requirement comes from the Pydicom deID module. More information about recipe files can be found in the Pydicom DeID documentation (https://pydicom.github.io/deid/examples/recipe/) We have provided pre-built recipes that implement the following de-identification rulesets:

  • HIPAA Safe Harbor
  • DICOM PS3.15 2024e - Security and System Management Profiles Table E.1-1. Application Level Confidentiality Profile Attributes (in progress)

Module 2: Requires PaddleOCR and GliNER. Current defaults are to use the pre-trained en_PP-OCRv3 model for OCR and the "E3-JSI/gliner-multi-pii-domains-v1" pre-trained GliNER model for NER (https://huggingface.co/E3-JSI/gliner-multi-pii-domains-v1)

Module 3: Requires GliNER. This module uses the same GliNER model as Module 2.

Sample data containing phantom DICOM images (in the Imaging_Phantom_FallTest folder) and DICOM images with synthetic PII (in the ms_sample_dicom) folder have been provided. Lookup tables for ID replacement for these data have been provided as well in the sample_lookup_tables folder (lookup_table.tsv.txt for the phantom data, and lookup_table_ms.tsv for the images with synthetic PII).

Quick Start

To get started, first download the project folder into a directory of your choice.

Then, we will want to install Python v3.11. You can get that version from here: https://www.python.org/downloads/release/python-3110/

Then, we will want to set up a virtual environment to run this process, so that the dependencies do not conflict with your base python installation (or any other virtual environments you may have).

To create and activate a virtual environment in Python, you can use the built-in venv module:

  1. Open your terminal or command prompt
  2. Navigate to the directory where you want to create the virtual environment
  3. Create the virtual environment using the command python3 -m venv
  4. Activate the virtual environment using the appropriate command for your operating system
  5. Install Python packages using pip
  6. Deactivate the virtual environment when you're done working in it

Once you have the virtual environment set up and activated, use the requirements.txt downloaded from the repository file to install all necessary dependencies.

pip install -r /path/to/requirements.txt

After installing all required dependencies, the environment is set up and ready to go. The next step is to configure the de-identification pipeline.

To do so, there are three options. If you wish to configure within a python script, then you can use set_config.py and edit the script directly, running which will generate the config.json file. Otherwise, you can open the config.json file found within the repository in a text editor like Notepad and edit it directly. Or, if you wish to use the command line, then you can run the set_config_cmd_args.py from your command line to generate a config.json file. “set_config_cmd_args.py” takes the following arguments, which are also present in the config.json file.

  • project_directory: A string specifying the full path of your project directory. This directory is where you store the two key files "functions.py" and "classes.py" downloaded from the repository
  • input_folder: A string specifying the full path of the input folder where the DICOM files are stored (if there are subfolders, the top level directory folder is fine to provide)
  • recipe_loc: A string specifying the full path of the DICOM recipe file
  • lookup_loc: A string specifying the full path of the lookup table for ID replacement
  • output_loc: A string specifying the full path of the output folder
  • sequential: A boolean where if true, then we use sequential processing. If false then use parallel processing (default true)
  • PII: True means we only look for PII in the detected text. False means we detect all text (default true)
  • gliner: use GliNER for finding PII. False means GliNER will not be used (default to Scrubadub)
  • scrubadub: True means we use Scrubadub for finding PII after using GliNER. False means we do not use Scrubadub after GliNER.
  • perf_header_deID: True means we use the header_deID function to leverage Pydicom deid to de-identify the header data (default true)
  • perf_bounding_box_deID: True means we use the bounding_box_deID function to leverage Pydicom deid to perform bounding box pixel de-ID (default true)
  • perf_image_deID: True means we use the image_deID function to detect text in pixel data and perform de-identification (default true)
  • redact: True means we black out detected text using a filled in rectangle. False means we draw a bounding box around detected text.
  • id_mask: True means masking will be performed using the provided lookup table. If False, then no masking of IDs or Patient Names will be performed.
  • perf_dateshift: True means we use the dateshift function to dateshift dates in headers (default false)
  • max_dateshift: Numeric value defining the maximum number of days to shift. Default is 365 (1 year)
  • perf_gliner_NLP_header_redaction: True means we use GliNER to scan through free-text fields in headers to remove PII (default true)
  • labels: used by gliner_NLP_header_redaction to tell it what kinds of PII to look for (default ["name", "date", "date of birth", "address"])
  • VR_check: used by gliner_NLP_header_redaction to tell what types of header fields it should look through to remove PII. Default settings cover all fields that can contain free-text (["LO", "LT", "SH", "ST", "UC", "UN", "UT"])
  • gliner_threshold: significance threshold for GliNER PII detections (default 0.5)
  • ocr_threshold: significance threshold for OCR text detections (default 0.25)
  • compression: Flag for whether we want to compress the files. True means outputted files will be compressed. Current compression supported is RLELossless (default false)
  • compression_method: Define a compression method from one of pydicom's native compression methods (RLELossless, JPEGLSLossless, JPEGLSNearLossless, JPEG2000Lossless, JPEG2000)

After configuring the file, run main.py, which will run the pipeline. If running main.py from the command line, it takes the argument “config_loc”, which is the full directory path to your config.json file.

De-identified files that have gone through the pipeline will be outputted into the directory specified in “output_loc”. If there is a folder structure in the “input_folder”, then that same folder structure will be copied into the output directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages