|
1 | | -# deidentification |
2 | | -Deidentify people's names and gender specific pronouns |
| 1 | +# Deidentification |
| 2 | + |
| 3 | +A Python module that removes personally identifiable information (PII) from text documents, focusing on personal names and gender-specific pronouns. This tool uses spaCy's Named Entity Recognition (NER) capabilities combined with custom pronoun handling to provide thorough text de-identification. |
| 4 | + |
| 5 | +## Key Features |
| 6 | + |
| 7 | +- Accurately identifies and replaces personal names using spaCy's NER |
| 8 | +- Handles gender-specific pronouns with customizable replacements |
| 9 | +- Supports both plain text and HTML output formats |
| 10 | +- Uses an optimized backward-processing strategy for accurate text replacements |
| 11 | +- Iterative processing ensures comprehensive PII removal |
| 12 | +- Configurable replacement tokens and debug output |
| 13 | +- GPU acceleration support through spaCy |
| 14 | + |
| 15 | +## Installation |
| 16 | + |
| 17 | +```bash |
| 18 | +pip install git+https://github.com/jftuga/deidentification.git |
| 19 | +``` |
| 20 | + |
| 21 | +### Requirements |
| 22 | + |
| 23 | +- Python 3.7 or higher |
| 24 | +- spaCy |
| 25 | +- spaCy's `en_core_web_trf` model (or another compatible model) |
| 26 | + |
| 27 | +Download the required spaCy model: |
| 28 | +```bash |
| 29 | +python -m spacy download en_core_web_trf |
| 30 | +``` |
| 31 | + |
| 32 | +For debugging, by setting `config.debug=True`, you will also need [VeryPrettyTable](https://github.com/smeggingsmegger/): |
| 33 | +```bash |
| 34 | +pip install VeryPrettyTable |
| 35 | +``` |
| 36 | + |
| 37 | +## Usage |
| 38 | + |
| 39 | +### Basic Usage |
| 40 | + |
| 41 | +```python |
| 42 | +from deidentification import Deidentification |
| 43 | + |
| 44 | +# Create a deidentification instance with default settings |
| 45 | +deidentifier = Deidentification() |
| 46 | + |
| 47 | +# Process text |
| 48 | +text = "John Smith went to the store. He bought some groceries." |
| 49 | +deidentified_text = deidentifier.deidentify(text) |
| 50 | +print(deidentified_text) |
| 51 | +# Output: "PERSON went to the store. HE/SHE bought some groceries." |
| 52 | +``` |
| 53 | + |
| 54 | +### HTML Output |
| 55 | + |
| 56 | +```python |
| 57 | +# Generate HTML output with highlighted replacements |
| 58 | +html_output = deidentifier.deidentify_with_wrapped_html(text) |
| 59 | +``` |
| 60 | + |
| 61 | +### Custom Configuration |
| 62 | + |
| 63 | +```python |
| 64 | +from deidentification import ( |
| 65 | + Deidentification, |
| 66 | + DeidentificationConfig, |
| 67 | + DeidentificationOutputStyle, |
| 68 | +) |
| 69 | + |
| 70 | +config = DeidentificationConfig( |
| 71 | + spacy_model="en_core_web_trf", |
| 72 | + output_style=DeidentificationOutputStyle.HTML, |
| 73 | + replacement="[REDACTED]", |
| 74 | + debug=True |
| 75 | +) |
| 76 | +deidentifier = Deidentification(config) |
| 77 | +``` |
| 78 | + |
| 79 | +## Configuration Options |
| 80 | + |
| 81 | +The `DeidentificationConfig` class supports the following options: |
| 82 | + |
| 83 | +- `spacy_load` (bool): Whether to load the spaCy model (default: True) |
| 84 | +- `spacy_model` (str): Name of the spaCy model to use (default: "en_core_web_trf") |
| 85 | +- `output_style` (DeidentificationOutputStyle): Output format - TEXT or HTML (default: TEXT) |
| 86 | +- `replacement` (str): Replacement text for identified names (default: "PERSON") |
| 87 | +- `debug` (bool): Enable debug output (default: False) |
| 88 | + |
| 89 | +## How It Works |
| 90 | + |
| 91 | +The de-identification process follows these steps: |
| 92 | + |
| 93 | +1. Text is normalized for consistent processing |
| 94 | +2. spaCy processes the text to identify person entities |
| 95 | +3. Gender-specific pronouns are identified using a predefined list |
| 96 | +4. Entities and pronouns are sorted by their position in reverse order |
| 97 | +5. Replacements are made from end to beginning to maintain position accuracy |
| 98 | +6. The process repeats until no new entities are detected |
| 99 | + |
| 100 | +The backward-processing strategy is key to accurate replacements, as it prevents position shifts from affecting subsequent replacements. |
| 101 | + |
| 102 | +## Debug Output |
| 103 | + |
| 104 | +When debug mode is enabled, the tool provides detailed information about: |
| 105 | +- Identified person entities |
| 106 | +- Found pronouns |
| 107 | +- Replacement positions and actions |
| 108 | +- Processing iterations |
| 109 | + |
| 110 | +## Contributing |
| 111 | + |
| 112 | +Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. |
| 113 | + |
| 114 | +## License |
| 115 | + |
| 116 | +This project is licensed under the MIT License - see the LICENSE file for details. |
0 commit comments