Skip to content

Commit 1c10533

Browse files
authored
Merge pull request #1 from jftuga/dev
initial commit
2 parents 53c2cba + 5632149 commit 1c10533

16 files changed

+2044
-8
lines changed

.gitignore

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -161,11 +161,11 @@ dmypy.json
161161
cython_debug/
162162

163163
# PyCharm
164-
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165-
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166-
# and can be added to the global gitignore or merged into this file. For a more nuclear
167-
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
168-
#.idea/
164+
.idea/
169165

170166
# PyPI configuration file
171167
.pypirc
168+
.??*~
169+
*.html
170+
*.json
171+

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.12

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2025 John Taylor
3+
Copyright (c) 2024 John Taylor
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

MANIFEST.in

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
include LICENSE
2+
include README.md
3+
include pyproject.toml
4+
include requirements.txt

Pipfile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
[[source]]
2+
url = "https://pypi.org/simple"
3+
verify_ssl = true
4+
name = "pypi"
5+
6+
[dev-packages]
7+
black = "*"
8+
ruff = "*"
9+
veryprettytable = {git = "https://github.com/andrewspiers/VeryPrettyTable.git"}
10+
11+
[packages]
12+
chardet = ">=5.2.0"
13+
spacy = ">=3.8.3"
14+
torch = ">=2.5.1"
15+
16+
[requires]
17+
python_version = "3.12"

Pipfile.lock

Lines changed: 1147 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 116 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,116 @@
1-
# deidentification
2-
Deidentify people's names and gender specific pronouns
1+
# Deidentification
2+
3+
A Python module that removes personally identifiable information (PII) from text documents, focusing on personal names and gender-specific pronouns. This tool uses spaCy's Named Entity Recognition (NER) capabilities combined with custom pronoun handling to provide thorough text de-identification.
4+
5+
## Key Features
6+
7+
- Accurately identifies and replaces personal names using spaCy's NER
8+
- Handles gender-specific pronouns with customizable replacements
9+
- Supports both plain text and HTML output formats
10+
- Uses an optimized backward-processing strategy for accurate text replacements
11+
- Iterative processing ensures comprehensive PII removal
12+
- Configurable replacement tokens and debug output
13+
- GPU acceleration support through spaCy
14+
15+
## Installation
16+
17+
```bash
18+
pip install git+https://github.com/jftuga/deidentification.git
19+
```
20+
21+
### Requirements
22+
23+
- Python 3.7 or higher
24+
- spaCy
25+
- spaCy's `en_core_web_trf` model (or another compatible model)
26+
27+
Download the required spaCy model:
28+
```bash
29+
python -m spacy download en_core_web_trf
30+
```
31+
32+
For debugging, by setting `config.debug=True`, you will also need [VeryPrettyTable](https://github.com/smeggingsmegger/):
33+
```bash
34+
pip install VeryPrettyTable
35+
```
36+
37+
## Usage
38+
39+
### Basic Usage
40+
41+
```python
42+
from deidentification import Deidentification
43+
44+
# Create a deidentification instance with default settings
45+
deidentifier = Deidentification()
46+
47+
# Process text
48+
text = "John Smith went to the store. He bought some groceries."
49+
deidentified_text = deidentifier.deidentify(text)
50+
print(deidentified_text)
51+
# Output: "PERSON went to the store. HE/SHE bought some groceries."
52+
```
53+
54+
### HTML Output
55+
56+
```python
57+
# Generate HTML output with highlighted replacements
58+
html_output = deidentifier.deidentify_with_wrapped_html(text)
59+
```
60+
61+
### Custom Configuration
62+
63+
```python
64+
from deidentification import (
65+
Deidentification,
66+
DeidentificationConfig,
67+
DeidentificationOutputStyle,
68+
)
69+
70+
config = DeidentificationConfig(
71+
spacy_model="en_core_web_trf",
72+
output_style=DeidentificationOutputStyle.HTML,
73+
replacement="[REDACTED]",
74+
debug=True
75+
)
76+
deidentifier = Deidentification(config)
77+
```
78+
79+
## Configuration Options
80+
81+
The `DeidentificationConfig` class supports the following options:
82+
83+
- `spacy_load` (bool): Whether to load the spaCy model (default: True)
84+
- `spacy_model` (str): Name of the spaCy model to use (default: "en_core_web_trf")
85+
- `output_style` (DeidentificationOutputStyle): Output format - TEXT or HTML (default: TEXT)
86+
- `replacement` (str): Replacement text for identified names (default: "PERSON")
87+
- `debug` (bool): Enable debug output (default: False)
88+
89+
## How It Works
90+
91+
The de-identification process follows these steps:
92+
93+
1. Text is normalized for consistent processing
94+
2. spaCy processes the text to identify person entities
95+
3. Gender-specific pronouns are identified using a predefined list
96+
4. Entities and pronouns are sorted by their position in reverse order
97+
5. Replacements are made from end to beginning to maintain position accuracy
98+
6. The process repeats until no new entities are detected
99+
100+
The backward-processing strategy is key to accurate replacements, as it prevents position shifts from affecting subsequent replacements.
101+
102+
## Debug Output
103+
104+
When debug mode is enabled, the tool provides detailed information about:
105+
- Identified person entities
106+
- Found pronouns
107+
- Replacement positions and actions
108+
- Processing iterations
109+
110+
## Contributing
111+
112+
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
113+
114+
## License
115+
116+
This project is licensed under the MIT License - see the LICENSE file for details.

deidentification/__init__.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
"""A Python module for de-identifying personally identifiable information in text."""
2+
3+
from .deidentification import Deidentification, DeidentificationConfig, DeidentificationOutputStyle
4+
from .deidentification_constants import pgmName, pgmVersion, pgmUrl
5+
6+
__version__ = pgmVersion
7+
__author__ = "John Taylor"
8+
__all__ = [
9+
"Deidentification",
10+
"DeidentificationConfig",
11+
"DeidentificationOutputStyle",
12+
"pgmName",
13+
"pgmVersion",
14+
"pgmUrl",
15+
]

0 commit comments

Comments
 (0)