Skip to content

Commit f64dab7

Browse files
committed
feat: initial release with text, image, and dict scrubbing
0 parents  commit f64dab7

23 files changed

+2854
-0
lines changed

.gitignore

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual environments
24+
.venv/
25+
venv/
26+
ENV/
27+
28+
# Testing
29+
.pytest_cache/
30+
.coverage
31+
htmlcov/
32+
.tox/
33+
.nox/
34+
35+
# IDEs
36+
.idea/
37+
.vscode/
38+
*.swp
39+
*.swo
40+
*~
41+
42+
# OS
43+
.DS_Store
44+
Thumbs.db
45+
46+
# uv
47+
uv.lock

README.md

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
# openadapt-privacy
2+
3+
Privacy scrubbing for GUI automation data - PII/PHI detection and redaction.
4+
5+
## Installation
6+
7+
```bash
8+
pip install openadapt-privacy
9+
```
10+
11+
For Presidio-based scrubbing (recommended):
12+
13+
```bash
14+
pip install openadapt-privacy[presidio]
15+
python -m spacy download en_core_web_trf
16+
```
17+
18+
## Quick Start
19+
20+
### Text Scrubbing
21+
22+
```python
23+
from openadapt_privacy.providers.presidio import PresidioScrubbingProvider
24+
25+
scrubber = PresidioScrubbingProvider()
26+
27+
text = "Contact John Smith at [email protected] or 555-123-4567"
28+
scrubbed = scrubber.scrub_text(text)
29+
```
30+
31+
**Input:**
32+
```
33+
Contact John Smith at [email protected] or 555-123-4567
34+
```
35+
36+
**Output:**
37+
```
38+
Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>
39+
```
40+
41+
### Example Inputs & Outputs
42+
43+
| Input | Output |
44+
|-------|--------|
45+
| `My email is [email protected]` | `My email is <EMAIL_ADDRESS>` |
46+
| `SSN: 923-45-6789` | `SSN: <US_SSN>` |
47+
| `Card: 4532-1234-5678-9012` | `Card: <CREDIT_CARD>` |
48+
| `Call me at 555-123-4567` | `Call me at <PHONE_NUMBER>` |
49+
| `DOB: 01/15/1985` | `DOB: <DATE_TIME>` |
50+
| `Contact John Smith` | `Contact <PERSON>` |
51+
52+
## Dict Scrubbing
53+
54+
Scrub PII from nested dictionaries (e.g., GUI element trees):
55+
56+
```python
57+
from openadapt_privacy import scrub_dict
58+
from openadapt_privacy.providers.presidio import PresidioScrubbingProvider
59+
60+
scrubber = PresidioScrubbingProvider()
61+
action = {
62+
"text": "Email: [email protected]",
63+
"metadata": {
64+
"title": "User Profile - John Smith",
65+
"tooltip": "Click to contact [email protected]",
66+
},
67+
"coordinates": {"x": 100, "y": 200},
68+
}
69+
scrubbed = scrub_dict(action, scrubber)
70+
```
71+
72+
**Input:**
73+
```json
74+
{
75+
"text": "Email: [email protected]",
76+
"metadata": {
77+
"title": "User Profile - John Smith",
78+
"tooltip": "Click to contact [email protected]"
79+
},
80+
"coordinates": {"x": 100, "y": 200}
81+
}
82+
```
83+
84+
**Output:**
85+
```json
86+
{
87+
"text": "Email: <EMAIL_ADDRESS>",
88+
"metadata": {
89+
"title": "User Profile - <PERSON>",
90+
"tooltip": "Click to contact <EMAIL_ADDRESS>"
91+
},
92+
"coordinates": {"x": 100, "y": 200}
93+
}
94+
```
95+
96+
## Recording Pipeline
97+
98+
Process complete GUI automation recordings:
99+
100+
```python
101+
from openadapt_privacy import DictRecordingLoader
102+
from openadapt_privacy.providers.presidio import PresidioScrubbingProvider
103+
104+
scrubber = PresidioScrubbingProvider()
105+
loader = DictRecordingLoader()
106+
107+
recording = loader.load_from_dict({
108+
"task_description": "Send email to John Smith at [email protected]",
109+
"actions": [
110+
{"id": 1, "action_type": "click", "text": "Compose", "timestamp": 1000},
111+
{"id": 2, "action_type": "type", "text": "[email protected]", "timestamp": 2000},
112+
{"id": 3, "action_type": "click", "text": "Send", "window_title": "Email to [email protected]", "timestamp": 3000},
113+
],
114+
})
115+
116+
scrubbed = recording.scrub(scrubber)
117+
```
118+
119+
**Input Recording:**
120+
```
121+
task_description: "Send email to John Smith at [email protected]"
122+
123+
actions:
124+
[1] click: "Compose"
125+
[2] type: "[email protected]"
126+
[3] click: "Send" (window: "Email to [email protected]")
127+
```
128+
129+
**Output Recording:**
130+
```
131+
task_description: "Send email to <PERSON> at <EMAIL_ADDRESS>"
132+
133+
actions:
134+
[1] click: "Compose"
135+
[2] type: "<EMAIL_ADDRESS>"
136+
[3] click: "Send" (window: "Email to <EMAIL_ADDRESS>")
137+
```
138+
139+
## Image Scrubbing
140+
141+
Redact PII from screenshots using OCR + NER:
142+
143+
```python
144+
from PIL import Image
145+
from openadapt_privacy.providers.presidio import PresidioScrubbingProvider
146+
147+
scrubber = PresidioScrubbingProvider()
148+
149+
image = Image.open("screenshot.png")
150+
scrubbed_image = scrubber.scrub_image(image)
151+
scrubbed_image.save("screenshot_scrubbed.png")
152+
```
153+
154+
**Input Screenshot:**
155+
156+
![Original screenshot with PII](assets/screenshot_original.png)
157+
158+
**Output Screenshot:**
159+
160+
![Scrubbed screenshot with PII redacted](assets/screenshot_scrubbed.png)
161+
162+
The image redactor:
163+
1. Runs OCR to detect text regions
164+
2. Analyzes text for PII entities (email, phone, SSN, etc.)
165+
3. Fills detected PII regions with solid color (configurable, default: red)
166+
167+
## Custom Data Loader
168+
169+
Implement your own loader for custom storage formats:
170+
171+
```python
172+
from openadapt_privacy import RecordingLoader, Recording
173+
174+
class SQLiteRecordingLoader(RecordingLoader):
175+
def __init__(self, db_path: str):
176+
self.db_path = db_path
177+
178+
def load(self, recording_id: str) -> Recording:
179+
# Load from SQLite database
180+
...
181+
182+
def save(self, recording: Recording, recording_id: str) -> None:
183+
# Save to SQLite database
184+
...
185+
186+
# Usage
187+
loader = SQLiteRecordingLoader("recordings.db")
188+
scrubber = PresidioScrubbingProvider()
189+
190+
# Load, scrub, and save
191+
scrubbed = loader.load_and_scrub("recording_001", scrubber)
192+
loader.save(scrubbed, "recording_001_scrubbed")
193+
```
194+
195+
## Configuration
196+
197+
```python
198+
from openadapt_privacy.config import PrivacyConfig
199+
200+
custom_config = PrivacyConfig(
201+
SCRUB_CHAR="X", # Character for scrub_text_all
202+
SCRUB_FILL_COLOR=0xFF0000, # Red for image redaction (BGR)
203+
SCRUB_KEYS_HTML=[ # Keys to scrub in dicts
204+
"text", "value", "title", "tooltip", "custom_field"
205+
],
206+
SCRUB_PRESIDIO_IGNORE_ENTITIES=[ # Entity types to skip
207+
"DATE_TIME",
208+
],
209+
)
210+
```
211+
212+
## Supported Entity Types
213+
214+
| Entity | Example Input | Example Output |
215+
|--------|---------------|----------------|
216+
| `PERSON` | `John Smith` | `<PERSON>` |
217+
| `EMAIL_ADDRESS` | `[email protected]` | `<EMAIL_ADDRESS>` |
218+
| `PHONE_NUMBER` | `555-123-4567` | `<PHONE_NUMBER>` |
219+
| `US_SSN` | `923-45-6789` | `<US_SSN>` |
220+
| `CREDIT_CARD` | `4532-1234-5678-9012` | `<CREDIT_CARD>` |
221+
| `US_BANK_NUMBER` | `635526789012` | `<US_BANK_NUMBER>` |
222+
| `US_DRIVER_LICENSE` | `A123-456-789-012` | `<US_DRIVER_LICENSE>` |
223+
| `DATE_TIME` | `01/15/1985` | `<DATE_TIME>` |
224+
| `LOCATION` | `Toronto, ON` | `<LOCATION>` |
225+
226+
## Architecture
227+
228+
```
229+
openadapt_privacy/
230+
├── base.py # ScrubbingProvider, TextScrubbingMixin
231+
├── config.py # PrivacyConfig dataclass
232+
├── loaders.py # Recording, Action, Screenshot, RecordingLoader
233+
├── providers/
234+
│ ├── __init__.py # ScrubProvider registry
235+
│ └── presidio.py # PresidioScrubbingProvider
236+
└── pipelines/
237+
└── dicts.py # scrub_dict, scrub_list_dicts
238+
```
239+
240+
## License
241+
242+
MIT

assets/screenshot_original.png

17.4 KB
Loading

assets/screenshot_scrubbed.png

11.1 KB
Loading

openadapt_privacy/__init__.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
"""OpenAdapt Privacy - PII/PHI detection and redaction for GUI automation data."""
2+
3+
from openadapt_privacy.base import (
4+
Modality,
5+
ScrubbingProvider,
6+
ScrubbingProviderFactory,
7+
TextScrubbingMixin,
8+
)
9+
from openadapt_privacy.config import PrivacyConfig, config
10+
from openadapt_privacy.loaders import (
11+
Action,
12+
DictRecordingLoader,
13+
Recording,
14+
RecordingLoader,
15+
Screenshot,
16+
)
17+
from openadapt_privacy.pipelines.dicts import DictScrubber, scrub_dict, scrub_list_dicts
18+
from openadapt_privacy.providers import ScrubProvider
19+
20+
__version__ = "0.1.0"
21+
22+
__all__ = [
23+
# Base classes
24+
"Modality",
25+
"ScrubbingProvider",
26+
"ScrubbingProviderFactory",
27+
"TextScrubbingMixin",
28+
# Config
29+
"PrivacyConfig",
30+
"config",
31+
# Providers
32+
"ScrubProvider",
33+
# Pipelines
34+
"DictScrubber",
35+
"scrub_dict",
36+
"scrub_list_dicts",
37+
# Data loaders
38+
"Action",
39+
"Screenshot",
40+
"Recording",
41+
"RecordingLoader",
42+
"DictRecordingLoader",
43+
]

0 commit comments

Comments
 (0)