Skip to content

Commit 8040968

Browse files
authored
How to detect and redact PII in Unstructured JSON output files (#695)
1 parent ee62f23 commit 8040968

File tree

2 files changed

+279
-0
lines changed

2 files changed

+279
-0
lines changed

docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,7 @@
281281
"examplecode/tools/onedrive-events",
282282
"examplecode/tools/sharepoint-events",
283283
"examplecode/tools/s3-vectors",
284+
"examplecode/tools/pii",
284285
"examplecode/tools/jq",
285286
"examplecode/tools/firecrawl",
286287
"examplecode/tools/langflow",

examplecode/tools/pii.mdx

Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
---
2+
title: PII detection
3+
---
4+
5+
Personally identifiable information (PII) detection is important as part of an organization's overall strategy for
6+
minimizing potential harm from misuse or unauthorized access to individuals' private data. Detecting PII is a first step
7+
toward helping avoid identity theft, maintaining privacy, and building trust with customers and users. Organizations
8+
might also need to follow various data protection regulations, making PII detection a crucial part of the
9+
organization's legal compliance framework.
10+
11+
This hands-on example walkthrough demonstrates how to use the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/)
12+
to identify and then redact PII in Unstructured JSON output files. Presidio can identify
13+
and redact or anonymize entities in text and images such as credit card numbers, names, locations, social security numbers,
14+
bitcoin wallets, US phone numbers, financial data, and more.
15+
16+
In this walkthrough, you will use Python code to connect to a folder within an Amazon S3 bucket that already contains a
17+
collection of Unstructured JSON output files. For each file, your code will use Presidio to identify PII that matches
18+
specific patterns and then redact the identified PII. Your code will then write the Unstrucutred JSON output files' contents
19+
that contain the redacted PII to a separate folder within an S3 bucket. You can then compare the output JSON files generated by
20+
Unstructured to the redacted data generated by Presidio and see the impact of using Presidio for PII detection.
21+
22+
_Redaction_ directly removes or obscures PII, such as replacing names with placeholders, or blacking out sensitive text.
23+
The example code from this walkthrough replaces the detected PII characters within text strings with placeholder text. Another technique, _anonymization_,
24+
can involve redaction, but also includes techniques such as _generalization_ (for example, replacing specific dates
25+
with age ranges), _suppression_ (removing entire data fields), and _data masking_ (replacing data with random values
26+
while preserving the format). Presidio supports both redaction and anonymization. Although the term "anonymize" is
27+
visible throughout this walkthrough's example code, Presidio's available anonymization techniques are not explored here—only redaction.
28+
29+
<Note>
30+
The Microsoft Presidio SDK is not offered, maintained, or supported by Unstructured. For questions or issues
31+
related to Presidio, see the following resources:
32+
33+
- For general discussions, use the [discussion board](https://github.com/microsoft/presidio/discussions) in the Presidio repositiory on GiHub.
34+
- For questions or issues, file an [issue](https://github.com/microsoft/presidio/issues) in the Presidio repository on GitHub.
35+
- For other matters, email [[email protected]](mailto:[email protected]).
36+
37+
The example code in this walkthrough is provided as a general reference only. It is not intended to substitute
38+
for a complete PII detection strategy.
39+
40+
The example code in this walkthrough is not guaranteed to detect and redact all possible PII. For instance, the code
41+
looks for PII only in text strings. It does not look for PII in non-text fields such as `image_base64` and `orig_elements`
42+
within Unstructured metadata, and it does not look for PII in images.
43+
</Note>
44+
45+
## Requirements
46+
47+
import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-simple-ui-only.mdx'
48+
49+
To use this example, you will need:
50+
51+
- An Unstructured account, as follows:
52+
53+
<GetStartedSimpleUIOnly />
54+
55+
- A set of one or more Unstructured JSON output files that have been generated by Unstructured and stored in a folder within an
56+
Amazon S3 bucket that you have access to. One way to generate these files is to use an Unstructured workflow that
57+
relies on an S3 destination connector to store these Unstructured JSON output files. Learn how to [create an S3 destination connector](/ui/destinations/s3) and
58+
[create a custom workflow](/ui/workflows#create-a-custom-workflow) that uses your S3 destination connector.
59+
- Python installed on your local development machine.
60+
61+
## Create and run the Python code
62+
63+
1. In your local Python virtual environment, install the following libraries:
64+
65+
- `boto3`
66+
- `presidio_analyzer`
67+
- `presidio_anonymizer`
68+
69+
For example, if you are using [uv](https://docs.astral.sh/uv/), you can install these libraries into your local `uv`
70+
virtual environment with the following command:
71+
72+
```bash
73+
uv add boto3 presidio_analyzer presidio_anonymizer
74+
```
75+
76+
2. In your local Python virtual environment, install the appropriate natural language processing (NLP) models for
77+
[spaCy](https://spacy.io/), which Presidio relies on for various internal tasks related to named entity recognition (NER) and
78+
PII identification.
79+
80+
To find the appropriate model for your use case, do the following:
81+
82+
a. Go to [spaCy Trained Models & Pipelines](https://spacy.io/models).<br/>
83+
b. On the sidebar, click your target language, for example **English**.<br/>
84+
c. Click the model you want to use, for example **en_core_web_lg**.<br/>
85+
d. Click **Release details**.<br/>
86+
e. At the bottom of the release details page, in the **Assets** section, right-click the filename ending in `.whl`, for
87+
example **en_core_web_lg-3.8.0-py3-none-any.whl**, and select **Copy Link Address** from the context menu.<br/>
88+
f. From the release details page, copy the URL from your web browser's address bar.<br/>
89+
g. Install the model into your local Python virtual environment by using the model's name and the URL that you just copied.
90+
For example, if you are using `uv`, you can install the preceding model with a command such as the following:<br/>
91+
92+
```bash
93+
uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl
94+
```
95+
96+
3. [Set up Boto3 credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for your AWS account.
97+
The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting
98+
[environment variables](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables) or
99+
configuring a [shared credentials file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#shared-credentials-file),
100+
101+
One approach to getting and setting up Boto3 credentials is to [create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey)
102+
and then use the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
103+
to [set up your credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-methods) on your local development machine.
104+
105+
<iframe
106+
width="560"
107+
height="315"
108+
src="https://www.youtube.com/embed/MoFTaGJE65Q"
109+
title="YouTube video player"
110+
frameborder="0"
111+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
112+
allowfullscreen
113+
></iframe>
114+
115+
4. Add the following code to a Python script file in your virtual environment, replacing the following placeholders:
116+
117+
- Replace `<input-bucket-name>` with the name of the Amazon S3 bucket that contains your original Unstructured JSON files. This is the same
118+
bucket that you used for your S3 destination connector.
119+
- Replace `<input-folder-prefix>` with the path to the folder within the input bucket that contains your original Unstructured JSON files.
120+
- Replace `<output-bucket-name>` with the name of the S3 bucket that will contain copies of the contents of your Unstructured JSON files,
121+
with the redacted content within those files' copies. This can be the same bucket as the input bucket, or a different bucket.
122+
- Replace `<output-folder-prefix>` with the path to the folder within the output bucket that will contain copies of the contents of your Unstructured
123+
JSON files, with the redacted content within those files' copies. This must not be the same folder as the input folder.
124+
- Replace `<bucket-region-short-id>` with the short ID of the region where your buckets are located, for example `us-east-1`.
125+
126+
For the `operators` variable, a list of operators for built-in Presidio entities is specified. These operators look for common entities such as
127+
credit card numbers, email addresses, phone numbers, and more. You can remove any entities from this list that you do not
128+
want your code to look for. You can also add operators to this list for additional
129+
[built-in entities](https://microsoft.github.io/presidio/supported_entities/) that you want your code to also look for. And you can also
130+
[add your own custom entities](https://microsoft.github.io/presidio/analyzer/adding_recognizers/) to this list.
131+
132+
```python
133+
import boto3
134+
import json
135+
136+
from presidio_analyzer import AnalyzerEngine
137+
from presidio_anonymizer import AnonymizerEngine
138+
from presidio_anonymizer.entities import OperatorConfig
139+
140+
operators={
141+
"CREDIT_CARD": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CREDIT_CARD>'}),
142+
"CRYPTO": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CRYPTO>'}),
143+
"EMAIL_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_EMAIL_ADDRESS>'}),
144+
"IBAN_CODE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IBAN_CODE>'}),
145+
"IP_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IP_ADDRESS>'}),
146+
"NRP": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_NRP>'}),
147+
"LOCATION": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_LOCATION>'}),
148+
"PERSON": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PERSON>'}),
149+
"PHONE_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PHONE_NUMBER>'}),
150+
"MEDICAL_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_MEDICAL_LICENSE>'}),
151+
"URL": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_URL>'}),
152+
"US_BANK_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_BANK_NUMBER>'}),
153+
"US_DRIVER_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_DRIVER_LICENSE>'}),
154+
"US_ITIN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_ITIN>'}),
155+
"US_PASSPORT": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_PASSPORT>'}),
156+
"US_SSN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_SSN>'})
157+
}
158+
159+
# Recursively check for string values in the provided JSON object (in this case,
160+
# the "metadata" field of the JSON object) and redact them
161+
# as appropriate.
162+
def check_string_values(obj, analyzer, anonymizer):
163+
if isinstance(obj, dict):
164+
for key, value in obj.items():
165+
# Skip analyzing Base64-encoded fields.
166+
if key == 'image_base64' or key == 'orig_elements':
167+
pass
168+
elif isinstance(value, str):
169+
anonymized_results = anonymizer.anonymize(
170+
text=value,
171+
analyzer_results=analyzer.analyze(text=value, language="en"),
172+
operators=operators
173+
)
174+
value = anonymized_results.text
175+
# Recurse through nested "metadata" fields.
176+
elif isinstance(value, dict):
177+
check_string_values(value, analyzer, anonymizer)
178+
# Skip analyzing non-string fields.
179+
else:
180+
pass
181+
return obj
182+
183+
def main():
184+
s3_input_bucket_name = '<input-bucket-name>'
185+
s3_input_folder_prefix = '<input-folder-prefix>'
186+
s3_output_bucket_name = '<output-bucket-name>'
187+
s3_output_folder_prefix = '<output-folder-prefix>'
188+
s3_bucket_region = '<bucket-region-short-id>'
189+
190+
s3_client = boto3.client('s3')
191+
192+
# Load the JSON files from the input folder.
193+
# Normalize the input folder prefix to ensure it ends with '/'.
194+
if not s3_input_folder_prefix.endswith('/'):
195+
s3_input_folder_prefix += '/'
196+
197+
paginator = s3_client.get_paginator('list_objects_v2')
198+
page_iterator = paginator.paginate(
199+
Bucket=s3_input_bucket_name,
200+
Prefix=s3_input_folder_prefix
201+
)
202+
files = []
203+
204+
# Get the list of file keys from the input folder to anaylyze.
205+
# A file's key is the full path to the file within the bucket.
206+
# For example, if the input folder's name is "original" and the
207+
# input file's name is "file1.json", the file's key is
208+
# "original/file1.json".
209+
# There could be multiple "pages" of file listings available,
210+
# so each of these "pages" must be looped through, so that
211+
# no files are missed.
212+
for page in page_iterator:
213+
# "Contents" is missing if the folder is empty or the
214+
# intended prefix is not found.
215+
if 'Contents' in page:
216+
for obj in page['Contents']:
217+
key = obj['Key']
218+
if not key.endswith('/'): # Skip if it's a folder placeholder.
219+
files.append(key)
220+
print(f"Found file: {s3_input_bucket_name}/{key}")
221+
222+
analyzer = AnalyzerEngine()
223+
anonymizer = AnonymizerEngine()
224+
s3_resource = boto3.resource('s3')
225+
226+
# For each JSON file to analyze, load the JSON data.
227+
for key in files:
228+
print(f"Analyzing file: {s3_input_bucket_name}/{key}")
229+
content_object = s3_resource.Object(
230+
bucket_name=s3_input_bucket_name,
231+
key=key
232+
)
233+
234+
file_content = content_object.get()['Body'].read().decode('utf-8') # Bytes to text.
235+
json_data = json.loads(file_content) # Text to JSON.
236+
237+
# For each element in the JSON data...
238+
for element in json_data:
239+
print(f" Analyzing element with ID: {element['element_id']} in file {s3_input_bucket_name}/{key}")
240+
# If there is a "text" field...
241+
if 'text' in element:
242+
# ...get the text content...
243+
text_element = element['text']
244+
# ...and analyze and redact the text content as appropriate.
245+
anonymized_results = anonymizer.anonymize(
246+
text=text_element,
247+
analyzer_results=analyzer.analyze(text=text_element, language="en"),
248+
operators=operators
249+
)
250+
element['text'] = anonymized_results.text
251+
# If there is a "metadata" field...
252+
if 'metadata' in element:
253+
# ...get the metadata content...
254+
metadata_element = element['metadata']
255+
# ...and analyze and redact the metadata content as appropriate.
256+
element['metadata'] = check_string_values(metadata_element, analyzer, anonymizer)
257+
258+
# Get the filename from the key.
259+
filename = key.split(s3_input_folder_prefix)[1]
260+
261+
# Normalize the output folder prefix to ensure it ends with '/'.
262+
if not s3_output_folder_prefix.endswith('/'):
263+
s3_output_folder_prefix += '/'
264+
265+
# Then save the JSON data with its redactions to the output folder.
266+
print(f"Saving file: {s3_output_bucket_name}/{s3_output_folder_prefix}{filename}")
267+
s3_client.put_object(
268+
Bucket=s3_output_bucket_name,
269+
Key=f"{s3_output_folder_prefix}{filename}",
270+
Body=json.dumps(obj=json_data, indent=4).encode('utf-8')
271+
)
272+
273+
if __name__ == "__main__":
274+
main()
275+
```
276+
277+
5. Run the Python script.
278+
6. Go to the output folder in S3 and explore the generated files, searching for the `<REDACTED_` placeholders in the generated files' contents.

0 commit comments

Comments
 (0)