How to detect and redact PII in Unstructured JSON output files (#695)

Paul-Cornell · web-flow · commit 80409681d7db · 2025-08-01T11:01:13.000-07:00
diff --git a/docs.json b/docs.json
@@ -281,6 +281,7 @@
               "examplecode/tools/onedrive-events",
               "examplecode/tools/sharepoint-events",
               "examplecode/tools/s3-vectors",
+              "examplecode/tools/pii",
               "examplecode/tools/jq",
               "examplecode/tools/firecrawl",
               "examplecode/tools/langflow",
diff --git a/examplecode/tools/pii.mdx b/examplecode/tools/pii.mdx
@@ -0,0 +1,278 @@
+---
+title: PII detection
+---
+
+Personally identifiable information (PII) detection is important as part of an organization's overall strategy for 
+minimizing potential harm from misuse or unauthorized access to individuals' private data. Detecting PII is a first step 
+toward helping avoid identity theft, maintaining privacy, and building trust with customers and users. Organizations 
+might also need to follow various data protection regulations, making PII detection a crucial part of the 
+organization's legal compliance framework.
+
+This hands-on example walkthrough demonstrates how to use the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/) 
+to identify and then redact PII in Unstructured JSON output files. Presidio can identify 
+and redact or anonymize entities in text and images such as credit card numbers, names, locations, social security numbers, 
+bitcoin wallets, US phone numbers, financial data, and more.
+
+In this walkthrough, you will use Python code to connect to a folder within an Amazon S3 bucket that already contains a 
+collection of Unstructured JSON output files. For each file, your code will use Presidio to identify PII that matches 
+specific patterns and then redact the identified PII. Your code will then write the Unstrucutred JSON output files' contents 
+that contain the redacted PII to a separate folder within an S3 bucket. You can then compare the output JSON files generated by 
+Unstructured to the redacted data generated by Presidio and see the impact of using Presidio for PII detection. 
+
+_Redaction_ directly removes or obscures PII, such as replacing names with placeholders, or blacking out sensitive text. 
+The example code from this walkthrough replaces the detected PII characters within text strings with placeholder text. Another technique, _anonymization_, 
+can involve redaction, but also includes techniques such as _generalization_ (for example, replacing specific dates 
+with age ranges), _suppression_ (removing entire data fields), and _data masking_ (replacing data with random values 
+while preserving the format). Presidio supports both redaction and anonymization. Although the term "anonymize" is 
+visible throughout this walkthrough's example code, Presidio's available anonymization techniques are not explored here&mdash;only redaction.
+
+<Note>
+    The Microsoft Presidio SDK is not offered, maintained, or supported by Unstructured. For questions or issues 
+    related to Presidio, see the following resources:
+
+    - For general discussions, use the [discussion board](https://github.com/microsoft/presidio/discussions) in the Presidio repositiory on GiHub.
+    - For questions or issues, file an [issue](https://github.com/microsoft/presidio/issues) in the Presidio repository on GitHub.
+    - For other matters, email [presidio@microsoft.com](mailto:presidio@microsoft.com).
+
+    The example code in this walkthrough is provided as a general reference only. It is not intended to substitute 
+    for a complete PII detection strategy.
+
+    The example code in this walkthrough is not guaranteed to detect and redact all possible PII. For instance, the code 
+    looks for PII only in text strings. It does not look for PII in non-text fields such as `image_base64` and `orig_elements` 
+    within Unstructured metadata, and it does not look for PII in images.
+</Note>
+
+## Requirements
+
+import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-simple-ui-only.mdx'
+
+To use this example, you will need:
+
+- An Unstructured account, as follows:
+
+  <GetStartedSimpleUIOnly />
+
+- A set of one or more Unstructured JSON output files that have been generated by Unstructured and stored in a folder within an 
+  Amazon S3 bucket that you have access to. One way to generate these files is to use an Unstructured workflow that 
+  relies on an S3 destination connector to store these Unstructured JSON output files. Learn how to [create an S3 destination connector](/ui/destinations/s3) and 
+  [create a custom workflow](/ui/workflows#create-a-custom-workflow) that uses your S3 destination connector.
+- Python installed on your local development machine. 
+
+## Create and run the Python code
+
+1. In your local Python virtual environment, install the following libraries:
+
+   - `boto3`
+   - `presidio_analyzer`
+   - `presidio_anonymizer`
+
+   For example, if you are using [uv](https://docs.astral.sh/uv/), you can install these libraries into your local `uv` 
+   virtual environment with the following command:
+
+   ```bash
+   uv add boto3 presidio_analyzer presidio_anonymizer
+   ```
+
+2. In your local Python virtual environment, install the appropriate natural language processing (NLP) models for 
+   [spaCy](https://spacy.io/), which Presidio relies on for various internal tasks related to named entity recognition (NER) and 
+   PII identification.
+
+   To find the appropriate model for your use case, do the following:
+   
+   a. Go to [spaCy Trained Models & Pipelines](https://spacy.io/models).<br/>
+   b. On the sidebar, click your target language, for example **English**.<br/>
+   c. Click the model you want to use, for example **en_core_web_lg**.<br/>
+   d. Click **Release details**.<br/>
+   e. At the bottom of the release details page, in the **Assets** section, right-click the filename ending in `.whl`, for 
+      example **en_core_web_lg-3.8.0-py3-none-any.whl**, and select **Copy Link Address** from the context menu.<br/>
+   f. From the release details page, copy the URL from your web browser's address bar.<br/>
+   g. Install the model into your local Python virtual environment by using the model's name and the URL that you just copied. 
+      For example, if you are using `uv`, you can install the preceding model with a command such as the following:<br/>
+
+      ```bash 
+      uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl
+      ```
+
+3. [Set up Boto3 credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for your AWS account. 
+   The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting 
+   [environment variables](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables) or 
+   configuring a [shared credentials file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#shared-credentials-file), 
+
+   One approach to getting and setting up Boto3 credentials is to [create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey) 
+   and then use the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) 
+   to [set up your credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-methods) on your local development machine.
+
+   <iframe
+   width="560"
+   height="315"
+   src="https://www.youtube.com/embed/MoFTaGJE65Q"
+   title="YouTube video player"
+   frameborder="0"
+   allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+   allowfullscreen
+   ></iframe>
+
+4. Add the following code to a Python script file in your virtual environment, replacing the following placeholders:
+
+   - Replace `<input-bucket-name>` with the name of the Amazon S3 bucket that contains your original Unstructured JSON files. This is the same 
+     bucket that you used for your S3 destination connector.
+   - Replace `<input-folder-prefix>` with the path to the folder within the input bucket that contains your original Unstructured JSON files.
+   - Replace `<output-bucket-name>` with the name of the S3 bucket that will contain copies of the contents of your Unstructured JSON files, 
+     with the redacted content within those files' copies. This can be the same bucket as the input bucket, or a different bucket. 
+   - Replace `<output-folder-prefix>` with the path to the folder within the output bucket that will contain copies of the contents of your Unstructured 
+     JSON files, with the redacted content within those files' copies. This must not be the same folder as the input folder.
+   - Replace `<bucket-region-short-id>` with the short ID of the region where your buckets are located, for example `us-east-1`.
+   
+   For the `operators` variable, a list of operators for built-in Presidio entities is specified. These operators look for common entities such as 
+   credit card numbers, email addresses, phone numbers, and more. You can remove any entities from this list that you do not 
+   want your code to look for. You can also add operators to this list for additional 
+   [built-in entities](https://microsoft.github.io/presidio/supported_entities/) that you want your code to also look for. And you can also 
+   [add your own custom entities](https://microsoft.github.io/presidio/analyzer/adding_recognizers/) to this list.
+
+   ```python
+   import boto3
+   import json
+
+   from presidio_analyzer import AnalyzerEngine
+   from presidio_anonymizer import AnonymizerEngine
+   from presidio_anonymizer.entities import OperatorConfig
+
+   operators={
+       "CREDIT_CARD": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CREDIT_CARD>'}),
+       "CRYPTO": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CRYPTO>'}),
+       "EMAIL_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_EMAIL_ADDRESS>'}),
+       "IBAN_CODE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IBAN_CODE>'}),
+       "IP_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IP_ADDRESS>'}),
+       "NRP": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_NRP>'}),
+       "LOCATION": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_LOCATION>'}), 
+       "PERSON": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PERSON>'}),
+       "PHONE_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PHONE_NUMBER>'}),
+       "MEDICAL_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_MEDICAL_LICENSE>'}),
+       "URL": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_URL>'}),
+       "US_BANK_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_BANK_NUMBER>'}),
+       "US_DRIVER_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_DRIVER_LICENSE>'}),
+       "US_ITIN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_ITIN>'}),
+       "US_PASSPORT": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_PASSPORT>'}),
+       "US_SSN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_SSN>'})
+   }
+
+   # Recursively check for string values in the provided JSON object (in this case, 
+   # the "metadata" field of the JSON object) and redact them 
+   # as appropriate.
+   def check_string_values(obj, analyzer, anonymizer):
+       if isinstance(obj, dict):
+           for key, value in obj.items():
+               # Skip analyzing Base64-encoded fields.
+               if key == 'image_base64' or key == 'orig_elements':
+                   pass
+               elif isinstance(value, str):
+                   anonymized_results = anonymizer.anonymize(
+                       text=value,
+                       analyzer_results=analyzer.analyze(text=value, language="en"),
+                       operators=operators
+                   )
+                   value = anonymized_results.text
+               # Recurse through nested "metadata" fields.
+               elif isinstance(value, dict):
+                   check_string_values(value, analyzer, anonymizer)
+               # Skip analyzing non-string fields.
+               else:
+                   pass
+       return obj
+
+   def main():
+       s3_input_bucket_name    = '<input-bucket-name>'
+       s3_input_folder_prefix  = '<input-folder-prefix>'
+       s3_output_bucket_name   = '<output-bucket-name>'
+       s3_output_folder_prefix = '<output-folder-prefix>'
+       s3_bucket_region        = '<bucket-region-short-id>'
+
+       s3_client = boto3.client('s3')
+
+       # Load the JSON files from the input folder.
+       # Normalize the input folder prefix to ensure it ends with '/'.
+       if not s3_input_folder_prefix.endswith('/'):
+           s3_input_folder_prefix += '/'
+
+       paginator = s3_client.get_paginator('list_objects_v2')
+       page_iterator = paginator.paginate(
+           Bucket=s3_input_bucket_name, 
+           Prefix=s3_input_folder_prefix
+       )
+       files = []
+
+       # Get the list of file keys from the input folder to anaylyze.
+       # A file's key is the full path to the file within the bucket.
+       # For example, if the input folder's name is "original" and the 
+       # input file's name is "file1.json", the file's key is 
+       # "original/file1.json".
+       # There could be multiple "pages" of file listings available, 
+       # so each of these "pages" must be looped through, so that 
+       # no files are missed.
+       for page in page_iterator:
+           # "Contents" is missing if the folder is empty or the 
+           # intended prefix is not found.
+           if 'Contents' in page:
+               for obj in page['Contents']:
+                   key = obj['Key']
+                   if not key.endswith('/'):  # Skip if it's a folder placeholder.
+                       files.append(key)
+                       print(f"Found file: {s3_input_bucket_name}/{key}")
+
+       analyzer = AnalyzerEngine()
+       anonymizer = AnonymizerEngine()
+       s3_resource = boto3.resource('s3')
+
+       # For each JSON file to analyze, load the JSON data.
+       for key in files:
+           print(f"Analyzing file: {s3_input_bucket_name}/{key}")
+           content_object = s3_resource.Object(
+               bucket_name=s3_input_bucket_name, 
+               key=key 
+           )
+
+           file_content = content_object.get()['Body'].read().decode('utf-8')  # Bytes to text.
+           json_data = json.loads(file_content) # Text to JSON.
+
+           # For each element in the JSON data...
+           for element in json_data:
+               print(f"    Analyzing element with ID: {element['element_id']} in file {s3_input_bucket_name}/{key}")
+               # If there is a "text" field...
+               if 'text' in element:
+                   # ...get the text content...
+                   text_element = element['text']
+                   # ...and analyze and redact the text content as appropriate.
+                   anonymized_results = anonymizer.anonymize(
+                       text=text_element,
+                       analyzer_results=analyzer.analyze(text=text_element, language="en"),
+                       operators=operators
+                   )
+                   element['text'] = anonymized_results.text
+               # If there is a "metadata" field...
+               if 'metadata' in element:
+                   # ...get the metadata content...
+                   metadata_element = element['metadata']
+                   # ...and analyze and redact the metadata content as appropriate.
+                   element['metadata'] = check_string_values(metadata_element, analyzer, anonymizer)
+
+           # Get the filename from the key.
+           filename = key.split(s3_input_folder_prefix)[1]
+
+           # Normalize the output folder prefix to ensure it ends with '/'.
+           if not s3_output_folder_prefix.endswith('/'):
+               s3_output_folder_prefix += '/'
+           
+           # Then save the JSON data with its redactions to the output folder.
+           print(f"Saving file: {s3_output_bucket_name}/{s3_output_folder_prefix}{filename}")
+           s3_client.put_object(
+               Bucket=s3_output_bucket_name,
+               Key=f"{s3_output_folder_prefix}{filename}",
+               Body=json.dumps(obj=json_data, indent=4).encode('utf-8')
+           )
+
+   if __name__ == "__main__":
+       main()
+   ```
+
+5. Run the Python script.
+6. Go to the output folder in S3 and explore the generated files, searching for the `<REDACTED_` placeholders in the generated files' contents.