|
| 1 | +--- |
| 2 | +title: PII detection |
| 3 | +--- |
| 4 | + |
| 5 | +Personally identifiable information (PII) detection is important as part of an organization's overall strategy for |
| 6 | +minimizing potential harm from misuse or unauthorized access to individuals' private data. Detecting PII is a first step |
| 7 | +toward helping avoid identity theft, maintaining privacy, and building trust with customers and users. Organizations |
| 8 | +might also need to follow various data protection regulations, making PII detection a crucial part of the |
| 9 | +organization's legal compliance framework. |
| 10 | + |
| 11 | +This hands-on example walkthrough demonstrates how to use the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/) |
| 12 | +to identify and then redact PII in Unstructured JSON output files. Presidio can identify |
| 13 | +and redact or anonymize entities in text and images such as credit card numbers, names, locations, social security numbers, |
| 14 | +bitcoin wallets, US phone numbers, financial data, and more. |
| 15 | + |
| 16 | +In this walkthrough, you will use Python code to connect to a folder within an Amazon S3 bucket that already contains a |
| 17 | +collection of Unstructured JSON output files. For each file, your code will use Presidio to identify PII that matches |
| 18 | +specific patterns and then redact the identified PII. Your code will then write the Unstrucutred JSON output files' contents |
| 19 | +that contain the redacted PII to a separate folder within an S3 bucket. You can then compare the output JSON files generated by |
| 20 | +Unstructured to the redacted data generated by Presidio and see the impact of using Presidio for PII detection. |
| 21 | + |
| 22 | +_Redaction_ directly removes or obscures PII, such as replacing names with placeholders, or blacking out sensitive text. |
| 23 | +The example code from this walkthrough replaces the detected PII characters within text strings with placeholder text. Another technique, _anonymization_, |
| 24 | +can involve redaction, but also includes techniques such as _generalization_ (for example, replacing specific dates |
| 25 | +with age ranges), _suppression_ (removing entire data fields), and _data masking_ (replacing data with random values |
| 26 | +while preserving the format). Presidio supports both redaction and anonymization. Although the term "anonymize" is |
| 27 | +visible throughout this walkthrough's example code, Presidio's available anonymization techniques are not explored here—only redaction. |
| 28 | + |
| 29 | +<Note> |
| 30 | + The Microsoft Presidio SDK is not offered, maintained, or supported by Unstructured. For questions or issues |
| 31 | + related to Presidio, see the following resources: |
| 32 | + |
| 33 | + - For general discussions, use the [discussion board](https://github.com/microsoft/presidio/discussions) in the Presidio repositiory on GiHub. |
| 34 | + - For questions or issues, file an [issue](https://github.com/microsoft/presidio/issues) in the Presidio repository on GitHub. |
| 35 | + - For other matters, email [[email protected]](mailto:[email protected]). |
| 36 | + |
| 37 | + The example code in this walkthrough is provided as a general reference only. It is not intended to substitute |
| 38 | + for a complete PII detection strategy. |
| 39 | + |
| 40 | + The example code in this walkthrough is not guaranteed to detect and redact all possible PII. For instance, the code |
| 41 | + looks for PII only in text strings. It does not look for PII in non-text fields such as `image_base64` and `orig_elements` |
| 42 | + within Unstructured metadata, and it does not look for PII in images. |
| 43 | +</Note> |
| 44 | + |
| 45 | +## Requirements |
| 46 | + |
| 47 | +import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-simple-ui-only.mdx' |
| 48 | + |
| 49 | +To use this example, you will need: |
| 50 | + |
| 51 | +- An Unstructured account, as follows: |
| 52 | + |
| 53 | + <GetStartedSimpleUIOnly /> |
| 54 | + |
| 55 | +- A set of one or more Unstructured JSON output files that have been generated by Unstructured and stored in a folder within an |
| 56 | + Amazon S3 bucket that you have access to. One way to generate these files is to use an Unstructured workflow that |
| 57 | + relies on an S3 destination connector to store these Unstructured JSON output files. Learn how to [create an S3 destination connector](/ui/destinations/s3) and |
| 58 | + [create a custom workflow](/ui/workflows#create-a-custom-workflow) that uses your S3 destination connector. |
| 59 | +- Python installed on your local development machine. |
| 60 | + |
| 61 | +## Create and run the Python code |
| 62 | + |
| 63 | +1. In your local Python virtual environment, install the following libraries: |
| 64 | + |
| 65 | + - `boto3` |
| 66 | + - `presidio_analyzer` |
| 67 | + - `presidio_anonymizer` |
| 68 | + |
| 69 | + For example, if you are using [uv](https://docs.astral.sh/uv/), you can install these libraries into your local `uv` |
| 70 | + virtual environment with the following command: |
| 71 | + |
| 72 | + ```bash |
| 73 | + uv add boto3 presidio_analyzer presidio_anonymizer |
| 74 | + ``` |
| 75 | + |
| 76 | +2. In your local Python virtual environment, install the appropriate natural language processing (NLP) models for |
| 77 | + [spaCy](https://spacy.io/), which Presidio relies on for various internal tasks related to named entity recognition (NER) and |
| 78 | + PII identification. |
| 79 | + |
| 80 | + To find the appropriate model for your use case, do the following: |
| 81 | + |
| 82 | + a. Go to [spaCy Trained Models & Pipelines](https://spacy.io/models).<br/> |
| 83 | + b. On the sidebar, click your target language, for example **English**.<br/> |
| 84 | + c. Click the model you want to use, for example **en_core_web_lg**.<br/> |
| 85 | + d. Click **Release details**.<br/> |
| 86 | + e. At the bottom of the release details page, in the **Assets** section, right-click the filename ending in `.whl`, for |
| 87 | + example **en_core_web_lg-3.8.0-py3-none-any.whl**, and select **Copy Link Address** from the context menu.<br/> |
| 88 | + f. From the release details page, copy the URL from your web browser's address bar.<br/> |
| 89 | + g. Install the model into your local Python virtual environment by using the model's name and the URL that you just copied. |
| 90 | + For example, if you are using `uv`, you can install the preceding model with a command such as the following:<br/> |
| 91 | + |
| 92 | + ```bash |
| 93 | + uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl |
| 94 | + ``` |
| 95 | + |
| 96 | +3. [Set up Boto3 credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for your AWS account. |
| 97 | + The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting |
| 98 | + [environment variables](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables) or |
| 99 | + configuring a [shared credentials file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#shared-credentials-file), |
| 100 | + |
| 101 | + One approach to getting and setting up Boto3 credentials is to [create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey) |
| 102 | + and then use the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) |
| 103 | + to [set up your credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-methods) on your local development machine. |
| 104 | + |
| 105 | + <iframe |
| 106 | + width="560" |
| 107 | + height="315" |
| 108 | + src="https://www.youtube.com/embed/MoFTaGJE65Q" |
| 109 | + title="YouTube video player" |
| 110 | + frameborder="0" |
| 111 | + allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" |
| 112 | + allowfullscreen |
| 113 | + ></iframe> |
| 114 | + |
| 115 | +4. Add the following code to a Python script file in your virtual environment, replacing the following placeholders: |
| 116 | + |
| 117 | + - Replace `<input-bucket-name>` with the name of the Amazon S3 bucket that contains your original Unstructured JSON files. This is the same |
| 118 | + bucket that you used for your S3 destination connector. |
| 119 | + - Replace `<input-folder-prefix>` with the path to the folder within the input bucket that contains your original Unstructured JSON files. |
| 120 | + - Replace `<output-bucket-name>` with the name of the S3 bucket that will contain copies of the contents of your Unstructured JSON files, |
| 121 | + with the redacted content within those files' copies. This can be the same bucket as the input bucket, or a different bucket. |
| 122 | + - Replace `<output-folder-prefix>` with the path to the folder within the output bucket that will contain copies of the contents of your Unstructured |
| 123 | + JSON files, with the redacted content within those files' copies. This must not be the same folder as the input folder. |
| 124 | + - Replace `<bucket-region-short-id>` with the short ID of the region where your buckets are located, for example `us-east-1`. |
| 125 | + |
| 126 | + For the `operators` variable, a list of operators for built-in Presidio entities is specified. These operators look for common entities such as |
| 127 | + credit card numbers, email addresses, phone numbers, and more. You can remove any entities from this list that you do not |
| 128 | + want your code to look for. You can also add operators to this list for additional |
| 129 | + [built-in entities](https://microsoft.github.io/presidio/supported_entities/) that you want your code to also look for. And you can also |
| 130 | + [add your own custom entities](https://microsoft.github.io/presidio/analyzer/adding_recognizers/) to this list. |
| 131 | + |
| 132 | + ```python |
| 133 | + import boto3 |
| 134 | + import json |
| 135 | + |
| 136 | + from presidio_analyzer import AnalyzerEngine |
| 137 | + from presidio_anonymizer import AnonymizerEngine |
| 138 | + from presidio_anonymizer.entities import OperatorConfig |
| 139 | + |
| 140 | + operators={ |
| 141 | + "CREDIT_CARD": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CREDIT_CARD>'}), |
| 142 | + "CRYPTO": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CRYPTO>'}), |
| 143 | + "EMAIL_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_EMAIL_ADDRESS>'}), |
| 144 | + "IBAN_CODE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IBAN_CODE>'}), |
| 145 | + "IP_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IP_ADDRESS>'}), |
| 146 | + "NRP": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_NRP>'}), |
| 147 | + "LOCATION": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_LOCATION>'}), |
| 148 | + "PERSON": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PERSON>'}), |
| 149 | + "PHONE_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PHONE_NUMBER>'}), |
| 150 | + "MEDICAL_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_MEDICAL_LICENSE>'}), |
| 151 | + "URL": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_URL>'}), |
| 152 | + "US_BANK_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_BANK_NUMBER>'}), |
| 153 | + "US_DRIVER_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_DRIVER_LICENSE>'}), |
| 154 | + "US_ITIN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_ITIN>'}), |
| 155 | + "US_PASSPORT": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_PASSPORT>'}), |
| 156 | + "US_SSN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_SSN>'}) |
| 157 | + } |
| 158 | + |
| 159 | + # Recursively check for string values in the provided JSON object (in this case, |
| 160 | + # the "metadata" field of the JSON object) and redact them |
| 161 | + # as appropriate. |
| 162 | + def check_string_values(obj, analyzer, anonymizer): |
| 163 | + if isinstance(obj, dict): |
| 164 | + for key, value in obj.items(): |
| 165 | + # Skip analyzing Base64-encoded fields. |
| 166 | + if key == 'image_base64' or key == 'orig_elements': |
| 167 | + pass |
| 168 | + elif isinstance(value, str): |
| 169 | + anonymized_results = anonymizer.anonymize( |
| 170 | + text=value, |
| 171 | + analyzer_results=analyzer.analyze(text=value, language="en"), |
| 172 | + operators=operators |
| 173 | + ) |
| 174 | + value = anonymized_results.text |
| 175 | + # Recurse through nested "metadata" fields. |
| 176 | + elif isinstance(value, dict): |
| 177 | + check_string_values(value, analyzer, anonymizer) |
| 178 | + # Skip analyzing non-string fields. |
| 179 | + else: |
| 180 | + pass |
| 181 | + return obj |
| 182 | + |
| 183 | + def main(): |
| 184 | + s3_input_bucket_name = '<input-bucket-name>' |
| 185 | + s3_input_folder_prefix = '<input-folder-prefix>' |
| 186 | + s3_output_bucket_name = '<output-bucket-name>' |
| 187 | + s3_output_folder_prefix = '<output-folder-prefix>' |
| 188 | + s3_bucket_region = '<bucket-region-short-id>' |
| 189 | + |
| 190 | + s3_client = boto3.client('s3') |
| 191 | + |
| 192 | + # Load the JSON files from the input folder. |
| 193 | + # Normalize the input folder prefix to ensure it ends with '/'. |
| 194 | + if not s3_input_folder_prefix.endswith('/'): |
| 195 | + s3_input_folder_prefix += '/' |
| 196 | + |
| 197 | + paginator = s3_client.get_paginator('list_objects_v2') |
| 198 | + page_iterator = paginator.paginate( |
| 199 | + Bucket=s3_input_bucket_name, |
| 200 | + Prefix=s3_input_folder_prefix |
| 201 | + ) |
| 202 | + files = [] |
| 203 | + |
| 204 | + # Get the list of file keys from the input folder to anaylyze. |
| 205 | + # A file's key is the full path to the file within the bucket. |
| 206 | + # For example, if the input folder's name is "original" and the |
| 207 | + # input file's name is "file1.json", the file's key is |
| 208 | + # "original/file1.json". |
| 209 | + # There could be multiple "pages" of file listings available, |
| 210 | + # so each of these "pages" must be looped through, so that |
| 211 | + # no files are missed. |
| 212 | + for page in page_iterator: |
| 213 | + # "Contents" is missing if the folder is empty or the |
| 214 | + # intended prefix is not found. |
| 215 | + if 'Contents' in page: |
| 216 | + for obj in page['Contents']: |
| 217 | + key = obj['Key'] |
| 218 | + if not key.endswith('/'): # Skip if it's a folder placeholder. |
| 219 | + files.append(key) |
| 220 | + print(f"Found file: {s3_input_bucket_name}/{key}") |
| 221 | + |
| 222 | + analyzer = AnalyzerEngine() |
| 223 | + anonymizer = AnonymizerEngine() |
| 224 | + s3_resource = boto3.resource('s3') |
| 225 | + |
| 226 | + # For each JSON file to analyze, load the JSON data. |
| 227 | + for key in files: |
| 228 | + print(f"Analyzing file: {s3_input_bucket_name}/{key}") |
| 229 | + content_object = s3_resource.Object( |
| 230 | + bucket_name=s3_input_bucket_name, |
| 231 | + key=key |
| 232 | + ) |
| 233 | + |
| 234 | + file_content = content_object.get()['Body'].read().decode('utf-8') # Bytes to text. |
| 235 | + json_data = json.loads(file_content) # Text to JSON. |
| 236 | + |
| 237 | + # For each element in the JSON data... |
| 238 | + for element in json_data: |
| 239 | + print(f" Analyzing element with ID: {element['element_id']} in file {s3_input_bucket_name}/{key}") |
| 240 | + # If there is a "text" field... |
| 241 | + if 'text' in element: |
| 242 | + # ...get the text content... |
| 243 | + text_element = element['text'] |
| 244 | + # ...and analyze and redact the text content as appropriate. |
| 245 | + anonymized_results = anonymizer.anonymize( |
| 246 | + text=text_element, |
| 247 | + analyzer_results=analyzer.analyze(text=text_element, language="en"), |
| 248 | + operators=operators |
| 249 | + ) |
| 250 | + element['text'] = anonymized_results.text |
| 251 | + # If there is a "metadata" field... |
| 252 | + if 'metadata' in element: |
| 253 | + # ...get the metadata content... |
| 254 | + metadata_element = element['metadata'] |
| 255 | + # ...and analyze and redact the metadata content as appropriate. |
| 256 | + element['metadata'] = check_string_values(metadata_element, analyzer, anonymizer) |
| 257 | + |
| 258 | + # Get the filename from the key. |
| 259 | + filename = key.split(s3_input_folder_prefix)[1] |
| 260 | + |
| 261 | + # Normalize the output folder prefix to ensure it ends with '/'. |
| 262 | + if not s3_output_folder_prefix.endswith('/'): |
| 263 | + s3_output_folder_prefix += '/' |
| 264 | + |
| 265 | + # Then save the JSON data with its redactions to the output folder. |
| 266 | + print(f"Saving file: {s3_output_bucket_name}/{s3_output_folder_prefix}{filename}") |
| 267 | + s3_client.put_object( |
| 268 | + Bucket=s3_output_bucket_name, |
| 269 | + Key=f"{s3_output_folder_prefix}{filename}", |
| 270 | + Body=json.dumps(obj=json_data, indent=4).encode('utf-8') |
| 271 | + ) |
| 272 | + |
| 273 | + if __name__ == "__main__": |
| 274 | + main() |
| 275 | + ``` |
| 276 | + |
| 277 | +5. Run the Python script. |
| 278 | +6. Go to the output folder in S3 and explore the generated files, searching for the `<REDACTED_` placeholders in the generated files' contents. |
0 commit comments