|
| 1 | +--- |
| 2 | +title: De-identify multiple documents with the de-identification service in python |
| 3 | +description: "Learn how to bulk de-identify documents with the asynchronous de-identification service in python." |
| 4 | +author: kimiamavon-msft |
| 5 | +ms.author: kimiamavon |
| 6 | +ms.service: azure-health-data-services |
| 7 | +ms.subservice: deidentification-service |
| 8 | +ms.topic: tutorial |
| 9 | +ms.date: 05/01/2025 |
| 10 | + |
| 11 | +#customer intent: As an IT admin, I want to de-identify multiple documents with the de-identification service in python |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +# De-identify multiple documents with the asynchronous de-identification service |
| 16 | + |
| 17 | +The Azure Health Data Services de-identification service can de-identify documents in Azure Storage via an asynchronous job. If you have many documents that you would like |
| 18 | +to de-identify, using a job is a good option. Jobs also provide consistent surrogation, meaning that surrogate values in the de-identified output will match across |
| 19 | +all documents. For more information about de-identification, including consistent surrogation, see [What is the de-identification service?](overview.md) |
| 20 | + |
| 21 | +When you choose to store documents in Azure Blob Storage, you're charged based on Azure Storage pricing. This cost isn't included in the |
| 22 | + de-identification service pricing. [Explore Azure Blob Storage pricing](https://azure.microsoft.com/pricing/details/storage/blobs). |
| 23 | + |
| 24 | +In this tutorial, you: |
| 25 | + |
| 26 | + |
| 27 | + * Create a storage account and container |
| 28 | + * Upload a sample document |
| 29 | + * Grant the de-identification service access |
| 30 | + * Configure network isolation |
| 31 | + |
| 32 | +## Prerequisites |
| 33 | + |
| 34 | +* An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F). |
| 35 | +* A de-identification service with system-assigned managed identity. [Deploy the de-identification service](quickstart.md). |
| 36 | + |
| 37 | +## Open Azure CLI |
| 38 | + |
| 39 | +Install [Azure CLI](/cli/azure/install-azure-cli) and open your terminal of choice. In this tutorial, we're using PowerShell. |
| 40 | + |
| 41 | +## Create a storage account and container |
| 42 | +1. Set your context, substituting the subscription name containing your de-identification service for the `<subscription_name>` placeholder: |
| 43 | + ```powershell |
| 44 | + az account set --subscription "<subscription_name>" |
| 45 | + ``` |
| 46 | +1. Save a variable for the resource group, substituting the resource group containing your de-identification service for the `<resource_group>` placeholder: |
| 47 | + ```powershell |
| 48 | + $ResourceGroup = "<resource_group>" |
| 49 | + ``` |
| 50 | +1. Create a storage account, providing a value for the `<storage_account_name>` placeholder: |
| 51 | + ```powershell |
| 52 | + $StorageAccountName = "<storage_account_name>" |
| 53 | + $StorageAccountId = $(az storage account create --name $StorageAccountName --resource-group $ResourceGroup --sku Standard_LRS --kind StorageV2 --min-tls-version TLS1_2 --allow-blob-public-access false --query id --output tsv) |
| 54 | + ``` |
| 55 | +1. Assign yourself a role to perform data operations on the storage account: |
| 56 | + ```powershell |
| 57 | + $UserId = $(az ad signed-in-user show --query id -o tsv) |
| 58 | + az role assignment create --role "Storage Blob Data Contributor" --assignee $UserId --scope $StorageAccountId |
| 59 | + ``` |
| 60 | +1. Create a container to hold your sample document: |
| 61 | + ```powershell |
| 62 | + az storage container create --account-name $StorageAccountName --name deidtest --auth-mode login |
| 63 | + ``` |
| 64 | +## Upload a sample document |
| 65 | +Next, you upload a document that contains synthetic PHI: |
| 66 | +```powershell |
| 67 | +$DocumentContent = "The patient came in for a visit on 10/12/2023 and was seen again November 4th at Contoso Hospital." |
| 68 | +az storage blob upload --data $DocumentContent --account-name $StorageAccountName --container-name deidtest --name deidsample.txt --auth-mode login |
| 69 | +``` |
| 70 | + |
| 71 | +## Grant the de-identification service access to the storage account |
| 72 | + |
| 73 | +In this step, you grant the de-identification service's system-assigned managed identity role-based access to the container. You grant the **Storage Blob |
| 74 | +Data Contributor** role because the de-identification service will both read the original document and write de-identified output documents. Substitute the name of |
| 75 | +your de-identification service for the `<deid_service_name>` placeholder: |
| 76 | +```powershell |
| 77 | +$DeidServicePrincipalId=$(az resource show -n <deid_service_name> -g $ResourceGroup --resource-type microsoft.healthdataaiservices/deidservices --query identity.principalId --output tsv) |
| 78 | +az role assignment create --assignee $DeidServicePrincipalId --role "Storage Blob Data Contributor" --scope $StorageAccountId |
| 79 | +``` |
| 80 | +To verify that the de-identification service has access to the storage account, you can check on the Azure portal under <b>storage accounts</b>. Under the <b>Storage center</b> and <b>Resources<b/> tab, click your storage account name. Select <b>Access control (IAM)</b> and in the search bar, search for the name of your de-identification service ($ResourceGroup). |
| 81 | + |
| 82 | +## Configure network isolation on the storage account |
| 83 | +Next, you update the storage account to disable public network access and only allow access from trusted Azure services such as the de-identification service. |
| 84 | +After running this command, you won't be able to view the storage container contents without setting a network exception. |
| 85 | +Learn more at [Configure Azure Storage firewalls and virtual networks](/azure/storage/common/storage-network-security). |
| 86 | + |
| 87 | +```powershell |
| 88 | +az storage account update --name $StorageAccountName --public-network-access Disabled --bypass AzureServices |
| 89 | +``` |
| 90 | + |
| 91 | +## Use the python SDK |
| 92 | +The code below contains a sample from the [Azure Health Deidentification SDK for Python](https://learn.microsoft.com/python/api/overview/azure/health-deidentification?view=azure-python). |
| 93 | + |
| 94 | +```Bash |
| 95 | + |
| 96 | +""" |
| 97 | +FILE: deidentify_documents_async.py |
| 98 | +
|
| 99 | +DESCRIPTION: |
| 100 | + This sample demonstrates a basic scenario of de-identifying documents in Azure Storage. |
| 101 | + Taking a container URI and an input prefix, the sample will create a job and wait for the job to complete. |
| 102 | +
|
| 103 | +USAGE: |
| 104 | + python deidentify_documents_async.py |
| 105 | +
|
| 106 | + Set the environment variables with your own values before running the sample: |
| 107 | + 1) endpoint - the service URL endpoint for a de-identification service. |
| 108 | + 2) storage_location - an Azure Storage container endpoint, like "https://<storageaccount>.blob.core.windows.net/<container>". |
| 109 | + 3) INPUT_PREFIX - the prefix of the input document name(s) in the container. |
| 110 | + For example, providing "folder1" would create a job that would process documents like "https://<storageaccount>.blob.core.windows.net/<container>/folder1/document1.txt". |
| 111 | +""" |
| 112 | + |
| 113 | + |
| 114 | +import asyncio |
| 115 | +from azure.core.polling import AsyncLROPoller |
| 116 | +from azure.health.deidentification.aio import DeidentificationClient |
| 117 | +from azure.health.deidentification.models import ( |
| 118 | + DeidentificationJob, |
| 119 | + SourceStorageLocation, |
| 120 | + TargetStorageLocation, |
| 121 | +) |
| 122 | +from azure.identity.aio import DefaultAzureCredential |
| 123 | +import os |
| 124 | +import uuid |
| 125 | + |
| 126 | + |
| 127 | +async def deidentify_documents_async(): |
| 128 | + endpoint = "<YOUR SERVICE URL HERE>" ### Replace |
| 129 | + storage_location = "https://<CONTAINER NAME>.blob.core.windows.net/deidtest/" ### Replace <CONTAINER NAME> |
| 130 | + inputPrefix = "deidsample" |
| 131 | + outputPrefix = "_output" |
| 132 | + |
| 133 | + credential = DefaultAzureCredential() |
| 134 | + client = DeidentificationClient(endpoint, credential) |
| 135 | + |
| 136 | + jobname = f"sample-job-{uuid.uuid4().hex[:8]}" |
| 137 | +
|
| 138 | + job = DeidentificationJob( |
| 139 | + source_location=SourceStorageLocation( |
| 140 | + location=storage_location, |
| 141 | + prefix=inputPrefix, |
| 142 | + ), |
| 143 | + target_location=TargetStorageLocation(location=storage_location, prefix=outputPrefix, overwrite=True), |
| 144 | + ) |
| 145 | +
|
| 146 | + async with client: |
| 147 | + lro: AsyncLROPoller = await client.begin_deidentify_documents(jobname, job) |
| 148 | + finished_job: DeidentificationJob = await lro.result() |
| 149 | +
|
| 150 | + await credential.close() |
| 151 | +
|
| 152 | + print(f"Job Name: {finished_job.job_name}") |
| 153 | + print(f"Job Status: {finished_job.status}") # Succeeded |
| 154 | + print(f"File Count: {finished_job.summary.total_count if finished_job.summary is not None else 0}") |
| 155 | +
|
| 156 | +
|
| 157 | +async def main(): |
| 158 | + await deidentify_documents_async() |
| 159 | +
|
| 160 | +
|
| 161 | +if __name__ == "__main__": |
| 162 | + asyncio.run(main()) |
| 163 | +
|
| 164 | +
|
| 165 | +``` |
| 166 | +
|
| 167 | +## Clean up resources |
| 168 | +Once you're done with the storage account, you can delete the storage account and role assignments: |
| 169 | +```powershell |
| 170 | +az role assignment delete --assignee $DeidServicePrincipalId --role "Storage Blob Data Contributor" --scope $StorageAccountId |
| 171 | +az role assignment delete --assignee $UserId --role "Storage Blob Data Contributor" --scope $StorageAccountId |
| 172 | +az storage account delete --ids $StorageAccountId --yes |
| 173 | +``` |
| 174 | +
|
| 175 | +## Next step |
| 176 | +
|
| 177 | +> [!div class="nextstepaction"] |
| 178 | +> [Quickstart: Azure Health De-identification client library for .NET](quickstart-sdk-net.md) |
0 commit comments