Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The Azure Health Data Services de-identification service can de-identify documents in Azure Storage via an asynchronous job. If you have many documents that you would like to de-identify, using a job is a good option. Jobs also provide consistent surrogation, meaning that surrogate values in the de-identified output match across all documents. For more information about de-identification, including consistent surrogation, see What is the de-identification service?
When you choose to store documents in Azure Blob Storage, you're charged based on Azure Storage pricing. This cost isn't included in the de-identification service pricing. Explore Azure Blob Storage pricing.
This tutorial covers how to configure and run the service via the asynchronous (Batch) API.
Prerequisites
- An Azure account with an active subscription. Create an account for free.
- A de-identification service with system-assigned managed identity. Deploy the de-identification service.
- In the top search bar, type
De-identification. - Select De-identification Services from the search results.
- Select Create.
- Fill in required subscription and instance details.
- Select Review + create, then Create.
- Once deployment completes, go to the resource and copy your Service URL and Subscription ID.
Important
To use the batch (asynchronous) API with the multilingual model, ensure you have the DeID Batch Data Owner role assigned to your identity in Access Control (IAM).
Install Azure CLI and open your terminal of choice. In this tutorial, we're using PowerShell.
Create a storage account and container
- Set your context, substituting the subscription name containing your de-identification service for the
<subscription_name>placeholder:az account set --subscription "<subscription_name>" - Save a variable for the resource group, substituting the resource group containing your de-identification service for the
<resource_group>placeholder:$ResourceGroup = "<resource_group>" - Create a storage account, providing a value for the
<storage_account_name>placeholder:$StorageAccountName = "<storage_account_name>" $StorageAccountId = $(az storage account create --name $StorageAccountName --resource-group $ResourceGroup --sku Standard_LRS --kind StorageV2 --min-tls-version TLS1_2 --allow-blob-public-access false --query id --output tsv) - Assign yourself a role to perform data operations on the storage account:
$UserId = $(az ad signed-in-user show --query id -o tsv) az role assignment create --role "Storage Blob Data Contributor" --assignee $UserId --scope $StorageAccountId - Create a container to hold your sample document:
az storage container create --account-name $StorageAccountName --name deidtest --auth-mode login
Upload a sample document
Next, you upload a document that contains synthetic PHI:
- Example with English data:
$DocumentContent = "The patient came in for a visit on 10/12/2023 and was seen again November 4th at Contoso Hospital."
az storage blob upload --data $DocumentContent --account-name $StorageAccountName --container-name deidtest --name deidsample.txt --auth-mode login
- Example with French data:
$DocumentContent = "Julie Tremblay a été consultée par Dr. Marc Lavoie à l'Hôpital de la Cité-des-Prairies le 10 Janvier"
az storage blob upload --data $DocumentContent --account-name $StorageAccountName --container-name deidtest --name deidsample_fr.txt --auth-mode login
Grant the de-identification service access to the storage account
In this step, you grant the de-identification service's system-assigned managed identity role-based access to the container.
You grant the Storage Blob Data Contributor role because the de-identification service reads the original document and writes the de-identified output documents.
Substitute the name of your de-identification service for the <deid_service_name> placeholder:
$DeidServicePrincipalId=$(az resource show -n <deid_service_name> -g $ResourceGroup --resource-type microsoft.healthdataaiservices/deidservices --query identity.principalId --output tsv)
az role assignment create --assignee $DeidServicePrincipalId --role "Storage Blob Data Contributor" --scope $StorageAccountId
To verify that the de-identification service has access to the storage account, you can check on the Azure portal under storage accounts. Under the Storage center and Resources tab, click your storage account name. Select Access control (IAM). In the search bar, search for the name of your de-identification service ($ResourceGroup).
Configure network isolation on the storage account
Next, you update the storage account to disable public network access and only allow access from trusted Azure services such as the de-identification service. After running this command, you won't be able to view the storage container contents without setting a network exception.
Learn more at Configure Azure Storage firewalls and virtual networks.
az storage account update --name $StorageAccountName --public-network-access Disabled --bypass AzureServices
Running the service using the python SDK
The code below contains a sample from the Azure Health De-identification SDK for Python.
This example uses the French Canada language-locale pair. See the full list our service supports here.
"""
FILE: deidentify_documents_async.py
DESCRIPTION:
This sample demonstrates a basic scenario of de-identifying documents in Azure Storage.
Taking a container URI and an input prefix, the sample will create a job and wait for the job to complete.
USAGE:
python deidentify_documents_async.py
Set the environment variables with your own values before running the sample:
1) endpoint - the service URL endpoint for a de-identification service.
2) storage_location - an Azure Storage container endpoint, like "https://<storageaccount>.blob.core.windows.net/<container>".
3) INPUT_PREFIX - the prefix of the input document name(s) in the container.
For example, providing "folder1" would create a job that would process documents like "https://<storageaccount>.blob.core.windows.net/<container>/folder1/document1.txt".
"""
import asyncio
from azure.core.polling import AsyncLROPoller
from azure.health.deidentification.aio import DeidentificationClient
from azure.health.deidentification.models import (
DeidentificationJob,
SourceStorageLocation,
TargetStorageLocation,
DeidentificationJobCustomizationOptions,
DeidentificationOperationType,
)
from azure.identity.aio import DefaultAzureCredential
import uuid
async def deidentify_documents_async():
endpoint = "<YOUR SERVICE URL HERE>" ### Replace
storage_location = "https://<CONTAINER NAME>.blob.core.windows.net/deidtest/" ### Replace <CONTAINER NAME>
inputPrefix = "deidsample"
outputPrefix = "_output"
locale = "fr-CA" ### e.g., fr-FR, de-DE, etc
credential = DefaultAzureCredential()
client = DeidentificationClient(endpoint, credential)
jobname = f"sample-job-{uuid.uuid4().hex[:8]}"
job = DeidentificationJob(
### Set the operation to Surrogate:
operation_type=DeidentificationOperationType.SURROGATE,
source_location=SourceStorageLocation(
location=storage_location,
prefix=inputPrefix,
),
target_location=TargetStorageLocation(location=storage_location, prefix=outputPrefix, overwrite=True),
### Locale customization for surrogate generation:
customizations=DeidentificationJobCustomizationOptions(surrogate_locale=locale),
)
async with client:
lro: AsyncLROPoller = await client.begin_deidentify_documents(jobname, job)
finished_job: DeidentificationJob = await lro.result()
await credential.close()
print(f"Job Name: {finished_job.job_name}")
print(f"Job Status: {finished_job.status}") # Succeeded
print(f"File Count: {finished_job.summary.total_count if finished_job.summary is not None else 0}")
async def main():
await deidentify_documents_async()
if __name__ == "__main__":
asyncio.run(main())
Get job status:
curl -X GET \
-H "Authorization: Bearer $(<token.txt)" \
-H "Content-Type: application/json" \
"<your Service URL>/jobs/<unique-job-name>?api-version=2024-12-15-preview"
Clean up resources
Once you're done with the storage account, you can delete the storage account and role assignments:
az role assignment delete --assignee $DeidServicePrincipalId --role "Storage Blob Data Contributor" --scope $StorageAccountId
az role assignment delete --assignee $UserId --role "Storage Blob Data Contributor" --scope $StorageAccountId
az storage account delete --ids $StorageAccountId --yes