What is Personally Identifiable Information (PII) detection in Azure AI Language?

Article
01/31/2024

PII detection is one of the features offered by Azure AI Language, a collection of machine learning and AI algorithms in the cloud for developing intelligent applications that involve written language. The PII detection feature can identify, categorize, and redact sensitive information in unstructured text. For example: phone numbers, email addresses, and forms of identification. The method for utilizing PII in conversations is different than other use cases, and articles for this use are separate.

Quickstarts are getting-started instructions to guide you through making requests to the service.
How-to guides contain instructions for using the service in more specific or customized ways.
The conceptual articles provide in-depth explanations of the service's functionality and features.

PII comes into two shapes:

PII - works on unstructured text.
Conversation PII (preview) - tailored model to work on conversation transcription.

Typical workflow

To use this feature, you submit data for analysis and handle the API output in your application. Analysis is performed as-is, with no added customization to the model used on your data.

Create an Azure AI Language resource, which grants you access to the features offered by Azure AI Language. It generates a password (called a key) and an endpoint URL that you use to authenticate API requests.
Create a request using either the REST API or the client library for C#, Java, JavaScript, and Python. You can also send asynchronous calls with a batch request to combine API requests for multiple features into a single call.
Send the request containing your text data. Your key and endpoint are used for authentication.
Stream or store the response locally.

Native document support

A native document refers to the file format used to create the original document such as Microsoft Word (docx) or a portable document file (pdf). Native document support eliminates the need for text preprocessing prior to using Azure AI Language resource capabilities. Currently, native document support is available for the PiiEntityRecognition capability.

Currently PII supports the following native document formats:

File type	File extension	Description
Text	`.txt`	An unformatted text document.
Adobe PDF	`.pdf`	A portable document file formatted document.
Microsoft Word	`.docx`	A Microsoft Word document file.

For more information, see Use native documents for language processing

Get started with PII detection

To use PII detection, you submit text for analysis and handle the API output in your application. Analysis is performed as-is, with no customization to the model used on your data. There are two ways to use PII detection:

Development option	Description
Language studio	Language Studio is a web-based platform that lets you try entity linking with text examples without an Azure account, and your own data when you sign up. For more information, see the Language Studio website or language studio quickstart.
REST API or Client library (Azure SDK)	Integrate PII detection into your applications using the REST API, or the client library available in various languages. For more information, see the PII detection quickstart.

Reference documentation and code samples

As you use this feature in your applications, see the following reference documentation and samples for Azure AI Language:

Development option / language	Reference documentation	Samples
REST API	REST API documentation
C#	C# documentation	C# samples
Java	Java documentation	Java Samples
JavaScript	JavaScript documentation	JavaScript samples
Python	Python documentation	Python samples

Responsible AI

An AI system includes not only the technology, but also the people who use it, the people affected by it, and the deployment environment. Read the transparency note for PII to learn about responsible AI use and deployment in your systems. For more information, see the following articles:

Example scenarios

Apply sensitivity labels - For example, based on the results from the PII service, a public sensitivity label might be applied to documents where no PII entities are detected. For documents where US addresses and phone numbers are recognized, a confidential label might be applied. A highly confidential label might be used for documents where bank routing numbers are recognized.
Redact some categories of personal information from documents that get wider circulation - For example, if customer contact records are accessible to frontline support representatives, the company can redact the customer's personal information besides their name from the version of the customer history to preserve the customer's privacy.
Redact personal information in order to reduce unconscious bias - For example, during a company's resume review process, they can block name, address and phone number to help reduce unconscious gender or other biases.
Replace personal information in source data for machine learning to reduce unfairness – For example, if you want to remove names that might reveal gender when training a machine learning model, you could use the service to identify them and you could replace them with generic placeholders for model training.
Remove personal information from call center transcription – For example, if you want to remove names or other PII data that happen between the agent and the customer in a call center scenario. You could use the service to identify and remove them.
Data cleaning for data science - PII can be used to make the data ready for data scientists and engineers to be able to use these data to train their machine learning models. Redacting the data to make sure that customer data isn't exposed.

Next steps

There are two ways to get started using the entity linking feature:

Language Studio, which is a web-based platform that enables you to try several Language service features without needing to write code.
The quickstart article for instructions on making requests to the service using the REST API and client library SDK.