Data privacy and security for AI (and other projects that use sensitive data)

As the amount of data that an organization collects and uses for analyses increases, so do concerns of privacy and security. Analyses require data. Typically, the more data used to train Machine Learning models, the more accurate they are likely to be. When personal information is used for analysis, it's especially important that the data remains private throughout its use.

ℹ️ Refer to the Data: Data Governance and Protection content for more information nuanced to a Data Engineer/Governance role. For best practices, refer to the Data Security Best Practices in the Data content.

Data Privacy from an MLOps perspective

Data privacy in MLOps can relate to the following areas:

  • Protecting sensitive data: Anonymizing, obfuscating and synthesizing PII data for model training.
  • Federated Learning or Multi-Party Computation: Building ML models across multiple parties where no single party is allowed access to all of the data.
  • Homomorphic Encryption: Using encrypted data exclusively during the entire MLOps process.
  • Differential Privacy: A set of systems and practices that help keep the data of individuals safe and private.
  • Trusted Research Environments: Enforce a secure boundary around distinct workspaces to enable information governance controls to be enforced.
  • Remote Execution: Executing ML on machines that ML practitioners do not have access to.

PII data redaction, anonymization and synthetic data

Data anonymization is the process to alter the data so that a data subject can no longer be identified directly or indirectly.

Data obfuscation or pseudonymization, means to replace any information, which could be used to identify an individual with a pseudonym. It can still allow for some form of re-identification of the data.

Below are some PII data detection and masking resources for Azure:

  • Microsoft Presidio: Microsoft Presidio helps to ensure sensitive data is properly managed and governed. It provides identification and anonymization modules for private entities in text and images such as: credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data.
  • PII Detection Azure AI Service: the PII detection feature can identify, categorize, and redact sensitive information in unstructured text. For example: phone numbers, email addresses, and forms of identification.
  • Masking PII data in Azure AI Search: The PII Detection skill extracts personal information from an input text and gives you the option of masking it.

Below are some synthetic data resources in Azure:

Federated Learning or Multi-Party Computation

According to the Wikipedia definition for Federated Learning, it is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging them. This approach differs from traditional centralized machine learning methods, where all local datasets are uploaded to a single server. It also differs from classical decentralized approaches that assume local data samples have identical distributions.

Below are some Federated Learning resources for Azure:

  • Microsoft Flute: Federated Learning Utilities and Tools for Experimentation

Homomorphic Encryption

Homomorphic encryption allows computation directly on encrypted data, making it easier to apply the potential of the cloud for privacy-critical data.

  • Microsoft SEAL: Microsoft SEAL is an easy-to-use open-source (MIT licensed) homomorphic encryption library developed by the Cryptography and Privacy Research Group at Microsoft.

Refer to the SEAL documentation and Microsoft Research SEAL for detailed guidance.

Differential Privacy

Differential privacy is a set of systems and practices that help keep the data of individuals safe and private. In machine learning solutions, differential privacy might be required for regulatory compliance.

Refer to this document for more information on Differential privacy in machine learning.

Some resources include:

Trusted Research Environments

Trusted Research Environments (TREs) enforce a secure boundary around distinct workspaces to enable information governance controls to be enforced. Each workspace is accessible by a set of authorized users, prevents the exfiltration of sensitive data, and has access to one or more datasets provided by the data platform.

Some resources include:

Remote Execution

It is the ability to remotely and securely execute ML.

Some resources include: