Data privacy and security for AI (and other projects that use sensitive data)
As the amount of data that an organization collects and uses for analyses increases, so do concerns of privacy and security. Analyses require data. Typically, the more data used to train Machine Learning models, the more accurate they are likely to be. When personal information is used for analysis, it's especially important that the data remains private throughout its use.
ℹ️ Refer to the Data: Data Governance and Protection content for more information nuanced to a Data Engineer/Governance role. For best practices, refer to the Data Security Best Practices in the Data content.
Data Privacy from an MLOps perspective
Data privacy in MLOps can relate to the following areas:
- Protecting sensitive data: Anonymizing, obfuscating and synthesizing PII data for model training.
- Federated Learning or Multi-Party Computation: Building ML models across multiple parties where no single party is allowed access to all of the data.
- Homomorphic Encryption: Using encrypted data exclusively during the entire MLOps process.
- Differential Privacy: A set of systems and practices that help keep the data of individuals safe and private.
- Trusted Research Environments: Enforce a secure boundary around distinct workspaces to enable information governance controls to be enforced.
- Remote Execution: Executing ML on machines that ML practitioners do not have access to.
PII data redaction, anonymization and synthetic data
Data anonymization is the process to alter the data so that a data subject can no longer be identified directly or indirectly.
Data obfuscation or pseudonymization, means to replace any information, which could be used to identify an individual with a pseudonym. It can still allow for some form of re-identification of the data.
- Refer to the Pseudonymization definition link for more information.
- Refer to the Anonymization definition link for more information.
- ℹ️ Refer to the Data: Data Privacy for masking and obfuscation from a Data Engineering perspective.
Below are some PII data detection and masking resources for Azure:
- Microsoft Presidio: Microsoft Presidio helps to ensure sensitive data is properly managed and governed. It provides identification and anonymization modules for private entities in text and images such as: credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data.
- PII Detection Azure AI Service: the PII detection feature can identify, categorize, and redact sensitive information in unstructured text. For example: phone numbers, email addresses, and forms of identification.
- Masking PII data in Azure AI Search: The PII Detection skill extracts personal information from an input text and gives you the option of masking it.
Below are some synthetic data resources in Azure:
- Microsoft DSynth: DSynth is a flexible template-driven data generator.
- Microsoft Synthetic data showcase: Generates synthetic data and user interfaces for privacy-preserving data sharing and analysis.
Federated Learning or Multi-Party Computation
According to the Wikipedia definition for Federated Learning, it is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging them. This approach differs from traditional centralized machine learning methods, where all local datasets are uploaded to a single server. It also differs from classical decentralized approaches that assume local data samples have identical distributions.
Below are some Federated Learning resources for Azure:
- Microsoft Flute: Federated Learning Utilities and Tools for Experimentation
Homomorphic Encryption
Homomorphic encryption allows computation directly on encrypted data, making it easier to apply the potential of the cloud for privacy-critical data.
- Microsoft SEAL: Microsoft SEAL is an easy-to-use open-source (MIT licensed) homomorphic encryption library developed by the Cryptography and Privacy Research Group at Microsoft.
Refer to the SEAL documentation and Microsoft Research SEAL for detailed guidance.
Differential Privacy
Differential privacy is a set of systems and practices that help keep the data of individuals safe and private. In machine learning solutions, differential privacy might be required for regulatory compliance.
Refer to this document for more information on Differential privacy in machine learning.
Some resources include:
- Microsoft Synthetic data showcase: Generates synthetic data and user interfaces for privacy-preserving data sharing and analysis.
- Azure CounterFit: a CLI that provides a generic automation layer for assessing the security of ML models.
- SmartNoise: Differential privacy validator and runtime.
Trusted Research Environments
Trusted Research Environments (TREs) enforce a secure boundary around distinct workspaces to enable information governance controls to be enforced. Each workspace is accessible by a set of authorized users, prevents the exfiltration of sensitive data, and has access to one or more datasets provided by the data platform.
Some resources include:
- Azure Trusted Research Environment: on Azure
- Azure confidential ledger: Tamperproof, unstructured data store hosted in trusted execution environments (TEEs) and backed by cryptographically verifiable evidence:
- Secure MLOps solutions with Azure network security: How to protect ML solutions using Azure network security capabilities such as: Azure Virtual Network, network peering, Azure Private Link, and Azure DNS.
Remote Execution
It is the ability to remotely and securely execute ML.
Some resources include:
- Cloud analytics options
- Microsoft Azure Attestation: A unified solution to remotely verifying the trustworthiness of a platform and integrity of the binaries running on it.
- Security management in Azure: A comprehensive security resource for Azure.
- PySyft: PySyft is an open-source library that provides secure and private Deep Learning in Python.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for