Examine Microsoft AI principles - Privacy and Security

Completed

As artificial intelligence becomes ubiquitous, protecting user privacy and data security is imperative. Microsoft recognized this critical need and made privacy and security key pillars of its Responsible AI initiative. Responsible AI demands principles-based thinking around data practices.

Microsoft's approach

Microsoft's Responsible AI principles are designed to help prevent abuse and breach of user trust. Key tenets of Microsoft's approach to responsible AI privacy and security include:

  • Obtain informed consent before collecting user data. Clearly communicate how the AI system uses data and obtains user permission. Don't collect data covertly. For example, allow users to opt in or out of personal data collection features. Provide explicit notice of data collection in privacy policies and user agreements. Use clear, just-in-time consent prompts describing how the system plans to use the data. Allow granular controls, such as consent for photos but not location. Never assume implicit consent.
  • Limit data collection, retention and use to core necessities. Only gather the minimum data needed for an AI system to function. Don't collect extraneous data. Annotate datasets to mark sensitive fields and purge data once you deploy the AI model into production. Carefully audit data inputs to products and services to eliminate the gathering of any nonessential user data. Build protection against unlawful data gathering into system architectures.
  • Anonymize personal data where possible. Remove personal information from datasets through techniques like pseudonymization and aggregation. Pseudonymization is a technique used to protect personal data by replacing identifying information with artificial identifiers. For example, a dataset might contain fields like name, address, and ID number for each customer record. To pseudonymize this data, the AI model removes the name and address and replaces them with random strings, such as "User ABC123" rather than the real name. It would also replace the ID number with a random number. Doing so preserves privacy while retaining analytic capabilities. Aggregation refers to grouping together data points into totals, averages, or other summary statistics that remove individual-level detail. For example, rather than showing the exact age of every customer, a dataset could be aggregated to show age buckets, such as 25-34 years old. Or individual incomes could be aggregated into income bands rather than precise amounts. Aggregation preserves analytical utility for tasks like modeling trends, while removing identifiable information about specific individuals.
  • Encrypt sensitive data, both in transit and at rest. Implement end-to-end encryption for sensitive user data like biometrics, communications, and documents. Use secure protocols like TLS for data in transit. Use strong encryption methods like AES-256 to protect data while moving over networks and when stored long-term. Store encryption keys securely through various methods, including:
    • Use a hardware security module (HSM). HSM is a specialized secure computer that stores keys and performs encryption operations in a tamper-resistant environment. HSMs prevent attackers from extracting keys.
    • Store keys in a secure vault. Cloud solutions like Microsoft Azure offer hardened key storage services with tightly controlled access. Customers don't directly interact with the HSM devices themselves, but the HSMs add a layer of hardware-level security for keys stored on the cloud platform. This design provides customers both ease of use in managing keys through the cloud service, and strong back-end security powered by dedicated HSM hardware. The keys are still accessible to customers, but they added protection from the underlying HSMs within Azure's infrastructure.
    • Use envelope encryption. This feature encrypts each key with its own symmetric key. Organizations should store and protect those keys separately from data encryption keys. In essence, they protect the data with two locks - the data key and the key encryption key. To unlock the data, you need access to both keys. Having two separate keys makes it harder for an attacker to decrypt the data.
    • Apply access controls. Organizations should restrict and audit which users or services can access/retrieve keys from secure storage. Keys should only be accessible to authorized encryption services.
    • Rotate keys periodically. Organizations should refresh keys on a regular cadence to limit the effect of a compromised key. In doing so, they should eliminate the old keys everywhere possible and ensure that no one can ever recover or reuse them to access encrypted data again. Proper key destruction removes the risk of compromised data if someone leaks an old key.
    • Back up keys securely. Organizations should highly protect any backup copies of keys. For example, keeping an encrypted key backup offline in a physically secure facility as a contingency.
  • Restrict access to trained models and sensitive datasets. Organizations should employ the following best practices to restrict user access to trained models and sensitive data:
    • Limit which employees can access production models and underlying training data through permissions and controls. This practice helps prevent theft or misuse.
    • Classify data and models based on sensitivity.
    • Limit access to production systems and underlying data through permissions, controls, and code repository protections.
    • Conduct security audits and access reviews to prevent exposure.

Addressing key risks

AI introduces new privacy and security risks that developers should mitigate, including:

  • Invasive data collection. AI systems could gather more user data than required through techniques like pervasive tracking, recording user interactions, or acquiring data from third parties without consent. Doing so creates risks of overreach, breach, and misuse. Strict governance, privacy reviews, and techniques like data minimization help prevent unnecessary collection. An organization can prevent users from exceeding the proper or acceptable limits of authority or power by establishing governance. Governance provides the policies, processes, and organizational oversight to ensure teams only collect the data necessary for an intended purpose, and no more. Governance best practices that help prevent overreach include:

    • Require justification for why someone needs user data and how they plan to use it.
    • Conduct impact assessments to evaluate data collection practices.
    • Create oversight teams responsible for reviewing data practices for potential overreach.
    • Implement approval workflows for accessing stored data with audit logging.
    • Employ techniques like data minimization, pseudonymization, and aggregation to collect as little raw data as possible.
    • Establish deletion schedules to remove unneeded data.
    • Train teams on responsible data practices and repercussions of overreach.
  • Reidentification. Even anonymized datasets combined with the power of AI can potentially reidentify individuals through pattern matching. Advanced techniques introduce mathematical guarantees to preserve anonymity even under algorithmic analysis. Such techniques include:

    • Differential privacy. Introduces mathematical noise to query results on a dataset to obscure any one individual's data while preserving overall analytics. Prevents reidentifying individuals through pattern matching.
    • K-anonymity. Generalizes data to ensure attackers can't distinguish each row in a dataset from at least k-1 other rows. For example, generalizing geographic data to zip codes rather than precise locations.
    • Distributed learning. Trains models on decentralized data sources like personal devices without aggregating data into a central server. Avoids having consolidated data that's vulnerable to reidentification.
    • Synthetic data generation. Uses generative models to create artificial datasets that have the same structure and patterns as real data but don't represent actual people. Eliminates reidentification risks.
    • Access controls. Restrict access to datasets and provide training on rigorous auditing and approval processes. Doing so reduces the risk of malicious actors attempting reidentification attacks.
  • Model inversion. Attackers may attempt to reverse engineer private data used to train a model by observing its outputs. For example, they may query an AI system in a way that extracts facial features of original training subjects. You can make model inversion attacks more difficult by employing the following techniques:

    • Control model access
      • Require authentication and authorization to access models, limit to only essential users.
      • Deploy models within secure enclaves with hardened perimeter defenses.
    • Analyze model vulnerability
      • Probe models with inversion attacks during development to quantify vulnerability.
      • Evaluate how much private data someone can extract from model outputs.
      • Compare inversion resilience across different model architectures.
    • Limit output fidelity
      • Reduce precision of model outputs (for example, coarser classifications versus exact probabilities).
      • Add noise to outputs to increase uncertainty for attackers.
      • Disable features that return rich information like activation maps.
  • Data poisoning. Bad actors could corrupt training data to distort models. They may attempt to do so by injecting false or biased data into datasets to manipulate model behavior after training. Secure data pipelines end-to-end and perform integrity checks. Organizations can detect potential manipulation attempts and investigate issues by verifying integrity through the following practices:

    • Cryptographic data integrity checks
      • Hash training data and verify that hashes didn't change before use in training. Detects any manipulation.
      • Digitally sign key training assets like datasets with private keys for later verification.
    • Anomaly detection
      • Analyze datasets for unusual patterns like sudden distribution shifts. Could flag poisoning attempts.
      • Monitor model performance over time for unexplained accuracy drops, which may indicate data tampering.
    • Tracking data provenance
      • Log full provenance of data assets from origin through pipelines. Supports auditing and debugging.
      • Label datasets with unique identifiers that propagate to downstream assets like models.
      • Detect discrepancies in data lineage through profiling and integrity checks.
  • Membership inference. Models may enable inference of whether a given data point was present in the original training data based on the model's output for that point. Adding noise to model outputs can increase uncertainty and make these attacks more difficult. You can employ the following techniques to add noise to model outputs to prevent membership inference attacks:

    • Differential privacy. Add mathematical noise derived from a random distribution to the model outputs. This practice is similar to how adding noise to audio makes voices unrecognizable. The noise obscures the contribution of individual data points while preserving overall accuracy.
    • Dropout. Randomly drop neural network nodes during training. Doing so creates inherent uncertainty in the model that persists during inference because the model relies less on specific nodes and is less predictable.
    • Prediction clipping. Purposely restrict the model's predicted probabilities to a smaller range, rather than allowing 0% to 100% probabilities. Reduces precision.
    • Randomized response. Flip model predictions to randomly incorrect values with some probability (for example, 10% of the time). Doing so obscures the model's true predictions.
    • Output rounding. Rounding the model's predicted values to less precise numbers, rather than exact probabilities.
    • Additive/multiplicative noise. Deliberately adding small amounts of random noise to the model's predictions. Makes them slightly less precise.

    At first glance, some of these noise techniques seem to have a negative effect on the data presented in the AI model. They can make the model reduce data precision (prediction clipping), obscure the model's true predictions (randomized response), and make predictions less precise (additive/multiplicative noise). So why include all this noise if it negatively affects the data presented in the AI model?

    There's certainly a tradeoff between adding noise to protect privacy versus preserving the usefulness or value of the AI model for customers. The goal is to add enough noise to provide strong privacy protection while minimizing the degradation in model quality. Doing so entails a careful balancing act between data privacy and the usefulness that you should consider for each case. You should tailor the level of noise to the application and the sensitivity of the data involved. While noise techniques do degrade model quality to some degree, there are a few factors to consider:

    • You can tune the noise to have minimal effect. For many applications, perfect accuracy isn't critical, and small degradations may be acceptable to users. For example, a model predicting the probability that a user selects an ad may normally output any probability between 0% and 100%. With prediction clipping, you could limit the model to only predict probabilities between 40% and 60%. So the model could predict values like 42% or 58%, but not values like 15% or 95% that fall outside the range. Doing so makes the model less precise by restricting or "clipping" its range of possible predictions. But the relative probabilities still provide useful information. For example, assume the system predicts that one user is 52% likely to select an ad and another is 48%. In this case, the first user is still more likely to select, even if you don't know the exact probabilities. So prediction clipping reduces precision, but it can still preserve the relative usefulness of the model's forecasts.
    • The noise improves worst case privacy protection. While average users see little change, vulnerable populations have greater privacy.
    • Organizations often use noise techniques to protect highly sensitive data like health records where privacy is paramount. Doing so may warrant some loss of usefulness.
    • Approaches like differential privacy offer mathematical guarantees of privacy protection with provable bounds on the utility loss.

Human-centered governance

Alongside technical measures, organizations must embrace responsible governance. They can do so using the following techniques:

  • Conduct impact assessments. Organizations should conduct detailed privacy impact assessments for AI projects to surface potential issues early. These assessments thoroughly evaluate how an organization plans to collect, store, use, and share data to identify risks and guide mitigation strategies.
  • Employ stringent controls on personal information. Organizations should institute safeguards proportional to data sensitivity like deidentification for highly personal data. They should apply more stringent controls to sensitive data like health records versus less risky data like product reviews.
  • Implement controls on data usage. Organizations should favor transparency mechanisms and user controls around data use and system behaviors. These controls should clearly communicate how an organization uses data. They should allow users to access, edit or delete their data. They should also enable users to opt-in/opt-out of data sharing.
  • Monitor data practices. Organizations should continuously monitor systems and data practices post-deployment to identify emerging risks. They should audit log files, watch for anomalous behavior, and proactively probe for issues.
  • Create engineering protocols. Organizations should establish secure development lifecycle protocols like threat modeling and red teaming. They should conduct code reviews and penetration testing to uncover vulnerabilities.
  • Address data breaches. Organizations should develop procedures to address data breaches responsibly through notification and mitigation. They should analyze incidents thoroughly to prevent recurrence.
  • Implement security teams. Organizations should appoint dedicated data privacy/security teams to provide guidance and oversight across the organization. They should centralize key functions to ensure consistency.

At Microsoft, its Office of Responsible AI oversees the company's policies, audits, best practices, and education around privacy and security. While standards provide guidance, fulfilling the promise of AI requires every AI creator to act conscientiously. Microsoft works diligently to earn user trust through robust privacy protections and security. When applied ethically, AI can empower people – but it's society's duty to use it wisely.

Microsoft safeguards user data and models through multilayered technical, governance, and cultural initiatives. Yet this effort transcends any one institution. Society can achieve AI’s benefits without compromising fundamental rights only by embracing privacy-focused principles.