ADF PII Detection and Masking - Schema Drift Issue

Muhammad Abdulbaqi 20 Reputation points
2024-11-15T19:42:49.95+00:00

I am following the tutorial for setting up a pipeline for PII detection and masking (linked below). All steps have been followed exactly as described, without any changes to the pipeline settings or configurations.

However, I am encountering an issue where the PII entities in my input .txt document are drifting into a separate column instead of being placed into the correct categories. The output does not align with the expected results. Relevant screenshots of the pipeline, including the data preview that shows the schema drift, are attached.

I would like to know:

  1. Are there any customizations needed for the CreateRequestBody or data source configurations within the data flow to correct this issue?
  2. How can the PII entities be correctly categorized and aligned in the output?
  3. After successfully running the pipeline, I am unable to locate the masked document. Where can the masked version of the document be viewed?
  4. Where should the code or settings be customized to apply masking to the output document?

User's image

The image below is the data preview of the request body.

User's image

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,984 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 25,291 Reputation points MVP
    2024-11-17T13:46:20.1833333+00:00

    Hi Muhammad Abdulbaqi,

    Thanks for reaching out to Microsoft Q&A.

    If you are facing schema drift issues in ADF when performing PII detection and masking, which is causing PII entities to drift into separate columns instead of aligning with the correct categories.

    Here’s a breakdown of steps to address this issue:

    1. Check and Customize the CreateRequestBody
    • CreateRequestBody in your data flow should be set up to correctly map input fields to expected PII categories (like name, email, phone number, etc.). If the input .txt document has columns that don’t align with the required schema, you may need to transform them before processing.
    • Inspect the input column mapping in the source transformation and ensure that fields like "case_month," "res_state," and others match the expected structure and are not contributing to the drift by being mismapped or unconfigured.
    1. Configure Column Mapping for PII Detection
    • In the data flow, specifically in the transformation where PII detection is being applied, verify that the output schema aligns with your expected categories.
    • Use the select or derived column transformation to ensure that PII entities are mapped correctly. For instance, explicitly map detected PII fields like Patient Name, Phone, etc., to the appropriate output columns, rather than leaving them to be dynamically created.
    1. Customize Data Source and Output Schema
    • The source schema in your PII detection data flow may require additional configuration to prevent schema drift. For example:
      • Enable or disable schema drift as per your requirements.
        • Use the projection tab in each transformation to lock down the schema if you’re noticing unwanted columns in the output.
        • In the sink transformation, explicitly define the target schema to ensure all columns are correctly aligned.
    1. Locating the Masked Document
    • After running the pipeline, the masked document should ideally be stored in the sink location you specified in the data flow or pipeline settings.
    • Verify the output path configuration in the sink transformation and confirm that ADF has write permissions to the specified destination, like a Blob Storage container.
    1. Setting Up Masking Configuration
    • For actual masking, make sure the PII detection transformation includes a masking policy (such as hashing or redaction). This can often be applied in the data flow transformation using custom expressions or masking options.
    • Additionally, ensure the output format settings in your sink transformation align with how you want the data masked.

    Summary of Action Steps

    • Check and, if necessary, customize CreateRequestBody.
    • Ensure column mappings in data flow transformations are correctly aligned with PII categories.
    • Review source and sink schema configurations to prevent schema drift.
    • Confirm the sink location and permissions to view the masked document output.
    • Apply masking configurations explicitly in your transformations.

    If you still encounter issues after these adjustments, let me know the specifics, and we can dive deeper into the configuration options within your ADF pipeline.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.