How can we preprocess (pii entities anonymization) content of files before copying it in azure blob container

Mansi Yadav 40 Reputation points
2024-06-07T09:27:01.1066667+00:00

We have created pipelines to copy files from sharepoint to the ADLS container. We have deployed a web app that uses Azure open AI GPT-35-Turbo and Azure search index as a data source. Now we want to anonymize some pii data which is coming in the response. Initially I tried using pii detection skillset but the results were not good because it was masking each character with * and we want to replace it with a generic term. For example: Amazon and Flipkart are leading e-commerce companies.

Anonymized text:- Org1 and Org2 are leading e-commerce companies.

I tried to use Microsoft presidio but it is not able to detect organizations and also it's difficult to add it in open ai response and citation. So according to me applying pii detection on whole document and then indexing it would be better approach.

Please let me know if there is any solution to achieve before copying the files using Data flow, etc.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Harishga 5,985 Reputation points Microsoft Vendor
    2024-06-10T06:06:19.4433333+00:00

    Hi @Mansi Yadav
    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    As per the requirements for anonymizing PII data in the web application responses, a custom solution can be developed that combines the strengths of Azure services and custom logic. 

    To implement the custom solution, you can follow these steps:

    Firstly, you need to develop a custom anonymization module that can integrate with Azure OpenAI GPT-3.5-Turbo’s response. This module should use regular expressions or machine learning models to detect PII and replace detected PII with generic terms.

    For example, names could be replaced with “Name” and organizations with “Organization”.

    Secondly, you can enhance the organization detection by training a custom Named Entity Recognition model that can better identify organizational names. This model can be trained on a dataset that includes the types of organizations relevant to the user’s domain. This will help in accurately detecting and anonymizing organizational names.

    Thirdly, you can integrate the anonymized data with Azure Search Index. Once the PII is anonymized, the data can be indexed using Azure Cognitive Search. Make sure that the indexing pipeline includes the custom anonymization logic. This will enable users to search for data without exposing sensitive information.

    Fourthly, you can utilize Azure Data Factory’s Data Flow to automate the process of copying the anonymized files. The Data Flow can be configured to ingest documents from the source, apply the custom anonymization logic, index the anonymized documents, and copy the indexed documents to the destination. This will ensure that the anonymization process is automated and efficient.

    Finally, it is important to test the custom solution with various types of documents to ensure that all PII is accurately detected and anonymized. Iterate on the model and logic based on the test results. Additionally, ensure that the solution complies with relevant data protection regulations and that the anonymization process is secure. This will help in maintaining the privacy and security of the data.

    Reference:
    https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/fine-tune?tabs=python-new%2Ccommand-line

    https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=turbo%2Cpython-new&pivots=programming-language-studio

    https://learn.microsoft.com/en-us/azure/ai-services/language-service/custom-named-entity-recognition/overview

    https://github.com/microsoft/presidio

    https://learn.microsoft.com/en-us/azure/search/search-how-to-create-search-index?tabs=portal

    I hope this information helps you. Let me know if you have any further questions or concerns.

    0 comments No comments