How can we preprocess (pii entities anonymization) content of files before copying it in azure blob container

Mansi Yadav 40 Reputation points

We have created pipelines to copy files from sharepoint to the ADLS container. We have deployed a web app that uses Azure open AI GPT-35-Turbo and Azure search index as a data source. Now we want to anonymize some pii data which is coming in the response. Initially I tried using pii detection skillset but the results were not good because it was masking each character with * and we want to replace it with a generic term. For example: Amazon and Flipkart are leading e-commerce companies.

Anonymized text:- Org1 and Org2 are leading e-commerce companies.

I tried to use Microsoft presidio but it is not able to detect organizations and also it's difficult to add it in open ai response and citation. So according to me applying pii detection on whole document and then indexing it would be better approach.

Please let me know if there is any solution to achieve before copying the files using Data flow, etc.

Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,933 questions
