Best approach to process data in someone else's Azure blob container

javier cohenar 191 Reputation points
2021-12-03T22:58:38.123+00:00

We have a client that doesnt want to share sensitive sales data with our company.

However they are interested in our algorithm processing their data. We would like to be able to process theur data without giving them access to iur algorithm.

What is the best approach to do that? Just looking for a general idea on what the best practice would be and any related resources to learn from.

Our algorithm currently runs on ADF using Data Mapping, spark cluster and Databricks transformations.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,080 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,061 Reputation points
    2021-12-06T19:13:53.773+00:00

    Hello @javier cohenar and welcome to Microsoft Q&A.

    It sounds like both of your have assets you do not want revealed to each other. Their sensitive data, and your algorithm/process.

    When both sides belong to the same party, the process would pull data. However in your case, I think it might make more sense if you built an application / endpoint where the client either uploads their data file to your storage, or, the client provides you with a SAS Token to read their data. The benefit of this case, is putting the client in control of exactly what data files are shared. This also keeps the processing all on your side. Event Grid or Blob Trigger can be used to tell the algorithm to start when the Token is provided or blob is uploaded.

    If the sensitive information is also incorporated into the data files, then you may need data masking. Data masking hides some portion of the data. Data masking may be worth a separate conversation. Masking in Databricks Masking in Data Factory SQL Dynamic Masking

    If the client uploads data to your storage, you should consider what policies or regulations to comply with regarding data privacy and data retention, etc. Also, depending on volume of data, uploading might not be the best choice.
    SAS Token can be used to access more than a single file, depending upon setup, it could point to an entire folder. SAS is also time-bound giving more control to the client.

    There is also the question of where to store or make available the output of your algorithm. It could be in your blob container, and the client sent a SAS Token to retrieve from your account. Or the client could provide a SAS Token to write output into their container. Similar policy questions apply here.

    There is another service to look into Azure Data Share. I do not know if it would be appropriate, depends on sensitivity.

    These are high level ideas, and are intended as inspiration towards your own solution, rather than a prescriptive answer. More ideas can be found at the Azure Architecture center.


0 additional answers

Sort by: Most helpful