Manual uploads of csv files in azure storage with control framework and data goveranance

Sourav 130 Reputation points
2023-08-06T18:26:48.98+00:00

Hello-

We are looking for a solution where in businesses will be uploading manual files(csv) into azure datalake storage from on-prem, we want to have control over what is being ingested and ensure no PII data is being exposed.

Requirement :

Create a pipeline which should be able to pick up files from an on-prem file share location.( we don't want users to make changes to the files or have control)

Validate the file based on pre-defined meta data.

Easy to add/edit/remove the files but based on process owner approval

Minimum code changes if any.

No PII data should be exposed as for example if any user change the column name in the file or added new columns that has PII data.

Thought on Solution Approach :

Pipeline :

  1. Create a pipeline to copy files from on-prem shared location to azure storage and validate the file name (from a share point list)
  2. Read the file and validate with supplied meta data file.
  3. When validated move to a target folder or else move to an error folder and send an alert.
  4. Use data bricks to read this file and update delta table.

Control work flow :

  1. create a sharepoint list with list of approved files
  2. create a work flow to add/remove/update files

Control pipeline

  1. secure an area in the file share and control using AD groups
  2. upload meta deta files here
  3. sync this with storage via ADF pipeline

Question : What are the steps to implement the above solution approach . I am aware of how to create ADF, datalake.

Lastly, can we register and create data lineage for the whole activity in purview.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,553 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,404 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,436 questions
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,521 Reputation points Microsoft Employee
    2023-08-08T06:14:14.2733333+00:00

    Hi Sourav,

    Thank you for posting query in Microsoft Q&A Platform.

    Here are the steps to implement the solution approach you have described:

    Create an Azure Data Factory (ADF) pipeline to copy files from the on-premises file share to Azure Data Lake Storage (ADLS) and validate the file name against a SharePoint list of approved files. You can use the "Copy Data" activity in ADF to copy the files and the "Lookup" activity to validate the file name against the SharePoint list.

    Read the file and validate it against the supplied metadata file. You can use Azure Databricks to read the file and validate it against the metadata file. You can write a Python or Scala script in Databricks to perform the validation.

    If the file passes validation, move it to a target folder in ADLS. If the file fails validation, move it to an error folder and send an alert. You can use the "Move Data" activity in ADF to move the file to the target or error folder, and you can use Azure Logic Apps to send the alert.

    Use Delta Lake to read the validated file and update the delta table. You can use Azure Databricks to read the validated file and update the delta table.

    Create a SharePoint list of approved files and create a workflow to add/remove/update files. You can use SharePoint Designer to create the workflow.

    Secure an area in the file share and control it using AD groups. You can use Active Directory to control access to the file share.

    Upload the metadata files to the secured area in the file share and sync it with ADLS via ADF pipeline. You can use the "Copy Data" activity in ADF to copy the metadata files from the file share to ADLS.

    Register and create data lineage for the whole activity in Azure Purview. You can use Azure Purview to register and create data lineage for the ADF pipeline, Databricks job, and Delta Lake table.

    To summarize, the solution approach involves using ADF pipeline to copy files from on-premises file share to ADLS, using Databricks to validate the files against metadata, using Delta Lake to store the validated files, using SharePoint to manage the list of approved files, using AD to control access to the file share, and using Azure Purview to register and create data lineage for the whole activity.

    Hope this helps. Please let me know if any further queries. Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.