Reprocess dataverse data lake previous date folders

Question

Reprocess dataverse data lake previous date folders

Karthik Eetur 21

I've used below "copy dataverse to sql" pipeline(dataflow) suggested my MS.

https://learn.microsoft.com/en-us/power-apps/maker/data-platform/azure-synapse-link-pipelines?tabs=synapse-analytics

I am running synapse pipelines for every 15 mins to copy data from incremental folders to sql database.

I've a scenario where main dataflow pipeline runs for more than 15 mins and orchestrator pipeline skips(status = 3) all subsequent pipelines until the pipeline run is completed. I've to manually clear processinglog table entries with status <> 1 for orchestrator pipeline to pick new folders for processing.

How to reprocess missed previous datetime folders using same MS flow?

I don't think same MS flow works for reprocessing previous datetime folders when new folders are ingested in the database and maxrowversion is a latest one. For example - reprocess 20-30 old folders to catch up data.

Appreciate any help here

Answer accepted by question author

0 additional answers

Your answer

Answer 1

@Karthik Eetur Welcome to Microsoft Q&A forum and thanks for reaching out here.

When Orchestrator Pipeline encountered unexpected error and failed for the given folder (Incremental folder 1). ThenDataverseToSQLPipelineProcessingLog has entry of 0 (failure) for the failed folder. Then subsequent folders (eg: Incremental folder 2) will be marked as 3 (skipped) and not as successful (1). As the previous folder are not successful (status code <> 1), hence all the subsequent folders will be skipped (Status code = 3). To avoid this skipping folder loop, you may follow below resolution steps suggested by product team:

Identify & resolve the root cause of pipeline failure.
Manually run the DataverseToSQL pipeline for the failed folder.
After successful execution of manual run, update the corresponding row in DataverseToSQLPipelineProcessingLog to 1 (success).
Chronologically, sequentially & manually process the subsequent skipped (3) folders and manually update the DataverseToSQLPipelineProcessingLog Status column of corresponding rows to 1 (success).
Once the failed folder & all the skipped folders are marked as successful, DataverseToSQL_Orchestrator pipeline will automatically process the next folder in next trigger.

You may have question that there are several skipped folders and manually executing them to updating DataverseToSQLPipelineProcessingLog is time consuming and error prone. Is there any better approach?

Recommendation for Automated reprocessing:

Yes, in that case to avoid manual effort, please consider having new pipeline which:

Retrieves the skipped folder chronologically.
Executes the DataverseToSQL pipeline sequentially.
Updates Status column of DataverseToSQLPipelineProcessingLog rows to 1 (success). Below is sample view of the pipeline looks like for processing skipped folders:

User's image

NOTE: To avoid the execution time overlapping, it is necessary to set the concurrency = 1 for your orchestrator pipeline, that way only one execution will be in progress until completed and the subsequent pipeline runs will be queued and executed in the chronological order. Ensuring that Concurrency setting of Orchestrator pipeline is 1 will avoid this scenario in future.

I would also recommend going through this thread where I had detailed discussion with another user on this implementation: Copy Dataverse data into Azure SQL using Synapse Link not running the DaterverseToSQL step

Hope this info helps.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Karthik Eetur 21 Reputation points

2023-08-18T05:03:08.16+00:00

Thanks @KranthiPakala-MSFT for the response to my question.

I understand the orchestrator pipeline was designed in such a way that data ingestion is always in incremental fashion.

I'm aware of the manual way of processing failed/skipped folders chronologically and sequentially but it's too much of manual work to handle folders incoming every 15 mins.

Just made below changes in the pipelines to see if that will help

Set concurrency=1 in the orchestrator pipeline make subsequent pipelines queued if current pipeline is running more than 15 mins.

Added 10 mins timeout for CreateTable activity and DataverseToSQL dataflow to time out once timeout setting is expired. By doing this, the given pipepine still fails with status=0 and all subsequent pipelines will be skipped but we have control over long running pipelines(we are seeing some activities running longer than expected at some times due to transient network error).

I gave a thought about your recommendation of automating the reprocessing of failed/skipped folders chronologically but didn't implement one yet. I will do that now to reduce manual reprocessing effort.

The only problem I see here is until all failed/skipped folders were processed, the orchestrator pipeline keeps skipping any new folders. This will lead to delay in data transfers to sink/destination systems.

Can we keep processing new incoming folders without skipping them and act on failed/skipped folders using a parallel process?

Is there a process/design in place to reprocess any random backdated folder but not in chronological order?

Share via

Reprocess dataverse data lake previous date folders

0 additional answers

Your answer