Data Factory Scheduling

Often questions arise on how to correctly schedule Azure Data Factory pipelines. In this post we will look at some of the possible options and how they can affect your scheduling.

The first two options revolve around changing when the datasets will become available for a pipeline to run. A pipeline may be set to run daily but this will only be able to process when the source dataset lets the pipeline know its data is ready. The third option is to delay the pipeline to only run at a specific time within the day.

Option 1 – Set the “offset” on the dataset

         "availability": {
            "frequency": "Day",
            "interval": 1,
            "offset": "06:00:00"

Here the system is being instructed to that the data will be available every day at 6 in the morning.
This would give a slice pattern as follows:

 29‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC - ‎30‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC
28‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC - ‎29‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC
27‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC - ‎28/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC

As the interval is set to daily the slice will also have an end time of 6 in the morning.

Option 2 – Set the “anchorDateTime” and "interval" on the dataset

This option on the dataset lets the start time of the slice be set and then the interval between slices

         "availability": {
            "frequency": "Hour",
            "interval": 18,
            "anchorDateTime": "2001-01-01T06:00:00Z"

This is often misunderstood to give a daily slice starting at 6 in the morning. However, the reality is that two slices will be created for each day. The slice running from 6 in the morning to midnight and then another slice from midnight to 6 in the morning. As in the following slice timings:

 29‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎PM UTC - ‎30‎/‎04‎/‎2017‎ ‎12‎:‎00‎ ‎PM UTC
‎29‎/‎04‎/‎2017‎ ‎12‎:‎00‎ ‎AM UTC -‎ 29‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎PM UTC
‎28‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC - ‎29‎/‎04‎/‎2017‎ ‎12‎:‎00‎ ‎AM UTC 
‎27‎/‎04‎/‎2017‎ ‎12‎:‎00‎ ‎PM UTC -‎ 28‎/‎04‎/‎2017‎ ‎06‎:‎00‎ ‎AM UTC

Option 3 set the “delay” on the pipeline

With a dataset and pipeline set to daily (and starting and ending at midnight) a policy may be applied to an action that delays the processing by an amount of time. In this example the pipeline will pause execution of the data movement until 6 in the morning. Providing the dataset is available at this time (which it will be if it has been set to a daily slice).

         "policy": {
                    "timeout": "1.00:00:00",
                    "delay": "06:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 3,
                    "longRetry": 0,
                    "longRetryInterval": "00:00:00"

This way the slice will be a full day slice but only execute at 6 in the morning.

Using these three option you can chose when to “Offset” the slices so the whole slice is shifted by a number of hours. You can choose to “anchorDateTime” the start time and then cut up the slices between this and the next slice time with an “interval”. Finally you can choose to “delay” the processing of a pipeline. These three options used independently or in combination where appropriate can give a very tuneable way to manage when and how often your data is processed