Automatic Restart of SSIS packages after Failover or Failure
SSIS makes it possible to minimize work lost by starting after the last successfully executed control flow task upon restarting a failed package. From there, one challenge many seek to overcome is automatically restarting that package at that point without any manual intervention. This would make it possible for your SSIS packages to be running, your cluster to failover, and your SSIS packages to restart and complete without you having to respond to the failover and restart them.
This is possible if you set it up correctly. However; clustering the SSIS service will not accomplish this. This has nothing to do with SSIS not being cluster aware. After all, SQL Server is fully cluster aware, but it does not automatically restart queries after a failover – the queries must be submitted again. It is the responsibility of the developer to detect the failure and retry queries in SQL Server. Likewise, in SSIS, it is the responsibility of the developer to detect a failover, and restart any packages that were executing.
In this entry, I will show one way a developer or administrator can do this automatically. This solution takes advantage of SSIS behaviors with checkpoint files, SSIS features, and features of SQL Agent scheduling. The article assumes a working knowledge of SSIS and SQL Agent.
How does it work?
Suppose you have a package executing at its normally scheduled time. This package is set up to use checkpoint files and to write a checkpoint at each control flow task. Checkpoint usages is set to “if Exists.”
During the execution of this package, your server experiences a failure. At this point, the checkpoint file is in place. The fact that the checkpoint file is in place tells you that either the package is currently executing, or else on its last execution, it did not complete successfully. A cluster failover is a process of services failing, or being brought down on one node, then starting on another node. A failure on a standalone might be a crash of the operating system followed by a reboot. In either case, SQL Agent will stop, then restart.
When an SSIS package uses a checkpoint file, it must first acquire a lock on that checkpoint file. If another instance of the SSIS package is running, that other instance will have a lock on the checkpoint file already. In this case, when the new instance of the package tries to start, it will be unable to acquire a lock on the checkpoint file, and will fail immediately with no work being done. This prevents two instances of a package stepping on each other by simultaneous execution.
The key to making the SSIS package restart is scheduling a job in SQL Agent to run at startup. This job will check for the existence of the checkpoint file. If the checkpoint file does not exist, then we can assume that the SSIS package was not executing at the time of the failure, and take no further action. If the checkpoint file exists, then we know that the package is either executing, or its last execution was unsuccessful (presumably due to the failover or server failure). So if the checkpoint file is found, the job executing at startup will restart the package. The package will read its checkpoint file and start after the last successful control flow task, or if the package is already executing, it will attempt to start that task, but it will fail – what you want to happen.
The last key piece to this is preventing the manual intervention that might normally take place. Consider that ETL jobs often run after hours. If the failure occurs after hours, and the DBA is notified of the server failure, in his groggy state, he may attempt to restart jobs just thinking that everything has to run before start of business the next day. If the job scheduled at startup has already restarted the package, then when the DBA attempts to restart the package, this new instance he is attempting to start will not be able to acquire the lock on the checkpoint file and will fail – exactly what you want to happen. So use of the checkpoint file has the additional benefit of preventing duplicate instances of the package from running and causing data integrity problems.
Example
In this example, I will show end to end the development of a self-restarting SSIS package system. The package that will do the work will be simulated by the “MainPackage” package, and we will accomplish the restart with another SSIS package. Variations on this could use a script job step in SQL Agent running at startup to check for the existence of the checkpoint file and restart the MainPackage package if found.
MainPackage package
The MainPackage package would be the critical package you need to execute, and need to restart automatically should SQL experience a cluster failover, or should the server bounce for any reason. This package could be anything, but for this demo, I am going to set up a package with a “FOR” loop so I can also deal with a challenge that containers present us with when using checkpoints.
In my SSIS solution, I create a package named “MainPackage.dtsx.” The package will look like this:
Figure 1 The basic structure of the demo "MainPackage" package - the package that will be automatically restarted after failover or server bounce.
Following through, the Execute SQL Task named “Create the Table if needed” is just a script to make sure the rest of my package will work. The details of this package are not as important to the demo as making the automatic restart of this package work.
In my demo package, the Increment LoopVar is there just to control the flow in the loop. This is to overcome the challenge of restarting in a container – specifically in a “FOR” loop. I will explain this in detail later.
“Insert a new row” is an Execute SQL task to insert the current counter variable value into the table created outside the loop. “Wait for 5 seconds” just executes a “WAITFOR DELAY '00:00:05'” against the SQL Server. Its only purpose is to slow down execution enough to demo the restart.
Once the package is set up, we must set it up for checkpointing. To set the package up for checkpointing, you must configure the following package properties
· Set CheckPointFileName to the name of the checkpoint file you want this package to create when running
· Set CheckPointUsage to “IfExists.” This property is critical to make sure the package restarts only if it did not finish successfully before the server failure.
· Set SaveCheckpoints to “True”
After the package properties are set, highlight each control flow task that you want to save a checkpoint, and set its property “FailPackageOnFailure” to “True.”
NOTE: You can use the “Expressions” at the package level to set a checkpoint file name. Setting this in a variable may be a good practice to let you easily copy this to the package that will start this after a server failure. Using a variable may also allow you to use a configuration that can be shared between this package and the restart package to ensure the name is consistent.
NOTE: For clustered environments, be sure to store your checkpoint file on a cluster volume in the SQL Cluster Group so that it will be available regardless of which node is hosting the SQL instance.
For details on setting a package up for checkpointing, see http://technet.microsoft.com/en-us/library/ms141009.aspx
Package to run at startup
The real key to making your MainPackage package restart after a cluster failover, or any SQL restart is having a job that can check for the existence of the check point file created by that package, then restarting it only if the checkpoint file exists. This job will be scheduled to run when SQL Agent starts. This job could be a script, but for the purposes of this demo, I will use another SSIS package.
For the package, I will set up 3 variables at the package scope:
1. A string variable to hold the name of the directory where the checkpoint file is to be stored.
2. A variable to hold the filename of the checkpoint file
3. A Boolean variable where the package will store whether the checkpoint file was found. This must have an initial value of “False.”
Figure 2 Sample variable set up for this "ScheduleAtStartup" package.
I need three main components for this package:
1. A “FOREACH” loop to go through all the files in the directory where the checkpoint file is to be stored.
2. A script task to compare the file names in the directory to the name of the checkpoint file that we have saved in the “CheckPointFileName” variable.
3. An “[[articles:Execute Package Task (SSIS)|Execute Package]]” task that will be constrained so that it will only execute if the checkpoint file is found.
Figure 3 The overall structure of the control flow for the "ScheduleAtStartup" package.
As you can see, this package is the one that will do the work of the automatic restart after any cluster failover or server reboot.
Search The Checkpoint Folder
This is a foreach loop. Highlight this container, and create another variable named “FileName.” This will store the name of the file from the enumerator as we evaluate each file to see if its name matches the checkpoint file we are searching for.
The following configurations must be set:
· On the “Collections” tab, set the “Enumerator” to “Foreach File Enumerator”
· Below the “Enumerator, click the “Expressions.” Select “Directory” from the list of Expressions and click on the Ellipsis button. In the expression builder that pops up. Expand the variables and drag the User::CheckpointPath variable to the expression window. Click the “Evaluate Expression” button to ensure your expression evaluates as you expect. Click “OK” to save and exit the expression builder.
Figure 4 The expression as it should look when setting the directory to be pulled from the CheckPointPath variable we set up earlier.
· Under “Retrieve File Name,” select “Name and Extension.”
Figure 5 Configuration of the "Collections" in the Foreach loop.
· Click on “Variable Mappings” in the list box. For “Variable,” use the User::FileName variable you created within the scope of the Foreach loop.
Figure 6 The foreach loop will only retrieve one value, so index 0 must map to the variable created to hold the value. In our case, we use the FileName variable we created in the scope of the Foreach loop.
After you have set all configurations, click “OK” to close the Foreach Loop Editor.
Script Task
Drag a script task into the foreach loop. Double click to open it for configuration.
· For the ReadOnlyVariables, select User::FileName and User::CheckPointFileName. These two will be used for comparison to see if they match.
· For ReadWriteVariables, select User::FileFound. This is where we will record whether or not we find the checkpoint file.
Figure 7 Configured Script page of the script task.
Click the “Edit Script” button, and be sure your Main function reads as follows:
public void Main()
{
// set FileFound to true if we find the checkpoint file.
// finding the checkpoint file would mean the last
// or present execution was not completed.
if (Dts.Variables["FileName"].Value.ToString().Equals(Dts.Variables["CheckPointFileName"].Value.ToString()))
{
Dts.Variables["FileFound"].Value = true;
}
Dts.TaskResult = (int)ScriptResults.Success;
}
After editing is complete, close the script and OK to close the Script Task Editor.
Execute Main Task
It will greatly simplify things if you have already installed your “MainPackage” package onto the server where it will execute.
Drag an “Execute Package” task onto the design surface below the Foreach loop. Drag the constraint from the Foreach Loop to the Execute Package task as was shown in Figure 3. Now, a CRITICAL piece to making this function correctly:
· Right click the constraint you just dragged from the foreach loop to the Execute package Task and click “Edit.”
· In the “Precedence Constraint Editor” that comes up, for “Evaluation Operation” choose “Constraint and Expression” in the drop down.
· For value, choose “Success”
· For Expression, type in @FileFound. This is the variable where the script task stored “true” if we found the checkpoint file. Since this is a Boolean, there is no need to do a comparison, it will evaluate to “True” or “False” without the comparison.
Figure 8 The configured Precedence Contraint Editor. With this configuration, the "Execute Package" task it constrains will only execute if the Foreach Loop was successful, and the Checkpoint file was found.
Click “OK” to close the Precedence Constraint Editor and double click the “Execute Package” task you dragged onto the design surface previously. Enter the appropriate name and description, then click on the “Package” in the list box on the left. If necessary, establish connections to the SQL Server where you have stored your package. Configure this to execute your MainPackage package. For our demo, my configuration looks like this:
Figure 9 My MainPackage package is stored in SQL Server. I created a new connection in the Connection drop down and pointed this connection to the server where MainPackage is stored.
After development, test your packages to be sure that the “ScheduleAtStartup” package we just created will function as designed. It should start the MainPackage if it finds the checkpoint file in place, and should not try to execute it if the checkpoint file is not in place.
After everything is working, install your “ScheduleAtStartup” package onto the same server as the package it will restart.
Scheduling it in Agent
For details on scheduling SSIS packages, see “Using SQL Server Agent to Run a Package” in this article: http://msdn.microsoft.com/en-us/library/ms138023.aspx
The key is to schedule your MainPackage package at the time you want it to execute, and scheduling the “ScheduleAtStartup” package to start when SQL Server Agent starts. If this is in place, then in the event of a failover or server reboot, Agent will start. When Agent starts, the ScheduleAtStartup package will execute since it is scheduled to start at startup. If it finds the checkpoint file, then we know the last execution of the MainPackage package was not successful, so it will then start MainPackage. Because you are using checkpoints, MainPackage will start after the last successful control flow task. This achieves the “Automatic Restart after Failover” that so many of my customers have asked me about.
Challenge with Restarting in a For Loop
I used a “FOR” loop in my MainPackage package. Whenever you use a container like this, on restart, the package will restart with the container. If that container is a “FOR” loop, it will always restart this loop by executing the InitExpression first. In effect, this restarts the FOR loop from the beginning. This may not be what is desired.
You can get around this by incrementing variables evaluated in the EvalExpression inside the loop, and using dummy variables for the InitExpression and AssignExpressions in the “FOR” loop. In effect, this turns the “FOR” loop into a “WHILE” loop. The following is an example of a “FOR” loop with this configuration:
Figure 10 A FOR loop configured to overcome the challenge created when the package always executes the "InitExpression" after restart.
Of course, be sure you have something like a script task in the FOR loop to increment your LoopVar, or you will have an infinite loop, something else you probably do not want.