Error handling and detection in Azure Batch
At times, you might need to handle task and application failures in your Azure Batch solution. This article explains different types of Batch errors, and how to resolve common problems.
Error codes
Some general types of errors that you might see in Batch are:
- Networking failures for requests that never reached Batch, or networking failures when the Batch response didn't reach the client in time.
- Internal server errors. These errors have a standard
5xx
status code HTTP response. - Throttling-related errors. These errors include
429
or503
status code HTTP responses with theRetry-after
header. 4xx
errors such asAlreadyExists
andInvalidOperation
. These errors indicate that the resource isn't in the correct state for the state transition.
For detailed information about specific error codes, see Batch status and error codes. This reference includes error codes for REST API, Batch service, and for job tasks and scheduling.
Application failures
During execution, an application might produce diagnostic output. You can use this output to troubleshoot issues. The Batch service writes standard output and standard error output to the stdout.txt and stderr.txt files in the task directory on the compute node. For more information, see Files and directories in Batch.
To download these output files, use the Azure portal or one of the Batch SDKs. For example, to retrieve files for troubleshooting purposes, use ComputeNode.GetNodeFile and CloudTask.GetNodeFile in the Batch .NET library.
Task errors
Task errors fall into several categories.
Pre-processing errors
If a task fails to start, a pre-processing error is set for the task. Pre-processing errors can occur if:
- The task's resource files have moved.
- The storage account is no longer available.
- Another issue happened that prevented the successful copying of files to the node.
File upload errors
If files that you specified for a task fail to upload for any reason, a file upload error is set for the task. File upload errors can occur if:
- The shared access signature (SAS) token supplied for accessing Azure Storage is invalid.
- The SAS token doesn't provide write permissions.
- The storage account is no longer available.
- Another issue happened that prevented the successful copying of files from the node.
Application errors
The process specified by the task's command line can also fail. For more information, see Task exit codes.
For application errors, configure Batch to automatically retry the task up to a specified number of times.
Constraint errors
To specify the maximum execution duration for a job or task, set the maxWallClockTime
constraint. Use this setting to terminate tasks that fail to progress.
When the task exceeds the maximum time:
- The task is marked as completed.
- The exit code is set to
0xC000013A
. - The schedulingError field is marked as
{ category:"ServerError", code="TaskEnded"}
.
Task exit codes
When a task executes a process, Batch populates the task's exit code property with the return code of the process. If the process returns a nonzero exit code, the Batch service marks the task as failed.
The Batch service doesn't determine a task's exit code. The process itself, or the operating system on which the process executes, determines the exit code.
Task failures or interruptions
Tasks might occasionally fail or be interrupted. For example:
- The task application itself might fail.
- The node on which the task is running might reboot.
- A resize operation might remove the node from the pool. This action might happen if the pool's deallocation policy removes nodes immediately without waiting for tasks to finish.
In all cases, Batch can automatically requeue the task for execution on another node.
It's also possible for an intermittent issue to cause a task to stop responding or take too long to execute. You can set a maximum execution interval for a task. If a task exceeds the interval, the Batch service interrupts the task application.
Connect to compute nodes
You can perform debugging and troubleshooting by signing in to a compute node remotely. Use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes, and obtain Secure Shell (SSH) connection information for Linux nodes. You can also download this information using the Batch .NET or Batch Python APIs.
To connect to a node via RDP or SSH, first create a user on the node. Use one of the following methods:
- The Azure portal
- Batch REST API: adduser
- Batch .NET API: ComputeNode.CreateComputeNodeUser
- Batch Python module: add_user
If necessary, configure or disable access to compute nodes.
Troubleshoot problem nodes
Your Batch client application or service can examine the metadata of failed tasks to identify a problem node. Each node in a pool has a unique ID. Task metadata includes the node where a task runs. After you find the problem node, try the following methods to resolve the failure.
Reboot node
Restarting a node sometimes fixes latent issues, such as stuck or crashed processes. If your pool uses a start task, or your job uses a job preparation task, a node restart executes these tasks.
- Batch REST API: reboot
- Batch .NET API: ComputeNode.Reboot
Reimage node
Reimaging a node reinstalls the operating system. Start tasks and job preparation tasks rerun after the reimaging happens.
- Batch REST API: reimage
- Batch .NET API: ComputeNode.Reimage
Remove node from pool
Removing the node from the pool is sometimes necessary.
- Batch REST API: removenodes
- Batch .NET API: PoolOperations
Disable task scheduling on node
Disabling task scheduling on a node effectively takes the node offline. Batch assigns no further tasks to the node. However, the node continues running in the pool. You can then further investigate the failures without losing the failed task's data. The node also won't cause more task failures.
For example, disable task scheduling on the node. Then, sign in to the node remotely. Examine the event logs, and do other troubleshooting. After you solve the problems, enable task scheduling again to bring the node back online.
- Batch REST API: enablescheduling
- Batch .NET API: ComputeNode.EnableScheduling
You can use these actions to specify Batch handles tasks currently running on the node. For example, when you disable task scheduling with the Batch .NET API, you can specify an enum value for DisableComputeNodeSchedulingOption. You can choose to:
- Terminate running tasks:
Terminate
- Requeue tasks for scheduling on other nodes:
Requeue
- Allow running tasks to complete before performing the action:
TaskCompletion
Retry after errors
The Batch APIs notify you about failures. You can retry all APIs using the built-in global retry handler. It's a best practice to use this option.
After a failure, wait several seconds before retrying. If you retry too frequently or too quickly, the retry handler throttles requests.