|
| 1 | +--- |
| 2 | +title: Error handling in Azure Batch |
| 3 | +description: Learn about error handling in Batch service workflows from a development standpoint. |
| 4 | +ms.topic: conceptual |
| 5 | +ms.date: 05/12/2020 |
| 6 | + |
| 7 | +--- |
| 8 | +# Error handling in Azure Batch |
| 9 | + |
| 10 | +At times, you may find it necessary to handle both task and application failures within your Batch solution. This article talks about types of errors and how to resolve them. |
| 11 | + |
| 12 | +**Should this be combined with [Detecting and handling Batch service errors](batch-retry-after-errors.md)? And/or moved out of this section?** |
| 13 | + |
| 14 | +## Application failures |
| 15 | + |
| 16 | +During execution, an application might produce diagnostic output that you can use to troubleshoot issues. As described in [Files and directories](files-and-directories.md), the Batch service writes standard output and standard error output to `stdout.txt` and `stderr.txt` files in the task directory on the compute node. |
| 17 | + |
| 18 | +You can use the Azure portal or one of the Batch SDKs to download these files. For example, you can retrieve these and other files for troubleshooting purposes by using [ComputeNode.GetNodeFile](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode) and [CloudTask.GetNodeFile](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.cloudtask) in the Batch .NET library. |
| 19 | + |
| 20 | +## Task errors |
| 21 | + |
| 22 | +Task errors fall into these categories: |
| 23 | + |
| 24 | +### Pre-processing errors |
| 25 | + |
| 26 | +If a task fails to start, a pre-processing error is set for the task. |
| 27 | + |
| 28 | +Pre-processing errors can occur if the task's resource files have moved, the storage account is no longer available, or another issue was encountered that prevented the successful copying of files to the node. |
| 29 | + |
| 30 | +### File upload errors |
| 31 | + |
| 32 | +If files that are specified for a task fail to upload for any reason, a file upload error is set for the task. |
| 33 | + |
| 34 | +File upload errors can occur if the SAS supplied for accessing Azure Storage is invalid or does not provide write permissions, if the storage account is no longer available, or if another issue was encountered that prevented the successful copying of files from the node. |
| 35 | + |
| 36 | +### Application errors |
| 37 | + |
| 38 | +The process that is specified by the task's command line can also fail. The process is deemed to have failed when a nonzero exit code is returned by the process that is executed by the task (see *Task exit codes* in the next section). |
| 39 | + |
| 40 | +For application errors, you can configure Batch to automatically retry the task up to a specified number of times. |
| 41 | + |
| 42 | +### Constraint errors |
| 43 | + |
| 44 | +You can set a constraint that specifies the maximum execution duration for a job or task, the *maxWallClockTime*. This can be useful for terminating tasks that fail to progress. |
| 45 | + |
| 46 | +When the maximum amount of time has been exceeded, the task is marked as *completed*, but the exit code is set to `0xC000013A` and the *schedulingError* field is marked as `{ category:"ServerError", code="TaskEnded"}`. |
| 47 | + |
| 48 | +## Task exit codes |
| 49 | + |
| 50 | +As mentioned earlier, a task is marked as failed by the Batch service if the process that is executed by the task returns a nonzero exit code. When a task executes a process, Batch populates the task's exit code property with the return code of the process. |
| 51 | + |
| 52 | +It is important to note that a task's exit code is not determined by the Batch service. A task's exit code is determined by the process itself or the operating system on which the process executed. |
| 53 | + |
| 54 | +## Task failures or interruptions |
| 55 | + |
| 56 | +Tasks might occasionally fail or be interrupted. The task application itself might fail, the node on which the task is running might be rebooted, or the node might be removed from the pool during a resize operation (if the pool's deallocation policy is set to remove nodes immediately without waiting for tasks to finish). In all cases, the task can be automatically requeued by Batch for execution on another node. |
| 57 | + |
| 58 | +It is also possible for an intermittent issue to cause a task to stop responding or take too long to execute. You can set the maximum execution interval for a task. If the maximum execution interval is exceeded, the Batch service interrupts the task application. |
| 59 | + |
| 60 | +## Connect to compute nodes |
| 61 | + |
| 62 | +You can perform additional debugging and troubleshooting by signing in to a compute node remotely. You can use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes and obtain Secure Shell (SSH) connection information for Linux nodes. You can also do this by using the Batch APIs such as with [Batch .NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode) or [Batch Python](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh). |
| 63 | + |
| 64 | +> [!IMPORTANT] |
| 65 | +> To connect to a node via RDP or SSH, you must first create a user on the node. To do this, you can use the Azure portal, [add a user account to a node](https://docs.microsoft.com/rest/api/batchservice/computenode/adduser) by using the Batch REST API, call the [ComputeNode.CreateComputeNodeUser](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode) method in Batch .NET, or call the [add_user](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh) method in the Batch Python module. |
| 66 | +
|
| 67 | +If you need to restrict or disable RDP or SSH access to compute nodes, see [Configure or disable remote access to compute nodes in an Azure Batch pool](pool-endpoint-configuration.md). |
| 68 | + |
| 69 | +## Troubleshoot problem nodes |
| 70 | + |
| 71 | +In situations where some of your tasks are failing, your Batch client application or service can examine the metadata of the failed tasks to identify a misbehaving node. Each node in a pool is given a unique ID, and the node on which a task runs is included in the task metadata. After you've identified a problem node, you can take several actions with it: |
| 72 | + |
| 73 | +- **Reboot the node** ([REST](https://docs.microsoft.com/rest/api/batchservice/computenode/reboot) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.reboot))) |
| 74 | + |
| 75 | + Restarting the node can sometimes clear up latent issues like stuck or crashed processes. If your pool uses a start task or your job uses a job preparation task, they are executed when the node restarts. |
| 76 | +- **Reimage the node** ([REST](https://docs.microsoft.com/rest/api/batchservice/computenode/reimage) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.reimage)) |
| 77 | + |
| 78 | + This reinstalls the operating system on the node. As with rebooting a node, start tasks and job preparation tasks are rerun after the node has been reimaged. |
| 79 | +- **Remove the node from the pool** ([REST](https://docs.microsoft.com/rest/api/batchservice/pool/removenodes) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.pooloperations)) |
| 80 | + |
| 81 | + Sometimes it is necessary to completely remove the node from the pool. |
| 82 | +- **Disable task scheduling on the node** ([REST](https://docs.microsoft.com/en-us/rest/api/batchservice/computenode/disablescheduling) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.disablescheduling)) |
| 83 | + |
| 84 | + This effectively takes the node offline so that no further tasks are assigned to it, but allows the node to remain running and in the pool. This enables you to perform further investigation into the cause of the failures without losing the failed task's data, and without the node causing additional task failures. For example, you can disable task scheduling on the node, then sign in remotely to examine the node's event logs or perform other troubleshooting. After you've finished your investigation, you can then bring the node back online by enabling task scheduling ([REST](https://docs.microsoft.com/rest/api/batchservice/computenode/enablescheduling) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.enablescheduling), or perform one of the other actions discussed earlier. |
| 85 | + |
| 86 | +> [!IMPORTANT] |
| 87 | +> With the actions described above, youc can specify how tasks currently running on the node are handled when you perform the action. For example, when you disable task scheduling on a node by using the Batch .NET client library, you can specify a [DisableComputeNodeSchedulingOption](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.common.disablecomputenodeschedulingoption) enum value to specify whether to **Terminate** running tasks, **Requeue** them for scheduling on other nodes, or allow running tasks to complete before performing the action (**TaskCompletion**). |
| 88 | +
|
| 89 | +## Next steps |
| 90 | + |
| 91 | +- Learn how to [check for pool and node errors](batch-pool-node-error-checking.md). |
| 92 | +- Learn how to [check for job and task errors](batch-job-task-error-checking.md). |
0 commit comments