You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/batch/error-handling.md
+22-21Lines changed: 22 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
title: Error handling and detection in Azure Batch
3
3
description: Learn about error handling in Batch service workflows from a development standpoint.
4
4
ms.topic: article
5
-
ms.date: 12/20/2021
5
+
ms.date: 04/13/2023
6
6
---
7
7
8
8
# Error handling and detection in Azure Batch
@@ -11,18 +11,18 @@ At times, you might need to handle task and application failures in your Azure B
11
11
12
12
## Error codes
13
13
14
-
Some general types of errors you might see in Batch are:
14
+
Some general types of errors that you might see in Batch are:
15
15
16
-
- Networking failures for requests that never reached Batch. Or, networking failures when the Batch response didn't reach the client in time.
16
+
- Networking failures for requests that never reached Batch, or networking failures when the Batch response didn't reach the client in time.
17
17
- Internal server errors. These errors have a standard `5xx` status code HTTP response.
18
18
- Throttling-related errors. These errors include `429` or `503` status code HTTP responses with the `Retry-after` header.
19
19
-`4xx` errors such as `AlreadyExists` and `InvalidOperation`. These errors indicate that the resource isn't in the correct state for the state transition.
20
20
21
-
For detailed information about specific error codes, see [Batch Status and Error Codes](/rest/api/batchservice/batch-status-and-error-codes). This reference includes error codes for REST API, Batch service, and job tasks and scheduling.
21
+
For detailed information about specific error codes, see [Batch status and error codes](/rest/api/batchservice/batch-status-and-error-codes). This reference includes error codes for REST API, Batch service, and for job tasks and scheduling.
22
22
23
23
## Application failures
24
24
25
-
During execution, an application might produce diagnostic output. You can use this output to troubleshoot issues. The Batch service writes standard output and standard error output to the `stdout.txt` and `stderr.txt` files in the task directory on the compute node. For more information, see [Files and directories in Batch](files-and-directories.md).
25
+
During execution, an application might produce diagnostic output. You can use this output to troubleshoot issues. The Batch service writes standard output and standard error output to the *stdout.txt* and *stderr.txt* files in the task directory on the compute node. For more information, see [Files and directories in Batch](files-and-directories.md).
26
26
27
27
To download these output files, use the Azure portal or one of the Batch SDKs. For example, to retrieve files for troubleshooting purposes, use [ComputeNode.GetNodeFile](/dotnet/api/microsoft.azure.batch.computenode) and [CloudTask.GetNodeFile](/dotnet/api/microsoft.azure.batch.cloudtask) in the Batch .NET library.
28
28
@@ -44,30 +44,30 @@ If files that you specified for a task fail to upload for any reason, a file upl
44
44
45
45
- The shared access signature (SAS) token supplied for accessing Azure Storage is invalid.
46
46
- The SAS token doesn't provide write permissions.
47
-
- The storage account is no longer available
47
+
- The storage account is no longer available.
48
48
- Another issue happened that prevented the successful copying of files from the node.
49
49
50
50
### Application errors
51
51
52
-
The process that the task's command line specifies can also fail. For more information, see [Task exit codes](#task-exit-codes).
52
+
The process specified by the task's command line can also fail. For more information, see [Task exit codes](#task-exit-codes).
53
53
54
54
For application errors, configure Batch to automatically retry the task up to a specified number of times.
55
55
56
56
### Constraint errors
57
57
58
-
To specify the maximum execution duration for a job or task, set the **maxWallClockTime** constraint. Use this setting to terminate tasks that fail to progress.
58
+
To specify the maximum execution duration for a job or task, set the `maxWallClockTime` constraint. Use this setting to terminate tasks that fail to progress.
59
59
60
60
When the task exceeds the maximum time:
61
61
62
-
- The task is marked as **completed**.
63
-
- The exit code is set to `0xC000013A`
62
+
- The task is marked as *completed*.
63
+
- The exit code is set to `0xC000013A`.
64
64
- The **schedulingError** field is marked as `{ category:"ServerError", code="TaskEnded"}`.
65
65
66
66
## Task exit codes
67
67
68
68
When a task executes a process, Batch populates the task's exit code property with the return code of the process. If the process returns a nonzero exit code, the Batch service marks the task as failed.
69
69
70
-
The Batch service doesn't determine a task's exit code. The process itself, or the operating system on which the process executed, determines the exit code.
70
+
The Batch service doesn't determine a task's exit code. The process itself, or the operating system on which the process executes, determines the exit code.
71
71
72
72
## Task failures or interruptions
73
73
@@ -83,16 +83,17 @@ It's also possible for an intermittent issue to cause a task to stop responding
83
83
84
84
## Connect to compute nodes
85
85
86
-
You can perform additional debugging and troubleshooting by signing in to a compute node remotely. Use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes, and obtain Secure Shell (SSH) connection information for Linux nodes. You can also download this information using the [Batch .NET](/dotnet/api/microsoft.azure.batch.computenode) or [Batch Python](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh) APIs.
86
+
You can perform debugging and troubleshooting by signing in to a compute node remotely. Use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes, and obtain Secure Shell (SSH) connection information for Linux nodes. You can also download this information using the [Batch .NET](/dotnet/api/microsoft.azure.batch.computenode) or [Batch Python](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh) APIs.
87
87
88
88
To connect to a node via RDP or SSH, first create a user on the node. Use one of the following methods:
If necessary, [restrict or disable RDP or SSH access to compute nodes](pool-endpoint-configuration.md).
95
+
If necessary, [configure or disable access to compute nodes](pool-endpoint-configuration.md).
96
+
96
97
## Troubleshoot problem nodes
97
98
98
99
Your Batch client application or service can examine the metadata of failed tasks to identify a problem node. Each node in a pool has a unique ID. Task metadata includes the node where a task runs. After you find the problem node, try the following methods to resolve the failure.
@@ -116,11 +117,11 @@ Reimaging a node reinstalls the operating system. Start tasks and job preparatio
116
117
Removing the node from the pool is sometimes necessary.
Disabling task scheduling on a node effectively takes the node offline. Batch assigns no further tasks to the node. However, the node continues running in the pool. You can then further investigate the failures without losing the failed tasks's data. The node also won't cause additional task failures.
124
+
Disabling task scheduling on a node effectively takes the node offline. Batch assigns no further tasks to the node. However, the node continues running in the pool. You can then further investigate the failures without losing the failed task's data. The node also won't cause more task failures.
124
125
125
126
For example, disable task scheduling on the node. Then, sign in to the node remotely. Examine the event logs, and do other troubleshooting. After you solve the problems, enable task scheduling again to bring the node back online.
126
127
@@ -129,9 +130,9 @@ For example, disable task scheduling on the node. Then, sign in to the node remo
129
130
130
131
You can use these actions to specify Batch handles tasks currently running on the node. For example, when you disable task scheduling with the Batch .NET API, you can specify an enum value for [DisableComputeNodeSchedulingOption](/dotnet/api/microsoft.azure.batch.common.disablecomputenodeschedulingoption). You can choose to:
131
132
132
-
- Terminate running tasks (`Terminate`).
133
-
- Requeue tasks for scheduling on other nodes (`Requeue`).
134
-
- Allow running tasks to complete before performing the action (`TaskCompletion`).
133
+
- Terminate running tasks: `Terminate`
134
+
- Requeue tasks for scheduling on other nodes: `Requeue`
135
+
- Allow running tasks to complete before performing the action: `TaskCompletion`
135
136
136
137
## Retry after errors
137
138
@@ -141,5 +142,5 @@ After a failure, wait several seconds before retrying. If you retry too frequent
141
142
142
143
## Next steps
143
144
144
-
-[Check for Batch pool and node errors](batch-pool-node-error-checking.md).
145
-
-[Check for Batch job and task errors](batch-job-task-error-checking.md).
145
+
-[Check for Batch pool and node errors](batch-pool-node-error-checking.md)
146
+
-[Check for Batch job and task errors](batch-job-task-error-checking.md)
0 commit comments