Skip to content

Commit 01ecda8

Browse files
authored
Merge pull request #234412 from cdpark/batch-error-handling
Freshness Pass for User Story: 79612 Error handling
2 parents db0f1c7 + 3622699 commit 01ecda8

File tree

1 file changed

+22
-21
lines changed

1 file changed

+22
-21
lines changed

articles/batch/error-handling.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Error handling and detection in Azure Batch
33
description: Learn about error handling in Batch service workflows from a development standpoint.
44
ms.topic: article
5-
ms.date: 12/20/2021
5+
ms.date: 04/13/2023
66
---
77

88
# Error handling and detection in Azure Batch
@@ -11,18 +11,18 @@ At times, you might need to handle task and application failures in your Azure B
1111

1212
## Error codes
1313

14-
Some general types of errors you might see in Batch are:
14+
Some general types of errors that you might see in Batch are:
1515

16-
- Networking failures for requests that never reached Batch. Or, networking failures when the Batch response didn't reach the client in time.
16+
- Networking failures for requests that never reached Batch, or networking failures when the Batch response didn't reach the client in time.
1717
- Internal server errors. These errors have a standard `5xx` status code HTTP response.
1818
- Throttling-related errors. These errors include `429` or `503` status code HTTP responses with the `Retry-after` header.
1919
- `4xx` errors such as `AlreadyExists` and `InvalidOperation`. These errors indicate that the resource isn't in the correct state for the state transition.
2020

21-
For detailed information about specific error codes, see [Batch Status and Error Codes](/rest/api/batchservice/batch-status-and-error-codes). This reference includes error codes for REST API, Batch service, and job tasks and scheduling.
21+
For detailed information about specific error codes, see [Batch status and error codes](/rest/api/batchservice/batch-status-and-error-codes). This reference includes error codes for REST API, Batch service, and for job tasks and scheduling.
2222

2323
## Application failures
2424

25-
During execution, an application might produce diagnostic output. You can use this output to troubleshoot issues. The Batch service writes standard output and standard error output to the `stdout.txt` and `stderr.txt` files in the task directory on the compute node. For more information, see [Files and directories in Batch](files-and-directories.md).
25+
During execution, an application might produce diagnostic output. You can use this output to troubleshoot issues. The Batch service writes standard output and standard error output to the *stdout.txt* and *stderr.txt* files in the task directory on the compute node. For more information, see [Files and directories in Batch](files-and-directories.md).
2626

2727
To download these output files, use the Azure portal or one of the Batch SDKs. For example, to retrieve files for troubleshooting purposes, use [ComputeNode.GetNodeFile](/dotnet/api/microsoft.azure.batch.computenode) and [CloudTask.GetNodeFile](/dotnet/api/microsoft.azure.batch.cloudtask) in the Batch .NET library.
2828

@@ -44,30 +44,30 @@ If files that you specified for a task fail to upload for any reason, a file upl
4444

4545
- The shared access signature (SAS) token supplied for accessing Azure Storage is invalid.
4646
- The SAS token doesn't provide write permissions.
47-
- The storage account is no longer available
47+
- The storage account is no longer available.
4848
- Another issue happened that prevented the successful copying of files from the node.
4949

5050
### Application errors
5151

52-
The process that the task's command line specifies can also fail. For more information, see [Task exit codes](#task-exit-codes).
52+
The process specified by the task's command line can also fail. For more information, see [Task exit codes](#task-exit-codes).
5353

5454
For application errors, configure Batch to automatically retry the task up to a specified number of times.
5555

5656
### Constraint errors
5757

58-
To specify the maximum execution duration for a job or task, set the **maxWallClockTime** constraint. Use this setting to terminate tasks that fail to progress.
58+
To specify the maximum execution duration for a job or task, set the `maxWallClockTime` constraint. Use this setting to terminate tasks that fail to progress.
5959

6060
When the task exceeds the maximum time:
6161

62-
- The task is marked as **completed**.
63-
- The exit code is set to `0xC000013A`
62+
- The task is marked as *completed*.
63+
- The exit code is set to `0xC000013A`.
6464
- The **schedulingError** field is marked as `{ category:"ServerError", code="TaskEnded"}`.
6565

6666
## Task exit codes
6767

6868
When a task executes a process, Batch populates the task's exit code property with the return code of the process. If the process returns a nonzero exit code, the Batch service marks the task as failed.
6969

70-
The Batch service doesn't determine a task's exit code. The process itself, or the operating system on which the process executed, determines the exit code.
70+
The Batch service doesn't determine a task's exit code. The process itself, or the operating system on which the process executes, determines the exit code.
7171

7272
## Task failures or interruptions
7373

@@ -83,16 +83,17 @@ It's also possible for an intermittent issue to cause a task to stop responding
8383

8484
## Connect to compute nodes
8585

86-
You can perform additional debugging and troubleshooting by signing in to a compute node remotely. Use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes, and obtain Secure Shell (SSH) connection information for Linux nodes. You can also download this information using the [Batch .NET](/dotnet/api/microsoft.azure.batch.computenode) or [Batch Python](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh) APIs.
86+
You can perform debugging and troubleshooting by signing in to a compute node remotely. Use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes, and obtain Secure Shell (SSH) connection information for Linux nodes. You can also download this information using the [Batch .NET](/dotnet/api/microsoft.azure.batch.computenode) or [Batch Python](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh) APIs.
8787

8888
To connect to a node via RDP or SSH, first create a user on the node. Use one of the following methods:
8989

90-
- The Azure portal
90+
- The [Azure portal](https://portal.azure.com)
9191
- Batch REST API: [adduser](/rest/api/batchservice/computenode/adduser)
9292
- Batch .NET API: [ComputeNode.CreateComputeNodeUser](/dotnet/api/microsoft.azure.batch.computenode)
9393
- Batch Python module: [add_user](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh)
9494

95-
If necessary, [restrict or disable RDP or SSH access to compute nodes](pool-endpoint-configuration.md).
95+
If necessary, [configure or disable access to compute nodes](pool-endpoint-configuration.md).
96+
9697
## Troubleshoot problem nodes
9798

9899
Your Batch client application or service can examine the metadata of failed tasks to identify a problem node. Each node in a pool has a unique ID. Task metadata includes the node where a task runs. After you find the problem node, try the following methods to resolve the failure.
@@ -116,11 +117,11 @@ Reimaging a node reinstalls the operating system. Start tasks and job preparatio
116117
Removing the node from the pool is sometimes necessary.
117118

118119
- Batch REST API: [removenodes](/rest/api/batchservice/pool/remove-nodes)
119-
- Batch .NET API: [pooloperations](/dotnet/api/microsoft.azure.batch.pooloperations)
120+
- Batch .NET API: [PoolOperations](/dotnet/api/microsoft.azure.batch.pooloperations)
120121

121122
### Disable task scheduling on node
122123

123-
Disabling task scheduling on a node effectively takes the node offline. Batch assigns no further tasks to the node. However, the node continues running in the pool. You can then further investigate the failures without losing the failed tasks's data. The node also won't cause additional task failures.
124+
Disabling task scheduling on a node effectively takes the node offline. Batch assigns no further tasks to the node. However, the node continues running in the pool. You can then further investigate the failures without losing the failed task's data. The node also won't cause more task failures.
124125

125126
For example, disable task scheduling on the node. Then, sign in to the node remotely. Examine the event logs, and do other troubleshooting. After you solve the problems, enable task scheduling again to bring the node back online.
126127

@@ -129,9 +130,9 @@ For example, disable task scheduling on the node. Then, sign in to the node remo
129130

130131
You can use these actions to specify Batch handles tasks currently running on the node. For example, when you disable task scheduling with the Batch .NET API, you can specify an enum value for [DisableComputeNodeSchedulingOption](/dotnet/api/microsoft.azure.batch.common.disablecomputenodeschedulingoption). You can choose to:
131132

132-
- Terminate running tasks (`Terminate`).
133-
- Requeue tasks for scheduling on other nodes (`Requeue`).
134-
- Allow running tasks to complete before performing the action (`TaskCompletion`).
133+
- Terminate running tasks: `Terminate`
134+
- Requeue tasks for scheduling on other nodes: `Requeue`
135+
- Allow running tasks to complete before performing the action: `TaskCompletion`
135136

136137
## Retry after errors
137138

@@ -141,5 +142,5 @@ After a failure, wait several seconds before retrying. If you retry too frequent
141142

142143
## Next steps
143144

144-
- [Check for Batch pool and node errors](batch-pool-node-error-checking.md).
145-
- [Check for Batch job and task errors](batch-job-task-error-checking.md).
145+
- [Check for Batch pool and node errors](batch-pool-node-error-checking.md)
146+
- [Check for Batch job and task errors](batch-job-task-error-checking.md)

0 commit comments

Comments
 (0)