Skip to content

Commit 1468ddc

Browse files
committed
splitting up topic
1 parent 7683345 commit 1468ddc

File tree

9 files changed

+589
-2
lines changed

9 files changed

+589
-2
lines changed

articles/batch/TOC.yml

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,27 @@
5353
items:
5454
- name: Security baseline
5555
href: security-baseline.md
56-
- name: Developer features
56+
- name: [OLD] Developer features
5757
href: batch-api-basics.md
58+
- name: Batch service workflow and resources
59+
displayName: developer features
60+
href: batch-service-workflow-resources.md
61+
items:
62+
- name: Batch accounts
63+
displayName: storage account
64+
href: accounts.md
65+
- name: Nodes and pools
66+
displayName: compute node, application package, scaling, schedule, os
67+
href: nodes-and-pools.md
68+
- name: Jobs and tasks
69+
displayName: Batch job
70+
href: jobs-and-tasks.md
71+
- name: Files and directories
72+
displayName: Batch file
73+
href: files-and-directories.md
74+
- name: Error handling
75+
displayName: troubleshooting
76+
href: error-handling.md
5877
- name: APIs and tools
5978
href: batch-apis-tools.md
6079
- name: Detecting and handling Batch service errors

articles/batch/accounts.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: Batch accounts and Azure Storage accounts
3+
description: Learn about Azure Batch accounts and how they are used from a development standpoint.
4+
ms.topic: conceptual
5+
ms.date: 05/12/2020
6+
7+
---
8+
# Batch accounts and Azure Storage accounts
9+
10+
An Azure Batch account is a uniquely identified entity within the Batch service. Most Batch solutions use [Azure Storage](../azure/storage/index.yml) for storing resource files and output files, so each Batch account is usually associated with a corresponding storage account.
11+
12+
## Batch accounts
13+
14+
All processing and resources are associated with a Batch account. When your application makes a request against the Batch service, it authenticates the request using the Azure Batch account name, the URL of the account, and either an access key or an Azure Active Directory token.
15+
16+
You can run multiple Batch workloads in a single Batch account. You can also distribute your workloads among Batch accounts that are in the same subscription but located in different Azure regions.
17+
18+
[!INCLUDE [batch-account-mode-include](../../includes/batch-account-mode-include.md)]
19+
20+
You can create a Batch account using the [Azure portal](batch-account-create-portal.md) or programmatically, such as with the [Batch Management .NET library](batch-management-dotnet.md). When creating the account, you can associate an Azure storage account for storing job-related input and output data or applications.
21+
22+
## Azure Storage accounts
23+
24+
Most Batch solutions use Azure Storage for storing resource files and output files. For example, your Batch tasks (including standard tasks, start tasks, job preparation tasks, and job release tasks) typically specify resource files that reside in a storage account. Storage accounts also stores that data that is processed and any output data that is generated.
25+
26+
Batch supports the following types of Azure Storage accounts:
27+
28+
- General-purpose v2 (GPv2) accounts
29+
- General-purpose v1 (GPv1) accounts
30+
- Blob storage accounts (currently supported for pools in the Virtual Machine configuration)
31+
32+
For more information about storage accounts, see [Azure storage account overview](../storage/common/storage-account-overview.md).
33+
34+
You can associate a storage account with your Batch account when you create the Batch account, or later. Consider your cost and performance requirements when choosing a storage account. For example, the GPv2 and blob storage account options support greater [capacity and scalability limits](https://azure.microsoft.com/blog/announcing-larger-higher-scale-storage-accounts/) compared with GPv1. (Contact Azure Support to request an increase in a storage limit.) These account options can improve the performance of Batch solutions that contain a large number of parallel tasks that read from or write to the storage account.
35+
36+
## Next steps
37+
38+
- Learn about [Nodes and pools](nodes-and-pools.md).
39+
- Learn how to create a Batch account using the [Azure portal](batch-account-create-portal.md).

articles/batch/batch-api-basics.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.date: 08/29/2019
66
ms.custom: seodec18
77

88
---
9-
# Develop large-scale parallel compute solutions with Batch
9+
# [OLD] Develop large-scale parallel compute solutions with Batch
1010

1111
In this overview of the core components of the Azure Batch service, we discuss the primary service features and resources that Batch developers can use to build large-scale parallel compute solutions.
1212

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
title: Batch service workflow and resources
3+
description: Learn about the features of the Batch service and its high-level workflow from a development standpoint.
4+
ms.topic: conceptual
5+
ms.date: 05/12/2020
6+
7+
---
8+
# Batch service workflow and resources
9+
10+
In this overview of the core components of the Azure Batch service, we discuss the primary service resources that Batch developers can use to build large-scale parallel compute solutions.
11+
12+
Whether you're developing a distributed computational application or service that issues direct [REST API](https://docs.microsoft.com/rest/api/batchservice/) calls or you're using another one of the [Batch SDKs](batch-apis-tools.md#batch-service-apis), you'll use many of the resources and features discussed in this article.
13+
14+
> [!TIP]
15+
> For a higher-level introduction to the Batch service, see [Basics of Azure Batch](batch-technical-overview.md). Also see the latest [Batch service updates](https://azure.microsoft.com/updates/?product=batch).
16+
17+
## Batch service workflow
18+
19+
The following high-level workflow is typical of nearly all applications and services that use the Batch service for processing parallel workloads:
20+
21+
1. Upload the **data files** that you want to process to an [Azure Storage](../azure/storage/index.yml) account. Batch includes built-in support for accessing Azure Blob storage, and your tasks can download these files to [compute nodes](nodes-and-pools.md#nodes) when the tasks are run.
22+
2. Upload the **application files** that your tasks will run. These files can be binaries or scripts and their dependencies, and are executed by the tasks in your jobs. Your tasks can download these files from your Storage account, or you can use the [application packages](nodes-and-pools.md#application-packages) feature of Batch for application management and deployment.
23+
3. Create a [pool](nodes-and-pools.md#pools) of compute nodes. When you create a pool, you specify the number of compute nodes for the pool, their size, and the operating system. When each task in your job runs, it's assigned to execute on one of the nodes in your pool.
24+
4. Create a [job](jobs-and-tasks.md#jobs). A job manages a collection of tasks. You associate each job to a specific pool where that job's tasks will run.
25+
5. Add [tasks](jobs-and-tasks#tasks) to the job. Each task runs the application or script that you uploaded to process the data files it downloads from your Storage account. As each task completes, it can upload its output to Azure Storage.
26+
6. Monitor job progress and retrieve the task output from Azure Storage.
27+
28+
> [!NOTE]
29+
> You need a [Batch account](accounts.md) to use the Batch service. Most Batch solutions also use an associated [Azure Storage][../azure/storage/index.yml] account for file storage and retrieval.
30+
31+
## Batch service resources
32+
33+
The following sections discuss the resources of Batch that enable your distributed computational scenario. Some of these--accounts, compute nodes, pools, jobs, and tasks--are required by all solutions that use the Batch service. Others, like job schedules and application packages, are helpful but optional, features.
34+
35+
- [Batch accounts and storage accounts](accounts.md)
36+
- [Nodes and pools](nodes-and-pools.md)
37+
- [Jobs and tasks](jobs-and-tasks.md)
38+
- [Files and directories](files-and-directories.md)
39+
- [Error handling](error-handling.md)
40+
- Application packages? or can that fit somewhere else?
41+
42+
43+
## Next steps
44+
45+
* Learn about the [Batch APIs and tools](batch-apis-tools.md) available for building Batch solutions.
46+
* Learn the basics of developing a Batch-enabled application using the [Batch .NET client library](quick-run-dotnet.md) or [Python](quick-run-python.md). These quickstarts guide you through a sample application that uses the Batch service to execute a workload on multiple compute nodes, and includes using Azure Storage for workload file staging and retrieval.
47+
* Download and install [Batch Explorer][https://azure.github.io/BatchExplorer/] for use while you develop your Batch solutions. Use Batch Explorer to help create, debug, and monitor Azure Batch applications.
48+
* See community resources including [Stack Overflow](https://stackoverflow.com/questions/tagged/azure-batch), the [Batch Community repo](https://github.com/Azure/Batch), and the [Azure Batch forum][https://social.msdn.microsoft.com/Forums/en-US/home?forum=azurebatch] on MSDN.

articles/batch/error-handling.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
title: Error handling in Azure Batch
3+
description: Learn about error handling in Batch service workflows from a development standpoint.
4+
ms.topic: conceptual
5+
ms.date: 05/12/2020
6+
7+
---
8+
# Error handling in Azure Batch
9+
10+
At times, you may find it necessary to handle both task and application failures within your Batch solution. This article talks about types of errors and how to resolve them.
11+
12+
**Should this be combined with [Detecting and handling Batch service errors](batch-retry-after-errors.md)? And/or moved out of this section?**
13+
14+
## Application failures
15+
16+
During execution, an application might produce diagnostic output that you can use to troubleshoot issues. As described in [Files and directories](files-and-directories.md), the Batch service writes standard output and standard error output to `stdout.txt` and `stderr.txt` files in the task directory on the compute node.
17+
18+
You can use the Azure portal or one of the Batch SDKs to download these files. For example, you can retrieve these and other files for troubleshooting purposes by using [ComputeNode.GetNodeFile](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode) and [CloudTask.GetNodeFile](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.cloudtask) in the Batch .NET library.
19+
20+
## Task errors
21+
22+
Task errors fall into these categories:
23+
24+
### Pre-processing errors
25+
26+
If a task fails to start, a pre-processing error is set for the task.
27+
28+
Pre-processing errors can occur if the task's resource files have moved, the storage account is no longer available, or another issue was encountered that prevented the successful copying of files to the node.
29+
30+
### File upload errors
31+
32+
If files that are specified for a task fail to upload for any reason, a file upload error is set for the task.
33+
34+
File upload errors can occur if the SAS supplied for accessing Azure Storage is invalid or does not provide write permissions, if the storage account is no longer available, or if another issue was encountered that prevented the successful copying of files from the node.
35+
36+
### Application errors
37+
38+
The process that is specified by the task's command line can also fail. The process is deemed to have failed when a nonzero exit code is returned by the process that is executed by the task (see *Task exit codes* in the next section).
39+
40+
For application errors, you can configure Batch to automatically retry the task up to a specified number of times.
41+
42+
### Constraint errors
43+
44+
You can set a constraint that specifies the maximum execution duration for a job or task, the *maxWallClockTime*. This can be useful for terminating tasks that fail to progress.
45+
46+
When the maximum amount of time has been exceeded, the task is marked as *completed*, but the exit code is set to `0xC000013A` and the *schedulingError* field is marked as `{ category:"ServerError", code="TaskEnded"}`.
47+
48+
## Task exit codes
49+
50+
As mentioned earlier, a task is marked as failed by the Batch service if the process that is executed by the task returns a nonzero exit code. When a task executes a process, Batch populates the task's exit code property with the return code of the process.
51+
52+
It is important to note that a task's exit code is not determined by the Batch service. A task's exit code is determined by the process itself or the operating system on which the process executed.
53+
54+
## Task failures or interruptions
55+
56+
Tasks might occasionally fail or be interrupted. The task application itself might fail, the node on which the task is running might be rebooted, or the node might be removed from the pool during a resize operation (if the pool's deallocation policy is set to remove nodes immediately without waiting for tasks to finish). In all cases, the task can be automatically requeued by Batch for execution on another node.
57+
58+
It is also possible for an intermittent issue to cause a task to stop responding or take too long to execute. You can set the maximum execution interval for a task. If the maximum execution interval is exceeded, the Batch service interrupts the task application.
59+
60+
## Connect to compute nodes
61+
62+
You can perform additional debugging and troubleshooting by signing in to a compute node remotely. You can use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes and obtain Secure Shell (SSH) connection information for Linux nodes. You can also do this by using the Batch APIs such as with [Batch .NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode) or [Batch Python](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh).
63+
64+
> [!IMPORTANT]
65+
> To connect to a node via RDP or SSH, you must first create a user on the node. To do this, you can use the Azure portal, [add a user account to a node](https://docs.microsoft.com/rest/api/batchservice/computenode/adduser) by using the Batch REST API, call the [ComputeNode.CreateComputeNodeUser](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode) method in Batch .NET, or call the [add_user](batch-linux-nodes.md#connect-to-linux-nodes-using-ssh) method in the Batch Python module.
66+
67+
If you need to restrict or disable RDP or SSH access to compute nodes, see [Configure or disable remote access to compute nodes in an Azure Batch pool](pool-endpoint-configuration.md).
68+
69+
## Troubleshoot problem nodes
70+
71+
In situations where some of your tasks are failing, your Batch client application or service can examine the metadata of the failed tasks to identify a misbehaving node. Each node in a pool is given a unique ID, and the node on which a task runs is included in the task metadata. After you've identified a problem node, you can take several actions with it:
72+
73+
- **Reboot the node** ([REST](https://docs.microsoft.com/rest/api/batchservice/computenode/reboot) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.reboot)))
74+
75+
Restarting the node can sometimes clear up latent issues like stuck or crashed processes. If your pool uses a start task or your job uses a job preparation task, they are executed when the node restarts.
76+
- **Reimage the node** ([REST](https://docs.microsoft.com/rest/api/batchservice/computenode/reimage) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.reimage))
77+
78+
This reinstalls the operating system on the node. As with rebooting a node, start tasks and job preparation tasks are rerun after the node has been reimaged.
79+
- **Remove the node from the pool** ([REST](https://docs.microsoft.com/rest/api/batchservice/pool/removenodes) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.pooloperations))
80+
81+
Sometimes it is necessary to completely remove the node from the pool.
82+
- **Disable task scheduling on the node** ([REST](https://docs.microsoft.com/en-us/rest/api/batchservice/computenode/disablescheduling) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.disablescheduling))
83+
84+
This effectively takes the node offline so that no further tasks are assigned to it, but allows the node to remain running and in the pool. This enables you to perform further investigation into the cause of the failures without losing the failed task's data, and without the node causing additional task failures. For example, you can disable task scheduling on the node, then sign in remotely to examine the node's event logs or perform other troubleshooting. After you've finished your investigation, you can then bring the node back online by enabling task scheduling ([REST](https://docs.microsoft.com/rest/api/batchservice/computenode/enablescheduling) | [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.computenode.enablescheduling), or perform one of the other actions discussed earlier.
85+
86+
> [!IMPORTANT]
87+
> With the actions described above, youc can specify how tasks currently running on the node are handled when you perform the action. For example, when you disable task scheduling on a node by using the Batch .NET client library, you can specify a [DisableComputeNodeSchedulingOption](https://docs.microsoft.com/dotnet/api/microsoft.azure.batch.common.disablecomputenodeschedulingoption) enum value to specify whether to **Terminate** running tasks, **Requeue** them for scheduling on other nodes, or allow running tasks to complete before performing the action (**TaskCompletion**).
88+
89+
## Next steps
90+
91+
- Learn how to [check for pool and node errors](batch-pool-node-error-checking.md).
92+
- Learn how to [check for job and task errors](batch-job-task-error-checking.md).

0 commit comments

Comments
 (0)