You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/batch/batch-pool-node-error-checking.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,13 +17,13 @@ Pool errors might be related to resize timeout or failure, automatic scaling fai
17
17
18
18
### Resize timeout or failure
19
19
20
-
When you create a new pool or resize an existing pool, you specify the target number of nodes. The create or resize operation completes immediately, but the actual allocation of new nodes or removal of existing nodes might take several minutes. You can specify the resize timeout in the [Pool - Add](/rest/api/batchservice/pool/add) or [Pool - Resize](/rest/api/batchservice/pool/resize) APIs. If Batch can't allocate the target number of nodes during the resize timeout period, the pool goes into a steady state and reports resize errors.
20
+
When you create a new pool or resize an existing pool, you specify the target number of nodes. The create or resize operation completes immediately, but the actual allocation of new nodes or removal of existing nodes might take several minutes. You can specify the resize timeout in the [Pool - Add](/rest/api/batchservice/pool/add) or [Pool - Resize](/rest/api/batchservice/pool/resize) APIs. If Batch can't allocate the target number of nodes during the resize timeout period, the pool goes into a steady state, and reports resize errors.
21
21
22
-
The [ResizeError](/rest/api/batchservice/pool/get#resizeerror) property for the most recent evaluation lists the errors that occurred.
22
+
The [ResizeError](/rest/api/batchservice/pool/get#resizeerror) property lists the errors that occurred for the most recent evaluation.
23
23
24
24
Common causes for resize errors include:
25
25
26
-
-**Resize timeout too short.**The default timeout of 15 minutes is usually long enough to allocate or remove pool nodes. If you're allocating a large number of nodes, such as more than 1,000 nodes from an Azure Marketplace image, or more than 300 nodes from a custom virtual machine (VM) image, you can set the resize timeout to 30 minutes.
26
+
-**Resize timeout too short.**Usually, the default timeout of 15 minutes is long enough to allocate or remove pool nodes. If you're allocating a large number of nodes, such as more than 1,000 nodes from an Azure Marketplace image, or more than 300 nodes from a custom virtual machine (VM) image, you can set the resize timeout to 30 minutes.
27
27
28
28
-**Insufficient core quota.** A Batch account is limited in the number of cores it can allocate across all pools, and stops allocating nodes once it reaches that quota. You can increase the core quota so Batch can allocate more nodes. For more information, see [Batch service quotas and limits](batch-quota-limit.md).
29
29
@@ -57,7 +57,7 @@ If the pool deletion is taking longer than expected, Batch retries periodically
57
57
58
58
- Resource locks might be placed on Batch-created resources, or on network resources that Batch uses.
59
59
60
-
- Resources that you created might depend on a Batch-created resource. For instance, if you [create a pool in a virtual network](batch-virtual-network.md), Batch creates a NSG, a public IP address, and a load balancer. If you use these resources outside the pool, you must remove that dependency to delete the pool.
60
+
- Resources that you created might depend on a Batch-created resource. For instance, if you [create a pool in a virtual network](batch-virtual-network.md), Batch creates an NSG, a public IP address, and a load balancer. If you use these resources outside the pool, you must remove that dependency to delete the pool.
61
61
62
62
- The `Microsoft.Batch` resource provider might be unregistered from the subscription that contains your pool.
63
63
@@ -77,7 +77,7 @@ A failed start task also causes Batch to set the [computeNodeState](/rest/api/ba
77
77
78
78
As with any task, there can be many causes for a start task failure. To troubleshoot, check the *stdout*, *stderr*, and any other task-specific log files.
79
79
80
-
Start tasks must be reentrant, because the start task can run multiple times on the same node, for example when the node is reimaged or rebooted. In rare cases, when a start task runs after an event causes a node reboot, one operating system (OS) or ephemeral disk reimages while the other doesn't. Since Batch start tasks and all Batch tasks run from the ephemeral disk, this situation isn't usually a problem. However, in some cases where the start task installs an application to the OS disk and keeps other data on the ephemeral disk, there can be sync problems. Protect your application accordingly if you use both disks.
80
+
Start tasks must be re-entrant, because the start task can run multiple times on the same node, for example when the node is reimaged or rebooted. In rare cases, when a start task runs after an event causes a node reboot, one operating system (OS) or ephemeral disk reimages while the other doesn't. Since Batch start tasks and all Batch tasks run from the ephemeral disk, this situation isn't usually a problem. However, in some cases where the start task installs an application to the OS disk and keeps other data on the ephemeral disk, there can be sync problems. Protect your application accordingly if you use both disks.
81
81
82
82
### Application package download failure
83
83
@@ -95,42 +95,42 @@ For Windows pools, `enableAutomaticUpdates` is set to `true` by default. Allowin
95
95
96
96
### Node in unusable state
97
97
98
-
Batch might set the [computeNodeState](/rest/api/batchservice/computenode/get#computenodestate) to `unusable` for many reasons. You can't schedule tasks to a node with state set to `unusable`, but the node still incurs charges.
98
+
Batch might set the [computeNodeState](/rest/api/batchservice/computenode/get#computenodestate) to `unusable` for many reasons. You can't schedule tasks to an `unusable` node, but the node still incurs charges.
99
99
100
-
If Batch can determine the cause, the [computeNodeError](/rest/api/batchservice/computenode/get#computenodeerror) property reports it. If a node is in an `unusable` state, but has no [computeNodeError](/rest/api/batchservice/computenode/get#computenodeerror), it means Batch is unable to communicate with the VM. In this case, Batch always tries to recover the VM. However, Batch won't automatically attempt to recover VMs that failed to install application packages or containers, even if their state is `unusable`.
100
+
If Batch can determine the cause, the [computeNodeError](/rest/api/batchservice/computenode/get#computenodeerror) property reports it. If a node is in an `unusable` state, but has no [computeNodeError](/rest/api/batchservice/computenode/get#computenodeerror), it means Batch is unable to communicate with the VM. In this case, Batch always tries to recover the VM. However, Batch doesn't automatically attempt to recover VMs that failed to install application packages or containers, even if their state is `unusable`.
101
101
102
102
Other reasons for `unusable` nodes might include the following causes:
103
103
104
104
- A custom VM image is invalid. For example, the image isn't properly prepared.
105
105
- A VM is moved because of an infrastructure failure or a low-level upgrade. Batch recovers the node.
106
-
- A VM image has been deployed on hardware that doesn't support it. For example, a CentOS HPC image is deployed on a [Standard_D1_v2](/azure/virtual-machines/dv2-dsv2-series.md) VM.
106
+
- A VM image has been deployed on hardware that doesn't support it. For example, a CentOS HPC image is deployed on a [Standard_D1_v2](/azure/virtual-machines/dv2-dsv2-series) VM.
107
107
- The VMs are in an [Azure virtual network](batch-virtual-network.md), and traffic has been blocked to key ports.
108
108
- The VMs are in a virtual network, but outbound traffic to Azure Storage is blocked.
109
109
- The VMs are in a virtual network with a custom DNS configuration, and the DNS server can't resolve Azure storage.
110
110
111
111
### Node agent log files
112
112
113
-
The Batch agent process that runs on each pool node provides log files that might help if you need to contact support about a pool node issue. You can upload log files for a node via the Azure portal, Batch Explorer, or the [Compute Node - Upload Batch Service Logs](/rest/api/batchservice/computenode/uploadbatchservicelogs) API. You can upload and save the log files and then delete the node or pool to save the cost of running the nodes.
113
+
The Batch agent process that runs on each pool node provides log files that might help if you need to contact support about a pool node issue. You can upload log files for a node via the Azure portal, Batch Explorer, or the [Compute Node - Upload Batch Service Logs](/rest/api/batchservice/computenode/uploadbatchservicelogs) API. Upload and save the log files and then delete the node or pool to save the cost of running the nodes.
114
114
115
115
### Node disk full
116
116
117
-
Batch uses the temporary drive for a node pool VM to store files such as the following job files, task files, and shared files:
117
+
Batch uses the temporary drive on a node pool VM to store files such as the following job files, task files, and shared files:
118
118
119
119
- Application package files
120
120
- Task resource files
121
121
- Application-specific files downloaded to one of the Batch folders
122
122
-*Stdout* and *stderr* files for each task application execution
123
123
- Application-specific output files
124
124
125
-
Some of these files, such as pool application packages or pool start task resource files, are written only once when pool nodes are created. Even though they only write once, if these files are too large they could fill the temporary drive.
125
+
Files like application packages or start task resource files write only once when Batch creates the pool node. Even though they only write once, if these files are too large they could fill the temporary drive.
126
126
127
127
Other files, such as *stdout* and *stderr*, are written for each task that a node runs. If a large number of tasks run on the same node, or the task files are too large, they could fill the temporary drive.
128
128
129
129
The node also needs a small amount of space on the OS disk to create users after it starts.
130
130
131
131
The size of the temporary drive depends on the VM size. One consideration when picking a VM size is to ensure that the temporary drive has enough space for the planned workload.
132
132
133
-
When you add a pool in the Azure portal, you can display the full list of VM sizes, including a **Resource disk size** column. The articles that describe VM sizes have tables with a **Temp Storage** column. For example, see [Compute optimized virtual machine sizes](/azure/virtual-machines/sizes-compute.md).
133
+
When you add a pool in the Azure portal, you can display the full list of VM sizes, including a **Resource disk size** column. The articles that describe VM sizes have tables with a **Temp Storage** column. For more information, see [Compute optimized virtual machine sizes](/azure/virtual-machines/sizes-compute). For an example size table, see [Fsv2-series](/azure/virtual-machines/fsv2-series).
134
134
135
135
You can specify a retention time for files written by each task. The retention time determines how long to keep the task files before automatically cleaning them up. You can reduce the retention time to lower storage requirements.
136
136
@@ -142,7 +142,7 @@ After you make sure to retrieve any data you need from the node or upload it to
142
142
143
143
You can delete old completed jobs or tasks whose task data is still on the nodes. Look in the `recentTasks` collection in the [taskInformation](/rest/api/batchservice/computenode/get#taskinformation) on the node, or use the [File - List From Compute Node](/rest/api/batchservice/file/listfromcomputenode) API. Deleting a job deletes all the tasks in the job. Deleting the tasks in the job triggers deletion of data in the task directories on the nodes, and frees up space. Once you've freed up enough space, reboot the node. The node should move out of `unusable` state and into `idle` again.
144
144
145
-
To recover an unusable node in [VirtualMachineConfiguration](/rest/api/batchservice/pool/add#virtualmachineconfiguration) pools, you can remove a node from the pool by using the [Pool - Remove Nodes](/rest/api/batchservice/pool/removenodes) API. Then you can grow the pool again to replace the bad node with a fresh one. For [CloudServiceConfiguration](/rest/api/batchservice/pool/add#cloudserviceconfiguration) pools, you can reimage the node by using the [Compute Node - Reimage](/rest/api/batchservice/computenode/reimage) API to clean the entire disk. Reimage isn't currently supported for [VirtualMachineConfiguration](/rest/api/batchservice/pool/add#virtualmachineconfiguration) pools.
145
+
To recover an unusable node in [VirtualMachineConfiguration](/rest/api/batchservice/pool/add#virtualmachineconfiguration) pools, you can remove the node from the pool by using the [Pool - Remove Nodes](/rest/api/batchservice/pool/removenodes) API. Then you can grow the pool again to replace the bad node with a fresh one. For [CloudServiceConfiguration](/rest/api/batchservice/pool/add#cloudserviceconfiguration) pools, you can reimage the node by using the [Compute Node - Reimage](/rest/api/batchservice/computenode/reimage) API to clean the entire disk. Reimage isn't currently supported for [VirtualMachineConfiguration](/rest/api/batchservice/pool/add#virtualmachineconfiguration) pools.
0 commit comments