Skip to content

Commit c1bc3fa

Browse files
authored
Merge pull request #9979 from v-tappelgate/AB#7781
AB#7781:Spaceport IO stops when UsableBlocks for metadata updates reach 0
2 parents 3e2f54e + ec640da commit c1bc3fa

File tree

2 files changed

+183
-0
lines changed

2 files changed

+183
-0
lines changed
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
---
2+
title: CSV Goes Offline after a Node or Storage Component Goes Offline During Active I/O
3+
description: Discusses a change in Cluster Shared Volume behavior that sends the CSV offline and could cause it to fail.
4+
ms.date: 11/05/2025
5+
author: kaushika-msft
6+
ms.author: kaushika
7+
manager: dcscontentpm
8+
audience: itpro
9+
ms.topic: troubleshooting
10+
ms.custom:
11+
- sap:clustering and high availability\cluster shared volume (csv)
12+
- pcy:WinComm Storage High Avail
13+
ms.reviewer: kaushika, v-mikiwu, v-appelgatet
14+
appliesto:
15+
- <a href=https://learn.microsoft.com/windows/release-health/windows-server-release-info target=_blank>Windows Server 2025</a>
16+
---
17+
# Cluster Shared Volume goes offline after a node or storage component goes offline
18+
19+
This article discusses a situation in which the Cluster Shared Volume (CSV) of a cluster goes offline after other components go offline. The article includes steps to resolve the issue and, if it's necessary, recover any affected virtual machines (VMs).
20+
21+
## Symptoms
22+
23+
This issue starts under the following circumstances:
24+
25+
1. A cluster node or storage component becomes unavailable, but I/O operations continue. For example, a disk array fails or requires maintenance.
26+
1. As I/O operations continue, metadata records accumulate.
27+
1. When the metadata records reach their allocated limits, I/O operations fail.
28+
1. The associated CSV enters a Failed state.
29+
1. Every 15 minutes (the default setting), the cluster tries to bring the CSV online. If the Virtual Machine Management Service (VMMS) manages VMs on the cluster, VMMS periodically tries to start the VMs.
30+
1. After 30 minutes, VMMS stops trying to start the VMs. Any VMs that use the affected CSV can't automatically recover.
31+
32+
## Cause
33+
34+
A recent change in cluster behavior affects how the CSV responds in the situation that's mentioned in the "Symptoms" section. Previously, when metadata records accumulated to the allocated limits, I/O operations could stop indefinitely. Because of the change, I/O operations fail in this situation instead of hanging. The I/O failure, in turn, causes the CSV to go offline and enter a Failed state.
35+
36+
## Recovery
37+
38+
You can use one of two methods to recover the cluster. The method to use depends on whether you can restore the previous cluster components, or you have to replace parts of the cluster.
39+
40+
### Method 1: Restore the offline component and automatically repair the cluster
41+
42+
After you restore the offline node or storage, the following steps occur automatically.
43+
44+
1. The next time that the cluster automatically tries to bring the CSV online, it succeeds.
45+
1. Automatic repair processes start, and then the volume becomes available.
46+
47+
> [!IMPORTANT]
48+
> After the cluster recovers, you might have to manually start any VMMS-managed VMs that use the cluster. After the cluster is down for 30 minutes, VMMS stops automatically trying to restart the VMs.
49+
50+
### Method 2: Replace the offline component and manually recover the cluster
51+
52+
If you can't restore the missing node or storage, follow these steps to manually recover the cluster.
53+
54+
> [!IMPORTANT]
55+
>
56+
> - This procedure temporarily takes all volumes in the pool offline.
57+
> - You can use this procedure for either Storage Spaces or failover clusters. Every step that applies to virtual disks or volumes also applies to Cluster Virtual Disks and Cluster Shared Volumes.
58+
59+
Run the following steps as a cluster administrator on a node that has full access to the storage pool.
60+
61+
1. On a cluster node that has full access to the storage pool, open an administrative PowerShell Command Prompt window.
62+
63+
1. To get the properties of the affected storage pool, run the following command at the PowerShell command prompt:
64+
65+
```powershell
66+
Get-ClusterResource <Pool>
67+
```
68+
69+
> [!NOTE]
70+
>
71+
> - In this cmdlet, \<Pool> is the name of the storage pool resource.
72+
> - Later steps in this procedure use properties such as the name and owner group of the resource.
73+
74+
1. To get the properties of the storage pool's virtual disks and CSVs, run the following command:
75+
76+
```powershell
77+
Get-ClusterResource | Where-Object { $_.ResourceType -eq "Physical Disk" }
78+
```
79+
80+
```powershell
81+
Get-ClusterSharedVolume
82+
```
83+
84+
1. Review the properties to determine which resources are in a "Failed" state.
85+
86+
> [!NOTE]
87+
> If you intend to reuse name and ID information for any resources that you replace, you can use `Get-ClusterResource` and `Get-ClusterParameter` to get that information.
88+
89+
1. Whether you're replacing a node, or just storage, run the following cmdlets to add unpooled disks to the storage pool:
90+
91+
```powershell
92+
Get-StoragePool -isprimordial $false | Add-PhysicalDisk -PhysicalDisks $(Get-PhysicalDisk -CanPool $true)
93+
```
94+
95+
1. To retire unhealthy disks (or disks that are associated with a failed node), run the following cmdlets:
96+
97+
```powershell
98+
Get-PhysicalDisk | Where-Object { $_.OperationalStatus -eq "Lost Communication" } | Set-PhysicalDisk -usage Retired
99+
```
100+
101+
1. Monitor the virtual disks by running `Get-VirtualDisk` and looking for `OperationalStatus = InService` in the cmdlet output. When the `OperationalStatus` parameter is clear for all the virtual disks, go to the next step.
102+
103+
1. To move the affected storage pool (that you identified previously) to the current node, run a PowerShell cmdlet that resembles the following command:
104+
105+
```powershell
106+
Move-ClusterResource -node <Current Node> -name <OwnerGroup>
107+
```
108+
109+
> [!NOTE]
110+
> In this command, \<Current Node> is the name of the node that you're working from, and \<OwnerGroup> is the value of the OwnerGroup property of the storage group resource.
111+
112+
1. To move the failed disk and CSV resources to the current node, run `Move-ClusterResource` again for each physical disk and CSV resource. To see the OwnerGroup value of the CSV, run `Get-ClusterSharedVolume | get-ClusterGroup`.
113+
114+
1. To remove all cluster virtual disks and CSVs from cluster management, run the following PowerShell commands in sequence:
115+
116+
```powershell
117+
Remove-ClusterSharedVolume
118+
Get-ClusterResource | Where-Object { $_.ResourceType -eq "Physical Disk" } | Remove-ClusterResource
119+
```
120+
121+
1. To remove the storage pool from cluster management, run the `Remove-ClusterResource` command for the storage pool objects that you identified in step 2 of this procedure.
122+
123+
1. To make the storage pool writable, run the following commands:
124+
125+
```powershell
126+
Get-StoragePool -isPrimordial $false | Set-StoragePool -IsReadOnly $false
127+
```
128+
129+
1. To configure the virtual disks, run the following commands for each virtual disk that you identified in step 3 of this procedure.
130+
131+
```powershell
132+
Get-VirtualDisk | Set-VirtualDisk -IsManualAttach $false
133+
```
134+
135+
1. Use the `Get-StorageJob` cmdlet to monitor the storage jobs that are related to repair. After the jobs start (the percentage completed is greater than 0), go to the next step.
136+
137+
1. To restore the storage pool to cluster management, run the following commands:
138+
139+
```powershell
140+
Get-CimInstance -Namespace "root\MSCluster" -ClassName "MSCluster_AvailableStoragePool" | invoke-cimmethod -MethodName AddToCluster
141+
```
142+
143+
1. Restore all non-failed virtual disks to cluster management. If any of the virtual disks from the previous step were configured as CSVs before the failure, convert them to CSVs.
144+
145+
For example, you can bring back any of the virtual disk or CSV resources that weren't in a failed state. To restore these resources, use the `virtualdiskid` and `name` property values from step 3, and then run commands that resemble the following script excerpt:
146+
147+
```powershell
148+
`$virtualdiskname = "ClusterPerformanceHistory"`
149+
`$virtualdiskid = "603bb5d0-9c4d-4fc6-9c25-eec92a478733"`
150+
`(get-clusteravailabledisk | Where-Object { $_.Id -eq $virtualdiskid} | add-clusterdisk).Name = $virtualdiskname`
151+
```
152+
153+
You can use `Add-ClusterSharedVolume` to reconfigure the CSVs.
154+
155+
1. Monitor the virtual disks by running `Get-VirtualDisk` and looking for `OperationalStatus = InService` in the cmdlet output. When the `OperationalStatus` parameter is clear for all of the virtual disks, continue to the next step.
156+
157+
1. To bring the virtual disks online and configure them as read/write, run the following commands:
158+
159+
```powershell
160+
Get-VirtualDisk | get-disk | Where-Object { $_.IsReadonly -eq $true } | set-disk -IsReadOnly $false
161+
Get-VirtualDisk | get-disk | Where-Object { $_.IsOffline -eq $true} | set-disk -IsOffline $false
162+
```
163+
164+
1. Monitor the virtual disk footprints of the retired physical disks by running the following cmdlet:
165+
166+
```powershell
167+
get-physicaldisk -Usage Retired | ft Deviceid, Usage, VirtualDiskFootprint
168+
```
169+
170+
When the footprint reaches zero, go to the next step.
171+
172+
1. Restore the previously failed virtual disks to cluster management. If any of these virtual disks were previously configured as CSVs, convert them to CSVs.
173+
174+
1. Bring the virtual disks from the previous step online, and configure them as read/write.
175+
176+
> [!IMPORTANT]
177+
> After the cluster recovers, you might have to manually start any VMMS-managed VMs that use the cluster. After the cluster is down for 30 minutes, VMMS stops automatically trying to restart the VMs.
178+
179+
## Status
180+
181+
This behavior is by design in Windows Server 2025. It's intended to prevent indefinite I/O unresponsiveness.

support/windows-server/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1124,6 +1124,8 @@ items:
11241124
items:
11251125
- name: 'Troubleshooting guidance: Event ID 5120 Cluster Shared Volume'
11261126
href: ./high-availability/event-id-5120-cluster-shared-volume-troubleshooting-guidance.md
1127+
- name: Cluster Shared Volume goes offline
1128+
href: ./high-availability/csv-offline-after-component-goes-offline-during-active-io.md
11271129
- name: Errors when running the Validation Wizard
11281130
items:
11291131
- name: Cluster validation account causes events or messages

0 commit comments

Comments
 (0)