Skip to content

Commit 93c15ff

Browse files
authored
Merge pull request #247423 from adwilso/main
Update scheduled events doc
2 parents 0b0ac00 + 15cfbb2 commit 93c15ff

File tree

6 files changed

+121
-31
lines changed

6 files changed

+121
-31
lines changed
30.1 KB
Loading
80.9 KB
Loading

articles/virtual-machines/linux/scheduled-events.md

Lines changed: 75 additions & 25 deletions
Large diffs are not rendered by default.
30.1 KB
Loading
80.9 KB
Loading

articles/virtual-machines/windows/scheduled-events.md

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Scheduled events are delivered to and can be acknowledged by:
6060
- All the VMs in a scale set placement group.
6161

6262
> [!NOTE]
63-
> Scheduled Events for all virtual machines (VMs) in a Fabric Controller (FC) tenant are delivered to all VMs in a FC tenant. FC tenant equates to a standalone VM, an entire Cloud Service, an entire Availability Set, and a Placement Group for a VM Scale Set (VMSS) regardless of Availability Zone usage.
63+
> Scheduled Events for all virtual machines (VMs) in a Fabric Controller (FC) tenant are delivered to all VMs in a FC tenant. FC tenant equates to a standalone VM, an entire Cloud Service, an entire Availability Set, and a Placement Group for a Virtual Machine Scale Set regardless of Availability Zone usage.
6464
6565
As a result, check the `Resources` field in the event to identify which VMs are affected.
6666

@@ -95,12 +95,52 @@ Scheduled Events is enabled for your service the first time you make a request f
9595
### User-initiated maintenance
9696
User-initiated VM maintenance via the Azure portal, API, CLI, or PowerShell results in a scheduled event. You then can test the maintenance preparation logic in your application, and your application can prepare for user-initiated maintenance.
9797

98-
If you restart a VM, an event with the type `Reboot` is scheduled. If you redeploy a VM, an event with the type `Redeploy` is scheduled. Typically events with a user event source can be immediately approved to avoid a delay on user-initiated actions. We advise having a primary and secondary VM communicating and approving user generated scheduled events in case the primary VM becomes unresponsive. This will prevent delays in recovering your application back to a good state.
98+
If you restart a VM, an event with the type `Reboot` is scheduled. If you redeploy a VM, an event with the type `Redeploy` is scheduled. Typically events with a user event source can be immediately approved to avoid a delay on user-initiated actions. We advise having a primary and secondary VM communicating and approving user generated scheduled events in case the primary VM becomes unresponsive. Immediately approving events prevents delays in recovering your application back to a good state.
9999

100-
Scheduled events are disabled by default for [VMSS Guest OS upgrades or reimages](../../virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-upgrade.md). To enable scheduled events for these operations, first enable them using [OSImageNotificationProfile](https://learn.microsoft.com/rest/api/compute/virtual-machine-scale-sets/create-or-update?tabs=HTTP#osimagenotificationprofile).
100+
Scheduled events are disabled by default for [Virtual Machine Scale Set Guest OS upgrades or reimages](../../virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-upgrade.md). To enable scheduled events for these operations, first enable them using [OSImageNotificationProfile](https://learn.microsoft.com/rest/api/compute/virtual-machine-scale-sets/create-or-update?tabs=HTTP#osimagenotificationprofile).
101101

102102
## Use the API
103103

104+
### High level overview
105+
106+
There are two major components to handling Scheduled Events, preparation and recovery. All current events impacting the customer will be available via the IMDS Scheduled Events endpoint. When the event has reached a terminal state, it is removed from the list of events. The following diagram shows the various state transitions that a single scheduled event can experience:
107+
108+
![State diagram showing the various transitions a scheduled event can take.](media/scheduled-events/scheduled-events-states.png)
109+
110+
For events in the EventStatus:"Scheduled" state, you'll need to take steps to prepare your workload. Once the preparation is complete, you should then approve the event using the scheduled event API. Otherwise, the event will be automatically approved when the NotBefore time is reached. If the VM is on shared infrastructure, the system will then wait for all other tenants on the same hardware to also approve the job or timeout. Once approvals are gathered from all impacted VMs or the NotBefore time is reached then Azure generates a new scheduled event payload with EventStatus:"Started" and triggers the start of the maintenance event. When the event has reached a terminal state, it is removed from the list of events which serves as the signal for the tenant to recover their VM(s)”
111+
112+
Below is psudeo code demonstrating a process for how to read and manage scheduled events in your application:
113+
```
114+
current_list_of_scheduled_events = get_latest_from_se_endpoint()
115+
#prepare for new events
116+
for each event in current_list_of_scheduled_events:
117+
if event not in previous_list_of_scheduled_events:
118+
prepare_for_event(event)
119+
#recover from completed events
120+
for each event in previous_list_of_scheduled_events:
121+
if event not in current_list_of_scheduled_events:
122+
receover_from_event(event)
123+
#prepare for future jobs
124+
previous_list_of_scheduled_events = current_list_of_scheduled_events
125+
```
126+
As scheduled events are often used for applications with high availability requirements, there are a few exceptional cases that should be considered:
127+
128+
1. Once a scheduled event is completed and removed from the array there will be no further impacts without a new event including another EventStatus:"Scheduled" event
129+
2. Azure monitors maintenance operations across the entire fleet and in rare circumstances determines that a maintenance operation too high risk to apply. In that case the scheduled event will go directly from “Scheduled” to being removed from the events array
130+
3. In the case of hardware failure, Azure will bypass the “Scheduled” state and immediately move to the EventStatus:"Started" state.
131+
4. While the event is still in EventStatus:"Started" state, there may be additional impacts of a shorter duration than what was advertised in the scheduled event.
132+
133+
As part of Azure’s availability guarantee, VMs in different fault domains won't be impacted by routine maintenance operations at the same time. However, they may have operations serialized one after another. VMs in one fault domain can receive scheduled events with EventStatus:"Scheduled" shortly after another fault domain’s maintenance is completed. Regardless of what architecture you chose, always keep checking for new events pending against your VMs.
134+
135+
While the exact timings of events vary, the following diagram provides a rough guideline for how a typical maintenance operation proceeds:
136+
137+
- EventStatus:"Scheduled" to Approval Timeout: 15 minutes
138+
- Impact Duration: 7 seconds
139+
- EventStatus:"Started" to Completed (event removed from Events array): 10 minutes
140+
141+
![Diagram of a timeline showing the flow of a scheduled event.](media/scheduled-events/scheduled-events-timeline.png)
142+
143+
104144
### Headers
105145
When you query Metadata Service, you must provide the header `Metadata:true` to ensure the request wasn't unintentionally redirected. The `Metadata:true` header is required for all scheduled events requests. Failure to include the header in the request results in a "Bad Request" response from Metadata Service.
106146

@@ -177,7 +217,7 @@ Each event is scheduled a minimum amount of time in the future based on the even
177217
| Redeploy | 10 minutes |
178218
| Terminate | [User Configurable](../../virtual-machine-scale-sets/virtual-machine-scale-sets-terminate-notification.md#enable-terminate-notifications): 5 to 15 minutes |
179219

180-
Once an event is scheduled, it will move into the `Started` state after it's been approved or the `NotBefore` time passes. However, in rare cases, the operation will be cancelled by Azure before it starts. In that case the event will be removed from the Events array, and the impact will not occur as previously scheduled.
220+
Once an event is scheduled, it will move into the `Started` state after it's been approved or the `NotBefore` time passes. However, in rare cases, the operation will be canceled by Azure before it starts. In that case the event will be removed from the Events array, and the impact won't occur as previously scheduled.
181221

182222
> [!NOTE]
183223
> In some cases, Azure is able to predict host failure due to degraded hardware and will attempt to mitigate disruption to your service by scheduling a migration. Affected virtual machines will receive a scheduled event with a `NotBefore` that is typically a few days in the future. The actual time varies depending on the predicted failure risk assessment. Azure tries to give 7 days' advance notice when possible, but the actual time varies and might be smaller if the prediction is that there's a high chance of the hardware failing imminently. To minimize risk to your service in case the hardware fails before the system-initiated migration, we recommend that you self-redeploy your virtual machine as soon as possible.
@@ -205,7 +245,7 @@ The following JSON sample is expected in the `POST` request body. The request sh
205245
}
206246
```
207247

208-
The service will always return a 200 success code in the case of a valid event ID, even if it was already approved by a different VM. A 400 error code indicates that the request header or payload was malformed.
248+
The service will always return a 200 success code if it is passed a valid event ID, even if the event was already approved by a different VM. A 400 error code indicates that the request header or payload was malformed.
209249

210250
> [!Note]
211251
> Events will not proceed unless they are either approved via a POST message or the NotBefore time elapses. This includes user triggered events such as VM restarts from the Azure portal.
@@ -237,7 +277,7 @@ def confirm_scheduled_event(event_id):
237277
> Acknowledging an event allows the event to proceed for all `Resources` in the event, not just the VM that acknowledges the event. Therefore, you can choose to elect a leader to coordinate the acknowledgement, which might be as simple as the first machine in the `Resources` field.
238278
239279
## Example responses
240-
The following is an example of a series of events that were seen by two VMs that were live migrated to another node.
280+
The following events are an example that was seen by two VMs that were live migrated to another node.
241281

242282
The `DocumentIncarnation` is changing every time there is new information in `Events`. An approval of the event would allow the freeze to proceed for both WestNO_0 and WestNO_1. The `DurationInSeconds` of -1 indicates that the platform doesn't know how long the operation will take.
243283

0 commit comments

Comments
 (0)