Skip to content

Commit 3e7b03d

Browse files
authored
Merge pull request #285728 from msftadam/patch-12
Update index.yml
2 parents 2b11783 + 6d92fb3 commit 3e7b03d

File tree

3 files changed

+140
-0
lines changed

3 files changed

+140
-0
lines changed

articles/operator-service-manager/TOC.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@
3535
items:
3636
- name: Get Started with Safe Upgrade Practices
3737
href: safe-upgrade-practices.md
38+
- name: Control rollback behavior on upgrade failure
39+
40+
href: safe-upgrades-nf-level-rollback.md
3841
- name: Quickstarts
3942
expanded: false
4043
items:

articles/operator-service-manager/index.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,18 @@ landingContent:
8181
- text: Troubleshoot common CLI Issues
8282
url: troubleshoot-cli-common-issues.md
8383

84+
# Card
85+
- title: Safe Upgrade Practices
86+
linkLists:
87+
- linkListType: overview
88+
links:
89+
- text: Get started with safe upgrades
90+
url: safe-upgrade-practices.md
91+
- linkListType: concept
92+
links:
93+
- text: Control rollback behavior on upgrade failure
94+
url: safe-upgrades-nf-level-rollback.md
95+
8496
# Card
8597
- title: Additional Resources
8698
linkLists:
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
title: Rollback on upgrade failure using Azure Operator Service Manager
3+
description: Revert all prior completed operations during safe upgrade failure.
4+
author: msftadam
5+
ms.author: adamdor
6+
ms.date: 08/28/2024
7+
ms.topic: upgrade-and-migration-article
8+
ms.service: azure-operator-service-manager
9+
---
10+
11+
# Rollback on upgrade failure
12+
This guide describes the Azure Operator Service Manager (AOSM) optional rollback on failure feature for container network functions (CNFs). This feature, as part of the AOSM safe upgrade practices initiative, reduces the service impact of unexpected upgrade failures for network functions (NFs) where comprehensive forward and backward version network function application (NfApp) compatibility is not available.
13+
14+
## Pause on failure
15+
In the case of an unexpected failure during an upgrade, historically AOSM supports the pause on failure approach. This method remains the default and implements the following workflow logic;
16+
* The NfApps are created or upgraded following either updateDependsOn ordering, if provided, or in the sequential order they appear.
17+
* NfApps with parameter "applicationEnabled" disabled are skipped.
18+
* NFApps present before upgrade, but not referenced by the new network function definition version (NFDV) are deleted.
19+
* The execution is paused if any of the NfApp upgrades fail.
20+
* The failure leaves the NF resource in a failed state.
21+
22+
With pause on failure, AOSM rolls back the failed NfApp, via the testOptions, installOptions, or upgradeOptions parameters. This method allows the end user to troubleshoot the failed NfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient upgrade method, but may cause network function (NF) inconsistencies while in a mixed version state.
23+
24+
## Rollback on failure
25+
To address risk of mismatched NfApp versions, AOSM now supports NF level rollback on failure. With this option enabled, if an NfApp upgrade fails, both the failed NfApp, and all prior completed NfApps, are rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to NfApp version mismatches. The optional rollback on failure feature works as follows:
26+
* A user initiates an upgrade and enables the rollback on failure feature.
27+
* A snapshot of the current NfApp versions is captured and stored.
28+
* The snapshot is used to determine the individual NfApp actions taken to reverse actions that completed successfully.
29+
- "helm install" action on deleted components,
30+
- "helm rollback" action on upgraded components,
31+
- "helm delete" action on newly installed components
32+
* NfApp failure occurs, AOSM restores the NfApps to the snapshot version state before the upgrade, with most recent actions reverted first.
33+
34+
> [!NOTE]
35+
> * AOSM doesn't create a snapshot if a user doesn't enable rollback on failure.
36+
> * A rollback on failure only applies to the successfully completed NFApps.
37+
> - Use the testOptions, installOptions, or upgradeOptions parameters to control rollback of the failed NfApp.
38+
39+
AOSM returns the following operational status and messages, given the respective results:
40+
```
41+
- Upgrade Succeeded
42+
- Provisioning State: Succeeded
43+
- Message: <empty>
44+
```
45+
```
46+
- Upgrade Failed, Rollback Succeeded
47+
- Provisioning State: Failed
48+
- Message: Application(<ComponentName>) : <Failure Reason>; Rollback succeeded
49+
```
50+
```
51+
- Upgrade Failed, Rollback Failed
52+
- Provisioning State: Failed
53+
- Message: Application(<ComponentName>) : <Failure reason>; Rollback Failed (<RollbackComponentName>) : <Rollback Failure reason>
54+
```
55+
56+
## How to configure rollback on failure
57+
The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, rollbackEnabled, to allow for configuration group value (CGV) control via roleOverrideValues in the NF payload. First, define the CGS parameter:
58+
```
59+
{
60+
"description": "NF configuration",
61+
"type": "object",
62+
"properties": {
63+
"nfConfiguration": {
64+
"type": "object",
65+
"properties": {
66+
"rollbackEnabled": {
67+
"type": "boolean"
68+
}
69+
},
70+
"required": [
71+
"rollbackEnabled"
72+
]
73+
}
74+
}
75+
}
76+
```
77+
> [!NOTE]
78+
> * If the nfConfiguration isn't provided through the roleOverrideValues parameter, by default the rollback is disabled.
79+
80+
With the new rollbackEnable parameter defined, the Operator can now provide a run time value, under roleOverrideValues, as part of NF reput payload.
81+
```
82+
example:
83+
{
84+
"location": "eastus",
85+
"properties": {
86+
// ...
87+
"roleOverrideValues": [
88+
"{\"nfConfiguration\":{\"rollbackEnabled\":true}}",
89+
"{\"name\":\"nfApp1\",\"deployParametersMappingRuleProfile\":{\"applicationEnablement\" : \"Disabled\"}}",
90+
"{\"name\":\"nfApp2\",\"deployParametersMappingRuleProfile\":{\"applicationEnablement\" : \"Disabled\"}}",
91+
//... other nfapps overrides
92+
]
93+
}
94+
}
95+
```
96+
> [!NOTE]
97+
> * Each roleOverrideValues entry overrides the default behavior of the NfAapps.
98+
> * If multiple entries of nfConfiguration are found in the roleOverrideValues, then the NF reput is returned as a bad request.
99+
100+
## How to troubleshoot rollback on failure
101+
### Understand pod states
102+
Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states:
103+
* Pending: Pod scheduling is in progress by Kubernetes.
104+
* Running: All containers in the pod are running and healthy.
105+
* Failed: One or more containers in the pod are terminated with a nonzero exit code.
106+
* CrashLoopBackOff: A container within the pod is repeatedly crashing and Kubernetes is unable to restart it.
107+
* ContainerCreating: Container creation is in progress by the container runtime.
108+
109+
### Check pod status and logs
110+
First start by checking pod status and logs using a kubectl command:
111+
```
112+
$ kubectl get pods
113+
$ kubectl logs <pod-name>
114+
```
115+
The get pods command lists all the pods in the current namespace, along with their current status. The logs command retrieves the logs for a specific pod, allowing you to inspect any errors or exceptions. To troubleshoot networking problems, use the following commands:
116+
```
117+
$ kubectl get services
118+
$ kubectl describe service <service-name>
119+
```
120+
The get services command displays all the services in the current namespace. The command provides details about a specific service, including the associated endpoints, and any relevant error messages. If you're encountering issues with PVCs, you can use the following commands to debug them:
121+
```
122+
$ kubectl get persistentvolumeclaims
123+
$ kubectl describe persistentvolumeclaims <pvc-name>
124+
```
125+
The "get persistentvolumeclaims" command lists all the PVCs in the current namespace. The describe command provides detailed information about a specific PVC, including the status, associated storage class, and any relevant events or errors.

0 commit comments

Comments
 (0)