|
| 1 | +--- |
| 2 | +title: Rollback on upgrade failure using Azure Operator Service Manager |
| 3 | +description: Revert all prior completed operations during safe upgrade failure. |
| 4 | +author: msftadam |
| 5 | +ms.author: adamdor |
| 6 | +ms.date: 08/28/2024 |
| 7 | +ms.topic: upgrade-and-migration-article |
| 8 | +ms.service: azure-operator-service-manager |
| 9 | +--- |
| 10 | + |
| 11 | +# Rollback on upgrade failure |
| 12 | +This guide describes the Azure Operator Service Manager (AOSM) optional rollback on failure feature for container network functions (CNFs). This feature, as part of the AOSM safe upgrade practices initiative, reduces the service impact of unexpected upgrade failures for network functions (NFs) where comprehensive forward and backward version network function application (NfApp) compatibility is not available. |
| 13 | + |
| 14 | +## Pause on failure |
| 15 | +In the case of an unexpected failure during an upgrade, historically AOSM supports the pause on failure approach. This method remains the default and implements the following workflow logic; |
| 16 | +* The NfApps are created or upgraded following either updateDependsOn ordering, if provided, or in the sequential order they appear. |
| 17 | +* NfApps with parameter "applicationEnabled" disabled are skipped. |
| 18 | +* NFApps present before upgrade, but not referenced by the new network function definition version (NFDV) are deleted. |
| 19 | +* The execution is paused if any of the NfApp upgrades fail. |
| 20 | +* The failure leaves the NF resource in a failed state. |
| 21 | + |
| 22 | +With pause on failure, AOSM rolls back the failed NfApp, via the testOptions, installOptions, or upgradeOptions parameters. This method allows the end user to troubleshoot the failed NfApp and then restart the upgrade from that point forward. As the default behavior, this method is the most efficient upgrade method, but may cause network function (NF) inconsistencies while in a mixed version state. |
| 23 | + |
| 24 | +## Rollback on failure |
| 25 | +To address risk of mismatched NfApp versions, AOSM now supports NF level rollback on failure. With this option enabled, if an NfApp upgrade fails, both the failed NfApp, and all prior completed NfApps, are rolled back to initial version state. This method minimizes, or eliminates, the amount of time the NF is exposed to NfApp version mismatches. The optional rollback on failure feature works as follows: |
| 26 | +* A user initiates an upgrade and enables the rollback on failure feature. |
| 27 | +* A snapshot of the current NfApp versions is captured and stored. |
| 28 | +* The snapshot is used to determine the individual NfApp actions taken to reverse actions that completed successfully. |
| 29 | + - "helm install" action on deleted components, |
| 30 | + - "helm rollback" action on upgraded components, |
| 31 | + - "helm delete" action on newly installed components |
| 32 | +* NfApp failure occurs, AOSM restores the NfApps to the snapshot version state before the upgrade, with most recent actions reverted first. |
| 33 | + |
| 34 | +> [!NOTE] |
| 35 | +> * AOSM doesn't create a snapshot if a user doesn't enable rollback on failure. |
| 36 | +> * A rollback on failure only applies to the successfully completed NFApps. |
| 37 | +> - Use the testOptions, installOptions, or upgradeOptions parameters to control rollback of the failed NfApp. |
| 38 | +
|
| 39 | +AOSM returns the following operational status and messages, given the respective results: |
| 40 | +``` |
| 41 | + - Upgrade Succeeded |
| 42 | + - Provisioning State: Succeeded |
| 43 | + - Message: <empty> |
| 44 | +``` |
| 45 | +``` |
| 46 | + - Upgrade Failed, Rollback Succeeded |
| 47 | + - Provisioning State: Failed |
| 48 | + - Message: Application(<ComponentName>) : <Failure Reason>; Rollback succeeded |
| 49 | +``` |
| 50 | +``` |
| 51 | + - Upgrade Failed, Rollback Failed |
| 52 | + - Provisioning State: Failed |
| 53 | + - Message: Application(<ComponentName>) : <Failure reason>; Rollback Failed (<RollbackComponentName>) : <Rollback Failure reason> |
| 54 | +``` |
| 55 | + |
| 56 | +## How to configure rollback on failure |
| 57 | +The most flexible method to control failure behavior is to extend a new configuration group schema (CGS) parameter, rollbackEnabled, to allow for configuration group value (CGV) control via roleOverrideValues in the NF payload. First, define the CGS parameter: |
| 58 | +``` |
| 59 | +{ |
| 60 | + "description": "NF configuration", |
| 61 | + "type": "object", |
| 62 | + "properties": { |
| 63 | + "nfConfiguration": { |
| 64 | + "type": "object", |
| 65 | + "properties": { |
| 66 | + "rollbackEnabled": { |
| 67 | + "type": "boolean" |
| 68 | + } |
| 69 | + }, |
| 70 | + "required": [ |
| 71 | + "rollbackEnabled" |
| 72 | + ] |
| 73 | + } |
| 74 | + } |
| 75 | +} |
| 76 | +``` |
| 77 | +> [!NOTE] |
| 78 | +> * If the nfConfiguration isn't provided through the roleOverrideValues parameter, by default the rollback is disabled. |
| 79 | +
|
| 80 | +With the new rollbackEnable parameter defined, the Operator can now provide a run time value, under roleOverrideValues, as part of NF reput payload. |
| 81 | +``` |
| 82 | +example: |
| 83 | +{ |
| 84 | + "location": "eastus", |
| 85 | + "properties": { |
| 86 | + // ... |
| 87 | + "roleOverrideValues": [ |
| 88 | + "{\"nfConfiguration\":{\"rollbackEnabled\":true}}", |
| 89 | + "{\"name\":\"nfApp1\",\"deployParametersMappingRuleProfile\":{\"applicationEnablement\" : \"Disabled\"}}", |
| 90 | + "{\"name\":\"nfApp2\",\"deployParametersMappingRuleProfile\":{\"applicationEnablement\" : \"Disabled\"}}", |
| 91 | + //... other nfapps overrides |
| 92 | + ] |
| 93 | + } |
| 94 | +} |
| 95 | +``` |
| 96 | +> [!NOTE] |
| 97 | +> * Each roleOverrideValues entry overrides the default behavior of the NfAapps. |
| 98 | +> * If multiple entries of nfConfiguration are found in the roleOverrideValues, then the NF reput is returned as a bad request. |
| 99 | +
|
| 100 | +## How to troubleshoot rollback on failure |
| 101 | +### Understand pod states |
| 102 | +Understanding the different pod states is crucial for effective troubleshooting. The following are the most common pod states: |
| 103 | +* Pending: Pod scheduling is in progress by Kubernetes. |
| 104 | +* Running: All containers in the pod are running and healthy. |
| 105 | +* Failed: One or more containers in the pod are terminated with a nonzero exit code. |
| 106 | +* CrashLoopBackOff: A container within the pod is repeatedly crashing and Kubernetes is unable to restart it. |
| 107 | +* ContainerCreating: Container creation is in progress by the container runtime. |
| 108 | + |
| 109 | +### Check pod status and logs |
| 110 | +First start by checking pod status and logs using a kubectl command: |
| 111 | +``` |
| 112 | +$ kubectl get pods |
| 113 | +$ kubectl logs <pod-name> |
| 114 | +``` |
| 115 | +The get pods command lists all the pods in the current namespace, along with their current status. The logs command retrieves the logs for a specific pod, allowing you to inspect any errors or exceptions. To troubleshoot networking problems, use the following commands: |
| 116 | +``` |
| 117 | +$ kubectl get services |
| 118 | +$ kubectl describe service <service-name> |
| 119 | +``` |
| 120 | +The get services command displays all the services in the current namespace. The command provides details about a specific service, including the associated endpoints, and any relevant error messages. If you're encountering issues with PVCs, you can use the following commands to debug them: |
| 121 | +``` |
| 122 | +$ kubectl get persistentvolumeclaims |
| 123 | +$ kubectl describe persistentvolumeclaims <pvc-name> |
| 124 | +``` |
| 125 | +The "get persistentvolumeclaims" command lists all the PVCs in the current namespace. The describe command provides detailed information about a specific PVC, including the status, associated storage class, and any relevant events or errors. |
0 commit comments