You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-service-manager/safe-upgrade-practices.md
+60-44Lines changed: 60 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,20 +3,20 @@ title: Get started with Azure Operator Service Manager Safe Upgrade Practices
3
3
description: Safely execute complex upgrades of CNF workloads on Azure Operator Nexus
4
4
author: msftadam
5
5
ms.author: adamdor
6
-
ms.date: 02/19/2024
6
+
ms.date: 05/20/2024
7
7
ms.topic: upgrade-and-migration-article
8
8
ms.service: azure-operator-service-manager
9
9
---
10
10
11
11
# Get started with safe upgrade practices
12
-
This article introduces Azure Operator Service Manager (AOSM) safe upgrade practices (SUP). This feature set enables the safe execution of complex container network function (CNF) hosted on Azure Operator Nexus. These upgrades are structured in general compliance with partner In Service Software Upgrade (ISSU) requirements. Look for future articles to expand on advanced SUP features and capabilities.
12
+
This article introduces Azure Operator Service Manager (AOSM) safe upgrade practices (SUP). This feature set enables upgrades to complex container network function (CNF) hosted on Azure Operator Nexus. These upgrades generally support partner In Service Software Upgrade (ISSU) methods and requirements. While this article introduces basic concepts, look for other articles which expand on advanced SUP features and capabilities.
13
13
14
14
## Introduction to safe upgrades
15
-
A given network service supported by AOSM is composed of one to many CNFswhich, over time, require software upgrades. For each upgrade, it's necessary to run one to many helm operations, updating dependent network function applications (nfApps), in a particular order, in a manner which least impacts the network service. AOSM SUP represents a set of features, which enables safe automation of these operations on Azure Operator Nexus.
15
+
A given network service supported by AOSM, composed of one to many CNFs, includes components which, over time, require software and/or configuration changes. To make these component level changes it's necessary to run one to many helm operations, upgrading each network function application (nfApp) in a particular order and in a manner which least impacts the network service. AOSM safe upgrade practices apply the following high level capabilities to handle upgrade process and workflow requirements:
16
16
17
17
* SNS Reput Support - Execute helm upgrade operation across all nfApps in network function design version (NFDV).
18
18
* Nexus Platform - Support SNS reput operations on Nexus platform targets.
19
-
* Operation Time-outs - Ability to set operational time-outs for each nfApp operation.
19
+
* Operation Timeouts - Ability to set operational timeouts for each nfApp operation.
20
20
* Synchronous Operations - Ability to run one serial nfApp operation at a time.
21
21
* Control Upgrade Order - Define different nfApp sequence for install and upgrade.
22
22
* Pause On Failure - Default behavior pauses after an nfApp operation failure.
@@ -26,28 +26,30 @@ A given network service supported by AOSM is composed of one to many CNFs which,
26
26
* Image Preloading - Ability to preload images to edge repository.
27
27
28
28
## Safe upgrade approach
29
-
To update an existing Azure Operator Service Manager site network service (SNS), the Operator executes a reput update request against the deployed SNS resource. Where the SNS contains CNFs with multiple nfApps, the request is fanned out across all nfApps defined in the network function definition version (NFDV). By default, in the order, which they appear, or optionally in the order defined by `updateDependsOn` parameter.
29
+
To update an existing AOSM site network service (SNS), the operator executes a reput request against the deployed SNS resource. Where the SNS contains CNFs with multiple nfApps, the request is fanned out across all nfApps defined in the network function definition version (NFDV). By default, in the order, which they appear, or optionally in the order defined by `updateDependsOn` parameter.
30
30
31
-
For each nfApp, the reput update request supports increasing a helm chart version, adding/removing helm values and/or adding/removing any nfApps. Time-outs can be set per nfApp, based on known allowable runtimes, but nfApps can only be processed in serial order, one after the other. The reput update implements the following processing logic:
31
+
For each nfApp, the reput request supports various changes including increasing a helm chart version, adding/removing helm values and/or adding/removing any nfApps. While timeouts can be set per nfApp, based on known allowable runtimes, nfApps can only be processed in serial order, one after the other. The reput update implements the following processing logic:
32
32
33
-
* nfApps are processed following either updateDependsOn ordering, or in the sequential order they appear.
33
+
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
34
34
* nfApps with parameter `applicationEnabled` set to disable are skipped.
35
35
* nfApps with parameter `skipUpgrade` set to `enabled` are skipped if no changes detected.
36
-
* nfApps which are common between old and new NFDV are upgraded.
37
-
* nfApps which are only in the new NFDV are installed.
38
-
* nfApps deployed, but not referenced by the new NFDV, are deleted.
36
+
* nfApps which are common between old and new NFDV are upgraded using `helm upgrade`.
37
+
* nfApps which are only in the new NFDV are installed using `helm install`.
38
+
* nfApps deployed, but not referenced by the new NFDV, are deleted using `helm delete`.
39
39
40
-
To ensure outcomes, nfApp testing is supported using helm, either helm upgrade pre/post tests, or standalone helm tests. For pre/post tests failures, the atomic parameter is honored. With atomic/true, the failed chart is rolled back. With atomic/false, no rollback is executed. For more information on standalone helm testing, see the following article: [Run tests after install or upgrade](safe-upgrades-helm-test.md)
40
+
To ensure outcomes, nfApp testing is supported using helm methods, either tests triggered by helm pre or post hooks, or using the standalone helm test hook. For pre or post hook failure, the `atomic` parameter is honored. With atomic/true, the failed chart is rolled back. With atomic/false, no rollback is executed. For standalone helm test hook failure, the `rollbackOnTestFailure` is honored, following similar logic as atomic. For more information on standalone helm testing, see the following article: [Run tests after install or upgrade](safe-upgrades-helm-test.md)
41
+
42
+
When an nfApp operation failure occurs, and after the failed nfApp is handled via `atomic` or `rollbackOnTestFailure` parameters, the operator can control behavior on how to handle any nfApps changed before the failed nfApp. With pause-on-failure the operator can force AOSM to break after addressing the failed nfApp, preserving the mixed version environment. With rollback-on-failure the operator can force AOSM to rollback any prior nfApp, restoring the original environment snapshot. For more information on controlling upgrade failure behavior, see the following article: [Control upgrade failure behavior](safe-upgrades-nf-level-rollback.md)
41
43
42
44
## Considerations for in-service upgrades
43
-
Azure Operator Service Manager generally supports in service upgrades, an upgrade method which advances a deployment version without interrupting the running service. Some considerations are necessary to ensure the proper behavior of AOSM during ISSU operations.
45
+
Azure Operator Service Manager generally supports in service upgrades, an upgrade method which advances a deployment version without interrupting the running service. Some network function owner considerations are necessary to ensure the proper behavior of AOSM during ISSU operations.
44
46
* Where AOSM performs an upgrade against an ordered set of multiple nfApps, AOSM first upgrades or creates all new nfApps, then deletes all old nfApps. This approach ensures service isn't impacted until all new nfApps are ready but requires extra platform capacity for transient hosting of both old and new nfApps.
45
47
* Where AOSM upgrades an nfApp with multiple replicas, AOSM honors the deployment profile settings for either the rolling or recreate option. Where rolling is used, expose the values `maxUnavailable` and `maxSurge` as CGS parameters, which can then be set via operator CGV at run-time.
46
48
47
49
Ultimately, the ability for a given service to be upgraded without interruption is a feature of the service itself. Consult further with the service publisher to understand the in-service upgrade capabilities and ensure they're aligned with the proper AOSM behavioral options.
48
50
49
51
## Safe upgrade prerequisites
50
-
When planning for an upgrade using Azure Operator Service Manager, address the following requirements in advance of upgrade execution to optimize the time spent attempting the upgrade.
52
+
When planning for an upgrade using AOSM, address the following requirements in advance of upgrade execution, to optimize time spent attempting and ensure success of the upgrade.
51
53
52
54
- Onboard updated artifacts using publisher and/or designer workflows.
53
55
- In most cases, use the existing publisher to host new version artifacts.
@@ -73,7 +75,7 @@ When planning for an upgrade using Azure Operator Service Manager, address the f
73
75
- Settings used for production may suppress failures details, while settings used for debugging, or testing, may choose to expose these details.
74
76
75
77
## Safe upgrade procedure
76
-
Follow the following process to trigger an upgrade with Azure Operator Service Manager.
78
+
Follow the following process to trigger an upgrade with AOSM.
77
79
78
80
* Create new NFDV resource
79
81
* For new NFDV versions, it must be in a valid SemVer format. The new version can be an upgrade, a greater value versus the deployed version, or a downgrade, a lower value versus the deployed version. The new version can differ by major, minor, or patch values.
@@ -99,34 +101,50 @@ In cases where a reput update fails, the following process can be followed to re
99
101
* By default, the reput retries nfApps in the declared update order, unless they're skipped using `applicationEnablement` flag.
100
102
101
103
## Control timeouts with installOptions and UpgradeOptions
102
-
When an SNS operation starts either a helm install and helm upgrade, a 27-minute default timeout value. This value can be customized at the global NF, but we recommend to customize this value at the component NF levelby defining override values in the NF payload template. Further the values in the NF payload template and be exposed as operator values, allowing final customization at run-time. The following example demonstrates supported installOptions and upgradeOptions parameters applied to a single nfApp component;
104
+
When an SNS operation starts either a `helm install` or a `helm upgrade`, a 27-minute default timeout value is used. While this value can be customized at the global network function (NF) level, we recommend customizing this value at the component NF level using `roleOverrideValues`in the NF payload template. Further exposing the `roleOverrideValues` in CGS/CGV allows control by the operator at run-time. The following example demonstrates supported installOptions and upgradeOptions parameters applied across two nfApp components;
103
105
104
-
```
105
-
"roleOverrideValues": ["{
106
-
"name": "hellotest",
107
-
"deployParametersMappingRuleProfile": {
108
-
"helmMappingRuleProfile": {
109
-
"options": {
110
-
"installOptions": {
111
-
"atomic": true,
112
-
"wait": true,
113
-
"timeout": "1" },
114
-
"upgradeOptions": {
115
-
"atomic": true,
116
-
"wait": true,
117
-
"timeout": "2" }
118
-
} } } }"
119
-
]
106
+
```json
107
+
{
108
+
"roleOverrideValues": [
109
+
{
110
+
"name": "nfApplication1",
111
+
"deployParametersMappingRuleProfile": {
112
+
"helmMappingRuleProfile": {
113
+
"options": {
114
+
"installOptions": {
115
+
"atomic": "true",
116
+
"wait": "true",
117
+
"timeout": "1"
118
+
},
119
+
"upgradeOptions": {
120
+
"atomic": "true",
121
+
"wait": "true",
122
+
"timeout": "1"
123
+
} } } } },
124
+
{
125
+
"name": "nfApplication2",
126
+
"deployParametersMappingRuleProfile": {
127
+
"helmMappingRuleProfile": {
128
+
"options": {
129
+
"installOptions": {
130
+
"atomic": "true",
131
+
"wait": "true",
132
+
"timeout": "1"
133
+
},
134
+
"upgradeOptions": {
135
+
"atomic": "true",
136
+
"wait": "true",
137
+
"timeout": "1"
138
+
} } } } }
139
+
]
140
+
}
120
141
```
121
142
122
143
## Skip nfApps using applicationEnablement
123
-
In the NFDV resource, under `deployParametersMappingRuleProfile` there's a supported property `applicationEnablement` of type enum, which takes values of Unknown, Enabled, or disabled. It can be used to manually exclude nfApp operations during network function (NF) deployment. The following example demonstrates a generic method to parameterize `applicationEnablement` as an included value in `roleOverrideValues` property.
124
-
125
-
### Template changes
126
-
While no NFDV changes are necessarily required, optionally the publisher can use the NFDV to set a default value for the `applicationEnablement` property. The default value is used, unless its changed via `roleOverrideValues`.
144
+
In the NFDV resource, under `deployParametersMappingRuleProfile` there's a supported property `applicationEnablement` of type enum, which takes values of Unknown, Enabled, or disabled. It can be used to manually exclude nfApp operations during network function deployment. The following example demonstrates a generic method to parameterize `applicationEnablement` as an included value in `roleOverrideValues` property.
127
145
128
-
####NFDV template
129
-
Use the NFDV template to set a default value for `applicationEnablement`. The following example sets `enabled` state as the default value for `hellotest` networkfunctionApplication.
146
+
### NFDV template changes
147
+
While no NFDV changes are necessarily required, optionally the publisher can use the NFDV to set a default value for the `applicationEnablement` property. The default value is used, unless its changed via `roleOverrideValues`. Use the NFDV template to set a default value for `applicationEnablement`. The following example sets `enabled` state as the default value for `hellotest` networkfunctionApplication.
130
148
131
149
```json
132
150
"location":"<location>",
@@ -145,7 +163,7 @@ Use the NFDV template to set a default value for `applicationEnablement`. The fo
145
163
146
164
To manage the `applicationEnablement` value more dynamically, the Operator can pass a real-time value using the NF template `roleOverrideValues` property. While it's possible for the operator to manipulate the NF template directly, instead parameterize the `roleOverrideValues`, so that values can be passed via a CGV template at runtime. The following examples demonstrate the needed modifications to the CGS, NF templates, and finally the CGV.
147
165
148
-
####CGS template
166
+
### CGS template changes
149
167
The CGS template must be updated to include one variable declaration for each line to parameterize under `roleOverrideValues`. The following example demonstrates three override values.
150
168
151
169
```json
@@ -160,8 +178,8 @@ The CGS template must be updated to include one variable declaration for each li
160
178
}
161
179
```
162
180
163
-
####NF payload template
164
-
The NF template must be update three ways. First, the implicit config parameter must be defined as type object. Second, `roleOverrideValues0`, `roleOverrideValues1`, and `roleOverrideValues2` must be declared as variables mapped to config parameter. Third, `roleOverrideValues0`, `roleOverrideValues1` and `roleOverrideValues2` must be referenced for substitution under `roleOverrideValues` in proper order and following proper syntax.
181
+
### NF payload template changes
182
+
The NF template must be update three ways. First, the implicit config parameter must be defined as type object. Second, `roleOverrideValues0`, `roleOverrideValues1`, and `roleOverrideValues2` must be declared as variables mapped to config parameter. Third, `roleOverrideValues0`, `roleOverrideValues1`, and `roleOverrideValues2` must be referenced for substitution under `roleOverrideValues` in proper order and following proper syntax.
165
183
166
184
```json
167
185
"parameters": {
@@ -186,7 +204,7 @@ The NF template must be update three ways. First, the implicit config parameter
186
204
}
187
205
```
188
206
189
-
#### CGV template
207
+
### CGV template changes
190
208
The CGV template can now be updated to include the content for each variable to be substituted into `roleOverrideValues` property at run-time. The following example sets `rollbackEnabled` to true, followed by override sets for `hellotest` and `hellotest1` nfApplications.
191
209
192
210
```json
@@ -238,10 +256,8 @@ To enable the SkipUpgrade feature via `roleOverrideValues`, refer to the followi
238
256
- **nfApplication: `runnerTest`**
239
257
- The `skipUpgrade` flag isn't specified. Therefore, `runnerTest` executes a traditional Helm upgrade at the cluster level, even if the precheck criteria are met.
240
258
241
-
242
-
243
259
## Complete roleOverrideValues option reference
244
-
Bringing together all examples in this and other articles, the following reference demonstrates all presently supported install and upgrade options available through the `roleOverrideValues` mechanism.
260
+
Bringing together all examples in this and other articles, the following reference demonstrates all presently supported options available through the `roleOverrideValues` mechanism.
0 commit comments