Skip to content

Commit 73e223a

Browse files
Merge pull request #300087 from msftadam/patch-74
Update safe-upgrade-practices.md
2 parents a9306db + 40d1369 commit 73e223a

File tree

1 file changed

+60
-44
lines changed

1 file changed

+60
-44
lines changed

articles/operator-service-manager/safe-upgrade-practices.md

Lines changed: 60 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,20 @@ title: Get started with Azure Operator Service Manager Safe Upgrade Practices
33
description: Safely execute complex upgrades of CNF workloads on Azure Operator Nexus
44
author: msftadam
55
ms.author: adamdor
6-
ms.date: 02/19/2024
6+
ms.date: 05/20/2024
77
ms.topic: upgrade-and-migration-article
88
ms.service: azure-operator-service-manager
99
---
1010

1111
# Get started with safe upgrade practices
12-
This article introduces Azure Operator Service Manager (AOSM) safe upgrade practices (SUP). This feature set enables the safe execution of complex container network function (CNF) hosted on Azure Operator Nexus. These upgrades are structured in general compliance with partner In Service Software Upgrade (ISSU) requirements. Look for future articles to expand on advanced SUP features and capabilities.
12+
This article introduces Azure Operator Service Manager (AOSM) safe upgrade practices (SUP). This feature set enables upgrades to complex container network function (CNF) hosted on Azure Operator Nexus. These upgrades generally support partner In Service Software Upgrade (ISSU) methods and requirements. While this article introduces basic concepts, look for other articles which expand on advanced SUP features and capabilities.
1313

1414
## Introduction to safe upgrades
15-
A given network service supported by AOSM is composed of one to many CNFs which, over time, require software upgrades. For each upgrade, it's necessary to run one to many helm operations, updating dependent network function applications (nfApps), in a particular order, in a manner which least impacts the network service. AOSM SUP represents a set of features, which enables safe automation of these operations on Azure Operator Nexus.
15+
A given network service supported by AOSM, composed of one to many CNFs, includes components which, over time, require software and/or configuration changes. To make these component level changes it's necessary to run one to many helm operations, upgrading each network function application (nfApp) in a particular order and in a manner which least impacts the network service. AOSM safe upgrade practices apply the following high level capabilities to handle upgrade process and workflow requirements:
1616

1717
* SNS Reput Support - Execute helm upgrade operation across all nfApps in network function design version (NFDV).
1818
* Nexus Platform - Support SNS reput operations on Nexus platform targets.
19-
* Operation Time-outs - Ability to set operational time-outs for each nfApp operation.
19+
* Operation Timeouts - Ability to set operational timeouts for each nfApp operation.
2020
* Synchronous Operations - Ability to run one serial nfApp operation at a time.
2121
* Control Upgrade Order - Define different nfApp sequence for install and upgrade.
2222
* Pause On Failure - Default behavior pauses after an nfApp operation failure.
@@ -26,28 +26,30 @@ A given network service supported by AOSM is composed of one to many CNFs which,
2626
* Image Preloading - Ability to preload images to edge repository.
2727

2828
## Safe upgrade approach
29-
To update an existing Azure Operator Service Manager site network service (SNS), the Operator executes a reput update request against the deployed SNS resource. Where the SNS contains CNFs with multiple nfApps, the request is fanned out across all nfApps defined in the network function definition version (NFDV). By default, in the order, which they appear, or optionally in the order defined by `updateDependsOn` parameter.
29+
To update an existing AOSM site network service (SNS), the operator executes a reput request against the deployed SNS resource. Where the SNS contains CNFs with multiple nfApps, the request is fanned out across all nfApps defined in the network function definition version (NFDV). By default, in the order, which they appear, or optionally in the order defined by `updateDependsOn` parameter.
3030

31-
For each nfApp, the reput update request supports increasing a helm chart version, adding/removing helm values and/or adding/removing any nfApps. Time-outs can be set per nfApp, based on known allowable runtimes, but nfApps can only be processed in serial order, one after the other. The reput update implements the following processing logic:
31+
For each nfApp, the reput request supports various changes including increasing a helm chart version, adding/removing helm values and/or adding/removing any nfApps. While timeouts can be set per nfApp, based on known allowable runtimes, nfApps can only be processed in serial order, one after the other. The reput update implements the following processing logic:
3232

33-
* nfApps are processed following either updateDependsOn ordering, or in the sequential order they appear.
33+
* nfApps are processed following either `updateDependsOn` ordering, or in the sequential order they appear.
3434
* nfApps with parameter `applicationEnabled` set to disable are skipped.
3535
* nfApps with parameter `skipUpgrade` set to `enabled` are skipped if no changes detected.
36-
* nfApps which are common between old and new NFDV are upgraded.
37-
* nfApps which are only in the new NFDV are installed.
38-
* nfApps deployed, but not referenced by the new NFDV, are deleted.
36+
* nfApps which are common between old and new NFDV are upgraded using `helm upgrade`.
37+
* nfApps which are only in the new NFDV are installed using `helm install`.
38+
* nfApps deployed, but not referenced by the new NFDV, are deleted using `helm delete`.
3939

40-
To ensure outcomes, nfApp testing is supported using helm, either helm upgrade pre/post tests, or standalone helm tests. For pre/post tests failures, the atomic parameter is honored. With atomic/true, the failed chart is rolled back. With atomic/false, no rollback is executed. For more information on standalone helm testing, see the following article: [Run tests after install or upgrade](safe-upgrades-helm-test.md)
40+
To ensure outcomes, nfApp testing is supported using helm methods, either tests triggered by helm pre or post hooks, or using the standalone helm test hook. For pre or post hook failure, the `atomic` parameter is honored. With atomic/true, the failed chart is rolled back. With atomic/false, no rollback is executed. For standalone helm test hook failure, the `rollbackOnTestFailure` is honored, following similar logic as atomic. For more information on standalone helm testing, see the following article: [Run tests after install or upgrade](safe-upgrades-helm-test.md)
41+
42+
When an nfApp operation failure occurs, and after the failed nfApp is handled via `atomic` or `rollbackOnTestFailure` parameters, the operator can control behavior on how to handle any nfApps changed before the failed nfApp. With pause-on-failure the operator can force AOSM to break after addressing the failed nfApp, preserving the mixed version environment. With rollback-on-failure the operator can force AOSM to rollback any prior nfApp, restoring the original environment snapshot. For more information on controlling upgrade failure behavior, see the following article: [Control upgrade failure behavior](safe-upgrades-nf-level-rollback.md)
4143

4244
## Considerations for in-service upgrades
43-
Azure Operator Service Manager generally supports in service upgrades, an upgrade method which advances a deployment version without interrupting the running service. Some considerations are necessary to ensure the proper behavior of AOSM during ISSU operations.
45+
Azure Operator Service Manager generally supports in service upgrades, an upgrade method which advances a deployment version without interrupting the running service. Some network function owner considerations are necessary to ensure the proper behavior of AOSM during ISSU operations.
4446
* Where AOSM performs an upgrade against an ordered set of multiple nfApps, AOSM first upgrades or creates all new nfApps, then deletes all old nfApps. This approach ensures service isn't impacted until all new nfApps are ready but requires extra platform capacity for transient hosting of both old and new nfApps.
4547
* Where AOSM upgrades an nfApp with multiple replicas, AOSM honors the deployment profile settings for either the rolling or recreate option. Where rolling is used, expose the values `maxUnavailable` and `maxSurge` as CGS parameters, which can then be set via operator CGV at run-time.
4648

4749
Ultimately, the ability for a given service to be upgraded without interruption is a feature of the service itself. Consult further with the service publisher to understand the in-service upgrade capabilities and ensure they're aligned with the proper AOSM behavioral options.
4850

4951
## Safe upgrade prerequisites
50-
When planning for an upgrade using Azure Operator Service Manager, address the following requirements in advance of upgrade execution to optimize the time spent attempting the upgrade.
52+
When planning for an upgrade using AOSM, address the following requirements in advance of upgrade execution, to optimize time spent attempting and ensure success of the upgrade.
5153

5254
- Onboard updated artifacts using publisher and/or designer workflows.
5355
- In most cases, use the existing publisher to host new version artifacts.
@@ -73,7 +75,7 @@ When planning for an upgrade using Azure Operator Service Manager, address the f
7375
- Settings used for production may suppress failures details, while settings used for debugging, or testing, may choose to expose these details.
7476

7577
## Safe upgrade procedure
76-
Follow the following process to trigger an upgrade with Azure Operator Service Manager.
78+
Follow the following process to trigger an upgrade with AOSM.
7779

7880
* Create new NFDV resource
7981
* For new NFDV versions, it must be in a valid SemVer format. The new version can be an upgrade, a greater value versus the deployed version, or a downgrade, a lower value versus the deployed version. The new version can differ by major, minor, or patch values.
@@ -99,34 +101,50 @@ In cases where a reput update fails, the following process can be followed to re
99101
* By default, the reput retries nfApps in the declared update order, unless they're skipped using `applicationEnablement` flag.
100102

101103
## Control timeouts with installOptions and UpgradeOptions
102-
When an SNS operation starts either a helm install and helm upgrade, a 27-minute default timeout value. This value can be customized at the global NF, but we recommend to customize this value at the component NF levelby defining override values in the NF payload template. Further the values in the NF payload template and be exposed as operator values, allowing final customization at run-time. The following example demonstrates supported installOptions and upgradeOptions parameters applied to a single nfApp component;
104+
When an SNS operation starts either a `helm install` or a `helm upgrade`, a 27-minute default timeout value is used. While this value can be customized at the global network function (NF) level, we recommend customizing this value at the component NF level using `roleOverrideValues` in the NF payload template. Further exposing the `roleOverrideValues` in CGS/CGV allows control by the operator at run-time. The following example demonstrates supported installOptions and upgradeOptions parameters applied across two nfApp components;
103105

104-
```
105-
"roleOverrideValues": ["{
106-
"name": "hellotest",
107-
"deployParametersMappingRuleProfile": {
108-
"helmMappingRuleProfile": {
109-
"options": {
110-
"installOptions": {
111-
"atomic": true,
112-
"wait": true,
113-
"timeout": "1" },
114-
"upgradeOptions": {
115-
"atomic": true,
116-
"wait": true,
117-
"timeout": "2" }
118-
} } } }"
119-
]
106+
```json
107+
{
108+
"roleOverrideValues": [
109+
{
110+
"name": "nfApplication1",
111+
"deployParametersMappingRuleProfile": {
112+
"helmMappingRuleProfile": {
113+
"options": {
114+
"installOptions": {
115+
"atomic": "true",
116+
"wait": "true",
117+
"timeout": "1"
118+
},
119+
"upgradeOptions": {
120+
"atomic": "true",
121+
"wait": "true",
122+
"timeout": "1"
123+
} } } } },
124+
{
125+
"name": "nfApplication2",
126+
"deployParametersMappingRuleProfile": {
127+
"helmMappingRuleProfile": {
128+
"options": {
129+
"installOptions": {
130+
"atomic": "true",
131+
"wait": "true",
132+
"timeout": "1"
133+
},
134+
"upgradeOptions": {
135+
"atomic": "true",
136+
"wait": "true",
137+
"timeout": "1"
138+
} } } } }
139+
]
140+
}
120141
```
121142

122143
## Skip nfApps using applicationEnablement
123-
In the NFDV resource, under `deployParametersMappingRuleProfile` there's a supported property `applicationEnablement` of type enum, which takes values of Unknown, Enabled, or disabled. It can be used to manually exclude nfApp operations during network function (NF) deployment. The following example demonstrates a generic method to parameterize `applicationEnablement` as an included value in `roleOverrideValues` property.
124-
125-
### Template changes
126-
While no NFDV changes are necessarily required, optionally the publisher can use the NFDV to set a default value for the `applicationEnablement` property. The default value is used, unless its changed via `roleOverrideValues`.
144+
In the NFDV resource, under `deployParametersMappingRuleProfile` there's a supported property `applicationEnablement` of type enum, which takes values of Unknown, Enabled, or disabled. It can be used to manually exclude nfApp operations during network function deployment. The following example demonstrates a generic method to parameterize `applicationEnablement` as an included value in `roleOverrideValues` property.
127145

128-
#### NFDV template
129-
Use the NFDV template to set a default value for `applicationEnablement`. The following example sets `enabled` state as the default value for `hellotest` networkfunctionApplication.
146+
### NFDV template changes
147+
While no NFDV changes are necessarily required, optionally the publisher can use the NFDV to set a default value for the `applicationEnablement` property. The default value is used, unless its changed via `roleOverrideValues`. Use the NFDV template to set a default value for `applicationEnablement`. The following example sets `enabled` state as the default value for `hellotest` networkfunctionApplication.
130148

131149
```json
132150
"location":"<location>",
@@ -145,7 +163,7 @@ Use the NFDV template to set a default value for `applicationEnablement`. The fo
145163

146164
To manage the `applicationEnablement` value more dynamically, the Operator can pass a real-time value using the NF template `roleOverrideValues` property. While it's possible for the operator to manipulate the NF template directly, instead parameterize the `roleOverrideValues`, so that values can be passed via a CGV template at runtime. The following examples demonstrate the needed modifications to the CGS, NF templates, and finally the CGV.
147165

148-
#### CGS template
166+
### CGS template changes
149167
The CGS template must be updated to include one variable declaration for each line to parameterize under `roleOverrideValues`. The following example demonstrates three override values.
150168

151169
```json
@@ -160,8 +178,8 @@ The CGS template must be updated to include one variable declaration for each li
160178
}
161179
```
162180

163-
#### NF payload template
164-
The NF template must be update three ways. First, the implicit config parameter must be defined as type object. Second, `roleOverrideValues0`, `roleOverrideValues1`, and `roleOverrideValues2` must be declared as variables mapped to config parameter. Third, `roleOverrideValues0`, `roleOverrideValues1` and `roleOverrideValues2` must be referenced for substitution under `roleOverrideValues` in proper order and following proper syntax.
181+
### NF payload template changes
182+
The NF template must be update three ways. First, the implicit config parameter must be defined as type object. Second, `roleOverrideValues0`, `roleOverrideValues1`, and `roleOverrideValues2` must be declared as variables mapped to config parameter. Third, `roleOverrideValues0`, `roleOverrideValues1`, and `roleOverrideValues2` must be referenced for substitution under `roleOverrideValues` in proper order and following proper syntax.
165183

166184
```json
167185
"parameters": {
@@ -186,7 +204,7 @@ The NF template must be update three ways. First, the implicit config parameter
186204
}
187205
```
188206

189-
#### CGV template
207+
### CGV template changes
190208
The CGV template can now be updated to include the content for each variable to be substituted into `roleOverrideValues` property at run-time. The following example sets `rollbackEnabled` to true, followed by override sets for `hellotest` and `hellotest1` nfApplications.
191209

192210
```json
@@ -238,10 +256,8 @@ To enable the SkipUpgrade feature via `roleOverrideValues`, refer to the followi
238256
- **nfApplication: `runnerTest`**
239257
- The `skipUpgrade` flag isn't specified. Therefore, `runnerTest` executes a traditional Helm upgrade at the cluster level, even if the precheck criteria are met.
240258

241-
242-
243259
## Complete roleOverrideValues option reference
244-
Bringing together all examples in this and other articles, the following reference demonstrates all presently supported install and upgrade options available through the `roleOverrideValues` mechanism.
260+
Bringing together all examples in this and other articles, the following reference demonstrates all presently supported options available through the `roleOverrideValues` mechanism.
245261

246262
```json
247263
{

0 commit comments

Comments
 (0)