Merge pull request ceph#64044 from Naveenaidu/wip-71741-tentacle

yuriw · web-flow · commit a19148a0ea5a · 2025-06-24T11:31:13.000-07:00
tentacle: doc/mgr/telemetry: add doc for telemetry upgrade tests

Reviewed-by: Laura Flores &lt;lflores@redhat.com&gt;
diff --git a/src/pybind/mgr/telemetry/tests/telemetry_upgrade_tests.md b/src/pybind/mgr/telemetry/tests/telemetry_upgrade_tests.md
@@ -0,0 +1,214 @@
+The `upgrade` suite is used to verify that upgrades can complete successfully
+without disrupting any ongoing workloads.
+
+The diagram below represents the upgrade test directory from the squid release
+branch. Each release branch upgrade directory includes X-2 upgrade testing. That
+means, we can test the upgrade from 2 previous releases to the current one.
+
+```
+upgrade
+├── quincy-x
+│   ├── filestore-remove-check
+│   ├── parallel
+│   │   ├── 0-start.yaml
+│   │   ├── 1-tasks.yaml
+│   │   ├── upgrade-sequence.yaml
+│   │   └── workload
+│   └── stress-split
+|
+├── reef-x
+│   ├── parallel
+│   │   └── workload
+│   └── stress-split
+|
+├── squid-p2p
+│   ├── squid-p2p-parallel
+│   └── squid-p2p-stress-split
+|
+└── telemetry-upgrade
+    ├── quincy-x
+    └── reef-x
+
+```
+
+Based on the above example where X=squid, it is possible to test the upgrades
+from Quincy (X-2) or from Reef (X-1) to Squid (X).
+
+- The `upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` sub-suite
+  installs a Quincy or Reef cluster, then upgrades the cluster to Squid (X). In
+  parallel, some workloads are run against the cluster, including telemetry
+  workunits.
+- The `upgrade/telemetry-upgrade` sub-suite is identical to
+  `upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` sub-suites above,
+  but these only test the telemetry workunits and do not run any other
+  workloads.
+
+A simple upgrade test contains these steps in order, divided into separate yaml
+files:
+```
+├── 0-start.yaml
+├── 1-tasks.yaml
+├── upgrade-sequence.yaml
+└── workload
+```
+
+- `0-start.yaml`: This file contains the information about the ceph cluster
+  configuration (number of osds, monitors etc) for the test
+- `1-tasks.yaml`: This file contains the information of the tasks we want to run
+  on the cluster. It is here that we install an older release, then begin
+  running the given `workload` and `upgrade-sequence` in parallel.
+- `upgrade-sequence.yaml`: This file contains the steps for upgrading the
+  cluster to the designated release
+- `workloads`: A set of yaml file with workloads we want to run while the
+  upgrade is in progress
+
+```
+- print: "**** done start parallel"
+- parallel:
+    - workload
+    - upgrade-sequence
+- print: "**** done end parallel"
+```
+
+The `workload` directory contains the workload yaml files just like any other
+suite and the `upgrade-sequence` is responsible for initiating the upgrade and
+waiting for it to complete.
+
+```
+# renamed tasks: to upgrade-sequence:
+upgrade-sequence:
+   sequential:
+   - print: "**** done start upgrade, wait"
+   ...
+       mon.a:
+         - ceph config set global log_to_journald false --force
+         - ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1
+         - while ceph orch upgrade status | jq '.in_progress' | grep true && ! ceph orch upgrade status | jq '.message' | grep Error ; do ceph orch ps ; ceph versions ; ceph orch upgrade status ; sleep 30 ; done
+   ...
+   - print: "**** done end upgrade, wait..."
+```
+
+## Telemetry Upgrade tests
+
+The telemetry upgrade sub-suite verifies that telemetry is emitting the correct
+collections after the upgrade. This integration test coverage is done via
+workunits. Workunits are basically bash scripts that run commands against a Ceph
+Cluster.
+
+In the same manner as the `upgrade/parallel` tests, each release branch
+references the `qa/workunits` directory, which includes telemetry bash scripts
+for X-2 releases. That means we can test telemetry before and after the upgrade
+from previous two releases to the current one.
+
+For instance, the relevant telemetry workunits for the `squid` release are:
+```
+qa/workunits
+├── test_telemetry_quincy.sh
+├── test_telemetry_quincy_x.sh
+├── test_telemetry_reef.sh
+└── test_telemetry_reef_x.sh
+```
+
+- `test_telemetry_quincy.sh`, tests the presence of telemetry collection on a
+  Quincy cluster before the upgrade.
+- `test_telemetry_quincy_x.sh`, tests the presence of new telemetry collection
+  on the X-version cluster after it has been upgraded from Quincy.
+- `test_telemetry_reef.sh`, tests the presence of telemetry collection on a Reef
+  cluster before the upgrade
+- `test_telemetry_reef_x.sh`, tests the presence of new telemetry collection on
+  the X-version cluster after it has been upgraded from Reef.
+
+A sample telemetry upgrade test file contains the following test:
+```
+...
+# Assert that new collections are available
+COLLECTIONS=$(ceph telemetry collection ls)
+NEW_COLLECTIONS=("perf_perf" "basic_mds_metadata" "basic_pool_usage"
+                 "basic_rook_v01" "perf_memory_metrics" "basic_pool_options_bluestore")
+for col in ${NEW_COLLECTIONS[@]}; do
+    if ! [[ $COLLECTIONS == *$col* ]];
+    then
+        echo "COLLECTIONS does not contain" "'"$col"'."
+	exit 1
+    fi
+done
+...
+```
+
+These workunits are used in the `upgrade` suite, specifically in:
+- [upgrade/quincy-x/parallel](https://github.com/ceph/ceph/blob/squid/qa/suites/upgrade/quincy-x/parallel/1-tasks.yaml)
+- [upgrade/reef-x/parallel](https://github.com/ceph/ceph/blob/squid/qa/suites/upgrade/reef-x/parallel/1-tasks.yaml)
+- [upgrade/telemetry-upgrade](https://github.com/ceph/ceph/tree/squid/qa/suites/upgrade/telemetry-upgrade)
+
+```
+
+upgrade
+├── reef-x
+│   ├── parallel
+│   │   └──  1-tasks.yaml
+├── squid-x
+│   ├── parallel
+│   │   └── 1-tasks.yaml
+└── telemetry-upgrade
+    ├── quincy-x
+    └── reef-x
+
+
+```
+
+The `upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` sub-suite installs
+a Quincy or Reef cluster, then upgrades the cluster to Squid. In parallel, some
+workloads are run against the cluster, including telemetry workunits. The
+`1-tasks.yaml` file is the place where the workunits are run.
+
+For instance, the `upgrade/quincy-x/parallel/1-tasks.yaml`  file from the
+`squid` release branch looks like this:
+
+```
+...
+- print: "**** done start telemetry quincy..."
+- workunit:
+    clients:
+      client.0:
+        - test_telemetry_quincy.sh
+- print: "**** done end telemetry quincy..."
+
+- print: "**** done start parallel"
+- parallel:
+    - workload
+    - upgrade-sequence
+- print: "**** done end parallel"
+
+- print: "**** done start telemetry x..."
+- workunit:
+    clients:
+      client.0:
+        - test_telemetry_quincy_x.sh
+- print: "**** done end telemetry x..."
+```
+
+The `test_telemetry_quincy.sh` workunit is run on the Quincy cluster before the
+upgrade and `test_telemetry_quincy_x.sh` is run on the X-version cluster (in
+this example `squid`) after the upgrade.
+
+The `upgrade/telemetry-upgrade` sub-suite is identical to the
+`upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` suites as above, but
+these tests ONLY test the telemetry workunits and do not run with any other
+workloads.
+
+So `upgrade/telemetry-upgrade` is nice to schedule when you just want to verify
+that the telemetry workunits are working as expected (they complete much
+faster). The `upgrade/[quincy|reef]-x/parallel` suites are nice to schedule if
+you want to verify that the workunits are working fine with all the other
+workloads also running.
+
+### What tests to update when a telemetry collection is added/removed
+
+- If the collection is added only to the `main` branch or the current release
+	(`tentacle`), - only update the `test_telemetry_{X-2}_x.sh` and
+	`test_telemetry_{X-1}_x.sh`,  (where *`x-2` is Reef and `x-1` is Squid for
+	`tentacle` release branch*)
+- If the collection are backported to the `X-2` releases then update the -
+	`test_telemetry_{X-2}.sh` and `test_telemetry_{X-1}.sh` (where *`x-2` is Reef
+	and `x-1` is Squid for `tentacle` release branch*) files to reflect the
+	collection changes there