Skip to content

Commit a19148a

Browse files
authored
Merge pull request ceph#64044 from Naveenaidu/wip-71741-tentacle
tentacle: doc/mgr/telemetry: add doc for telemetry upgrade tests Reviewed-by: Laura Flores <[email protected]>
2 parents 1171194 + fed4c2b commit a19148a

File tree

1 file changed

+214
-0
lines changed

1 file changed

+214
-0
lines changed
Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
The `upgrade` suite is used to verify that upgrades can complete successfully
2+
without disrupting any ongoing workloads.
3+
4+
The diagram below represents the upgrade test directory from the squid release
5+
branch. Each release branch upgrade directory includes X-2 upgrade testing. That
6+
means, we can test the upgrade from 2 previous releases to the current one.
7+
8+
```
9+
upgrade
10+
├── quincy-x
11+
│ ├── filestore-remove-check
12+
│ ├── parallel
13+
│ │ ├── 0-start.yaml
14+
│ │ ├── 1-tasks.yaml
15+
│ │ ├── upgrade-sequence.yaml
16+
│ │ └── workload
17+
│ └── stress-split
18+
|
19+
├── reef-x
20+
│ ├── parallel
21+
│ │ └── workload
22+
│ └── stress-split
23+
|
24+
├── squid-p2p
25+
│ ├── squid-p2p-parallel
26+
│ └── squid-p2p-stress-split
27+
|
28+
└── telemetry-upgrade
29+
├── quincy-x
30+
└── reef-x
31+
32+
```
33+
34+
Based on the above example where X=squid, it is possible to test the upgrades
35+
from Quincy (X-2) or from Reef (X-1) to Squid (X).
36+
37+
- The `upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` sub-suite
38+
installs a Quincy or Reef cluster, then upgrades the cluster to Squid (X). In
39+
parallel, some workloads are run against the cluster, including telemetry
40+
workunits.
41+
- The `upgrade/telemetry-upgrade` sub-suite is identical to
42+
`upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` sub-suites above,
43+
but these only test the telemetry workunits and do not run any other
44+
workloads.
45+
46+
A simple upgrade test contains these steps in order, divided into separate yaml
47+
files:
48+
```
49+
├── 0-start.yaml
50+
├── 1-tasks.yaml
51+
├── upgrade-sequence.yaml
52+
└── workload
53+
```
54+
55+
- `0-start.yaml`: This file contains the information about the ceph cluster
56+
configuration (number of osds, monitors etc) for the test
57+
- `1-tasks.yaml`: This file contains the information of the tasks we want to run
58+
on the cluster. It is here that we install an older release, then begin
59+
running the given `workload` and `upgrade-sequence` in parallel.
60+
- `upgrade-sequence.yaml`: This file contains the steps for upgrading the
61+
cluster to the designated release
62+
- `workloads`: A set of yaml file with workloads we want to run while the
63+
upgrade is in progress
64+
65+
```
66+
- print: "**** done start parallel"
67+
- parallel:
68+
- workload
69+
- upgrade-sequence
70+
- print: "**** done end parallel"
71+
```
72+
73+
The `workload` directory contains the workload yaml files just like any other
74+
suite and the `upgrade-sequence` is responsible for initiating the upgrade and
75+
waiting for it to complete.
76+
77+
```
78+
# renamed tasks: to upgrade-sequence:
79+
upgrade-sequence:
80+
sequential:
81+
- print: "**** done start upgrade, wait"
82+
...
83+
mon.a:
84+
- ceph config set global log_to_journald false --force
85+
- ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1
86+
- while ceph orch upgrade status | jq '.in_progress' | grep true && ! ceph orch upgrade status | jq '.message' | grep Error ; do ceph orch ps ; ceph versions ; ceph orch upgrade status ; sleep 30 ; done
87+
...
88+
- print: "**** done end upgrade, wait..."
89+
```
90+
91+
## Telemetry Upgrade tests
92+
93+
The telemetry upgrade sub-suite verifies that telemetry is emitting the correct
94+
collections after the upgrade. This integration test coverage is done via
95+
workunits. Workunits are basically bash scripts that run commands against a Ceph
96+
Cluster.
97+
98+
In the same manner as the `upgrade/parallel` tests, each release branch
99+
references the `qa/workunits` directory, which includes telemetry bash scripts
100+
for X-2 releases. That means we can test telemetry before and after the upgrade
101+
from previous two releases to the current one.
102+
103+
For instance, the relevant telemetry workunits for the `squid` release are:
104+
```
105+
qa/workunits
106+
├── test_telemetry_quincy.sh
107+
├── test_telemetry_quincy_x.sh
108+
├── test_telemetry_reef.sh
109+
└── test_telemetry_reef_x.sh
110+
```
111+
112+
- `test_telemetry_quincy.sh`, tests the presence of telemetry collection on a
113+
Quincy cluster before the upgrade.
114+
- `test_telemetry_quincy_x.sh`, tests the presence of new telemetry collection
115+
on the X-version cluster after it has been upgraded from Quincy.
116+
- `test_telemetry_reef.sh`, tests the presence of telemetry collection on a Reef
117+
cluster before the upgrade
118+
- `test_telemetry_reef_x.sh`, tests the presence of new telemetry collection on
119+
the X-version cluster after it has been upgraded from Reef.
120+
121+
A sample telemetry upgrade test file contains the following test:
122+
```
123+
...
124+
# Assert that new collections are available
125+
COLLECTIONS=$(ceph telemetry collection ls)
126+
NEW_COLLECTIONS=("perf_perf" "basic_mds_metadata" "basic_pool_usage"
127+
"basic_rook_v01" "perf_memory_metrics" "basic_pool_options_bluestore")
128+
for col in ${NEW_COLLECTIONS[@]}; do
129+
if ! [[ $COLLECTIONS == *$col* ]];
130+
then
131+
echo "COLLECTIONS does not contain" "'"$col"'."
132+
exit 1
133+
fi
134+
done
135+
...
136+
```
137+
138+
These workunits are used in the `upgrade` suite, specifically in:
139+
- [upgrade/quincy-x/parallel](https://github.com/ceph/ceph/blob/squid/qa/suites/upgrade/quincy-x/parallel/1-tasks.yaml)
140+
- [upgrade/reef-x/parallel](https://github.com/ceph/ceph/blob/squid/qa/suites/upgrade/reef-x/parallel/1-tasks.yaml)
141+
- [upgrade/telemetry-upgrade](https://github.com/ceph/ceph/tree/squid/qa/suites/upgrade/telemetry-upgrade)
142+
143+
```
144+
145+
upgrade
146+
├── reef-x
147+
│ ├── parallel
148+
│ │ └── 1-tasks.yaml
149+
├── squid-x
150+
│ ├── parallel
151+
│ │ └── 1-tasks.yaml
152+
└── telemetry-upgrade
153+
├── quincy-x
154+
└── reef-x
155+
156+
157+
```
158+
159+
The `upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` sub-suite installs
160+
a Quincy or Reef cluster, then upgrades the cluster to Squid. In parallel, some
161+
workloads are run against the cluster, including telemetry workunits. The
162+
`1-tasks.yaml` file is the place where the workunits are run.
163+
164+
For instance, the `upgrade/quincy-x/parallel/1-tasks.yaml` file from the
165+
`squid` release branch looks like this:
166+
167+
```
168+
...
169+
- print: "**** done start telemetry quincy..."
170+
- workunit:
171+
clients:
172+
client.0:
173+
- test_telemetry_quincy.sh
174+
- print: "**** done end telemetry quincy..."
175+
176+
- print: "**** done start parallel"
177+
- parallel:
178+
- workload
179+
- upgrade-sequence
180+
- print: "**** done end parallel"
181+
182+
- print: "**** done start telemetry x..."
183+
- workunit:
184+
clients:
185+
client.0:
186+
- test_telemetry_quincy_x.sh
187+
- print: "**** done end telemetry x..."
188+
```
189+
190+
The `test_telemetry_quincy.sh` workunit is run on the Quincy cluster before the
191+
upgrade and `test_telemetry_quincy_x.sh` is run on the X-version cluster (in
192+
this example `squid`) after the upgrade.
193+
194+
The `upgrade/telemetry-upgrade` sub-suite is identical to the
195+
`upgrade/quincy-x/parallel` and `upgrade/reef-x/parallel` suites as above, but
196+
these tests ONLY test the telemetry workunits and do not run with any other
197+
workloads.
198+
199+
So `upgrade/telemetry-upgrade` is nice to schedule when you just want to verify
200+
that the telemetry workunits are working as expected (they complete much
201+
faster). The `upgrade/[quincy|reef]-x/parallel` suites are nice to schedule if
202+
you want to verify that the workunits are working fine with all the other
203+
workloads also running.
204+
205+
### What tests to update when a telemetry collection is added/removed
206+
207+
- If the collection is added only to the `main` branch or the current release
208+
(`tentacle`), - only update the `test_telemetry_{X-2}_x.sh` and
209+
`test_telemetry_{X-1}_x.sh`, (where *`x-2` is Reef and `x-1` is Squid for
210+
`tentacle` release branch*)
211+
- If the collection are backported to the `X-2` releases then update the -
212+
`test_telemetry_{X-2}.sh` and `test_telemetry_{X-1}.sh` (where *`x-2` is Reef
213+
and `x-1` is Squid for `tentacle` release branch*) files to reflect the
214+
collection changes there

0 commit comments

Comments
 (0)