Skip to content

Commit 6a2fc10

Browse files
committed
Add upgrade workload
1 parent 7b32be0 commit 6a2fc10

File tree

6 files changed

+442
-31
lines changed

6 files changed

+442
-31
lines changed

docs/README.md

Lines changed: 33 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,23 @@
11
# Table of workloads
22

3-
| Workload/tooling | Short Description | Minimum Requirements |
4-
|:-------------------------------------------------- |:----------------------------------------- | ------------------------------------- |
5-
| [Tooling](tooling.md) | Setup pbench instrumentation tools | Cluster-admin, Privileged Containers |
6-
| [Test](test.md) | Test/Run your workload from ssh Container | Cluster-admin, Privileged Containers |
7-
| [Baseline](baseline.md) | Baseline metrics capture | Tooling job* |
8-
| [Scale](scale.md) | Scales worker nodes | Cluster-admin |
9-
| [NodeVertical](nodevertical.md) | Node Kubelet Density | Labeling Nodes |
10-
| [PodVertical](podvertical.md) | Max Pod Density | None |
11-
| [MasterVertical](mastervertical.md) | Master Node Stress workload | None |
12-
| [HTTP](http.md) | HTTP ingress TPS/Latency | None |
13-
| [Network](network.md) | TCP/UDP Throughput/Latency | Labeling Nodes, [See below](#network) |
14-
| [Deployments Per Namespace](deployments-per-ns.md) | Maximum Deployments | None |
15-
| [PVCscale](pvscale.md) | PVCScale test | Working storageclass |
16-
| [Conformance](conformance.md) | OCP/Kubernetes e2e tests | None |
17-
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
18-
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
19-
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
3+
| Workload/tooling | Short Description | Minimum Requirements |
4+
|:--------------------------------------------------- |:----------------------------------------- | -------------------------------------------- |
5+
| [Tooling](tooling.md) | Setup pbench instrumentation tools | Cluster-admin, Privileged Containers |
6+
| [Test](test.md) | Test/Run your workload from ssh Container | Cluster-admin, Privileged Containers |
7+
| [Baseline](baseline.md) | Baseline metrics capture | Tooling job* |
8+
| [Scale](scale.md) | Scales worker nodes | Cluster-admin |
9+
| [NodeVertical](nodevertical.md) | Node Kubelet Density | Labeling Nodes |
10+
| [PodVertical](podvertical.md) | Max Pod Density | None |
11+
| [MasterVertical](mastervertical.md) | Master Node Stress workload | None |
12+
| [HTTP](http.md) | HTTP ingress TPS/Latency | None |
13+
| [Network](network.md) | TCP/UDP Throughput/Latency | Labeling Nodes, [See below](#network) |
14+
| [Deployments Per Namespace](deployments-per-ns.md) | Maximum Deployments | None |
15+
| [PVCscale](pvscale.md) | PVCScale test | Working storageclass |
16+
| [Conformance](conformance.md) | OCP/Kubernetes e2e tests | None |
17+
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
18+
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
19+
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
20+
| [Upgrade](upgrade.md) | Upgrades cluster | Cluster-admin |
2021

2122
* Baseline job without a tooled cluster just idles a cluster. The goal is to capture resource consumption over a period of time to characterize resource requirements thus tooling is required. (For now)
2223

@@ -36,20 +37,21 @@
3637

3738
Each workload will implement a form of pass/fail criteria in order to flag if the tests have failed in CI.
3839

39-
| Workload/tooling | Pass/Fail |
40-
|:-------------------------------------------------- |:----------------------------- |
41-
| [Tooling](tooling.md) | NA |
42-
| [Test](test.md) | NA |
43-
| [Baseline](baseline.md) | NA |
44-
| [Scale](scale.md) | Yes: Test Duration |
45-
| [NodeVertical](nodevertical.md) | Yes: Exit Code, Test Duration |
46-
| [PodVertical](podvertical.md) | Yes: Exit Code, Test Duration |
47-
| [MasterVertical](mastervertical.md) | Yes: Exit Code, Test Duration |
48-
| [HTTP](http.md) | No |
49-
| [Network](network.md) | No |
50-
| [Deployments Per Namespace](deployments-per-ns.md) | No |
51-
| [PVCscale](pvscale.md) | No |
52-
| [Conformance](conformance.md) | No |
40+
| Workload/tooling | Pass/Fail |
41+
|:--------------------------------------------------- |:----------------------------- |
42+
| [Tooling](tooling.md) | NA |
43+
| [Test](test.md) | NA |
44+
| [Baseline](baseline.md) | NA |
45+
| [Scale](scale.md) | Yes: Test Duration |
46+
| [NodeVertical](nodevertical.md) | Yes: Exit Code, Test Duration |
47+
| [PodVertical](podvertical.md) | Yes: Exit Code, Test Duration |
48+
| [MasterVertical](mastervertical.md) | Yes: Exit Code, Test Duration |
49+
| [HTTP](http.md) | No |
50+
| [Network](network.md) | No |
51+
| [Deployments Per Namespace](deployments-per-ns.md) | No |
52+
| [PVCscale](pvscale.md) | No |
53+
| [Conformance](conformance.md) | No |
5354
| [Namespaces per cluster](namespaces-per-cluster.md) | Yes: Exit code, Test Duration |
5455
| [Services per namespace](services-per-namespace.md) | Yes: Exit code, Test Duration |
5556
| [FIO I/O test](fio.md) | No |
57+
| [Upgrade](upgrade.md) | Yes: Test Duration |

docs/upgrade.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Upgrade Workload
2+
3+
The upgrade workload playbook is `workloads/upgrade.yml` and will upgrade a cluster with or without tooling.
4+
5+
Note that upgrades can reboot nodes and thus any node that is rebooted hosting a pbench agent pod that is actively collecting data will be interrupted. As with cloud native workloads, pods are supposed to be ephemeral anyway.
6+
7+
Running from CLI:
8+
9+
```sh
10+
$ cp workloads/inventory.example inventory
11+
$ # Add orchestration host to inventory
12+
$ # Edit vars in workloads/vars/scale.yml or define Environment vars (See below)
13+
$ time ansible-playbook -vv -i inventory workloads/upgrade.yml
14+
```
15+
16+
## Environment variables
17+
18+
### PUBLIC_KEY
19+
Default: `~/.ssh/id_rsa.pub`
20+
Public ssh key file for Ansible.
21+
22+
### PRIVATE_KEY
23+
Default: `~/.ssh/id_rsa`
24+
Private ssh key file for Ansible.
25+
26+
### ORCHESTRATION_USER
27+
Default: `root`
28+
User for Ansible to log in as. Must authenticate with PUBLIC_KEY/PRIVATE_KEY.
29+
30+
### WORKLOAD_IMAGE
31+
Default: `quay.io/openshift-scale/scale-ci-workload`
32+
Container image that runs the workload script.
33+
34+
### WORKLOAD_JOB_NODE_SELECTOR
35+
Default: `true`
36+
Enables/disables the node selector that places the workload job on the `workload` node.
37+
38+
### WORKLOAD_JOB_TAINT
39+
Default: `true`
40+
Enables/disables the toleration on the workload job to permit the `workload` taint.
41+
42+
### WORKLOAD_JOB_PRIVILEGED
43+
Default: `false`
44+
Enables/disables running the workload Pod as privileged.
45+
46+
### KUBECONFIG_FILE
47+
Default: `~/.kube/config`
48+
Location of kubeconfig on orchestration host.
49+
50+
### PBENCH_INSTRUMENTATION
51+
Default: `false`
52+
Enables/disables running the workload wrapped by pbench-user-benchmark. When enabled, pbench agents can then be enabled (`ENABLE_PBENCH_AGENTS`) for further instrumentation data and pbench-copy-results can be enabled (`ENABLE_PBENCH_COPY`) to export captured data for further analysis.
53+
54+
### ENABLE_PBENCH_AGENTS
55+
Default: `false`
56+
Enables/disables the collection of pbench data on the pbench agent Pods. These Pods are deployed by the tooling playbook.
57+
58+
### ENABLE_PBENCH_COPY
59+
Default: `false`
60+
Enables/disables the copying of pbench data to a remote results server for further analysis.
61+
62+
### PBENCH_SSH_PRIVATE_KEY_FILE
63+
Default: `~/.ssh/id_rsa`
64+
Location of ssh private key to authenticate to the pbench results server.
65+
66+
### PBENCH_SSH_PUBLIC_KEY_FILE
67+
Default: `~/.ssh/id_rsa.pub`
68+
Location of the ssh public key to authenticate to the pbench results server.
69+
70+
### PBENCH_SERVER
71+
Default: There is no public default.
72+
DNS address of the pbench results server.
73+
74+
### SCALE_CI_RESULTS_TOKEN
75+
Default: There is no public default.
76+
Future use for pbench and prometheus scraper to place results into git repo that holds results data.
77+
78+
### JOB_COMPLETION_POLL_ATTEMPTS
79+
Default: `360`
80+
Number of retries for Ansible to poll if the workload job has completed. Poll attempts delay 10s between polls with some additional time taken for each polling action depending on the orchestration host setup.
81+
82+
### UPGRADE_TEST_PREFIX
83+
Default: `upgrade`
84+
Test to prefix the pbench results.
85+
86+
### UPGRADE_NEW_VERSION_URL
87+
Default: No default.
88+
The url portion of a new version to upgrade to. An example would be `quay.io/openshift-release-dev/ocp-release` or `registry.svc.ci.openshift.org/ocp/release`.
89+
90+
### UPGRADE_NEW_VERSION
91+
Default: No default.
92+
The new version to upgrade to. Check [https://openshift-release.svc.ci.openshift.org/](https://openshift-release.svc.ci.openshift.org/) for versions and upgrade paths based on the installed cluster.
93+
94+
### FORCE_UPGRADE
95+
Default: `false`
96+
Determines the `--force` flag value for the `oc adm upgrade` command to initiate an upgrade.
97+
98+
### UPGRADE_POLL_ATTEMPTS
99+
Default: `1800`
100+
Number of times to poll to determine if the cluster has been upgraded. Each poll attempted corresponds to approximately a 2s wait plus poll time.
101+
102+
### EXPECTED_UPGRADE_DURATION
103+
Default: `1800`
104+
Pass/fail criteria. Value to determine if upgrade workload executed in duration expected.
105+
106+
## Smoke test variables
107+
108+
```
109+
UPGRADE_TEST_PREFIX=upgrade_smoke
110+
UPGRADE_NEW_VERSION_URL=registry.svc.ci.openshift.org/ocp/release
111+
UPGRADE_NEW_VERSION=4.2.0-0.nightly-2019-08-13-183722
112+
FORCE_UPGRADE=true
113+
UPGRADE_POLL_ATTEMPTS=7200
114+
```
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: scale-ci-workload-script
5+
data:
6+
run.sh: |
7+
#!/bin/sh
8+
set -eo pipefail
9+
workload_log() { echo "$(date -u) $@" >&2; }
10+
export -f workload_log
11+
workload_log "Configuring pbench for running upgrade workload"
12+
mkdir -p /var/lib/pbench-agent/tools-default/
13+
echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${HOME}:/sbin/nologin" >> /etc/passwd
14+
if [ "${ENABLE_PBENCH_AGENTS}" = true ]; then
15+
echo "" > /var/lib/pbench-agent/tools-default/disk
16+
echo "" > /var/lib/pbench-agent/tools-default/iostat
17+
echo "workload" > /var/lib/pbench-agent/tools-default/label
18+
echo "" > /var/lib/pbench-agent/tools-default/mpstat
19+
echo "" > /var/lib/pbench-agent/tools-default/oc
20+
echo "" > /var/lib/pbench-agent/tools-default/perf
21+
echo "" > /var/lib/pbench-agent/tools-default/pidstat
22+
echo "" > /var/lib/pbench-agent/tools-default/sar
23+
master_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/master= --no-headers | awk '{print $1}'`
24+
for node in $master_nodes; do
25+
echo "master" > /var/lib/pbench-agent/tools-default/remote@$node
26+
done
27+
infra_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/infra= --no-headers | awk '{print $1}'`
28+
for node in $infra_nodes; do
29+
echo "infra" > /var/lib/pbench-agent/tools-default/remote@$node
30+
done
31+
worker_nodes=`oc get nodes -l pbench_agent=true,node-role.kubernetes.io/worker= --no-headers | awk '{print $1}'`
32+
for node in $worker_nodes; do
33+
echo "worker" > /var/lib/pbench-agent/tools-default/remote@$node
34+
done
35+
fi
36+
source /opt/pbench-agent/profile
37+
workload_log "Done configuring pbench for upgrade workload run"
38+
39+
workload_log "Running upgrade workload"
40+
if [ "${PBENCH_INSTRUMENTATION}" = "true" ]; then
41+
pbench-user-benchmark -- sh /root/workload/workload.sh
42+
result_dir="/var/lib/pbench-agent/$(ls -t /var/lib/pbench-agent/ | grep "pbench-user" | head -2 | tail -1)"/1/sample1
43+
if [ "${ENABLE_PBENCH_COPY}" = "true" ]; then
44+
pbench-copy-results --prefix ${UPGRADE_TEST_PREFIX}
45+
fi
46+
else
47+
sh /root/workload/workload.sh
48+
result_dir=/tmp
49+
fi
50+
workload_log "Completed upgrade workload run"
51+
52+
workload_log "Checking Test Results"
53+
workload_log "Checking Test Exit Code"
54+
if [ $(jq '.exit_code==0' ${result_dir}/exit.json) == "false" ]; then
55+
workload_log "Test Failure"
56+
workload_log "Test Analysis: Failed"
57+
exit 1
58+
fi
59+
workload_log "Comparing upgrade duration to expected duration"
60+
workload_log "Scaling Duration: $(jq '.duration' ${result_dir}/exit.json)"
61+
if [ $(jq '.duration>'${EXPECTED_UPGRADE_DURATION}'' ${result_dir}/exit.json) == "true" ]; then
62+
workload_log "EXPECTED_UPGRADE_DURATION (${EXPECTED_UPGRADE_DURATION}) exceeded ($(jq '.duration' ${result_dir}/exit.json))"
63+
workload_log "Test Analysis: Failed"
64+
exit 1
65+
fi
66+
# TODO: Check pbench-agent collected metrics for Pass/Fail
67+
# TODO: Check prometheus collected metrics for Pass/Fail
68+
workload_log "Test Analysis: Passed"
69+
workload.sh: |
70+
#!/bin/sh
71+
72+
result_dir=/tmp
73+
if [ "${PBENCH_INSTRUMENTATION}" = "true" ]; then
74+
result_dir=${benchmark_results_dir}
75+
fi
76+
start_time=$(date +%s)
77+
78+
oc adm upgrade --force=${FORCE_UPGRADE} --to-image=${UPGRADE_NEW_VERSION_URL}:${UPGRADE_NEW_VERSION}
79+
80+
# Poll to see upgrade started
81+
retries=0
82+
while [ ${retries} -le 120 ] ; do
83+
clusterversion_output=`oc get clusterversion/version`
84+
if [[ "${clusterversion_output}" == *"Working towards "* ]]; then
85+
workload_log "Cluster upgrade started"
86+
break
87+
else
88+
workload_log "Cluster upgrade has not started, Poll attempts: ${retries}/120"
89+
sleep 1
90+
fi
91+
retries=$[${retries} + 1]
92+
done
93+
94+
# Poll to see if upgrade has completed
95+
retries=0
96+
while [ ${retries} -le ${UPGRADE_POLL_ATTEMPTS} ] ; do
97+
clusterversion_output=`oc get clusterversion/version`
98+
if [[ "${clusterversion_output}" == *"Cluster version is "* ]]; then
99+
workload_log "Cluster upgrade complete"
100+
break
101+
else
102+
workload_log "Cluster still upgrading, Poll attempts: ${retries}/${UPGRADE_POLL_ATTEMPTS}"
103+
sleep 2
104+
fi
105+
retries=$[${retries} + 1]
106+
done
107+
end_time=$(date +%s)
108+
duration=$((end_time-start_time))
109+
exit_code=0
110+
if [[ "${clusterversion_output}" != *"Cluster version is "* ]]; then
111+
workload_log "Cluster failed to scale to ${UPGRADE_NEW_VERSION} in (${UPGRADE_POLL_ATTEMPTS} * 2s)"
112+
exit_code=1
113+
fi
114+
workload_log "Writing Exit Code and Duration"
115+
jq -n '. | ."exit_code"='${exit_code}' | ."duration"='${duration}'' > "${result_dir}/exit.json"

workloads/templates/workload-env.yml.j2

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,4 +95,13 @@ data:
9595
FIOTEST_SSH_AUTHORIZED_KEYS: "{{pbench_ssh_public_key_file_slurp['content']}}"
9696
FIOTEST_SSH_PRIVATE_KEY: "{{pbench_ssh_private_key_file_slurp['content']}}"
9797
FIOTEST_SSH_PUBLIC_KEY: "{{pbench_ssh_public_key_file_slurp['content']}}"
98+
{% elif workload_job == "upgrade" %}
99+
PBENCH_INSTRUMENTATION: "{{pbench_instrumentation|bool|lower}}"
100+
ENABLE_PBENCH_COPY: "{{enable_pbench_copy|bool|lower}}"
101+
UPGRADE_TEST_PREFIX: "{{upgrade_test_prefix}}"
102+
UPGRADE_NEW_VERSION_URL: "{{upgrade_new_version_url}}"
103+
UPGRADE_NEW_VERSION: "{{upgrade_new_version}}"
104+
FORCE_UPGRADE: "{{force_upgrade}}"
105+
UPGRADE_POLL_ATTEMPTS: "{{upgrade_poll_attempts}}"
106+
EXPECTED_UPGRADE_DURATION: "{{expected_upgrade_duration}}"
98107
{% endif %}

0 commit comments

Comments
 (0)