Skip to content

Commit f54e6b2

Browse files
Automate Topology Manager on Power
Signed-off-by: Aniruddha Nayek <[email protected]>
1 parent 0f21612 commit f54e6b2

File tree

12 files changed

+627
-0
lines changed

12 files changed

+627
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ This repository consists of additional ansible playbooks for the following:
5353
1. Verify IPI day2 operations
5454
1. Deploy Openshift Data Foundation operator
5555
1. Enabling Kdump
56+
1. Enable Topology Manager on Power
5657

5758
## Assumptions:
5859

examples/all.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -559,3 +559,14 @@ test_pod_image: "quay.io/powercloud/nginx-unprivileged:latest"
559559
## ocp-service-controller-function vars
560560
ocp-service: false
561561

562+
# topology vars
563+
topology_enabled: false
564+
single_node_cpuv1: ""
565+
single_node_cpuv2: ""
566+
besteffort_cpuv1: ""
567+
besteffort_cpuv2: ""
568+
restricted_cpuv1: ""
569+
restricted_cpuv2: ""
570+
none_cpuv1: ""
571+
none_cpuv2: ""
572+

examples/topology_vars.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
# topology vars
3+
# Set the values below in accordance to the number of NUMA nodes and their partition in worker-1
4+
5+
topology_enabled: false
6+
single_node_cpuv1: ""
7+
single_node_cpuv2: ""
8+
besteffort_cpuv1: ""
9+
besteffort_cpuv2: ""
10+
restricted_cpuv1: ""
11+
restricted_cpuv2: ""
12+
none_cpuv1: ""
13+
none_cpuv2: ""

playbooks/main.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,9 @@
7575
- import_playbook: ocp-disa-stig-compliance.yml
7676
when: stig_compliance_enabled is defined and stig_compliance_enabled
7777

78+
- import_playbook: topology-manager.yml
79+
when: topology_enabled is defined and topology_enabled
80+
7881
- import_playbook: hypershift.yml
7982
when: >
8083
(hypershift_install is defined and hypershift_install) or
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
Validate Topology Manger
2+
========================
3+
4+
This playbook validates the Topology Manager for which it covers the following use cases:
5+
6+
* Validate Pod Alignment with CPU requests and Topology Manager policy set to Single numa node
7+
* Validate Pod Alignment with CPU requests and Topology Manager policy set to Best Effort
8+
* Validate Pod Alignment with CPU requests and Topology Manager policy set to Restricted
9+
* Validate Pod Alignment with CPU requests and Topology Manager policy set to None
10+
11+
Note: Other few use cases need to be validated manually
12+
13+
- Validate pod alignment with CPU requests for Topology Manager 'single-numa-node' policy - We expect pod scheduling within NUMA locality.
14+
- Validate pod alignment with CPU requests for Topology Manager 'best-effort' policy - We do not expect any pod scheduling to be rejected as deployments are not restricted to NUMA locality.
15+
- Validate pod alignment with CPU requests for Topology Manager 'restricted' policy - The policy will allow the deployments to use resources beyond NUMA locality.
16+
- Validate pod alignment with CPU requests for Topology Manager 'none' policy - This would not restrict any deployments from being scheduled .
17+
18+
So, in accordance to these criteria it is expected that all the deploymets get scheduled.
19+
20+
Pre-requisite & Requirements
21+
----------------------------
22+
23+
- The cluster is in a known good state, without any errors.
24+
- Analyze the NUMA nodes partition & CPU memory in worker-1 for the above use cases.
25+
26+
Role Variables
27+
--------------
28+
29+
| Variable | Required | Comments |
30+
| ---------------- | -------- | ----------------------------------------------- |
31+
| topology_enabled | no | Set it to true to run this playbook. |
32+
| single_node_cpuv1 | yes | Request desired no. of CPUs for the first pod. |
33+
| single_node_cpuv2 | yes | Request desired no. of CPUs for the second pod. |
34+
| besteffort_cpuv1 | yes | Request desired no. of CPUs for the first pod. |
35+
| besteffort_cpuv2 | yes | Request desired no. of CPUs for the second pod. |
36+
| restricted_cpuv1 | yes | Request desired no. of CPUs for the first pod. |
37+
| restricted_cpuv2 | yes | Request desired no. of CPUs for the second pod. |
38+
| none_cpuv1 | yes | Request desired no. of CPUs for the first pod. |
39+
| none_cpuv2 | yes | Request desired no. of CPUs for the second pod. |
40+
41+
Example Playbook
42+
----------------
43+
44+
```
45+
- name: Validate topology manager on Power
46+
hosts: bastion
47+
roles:
48+
- topology-manager
49+
```
50+
51+
Steps to run playbook
52+
---------------------
53+
54+
- Copy `ocp4-playbooks-extras/examples/inventory` file to the home or working directory and modify it to add a remote host
55+
- Copy the `ocp4-playbooks-extras/examples/topology_vars.yaml` to the home or working directory and set the role variables for `roles/topology-manager` with the custom inputs.
56+
- To execute the playbook run the below sample command
57+
58+
Sample Command
59+
--------------
60+
61+
ansible-playbook -i inventory -e @topology_vars.yaml ~/ocp4-playbooks-extras/playbooks/topology-manager.yml
62+
63+
License
64+
-------
65+
66+
See LICENCE.txt
67+
68+
Author Information
69+
------------------
70+
71+
72+
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
---
2+
# defaults file
3+
4+
pause_image: "registry.access.redhat.com/ubi8/pause"
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
- name: Create kubelet-config with topologyManagerPolicy 'best-effort'
3+
kubernetes.core.k8s:
4+
state: present
5+
definition:
6+
apiVersion: machineconfiguration.openshift.io/v1
7+
kind: KubeletConfig
8+
metadata:
9+
name: cpumanager-enabled
10+
spec:
11+
machineConfigPoolSelector:
12+
matchLabels:
13+
custom-kubelet: cpumanager-enabled
14+
kubeletConfig:
15+
cpuManagerPolicy: static
16+
cpuManagerReconcilePeriod: 5s
17+
topologyManagerPolicy: best-effort
18+
19+
- name: Wait for 2 minutes before checking MCP status
20+
pause:
21+
minutes: 2
22+
23+
- name: Check mcp status
24+
shell: oc get mcp worker | awk 'NR==2 {print $3}'
25+
register: mcpstatus
26+
until: mcpstatus.stdout == 'True'
27+
retries: 20
28+
delay: 60
29+
30+
- name: Check the worker node for the updated kubelet.conf
31+
shell: oc debug node/worker-1 -q -- chroot /host cat /etc/kubernetes/kubelet.conf | grep -E 'reservedSystemCPUs|cpuManager|topologyManager'
32+
register: pa_result
33+
34+
- name: Verify the updated config
35+
debug:
36+
var: pa_result.stdout_lines
37+
38+
- name: Create pod with {{ besteffort_cpuv1 }} cpu request
39+
kubernetes.core.k8s:
40+
state: present
41+
definition:
42+
apiVersion: v1
43+
kind: Pod
44+
metadata:
45+
name: "podcpu{{ besteffort_cpuv1 }}"
46+
namespace: default
47+
spec:
48+
nodeSelector:
49+
cpumanager: "true"
50+
containers:
51+
- name: appcntr1
52+
image: "{{ pause_image }}"
53+
imagePullPolicy: IfNotPresent
54+
command: [ "/bin/bash", "-c", "--" ]
55+
args: [ "while true; do sleep 300000; done;" ]
56+
resources:
57+
requests:
58+
cpu: "{{ besteffort_cpuv1 }}"
59+
memory: 100Mi
60+
limits:
61+
cpu: "{{ besteffort_cpuv1 }}"
62+
memory: 100Mi
63+
64+
- name: Verify the pod gets scheduled
65+
shell: oc get pod podcpu{{ besteffort_cpuv1 }} | awk 'NR==2 {print $3}'
66+
register: result
67+
until: result.stdout == "Running"
68+
retries: 5
69+
delay: 10
70+
71+
- name: Create pod with {{ besteffort_cpuv2 }} cpu request
72+
kubernetes.core.k8s:
73+
state: present
74+
definition:
75+
apiVersion: v1
76+
kind: Pod
77+
metadata:
78+
name: "podcpu{{ besteffort_cpuv2 }}"
79+
namespace: default
80+
spec:
81+
nodeSelector:
82+
cpumanager: "true"
83+
containers:
84+
- name: appcntr1
85+
image: "{{ pause_image }}"
86+
imagePullPolicy: IfNotPresent
87+
command: [ "/bin/bash", "-c", "--" ]
88+
args: [ "while true; do sleep 300000; done;" ]
89+
resources:
90+
requests:
91+
cpu: "{{ besteffort_cpuv2 }}"
92+
memory: 310Mi
93+
limits:
94+
cpu: "{{ besteffort_cpuv2 }}"
95+
memory: 310Mi
96+
97+
- name: Verify the pod gets scheduled
98+
shell: oc get pod podcpu{{ besteffort_cpuv2 }} | awk 'NR==2 {print $3}'
99+
register: result
100+
until: result.stdout == "Running"
101+
retries: 5
102+
delay: 10
103+
104+
- name: Print a simple message
105+
debug:
106+
msg: "Best-effort policy validated successfully"
107+
108+
- name: Cleanup
109+
block:
110+
- name: Delete the pods created
111+
shell: oc delete pods --all
112+
113+
- name: Verify pods deletion
114+
shell: oc get pods | wc -l
115+
register: pods_count
116+
until: pods_count.stdout | int == 0
117+
retries: 10
118+
delay: 10
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
# tasks file for playbooks/roles/topology-manager
3+
4+
- name: Check if cluster operators and nodes are healthy
5+
include_role:
6+
name: check-cluster-health
7+
8+
- name: Setting up CPU manager
9+
block:
10+
- name: Label node worker-1
11+
shell: oc label node worker-1 cpumanager=true
12+
13+
- name: Enable CPU manager for all workers
14+
shell: oc patch mcp worker --type=merge -p '{"metadata":{"labels":{"custom-kubelet":"cpumanager-enabled"}}}'
15+
register: patch_result
16+
changed_when: "'configured' in patch_result.stdout"
17+
18+
- name: Validate Pod Alignment with CPU requests and Topology Manager policy set to Single Numa Node
19+
include_tasks: single_numa_node_policy.yml
20+
21+
- name: Validate Pod Alignment with CPU requests and Topology Manager policy set to Best Effort
22+
include_tasks: besteffort_policy.yml
23+
24+
- name: Validate Pod Alignment with CPU requests and Topology Manager policy set to Restricted
25+
include_tasks: restricted_policy.yml
26+
27+
- name: Validate Pod Alignment with CPU requests and Topology Manager policy set to None
28+
include_tasks: none_policy.yml
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
- name: Create kubelet-config with topologyManagerPolicy 'none'
3+
kubernetes.core.k8s:
4+
state: present
5+
definition:
6+
apiVersion: machineconfiguration.openshift.io/v1
7+
kind: KubeletConfig
8+
metadata:
9+
name: cpumanager-enabled
10+
spec:
11+
machineConfigPoolSelector:
12+
matchLabels:
13+
custom-kubelet: cpumanager-enabled
14+
kubeletConfig:
15+
cpuManagerPolicy: static
16+
cpuManagerReconcilePeriod: 5s
17+
topologyManagerPolicy: none
18+
19+
- name: Wait for 2 minutes before checking MCP status
20+
pause:
21+
minutes: 2
22+
23+
- name: Check mcp status
24+
shell: oc get mcp worker | awk 'NR==2 {print $3}'
25+
register: mcpstatus
26+
until: mcpstatus.stdout == 'True'
27+
retries: 20
28+
delay: 60
29+
30+
- name: Check the worker node for the updated kubelet.conf
31+
shell: oc debug node/worker-1 -q -- chroot /host cat /etc/kubernetes/kubelet.conf | grep -E 'reservedSystemCPUs|cpuManager|topologyManager'
32+
register: pa_result
33+
34+
- name: Verify the updated config
35+
debug:
36+
var: pa_result.stdout_lines
37+
38+
- name: Create pod with {{ none_cpuv1 }} cpu request
39+
kubernetes.core.k8s:
40+
state: present
41+
definition:
42+
apiVersion: v1
43+
kind: Pod
44+
metadata:
45+
name: "podcpu{{ none_cpuv1 }}"
46+
namespace: default
47+
spec:
48+
nodeSelector:
49+
kubernetes.io/hostname: worker-1
50+
containers:
51+
- name: appcntr1
52+
image: "{{ pause_image }}"
53+
imagePullPolicy: IfNotPresent
54+
command: [ "/bin/bash", "-c", "--" ]
55+
args: [ "while true; do sleep 300000; done;" ]
56+
resources:
57+
requests:
58+
cpu: "{{ none_cpuv1 }}"
59+
memory: 200Mi
60+
limits:
61+
cpu: "{{ none_cpuv1 }}"
62+
memory: 200Mi
63+
64+
- name: Verify the pod gets scheduled
65+
shell: oc get pod podcpu{{ none_cpuv1 }} | awk 'NR==2 {print $3}'
66+
register: result
67+
until: result.stdout == "Running"
68+
retries: 5
69+
delay: 10
70+
71+
- name: Create pod with {{ none_cpuv2 }} cpu request
72+
kubernetes.core.k8s:
73+
state: present
74+
definition:
75+
apiVersion: v1
76+
kind: Pod
77+
metadata:
78+
name: "podcpu{{ none_cpuv2 }}"
79+
namespace: default
80+
spec:
81+
nodeSelector:
82+
kubernetes.io/hostname: worker-1
83+
containers:
84+
- name: appcntr1
85+
image: "{{ pause_image }}"
86+
imagePullPolicy: IfNotPresent
87+
command: [ "/bin/bash", "-c", "--" ]
88+
args: [ "while true; do sleep 300000; done;" ]
89+
resources:
90+
requests:
91+
cpu: "{{ none_cpuv2 }}"
92+
memory: 330Mi
93+
limits:
94+
cpu: "{{ none_cpuv2 }}"
95+
memory: 330Mi
96+
97+
- name: Verify the pod gets scheduled
98+
shell: oc get pod podcpu{{ none_cpuv2 }} | awk 'NR==2 {print $3}'
99+
register: result
100+
until: result.stdout == "Running"
101+
retries: 5
102+
delay: 10
103+
104+
- name: Print a simple message
105+
debug:
106+
msg: "None policy validated successfully"
107+
108+
- name: Cleanup
109+
block:
110+
- name: Delete the pods created
111+
shell: oc delete pods --all
112+
113+
- name: Verify pods deletion
114+
shell: oc get pods | wc -l
115+
register: pods_count
116+
until: pods_count.stdout | int == 0
117+
retries: 10
118+
delay: 10

0 commit comments

Comments
 (0)