-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Expected behavior
As an operator
In order to notice failure to uncordon nodes at startup
In order to respect bosh canary update and not propagate errors to the whole cluster
I need bosh job status to surface failures
Current behavior
On new strimzi coab instances, smoke test fails, i.e. new fresh service instances fails to deploy.
Suspecting that k3s server takes time to start and be ready to accept k8s api request to uncordon the master node in the post-start bosh action. As a result, the cordon request fails.
however, post-start silently ignores uncordon failures
k3s-wrapper-boshrelease/jobs/k3s-agent/templates/bin/post-start.erb
Lines 23 to 26 in c535c16
| #uncordon | |
| /var/vcap/packages/k3s/k3s kubectl --kubeconfig=/var/vcap/data/k3s-agent/drain-kubeconfig.yaml uncordon $K3S_NODE_NAME \ | |
| >> $JOB_DIR/post-start.log \ | |
| 2>> $JOB_DIR/post-start-stderr.log |
k3s-wrapper-boshrelease/jobs/k3s-server/templates/bin/post-start.erb
Lines 15 to 20 in c535c16
| #wait for k8s api to be available, wait for 5 min max | |
| <% if_p('k3s.master_vip_api') do |vip| %> | |
| timeout 300 sh -c 'until nc -z <%= vip %> 6443; do sleep 1; done' /var/vcap/packages/k3s/k3s kubectl --kubeconfig=/var/vcap/store/k3s-server/kubeconfig.yml get pods --all-namespaces | |
| <% end %> | |
| #uncordon | |
| /var/vcap/packages/k3s/k3s kubectl --kubeconfig=/var/vcap/store/k3s-server/kubeconfig.yml uncordon $K3S_NODE_NAME |
Therefore, we see additional downstream side effects of the uncordon failures (typically timeout to complete kustomizations aka run k8s jobs/workloads)