Skip to content

post-start silently ignores uncordon failures: should fail fast instead #78

@gberche-orange

Description

@gberche-orange

Expected behavior

As an operator
In order to notice failure to uncordon nodes at startup
In order to respect bosh canary update and not propagate errors to the whole cluster
I need bosh job status to surface failures

Current behavior

On new strimzi coab instances, smoke test fails, i.e. new fresh service instances fails to deploy.

Suspecting that k3s server takes time to start and be ready to accept k8s api request to uncordon the master node in the post-start bosh action. As a result, the cordon request fails.

however, post-start silently ignores uncordon failures

#uncordon
/var/vcap/packages/k3s/k3s kubectl --kubeconfig=/var/vcap/data/k3s-agent/drain-kubeconfig.yaml uncordon $K3S_NODE_NAME \
>> $JOB_DIR/post-start.log \
2>> $JOB_DIR/post-start-stderr.log

#wait for k8s api to be available, wait for 5 min max
<% if_p('k3s.master_vip_api') do |vip| %>
timeout 300 sh -c 'until nc -z <%= vip %> 6443; do sleep 1; done' /var/vcap/packages/k3s/k3s kubectl --kubeconfig=/var/vcap/store/k3s-server/kubeconfig.yml get pods --all-namespaces
<% end %>
#uncordon
/var/vcap/packages/k3s/k3s kubectl --kubeconfig=/var/vcap/store/k3s-server/kubeconfig.yml uncordon $K3S_NODE_NAME

Therefore, we see additional downstream side effects of the uncordon failures (typically timeout to complete kustomizations aka run k8s jobs/workloads)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions