-
Notifications
You must be signed in to change notification settings - Fork 88
Description
I’ve run into this issue multiple times, but since I’ve started using it in automation, I need a proper solution to handle it.
We have a use case where all services are scaled down during the maintenance window and then scaled back up afterwards.
This is implemented using two Jenkins jobs — one to scale down the services and another to scale them back up.
Most of the time, it works fine, but occasionally a service gets corrupted (see below) or fails to start (e.g., MariaDB, for some reason).
When that happens, scaling the service back up results in an infinite loop.
The only way to stop the loop is to remove the service using docker service rm, but that also deletes the history, logs, and everything we need.
Otherwise, we would simply remove and redeploy the service each time — but that's not the solution I'm aiming for.
The following log is from a Jenkins run attempting to bring back all the services that were scaled down for daily maintenance.
These logs started at 4 a.m. and were manually stopped after running for three hours.
Executing: docker service scale mariadb_mariadb=1
mariadb_mariadb scaled to 1
overall progress: 0 out of 1 tasks
1/1:
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
1/1: task: non-zero exit (1)
overall progress: 0 out of 1 tasks
...
Docker scaling should respect the docker stack deploy configuration, specifically the restart policy, and stop retrying after the configured number of times. For instance, for the MariaDB service, this is the deploy configuration we are using:
deploy:
replicas: 1
restart_policy:
condition: any
delay: 30s
max_attempts: 5
update_config:
parallelism: 1
delay: 2m
failure_action: pause
monitor: 10s
I have observed in the past that a Docker service can become corrupted without any discernible reason. When attempting to scale a service back up, it begins to report the following error:
"No such image: rullion/uk.co.rullion.cloud.notification-service:2.1.26@sha256:87f1001ea072b47779942a8504983fcd84fa61bb588e8b5508e68fb64f27be79"
Nevertheless:
- The same service had already been using this image prior to being scaled down.
- The image is accessible, and we are able to pull it from the repository.
- Most importantly, the issue is "fixed" after docker service rm, which suggests that the service configuration became corrupted.