Docker container randomly stops booting #8938

kdubb · 2023-07-23T01:00:03Z

kdubb
Jul 23, 2023

Describe the bug

Quarkus devservices automatically starts a RabbitMQ container during tests (via testcontainers). Intermittently the docker container will not finish starting up. Seemingly it just randomly stops.

It's been hard to "catch" this failure because it is truly intermittent but I've finally managed to capture the log output for a "stalled start".

Here is the log, there are no discernible errors and just abruptly stops.

2023-07-23 00:22:48.751640+00:00 [info] <0.230.0> Feature flags: list of feature flags found:
2023-07-23 00:22:48.765448+00:00 [info] <0.230.0> Feature flags:   [ ] implicit_default_bindings
2023-07-23 00:22:48.765496+00:00 [info] <0.230.0> Feature flags:   [ ] maintenance_mode_status
2023-07-23 00:22:48.765512+00:00 [info] <0.230.0> Feature flags:   [ ] quorum_queue
2023-07-23 00:22:48.765522+00:00 [info] <0.230.0> Feature flags:   [ ] stream_queue
2023-07-23 00:22:48.765640+00:00 [info] <0.230.0> Feature flags:   [ ] user_limits
2023-07-23 00:22:48.765649+00:00 [info] <0.230.0> Feature flags:   [ ] virtual_host_metadata
2023-07-23 00:22:48.765663+00:00 [info] <0.230.0> Feature flags: feature flag states written to disk: yes
2023-07-23 00:22:49.087171+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped
2023-07-23 00:22:49.087287+00:00 [notice] <0.230.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
2023-07-23 00:22:49.099955+00:00 [notice] <0.230.0> Logging: configured log handlers are now ACTIVE
2023-07-23 00:22:49.269901+00:00 [info] <0.230.0> ra: starting system quorum_queues
2023-07-23 00:22:49.269998+00:00 [info] <0.230.0> starting Ra system: quorum_queues in directory: /var/lib/rabbitmq/mnesia/rabbit@73372b4d4347/quorum/rabbit@73372b4d4347
2023-07-23 00:22:49.319935+00:00 [info] <0.266.0> ra system 'quorum_queues' running pre init for 0 registered servers
2023-07-23 00:22:49.330775+00:00 [info] <0.267.0> ra: meta data store initialised for system quorum_queues. 0 record(s) recovered
2023-07-23 00:22:49.345526+00:00 [notice] <0.272.0> WAL: ra_log_wal init, open tbls: ra_log_open_mem_tables, closed tbls: ra_log_closed_mem_tables
2023-07-23 00:22:49.352708+00:00 [info] <0.230.0> ra: starting system coordination
2023-07-23 00:22:49.352768+00:00 [info] <0.230.0> starting Ra system: coordination in directory: /var/lib/rabbitmq/mnesia/rabbit@73372b4d4347/coordination/rabbit@73372b4d4347
2023-07-23 00:22:49.354147+00:00 [info] <0.279.0> ra system 'coordination' running pre init for 0 registered servers
2023-07-23 00:22:49.355034+00:00 [info] <0.280.0> ra: meta data store initialised for system coordination. 0 record(s) recovered
2023-07-23 00:22:49.355188+00:00 [notice] <0.285.0> WAL: ra_coordination_log_wal init, open tbls: ra_coordination_log_open_mem_tables, closed tbls: ra_coordination_log_closed_mem_tables
2023-07-23 00:22:49.358086+00:00 [info] <0.230.0> 
2023-07-23 00:22:49.358086+00:00 [info] <0.230.0>  Starting RabbitMQ 3.9.29 on Erlang 25.3.1 [jit]
2023-07-23 00:22:49.358086+00:00 [info] <0.230.0>  Copyright (c) 2007-2023 VMware, Inc. or its affiliates.
2023-07-23 00:22:49.358086+00:00 [info] <0.230.0>  Licensed under the MPL 2.0. Website: https://rabbitmq.com
2023-07-23 00:22:49.358172+00:00 [error] <0.230.0> This release series has reached end of life and is no longer supported. Please visit https://rabbitmq.com/versions.html to learn more and upgrade
  ##  ##      RabbitMQ 3.9.29
  ##  ##
  ##########  Copyright (c) 2007-2023 VMware, Inc. or its affiliates.
  ######  ##
  ##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com
  Erlang:      25.3.1 [jit]
  TLS Library: OpenSSL - OpenSSL 3.0.8 7 Feb 2023
  Release series support status: out of support
  Doc guides:  https://rabbitmq.com/documentation.html
  Support:     https://rabbitmq.com/contact.html
  Tutorials:   https://rabbitmq.com/getstarted.html
  Monitoring:  https://rabbitmq.com/monitoring.html
  Logs: /var/log/rabbitmq/rabbit@73372b4d4347_upgrade.log
        <stdout>
  Config file(s): /etc/rabbitmq/conf.d/10-defaults.conf
  Starting broker...2023-07-23 00:22:49.359502+00:00 [info] <0.230.0> 
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>  node           : rabbit@73372b4d4347
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>  home dir       : /var/lib/rabbitmq
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>  config file(s) : /etc/rabbitmq/conf.d/10-defaults.conf
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>  cookie hash    : cVEEQfgikirQ1na85hL4ww==
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>  log(s)         : /var/log/rabbitmq/rabbit@73372b4d4347_upgrade.log
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>                 : <stdout>
2023-07-23 00:22:49.359502+00:00 [info] <0.230.0>  database dir   : /var/lib/rabbitmq/mnesia/rabbit@73372b4d4347
2023-07-23 00:22:49.595033+00:00 [info] <0.230.0> Feature flags: list of feature flags found:
2023-07-23 00:22:49.595105+00:00 [info] <0.230.0> Feature flags:   [ ] drop_unroutable_metric
2023-07-23 00:22:49.595125+00:00 [info] <0.230.0> Feature flags:   [ ] empty_basic_get_metric
2023-07-23 00:22:49.595152+00:00 [info] <0.230.0> Feature flags:   [ ] implicit_default_bindings
2023-07-23 00:22:49.595164+00:00 [info] <0.230.0> Feature flags:   [ ] maintenance_mode_status
2023-07-23 00:22:49.595174+00:00 [info] <0.230.0> Feature flags:   [ ] quorum_queue
2023-07-23 00:22:49.595183+00:00 [info] <0.230.0> Feature flags:   [ ] stream_queue
2023-07-23 00:22:49.595220+00:00 [info] <0.230.0> Feature flags:   [ ] user_limits
2023-07-23 00:22:49.595231+00:00 [info] <0.230.0> Feature flags:   [ ] virtual_host_metadata
2023-07-23 00:22:49.595269+00:00 [info] <0.230.0> Feature flags: feature flag states written to disk: yes

I've left the docker container running for 15 minutes or so and it just stays in this stalled state.

Quarkus is currently defaults to the 3.9 series (I've file a PR to update to the 3.12 series) but our project has been using 3.10 for a while and this still happens.

Additional details:
Container: 3.9-management & 3.10-management
Docker Desktop: 8 cpus 16GB
Host: macStudio M1 Ultra 128GB
OS: macOS 13.4

Reproduction steps

Start a docker container and wait for "Server startup complete" log message
Do this 10 or so times (😉)
Observe one or more of the startups failing
...

Expected behavior

RabbitMQ finishes startup every time.

Additional context

No response

Answered by kdubb

Jul 25, 2023

I decided to bump to 3.12 (in expectation of the coming update to Quarkus) and it runs fine. It seems to be something specifically with 3.9 & 3.10.

I added a counter to the script output it ran to well around 500 restarts with no problem; so it seems whatever the issue is has been remedied.

Just for posterity, I am going to enable debug logging and see if I get more information; on the off chance the it could be informative or the issue reappears.

View full answer

lukebakken · 2023-07-23T01:23:50Z

lukebakken
Jul 23, 2023
Maintainer

This is almost certainly not a bug in RabbitMQ, and probably not a bug in the docker image.

Start a docker container and wait for "Server startup complete" log message

Are you doing this as part of "Quarkus devservices" or on your own? How exactly is the container being pulled and started?

0 replies

kdubb · 2023-07-23T01:31:15Z

kdubb
Jul 23, 2023
Author

Quarkus uses Testcontainers to start external services for testing. It uses the Docker API to pull and start containers.

The container happens to still be in a "stalled state" on my machine if there is some debug that can happen.

As stated 9/10 times it starts properly and quite fast. Just intermittently it doesn't. This is the only container we have an issues with as part of our automated testing. Including Postgres, Redis, Vault, localstack, jaeger, etc; and they are all started via Testcontainers. So it's definitely something unique to the RabbitMQ container.

10 replies

lukebakken Jul 24, 2023
Maintainer

I will re-run the above for at least an hour, just to be sure.

lukebakken Jul 24, 2023
Maintainer

https://github.com/lukebakken/rabbitmq-server-8938/blob/main/TCStartTest-RYUK-DISABLED-2.java.log

Seems to be running fine!

lukebakken Jul 24, 2023
Maintainer

If you'd like to enable debug logging, here is how I would mount config files using docker-compose.yml:

https://github.com/lukebakken/rabbitmq-server-8938/tree/main#enable-debug-logging

...I'm sure something similar is available with the tools you use. We would just be interested in the full set of STDOUT from the beginning of container start until it freezes.

kdubb Jul 25, 2023
Author

I decided to bump to 3.12 (in expectation of the coming update to Quarkus) and it runs fine. It seems to be something specifically with 3.9 & 3.10.

I added a counter to the script output it ran to well around 500 restarts with no problem; so it seems whatever the issue is has been remedied.

Just for posterity, I am going to enable debug logging and see if I get more information; on the off chance the it could be informative or the issue reappears.

Answer selected by lukebakken

lukebakken Jul 25, 2023
Maintainer

Thanks for the follow-up! If I had to guess, I would suspect something changed in Erlang itself to address this issue, rather than RabbitMQ (but that's just speculation).

kdubb · 2023-07-23T01:33:56Z

kdubb
Jul 23, 2023
Author

I've tried cURLing the management port (and event the AMQP) of a container exhibiting this and neither response with any data

*   Trying 127.0.0.1:63451...
* Connected to localhost (127.0.0.1) port 63451 (#0)
> GET / HTTP/1.1
> Host: localhost:63451
> User-Agent: curl/7.88.1
> Accept: */*
>
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

2 replies

michaelklishin Jul 23, 2023
Maintainer

Start with inspecting node logs. You can also enable HTTP API request logging.

It's impossible to say what may be going on with just this curl output.

kdubb Jul 24, 2023
Author

The logs I posted are the entirety of the logs output because the container never completes startup/boot.

The reason I posted this output was only to confirm/inform that the logs are accurately representing the state of the container in that it hasn't completed startup and cannot accept connections.

michaelklishin · 2023-07-23T09:29:53Z

michaelklishin
Jul 23, 2023
Maintainer

How many nodes does this cluster have? If it's more than one, see Restarting Cluster Nodes. While this may use Docker Swarm or something other than Kubernetes, the readiness probe part in the Cluster Formation guide is highly relevant.

RabbitMQ 3.9 is out of support (the node boot banner states as much), so we'd only investigate if there is a set of steps to reproduce with 3.12.

2 replies

kdubb Jul 24, 2023
Author

This is a docker container run as part of automated unit/integration tests. No cluster it's just starting a single RabbitMQ docker container instance.

Yes. 3.9 is out of date. The logs were captured from a simple Quarkus test which is pinned to the 3.9 series; I've filed a PR to update Quarkus to use the 3.12 series for testing.

kdubb Jul 24, 2023
Author

Now that I have the test script I posted for @lukebakken, I can easily test 3.12 to see if it fails the same way. Will report back.

michaelklishin · 2023-07-23T09:30:54Z

michaelklishin
Jul 23, 2023
Maintainer

To reason about what is going on during boot, enabling debug logging is a good idea.

1 reply

kdubb Jul 24, 2023
Author

Good idea. This wasn't easily possible when trying to reproduce by running tests for a Quarkus application. Again, now with the test script I can try this.

Docker container randomly stops booting #8938

Uh oh!

Uh oh!

kdubb Jul 23, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 5 comments · 15 replies

Uh oh!

lukebakken Jul 23, 2023 Maintainer

Uh oh!

kdubb Jul 23, 2023 Author

Uh oh!

lukebakken Jul 24, 2023 Maintainer

Uh oh!

lukebakken Jul 24, 2023 Maintainer

Uh oh!

lukebakken Jul 24, 2023 Maintainer

Uh oh!

kdubb Jul 25, 2023 Author

Uh oh!

lukebakken Jul 25, 2023 Maintainer

Uh oh!

kdubb Jul 23, 2023 Author

Uh oh!

michaelklishin Jul 23, 2023 Maintainer

Uh oh!

kdubb Jul 24, 2023 Author

Uh oh!

michaelklishin Jul 23, 2023 Maintainer

Uh oh!

kdubb Jul 24, 2023 Author

Uh oh!

kdubb Jul 24, 2023 Author

Uh oh!

michaelklishin Jul 23, 2023 Maintainer

Uh oh!

kdubb Jul 24, 2023 Author

kdubb
Jul 23, 2023

Replies: 5 comments 15 replies

lukebakken
Jul 23, 2023
Maintainer

kdubb
Jul 23, 2023
Author

lukebakken Jul 24, 2023
Maintainer

lukebakken Jul 24, 2023
Maintainer

lukebakken Jul 24, 2023
Maintainer

kdubb Jul 25, 2023
Author

lukebakken Jul 25, 2023
Maintainer

kdubb
Jul 23, 2023
Author

michaelklishin Jul 23, 2023
Maintainer

kdubb Jul 24, 2023
Author

michaelklishin
Jul 23, 2023
Maintainer

kdubb Jul 24, 2023
Author

kdubb Jul 24, 2023
Author

michaelklishin
Jul 23, 2023
Maintainer

kdubb Jul 24, 2023
Author