Move DB initialization scripts for postgres and redis into service files. #967

featheredtoast · 2025-05-18T15:12:32Z

This resolves a race condition with unconfigured images attempting to bring up DBs for the first time. This does not affect fully bootstrapped images.

Currently, all jobs start at boot - this includes postgres.

Issue with the current is - postgres starts and adds the corresponding .s/.pid files to /var/run/postgres.

Simultaneously, the unicorn job gets started, checks to see if postgres is running (it is already at this point from boot), and runs install_postgres.

Inside the install_postgres script, we mount the shared postgres folder and remove .s/.pid files -- after postgres has already been started. In this case, we remove the (in-use) .s and .pid files.

Subsequent unicorn tasks fail, erroring out the service and forcing it into a restart loop. Since postgres never restarts, it never regenerates the .s/.pid files, and unicorn can never run successfully.

This proposal moves install_postgres into the postgres job file, eliminating the race condition. Since they are part of the same service, install_postgres will always run before starting postgres - it will no longer be able to remove valid .s and .pid files.

Redis has a similar race condition with the creation of its data folder. This isn't as disastrous as the redis service restarts until the folder exists from unicorn run, but it provides better reasoning about the running services.

Run create_db script if configured to directly from the postgres service.
Wait more smartly via polling pg_isready rather than a single long sleep.

Add early exit from unicorn boot scripts for more dependable service restarts - exiting early ensures all will restart.

andrewschleifer · 2025-07-24T03:44:33Z

templates/redis.template.yml

        exec 2>&1
+        if [ ! -d /shared/redis_data ]; then
+          install -d -m 0755 -o redis -g redis /shared/redis_data
+        fi


The test here would skip past if the directory exists but the ownership or permission was changed externally. Is that what we want?

Running install will still return success if the directory is already there.

I was attempting to save work at boot, but it does seem safe enough to run install regardless, will remove the if

andrewschleifer · 2025-07-24T03:47:35Z

templates/postgres.template.yml

+          rm /root/install_postgres
+        fi
+        if [ "$CREATE_DB_ON_BOOT" = "1" ]; then
+          /usr/local/bin/create_db&


Why is there &? I don't think we want that to happen in the background.

I've updated the way create_db is called such that it follows how docker postgres resolves a very similar problem with ensuring a postgres service is started before running scripts against it.

andrewschleifer · 2025-07-24T03:49:12Z

templates/postgres.template.yml

      chmod: +x
      contents: |
        #!/bin/bash
+        # wait for postgres to start up...


I am uncomfortable putting this check inside the create_db script. Whatever is calling it -- I think /etc/service/postgres/run -- should ensure the database is available first.

I had thought to move the check+sleep into the file itself since the sleep/check/wait is needed to be done both on launcher bootstrap (if doing a full, environment-bound build using launcher), and when booting running directly from launcher build, with CREATE_DB_ON_BOOT=1.

But I can move the check/sleep back out to both places if that makes things more readable/explicit

Reverted the sleep on the -exec path, and in the service file, started the DB with -w start flag, which will not background the postgres service until it is up, which is exactly what we want. This also removes the check from the create_db script as well.

andrewschleifer · 2025-07-24T03:57:33Z

templates/postgres.template.yml

+        # wait for postgres to start up...
+        for i in {1..5}; do
+          su postgres -c 'pg_isready -q' && break
+          sleep 1


Is it worth using the iterator for increased backoff --sleep $i. Do we know how long the database normally takes to start? How long are we willing to wait?

From testing 1 second was plenty and allowed the process to feel a bit snappier by shaving off a few seconds, but I didn't want to assume all systems could boot within 1 second.

Previously wait was hardcoded to 5 seconds, so I used that as an upper bound to not leave older hardware out if that length mattered for other systems.

I removed the need for the check on start in the service file

andrewschleifer · 2025-07-24T03:58:13Z

templates/postgres.template.yml

      contents: |
        #!/bin/bash
+        # wait for postgres to start up...
+        for i in {1..5}; do


What happens if we try five times and the database is still not responding? This will just continue ahead. We should issue an explicit error and abort.

I think the previous script crashed against the database not being up and errored there, but it does make much more sense to explicitly throw a "cannot connect" if the timeout is reached. Will adjust 👍

…les. This resolves a race condition with unconfigured images attempting to bring up DBs for the first time. This does not affect fully bootstrapped images. Currently, all jobs start at boot - this includes postgres. Issue with the current is - postgres starts and adds the corresponding .s/.pid files to /var/run/postgres. Simultaneously, the unicorn job gets started, checks to see if postgres is running (it is already at this point from boot), and runs install_postgres. Inside the install_postgres script, we mount the shared postgres folder and remove .s/.pid files -- after postgres has already been started. In this case, we remove the (in-use) .s and .pid files. Subsequent unicorn tasks fail, erroring out the service and forcing it into a restart loop. Since postgres never restarts, it never regenerates the .s/.pid files, and unicorn can never run successfully. This proposal moves install_postgres into the postgres job file, eliminating the race condition. Since they are part of the same service, install_postgres will always run before starting postgres - it will no longer be able to remove valid .s and .pid files. Redis has a similar race condition with the creation of its data folder. This isn't as disastrous as the redis service restarts until the folder exists from unicorn run, but it provides better reasoning about the running services. Add early exit from unicorn boot scripts to properly retry migrate as well. Use pg_isready to check if pg is ready directly in create_db. Merge the ready check into create_db. Run create_db in a subshell on postgres job start, rather than in unicorn script. remove postgres-config call

Update DB password on boot in 2 container setup from env if exists

Fixes permissions on the directory if mis-set.

Use pg_ctl with -w start to ensure postgres is started. Allows create_db to run in the foreground on start for CREATE_DB_ON_BOOT. Take inspiration to how docker-library postgres does similar: https://github.com/docker-library/postgres/blob/889f9447cd2dfe21cccfbe9bb7945e3b037e02d8/15/bullseye/docker-entrypoint.sh#L294-L316

featheredtoast force-pushed the fix-cold-boot branch 3 times, most recently from 518415f to 136aefe Compare May 19, 2025 17:20

featheredtoast mentioned this pull request May 19, 2025

DEV: allow for 2 container data services to setup on boot #874

Closed

andrewschleifer requested changes Jul 24, 2025

View reviewed changes

featheredtoast requested a review from andrewschleifer July 25, 2025 20:37

featheredtoast mentioned this pull request Jul 26, 2025

DEV: re-split build vs configure #878

Merged

featheredtoast added 6 commits August 1, 2025 20:32

Update DB password if exists

1ada3ce

Update DB password on boot in 2 container setup from env if exists

DEV: always run install on redis_data directory

7b9cb85

Fixes permissions on the directory if mis-set.

properly revert sleep for startup

efe4f2d

Better formatting of if statements

d46d5b3

featheredtoast force-pushed the fix-cold-boot branch from d1f6d4b to d46d5b3 Compare August 2, 2025 03:32

andrewschleifer approved these changes Aug 5, 2025

View reviewed changes

featheredtoast merged commit f1996f3 into main Aug 8, 2025
4 of 5 checks passed

featheredtoast deleted the fix-cold-boot branch August 8, 2025 19:54

Uh oh!

Move DB initialization scripts for postgres and redis into service files. #967

Move DB initialization scripts for postgres and redis into service files. #967

Uh oh!

Conversation

featheredtoast commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewschleifer Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

featheredtoast Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

featheredtoast commented May 18, 2025 •

edited

Loading

andrewschleifer Jul 24, 2025 •

edited

Loading

featheredtoast Jul 25, 2025 •

edited

Loading