Skip to content

Conversation

@featheredtoast
Copy link
Member

@featheredtoast featheredtoast commented May 18, 2025

This resolves a race condition with unconfigured images attempting to bring up DBs for the first time. This does not affect fully bootstrapped images.

Currently, all jobs start at boot - this includes postgres.

Issue with the current is - postgres starts and adds the corresponding .s/.pid files to /var/run/postgres.

Simultaneously, the unicorn job gets started, checks to see if postgres is running (it is already at this point from boot), and runs install_postgres.

Inside the install_postgres script, we mount the shared postgres folder and remove .s/.pid files -- after postgres has already been started. In this case, we remove the (in-use) .s and .pid files.

Subsequent unicorn tasks fail, erroring out the service and forcing it into a restart loop. Since postgres never restarts, it never regenerates the .s/.pid files, and unicorn can never run successfully.

This proposal moves install_postgres into the postgres job file, eliminating the race condition. Since they are part of the same service, install_postgres will always run before starting postgres - it will no longer be able to remove valid .s and .pid files.

Redis has a similar race condition with the creation of its data folder. This isn't as disastrous as the redis service restarts until the folder exists from unicorn run, but it provides better reasoning about the running services.

Run create_db script if configured to directly from the postgres service.
Wait more smartly via polling pg_isready rather than a single long sleep.

Add early exit from unicorn boot scripts for more dependable service restarts - exiting early ensures all will restart.

exec 2>&1
if [ ! -d /shared/redis_data ]; then
install -d -m 0755 -o redis -g redis /shared/redis_data
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here would skip past if the directory exists but the ownership or permission was changed externally. Is that what we want?

Running install will still return success if the directory is already there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was attempting to save work at boot, but it does seem safe enough to run install regardless, will remove the if

rm /root/install_postgres
fi
if [ "$CREATE_DB_ON_BOOT" = "1" ]; then
/usr/local/bin/create_db&
Copy link
Member

@andrewschleifer andrewschleifer Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there &? I don't think we want that to happen in the background.

Copy link
Member Author

@featheredtoast featheredtoast Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the way create_db is called such that it follows how docker postgres resolves a very similar problem with ensuring a postgres service is started before running scripts against it.

chmod: +x
contents: |
#!/bin/bash
# wait for postgres to start up...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am uncomfortable putting this check inside the create_db script. Whatever is calling it -- I think /etc/service/postgres/run -- should ensure the database is available first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought to move the check+sleep into the file itself since the sleep/check/wait is needed to be done both on launcher bootstrap (if doing a full, environment-bound build using launcher), and when booting running directly from launcher build, with CREATE_DB_ON_BOOT=1.

But I can move the check/sleep back out to both places if that makes things more readable/explicit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the sleep on the -exec path, and in the service file, started the DB with -w start flag, which will not background the postgres service until it is up, which is exactly what we want. This also removes the check from the create_db script as well.

# wait for postgres to start up...
for i in {1..5}; do
su postgres -c 'pg_isready -q' && break
sleep 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using the iterator for increased backoff --sleep $i. Do we know how long the database normally takes to start? How long are we willing to wait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From testing 1 second was plenty and allowed the process to feel a bit snappier by shaving off a few seconds, but I didn't want to assume all systems could boot within 1 second.

Previously wait was hardcoded to 5 seconds, so I used that as an upper bound to not leave older hardware out if that length mattered for other systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the need for the check on start in the service file

contents: |
#!/bin/bash
# wait for postgres to start up...
for i in {1..5}; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we try five times and the database is still not responding? This will just continue ahead. We should issue an explicit error and abort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous script crashed against the database not being up and errored there, but it does make much more sense to explicitly throw a "cannot connect" if the timeout is reached. Will adjust 👍

…les.

This resolves a race condition with unconfigured images attempting to bring up
DBs for the first time. This does not affect fully bootstrapped images.

Currently, all jobs start at boot - this includes postgres.

Issue with the current is - postgres starts and adds the corresponding .s/.pid
files to /var/run/postgres.

Simultaneously, the unicorn job gets started, checks to see if postgres is
running (it is already at this point from boot), and runs install_postgres.

Inside the install_postgres script, we mount the shared postgres folder and
remove .s/.pid files -- after postgres has already been started. In this case,
we remove the (in-use) .s and .pid files.

Subsequent unicorn tasks fail, erroring out the service and forcing it into
a restart loop. Since postgres never restarts, it never regenerates the .s/.pid
files, and unicorn can never run successfully.

This proposal moves install_postgres into the postgres job file, eliminating the
race condition. Since they are part of the same service, install_postgres will
always run before starting postgres - it will no longer be able to remove valid
.s and .pid files.

Redis has a similar race condition with the creation of its data folder. This
isn't as disastrous as the redis service restarts until the folder exists from
unicorn run, but it provides better reasoning about the running services.

Add early exit from unicorn boot scripts to properly retry migrate as well.

Use pg_isready to check if pg is ready directly in create_db.
Merge the ready check into create_db.

Run create_db in a subshell on postgres job start, rather than in unicorn script.

remove postgres-config call
Update DB password on boot in 2 container setup from env if exists
Fixes permissions on the directory if mis-set.
Use pg_ctl with -w start to ensure postgres is started.

Allows create_db to run in the foreground on start for CREATE_DB_ON_BOOT.

Take inspiration to how docker-library postgres does similar: https://github.com/docker-library/postgres/blob/889f9447cd2dfe21cccfbe9bb7945e3b037e02d8/15/bullseye/docker-entrypoint.sh#L294-L316
@featheredtoast featheredtoast merged commit f1996f3 into main Aug 8, 2025
4 of 5 checks passed
@featheredtoast featheredtoast deleted the fix-cold-boot branch August 8, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants