|
| 1 | +--- |
| 2 | +date: "2025-10-25T19:44:48+08:00" |
| 3 | +slug: "database-troubleshooting-for-umami" |
| 4 | +tags: |
| 5 | + - analytics |
| 6 | +title: "Database troubleshooting for Umami" |
| 7 | +--- |
| 8 | + |
| 9 | +I suck at SQL. There. I said it. I just never really had to interact with a SQL database directly over the years. So yeah, I can query a database, but managing one? Nah. My SQL knowledge is pretty much limited to `SELECT * FROM something WHERE otherthing = ?`. |
| 10 | + |
| 11 | +My website uses [Umami](https://umami.is/) for site analytics, which was [set up around October 2020](/blog/setting-up-umami-on-heroku/). Between that blog post and now, I had migrated my database from Heroku to AWS RDS, which I did not do on my own. Someone else had set things up on AWS and sent me the database URL to connect to. |
| 12 | + |
| 13 | +I had also migrated to Umami v2 at some point, dutifully following the [migration instructions](https://github.com/umami-software/migrate-v1-v2), which thankfully just worked. But then, when version [v2.18.0](https://github.com/umami-software/umami/releases/tag/v2.18.0) was released, I ran into this little error: |
| 14 | + |
| 15 | +```shell |
| 16 | +Error: P3018 |
| 17 | + |
| 18 | +A migration failed to apply. New migrations cannot be applied before the error is recovered from. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve |
| 19 | + |
| 20 | +Migration name: 09_update_hostname_region |
| 21 | + |
| 22 | +Database error code: 40P01 |
| 23 | + |
| 24 | +Database error: |
| 25 | +ERROR: deadlock detected |
| 26 | +DETAIL: Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654. |
| 27 | +Process 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550. |
| 28 | +HINT: See server log for query details. |
| 29 | + |
| 30 | +DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(E40P01), message: "deadlock detected", detail: Some("Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654.\nProcess 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550."), hint: Some("See server log for query details."), position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("deadlock.c"), line: Some(1135), routine: Some("DeadLockReport") } |
| 31 | + |
| 32 | + |
| 33 | +✗ Command failed: prisma migrate deploy |
| 34 | +Error: P3018 |
| 35 | + |
| 36 | +A migration failed to apply. New migrations cannot be applied before the error is recovered from. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve |
| 37 | + |
| 38 | +Migration name: 09_update_hostname_region |
| 39 | + |
| 40 | +Database error code: 40P01 |
| 41 | + |
| 42 | +Database error: |
| 43 | +ERROR: deadlock detected |
| 44 | +DETAIL: Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654. |
| 45 | +Process 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550. |
| 46 | +HINT: See server log for query details. |
| 47 | + |
| 48 | +DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(E40P01), message: "deadlock detected", detail: Some("Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654.\nProcess 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550."), hint: Some("See server log for query details."), position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("deadlock.c"), line: Some(1135), routine: Some("DeadLockReport") } |
| 49 | + |
| 50 | + ELIFECYCLE Command failed with exit code 1. |
| 51 | +ERROR: "check-db" exited with 1. |
| 52 | + ELIFECYCLE Command failed with exit code 1. |
| 53 | +``` |
| 54 | +
|
| 55 | +I am not equipped to solve such errors. But I wasn't alone. Scouring GitHub revealed: |
| 56 | +
|
| 57 | +<ul> |
| 58 | + <li class="no-margin"><a href="https://github.com/umami-software/umami/issues/3399">09_update_hostname_region migration failing after updating to 2.18.0</a></li> |
| 59 | + <li class="no-margin"><a href="https://github.com/umami-software/umami/issues/3428">Vercel build: 09_update_hostname_region migration failing</a></li> |
| 60 | + <li class="no-margin"><a href="https://github.com/umami-software/umami/issues/3417">Upgrade to 2.18.1 Caused Connection Pool Exhaustion and Rapid DB Size Increase</a></li> |
| 61 | + <li><a href="https://github.com/umami-software/umami/issues/3536">DB prisma migration failed while restoring data for latest deployment 2.19.0 from 2.15.0</a></li> |
| 62 | +</ul> |
| 63 | +
|
| 64 | +I also submitted an [issue of my own](https://github.com/umami-software/umami/issues/3462), but was told to resolve my database deadlock first. Nobody really had the exact same error as I did but based on everyone else's issues, I guessed that my database was probably too large for the tiny RDS instance to cope with the migration. |
| 65 | +
|
| 66 | +Given that [v3 was coming out soon](https://umami.is/blog/what-is-coming-in-umami-v3), I figured now would be an excellent time to actually sit my ass down and resolve this database issue once and for all. |
| 67 | +
|
| 68 | +## The most basic of basics |
| 69 | +
|
| 70 | +You need tools for troubleshooting, and the first thing is to figure out how to actually access your database. I had to finally log into the database for the first time ever (the migration was previously run via script, I didn't actually have to do anything). AWS provides [this guidance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html) on how to connect to your PostgreSQL DB instance. You need [`psql`](https://www.postgresql.org/docs/current/app-psql.html). |
| 71 | +
|
| 72 | +If you're on Mac OS, like me, install [`libpq`](https://www.postgresql.org/docs/current/libpq.html) via brew. |
| 73 | +
|
| 74 | +```bash |
| 75 | +brew install libpq |
| 76 | +``` |
| 77 | +
|
| 78 | +Then add it to your `PATH`. I'm using fish shell so I used `fish_add_path`. |
| 79 | +
|
| 80 | +```bash |
| 81 | +fish_add_path /opt/homebrew/opt/libpq/bin |
| 82 | +``` |
| 83 | +
|
| 84 | +Connect to your remote database. This really depends on where your database lives. For me, I used this URL pattern. |
| 85 | +
|
| 86 | +```bash |
| 87 | +psql "postgres://dbuser:password@aws_rds_endpoint:port/db_name?sslrootcert=path_to_pem_file" |
| 88 | +``` |
| 89 | +
|
| 90 | +## Start operation recovery |
| 91 | +
|
| 92 | +The migration borked at `09_update_hostname_region`, and the [migration.sql](https://github.com/umami-software/umami/blob/master/db/postgresql/migrations/09_update_hostname_region/migration.sql) file looks like this: |
| 93 | +
|
| 94 | +```sql |
| 95 | +-- AlterTable |
| 96 | +ALTER TABLE "website_event" ADD COLUMN "hostname" VARCHAR(100); |
| 97 | +
|
| 98 | +-- DataMigration |
| 99 | +UPDATE "website_event" w |
| 100 | +SET hostname = s.hostname |
| 101 | +FROM "session" s |
| 102 | +WHERE s.website_id = w.website_id |
| 103 | + and s.session_id = w.session_id; |
| 104 | +
|
| 105 | +-- DropIndex |
| 106 | +DROP INDEX IF EXISTS "session_website_id_created_at_hostname_idx"; |
| 107 | +DROP INDEX IF EXISTS "session_website_id_created_at_subdivision1_idx"; |
| 108 | +
|
| 109 | +-- AlterTable |
| 110 | +ALTER TABLE "session" RENAME COLUMN "subdivision1" TO "region"; |
| 111 | +ALTER TABLE "session" DROP COLUMN "subdivision2"; |
| 112 | +ALTER TABLE "session" DROP COLUMN "hostname"; |
| 113 | +
|
| 114 | +-- CreateIndex |
| 115 | +CREATE INDEX "website_event_website_id_created_at_hostname_idx" ON "website_event"("website_id", "created_at", "hostname"); |
| 116 | +CREATE INDEX "session_website_id_created_at_region_idx" ON "session"("website_id", "created_at", "region"); |
| 117 | +``` |
| 118 | +
|
| 119 | +The migration never finished, so I first had to mark the failed migration as rolled back so we can do it all over again. |
| 120 | +
|
| 121 | +```bash |
| 122 | +prisma migrate resolve --rolled-back 09_update_hostname_region |
| 123 | +``` |
| 124 | +
|
| 125 | +A check on the state of my database revealed that I had 1,003,518 rows to deal with and my database was on a t3.micro instance. Probably not the best combination but it is what it is. |
| 126 | +
|
| 127 | +It seemed like a better idea to run the migration in batches and commit after every batch so I could recover without having to go through the entire database all over again if something went wrong. These are the steps I went through. |
| 128 | +
|
| 129 | +Start off with some defensive session settings for my puny t3.micro instance: |
| 130 | +
|
| 131 | +```sql |
| 132 | +SET lock_timeout = '2min'; |
| 133 | +SET statement_timeout = '60min'; |
| 134 | +SET idle_in_transaction_session_timeout = '10min'; |
| 135 | +SET maintenance_work_mem = '256MB'; |
| 136 | +``` |
| 137 | +
|
| 138 | +Also add helper indexes in addition to the required column: |
| 139 | +
|
| 140 | +```sql |
| 141 | +ALTER TABLE public.website_event ADD COLUMN IF NOT EXISTS hostname VARCHAR(100); |
| 142 | +
|
| 143 | +CREATE INDEX CONCURRENTLY IF NOT EXISTS session_website_id_session_id_idx |
| 144 | + ON public."session"(website_id, session_id); |
| 145 | +
|
| 146 | +CREATE INDEX CONCURRENTLY IF NOT EXISTS website_event_wid_sid_null_idx |
| 147 | + ON public.website_event(website_id, session_id) |
| 148 | + WHERE hostname IS NULL; |
| 149 | +``` |
| 150 | +
|
| 151 | +Create a helper function to fill in the rows for the newly added column: |
| 152 | +
|
| 153 | +```sql |
| 154 | +CREATE OR REPLACE FUNCTION fill_hostname_batch(batch_size int) |
| 155 | +RETURNS integer |
| 156 | +LANGUAGE plpgsql |
| 157 | +AS $$ |
| 158 | +DECLARE r_count integer; |
| 159 | +BEGIN |
| 160 | + WITH chunk AS ( |
| 161 | + SELECT w.ctid AS w_ctid, s.hostname |
| 162 | + FROM public.website_event w |
| 163 | + JOIN public."session" s |
| 164 | + ON s.website_id = w.website_id |
| 165 | + AND s.session_id = w.session_id |
| 166 | + WHERE w.hostname IS NULL |
| 167 | + LIMIT batch_size |
| 168 | + ) |
| 169 | + UPDATE public.website_event w |
| 170 | + SET hostname = c.hostname |
| 171 | + FROM chunk c |
| 172 | + WHERE w.ctid = c.w_ctid; |
| 173 | +
|
| 174 | + GET DIAGNOSTICS r_count = ROW_COUNT; |
| 175 | + RETURN r_count; |
| 176 | +END$$; |
| 177 | +``` |
| 178 | +
|
| 179 | +Set the function to run in batches. I manually stopped this when it returned 0. But this took more than 24 hours. I actually had to go on a business trip before it completed so I stopped it and resumed it when I came back home. |
| 180 | +
|
| 181 | +```sql |
| 182 | +\set batch 2000 |
| 183 | +SELECT fill_hostname_batch(:batch); |
| 184 | +\watch 1 |
| 185 | +``` |
| 186 | +
|
| 187 | +Create the new `website_event` index: |
| 188 | +
|
| 189 | +```sql |
| 190 | +CREATE INDEX CONCURRENTLY IF NOT EXISTS website_event_website_id_created_at_hostname_idx |
| 191 | + ON public.website_event(website_id, created_at, hostname); |
| 192 | +``` |
| 193 | +
|
| 194 | +Drop the legacy `session` indexes: |
| 195 | +
|
| 196 | +```sql |
| 197 | +DROP INDEX CONCURRENTLY IF EXISTS public.session_website_id_created_at_hostname_idx; |
| 198 | +DROP INDEX CONCURRENTLY IF EXISTS public.session_website_id_created_at_subdivision1_idx; |
| 199 | +``` |
| 200 | +
|
| 201 | +Apply the required `session` column changes: |
| 202 | +
|
| 203 | +```sql |
| 204 | +BEGIN; |
| 205 | +SET LOCAL lock_timeout = '5s'; |
| 206 | +
|
| 207 | +DO $$ |
| 208 | +BEGIN |
| 209 | + IF EXISTS ( |
| 210 | + SELECT 1 FROM information_schema.columns |
| 211 | + WHERE table_schema='public' AND table_name='session' AND column_name='subdivision1' |
| 212 | + ) THEN |
| 213 | + EXECUTE 'ALTER TABLE public."session" RENAME COLUMN subdivision1 TO region'; |
| 214 | + END IF; |
| 215 | +END$$; |
| 216 | +
|
| 217 | +ALTER TABLE public."session" DROP COLUMN IF EXISTS subdivision2; |
| 218 | +ALTER TABLE public."session" DROP COLUMN IF EXISTS hostname; |
| 219 | +
|
| 220 | +COMMIT; |
| 221 | +``` |
| 222 | +
|
| 223 | +Almost there now. Create the new session index: |
| 224 | +
|
| 225 | +```sql |
| 226 | +CREATE INDEX CONCURRENTLY IF NOT EXISTS session_website_id_created_at_region_idx |
| 227 | + ON public."session"(website_id, created_at, region); |
| 228 | +``` |
| 229 | +
|
| 230 | +Finalise the migration record: |
| 231 | +
|
| 232 | +```sql |
| 233 | +UPDATE public._prisma_migrations |
| 234 | +SET finished_at = NOW(), |
| 235 | + rolled_back_at = NULL, |
| 236 | + logs = 'Manually completed after deadlock recovery and verified schema.' |
| 237 | +WHERE migration_name = '09_update_hostname_region' |
| 238 | + AND finished_at IS NULL; |
| 239 | +``` |
| 240 | +
|
| 241 | +Thankfully at this point, there weren't any new errors, and it seemed like the migration went through. |
| 242 | +
|
| 243 | +## Wrapping up |
| 244 | +
|
| 245 | +I ran `npx prisma migrate status` which told me that I had 13 pending migrations, and I went ahead and applied them with `npx prisma migrate deploy`. Again, no new errors, and when I finally ran the build command, it actually worked and the app compiled successfully. <span class="emoji" role="img" tabindex="0" aria-label="partying face">🥳</span> |
| 246 | +
|
| 247 | +Hopefully v3 doesn't involve such a mega database update. <span class="emoji" role="img" tabindex="0" aria-label="crossed fingers">🤞</span> |
0 commit comments