Skip to content

Commit 4a3f707

Browse files
committed
feat: add database migration blog post
1 parent 31d648e commit 4a3f707

File tree

2 files changed

+251
-4
lines changed

2 files changed

+251
-4
lines changed

package.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@
1010
"astro": "astro"
1111
},
1212
"dependencies": {
13-
"@astrojs/mdx": "^4.3.7",
14-
"@astrojs/netlify": "^6.5.13",
15-
"@astrojs/rss": "^4.0.12",
13+
"@astrojs/mdx": "^4.3.8",
14+
"@astrojs/netlify": "^6.6.0",
15+
"@astrojs/rss": "^4.0.13",
1616
"@astrojs/sitemap": "^3.6.0",
17-
"astro": "^5.14.5",
17+
"astro": "^5.15.1",
1818
"html-to-text": "^9.0.5",
1919
"markdown-it": "^14.1.0",
2020
"sharp": "^0.34.4"
Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
---
2+
date: "2025-10-25T19:44:48+08:00"
3+
slug: "database-troubleshooting-for-umami"
4+
tags:
5+
- analytics
6+
title: "Database troubleshooting for Umami"
7+
---
8+
9+
I suck at SQL. There. I said it. I just never really had to interact with a SQL database directly over the years. So yeah, I can query a database, but managing one? Nah. My SQL knowledge is pretty much limited to `SELECT * FROM something WHERE otherthing = ?`.
10+
11+
My website uses [Umami](https://umami.is/) for site analytics, which was [set up around October 2020](/blog/setting-up-umami-on-heroku/). Between that blog post and now, I had migrated my database from Heroku to AWS RDS, which I did not do on my own. Someone else had set things up on AWS and sent me the database URL to connect to.
12+
13+
I had also migrated to Umami v2 at some point, dutifully following the [migration instructions](https://github.com/umami-software/migrate-v1-v2), which thankfully just worked. But then, when version [v2.18.0](https://github.com/umami-software/umami/releases/tag/v2.18.0) was released, I ran into this little error:
14+
15+
```shell
16+
Error: P3018
17+
18+
A migration failed to apply. New migrations cannot be applied before the error is recovered from. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve
19+
20+
Migration name: 09_update_hostname_region
21+
22+
Database error code: 40P01
23+
24+
Database error:
25+
ERROR: deadlock detected
26+
DETAIL: Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654.
27+
Process 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550.
28+
HINT: See server log for query details.
29+
30+
DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(E40P01), message: "deadlock detected", detail: Some("Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654.\nProcess 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550."), hint: Some("See server log for query details."), position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("deadlock.c"), line: Some(1135), routine: Some("DeadLockReport") }
31+
32+
33+
✗ Command failed: prisma migrate deploy
34+
Error: P3018
35+
36+
A migration failed to apply. New migrations cannot be applied before the error is recovered from. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve
37+
38+
Migration name: 09_update_hostname_region
39+
40+
Database error code: 40P01
41+
42+
Database error:
43+
ERROR: deadlock detected
44+
DETAIL: Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654.
45+
Process 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550.
46+
HINT: See server log for query details.
47+
48+
DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(E40P01), message: "deadlock detected", detail: Some("Process 27550 waits for AccessExclusiveLock on relation 16967 of database 16401; blocked by process 26654.\nProcess 26654 waits for AccessShareLock on relation 16982 of database 16401; blocked by process 27550."), hint: Some("See server log for query details."), position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("deadlock.c"), line: Some(1135), routine: Some("DeadLockReport") }
49+
50+
 ELIFECYCLE  Command failed with exit code 1.
51+
ERROR: "check-db" exited with 1.
52+
 ELIFECYCLE  Command failed with exit code 1.
53+
```
54+
55+
I am not equipped to solve such errors. But I wasn't alone. Scouring GitHub revealed:
56+
57+
<ul>
58+
<li class="no-margin"><a href="https://github.com/umami-software/umami/issues/3399">09_update_hostname_region migration failing after updating to 2.18.0</a></li>
59+
<li class="no-margin"><a href="https://github.com/umami-software/umami/issues/3428">Vercel build: 09_update_hostname_region migration failing</a></li>
60+
<li class="no-margin"><a href="https://github.com/umami-software/umami/issues/3417">Upgrade to 2.18.1 Caused Connection Pool Exhaustion and Rapid DB Size Increase</a></li>
61+
<li><a href="https://github.com/umami-software/umami/issues/3536">DB prisma migration failed while restoring data for latest deployment 2.19.0 from 2.15.0</a></li>
62+
</ul>
63+
64+
I also submitted an [issue of my own](https://github.com/umami-software/umami/issues/3462), but was told to resolve my database deadlock first. Nobody really had the exact same error as I did but based on everyone else's issues, I guessed that my database was probably too large for the tiny RDS instance to cope with the migration.
65+
66+
Given that [v3 was coming out soon](https://umami.is/blog/what-is-coming-in-umami-v3), I figured now would be an excellent time to actually sit my ass down and resolve this database issue once and for all.
67+
68+
## The most basic of basics
69+
70+
You need tools for troubleshooting, and the first thing is to figure out how to actually access your database. I had to finally log into the database for the first time ever (the migration was previously run via script, I didn't actually have to do anything). AWS provides [this guidance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html) on how to connect to your PostgreSQL DB instance. You need [`psql`](https://www.postgresql.org/docs/current/app-psql.html).
71+
72+
If you're on Mac OS, like me, install [`libpq`](https://www.postgresql.org/docs/current/libpq.html) via brew.
73+
74+
```bash
75+
brew install libpq
76+
```
77+
78+
Then add it to your `PATH`. I'm using fish shell so I used `fish_add_path`.
79+
80+
```bash
81+
fish_add_path /opt/homebrew/opt/libpq/bin
82+
```
83+
84+
Connect to your remote database. This really depends on where your database lives. For me, I used this URL pattern.
85+
86+
```bash
87+
psql "postgres://dbuser:password@aws_rds_endpoint:port/db_name?sslrootcert=path_to_pem_file"
88+
```
89+
90+
## Start operation recovery
91+
92+
The migration borked at `09_update_hostname_region`, and the [migration.sql](https://github.com/umami-software/umami/blob/master/db/postgresql/migrations/09_update_hostname_region/migration.sql) file looks like this:
93+
94+
```sql
95+
-- AlterTable
96+
ALTER TABLE "website_event" ADD COLUMN "hostname" VARCHAR(100);
97+
98+
-- DataMigration
99+
UPDATE "website_event" w
100+
SET hostname = s.hostname
101+
FROM "session" s
102+
WHERE s.website_id = w.website_id
103+
and s.session_id = w.session_id;
104+
105+
-- DropIndex
106+
DROP INDEX IF EXISTS "session_website_id_created_at_hostname_idx";
107+
DROP INDEX IF EXISTS "session_website_id_created_at_subdivision1_idx";
108+
109+
-- AlterTable
110+
ALTER TABLE "session" RENAME COLUMN "subdivision1" TO "region";
111+
ALTER TABLE "session" DROP COLUMN "subdivision2";
112+
ALTER TABLE "session" DROP COLUMN "hostname";
113+
114+
-- CreateIndex
115+
CREATE INDEX "website_event_website_id_created_at_hostname_idx" ON "website_event"("website_id", "created_at", "hostname");
116+
CREATE INDEX "session_website_id_created_at_region_idx" ON "session"("website_id", "created_at", "region");
117+
```
118+
119+
The migration never finished, so I first had to mark the failed migration as rolled back so we can do it all over again.
120+
121+
```bash
122+
prisma migrate resolve --rolled-back 09_update_hostname_region
123+
```
124+
125+
A check on the state of my database revealed that I had 1,003,518 rows to deal with and my database was on a t3.micro instance. Probably not the best combination but it is what it is.
126+
127+
It seemed like a better idea to run the migration in batches and commit after every batch so I could recover without having to go through the entire database all over again if something went wrong. These are the steps I went through.
128+
129+
Start off with some defensive session settings for my puny t3.micro instance:
130+
131+
```sql
132+
SET lock_timeout = '2min';
133+
SET statement_timeout = '60min';
134+
SET idle_in_transaction_session_timeout = '10min';
135+
SET maintenance_work_mem = '256MB';
136+
```
137+
138+
Also add helper indexes in addition to the required column:
139+
140+
```sql
141+
ALTER TABLE public.website_event ADD COLUMN IF NOT EXISTS hostname VARCHAR(100);
142+
143+
CREATE INDEX CONCURRENTLY IF NOT EXISTS session_website_id_session_id_idx
144+
ON public."session"(website_id, session_id);
145+
146+
CREATE INDEX CONCURRENTLY IF NOT EXISTS website_event_wid_sid_null_idx
147+
ON public.website_event(website_id, session_id)
148+
WHERE hostname IS NULL;
149+
```
150+
151+
Create a helper function to fill in the rows for the newly added column:
152+
153+
```sql
154+
CREATE OR REPLACE FUNCTION fill_hostname_batch(batch_size int)
155+
RETURNS integer
156+
LANGUAGE plpgsql
157+
AS $$
158+
DECLARE r_count integer;
159+
BEGIN
160+
WITH chunk AS (
161+
SELECT w.ctid AS w_ctid, s.hostname
162+
FROM public.website_event w
163+
JOIN public."session" s
164+
ON s.website_id = w.website_id
165+
AND s.session_id = w.session_id
166+
WHERE w.hostname IS NULL
167+
LIMIT batch_size
168+
)
169+
UPDATE public.website_event w
170+
SET hostname = c.hostname
171+
FROM chunk c
172+
WHERE w.ctid = c.w_ctid;
173+
174+
GET DIAGNOSTICS r_count = ROW_COUNT;
175+
RETURN r_count;
176+
END$$;
177+
```
178+
179+
Set the function to run in batches. I manually stopped this when it returned 0. But this took more than 24 hours. I actually had to go on a business trip before it completed so I stopped it and resumed it when I came back home.
180+
181+
```sql
182+
\set batch 2000
183+
SELECT fill_hostname_batch(:batch);
184+
\watch 1
185+
```
186+
187+
Create the new `website_event` index:
188+
189+
```sql
190+
CREATE INDEX CONCURRENTLY IF NOT EXISTS website_event_website_id_created_at_hostname_idx
191+
ON public.website_event(website_id, created_at, hostname);
192+
```
193+
194+
Drop the legacy `session` indexes:
195+
196+
```sql
197+
DROP INDEX CONCURRENTLY IF EXISTS public.session_website_id_created_at_hostname_idx;
198+
DROP INDEX CONCURRENTLY IF EXISTS public.session_website_id_created_at_subdivision1_idx;
199+
```
200+
201+
Apply the required `session` column changes:
202+
203+
```sql
204+
BEGIN;
205+
SET LOCAL lock_timeout = '5s';
206+
207+
DO $$
208+
BEGIN
209+
IF EXISTS (
210+
SELECT 1 FROM information_schema.columns
211+
WHERE table_schema='public' AND table_name='session' AND column_name='subdivision1'
212+
) THEN
213+
EXECUTE 'ALTER TABLE public."session" RENAME COLUMN subdivision1 TO region';
214+
END IF;
215+
END$$;
216+
217+
ALTER TABLE public."session" DROP COLUMN IF EXISTS subdivision2;
218+
ALTER TABLE public."session" DROP COLUMN IF EXISTS hostname;
219+
220+
COMMIT;
221+
```
222+
223+
Almost there now. Create the new session index:
224+
225+
```sql
226+
CREATE INDEX CONCURRENTLY IF NOT EXISTS session_website_id_created_at_region_idx
227+
ON public."session"(website_id, created_at, region);
228+
```
229+
230+
Finalise the migration record:
231+
232+
```sql
233+
UPDATE public._prisma_migrations
234+
SET finished_at = NOW(),
235+
rolled_back_at = NULL,
236+
logs = 'Manually completed after deadlock recovery and verified schema.'
237+
WHERE migration_name = '09_update_hostname_region'
238+
AND finished_at IS NULL;
239+
```
240+
241+
Thankfully at this point, there weren't any new errors, and it seemed like the migration went through.
242+
243+
## Wrapping up
244+
245+
I ran `npx prisma migrate status` which told me that I had 13 pending migrations, and I went ahead and applied them with `npx prisma migrate deploy`. Again, no new errors, and when I finally ran the build command, it actually worked and the app compiled successfully. <span class="emoji" role="img" tabindex="0" aria-label="partying face">&#x1F973;</span>
246+
247+
Hopefully v3 doesn't involve such a mega database update. <span class="emoji" role="img" tabindex="0" aria-label="crossed fingers">&#x1F91E;</span>

0 commit comments

Comments
 (0)