Skip to content

Commit d611317

Browse files
authored
Add draft ADR for bigint id column migration (#4191)
* Add draft ADR for bigint id column migration * Remove `id_old` column related steps * Remove general rollback section * Add opt-out flag and mysql specifics * Clarify when migrations steps will be skipped/fail * Replace 'rename' with 'drop & rename'
1 parent fd0c490 commit d611317

File tree

1 file changed

+267
-0
lines changed

1 file changed

+267
-0
lines changed
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# 13: Migrating `int` to `bigint` for `id` Primary Keys
2+
3+
Date: 2025-02-04
4+
5+
## Status
6+
7+
Draft :construction:
8+
9+
## Context
10+
11+
The primary key `id` columns in all database tables use the integer type, which has a maximum value of 2,147,483,647.
12+
As foundations grow over time, the `id` values in some of these tables (e.g., events) are approaching this limit.
13+
If the limit is reached, the cloud controller will be unable to insert new records, leading to critical failures in the CF API.
14+
E.g.:
15+
```
16+
PG::SequenceGeneratorLimitExceeded: ERROR: nextval: reached maximum value of sequence "events_id_seq"
17+
```
18+
The goal is to migrate these primary key `id` columns from `int` to `bigint` without causing downtime and to ensure compatibility across PostgreSQL and MySQL databases.
19+
This migration must:
20+
- Avoid downtime since the CF API is actively used in production.
21+
- Handle tables with millions of records efficiently.
22+
- Provide a safe rollback mechanism in case of issues during the migration.
23+
- Be reusable for other tables in the future.
24+
- Ensure that migration only is executed when the new `id_bigint` column is fully populated.
25+
26+
The largest tables in a long-running foundation are `events`, `delayed_jobs`, `jobs`, and `app_usage_events`.
27+
28+
## Decisions
29+
30+
### Opt-Out Mechanism
31+
Operators of smaller foundations, which are unlikely to ever encounter the integer overflow issue, may wish to avoid the risks and complexity associated with this migration.
32+
They can opt out of the migration by setting the `skip_bigint_id_migration` flag in the CAPI-Release manifest.
33+
When this flag is set, all migration steps will result in a no-op but will still be marked as applied in the `schema_versions` table.
34+
*Important*: Removing the flag later will *not* re-trigger the migration. Operators must handle the migration manually if they choose to opt out.
35+
36+
### Scope
37+
38+
The `events` table will be migrated first as it has the most significant growth in `id` values.
39+
Other tables will be migrated at a later stage.
40+
41+
### Newly Created Foundations
42+
For newly created foundations, the `id` column for the `events` table will be created as `bigint` by default.
43+
This will be implemented with migration step 1 and will be only applied, if the `events` table is empty.
44+
45+
### Phased Migration
46+
The migration will be conducted in multiple steps to ensure minimal risk.
47+
#### Step 1 - Preparation
48+
- If the opt-out flag is set, this step will be a no-op.
49+
- In case the target table is empty the type of the `id` column will be set to `bigint` directly.
50+
- Otherwise, the following steps will be executed:
51+
- Add a new column `id_bigint` of type `bigint` to the target table. If the `id` column is referenced as a foreign key in other tables, also add an `<ref>_id_bigint` column in those referencing tables.
52+
- Create triggers to keep `id_bigint` in sync with `id` when new records are inserted.
53+
- Add constraints and indexes to `id_bigint` as required to match primary key and foreign key requirements.
54+
55+
#### Step 2 - Backfill
56+
- Backfill will not be scheduled if the opt-out flag is set.
57+
- If the `id_bigint` column does not exist, backfill will be skipped or result in a no-op.
58+
- Use a batch-processing script (e.g. a delayed job) to populate `id_bigint` for existing rows in both the primary table and, if applicable, all foreign key references.
59+
- Table locks will be avoided by using a batch processing approach.
60+
- In case the table has a configurable cleanup duration, the backfill job will only process records which are beyond the cleanup duration to reduce the number of records to be processed.
61+
- Backfill will be executed outside the migration due to its potentially long runtime.
62+
- If necessary the backfill will run for multiple weeks to ensure all records are processed.
63+
64+
#### Step 3 - Migration
65+
- The migration is divided into two parts: a pre-check and the actual migration but both will be stored in a single migration script.
66+
- This step will be a no-op if the opt-out flag is set or the `id` column is already of type `bigint`.
67+
- All sql statements will be executed in a single transaction to ensure consistency.
68+
##### Step 3a - Migration Pre Check
69+
- In case the `id_bigint` column does not exist the migration will fail with a clear error message.
70+
- Add a `CHECK` constraint to verify that `id_bigint` is fully populated (`id_bigint == id & id_bigint != NULL`).
71+
- In case the backfill is not yet complete or the `id_bigint` column is not fully populated the migration will fail.
72+
- If pre-check fails, operators might need to take manual actions to ensure all preconditions are met as the migration will be retried during the next deployment.
73+
##### Step 3b - Actual Migration
74+
- Remove the `CHECK` constraint once verified.
75+
- Drop the primary key constraint on id.
76+
- If foreign keys exist, drop the corresponding foreign key constraints.
77+
- Remove the sync triggers.
78+
- Drop the old `id` column.
79+
- Rename the `id_bigint` column to `id`.
80+
- Add PK constraint on `id` column and configure `id` generator.
81+
- If foreign keys exist, rename `id_bigint` to `id` in referencing tables accordingly.
82+
83+
### Database Specifics
84+
85+
#### PostgreSQL
86+
The default value of the `id` column could be either a sequence (for PostgreSQL versions < 10) or an identity column (for newer PostgreSQL versions).
87+
This depends on the version of PostgreSQL which was used when the table was initially created.
88+
The migration script needs to handle both cases.
89+
90+
#### MySQL
91+
MySQL primary key changes typically cause table rebuilds due to clustered indexing, which can be expensive and disruptive, especially with clustered replication setups like Galera.
92+
A common approach to mitigate this involves creating a new shadow table, performing a backfill, and then swapping tables atomically.
93+
Further details will be refined during implementation.
94+
95+
### Rollback Mechanism
96+
The old `id` column is no longer retained, as the `CHECK` constraint ensures correctness during migration.
97+
Step 3b (switch over) will be executed in a single transaction and will be rolled back if any issues occur.
98+
If unexpected issues occur, the migration will fail explicitly, requiring intervention.
99+
If rollback is needed, either backups could be restored or the migration needs to be reverted manually.
100+
101+
### Standardized Approach
102+
Write reusable scripts for adding `id_bigint`, setting up triggers, backfilling data, and verifying migration readiness.
103+
These scripts can be reused for other tables in the future.
104+
105+
### Release Strategy
106+
107+
Steps 1-2 will be released as a cf-deployment major release to ensure that the database is prepared for the migration.
108+
Steps 3-4 will be released as a subsequent cf-deployment major release to complete the migration.
109+
Between these releases there should be a reasonable time to allow the backfill to complete.
110+
111+
For the `events` table there is a default cleanup interval of 31 days. Therefore, for the `events` table the gap between the releases should be around 60 days.
112+
113+
## Consequences
114+
### Positive Consequences
115+
116+
- Future-proofing the schema for tables with high record counts.
117+
- Minimal locking during step 3b (actual migration) could result in slower queries or minimal downtime.
118+
- A standardized process for similar migrations across the database.
119+
120+
### Negative Consequences
121+
122+
- Increased complexity in the migration process.
123+
- Potentially long runtimes for backfilling data in case tables have millions of records.
124+
- Requires careful coordination across multiple CAPI/CF-Deployment versions.
125+
- If backfilling encounters edge cases (e.g., missing cleanup jobs), the migration may be delayed until operators intervene.
126+
127+
## Alternatives Considered
128+
129+
### Switching to `guid` Field as Primary Key
130+
131+
Pros: Provides globally unique identifiers and eliminates the risk of overflow.
132+
133+
Cons: Might decrease query and index performance, requires significant changes for foreign key constraints, and introduces non-sequential keys.
134+
135+
Reason Rejected: The overhead and complexity outweighed the benefits for our use case.
136+
137+
### Implementing Rollover for `id` Reuse
138+
139+
Pros: Delays the overflow issue by reusing IDs from deleted rows. Minimal schema changes.
140+
141+
Cons: Potential issues with foreign key constraints and increased complexity in the rollover process. Could be problematic for tables which do not have frequent deletions.
142+
143+
Reason Rejected: Might work well for tables like events, but not a universal solution for all tables where there is no guarantee of frequent deletions.
144+
145+
146+
### Direct Migration of `id` to `bigint` via `ALTER TABLE` Statement
147+
148+
Pros: One-step migration process.
149+
150+
Cons: Requires downtime, locks the table for the duration of the migration, and can be slow for tables with millions of records.
151+
152+
Reason Rejected: Downtimes are unacceptable for productive foundations.
153+
154+
155+
## Example Migration Scripts With PostgreSQL Syntax For `events` Table
156+
157+
### Step 1 - Preparation
158+
```sql
159+
BEGIN;
160+
161+
-- Add new BIGINT column
162+
ALTER TABLE events ADD COLUMN id_bigint BIGINT;
163+
164+
-- Ensure new inserts populate `id_bigint`
165+
CREATE OR REPLACE FUNCTION events_set_id_bigint_on_insert()
166+
RETURNS TRIGGER AS $$
167+
BEGIN
168+
NEW.id_bigint := NEW.id;
169+
RETURN NEW;
170+
END;
171+
$$ LANGUAGE plpgsql;
172+
173+
DROP TRIGGER IF EXISTS trigger_events_set_id_bigint ON events;
174+
175+
CREATE TRIGGER trigger_events_set_id_bigint
176+
BEFORE INSERT ON events
177+
FOR EACH ROW
178+
EXECUTE FUNCTION events_set_id_bigint_on_insert();
179+
180+
COMMIT;
181+
182+
```
183+
184+
### Step 2 - Backfill
185+
```sql
186+
WITH batch AS (
187+
SELECT id FROM events
188+
WHERE id_bigint IS NULL
189+
ORDER BY id
190+
LIMIT 100000
191+
FOR UPDATE SKIP LOCKED
192+
)
193+
UPDATE events
194+
SET id_bigint = events.id
195+
FROM batch
196+
WHERE events.id = batch.id;
197+
```
198+
199+
200+
### Step 3a - Migration Pre Check
201+
```sql
202+
ALTER TABLE events ADD CONSTRAINT check_id_bigint_matches CHECK (id_bigint IS NOT NULL AND id_bigint = id);
203+
204+
-- Alternative:
205+
SELECT COUNT(*) FROM events WHERE id_bigint IS DISTINCT FROM id;
206+
-- should return 0
207+
```
208+
209+
### Step 3b - Actual Migration
210+
```sql
211+
BEGIN;
212+
213+
ALTER TABLE events DROP CONSTRAINT check_id_bigint_matches;
214+
215+
-- Drop primary key constraint
216+
ALTER TABLE events DROP CONSTRAINT events_pkey;
217+
218+
-- Drop id_bigint sync trigger
219+
DROP TRIGGER IF EXISTS trigger_events_set_id_bigint ON events;
220+
DROP FUNCTION IF EXISTS events_set_id_bigint_on_insert();
221+
222+
-- Drop the old id column
223+
ALTER TABLE events DROP COLUMN id;
224+
225+
-- Rename columns
226+
ALTER TABLE events RENAME COLUMN id_bigint TO id;
227+
228+
-- Recreate primary key on new `id`
229+
ALTER TABLE events ADD PRIMARY KEY (id);
230+
231+
-- Set `id` as IDENTITY with correct starting value
232+
DO $$
233+
DECLARE max_id BIGINT;
234+
BEGIN
235+
SELECT COALESCE(MAX(id), 1) + 1 INTO max_id FROM events;
236+
237+
-- Set the column to IDENTITY with the correct start value
238+
EXECUTE format(
239+
'ALTER TABLE events ALTER COLUMN id ADD GENERATED BY DEFAULT AS IDENTITY (START WITH %s)',
240+
max_id
241+
);
242+
243+
RAISE NOTICE 'Set id as IDENTITY starting from %', max_id;
244+
END $$;
245+
246+
COMMIT;
247+
```
248+
249+
### Helpful Commands
250+
```sql
251+
SELECT COUNT(*) FROM events WHERE id_bigint IS NULL;
252+
-- should return 0
253+
254+
SELECT COUNT(*) FROM events WHERE id_bigint IS DISTINCT FROM id;
255+
-- should return 0
256+
```
257+
258+
## References
259+
WIP - e.g.:
260+
Migration scripts in the repository.
261+
Backfill script documentation.
262+
Trigger functions for PostgreSQL and MySQL.
263+
264+
### Helpful Links
265+
- [Stack Overflow: Migrating int to bigint](https://stackoverflow.com/questions/33504982/postgresql-concurrently-change-column-type-from-int-to-bigint)
266+
- [Rollover](https://www.postgresql.org/message-id/10056.1057506282%40sss.pgh.pa.us)
267+
- [PostgreSQL zero-downtime migration of a primary key from int to bigint (with Ruby on Rails specific notes)](https://engineering.silverfin.com/pg-zero-downtime-bigint-migration/)

0 commit comments

Comments
 (0)