|
| 1 | +# 13: Migrating `int` to `bigint` for `id` Primary Keys |
| 2 | + |
| 3 | +Date: 2025-02-04 |
| 4 | + |
| 5 | +## Status |
| 6 | + |
| 7 | +Draft :construction: |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +The primary key `id` columns in all database tables use the integer type, which has a maximum value of 2,147,483,647. |
| 12 | +As foundations grow over time, the `id` values in some of these tables (e.g., events) are approaching this limit. |
| 13 | +If the limit is reached, the cloud controller will be unable to insert new records, leading to critical failures in the CF API. |
| 14 | +E.g.: |
| 15 | +``` |
| 16 | +PG::SequenceGeneratorLimitExceeded: ERROR: nextval: reached maximum value of sequence "events_id_seq" |
| 17 | +``` |
| 18 | +The goal is to migrate these primary key `id` columns from `int` to `bigint` without causing downtime and to ensure compatibility across PostgreSQL and MySQL databases. |
| 19 | +This migration must: |
| 20 | +- Avoid downtime since the CF API is actively used in production. |
| 21 | +- Handle tables with millions of records efficiently. |
| 22 | +- Provide a safe rollback mechanism in case of issues during the migration. |
| 23 | +- Be reusable for other tables in the future. |
| 24 | +- Ensure that migration only is executed when the new `id_bigint` column is fully populated. |
| 25 | + |
| 26 | +The largest tables in a long-running foundation are `events`, `delayed_jobs`, `jobs`, and `app_usage_events`. |
| 27 | + |
| 28 | +## Decisions |
| 29 | + |
| 30 | +### Opt-Out Mechanism |
| 31 | +Operators of smaller foundations, which are unlikely to ever encounter the integer overflow issue, may wish to avoid the risks and complexity associated with this migration. |
| 32 | +They can opt out of the migration by setting the `skip_bigint_id_migration` flag in the CAPI-Release manifest. |
| 33 | +When this flag is set, all migration steps will result in a no-op but will still be marked as applied in the `schema_versions` table. |
| 34 | +*Important*: Removing the flag later will *not* re-trigger the migration. Operators must handle the migration manually if they choose to opt out. |
| 35 | + |
| 36 | +### Scope |
| 37 | + |
| 38 | +The `events` table will be migrated first as it has the most significant growth in `id` values. |
| 39 | +Other tables will be migrated at a later stage. |
| 40 | + |
| 41 | +### Newly Created Foundations |
| 42 | +For newly created foundations, the `id` column for the `events` table will be created as `bigint` by default. |
| 43 | +This will be implemented with migration step 1 and will be only applied, if the `events` table is empty. |
| 44 | + |
| 45 | +### Phased Migration |
| 46 | +The migration will be conducted in multiple steps to ensure minimal risk. |
| 47 | +#### Step 1 - Preparation |
| 48 | +- If the opt-out flag is set, this step will be a no-op. |
| 49 | +- In case the target table is empty the type of the `id` column will be set to `bigint` directly. |
| 50 | +- Otherwise, the following steps will be executed: |
| 51 | + - Add a new column `id_bigint` of type `bigint` to the target table. If the `id` column is referenced as a foreign key in other tables, also add an `<ref>_id_bigint` column in those referencing tables. |
| 52 | + - Create triggers to keep `id_bigint` in sync with `id` when new records are inserted. |
| 53 | + - Add constraints and indexes to `id_bigint` as required to match primary key and foreign key requirements. |
| 54 | + |
| 55 | +#### Step 2 - Backfill |
| 56 | +- Backfill will not be scheduled if the opt-out flag is set. |
| 57 | +- If the `id_bigint` column does not exist, backfill will be skipped or result in a no-op. |
| 58 | +- Use a batch-processing script (e.g. a delayed job) to populate `id_bigint` for existing rows in both the primary table and, if applicable, all foreign key references. |
| 59 | +- Table locks will be avoided by using a batch processing approach. |
| 60 | +- In case the table has a configurable cleanup duration, the backfill job will only process records which are beyond the cleanup duration to reduce the number of records to be processed. |
| 61 | +- Backfill will be executed outside the migration due to its potentially long runtime. |
| 62 | +- If necessary the backfill will run for multiple weeks to ensure all records are processed. |
| 63 | + |
| 64 | +#### Step 3 - Migration |
| 65 | +- The migration is divided into two parts: a pre-check and the actual migration but both will be stored in a single migration script. |
| 66 | +- This step will be a no-op if the opt-out flag is set or the `id` column is already of type `bigint`. |
| 67 | +- All sql statements will be executed in a single transaction to ensure consistency. |
| 68 | +##### Step 3a - Migration Pre Check |
| 69 | +- In case the `id_bigint` column does not exist the migration will fail with a clear error message. |
| 70 | +- Add a `CHECK` constraint to verify that `id_bigint` is fully populated (`id_bigint == id & id_bigint != NULL`). |
| 71 | +- In case the backfill is not yet complete or the `id_bigint` column is not fully populated the migration will fail. |
| 72 | +- If pre-check fails, operators might need to take manual actions to ensure all preconditions are met as the migration will be retried during the next deployment. |
| 73 | +##### Step 3b - Actual Migration |
| 74 | +- Remove the `CHECK` constraint once verified. |
| 75 | +- Drop the primary key constraint on id. |
| 76 | +- If foreign keys exist, drop the corresponding foreign key constraints. |
| 77 | +- Remove the sync triggers. |
| 78 | +- Drop the old `id` column. |
| 79 | +- Rename the `id_bigint` column to `id`. |
| 80 | +- Add PK constraint on `id` column and configure `id` generator. |
| 81 | +- If foreign keys exist, rename `id_bigint` to `id` in referencing tables accordingly. |
| 82 | + |
| 83 | +### Database Specifics |
| 84 | + |
| 85 | +#### PostgreSQL |
| 86 | +The default value of the `id` column could be either a sequence (for PostgreSQL versions < 10) or an identity column (for newer PostgreSQL versions). |
| 87 | +This depends on the version of PostgreSQL which was used when the table was initially created. |
| 88 | +The migration script needs to handle both cases. |
| 89 | + |
| 90 | +#### MySQL |
| 91 | +MySQL primary key changes typically cause table rebuilds due to clustered indexing, which can be expensive and disruptive, especially with clustered replication setups like Galera. |
| 92 | +A common approach to mitigate this involves creating a new shadow table, performing a backfill, and then swapping tables atomically. |
| 93 | +Further details will be refined during implementation. |
| 94 | + |
| 95 | +### Rollback Mechanism |
| 96 | +The old `id` column is no longer retained, as the `CHECK` constraint ensures correctness during migration. |
| 97 | +Step 3b (switch over) will be executed in a single transaction and will be rolled back if any issues occur. |
| 98 | +If unexpected issues occur, the migration will fail explicitly, requiring intervention. |
| 99 | +If rollback is needed, either backups could be restored or the migration needs to be reverted manually. |
| 100 | + |
| 101 | +### Standardized Approach |
| 102 | +Write reusable scripts for adding `id_bigint`, setting up triggers, backfilling data, and verifying migration readiness. |
| 103 | +These scripts can be reused for other tables in the future. |
| 104 | + |
| 105 | +### Release Strategy |
| 106 | + |
| 107 | +Steps 1-2 will be released as a cf-deployment major release to ensure that the database is prepared for the migration. |
| 108 | +Steps 3-4 will be released as a subsequent cf-deployment major release to complete the migration. |
| 109 | +Between these releases there should be a reasonable time to allow the backfill to complete. |
| 110 | + |
| 111 | +For the `events` table there is a default cleanup interval of 31 days. Therefore, for the `events` table the gap between the releases should be around 60 days. |
| 112 | + |
| 113 | +## Consequences |
| 114 | +### Positive Consequences |
| 115 | + |
| 116 | +- Future-proofing the schema for tables with high record counts. |
| 117 | +- Minimal locking during step 3b (actual migration) could result in slower queries or minimal downtime. |
| 118 | +- A standardized process for similar migrations across the database. |
| 119 | + |
| 120 | +### Negative Consequences |
| 121 | + |
| 122 | +- Increased complexity in the migration process. |
| 123 | +- Potentially long runtimes for backfilling data in case tables have millions of records. |
| 124 | +- Requires careful coordination across multiple CAPI/CF-Deployment versions. |
| 125 | +- If backfilling encounters edge cases (e.g., missing cleanup jobs), the migration may be delayed until operators intervene. |
| 126 | + |
| 127 | +## Alternatives Considered |
| 128 | + |
| 129 | +### Switching to `guid` Field as Primary Key |
| 130 | + |
| 131 | +Pros: Provides globally unique identifiers and eliminates the risk of overflow. |
| 132 | + |
| 133 | +Cons: Might decrease query and index performance, requires significant changes for foreign key constraints, and introduces non-sequential keys. |
| 134 | + |
| 135 | +Reason Rejected: The overhead and complexity outweighed the benefits for our use case. |
| 136 | + |
| 137 | +### Implementing Rollover for `id` Reuse |
| 138 | + |
| 139 | +Pros: Delays the overflow issue by reusing IDs from deleted rows. Minimal schema changes. |
| 140 | + |
| 141 | +Cons: Potential issues with foreign key constraints and increased complexity in the rollover process. Could be problematic for tables which do not have frequent deletions. |
| 142 | + |
| 143 | +Reason Rejected: Might work well for tables like events, but not a universal solution for all tables where there is no guarantee of frequent deletions. |
| 144 | + |
| 145 | + |
| 146 | +### Direct Migration of `id` to `bigint` via `ALTER TABLE` Statement |
| 147 | + |
| 148 | +Pros: One-step migration process. |
| 149 | + |
| 150 | +Cons: Requires downtime, locks the table for the duration of the migration, and can be slow for tables with millions of records. |
| 151 | + |
| 152 | +Reason Rejected: Downtimes are unacceptable for productive foundations. |
| 153 | + |
| 154 | + |
| 155 | +## Example Migration Scripts With PostgreSQL Syntax For `events` Table |
| 156 | + |
| 157 | +### Step 1 - Preparation |
| 158 | +```sql |
| 159 | +BEGIN; |
| 160 | + |
| 161 | +-- Add new BIGINT column |
| 162 | +ALTER TABLE events ADD COLUMN id_bigint BIGINT; |
| 163 | + |
| 164 | +-- Ensure new inserts populate `id_bigint` |
| 165 | +CREATE OR REPLACE FUNCTION events_set_id_bigint_on_insert() |
| 166 | +RETURNS TRIGGER AS $$ |
| 167 | +BEGIN |
| 168 | + NEW.id_bigint := NEW.id; |
| 169 | +RETURN NEW; |
| 170 | +END; |
| 171 | +$$ LANGUAGE plpgsql; |
| 172 | + |
| 173 | +DROP TRIGGER IF EXISTS trigger_events_set_id_bigint ON events; |
| 174 | + |
| 175 | +CREATE TRIGGER trigger_events_set_id_bigint |
| 176 | + BEFORE INSERT ON events |
| 177 | + FOR EACH ROW |
| 178 | + EXECUTE FUNCTION events_set_id_bigint_on_insert(); |
| 179 | + |
| 180 | +COMMIT; |
| 181 | + |
| 182 | +``` |
| 183 | + |
| 184 | +### Step 2 - Backfill |
| 185 | +```sql |
| 186 | +WITH batch AS ( |
| 187 | + SELECT id FROM events |
| 188 | + WHERE id_bigint IS NULL |
| 189 | + ORDER BY id |
| 190 | + LIMIT 100000 |
| 191 | + FOR UPDATE SKIP LOCKED |
| 192 | +) |
| 193 | +UPDATE events |
| 194 | +SET id_bigint = events.id |
| 195 | +FROM batch |
| 196 | +WHERE events.id = batch.id; |
| 197 | +``` |
| 198 | + |
| 199 | + |
| 200 | +### Step 3a - Migration Pre Check |
| 201 | +```sql |
| 202 | +ALTER TABLE events ADD CONSTRAINT check_id_bigint_matches CHECK (id_bigint IS NOT NULL AND id_bigint = id); |
| 203 | + |
| 204 | +-- Alternative: |
| 205 | +SELECT COUNT(*) FROM events WHERE id_bigint IS DISTINCT FROM id; |
| 206 | +-- should return 0 |
| 207 | +``` |
| 208 | + |
| 209 | +### Step 3b - Actual Migration |
| 210 | +```sql |
| 211 | +BEGIN; |
| 212 | + |
| 213 | +ALTER TABLE events DROP CONSTRAINT check_id_bigint_matches; |
| 214 | + |
| 215 | +-- Drop primary key constraint |
| 216 | +ALTER TABLE events DROP CONSTRAINT events_pkey; |
| 217 | + |
| 218 | +-- Drop id_bigint sync trigger |
| 219 | +DROP TRIGGER IF EXISTS trigger_events_set_id_bigint ON events; |
| 220 | +DROP FUNCTION IF EXISTS events_set_id_bigint_on_insert(); |
| 221 | + |
| 222 | +-- Drop the old id column |
| 223 | +ALTER TABLE events DROP COLUMN id; |
| 224 | + |
| 225 | +-- Rename columns |
| 226 | +ALTER TABLE events RENAME COLUMN id_bigint TO id; |
| 227 | + |
| 228 | +-- Recreate primary key on new `id` |
| 229 | +ALTER TABLE events ADD PRIMARY KEY (id); |
| 230 | + |
| 231 | +-- Set `id` as IDENTITY with correct starting value |
| 232 | +DO $$ |
| 233 | +DECLARE max_id BIGINT; |
| 234 | +BEGIN |
| 235 | +SELECT COALESCE(MAX(id), 1) + 1 INTO max_id FROM events; |
| 236 | + |
| 237 | +-- Set the column to IDENTITY with the correct start value |
| 238 | +EXECUTE format( |
| 239 | + 'ALTER TABLE events ALTER COLUMN id ADD GENERATED BY DEFAULT AS IDENTITY (START WITH %s)', |
| 240 | + max_id |
| 241 | + ); |
| 242 | + |
| 243 | +RAISE NOTICE 'Set id as IDENTITY starting from %', max_id; |
| 244 | +END $$; |
| 245 | + |
| 246 | +COMMIT; |
| 247 | +``` |
| 248 | + |
| 249 | +### Helpful Commands |
| 250 | +```sql |
| 251 | +SELECT COUNT(*) FROM events WHERE id_bigint IS NULL; |
| 252 | +-- should return 0 |
| 253 | + |
| 254 | +SELECT COUNT(*) FROM events WHERE id_bigint IS DISTINCT FROM id; |
| 255 | +-- should return 0 |
| 256 | +``` |
| 257 | + |
| 258 | +## References |
| 259 | +WIP - e.g.: |
| 260 | +Migration scripts in the repository. |
| 261 | +Backfill script documentation. |
| 262 | +Trigger functions for PostgreSQL and MySQL. |
| 263 | + |
| 264 | +### Helpful Links |
| 265 | +- [Stack Overflow: Migrating int to bigint](https://stackoverflow.com/questions/33504982/postgresql-concurrently-change-column-type-from-int-to-bigint) |
| 266 | +- [Rollover](https://www.postgresql.org/message-id/10056.1057506282%40sss.pgh.pa.us) |
| 267 | +- [PostgreSQL zero-downtime migration of a primary key from int to bigint (with Ruby on Rails specific notes)](https://engineering.silverfin.com/pg-zero-downtime-bigint-migration/) |
0 commit comments