docs(postgres): add 23_populate_storage_locations.md describing data migration needed after schema migrated to version 23, update README.md to mention migratedb.docs directory

KarlG-nbis · KarlG-nbis · commit 18f390fc7ccb · 2026-03-06T13:45:27.000+01:00
diff --git a/postgresql/README.md b/postgresql/README.md
@@ -2,7 +2,7 @@
 
 We use
 [Postgres 15](https://github.com/docker-library/postgres/tree/master/15/alpine)
-and Alpine 3.17.
+and Alpine 3.23.
 
 Security is hardened:
 
@@ -26,3 +26,13 @@ The following environment variables can be used to configure the database:
 | POSTGRES_VERIFY_PEER   | Enforce client verification         | verify-ca                |
 
 Client verification is enforced if `POSTGRES_VERIFY_PEER` is set to `verify-ca` or `verify-full`.
+
+# Data migration instructions docs
+
+In [migratedb.docs](data_migration.docs) directory there are instructions on how to execute the data migration 
+if upgrading a system with existing data related to specific versions of the schema.
+
+The file naming convention is as follows: `${SCHEMA_VERSION}_${PRE/POST}_${SHORT_DESCRIPTION}.md`.
+* `${SCHEMA_VERSION}` - describes the schema version the data migration instructions relates to. 
+* `${PRE/POST}` describes if these instructions should be executed before or after the schema migration has taken place.
+* `${SHORT_DESCRIPTION}` - short description describing the data migration
diff --git a/postgresql/data_migration.docs/23_post_populate_storage_locations.md b/postgresql/data_migration.docs/23_post_populate_storage_locations.md
@@ -0,0 +1,121 @@
+# Data Migration Plan POST schema migration version 23
+
+## 1. Prep
+Note: Prep is only needed if you have multiple s3 buckets / posix volumes for a storage
+   
+Repeat steps for each s3 bucket / posix volume 
+
+### 1.1. Get file ids file for a storage 
+ 
+#### If S3 storage
+Get all files in form each s3 bucket
+```bash
+aws s3api list-objects-v2 --endpoint ${ENDPOINT} --bucket ${BUCKET} > ${BUCKET}_raw
+```
+
+Transform raw response to just list of ids
+```bash
+cat ${BUCKET}_raw | jq -r '.Contents.[] | .Key' > ${BUCKET}_ids
+```
+
+#### If Posix storage
+``` bash
+find . -type f -exec basename {} \; > ${POSIX_VOLUME}_ids
+```
+
+### 1.2. Create new temporary tables to support DB migration
+
+```sql
+CREATE TABLE temp_file_in_${BUCKET || POSIX_VOLUME} ( 
+file_id UUID PRIMARY KEY
+);
+``` 
+
+### 1.3. Populate tables
+```bash
+psql -U $user -d sda -At -h $host -p $port -c "\copy sda.temp_file_in_${BUCKET || POSIX_VOLUME} from '/path/to/${BUCKET || POSIX_VOLUME}_ids' with delimiter as ','"
+```
+
+## 2. Ensure schema migration has taken place
+Ensure [23_expand_files_table_with_storage_locations.sql](../migratedb.d/23_expand_files_table_with_storage_locations.sql)
+has been executed.
+
+Can be checked by 
+```sql
+SELECT * from sda.dbschema_version WHERE version = 23;
+```
+
+
+## 3. Run data migration queries
+
+### 3.1. Inbox Location
+
+If posix inbox replace `${INBOX_ENDPOINT}/${INBOX_BUCKET}` with `${INBOX_POSIX_VOLUME}`
+
+If you only have one inbox storage
+```sql
+UPDATE sda.files
+SET submission_location = '${INBOX_ENDPOINT}/${INBOX_BUCKET}';
+```
+
+If you only have multiple inbox storages, repeat following UPDATE statement per bucket/volume you have
+```sql
+UPDATE sda.files AS f
+SET submission_location = '${INBOX_ENDPOINT}/${INBOX_BUCKET}'
+FROM temp_file_in_${INBOX_BUCKET} AS in_buk
+WHERE f.id = in_buk.file_id;
+```
+
+
+### 3.2. Archive Location
+
+If posix archive replace `${ARCHIVE_ENDPOINT}/${ARCHIVE_BUCKET}` with `/${ARCHIVE_POSIX_VOLUME}`
+
+If you only have one archive storage
+```sql
+UPDATE sda.files 
+SET archive_location ='${ARCHIVE_ENDPOINT}/${ARCHIVE_BUCKET}'
+WHERE archive_file_path != '';
+```
+
+If you only have multiple archive storages, repeat following UPDATE statement per bucket/volume you have
+```sql
+UPDATE sda.files AS f
+SET archive_location = '${ARCHIVE_ENDPOINT}/${ARCHIVE_BUCKET}'
+FROM temp_file_in_${ARCHIVE_BUCKET} AS in_buk 
+WHERE f.id = in_buk.file_id;
+```
+
+### 3.3 Backup location
+Skip this if you do not have a backup storage
+
+If posix archive replace '${BACKUP_ENDPOINT}/${BACKUP_BUCKET}' with '/${BACKUP_POSIX_VOLUME}'
+
+If you only have one backup storage
+```sql
+UPDATE sda.files 
+SET backup_location ='${BACKUP_ENDPOINT}/${BACKUP_BUCKET}'
+WHERE stable_id IS NOT NULL;
+```
+If you only have multiple backup storages, repeat following UPDATE statement per bucket/volume you have
+```sql
+UPDATE sda.files AS f
+SET backup_location = '${BACKUP_ENDPOINT}/${BACKUP_BUCKET}'
+FROM temp_file_in_${BACKUP_BUCKET} AS in_buk 
+WHERE f.id = in_buk.file_id;
+```
+
+## 4. Clean up
+Only needed if you did the [1. Prep step](#1-prep) and created temporary tables
+
+Repeat DROP table statement per temporary table created
+```sql
+DROP TABLE sda.temp_file_in_${BUCKET || POSIX_VOLUME}; 
+```
+
+## 5. Ensure all files have been updated
+```sql
+SELECT count(id) FROM sda.files WHERE submission_location IS NULL OR (archive_location IS NULL AND archive_file_path != '')
+```
+If there exists rows, then there are issues and the required locations of the files are not known.
+To resolve you could either manually delete those sda.files entries or ensure the files are uploaded to the expected locations.