v1.3.3 to v.2.0.1 Upgrade Failing #7389

bjewkes · 2021-08-03T21:56:31Z

bjewkes
Aug 3, 2021

Hi, we were upgrading our Hasura instance from version 1.3.3 to version 2.0.1. We host behind an AWS Application Load Balancer on AWS Fargate using the hasura/graphql-engine docker image. When we upgraded in our develop environment everything worked seamlessly.

When we went to ship this to production we were never able to get version 2.0.1 started to the point where it would accept requests.

We only saw these first two logs on startup (truncated these a bit):

{"type":"startup","timestamp":"2021-08-03T21:15:56.159+0000","level":"info","detail":{"kind":"server_configuration","info . . .
{"type":"startup","timestamp":"2021-08-03T21:16:05.539+0000","level":"info","detail":{"kind":"postgres_connection","info":{"retries":1,"database_url": , . . .

We have the following log types enabled: "http-log", "websocket-log"," startup", "webhook-log"

To have 0 downtime when deploying to production we had 1.3.3 instances running while the 2.0.1 instances booted. The 2.0.1 instances failed to boot for over an hour (I even changed the health check to give them 10 minutes to start) so we though it might be a conflict between 1.3.3 running at the same time as 2.0.1 so we killed the 1.3.3 instances.

This did not change the behavior of 2.0.1, it still started with just the two logs above, and then nothing.

We ended up reverting to 1.3.3, which still did not work until we restarted the database.

My only guess is that this is due to the difference between the sizes of our database between our develop and production causing long running operations on migration to 2.0.1.

As an example, our production event trigger logs are a few million rows long, but develop only has a handful.

We can reproduce this behavior on our production database - we have tried this upgrade two times with the same result. So far we haven't found any issues that seem similar to this - but please let us know if there are any we should look at. Thanks in advance!

Answered by bjewkes

Aug 11, 2021

@reinoldus thanks for your input - I do not think that I had your issue. I did some digging to see if the same function was being dropped in my logs but I could not see it. We also do not need to consistently restart our database, it was only when the v2 migration failed, which I now know was due to transactions locking access to the hdb_catalog.event_log table.

I looked into the statistics of this table in our database and since we used eventing heavily, we had actually amassed an enormous event_log table of about 25 GB. The same was true of the linked table hdb_catalog.event_invocation_logs. We had not been cleaning these tables regularly, which I would recommend.

The Hasura team suspec…

View full answer

coco98 · 2021-08-03T22:33:49Z

coco98
Aug 3, 2021
Maintainer

We’d love to take a look here and get this sorted out asap! :)

Could you reach out to Brandon (Brandon.b at hasura.io) and me (tanmaig at hasura.io) and we’ll set up a call so that we can work through this quickly?

0 replies

bjewkes · 2021-08-04T13:23:22Z

bjewkes
Aug 4, 2021
Author

We’d love to take a look here and get this sorted out asap! :)

Could you reach out to Brandon (Brandon.b at hasura.io) and me (tanmaig at hasura.io) and we’ll set up a call so that we can work through this quickly?

Thanks! I emailed with the issue number in the subject.

0 replies

reinoldus · 2021-08-06T05:41:10Z

reinoldus
Aug 6, 2021

I am having the same issue in my dev environment. The only way to fix it for me is to restart the database then hasura boots.

Postgresql logs are as follows:

PostgreSQL Database directory appears to contain a database; Skipping initialization

2021-08-05 15:58:49.636 UTC [1] LOG:  starting PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-08-05 15:58:49.636 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-05 15:58:49.636 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-08-05 15:58:49.646 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-08-05 15:58:49.666 UTC [27] LOG:  database system was shut down at 2021-08-05 15:58:48 UTC
2021-08-05 15:58:49.682 UTC [1] LOG:  database system is ready to accept connections
2021-08-05 16:01:32.323 UTC [59] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 17:56:08.078 UTC [62] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 17:56:22.795 UTC [578] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 17:56:34.224 UTC [583] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 17:57:15.311 UTC [588] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 17:57:39.417 UTC [594] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 17:58:42.080 UTC [600] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 18:02:44.983 UTC [605] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 18:02:50.873 UTC [625] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 18:03:18.359 UTC [626] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 18:03:56.929 UTC [630] LOG:  unexpected EOF on client connection with an open transaction
2021-08-05 18:04:18.962 UTC [639] LOG:  unexpected EOF on client connection with an open transaction
2021-08-06 03:01:26.788 UTC [3056] LOG:  unexpected EOF on client connection with an open transaction
2021-08-06 03:33:22.108 UTC [642] LOG:  unexpected EOF on client connection with an open transaction
2021-08-06 03:33:27.941 UTC [3204] LOG:  unexpected EOF on client connection with an open transaction
2021-08-06 03:34:02.572 UTC [3206] LOG:  unexpected EOF on client connection with an open transaction
2021-08-06 04:15:34.630 UTC [1] LOG:  received fast shutdown request
2021-08-06 04:15:34.633 UTC [1] LOG:  aborting any active transactions
2021-08-06 04:15:34.633 UTC [3365] FATAL:  terminating connection due to administrator command
2021-08-06 04:15:34.633 UTC [3209] FATAL:  terminating connection due to administrator command
2021-08-06 04:15:34.635 UTC [1] LOG:  background worker "logical replication launcher" (PID 33) exited with exit code 1
2021-08-06 04:15:34.636 UTC [3368] FATAL:  terminating connection due to administrator command
2021-08-06 04:15:34.638 UTC [3388] FATAL:  terminating connection due to administrator command
2021-08-06 04:15:34.640 UTC [28] LOG:  shutting down
2021-08-06 04:15:34.704 UTC [1] LOG:  database system is shut down

PostgreSQL Database directory appears to contain a database; Skipping initialization

2021-08-06 05:21:24.131 UTC [1] LOG:  starting PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-08-06 05:21:24.135 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-06 05:21:24.135 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-08-06 05:21:24.137 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-08-06 05:21:24.166 UTC [27] LOG:  database system was shut down at 2021-08-06 04:15:34 UTC
2021-08-06 05:21:24.203 UTC [1] LOG:  database system is ready to accept connections
2021-08-06 05:31:46.334 UTC [1] LOG:  received fast shutdown request
2021-08-06 05:31:46.336 UTC [1] LOG:  aborting any active transactions
2021-08-06 05:31:46.337 UTC [70] FATAL:  terminating connection due to administrator command
2021-08-06 05:31:46.337 UTC [70] STATEMENT:  DROP FUNCTION IF EXISTS hdb_catalog."notify_hasura_user_update_UPDATE"() CASCADE
2021-08-06 05:31:46.337 UTC [55] FATAL:  terminating connection due to administrator command
2021-08-06 05:31:46.337 UTC [59] FATAL:  terminating connection due to administrator command
2021-08-06 05:31:46.337 UTC [59] STATEMENT:  DROP FUNCTION IF EXISTS hdb_catalog."notify_hasura_typesense_sync_ingredient_INSERT"() CASCADE
2021-08-06 05:31:46.337 UTC [62] FATAL:  terminating connection due to administrator command at character 2544
2021-08-06 05:31:46.337 UTC [62] STATEMENT:  
2021-08-06 05:31:46.339 UTC [40] STATEMENT:  SELECT  [long query]
2021-08-06 05:31:46.337 UTC [54] FATAL:  terminating connection due to administrator command
2021-08-06 05:31:46.339 UTC [40] FATAL:  terminating connection due to administrator command at character 2544
2021-08-06 05:31:46.339 UTC [40] STATEMENT:  SELECT  [long query]
2021-08-06 05:31:46.341 UTC [84] FATAL:  terminating connection due to administrator command
2021-08-06 05:31:46.341 UTC [84] STATEMENT:  DROP FUNCTION IF EXISTS hdb_catalog."notify_hasura_user_update_UPDATE"() CASCADE
2021-08-06 05:31:46.342 UTC [1] LOG:  background worker "logical replication launcher" (PID 33) exited with exit code 1
2021-08-06 05:31:46.342 UTC [83] FATAL:  terminating connection due to administrator command
2021-08-06 05:31:46.344 UTC [28] LOG:  shutting down
2021-08-06 05:31:46.345 UTC [90] FATAL:  the database system is shutting down
2021-08-06 05:31:46.370 UTC [1] LOG:  database system is shut down

PostgreSQL Database directory appears to contain a database; Skipping initialization

2021-08-06 05:31:47.063 UTC [1] LOG:  starting PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-08-06 05:31:47.063 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-06 05:31:47.063 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-08-06 05:31:47.069 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-08-06 05:31:47.091 UTC [26] LOG:  database system was shut down at 2021-08-06 05:31:46 UTC
2021-08-06 05:31:47.104 UTC [1] LOG:  database system is ready to accept connections

I think the critical logs are probably starting at "2021-08-06 05:21:24.131". It looks like hasura got hung up on some queries

0 replies

reinoldus · 2021-08-06T10:55:49Z

reinoldus
Aug 6, 2021

Additional information:

Hasura also hangs on schema data reloads, it works after a fresh restart but after a while it stops working: Only fix is to restart the db, restarting hasura will make hasura hang at start as described above.

After restarting the db I see the statements that are killed and always these statements are in the logs (as in the logs above):

DROP FUNCTION IF EXISTS hdb_catalog."notify_hasura_typesense_sync_ingredient_INSERT"() CASCADE
In case it is relevant:

I am using config v3
I boot the console through the cli
I also recently upgraded from 1.3.3 to 2.0.5

0 replies

bjewkes · 2021-08-11T23:39:18Z

bjewkes
Aug 11, 2021
Author

@reinoldus thanks for your input - I do not think that I had your issue. I did some digging to see if the same function was being dropped in my logs but I could not see it. We also do not need to consistently restart our database, it was only when the v2 migration failed, which I now know was due to transactions locking access to the hdb_catalog.event_log table.

I looked into the statistics of this table in our database and since we used eventing heavily, we had actually amassed an enormous event_log table of about 25 GB. The same was true of the linked table hdb_catalog.event_invocation_logs. We had not been cleaning these tables regularly, which I would recommend.

The Hasura team suspected that the size of these tables was causing locks with a long running transactional migration, which definitely seemed like the case when I actually checked the locks after deploying v2 again to test this.

To resolve this the team suggested the steps here to shrink the size of the table, but I found that even DELETE FROM . . . SQL statements left too much of the table behind to be able to run the migration.

I checked the table size with SELECT pg_relation_size('hdb_catalog.event_log');.

By running TRUNCATE hdb_catalog.event_log, hdb_catalog.event_invocation_logs; I was able to clear the data in these tables (I fortunately do not need to preserve these logs) and then the v2 migration progressed flawlessly as described in the documents.

@coco98 I saved a snapshot of the database before trying any of this I can restore to test any improvements on the migration safely.

0 replies

reinoldus · 2021-08-12T05:28:54Z

reinoldus
Aug 12, 2021

Thank you for sharing that information!

my invocation logs are also massive, so I assume it is the same issue.

A quick question at the hasura team: I guess for cron triggers we'd have the same problem, but it's not documented as for the normal event triggers:

Are these queries safe to execute for event logs and cron events:

DELETE FROM hdb_catalog.event_invocation_logs WHERE  DateDiff('hour', hdb_catalog.event_invocation_logs.created_at, NOW()::timestamp) > 24;
    
DELETE FROM hdb_catalog.event_log
WHERE DateDiff('hour', hdb_catalog.event_log.created_at, NOW()::timestamp) > 24;
    
DELETE FROM hdb_catalog.hdb_cron_event_invocation_logs 
WHERE
DateDiff('hour', hdb_catalog.hdb_cron_event_invocation_logs.created_at::timestamp, NOW()::timestamp) > 24;
    
DELETE FROM hdb_catalog.hdb_cron_events 
WHERE 
DateDiff('hour', hdb_catalog.hdb_cron_events.scheduled_time::timestamp, NOW()::timestamp) > 24;

Or is it unnecessary to purge those? They do not seem to get so big but I also have less cron triggers

For reference DateDiff is this function (copied somewhere some time from stackoverflow)

CREATE OR REPLACE FUNCTION DateDiff(units VARCHAR(30), start_t TIMESTAMP, end_t TIMESTAMP)
    RETURNS INT AS
$$
DECLARE
    diff_interval INTERVAL;
    diff          INT = 0;
    years_diff    INT = 0;
BEGIN
    IF units IN ('yy', 'yyyy', 'year', 'mm', 'm', 'month') THEN
        years_diff = DATE_PART('year', end_t) - DATE_PART('year', start_t);

        IF units IN ('yy', 'yyyy', 'year') THEN
            -- SQL Server does not count full years passed (only difference between year parts)
            RETURN years_diff;
        ELSE
            -- If end month is less than start month it will subtracted
            RETURN years_diff * 12 + (DATE_PART('month', end_t) - DATE_PART('month', start_t));
        END IF;
    END IF;

    -- Minus operator returns interval 'DDD days HH:MI:SS'
    diff_interval = end_t - start_t;

    diff = diff + DATE_PART('day', diff_interval);

    IF units IN ('wk', 'ww', 'week') THEN
        diff = diff / 7;
        RETURN diff;
    END IF;

    IF units IN ('dd', 'd', 'day') THEN
        RETURN diff;
    END IF;

    diff = diff * 24 + DATE_PART('hour', diff_interval);

    IF units IN ('hh', 'hour') THEN
        RETURN diff;
    END IF;

    diff = diff * 60 + DATE_PART('minute', diff_interval);

    IF units IN ('mi', 'n', 'minute') THEN
        RETURN diff;
    END IF;

    diff = diff * 60 + DATE_PART('second', diff_interval);

    RETURN diff;
END;
$$ LANGUAGE plpgsql;

0 replies

codingkarthik · 2021-08-12T09:00:11Z

codingkarthik
Aug 12, 2021
Collaborator

Hi @reinoldus,

The SQL you've posted will not work as written but can be fixed with some minor tweaks. I'd recommend you to execute the following queries instead:

DELETE FROM hdb_catalog.hdb_cron_event_invocation_logs 
WHERE created_at < (NOW() - interval '24 hour');
    
DELETE FROM hdb_catalog.hdb_cron_events WHERE scheduled_time < (NOW() - interval '24 hour');

Also, please run this in a transaction to avoid any mishaps.

If you're only purging for the v1 -> v2 upgrade, then you don't need to purge the event trigger events and their invocations as they are not transferred to the metadata DB.

Best,

Karthikeyan

0 replies

ryands17 · 2021-08-12T09:03:33Z

ryands17
Aug 12, 2021

Hey @codingkarthik 👋
I think you've tagged the wrong person :)

0 replies

codingkarthik · 2021-08-12T09:16:03Z

codingkarthik
Aug 12, 2021
Collaborator

Oops @ryands17 , My bad 😅

0 replies

iosifnicolae2 · 2021-09-25T08:15:59Z

iosifnicolae2
Sep 25, 2021

Hey,

I'm using TimescaleDB and I've found the following query to be stuck.

graphql-engine/server/src-lib/Hasura/Backends/Postgres/DDL/EventTrigger.hs

Line 461 in 11a454c

"DROP FUNCTION IF EXISTS"

DROP FUNCTION IF EXISTS hdb_catalog."notify_hasura_<YOUR_EVENT_TRIGGER_HERE>_INSERT"() CASCADE

Deleting all the locks and closing all the user connections fixed the problem:

SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE pid <> pg_backend_pid();

SELECT pg_terminate_backend(pid), *
FROM pg_stat_activity 
WHERE pid <> pg_backend_pid();

0 replies

v1.3.3 to v.2.0.1 Upgrade Failing #7389

Uh oh!

bjewkes Aug 3, 2021

Replies: 10 comments

Uh oh!

coco98 Aug 3, 2021 Maintainer

Uh oh!

bjewkes Aug 4, 2021 Author

Uh oh!

Uh oh!

reinoldus Aug 6, 2021

Uh oh!

Uh oh!

reinoldus Aug 6, 2021

Uh oh!

bjewkes Aug 11, 2021 Author

Uh oh!

Uh oh!

reinoldus Aug 12, 2021

Uh oh!

Uh oh!

codingkarthik Aug 12, 2021 Collaborator

Uh oh!

ryands17 Aug 12, 2021

Uh oh!

codingkarthik Aug 12, 2021 Collaborator

Uh oh!

Uh oh!

iosifnicolae2 Sep 25, 2021

bjewkes
Aug 3, 2021

coco98
Aug 3, 2021
Maintainer

bjewkes
Aug 4, 2021
Author

reinoldus
Aug 6, 2021

reinoldus
Aug 6, 2021

bjewkes
Aug 11, 2021
Author

reinoldus
Aug 12, 2021

codingkarthik
Aug 12, 2021
Collaborator

ryands17
Aug 12, 2021

codingkarthik
Aug 12, 2021
Collaborator

iosifnicolae2
Sep 25, 2021