-
Hi, we were upgrading our Hasura instance from version 1.3.3 to version 2.0.1. We host behind an AWS Application Load Balancer on AWS Fargate using the When we went to ship this to production we were never able to get version 2.0.1 started to the point where it would accept requests. We only saw these first two logs on startup (truncated these a bit):
We have the following log types enabled: To have 0 downtime when deploying to production we had 1.3.3 instances running while the 2.0.1 instances booted. The 2.0.1 instances failed to boot for over an hour (I even changed the health check to give them 10 minutes to start) so we though it might be a conflict between 1.3.3 running at the same time as 2.0.1 so we killed the 1.3.3 instances. This did not change the behavior of 2.0.1, it still started with just the two logs above, and then nothing. We ended up reverting to 1.3.3, which still did not work until we restarted the database. My only guess is that this is due to the difference between the sizes of our database between our develop and production causing long running operations on migration to 2.0.1. As an example, our production event trigger logs are a few million rows long, but develop only has a handful. We can reproduce this behavior on our production database - we have tried this upgrade two times with the same result. So far we haven't found any issues that seem similar to this - but please let us know if there are any we should look at. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments
-
We’d love to take a look here and get this sorted out asap! :) Could you reach out to Brandon (Brandon.b at hasura.io) and me (tanmaig at hasura.io) and we’ll set up a call so that we can work through this quickly? |
Beta Was this translation helpful? Give feedback.
-
Thanks! I emailed with the issue number in the subject. |
Beta Was this translation helpful? Give feedback.
-
I am having the same issue in my dev environment. The only way to fix it for me is to restart the database then hasura boots. Postgresql logs are as follows:
I think the critical logs are probably starting at "2021-08-06 05:21:24.131". It looks like hasura got hung up on some queries |
Beta Was this translation helpful? Give feedback.
-
Additional information: Hasura also hangs on schema data reloads, it works after a fresh restart but after a while it stops working: Only fix is to restart the db, restarting hasura will make hasura hang at start as described above. After restarting the db I see the statements that are killed and always these statements are in the logs (as in the logs above):
|
Beta Was this translation helpful? Give feedback.
-
@reinoldus thanks for your input - I do not think that I had your issue. I did some digging to see if the same function was being dropped in my logs but I could not see it. We also do not need to consistently restart our database, it was only when the v2 migration failed, which I now know was due to transactions locking access to the I looked into the statistics of this table in our database and since we used eventing heavily, we had actually amassed an enormous The Hasura team suspected that the size of these tables was causing locks with a long running transactional migration, which definitely seemed like the case when I actually checked the locks after deploying v2 again to test this. To resolve this the team suggested the steps here to shrink the size of the table, but I found that even I checked the table size with By running @coco98 I saved a snapshot of the database before trying any of this I can restore to test any improvements on the migration safely. |
Beta Was this translation helpful? Give feedback.
-
Thank you for sharing that information! my invocation logs are also massive, so I assume it is the same issue. A quick question at the hasura team: I guess for cron triggers we'd have the same problem, but it's not documented as for the normal event triggers: Are these queries safe to execute for event logs and cron events:
Or is it unnecessary to purge those? They do not seem to get so big but I also have less cron triggers For reference DateDiff is this function (copied somewhere some time from stackoverflow)
|
Beta Was this translation helpful? Give feedback.
-
Hi @reinoldus, The SQL you've posted will not work as written but can be fixed with some minor tweaks. I'd recommend you to execute the following queries instead: DELETE FROM hdb_catalog.hdb_cron_event_invocation_logs
WHERE created_at < (NOW() - interval '24 hour');
DELETE FROM hdb_catalog.hdb_cron_events WHERE scheduled_time < (NOW() - interval '24 hour'); Also, please run this in a transaction to avoid any mishaps. If you're only purging for the v1 -> v2 upgrade, then you don't need to purge the event trigger events and their invocations as they are not transferred to the metadata DB. Best, Karthikeyan |
Beta Was this translation helpful? Give feedback.
-
Hey @codingkarthik 👋 |
Beta Was this translation helpful? Give feedback.
-
Oops @ryands17 , My bad 😅 |
Beta Was this translation helpful? Give feedback.
-
Hey, I'm using TimescaleDB and I've found the following query to be stuck. DROP FUNCTION IF EXISTS hdb_catalog."notify_hasura_<YOUR_EVENT_TRIGGER_HERE>_INSERT"() CASCADE Deleting all the locks and closing all the user connections fixed the problem: SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid();
SELECT pg_terminate_backend(pid), *
FROM pg_stat_activity
WHERE pid <> pg_backend_pid(); |
Beta Was this translation helpful? Give feedback.
@reinoldus thanks for your input - I do not think that I had your issue. I did some digging to see if the same function was being dropped in my logs but I could not see it. We also do not need to consistently restart our database, it was only when the v2 migration failed, which I now know was due to transactions locking access to the
hdb_catalog.event_log
table.I looked into the statistics of this table in our database and since we used eventing heavily, we had actually amassed an enormous
event_log
table of about 25 GB. The same was true of the linked tablehdb_catalog.event_invocation_logs
. We had not been cleaning these tables regularly, which I would recommend.The Hasura team suspec…