-
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
We recently had a push of almost 2GB (surprisingly it only took about 3 min for the client to upload it).
Here's an overview of when things broken:

It failed on the last chunk, because that's when the server actually starts doing the heavy lifting: trying to apply the commit.
And here's what I think/know happened:
- Chorus sends the last chunk
- The resumable server detects that it's the last chunk
- It calls unbundle
- Which calls
hg incoming, which takes ~3.5m (it gets logged to/var/cache/hgresume/<transaction-ID>.bundle.incoming.async_run)
- Because the request takes so long
- Chorus cancels the push request after 30s (so Lexbox returns a 499)
- It's also fairly likely that PHP kills the script, because it takes longer than 30s and/or, because it was cancelled.
- It's also fairly likely that Cloudflare/the web server tear down the script process 🤷
- So, presumably the PHP script gets torn down while it's waiting for the
hg incomingcommand to finish (which does finish, because it's in its own process) - Because the PHP script gets torn down, no work actually happens: it never gets to running the command
hg unbundleand creating a lockfile for that command (A lockfile is created forhg incoming, but that's a seperate file) - Because the lockfile doesn't get created, when Chorus retries the push-bundle, the server throws an Exception
Do we want to allow big pushes like this? I think so! So how:
- It's fine if Chorus times out as long as the job actually happens and we returns a more meaningful response. The code tries to return a 200, but fails, because the lockfile it's expecting doesn't exist. The exception is good, because a missing lockfile means nothing is happening.
So we either need to:
- Move more stuff into an external command that doesn't get torn down 🙁
- Prevent the PHP script from getting torn down (e.g. move large pushes to a Lexbox Job and make sure we turn off everything that might kill a long PHP script)
- Make the retries smarter and have them pick up where the last one died
I think 3 sounds like the best bet. Something like:
- Replace the exception-throwing
isCompletecheck, with something that anticipates this senario:- If there's no lock file retry the unbundle
- In the unbundle, detect if
hg incomingalready ran and if so:- Do the necessary validation
- Then start
hg unbundle
Metadata
Metadata
Assignees
Labels
No labels