Larges pushes fail, because they take too long

We recently had a push of almost 2GB (surprisingly it only took about 3 min for the client to upload it).
[Here's an overview](https://ui.honeycomb.io/sil-language-forge/environments/prod?query=%7B%22start_time%22%3A1714760528%2C%22end_time%22%3A1714760593%2C%22granularity%22%3A0%2C%22breakdowns%22%3A%5B%22http.response.status_code%22%5D%2C%22calculations%22%3A%5B%7B%22op%22%3A%22COUNT%22%7D%5D%2C%22filters%22%3A%5B%7B%22column%22%3A%22app.project_code%22%2C%22op%22%3A%22%3D%22%2C%22value%22%3A%22tsi%22%7D%5D%2C%22filter_combination%22%3A%22AND%22%2C%22orders%22%3A%5B%7B%22op%22%3A%22COUNT%22%2C%22order%22%3A%22descending%22%7D%5D%2C%22havings%22%3A%5B%5D%2C%22limit%22%3A1000%7D) of when things broken:
![image](https://github.com/sillsdev/hgresume/assets/12587509/74ca605e-b013-4247-8b15-cda677c08276)

It failed on the last chunk, because that's when the server actually starts doing the heavy lifting: trying to apply the commit.

And here's what I think/know happened:
- Chorus sends the last chunk
  - The resumable server [detects](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/HgResumeApi.php#L94) that it's the last chunk
  - It [calls unbundle](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/HgResumeApi.php#L98C45-L98C53)
  - Which calls [`hg incoming`](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/HgRunner.php#L55), which takes ~3.5m (it gets logged to `/var/cache/hgresume/<transaction-ID>.bundle.incoming.async_run`)
- Because the request takes so long
  - Chorus [cancels the push request](https://github.com/sillsdev/chorus/blob/38ecab3a4af8a5bba05ed1d65805bb219aa2464b/src/LibChorus/VcsDrivers/Mercurial/HgResumeTransport.cs#L511) after [30s](https://github.com/sillsdev/chorus/blob/38ecab3a4af8a5bba05ed1d65805bb219aa2464b/src/LibChorus/VcsDrivers/Mercurial/HgResumeTransport.cs#L33) (so Lexbox returns a 499)
  - It's also fairly likely that PHP kills the script, because it [takes longer than 30s](https://www.php.net/manual/en/function.set-time-limit.php) and/or, because [it was cancelled](https://www.php.net/manual/en/function.ignore-user-abort.php).
  - It's also fairly likely that Cloudflare/the web server tear down the script process 🤷 
- So, presumably the PHP script gets torn down while it's waiting for the `hg incoming` command to finish (which does finish, because it's in its own process)
- Because the PHP script gets torn down, no work actually happens: it never gets to [running the command `hg unbundle`](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/HgRunner.php#L74) and [creating a lockfile](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/AsyncRunner.php#L24) for that command (A lockfile is created for `hg incoming`, but that's a seperate file)
- Because the lockfile doesn't get created, when Chorus **retries** the push-bundle, the server [throws an Exception](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/AsyncRunner.php#L42)

Do we want to allow big pushes like this? I think so! So how:
- It's fine if Chorus times out as long as the job actually happens and we returns a more meaningful response. The code [tries to return a 200](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/HgResumeApi.php#L151), but fails, because the lockfile it's expecting doesn't exist. The exception is good, because a missing lockfile means nothing is happening.

So we either need to:
1) Move more stuff into an external command that doesn't get torn down 🙁 
2) Prevent the PHP script from getting torn down (e.g. move large pushes to a Lexbox Job and make sure we turn off everything that might kill a long PHP script)
3) Make the retries smarter and have them pick up where the last one died

I think 3 sounds like the best bet. Something like:
- Replace the [exception-throwing `isComplete` check](https://github.com/sillsdev/hgresume/blob/480e876b73995c6c05c7fd164fdc2b9f2a233e65/src/lib/HgResumeApi.php#L139), with something that anticipates this senario:
  - If there's no lock file retry the unbundle
  - In the unbundle, detect if `hg incoming` already ran and if so:
    - Do the necessary validation
    - Then start `hg unbundle`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Larges pushes fail, because they take too long #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Larges pushes fail, because they take too long #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions