Skip to content

Larges pushes fail, because they take too long #11

@myieye

Description

@myieye

We recently had a push of almost 2GB (surprisingly it only took about 3 min for the client to upload it).
Here's an overview of when things broken:
image

It failed on the last chunk, because that's when the server actually starts doing the heavy lifting: trying to apply the commit.

And here's what I think/know happened:

  • Chorus sends the last chunk
    • The resumable server detects that it's the last chunk
    • It calls unbundle
    • Which calls hg incoming, which takes ~3.5m (it gets logged to /var/cache/hgresume/<transaction-ID>.bundle.incoming.async_run)
  • Because the request takes so long
  • So, presumably the PHP script gets torn down while it's waiting for the hg incoming command to finish (which does finish, because it's in its own process)
  • Because the PHP script gets torn down, no work actually happens: it never gets to running the command hg unbundle and creating a lockfile for that command (A lockfile is created for hg incoming, but that's a seperate file)
  • Because the lockfile doesn't get created, when Chorus retries the push-bundle, the server throws an Exception

Do we want to allow big pushes like this? I think so! So how:

  • It's fine if Chorus times out as long as the job actually happens and we returns a more meaningful response. The code tries to return a 200, but fails, because the lockfile it's expecting doesn't exist. The exception is good, because a missing lockfile means nothing is happening.

So we either need to:

  1. Move more stuff into an external command that doesn't get torn down 🙁
  2. Prevent the PHP script from getting torn down (e.g. move large pushes to a Lexbox Job and make sure we turn off everything that might kill a long PHP script)
  3. Make the retries smarter and have them pick up where the last one died

I think 3 sounds like the best bet. Something like:

  • Replace the exception-throwing isComplete check, with something that anticipates this senario:
    • If there's no lock file retry the unbundle
    • In the unbundle, detect if hg incoming already ran and if so:
      • Do the necessary validation
      • Then start hg unbundle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions