Fix cancel in exception handling code#96
Conversation
parsonsmatt
left a comment
There was a problem hiding this comment.
The code here is a bit contrived, and possibly is just an abuse of the API. I've recommended that we switch to catch instead of withException as I wasn't aware the semantics were different.
To motivate it a bit more: our original code looked like this:
action `wtihException` sendAlert
sendAlert = runDB . sendMessageToSlack MyAlertChannelMeanwhile, we've got some instrumenting code in runDB that adds to a timer when we're waiting on a connection.
runDB action = do
let
timerAction =
forever do
threadDelay 1_000
recordAnotherMillisecond
withAsync (timerAction `finally` flushMetrics) \a -> do
runSqlPool (liftIO (cancel a) >> action) =<< askSqlPoolBut, since withAsync can't actually be canceled, if we run any database action in a onException or withException handler, then we just... hang forever.
This can be fixed locally, using withAsyncWithUnmask (\unmask -> unmask $ timerThread 0) \a -> ...
This seems deeply unsatisfying to me. But hey it's Haskell exceptions 😂
unliftio/src/UnliftIO/Exception.hs
Outdated
| Left e1 -> do | ||
| -- see explanation in bracket | ||
| _ :: Either SomeException b <- EUnsafe.try $ run $ after e1 | ||
| _ :: Either SomeException b <- EUnsafe.try $ restore $ run $ after e1 |
There was a problem hiding this comment.
So, this line fixes the test. But it also removes the guarantee that action `onException` cleanup will definitely complete cleanup.
| when (n < 10) $ do | ||
| timerAction (n + 1) | ||
| withAsync (timerAction 0) $ \a -> do | ||
| cancel a |
There was a problem hiding this comment.
This line can be replaced with pure (). withAsync is also unable to cancel the thread.
| withAsync (timerAction 0) $ \a -> do | ||
| cancel a | ||
| eresult <- | ||
| race |
There was a problem hiding this comment.
This race call also is unable to cancel one of the threads - so, with the original code, this test hangs for 10 seconds waiting for timerAction to complete before returning Right 10.
| withAsync a b = withRunInIO $ \run -> do | ||
| maskingState <- E.getMaskingState | ||
| case maskingState of | ||
| E.MaskedUninterruptible -> | ||
| A.withAsyncWithUnmask (\unmask -> unmask (run a)) (run . b) | ||
| _ -> | ||
| A.withAsync (run a) (run . b) |
There was a problem hiding this comment.
This is an alternative fix to using restore in the handler of withException and bracket.
The Async API is totally dependent on being able to throw async exceptions to the other thread. I don't think Async makes any sense if it cannot do that.
While this single fix works here, it won't work for eg race or any of the other code, so we may want to patch those as well.
I'm almost curious if this is a thing that needs to get raised with upstream Async.
There was a problem hiding this comment.
I'd be very worried about two things here:
- A very subtle change to behavior that can take production code down
- Diverging from
async's handling of these cases
There was a problem hiding this comment.
Upstream hasn't responded, so, we may want to handle this here?
I have a hard time imagining any code that depends on withAsync creating an unkillable zombie and being unable to exit.
This is a decently large footgun that is only really likely to happen if you are using unliftio or safe-exceptions. While I do think that async should really be launching threads in at most a MaskedInterruptible state, it's not super likely to be triggered - uninterruptibleMask is pretty rare to see, and has lots of warning labels.
Since unliftio uses it a bunch in cleanup functions, though, I think it does make sense to make the combination of UnliftIO.Exception and UnliftIO.Async not quite so dangerous.
There was a problem hiding this comment.
I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.
I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.
One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?
There was a problem hiding this comment.
I'm not following why this situation is worse with unliftio or safe-exceptions, it seems like other code would have the same behavior.
It's unlikely because most people read the docs for uninterruptibleMask and decide they probably don't need it. So the main way it becomes a part of people's programs is when safe-exceptions or unliftio style cleanups enter the picture and puts it in with bracket and friends.
But, yeah, anyone that runs uninterruptibleMask_ $ async $ do ... is going to create an unkillable Async.
I'm still very much on the fence here. I think the change is a good one, but the idiosyncrasies it may cause may not be worth it.
I'm coming around that it should probably be unmasked in the async and withAsync functions, and definitely unmasked in race and race-like functions. If the intent is "one of these threads should die," then unmasking should happen. But concurrently poses a slightly different question: Simon Marlow points out that finally foo (a >> b) and finally foo (concurrently a b should have the same semantics, but if concurrently unmasks, then the action is no longer protected. So the appropriate default for concurrently may be different than for race, since the expectations/assumptions are different.
Likewise, withAsync seems to demand the unmasking much more - uninterruptibleMask_ $ withAsync foo bar cannot terminate if foo is a forever loop - you're deadlocked. But uninterruptibleMask_ $ async foo is less likely to deadlock, since it's entirely possible that you never cancel it, and thus never have the problem.
I'm not really sure the best way to approach it. Maybe separate modules with different behavior is the right answer? A separate opt-in package?
My chaos brain says something like UnliftIO.Async needs to export these functions with a MaskingBehavior argument, forcing all users to upgrade, and maybe give a warning/deprecation message that points to alternative modules. But that's a big breaking change for probably little benefit for those that aren't already experiencing the pain.
One specific question about the implementation here: IIUC, this is changing uninterruptible masking to unmasked, but leaving interruptible masking as it is. That seems surprising. Should uninterruptible become interruptible, or both become unmasked?
Yeah, that's a good pont - unfortunately, we're missing a primop.
I have opened an upstream issue: ideally, we'd have a primop unsafeUnUninterruptibleMask# that would downgrade a UninterruptibleMask to a InterruptibleMask. But we don't have that, so the alternative of asyncWithUnmask $ \unmask -> unmask $ mask_ action actually is vulnerable to an async exception at the unmask point, since unsafeUnmask (the function provided as unmask) triggers any queued async exceptions in the C--. In the case of async, this is really unlikely - I think you'd need to cancel =<< async action to observe the async exception slipping between the unmask $ mask_.
Switching from mask_ (which checks masking state; should be unnecessary after unmask) to block would also accomplish this.
|
Hate to be a squeaky wheel, but this has now bitten us 3 times, resulting in non-graceful SIGKILL shutdowns I've made the patch locally to ensure that all |
|
Fair enough, let's try to move forward on this then. Can I ask you to put together a ChangeLog entry in here that makes it clear that there's a subtle behavioral change in this release? Also: what do you think of posting this on the Haskell Foundation Discourse and giving people a heads-up that this change will be coming before releasing? I'm a bit worried about encouraging a debate on it, but also a bit worried about this one sneaking under people's radars. |
|
I think getting some feedback would be great. At the very least, maybe we can get some more eyes on this ticket: https://gitlab.haskell.org/ghc/ghc/-/issues/22389 the solution to which would introduce a new primitive that should make this a bit safer. |
|
Would you be able to publish such a message? I'll be happy to boost with links from Twitter. |
Initial discussion: fpco/safe-exceptions#3
Follow-up on GHC: https://gitlab.haskell.org/ghc/ghc/-/issues/18899
🤔
Fixes #95