Skip to content

bug: state of _stackprof.last_sample_at can leak between invocations #220

@dalehamel

Description

@dalehamel

We noticed a very peculiar profile:

Screenshot 2023-11-10 at 9 56 27 AM

The top frame is StackProf.start. It is supposedly occupying 20% of the overall aggregated profile time (~10 out of ~50 minutes). How can this be possible?

Checking the code, it looks like it is technically possible that we can have a signal be received within stackprof_start, as the signal handler is installed but there are a few lines left to execute before we return:

https://github.com/tmm1/stackprof/blob/master/ext/stackprof/stackprof.c#L204-L234

Many of the other variables on _stackprof struct are cleared between runs, but notably, last_sample_at is never reset within stackprof_results

Since we never explicitly acquire and relinquish the GVL within stackprof_start, and never reset the value, this is my theory as to what happened:

  • stackprof run A occurs, and on its final sample collection, sets last_sample_at, t_0
  • Some time passes
  • stackprof run B occurs, and during the initialization in stackprof_start, between the installation of the signal handler and before we reset last_sample_at to the current time, the ruby VM swaps to another thread
  • Enough wall clock time is spent on the other thread that our timer fires, and we collect a sample
  • When calculating the weight of the sample, we use t_0 as the base, instead of t_1, because it has not been reset yet
  • Execution returns and we finish executing stackprof_start, and last_sample_at is set to t_1, the current time.

I'm not sure if the above is plausible in ruby's threading model, but it's the best I could come up with^.

I was very surprised to see StackProf.start at all in profiles, let alone so prominently. But from what I can tell, the above scenario is a plausible explanation for how this could occur.

Looking at the same interval of time, we do not see it appear in :cpu mode profiles collected. This makes sense though, as those don't use timestamps for their weighting.

The solution to this I think would be some combination of:

  • Acquire the GVL during stackprof_start and release it just before we return
  • Reset the last_sample_at during the results function, as we do with other variables on the _stackprof struct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions