Idea: stackalloc from caller's stack frame #1817

svick · 2018-08-26T14:00:00Z

svick
Aug 26, 2018
Collaborator

I think one of the current issues with stackalloc is that it cannot be hidden behind a layer of abstraction. More specifically, when a method wants to return stack allocated data, its caller has to manually allocate the right amount of space on its own stack before calling it.

As a somewhat convoluted motivating example, consider a method to parse a string that contains a sequence of numbers like 1,2,3:

static ReadOnlySpan<int> Parse(ReadOnlySpan<char> input)
{
    int valuesCount = 1;
    foreach (char c in input)
    {
        if (c == ',')
            valuesCount++;
    }
    
    Span<int> values;
    if (valuesCount <= 100)
        values = stackalloc int[valuesCount];
    else
        values = new int[valuesCount];
    
    for (int i = 0; i < valuesCount - 1; i++)
    {
        int index = input.IndexOf(',');
        values[i] = int.Parse(input.Slice(0, index));
        input = input.Slice(index + 1);
    }
    values[valuesCount - 1] = int.Parse(input);
    
    return values;
}

The above code obviously won't compile, because it can return stack allocated Span to its caller.

When implementing such method today, it would have to take Span<int> values as another parameter, and it would be the caller's responsibility to decide whether it should stack allocate it and how much it should allocate. I think this means the caller has to know too much about implementation details of the called method and it will often mean that the caller allocates either too much, or too little.

Being able to somehow stack allocate from the caller's frame would solve this issue. The big obvious problem with that is that the callee's stack frame is in the way. I can think of several ways of working around that:

Approach 1: Coroutines

With this approach, callee would be implemented as a coroutine. Whenever callee wanted to stack allocate from the caller's frame, it would yield, the caller would then allocate the required buffer and finally resume callee. This would require that any callee state that needs to persist during a yield would have to be stored in the caller's frame (probably in a ref struct).

The disadvantages of this approach are that the callee could not stack allocate from its own frame (because that allocation would not survive yield) and that its state would take up space on the stack even after it returns.

The advantage is that this could be implemented in the C# compiler itself, no CLR changes necessary.

A picture of how this process would work (not to scale):

Approach 2: Make the CLR understand

With this approach, the C# compiler would give some signal to the CLR (probably using an intrinsic) and CLR would modify the stack as required.

I can think of two ways of how exactly the CLR could make this work:

Approach 2a: Expand the caller's frame

When the callee returns, whatever was in its frame will become part of caller's frame, including any buffers it's returning (which is what we want) and any other state (which we don't).

The disadvantage of this is that the callee state still takes up stack space even after it returns. It would also require changes to the CLR (but maybe those wouldn't be too big?).

The advantage is that this would not require creating a state machine for a coroutine, which would likely make this more efficient, and possibly easier to implement.

A picture:

Another approach that would achieve a similar result would be to force inlining of the callee, but this comes with its own set of disadvantages (e.g. it likely couldn't be used in virtual methods).

Approach 2b: Move the callee's frame

When the callee wants to stack allocate from the caller's frame, the CLR first moves the callee's frame by the required amount and then performs the allocation directly from the caller's frame.

The main disadvantage is that it would require moving the callee's frame, which can take some time (especially if it contains its own stack allocations). This would also require adjusting any references to variables from the callee's frame, which might not be easy. Out of the three approaches, this one would likely require the biggest CLR changes.

The main advantage, compared with the other approaches, is that when the callee returns, its state does not stay on the stack.

A picture:

Closing thoughts

One thing to note is that the immediate caller is not special, so it might make sense to have some way of deciding from which frame to allocate, not limited to just the immediate caller.

Finally, I do realize that all the approaches I suggested above are complicated and unorthodox, especially when considering what problem they're trying to solve. So it's likely they're not going to be implemented, at least not anytime soon. But I do think it's a problem worth solving, so I wanted to start this discussion, even if the eventual solution looked completely different.

tannergooding · 2018-08-26T15:40:35Z

tannergooding
Aug 26, 2018
Collaborator

I don't think this is a good idea. Being able to stackalloc a buffer on the caller's stack space would add a lot of complexity for what looks to be relatively little benefit (IMO).

If calculating the number of results is trivial, it is infinitely better to have some function which tells the consumer how many results they can expect (GetCount()), so they can decide how/where to allocate their buffer.

If calculating the number of results is non-trivial, you likely won't know the total size of the buffer needed until after you finish processing the string, in which case the output buffer will likely need to be resized multiple times to make things "right". In which case, I believe the better solution is to have some kind of stream class, which tracks the current stream position and parses data into the user provided buffer until said buffer is full, which allso allows the user to determine the appropriately sized buffer and where to allocate it.

0 replies

iSazonov · 2018-08-27T04:39:07Z

iSazonov
Aug 27, 2018

If we know the number of results we can allocate the span required size in the caller method.
If we don't know the number of results we have to reallocate. This works for types like Array. With some limitations we could implement this for Span too. Technically in realloc point we temporary return in caller method, move called method stack frame (is it small?), expand Span and return in realloc point in called method. Also we could take into account a threshold and replace stack frame with heap buffer if it is needed.
This is easy to do for a single stack Span variable, but it makes no sense if the number of variables is greater. (We will be forced to copy the second (and subsequent) buffer and this is no different from an array reallocation)
Also, I believe that this all does not make sense for small sizes, but it can be justified for reallocation Span variable from 1-2 Kb to 3-4 Kb.

Another scenario is that in the reallocation point we immediately re-allocate to heap. Something similar happens in the internal ValueStringBuilder.

0 replies

yaakov-h · 2018-08-27T05:08:57Z

yaakov-h
Aug 27, 2018

What about two other possiblities:

When returning from the callee to the caller, expand the caller's stack frame and copy the buffer into the caller. This would require that a buffer of size N begins at least N bytes from the beginning of the callee's stack space.
After returning, subsequent callees ignore the buffer, and bifurcate their stack space around it, until the caller function is completed.

0 replies

svick · 2018-08-27T09:03:14Z

svick
Aug 27, 2018
Collaborator Author

@yaakov-h

This would require that a buffer of size N begins at least N bytes from the beginning of the callee's stack space.

I'm not sure about that. I think implementations of memmove can do this kind of overlapping copying efficiently.

After returning, subsequent callees ignore the buffer, and bifurcate their stack space around it, until the caller function is completed.

I don't think that would work well. One of the great things about the stack is that you can use very simple machine code with it, e.g. on x86 you can use sub esp, 0x8 to allocate 8 bytes or mov dword [esp+0x4], 0x2a to assign value to a variable on the stack. With this suggestion, that code would have to be significantly more complicated.

0 replies

YairHalberstadt · 2020-10-18T11:56:01Z

YairHalberstadt
Oct 18, 2020
Collaborator

One case where the runtime needs to be able to do this is for this API: dotnet/runtime#25423, where the runtime decides whether or not to StackAlloc an array.

This API is necessary for params Span<T>, which would offer significant performance benefits.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Idea: stackalloc from caller's stack frame #1817

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Idea: stackalloc from caller's stack frame #1817

Uh oh!

svick Aug 26, 2018 Collaborator

Approach 1: Coroutines

Approach 2: Make the CLR understand

Approach 2a: Expand the caller's frame

Approach 2b: Move the callee's frame

Closing thoughts

Replies: 5 comments

Uh oh!

tannergooding Aug 26, 2018 Collaborator

Uh oh!

iSazonov Aug 27, 2018

Uh oh!

yaakov-h Aug 27, 2018

Uh oh!

svick Aug 27, 2018 Collaborator Author

Uh oh!

YairHalberstadt Oct 18, 2020 Collaborator

svick
Aug 26, 2018
Collaborator

tannergooding
Aug 26, 2018
Collaborator

iSazonov
Aug 27, 2018

yaakov-h
Aug 27, 2018

svick
Aug 27, 2018
Collaborator Author

YairHalberstadt
Oct 18, 2020
Collaborator