Proposal: Removing C# stack copies syntactically via (Out-Returns) #8786
Replies: 30 comments
-
A hacky way to implement these non-copy operators as fluent methods in C# today would be to use https://gist.github.com/HaloFour/5bf3f2f3a48183aeaab26a3c5aad25ab Works great on .NET Core, not so much on .NET Framework. Of course that's not remotely as nice to use as actual proper operators. |
Beta Was this translation helpful? Give feedback.
-
Theoretically supporting public struct BigStruct {
public double x, y, z;
public static void operator +(in BigStruct p1, in BigStruct p2, out BigStruct result) {
result.x = p1.x + p2.x;
result.y = p1.y + p2.y;
result.z = p1.z + p2.z;
}
}
BigStruct a, b, c;
// init structs here
BigStruct result = a + b + c;
// compiled into the equivalent of
a.op_Addition(b, out BigStruct $temp1);
$temp1.op_Addition(c, out BigStruct result); |
Beta Was this translation helpful? Give feedback.
-
-- Just pointing it out but keep in mind thats still slower which will start to add up in more complex code situations.
-- True and while operators are a major issue here I should point out there are many other non-operator methods used on vectors that should have the ability to work the same way. As the HLSL example shows these algorithms are normally a combination of operator and non-operator methods stringed together. So ideally the proposed solution should work for both. |
Beta Was this translation helpful? Give feedback.
-
I'm wondering whether this needs any syntax change at all. Rather, I think this should be taken care of by the compiler, perhaps with assistance from the runtime - they have all the context to do the requisite analysis of the methods and could rewrite things to achieve this speedup without imposing any burden on either the C# language nor on developers. |
Beta Was this translation helpful? Give feedback.
-
The benchmarks I ran showed it to be as fast as the |
Beta Was this translation helpful? Give feedback.
-
A couple issues I see with this approach.
I'm not saying optimizing at the IL level is bad I'm just saying its not ideal in this situation. Being explicit is usually better for optimizing in this area. @HaloFour |
Beta Was this translation helpful? Give feedback.
-
I'm also in favor of CLR doing RVO automatically, similar to C++, with rules that would be easy for a dev to reason about. |
Beta Was this translation helpful? Give feedback.
-
As much as I would have wanted an implementation of this to be automatic, the fact that doing so would silently change the call signature raised some red flags. It's probably best to have new syntax. It can be backed up/enforced by an analyser if the performance boost is paramount. |
Beta Was this translation helpful? Give feedback.
-
I've commented a bit on the Gitter channel but most of this seems like a runtime issue. From an ABI perspective. If there are codegen problems here, they are likely in the JIT and they should be updated there so that it ends up improving all existing APIs. From the language perspective, I think allowing |
Beta Was this translation helpful? Give feedback.
-
@tannergooding I know we talked on Gitter but will post my disagreement with this solution here.
This requires every relevant .NET runtime to make the same optimizations for what is actually a semantic one. Because this will probably never happen and the fact that even C++ can't optimize here correctly is rather telling it may not be the right approach. Many things I agree the JIT should optimize but not this and the primary reason being is there is no way to handle the multitude of random different situations that come up if left up to the JIT. However fixing this syntactically enforces common sense rules just as an 'out' keyword does guaranteeing a performance increase. The calling conversion isn't that related to the problem: "static MyStruct* Method(MyStruct* pResult)" static Vec3 operator+(Vec3 a, Vec3 b)
{
Vec3 result;
result.x = a.x + b.x;
result.x = a.y + b.y;
result.x = a.z + b.z;
return result;// This COPIES "Vec3 result;" back to the callings stack memory (the copy is the issue)
} You also can't argue this approach below as there is no guarantee there will be a suitable constructor. static 3rdPartyVec3 operator+(Vec3 a, 3rdPartyVec3 b)
{
// this constructor doesn't exist so we can't avoid stack copy
return new 3rdPartyVec3(a.x + b.x, a.t + b.t, a.z + b.z); // compiler error
}
// must use
static 3rdPartyVec3 operator+(Vec3 a, 3rdPartyVec3 b)
{
3rdPartyVec3 result;
result.x = a.x + b.x;
result.x = a.y + b.y;
result.x = a.z + b.z;
return result;// again a copy
} There are many other situations the JIT will be confused about. |
Beta Was this translation helpful? Give feedback.
-
@tannergooding How about this approach below. This approach lets the JIT do everything BUT lets the framework developer give a hint to how this is expected to be optimized. Example: [return:OutReturn]
public static Vec3 operator+(in Vec3 a, in Vec3 b)
{
return new Vec3(a.x + b.x, a.y + b.y, a.z + b.z);
} When the JIT sees that attribute it could convert it in C terms like so: void Vec3_add_Operator(Vec3* result, Vec3* a, Vec3* b)
{
Vec3_INIT(result, a->x + b->x, a->y + b->y, a->z + b->z);
return;
} This at least allows us to get rid of stack copies / memory duplication with less flexibility BUT can be fully done in the JIT while making it easy for Mono, IL2CPP, etc have an easier time knowing what is expected. Everything just works. Seems like a far trade off to me? |
Beta Was this translation helpful? Give feedback.
-
Like I said above. That is already implicitly done by the calling convention. You can see https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2017#return-values and https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf for more details on how each calling convention deals with this. The CoreCLR runtime also already does this kind of optimization in some simple cases for the latest .NET Core 3.0 Preview. For example: public struct Vector3 {
public float X;
public float Y;
public float Z;
public Vector3(float x, float y, float z)
{
X = x;
Y = y;
Z = z;
}
public static Vector3 One()
{
return new Vector3(1.0f, 1.0f, 1.0f);
}
}
00007FFD7786F650 vzeroupper
00007FFD7786F653 xchg ax,ax
00007FFD7786F655 vmovss xmm0,dword ptr [7FFD7786F680h]
00007FFD7786F65D vmovss xmm1,dword ptr [7FFD7786F684h]
00007FFD7786F665 vmovss dword ptr [rcx],xmm0
00007FFD7786F669 vmovss dword ptr [rcx+4],xmm1
00007FFD7786F66E vmovss dword ptr [rcx+8],xmm1
00007FFD7786F673 mov rax,rcx
00007FFD7786F676 ret You can also see that native compilers do the same for cases they can rationalize: https://godbolt.org/z/ZuUOBz |
Beta Was this translation helpful? Give feedback.
-
Some cases where the JIT wasn't doing a great job are tracked by https://github.com/dotnet/coreclr/issues/19522. |
Beta Was this translation helpful? Give feedback.
-
That's not true and it's one of the reasons why optimizing the copy But all the ABI discussion isn't really relevant in this case because
No, it doesn't guarantee performance increase. Supporting |
Beta Was this translation helpful? Give feedback.
-
I meant you can guarantee a performance increase if x64 is your target and you're optimizing for a larger struct like Vec3 for example. Vs relying on runtime updates that may or may not handle struct copies or inline correctly. Its explicit. So yes it can be slower but if you explicitly test its faster you can guarantee it. This is why I make a point of non-primitive structs as those for example are slower with this method. |
Beta Was this translation helpful? Give feedback.
-
It may be guaranteed on the particular runtime version you test with. It's not necessarily guaranteed on other runtimes or future version of the same runtime. While I would expect a new version of the runtime to not regress code gen quality, this has happened occasionally in the past. Either as a result of an oversight or a bug fix that made the JIT more conservative.
Using ref/out parameters also relies on the JIT to do its job correctly. If I go delete a certain piece of code from the JIT your out version will be much slower. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure what you mean by that. The concept of
Again, the ABI doesn't have anything to do with your case. Pretty much all interesting methods in your benchmark are inlined.
Sure, but there's also a balance in how much stuff you can add to a language's syntax. Especially when what you want to add isn't guaranteed to be always an improvement. |
Beta Was this translation helpful? Give feedback.
-
Ignore the other stuff I said tired and going to bed after this but if I was going to make a game and target Xbox One using .NET Core 3 or C++ I could guarantee its performance on that system. Again its generally explicit (you're telling the compiler what you think should happen when it may not know whats best). Think in terms of AOT targets. This just gives devs more lang tools to enforce concepts they expect when trying to micro manage memory at a higher level. Anyway hope that makes sense. |
Beta Was this translation helpful? Give feedback.
-
OK, but does it make sense to complicate a general purpose language like C# because you find that a certain approach happens to work well on a certain target? This sounds more like a hack rather than a well designed solution. Anyway, I took a look at the JIT generated code. It certainly looks horrible in the non-USE_OUT case. And one of the reasons it is horrible has an certain compiler issue I already mentioned. Can you guess which one? 😁 |
Beta Was this translation helpful? Give feedback.
-
C# is more than just a general purpose lang at this point (at least its reaching out beyond that and being used for more than that). It may have started out that way but with that logic you could call SIMD a hack vs just letting the compiler auto vectorize stuff. Being explicit isn't a hack. Just like using out explicitly isn't a hack or using SIMD isn't. Being able to simplify an expression in syntax isn't a hack, its syntax sugar and something C# seems to have a lot of... much of which I consider bad short-hands that make stuff harder to look at (shorter isn't always better). If you have to manually modify the IL in a post build step to get this working correctly on Non .NET Core runtimes... now thats a hack (which is whats going to happen). The JIT is basically being used as a hack for flaws in the C# lang that can actually be described in langs like Nim which can explicitly avoid these issues.... and don't tell me the C# / .NET goals haven't changed over the years. What seems best on the surface from a classical perspective isn't always correct just because its always been done a particular way. Having a lang feature and IL optimizations can live together, these principles aren't mutually exclusive. |
Beta Was this translation helpful? Give feedback.
-
It depends. What's for sure is that implementing a feature based on poor/incomplete analysis is a hack. And that's exactly what you're proposing here. You have no idea why the non-USE_OUT version of your code is significantly slower than the USE_OUT version. You looked at the IL, missed the facts, jumped to conclusions and sketched up solution for non-existing and/or different problems. |
Beta Was this translation helpful? Give feedback.
-
And I'm the one making fallacies? You're arguing a straw-man... good luck. |
Beta Was this translation helpful? Give feedback.
-
@HaloFour That would be great! |
Beta Was this translation helpful? Give feedback.
-
I'm with the group thinking this makes far more sense as a runtime optimization. @zezba9000 Do you want to port this to dotnet/coreclr (and maybe provide a sample impl)? |
Beta Was this translation helpful? Give feedback.
-
@CyrusNajmabadi By 'this' you mean the C# syntax prototype idea and topic? As that would be more on the Roslyn side correct or did you mean solely a runtime optimization? |
Beta Was this translation helpful? Give feedback.
-
@zezba9000 I meant solely as a runtime optimization. |
Beta Was this translation helpful? Give feedback.
-
@zezba9000 just curious, any progress on this? |
Beta Was this translation helpful? Give feedback.
-
@Krakean Have not looked at it outside of IL generation. If you want to avoid bad IL generation in C# that causes the issue make sure you format your operator methods like so:
If you don't format your operator methods like the above example, Roslyn starts referencing then de-referencing parameter values for no reason... which I can only guess the JIT is having a very hard time understanding what to optimize. |
Beta Was this translation helpful? Give feedback.
-
@zezba9000 Curious then, in/ref/out like shown in @HaloFour example (https://gist.github.com/HaloFour/5bf3f2f3a48183aeaab26a3c5aad25ab) makes sense? |
Beta Was this translation helpful? Give feedback.
-
@Krakean .NET Core (and I think .NET Framework but can't remember) will optimize it to be the same. The example I gave with "return new Vec3(...)" is the simplest to read and yields the same results as the out method for the most part. Thats the one you should be using. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Removing C# stack copies syntactically
A proposal that eliminates stack copies at the heart of C type'ish languages like C# when using 'structs' (aka value types) heavily as is done in image manipulation, graphics in general, physics, games, Unity3D, etc (I'm sure there are many more fields as well).
Key arguments for this feature
The issue!
A potential solution!
Here is some initial pseudocode
If the 'return' keyword isn't the best choice there are other options. However, this is more of a minor syntax issue.
HLSL Comparison
Say we wanted to run a color manipulation algorithm. Here is some standard code that might be used in HLSL. However imagine if you will a similar algorithm being used in C#. I'm simply showing this to give some frame of reference as to why this kind of syntax should be done this way while being able to achieve maximum performance.
So lets take this line: "OUT.rgb = brightness * (color0 * mask + color1 * (1.0 - mask));"
If I wanted to get the best performance here in C# this becomes very verbose and hard to read.
As you can see below writing performant vector based code in C# isn't fun.
Why does this happen?
Every time anything is returned out in C# it must first have its value stored on the stack that in turn gets copied back to the stack before it unwinds. However if you use an 'out' parameter this extra copy is explicitly avoided. Taking a quick look at the IL difference is very telling.
How to handle older C# or .NET versions
Take the example below
To call the example code above in older C# versions one must do:
To call the example code in newer C# versions one can do:
And finally just as you would give a compiler error for methods that only differ in return type, so would you for 'out-return' types. All examples methods below conflict with one another if defined in the same type.
Final thoughts
Given the major performance gains and relatively little changes needed I see this as a big win.
Beta Was this translation helpful? Give feedback.
All reactions