Proposal: Mutable methods to reduce or avoid branching #695

brandon942 · 2017-06-21T20:18:50Z

brandon942
Jun 21, 2017

Method behavior is often determined by states that need to be polled by the code within the method each time the method is called. In most cases states change less frequently than the method that checks them is called, or only once. Those situations can be optimized and the performance improved with the introduction of mutable methods.

Implementation
Let me start with the implementation because it is very simple. A method marked as mutable always starts with a jmp instruction that leads to the currently set version of the method. Changing the version of the method is done by overwriting the jmp instruction in memory. Since the write is atomic it is thread safe. Threads that are still executing the previous version of the method are unaffected.

Syntax
Coming up with a syntax is harder. Let's look at a property getter that initializes a value when called for the first time:

private string _name;
public static string Name{
	mutable get{
		if (_name == null){
			_name = GetRandomName();
			mutate(Name.get, initialized);
		}
		initialized:
		return _name;
	}
}

Here mutate is a keyword, Name.get is the method and initialized is the label to jump to or version name of the method.
Having multiple versions of a method can lead to duplicate code in the source. To avoid code duplication (let the compiler do that) we'd need some syntactic sugar. Let's have another example.

class MyClass
{
	private int _state;
	public int A { get; set; }
	public int B { get; set; }
	public int C { get; set; }
	public int State { get { return _state; } set { _state = value; } }
	public int GetValue()
	{
		int v;
		if (_state > 10)
		{
			v = A++;
			v *= 2;
		}
		else if (_state > 6)
		{
			v = B++;
			v *= 3;
		}
		else v = C++;
		return v;
	}
}

could be rewritten as:

public int State {
	get => _state;
	set {
		_state = value;
		if(_state > 10) mutate(GetValue, GetValue.versionA); // label
		else if(_state > 6) mutate(GetValue, GetValue.versionB); 
		else mutate(GetValue, default);
	}
}
		
public int mutable GetValue()
{
	int v;
	if version(versionA){
		v = A++;
		v *= 2;
	}
	else if version(versionB){
		v = B++;
		v *= 3;
	}
	else v = C++; // if version(!versionA && !versionB)
	return v;
}

GetValue() has 3 versions: default (unassigned name), versionA and versionB. The code for each is inferred.

svick · 2017-06-21T20:32:25Z

svick
Jun 21, 2017
Collaborator

To me, this looks like a lot of complicated machinery and syntax, just to avoid a single field load every time the method is called. Especially considering that, if the version really doesn't change often, branch prediction should work well for such code.

Also, how would this work for instance methods? You can't modify the code, since all instances share it.

0 replies

bondsbw · 2017-06-22T03:37:07Z

bondsbw
Jun 22, 2017

This looks like a coroutine. C# iterator methods already provide coroutines with a similar state machine mechanism.

The first block of code could be replaced with this today:

private string _name;
public IEnumerable<string> Name
{
    get
    {
        if (_name == null)
        {
            _name = GetRandomName();
        }
        
        while (true) yield return _name;
    }
}

...

// Calling the iterator

var n = Name.GetEnumerator();

// First time, executes the if block
n.MoveNext();
Console.WriteLine(n.Current);

// Subsequent calls execute after the previous "yield", i.e. loops
n.MoveNext();
Console.WriteLine(n.Current);
n.MoveNext();
Console.WriteLine(n.Current);

The call site is a bit awkward, which is where this proposal would work better.

I would like to propose a tweak to the iterator syntax to allow for non-iterator coroutines. This would give us more natural call syntax, with the familiar C# yield syntax:

private string _name;
public coroutine string Name
{
    get
    {
        if (_name == null)
        {
            _name = GetRandomName();
        }
        
        while (true) yield return _name;
    }
}

...

// Calling the coroutine

// First time, executes the if block
Console.WriteLine(Name);

// Subsequent calls execute after the previous "yield", i.e. loops
Console.WriteLine(Name);
Console.WriteLine(Name);

0 replies

ig-sinicyn · 2017-06-22T06:38:19Z

ig-sinicyn
Jun 22, 2017

@brandon942 for cases the state may be cached in static readonly field you may implement the desired behavior right now. The JIT is smart enough to promote the value of the static readonly field to constant and therefore it may eliminate unneeded branches completely. Prooflink.

0 replies

Joe4evr · 2017-06-22T09:21:12Z

Joe4evr
Jun 22, 2017

That post only specifies that it works for primitives, aka, the types that can already be expressed as const (which you might as well just use const for, in the first place). I doubt the JIT is capable of doing the same thing for complex types, though it is already good at branch elimination in other situations like generics.

0 replies

brandon942 · 2017-06-22T09:47:18Z

brandon942
Jun 22, 2017
Author

how would this work for instance methods? You can't modify the code, since all instances share it.

@svick That is a good point. Unless the state used by the instance method is static it will have to be a register-indirect jump. Instances would have to store the address offset in a field and pass it to the method. Static classes would not need that.

@bondsbw yield methods return a reference-type Enumerator that contains at least 4 fields. The branching code is moved into the MoveNext() method. That is worse memory wise and does not increase performance. The proposal is all about performance.

0 replies

svick · 2017-06-22T10:17:18Z

svick
Jun 22, 2017
Collaborator

Instances would have to store the address offset in a field and pass it to the method.

So how is that better than just using the original code with normal ifs? When you're trying to avoid a field load, you can't just replace it with a different field load.

0 replies

brandon942 · 2017-06-22T10:28:10Z

brandon942
Jun 22, 2017
Author

@svick You avoid the checks.

0 replies

Joe4evr · 2017-06-22T11:21:24Z

Joe4evr
Jun 22, 2017

I think @Opiumtm is right, this should be proposed to the JIT team. And in order for that, you would need to have some cold hard data about why your proposal would perform better in all situations than what it does currently.

For sheer example, what if branching is actually a really cheap thing to do and loading a field takes much more time overall (relatively speaking)? Then it would be useless to try and eliminate checks when it's not the main source of your code running slower than you'd like (though personally, it's all pretty negligible).

Without actual numbers of how much time each operation takes individually, it's not exactly good form to just claim "do this to make the language more performant!" Plus (again) most of performance should be handled at the CLR-level before needing a feature in the language that'll expose it.

0 replies

brandon942 · 2017-06-22T11:26:20Z

brandon942
Jun 22, 2017
Author

Btw, expanding on the idea and the good point made by @svick, it might be necessary to differentiate syntactically between local state dependent behavior and static state dependent behavior.

For that, the version of methods marked as mutable static is changed for all instances with mutatestatic(method, version) with full performance benefits.
Methods marked as mutable are changed with mutate(instance, method, version) for the specific instance with smaller performance benefits. It sets the instance field holding the address offset.

0 replies

mikedn · 2017-06-22T11:44:28Z

mikedn
Jun 22, 2017

Unless the state used by the instance method is static it will have to be a register-indirect jump.

And then it's not any better than the original if-based code.

0 replies

Joe4evr · 2017-06-22T12:07:50Z

Joe4evr
Jun 22, 2017

Another concern would be the reliability. Your opening post starts with the following claim:

Method behavior is often determined by states that need to be polled by the code within the method each time the method is called. In most cases states change less frequently than the method that checks them is called, or only once.

The "only once" part is a bit different from the rest: you'd be best off setting that state in your object constructor, so that the field can be readonly and you'll never have to check something about that state in other methods in the first place. (readonly is love, readonly is life!)

But if the field is mutable, and will change over the object lifetime, how exactly do you intend to keep track of when a check should be done or not for subsequent invocations of the method?

From your own example:

initialized is the label to jump to or version name of the method.

Once a method has been set to jump to that label immediately, and something else changes the _name field back to null, how will the method know to run the check again? Or would you allow the method to just keep returning null from that point on until changed again (in which case it's no better than not having this syntax at all)?

Remember, the onus is on you to prove (likely by way of some example IL/ASM that would have to be emitted) that your proposal will make code perform so much better, yet remains reliable in all cases that it is worth the time and money for Microsoft to implement.

0 replies

Logerfo · 2017-06-22T12:41:32Z

Logerfo
Jun 22, 2017

I like this, but I don't think your syntax would be valid. The way you presented it suggests that the first version of the method shares its scope with the second version, which is not possible.

private string _name;
public static string Name{
	mutable get{
        int x = 2; //new variable
		if (_name == null){
			_name = GetRandomName();
			mutate(Name.get, initialized);
		}
		initialized:
        x++; //out of scope
		return _name;
	}
}

0 replies

Opiumtm · 2017-06-22T12:58:36Z

Opiumtm
Jun 22, 2017

I have some crude benchmark on this proposal.

Baseline "if" scenario

        private void RunBenchmark()
        {
            gcnt = 0;
            int c = 0;
            bool state = false;
            for (var i = 0; i < Consts.RunCount; i++)
            {
                c++;
                if (c >= 100)
                {
                    c = 0;
                    state = !state;
                }
                if (state)
                {
                    gcnt += 1;
                }
                else
                {
                    gcnt += 2;
                }
            }
        }

"Delegate" scenario (emulating change of routine address when state is changed at 1/100 rate to calls) using static delegates.

        private void RunBenchmark()
        {
            gcnt = 0;
            int c = 0;
            bool state = false;
            BranchDelegate branch = FalseBranch;
            for (var i = 0; i < Consts.RunCount; i++)
            {
                c++;
                if (c >= 100)
                {
                    c = 0;
                    state = !state;
                    branch = state ? TrueDelegate : FalseDelegate;
                }
                branch(this);
            }
        }

        private static readonly BranchDelegate TrueDelegate = TrueBranch;
        private static readonly BranchDelegate FalseDelegate = FalseBranch;

        private static void TrueBranch(DelegateBranchCallScenario thisObj)
        {
            thisObj.gcnt += 1;
        }

        private static void FalseBranch(DelegateBranchCallScenario thisObj)
        {
            thisObj.gcnt += 2;
        }

        private delegate void BranchDelegate(DelegateBranchCallScenario thisObj);

Scenario set "Branch hypothesis (if vs delegate)"
=======================================================
If branch call: run count = 1000000000, totalTime(ms) = 969, time per run(microsec) = 0.0010
Delegate branch call: run count = 1000000000, totalTime(ms) = 4063, time per run(microsec) = 0.0041, % to baseline = 419.30%

No performance benefits. Actual results for such "optimizations" are much worse.

Tested on x64 release build, .NET Native UWP, Core i7 CPU.

Benchmark results for JIT-ed x64 debug build:

Scenario set "Branch hypothesis (if vs delegate)"
=======================================================
If branch call: run count = 1000000000, totalTime(ms) = 4125, time per run(microsec) = 0.0041
Delegate branch call: run count = 1000000000, totalTime(ms) = 5906, time per run(microsec) = 0.0059, % to baseline = 143.18%

Not so dramatically worse, but anyway no benefits at all.

JIT and .NET Native are already good at optimizations of simple and trivial branching.

Update
OK, maybe it's all because of "this" parameter passing.
Let's try to make gcnt field static and use parameterless static delegates.

        public static ulong gcnt = 0;

        private static readonly BranchDelegate TrueDelegate = TrueBranch;
        private static readonly BranchDelegate FalseDelegate = FalseBranch;

        private static void TrueBranch()
        {
            gcnt += 1;
        }

        private static void FalseBranch()
        {
            gcnt += 2;
        }

        private delegate void BranchDelegate();

Here are results on x64 release .NET Native UWP:

Scenario set "Branch hypothesis (if vs delegate)"
=======================================================
If branch call: run count = 1000000000, totalTime(ms) = 1031, time per run(microsec) = 0.0010
Delegate branch call: run count = 1000000000, totalTime(ms) = 5156, time per run(microsec) = 0.0052, % to baseline = 500.10%

Some final thoughts on this.
Trivial branched code obviously can be inlined by JIT or .NET Native. Or at least can be hinted by attribute to aggressively inline it.

Any "jump rewrite" scenario obviously can not be inlined by compiler or JIT, so there is much harm than benefits because of it.

0 replies

brandon942 · 2017-06-22T14:04:40Z

brandon942
Jun 22, 2017
Author

@Joe4evr It's up to the programmer to decide when a state has changed. In the mentioned case you can switch the getter version back in the setter.
@Logerfo Indeed.
@Opiumtm I'm not sure if the if vs delegate scenario is applicable - delegates do have some overhead - but you brought up a great point: Inlining.

Can mutable methods be inlined? They can if you store the addresses of every call site in a list. Changing the method version means changing the code on every call site in the same way. Despite the memory cost of inlined code I believe that the performance gains add up and become very significant in the end.

0 replies

Opiumtm · 2017-06-22T15:06:03Z

Opiumtm
Jun 22, 2017

@brandon942
Delegates can be thought as some crude (?) model for this proposal.
Actual jump address must be stored somewhere. Delegate is something close to it.

Again, some benchmarks on direct method call (inlined), direct method call (not inlined), via delegate call (static and instance).

.NET Native Release x64

.NET direct call: run count = 100000000, totalTime(ms) = 188, time per run(microsec) = 0.0019
.NET direct call no inline: run count = 100000000, totalTime(ms) = 344, time per run(microsec) = 0.0034, % to baseline = 182.98%
.NET delegate call: run count = 100000000, totalTime(ms) = 547, time per run(microsec) = 0.0055, % to baseline = 290.96%
.NET static delegate call: run count = 100000000, totalTime(ms) = 906, time per run(microsec) = 0.0091, % to baseline = 481.91%

JIT-ed x64:

.NET direct call: run count = 100000000, totalTime(ms) = 797, time per run(microsec) = 0.0080
.NET direct call no inline: run count = 100000000, totalTime(ms) = 766, time per run(microsec) = 0.0077, % to baseline = 96.11%
.NET delegate call: run count = 100000000, totalTime(ms) = 938, time per run(microsec) = 0.0094, % to baseline = 117.69%
.NET static delegate call: run count = 100000000, totalTime(ms) = 828, time per run(microsec) = 0.0083, % to baseline = 103.89%

On JIT-ed runtime, invoke via delegate time is almost same as direct call (3.89% for static delegates isn't a number to worry about).
So it can be safely assumed that at least on JIT-ed runtime "delegate emulation" is absolutely correct.
But "crude emulation" of address rewrite with delegates performs considerably worse on JIT-ed runtime too.

On static compiled .NET Native call inlining is playing major role (almost 2x worse with "no inline" attribute on called method). Delegate costs are lesser than "no inline" costs.

0 replies

mikedn · 2017-06-22T15:25:55Z

mikedn
Jun 22, 2017

Delegate calls are certainly costlier than the simple jump that was suggested. It's not the delegate call itself that it's the problem, it's the fact that you're actually calling a method and that involves argument passing and frame setup.

Still, in the case of instance methods where the jump has to be indirect there's little chance that doing this would be faster than the original if. The jump does a memory load just like the original code does. And the cost of that jump instruction is not 0, it's an instruction like any other instructions and require some CPU resources (1 uop, 2 cycle latency on a Skylake for example).

This is a rather "strange" optimization that's very costly to implement and has very little value compared to a zillion other optimizations that could be added to the JIT.

0 replies

bondsbw · 2017-06-22T23:35:46Z

bondsbw
Jun 22, 2017

@brandon942

@bondsbw yield methods return a reference-type Enumerator that contains at least 4 fields. The branching code is moved into the MoveNext() method. That is worse memory wise and does not increase performance. The proposal is all about performance.

My proposed syntax does not rely on an enumerator. Why not optimize it like we are discussing here? It could be compiled into pretty much the same IL as the main proposal.

0 replies

scalablecory · 2017-06-30T15:28:59Z

scalablecory
Jun 30, 2017

I do believe the ability to hotpatch methods is an interesting one, but I agree with others that this won't help the problem of optimization.

Consider that you're only going to save two instructions. The if will be:

mov eax, [_name]
test eax, eax
jnz initialized

While the mutable version would be:

jmp [_func]

Both versions perform the same number of memory loads, and both will have zero latency if branch prediction / branch target prediction gets a hit.

Want to benchmark? Write something in C. An optimizing compiler will translate a function pointer call with identical arguments & calling convention into a jmp.

0 replies

Proposal: Mutable methods to reduce or avoid branching #695

Uh oh!

Replies: 18 comments

Uh oh!

svick Jun 21, 2017 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brandon942 Jun 22, 2017 Author

Uh oh!

svick Jun 22, 2017 Collaborator

Uh oh!

brandon942 Jun 22, 2017 Author

Uh oh!

Uh oh!

brandon942 Jun 22, 2017 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brandon942 Jun 22, 2017 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

svick
Jun 21, 2017
Collaborator

brandon942
Jun 22, 2017
Author

svick
Jun 22, 2017
Collaborator

brandon942
Jun 22, 2017
Author

brandon942
Jun 22, 2017
Author

brandon942
Jun 22, 2017
Author