[CompilerPerf] Faster equality in generic contexts #5112

manofstick · 2018-06-06T09:37:43Z

Rising from the ashes of #513 (3 years later!) comes a new implementation. The new implementation is much simpler, albeit with a smaller scope (I'm no longer attempting to handle Tuples, no longer wandering into forbidden code generation - even if it was only MakeGenericType)

So what does this do? It increases performance of Equals, CompareTo, GetHashCode when used in a generic context. In a non-generic context the fsharp compiler inserts efficient code through a combination of inline IL, statically resolved type parameters and compiler optimizations (lots of magic).

So what is a generic context?

A type which has generated equality/comparison such as a record type:

type Thing<'T> = {
    Field : 'T
}

or

let f<'a when 'a : equality> =
    let comp = HashIdentity.Structural<'a>
    ...

So any of the containers, such as Map, Set, dict as well as other places in the standard library isn't handled by inline functions also has potential for improvement.

abelbraaksma · 2018-06-07T10:08:57Z

Very, very interesting. I rely heavily on CustomComparisonAttribute and CustomEqualityAttribute, these require me to implement the non-generic IComparable interface and an Equals (plus GetHashCode()) overload, which requires (expensive?) matching on the type to get right.

Would this improve such situations as well, or would existing code benefit if it also implemented the generic IEqualityComparer<'T> (or IEquatable<'T>) and/or IComparer<'T> interfaces? Or are custom comparers/equals out of scope?

dsyme · 2018-06-07T11:17:05Z

This is getting close to the territory we would accept, if backed up by perf results

Here are some current failures:


****failure on [nan; 1.0] [nan; 1.0]


................TEST FAILED...............


****failure on [nan; 1.0] [nan; 1.0]


................TEST FAILED...............


****failure on ("Foo", {h = 5.9;
         w = nan;}) ("Foo", {h = 5.9;
         w = nan;})


................TEST FAILED...............


****failure on ("Foo", {h = 5.9;
         w = nan;}) ("Foo", {h = 5.9;
         w = nan;})


................TEST FAILED...............


****failure on [nanf; 1.0f] [nanf; 1.0f]


................TEST FAILED...............


****failure on [nanf; 1.0f] [nanf; 1.0f]


................TEST FAILED...............


****failure on ("Foo", {h = 5.9000001f;
         w = nanf;}) ("Foo", {h = 5.9000001f;
         w = nanf;})


................TEST FAILED...............


****failure on ("Foo", {h = 5.9000001f;
         w = nanf;}) ("Foo", {h = 5.9000001f;
         w = nanf;})


................TEST FAILED...............

manofstick · 2018-06-07T20:50:38Z

@abelbraaksma

If you are create a sealed type with [<CustomEquality>] and provide IEquatable<'T> but don't provide IStructuralEquality then after this PR will use the Equality System.Collections.Generic.EqualityComparer<'T>.Default. This can make a real difference (faster) for value types. Same rules will apply for comparison, but not implemented yet. I think it would be helpful to add a new attribute which removes the implementation of IStructuralEquatable and IStructuralComparable so that it could benefit (and really it's a bit of a misnomer, as you still get structural equality through the standard Equals - what if different is how they handle NaNs) (this is a little bit more to this story, but basically that you be how the 99% would be affected - and even then there are, from memory, inconsistencies with generic types, etc. I posted some of there ages ago, I can probably dig them up)

manofstick · 2018-06-07T20:54:08Z

@dsyme

Sorry, WIP, so I will bash my way through failures as I get time...

Oh, currently this is just equality (GetHashCode/Equals + associated operators) but I plan on getting to comparison (CompareTo + associated operators). Should that be done as a seperate PR? I think there is value in that, shrinking the surface area of an individual change?

dsyme · 2018-06-08T16:46:28Z

Oh, currently this is just equality (GetHashCode/Equals + associated operators) but I plan on getting to comparison (CompareTo + associated operators). Should that be done as a seperate PR? I think there is value in that, shrinking the surface area of an individual change?

If the technique is the same then do it in the same PR

manofstick · 2018-06-09T04:45:39Z

@dsyme

This is getting close to the territory we would accept

Hahahah! I guess I have been a little... ummm... adventurous (?) with my PRs. Anyway, I always kind of intended them to be group efforts and really I see myself throwing out ideas as much as anything... But hey, after 3 years I guess this just isn't they way that this open source stuff works. Get an idea and polish it off yourself or be damned :-) (obviously a bit tongue in cheek - I didn't have a number of helpers on the Seq work...)

if backed up by perf results

Let's drag up some of the old test from the previous PRs... (micro benchmarks, blah, blah, blah - I haven't run these extensively, numbers jump around, looking for overall trend... at the end I realized I shouldn't have had as many significant figures in the %s as really everything is +/- a few %... But anyway...)

In #574 we are linking to this gist:

Test	Old 64-bit	New 64-bit	New/Old 64-bit	Old 32-bit	New 32-bit	New/Old 32-bit
custom dynamic	188.67	113.67	60.25%	452.67	106.17	23.45%
custom structural	190	136.17	71.67%	457	221.83	48.54%
custom default	119.83	117.83	98.33%	110.83	109	98.35%
value dynamic	864.17	917.67	106.19%	2487.83	2403	96.59%
value structural	483.33	481.83	99.69%	548.17	542.83	99.03%
value default	451	445.83	98.85%	447.67	445.33	99.48%
gen value dynamic	1846.33	1052.67	57.01%	5303	2639.33	49.77%
gen value structural	1397	629.5	45.06%	3262.5	920.67	28.22%
gen value default	1150.67	447.17	38.86%	2018	477	23.64%
ref dynamic	959.17	1242.5	129.54%	2469.33	2344.33	94.94%
ref structural	706.83	699.33	98.94%	697.33	698.17	100.12%
ref default	751.5	742	98.74%	773	751.5	97.22%
gen ref dynamic	1996.83	1396.17	69.92%	5352.5	2550	47.64%
gen ref structural	1726.5	921.67	53.38%	3907	1412.17	36.14%
gen ref default	1732	815.83	47.10%	2597.17	878.67	33.83%
tuple dynamic	565	594.83	105.28%	1157	1163.33	100.55%
tuple structural	245.67	242.5	98.71%	253.83	251.67	99.15%
tuple default	682.5	676.83	99.17%	761.67	753	98.86%
value tuple dynamic	391	147.67	37.77%	988	144.83	14.66%
value tuple structural	386.17	147.33	38.15%	981.83	143.67	14.63%
value tuple default	146.83	147.17	100.23%	151	145	96.03%

In #549 where I've copied the code to this gist:

Test	Old 64-bit	New 64-bit	New/Old 64-bit	Old 32-bit	New 32-bit	New/Old 32-bit
seqGroupBy	1710	674	39.42%	1011	986	97.53%
seqCountBy	2793	676	24.20%	1391	1296	93.17%
listGroupBy	1701	716	42.09%	857	758	88.45%
listCountBy	2490	457	18.35%	1056	949	89.87%
arrayCountBy	2340	304	12.99%	892	818	91.70%
arrayGroupBy	1241	217	17.49%	470	432	91.91%
arrayCountBy	870	384	44.14%	544	525	96.51%

In #930 where I've copied the code to this gist:

Test	Old 64-bit	New 64-bit	New/Old 64-bit	Old 32-bit	New 32-bit	New/Old 32-bit
Perf	11149	4822	43.25%	20301	5512	27.15%

In #513 we're I've copied the code to this gist:

Test	Old 64-bit	New 64-bit	New/Old 64-bit	Old 32-bit	New 32-bit	New/Old 32-bit
StructureInt	2578	2639	102.37%	2785	2664	95.66 %
StructureGeneric	5635	2891	51.30	8990	3083	34.29 %

More in #513 where I've copied the code to this gist:

Test	Old 64-bit	New 64-bit	New/Old 64-bit	Old 32-bit	New 32-bit	New/Old 32-bit
Perf	503	301	59.84%	847	288	34.00%

More in #513 where I've copied the code to this gist:

Test	Old 64-bit	New 64-bit	New/Old 64-bit	Old 32-bit	New 32-bit	New/Old 32-bit
Perf	3557	710	19.96%	7510	915	12.18 %

manofstick · 2018-06-09T06:44:10Z

@dsyme

(seems to be a build server issue, and builds/tests on my machine, and the build before the failure had a 120 minute timeout and thus was incomplete... that wasn't this PR, but something to do with Span...)

Anyway, I'd be happy if I could leave this here for the moment. i.e. I have intentions of doing the comparison side of things, but this stands on its own, and I don't think I currently have the time to do the work (from my, admittedly old, memory I did an initial phase of refactoring out the exception throwing throwing behaviour and I think a bit more refactoring before being able to effectively do the same process as here)

So I'm changing the title of this PR to reflect that it is just equality... and also that it isn't a WIP anymore...

dsyme · 2018-06-11T10:58:40Z

tests/fsharpqa/Source/Optimizations/GenericComparison/Equals09.il.bsl


        .line 16707566,16707566 : 0,0 ''
        IL_000a:  ldarg.0
        IL_000b:  ldloc.0
-        IL_000c:  tail.


I'm not totally comfortable with these tailcalls disappearing - have we discussed this? Thanks

dsyme · 2018-06-11T13:32:51Z

@manofstick I see we talked about the tailcall removal in the context of #513: I added this issue here to track it #946.

My specific reply was here: #946 (comment)

Actually maybe this isn't too bad after all. I it seems that the particular form listed above is possibly the only case that currently works and after #513 doesn't.

.. .If a generic argument is instantiated to a record or union in a way that makes the type recursive, then it current works, but now doesn't..... I would be willing to accept that limitation (he says, with his fingers crossed behind his back)......

I am also willing to accept this limitation. I think the best thing is to submit testing that pins down this behaviour in positive cases systematically.

Still, I wonder if we should make the tailcall removal a separate PR? In case we needed to revert it at a later point?

Thanks!
Don

dsyme · 2018-06-11T20:55:50Z

@dotnet-bot test this please

manofstick · 2018-06-12T07:49:17Z

@dsyme

Hmm. Removing the tail call avoiding code does smash performance, in many cases making things worse than they were.

The 64-bit JIT now appears to do tail calls internally as what would of causes a stack overflow under 32-bit, doesn't under 64-bit.

I create some example types here.

Anyway, I then modified all calls in FSharp.Core to be of the form:

        let GenericHashIntrinsic input =
            if Environment.Is64BitProcess
            then avoid_tail_call_int (fun () -> GenericHashT<'T>.GetHashCode input)
            else GenericHashT<'T>.GetHashCode input

Which, albeit a bit ugly, does work.

Thoughts?

zpodlovics · 2018-06-12T15:15:56Z

@manofstick Yes, tail calls prevent inlining - hopefully will be revisited soon:

https://github.com/dotnet/coreclr/issues/18361#issuecomment-396060473
https://github.com/dotnet/coreclr/issues/18406

mrange · 2018-06-12T17:24:48Z

@manofstick I applaud your efforts in trying to optimize the big and the small! It's a dangerous game you play (might break things) but someone has to do it :)

Btw in the referenced issue I actually saw performance drop when surpressing .tail. It's not easy to predict performance.

AndyAyersMS · 2018-06-12T17:47:51Z

I would suggest leaving the tail prefix alone as not all jits are going to be able to recognize tail call opportunities without this prefix.

For cases where performance might improve without tail I suspect the real issue is that tail blocks the jit (at least in RyuJit, likely also jit32) from inlining:

the jit will not inline at call sites with a tail prefix
the jit will not inline methods that contain call sites with tail prefixes.

These are things we can revisit on .Net Core (see dotnet/coreclr#18406).

Since we generally only see tail prefixes from F# code it would be great to have more examples like dotnet/coreclr#18361 to look at.

manofstick · 2018-06-13T09:49:12Z

OK; had some good progress. Basically scrapped everything (well except the core idea) and re-implemented.

So, to make everyone happy I have restored tail calls which now doesn't have as much of an impact on performance as I have managed to drop one level of call indirection in some cases - but looking with great anticipation for when @AndyAyersMS can implement some optimizations in the JIT!

@mrange - yes, yes I am a sucker for punishment :-) But there is a fairly significant set of tests I put in when I working on #513, so hopefully have a reasonable degree of confidence...

dsyme

This looks very promising, will review more completely. Can you report latest performance results? thanks

dsyme · 2018-06-13T12:07:55Z