Add array-fast path in Enumerable.Min/Max(int/long) #64470

stephentoub · 2022-01-28T21:34:42Z

In looking through usage data, it's pretty common to call Enumerable.Min/Max with an array. At the cost of a type check when the source isn't an array, we can speed up the cases where it is, at a minimum by avoiding interface dispatch and in the best case by vectorizing. I've done so for int/long. We could look at doing so for float/double as well, but I wasn't sure about the impact of the NaN-related logic on vectorization.

Method	Toolchain	Length	Array	Mean	Error	Ratio
Min	\main\corerun.exe	1	False	31.444 ns	0.6275 ns	1.00
Min	\pr\corerun.exe	1	False	31.211 ns	0.3926 ns	0.99

Min	\main\corerun.exe	1	True	14.049 ns	0.0917 ns	1.00
Min	\pr\corerun.exe	1	True	1.912 ns	0.0116 ns	0.14

Min	\main\corerun.exe	4	False	44.083 ns	0.5091 ns	1.00
Min	\pr\corerun.exe	4	False	44.900 ns	0.4524 ns	1.02

Min	\main\corerun.exe	4	True	24.681 ns	0.1511 ns	1.00
Min	\pr\corerun.exe	4	True	3.553 ns	0.0647 ns	0.14

Min	\main\corerun.exe	16	False	108.921 ns	1.0189 ns	1.00
Min	\pr\corerun.exe	16	False	112.215 ns	0.5962 ns	1.03

Min	\main\corerun.exe	16	True	74.743 ns	0.4251 ns	1.00
Min	\pr\corerun.exe	16	True	4.557 ns	0.0403 ns	0.06

Min	\main\corerun.exe	64	False	333.232 ns	1.3208 ns	1.00
Min	\pr\corerun.exe	64	False	331.662 ns	6.6332 ns	0.99

Min	\main\corerun.exe	64	True	251.663 ns	5.0050 ns	1.00
Min	\pr\corerun.exe	64	True	7.923 ns	0.0847 ns	0.03

Min	\main\corerun.exe	256	False	1,275.711 ns	25.5032 ns	1.00
Min	\pr\corerun.exe	256	False	1,220.111 ns	24.3144 ns	0.96

Min	\main\corerun.exe	256	True	932.077 ns	2.8967 ns	1.00
Min	\pr\corerun.exe	256	True	21.124 ns	0.1536 ns	0.02

Min	\main\corerun.exe	1024	False	4,951.882 ns	92.7338 ns	1.00
Min	\pr\corerun.exe	1024	False	4,673.227 ns	41.3033 ns	0.94

Min	\main\corerun.exe	1024	True	3,753.438 ns	50.7476 ns	1.00
Min	\pr\corerun.exe	1024	True	86.373 ns	0.3499 ns	0.02

using System;
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(1, 4, 16, 64, 256, 1024)]
    public int Length { get; set; }

    [Params(false, true)]
    public bool Array { get; set; }

    private IEnumerable<int> _source;

    [GlobalSetup]
    public void Setup()
    {
        _source = Enumerable.Range(1, Length).Reverse();
        if (Array)
        {
            _source = _source.ToArray();
        }
    }

    [Benchmark]
    public int Min() => _source.Min();
}

In looking through usage data, it's pretty common to call Enumerable.Min/Max with an array. At the cost of a type check when the source isn't an array, we can speed up the cases where it is, at a minimum by avoiding interface dispatch and in the best case by vectorizing.

ghost · 2022-01-28T21:34:53Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

In looking through usage data, it's pretty common to call Enumerable.Min/Max with an array. At the cost of a type check when the source isn't an array, we can speed up the cases where it is, at a minimum by avoiding interface dispatch and in the best case by vectorizing. I've done so for int/long. We could look at doing so for float/double as well, but I wasn't sure about the impact of the NaN-related logic on vectorization.

Method	Toolchain	Length	Array	Mean	Error	Ratio
Min	\main\corerun.exe	1	False	31.444 ns	0.6275 ns	1.00
Min	\pr\corerun.exe	1	False	31.211 ns	0.3926 ns	0.99

Min	\main\corerun.exe	1	True	14.049 ns	0.0917 ns	1.00
Min	\pr\corerun.exe	1	True	1.912 ns	0.0116 ns	0.14

Min	\main\corerun.exe	4	False	44.083 ns	0.5091 ns	1.00
Min	\pr\corerun.exe	4	False	44.900 ns	0.4524 ns	1.02

Min	\main\corerun.exe	4	True	24.681 ns	0.1511 ns	1.00
Min	\pr\corerun.exe	4	True	3.553 ns	0.0647 ns	0.14

Min	\main\corerun.exe	16	False	108.921 ns	1.0189 ns	1.00
Min	\pr\corerun.exe	16	False	112.215 ns	0.5962 ns	1.03

Min	\main\corerun.exe	16	True	74.743 ns	0.4251 ns	1.00
Min	\pr\corerun.exe	16	True	4.557 ns	0.0403 ns	0.06

Min	\main\corerun.exe	64	False	333.232 ns	1.3208 ns	1.00
Min	\pr\corerun.exe	64	False	331.662 ns	6.6332 ns	0.99

Min	\main\corerun.exe	64	True	251.663 ns	5.0050 ns	1.00
Min	\pr\corerun.exe	64	True	7.923 ns	0.0847 ns	0.03

Min	\main\corerun.exe	256	False	1,275.711 ns	25.5032 ns	1.00
Min	\pr\corerun.exe	256	False	1,220.111 ns	24.3144 ns	0.96

Min	\main\corerun.exe	256	True	932.077 ns	2.8967 ns	1.00
Min	\pr\corerun.exe	256	True	21.124 ns	0.1536 ns	0.02

Min	\main\corerun.exe	1024	False	4,951.882 ns	92.7338 ns	1.00
Min	\pr\corerun.exe	1024	False	4,673.227 ns	41.3033 ns	0.94

Min	\main\corerun.exe	1024	True	3,753.438 ns	50.7476 ns	1.00
Min	\pr\corerun.exe	1024	True	86.373 ns	0.3499 ns	0.02

using System;
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(1, 4, 16, 64, 256, 1024)]
    public int Length { get; set; }

    [Params(false, true)]
    public bool Array { get; set; }

    private IEnumerable<int> _source;

    [GlobalSetup]
    public void Setup()
    {
        _source = Enumerable.Range(1, Length).Reverse();
        if (Array)
        {
            _source = _source.ToArray();
        }
    }

    [Benchmark]
    public int Min() => _source.Min();
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Linq`, `tenet-performance`
Milestone:	7.0.0

tannergooding · 2022-01-28T21:50:43Z

src/libraries/System.Linq/src/System/Linq/Max.cs

            return value;
        }

+        private static int Max(int[] array)


So what I'd really like to see here is for many of these methods to be available on Span<T> and for us to forward this to the Span implementation.

Exposing new public APIs is of course something that needs to goto API proposal, but I think it would be better long term, would also allow more logic sharing for various helper APIs, and would centralize most of the "array like SIMD algorithms" to the same place in the BCL.

I'd be fine subsequently seeing such methods on MemoryExtensions and deleting the code from here to delegate there instead. Someone would need to do the work to figure out what that set of methods should be. I also don't think it precludes adding this now and then delegating later if/when the methods exist.

Right, I just wanted to call out what I'd like to see here and I think just about any method we think is worth special-casing for array in LINQ is applicable.

Sum/Min/Max are the "most obvious ones"

tannergooding · 2022-01-28T21:56:04Z

src/libraries/System.Linq/src/System/Linq/Max.cs

+
+            // Vectorize the search if possible.
+            int index, value;
+            if (Vector.IsHardwareAccelerated && array.Length >= Vector<int>.Count * 2)


Do we know how big the average input here is? This is going to give us different perf characteristics between Arm64 and x64 because the former will trigger for 8+ element arrays (32-bytes) while the latter will only trigger for 16+ element arrays (64-bytes).

I'd expect we want similar handling in both and while manually unrolling may pipeline well on x64, that may not be the case on Arm64 or all x64 based chipsets.

Generally the using Vector256 is beneficial for very large inputs where you can saturate the decoder and memory pipeline; but are less beneficial for "small" inputs (particularly on older hardware where cache line splits and throughput may be more limited).

Do we know how big the average input here is?

I'm not sure.

Are you suggesting a default stance should be to use Vector128<T> instead of Vector<T> unless you know big inputs are common? The tradeoff you're describing is between potentially 2x throughput on larger sizes (if you use Vector<T> and end up employing 256-bit operations instead of 128-bit operations) vs vectorizing smaller sizes (if you use Vector<T> and can't vectorize the processing of 200 bits but could have a little with Vector128<T>)?

(This is why I opened #64409. We need really clear guidance on when to use which.)

Are you suggesting a default stance should be to use Vector128 instead of Vector unless you know big inputs are common? The tradeoff you're describing is between potentially 2x throughput on larger sizes (if you use Vector and end up employing 256-bit operations instead of 128-bit operations) vs vectorizing smaller sizes (if you use Vector and can't vectorize the processing of 200 bits but could have a little with Vector128)?

Yes. There is a lot of nuance here such as Arm64 not having Vector256<T> and it often having different pipelining characteristics. Even when just considering x64, there are various things to consider as well such as the expected target hardware as modern (Skylake and Zen2 or later) tends to perform really well with V256; where-as older (Haswell/Broadwell or Zen1) tends to behave worse and may incur downclocking, may not support dual dispatch, and may even emulate 256-bit support as 2x128-bit ops.

I'd generally expect that we do something like:

Vector128<T> is the default choice and is used for practically any input that is at least Count long

This is the path that will light up basically "everywhere" (Arm64, x64, x86; eventually WASM, etc)

It may or may not be worth it to specialize cases where we are doing less 4-7 loop iterations due to the default branch prediction behavior

Vector256<T> is used for cases with at least 128-bytes. This is also the cutoff of many memcpy algorithms before they start considering non-temporal or other store sizes

This path will only light up on x86/x64 and can be impacted by things like alignment, whether its going to split the cache every other read/write, the instructions being used, and the underlying data size

There are likely cases where microbenches or modern hardware (roughly Skylake and Zen2 or newer) will show Vector256<T> is better. However, I expect that a good bit of hardware (for both servers and consumers) still in the wild isn't this and so we'll need to take that into consideration. Likewise, the newer Alder Lake CPUs also bring in considerations where the Vector256<T> support is slower on the energy efficient cores (vpmaxud is 1 latency and 0.5 throughput for V128 and V256 on the power cores; but is 1 latency, 0.33 throughput on the energy cores for V128 and 2 latency, 0.67 throughput for V256)

Noting that I wouldn't block this PR on that doc being written up. I think this largely matches what we're already doing for other cases.

Just that I believe this is roughly what the writeup will follow based on my own knowledge/experience.

If you are trying to get good performance everywhere and don't need to support .NET Standard. Vector128 is a good option.

We have many vectorized methods now. From a cursory scan, they all either use Vector<T> or they have both 128-bit and 256-bit code paths. Can you please list which of these implementations should ideally be changed to just use Vector128<T>? If that is the recommended default choice, I assume the list will include at least half of our methods. Thanks.

I'm not aware of any existing SIMD code in the BCL where we need to or should only support Vector128<T>. There may be some cases of new SIMD code where that is the right initial choice.

There are likely places where Vector256<T> will end up needing additional or modified checks due to the changing hardware ecosystem.

There are likely places where we would see perf benefits by adding a Vector128<T> path on top of Vector<T> to ensure that we get better perf for inputs smaller than 32-bytes (or inputs that are small enough that loop unrolling and other considerations would be better).

The BCL is special from many angles and the guidance we have for our own usage may be more complex than the default recommendation for the typical user or individual OSS maintainer looking to accelerate their app.

Put another way, we are the core foundation on which a lot of other code is built and run. We have no way of statically knowing how any given app will use our own code and being the baseline the places where our code is used varies a lot.

This is different from many applications or smaller libraries out there where they can easily get away with having simpler code/guidance and still seeing good benefits. This doesn't mean that we shouldn't provide or make available the extended guidance as well nor that other libraries may not have the same or similar considerations.

But there are multiple angles and target audiences here and no one single answer.

Potentially put another way in an attempt at drawing a parallel, the BCL provides T[], Span<T>, and several other tools for working with memory (MemoryMarshl, Unsafe, etc).

Most apps/libraries only need to use T[] to be successful. They get indirect benefits by the BCL using Span<T>

Some apps/libraries can then get additional perf benefits by providing additional overloads which use Span<T> allowing for reduced allocations and handling of unmanaged memory without needing to deal with unsafe.

You can then go a step further and "hyper-optimize" by doing things like casting int to uint and then zero-extending to nuint or by using MemoryMarshal.GetReference to avoid bounds checking, etc. This is something that provides the "best performance" and is also the most complicated.

All of these options are valid and we use all of them throughout the BCL depending on what the exact circumstances and needs are.

In the same way, it is fine for most user code to just use Vector128<T> and it gives them the easiest path and chance of "succeeding" (for some definition of success) on all platforms. Vector<T> can similarly be used to allow them to succeed and is almost as good of a choice, but there is some complicated nuance around it that makes it more confusing (at least in my own opinion) to recommend as the default scenario. One of these two is the equivalent to T[].

Vector256<T> and providing multiple paths is somewhat like the Span<T> case. This can be beneficial when you know what you're doing and most code can get indirect benefits by the BCL utilizing these techniques.

Using platform specific intrinsics is a case of "hyper-optimization". You use these when you need the utmost control of the codegen and are trying to eek out every bit of performance or where the xplat helpers happen to fall short.

I'm not aware of any existing SIMD code in the BCL where we need to or should only support Vector128<T>

Right

The BCL is special from many angles and the guidance we have for our own usage may be more complex than the default recommendation

I do not believe the core libraries are special in this regard. If someone is reaching for Vector, they're already in the 1% of caring about getting really great throughput and have already demonstrated an interest in writing additional code for performance. On top of that, most libraries are in a position where they can't predict an upper bound on input lengths.

If the guidance we have doesn't work for us, it doesn't work. What I outlined in #64470 (comment) is essentially what we do, where the starting point is writing one Vector<T> path and then getting more complicated from there if either the API surface is insufficient or the smaller input sizes are critical... I think that's guidance that will actually be followed and by default yields the most maintainable code that's also portable and also good perf as a starting point (yes, I've heard and understood all the nuances). Folks may reasonably start and end with Vector<T>... I do not believe folks will end with Vector<128>.

tannergooding

LGTM and matches what we are doing for other similar code already.

We probably want to look again later after I do the doc writeup.

stephentoub · 2022-01-29T02:27:44Z

thank you for reviewing and for the pre-look at your thinking :-)

src/libraries/System.Linq/tests/Shuffler.cs

src/libraries/System.Linq/src/System/Linq/Max.cs

eiriktsarpalis · 2022-01-31T13:19:09Z

src/libraries/System.Linq/tests/Shuffler.cs

+            int i = array.Length;
+            while (i > 1)
+            {
+                int j = Random.Shared.Next(i--);


Maybe we should use a seeded Random instance here for deterministic test inputs?

Sure. I'll do that in a separate PR.

stephentoub added area-System.Linq tenet-performance Performance related issue labels Jan 28, 2022

stephentoub added this to the 7.0.0 milestone Jan 28, 2022

ghost assigned stephentoub Jan 28, 2022

stephentoub requested review from eiriktsarpalis and tannergooding January 28, 2022 21:38

tannergooding reviewed Jan 28, 2022

View reviewed changes

tannergooding approved these changes Jan 29, 2022

View reviewed changes

davidfowl reviewed Jan 29, 2022

View reviewed changes

src/libraries/System.Linq/tests/Shuffler.cs Show resolved Hide resolved

eiriktsarpalis reviewed Jan 31, 2022

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Max.cs Show resolved Hide resolved

eiriktsarpalis reviewed Jan 31, 2022

View reviewed changes

eiriktsarpalis approved these changes Jan 31, 2022

View reviewed changes

stephentoub merged commit 096c8d5 into dotnet:main Jan 31, 2022

stephentoub deleted the linqminmaxarray branch January 31, 2022 16:52

eiriktsarpalis mentioned this pull request Feb 1, 2022

Improve perf of Enumerable.Sum/Average/Max/Min for arrays and lists #64624

Merged

kunalspathak mentioned this pull request Feb 8, 2022

Regressions from recent change in Enumerable/String #64992

Closed

ghost locked as resolved and limited conversation to collaborators Mar 4, 2022

Add array-fast path in Enumerable.Min/Max(int/long) #64470

Add array-fast path in Enumerable.Min/Max(int/long) #64470

Uh oh!

Conversation

stephentoub commented Jan 28, 2022

Uh oh!

ghost commented Jan 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Jan 29, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tannergooding Feb 2, 2022 •

edited

Loading

tannergooding Feb 2, 2022 •

edited

Loading

stephentoub Feb 2, 2022 •

edited

Loading