Skip to content

Conversation

@stephentoub
Copy link
Member

In looking through usage data, it's pretty common to call Enumerable.Min/Max with an array. At the cost of a type check when the source isn't an array, we can speed up the cases where it is, at a minimum by avoiding interface dispatch and in the best case by vectorizing. I've done so for int/long. We could look at doing so for float/double as well, but I wasn't sure about the impact of the NaN-related logic on vectorization.

Method Toolchain Length Array Mean Error Ratio
Min \main\corerun.exe 1 False 31.444 ns 0.6275 ns 1.00
Min \pr\corerun.exe 1 False 31.211 ns 0.3926 ns 0.99
Min \main\corerun.exe 1 True 14.049 ns 0.0917 ns 1.00
Min \pr\corerun.exe 1 True 1.912 ns 0.0116 ns 0.14
Min \main\corerun.exe 4 False 44.083 ns 0.5091 ns 1.00
Min \pr\corerun.exe 4 False 44.900 ns 0.4524 ns 1.02
Min \main\corerun.exe 4 True 24.681 ns 0.1511 ns 1.00
Min \pr\corerun.exe 4 True 3.553 ns 0.0647 ns 0.14
Min \main\corerun.exe 16 False 108.921 ns 1.0189 ns 1.00
Min \pr\corerun.exe 16 False 112.215 ns 0.5962 ns 1.03
Min \main\corerun.exe 16 True 74.743 ns 0.4251 ns 1.00
Min \pr\corerun.exe 16 True 4.557 ns 0.0403 ns 0.06
Min \main\corerun.exe 64 False 333.232 ns 1.3208 ns 1.00
Min \pr\corerun.exe 64 False 331.662 ns 6.6332 ns 0.99
Min \main\corerun.exe 64 True 251.663 ns 5.0050 ns 1.00
Min \pr\corerun.exe 64 True 7.923 ns 0.0847 ns 0.03
Min \main\corerun.exe 256 False 1,275.711 ns 25.5032 ns 1.00
Min \pr\corerun.exe 256 False 1,220.111 ns 24.3144 ns 0.96
Min \main\corerun.exe 256 True 932.077 ns 2.8967 ns 1.00
Min \pr\corerun.exe 256 True 21.124 ns 0.1536 ns 0.02
Min \main\corerun.exe 1024 False 4,951.882 ns 92.7338 ns 1.00
Min \pr\corerun.exe 1024 False 4,673.227 ns 41.3033 ns 0.94
Min \main\corerun.exe 1024 True 3,753.438 ns 50.7476 ns 1.00
Min \pr\corerun.exe 1024 True 86.373 ns 0.3499 ns 0.02
using System;
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(1, 4, 16, 64, 256, 1024)]
    public int Length { get; set; }

    [Params(false, true)]
    public bool Array { get; set; }

    private IEnumerable<int> _source;

    [GlobalSetup]
    public void Setup()
    {
        _source = Enumerable.Range(1, Length).Reverse();
        if (Array)
        {
            _source = _source.ToArray();
        }
    }

    [Benchmark]
    public int Min() => _source.Min();
}

In looking through usage data, it's pretty common to call Enumerable.Min/Max with an array.  At the cost of a type check when the source isn't an array, we can speed up the cases where it is, at a minimum by avoiding interface dispatch and in the best case by vectorizing.
@stephentoub stephentoub added area-System.Linq tenet-performance Performance related issue labels Jan 28, 2022
@stephentoub stephentoub added this to the 7.0.0 milestone Jan 28, 2022
@ghost ghost assigned stephentoub Jan 28, 2022
@ghost
Copy link

ghost commented Jan 28, 2022

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

In looking through usage data, it's pretty common to call Enumerable.Min/Max with an array. At the cost of a type check when the source isn't an array, we can speed up the cases where it is, at a minimum by avoiding interface dispatch and in the best case by vectorizing. I've done so for int/long. We could look at doing so for float/double as well, but I wasn't sure about the impact of the NaN-related logic on vectorization.

Method Toolchain Length Array Mean Error Ratio
Min \main\corerun.exe 1 False 31.444 ns 0.6275 ns 1.00
Min \pr\corerun.exe 1 False 31.211 ns 0.3926 ns 0.99
Min \main\corerun.exe 1 True 14.049 ns 0.0917 ns 1.00
Min \pr\corerun.exe 1 True 1.912 ns 0.0116 ns 0.14
Min \main\corerun.exe 4 False 44.083 ns 0.5091 ns 1.00
Min \pr\corerun.exe 4 False 44.900 ns 0.4524 ns 1.02
Min \main\corerun.exe 4 True 24.681 ns 0.1511 ns 1.00
Min \pr\corerun.exe 4 True 3.553 ns 0.0647 ns 0.14
Min \main\corerun.exe 16 False 108.921 ns 1.0189 ns 1.00
Min \pr\corerun.exe 16 False 112.215 ns 0.5962 ns 1.03
Min \main\corerun.exe 16 True 74.743 ns 0.4251 ns 1.00
Min \pr\corerun.exe 16 True 4.557 ns 0.0403 ns 0.06
Min \main\corerun.exe 64 False 333.232 ns 1.3208 ns 1.00
Min \pr\corerun.exe 64 False 331.662 ns 6.6332 ns 0.99
Min \main\corerun.exe 64 True 251.663 ns 5.0050 ns 1.00
Min \pr\corerun.exe 64 True 7.923 ns 0.0847 ns 0.03
Min \main\corerun.exe 256 False 1,275.711 ns 25.5032 ns 1.00
Min \pr\corerun.exe 256 False 1,220.111 ns 24.3144 ns 0.96
Min \main\corerun.exe 256 True 932.077 ns 2.8967 ns 1.00
Min \pr\corerun.exe 256 True 21.124 ns 0.1536 ns 0.02
Min \main\corerun.exe 1024 False 4,951.882 ns 92.7338 ns 1.00
Min \pr\corerun.exe 1024 False 4,673.227 ns 41.3033 ns 0.94
Min \main\corerun.exe 1024 True 3,753.438 ns 50.7476 ns 1.00
Min \pr\corerun.exe 1024 True 86.373 ns 0.3499 ns 0.02
using System;
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Params(1, 4, 16, 64, 256, 1024)]
    public int Length { get; set; }

    [Params(false, true)]
    public bool Array { get; set; }

    private IEnumerable<int> _source;

    [GlobalSetup]
    public void Setup()
    {
        _source = Enumerable.Range(1, Length).Reverse();
        if (Array)
        {
            _source = _source.ToArray();
        }
    }

    [Benchmark]
    public int Min() => _source.Min();
}
Author: stephentoub
Assignees: -
Labels:

area-System.Linq, tenet-performance

Milestone: 7.0.0

return value;
}

private static int Max(int[] array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I'd really like to see here is for many of these methods to be available on Span<T> and for us to forward this to the Span implementation.

Exposing new public APIs is of course something that needs to goto API proposal, but I think it would be better long term, would also allow more logic sharing for various helper APIs, and would centralize most of the "array like SIMD algorithms" to the same place in the BCL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine subsequently seeing such methods on MemoryExtensions and deleting the code from here to delegate there instead. Someone would need to do the work to figure out what that set of methods should be. I also don't think it precludes adding this now and then delegating later if/when the methods exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just wanted to call out what I'd like to see here and I think just about any method we think is worth special-casing for array in LINQ is applicable.

Sum/Min/Max are the "most obvious ones"


// Vectorize the search if possible.
int index, value;
if (Vector.IsHardwareAccelerated && array.Length >= Vector<int>.Count * 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how big the average input here is? This is going to give us different perf characteristics between Arm64 and x64 because the former will trigger for 8+ element arrays (32-bytes) while the latter will only trigger for 16+ element arrays (64-bytes).

I'd expect we want similar handling in both and while manually unrolling may pipeline well on x64, that may not be the case on Arm64 or all x64 based chipsets.

Generally the using Vector256 is beneficial for very large inputs where you can saturate the decoder and memory pipeline; but are less beneficial for "small" inputs (particularly on older hardware where cache line splits and throughput may be more limited).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how big the average input here is?

I'm not sure.

Are you suggesting a default stance should be to use Vector128<T> instead of Vector<T> unless you know big inputs are common? The tradeoff you're describing is between potentially 2x throughput on larger sizes (if you use Vector<T> and end up employing 256-bit operations instead of 128-bit operations) vs vectorizing smaller sizes (if you use Vector<T> and can't vectorize the processing of 200 bits but could have a little with Vector128<T>)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is why I opened #64409. We need really clear guidance on when to use which.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting a default stance should be to use Vector128 instead of Vector unless you know big inputs are common? The tradeoff you're describing is between potentially 2x throughput on larger sizes (if you use Vector and end up employing 256-bit operations instead of 128-bit operations) vs vectorizing smaller sizes (if you use Vector and can't vectorize the processing of 200 bits but could have a little with Vector128)?

Yes. There is a lot of nuance here such as Arm64 not having Vector256<T> and it often having different pipelining characteristics. Even when just considering x64, there are various things to consider as well such as the expected target hardware as modern (Skylake and Zen2 or later) tends to perform really well with V256; where-as older (Haswell/Broadwell or Zen1) tends to behave worse and may incur downclocking, may not support dual dispatch, and may even emulate 256-bit support as 2x128-bit ops.

I'd generally expect that we do something like:

  • Vector128<T> is the default choice and is used for practically any input that is at least Count long
    • This is the path that will light up basically "everywhere" (Arm64, x64, x86; eventually WASM, etc)
    • It may or may not be worth it to specialize cases where we are doing less 4-7 loop iterations due to the default branch prediction behavior
  • Vector256<T> is used for cases with at least 128-bytes. This is also the cutoff of many memcpy algorithms before they start considering non-temporal or other store sizes
    • This path will only light up on x86/x64 and can be impacted by things like alignment, whether its going to split the cache every other read/write, the instructions being used, and the underlying data size

There are likely cases where microbenches or modern hardware (roughly Skylake and Zen2 or newer) will show Vector256<T> is better. However, I expect that a good bit of hardware (for both servers and consumers) still in the wild isn't this and so we'll need to take that into consideration. Likewise, the newer Alder Lake CPUs also bring in considerations where the Vector256<T> support is slower on the energy efficient cores (vpmaxud is 1 latency and 0.5 throughput for V128 and V256 on the power cores; but is 1 latency, 0.33 throughput on the energy cores for V128 and 2 latency, 0.67 throughput for V256)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I wouldn't block this PR on that doc being written up. I think this largely matches what we're already doing for other cases.

Just that I believe this is roughly what the writeup will follow based on my own knowledge/experience.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are trying to get good performance everywhere and don't need to support .NET Standard. Vector128 is a good option.

We have many vectorized methods now. From a cursory scan, they all either use Vector<T> or they have both 128-bit and 256-bit code paths. Can you please list which of these implementations should ideally be changed to just use Vector128<T>? If that is the recommended default choice, I assume the list will include at least half of our methods. Thanks.

Copy link
Member

@tannergooding tannergooding Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any existing SIMD code in the BCL where we need to or should only support Vector128<T>. There may be some cases of new SIMD code where that is the right initial choice.

There are likely places where Vector256<T> will end up needing additional or modified checks due to the changing hardware ecosystem.

There are likely places where we would see perf benefits by adding a Vector128<T> path on top of Vector<T> to ensure that we get better perf for inputs smaller than 32-bytes (or inputs that are small enough that loop unrolling and other considerations would be better).

The BCL is special from many angles and the guidance we have for our own usage may be more complex than the default recommendation for the typical user or individual OSS maintainer looking to accelerate their app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, we are the core foundation on which a lot of other code is built and run. We have no way of statically knowing how any given app will use our own code and being the baseline the places where our code is used varies a lot.

This is different from many applications or smaller libraries out there where they can easily get away with having simpler code/guidance and still seeing good benefits. This doesn't mean that we shouldn't provide or make available the extended guidance as well nor that other libraries may not have the same or similar considerations.

But there are multiple angles and target audiences here and no one single answer.

Copy link
Member

@tannergooding tannergooding Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially put another way in an attempt at drawing a parallel, the BCL provides T[], Span<T>, and several other tools for working with memory (MemoryMarshl, Unsafe, etc).

Most apps/libraries only need to use T[] to be successful. They get indirect benefits by the BCL using Span<T>

Some apps/libraries can then get additional perf benefits by providing additional overloads which use Span<T> allowing for reduced allocations and handling of unmanaged memory without needing to deal with unsafe.

You can then go a step further and "hyper-optimize" by doing things like casting int to uint and then zero-extending to nuint or by using MemoryMarshal.GetReference to avoid bounds checking, etc. This is something that provides the "best performance" and is also the most complicated.

All of these options are valid and we use all of them throughout the BCL depending on what the exact circumstances and needs are.


In the same way, it is fine for most user code to just use Vector128<T> and it gives them the easiest path and chance of "succeeding" (for some definition of success) on all platforms. Vector<T> can similarly be used to allow them to succeed and is almost as good of a choice, but there is some complicated nuance around it that makes it more confusing (at least in my own opinion) to recommend as the default scenario. One of these two is the equivalent to T[].

Vector256<T> and providing multiple paths is somewhat like the Span<T> case. This can be beneficial when you know what you're doing and most code can get indirect benefits by the BCL utilizing these techniques.

Using platform specific intrinsics is a case of "hyper-optimization". You use these when you need the utmost control of the codegen and are trying to eek out every bit of performance or where the xplat helpers happen to fall short.

Copy link
Member Author

@stephentoub stephentoub Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any existing SIMD code in the BCL where we need to or should only support Vector128<T>

Right

The BCL is special from many angles and the guidance we have for our own usage may be more complex than the default recommendation

I do not believe the core libraries are special in this regard. If someone is reaching for Vector, they're already in the 1% of caring about getting really great throughput and have already demonstrated an interest in writing additional code for performance. On top of that, most libraries are in a position where they can't predict an upper bound on input lengths.

If the guidance we have doesn't work for us, it doesn't work. What I outlined in #64470 (comment) is essentially what we do, where the starting point is writing one Vector<T> path and then getting more complicated from there if either the API surface is insufficient or the smaller input sizes are critical... I think that's guidance that will actually be followed and by default yields the most maintainable code that's also portable and also good perf as a starting point (yes, I've heard and understood all the nuances). Folks may reasonably start and end with Vector<T>... I do not believe folks will end with Vector<128>.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and matches what we are doing for other similar code already.

We probably want to look again later after I do the doc writeup.

@stephentoub
Copy link
Member Author

thank you for reviewing and for the pre-look at your thinking :-)

int i = array.Length;
while (i > 1)
{
int j = Random.Shared.Next(i--);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a seeded Random instance here for deterministic test inputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll do that in a separate PR.

@stephentoub stephentoub merged commit 096c8d5 into dotnet:main Jan 31, 2022
@stephentoub stephentoub deleted the linqminmaxarray branch January 31, 2022 16:52
@ghost ghost locked as resolved and limited conversation to collaborators Mar 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Linq tenet-performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants