Skip to content

Conversation

@DeepakRajendrakumaran
Copy link
Contributor

This accelerates Vector512.Sum() using AVX512 instructions.

Vector512.Sum() ->

vmovups  zmm0, zmmword ptr [rcx]
vextracti64x4 ymm1, zmm0, 1
vpaddd   ymm0, ymm1, ymm0
vextracti128 xmm1, ymm0, 1
vpaddd   xmm0, xmm1, xmm0
vphaddd  xmm0, xmm0, xmm0
vphaddd  xmm0, xmm0, xmm0
vmovd    eax, xmm0

@dotnet/avx512-contrib

@ghost ghost added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member labels Jun 20, 2023
@ghost
Copy link

ghost commented Jun 20, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This accelerates Vector512.Sum() using AVX512 instructions.

Vector512.Sum() ->

vmovups  zmm0, zmmword ptr [rcx]
vextracti64x4 ymm1, zmm0, 1
vpaddd   ymm0, ymm1, ymm0
vextracti128 xmm1, ymm0, 1
vpaddd   xmm0, xmm1, xmm0
vphaddd  xmm0, xmm0, xmm0
vphaddd  xmm0, xmm0, xmm0
vmovd    eax, xmm0

@dotnet/avx512-contrib

Author: DeepakRajendrakumaran
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@gfoidl
Copy link
Member

gfoidl commented Jun 21, 2023

Is vphaddd the instruction we want to use here?
I'm asking because of

phaddd is implemented in microcode on all supporting processors, so it's slow compared to pshufd+paddd

(from SixLabors/ImageSharp#1630 (comment))

and

phaddd is no faster than 2 shuffles + a vertical add

(from https://stackoverflow.com/a/59326242/347870

I know that Vector128/256 also uses vphaddd and I don't know if current cpus still implement it in microcode (from Agner Fog's instruction tables it looks like to be microcode).

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Jun 21, 2023

Is vphaddd the instruction we want to use here? I'm asking because of

phaddd is implemented in microcode on all supporting processors, so it's slow compared to pshufd+paddd

(from SixLabors/ImageSharp#1630 (comment))

and

phaddd is no faster than 2 shuffles + a vertical add

(from https://stackoverflow.com/a/59326242/347870

I know that Vector128/256 also uses vphaddd and I don't know if current cpus still implement it in microcode (from Agner Fog's instruction tables it looks like to be microcode).

That's a good question and something we can possibly consider. ICC for eg. creates the following - https://godbolt.org/z/ojGcK9a4W

I'd imagine this would mean optimizing the whole SUM() lowering if we want to change for Vector128/ vector256/vector512. It's something I'm open to since Vector*.Dot() also in theory can use this logic(currently there is little overlap between Dot() and Sum().

I'd love to hear what @tannergooding and @BruceForstall have to add here

@tannergooding
Copy link
Member

Should be fine. We'd be looking at 9 instructions with hadd vs 11 with permute+add, but a more opportunities for parallel dispatch (some CPUs have 2 shuffle/permute ports) and almost half the micro-ops/latency.

Would be good to ensure that all 3 sizes for Sum are doing the "optimal" thing here.

@BruceForstall
Copy link
Contributor

Related: #85207

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Jun 22, 2023

Related: #85207

You can mark 'Shuffle()' from there done. I had a PR a few weeks ago

@tannergooding tannergooding added the avx512 Related to the AVX-512 architecture label Jul 14, 2023
Comment on lines +23899 to +23926
if (simdSize == 64)
{
assert(IsBaselineVector512IsaSupportedDebugOnly());
// This is roughly the following managed code:
// ...
// simd64 tmp2 = tmp1;
// tmp3 = tmp2.GetUpper();
// simd32 tmp4 = Isa.Add(tmp1.GetLower(), tmp2);
// tmp5 = tmp4;
// simd16 tmp6 = tmp4.GetUpper();
// tmp1 = Isa.Add(tmp1.GetLower(), tmp2);
// ...
// From here on we can treat this as a simd16 reduction
GenTree* op1Dup = fgMakeMultiUse(&op1);
GenTree* op1Lower32 = gtNewSimdGetUpperNode(TYP_SIMD32, op1Dup, simdBaseJitType, simdSize);
GenTree* op1Upper32 = gtNewSimdGetLowerNode(TYP_SIMD32, op1, simdBaseJitType, simdSize);

simdSize = simdSize / 2;
op1Lower32 = gtNewSimdBinOpNode(GT_ADD, TYP_SIMD32, op1Lower32, op1Upper32, simdBaseJitType, simdSize);
haddCount--;

GenTree* op1Dup32 = fgMakeMultiUse(&op1Lower32);
GenTree* op1Lower16 = gtNewSimdGetUpperNode(TYP_SIMD16, op1Lower32, simdBaseJitType, simdSize);
GenTree* op1Upper16 = gtNewSimdGetLowerNode(TYP_SIMD16, op1Dup32, simdBaseJitType, simdSize);
simdSize = simdSize / 2;
op1 = gtNewSimdBinOpNode(GT_ADD, TYP_SIMD16, op1Lower16, op1Upper16, simdBaseJitType, simdSize);
haddCount--;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is "correct", but has an interesting side effect in that can change the result for float/double.

Since float/double are not associative due to the rounding, summing accross as [0] + [1] + [2] + [3] + ... is different from summing pairwise as (([0] + [1]) + ([2] + [3])) + ..., which is different from summing per lane, then combining the lanes, etc.

Today, we're basically doing it per lane, then combining lanes. Within that lane we're typically doing pairwise because that's how addv (add across) works on Arm64, it's how hadd (horizontal add) works on x86/x64, and its trivial to emulate using shufps on older hardware.

We don't really want to have the results subtly change based on what the hardware supports, so we should probably try to ensure this keeps things overall consistent in how it operates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm busy with something else but will get back to this once I'm done. Sorry for the delay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. This one isn't "critical" for .NET 8 and we're already generating decent (but not amazing) code that would be similar to what a user might manually write.

@JulieLeeMSFT
Copy link
Member

Markign this as .NET 9.

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Aug 7, 2023
@BruceForstall
Copy link
Contributor

@DeepakRajendrakumaran Can you mark this "Draft" until it is ready to review again?

@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as draft September 13, 2023 16:18
@DeepakRajendrakumaran
Copy link
Contributor Author

@DeepakRajendrakumaran Can you mark this "Draft" until it is ready to review again?

Done

@ghost ghost closed this Oct 13, 2023
@ghost
Copy link

ghost commented Oct 13, 2023

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 12, 2023
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants