- 
                Notifications
    You must be signed in to change notification settings 
- Fork 5.2k
[Mono] [Arm64] Added SIMD support for vector 2/3/4 methods #98761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /azp run runtime-extra-platforms | 
| Azure Pipelines successfully started running 1 pipeline(s). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks for implementing these. I left a few comments with questions.
Before merging, please take a look if fullAOT llvm compiles correctly these new intrinsic.
        
          
                src/mono/mono/mini/simd-intrinsics.c
              
                Outdated
          
        
      | case SN_Lerp: { | ||
| #if defined (TARGET_ARM64) | ||
| MonoInst* v1 = args [1]; | ||
| if (!strcmp ("Quaternion", m_class_get_name (klass))) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quaternion.Lerp is not marked as intrinsic in the libraries:
runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/Quaternion.cs
Lines 493 to 498 in db7d269
| /// <summary>Performs a linear interpolation between two quaternions based on a value that specifies the weighting of the second quaternion.</summary> | |
| /// <param name="quaternion1">The first quaternion.</param> | |
| /// <param name="quaternion2">The second quaternion.</param> | |
| /// <param name="amount">The relative weight of <paramref name="quaternion2" /> in the interpolation.</param> | |
| /// <returns>The interpolated quaternion.</returns> | |
| public static Quaternion Lerp(Quaternion quaternion1, Quaternion quaternion2, float amount) | 
However, if this implementation generates better codegen we should probably keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intrinsified version is around 40% faster on my machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the intrinsified version faster here? Is it fundamentally doing something differently from the managed implementation or is there potentially a missing JIT optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps there is simply missing a change on the managed side and so its using scalar logic rather than any actual vectorization and a better fix is to update the managed impl?
We've typically tried to keep a clear separation between intrinsic functionality and more complex methods.
APIs like operator + or Sqrt are generally mapped to exactly 1 hardware instruction and this is the case for most platforms.
APIs like DotProduct or even Create may be mapped to exactly 1 hardware instruction on some platforms and are fairly "core" to the general throughput considerations of many platforms.
APIs like Quaternion.Lerp or CopyTo are more complex functions which use multiple instructions on all platforms and which may even require branching or masking logic. So, we've typically tried to keep them in managed and have them use the intrinsic APIs instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should align Mono's behavior with CoreCLR, not intrinsifying Quaternion.Lerp or CopyTo for mono either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the intrinsified version faster here? Is it fundamentally doing something differently from the managed implementation or is there potentially a missing JIT optimization?
In general, Mono's mini JIT doesn't have as comprehensive optimizations as CoreCLR's RyuJIT.
        
          
                src/mono/mono/mini/simd-intrinsics.c
              
                Outdated
          
        
      | MonoInst *sum = emit_simd_ins (cfg, klass, OP_ARM64_XADDV, arg->dreg, -1); | ||
| sum->inst_c0 = INTRINS_AARCH64_ADV_SIMD_FADDV; | ||
| sum->inst_c1 = MONO_TYPE_R4; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that Arm has typically pushed us away from using FADDV as it does not perform well on some hardware.
Rather instead they had us use a sequence of FADDP (AddPairwise) instructions which tend to have better perf/throughput: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L25190-L25210
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the information, @tannergooding. @jkurdek Feel free to create an issue to address it in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f68c44b    to
    1ebb378      
    Compare
  
    
Implements #91394.