Clean up in importervectorization.cpp #108818

EgorBo · 2024-10-13T05:14:28Z

A small clean up in importervectorization.cpp:

I've removed a bit obscured logic and the separation between SIMD and scalar. It's now a single function with a single loop
The logic is no longer hard-limited by 1 or 2 loads per expansion. I plan to allow arm64 to use more loads to be on par with x64
I've removed 32-bit specific limitations (hence, more improvements on x86 in the diffs)
We now try to minimize the load size for the remainder, e.g. when our string is 6 bytes, we do the comparison as 4+2 instead of 4+4 (with overlap) - hence, we less likely hit a cache-line (clang does the same). This change is responsible for SPMI diffs.

Diffs

EgorBo · 2024-10-14T14:01:22Z

@jakobbotsch @AndyAyersMS @dotnet/jit-contrib PTAL

the changes are mainly in one function so I suggest reviewing it as a whole, e.g. here:

runtime/src/coreclr/jit/importervectorization.cpp

Lines 90 to 213 in 51a3337

    
           GenTree* Compiler::impExpandHalfConstEquals( 
        
               GenTreeLclVarCommon* data, WCHAR* cns, int charLen, int dataOffset, StringComparison cmpMode) 
        
           { 
        
               static_assert_no_msg(sizeof(WCHAR) == 2); 
        
               assert((charLen > 0) && (charLen <= MaxPossibleUnrollSize)); 
        
               // A gtNewOperNode which can handle SIMD operands (used for bitwise operations): 
        
               auto bitwiseOp = [&](genTreeOps oper, var_types type, GenTree* op1, GenTree* op2) -> GenTree* { 
        
           #ifdef FEATURE_HW_INTRINSICS 
        
                   if (varTypeIsSIMD(type)) 
        
                   { 
        
                       return gtNewSimdBinOpNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(type)); 
        
                   } 
        
                   if (varTypeIsSIMD(op1)) 
        
                   { 
        
                       // E.g. a comparison of SIMD ops returning TYP_INT; 
        
                       assert(varTypeIsSIMD(op2)); 
        
                       return gtNewSimdCmpOpAllNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(op1)); 
        
                   } 
        
           #endif 
        
                   return gtNewOperNode(oper, type, op1, op2); 
        
               }; 
        
               // Convert charLen to byteLen. It never overflows because charLen is a small value 
        
               unsigned byteLen = (unsigned)charLen * 2; 
        
               // Find the largest possible type to read data 
        
               var_types readType         = roundDownMaxType(byteLen, true); 
        
               GenTree*  result           = nullptr; 
        
               unsigned  byteLenRemaining = byteLen; 
        
               while (byteLenRemaining > 0) 
        
               { 
        
                   // We have a remaining data to process and it's smaller than the 
        
                   // previously processed data 
        
                   if (byteLenRemaining < genTypeSize(readType)) 
        
                   { 
        
                       if (varTypeIsIntegral(readType)) 
        
                       { 
        
                           // Use a smaller GPR load for the remaining data, we're going to zero-extend it 
        
                           // since the previous GPR load was larger. Hence, for e.g. 6 bytes we're going to do 
        
                           // "(IND<INT> ^ cns1) | (UINT)(IND<USHORT> ^ cns2)" 
        
                           readType = roundUpGPRType(byteLenRemaining); 
        
                       } 
        
                       else 
        
                       { 
        
                           // TODO-CQ: We should probably do the same for SIMD, e.g. 34 bytes -> SIMD32 and SIMD16 
        
                           // while currently we do SIMD32 and SIMD32. This involves a bit more complex upcasting logic. 
        
                       } 
        
                       // Overlap with the previously processed data 
        
                       byteLenRemaining = genTypeSize(readType); 
        
                       assert(byteLenRemaining <= byteLen); 
        
                   } 
        
                   ssize_t byteOffset = ((ssize_t)byteLen - (ssize_t)byteLenRemaining); 
        
                   // Total offset includes dataOffset (e.g. 12 for String) 
        
                   ssize_t totalOffset = byteOffset + (ssize_t)dataOffset; 
        
                   // Clone dst and add offset if necessary. 
        
                   GenTree* absOffset  = gtNewIconNode(totalOffset, TYP_I_IMPL); 
        
                   GenTree* currData   = gtNewOperNode(GT_ADD, TYP_BYREF, gtCloneExpr(data), absOffset); 
        
                   GenTree* loadedData = gtNewIndir(readType, currData, GTF_IND_UNALIGNED | GTF_IND_ALLOW_NON_ATOMIC); 
        
                   // For OrdinalIgnoreCase mode we need to convert both data and cns to lower case 
        
                   if (cmpMode == OrdinalIgnoreCase) 
        
                   { 
        
                       WCHAR mask[MaxPossibleUnrollSize] = {}; 
        
                       int   maskSize                    = (int)genTypeSize(readType) / 2; 
        
                       if (!ConvertToLowerCase(cns + (byteOffset / 2), reinterpret_cast<WCHAR*>(&mask), maskSize)) 
        
                       { 
        
                           // value contains non-ASCII chars, we can't proceed further 
        
                           return nullptr; 
        
                       } 
        
                       // 0x20 mask for the current chunk to convert it to lower case 
        
                       GenTree* toLowerMask = gtNewGenericCon(readType, (uint8_t*)mask); 
        
                       // loadedData is now "loadedData | toLowerMask" 
        
                       loadedData = bitwiseOp(GT_OR, genActualType(readType), loadedData, toLowerMask); 
        
                   } 
        
                   else 
        
                   { 
        
                       assert(cmpMode == Ordinal); 
        
                   } 
        
                   GenTree* srcCns = gtNewGenericCon(readType, (uint8_t*)cns + byteOffset); 
        
                   // A small optimization: prefer X == Y over X ^ Y == 0 since 
        
                   // just one comparison is needed, and we can do it with a single load. 
        
                   if ((genTypeSize(readType) == byteLen) && varTypeIsIntegral(readType)) 
        
                   { 
        
                       // TODO-CQ: Figure out why it's a size regression for SIMD 
        
                       return bitwiseOp(GT_EQ, TYP_INT, loadedData, srcCns); 
        
                   } 
        
                   // loadedData ^ srcCns 
        
                   GenTree* xorNode = bitwiseOp(GT_XOR, genActualType(readType), loadedData, srcCns); 
        
                   // Merge with the previous result with OR 
        
                   if (result == nullptr) 
        
                   { 
        
                       // It's the first check 
        
                       result = xorNode; 
        
                   } 
        
                   else 
        
                   { 
        
                       if (!result->TypeIs(readType)) 
        
                       { 
        
                           assert(varTypeIsIntegral(result) && varTypeIsIntegral(readType)); 
        
                           xorNode = gtNewCastNode(result->TypeGet(), xorNode, true, result->TypeGet()); 
        
                       } 
        
                       // Merge with the previous result via OR 
        
                       result = bitwiseOp(GT_OR, genActualType(result->TypeGet()), result, xorNode); 
        
                   } 
        
                   // Move to the next chunk. 
        
                   byteLenRemaining -= genTypeSize(readType); 
        
               } 
        
               // Compare the result against zero, e.g. (chunk1 ^ cns1) | (chunk2 ^ cns2) == 0 
        
               return bitwiseOp(GT_EQ, TYP_INT, result, gtNewZeroConNode(result->TypeGet())); 
        
           }

This simplifies code along with some benefits (see description).

jakobbotsch

LGTM, nice clean up.

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

EgorBo · 2024-10-17T17:59:38Z

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

I presume it's better if some other phase (morph/lower) converts to a whatever is better shape. So it can also handle users inputs with bitwise ops

jakobbotsch · 2024-10-21T07:01:50Z

I presume it's better if some other phase (morph/lower) converts to a whatever is better shape. So it can also handle users inputs with bitwise ops

It would make more sense to me for this transform to produce the obvious IR (using GT_EQ), which will indeed be transformed to conditional compares.
Then I agree the x64 backend could transform to the bitwise version if that is actually better.

Clean up in importervectorization.cpp

3df72ca

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024

dotnet-policy-service bot assigned EgorBo Oct 13, 2024

EgorBo added 2 commits October 13, 2024 07:32

fix build

126e784

fix build

fea0ce7

This was referenced Oct 13, 2024

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Closed

ReadECDsaPrivateKey_BrainpoolP160r1_Pfx test failure on windows #108815

Closed

EgorBo added 2 commits October 13, 2024 17:20

Clean up

048f678

clean up

51a3337

EgorBot mentioned this pull request Oct 13, 2024

EgorBot for EgorBo in #108818 EgorBot/runtime-utils#119

Open

EgorBo marked this pull request as ready for review October 14, 2024 13:59

jakobbotsch approved these changes Oct 15, 2024

View reviewed changes

Merge branch 'main' into cleanup-str-unroll

d080256

EgorBo merged commit 8734660 into dotnet:main Oct 18, 2024
108 checks passed

EgorBo deleted the cleanup-str-unroll branch October 18, 2024 09:28

github-actions bot locked and limited conversation to collaborators Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clean up in importervectorization.cpp #108818

Clean up in importervectorization.cpp #108818

Uh oh!

EgorBo commented Oct 13, 2024 •

edited

Loading

Uh oh!

EgorBo commented Oct 14, 2024

Uh oh!

jakobbotsch left a comment

Uh oh!

EgorBo commented Oct 17, 2024 •

edited

Loading

Uh oh!

Uh oh!

jakobbotsch commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Clean up in importervectorization.cpp #108818

Clean up in importervectorization.cpp #108818

Uh oh!

Conversation

EgorBo commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Oct 14, 2024

Uh oh!

jakobbotsch left a comment

Choose a reason for hiding this comment

Uh oh!

EgorBo commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jakobbotsch commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EgorBo commented Oct 13, 2024 •

edited

Loading

EgorBo commented Oct 17, 2024 •

edited

Loading