Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 13, 2024

A small clean up in importervectorization.cpp:

  • I've removed a bit obscured logic and the separation between SIMD and scalar. It's now a single function with a single loop
  • The logic is no longer hard-limited by 1 or 2 loads per expansion. I plan to allow arm64 to use more loads to be on par with x64
  • I've removed 32-bit specific limitations (hence, more improvements on x86 in the diffs)
  • We now try to minimize the load size for the remainder, e.g. when our string is 6 bytes, we do the comparison as 4+2 instead of 4+4 (with overlap) - hence, we less likely hit a cache-line (clang does the same). This change is responsible for SPMI diffs.

Diffs

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024
@EgorBo EgorBo marked this pull request as ready for review October 14, 2024 13:59
@EgorBo
Copy link
Member Author

EgorBo commented Oct 14, 2024

@jakobbotsch @AndyAyersMS @dotnet/jit-contrib PTAL

the changes are mainly in one function so I suggest reviewing it as a whole, e.g. here:

GenTree* Compiler::impExpandHalfConstEquals(
GenTreeLclVarCommon* data, WCHAR* cns, int charLen, int dataOffset, StringComparison cmpMode)
{
static_assert_no_msg(sizeof(WCHAR) == 2);
assert((charLen > 0) && (charLen <= MaxPossibleUnrollSize));
// A gtNewOperNode which can handle SIMD operands (used for bitwise operations):
auto bitwiseOp = [&](genTreeOps oper, var_types type, GenTree* op1, GenTree* op2) -> GenTree* {
#ifdef FEATURE_HW_INTRINSICS
if (varTypeIsSIMD(type))
{
return gtNewSimdBinOpNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(type));
}
if (varTypeIsSIMD(op1))
{
// E.g. a comparison of SIMD ops returning TYP_INT;
assert(varTypeIsSIMD(op2));
return gtNewSimdCmpOpAllNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(op1));
}
#endif
return gtNewOperNode(oper, type, op1, op2);
};
// Convert charLen to byteLen. It never overflows because charLen is a small value
unsigned byteLen = (unsigned)charLen * 2;
// Find the largest possible type to read data
var_types readType = roundDownMaxType(byteLen, true);
GenTree* result = nullptr;
unsigned byteLenRemaining = byteLen;
while (byteLenRemaining > 0)
{
// We have a remaining data to process and it's smaller than the
// previously processed data
if (byteLenRemaining < genTypeSize(readType))
{
if (varTypeIsIntegral(readType))
{
// Use a smaller GPR load for the remaining data, we're going to zero-extend it
// since the previous GPR load was larger. Hence, for e.g. 6 bytes we're going to do
// "(IND<INT> ^ cns1) | (UINT)(IND<USHORT> ^ cns2)"
readType = roundUpGPRType(byteLenRemaining);
}
else
{
// TODO-CQ: We should probably do the same for SIMD, e.g. 34 bytes -> SIMD32 and SIMD16
// while currently we do SIMD32 and SIMD32. This involves a bit more complex upcasting logic.
}
// Overlap with the previously processed data
byteLenRemaining = genTypeSize(readType);
assert(byteLenRemaining <= byteLen);
}
ssize_t byteOffset = ((ssize_t)byteLen - (ssize_t)byteLenRemaining);
// Total offset includes dataOffset (e.g. 12 for String)
ssize_t totalOffset = byteOffset + (ssize_t)dataOffset;
// Clone dst and add offset if necessary.
GenTree* absOffset = gtNewIconNode(totalOffset, TYP_I_IMPL);
GenTree* currData = gtNewOperNode(GT_ADD, TYP_BYREF, gtCloneExpr(data), absOffset);
GenTree* loadedData = gtNewIndir(readType, currData, GTF_IND_UNALIGNED | GTF_IND_ALLOW_NON_ATOMIC);
// For OrdinalIgnoreCase mode we need to convert both data and cns to lower case
if (cmpMode == OrdinalIgnoreCase)
{
WCHAR mask[MaxPossibleUnrollSize] = {};
int maskSize = (int)genTypeSize(readType) / 2;
if (!ConvertToLowerCase(cns + (byteOffset / 2), reinterpret_cast<WCHAR*>(&mask), maskSize))
{
// value contains non-ASCII chars, we can't proceed further
return nullptr;
}
// 0x20 mask for the current chunk to convert it to lower case
GenTree* toLowerMask = gtNewGenericCon(readType, (uint8_t*)mask);
// loadedData is now "loadedData | toLowerMask"
loadedData = bitwiseOp(GT_OR, genActualType(readType), loadedData, toLowerMask);
}
else
{
assert(cmpMode == Ordinal);
}
GenTree* srcCns = gtNewGenericCon(readType, (uint8_t*)cns + byteOffset);
// A small optimization: prefer X == Y over X ^ Y == 0 since
// just one comparison is needed, and we can do it with a single load.
if ((genTypeSize(readType) == byteLen) && varTypeIsIntegral(readType))
{
// TODO-CQ: Figure out why it's a size regression for SIMD
return bitwiseOp(GT_EQ, TYP_INT, loadedData, srcCns);
}
// loadedData ^ srcCns
GenTree* xorNode = bitwiseOp(GT_XOR, genActualType(readType), loadedData, srcCns);
// Merge with the previous result with OR
if (result == nullptr)
{
// It's the first check
result = xorNode;
}
else
{
if (!result->TypeIs(readType))
{
assert(varTypeIsIntegral(result) && varTypeIsIntegral(readType));
xorNode = gtNewCastNode(result->TypeGet(), xorNode, true, result->TypeGet());
}
// Merge with the previous result via OR
result = bitwiseOp(GT_OR, genActualType(result->TypeGet()), result, xorNode);
}
// Move to the next chunk.
byteLenRemaining -= genTypeSize(readType);
}
// Compare the result against zero, e.g. (chunk1 ^ cns1) | (chunk2 ^ cns2) == 0
return bitwiseOp(GT_EQ, TYP_INT, result, gtNewZeroConNode(result->TypeGet()));
}

This simplifies code along with some benefits (see description).

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice clean up.

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 17, 2024

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

I presume it's better if some other phase (morph/lower) converts to a whatever is better shape. So it can also handle users inputs with bitwise ops

@EgorBo EgorBo merged commit 8734660 into dotnet:main Oct 18, 2024
108 checks passed
@EgorBo EgorBo deleted the cleanup-str-unroll branch October 18, 2024 09:28
@jakobbotsch
Copy link
Member

I presume it's better if some other phase (morph/lower) converts to a whatever is better shape. So it can also handle users inputs with bitwise ops

It would make more sense to me for this transform to produce the obvious IR (using GT_EQ), which will indeed be transformed to conditional compares.
Then I agree the x64 backend could transform to the bitwise version if that is actually better.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants