Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -945,6 +945,13 @@ private static unsafe nuint NarrowUtf16ToLatin1_Sse2(char* pUtf16Buffer, byte* p
/// </summary>
public static unsafe void WidenLatin1ToUtf16(byte* pLatin1Buffer, char* pUtf16Buffer, nuint elementCount)
{
if (((nuint)pUtf16Buffer & 1) != 0)
{
// Input isn't char aligned, we won't be able to vectorize.
WidenLatin1ToUtf16_MisalignedAddress(pLatin1Buffer, pUtf16Buffer, elementCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the need of this, I am seeing that we already align data in

if (elementCount >= SizeOfVector128)
{
// First, perform an unaligned 1x 64-bit read from the input buffer and an unaligned
// 1x 128-bit write to the destination buffer.
latin1Vector = Sse2.LoadScalarVector128((ulong*)pLatin1Buffer).AsByte(); // unaligned load
Sse2.Store((byte*)pUtf16Buffer, Sse2.UnpackLow(latin1Vector, zeroVector)); // unaligned write
// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment
// point, then use that as the base offset going forward. Remember the >> 1 to account for
// that we wrote chars, not bytes. This means we may re-read data in the next iteration of
// the loop, but this is ok.
currentOffset = (SizeOfVector128 >> 1) - (((nuint)pUtf16Buffer >> 1) & (MaskOfAllBitsInVector128 >> 1));
Debug.Assert(0 < currentOffset && currentOffset <= SizeOfVector128 / sizeof(char));
// Calculating the destination address outside the loop results in significant
// perf wins vs. relying on the JIT to fold memory addressing logic into the
// write instructions. See: https://github.com/dotnet/runtime/issues/33002
char* pCurrentWriteAddress = pUtf16Buffer + currentOffset;
// Now run the main 1x 128-bit read + 2x 128-bit write loop.
nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128;
while (currentOffset <= finalOffsetWhereCanIterateLoop)
, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the address that pUtf16Buffer points to is naturally aligned, then this logic should work as expected.

The problem is:

char* pCurrentWriteAddress = pUtf16Buffer + currentOffset;

which does not effectively ensure correct alignment for a misaligned char*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo As an alternative approach, change the following to Sse2.Store and accept the performance hit pre-Nehalem.

Sse2.StoreAligned(pLatin1Buffer + currentOffsetInElements, latin1Vector); // aligned

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just use unaligned load/store here. It may have some small hit on hardware that's more than 10-12 years old, but it simplifies things overall and will be increasingly rare to encounter such hardware over time. Plus, we don't use aligned loads/stores in most of our workloads already, so this won't be any different.

Opportunistically aligning the data is still goodness and will avoid cache line splits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Sse2.Store approach is better since unaligned char* should be rare? (e.g. char[] is always aligned)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consideration would be that Sse2.Store doesn't allow containment on pre-AVX hardware and on pre-AVX hardware movups was frequently slower than movaps even if the data was definitely aligned.

But, given the age of the hardware involved and the logic we use elsewhere, it shouldn't be a huge consideration (and the cost also wasn't huge, typically an extra cycle or so; the split cache line accesses were much worse and closer to 20 cycles for loads).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, we don't use aligned loads/stores in most of our workloads already, so this won't be any different.

WidenLatin1ToUtf16_Sse2 and NarrowUtf16ToLatin1_Sse2 are the only methods in the repository that use StoreAligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Sse2.Store approach is better since unaligned char* should be rare? (e.g. char[] is always aligned)

A potential case is the public API System.Text.Encoding.GetChars(System.Byte*, System.Int32, System.Char*, System.Int32).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WidenLatin1ToUtf16_Sse2 and NarrowUtf16ToLatin1_Sse2 are the only methods in the repository that use StoreAligned.

Right, and so preserving the alignment here won't make a big difference to downlevel hardware. We should just normalize to not using StoreAligned so the code can be simpler and robust in the face of unalignable data.

return;
}

// If SSE2 is supported, use those specific intrinsics instead of the generic vectorized
// code below. This has two benefits: (a) we can take advantage of specific instructions like
// punpcklbw which we know are optimized, and (b) we can avoid downclocking the processor while
Expand Down Expand Up @@ -1106,5 +1113,19 @@ private static unsafe void WidenLatin1ToUtf16_Fallback(byte* pLatin1Buffer, char
currentOffset++;
}
}

private static unsafe void WidenLatin1ToUtf16_MisalignedAddress(byte* pLatin1Buffer, char* pUtf16Buffer, nuint elementCount)
{
if (elementCount != 0)
{
do
{
Unsafe.WriteUnaligned(pUtf16Buffer, (char)*pLatin1Buffer);
pUtf16Buffer++;
pLatin1Buffer++;
}
while (--elementCount != 0);
}
}
}
}