-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
StringUtilities currently use a vectorized approach based on hw-intrinsics and doesn't include Arm.
In #44040 there's an attempt to use the xplat-intrinsics, but in the meantime dotnet/runtime#28230 got done thus it would be the best option to base StringUtilities on these new APIs so that the custom vectorized code can go away (cf. #44040 (comment)).
With the ASCII-APIs it could look like d2a1c23...25c5620, but there are some pieces missing. Copied from in #44040 (comment):
What's missing to achieve this?
ASCII
In StringUtilities for ASCII values of the range (0x00, 0x80) are considered valid.
Ascii.ToUtf16 treats the whole ASCII range [0x00, 0x80) as valid.
Thus something like
namespace System.Buffers.Text
{
public static class Ascii
{
// existing methods
- public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
+ public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
}
}is needed.
Latin1
I don't know how hot Latin1 is here, but as it's special cased in
aspnetcore/src/Servers/Kestrel/Core/src/Internal/Infrastructure/HttpUtilities.cs
Lines 140 to 143 in d3259f9
| if (ReferenceEquals(encoding, Encoding.Latin1)) | |
| { | |
| return span.GetLatin1StringNonNullCharacters(); | |
| } |
Encoding.Latin1 can't be used solely, as 0x00 is considered invalid.
Thus basically the same as for ASCII above applies, i.e.
namespace System.Buffers.Text
{
+ public static class Latin1
+ {
+ // other methods similar as Ascii?
+ public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
+ }
}If the type Latin1 seems too heavy, too niche, whatever, as alternative one could use something like 25c5620...e3afae2 where Latin1 bytes are expanded to UTF-16 via Asii.ToUtf16 and if non-ASCII is met, then the remainder is done in scalar way. Though this is a naive approach, which should be perf-tested -- I don't have numbers on how likeley Latin1 inputs with ranges [0x80, 0xFF] are, but if they are rare then that (simple) approach should be good enough.