Skip to content

StringUtilities based on vectorized helpers provided by core libraries #45962

@gfoidl

Description

@gfoidl

StringUtilities currently use a vectorized approach based on hw-intrinsics and doesn't include Arm.
In #44040 there's an attempt to use the xplat-intrinsics, but in the meantime dotnet/runtime#28230 got done thus it would be the best option to base StringUtilities on these new APIs so that the custom vectorized code can go away (cf. #44040 (comment)).

With the ASCII-APIs it could look like d2a1c23...25c5620, but there are some pieces missing. Copied from in #44040 (comment):

What's missing to achieve this?

ASCII

In StringUtilities for ASCII values of the range (0x00, 0x80) are considered valid.
Ascii.ToUtf16 treats the whole ASCII range [0x00, 0x80) as valid.

Thus something like

namespace System.Buffers.Text
{
    public static class Ascii
    {
        // existing methods
-       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
    }
}

is needed.

Latin1

I don't know how hot Latin1 is here, but as it's special cased in

if (ReferenceEquals(encoding, Encoding.Latin1))
{
return span.GetLatin1StringNonNullCharacters();
}
I think it's hot enough to be optimized. Besided that standard Encoding.Latin1 can't be used solely, as 0x00 is considered invalid.

Thus basically the same as for ASCII above applies, i.e.

namespace System.Buffers.Text
{
+   public static class Latin1
+   {
+       // other methods similar as Ascii?
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
+   }
}

If the type Latin1 seems too heavy, too niche, whatever, as alternative one could use something like 25c5620...e3afae2 where Latin1 bytes are expanded to UTF-16 via Asii.ToUtf16 and if non-ASCII is met, then the remainder is done in scalar way. Though this is a naive approach, which should be perf-tested -- I don't have numbers on how likeley Latin1 inputs with ranges [0x80, 0xFF] are, but if they are rare then that (simple) approach should be good enough.

Metadata

Metadata

Assignees

Labels

api-suggestionEarly API idea and discussion, it is NOT ready for implementationarea-networkingIncludes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions