Skip to content

Two-stage unicode tables for UTF-8 Char format #25653

@stevengj

Description

@stevengj

In the long run, it would be nice to have things like character-query functions (isalpha, grapheme breaks, etcetera) that are based directly on the new Char format, rather than requiring conversion/decoding to UInt32. Since we already maintain our own Unicode tables in utf8proc, it seems reasonable to switch to "natively" employing the new format at some point.

The standard way to do this is a two-stage table, since a single lookup table of all Unicode code points would be too big/slow.

Before we lock ourselves into the new Char format, however, it would be good to think about how it affects two-stage tables. In particular, since two-stage tables are based on dividing codepoints into blocks via the low-order bits, the fact that the encoded Char values are zero-padded may be a concern.

julia> reinterpret(UInt32, 'a')
0x61000000

For many codepoints in this format, the least-significant bits will provide no information. Does that mean that traditional two-stage tables won't work? Is there an easy fix?

I haven't really thought about this much, but I think it's important to take a look to make sure we aren't creating any headaches for later. cc @StefanKarpinski

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceMust go fasterstrings"Strings!"unicodeRelated to unicode characters and encodings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions