Two-stage unicode tables for UTF-8 Char format

In the long run, it would be nice to have things like character-query functions (`isalpha`, grapheme breaks, etcetera) that are based directly on the new `Char` format, rather than requiring conversion/decoding to `UInt32`.  Since we already maintain our own Unicode tables in utf8proc, it seems reasonable to switch to "natively" employing the new format at some point.

The standard way to do this is a [two-stage table](https://www.strchr.com/multi-stage_tables), since a single lookup table of all Unicode code points would be too big/slow.

Before we lock ourselves into the new `Char` format, however, it would be good to think about how it affects two-stage tables.   In particular, since two-stage tables are based on dividing codepoints into blocks via the low-order bits, the fact that the encoded `Char` values are zero-padded may be a concern.
```
julia> reinterpret(UInt32, 'a')
0x61000000
```
For many codepoints in this format, the least-significant bits will provide no information.  Does that mean that traditional two-stage tables won't work?  Is there an easy fix?

I haven't really thought about this much, but I think it's important to take a look to make sure we aren't creating any headaches for later. cc @StefanKarpinski

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Two-stage unicode tables for UTF-8 Char format #25653

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Two-stage unicode tables for UTF-8 Char format #25653

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions