-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
In the long run, it would be nice to have things like character-query functions (isalpha, grapheme breaks, etcetera) that are based directly on the new Char format, rather than requiring conversion/decoding to UInt32. Since we already maintain our own Unicode tables in utf8proc, it seems reasonable to switch to "natively" employing the new format at some point.
The standard way to do this is a two-stage table, since a single lookup table of all Unicode code points would be too big/slow.
Before we lock ourselves into the new Char format, however, it would be good to think about how it affects two-stage tables. In particular, since two-stage tables are based on dividing codepoints into blocks via the low-order bits, the fact that the encoded Char values are zero-padded may be a concern.
julia> reinterpret(UInt32, 'a')
0x61000000
For many codepoints in this format, the least-significant bits will provide no information. Does that mean that traditional two-stage tables won't work? Is there an easy fix?
I haven't really thought about this much, but I think it's important to take a look to make sure we aren't creating any headaches for later. cc @StefanKarpinski