-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO). This is almost correct, except that the Unicode Alphabetic property belongs to these categories, to a Nl category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic that live in Mc and Mn (spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic list.
Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}) and Java (Character.isAlphabetic) get this right (Java documentation explicitly explains the Alphabetic property, Python 2 and 3 both ("அதிகாலை".isalpha()) seem to be getting it wrong. Perl also gets the Other_Alphabetic characters correctly identified under \p{Alpha} (though it also seems to have additional magic on top).
Other_Alphabetic apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha.
I'm not sure if utf8proc supports querying for either the Alphabetic or the Other_Alphabetic property (the utf8proc_property_struct doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.