`isalpha` should use Unicode property `Alphabetic`; rename to `isletter`

Right now, it simply checks whether the given character is in one of the L categories (`isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO`). This is almost correct, except that the Unicode `Alphabetic` property belongs to these categories, to a `Nl` category ([number-like letters](https://www.compart.com/en/unicode/category/Nl), eg.  Roman numerals), and crucially to a set of characters defined to be `Other_Alphabetic` that live in `Mc` and `Mn` (spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this `Other_Alphabetic` list. 

Among the few other (programming) languages I tried this check on, Ruby (`\p{Alpha}`) and Java (`Character.isAlphabetic`) get this right (Java documentation [explicitly explains the `Alphabetic` property](https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), Python 2 and 3 both (`"அதிகாலை".isalpha()`) seem to be getting it wrong. Perl also gets the `Other_Alphabetic` characters correctly identified under `\p{Alpha}` (though it also seems to have additional magic on top). 

`Other_Alphabetic` apparently belongs to 1300 code points according to the [Unicode PropList](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt), so there are letters from quite a few language scripts that currently fail `isalpha`. 

I'm not sure if `utf8proc` supports querying for either the `Alphabetic` or the `Other_Alphabetic` property (the [`utf8proc_property_struct`](https://julialang.org/utf8proc/doc/structutf8proc__property__struct.html) doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

isalpha should use Unicode property Alphabetic; rename to isletter #26932

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932