gh-129117: Expose `_PyUnicode_IsXidContinue/Start` in `unicodedata` #140269

StanFromIreland · 2025-10-17T19:46:57Z

Issue: unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties. #129117

📚 Documentation preview 📚: https://cpython-previews--140269.org.readthedocs.build/

Include/cpython/unicodeobject.h

vstinner · 2025-10-28T18:09:55Z

@StanFromIreland StanFromIreland requested a review from vstinner

I don't know these Unicode properties. The PR documentation doesn't help me:

Return True if the character has the XID_Start property

What does it mean XID_Start?

StanFromIreland · 2025-10-28T18:40:27Z

Ah no worries then. You can find their documentation in this report, I can add a link to it in the docs.

vstinner · 2025-10-28T18:44:44Z

In short, these functions check if a character is an identifier start or an identifier character according to Unicode TR31?

StanFromIreland · 2025-10-28T18:54:20Z

Yes.

Include/internal/pycore_unicodedata.h

Doc/library/unicodedata.rst

Modules/unicodedata.c

Objects/unicodectype.c

This reverts commit b24b994.

Include/internal/pycore_unicodectype.h

Include/internal/pycore_unicodeobject.h

Modules/unicodedata.c

malemburg

LGTM now

Include/internal/pycore_unicodeobject.h

StanFromIreland · 2025-10-29T16:59:17Z

Thanks for the reviews!

Doc/library/unicodedata.rst

vstinner

The change is correct, but I'm not convinced that we have to expose this feature in Python. It seems to be an Unicode feature which rarely used.

malemburg · 2025-10-29T19:09:56Z

Have a look at https://peps.python.org/pep-3131/ for why these are important to have.

vstinner

About function names, the Unicode annex has also ID_Start and ID_Continue. The XID is a variant. Maybe we should keep x in the function names?

StanFromIreland · 2025-10-29T19:12:25Z

About function names, the Unicode annex has also ID_Start and ID_Continue.

Note that they explicitly recommend the "X" variants.

malemburg · 2025-10-29T19:15:41Z

Maybe we should keep x in the function names?

You have a point there. Let's keep the "x" in "xid" for the functions to not cause confusion.

vstinner

LGTM

Doc/whatsnew/3.15.rst

Doc/library/unicodedata.rst

StanFromIreland · 2025-10-30T10:21:22Z

Thanks for merging!

encukou · 2025-11-04T15:11:29Z

Have a look at https://peps.python.org/pep-3131/ for why these are important to have.

It's still not clear to me why isidentifier() is not enough here.

Note that neither PEP-3131 nor current Python use the Unicode definition of XID_Start to determine identifiers -- they additionally allow the underscore:

>>> unicodedata.isxidstart('_')
False
>>> '_'.isidentifier()
True

Also, there's an easier way to explain name parsing, which involves only id_start & id_continue, and not the xid variants (whose definitions are more complicated): #140464 (review)

malemburg · 2025-11-04T16:31:22Z

Python uses this internally as part of figuring out what a valid identified is, but XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

Note that you need to use the XID variants if you are working with NFKC normalized text. See https://www.unicode.org/reports/tr31/#NFKC_Modifications

From unicodeobject.c:

_PyUnicode_ScanIdentifier(PyObject *self)
{
    Py_ssize_t i;
    Py_ssize_t len = PyUnicode_GET_LENGTH(self);
    if (len == 0) {
        /* an empty string is not a valid identifier */
        return 0;
    }

    int kind = PyUnicode_KIND(self);
    const void *data = PyUnicode_DATA(self);
    Py_UCS4 ch = PyUnicode_READ(kind, data, 0);
    /* PEP 3131 says that the first character must be in
       XID_Start and subsequent characters in XID_Continue,
       and for the ASCII range, the 2.x rules apply (i.e
       start with letters and underscore, continue with
       letters, digits, underscore). However, given the current
       definition of XID_Start and XID_Continue, it is sufficient
       to check just for these, except that _ must be allowed
       as starting an identifier.  */
    if (!_PyUnicode_IsXidStart(ch) && ch != 0x5F /* LOW LINE */) {
        return 0;
    }

    for (i = 1; i < len; i++) {
        ch = PyUnicode_READ(kind, data, i);
        if (!_PyUnicode_IsXidContinue(ch)) {
            return i;
        }
    }
    return i;
}

The underscore is a special exception added for Python.

encukou · 2025-11-05T08:49:30Z

XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

OK, that's a valid reason. Thanks!
I'll add a note about _ to clear up confusion.

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

malemburg · 2025-11-05T09:27:53Z

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

No, since the normalization creates a few special cases which the ID variants won't handle. From the tech report: "Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties."

See https://www.unicode.org/reports/tr31/#NFKC_Modifications for details.

Since Python is doing exactly that (normalizing to NFKC before parsing), it needs to use the XID variants.

encukou · 2025-11-05T09:42:06Z

AFAIK, these modifications are exactly what's covered by normalizing and checking the result.
For the first example there, a THAI CHARACTER SARA AM is a Lo, which puts it in ID_Start:

>>> unicodedata.category('\N{THAI CHARACTER SARA AM}')
'Lo'

But normalizing turns it into 2 characters, Mn and Lo:

>>> [(unicodedata.name(c), unicodedata.category(c)) for c in unicodedata.normalize('NFKC', '\N{THAI CHARACTER SARA AM}')]
[('THAI CHARACTER NIKHAHIT', 'Mn'), ('THAI CHARACTER SARA AA', 'Lo')]

The NIKHAHIT (Mn) is in ID_Continue but not ID_Start, which means the SARA AM can't start an identifier despite being a letter:

>>> '\N{THAI CHARACTER SARA AM}'.isidentifier()
False

That is: using the XID properties before normalization will get you the same result as using the ID ones after normalization. IOW, you need to use the XID variants if you are not working with NFKC normalized text.

malemburg · 2025-11-05T10:27:57Z

That is: using the XID properties before normalization will get you the same result as using the ID ones after normalization. IOW, you need to use the XID variants if you are not working with NFKC normalized text.

Rereading the section in the TR, you could be right in a way 🙂

It discusses closure under normalization and this essentially means that the isIdentifier() property should give the same results regardless of whether it is applied to normalized text or raw text.

Using the XID variants to implement isIdentifier() will get you this property.

Python uses the XID variants on NFKC normalized text (since it has to normalize anyway) and so the results with respect to being identifiers are the same.

Applications parsing other languages may choose to not normalize first, so for them the XID variants are beneficial as well.

In other places in the TR, it recommends always using the XID variants: "They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties." (see https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax and https://www.unicode.org/reports/tr31/#Migration).

In fact, most of the TR was updated to use the XID variants instead of the ID ones, with the ID variantes only left in for backwards compatibility with Unicode versions prior to version 9.

So all in all, you're right in that the purpose of using XID is more generic and can be applied before or after normalization, giving the same results. In addition, it's also safer, since your text may in some cases be half normalized and half raw and XID will still do a proper job, whereas ID may fail in some edge cases.

encukou · 2025-11-05T10:42:54Z

Ah! It all makes sense now. Thank you!

Commit

c0511cc

StanFromIreland requested a review from malemburg October 17, 2025 19:46

StanFromIreland requested a review from AA-Turner as a code owner October 17, 2025 19:46

bedevere-app bot added the awaiting review label Oct 17, 2025

bedevere-app bot mentioned this pull request Oct 17, 2025

unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties. #129117

Closed

StanFromIreland requested a review from ezio-melotti October 17, 2025 19:59

Fix linking on windows and refactor test

83bb3a9

vstinner reviewed Oct 18, 2025

View reviewed changes

Include/cpython/unicodeobject.h Outdated Show resolved Hide resolved

Move to pycore_unicodedata.h

571c622

StanFromIreland requested review from a team, emmatyping and erlend-aasland as code owners October 18, 2025 22:03

lint

9043865

StanFromIreland requested a review from vstinner October 18, 2025 22:04

vstinner reviewed Oct 29, 2025

View reviewed changes

StanFromIreland added 2 commits October 29, 2025 13:36

Part of review

cf197af

Review

b24b994

StanFromIreland requested a review from vstinner October 29, 2025 13:47

StanFromIreland added 2 commits October 29, 2025 15:34

Revert "Review"

b9ae70d

This reverts commit b24b994.

Third times the charm

dc0752f

vstinner reviewed Oct 29, 2025

View reviewed changes

Include/internal/pycore_unicodectype.h Show resolved Hide resolved

Include/internal/pycore_unicodeobject.h Outdated Show resolved Hide resolved

Modules/unicodedata.c Outdated Show resolved Hide resolved

Review

c958c2a

malemburg approved these changes Oct 29, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Oct 29, 2025

malemburg requested changes Oct 29, 2025

View reviewed changes

Include/internal/pycore_unicodeobject.h Outdated Show resolved Hide resolved

bedevere-app bot added the awaiting merge label Oct 29, 2025

vstinner reviewed Oct 29, 2025

View reviewed changes

Doc/library/unicodedata.rst Outdated Show resolved Hide resolved

vstinner reviewed Oct 29, 2025

View reviewed changes

Split sentance in docs

14c9536

vstinner reviewed Oct 29, 2025

View reviewed changes

StanFromIreland added 2 commits October 29, 2025 19:19

"X"

6fe1ab8

More "X"

5ccc2cd

vstinner approved these changes Oct 29, 2025

View reviewed changes

vstinner reviewed Oct 29, 2025

View reviewed changes

Doc/whatsnew/3.15.rst Outdated Show resolved Hide resolved

StanFromIreland and others added 2 commits October 29, 2025 20:11

Expand What's New and blurb

4900fb7

Merge branch 'main' into startcontinueid

d6133f2

vstinner reviewed Oct 30, 2025

View reviewed changes

Doc/library/unicodedata.rst Outdated Show resolved Hide resolved

Update Doc/library/unicodedata.rst

50300f4

vstinner enabled auto-merge (squash) October 30, 2025 09:53

vstinner merged commit dbe3950 into python:main Oct 30, 2025
46 checks passed

bedevere-app bot removed the awaiting merge label Oct 30, 2025

StanFromIreland deleted the startcontinueid branch October 30, 2025 10:21

encukou mentioned this pull request Nov 5, 2025

gh-135676: Simplify docs on lexing names #140464

Draft

Uh oh!

gh-129117: Expose _PyUnicode_IsXidContinue/Start in unicodedata #140269

gh-129117: Expose _PyUnicode_IsXidContinue/Start in unicodedata #140269

Uh oh!

Conversation

StanFromIreland commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vstinner commented Oct 28, 2025

Uh oh!

StanFromIreland commented Oct 28, 2025

Uh oh!

vstinner commented Oct 28, 2025

Uh oh!

StanFromIreland commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

StanFromIreland commented Oct 29, 2025

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

malemburg commented Oct 29, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

StanFromIreland commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malemburg commented Oct 29, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented Oct 30, 2025

Uh oh!

encukou commented Nov 4, 2025

Uh oh!

malemburg commented Nov 4, 2025

Uh oh!

encukou commented Nov 5, 2025

Uh oh!

malemburg commented Nov 5, 2025

Uh oh!

encukou commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malemburg commented Nov 5, 2025

Uh oh!

encukou commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gh-129117: Expose `_PyUnicode_IsXidContinue/Start` in `unicodedata` #140269

gh-129117: Expose `_PyUnicode_IsXidContinue/Start` in `unicodedata` #140269

StanFromIreland commented Oct 17, 2025 •

edited by github-actions bot

Loading

StanFromIreland commented Oct 29, 2025 •

edited

Loading

encukou commented Nov 5, 2025 •

edited

Loading