Skip to content

Conversation

@StanFromIreland
Copy link
Member

@StanFromIreland StanFromIreland commented Oct 17, 2025

@vstinner
Copy link
Member

@StanFromIreland StanFromIreland requested a review from vstinner

I don't know these Unicode properties. The PR documentation doesn't help me:

Return True if the character has the XID_Start property

What does it mean XID_Start?

@StanFromIreland
Copy link
Member Author

Ah no worries then. You can find their documentation in this report, I can add a link to it in the docs.

@vstinner
Copy link
Member

In short, these functions check if a character is an identifier start or an identifier character according to Unicode TR31?

@StanFromIreland
Copy link
Member Author

Yes.

Copy link
Member

@malemburg malemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

@StanFromIreland
Copy link
Member Author

Thanks for the reviews!

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is correct, but I'm not convinced that we have to expose this feature in Python. It seems to be an Unicode feature which rarely used.

@malemburg
Copy link
Member

Have a look at https://peps.python.org/pep-3131/ for why these are important to have.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About function names, the Unicode annex has also ID_Start and ID_Continue. The XID is a variant. Maybe we should keep x in the function names?

@StanFromIreland
Copy link
Member Author

StanFromIreland commented Oct 29, 2025

About function names, the Unicode annex has also ID_Start and ID_Continue.

Note that they explicitly recommend the "X" variants.

@malemburg
Copy link
Member

Maybe we should keep x in the function names?

You have a point there. Let's keep the "x" in "xid" for the functions to not cause confusion.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vstinner vstinner enabled auto-merge (squash) October 30, 2025 09:53
@vstinner vstinner merged commit dbe3950 into python:main Oct 30, 2025
46 checks passed
@StanFromIreland StanFromIreland deleted the startcontinueid branch October 30, 2025 10:21
@StanFromIreland
Copy link
Member Author

Thanks for merging!

@encukou
Copy link
Member

encukou commented Nov 4, 2025

Have a look at https://peps.python.org/pep-3131/ for why these are important to have.

It's still not clear to me why isidentifier() is not enough here.

Note that neither PEP-3131 nor current Python use the Unicode definition of XID_Start to determine identifiers -- they additionally allow the underscore:

>>> unicodedata.isxidstart('_')
False
>>> '_'.isidentifier()
True

Also, there's an easier way to explain name parsing, which involves only id_start & id_continue, and not the xid variants (whose definitions are more complicated): #140464 (review)

@malemburg
Copy link
Member

Python uses this internally as part of figuring out what a valid identified is, but XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

Note that you need to use the XID variants if you are working with NFKC normalized text. See https://www.unicode.org/reports/tr31/#NFKC_Modifications

From unicodeobject.c:

_PyUnicode_ScanIdentifier(PyObject *self)
{
    Py_ssize_t i;
    Py_ssize_t len = PyUnicode_GET_LENGTH(self);
    if (len == 0) {
        /* an empty string is not a valid identifier */
        return 0;
    }

    int kind = PyUnicode_KIND(self);
    const void *data = PyUnicode_DATA(self);
    Py_UCS4 ch = PyUnicode_READ(kind, data, 0);
    /* PEP 3131 says that the first character must be in
       XID_Start and subsequent characters in XID_Continue,
       and for the ASCII range, the 2.x rules apply (i.e
       start with letters and underscore, continue with
       letters, digits, underscore). However, given the current
       definition of XID_Start and XID_Continue, it is sufficient
       to check just for these, except that _ must be allowed
       as starting an identifier.  */
    if (!_PyUnicode_IsXidStart(ch) && ch != 0x5F /* LOW LINE */) {
        return 0;
    }

    for (i = 1; i < len; i++) {
        ch = PyUnicode_READ(kind, data, i);
        if (!_PyUnicode_IsXidContinue(ch)) {
            return i;
        }
    }
    return i;
}

The underscore is a special exception added for Python.

@encukou
Copy link
Member

encukou commented Nov 5, 2025

XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions.

OK, that's a valid reason. Thanks!
I'll add a note about _ to clear up confusion.

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

@malemburg
Copy link
Member

Note that you need to use the XID variants if you are working with NFKC normalized text.

Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.)

No, since the normalization creates a few special cases which the ID variants won't handle. From the tech report: "Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties."

See https://www.unicode.org/reports/tr31/#NFKC_Modifications for details.

Since Python is doing exactly that (normalizing to NFKC before parsing), it needs to use the XID variants.

@encukou
Copy link
Member

encukou commented Nov 5, 2025

AFAIK, these modifications are exactly what's covered by normalizing and checking the result.
For the first example there, a THAI CHARACTER SARA AM is a Lo, which puts it in ID_Start:

>>> unicodedata.category('\N{THAI CHARACTER SARA AM}')
'Lo'

But normalizing turns it into 2 characters, Mn and Lo:

>>> [(unicodedata.name(c), unicodedata.category(c)) for c in unicodedata.normalize('NFKC', '\N{THAI CHARACTER SARA AM}')]
[('THAI CHARACTER NIKHAHIT', 'Mn'), ('THAI CHARACTER SARA AA', 'Lo')]

The NIKHAHIT (Mn) is in ID_Continue but not ID_Start, which means the SARA AM can't start an identifier despite being a letter:

>>> '\N{THAI CHARACTER SARA AM}'.isidentifier()
False

That is: using the XID properties before normalization will get you the same result as using the ID ones after normalization. IOW, you need to use the XID variants if you are not working with NFKC normalized text.

@malemburg
Copy link
Member

That is: using the XID properties before normalization will get you the same result as using the ID ones after normalization. IOW, you need to use the XID variants if you are not working with NFKC normalized text.

Rereading the section in the TR, you could be right in a way 🙂

It discusses closure under normalization and this essentially means that the isIdentifier() property should give the same results regardless of whether it is applied to normalized text or raw text.

Using the XID variants to implement isIdentifier() will get you this property.

Python uses the XID variants on NFKC normalized text (since it has to normalize anyway) and so the results with respect to being identifiers are the same.

Applications parsing other languages may choose to not normalize first, so for them the XID variants are beneficial as well.

In other places in the TR, it recommends always using the XID variants: "They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties." (see https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax and https://www.unicode.org/reports/tr31/#Migration).

In fact, most of the TR was updated to use the XID variants instead of the ID ones, with the ID variantes only left in for backwards compatibility with Unicode versions prior to version 9.

So all in all, you're right in that the purpose of using XID is more generic and can be applied before or after normalization, giving the same results. In addition, it's also safer, since your text may in some cases be half normalized and half raw and XID will still do a proper job, whereas ID may fail in some edge cases.

@encukou
Copy link
Member

encukou commented Nov 5, 2025

Ah! It all makes sense now. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants