-
-
Notifications
You must be signed in to change notification settings - Fork 33.4k
gh-129117: Expose _PyUnicode_IsXidContinue/Start in unicodedata
#140269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I don't know these Unicode properties. The PR documentation doesn't help me:
What does it mean |
|
Ah no worries then. You can find their documentation in this report, I can add a link to it in the docs. |
|
In short, these functions check if a character is an identifier start or an identifier character according to Unicode TR31? |
|
Yes. |
This reverts commit b24b994.
malemburg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now
|
Thanks for the reviews! |
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is correct, but I'm not convinced that we have to expose this feature in Python. It seems to be an Unicode feature which rarely used.
|
Have a look at https://peps.python.org/pep-3131/ for why these are important to have. |
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About function names, the Unicode annex has also ID_Start and ID_Continue. The XID is a variant. Maybe we should keep x in the function names?
Note that they explicitly recommend the "X" variants. |
You have a point there. Let's keep the "x" in "xid" for the functions to not cause confusion. |
vstinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks for merging! |
It's still not clear to me why Note that neither PEP-3131 nor current Python use the Unicode definition of XID_Start to determine identifiers -- they additionally allow the underscore: >>> unicodedata.isxidstart('_')
False
>>> '_'.isidentifier()
TrueAlso, there's an easier way to explain name parsing, which involves only |
|
Python uses this internally as part of figuring out what a valid identified is, but XID_Start/End are also important to be able to parse other languages which use these are basis for their identifier definitions. Note that you need to use the XID variants if you are working with NFKC normalized text. See https://www.unicode.org/reports/tr31/#NFKC_Modifications From unicodeobject.c: The underscore is a special exception added for Python. |
OK, that's a valid reason. Thanks!
Couldn't you also first normalize and then use the ID variants on the result? (Asking to confirm my understanding, as I'll probably be explaining this to others.) |
No, since the normalization creates a few special cases which the ID variants won't handle. From the tech report: "Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties." See https://www.unicode.org/reports/tr31/#NFKC_Modifications for details. Since Python is doing exactly that (normalizing to NFKC before parsing), it needs to use the XID variants. |
|
AFAIK, these modifications are exactly what's covered by normalizing and checking the result. But normalizing turns it into 2 characters, The NIKHAHIT ( That is: using the XID properties before normalization will get you the same result as using the ID ones after normalization. IOW, you need to use the XID variants if you are not working with NFKC normalized text. |
Rereading the section in the TR, you could be right in a way 🙂 It discusses closure under normalization and this essentially means that the Using the XID variants to implement Python uses the XID variants on NFKC normalized text (since it has to normalize anyway) and so the results with respect to being identifiers are the same. Applications parsing other languages may choose to not normalize first, so for them the XID variants are beneficial as well. In other places in the TR, it recommends always using the XID variants: "They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties." (see https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax and https://www.unicode.org/reports/tr31/#Migration). In fact, most of the TR was updated to use the XID variants instead of the ID ones, with the ID variantes only left in for backwards compatibility with Unicode versions prior to version 9. So all in all, you're right in that the purpose of using XID is more generic and can be applied before or after normalization, giving the same results. In addition, it's also safer, since your text may in some cases be half normalized and half raw and XID will still do a proper job, whereas ID may fail in some edge cases. |
|
Ah! It all makes sense now. Thank you! |
📚 Documentation preview 📚: https://cpython-previews--140269.org.readthedocs.build/