Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 87 additions & 58 deletions Doc/reference/lexical_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -386,73 +386,29 @@ Names (identifiers and keywords)
:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
*soft keywords*.

Within the ASCII range (U+0001..U+007F), the valid characters for names
include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
the underscore ``_`` and, except for the first character, the digits
``0`` through ``9``.
Names are composed of the following characters:

* uppercase and lowercase letters (``A-Z`` and ``a-z``),
* the underscore (``_``),
* digits (``0`` through ``9``), which cannot appear as the first character, and
* non-ASCII characters. Valid names may only contain "letter-like" and
"digit-like" characters; see :ref:`lexical-names-nonascii` for details.

Names must contain at least one character, but have no upper length limit.
Case is significant.

Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
and "number-like" characters from outside the ASCII range, as detailed below.

All identifiers are converted into the `normalization form`_ NFKC while
parsing; comparison of identifiers is based on NFKC.

Formally, the first character of a normalized identifier must belong to the
set ``id_start``, which is the union of:

* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
* Unicode category ``<Lt>`` - titlecase letters
* Unicode category ``<Lm>`` - modifier letters
* Unicode category ``<Lo>`` - other letters
* Unicode category ``<Nl>`` - letter numbers
* {``"_"``} - the underscore
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
to support backwards compatibility

The remaining characters must belong to the set ``id_continue``, which is the
union of:

* all characters in ``id_start``
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
* Unicode category ``<Pc>`` - connector punctuations
* Unicode category ``<Mn>`` - nonspacing marks
* Unicode category ``<Mc>`` - spacing combining marks
* ``<Other_ID_Continue>`` - another explicit set of characters in
`PropList.txt`_ to support backwards compatibility

Unicode categories use the version of the Unicode Character Database as
included in the :mod:`unicodedata` module.

These sets are based on the Unicode standard annex `UAX-31`_.
See also :pep:`3131` for further details.

Even more formally, names are described by the following lexical definitions:
Formally, names are described by the following lexical definitions:

.. grammar-snippet::
:group: python-grammar

NAME: `xid_start` `xid_continue`*
id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
xid_start: <all characters in `id_start` whose NFKC normalization is
in (`id_start` `xid_continue`*)">
xid_continue: <all characters in `id_continue` whose NFKC normalization is
in (`id_continue`*)">
identifier: <`NAME`, except keywords>
NAME: `name_start` `name_continue`*
name_start: "a"..."z" | "A"..."Z" | "_" | <non-ASCII character>
name_continue: name_start | "0"..."9"
identifier: <`NAME`, except keywords>

A non-normative listing of all valid identifier characters as defined by
Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
Character Database.


.. _UAX-31: https://www.unicode.org/reports/tr31/
.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
Note that not all names matched by this grammar are valid; see
:ref:`lexical-names-nonascii` for details.


.. _keywords:
Expand Down Expand Up @@ -555,6 +511,79 @@ characters:
:ref:`atom-identifiers`.


.. _lexical-names-nonascii:

Non-ASCII characters in names
-----------------------------

Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use "letter-like"
and "number-like" characters from outside the ASCII range,
as detailed in this section.

All names are converted into the `normalization form`_ NFKC while parsing.
This means that, for example, some typographic variants of characters are
converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to
``number``, so Python treats them as the same name::

>>> nᵘₘᵇₑʳ = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add an explicit comment that the normalized form of nᵘₘᵇₑʳis number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look good?

>>> number
3

.. note::

Normalization is done at the lexical level only.
Run-time functions that take names as *strings* generally do not normalize
their arguments.
For example, the variable defined above is accessible at run time in the
:func:`globals` dictionary as ``globals()["number"]`` but not
``globals()["nᵘₘᵇₑʳ"]``.

The first character of a normalized identifier must be "letter-like".
Formally, this means it must belong to the set ``id_start``,
which is defined as the union of:

* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
* Unicode category ``<Lt>`` - titlecase letters
* Unicode category ``<Lm>`` - modifier letters
* Unicode category ``<Lo>`` - other letters
* Unicode category ``<Nl>`` - letter numbers
* {``"_"``} - the underscore
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
to support backwards compatibility

The remaining characters must be "letter-like" or "digit-like".
Formally, they must belong to the set ``id_continue``, which is defined as
the union of:

* ``id_start`` (see above)
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
* Unicode category ``<Pc>`` - connector punctuations
* Unicode category ``<Mn>`` - nonspacing marks
* Unicode category ``<Mc>`` - spacing combining marks
* ``<Other_ID_Continue>`` - another explicit set of characters in
`PropList.txt`_ to support backwards compatibility

Unicode categories use the version of the Unicode Character Database as
included in the :mod:`unicodedata` module.

The ``id_start`` and ``id_continue`` sets are based on the Unicode standard
annex `UAX-31`_. See also :pep:`3131` for further details.
Note that Python does not necessarily conform to `UAX-31`_.

A non-normative listing of all valid identifier characters as defined by
Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
Character Database.
The properties *ID_Start* and *ID_Continue* are very similar to Python's
``id_start`` and ``id_continue`` sets; the properties *XID_Start* and
*XID_Continue* play similar roles for identifiers before NFKC normalization.

.. _UAX-31: https://www.unicode.org/reports/tr31/
.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms


.. _literals:

Literals
Expand Down
Loading