Skip to content

Commit c0e48e9

Browse files
authored
Sort engine names alphabetically
1 parent 399858d commit c0e48e9

File tree

1 file changed

+29
-23
lines changed

1 file changed

+29
-23
lines changed

pythainlp/tokenize/core.py

Lines changed: 29 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -133,39 +133,46 @@ def word_tokenize(
133133
for end of phrase in Thai.
134134
Otherwise, whitespaces are omitted.
135135
:param bool join_broken_num: True to rejoin formatted numeric that could be wrongly separated.
136-
Otherwise, formatted numeric could be separated.
136+
Otherwise, formatted numeric could be wrongly separated.
137137
138138
:return: list of words
139139
:rtype: List[str]
140140
**Options for engine**
141-
* *newmm* (default) - dictionary-based, Maximum Matching +
142-
Thai Character Cluster
143-
* *newmm-safe* - newmm, with a mechanism to help avoid long
144-
processing time for text with continuous ambiguous breaking points
145-
* *mm* or *multi_cut* - dictionary-based, Maximum Matching.
146-
* *nlpo3* - Python binding for nlpO3. It is newmm engine in Rust.
147-
* *longest* - dictionary-based, Longest Matching
148-
* *icu* - wrapper for ICU (International Components for Unicode,
149-
using PyICU), dictionary-based
150141
* *attacut* - wrapper for
151142
`AttaCut <https://github.com/PyThaiNLP/attacut>`_.,
152143
learning-based approach
153144
* *deepcut* - wrapper for
154145
`DeepCut <https://github.com/rkcosmos/deepcut>`_,
155146
learning-based approach
156-
* *nercut* - Dictionary-based maximal matching word segmentation,
147+
* *icu* - wrapper for a word tokenizer in
148+
`PyICU <https://gitlab.pyicu.org/main/pyicu>`_.,
149+
from ICU (International Components for Unicode),
150+
dictionary-based
151+
* *longest* - dictionary-based, longest matching
152+
* *mm* - "multi-cut", dictionary-based, maximum matching
153+
* *nercut* - dictionary-based, maximal matching,
157154
constrained with Thai Character Cluster (TCC) boundaries,
158-
and combining tokens that are parts of the same named-entity.
155+
combining tokens that are parts of the same named-entity
156+
* *newmm* (default) - "new multi-cut",
157+
dictionary-based, maximum matching,
158+
constrained with Thai Character Cluster (TCC) boundaries
159+
* *newmm-safe* - newmm, with a mechanism to avoid long
160+
processing time for text with continuous ambiguous breaking points
161+
* *nlpo3* - wrapper for a word tokenizer in
162+
`nlpO3 <https://github.com/PyThaiNLP/nlpo3>`_.,
163+
newmm adaptation in Rust (2.5x faster)
164+
* *oskut* - wrapper for
165+
`OSKut <https://github.com/mrpeerat/OSKut>`_.,
166+
Out-of-domain StacKed cut for Word Segmentation
159167
* *sefr_cut* - wrapper for
160168
`SEFR CUT <https://github.com/mrpeerat/SEFR_CUT>`_.,
169+
Stacked Ensemble Filter and Refine for Word Segmentation
161170
* *tltk* - wrapper for
162171
`TLTK <https://pypi.org/project/tltk/>`_.,
163-
* *oskut* - wrapper for
164-
`OSKut <https://github.com/mrpeerat/OSKut>`_.,
165-
172+
maximum collocation approach
166173
:Note:
167174
- The **custom_dict** parameter only works for \
168-
*newmm*, *longest*, and *deepcut* engine.
175+
*deepcut*, *longest*, *newmm*, and *newmm-safe* engines.
169176
:Example:
170177
171178
Tokenize text with different tokenizer::
@@ -329,12 +336,12 @@ def sent_tokenize(
329336
:rtype: list[str]
330337
**Options for engine**
331338
* *crfcut* - (default) split by CRF trained on TED dataset
339+
* *thaisum* - The implementation of sentence segmentator from \
340+
Nakhun Chumpolsathien, 2020
341+
* *tltk* - split by `TLTK <https://pypi.org/project/tltk/>`_.,
332342
* *whitespace+newline* - split by whitespaces and newline.
333343
* *whitespace* - split by whitespaces. Specifiaclly, with \
334344
:class:`regex` pattern ``r" +"``
335-
* *tltk* - split by `TLTK <https://pypi.org/project/tltk/>`_.,
336-
* *thaisum* - The implementation of sentence segmentator from \
337-
Nakhun Chumpolsathien, 2020
338345
:Example:
339346
340347
Split the text based on *whitespace*::
@@ -440,13 +447,12 @@ def subword_tokenize(
440447
:return: list of subwords
441448
:rtype: list[str]
442449
**Options for engine**
443-
* *tcc* (default) - Thai Character Cluster (Theeramunkong et al. 2000)
444-
* *etcc* - Enhanced Thai Character Cluster (Inrut et al. 2001)
445-
* *wangchanberta* - SentencePiece from wangchanberta model.
446450
* *dict* - newmm word tokenizer with a syllable dictionary
451+
* *etcc* - Enhanced Thai Character Cluster (Inrut et al. 2001)
447452
* *ssg* - CRF syllable segmenter for Thai
453+
* *tcc* (default) - Thai Character Cluster (Theeramunkong et al. 2000)
448454
* *tltk* - syllable tokenizer from tltk
449-
455+
* *wangchanberta* - SentencePiece from wangchanberta model
450456
:Example:
451457
452458
Tokenize text into subword based on *tcc*::

0 commit comments

Comments
 (0)