@@ -133,39 +133,46 @@ def word_tokenize(
133
133
for end of phrase in Thai.
134
134
Otherwise, whitespaces are omitted.
135
135
:param bool join_broken_num: True to rejoin formatted numeric that could be wrongly separated.
136
- Otherwise, formatted numeric could be separated.
136
+ Otherwise, formatted numeric could be wrongly separated.
137
137
138
138
:return: list of words
139
139
:rtype: List[str]
140
140
**Options for engine**
141
- * *newmm* (default) - dictionary-based, Maximum Matching +
142
- Thai Character Cluster
143
- * *newmm-safe* - newmm, with a mechanism to help avoid long
144
- processing time for text with continuous ambiguous breaking points
145
- * *mm* or *multi_cut* - dictionary-based, Maximum Matching.
146
- * *nlpo3* - Python binding for nlpO3. It is newmm engine in Rust.
147
- * *longest* - dictionary-based, Longest Matching
148
- * *icu* - wrapper for ICU (International Components for Unicode,
149
- using PyICU), dictionary-based
150
141
* *attacut* - wrapper for
151
142
`AttaCut <https://github.com/PyThaiNLP/attacut>`_.,
152
143
learning-based approach
153
144
* *deepcut* - wrapper for
154
145
`DeepCut <https://github.com/rkcosmos/deepcut>`_,
155
146
learning-based approach
156
- * *nercut* - Dictionary-based maximal matching word segmentation,
147
+ * *icu* - wrapper for a word tokenizer in
148
+ `PyICU <https://gitlab.pyicu.org/main/pyicu>`_.,
149
+ from ICU (International Components for Unicode),
150
+ dictionary-based
151
+ * *longest* - dictionary-based, longest matching
152
+ * *mm* - "multi-cut", dictionary-based, maximum matching
153
+ * *nercut* - dictionary-based, maximal matching,
157
154
constrained with Thai Character Cluster (TCC) boundaries,
158
- and combining tokens that are parts of the same named-entity.
155
+ combining tokens that are parts of the same named-entity
156
+ * *newmm* (default) - "new multi-cut",
157
+ dictionary-based, maximum matching,
158
+ constrained with Thai Character Cluster (TCC) boundaries
159
+ * *newmm-safe* - newmm, with a mechanism to avoid long
160
+ processing time for text with continuous ambiguous breaking points
161
+ * *nlpo3* - wrapper for a word tokenizer in
162
+ `nlpO3 <https://github.com/PyThaiNLP/nlpo3>`_.,
163
+ newmm adaptation in Rust (2.5x faster)
164
+ * *oskut* - wrapper for
165
+ `OSKut <https://github.com/mrpeerat/OSKut>`_.,
166
+ Out-of-domain StacKed cut for Word Segmentation
159
167
* *sefr_cut* - wrapper for
160
168
`SEFR CUT <https://github.com/mrpeerat/SEFR_CUT>`_.,
169
+ Stacked Ensemble Filter and Refine for Word Segmentation
161
170
* *tltk* - wrapper for
162
171
`TLTK <https://pypi.org/project/tltk/>`_.,
163
- * *oskut* - wrapper for
164
- `OSKut <https://github.com/mrpeerat/OSKut>`_.,
165
-
172
+ maximum collocation approach
166
173
:Note:
167
174
- The **custom_dict** parameter only works for \
168
- *newmm *, *longest*, and *deepcut* engine .
175
+ *deepcut *, *longest*, *newmm*, and *newmm-safe* engines .
169
176
:Example:
170
177
171
178
Tokenize text with different tokenizer::
@@ -329,12 +336,12 @@ def sent_tokenize(
329
336
:rtype: list[str]
330
337
**Options for engine**
331
338
* *crfcut* - (default) split by CRF trained on TED dataset
339
+ * *thaisum* - The implementation of sentence segmentator from \
340
+ Nakhun Chumpolsathien, 2020
341
+ * *tltk* - split by `TLTK <https://pypi.org/project/tltk/>`_.,
332
342
* *whitespace+newline* - split by whitespaces and newline.
333
343
* *whitespace* - split by whitespaces. Specifiaclly, with \
334
344
:class:`regex` pattern ``r" +"``
335
- * *tltk* - split by `TLTK <https://pypi.org/project/tltk/>`_.,
336
- * *thaisum* - The implementation of sentence segmentator from \
337
- Nakhun Chumpolsathien, 2020
338
345
:Example:
339
346
340
347
Split the text based on *whitespace*::
@@ -440,13 +447,12 @@ def subword_tokenize(
440
447
:return: list of subwords
441
448
:rtype: list[str]
442
449
**Options for engine**
443
- * *tcc* (default) - Thai Character Cluster (Theeramunkong et al. 2000)
444
- * *etcc* - Enhanced Thai Character Cluster (Inrut et al. 2001)
445
- * *wangchanberta* - SentencePiece from wangchanberta model.
446
450
* *dict* - newmm word tokenizer with a syllable dictionary
451
+ * *etcc* - Enhanced Thai Character Cluster (Inrut et al. 2001)
447
452
* *ssg* - CRF syllable segmenter for Thai
453
+ * *tcc* (default) - Thai Character Cluster (Theeramunkong et al. 2000)
448
454
* *tltk* - syllable tokenizer from tltk
449
-
455
+ * *wangchanberta* - SentencePiece from wangchanberta model
450
456
:Example:
451
457
452
458
Tokenize text into subword based on *tcc*::
0 commit comments