Skip to content

Conversation

wannaphong
Copy link
Member

@wannaphong wannaphong commented Apr 14, 2021

syllable_tokenize is deprecated, use subword_tokenize instead #322

What does this changes

Deprecated syllable_tokenize

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

syllable_tokenize is deprecated, use subword_tokenize instead
@wannaphong
Copy link
Member Author

wannaphong commented Apr 14, 2021

Todo

  • move test set
  • edit docs

@pep8speaks
Copy link

pep8speaks commented Apr 14, 2021

Hello @wannaphong! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 304:80: E501 line too long (89 > 79 characters)
Line 306:80: E501 line too long (80 > 79 characters)

Comment last updated at 2021-04-22 17:32:08 UTC

@coveralls
Copy link

coveralls commented Apr 14, 2021

Coverage Status

Coverage decreased (-0.02%) to 95.715% when pulling 9bf1842 on merge-syllable-subword into 449e9b0 on dev.

@wannaphong wannaphong requested a review from bact April 19, 2021 15:38
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove "(default)" from "dict" in the docstring.

Comment on lines 303 to 307
* *tcc* (default) - Thai Character Cluster (Theeramunkong et al. 2000)
* *etcc* - Enhanced Thai Character Cluster (Inrut et al. 2001)
* *wangchanberta* - SentencePiece from wangchanberta model.
* *dict* (default) - newmm word tokenizer with a syllable dictionary
* *ssg* - CRF syllable segmenter for Thai
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dict" is not a default subword tokenization engine.

Current default is "tcc",
according to DEFAULT_SUBWORD_TOKENIZE_ENGINE constant in
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/__init__.py

"dict (default) - newmm..." should be just "* dict - newmm..."

@wannaphong wannaphong requested a review from bact April 22, 2021 17:32
@wannaphong wannaphong added this to the 2.4 milestone Apr 23, 2021
@wannaphong wannaphong merged commit 036e985 into dev Apr 23, 2021
@wannaphong wannaphong deleted the merge-syllable-subword branch April 24, 2021 07:01
@wannaphong wannaphong mentioned this pull request Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants