Skip to content

Conversation

@wannaphong
Copy link
Member

After a long time of the development of PyThaiNLP 3.0, We released PyThaiNLP 3.0. PyThaiNLP 3.0 has many improvements and new features to help with Thai language processing tasks.

You can install by pip install pythainlp or upgrade by pip install -U pythainlp.

Documentation: https://pythainlp.github.io/docs/3.0/index.html

Report bug: https://github.com/PyThaiNLP/pythainlp/issues

See PyThaiNLP 3.0 change log#545

If you want to contribute to PyThaiNLP, you can read Contributing to PyThaiNLP.

News

Since PyThaiNLP 3.0, We will end supporting PyThaiNLP on Python 3.6. Python 3.6 users can use PyThaiNLP 2.3.2.

We have updated the Thai word dictionary & rule for newmm. We recommend retraining your model if you use newmm for word tokenization in your model.

What is new?

Deprecation and other API changes

  • Deprecated syllable_tokenize. syllable_tokenize is deprecated, use subword_tokenize instead
  • pythainlp.tag.named_entity.ThaiNameTagger is change to pythainlp.tag.thainer.ThaiNameTagger. This old class will be deprecated in PyThaiNLP version 3.1.

Augment

  • Add Thai Text Augmentation

Corpus

  • Fix lots of misspellings in the dictionary (words_th.txt)
  • Add get_corpus_default_db and thainer 1.5 model. You can add corpus on default_db.json, and you don't load the last trainer model from the Internet.

Tag

  • Add TLTK (pos_tag and ner) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize and more.
  • Add NER class - NER class for Named-entity recognizer tasks.

Translate

  • Add pythainlp.translate.Translate Class
  • Add Chinese-Thai Machine Translation
  • Add Thai-French Machine Translation

Tokenization

  • Tokenize repeating dots and commas from numbers
  • Fix token_max_len bug that makes it always zero
  • Tokenize repeating dots and commas from numbers (fix Problem with syllable tokenization #461)
  • Retrained sentenceseg_crfcut.model for PyThaiNLP 2.4
  • Add SEFR CUT to pythainlp
  • Add TLTK (sentence_tokenize and word_tokenize) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.
  • Add nlpo3

Transliterate

  • Refactor Royin Transliterate: Avoid embedded if blocks and simplified consonant replacing operations
  • Manually merge update-royin branch with dev branch to add O-ANG rule
  • Add TLTK (g2p and ipa) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.
  • Add pythainlp.transliterate.puan

Word Vector

  • Fix token_max_len bug that makes it always zero
  • Add pythainlp.word_vector.WordVector

Spell

  • Add more spelling engine
  • Add TLTK (spell) - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.

Generate

  • Add pythainlp.generate to generate a text.

Tool

  • Add misspell module

Other

  • Add TLTK - add TLTK wrapper to pythainlp functions ex ner, word_tokenize, and more.
  • Update requirements from ssg 0.0.6 to ssg 0.0.8
  • Spoonerism: Add supports words more three syllables
  • Add maiyamok; This function is preprocessing MaiYaMok in a Thai sentence.

Contributors

Thanks all the contributors. (Image made with contributors-img)

If you want to contributing to PyThaiNLP, you can read Contributing to PyThaiNLP.

This year is the 6th year's PyThaiNLP, and PyThaiNLP has more than one million downloads. I started to develop PyThaiNLP to help me do Thai language processing tasks. Now, PyThaiNLP has been used in many research and works worldwide. PyThaiNLP can't be grown if it doesn't have contributors, sponsors, and users.

Thank you for all supporting.

Thank you for using PyThaiNLP.

Wannaphong Phatthiyaphaibun

PyThaiNLP Founder

27 January 2022

@pep8speaks
Copy link

pep8speaks commented Jan 29, 2022

Hello @wannaphong! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-29 16:28:18 UTC

@wannaphong wannaphong linked an issue Jan 29, 2022 that may be closed by this pull request
@wannaphong wannaphong added this to the 3.0 milestone Jan 29, 2022
@wannaphong wannaphong merged commit 66373c8 into dev Jan 29, 2022
@wannaphong wannaphong deleted the v3.0.0 branch February 9, 2022 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyThaiNLP 3.0 change log Problem with syllable tokenization

2 participants