-
Notifications
You must be signed in to change notification settings - Fork 285
Add <Karan> rule to TCC and Change TCC rule for newmm #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bact I don't sure more |
I found the missing rules and I fixed. |
What is |
from the paper, from pythainlp.tokenize import subword_tokenize
subword_tokenize("พิสูจน์ได้ค่ะ",engine="tcc") output: before this pull request
this pull request
|
from pythainlp.tokenize import word_tokenize
word_tokenize("ทดสอบตัดคำภาษาไทยจอก์น")
# output: ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอ', 'ก์น'] |
Good one. Can be added as a test case |
Done |
OK. I found Thanks @c4n |
I found the |
Now, the result from subword_tokenize("ทดสอบตัดคำภาษาไทยจอก์น",engine="tcc")
# output: ['ท', 'ด', 'ส', 'อ', 'บ', 'ตัด', 'คำ', 'ภา', 'ษา', 'ไท', 'ย', 'จอก์', 'น'] and word_tokenize("ทดสอบตัดคำภาษาไทยจอก์น")
# output: ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอก์น'] |
@bact I did rewrite TCC rule. Can you help me for re-check the rule? |
I found
|
Now, "พันธ์" problem fixed but I found new problem about "นธ์". |
@bact I think we should ignore the tcc want to like 100% in the paper. (พันธ์ -> พันธ์ BUT TCC paper will be พั/น/ธ์) |
TCC's PyThaiNLP implement TCC Grammar from JTCC. |
This pull request will add rule to TCC only. |
Test this pull request
PyThaiNLP 3.1.0
|
I did talk with @korakot in PyThaiNLP chat at Facebook. I will move the TCC code that used in newmm to new file and replace TCC with code that results close to TCC paper. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
SonarCloud Quality Gate failed.
|
Conclude
You can try to use notebook to look the difference. https://github.com/PyThaiNLP/pythainlp/blob/68d7843b319cb99d945ff4ae1645925ebdae4a83/notebooks/test_tcc.ipynb |
Add more test for TCC #741 (comment)
What does this changes
Add rule to TCC
What was wrong
I try to tokenize text with "ทดสอบตัดคำภาษาไทยจอก์น" but It be
['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอก', '์น']
. I checked TCC from @bact think PyThaiNLP's TCC regex doesn't cover that case and I found we forgetC์
rule from<Cons> <TCC1> <Karan>
rule. The<TCC1>
can beNULL
. It mean ก์ should be 'ก์'.Update: I found many
<Karan>
rules are missing and I added<Karan>
rules.Theeramunkong, Thanaruk & Sornlertlamvanich, Virach & Tanhermhong, Thanasan & Chinnan, Wirat. (2004). Character Cluster Based Thai Information Retrieval. 10.1145/355214.355225
https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval
How this fixes it
I add
<Karan>
rules toTCC
.Fixes #662
Thanks @lalital for advice about Karan rule and @c4n for reference.
Your checklist for this pull request
🚨Please review the guidelines for contributing to this repository.