Skip to content

Conversation

wannaphong
Copy link
Member

@wannaphong wannaphong commented Oct 21, 2022

What does this changes

Add rule to TCC

What was wrong

I try to tokenize text with "ทดสอบตัดคำภาษาไทยจอก์น" but It be ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอก', '์น'] . I checked TCC from @bact think PyThaiNLP's TCC regex doesn't cover that case and I found we forget C์ rule from <Cons> <TCC1> <Karan> rule. The <TCC1> can be NULL. It mean ก์ should be 'ก์'.

Update: I found many <Karan> rules are missing and I added <Karan> rules.

Theeramunkong, Thanaruk & Sornlertlamvanich, Virach & Tanhermhong, Thanasan & Chinnan, Wirat. (2004). Character Cluster Based Thai Information Retrieval. 10.1145/355214.355225

https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval

How this fixes it

I add <Karan> rules to TCC.

Fixes #662

Thanks @lalital for advice about Karan rule and @c4n for reference.

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

@wannaphong wannaphong requested a review from bact October 21, 2022 06:32
@wannaphong
Copy link
Member Author

@bact I don't sure more <Cons><TCC1><Karan> rules that we doesn't add to tcc. Can you re-check the rule?

@coveralls
Copy link

coveralls commented Oct 21, 2022

Coverage Status

Coverage increased (+1.7%) to 94.201% when pulling 83aa3d9 on fix-karan-tcc into 904439b on dev.

@wannaphong wannaphong changed the title Add <Cons> <Karan> rule to TCC Add <Karan> rule to TCC Oct 21, 2022
@wannaphong
Copy link
Member Author

I found the missing rules and I fixed.

@wannaphong
Copy link
Member Author

What is DSara: lower vowel?

@wannaphong
Copy link
Member Author

wannaphong commented Oct 21, 2022

from the paper, พิสูจน์ได้ค่ะ should be ['พิ', 'สูจน์', 'ได้', 'ค่ะ']. It is result.

from pythainlp.tokenize import subword_tokenize
subword_tokenize("พิสูจน์ได้ค่ะ",engine="tcc")

output:

before this pull request

['พิ', 'สู', 'จ', 'น', '์', 'ได้', 'ค่ะ']

this pull request

['พิ', 'สูจน์', 'ได้', 'ค่ะ']

@wannaphong
Copy link
Member Author

"ทดสอบตัดคำภาษาไทยจอก์น"

from pythainlp.tokenize import word_tokenize
word_tokenize("ทดสอบตัดคำภาษาไทยจอก์น")
# output: ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอ', 'ก์น']

@bact
Copy link
Member

bact commented Oct 21, 2022

from the paper, พิสูจน์ได้ค่ะ should be ['พิ', 'สูจน์', 'ได้', 'ค่ะ']. It is result.

from pythainlp.tokenize import subword_tokenize
subword_tokenize("พิสูจน์ได้ค่ะ",engine="tcc")

output:

before this pull request

['พิ', 'สู', 'จ', 'น', '์', 'ได้', 'ค่ะ']

this pull request

['พิ', 'สูจน์', 'ได้', 'ค่ะ']

Good one. Can be added as a test case

@wannaphong
Copy link
Member Author

from the paper, พิสูจน์ได้ค่ะ should be ['พิ', 'สูจน์', 'ได้', 'ค่ะ']. It is result.

from pythainlp.tokenize import subword_tokenize
subword_tokenize("พิสูจน์ได้ค่ะ",engine="tcc")

output:
before this pull request

['พิ', 'สู', 'จ', 'น', '์', 'ได้', 'ค่ะ']

this pull request

['พิ', 'สูจน์', 'ได้', 'ค่ะ']

Good one. Can be added as a test case

Done

@wannaphong wannaphong added this to the 4.0 milestone Oct 21, 2022
@wannaphong
Copy link
Member Author

OK. I found DSara: lower vowel are "อุอู" from https://github.com/c4n/TCC_REIMPLEMENTATION/blob/master/TCC_RE-IMPLEMENTATION.ipynb.

Thanks @c4n

@wannaphong
Copy link
Member Author

I found the cรรc์ is missing too and I added.

@wannaphong
Copy link
Member Author

Now, the result from subword_tokenize and word_tokenize.

subword_tokenize("ทดสอบตัดคำภาษาไทยจอก์น",engine="tcc")
# output: ['ท', 'ด', 'ส', 'อ', 'บ', 'ตัด', 'คำ', 'ภา', 'ษา', 'ไท', 'ย', 'จอก์', 'น']

and

word_tokenize("ทดสอบตัดคำภาษาไทยจอก์น")
# output: ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอก์น']

@wannaphong wannaphong changed the title Add <Karan> rule to TCC [WIP] Add <Karan> rule to TCC Oct 21, 2022
@wannaphong
Copy link
Member Author

@bact I did rewrite TCC rule. Can you help me for re-check the rule?

@wannaphong
Copy link
Member Author

wannaphong commented Oct 21, 2022

It's not look like TCC from ETCC paper. I don't sure what is the missing rule?
ภาพ

this pull request
ภาพ

@wannaphong
Copy link
Member Author

ภาพ

@wannaphong
Copy link
Member Author

I found c[ั]([่-๋]c)?k has problem about "พันธ์".

พันธ์ should be "พั", "น", "ธ์" not "พันธ์".

@wannaphong
Copy link
Member Author

Now, "พันธ์" problem fixed but I found new problem about "นธ์".

@wannaphong
Copy link
Member Author

@bact I think we should ignore the tcc want to like 100% in the paper. (พันธ์ -> พันธ์ BUT TCC paper will be พั/น/ธ์)

@wannaphong
Copy link
Member Author

TCC's PyThaiNLP implement TCC Grammar from JTCC.

JTCC: library ตัดกลุ่มของตัวอักษรไทย

@wannaphong
Copy link
Member Author

This pull request will add rule to TCC only.

@wannaphong
Copy link
Member Author

wannaphong commented Oct 22, 2022

Test "ทดสอบตัดคำภาษาไทยจอก์น"

this pull request

from pythainlp.tokenize import subword_tokenize,word_tokenize
word_tokenize("ทดสอบตัดคำภาษาไทยจอก์น")
# output: ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอก์น']

PyThaiNLP 3.1.0

output: ['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย', 'จอก', '์น']

@wannaphong wannaphong changed the title [WIP] Add <Karan> rule to TCC Add <Karan> rule to TCC Oct 23, 2022
@wannaphong
Copy link
Member Author

I did talk with @korakot in PyThaiNLP chat at Facebook. I will move the TCC code that used in newmm to new file and replace TCC with code that results close to TCC paper.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
30.3% 30.3% Duplication

@wannaphong wannaphong changed the title Add <Karan> rule to TCC Add <Karan> rule to TCC and Change TCC rule for newmm Oct 23, 2022
@wannaphong
Copy link
Member Author

wannaphong commented Oct 23, 2022

Conclude

  • tcc is TCC rule code that reimplement the rule that results close to TCC in TCC/ETCC paper. (include Karan rule)
  • tcc_p (name TCC+) is TCC rule old code + improve rule for used in newmm

You can try to use notebook to look the difference. https://github.com/PyThaiNLP/pythainlp/blob/68d7843b319cb99d945ff4ae1645925ebdae4a83/notebooks/test_tcc.ipynb

@wannaphong wannaphong merged commit 97b1a51 into dev Oct 23, 2022
@wannaphong wannaphong deleted the fix-karan-tcc branch October 25, 2022 17:18
wannaphong added a commit that referenced this pull request Oct 26, 2022
Add more test for TCC #741 (comment)
@wannaphong wannaphong mentioned this pull request Oct 26, 2022
2 tasks
@wannaphong wannaphong mentioned this pull request Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

newmm has problem about " ์ "
3 participants