Adding segmentation style for PyThaiNLP `paragraph_tokenize` function

According to my previous PR #806, about `wtpsplit` engine used in `paragraph_tokenize` function. what I found during Hackoberfest is that wtpsplit itself can adapt to the Universal Dependencies, OPUS100, or Ersatz corpus segmentation style as well. As for 2023, it also supported Thai language in `OPUS100` corpus style. Here is the segmentation result from different style:


1. `paragraph_tokenize` with default `paragraph_threshold=0.5` (the current version in PyThaiNLP):
```python
from pythainlp.tokenize import paragraph_tokenize

sent = (
    "(1) บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต"
    +"  มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด"
    +" จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้"
)

# same as paragraph_tokenize(sent, paragraph_threshold=0.5)
paragraph_tokenize(sent)

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
#  'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ',
#  'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ',
#  'ณ ที่นี้']]
```

2. Here is the wtpsplit engine with `OPUS100` segmentation style:
```python
wtp.split(sent, lang_code="th", style='opus100', threshold=0.5)

# output
# ['(1) ',
# 'บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
# 'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้']
```

we will see that different styles of segmentation also yield different segmentation results. Note here that the threshold in OPUS100 segmentation style is not the `alpha` value. This also allows changing the threshold but inherently has higher threshold values since it is not newline probability anymore being thresholded as in the default engine used in PyThaiNLP.

What would you think if we can add a segmentation style with respect to `OPUS100` corpus as an option for users? so users can choose the style of segmentation they want whether using the `default engine` or `OPUS100` style for segmentation. I will let the current version of PyThaiNLP be the default engine, and let this OPUS100 segmentation style be a new specified argument in order to use it. we can discuss about it. if you agree, i will add this and make a PR then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding segmentation style for PyThaiNLP `paragraph_tokenize` function #843

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding segmentation style for PyThaiNLP paragraph_tokenize function #843

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Adding segmentation style for PyThaiNLP `paragraph_tokenize` function #843