Skip to content

bug: clause_tokenize does not work properly #1011

@panyutsriwirote

Description

@panyutsriwirote

Description

As per the documentation of the tokenize submodule (https://pythainlp.org/docs/5.0/api/tokenize.html), clause_tokenize should work as follows:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน'],
# ['และ', 'คุณ', 'เล่น', 'มือถือ'],
# ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

However, a run on a fresh conda environment results in the following:

from pythainlp.tokenize import clause_tokenize
clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
# [['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

The model is downloaded immediately before running the code above, so there should not be any problems with the model cache.

clause_tokenize not working properly

Expected results

The returned list should consist of 3 clauses:
[['ฉัน', 'นอน'], ['และ', 'คุณ', 'เล่น', 'มือถือ'], ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Current results

The returned list consists of only 1 clause:
[['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

Steps to reproduce

  1. Create a new conda environment
conda create -n temp_env python=3.10
  1. Install pythainlp and python-crfsuite
conda activate temp_env
python -m pip install pythainlp python-crfsuite
  1. Try running the following code
python
>>> from pythainlp.tokenize import clause_tokenize
>>> clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"])
[['ฉัน', 'นอน', 'และ', 'คุณ', 'เล่น', 'มือถือ', 'ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]

PyThaiNLP version

5.0.4

Python version

3.10.15

Operating system and version

Windows 11 Pro 23H2

More info

python-crfsuite version: 0.9.11

I also tried running the same code on the WSL2 distro for Ubuntu 24.04.1 LTS and got the same result.

Possible solution

No response

Files

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationimprove documentation and test cases

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions