-
Notifications
You must be signed in to change notification settings - Fork 284
Description
Describe the bug
the docstring of sent_tokenize says "This function does not yet automatically recognize when a sentence actually ends. Rather it helps split text where white space and a new line is found.", but it actually split text by whitespace, newline and the vertical bar.
For the default engine "whitespace+newline":
sentences = re.sub(r"\n+|\s+", "|", text, re.U).split("|")
Replacing \n and \s with the vertical bar and then splitting it by the vertical bar is problematic, since it will treat the the vertical bar "|" in the original text as sentence boundary as well.
I think it is okay to just use split(), which would also be a little bit faster I think.
>>> text = 'somethaitext|afterverticalbar after3whitespace\n\n\nthe4thline'
>>> text.split()
['somethaitext|afterverticalbar', 'after3whitespace', 'the4thline']
As for engine "whitespace":
sentences = nltk.tokenize.WhitespaceTokenizer().tokenize(text)
This is the same as "whitespace+newline", since NLTK's WhitespaceTokenizer will split text by space, tab and newline. And you don't need NLTK to do this at all (and indeed the docstring of NLTK's WhitespaceTokenizer says "In general, users should use the string split() method instead.").
If the text should only be split by whitespace but not newline, I think you can simply use re.split(r' +')
>>> import re
>>> re.split(r' +', text)
['somethaitext|afterverticalbar', 'after3whitespace\n\n\nthe4thline']
So the NLTK dependency can be removed, which is a quite large library. (I'm not sure whether NLTK is used elsewhere in PyThaiNLP though.)
To Reproduce
>>> import pythainlp
>>> text = 'somethaitext|afterverticalbar after3whitespace\n\n\nthe4thline' # I don't speak Thai, so I use English here, sorry,
>>> pythainlp.sent_tokenize(text)
['somethaitext', 'afterverticalbar', 'after3whitespace', 'the4thline']
>>> pythainlp.sent_tokenize(text, engine = 'whitespace')
['somethaitext|afterverticalbar', 'after3whitespace', 'the4thline']
Expected behavior
>>> pythainlp.sent_tokenize(text)
['somethaitext|afterverticalbar', 'after3whitespace', 'the4thline']
>>> pythainlp.sent_tokenize(text, engine = 'whitespace')
['somethaitext|afterverticalbar', 'after3whitespace\n\n\nthe4thline']
Desktop (please complete the following information):
- OS: Windows 10 x64
- Python Version: 3.7.1
- PyThaiNLP Version: 1.7.1