Skip to content

sent_tokenize also split text by vertical line #166

@BLKSerene

Description

@BLKSerene

Describe the bug
the docstring of sent_tokenize says "This function does not yet automatically recognize when a sentence actually ends. Rather it helps split text where white space and a new line is found.", but it actually split text by whitespace, newline and the vertical bar.

For the default engine "whitespace+newline":
sentences = re.sub(r"\n+|\s+", "|", text, re.U).split("|")
Replacing \n and \s with the vertical bar and then splitting it by the vertical bar is problematic, since it will treat the the vertical bar "|" in the original text as sentence boundary as well.
I think it is okay to just use split(), which would also be a little bit faster I think.

>>> text = 'somethaitext|afterverticalbar   after3whitespace\n\n\nthe4thline'
>>> text.split()
['somethaitext|afterverticalbar', 'after3whitespace', 'the4thline']

As for engine "whitespace":
sentences = nltk.tokenize.WhitespaceTokenizer().tokenize(text)
This is the same as "whitespace+newline", since NLTK's WhitespaceTokenizer will split text by space, tab and newline. And you don't need NLTK to do this at all (and indeed the docstring of NLTK's WhitespaceTokenizer says "In general, users should use the string split() method instead.").
If the text should only be split by whitespace but not newline, I think you can simply use re.split(r' +')

>>> import re
>>> re.split(r' +', text)
['somethaitext|afterverticalbar', 'after3whitespace\n\n\nthe4thline']

So the NLTK dependency can be removed, which is a quite large library. (I'm not sure whether NLTK is used elsewhere in PyThaiNLP though.)

To Reproduce

>>> import pythainlp
>>> text = 'somethaitext|afterverticalbar   after3whitespace\n\n\nthe4thline' # I don't speak Thai, so I use English here, sorry,
>>> pythainlp.sent_tokenize(text)
['somethaitext', 'afterverticalbar', 'after3whitespace', 'the4thline']
>>> pythainlp.sent_tokenize(text, engine = 'whitespace')
['somethaitext|afterverticalbar', 'after3whitespace', 'the4thline']

Expected behavior

>>> pythainlp.sent_tokenize(text)
['somethaitext|afterverticalbar', 'after3whitespace', 'the4thline']
>>> pythainlp.sent_tokenize(text, engine = 'whitespace')
['somethaitext|afterverticalbar', 'after3whitespace\n\n\nthe4thline']

Desktop (please complete the following information):

  • OS: Windows 10 x64
  • Python Version: 3.7.1
  • PyThaiNLP Version: 1.7.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions