Skip to content

Command line for generating misspell texts #615

@p16i

Description

@p16i

Detailed description

From #614, we now have a way to produce misspells for Thai and English text.

One usecase of the module would be to simulate out-of-distribution (OOD) datasets due to misspelling. This command line will be an interface for practitioners who want to create such datasets.

More precisely, give a text file, the command line will read the file (by line) and add misspells accordingly. The number of misspells should be configured as well as the random see. The user can specify the output path; if not, the default option would be that the command line use the filename and add a suffix.

Context

In my view, being able to simulate OOD situations has implication in a number functionalities provided by PyThaiNLP, especially in segmentation related tasks.

Possible implementation

thainlp misspell --file ./some/data.txt --seed=1  --mispell-ratio 0.05 

# output file: ./some/data[-misspelled-r.05-seed1].txt

Remarks:

  • [...] is the suffix added by the command line;
  • mispell-ratio could be the number of misspells per 100 characters.

What's next?

Once we have the command line, we could try to use it with datasets such as BEST2010 or other standard datasets and evaluate the behavior of segmentation algorithms provided by PyThaiNLP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions