-
Notifications
You must be signed in to change notification settings - Fork 285
Description
Detailed description
From #614, we now have a way to produce misspells for Thai and English text.
One usecase of the module would be to simulate out-of-distribution (OOD) datasets due to misspelling. This command line will be an interface for practitioners who want to create such datasets.
More precisely, give a text file, the command line will read the file (by line) and add misspells accordingly. The number of misspells should be configured as well as the random see. The user can specify the output path; if not, the default option would be that the command line use the filename and add a suffix.
Context
In my view, being able to simulate OOD situations has implication in a number functionalities provided by PyThaiNLP, especially in segmentation related tasks.
Possible implementation
thainlp misspell --file ./some/data.txt --seed=1 --mispell-ratio 0.05
# output file: ./some/data[-misspelled-r.05-seed1].txt
Remarks:
[...]
is the suffix added by the command line;mispell-ratio
could be the number of misspells per 100 characters.
What's next?
Once we have the command line, we could try to use it with datasets such as BEST2010 or other standard datasets and evaluate the behavior of segmentation algorithms provided by PyThaiNLP.