-
Notifications
You must be signed in to change notification settings - Fork 285
Description
เสนอว่าไม่ควรใช้ pythainlp.corpus.get_corpus_path()
นั้นเรียกดาวน์โหลดแฟ้มโดยอัตโนมัติหากมันหาแฟ้มไม่เจอครับ ควรจะปล่อยให้ผู้ใช้ตัดสินใจเองมากกว่า
Current get_corpus_path()
try to download the corpus file if it is not yet exist locally:
pythainlp/pythainlp/corpus/core.py
Line 81 in 831a9fc
def get_corpus_path(name: str) -> Union[str, None]: |
if db.search(query.name == name):
path = get_full_data_path(db.search(query.name == name)[0]["file"])
if not os.path.exists(path):
download(name)
I proposed that it shouldn't do that.
If the file is not exist, user/developer should get notified and decided if they want to download it or not (using API or using command line).
Currently, inside pythainlp module, every single call of get_corpus_path()
do exactly that. They check if returned path is "true", if not they call pythainlp.corpus.download()
by themselves:
pythainlp/pythainlp/tag/named_entity.py
Line 79 in 831a9fc
self.__data_path = get_corpus_path("thainer-1-3") self.__filemodel = get_corpus_path("thai2rom-pytorch-attn") self.__filemodel = get_corpus_path("thai-g2p") pythainlp/pythainlp/ulmfit/__init__.py
Line 134 in 831a9fc
path = get_corpus_path(fname) path = get_corpus_path("thai2fit_wv")
So removing the auto-download inside pythainlp.corpus.get_corpus_path()
will not change the behavior of those functions in submodules. (Anyway, it can be further discuss if we want to remove the auto-downloads from those submodules as well or not).
Proposed return values
I propose these for discussion:
- full path - if the corpus name is valid and the file is exist locally
- "" (empty string) - if the corpus name is valid but the file is not exist locally
- None - if the corpus name is not valid (not inside the corpus database)