Skip to content

pythainlp.corpus.get_corpus_path() should not try to download the corpus automatically #385

@bact

Description

@bact

เสนอว่าไม่ควรใช้ pythainlp.corpus.get_corpus_path() นั้นเรียกดาวน์โหลดแฟ้มโดยอัตโนมัติหากมันหาแฟ้มไม่เจอครับ ควรจะปล่อยให้ผู้ใช้ตัดสินใจเองมากกว่า

Current get_corpus_path() try to download the corpus file if it is not yet exist locally:

def get_corpus_path(name: str) -> Union[str, None]:

    if db.search(query.name == name):
        path = get_full_data_path(db.search(query.name == name)[0]["file"])

        if not os.path.exists(path):
            download(name)

I proposed that it shouldn't do that.

If the file is not exist, user/developer should get notified and decided if they want to download it or not (using API or using command line).

Currently, inside pythainlp module, every single call of get_corpus_path() do exactly that. They check if returned path is "true", if not they call pythainlp.corpus.download() by themselves:

So removing the auto-download inside pythainlp.corpus.get_corpus_path() will not change the behavior of those functions in submodules. (Anyway, it can be further discuss if we want to remove the auto-downloads from those submodules as well or not).

Proposed return values

I propose these for discussion:

  • full path - if the corpus name is valid and the file is exist locally
  • "" (empty string) - if the corpus name is valid but the file is not exist locally
  • None - if the corpus name is not valid (not inside the corpus database)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the librarycorpuscorpus/dataset-related issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions