Feature: keyword extraction with keybert and frequency ranking #751

noppayut · 2022-10-29T15:39:46Z

What does this changes

Implement keyword extraction feature and unit test. (#145 )
The engine has two options, namely 'keybert' and 'frequency' for KeyBERT and naive frequency ranking respectively.
KeyBERT is an algorithm to rank keywords by cosine similarity between embedding of each ngram in a document to embedding of the whole document. Embeddings are produced by a large language model. I use airesearch/wangchanberta-base-att-spm-uncased in this work.
The original implementation has more variety to rank keywords. Please check https://github.com/MaartenGr/KeyBERT for more details and better explanations 🙂 . I'm re-implementing core features and making it support Thai.

from pythainlp.summarize import extract_keywords

text = (
    "เบียร์ เป็นหนึ่งในเครื่องดื่มแอลกอฮอล์ที่เก่าแก่ที่สุดและบริโภคกันอย่างแพร่หลายมากที่สุดในโลก "
    "และเป็นเครื่องดื่มยอดนิยมอันดับสามทั้งหมด รองจากน้ำดื่มและชา "
    "ถูกผลิตขึ้นโดยการกลั่นเบียร์ (brewing) และกระบวนการหมักของแป้ง ซึ่งส่วนใหญ่ได้มาจากธัญพืช - "
    "ส่วนมากมาจากมอลต์ข้าวบาร์เลย์ แม้กระทั่งข้าวสาลี ข้าวโพด ข้าว และข้าวโอ๊ตก็ใช้ได้เช่น ในช่วงขั้นตอนการกลั่นเบียร์ "
    "กระบวนการหมักของแป้งนั้น น้ำตาลในวอร์ต(wort)จะก่อให้เกิดเอทานอลและคาร์บอนเนชั่นในเบียร์ที่ได้ออกมา "
    "เบียร์สมัยใหม่ส่วนใหญ่จะกลั่นด้วยฮอปส์ ซึ่งจะเป็นการเพิ่มความขมและรสชาติอื่น ๆ "
    "และทำหน้าที่เป็นสารกันบูดและสารคงตัวตามธรรมชาติ สารแต่งกลิ่นรสอื่น ๆ "
    "เช่น กรู๊ต สมุนไพรหรือผลไม้ซึ่งอาจจะรวมทั้งหรือการใช้แทนฮอปส์ ในการกลั่นเบียร์เชิงพาณิชย์ "
    "ผลของการเกิดคาร์บอนเนชั่นตามธรรมชาติมักจะถูกขจัดออกในช่วงกระบวนการผลิตและแทนที่ด้วยการอัดลมด้วยคาร์บอนเนชั่นแบบบังคับ"
)

keywords = extract_keywords(text)
# output: ['wort)', 'brewing)', 'ที่เก่าแก่', 'ขจัดออก', 'อัดลมด้วย']

keywords = extract_keywords(text, engine='frequency')
# output: ['เบียร์', 'กลั่น', 'คาร์บอน', 'เนชั่น', 'เครื่องดื่ม']

What was wrong

Just a brand new feature.

How this fixes it

Fixes #145

Some issues related to airesearch/wangchanberta-base-att-spm-uncased. It's spamming warning messages upon init.

- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.bias']

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

[✓] Passed code styles and structures
[✓] Passed code linting checks and unit test

coveralls · 2022-10-29T16:07:37Z

Coverage decreased (-2.0%) to 92.327% when pulling 9d70de8 on noppayut:feature/keyword-extraction-keybert into 0bed123 on PyThaiNLP:dev.

wannaphong · 2022-10-29T17:02:57Z

Thank you for pull request! Can you add the function to docs/api/summarize.rst?

noppayut · 2022-10-30T01:02:13Z

Sure. Please check.

sonarqubecloud · 2022-10-30T01:02:52Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
0.0% Duplication

wannaphong

Good work 👍

noppayut added 8 commits October 29, 2022 22:56

Implement KeyBERT

6dada49

Implement extract_keywords() and incorporate to summarize package

1a50ad6

Add unittest

4c83109

lint

bb4b976

Update API and docstring

7bd9f30

Merge upstream dev, resolve conflict

989d827

Add docstring to embed()

199b773

Fix frequency count

2b0936c

linting

ecc0629

wannaphong added hacktoberfest-accepted hacktoberfest accepted pull requests. enhancement enhance functionalities labels Oct 29, 2022

Add doc

9d70de8

wannaphong approved these changes Oct 30, 2022

View reviewed changes

wannaphong merged commit b002db0 into PyThaiNLP:dev Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: keyword extraction with keybert and frequency ranking #751

Feature: keyword extraction with keybert and frequency ranking #751

Uh oh!

noppayut commented Oct 29, 2022 •

edited

Loading

Uh oh!

coveralls commented Oct 29, 2022 •

edited

Loading

Uh oh!

wannaphong commented Oct 29, 2022

Uh oh!

noppayut commented Oct 30, 2022

Uh oh!

sonarqubecloud bot commented Oct 30, 2022

Uh oh!

wannaphong left a comment

Uh oh!

Uh oh!

Feature: keyword extraction with keybert and frequency ranking #751

Feature: keyword extraction with keybert and frequency ranking #751

Uh oh!

Conversation

noppayut commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

Uh oh!

coveralls commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wannaphong commented Oct 29, 2022

Uh oh!

noppayut commented Oct 30, 2022

Uh oh!

sonarqubecloud bot commented Oct 30, 2022

Uh oh!

wannaphong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

noppayut commented Oct 29, 2022 •

edited

Loading

coveralls commented Oct 29, 2022 •

edited

Loading