Skip to content

Conversation

@DevinTDHa
Copy link
Member

@DevinTDHa DevinTDHa commented Jul 18, 2022

Description

Currently, if users provide custom bounds to the SentenceDetector, they will not be returned at all. This PR adds a flag which will enable returning the custom bounds with separate sentences. There are also different sentence break policies which the user can choose from (either prepend or append the sentence break)

Example

with setCustomBounds([r"\.", ";"])

This is a sentence. This one uses custom bounds; As is this one;

Without the flags will result in

["This is a sentence", "This one uses custom bounds", "As is this one"]

With the new flag:

.setCustomBounds([r"\.", ";"])
.setCustomBoundsStrategy("append")

the result will be

["This is a sentence.", "This one uses custom bounds;", "As is this one;"]

Similarly with prepend:

1. This is a list
1.1 This is a subpoint
2. Second thing
2.2 Second subthing
.setCustomBounds([r"\n[\d\.]+"])
.setCustomBoundsStrategy("prepend")

the result will be

[
    "1. This is a list",
    "1.1 This is a subpoint",
    "2. Second thing",
    "2.2 Second subthing"
]

All test cases here.

Summary of the changes

  • added parameter customBoundsStrategy
    • defaults to "none" which keeps the same behaviour
    • can set to "prepend": prepends sentence break and keeps delimiter
    • can set to "append": appends sentence break and keeps delimiter
  • added new tests to reflect the change

How Has This Been Tested?

New tests and old test are passing.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

- added parameter customBoundsStrategy
  - defaults to "none" which keeps the same behaviour
  - can set to "prepend": prepends sentence break and keeps delimiter
  - can set to "append": appends sentence break and keeps delimiter
- added new tests to reflect the change
@DevinTDHa DevinTDHa added enhancement new-feature Introducing a new feature DON'T MERGE Do not merge this PR labels Jul 18, 2022
@josejuanmartinez
Copy link
Contributor

@DevinTDHa this is perfection! Very well observed about the prepend of numbers too, instead of just only the use cases which append the delimiter to the end. THANK YOU!

@DevinTDHa DevinTDHa changed the base branch from master to release/402-release-candidage July 18, 2022 16:22
@DevinTDHa DevinTDHa removed the DON'T MERGE Do not merge this PR label Jul 18, 2022
@maziyarpanahi maziyarpanahi merged commit 0f6de01 into JohnSnowLabs:release/402-release-candidage Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement new-feature Introducing a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants