Feature/regex tok trim #6806

wolliq · 2022-01-21T21:40:13Z

Description

This RegexTokenizer extension proposes the capability of trimming idenitfied tokens so that they can be used in further processing without having to apply others data transformations.
In particular we can trim tokens and decide whether preserve orignal indexes.

Motivation and Context

It has been noticed that RegexTokenizer was missing the token trimming capability.

How Has This Been Tested?

Full non regression test in Scala with added use cases in specific tests suite.
Added tests on Python interface.

Screenshots (if appropriate):

[info] Suites: completed 131, aborted 0
[info] Tests: succeeded 638, failed 0, canceled 0, ignored 5, pending 0
[info] All tests passed.

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…abs/spark-nlp into feature/regex-tok-trim

python/test/annotators.py

python/run-tests.py

…k 2.x validation

Stefano Lori added 5 commits January 16, 2022 19:02

wip added trim features to regex tokenizer

bf126d8

Added Python interface to Regex Tok new params

308532c

wip added trim features to regex tokenizer

bcd051a

Added Python interface to Regex Tok new params

2873eb2

Merge branch 'feature/regex-tok-trim' of https://github.com/JohnSnowL…

346ad4c

…abs/spark-nlp into feature/regex-tok-trim

wolliq requested a review from maziyarpanahi January 21, 2022 21:40