Skip to content

Conversation

@wolliq
Copy link
Contributor

@wolliq wolliq commented Jan 21, 2022

Description

This RegexTokenizer extension proposes the capability of trimming idenitfied tokens so that they can be used in further processing without having to apply others data transformations.
In particular we can trim tokens and decide whether preserve orignal indexes.

Motivation and Context

It has been noticed that RegexTokenizer was missing the token trimming capability.

How Has This Been Tested?

Full non regression test in Scala with added use cases in specific tests suite.
Added tests on Python interface.

Screenshots (if appropriate):

[info] Suites: completed 131, aborted 0
[info] Tests: succeeded 638, failed 0, canceled 0, ignored 5, pending 0
[info] All tests passed.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@wolliq wolliq requested a review from maziyarpanahi January 21, 2022 21:40
@maziyarpanahi maziyarpanahi self-assigned this Jan 22, 2022
@maziyarpanahi maziyarpanahi changed the base branch from master to release/341-release-candidate January 25, 2022 20:54
@maziyarpanahi maziyarpanahi merged commit 3089e63 into release/341-release-candidate Jan 25, 2022
@KshitizGIT KshitizGIT deleted the feature/regex-tok-trim branch March 2, 2023 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants