Tokenizer: Optimized tokenization with exceptions #7881

DevinTDHa · 2022-04-11T13:55:46Z

Motivation and Context

The current Tokenizer implementation has performance issues if a large number of exceptions are used.

For example:
When tokenizing src/test/resources/spell/sherlockholmes.txt, on my machine it will take an average of 0.58 seconds for 20 iterations. If there are 100 exceptions, it will take on average 3.17 seconds, for 200 it will jump to 5.8 seconds. The following benchmark was used for testing: TokenizerTestSpec

There is double the compilations of regular expressions and a severe performance drop.

Problem

The issue lies with how the exceptions are checked.

Regex are compiled for each sentence that is tagged. This can be done beforehand.
- Addtionally, each regex exception is compiled twice due to first being compiled to replace the BREAK_PATTERN in tag and afterwards compiled again in casedMatchExists

Description

The following changes were implemented:

exceptions and regular expressions are compiled during instantiation of the model instead of for each sentence
exceptions are combined into one regular expression in TokenizerModel
function casedMatchExists was removed in favor of a hash set that checks whether an exception without a break exists in tag
- @maziyarpanahi Do you think this approach is acceptable?
fixed some minor warnings highlighted by intelliJ

Benchmark Results

These changes will result in a performance boost of

more than 300% for 100 exceptions (from 3.17 down to 0.95 seconds) and
for 200 exceptions a performance boost of more than 450% (from 5.8 seconds down to 1.24 seconds).

How Has This Been Tested?

Added an additional test for the benchmark. All Tokenizer tests are passing.

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

- exceptions and regular expressions are compiled during instantiation instead of for each sentence - exceptions are combined into one regular expression - function casedMatchExists was removed in favor of a hash set that checks whether an exception without a break exists - fixed some warnings

maziyarpanahi · 2022-04-11T14:10:49Z

Thanks for this @DevinTDHa, the performance improvements look great.

Regarding the use of mutable.HashSet I would suggest a stress test on a large dataset and instead of sentence as inputCols I would use document so you have more text in each regex matching. Let's see if the mutable HashSet performs acceptably since Scala mutable HashSet is known the perform poorly in copy at least when it grows.

DevinTDHa · 2022-04-11T14:27:58Z

@maziyarpanahi I will test how well it performs. In this particular case, the hashset will not grow much: at most it will have nr. of exceptions entries and only if these exceptions are not affected by the BREAK_PATTERN.

maziyarpanahi · 2022-04-11T14:29:27Z

@maziyarpanahi I will test how well it performs. In this particular case, the hashset will not grow much: at most it will have nr. of exceptions entries and only if these exceptions are not affected by the BREAK_PATTERN.

That's a great point! But let's make this as ridiculous as possible since users might come with like 100k lines of exceptions on large pages taken from OCR pages without any sentencing.

DevinTDHa · 2022-04-14T10:37:23Z

@maziyarpanahi I looked at it the following ways:

I used free samples form the COCA Dataset as a test. In total these are about 500k sentences, 56 million characters.

As exceptions I chose something ridiculous like \w+, which will effectively match every word in a sentence. I also monitored the tests with visualvm.

Tokenizing it the regular way has no issues. The set doesn't grow beyond the vocabulary of line and the set also only exists for each line of the file, keeping it small.
For the second test, I replaced all \n with \t in the sample data. This effectively results in one massive sentence, which will be tokenized at once. The collection will grow to a fixed size of the vocabulary of the data (which only happens once). The processing of the exceptions will be done in about 5 seconds on my machine.
- Note: The whole tokenization process doesn't terminate (at least I didn't wait for that long). Following, the current implementation of the exceptions will perform even worse.

DevinTDHa added the enhancement label Apr 11, 2022

DevinTDHa requested a review from maziyarpanahi April 11, 2022 13:55

DevinTDHa assigned maziyarpanahi and DevinTDHa Apr 11, 2022

maziyarpanahi added the DON'T MERGE Do not merge this PR label Apr 11, 2022

DevinTDHa marked this pull request as draft April 11, 2022 14:12

maziyarpanahi changed the base branch from release/343-release-candidate to master April 13, 2022 13:01

maziyarpanahi marked this pull request as ready for review April 13, 2022 13:01

maziyarpanahi changed the base branch from master to release/344-release-candidate April 13, 2022 13:02

maziyarpanahi approved these changes Apr 14, 2022

View reviewed changes

maziyarpanahi merged commit a77fa66 into JohnSnowLabs:release/344-release-candidate Apr 24, 2022

maziyarpanahi mentioned this pull request Apr 24, 2022

Release/344 release candidate #8083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer: Optimized tokenization with exceptions #7881

Tokenizer: Optimized tokenization with exceptions #7881

Uh oh!

DevinTDHa commented Apr 11, 2022

Uh oh!

maziyarpanahi commented Apr 11, 2022

Uh oh!

DevinTDHa commented Apr 11, 2022

Uh oh!

maziyarpanahi commented Apr 11, 2022

Uh oh!

DevinTDHa commented Apr 14, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tokenizer: Optimized tokenization with exceptions #7881

Tokenizer: Optimized tokenization with exceptions #7881

Uh oh!

Conversation

DevinTDHa commented Apr 11, 2022

Motivation and Context

Problem

Description

Benchmark Results

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

maziyarpanahi commented Apr 11, 2022

Uh oh!

DevinTDHa commented Apr 11, 2022

Uh oh!

maziyarpanahi commented Apr 11, 2022

Uh oh!

DevinTDHa commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DevinTDHa commented Apr 14, 2022 •

edited

Loading