Tokenizer: Optimized tokenization with exceptions #7881
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
The current Tokenizer implementation has performance issues if a large number of exceptions are used.
For example:
When tokenizing
src/test/resources/spell/sherlockholmes.txt, on my machine it will take an average of 0.58 seconds for 20 iterations. If there are 100 exceptions, it will take on average 3.17 seconds, for 200 it will jump to 5.8 seconds. The following benchmark was used for testing: TokenizerTestSpecThere is double the compilations of regular expressions and a severe performance drop.
Problem
The issue lies with how the exceptions are checked.
BREAK_PATTERNintagand afterwards compiled again incasedMatchExistsDescription
The following changes were implemented:
TokenizerModelcasedMatchExistswas removed in favor of a hash set that checks whether an exception without a break exists intagBenchmark Results
These changes will result in a performance boost of
How Has This Been Tested?
Added an additional test for the benchmark. All Tokenizer tests are passing.
Types of changes
Checklist: