Skip to content

Conversation

@DevinTDHa
Copy link
Member

Motivation and Context

The current Tokenizer implementation has performance issues if a large number of exceptions are used.

For example:
When tokenizing src/test/resources/spell/sherlockholmes.txt, on my machine it will take an average of 0.58 seconds for 20 iterations. If there are 100 exceptions, it will take on average 3.17 seconds, for 200 it will jump to 5.8 seconds. The following benchmark was used for testing: TokenizerTestSpec

There is double the compilations of regular expressions and a severe performance drop.

Problem

The issue lies with how the exceptions are checked.

  1. Regex are compiled for each sentence that is tagged. This can be done beforehand.
    • Addtionally, each regex exception is compiled twice due to first being compiled to replace the BREAK_PATTERN in tag and afterwards compiled again in casedMatchExists

Description

The following changes were implemented:

  • exceptions and regular expressions are compiled during instantiation of the model instead of for each sentence
  • exceptions are combined into one regular expression in TokenizerModel
  • function casedMatchExists was removed in favor of a hash set that checks whether an exception without a break exists in tag
  • fixed some minor warnings highlighted by intelliJ

Benchmark Results

These changes will result in a performance boost of

  • more than 300% for 100 exceptions (from 3.17 down to 0.95 seconds) and
  • for 200 exceptions a performance boost of more than 450% (from 5.8 seconds down to 1.24 seconds).

How Has This Been Tested?

Added an additional test for the benchmark. All Tokenizer tests are passing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

  - exceptions and regular expressions are compiled during instantiation instead of for each sentence
  - exceptions are combined into one regular expression
  - function casedMatchExists was removed in favor of a hash set that checks whether an exception without a break exists
  - fixed some warnings
@maziyarpanahi
Copy link
Contributor

Thanks for this @DevinTDHa, the performance improvements look great.

Regarding the use of mutable.HashSet I would suggest a stress test on a large dataset and instead of sentence as inputCols I would use document so you have more text in each regex matching. Let's see if the mutable HashSet performs acceptably since Scala mutable HashSet is known the perform poorly in copy at least when it grows.

@maziyarpanahi maziyarpanahi added the DON'T MERGE Do not merge this PR label Apr 11, 2022
@DevinTDHa DevinTDHa marked this pull request as draft April 11, 2022 14:12
@DevinTDHa
Copy link
Member Author

@maziyarpanahi I will test how well it performs. In this particular case, the hashset will not grow much: at most it will have nr. of exceptions entries and only if these exceptions are not affected by the BREAK_PATTERN.

@maziyarpanahi
Copy link
Contributor

@maziyarpanahi I will test how well it performs. In this particular case, the hashset will not grow much: at most it will have nr. of exceptions entries and only if these exceptions are not affected by the BREAK_PATTERN.

That's a great point! But let's make this as ridiculous as possible since users might come with like 100k lines of exceptions on large pages taken from OCR pages without any sentencing.

@maziyarpanahi maziyarpanahi changed the base branch from release/343-release-candidate to master April 13, 2022 13:01
@maziyarpanahi maziyarpanahi marked this pull request as ready for review April 13, 2022 13:01
@maziyarpanahi maziyarpanahi changed the base branch from master to release/344-release-candidate April 13, 2022 13:02
@DevinTDHa
Copy link
Member Author

DevinTDHa commented Apr 14, 2022

@maziyarpanahi I looked at it the following ways:

I used free samples form the COCA Dataset as a test. In total these are about 500k sentences, 56 million characters.

As exceptions I chose something ridiculous like \w+, which will effectively match every word in a sentence. I also monitored the tests with visualvm.

  • Tokenizing it the regular way has no issues. The set doesn't grow beyond the vocabulary of line and the set also only exists for each line of the file, keeping it small.
  • For the second test, I replaced all \n with \t in the sample data. This effectively results in one massive sentence, which will be tokenized at once. The collection will grow to a fixed size of the vocabulary of the data (which only happens once). The processing of the exceptions will be done in about 5 seconds on my machine.
    • Note: The whole tokenization process doesn't terminate (at least I didn't wait for that long). Following, the current implementation of the exceptions will perform even worse.

@maziyarpanahi maziyarpanahi merged commit a77fa66 into JohnSnowLabs:release/344-release-candidate Apr 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DON'T MERGE Do not merge this PR enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants