grhoten · 2025-07-24T20:58:19Z

Fixes #63
Fixes #62
Fixes #61
Fixes #60
Fixes #58
Fixes #56
Fixes #55
Fixes #54
Fixes #53
Fixes #52
Fixes #51
Fixes #50

These changes transition the remainder of the languages from stub test data to lexical dictionaries based on Wikidata.

The oldest commit replaces the dictionaries.
The middle commit are the code changes to consume and use the new lexical dictionaries.
The newest commit disables a few tests in Arabic and Hebrew until the data or code can be changed to pass the tests. It's also possible that the tests are bad, but that requires further review.

Here are some other highlights with these changes in addition to the data transition:

Marisa was upgraded to version 0.3.1
Catch2 was upgraded to version 3.9.0
Mild memory performance changes around static variables. This is mostly in the various GrammarSynthesizer implementations.
Use std::less<> for some string based sets and maps
Switch some map and set checks to use contains, which is new C++20 syntax.
Some static analyzer fixes, like the renaming the 3 argument quantify method to quantifyFormatted CommonConceptFactory so that it doesn't conflict with the 2 argument quantify method.
Various test updates to adapt to the current equivalent behavior. This frequently happened when there was more than 1 possible valid answer.
Remove support for versions of ICU prior to version 77.1
The dictionary-parser handles all languages now. It takes about 5 minutes to generate all of the lexical dictionaries.
The command to regenerate all migrated languages was ./ParseWikidata --all ~/Downloads/wikidata-20250716-lexemes.json. There were some warnings about the data, but the number of issues is small. Most of the issues involve unknown grammemes. The warnings indicate the following:
- The data needs fixing.
- The lexeme data contains linguistic data that is hard to figure out. In this case, Grammar.java should be changed to include the data.
- The lexeme is not actually a valid word to include. The lexeme should be considered for deletion, or omitted with filter.properties files.
Some entries in filter.properties for a given language involve omitting some lexemes. Typically this was done because I either couldn't figure out what to do with the conflicting entry, or the lexeme creator was being too pedantic. A bad example, is the letter "i", which can have the plural "is", which conflicts with the verb "is". The word "is" is singular as a verb, and the plural form of it is "are". I probably could mark the "i" entry as rare, but I decided to just omit it and be done quickly.
Some bugs involving merging of lexemes was fixed in dictionary-parser.
Some documentation and spelling fixes.
If there is a git LFS issue with parsing a lexical dictionary, it will fail with an error.
Other small bug fixes.

nciric

But the CI is complaining about mem leaks and some tests.

grhoten · 2025-07-25T18:02:48Z

@nciric My current theory is that git LFS automation is having issues with downloading these new files.

Inflection-62 Integrate ar Wikidata into Unicode Inflection Inflection-61 Integrate he Wikidata into Unicode Inflection Inflection-60 Integrate hi Wikidata into Unicode Inflection Inflection-58 Integrate nb Wikidata into Unicode Inflection Inflection-56 Integrate nl Wikidata into Unicode Inflection Inflection-55 Integrate tr Wikidata into Unicode Inflection Inflection-54 Integrate ru Wikidata into Unicode Inflection Inflection-53 Integrate it Wikidata into Unicode Inflection Inflection-52 Integrate pt Wikidata into Unicode Inflection Inflection-51 Integrate fr Wikidata into Unicode Inflection Inflection-50 Integrate de Wikidata into Unicode Inflection

… underlying issues are resolved

grhoten requested a review from nciric July 24, 2025 21:01

grhoten force-pushed the main branch from 29e997f to 6849342 Compare July 25, 2025 07:33

nciric approved these changes Jul 25, 2025

View reviewed changes

grhoten force-pushed the main branch from 6849342 to bf3911b Compare July 25, 2025 18:01

grhoten added 3 commits July 25, 2025 21:33

Temporarily disable tests till the data from Wikidata is fixed or the…

fcec242

… underlying issues are resolved

grhoten force-pushed the main branch 2 times, most recently from c8cd26b to fcec242 Compare July 26, 2025 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

grhoten commented Jul 24, 2025 •

edited

Loading

Uh oh!

nciric left a comment

Uh oh!

grhoten commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

Are you sure you want to change the base?

Uh oh!

Conversation

grhoten commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nciric left a comment

Choose a reason for hiding this comment

Uh oh!

grhoten commented Jul 25, 2025

Uh oh!

Uh oh!

grhoten commented Jul 24, 2025 •

edited

Loading