Inflection-47 Integrate en Wikidata into Unicode Inflection #83

grhoten · 2025-03-02T06:23:13Z

Resolves #47

This transitions the English lexical data to use Wikidata. There were 2 main obstacles for adopting English.

The --expand-grammemes was added to complete the inflection table for verbs. This was the most expedient way to reduce the risk for properly inflecting verbs. This added 2520 bytes to the size of the data, which is a small increase.
The weird non-currency form of yen (L15388) had to be sorted last. So a list of words that are rare or should be omitted are now located in filter_en.properties.

The current size of the uncompressed lexical dictionary is about 528 KB. There are likely additional ways to reduce the size of the lexical dictionary. Those methods include:

Fix accidentally irregular inflection patterns to be like the other inflection patterns. Perhaps an inflection is missing or has an unintentional form.
Filter out irrelevant words. This can be done with either ignoring specific types, like proper-noun, or specific ones with the filter_en.properties, but that can be laborious.
Ignore irrelevant properties.
If there are phrases in Wikidata, then make sure that they're properly marked in Wikidata as a phrase, like noun phrase, verb phrase and so forth.

Here is the summary of the data:

==============================================
                       Source: wikidata-20250226-lexemes.json 
                  Lemma terms:   69127
         Unusable lemma terms:    1330
       Incoming surface forms:  143271
                Surface forms:  111504
      Collapsed surface forms:   57546 (40.2%)
       Unusable surface forms:     461
                 Usable terms:  111502  (100%)
           Unclassified terms:       2    (0%)
==============================================
Alternate:
    spelling:                394  (0.4%)

Animacy:
    inanimate:                 5    (0%)

Aspect:
    simple:                24482   (22%)
    perfective:                4    (0%)

ComparisonDegree:
    positive:              12892 (11.6%)
    superlative:            1617  (1.5%)
    comparative:            1603  (1.4%)

Count:
    uncountable:              97  (0.1%)

Definiteness:
    indefinite:               10    (0%)
    demonstrative:             2    (0%)
    definite:                  1    (0%)

Gender:
    masculine:                 7    (0%)
    feminine:                  4    (0%)
    neuter:                    1    (0%)

Mood:
    indicative:               47    (0%)
    subjunctive:               2    (0%)
    imperative:                1    (0%)

Number:
    singular:              46209 (41.4%)
    plural:                39584 (35.5%)

PartOfSpeech:
    noun:                  61540 (55.2%)
    verb:                  32874 (29.5%)
    adjective:             16297 (14.6%)
    adverb:                11057  (9.9%)
    proper-noun:            3886  (3.5%)
    interjection:            304  (0.3%)
    adposition:              151  (0.1%)
    pronoun:                 149  (0.1%)
    conjunction:              76  (0.1%)
    numeral:                  63  (0.1%)
    determiner:               49    (0%)
    interrogative:            11    (0%)
    article:                   3    (0%)

Person:
    third:                 16346 (14.7%)
    second:                 8200  (7.4%)
    first:                  8171  (7.3%)

Polarity:
    negative:                 62  (0.1%)

Register:
    pejorative:                9    (0%)

Sizeness:
    diminutive:                2    (0%)

Sound:
    consonant-start:         852  (0.8%)
    vowel-start:             840  (0.8%)

Tense:
    present:               24537   (22%)
    past:                   8431  (7.6%)
    future:                    2    (0%)

VerbType:
    participle:            16358 (14.7%)
    infinitive:                5    (0%)

processed in 15.12 seconds
License: Creative Commons CC0 License (https://creativecommons.org/publicdomain/zero/1.0/)
generated with options: --language en --add-sound consonant-start,vowel-start --add-extra-grammemes vowelConsonantStartData_en.lst --inflection-types noun,verb,determiner --ignore-entries-with-grammemes abbreviation --ignore-entries-with-grammemes genitive --ignore-entries-with-grammemes Q4335462 --ignore-property particle --ignore-property vocative --ignore-property oblique --ignore-property nominative --ignore-property countable --add-sound consonant-start,vowel-start --expand-grammemes verb,present,simple:first,second,singular,plural --expand-grammemes verb,present,simple:third,plural

Inflection-47 Integrate en Wikidata into Unicode Inflection

733f416

grhoten requested a review from nciric March 2, 2025 06:27

nciric approved these changes Mar 3, 2025

View reviewed changes

grhoten merged commit 873905e into unicode-org:main Mar 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Inflection-47 Integrate en Wikidata into Unicode Inflection #83

Inflection-47 Integrate en Wikidata into Unicode Inflection #83

Uh oh!

grhoten commented Mar 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Inflection-47 Integrate en Wikidata into Unicode Inflection #83

Inflection-47 Integrate en Wikidata into Unicode Inflection #83

Uh oh!

Conversation

grhoten commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grhoten commented Mar 2, 2025 •

edited

Loading