Skip to content

Conversation

@grhoten
Copy link
Member

@grhoten grhoten commented Mar 2, 2025

Resolves #47

This transitions the English lexical data to use Wikidata. There were 2 main obstacles for adopting English.

  1. The --expand-grammemes was added to complete the inflection table for verbs. This was the most expedient way to reduce the risk for properly inflecting verbs. This added 2520 bytes to the size of the data, which is a small increase.
  2. The weird non-currency form of yen (L15388) had to be sorted last. So a list of words that are rare or should be omitted are now located in filter_en.properties.

The current size of the uncompressed lexical dictionary is about 528 KB. There are likely additional ways to reduce the size of the lexical dictionary. Those methods include:

  • Fix accidentally irregular inflection patterns to be like the other inflection patterns. Perhaps an inflection is missing or has an unintentional form.
  • Filter out irrelevant words. This can be done with either ignoring specific types, like proper-noun, or specific ones with the filter_en.properties, but that can be laborious.
  • Ignore irrelevant properties.
  • If there are phrases in Wikidata, then make sure that they're properly marked in Wikidata as a phrase, like noun phrase, verb phrase and so forth.

Here is the summary of the data:

==============================================
                       Source: wikidata-20250226-lexemes.json 
                  Lemma terms:   69127
         Unusable lemma terms:    1330
       Incoming surface forms:  143271
                Surface forms:  111504
      Collapsed surface forms:   57546 (40.2%)
       Unusable surface forms:     461
                 Usable terms:  111502  (100%)
           Unclassified terms:       2    (0%)
==============================================
Alternate:
    spelling:                394  (0.4%)

Animacy:
    inanimate:                 5    (0%)

Aspect:
    simple:                24482   (22%)
    perfective:                4    (0%)

ComparisonDegree:
    positive:              12892 (11.6%)
    superlative:            1617  (1.5%)
    comparative:            1603  (1.4%)

Count:
    uncountable:              97  (0.1%)

Definiteness:
    indefinite:               10    (0%)
    demonstrative:             2    (0%)
    definite:                  1    (0%)

Gender:
    masculine:                 7    (0%)
    feminine:                  4    (0%)
    neuter:                    1    (0%)

Mood:
    indicative:               47    (0%)
    subjunctive:               2    (0%)
    imperative:                1    (0%)

Number:
    singular:              46209 (41.4%)
    plural:                39584 (35.5%)

PartOfSpeech:
    noun:                  61540 (55.2%)
    verb:                  32874 (29.5%)
    adjective:             16297 (14.6%)
    adverb:                11057  (9.9%)
    proper-noun:            3886  (3.5%)
    interjection:            304  (0.3%)
    adposition:              151  (0.1%)
    pronoun:                 149  (0.1%)
    conjunction:              76  (0.1%)
    numeral:                  63  (0.1%)
    determiner:               49    (0%)
    interrogative:            11    (0%)
    article:                   3    (0%)

Person:
    third:                 16346 (14.7%)
    second:                 8200  (7.4%)
    first:                  8171  (7.3%)

Polarity:
    negative:                 62  (0.1%)

Register:
    pejorative:                9    (0%)

Sizeness:
    diminutive:                2    (0%)

Sound:
    consonant-start:         852  (0.8%)
    vowel-start:             840  (0.8%)

Tense:
    present:               24537   (22%)
    past:                   8431  (7.6%)
    future:                    2    (0%)

VerbType:
    participle:            16358 (14.7%)
    infinitive:                5    (0%)

processed in 15.12 seconds
License: Creative Commons CC0 License (https://creativecommons.org/publicdomain/zero/1.0/)
generated with options: --language en --add-sound consonant-start,vowel-start --add-extra-grammemes vowelConsonantStartData_en.lst --inflection-types noun,verb,determiner --ignore-entries-with-grammemes abbreviation --ignore-entries-with-grammemes genitive --ignore-entries-with-grammemes Q4335462 --ignore-property particle --ignore-property vocative --ignore-property oblique --ignore-property nominative --ignore-property countable --add-sound consonant-start,vowel-start --expand-grammemes verb,present,simple:first,second,singular,plural --expand-grammemes verb,present,simple:third,plural

@grhoten grhoten requested a review from nciric March 2, 2025 06:27
@grhoten grhoten merged commit 873905e into unicode-org:main Mar 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate en Wikidata into Unicode Inflection

2 participants