Skip to content

Only limited and biased selection of words with Natural Words #555

@trapicki

Description

@trapicki

Observations

When selecting "Prefer natural words" in the settings, the generated list of words has a few peculiarities:

  1. Some words appear very often, some not at all. Er, Es, Baby are never generated in German.
  2. Words with less than 3 characters do not appear at all.
  3. For rare letters, even if the dictionary has words (e.g. German y), the generated word list seems to be generated from phonemes. The German word list in packages/keybr-generators/dictionaries/dictionary-de.csv.gz contains following words with y:
    Typen, Zylinder, Babys, Sympathie, Ladys, typisch, Gymnasium, Psychiater, Loyalität, Rhythmus, Partys, Symptome, Bombay, Analyse, Systeme, Handys, Gymnasiast, Physiognomie, Hobby
    The generated words when the letters up to 'y' are enabled (ENTIRLHASCUGMOFDKBPZÄÜẞWÖVY) could look like:
    physion zylin loaylist zylin loyalib physion loyalig gymnache typisch.... and so on.

Possible reasons

Looking into https://github.com/aradzie/keybr.com/blob/72d74576f4921953cce1bf461192f5f5e3e8c612/packages/keybr-lesson/lib/guided.ts during runtime, I noticed:

  1. In
    filterWordList(wordList, this.codePoints).filter(
    (word) => word.length > 2,
    ),
    the word dictionary looses all words shorter than 3 letters. The criteria is in I do not understand why this is necessary or beneficial, as is disables all words like "he", "it", "I", "er", "es", to name a few from English and German.
  2. In #makeWordGenerator at
    const words = this.dictionary.find(filter).slice(0, 1000);
    , the dictionary has some peculiar property: dictionary from
    export class Dictionary implements Iterable<string> {
    has 57 members in the Map property. That means that words are indexed by upper and lower case letters separately. The German word list https://github.com/aradzie/keybr.com/blob/4df355907a220e82ca457828a6eca20da4c706d5/packages/keybr-content-words/lib/data/words-de.json contains words in upper and lower case, but it seems not the same word in both variants. All nouns in German are upper case, and every word at the begininnng of a sentence is upper case. Uppercase words are never considered at all, even when activating >0% uppercase words in the settings. When I execute in #makeWordGenerator this.dictionary.find(filter), where filter takes the starting setting (include "entirl", focus "e"), it produces:
ein', 'eine', 'einen', 'einer', 'nie', 'rein', 'nett', 'erinnere', 'nennt', 'leer', 'erinnerte', 'erinnert', 'inneren',
'irre', 'eilte', 'netter', 'nette', 'reine', 'eintreten', 'leere', 'reinen', 'lernte', 'nenne', 'innen', 'eilt', 'lernt',
'teilte', 'erteilte', 'leerte', 'trennte', 'einerlei', 'rette', 'rettet', 'lerne', 'eilen', 'erlernen', 'inne', 'netten', 
'rennt', 'eilten', 'trennt', 'leitet', 'eitel', 'entrinnen', 'erlitten', 'erteilt', 'trete', 'teilt', 'leite', 'rettete', 
'reitet', 'erlitt', 'innerer', 'irrte', 'reite', 'lernten'

When I filter the word list to only include RE "[entirl]+", I get:

in, er, ein, eine, Nein, einen, einer, nie, rein, Eltern, nett, Teil, erinnere, nennt, Eile, Tiere, leer, Tee,
 erinnerte, erinnert, Eier, Innern, inneren, irre, Tier, Ei, Eintritt, Teile, Ernte, eilte, netter, nette, litt,
 reine, eintreten, leere, reinen, Rennen, Linie, lernte, nenne, Titel, innen, Lilien, Ritter, Teint, Nenn, 
eilt, Leiter, lernt, Tritt, teilte, Erinnern, Leine, erteilte, leerte, trennte, einerlei, Nennen, rette, rettet,
 lerne, eilen, erlernen, inne, netten, rennt, Lettern, Linien, Tieren, Tinte, eilten, trennt, Treten, leitet, 
Reiter, Innere, Lilie, Linnen, Rittern, Teilen, eitel, entrinnen, erlitten, erteilt, trete, teilt, Rente, Ente, 
leite, Irrer, rettete, reitet, Liter, erlitt, innerer, irrte, reite, Eiern, Irren, Lernen, lernten, et, Renn, Eil

The difference is:

 Titel, Lilien, Ritter, Teint, Nenn, Leiter, Tritt, Erinnern, Leine, Nennen, Lettern, Linien, Tieren, Tinte,
 Treten, Reiter, Innere, Lilie, Linnen, Rittern, Teilen, Rente, Ente, Irrer, Liter, Eiern, Irren, Lernen, et, 
Renn, Eil

that is, length < 3 or uppercase first letter.

Note: First I also thought that longer words are somehow filtered out. Seems like this is a result of the natural distribution of word lengths. Most words, when not weighted by occurrence, are in German between 4 and 8 characters long. From length 10 on, the number of words for each length at least halves with every letter more. There are 20 words with length 2 in the list, though!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions