-
Notifications
You must be signed in to change notification settings - Fork 395
Description
Observations
When selecting "Prefer natural words" in the settings, the generated list of words has a few peculiarities:
- Some words appear very often, some not at all.
Er,Es,Babyare never generated in German. - Words with less than 3 characters do not appear at all.
- For rare letters, even if the dictionary has words (e.g. German y), the generated word list seems to be generated from phonemes. The German word list in packages/keybr-generators/dictionaries/dictionary-de.csv.gz contains following words with y:
Typen, Zylinder, Babys, Sympathie, Ladys, typisch, Gymnasium, Psychiater, Loyalität, Rhythmus, Partys, Symptome, Bombay, Analyse, Systeme, Handys, Gymnasiast, Physiognomie, Hobby
The generated words when the letters up to 'y' are enabled (ENTIRLHASCUGMOFDKBPZÄÜẞWÖVY) could look like:
physion zylin loaylist zylin loyalib physion loyalig gymnache typisch....and so on.
Possible reasons
Looking into https://github.com/aradzie/keybr.com/blob/72d74576f4921953cce1bf461192f5f5e3e8c612/packages/keybr-lesson/lib/guided.ts during runtime, I noticed:
- In the word dictionary looses all words shorter than 3 letters. The criteria is in I do not understand why this is necessary or beneficial, as is disables all words like "he", "it", "I", "er", "es", to name a few from English and German.
keybr.com/packages/keybr-lesson/lib/guided.ts
Lines 31 to 33 in 72d7457
filterWordList(wordList, this.codePoints).filter( (word) => word.length > 2, ), - In #makeWordGenerator at , the dictionary has some peculiar property:
keybr.com/packages/keybr-lesson/lib/guided.ts
Line 146 in 72d7457
const words = this.dictionary.find(filter).slice(0, 1000); dictionaryfromhas 57 members in theexport class Dictionary implements Iterable<string> { Mapproperty. That means that words are indexed by upper and lower case letters separately. The German word list https://github.com/aradzie/keybr.com/blob/4df355907a220e82ca457828a6eca20da4c706d5/packages/keybr-content-words/lib/data/words-de.json contains words in upper and lower case, but it seems not the same word in both variants. All nouns in German are upper case, and every word at the begininnng of a sentence is upper case. Uppercase words are never considered at all, even when activating >0% uppercase words in the settings. When I execute in#makeWordGeneratorthis.dictionary.find(filter), where filter takes the starting setting (include "entirl", focus "e"), it produces:
ein', 'eine', 'einen', 'einer', 'nie', 'rein', 'nett', 'erinnere', 'nennt', 'leer', 'erinnerte', 'erinnert', 'inneren',
'irre', 'eilte', 'netter', 'nette', 'reine', 'eintreten', 'leere', 'reinen', 'lernte', 'nenne', 'innen', 'eilt', 'lernt',
'teilte', 'erteilte', 'leerte', 'trennte', 'einerlei', 'rette', 'rettet', 'lerne', 'eilen', 'erlernen', 'inne', 'netten',
'rennt', 'eilten', 'trennt', 'leitet', 'eitel', 'entrinnen', 'erlitten', 'erteilt', 'trete', 'teilt', 'leite', 'rettete',
'reitet', 'erlitt', 'innerer', 'irrte', 'reite', 'lernten'
When I filter the word list to only include RE "[entirl]+", I get:
in, er, ein, eine, Nein, einen, einer, nie, rein, Eltern, nett, Teil, erinnere, nennt, Eile, Tiere, leer, Tee,
erinnerte, erinnert, Eier, Innern, inneren, irre, Tier, Ei, Eintritt, Teile, Ernte, eilte, netter, nette, litt,
reine, eintreten, leere, reinen, Rennen, Linie, lernte, nenne, Titel, innen, Lilien, Ritter, Teint, Nenn,
eilt, Leiter, lernt, Tritt, teilte, Erinnern, Leine, erteilte, leerte, trennte, einerlei, Nennen, rette, rettet,
lerne, eilen, erlernen, inne, netten, rennt, Lettern, Linien, Tieren, Tinte, eilten, trennt, Treten, leitet,
Reiter, Innere, Lilie, Linnen, Rittern, Teilen, eitel, entrinnen, erlitten, erteilt, trete, teilt, Rente, Ente,
leite, Irrer, rettete, reitet, Liter, erlitt, innerer, irrte, reite, Eiern, Irren, Lernen, lernten, et, Renn, Eil
The difference is:
Titel, Lilien, Ritter, Teint, Nenn, Leiter, Tritt, Erinnern, Leine, Nennen, Lettern, Linien, Tieren, Tinte,
Treten, Reiter, Innere, Lilie, Linnen, Rittern, Teilen, Rente, Ente, Irrer, Liter, Eiern, Irren, Lernen, et,
Renn, Eil
that is, length < 3 or uppercase first letter.
Note: First I also thought that longer words are somehow filtered out. Seems like this is a result of the natural distribution of word lengths. Most words, when not weighted by occurrence, are in German between 4 and 8 characters long. From length 10 on, the number of words for each length at least halves with every letter more. There are 20 words with length 2 in the list, though!