Only limited and biased selection of words with Natural Words

## Observations
When selecting "Prefer natural words" in the settings, the generated list of words has a few peculiarities:
 
1) Some words appear very often, some not at all. `Er`, `Es`, `Baby` are never generated in German.
2) Words with less than 3 characters do not appear at all.
3) For rare letters, even if the dictionary has words (e.g. German y), the generated word list seems to be generated from phonemes. The German word list in packages/keybr-generators/dictionaries/dictionary-de.csv.gz contains following words with _y_:
```Typen, Zylinder, Babys, Sympathie, Ladys, typisch, Gymnasium, Psychiater, Loyalität, Rhythmus, Partys, Symptome, Bombay, Analyse, Systeme, Handys, Gymnasiast, Physiognomie, Hobby```
The generated words when the letters up to 'y' are enabled (`ENTIRLHASCUGMOFDKBPZÄÜẞWÖVY`) could look like:
`physion zylin loaylist zylin loyalib physion loyalig gymnache typisch....` and so on.

## Possible reasons
Looking into https://github.com/aradzie/keybr.com/blob/72d74576f4921953cce1bf461192f5f5e3e8c612/packages/keybr-lesson/lib/guided.ts during runtime, I noticed:
1) In https://github.com/aradzie/keybr.com/blob/72d74576f4921953cce1bf461192f5f5e3e8c612/packages/keybr-lesson/lib/guided.ts#L31-L33 the **word dictionary looses all words shorter than 3 letters**. The criteria is in I do not understand why this is necessary or beneficial, as is disables all words like "he", "it", "I", "er", "es", to name a few from English and German.
2) In #makeWordGenerator at https://github.com/aradzie/keybr.com/blob/72d74576f4921953cce1bf461192f5f5e3e8c612/packages/keybr-lesson/lib/guided.ts#L146, the dictionary has some peculiar property: `dictionary ` from https://github.com/aradzie/keybr.com/blob/72d74576f4921953cce1bf461192f5f5e3e8c612/packages/keybr-lesson/lib/dictionary.ts#L9 has 57 members in the `Map` property. That means that **words are indexed by _upper and lower case letters_ separately**. The German word list https://github.com/aradzie/keybr.com/blob/4df355907a220e82ca457828a6eca20da4c706d5/packages/keybr-content-words/lib/data/words-de.json contains words in upper and lower case, but it seems not the same word in both variants. All nouns in German are upper case, and every word at the begininnng of a sentence is upper case. **Uppercase words are never considered at all**, even when activating >0% uppercase words in the settings. When I execute in `#makeWordGenerator` `this.dictionary.find(filter)`, where filter takes the starting setting (include "`entirl`", focus "`e`"), it produces:
```
ein', 'eine', 'einen', 'einer', 'nie', 'rein', 'nett', 'erinnere', 'nennt', 'leer', 'erinnerte', 'erinnert', 'inneren',
'irre', 'eilte', 'netter', 'nette', 'reine', 'eintreten', 'leere', 'reinen', 'lernte', 'nenne', 'innen', 'eilt', 'lernt',
'teilte', 'erteilte', 'leerte', 'trennte', 'einerlei', 'rette', 'rettet', 'lerne', 'eilen', 'erlernen', 'inne', 'netten', 
'rennt', 'eilten', 'trennt', 'leitet', 'eitel', 'entrinnen', 'erlitten', 'erteilt', 'trete', 'teilt', 'leite', 'rettete', 
'reitet', 'erlitt', 'innerer', 'irrte', 'reite', 'lernten'
```
When I filter the word list to only include RE "[entirl]+", I get:
```
in, er, ein, eine, Nein, einen, einer, nie, rein, Eltern, nett, Teil, erinnere, nennt, Eile, Tiere, leer, Tee,
 erinnerte, erinnert, Eier, Innern, inneren, irre, Tier, Ei, Eintritt, Teile, Ernte, eilte, netter, nette, litt,
 reine, eintreten, leere, reinen, Rennen, Linie, lernte, nenne, Titel, innen, Lilien, Ritter, Teint, Nenn, 
eilt, Leiter, lernt, Tritt, teilte, Erinnern, Leine, erteilte, leerte, trennte, einerlei, Nennen, rette, rettet,
 lerne, eilen, erlernen, inne, netten, rennt, Lettern, Linien, Tieren, Tinte, eilten, trennt, Treten, leitet, 
Reiter, Innere, Lilie, Linnen, Rittern, Teilen, eitel, entrinnen, erlitten, erteilt, trete, teilt, Rente, Ente, 
leite, Irrer, rettete, reitet, Liter, erlitt, innerer, irrte, reite, Eiern, Irren, Lernen, lernten, et, Renn, Eil
``` 
The difference is:
```in, er, Nein, Eltern, Teil, Eile, Tiere, Tee, Eier, Innern, Tier, Ei, Eintritt, Teile, Ernte, litt, Rennen, Linie,
 Titel, Lilien, Ritter, Teint, Nenn, Leiter, Tritt, Erinnern, Leine, Nennen, Lettern, Linien, Tieren, Tinte,
 Treten, Reiter, Innere, Lilie, Linnen, Rittern, Teilen, Rente, Ente, Irrer, Liter, Eiern, Irren, Lernen, et, 
Renn, Eil
``` 
that is, length < 3 or uppercase first letter.

_Note: First I also thought that longer words are somehow filtered out. Seems like this is a result of the natural distribution of word lengths. Most words, when not weighted by occurrence, are in German between 4 and 8 characters long. From length 10 on, the number of words for each length at least halves with every letter more. There are 20 words with length 2 in the list, though!_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only limited and biased selection of words with Natural Words #555

Observations

Possible reasons

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	filterWordList(wordList, this.codePoints).filter(
	(word) => word.length > 2,
	),

Only limited and biased selection of words with Natural Words #555

Description

Observations

Possible reasons

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions