Bigger word lists

You could look into [banking](https://gbdk-2020.github.io/gbdk-2020/docs/api/docs_rombanking_mbcs.html) and [compression](https://en.wikipedia.org/wiki/Byte_pair_encoding).
Also don’t use strings in your word list, strings end with [`\0`](https://en.cppreference.com/w/c/string/byte), so each word takes up 6B and 16% of the space is basically wasted.

MBC5 can have 512 banks (at least one is for your code) á 16KiB of which GBDK-2020 can address 255 with it’s functions. 256 banks are still 4MiB.

So you can fit > 1.67 million 5 char words in your list. But you also have to manage and find them. Or > 835 thousand 5 char words when you are limited to GBDK-2020’s capabilities.

TLDR: just do 0x00-0x29 for A-Z, 0x2A-0x7F for Digram Encoding, 0x80 to flag the end of it

Another trick would be to not use ASCII, it has 128 characters, but you actually just need use a-z (26 characters). You could for example use a different encoding where youcan fit two of the 15 most frequent charaters in one byte. The most frequent characters seem to be: (maybe easier to implement than digram encoding)

* 0: the other 11 characters in the second part of the byte, or `<empty>` if F is the second part or `<null>` if `0x00`
* 1: E
* 2: T
* 3: A
* 4: O
* 5: N
* 6: I
* 7: H
* 8: S
* 9: R
* A: L
* B: D
* C: U
* D: C
* E: M
* F: W

* 00: <null>
* 01: Y
* 02: F
* 03: G
* 04: P
* 05: B
* 06: V
* 07: K
* 08: J
* 09: X
* 0A: Q
* 0B: Z


This has to end with `0x00` again since this coding has a variable length. So we need some form of delimiter again. Pro of this is, that you can use the C string functions.

“WORLD” is `0xF4, 0x9A, 0xB0, 0x00`
“RUGBY” is `0x9C, 0x03, 0x05, 0x01, 0x00`
“FRANK” is `0x02, 0x93, 0x50, 0x07, 0x00`

Digram Encoding is likely better, but that also needs some form of delimiter. Custom encoding + some form of compression is also possible.
Just Digram Encoding might always gain. With 2B references It could have 3B to 5B, because you would just use the plain string in the worst case. So it’s probably not worth it.

Another encoding is to use one bit (`&0x80`) for marking the end and just replacing parts of ASCII you don’t need with most common digraphs/digrams:
(er, ti, ar, ou, or, le, th as 01, 02, 03, 04, 05, 06, 07)
“OTHER” is `0x4F, 0x07, 0x81`
“MOUTH” is `0x4D, 0x04, 0x87,`
“ORDER” is `0x05, 0x44, 0x81`
“UNCLE” is `0x55, 0x4E, 0x43, 0x86`
“VOICE” is `0x56, 0x4F, 0x49, 0x43, 0xC5`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigger word lists #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bigger word lists #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions