Skip to content

Latest commit

 

History

History
224 lines (174 loc) · 10.4 KB

File metadata and controls

224 lines (174 loc) · 10.4 KB

  BIP: ?
  Layer: Applications
  Title: Formosa --- Themed mnemonic sentences for generating deterministic keys
  Author: Yuri S Villas Boas <yuri@t3infosecurity.com>
          André Fidencio Gonçalves <andre7c4@gmail.com>
  Comments-Summary: No comments yet.
  Comments-URI: https://github.com/bitcoin/bips/wiki/Comments:BIP-formosa
  Status: Draft
  Type: Standards Track
  Created: 2021-12-10
  License: BSD-2-Clause
  Requires: BIP-0032, BIP-0039
  Post-History: https://www.toptal.com/cryptocurrency/formosa-crypto-wallet-management

Table of Contents

Abstract

This BIP describes an expansion of BIP-0039 for the generation of deterministic wallets. Where BIP-0039 uses a flat list of unrelated words, Formosa organizes mnemonic words into themed sentences with syntactic structure and semantic coherence, substantially improving memorability while retaining all properties of the original scheme.

It consists of two parts: generating the mnemonic and converting it into a binary seed. This seed can be later used to generate deterministic wallets using BIP-0032 or similar methods.

Full forward and backward compatibility with BIP-0039 is maintained: seed derivation internally converts any Formosa mnemonic back to its equivalent BIP-0039 representation, so existing keys and addresses are preserved.

Copyright

This BIP is licensed under the BSD 2-clause license.

Motivation

A mnemonic code or sentence is superior for human interaction compared to the handling of raw binary or hexadecimal representations of a wallet seed. The sentence could be written on paper or spoken over the telephone.

However, human memory is an associative process: information is more readily retained when it can be linked to existing knowledge through semantic associations, visual imagery, and narrative context. A BIP-0039 mnemonic is a sequence of unrelated words with no syntactic or semantic relationship, making it difficult to form the mental associations that aid long-term retention.

Formosa builds upon BIP-0039 by organizing mnemonic words into themed sentences with syntactic roles (e.g., subject, adjective, object, location). Each sentence draws vocabulary from a coherent semantic domain --- medieval fantasy, science fiction, nature, finance, or any custom theme --- enabling the user to form vivid mental images that reduce memorization effort per bit of entropy.

This guide is meant to be a way to transport computer-generated randomness with a human-readable transcription. It's not a way to process user-created sentences (also known as brainwallets) into a wallet seed.

Generating the mnemonic

The mnemonic must encode entropy in a multiple of 32 bits. With more entropy security is improved but the sentence length increases. We refer to the initial entropy length as ENT. The allowed size of ENT is 128-256 bits.

First, an initial entropy of ENT bits is generated. A checksum is generated by taking the first ENT / 32 bits of its SHA256 hash. This checksum is appended to the end of the initial entropy. Next, these concatenated bits are split into groups of 33 bits, which we call sentences. Each sentence is further subdivided into variable-length bit fields, one per syntactic category, whose lengths are defined by the active theme. Each bit field encodes an index into the corresponding category's word list. Finally, we convert these indices into words and use the joined words as a mnemonic sentence.

BIP-0039 is a special case where each sentence contains three 11-bit fields indexing a single 2048-word list (3 x 11 = 33).

The following table describes the relation between the initial entropy length (ENT), the checksum length (CS), the number of 33-bit sentences (S), and the length of the generated mnemonic sentence (MS) in words. The word count assumes a 6-word theme; for BIP-0039 (3 words per sentence), divide by 2.

CS = ENT / 32
S  = (ENT + CS) / 33

|  ENT  | CS | ENT+CS |  S  | MS (6-word) | MS (BIP-0039) |
+-------+----+--------+-----+-------------+---------------+
|  128  |  4 |   132  |  4  |     24      |      12       |
|  160  |  5 |   165  |  5  |     30      |      15       |
|  192  |  6 |   198  |  6  |     36      |      18       |
|  224  |  7 |   231  |  7  |     42      |      21       |
|  256  |  8 |   264  |  8  |     48      |      24       |

For each 33-bit sentence, the word selection algorithm proceeds as follows:

  1. Initialize an empty sentence array with one slot per category.
  2. For each category in the theme's filling order:
    1. Extract BIT_LENGTH bits from the current position in the bit stream.
    2. Interpret them as an unsigned integer index.
    3. If the category is led by another category, look up the appropriate sub-list from the leading category's mapping using the already-selected leading word. Otherwise, use the category's total word list.
    4. Select the word at the computed index from the resolved word list.
    5. Place the word into the sentence array at the position given by the theme's natural order.
  3. Output the words in natural order.

Themes

The Formosa equivalent to a BIP-0039 wordlist is a theme. A theme is a JSON document that defines syntactic categories, their word lists, bit-widths, and optional semantic restrictions between categories. The sum of all category bit-widths in a theme MUST equal 33.

An ideal theme has the following characteristics:

a) specific semantic scope (memory block)

   - the entire vocabulary should adhere to a single coherent topic, enabling
     the user to form a unified mental scene

b) concrete imagery

   - categories should consist of elements easily associated with mental images.
     Prefer concrete nouns and tangible adjectives over abstract terms

c) sorted wordlists

   - the wordlist is sorted which allows for more efficient lookup of the code words
     (i.e. implementations can use binary search instead of linear search)

d) first-letters uniqueness

   - the wordlist is created in such a way that it's enough to type the first two
     letters to unambiguously identify the word

The first-letters uniqueness property yields higher information density than BIP-0039. In BIP-0039, four characters are needed to identify each word, encoding 11 bits per 4 characters = 2.75 bits/character. In Formosa, two characters suffice per word. The achievable density depends on the theme's category bit-widths:

| List size | Bits | Chars to identify | Density (bits/char) |
+-----------+------+-------------------+---------------------+
|   2048    |  11  |        4          |   2.75 (BIP-0039)   |
|    32     |   5  |        2          |   2.50              |
|    64     |   6  |        2          |   3.00              |
|   128     |   7  |        2          |   3.50              |

As an example, the nationalities theme uses four 7-bit nationality categories (128 entries each) and one 5-bit profession category (32 entries), yielding 33 bits per 5-word sentence. A user typing only the first two characters of each word types 10 characters to encode 33 bits, achieving an information density of 33 / 10 = 3.30 bits/character --- a 20% improvement over BIP-0039's 2.75 bits/character

e) semantic restrictions (optional)

   - themes may define restrictions between categories so that the available word list
     for one category changes depending on the word selected in a leading category,
     producing more semantically coherent sentences. Restriction relationships MUST
     be acyclic

The wordlist can contain native characters, but they must be encoded in UTF-8 using Normalization Form Compatibility Decomposition (NFKD).

From mnemonic to seed

A user may decide to protect their mnemonic with a passphrase. If a passphrase is not present, an empty string "" is used instead.

To ensure forward and backward compatibility with BIP-0039, seed derivation first converts any Formosa mnemonic back to its equivalent BIP-0039 mnemonic by extracting the underlying entropy and re-encoding it using the BIP-0039 English word list. This guarantees that the same entropy always produces the same seed, keys, and addresses regardless of which theme was used.

To create a binary seed from the resulting BIP-0039 mnemonic, we use the PBKDF2 function with a mnemonic sentence (in UTF-8 NFKD) used as the password and the string "mnemonic" + passphrase (again in UTF-8 NFKD) used as the salt. The iteration count is set to 2048 and HMAC-SHA512 is used as the pseudo-random function. The length of the derived key is 512 bits (= 64 bytes).

This seed can be later used to generate deterministic wallets using BIP-0032 or similar methods.

The conversion of the mnemonic sentence to a binary seed is completely independent from generating the sentence. This results in a rather simple code; there are no constraints on sentence structure and clients are free to implement their own themes or even whole sentence generators, allowing for flexibility in wordlists for typo detection or other purposes.

Although using a mnemonic not generated by the algorithm described in "Generating the mnemonic" section is possible, this is not advised and software must compute a checksum for the mnemonic sentence using a wordlist and issue a warning if it is invalid.

The described method also provides plausible deniability, because every passphrase generates a valid seed (and thus a deterministic wallet) but only the correct one will make the desired wallet available.

Standard themes

The reference implementation ships with standard themes listed at the link below. Since BIP-0039 is a valid Formosa theme, all existing BIP-0039 mnemonics work without modification.

It is strongly discouraged to use non-standard custom themes for generating mnemonic sentences, as the user assumes responsibility for ensuring the theme file remains available and structurally valid. Users with proper training in security protocols who understand these risks may benefit from custom themes through higher memorization efficiency or an additional layer of obscurity.

Test vectors

The test vectors include input entropy, mnemonic and seed. The passphrase "TREZOR" is used for all vectors. Since Formosa converts back to BIP-0039 before seed derivation, the same test vectors apply to all themes given the same underlying entropy.

https://github.com/Yuri-SVB/formosa/blob/master/vectors.json

Reference Implementation

Reference implementation including themes is available from

https://github.com/Yuri-SVB/formosa