Skip to content

Commit 3f1abc4

Browse files
committed
RFC beginnings
1 parent 8b01688 commit 3f1abc4

File tree

1 file changed

+394
-0
lines changed

1 file changed

+394
-0
lines changed

RFC.md

Lines changed: 394 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,394 @@
1+
# The Multihash Format
2+
3+
## Introduction
4+
5+
TODO
6+
7+
### Hash Functions
8+
9+
TODO
10+
11+
### Self Description
12+
13+
TODO (use a definition from multiformats + example for hashes)
14+
15+
### Example Use Cases
16+
17+
TODO
18+
19+
### Terminology
20+
21+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
22+
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
23+
document are to be interpreted as described in RFC 2119, BCP 14
24+
[RFC2119] and indicate requirement levels for compliant multihash
25+
implementations.
26+
27+
This specification makes use of the following terminology:
28+
29+
Varint: A variable sized unsigned integer, represented with a Most
30+
Significant Bit (MSB) Varint with no size limit.
31+
32+
Hash function: TODO
33+
34+
Cryptographic hash function: TODO
35+
36+
Digest: The output value of a hash function.
37+
38+
Length-prefixed: A variable sized byte sequence, prefixed with the
39+
length of the sequence. Example: "0x03aabbcc" is the sequence
40+
"0xaabbcc" prefixed with the length: "0x03". For our purposes,
41+
these prefixes are varints.
42+
43+
## Multihash Function Tables
44+
45+
A Multihash Function Table assigns a unique integer code and a
46+
unique string name to each hash function listed on the table.
47+
You can think of these tables as a list of triples, where each
48+
the integer code MUST be unique, the string name MUST be unique,
49+
and the function should be a well defined hash function:
50+
51+
# a multihash table is a list of triples:
52+
<code> <name> <function>
53+
...
54+
55+
# example multihash table:
56+
0x01 identity "The Identity Hash Function"
57+
0x11 sha1 "The SHA1 Cryptographic Hash Function"
58+
0x12 sha2-256 "The SHA256 Cryptographic Hash Function"
59+
0x13 sha2-512 "The SHA512 Cryptographic Hash Function"
60+
0x14 sha3-512 "The SHA3-512 Cryptographic Hash Function"
61+
...
62+
63+
`<code>`: a unique (unsigned) integer code for a compact
64+
representation of the hash function. Usually represented
65+
as a varint. The table has no length limit.
66+
67+
`<name>`: a unique string name for a more human readable
68+
representation of the hash function. It is restricted to
69+
the following ASCII character values: `[a-zA-Z0-9]+`. In
70+
most cases, this name SHOULD be enough for a programmer
71+
to know without ambiguity precisely which hash function is
72+
intended.
73+
74+
`<function>`: is a unique string name or definition of the
75+
hash function. For most purposes, this MUST be enough of a
76+
definition to allow a programmer to know, without any ambiguity,
77+
precisely which hash function is intended. A self-describing
78+
multihash table may opt to represent functions as the code
79+
itself.
80+
81+
TODO: possibly remove `<hash-function>` entirely, and rely solely
82+
on the `<string-code>`? Is there such a table already to rely on?
83+
Such a table must include both cryptographic and non-cryptographic
84+
hash functions.
85+
86+
### Standard Multihash Table
87+
88+
There is a Standard Multihash Table that all implementations
89+
MUST support. This table exists to provide agreement on the
90+
int and string codes of widely known hash functions.
91+
92+
What level of notoriety must a hash function achieve before
93+
being included in the Standard Multihash Table? This question
94+
is left up to the maintainers of the table, presumably IANA.
95+
96+
The current value of the Standard Multihash Table is:
97+
98+
# 0x10-0x3f reserved for SHA standard functions
99+
# 0x60-0x7f reserved for custom hash functions
100+
# 0x3000-0x3fff reserved for custom hash functions
101+
# code, name
102+
0x00, identity
103+
0x11, sha1
104+
0x12, sha2-256
105+
0x13, sha2-512
106+
0x14, sha3-512
107+
0x15, sha3-384
108+
0x16, sha3-256
109+
0x17, sha3-224
110+
0x18, shake-128
111+
0x19, shake-256
112+
0x40, blake2b
113+
0x41, blake2s
114+
115+
TODO: either add the `<function>` part or NOT.
116+
117+
The Standard Multihash Table reserves some ranges for growth,
118+
and for custom tables:
119+
120+
0x01-0x3f: reserved for SHA standard functions.
121+
122+
A fourth of all integers is reserved for custom tables. That
123+
way, neither the numbers for the standard table nor for custom
124+
tables can run out. The reserved numbers are at the upper
125+
fourth of every varint range. This means the following ranges,
126+
and the series they imply:
127+
128+
0d96 to 0d127 or 0b1100000 to 0b1111111
129+
0d12288 to 0d16383 or 0b11000000000000 to 0b11111111111111
130+
131+
Put another way: every varint byte contributes 7 bits to the
132+
number. We say that all numbers whose 7-bit representation
133+
starts with 0b11..., are reserved for custom tables. This
134+
is a fourth of all numbers, and yields the sequence above.
135+
136+
TODO: check if the above makes sense readily, or we want a
137+
better explanation or a better range.
138+
139+
### Custom Tables
140+
141+
Uses MAY create custom multihash tables to use custom hash
142+
functions not defined in the Standard Multihash Table. In
143+
order to avoid any ambiguity whatsoever, these custom tables
144+
MUST be strict super-sets of the Standard Multihash Table.
145+
Therefore, they MUST define any custom functions with the
146+
integer codes reserved for this purpose, and must use string
147+
names that begin with `x-`.
148+
149+
# example custom table. superset of SMT
150+
# 0x01-0x0f reserved for application specific functions
151+
# 0x10-0x3f reserved for SHA standard functions
152+
0x00, identity
153+
0x11, sha1
154+
0x12, sha2-256
155+
0x13, sha2-512
156+
0x14, sha3-512
157+
0x15, sha3-384
158+
0x16, sha3-256
159+
0x17, sha3-224
160+
0x18, shake-128
161+
0x19, shake-256
162+
0x40, blake2b
163+
0x41, blake2s
164+
165+
### Self-Describing Multihash Tables
166+
167+
It is possible to define a multihash table using completely
168+
self-describing definitions, meaning that the hash function
169+
itself is stored not by name but as a code value. How to do
170+
that is left up to future specifications. If one is defined,
171+
this specification SHOULD be updated to link to it.
172+
173+
## Specification of the multihash format
174+
175+
Multihash is a self-describing format to store, transmit, and
176+
display hash function digest values. It is unambiguous, provides
177+
"algorithmic agility", prevents "algorithm lock-in", improves the
178+
security of hash value transmissions, and improves the usability
179+
of hash functions.
180+
181+
The multihash format is very simple. It defines a single value
182+
type: a multihash value. A multihash value consists of three parts:
183+
184+
1. Hash Function Code: an integer code representing a hash
185+
function from a pre-determined hash function table, expressed
186+
as a varint.
187+
2. Digest Length: the length of the hash function digest value,
188+
expressed as a varint.
189+
3. Digest Value: the hash function digest value, conforming to
190+
the Digest Length
191+
192+
These three parts are laid out in a byte sequence as follows:
193+
194+
<hash-function-code><digest-length><digest-value>
195+
196+
The multihash format defines two representations for a multihash
197+
value:
198+
199+
1. Packed Representation: a compact representation for use
200+
on the wire, in storage, in displays, and in other identifiers.
201+
2. Explanation Representation: a larger, human readable
202+
representation to make distinguishing values easier on
203+
developers.
204+
205+
Both representations follow the same parts order as above, only
206+
changing each of the parts.
207+
208+
For all intents and purposes, the Packed Representation is the
209+
main contribution of this specification. The Explanation
210+
Representation is merely here to provide a single standard way to
211+
dump out values in more readable forms.
212+
213+
### Packed Representation
214+
215+
The Packed Representation is a compact representation optimized
216+
for transmitting, storing, displaying, and using multihashes.
217+
It is optimized to avoid wasting space. It is also tuned to be
218+
incorporated into other values, such as content-addressed store
219+
identifiers, URI/URLs, unix filesystem paths, and more. This is
220+
the primary representation of a multihash format.
221+
222+
The Packed Representation is structured as follows:
223+
224+
<hash-function-code-v><digest-length-v><digest-value-bin>
225+
226+
Where:
227+
- `<hash-function-code-v>` is a varint storing the code of the hash function
228+
according to a pre-determined multihash table, and the Standard Multihash
229+
Table.
230+
- `<digest-length-v>` is a varint storing the length of `<digest-value>`,
231+
in bytes.
232+
- `<digest-value-bin>` is the hash function digest value, binary-packed.
233+
234+
#### Hash Function Code
235+
236+
The Hash Function Code in the packed representation is an unsigned
237+
varint, following the format specified in section (TODO). This code
238+
references a pre-determined multihash table, usually the standard
239+
Standard Multihash Table maintained alongside this specification.
240+
241+
It is possible for users to define and use their own tables, which
242+
MUST be compatible with the Standard Multihash Table, by using
243+
only the code ranges left undefined for this purpose. Section (TODO)
244+
explains the Standard Multihash Table and which ranges can be used.
245+
246+
#### Digest Length in Bytes
247+
248+
The `<digest-length>` counts bytes, not bits. This reduces the storage
249+
required for most hash function values: 256 and 512 bit lengths are
250+
represented as "32" and "64", using 1-byte varints instead of 2-byte
251+
varints.
252+
253+
This size reduction exploits the facts that: (a) most common computer
254+
architectures use 8-bit words, (b) most networks transit sequences of
255+
bytes, (c) most storage systems store sequences of bytes, and (d) most
256+
if not all widely used hash functions have standard digest lengths
257+
divisible by 8.
258+
259+
It is possible that some hash functions are used with digests of lengths
260+
not evenly divisble by 8. In such rare cases, the function implementation
261+
with multihash should define a byte-aligned version, usually by adding a
262+
pre-determined amount of padding bits at the end of the value. Such
263+
padding transformations can be well-defined, and would likely have to
264+
exist to support such hash functions in current computer architectures,
265+
storage systems, and networks. Therefore this specification is comfortable
266+
leaving this up to the implementations and users.
267+
268+
#### Digest Value
269+
270+
The `<digest-value-bin>` is simply a binary-packed representation of the
271+
hash function value. This value ensures the entire Multihash Packed
272+
Representation is as compact as it could be, wasting no space to
273+
represent the digest value.
274+
275+
Base encoding for strings must be performed around the whole multihash
276+
value,
277+
278+
### Explanation Representation
279+
280+
The Explanation Representation is a well defined way of representing
281+
the hash functions in a human-readable optimized way. It is wasteful
282+
explicitly, and not meant to be used as an identifier in systems.
283+
284+
The Explanation Representation is
285+
286+
<hash-function-name>.<digest-length>.<digest-value-hex>
287+
288+
Where:
289+
- `<hash-function-name>` is the string name of the code of the hash
290+
function according to a pre-determined multihash table, and the
291+
Standard Multihash Table.
292+
- `<digest-lengt>` is a decimal number storing the length of
293+
`<digest-value>`, in bytes.
294+
- `<digest-value-hex>` is the hash function digest value, hex-encoded.
295+
- `.` is a delimiter for the values.
296+
297+
This representation SHOULD NOT be used to transmit, store, or embed a
298+
multihash value. It SHOULD only be used for debugging.
299+
300+
Even when creating human-oriented string identifiers, it is strongly
301+
RECOMMENDED to use the packed representation of multihash, possibly
302+
encoded in a convenient base. This ensures the whole multihash value
303+
is as compact as it can be, and does not accidentally impose other
304+
requirements on the transmission of the value. It is much easier for
305+
systems to deal with a binary value that can be easily encoded in a
306+
variety of bases. Using the Explanation Representation for storage,
307+
transmission, embedding in other identifiers, or anything other
308+
than debugging, defeats the purpose of multihash.
309+
310+
### MSB Unsigned Varints - muvints
311+
312+
Multihash uses muvints, Most-significant-bit Unsigned Variable
313+
INTegerS. These are in use by other multiformats. Their definition
314+
is summarized here for completeness.
315+
316+
Unsigned: muvints are unsigned integers. There is no need for
317+
distinguishing negative integers.
318+
319+
Varints: muvints are variable integers, with no limit.
320+
321+
MSB continuation: muvints use the Most Significant Bit of every byte
322+
to represent a continuation bit. This type of varint is optimized
323+
for space and reads of small numbers, not for reads of very large
324+
numbers (128-bit ints and beyond).
325+
326+
Little-Endian: muvints are based on Protocol Buffers varints, and
327+
are thus little-endian, meaning the least significant bytes are
328+
encoded first.
329+
330+
Examples:
331+
332+
# decimal muvint bytes
333+
127 01111111
334+
128 10000000 00000001
335+
256 10000000 00000010
336+
1024 10000000 00001000
337+
16384 10000000 10000000 00000001
338+
339+
## Encoding multihashes
340+
341+
Multihashes are designed to be used as values in a variety of
342+
mediums. They will often need to be encoded in other bases.
343+
In particular, multihash endeavors to keep the function code
344+
and length in the same base encoding as the digest value, which
345+
is important to many applications that must treat hash digest
346+
values opaquely, or that may have base encoding restrictions.
347+
348+
Binary: It is RECOMMENDED that multihash values are stored
349+
and transmitted on the wire as binary packed values wherever
350+
possible. This will ensure the hash digests take up as little
351+
space as possible.
352+
353+
Copiable: It is RECOMMENDED that multihash values are displayed
354+
to users in a "copiable" form, that is in a form easy to select,
355+
copy, and paste, which typically means in base16, base32, base58.
356+
357+
Multibase: Multihash pairs well with Multibase, a standard for
358+
self-describing base encodings. This way, a multihash value can
359+
be stored, transmitted, or displayed in any base without any
360+
ambiguity.
361+
362+
363+
## Considerations
364+
365+
### Implementation considerations
366+
367+
TODO
368+
369+
### IANA considerations
370+
371+
It is RECOMMENDED that IANA host the Standard Multihash Table.
372+
373+
### Security considerations
374+
375+
It is RECOMMENDED that implementations establish a reasonable
376+
upper bound on varint sizes to avoid allocating large buffers,
377+
or potential buffer overflows. This limit will make sense at given
378+
times, depending on the size of the tables and common sizes for
379+
the digests of commonly used hash functions. Such a limit is
380+
explicitly left out of this specification as it is liable to be
381+
an incorrect choice as time passes.
382+
383+
## Acknowledgements
384+
385+
Special thanks to the following people for helping to define,
386+
implement, review, and extend multihash:
387+
388+
TODO list contributors
389+
390+
391+
## References
392+
393+
TODO
394+

0 commit comments

Comments
 (0)