|
| 1 | +# The Multihash Format |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +TODO |
| 6 | + |
| 7 | +### Hash Functions |
| 8 | + |
| 9 | +TODO |
| 10 | + |
| 11 | +### Self Description |
| 12 | + |
| 13 | +TODO (use a definition from multiformats + example for hashes) |
| 14 | + |
| 15 | +### Example Use Cases |
| 16 | + |
| 17 | +TODO |
| 18 | + |
| 19 | +### Terminology |
| 20 | + |
| 21 | +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", |
| 22 | +"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this |
| 23 | +document are to be interpreted as described in RFC 2119, BCP 14 |
| 24 | +[RFC2119] and indicate requirement levels for compliant multihash |
| 25 | +implementations. |
| 26 | + |
| 27 | +This specification makes use of the following terminology: |
| 28 | + |
| 29 | +Varint: A variable sized unsigned integer, represented with a Most |
| 30 | + Significant Bit (MSB) Varint with no size limit. |
| 31 | + |
| 32 | +Hash function: TODO |
| 33 | + |
| 34 | +Cryptographic hash function: TODO |
| 35 | + |
| 36 | +Digest: The output value of a hash function. |
| 37 | + |
| 38 | +Length-prefixed: A variable sized byte sequence, prefixed with the |
| 39 | + length of the sequence. Example: "0x03aabbcc" is the sequence |
| 40 | + "0xaabbcc" prefixed with the length: "0x03". For our purposes, |
| 41 | + these prefixes are varints. |
| 42 | + |
| 43 | +## Multihash Function Tables |
| 44 | + |
| 45 | +A Multihash Function Table assigns a unique integer code and a |
| 46 | +unique string name to each hash function listed on the table. |
| 47 | +You can think of these tables as a list of triples, where each |
| 48 | +the integer code MUST be unique, the string name MUST be unique, |
| 49 | +and the function should be a well defined hash function: |
| 50 | + |
| 51 | + # a multihash table is a list of triples: |
| 52 | + <code> <name> <function> |
| 53 | + ... |
| 54 | + |
| 55 | + # example multihash table: |
| 56 | + 0x01 identity "The Identity Hash Function" |
| 57 | + 0x11 sha1 "The SHA1 Cryptographic Hash Function" |
| 58 | + 0x12 sha2-256 "The SHA256 Cryptographic Hash Function" |
| 59 | + 0x13 sha2-512 "The SHA512 Cryptographic Hash Function" |
| 60 | + 0x14 sha3-512 "The SHA3-512 Cryptographic Hash Function" |
| 61 | + ... |
| 62 | + |
| 63 | +`<code>`: a unique (unsigned) integer code for a compact |
| 64 | + representation of the hash function. Usually represented |
| 65 | + as a varint. The table has no length limit. |
| 66 | + |
| 67 | +`<name>`: a unique string name for a more human readable |
| 68 | + representation of the hash function. It is restricted to |
| 69 | + the following ASCII character values: `[a-zA-Z0-9]+`. In |
| 70 | + most cases, this name SHOULD be enough for a programmer |
| 71 | + to know without ambiguity precisely which hash function is |
| 72 | + intended. |
| 73 | + |
| 74 | +`<function>`: is a unique string name or definition of the |
| 75 | + hash function. For most purposes, this MUST be enough of a |
| 76 | + definition to allow a programmer to know, without any ambiguity, |
| 77 | + precisely which hash function is intended. A self-describing |
| 78 | + multihash table may opt to represent functions as the code |
| 79 | + itself. |
| 80 | + |
| 81 | +TODO: possibly remove `<hash-function>` entirely, and rely solely |
| 82 | +on the `<string-code>`? Is there such a table already to rely on? |
| 83 | +Such a table must include both cryptographic and non-cryptographic |
| 84 | +hash functions. |
| 85 | + |
| 86 | +### Standard Multihash Table |
| 87 | + |
| 88 | +There is a Standard Multihash Table that all implementations |
| 89 | +MUST support. This table exists to provide agreement on the |
| 90 | +int and string codes of widely known hash functions. |
| 91 | + |
| 92 | +What level of notoriety must a hash function achieve before |
| 93 | +being included in the Standard Multihash Table? This question |
| 94 | +is left up to the maintainers of the table, presumably IANA. |
| 95 | + |
| 96 | +The current value of the Standard Multihash Table is: |
| 97 | + |
| 98 | + # 0x10-0x3f reserved for SHA standard functions |
| 99 | + # 0x60-0x7f reserved for custom hash functions |
| 100 | + # 0x3000-0x3fff reserved for custom hash functions |
| 101 | + # code, name |
| 102 | + 0x00, identity |
| 103 | + 0x11, sha1 |
| 104 | + 0x12, sha2-256 |
| 105 | + 0x13, sha2-512 |
| 106 | + 0x14, sha3-512 |
| 107 | + 0x15, sha3-384 |
| 108 | + 0x16, sha3-256 |
| 109 | + 0x17, sha3-224 |
| 110 | + 0x18, shake-128 |
| 111 | + 0x19, shake-256 |
| 112 | + 0x40, blake2b |
| 113 | + 0x41, blake2s |
| 114 | + |
| 115 | +TODO: either add the `<function>` part or NOT. |
| 116 | + |
| 117 | +The Standard Multihash Table reserves some ranges for growth, |
| 118 | +and for custom tables: |
| 119 | + |
| 120 | +0x01-0x3f: reserved for SHA standard functions. |
| 121 | + |
| 122 | +A fourth of all integers is reserved for custom tables. That |
| 123 | +way, neither the numbers for the standard table nor for custom |
| 124 | +tables can run out. The reserved numbers are at the upper |
| 125 | +fourth of every varint range. This means the following ranges, |
| 126 | +and the series they imply: |
| 127 | + |
| 128 | + 0d96 to 0d127 or 0b1100000 to 0b1111111 |
| 129 | + 0d12288 to 0d16383 or 0b11000000000000 to 0b11111111111111 |
| 130 | + |
| 131 | +Put another way: every varint byte contributes 7 bits to the |
| 132 | +number. We say that all numbers whose 7-bit representation |
| 133 | +starts with 0b11..., are reserved for custom tables. This |
| 134 | +is a fourth of all numbers, and yields the sequence above. |
| 135 | + |
| 136 | +TODO: check if the above makes sense readily, or we want a |
| 137 | +better explanation or a better range. |
| 138 | + |
| 139 | +### Custom Tables |
| 140 | + |
| 141 | +Uses MAY create custom multihash tables to use custom hash |
| 142 | +functions not defined in the Standard Multihash Table. In |
| 143 | +order to avoid any ambiguity whatsoever, these custom tables |
| 144 | +MUST be strict super-sets of the Standard Multihash Table. |
| 145 | +Therefore, they MUST define any custom functions with the |
| 146 | +integer codes reserved for this purpose, and must use string |
| 147 | +names that begin with `x-`. |
| 148 | + |
| 149 | + # example custom table. superset of SMT |
| 150 | + # 0x01-0x0f reserved for application specific functions |
| 151 | + # 0x10-0x3f reserved for SHA standard functions |
| 152 | + 0x00, identity |
| 153 | + 0x11, sha1 |
| 154 | + 0x12, sha2-256 |
| 155 | + 0x13, sha2-512 |
| 156 | + 0x14, sha3-512 |
| 157 | + 0x15, sha3-384 |
| 158 | + 0x16, sha3-256 |
| 159 | + 0x17, sha3-224 |
| 160 | + 0x18, shake-128 |
| 161 | + 0x19, shake-256 |
| 162 | + 0x40, blake2b |
| 163 | + 0x41, blake2s |
| 164 | + |
| 165 | +### Self-Describing Multihash Tables |
| 166 | + |
| 167 | +It is possible to define a multihash table using completely |
| 168 | +self-describing definitions, meaning that the hash function |
| 169 | +itself is stored not by name but as a code value. How to do |
| 170 | +that is left up to future specifications. If one is defined, |
| 171 | +this specification SHOULD be updated to link to it. |
| 172 | + |
| 173 | +## Specification of the multihash format |
| 174 | + |
| 175 | +Multihash is a self-describing format to store, transmit, and |
| 176 | +display hash function digest values. It is unambiguous, provides |
| 177 | +"algorithmic agility", prevents "algorithm lock-in", improves the |
| 178 | +security of hash value transmissions, and improves the usability |
| 179 | +of hash functions. |
| 180 | + |
| 181 | +The multihash format is very simple. It defines a single value |
| 182 | +type: a multihash value. A multihash value consists of three parts: |
| 183 | + |
| 184 | +1. Hash Function Code: an integer code representing a hash |
| 185 | + function from a pre-determined hash function table, expressed |
| 186 | + as a varint. |
| 187 | +2. Digest Length: the length of the hash function digest value, |
| 188 | + expressed as a varint. |
| 189 | +3. Digest Value: the hash function digest value, conforming to |
| 190 | + the Digest Length |
| 191 | + |
| 192 | +These three parts are laid out in a byte sequence as follows: |
| 193 | + |
| 194 | + <hash-function-code><digest-length><digest-value> |
| 195 | + |
| 196 | +The multihash format defines two representations for a multihash |
| 197 | +value: |
| 198 | + |
| 199 | +1. Packed Representation: a compact representation for use |
| 200 | + on the wire, in storage, in displays, and in other identifiers. |
| 201 | +2. Explanation Representation: a larger, human readable |
| 202 | + representation to make distinguishing values easier on |
| 203 | + developers. |
| 204 | + |
| 205 | +Both representations follow the same parts order as above, only |
| 206 | +changing each of the parts. |
| 207 | + |
| 208 | +For all intents and purposes, the Packed Representation is the |
| 209 | +main contribution of this specification. The Explanation |
| 210 | +Representation is merely here to provide a single standard way to |
| 211 | +dump out values in more readable forms. |
| 212 | + |
| 213 | +### Packed Representation |
| 214 | + |
| 215 | +The Packed Representation is a compact representation optimized |
| 216 | +for transmitting, storing, displaying, and using multihashes. |
| 217 | +It is optimized to avoid wasting space. It is also tuned to be |
| 218 | +incorporated into other values, such as content-addressed store |
| 219 | +identifiers, URI/URLs, unix filesystem paths, and more. This is |
| 220 | +the primary representation of a multihash format. |
| 221 | + |
| 222 | +The Packed Representation is structured as follows: |
| 223 | + |
| 224 | + <hash-function-code-v><digest-length-v><digest-value-bin> |
| 225 | + |
| 226 | +Where: |
| 227 | +- `<hash-function-code-v>` is a varint storing the code of the hash function |
| 228 | + according to a pre-determined multihash table, and the Standard Multihash |
| 229 | + Table. |
| 230 | +- `<digest-length-v>` is a varint storing the length of `<digest-value>`, |
| 231 | + in bytes. |
| 232 | +- `<digest-value-bin>` is the hash function digest value, binary-packed. |
| 233 | + |
| 234 | +#### Hash Function Code |
| 235 | + |
| 236 | +The Hash Function Code in the packed representation is an unsigned |
| 237 | +varint, following the format specified in section (TODO). This code |
| 238 | +references a pre-determined multihash table, usually the standard |
| 239 | +Standard Multihash Table maintained alongside this specification. |
| 240 | + |
| 241 | +It is possible for users to define and use their own tables, which |
| 242 | +MUST be compatible with the Standard Multihash Table, by using |
| 243 | +only the code ranges left undefined for this purpose. Section (TODO) |
| 244 | +explains the Standard Multihash Table and which ranges can be used. |
| 245 | + |
| 246 | +#### Digest Length in Bytes |
| 247 | + |
| 248 | +The `<digest-length>` counts bytes, not bits. This reduces the storage |
| 249 | +required for most hash function values: 256 and 512 bit lengths are |
| 250 | +represented as "32" and "64", using 1-byte varints instead of 2-byte |
| 251 | +varints. |
| 252 | + |
| 253 | +This size reduction exploits the facts that: (a) most common computer |
| 254 | +architectures use 8-bit words, (b) most networks transit sequences of |
| 255 | +bytes, (c) most storage systems store sequences of bytes, and (d) most |
| 256 | +if not all widely used hash functions have standard digest lengths |
| 257 | +divisible by 8. |
| 258 | + |
| 259 | +It is possible that some hash functions are used with digests of lengths |
| 260 | +not evenly divisble by 8. In such rare cases, the function implementation |
| 261 | +with multihash should define a byte-aligned version, usually by adding a |
| 262 | +pre-determined amount of padding bits at the end of the value. Such |
| 263 | +padding transformations can be well-defined, and would likely have to |
| 264 | +exist to support such hash functions in current computer architectures, |
| 265 | +storage systems, and networks. Therefore this specification is comfortable |
| 266 | +leaving this up to the implementations and users. |
| 267 | + |
| 268 | +#### Digest Value |
| 269 | + |
| 270 | +The `<digest-value-bin>` is simply a binary-packed representation of the |
| 271 | +hash function value. This value ensures the entire Multihash Packed |
| 272 | +Representation is as compact as it could be, wasting no space to |
| 273 | +represent the digest value. |
| 274 | + |
| 275 | +Base encoding for strings must be performed around the whole multihash |
| 276 | +value, |
| 277 | + |
| 278 | +### Explanation Representation |
| 279 | + |
| 280 | +The Explanation Representation is a well defined way of representing |
| 281 | +the hash functions in a human-readable optimized way. It is wasteful |
| 282 | +explicitly, and not meant to be used as an identifier in systems. |
| 283 | + |
| 284 | +The Explanation Representation is |
| 285 | + |
| 286 | + <hash-function-name>.<digest-length>.<digest-value-hex> |
| 287 | + |
| 288 | +Where: |
| 289 | +- `<hash-function-name>` is the string name of the code of the hash |
| 290 | + function according to a pre-determined multihash table, and the |
| 291 | + Standard Multihash Table. |
| 292 | +- `<digest-lengt>` is a decimal number storing the length of |
| 293 | + `<digest-value>`, in bytes. |
| 294 | +- `<digest-value-hex>` is the hash function digest value, hex-encoded. |
| 295 | +- `.` is a delimiter for the values. |
| 296 | + |
| 297 | +This representation SHOULD NOT be used to transmit, store, or embed a |
| 298 | +multihash value. It SHOULD only be used for debugging. |
| 299 | + |
| 300 | +Even when creating human-oriented string identifiers, it is strongly |
| 301 | +RECOMMENDED to use the packed representation of multihash, possibly |
| 302 | +encoded in a convenient base. This ensures the whole multihash value |
| 303 | +is as compact as it can be, and does not accidentally impose other |
| 304 | +requirements on the transmission of the value. It is much easier for |
| 305 | +systems to deal with a binary value that can be easily encoded in a |
| 306 | +variety of bases. Using the Explanation Representation for storage, |
| 307 | +transmission, embedding in other identifiers, or anything other |
| 308 | +than debugging, defeats the purpose of multihash. |
| 309 | + |
| 310 | +### MSB Unsigned Varints - muvints |
| 311 | + |
| 312 | +Multihash uses muvints, Most-significant-bit Unsigned Variable |
| 313 | +INTegerS. These are in use by other multiformats. Their definition |
| 314 | +is summarized here for completeness. |
| 315 | + |
| 316 | +Unsigned: muvints are unsigned integers. There is no need for |
| 317 | + distinguishing negative integers. |
| 318 | + |
| 319 | +Varints: muvints are variable integers, with no limit. |
| 320 | + |
| 321 | +MSB continuation: muvints use the Most Significant Bit of every byte |
| 322 | + to represent a continuation bit. This type of varint is optimized |
| 323 | + for space and reads of small numbers, not for reads of very large |
| 324 | + numbers (128-bit ints and beyond). |
| 325 | + |
| 326 | +Little-Endian: muvints are based on Protocol Buffers varints, and |
| 327 | + are thus little-endian, meaning the least significant bytes are |
| 328 | + encoded first. |
| 329 | + |
| 330 | +Examples: |
| 331 | + |
| 332 | + # decimal muvint bytes |
| 333 | + 127 01111111 |
| 334 | + 128 10000000 00000001 |
| 335 | + 256 10000000 00000010 |
| 336 | + 1024 10000000 00001000 |
| 337 | + 16384 10000000 10000000 00000001 |
| 338 | + |
| 339 | +## Encoding multihashes |
| 340 | + |
| 341 | +Multihashes are designed to be used as values in a variety of |
| 342 | +mediums. They will often need to be encoded in other bases. |
| 343 | +In particular, multihash endeavors to keep the function code |
| 344 | +and length in the same base encoding as the digest value, which |
| 345 | +is important to many applications that must treat hash digest |
| 346 | +values opaquely, or that may have base encoding restrictions. |
| 347 | + |
| 348 | +Binary: It is RECOMMENDED that multihash values are stored |
| 349 | + and transmitted on the wire as binary packed values wherever |
| 350 | + possible. This will ensure the hash digests take up as little |
| 351 | + space as possible. |
| 352 | + |
| 353 | +Copiable: It is RECOMMENDED that multihash values are displayed |
| 354 | + to users in a "copiable" form, that is in a form easy to select, |
| 355 | + copy, and paste, which typically means in base16, base32, base58. |
| 356 | + |
| 357 | +Multibase: Multihash pairs well with Multibase, a standard for |
| 358 | + self-describing base encodings. This way, a multihash value can |
| 359 | + be stored, transmitted, or displayed in any base without any |
| 360 | + ambiguity. |
| 361 | + |
| 362 | + |
| 363 | +## Considerations |
| 364 | + |
| 365 | +### Implementation considerations |
| 366 | + |
| 367 | +TODO |
| 368 | + |
| 369 | +### IANA considerations |
| 370 | + |
| 371 | +It is RECOMMENDED that IANA host the Standard Multihash Table. |
| 372 | + |
| 373 | +### Security considerations |
| 374 | + |
| 375 | +It is RECOMMENDED that implementations establish a reasonable |
| 376 | +upper bound on varint sizes to avoid allocating large buffers, |
| 377 | +or potential buffer overflows. This limit will make sense at given |
| 378 | +times, depending on the size of the tables and common sizes for |
| 379 | +the digests of commonly used hash functions. Such a limit is |
| 380 | +explicitly left out of this specification as it is liable to be |
| 381 | +an incorrect choice as time passes. |
| 382 | + |
| 383 | +## Acknowledgements |
| 384 | + |
| 385 | +Special thanks to the following people for helping to define, |
| 386 | +implement, review, and extend multihash: |
| 387 | + |
| 388 | +TODO list contributors |
| 389 | + |
| 390 | + |
| 391 | +## References |
| 392 | + |
| 393 | +TODO |
| 394 | + |
0 commit comments