-
Notifications
You must be signed in to change notification settings - Fork 77
Description
The CID Decoding Algorithm contains a step, "1. If it's a string (ASCII/UTF-8):". This is not sufficiently specified, and is subject to false positives.
The first problem is the consequences of the term "UTF-8", and the lack of an algorithm to determine if the target byte sequence "is a string (in UTF-8)". UTF-8 strings may consist of byte values less than, or greater than, or equal to 0x80. But not all sequences of byte values are valid UTF-8. The Unicode Standard, Section 3.9, Unicode Encoding Forms, goes into the details of what is valid and what is not. But some of the ins and outs are non-obvious, and probably undesireable in a CID decoding algorithm. If the intention is that the test should be, "1. if the target byte sequence consists of byte values from 0x21 to 0x7F inclusive", then say that. If the intention is that the test should be, "1. if the target byte sequence is a valid base58
The second problem is that any byte sequence which can be interpreted as a valid ASCII string, or a valid UTF-8 string, is still also a valid binary byte sequence. Thus the decoding algorithm does not rule out the case where someone constructs a CID which happens to consist of a sequence of bytes with values from 0x21 to 0x7F inclusive. This might accidentally get parsed according to case 1 of the algorithm, when it should get processed according to case 2. If there is something about the CID structure which means that a binary CID will never pass the test of case 1, it would be clearer for the documentation to say so.
A third problem is ambiguity is what a decoder should do if it attempts to decode a string according to case 1, but the contents are not valid according to the base58btc or multibase specifications. Should the decoder treat the string as a binary byte sequence and attempt to decode that way, or should the decoding attempt fail with an error?
I am new to the CID design, but I am a software engineer with experience working with text-based formats, including UTF-8. These sort of ambiguities can cause real problems. It may be that they are made clear elsewhere. This issue reflects my naive reading of the CID Decoding Algorithm in isolation. I suggest that this algorithm should not be ambiguous even when read in isolation.
Suggested rewording of the spec to address the first problem above:
The algorithm to decode a byte sequence into a CID is as follows.
- If the bytes in the sequence, when interpreted as a UTF-8 string, consist only of the characters of the Base58 alphabet ("123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"), then …