CID Decoding Algorithm does not define "string", and allows false positives

The [CID Decoding Algorithm](https://github.com/multiformats/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) contains a step, "1. If it's a string (ASCII/UTF-8):". This is not sufficiently specified, and is subject to false positives.

The first problem is the consequences of the term "UTF-8", and the lack of an algorithm to determine if the target byte sequence "is a string (in UTF-8)".  UTF-8 strings may consist of byte values less than, or greater than, or equal to  0x80. But not all sequences of byte values are valid UTF-8. [_The Unicode Standard_,  Section 3.9, **Unicode Encoding Forms**](http://www.unicode.org/versions/latest/ch03.pdf#G7404), goes into the details of what is valid and what is not. But some of the ins and outs are non-obvious, and probably undesireable in a CID decoding algorithm.  If the intention is that the test should be, "1. if the target byte sequence consists of byte values from 0x21 to 0x7F inclusive", then say that. If the intention is that the test should be, "1. if the target byte sequence is a valid base58

The second problem is that any byte sequence which can be interpreted as a valid ASCII string, or a valid UTF-8 string, is still also a valid binary byte sequence. Thus the decoding algorithm does not rule out the case where someone constructs a CID which happens to consist of a sequence of bytes with values from 0x21 to 0x7F inclusive. This might accidentally get parsed according to case 1 of the algorithm, when it should get processed according to case 2. If there is something about the CID structure which means that a binary CID will never pass the test of case 1, it would be clearer for the documentation to say so.

A third problem is ambiguity is what a decoder should do if it attempts to decode a string according to case 1, but the contents are not valid according to the **base58btc** or  **multibase** specifications. Should the decoder treat the string as a binary byte sequence and attempt to decode that way, or should the decoding attempt fail with an error?

I am new to the CID design, but I am a software engineer with experience working with text-based formats, including UTF-8. These sort of ambiguities can cause real problems. It may be that they are made clear elsewhere. This issue reflects my naive reading of the CID Decoding Algorithm in isolation. I suggest that this algorithm should not be ambiguous even when read in isolation.

Suggested rewording of the spec to address the first problem above: 

> The algorithm to decode a byte sequence into a CID is as follows. 
> 
> 1. If the bytes in the sequence, when interpreted as a UTF-8 string, consist only of the characters of the Base58 alphabet ("123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"), then …

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CID Decoding Algorithm does not define "string", and allows false positives #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CID Decoding Algorithm does not define "string", and allows false positives #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions