atdgen-ocaml: utf-8 Vs byte-array strings

(I've the impression this should be an FAQ but could not find any discussion on this:)

Atdgen maps ATD “strings” to JSON strings which are supposed to be valid Unicode (UTF-8 in practice), and also directly to OCaml `string` values which can be arbitrary byte-arrays.

- This makes it very easy to generate invalid JSON which then fails with other parsers:, e.g., this [Gist](https://gist.github.com/smondet/88fe66d2f8d79a69c69b0aea8c078431) shows [Jsonm](https://opam.ocaml.org/packages/jsonm/) failing with `"illegal bytes in character stream"` while `J.string_of_t0 |> J.t0_of_string` succeeds.
- The “data-encoding” world often uses this as default solution for byte-arrays: <https://gitlab.com/nomadic-labs/data-encoding/-/blob/master/src/json.ml#L125-L145> → if a string is not UTF-8 it becomes an array of ints.

Should `Mod_j` functions have the option failing earlier if an input string is not valid? (I guess that would be having default or first-class-citizen `validator` entries? `-j-pp` seems to only work in one direction).

Does it make sense to add a `byte-array` core type to ATD?

Many tools already [just don't care](https://unix.stackexchange.com/questions/757832/how-to-process-json-with-strings-containing-invalid-utf-8), should this just be documented somewhere properly?

Right now the ATD definition [doc](https://atd.readthedocs.io/en/latest/atd-language.html#predefined-type-names) just says “Sequence of bytes or characters” …


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atdgen-ocaml: utf-8 Vs byte-array strings #415

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

atdgen-ocaml: utf-8 Vs byte-array strings #415

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions