-
-
Notifications
You must be signed in to change notification settings - Fork 921
Description
The compiler cannot currently parse source files encoded in UTF-8 which begin with a byte order mark (BOM). To reproduce, create a file with these contents:
00000000: efbb bf69 6d70 6f72 7420 676c 0000 0000 ...import gl....
00000010: 6561 6d2f 696f 0a0a 7075 6220 666e 206d eam/io..pub fn m
00000020: 6169 6e28 2920 2d3e 204e 696c 207b 0a20 ain() -> Nil {.
00000030: 2069 6f2e 7072 696e 746c 6e28 2254 6869 io.println("Thi
00000040: 7320 6669 6c65 2064 6f65 736e 2774 2063 s file doesn't c
00000050: 6f6d 7069 6c65 2229 0a7d 0a ompile").}.
and then run
xxd -r <file-you-just-created> > src/wont_compile.gleam
inside a fresh gleam project. As of Gleam 1.14.0 (running here on Debian trixie), this gives the following unclear error message:
error: Syntax error
┌─ <path-to-project>/src/wont_compile.gleam:1:1
│
1 │ import gleam/io
│ I can't figure out what to do with this character
Hint: Is it a typo?
I can see two possible fixes:
- add a specific error message for when a BOM is detected, with instructions to remove it (perhaps with the formatter); or
- gracefully handle the BOM and continue parsing as normal.
According to the Unicode Standard, a UTF-8 file which begins with a BOM is valid:
The UTF-8 encoding scheme permits, but does not require, a BOM to be present.
— Unicode Standard version 17.0.0, pg. 1166
Furthermore, since many pieces of software on Windows will add a BOM to any UTF-8 files they touch, so raising an error when a BOM is detected could cause frustration among Windows users.
On the other hand, several pieces of software on other OSs don't know how to handle a BOM properly. For example, here's the file from above opened in neovim on linux:
so it might be better to encourage removal of the BOM from all Gleam files if we want to ensure a good experience for users of any OS.
For a very brief exploration of prior art:
- Python will happily run a file that starts with a BOM, but it seems to break the parsing in strange ways
- Rust compiles and runs just as expected with or without a BOM, and for some reason Rust source files open just fine in my editor, without the
^@^@^@^@that appears in the screenshot above? Both files open with:set encoding=utf-8, and emacs displays the same behaviour across both files, so something else must be going on here that might be worth looking into.