Skip to content

UTF-8 source files with a BOM are handled incorrectly #5391

@lucymcphail

Description

@lucymcphail

The compiler cannot currently parse source files encoded in UTF-8 which begin with a byte order mark (BOM). To reproduce, create a file with these contents:

00000000: efbb bf69 6d70 6f72 7420 676c 0000 0000  ...import gl....
00000010: 6561 6d2f 696f 0a0a 7075 6220 666e 206d  eam/io..pub fn m
00000020: 6169 6e28 2920 2d3e 204e 696c 207b 0a20  ain() -> Nil {.
00000030: 2069 6f2e 7072 696e 746c 6e28 2254 6869   io.println("Thi
00000040: 7320 6669 6c65 2064 6f65 736e 2774 2063  s file doesn't c
00000050: 6f6d 7069 6c65 2229 0a7d 0a              ompile").}.

and then run

xxd -r <file-you-just-created> > src/wont_compile.gleam

inside a fresh gleam project. As of Gleam 1.14.0 (running here on Debian trixie), this gives the following unclear error message:

error: Syntax error
  ┌─ <path-to-project>/src/wont_compile.gleam:1:1
  │
1 │ import gleam/io
  │  I can't figure out what to do with this character

Hint: Is it a typo?

I can see two possible fixes:

  1. add a specific error message for when a BOM is detected, with instructions to remove it (perhaps with the formatter); or
  2. gracefully handle the BOM and continue parsing as normal.

According to the Unicode Standard, a UTF-8 file which begins with a BOM is valid:

The UTF-8 encoding scheme permits, but does not require, a BOM to be present.

Unicode Standard version 17.0.0, pg. 1166

Furthermore, since many pieces of software on Windows will add a BOM to any UTF-8 files they touch, so raising an error when a BOM is detected could cause frustration among Windows users.

On the other hand, several pieces of software on other OSs don't know how to handle a BOM properly. For example, here's the file from above opened in neovim on linux:

Image

so it might be better to encourage removal of the BOM from all Gleam files if we want to ensure a good experience for users of any OS.

For a very brief exploration of prior art:

  • Python will happily run a file that starts with a BOM, but it seems to break the parsing in strange ways
  • Rust compiles and runs just as expected with or without a BOM, and for some reason Rust source files open just fine in my editor, without the ^@^@^@^@ that appears in the screenshot above? Both files open with :set encoding=utf-8, and emacs displays the same behaviour across both files, so something else must be going on here that might be worth looking into.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions