Skip to content

Conversation

@lukaszsamson
Copy link
Contributor

I noticed code with invalid UTF8 escapes crash with badarg
repro:

:"\x963"

or

C.\"j\\x963<?j[On%K^!q5;V[`iU.WEI[\\<5\" )Mm0l@
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: invalid UTF8 encoding

    :erlang.binary_to_atom(<<150, 51>>, :utf8)
    (elixir 1.18.4) src/elixir_tokenizer.erl:1021: :elixir_tokenizer.unsafe_to_atom/4
    (elixir 1.18.4) src/elixir_tokenizer.erl:507: :elixir_tokenizer.tokenize/5

Other cases dealing with invalid UTF8 tend to raise UnicodeConversionError. I'm not convinced this is the best approach though. Maybe the tokenizer should return errors instead of raising

@josevalim
Copy link
Member

Can you please include tests? code_test.exs is a good candidate, you can test them by calling Code.string_to_quoted with a minimum string. Thank you!

@lukaszsamson
Copy link
Contributor Author

Sure

@josevalim josevalim merged commit 6ca0ad8 into elixir-lang:main Jun 19, 2025
13 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@lukaszsamson
Copy link
Contributor Author

@josevalim I'm wondering if the tokenizer should raise here or simply return an error tuple

@josevalim
Copy link
Member

As far as I understand, we are already raising in other places right? That’s way I went ahead and merged it, but left a mental note to revisit this for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants