Consistently raise UnicodeConversionError in tokenizer #14589

lukaszsamson · 2025-06-18T12:02:13Z

I noticed code with invalid UTF8 escapes crash with badarg
repro:

:"\x963"

or

C.\"j\\x963<?j[On%K^!q5;V[`iU.WEI[\\<5\" )Mm0l@

** (ArgumentError) errors were found at the given arguments:

  * 1st argument: invalid UTF8 encoding

    :erlang.binary_to_atom(<<150, 51>>, :utf8)
    (elixir 1.18.4) src/elixir_tokenizer.erl:1021: :elixir_tokenizer.unsafe_to_atom/4
    (elixir 1.18.4) src/elixir_tokenizer.erl:507: :elixir_tokenizer.tokenize/5

Other cases dealing with invalid UTF8 tend to raise UnicodeConversionError. I'm not convinced this is the best approach though. Maybe the tokenizer should return errors instead of raising

josevalim · 2025-06-18T12:11:25Z

Can you please include tests? code_test.exs is a good candidate, you can test them by calling Code.string_to_quoted with a minimum string. Thank you!

lukaszsamson · 2025-06-18T12:15:23Z

Sure

josevalim · 2025-06-19T19:00:49Z

💚 💙 💜 💛 ❤️

lukaszsamson · 2025-06-19T19:57:09Z

@josevalim I'm wondering if the tokenizer should raise here or simply return an error tuple

josevalim · 2025-06-19T23:20:26Z

As far as I understand, we are already raising in other places right? That’s way I went ahead and merged it, but left a mental note to revisit this for consistency.

…4589)

Consistently raise UnicodeConversionError in tokenizer

3cf3e86

add tests

8783aa9

josevalim merged commit 6ca0ad8 into elixir-lang:main Jun 19, 2025
13 checks passed

josevalim pushed a commit that referenced this pull request Jun 19, 2025

Consistently raise UnicodeConversionError in tokenizer (#14589)

711008a

lukaszsamson mentioned this pull request Jul 23, 2025

Return error on invalid unicode sequences #14666

Merged

ggVGc pushed a commit to ggVGc/elixir-verbatim that referenced this pull request Sep 12, 2025

Consistently raise UnicodeConversionError in tokenizer (elixir-lang#1…

fa49fdc

…4589)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consistently raise UnicodeConversionError in tokenizer #14589

Consistently raise UnicodeConversionError in tokenizer #14589

Uh oh!

lukaszsamson commented Jun 18, 2025

Uh oh!

josevalim commented Jun 18, 2025

Uh oh!

lukaszsamson commented Jun 18, 2025

Uh oh!

Uh oh!

josevalim commented Jun 19, 2025

Uh oh!

lukaszsamson commented Jun 19, 2025

Uh oh!

josevalim commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Consistently raise UnicodeConversionError in tokenizer #14589

Consistently raise UnicodeConversionError in tokenizer #14589

Uh oh!

Conversation

lukaszsamson commented Jun 18, 2025

Uh oh!

josevalim commented Jun 18, 2025

Uh oh!

lukaszsamson commented Jun 18, 2025

Uh oh!

Uh oh!

josevalim commented Jun 19, 2025

Uh oh!

lukaszsamson commented Jun 19, 2025

Uh oh!

josevalim commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants