Skip to content

Conversation

@lukaszsamson
Copy link
Contributor

I revisited #14589 and attempted to make the tokenizer and parser API more consistent. Code.string_to_quoted should not raise for valid string input. If the string has parsing errors it should always return an error tuple.

I collected all the cases where invalid UTF byte sequence may be used

cases = [
  # binstring
  "\"\\xFF\"",
  "\"\\xFF\#{some()}\"",
  # charlist
  "'\\xFF'",
  "'\\xFF\#{some()}'",
  # bin heredoc
  "\"\"\"\n\\xFF\n\"\"\"",
  "\"\"\"\n\\xFF\#{some()}\n\"\"\"",
  # list heredoc
  "'''\n\\xFF\n'''",
  "'''\n\\xFF\#{some()}\n'''",
  # quoted atom
  ":\"\\xFF\"",
  ":'\\xFF'",
  ":\"\\xFF\#{some()}\"",
  ":'\\xFF\#{some()}'",
  # quoted keyword identifier
  "[\"\\xFF\": 1]",
  "['\\xFF': 1]",
  "[\"\\xFF\#{some()}\": 1]",
  "['\\xFF\#{some()}': 1]",
  # quoted dot call
  "Foo.\"\\xFF\"",
  "Foo.'\\xFF'",
  "Foo.\"\\xFF\#{some()}\"",
  "Foo.'\\xFF\#{some()}'",
  # sigil
  "~s\"\\xFF\"",
  "~s'\\xFF'",
  "~s\"\\xFF\#{some()}\"",
  "~s'\\xFF\#{some()}'",
  "~S\"\\xFF\"",
  "~S'\\xFF'",
  "~S\"\\xFF\#{some()}\"",
  "~S'\\xFF\#{some()}'",
]

Before this PR the tokenizer used to raise on quoted atom/kw/dot call. The parser used to raise on list string and list heredoc.

Note that invalid byte sequences are still allowed in:

  • bin string (AST node and token has raw binary)
  • bin heredoc (AST node and token has raw binary)
  • sigil (AST node and token has raw binary, may crash at runtime)
  • anything with interpolation (AST has function calls with raw binary arg, will crash at runtime)

@josevalim josevalim merged commit 23776d9 into elixir-lang:main Jul 27, 2025
13 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

error:#{'__struct__' := 'Elixir.UnicodeConversionError', message := Message} ->
Sigil = [$~, S, H],
Message = " (for sigil ~ts starting at line ~B)",
interpolation_error(Message, [$~] ++ SigilName ++ Original, Scope, Tokens, Message, [Sigil, Line], Line, Column, [H], [sigil_terminator(H)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be a test added for this change? Dialyzer says this function call will fail, the first argument of interpolation_error should be a 5 element tuple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I will revert it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially because this is wrapping a large chunk of code instead of a small one, similar to others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try catch is not needed here. It calls tokens_to_binary which may throw on liststring but it will only get binary args here

ggVGc pushed a commit to ggVGc/elixir-verbatim that referenced this pull request Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants