Skip to content

bug: escape sequences inside regular expressions are not treated as part of regular expressions #284

@vitallium

Description

@vitallium

Did you check existing issues?

  • I have read all the tree-sitter docs if it relates to using the parser
  • I have searched the existing issues of tree-sitter-c

Tree-Sitter CLI Version, if relevant (output of tree-sitter --version)

tree-sitter 0.24.4 (fc8c1863e2e5724a0c40bb6e6cfc8631bfe5908b)

Describe the bug

Regexps containing escape sequences like \d, \w, and POSIX character classes like [[:alpha:]] are incorrectly parsed. The scanner prematurely ends string content when it encounters backslashes in regex, treating them as interpolation boundaries rather than escape sequences.

I used 3 editors to validate this behavior:

neovim:

Image

helix:

Image

zed:

Image

Sorry, I don't know which versions neovim and helix use but Zed uses 71bd32f.

This behavior leads to 2, I think, potential problems:

  1. Regular expressions are not properly tokenized
  2. The previous problem leads to another problem: problem with code highlighting

Steps To Reproduce/Bad Parse Tree

  1. Try to parse /[[[:alpha:]]\d]+/

The grammar outputs the following parse tree:

(program [0, 0] - [1, 0]
  (call [0, 0] - [0, 31]
    receiver: (regex [0, 0] - [0, 18]
      (string_content [0, 1] - [0, 13])
      (escape_sequence [0, 13] - [0, 15])
      (string_content [0, 15] - [0, 17]))
    method: (identifier [0, 19] - [0, 24])
    arguments: (argument_list [0, 24] - [0, 31]
      (string [0, 25] - [0, 30]
        (string_content [0, 26] - [0, 29])))))

Expected Behavior/Parse Tree

I expect the following parse tree without escape sequences to treat the regular expressions as a single node:

(program [0, 0] - [1, 0]
  (call [0, 0] - [0, 31]
    receiver: (regex [0, 0] - [0, 18]
      (string_content [0, 1] - [0, 17]))
    method: (identifier [0, 19] - [0, 24])
    arguments: (argument_list [0, 24] - [0, 31]
      (string [0, 25] - [0, 30]
        (string_content [0, 26] - [0, 29])))))

Repro

The minimal repo case is `/\d+/` but I used this `/[[[:alpha:]]\d]+/`

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions