Skip to content

Add Tree-sitter Query language#7243

Merged
lildude merged 16 commits intogithub-linguist:mainfrom
mkatychev:feat/tree-sitter-query
Mar 5, 2025
Merged

Add Tree-sitter Query language#7243
lildude merged 16 commits intogithub-linguist:mainfrom
mkatychev:feat/tree-sitter-query

Conversation

@mkatychev
Copy link
Contributor

Description

This PR adds support for the Tree-sitter Query language and attempts to resolve #5746

Checklist:

@mkatychev mkatychev requested a review from a team as a code owner February 18, 2025 19:15
@mkatychev
Copy link
Contributor Author

I recall @lildude mentioned that the decision to do tree-sitter syntax highlighting is handled internally however the grammar in https://github.com/tree-sitter-grammars/tree-sitter-query is one of the most performant grammars out there (and is what the highlighting is based off of).

Tree-sitter Query

$ tree-sitter test --stat total-only --wasm

Total parses: 18; successful parses: 18; failed parses: 0; success percentage: 100.00%; average speed: 9755 bytes/ms

Rust:

$ tree-sitter test --stat total-only --wasm

Total parses: 136; successful parses: 136; failed parses: 0; success percentage: 100.00%; average speed: 7669 bytes/ms

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also address the test failures.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

The classifier failure is because it is determining Tree-sitter Query/injections.scm is Scheme:

Tree-sitter Query/injections.scm BAD (Scheme)

This will be because it's quite a small sample and because there is only one very small Scheme .scm sample being used to train the classifier. I recommend experimenting (locally) with adding a few more larger more representative Scheme samples to better train the classifier... the more different they look from Tree-sitter Query files, the better.

- '(?:''[\(\w\*]|[\w\-]+->[\w\-]+|\b\.\.\.\b|\([+\-#:<>\/=~\)]|~>)'
- '^#:\w+'
- '#\w*\('
- '^\s*\((?i:define|import|library|let)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vulnerable to ReDoS attack. Please fix it, and ensure the new regex runs linearly and is a Re2-style regex. See #7242 for other regexes we're slowly cleaning up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all the pattern entries joined by a | or evaluated separately? Does pattern: ['a*', 'b*'] become pattern: '(a*|b*)'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find the magic here.

My finding was based purely on testing the first regex in isolation 😉

This comment was marked as resolved.

@mkatychev

This comment was marked as resolved.

@lildude
Copy link
Member

lildude commented Feb 19, 2025

Could I add those file suffixes to the test case as well?

No. That has no effect on the classifier tests. The classifier is only influenced by the samples themselves. More samples == better training.

EDIT: changed "*.scm" to "*" for Scheme in test_scm_by_heuristics

You don't want to do that. This is about testing a specific heuristic which is tied to an extension.

@mkatychev

This comment was marked as resolved.

@mkatychev mkatychev force-pushed the feat/tree-sitter-query branch from 8e553ac to fbfe049 Compare February 19, 2025 20:43
@mkatychev mkatychev requested a review from lildude February 19, 2025 23:18
remove unneeded assertion
@mkatychev mkatychev requested a review from lildude February 28, 2025 15:19
@mkatychev
Copy link
Contributor Author

@lildude are we waiting on any feedback to be addressed?

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Important

The changes in this PR will not appear on GitHub until the next release has been made and deployed. See here for more details.

@lildude lildude added this pull request to the merge queue Mar 5, 2025
Merged via the queue into github-linguist:main with commit 6f97158 Mar 5, 2025
5 checks passed
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jul 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Highlight tree-sitter queries

2 participants