refactor(markdown-parser): promote list structural tokens from skipped trivia to explicit CST nodes#9274
refactor(markdown-parser): promote list structural tokens from skipped trivia to explicit CST nodes#9274jfmcdowell wants to merge 4 commits intobiomejs:mainfrom
Conversation
|
c65ad0f to
c31f169
Compare
…d trivia to explicit CST nodes Introduce MdListMarkerPrefix to wrap list marker structure (pre-marker indent, marker, post-marker space, content indent) as real CST nodes instead of skipped trivia. This mirrors the MdQuotePrefix pattern from Phase 1 and makes list structure visible to the formatter harness. - Add MdListMarkerPrefix, MdIndentToken, MdIndentTokenList to grammar - Replace skip_list_marker_indent() with emit_indent_char_list() - Emit MD_LIST_POST_MARKER_SPACE as an explicit token - Promote marker-only line newline to MdNewline node - Remove trim_range(); use raw node ranges for metadata recording - Fix to_html.rs renderer to handle new CST shape correctly - Add verbatim formatter stubs for new node types
c31f169 to
eeedde8
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughThis PR restructures Markdown list indentation handling by introducing a dedicated indent token system. Three new syntax nodes (MdIndentToken, MdListMarkerPrefix, MdIndentTokenList) are added to the grammar, along with corresponding formatter implementations and parser logic. The list parser now explicitly emits these nodes instead of relying on implicit trivia handling, and HTML generation is adjusted to handle the new marker structure and edge cases involving leading newlines. Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_markdown_parser/src/syntax/list.rs`:
- Around line 750-757: The code currently skips emitting content indent when
first_line_empty is true, which leaves extra spaces before NEWLINE (e.g. "-
\n") and prevents handle_first_line_marker_only from seeing NEWLINE; update the
condition in the block around emit_indent_char_list so that whenever
!setext_marker && spaces_after_marker > 1 you call emit_indent_char_list(p, 0)
(remove the first_line_empty requirement), and make the same change in the
corresponding block around lines 1015-1020; this ensures remaining whitespace is
emitted as MD_INDENT_TOKEN_LIST tokens so handle_first_line_marker_only (and
NEWLINE detection) works correctly.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (17)
crates/biome_markdown_factory/src/generated/node_factory.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_factory/src/generated/syntax_factory.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/lazy_continuation.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/list_continuation_edge_cases.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/list_indentation.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/list_interrupt_bullet.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/list_interrupt_ordered.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/list_tightness.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/multiline_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/ordered_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph_interruption.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_edge_cases.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_syntax/src/generated/kind.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_syntax/src/generated/macros.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_syntax/src/generated/nodes.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_syntax/src/generated/nodes_mut.rsis excluded by!**/generated/**,!**/generated/**and included by**
📒 Files selected for processing (11)
crates/biome_markdown_formatter/src/generated.rscrates/biome_markdown_formatter/src/markdown/auxiliary/indent_token.rscrates/biome_markdown_formatter/src/markdown/auxiliary/list_marker_prefix.rscrates/biome_markdown_formatter/src/markdown/auxiliary/mod.rscrates/biome_markdown_formatter/src/markdown/lists/indent_token_list.rscrates/biome_markdown_formatter/src/markdown/lists/mod.rscrates/biome_markdown_parser/src/parser.rscrates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/src/to_html.rsxtask/codegen/markdown.ungramxtask/codegen/src/markdown_kinds_src.rs
💤 Files with no reviewable changes (1)
- crates/biome_markdown_parser/src/parser.rs
dyc3
left a comment
There was a problem hiding this comment.
At a glance this looks like you're going in the right direction.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/syntax/list.rs (1)
739-747: Keep bullet and orderedfirst_line_emptydetection aligned.The ordered path treats
MD_HARD_LINE_LITERALas an empty first line, but the bullet path does not. Aligning them avoids subtle divergence in marker-only handling.Suggested fix
let first_line_empty = if setext_marker { true } else { p.lookahead(|p| { while p.at(MD_TEXTUAL_LITERAL) && is_whitespace_only(p.cur_text()) { p.bump(MD_TEXTUAL_LITERAL); } - p.at(NEWLINE) || p.at(T![EOF]) + p.at(NEWLINE) || p.at(T![EOF]) || p.at(MD_HARD_LINE_LITERAL) }) };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/biome_markdown_parser/src/syntax/list.rs` around lines 739 - 747, The bullet-list branch computing first_line_empty diverges from the ordered-list path by overlooking MD_HARD_LINE_LITERAL; update the lookahead in the bullet path (the code setting first_line_empty) to treat MD_HARD_LINE_LITERAL as empty the same way the ordered path does — i.e., inside the p.lookahead closure, additionally consider p.at(MD_HARD_LINE_LITERAL) as a terminating/empty condition (or skip/bump it similarly to MD_TEXTUAL_LITERAL) so that is_whitespace_only and subsequent p.bump calls handle marker-only lines consistently; reference the variable first_line_empty, the p.lookahead closure, MD_TEXTUAL_LITERAL, MD_HARD_LINE_LITERAL, is_whitespace_only, and p.bump when making the change.crates/biome_markdown_parser/tests/spec_test.rs (1)
214-236: Consider extracting the shared parse/assert harness.
checkand the bullet test body repeat the same parse → validate → render pattern. A tiny helper would keep future edge-case tests easier to add and maintain.Also applies to: 263-290
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/biome_markdown_parser/tests/spec_test.rs` around lines 214 - 236, Extract the repeated parse→validate→render logic into a small helper (e.g., fn render_checked(input: &str) -> String) that runs parse_markdown(input), checks !root.syntax().descendants().any(|n| n.kind().is_bogus()), asserts root.diagnostics().is_empty(), casts with MdDocument::cast(root.syntax()), calls document_to_html(&doc, root.list_tightness(), root.list_item_indents(), root.quote_indents()), and returns the resulting HTML string; then update the existing check function and the other test bodies to call this helper and only perform the assert_eq!(expected_html, html, "...") so the parsing/validation code isn't duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_markdown_parser/src/syntax/list.rs`:
- Around line 163-182: emit_indent_char_list currently counts each '\t' as a
fixed TAB_STOP_SPACES which is wrong when preceding spaces exist; change the tab
width calculation so each tab expands to the next tab stop relative to the
current column (tab_width = TAB_STOP_SPACES - (consumed % TAB_STOP_SPACES)), use
that computed width when checking the max_columns cap and when adding to
consumed, and leave the token emission (p.bump_remap/M...complete) logic
unchanged; update the width computation in the loop inside emit_indent_char_list
accordingly.
---
Nitpick comments:
In `@crates/biome_markdown_parser/src/syntax/list.rs`:
- Around line 739-747: The bullet-list branch computing first_line_empty
diverges from the ordered-list path by overlooking MD_HARD_LINE_LITERAL; update
the lookahead in the bullet path (the code setting first_line_empty) to treat
MD_HARD_LINE_LITERAL as empty the same way the ordered path does — i.e., inside
the p.lookahead closure, additionally consider p.at(MD_HARD_LINE_LITERAL) as a
terminating/empty condition (or skip/bump it similarly to MD_TEXTUAL_LITERAL) so
that is_whitespace_only and subsequent p.bump calls handle marker-only lines
consistently; reference the variable first_line_empty, the p.lookahead closure,
MD_TEXTUAL_LITERAL, MD_HARD_LINE_LITERAL, is_whitespace_only, and p.bump when
making the change.
In `@crates/biome_markdown_parser/tests/spec_test.rs`:
- Around line 214-236: Extract the repeated parse→validate→render logic into a
small helper (e.g., fn render_checked(input: &str) -> String) that runs
parse_markdown(input), checks !root.syntax().descendants().any(|n|
n.kind().is_bogus()), asserts root.diagnostics().is_empty(), casts with
MdDocument::cast(root.syntax()), calls document_to_html(&doc,
root.list_tightness(), root.list_item_indents(), root.quote_indents()), and
returns the resulting HTML string; then update the existing check function and
the other test bodies to call this helper and only perform the
assert_eq!(expected_html, html, "...") so the parsing/validation code isn't
duplicated.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
crates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/tests/spec_test.rs
| fn emit_indent_char_list(p: &mut MarkdownParser, max_columns: usize) -> usize { | ||
| let list_m = p.start(); | ||
| let mut consumed = 0usize; | ||
| while p.at(MD_TEXTUAL_LITERAL) && is_whitespace_only(p.cur_text()) { | ||
| let text = p.cur_text(); | ||
| let width: usize = text | ||
| .chars() | ||
| .map(|c| if c == '\t' { TAB_STOP_SPACES } else { 1 }) | ||
| .sum(); | ||
| if max_columns > 0 && consumed + width > max_columns { | ||
| break; | ||
| } | ||
| consumed += width; | ||
| let char_m = p.start(); | ||
| p.bump_remap(MD_INDENT_CHAR); | ||
| char_m.complete(p, MD_INDENT_TOKEN); | ||
| } | ||
| list_m.complete(p, MD_INDENT_TOKEN_LIST); | ||
| consumed | ||
| } |
There was a problem hiding this comment.
emit_indent_char_list miscomputes tab columns after preceding spaces.
On Line 170, each tab is counted as a fixed TAB_STOP_SPACES, but tab expansion depends on the current column. For inputs like " \t", this overcounts columns and can break max_columns gating when a cap is used.
Suggested fix
fn emit_indent_char_list(p: &mut MarkdownParser, max_columns: usize) -> usize {
let list_m = p.start();
let mut consumed = 0usize;
while p.at(MD_TEXTUAL_LITERAL) && is_whitespace_only(p.cur_text()) {
let text = p.cur_text();
- let width: usize = text
- .chars()
- .map(|c| if c == '\t' { TAB_STOP_SPACES } else { 1 })
- .sum();
+ let mut width = 0usize;
+ for c in text.chars() {
+ width += if c == '\t' {
+ TAB_STOP_SPACES - ((consumed + width) % TAB_STOP_SPACES)
+ } else {
+ 1
+ };
+ }
if max_columns > 0 && consumed + width > max_columns {
break;
}
consumed += width;
let char_m = p.start();
p.bump_remap(MD_INDENT_CHAR);
char_m.complete(p, MD_INDENT_TOKEN);
}
list_m.complete(p, MD_INDENT_TOKEN_LIST);
consumed
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/biome_markdown_parser/src/syntax/list.rs` around lines 163 - 182,
emit_indent_char_list currently counts each '\t' as a fixed TAB_STOP_SPACES
which is wrong when preceding spaces exist; change the tab width calculation so
each tab expands to the next tab stop relative to the current column (tab_width
= TAB_STOP_SPACES - (consumed % TAB_STOP_SPACES)), use that computed width when
checking the max_columns cap and when adding to consumed, and leave the token
emission (p.bump_remap/M...complete) logic unchanged; update the width
computation in the loop inside emit_indent_char_list accordingly.
Replace programmatic assertions with a proper snapshot fixture for marker-only list items with trailing spaces. This aligns with the project's existing test convention where CST shape is validated via insta snapshots rather than ad-hoc assertions.
Note
AI Assistance Disclosure: This PR was developed with assistance from Claude Code.
Summary
MdQuotePrefixpattern established in Phase 1.MdListMarkerPrefix,MdIndentToken, andMdIndentTokenListto the markdown grammar, making list indentation structure visible and traversable in the CST.skip_list_marker_indent()(which discarded whitespace as trivia) withemit_indent_char_list()that wraps each indent character in a proper node.MD_LIST_POST_MARKER_SPACEas an explicit token instead of silently consuming it.trim_range()— a legacy workaround that tried to normalize node ranges by stripping leading/trailing whitespace. With structural tokens now in the CST, raw node ranges are correct by construction.to_html.rsto navigate the new CST shape (bullet.prefix().marker()instead ofbullet.bullet()) and correctly handle leading newlines in list item rendering.Test Plan
cargo test -p biome_markdown_parserjust test-markdown-conformanceResults:
129total,0failures)652/652,100%)Docs
N/A — internal parser refactor with no user-facing behavior change.