Skip to content

refactor(markdown-parser): promote list structural tokens from skipped trivia to explicit CST nodes#9274

Open
jfmcdowell wants to merge 4 commits intobiomejs:mainfrom
jfmcdowell:refactor/list-marker-prefix
Open

refactor(markdown-parser): promote list structural tokens from skipped trivia to explicit CST nodes#9274
jfmcdowell wants to merge 4 commits intobiomejs:mainfrom
jfmcdowell:refactor/list-marker-prefix

Conversation

@jfmcdowell
Copy link
Contributor

@jfmcdowell jfmcdowell commented Feb 28, 2026

Note

AI Assistance Disclosure: This PR was developed with assistance from Claude Code.

Summary

  • Promote list structural tokens (pre-marker indent, marker, post-marker space, content indent) from skipped trivia to explicit CST nodes, mirroring the MdQuotePrefix pattern established in Phase 1.
  • Introduce MdListMarkerPrefix, MdIndentToken, and MdIndentTokenList to the markdown grammar, making list indentation structure visible and traversable in the CST.
  • Replace skip_list_marker_indent() (which discarded whitespace as trivia) with emit_indent_char_list() that wraps each indent character in a proper node.
  • Emit MD_LIST_POST_MARKER_SPACE as an explicit token instead of silently consuming it.
  • Remove trim_range() — a legacy workaround that tried to normalize node ranges by stripping leading/trailing whitespace. With structural tokens now in the CST, raw node ranges are correct by construction.
  • Fix to_html.rs to navigate the new CST shape (bullet.prefix().marker() instead of bullet.bullet()) and correctly handle leading newlines in list item rendering.
  • Add verbatim formatter stubs for all new node types.

Test Plan

  • cargo test -p biome_markdown_parser
  • just test-markdown-conformance

Results:

  • Parser tests pass (129 total, 0 failures)
  • CommonMark conformance passes (652/652, 100%)

Docs

N/A — internal parser refactor with no user-facing behavior change.

@changeset-bot
Copy link

changeset-bot bot commented Feb 28, 2026

⚠️ No Changeset found

Latest commit: 2bbbff0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser A-Formatter Area: formatter A-Tooling Area: internal tools labels Feb 28, 2026
@jfmcdowell jfmcdowell force-pushed the refactor/list-marker-prefix branch 3 times, most recently from c65ad0f to c31f169 Compare February 28, 2026 14:53
@jfmcdowell jfmcdowell changed the title fix(markdown-parser): restore metadata range contract for list/quote rendering refactor(markdown-parser): promote list structural tokens from skipped trivia to explicit CST nodes Feb 28, 2026
…d trivia to explicit CST nodes

Introduce MdListMarkerPrefix to wrap list marker structure (pre-marker
indent, marker, post-marker space, content indent) as real CST nodes
instead of skipped trivia. This mirrors the MdQuotePrefix pattern from
Phase 1 and makes list structure visible to the formatter harness.

- Add MdListMarkerPrefix, MdIndentToken, MdIndentTokenList to grammar
- Replace skip_list_marker_indent() with emit_indent_char_list()
- Emit MD_LIST_POST_MARKER_SPACE as an explicit token
- Promote marker-only line newline to MdNewline node
- Remove trim_range(); use raw node ranges for metadata recording
- Fix to_html.rs renderer to handle new CST shape correctly
- Add verbatim formatter stubs for new node types
@jfmcdowell jfmcdowell force-pushed the refactor/list-marker-prefix branch from c31f169 to eeedde8 Compare February 28, 2026 15:39
@jfmcdowell jfmcdowell marked this pull request as ready for review February 28, 2026 15:41
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a68fa06 and 2bbbff0.

⛔ Files ignored due to path filters (1)
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_marker_trailing_spaces.md.snap is excluded by !**/*.snap and included by **
📒 Files selected for processing (1)
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_marker_trailing_spaces.md
✅ Files skipped from review due to trivial changes (1)
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_marker_trailing_spaces.md

Walkthrough

This PR restructures Markdown list indentation handling by introducing a dedicated indent token system. Three new syntax nodes (MdIndentToken, MdListMarkerPrefix, MdIndentTokenList) are added to the grammar, along with corresponding formatter implementations and parser logic. The list parser now explicitly emits these nodes instead of relying on implicit trivia handling, and HTML generation is adjusted to handle the new marker structure and edge cases involving leading newlines.

Possibly related PRs

  • PR #8962 — Provides the formatter wiring foundation (generated.rs) that this PR directly extends with new trait implementations
  • PR #9228 — Modifies the same list parser file (syntax/list.rs) to refactor marker and indentation emission logic
  • PR #9224 — Implements an analogous indent token pattern for block quote prefixes, establishing parity across Markdown constructs

Suggested reviewers

  • ematipico
  • dyc3
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarises the primary change: promoting list structural tokens from trivia to explicit CST nodes, which aligns perfectly with the substantial refactoring across parser, formatter and grammar files.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, outlining the refactoring goals, implementation strategy, test results and the removal of legacy workarounds.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/src/syntax/list.rs`:
- Around line 750-757: The code currently skips emitting content indent when
first_line_empty is true, which leaves extra spaces before NEWLINE (e.g. "-  
\n") and prevents handle_first_line_marker_only from seeing NEWLINE; update the
condition in the block around emit_indent_char_list so that whenever
!setext_marker && spaces_after_marker > 1 you call emit_indent_char_list(p, 0)
(remove the first_line_empty requirement), and make the same change in the
corresponding block around lines 1015-1020; this ensures remaining whitespace is
emitted as MD_INDENT_TOKEN_LIST tokens so handle_first_line_marker_only (and
NEWLINE detection) works correctly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 412a08d and eeedde8.

⛔ Files ignored due to path filters (17)
  • crates/biome_markdown_factory/src/generated/node_factory.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_factory/src/generated/syntax_factory.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/lazy_continuation.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_continuation_edge_cases.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_indentation.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_interrupt_bullet.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_interrupt_ordered.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/list_tightness.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/multiline_list.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/ordered_list.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph_interruption.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_edge_cases.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_syntax/src/generated/kind.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_syntax/src/generated/macros.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_syntax/src/generated/nodes.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_syntax/src/generated/nodes_mut.rs is excluded by !**/generated/**, !**/generated/** and included by **
📒 Files selected for processing (11)
  • crates/biome_markdown_formatter/src/generated.rs
  • crates/biome_markdown_formatter/src/markdown/auxiliary/indent_token.rs
  • crates/biome_markdown_formatter/src/markdown/auxiliary/list_marker_prefix.rs
  • crates/biome_markdown_formatter/src/markdown/auxiliary/mod.rs
  • crates/biome_markdown_formatter/src/markdown/lists/indent_token_list.rs
  • crates/biome_markdown_formatter/src/markdown/lists/mod.rs
  • crates/biome_markdown_parser/src/parser.rs
  • crates/biome_markdown_parser/src/syntax/list.rs
  • crates/biome_markdown_parser/src/to_html.rs
  • xtask/codegen/markdown.ungram
  • xtask/codegen/src/markdown_kinds_src.rs
💤 Files with no reviewable changes (1)
  • crates/biome_markdown_parser/src/parser.rs

Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance this looks like you're going in the right direction.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/syntax/list.rs (1)

739-747: Keep bullet and ordered first_line_empty detection aligned.

The ordered path treats MD_HARD_LINE_LITERAL as an empty first line, but the bullet path does not. Aligning them avoids subtle divergence in marker-only handling.

Suggested fix
     let first_line_empty = if setext_marker {
         true
     } else {
         p.lookahead(|p| {
             while p.at(MD_TEXTUAL_LITERAL) && is_whitespace_only(p.cur_text()) {
                 p.bump(MD_TEXTUAL_LITERAL);
             }
-            p.at(NEWLINE) || p.at(T![EOF])
+            p.at(NEWLINE) || p.at(T![EOF]) || p.at(MD_HARD_LINE_LITERAL)
         })
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/list.rs` around lines 739 - 747, The
bullet-list branch computing first_line_empty diverges from the ordered-list
path by overlooking MD_HARD_LINE_LITERAL; update the lookahead in the bullet
path (the code setting first_line_empty) to treat MD_HARD_LINE_LITERAL as empty
the same way the ordered path does — i.e., inside the p.lookahead closure,
additionally consider p.at(MD_HARD_LINE_LITERAL) as a terminating/empty
condition (or skip/bump it similarly to MD_TEXTUAL_LITERAL) so that
is_whitespace_only and subsequent p.bump calls handle marker-only lines
consistently; reference the variable first_line_empty, the p.lookahead closure,
MD_TEXTUAL_LITERAL, MD_HARD_LINE_LITERAL, is_whitespace_only, and p.bump when
making the change.
crates/biome_markdown_parser/tests/spec_test.rs (1)

214-236: Consider extracting the shared parse/assert harness.

check and the bullet test body repeat the same parse → validate → render pattern. A tiny helper would keep future edge-case tests easier to add and maintain.

Also applies to: 263-290

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/tests/spec_test.rs` around lines 214 - 236,
Extract the repeated parse→validate→render logic into a small helper (e.g., fn
render_checked(input: &str) -> String) that runs parse_markdown(input), checks
!root.syntax().descendants().any(|n| n.kind().is_bogus()), asserts
root.diagnostics().is_empty(), casts with MdDocument::cast(root.syntax()), calls
document_to_html(&doc, root.list_tightness(), root.list_item_indents(),
root.quote_indents()), and returns the resulting HTML string; then update the
existing check function and the other test bodies to call this helper and only
perform the assert_eq!(expected_html, html, "...") so the parsing/validation
code isn't duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/src/syntax/list.rs`:
- Around line 163-182: emit_indent_char_list currently counts each '\t' as a
fixed TAB_STOP_SPACES which is wrong when preceding spaces exist; change the tab
width calculation so each tab expands to the next tab stop relative to the
current column (tab_width = TAB_STOP_SPACES - (consumed % TAB_STOP_SPACES)), use
that computed width when checking the max_columns cap and when adding to
consumed, and leave the token emission (p.bump_remap/M...complete) logic
unchanged; update the width computation in the loop inside emit_indent_char_list
accordingly.

---

Nitpick comments:
In `@crates/biome_markdown_parser/src/syntax/list.rs`:
- Around line 739-747: The bullet-list branch computing first_line_empty
diverges from the ordered-list path by overlooking MD_HARD_LINE_LITERAL; update
the lookahead in the bullet path (the code setting first_line_empty) to treat
MD_HARD_LINE_LITERAL as empty the same way the ordered path does — i.e., inside
the p.lookahead closure, additionally consider p.at(MD_HARD_LINE_LITERAL) as a
terminating/empty condition (or skip/bump it similarly to MD_TEXTUAL_LITERAL) so
that is_whitespace_only and subsequent p.bump calls handle marker-only lines
consistently; reference the variable first_line_empty, the p.lookahead closure,
MD_TEXTUAL_LITERAL, MD_HARD_LINE_LITERAL, is_whitespace_only, and p.bump when
making the change.

In `@crates/biome_markdown_parser/tests/spec_test.rs`:
- Around line 214-236: Extract the repeated parse→validate→render logic into a
small helper (e.g., fn render_checked(input: &str) -> String) that runs
parse_markdown(input), checks !root.syntax().descendants().any(|n|
n.kind().is_bogus()), asserts root.diagnostics().is_empty(), casts with
MdDocument::cast(root.syntax()), calls document_to_html(&doc,
root.list_tightness(), root.list_item_indents(), root.quote_indents()), and
returns the resulting HTML string; then update the existing check function and
the other test bodies to call this helper and only perform the
assert_eq!(expected_html, html, "...") so the parsing/validation code isn't
duplicated.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eeedde8 and 726e321.

📒 Files selected for processing (2)
  • crates/biome_markdown_parser/src/syntax/list.rs
  • crates/biome_markdown_parser/tests/spec_test.rs

Comment on lines +163 to +182
fn emit_indent_char_list(p: &mut MarkdownParser, max_columns: usize) -> usize {
let list_m = p.start();
let mut consumed = 0usize;
while p.at(MD_TEXTUAL_LITERAL) && is_whitespace_only(p.cur_text()) {
let text = p.cur_text();
let width: usize = text
.chars()
.map(|c| if c == '\t' { TAB_STOP_SPACES } else { 1 })
.sum();
if max_columns > 0 && consumed + width > max_columns {
break;
}
consumed += width;
let char_m = p.start();
p.bump_remap(MD_INDENT_CHAR);
char_m.complete(p, MD_INDENT_TOKEN);
}
list_m.complete(p, MD_INDENT_TOKEN_LIST);
consumed
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

emit_indent_char_list miscomputes tab columns after preceding spaces.

On Line 170, each tab is counted as a fixed TAB_STOP_SPACES, but tab expansion depends on the current column. For inputs like " \t", this overcounts columns and can break max_columns gating when a cap is used.

Suggested fix
 fn emit_indent_char_list(p: &mut MarkdownParser, max_columns: usize) -> usize {
     let list_m = p.start();
     let mut consumed = 0usize;
     while p.at(MD_TEXTUAL_LITERAL) && is_whitespace_only(p.cur_text()) {
         let text = p.cur_text();
-        let width: usize = text
-            .chars()
-            .map(|c| if c == '\t' { TAB_STOP_SPACES } else { 1 })
-            .sum();
+        let mut width = 0usize;
+        for c in text.chars() {
+            width += if c == '\t' {
+                TAB_STOP_SPACES - ((consumed + width) % TAB_STOP_SPACES)
+            } else {
+                1
+            };
+        }
         if max_columns > 0 && consumed + width > max_columns {
             break;
         }
         consumed += width;
         let char_m = p.start();
         p.bump_remap(MD_INDENT_CHAR);
         char_m.complete(p, MD_INDENT_TOKEN);
     }
     list_m.complete(p, MD_INDENT_TOKEN_LIST);
     consumed
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/list.rs` around lines 163 - 182,
emit_indent_char_list currently counts each '\t' as a fixed TAB_STOP_SPACES
which is wrong when preceding spaces exist; change the tab width calculation so
each tab expands to the next tab stop relative to the current column (tab_width
= TAB_STOP_SPACES - (consumed % TAB_STOP_SPACES)), use that computed width when
checking the max_columns cap and when adding to consumed, and leave the token
emission (p.bump_remap/M...complete) logic unchanged; update the width
computation in the loop inside emit_indent_char_list accordingly.

autofix-ci bot and others added 2 commits February 28, 2026 20:03
Replace programmatic assertions with a proper snapshot fixture for
marker-only list items with trailing spaces. This aligns with the
project's existing test convention where CST shape is validated via
insta snapshots rather than ad-hoc assertions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Formatter Area: formatter A-Parser Area: parser A-Tooling Area: internal tools

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants