Skip to content

Lfe 9137/md to pdf#2775

Open
felixkrrr wants to merge 17 commits intomainfrom
LFE-9137/md-to-pdf
Open

Lfe 9137/md to pdf#2775
felixkrrr wants to merge 17 commits intomainfrom
LFE-9137/md-to-pdf

Conversation

@felixkrrr
Copy link
Copy Markdown
Contributor

Issues were:

  • md to pdf conversion does not work for pages like /terms
  • content/ layout (folders like marketing/, docs/) did not match URL paths 1:1 (e.g. marketing pages at /terms, customers at /users). Auto-exporting “raw” paths from content/ produced wrong or missing .md URLs and broke things that assumed a simple path (/terms.md, PDF fetch, etc.). That is “how the content dir is managed” vs how the site is routed.

Adjacent fixes:

  • added md overrides to certain pages pricing.md and pricing-self-host.md

Desired fixes noted by Marc:

  • stable default mapping; the new website folder structure of md files is not 1:1 the path mapping anymore; thus there needs to be some sort of fix that the md folder then mirrors the actual paths
  • I think having an md-override folder would be good; on build, we first auto-generate the mds; then we overwrite them again in case they get overridden (ie we have an md only version of the page, eg for pricing) -> thereby we have a very maintainable structure

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Ready Ready Preview, Comment Apr 8, 2026 7:54am

Request Review

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

@claude review

@dosubot dosubot bot added the docs label Apr 7, 2026
@felixkrrr
Copy link
Copy Markdown
Contributor Author

@claude pls review again

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 app/api/md-to-pdf/route.ts:111-119 — The PDF callout styling pipeline is broken: copy_md_sources.js strips all <Callout type='X'>...<\/Callout> tags before writing to public/md-src/, so by the time the PDF route fetches those files and calls processCallouts(), zero Callout tags remain. The five CSS callout classes (.callout-info, .callout-warn, etc.) in the HTML template are dead code — all callout blocks render as unstyled plain paragraphs in every generated PDF. Fix by either preserving Callout tags in md-src files (pass unwrapCalloutsForPlainMd: false at build time and keep the PDF-time call as-is), or apply processCallouts() before stripping at the build step.

    Extended reasoning...

    What the bug is and how it manifests

    The PDF callout styling pipeline has a two-step contradiction. At build time, scripts/copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) for every MDX file. Inside that call, stripOutsideFences() invokes unwrapCalloutTags(), which applies /<Callout\s+type=["'](\w+)["']\s*>([\s\S]*?)<\/Callout>/g and replaces every typed Callout tag with just its inner content (wrapped in newlines). By the time public/md-src/ files are written, all Callout tags have been erased.

    The specific code path that triggers it

    At render time, app/api/md-to-pdf/route.ts line 113–115 fetches from /md-src/ and calls stripMdxForPlainMarkdown(markdownContent, { unwrapCalloutsForPlainMd: false }). With no Callout tags remaining in the content, this call is a complete no-op — unwrapCalloutTags() is skipped, and stripMdxJsxArtifacts() (running with preserveCallout: true) also finds nothing to act on. The markdown then flows into marked.parse() as plain text, producing HTML with no <Callout> elements.

    Why existing code doesn't prevent it

    processCallouts() (route.ts line 26–30) then runs on this HTML using regex /<Callout\s+type=["'](\w+)["']\s*>([\s\S]*?)<\/Callout>/g. Since marked.parse() received markdown with no Callout tags, the resulting HTML also has none, and the regex finds zero matches. The five CSS callout classes (.callout-info, .callout-warn, .callout-error, .callout-warning, .callout-danger) defined in the HTML template are therefore never applied to any element.

    What the impact is

    Every callout block (e.g., <Callout type='info'>Important note</Callout>) in the source MDX renders as plain unstyled paragraph text in all generated PDFs for Langfuse URL sources. The visual differentiation between info, warning, error, and danger callouts — which is presumably the reason processCallouts() and the five CSS classes were written — is completely absent. This is a functional regression introduced by this PR: before the PR, copy_md_sources.js did not call stripMdxForPlainMarkdown at all, so Callout tags were preserved in md-src files and processCallouts() worked correctly.

    Step-by-step proof

    1. Source MDX contains: <Callout type="info">You must enable billing first.</Callout>
    2. Build time: unwrapCalloutTags() matches it → replaced with \n\nYou must enable billing first.\n\n
    3. public/md-src/docs/example.md now contains only the plain text, no Callout tags
    4. PDF route: fetches /md-src/docs/example.md, calls stripMdxForPlainMarkdown(..., { unwrapCalloutsForPlainMd: false }) — no-op since no Callout tags exist
    5. marked.parse() converts markdown → HTML paragraph: <p>You must enable billing first.</p>
    6. processCallouts(htmlContent) searches for <Callout in HTML → zero matches
    7. CSS class .callout-info is never applied; callout renders as an unstyled <p> tag

    How to fix it

    The cleanest fix is to change the build step to pass unwrapCalloutsForPlainMd: false (i.e., preserve Callout tags in md-src files), relying on the PDF route's processCallouts() to convert them to styled divs at render time. Alternatively, if plain-text md-src output is required for other consumers, add a separate Callout-to-styled-div conversion pass in the build step before stripping.

Comment on lines +22 to +50
* @param {string} content
* @returns {{ fenced: boolean, text: string }[]}
*/
function splitMarkdownByTripleBacktickFences(content) {
/** @type {{ fenced: boolean, text: string }[]} */
const segments = [];
let remaining = content;
while (remaining.length > 0) {
const open = remaining.indexOf("```");
if (open === -1) {
segments.push({ fenced: false, text: remaining });
break;
}
if (open > 0) {
segments.push({ fenced: false, text: remaining.slice(0, open) });
}
const afterOpen = remaining.slice(open + 3);
const close = afterOpen.indexOf("```");
if (close === -1) {
segments.push({ fenced: true, text: remaining.slice(open) });
break;
}
const fenceEnd = open + 3 + close + 3;
segments.push({ fenced: true, text: remaining.slice(open, fenceEnd) });
remaining = remaining.slice(fenceEnd);
}
return segments;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 splitMarkdownByTripleBacktickFences() uses indexOf to locate any 3-backtick sequence, so a 4-backtick opening fence (e.g. ````markdown filename=X) is matched by the 4th backtick + the first inner triple-backtick code block as its "close", causing the fenced segment to be truncated and all subsequent inner content to be classified as non-fenced. Current affected files (content/blog/2026-02-13-will-you-be-my-cli.mdx lines 72 and 210, content/docs/observability/features/multi-modality.mdx line 156) contain only bash/prose inside those fences so stripOutsideFences does no damage today, but any future 4-backtick block containing import statements or JSX components would be incorrectly stripped by this misclassification.

Extended reasoning...

What the bug is and how it manifests

splitMarkdownByTripleBacktickFences() (lib/stripMdxForPlainMarkdown.js:22–50) finds fence boundaries with remaining.indexOf("```"). When encountering a 4-backtick opening fence like `````markdown filename=X, it finds the first 3 of those 4 backticks at position open`. It then slices to `afterOpen = remaining.slice(open + 3)`, which starts with the remaining single backtick followed by `markdown filename=X\n...`. The subsequent `afterOpen.indexOf("```")` then finds the first inner triple-backtick code block—not the matching 4-backtick close fence. This produces a malformed fenced segment that ends prematurely, and all content from that inner opener onward is emitted as non-fenced.

The specific code path that triggers it

In scripts/copy_md_sources.js, every .mdx file is processed through stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }). Internally, splitMarkdownByTripleBacktickFences partitions the document, then non-fenced parts are passed to stripOutsideFences() which calls stripTopLevelImportBlocks(), unwrapCalloutTags(), and stripMdxJsxArtifacts(). Content misclassified as non-fenced is therefore subject to all three stripping passes.

Why existing code does not prevent it

The algorithm was written assuming fences always use exactly 3 backticks. CommonMark/MDX allow any run of 3+ backticks as a fence, and the closing fence only needs to match or exceed the opening count. The function has no mechanism to count how many backticks form the opening fence, and always searches for exactly 3 backticks as the closing delimiter.

What the impact is

Today's impact is limited: the files with 4-backtick fences contain bash commands and plain prose, so the stripping passes do not destroy any content. However, the segmentation is structurally wrong. Any future content author who uses 4-backtick fences (a common MDX pattern to embed inner triple-backtick examples) around MDX components or import statements would have those constructs silently stripped by stripMdxJsxArtifacts() or stripTopLevelImportBlocks(), producing a corrupted output .md file with no warning.

How to fix it

Count backticks at the opening fence and require the same count to close:

const open = remaining.search(/`{3,}/);
if (open === -1) { /* ... */ }
const openFence = remaining.match(/`{3,}/)[0];
const closeIdx = remaining.slice(open + openFence.length).indexOf(openFence);

Alternatively, detect fences only at line start with a line-by-line pass, which is the CommonMark spec approach.

Step-by-step proof with content/blog/2026-02-13-will-you-be-my-cli.mdx line 72

  1. remaining starts with ``````markdown filename=...\n...inner content...\n```bash\n...`
  2. remaining.indexOf("```") → 0 (first 3 of the 4 backticks)
  3. afterOpen = remaining.slice(3) → starts with ``markdown filename=...\n...inner content...\n```bash\n...`
  4. afterOpen.indexOf("```") → finds the ````bash` inner block opener (not the 4-backtick close)
  5. Segment 0 (fenced=true): only ``````markdown filename=...\n...inner content\n```` — truncated
  6. remaining is now advanced to the bash\n... content and all subsequent lines
  7. Those lines are classified as fenced=false (non-fenced) and passed to stripOutsideFences()

Comment on lines +68 to +89
* Does not strip arbitrary `{expr}` — those can appear in YAML frontmatter or JSX
* attributes; comments are the main noise in PDF / md-src output.
* @param {string} text
*/
function stripMdxBlockComments(text) {
return text.replace(/\{\s*\/\*[\s\S]*?\*\/\s*\}/g, "");
}

/**
* Single-line `import … from "…"` or `import "…"` without trailing `;` (valid ESM / ASI).
* @param {string} block
*/
function isCompleteSingleLineImportWithoutSemicolon(block) {
const parts = block.split(/\r?\n/);
if (parts.length !== 1) return false;
const line = parts[0];
if (!/^\s*import\s/.test(line) || /;\s*$/.test(line)) return false;
return (
/\bfrom\s+['"][^'"]+['"]\s*$/.test(line) ||
/^\s*import\s+['"][^'"]+['"]\s*$/.test(line)
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new stripTopLevelImportBlocks() function in lib/stripMdxForPlainMarkdown.js only strips lines matching /^\s*import\s/, so top-level MDX export function and export const definitions are left completely intact and land verbatim as raw JavaScript + JSX in the generated public/md-src/ .md files. At least three files are actively affected today: content/marketing/enterprise.mdx (line 72: export function TalkToUsButton()), content/docs/administration/rbac.mdx (line 54: export function RolePermissionTable()), and content/docs/administration/llm-connection.mdx (line 57: export function ModelList()). The fix is to extend stripTopLevelImportBlocks() (or add a parallel pass) to also detect and skip top-level export statement blocks by tracking brace depth.

Extended reasoning...

What the bug is and how it manifests

stripTopLevelImportBlocks() (lib/stripMdxForPlainMarkdown.js lines 68–89) scans each line for the pattern /^\s*import\s/ and, when matched, accumulates and discards the full import block including multi-line forms. The function has no equivalent handling for top-level export statements. MDX files commonly define React components at the top level using export function Foo() { ... } or export const Bar = () => (...) — these are local component definitions used within the MDX file. Because none of these lines start with import, they pass entirely through stripTopLevelImportBlocks() unchanged. The multi-line JSX bodies then also pass through stripMdxJsxArtifacts() unchanged, since the JSX inside contains non-empty content that the self-closing and empty-pair patterns cannot strip. The complete JS/JSX function definition lands verbatim as literal paragraph text in public/md-src/ .md files.

The specific code path that triggers it

During the build, scripts/copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) on every .mdx file. splitMarkdownByTripleBacktickFences() first splits on triple-backtick fences to leave code blocks alone. The export function definitions in the three affected files are outside any code fence — they are top-level MDX component definitions, not code samples. Those segments pass to stripOutsideFences() -> stripTopLevelImportBlocks(). Since lines start with export, not import, the /^\s*import\s/ test fails and every line is pushed unchanged to out[]. stripMdxJsxArtifacts() subsequently also leaves them intact.

Why existing code does not prevent it

The function was authored only to strip ESM import statements and was never extended to cover export statements. stripMdxJsxArtifacts() only addresses self-closing and empty paired PascalCase tags — it does not recognize raw JavaScript syntax such as function declarations, return (, braces, or arrow function bodies. There is no other pass in the pipeline that catches these blocks.

What the impact would be

Three files confirmed by verifiers: (1) content/marketing/enterprise.mdx lines 72–85: multi-line export function TalkToUsButton() with a full JSX body including Button components survives into public/md-src/enterprise.md. (2) content/docs/administration/rbac.mdx lines 54–71: export function RolePermissionTable({ roleScopes }) with Object.entries and JSX survives. (3) content/docs/administration/llm-connection.mdx line 57: export function ModelList() similarly survives. A fourth case was noted: content/docs/roadmap.mdx line 22 has export const ChangelogList = () => ( outside a code fence. In all cases the generated .md files contain raw JavaScript and JSX as literal paragraph text — garbage output for the PDF pipeline, LLM context, and search indexing.

How to fix it

Extend stripTopLevelImportBlocks() or add a parallel pass to detect and skip top-level export statement blocks. The simplest heuristic: detect a line matching /^\s*export\s+(function|const|class|default)\s/ then accumulate lines while tracking brace depth (increment on {, decrement on }), stopping when depth returns to zero. This handles function bodies, arrow functions with parens, and class definitions without needing a full parser.

Step-by-step proof

  1. Input from content/marketing/enterprise.mdx: lines 72–85 contain export function TalkToUsButton() { return (
    .........
    ); } outside any triple-backtick fence.
  2. splitMarkdownByTripleBacktickFences() returns this block as a { fenced: false, text: ... } segment.
  3. stripTopLevelImportBlocks() scans line by line: line 72 starts with "export function", not "import". The /^\s*import\s/ test fails. Line 72 is pushed to out[]. Lines 73–85 (the function body) are similarly pushed unchanged.
  4. stripMdxJsxArtifacts() runs: self-closing regex — no match (no />); empty-pair regex — no match (non-empty JSX content); empty-div regex — no match. The do/while loop exits with out === prev.
  5. Output: the full export function TalkToUsButton() { ... } block survives verbatim into public/md-src/enterprise.md as literal paragraph text.

Comment on lines 91 to +97
const dir = path.dirname(dest);
ensureDir(dir);
const originalContent = fs.readFileSync(file, 'utf8');
const processed = inlineComponentsMdx(originalContent, file);
const inlined = inlineComponentsMdx(originalContent, file);
const processed = stripMdxForPlainMarkdown(inlined, {
unwrapCalloutsForPlainMd: true,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The md-src build pipeline now calls stripMdxForPlainMarkdown with unwrapCalloutsForPlainMd: true, which strips every <Callout> tag before writing to public/md-src/ — so when the PDF route fetches those files and calls processCallouts(), it finds no <Callout> tags to match. All callout boxes across the 342+ affected content files now render as unstyled plain paragraphs in generated PDFs. Fix by setting unwrapCalloutsForPlainMd: false in scripts/copy_md_sources.js (lines 94–96) so <Callout> tags are preserved in md-src files for the PDF consumer; route.ts already calls stripMdxForPlainMarkdown with unwrapCalloutsForPlainMd: false to preserve them for processCallouts().

Extended reasoning...

What the bug is and how it manifests

Before this PR, copy_md_sources.js only called inlineComponentsMdx() — it never called stripMdxForPlainMarkdown(). The resulting public/md-src/ files retained raw <Callout type="info">content</Callout> tags. The PDF route would fetch these files, call marked.parse() (which passes unknown HTML blocks through verbatim), and then processCallouts(htmlContent) would convert every <Callout> tag into a styled <div class="callout callout-info">...</div>. Callouts were properly styled in PDFs.

After this PR, copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) (lines 94–96). Inside stripOutsideFences(), when unwrapCalloutsForPlainMd is true, unwrapCalloutTags(s) is invoked — replacing every <Callout[^>]*>content</Callout> with \n\ncontent\n\n. The public/md-src/ files now contain only plain text where callout blocks used to be.

The specific code path that triggers it

  1. Build time: copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true })stripOutsideFences(text, true)unwrapCalloutTags(s) replaces all <Callout> tags with plain text → written to public/md-src/docs/foo.md.
  2. PDF request: route.ts fetches /md-src/docs/foo.md — no <Callout> tags present in file.
  3. stripMdxForPlainMarkdown(markdownContent, { unwrapCalloutsForPlainMd: false }) in route.ts preserves Callout tags that no longer exist — a no-op.
  4. marked.parse(markdownContent) produces HTML with no <Callout> elements.
  5. processCallouts(htmlContent) regex /<Callout([^>]*)>([\s\S]*?)<\/Callout>/g finds zero matches.
  6. PDF output: callout content appears as an unstyled plain paragraph.

Why existing code does not prevent it

The design intent in route.ts (calling stripMdxForPlainMarkdown with unwrapCalloutsForPlainMd: false to "leave Callout for downstream HTML conversion") is directly contradicted by copy_md_sources.js stripping them at build time with unwrapCalloutsForPlainMd: true. The two consumers (plain markdown for LLM/search vs. PDF needing styled callouts) have conflicting requirements, and the build-time strip wins, making processCallouts() dead code.

Impact

Every content page using <Callout> components served via the PDF pipeline regresses. Grep confirms 342 <Callout> usages across the content/ directory — all affected. The md-override/ pricing files are plain markdown with no Callout components, so those are unaffected. This is a direct regression introduced by this PR.

Step-by-step proof

  1. content/docs/foo.mdx has <Callout type="info">Take note</Callout>
  2. copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true })
  3. unwrapCalloutTags() converts it to \n\nTake note\n\n — tags gone
  4. public/md-src/docs/foo.md contains only plain text
  5. PDF route fetches /md-src/docs/foo.md — no <Callout> tags present
  6. processCallouts(htmlContent) regex finds no matches
  7. Output PDF: "Take note" as plain paragraph, no callout box styling

How to fix it

Change scripts/copy_md_sources.js lines 94–96 to use unwrapCalloutsForPlainMd: false so <Callout> tags remain in md-src files for the PDF consumer. The route.ts path already has the correct flag set, and processCallouts() will then find and style them. For the plain-markdown (LLM/search) consumer, the raw <Callout> HTML in .md files is slightly noisier but functionally benign compared to silently dropping all callout styling in PDFs.

Comment on lines +55 to +64
function stripOutsideFences(text, unwrapCalloutsForPlainMd) {
let s = stripTopLevelImportBlocks(text);
if (unwrapCalloutsForPlainMd) {
s = unwrapCalloutTags(s);
}
s = stripMdxJsxArtifacts(s, { preserveCallout: !unwrapCalloutsForPlainMd });
s = stripMdxBlockComments(s);
s = s.replace(/\n{3,}/g, "\n\n");
return s;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In stripOutsideFences(), stripMdxJsxArtifacts runs before stripMdxBlockComments, so a component whose sole content is a JSX block comment — e.g. <Tab>{/* placeholder */}</Tab> — is not matched by the empty-pair regex (since {/* ... */} is non-whitespace), and after the comment is stripped the resulting <Tab></Tab> receives no second cleanup pass and survives verbatim into public/md-src files. This is a pre-existing latent logic flaw; no current content files have components containing only JSX block comments, so there is no active impact. Fix by moving stripMdxBlockComments to run before stripMdxJsxArtifacts.

Extended reasoning...

What the bug is and how it manifests

In stripOutsideFences() (lib/stripMdxForPlainMarkdown.js lines 55–64), the two cleanup passes run in this order: (1) stripMdxJsxArtifacts(s, ...) at line 60, then (2) stripMdxBlockComments(s) at line 61. The empty-pair regex inside stripMdxJsxArtifacts<Tag(?:\s[^>]*)?>\s*</Tag> — only matches when the content between the tags is all-whitespace. A JSX block comment like {/* placeholder */} is non-whitespace, so a component such as <Tab>{/* placeholder */}</Tab> passes through stripMdxJsxArtifacts unchanged. When stripMdxBlockComments then removes the comment, the result is <Tab></Tab>, but there is no second artifact-stripping pass, so the empty tag pair survives into the output.

The specific code path that triggers it

  1. Input: <Tab>{/* first tab content */}</Tab>
  2. stripMdxJsxArtifacts runs: the empty-pair regex does not match because {/* first tab content */} is non-whitespace. Output unchanged.
  3. stripMdxBlockComments runs: removes {/* first tab content */}. Output: <Tab></Tab>.
  4. No further pass of stripMdxJsxArtifacts runs.
  5. <Tab></Tab> is written verbatim to public/md-src or fed into marked.parse() in the PDF pipeline.

For nested cases like <Tabs><Tab>{/* placeholder */}</Tab><Tab>real content</Tab></Tabs>, the outer <Tabs> wrapper also can never be emptied by the artifact stripper, so it too would survive.

Why existing code does not prevent it

The do/while loop in stripMdxJsxArtifacts is designed to handle nested cleanup in one pass — but it runs to completion before stripMdxBlockComments is called. Once the loop exits, no re-entry occurs. The ordering places the comment-removal step after the artifact-stripping step, which is the inverse of what is needed: comments must be removed first so the artifact stripper can see the resulting empty tags.

What the impact would be and addressing the refutation

An exhaustive grep of the entire content/ directory finds zero instances where a PascalCase component's sole content is a JSX block comment — all real occurrences of {/* */} inside components appear alongside actual content. As all three verifying agents and the refutation agree, there is no active impact today. This is a latent logic flaw that would silently produce orphaned <Tab></Tab>, <Step></Step>, etc. tags in public/md-src .md files and in the PDF pipeline's marked.parse() input if a future content author used the comment-only component pattern.

How to fix it

Move the stripMdxBlockComments call to run before stripMdxJsxArtifacts in stripOutsideFences():

function stripOutsideFences(text, unwrapCalloutsForPlainMd) {
  let s = stripTopLevelImportBlocks(text);
  if (unwrapCalloutsForPlainMd) {
    s = unwrapCalloutTags(s);
  }
  s = stripMdxBlockComments(s);  // moved before artifact stripping
  s = stripMdxJsxArtifacts(s, { preserveCallout: \!unwrapCalloutsForPlainMd });
  s = s.replace(/\n{3,}/g, "\n\n");
  return s;
}

This ensures comment-only components are emptied before the empty-pair regex runs, so they are correctly removed.

Comment on lines +95 to +101
function isCompleteMultilineImportWithoutSemicolon(block) {
const parts = block.split(/\r?\n/);
if (parts.length < 2) return false;
const last = parts[parts.length - 1];
if (!/^\s*import\s/.test(parts[0]) || /;\s*$/.test(last)) return false;
return /\bfrom\s+['"][^'"]+['"]\s*$/.test(last);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 isCompleteMultilineImportWithoutSemicolon() uses /\bfrom\s+['][^'"']+['"']\s*$/ to detect import completion, but this matches any line ending with from "..." text — including inline comments like // from "shadcn" — causing the accumulation loop to break prematurely and leaking the real closing line (} from "...") into the generated .md output as plain text. No current MDX files trigger this pattern outside code fences, but the regex should be restricted to lines that actually close an import (e.g. /^\s*(}\s*)?from\s+['][^'"']+['"']\s*$/).

Extended reasoning...

The false positive lives in isCompleteMultilineImportWithoutSemicolon() (lib/stripMdxForPlainMarkdown.js lines 95–101). The function splits the accumulated block on newlines, isolates the last line, and returns true when that line matches /\bfrom\s+['][^'"']+['"']\s*$/. The use of \b (word boundary) rather than a line-start anchor means any line that ends with from "..." — including inline comments — satisfies the test.

The specific code path: inside stripTopLevelImportBlocks(), when the outer loop finds a line starting with import\s, an inner while loop accumulates subsequent lines. On each iteration it calls isCompleteMultilineImportWithoutSemicolon(block). If any intermediate line ends with from "something" text (e.g. Carousel, // from "shadcn"), the function returns true for a block that is not yet complete. The inner loop breaks, i = j + 1 points past that intermediate line, and the outer loop resumes — pushing the real closing line } from "@/components/ui/carousel" to out[] as plain text.

Why the existing regex does not prevent it: the pattern \bfrom\s+['][^'"']+['"']\s*$ only requires a word boundary before from, not that from appears at the structural position where an import closes (either as a bare from "..." clause or after a closing brace }). A trailing comment like // from "shadcn" ends with exactly this pattern.

Concrete proof: for block = 'import {\n Carousel, // from "shadcn"', the last line is Carousel, // from "shadcn". Running /\bfrom\s+['][^'"']+['"']\s*$/.test(" Carousel, // from \"shadcn\"") returns true (all five verifiers confirmed this empirically). The accumulator breaks, and the remaining } from "@/components/ui/carousel" is output as a plain-text paragraph in the generated .md file.

Impact is currently zero — an exhaustive grep of all content/ MDX files finds no instances of the // from "..." comment pattern inside import blocks outside code fences. This is new code introduced by this PR, so it is a latent defect rather than a regression.

Fix: restrict the from-clause check to lines that structurally close an import: /^\s*(\}\s*)?from\s+['][^'"']+['"']\s*$/.test(last). This requires the line to start with optional whitespace, then optionally } before from, ruling out mid-import comment lines.

Comment on lines 226 to 231
// Run BEFORE Next serves content/public files so it can override HTML routes
// when the client explicitly asks for markdown.
beforeFiles: [
// /support.md → raw markdown from the Support page (content/marketing/support.mdx → md-src/marketing/support.md)
{
source: "/support.md",
destination: "/md-src/marketing/support.md",
},

// Optional: make "/" negotiable too (remove if you don't have md-src/index.md)
{
source: "/",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The afterFiles rewrite source: /:path*.md has no exclusion for paths already under /md-src/, so a request to a non-existent /md-src/foo.md is rewritten to /md-src/md-src/foo.md instead of returning a clean 404. The beforeFiles content-negotiation rewrite already guards against this with (?\!api|_next|md-src) in its source pattern — add the same negative lookahead to afterFiles, e.g. source: '/:path((?\!md-src).+).md'.

Extended reasoning...

What the bug is and how it manifests

The afterFiles rewrite block in next.config.mjs (lines 226-231) maps every *.md request to /md-src/*.md:

afterFiles: [
  {
    source: "/:path*.md",
    destination: "/md-src/:path*.md",
  },
],

There is no guard to exclude paths that already start with /md-src/. When a client requests /md-src/nonexistent.md, Next.js first attempts to serve it as a static file from public/md-src/. If the file does not exist, Next.js falls through to the afterFiles rewrite phase, where /:path*.md matches with path = md-src/nonexistent, and the rewrite destination becomes /md-src/md-src/nonexistent.md — a doubled prefix that also does not exist.

Why existing code doesn't prevent it

The beforeFiles content-negotiation rewrite already handles this correctly. Its source pattern explicitly excludes md-src:

source: "/:path((?\!api|_next|md-src)(?\!.*\.md$)(?\!.*\.txt$).*)",

The afterFiles rule was authored without an equivalent negative lookahead, creating an inconsistency between the two rewrite sections.

Impact and scope

The bug is latent — it only triggers when a static file does not exist under public/md-src/. In normal production operation, all valid paths are present after the build. However, any typo, stale link, or programmatically constructed /md-src/ URL that misses the static file check fails with an opaque doubled-prefix 404 (/md-src/md-src/...) rather than a straightforward one, making debugging harder. There is no infinite redirect loop since both paths are dead ends.

Step-by-step proof

  1. Client requests GET /md-src/nonexistent.md
  2. Next.js static file check: public/md-src/nonexistent.md does not exist → static serve fails
  3. afterFiles phase: source: /:path*.md matches with path = md-src/nonexistent
  4. Destination resolves to /md-src/md-src/nonexistent.md
  5. Static file public/md-src/md-src/nonexistent.md also does not exist → 404
  6. The error response references the doubled path, obscuring the original request

Fix

Add a negative lookahead to the afterFiles source pattern to mirror the guard already present in beforeFiles:

afterFiles: [
  {
    source: "/:path(((?\!md-src).+)).md",
    destination: "/md-src/:path.md",
  },
],

Comment on lines +108 to +120
function stripTopLevelImportBlocks(text) {
const lines = text.split(/\r?\n/);
/** @type {string[]} */
const out = [];
let i = 0;
while (i < lines.length) {
const line = lines[i];
if (/^\s*import\s/.test(line)) {
let j = i;
let block = lines[j];
while (!/;\s*$/.test(block) && j + 1 < lines.length) {
if (
isCompleteSingleLineImportWithoutSemicolon(block) ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In stripTopLevelImportBlocks(), an import line ending with ; // comment causes the accumulation while loop to consume all subsequent lines to EOF, silently dropping all file content after that import. The outer while condition !/;\s*$/.test(block) evaluates to true because the line ends with comment text rather than semicolon+whitespace, and isCompleteSingleLineImportWithoutSemicolon also returns false since the line does not end with the from-clause closing quote pattern. No current top-level MDX import (outside code fences) triggers this today, but the new ASI-safe import logic introduced a gap where a trailing inline comment on a semicoloned import could silently corrupt an entire page's output.

Extended reasoning...

What the bug is and how it manifests

The while loop in stripTopLevelImportBlocks() (lib/stripMdxForPlainMarkdown.js:108-120) accumulates lines until it finds a semicolon terminator. The outer condition is !/;\s*$/.test(block), which checks only whether the string ends with semicolon + optional whitespace. For an import like import { LangfuseClient } from "@langfuse/client"; // Add this import, the line contains a semicolon but ends with comment text — so the condition evaluates to true and the accumulation loop is entered.

The specific code path that triggers it

Once inside the accumulation branch, two completeness checks are evaluated: isCompleteSingleLineImportWithoutSemicolon(block) and isCompleteMultilineImportWithoutSemicolon(block). The first function returns false because its final regex /\bfrom\s+['"][^'"]+['"]\s*$/ requires the line to end with a closing quote — it does not match a line ending in comment text. The second returns false because there is only one line. Neither check breaks the loop, so j increments through every subsequent line until EOF. When the inner loop exits, i = j + 1 = lines.length, causing the outer loop to also exit immediately. None of those accumulated lines are ever pushed to out[], so they are permanently lost.

Why existing code does not prevent it

The fix for no-semicolon imports correctly handles the case where a semicolon is entirely absent, via isCompleteSingleLineImportWithoutSemicolon and isCompleteMultilineImportWithoutSemicolon. However, it did not account for the case where a semicolon is present but obscured by a trailing inline comment. The /;\s*$/ check requires the semicolon to be the final non-whitespace character; a // comment suffix defeats it. The completeness helpers also fail because they check for the from-clause closing quote at end-of-line.

What the impact would be

If any content author adds a top-level MDX import (outside a code fence) with a trailing inline comment — a common style in JS/TS projects — all content after that import would be silently dropped from public/md-src/ output and PDF generation. No build error, no warning, no visible signal; the generated .md file would contain only frontmatter and content up to that import.

Current exposure and why it is nit severity

All four verifiers confirmed that no current MDX file outside a code fence triggers this pattern. The sole matching instance (content/integrations/frameworks/vercel-ai-sdk.mdx line 130) is safely inside a triple-backtick fence, classified as fenced: true by splitMarkdownByTripleBacktickFences(), so stripTopLevelImportBlocks() never sees it. This is a latent defect in new code introduced by this PR.

How to fix it

Extend the while loop termination condition to also treat an import as complete when a semicolon appears before any trailing // comment. For example, also match /;\s*(\/\/.*)?\s*$/ so that a semicolon optionally followed by a space and inline comment is recognized as a terminator.

Step-by-step proof

  1. Input line: import { X } from "mod"; // comment
  2. Outer while: !/;\s*$/.test(block) is true (line ends with comment text, not semicolon+whitespace)
  3. isCompleteSingleLineImportWithoutSemicolon(block): the /\bfrom\s+['"][^'"]+['"]\s*$/ test fails (line ends with comment) so returns false
  4. isCompleteMultilineImportWithoutSemicolon(block): returns false (single line)
  5. Loop increments j, appends next line, repeats for every remaining line through EOF
  6. Loop exits with j = lines.length - 1; i = lines.length; outer loop exits
  7. out[] contains only lines before the import; all subsequent content is permanently dropped

@felixkrrr felixkrrr requested a review from nimarb April 8, 2026 16:30
@felixkrrr
Copy link
Copy Markdown
Contributor Author

@nimarb i am a bit lost here claude keeps commenting and stuff gets more complicated regex - i think this is a bottomless thread

I think we want to have you and ben be more in the loop with website functionality.

Any chance one of you can help me with this issue -

@nimarb
Copy link
Copy Markdown
Contributor

nimarb commented Apr 8, 2026

@nimarb i am a bit lost here claude keeps commenting and stuff gets more complicated regex - i think this is a bottomless thread

I think we want to have you and ben be more in the loop with website functionality.

Any chance one of you can help me with this issue -

is the goal here that people can export the terms as PDF ?
we can help !

Copy link
Copy Markdown
Contributor Author

@nimarb

yes - this is currently broken on prod

worked with my fixes locally (wont work on dev deployments without significant changes, but that would be ok for now) - my problem is the regex .md stripper keeps running into more and more edge cases - we are ok with a few md pages that show too much content

Fixing the broken terms page download 🛠️ — Tella

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants