Lfe 9137/md to pdf by felixkrrr · Pull Request #2775 · langfuse/langfuse-docs

felixkrrr · 2026-04-07T16:24:11Z

Issues were:

md to pdf conversion does not work for pages like /terms
content/ layout (folders like marketing/, docs/) did not match URL paths 1:1 (e.g. marketing pages at /terms, customers at /users). Auto-exporting “raw” paths from content/ produced wrong or missing .md URLs and broke things that assumed a simple path (/terms.md, PDF fetch, etc.). That is “how the content dir is managed” vs how the site is routed.

Adjacent fixes:

added md overrides to certain pages pricing.md and pricing-self-host.md

Desired fixes noted by Marc:

stable default mapping; the new website folder structure of md files is not 1:1 the path mapping anymore; thus there needs to be some sort of fix that the md folder then mirrors the actual paths
I think having an md-override folder would be good; on build, we first auto-generate the mds; then we overwrite them again in case they get overridden (ie we have an md only version of the page, eg for pricing) -> thereby we have a very maintainable structure

Main upstream

vercel · 2026-04-07T16:24:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
langfuse-docs	Ready	Preview, Comment	Apr 8, 2026 7:54am

github-actions · 2026-04-07T16:24:21Z

@claude review

lib/stripMdxForPlainMarkdown.js

scripts/copy_md_sources.js

felixkrrr · 2026-04-07T18:45:44Z

@claude pls review again

lib/stripMdxForPlainMarkdown.js

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 app/api/md-to-pdf/route.ts:111-119 — The PDF callout styling pipeline is broken: copy_md_sources.js strips all <Callout type='X'>...<\/Callout> tags before writing to public/md-src/, so by the time the PDF route fetches those files and calls processCallouts(), zero Callout tags remain. The five CSS callout classes (.callout-info, .callout-warn, etc.) in the HTML template are dead code — all callout blocks render as unstyled plain paragraphs in every generated PDF. Fix by either preserving Callout tags in md-src files (pass unwrapCalloutsForPlainMd: false at build time and keep the PDF-time call as-is), or apply processCallouts() before stripping at the build step.
Extended reasoning...

What the bug is and how it manifests

The PDF callout styling pipeline has a two-step contradiction. At build time, scripts/copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) for every MDX file. Inside that call, stripOutsideFences() invokes unwrapCalloutTags(), which applies /<Callout\s+type=["'](\w+)["']\s*>([\s\S]*?)<\/Callout>/g and replaces every typed Callout tag with just its inner content (wrapped in newlines). By the time public/md-src/ files are written, all Callout tags have been erased.

The specific code path that triggers it

At render time, app/api/md-to-pdf/route.ts line 113–115 fetches from /md-src/ and calls stripMdxForPlainMarkdown(markdownContent, { unwrapCalloutsForPlainMd: false }). With no Callout tags remaining in the content, this call is a complete no-op — unwrapCalloutTags() is skipped, and stripMdxJsxArtifacts() (running with preserveCallout: true) also finds nothing to act on. The markdown then flows into marked.parse() as plain text, producing HTML with no <Callout> elements.

Why existing code doesn't prevent it

processCallouts() (route.ts line 26–30) then runs on this HTML using regex /<Callout\s+type=["'](\w+)["']\s*>([\s\S]*?)<\/Callout>/g. Since marked.parse() received markdown with no Callout tags, the resulting HTML also has none, and the regex finds zero matches. The five CSS callout classes (.callout-info, .callout-warn, .callout-error, .callout-warning, .callout-danger) defined in the HTML template are therefore never applied to any element.

What the impact is

Every callout block (e.g., <Callout type='info'>Important note</Callout>) in the source MDX renders as plain unstyled paragraph text in all generated PDFs for Langfuse URL sources. The visual differentiation between info, warning, error, and danger callouts — which is presumably the reason processCallouts() and the five CSS classes were written — is completely absent. This is a functional regression introduced by this PR: before the PR, copy_md_sources.js did not call stripMdxForPlainMarkdown at all, so Callout tags were preserved in md-src files and processCallouts() worked correctly.

Step-by-step proof
1. Source MDX contains: <Callout type="info">You must enable billing first.</Callout>
2. Build time: unwrapCalloutTags() matches it → replaced with \n\nYou must enable billing first.\n\n
3. public/md-src/docs/example.md now contains only the plain text, no Callout tags
4. PDF route: fetches /md-src/docs/example.md, calls stripMdxForPlainMarkdown(..., { unwrapCalloutsForPlainMd: false }) — no-op since no Callout tags exist
5. marked.parse() converts markdown → HTML paragraph: <p>You must enable billing first.</p>
6. processCallouts(htmlContent) searches for <Callout in HTML → zero matches
7. CSS class .callout-info is never applied; callout renders as an unstyled <p> tag
How to fix it

The cleanest fix is to change the build step to pass unwrapCalloutsForPlainMd: false (i.e., preserve Callout tags in md-src files), relying on the PDF route's processCallouts() to convert them to styled divs at render time. Alternatively, if plain-text md-src output is required for other consumers, add a separate Callout-to-styled-div conversion pass in the build step before stripping.

claude · 2026-04-07T19:28:03Z

lib/stripMdxForPlainMarkdown.js

+ * @param {string} content
+ * @returns {{ fenced: boolean, text: string }[]}
+ */
+function splitMarkdownByTripleBacktickFences(content) {
+  /** @type {{ fenced: boolean, text: string }[]} */
+  const segments = [];
+  let remaining = content;
+  while (remaining.length > 0) {
+    const open = remaining.indexOf("```");
+    if (open === -1) {
+      segments.push({ fenced: false, text: remaining });
+      break;
+    }
+    if (open > 0) {
+      segments.push({ fenced: false, text: remaining.slice(0, open) });
+    }
+    const afterOpen = remaining.slice(open + 3);
+    const close = afterOpen.indexOf("```");
+    if (close === -1) {
+      segments.push({ fenced: true, text: remaining.slice(open) });
+      break;
+    }
+    const fenceEnd = open + 3 + close + 3;
+    segments.push({ fenced: true, text: remaining.slice(open, fenceEnd) });
+    remaining = remaining.slice(fenceEnd);
+  }
+  return segments;
+}
+


🔴 splitMarkdownByTripleBacktickFences() uses indexOf to locate any 3-backtick sequence, so a 4-backtick opening fence (e.g. ````markdown filename=X) is matched by the 4th backtick + the first inner triple-backtick code block as its "close", causing the fenced segment to be truncated and all subsequent inner content to be classified as non-fenced. Current affected files (content/blog/2026-02-13-will-you-be-my-cli.mdx lines 72 and 210, content/docs/observability/features/multi-modality.mdx line 156) contain only bash/prose inside those fences so stripOutsideFences does no damage today, but any future 4-backtick block containing import statements or JSX components would be incorrectly stripped by this misclassification.

Extended reasoning...

What the bug is and how it manifests

splitMarkdownByTripleBacktickFences() (lib/stripMdxForPlainMarkdown.js:22–50) finds fence boundaries with remaining.indexOf("```"). When encountering a 4-backtick opening fence like `````markdown filename=X, it finds the first 3 of those 4 backticks at position open`. It then slices to `afterOpen = remaining.slice(open + 3)`, which starts with the remaining single backtick followed by `markdown filename=X\n...`. The subsequent `afterOpen.indexOf("```")` then finds the first inner triple-backtick code block—not the matching 4-backtick close fence. This produces a malformed fenced segment that ends prematurely, and all content from that inner opener onward is emitted as non-fenced.

The specific code path that triggers it

In scripts/copy_md_sources.js, every .mdx file is processed through stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }). Internally, splitMarkdownByTripleBacktickFences partitions the document, then non-fenced parts are passed to stripOutsideFences() which calls stripTopLevelImportBlocks(), unwrapCalloutTags(), and stripMdxJsxArtifacts(). Content misclassified as non-fenced is therefore subject to all three stripping passes.

Why existing code does not prevent it

The algorithm was written assuming fences always use exactly 3 backticks. CommonMark/MDX allow any run of 3+ backticks as a fence, and the closing fence only needs to match or exceed the opening count. The function has no mechanism to count how many backticks form the opening fence, and always searches for exactly 3 backticks as the closing delimiter.

What the impact is

Today's impact is limited: the files with 4-backtick fences contain bash commands and plain prose, so the stripping passes do not destroy any content. However, the segmentation is structurally wrong. Any future content author who uses 4-backtick fences (a common MDX pattern to embed inner triple-backtick examples) around MDX components or import statements would have those constructs silently stripped by stripMdxJsxArtifacts() or stripTopLevelImportBlocks(), producing a corrupted output .md file with no warning.

How to fix it

Count backticks at the opening fence and require the same count to close:

const open = remaining.search(/`{3,}/); if (open === -1) { /* ... */ } const openFence = remaining.match(/`{3,}/)[0]; const closeIdx = remaining.slice(open + openFence.length).indexOf(openFence);

Alternatively, detect fences only at line start with a line-by-line pass, which is the CommonMark spec approach.

Step-by-step proof with content/blog/2026-02-13-will-you-be-my-cli.mdx line 72

remaining starts with ``````markdown filename=...\n...inner content...\n```bash\n...`

remaining.indexOf("```") → 0 (first 3 of the 4 backticks)

afterOpen = remaining.slice(3) → starts with ``markdown filename=...\n...inner content...\n```bash\n...`

afterOpen.indexOf("```") → finds the ````bash` inner block opener (not the 4-backtick close)

Segment 0 (fenced=true): only ``````markdown filename=...\n...inner content\n```` — truncated

remaining is now advanced to the bash\n... content and all subsequent lines

Those lines are classified as fenced=false (non-fenced) and passed to stripOutsideFences()

claude · 2026-04-07T19:28:03Z

lib/stripMdxForPlainMarkdown.js

+ * Does not strip arbitrary `{expr}` — those can appear in YAML frontmatter or JSX
+ * attributes; comments are the main noise in PDF / md-src output.
+ * @param {string} text
+ */
+function stripMdxBlockComments(text) {
+  return text.replace(/\{\s*\/\*[\s\S]*?\*\/\s*\}/g, "");
+}
+
+/**
+ * Single-line `import … from "…"` or `import "…"` without trailing `;` (valid ESM / ASI).
+ * @param {string} block
+ */
+function isCompleteSingleLineImportWithoutSemicolon(block) {
+  const parts = block.split(/\r?\n/);
+  if (parts.length !== 1) return false;
+  const line = parts[0];
+  if (!/^\s*import\s/.test(line) || /;\s*$/.test(line)) return false;
+  return (
+    /\bfrom\s+['"][^'"]+['"]\s*$/.test(line) ||
+    /^\s*import\s+['"][^'"]+['"]\s*$/.test(line)
+  );
+}


🔴 The new stripTopLevelImportBlocks() function in lib/stripMdxForPlainMarkdown.js only strips lines matching /^\s*import\s/, so top-level MDX export function and export const definitions are left completely intact and land verbatim as raw JavaScript + JSX in the generated public/md-src/ .md files. At least three files are actively affected today: content/marketing/enterprise.mdx (line 72: export function TalkToUsButton()), content/docs/administration/rbac.mdx (line 54: export function RolePermissionTable()), and content/docs/administration/llm-connection.mdx (line 57: export function ModelList()). The fix is to extend stripTopLevelImportBlocks() (or add a parallel pass) to also detect and skip top-level export statement blocks by tracking brace depth.

Extended reasoning...

What the bug is and how it manifests

stripTopLevelImportBlocks() (lib/stripMdxForPlainMarkdown.js lines 68–89) scans each line for the pattern /^\s*import\s/ and, when matched, accumulates and discards the full import block including multi-line forms. The function has no equivalent handling for top-level export statements. MDX files commonly define React components at the top level using export function Foo() { ... } or export const Bar = () => (...) — these are local component definitions used within the MDX file. Because none of these lines start with import, they pass entirely through stripTopLevelImportBlocks() unchanged. The multi-line JSX bodies then also pass through stripMdxJsxArtifacts() unchanged, since the JSX inside contains non-empty content that the self-closing and empty-pair patterns cannot strip. The complete JS/JSX function definition lands verbatim as literal paragraph text in public/md-src/ .md files.

The specific code path that triggers it

During the build, scripts/copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) on every .mdx file. splitMarkdownByTripleBacktickFences() first splits on triple-backtick fences to leave code blocks alone. The export function definitions in the three affected files are outside any code fence — they are top-level MDX component definitions, not code samples. Those segments pass to stripOutsideFences() -> stripTopLevelImportBlocks(). Since lines start with export, not import, the /^\s*import\s/ test fails and every line is pushed unchanged to out[]. stripMdxJsxArtifacts() subsequently also leaves them intact.

Why existing code does not prevent it

The function was authored only to strip ESM import statements and was never extended to cover export statements. stripMdxJsxArtifacts() only addresses self-closing and empty paired PascalCase tags — it does not recognize raw JavaScript syntax such as function declarations, return (, braces, or arrow function bodies. There is no other pass in the pipeline that catches these blocks.

What the impact would be

Three files confirmed by verifiers: (1) content/marketing/enterprise.mdx lines 72–85: multi-line export function TalkToUsButton() with a full JSX body including Button components survives into public/md-src/enterprise.md. (2) content/docs/administration/rbac.mdx lines 54–71: export function RolePermissionTable({ roleScopes }) with Object.entries and JSX survives. (3) content/docs/administration/llm-connection.mdx line 57: export function ModelList() similarly survives. A fourth case was noted: content/docs/roadmap.mdx line 22 has export const ChangelogList = () => ( outside a code fence. In all cases the generated .md files contain raw JavaScript and JSX as literal paragraph text — garbage output for the PDF pipeline, LLM context, and search indexing.

How to fix it

Extend stripTopLevelImportBlocks() or add a parallel pass to detect and skip top-level export statement blocks. The simplest heuristic: detect a line matching /^\s*export\s+(function|const|class|default)\s/ then accumulate lines while tracking brace depth (increment on {, decrement on }), stopping when depth returns to zero. This handles function bodies, arrow functions with parens, and class definitions without needing a full parser.

Step-by-step proof

Input from content/marketing/enterprise.mdx: lines 72–85 contain export function TalkToUsButton() { return (
.........
); } outside any triple-backtick fence.

splitMarkdownByTripleBacktickFences() returns this block as a { fenced: false, text: ... } segment.

stripTopLevelImportBlocks() scans line by line: line 72 starts with "export function", not "import". The /^\s*import\s/ test fails. Line 72 is pushed to out[]. Lines 73–85 (the function body) are similarly pushed unchanged.

stripMdxJsxArtifacts() runs: self-closing regex — no match (no />); empty-pair regex — no match (non-empty JSX content); empty-div regex — no match. The do/while loop exits with out === prev.

Output: the full export function TalkToUsButton() { ... } block survives verbatim into public/md-src/enterprise.md as literal paragraph text.

claude · 2026-04-07T19:55:59Z

scripts/copy_md_sources.js

        const dir = path.dirname(dest);
        ensureDir(dir);
        const originalContent = fs.readFileSync(file, 'utf8');
-        const processed = inlineComponentsMdx(originalContent, file);
+        const inlined = inlineComponentsMdx(originalContent, file);
+        const processed = stripMdxForPlainMarkdown(inlined, {
+            unwrapCalloutsForPlainMd: true,
+        });


🔴 The md-src build pipeline now calls stripMdxForPlainMarkdown with unwrapCalloutsForPlainMd: true, which strips every <Callout> tag before writing to public/md-src/ — so when the PDF route fetches those files and calls processCallouts(), it finds no <Callout> tags to match. All callout boxes across the 342+ affected content files now render as unstyled plain paragraphs in generated PDFs. Fix by setting unwrapCalloutsForPlainMd: false in scripts/copy_md_sources.js (lines 94–96) so <Callout> tags are preserved in md-src files for the PDF consumer; route.ts already calls stripMdxForPlainMarkdown with unwrapCalloutsForPlainMd: false to preserve them for processCallouts().

Extended reasoning...

What the bug is and how it manifests

Before this PR, copy_md_sources.js only called inlineComponentsMdx() — it never called stripMdxForPlainMarkdown(). The resulting public/md-src/ files retained raw <Callout type="info">content</Callout> tags. The PDF route would fetch these files, call marked.parse() (which passes unknown HTML blocks through verbatim), and then processCallouts(htmlContent) would convert every <Callout> tag into a styled <div class="callout callout-info">...</div>. Callouts were properly styled in PDFs.

After this PR, copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) (lines 94–96). Inside stripOutsideFences(), when unwrapCalloutsForPlainMd is true, unwrapCalloutTags(s) is invoked — replacing every <Callout[^>]*>content</Callout> with \n\ncontent\n\n. The public/md-src/ files now contain only plain text where callout blocks used to be.

The specific code path that triggers it

Build time: copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true }) → stripOutsideFences(text, true) → unwrapCalloutTags(s) replaces all <Callout> tags with plain text → written to public/md-src/docs/foo.md.

PDF request: route.ts fetches /md-src/docs/foo.md — no <Callout> tags present in file.

stripMdxForPlainMarkdown(markdownContent, { unwrapCalloutsForPlainMd: false }) in route.ts preserves Callout tags that no longer exist — a no-op.

marked.parse(markdownContent) produces HTML with no <Callout> elements.

processCallouts(htmlContent) regex /<Callout([^>]*)>([\s\S]*?)<\/Callout>/g finds zero matches.

PDF output: callout content appears as an unstyled plain paragraph.

Why existing code does not prevent it

The design intent in route.ts (calling stripMdxForPlainMarkdown with unwrapCalloutsForPlainMd: false to "leave Callout for downstream HTML conversion") is directly contradicted by copy_md_sources.js stripping them at build time with unwrapCalloutsForPlainMd: true. The two consumers (plain markdown for LLM/search vs. PDF needing styled callouts) have conflicting requirements, and the build-time strip wins, making processCallouts() dead code.

Impact

Every content page using <Callout> components served via the PDF pipeline regresses. Grep confirms 342 <Callout> usages across the content/ directory — all affected. The md-override/ pricing files are plain markdown with no Callout components, so those are unaffected. This is a direct regression introduced by this PR.

Step-by-step proof

content/docs/foo.mdx has <Callout type="info">Take note</Callout>

copy_md_sources.js calls stripMdxForPlainMarkdown(inlined, { unwrapCalloutsForPlainMd: true })

unwrapCalloutTags() converts it to \n\nTake note\n\n — tags gone

public/md-src/docs/foo.md contains only plain text

PDF route fetches /md-src/docs/foo.md — no <Callout> tags present

processCallouts(htmlContent) regex finds no matches

Output PDF: "Take note" as plain paragraph, no callout box styling

How to fix it

Change scripts/copy_md_sources.js lines 94–96 to use unwrapCalloutsForPlainMd: false so <Callout> tags remain in md-src files for the PDF consumer. The route.ts path already has the correct flag set, and processCallouts() will then find and style them. For the plain-markdown (LLM/search) consumer, the raw <Callout> HTML in .md files is slightly noisier but functionally benign compared to silently dropping all callout styling in PDFs.

claude · 2026-04-07T19:55:59Z

lib/stripMdxForPlainMarkdown.js

+function stripOutsideFences(text, unwrapCalloutsForPlainMd) {
+  let s = stripTopLevelImportBlocks(text);
+  if (unwrapCalloutsForPlainMd) {
+    s = unwrapCalloutTags(s);
+  }
+  s = stripMdxJsxArtifacts(s, { preserveCallout: !unwrapCalloutsForPlainMd });
+  s = stripMdxBlockComments(s);
+  s = s.replace(/\n{3,}/g, "\n\n");
+  return s;
+}


🟡 In stripOutsideFences(), stripMdxJsxArtifacts runs before stripMdxBlockComments, so a component whose sole content is a JSX block comment — e.g. <Tab>{/* placeholder */}</Tab> — is not matched by the empty-pair regex (since {/* ... */} is non-whitespace), and after the comment is stripped the resulting <Tab></Tab> receives no second cleanup pass and survives verbatim into public/md-src files. This is a pre-existing latent logic flaw; no current content files have components containing only JSX block comments, so there is no active impact. Fix by moving stripMdxBlockComments to run before stripMdxJsxArtifacts.

Extended reasoning...

What the bug is and how it manifests

In stripOutsideFences() (lib/stripMdxForPlainMarkdown.js lines 55–64), the two cleanup passes run in this order: (1) stripMdxJsxArtifacts(s, ...) at line 60, then (2) stripMdxBlockComments(s) at line 61. The empty-pair regex inside stripMdxJsxArtifacts — <Tag(?:\s[^>]*)?>\s*</Tag> — only matches when the content between the tags is all-whitespace. A JSX block comment like {/* placeholder */} is non-whitespace, so a component such as <Tab>{/* placeholder */}</Tab> passes through stripMdxJsxArtifacts unchanged. When stripMdxBlockComments then removes the comment, the result is <Tab></Tab>, but there is no second artifact-stripping pass, so the empty tag pair survives into the output.

The specific code path that triggers it

Input: <Tab>{/* first tab content */}</Tab>

stripMdxJsxArtifacts runs: the empty-pair regex does not match because {/* first tab content */} is non-whitespace. Output unchanged.

stripMdxBlockComments runs: removes {/* first tab content */}. Output: <Tab></Tab>.

No further pass of stripMdxJsxArtifacts runs.

<Tab></Tab> is written verbatim to public/md-src or fed into marked.parse() in the PDF pipeline.

For nested cases like <Tabs><Tab>{/* placeholder */}</Tab><Tab>real content</Tab></Tabs>, the outer <Tabs> wrapper also can never be emptied by the artifact stripper, so it too would survive.

Why existing code does not prevent it

The do/while loop in stripMdxJsxArtifacts is designed to handle nested cleanup in one pass — but it runs to completion before stripMdxBlockComments is called. Once the loop exits, no re-entry occurs. The ordering places the comment-removal step after the artifact-stripping step, which is the inverse of what is needed: comments must be removed first so the artifact stripper can see the resulting empty tags.

What the impact would be and addressing the refutation

An exhaustive grep of the entire content/ directory finds zero instances where a PascalCase component's sole content is a JSX block comment — all real occurrences of {/* */} inside components appear alongside actual content. As all three verifying agents and the refutation agree, there is no active impact today. This is a latent logic flaw that would silently produce orphaned <Tab></Tab>, <Step></Step>, etc. tags in public/md-src .md files and in the PDF pipeline's marked.parse() input if a future content author used the comment-only component pattern.

How to fix it

Move the stripMdxBlockComments call to run before stripMdxJsxArtifacts in stripOutsideFences():

function stripOutsideFences(text, unwrapCalloutsForPlainMd) { let s = stripTopLevelImportBlocks(text); if (unwrapCalloutsForPlainMd) { s = unwrapCalloutTags(s); } s = stripMdxBlockComments(s); // moved before artifact stripping s = stripMdxJsxArtifacts(s, { preserveCallout: \!unwrapCalloutsForPlainMd }); s = s.replace(/\n{3,}/g, "\n\n"); return s; }

This ensures comment-only components are emptied before the empty-pair regex runs, so they are correctly removed.

claude · 2026-04-08T08:20:34Z

lib/stripMdxForPlainMarkdown.js

+function isCompleteMultilineImportWithoutSemicolon(block) {
+  const parts = block.split(/\r?\n/);
+  if (parts.length < 2) return false;
+  const last = parts[parts.length - 1];
+  if (!/^\s*import\s/.test(parts[0]) || /;\s*$/.test(last)) return false;
+  return /\bfrom\s+['"][^'"]+['"]\s*$/.test(last);
+}


🟡 isCompleteMultilineImportWithoutSemicolon() uses /\bfrom\s+['][^'"']+['"']\s*$/ to detect import completion, but this matches any line ending with from "..." text — including inline comments like // from "shadcn" — causing the accumulation loop to break prematurely and leaking the real closing line (} from "...") into the generated .md output as plain text. No current MDX files trigger this pattern outside code fences, but the regex should be restricted to lines that actually close an import (e.g. /^\s*(}\s*)?from\s+['][^'"']+['"']\s*$/).

Extended reasoning...

The false positive lives in isCompleteMultilineImportWithoutSemicolon() (lib/stripMdxForPlainMarkdown.js lines 95–101). The function splits the accumulated block on newlines, isolates the last line, and returns true when that line matches /\bfrom\s+['][^'"']+['"']\s*$/. The use of \b (word boundary) rather than a line-start anchor means any line that ends with from "..." — including inline comments — satisfies the test.

The specific code path: inside stripTopLevelImportBlocks(), when the outer loop finds a line starting with import\s, an inner while loop accumulates subsequent lines. On each iteration it calls isCompleteMultilineImportWithoutSemicolon(block). If any intermediate line ends with from "something" text (e.g. Carousel, // from "shadcn"), the function returns true for a block that is not yet complete. The inner loop breaks, i = j + 1 points past that intermediate line, and the outer loop resumes — pushing the real closing line } from "@/components/ui/carousel" to out[] as plain text.

Why the existing regex does not prevent it: the pattern \bfrom\s+['][^'"']+['"']\s*$ only requires a word boundary before from, not that from appears at the structural position where an import closes (either as a bare from "..." clause or after a closing brace }). A trailing comment like // from "shadcn" ends with exactly this pattern.

Concrete proof: for block = 'import {\n Carousel, // from "shadcn"', the last line is Carousel, // from "shadcn". Running /\bfrom\s+['][^'"']+['"']\s*$/.test(" Carousel, // from \"shadcn\"") returns true (all five verifiers confirmed this empirically). The accumulator breaks, and the remaining } from "@/components/ui/carousel" is output as a plain-text paragraph in the generated .md file.

Impact is currently zero — an exhaustive grep of all content/ MDX files finds no instances of the // from "..." comment pattern inside import blocks outside code fences. This is new code introduced by this PR, so it is a latent defect rather than a regression.

Fix: restrict the from-clause check to lines that structurally close an import: /^\s*(\}\s*)?from\s+['][^'"']+['"']\s*$/.test(last). This requires the line to start with optional whitespace, then optionally } before from, ruling out mid-import comment lines.

claude · 2026-04-08T08:20:34Z

next.config.mjs

      // Run BEFORE Next serves content/public files so it can override HTML routes
      // when the client explicitly asks for markdown.
      beforeFiles: [
-        // /support.md → raw markdown from the Support page (content/marketing/support.mdx → md-src/marketing/support.md)
-        {
-          source: "/support.md",
-          destination: "/md-src/marketing/support.md",
-        },
-
        // Optional: make "/" negotiable too (remove if you don't have md-src/index.md)
        {
          source: "/",


🟡 The afterFiles rewrite source: /:path*.md has no exclusion for paths already under /md-src/, so a request to a non-existent /md-src/foo.md is rewritten to /md-src/md-src/foo.md instead of returning a clean 404. The beforeFiles content-negotiation rewrite already guards against this with (?\!api|_next|md-src) in its source pattern — add the same negative lookahead to afterFiles, e.g. source: '/:path((?\!md-src).+).md'.

Extended reasoning...

What the bug is and how it manifests

The afterFiles rewrite block in next.config.mjs (lines 226-231) maps every *.md request to /md-src/*.md:

afterFiles: [ { source: "/:path*.md", destination: "/md-src/:path*.md", }, ],

There is no guard to exclude paths that already start with /md-src/. When a client requests /md-src/nonexistent.md, Next.js first attempts to serve it as a static file from public/md-src/. If the file does not exist, Next.js falls through to the afterFiles rewrite phase, where /:path*.md matches with path = md-src/nonexistent, and the rewrite destination becomes /md-src/md-src/nonexistent.md — a doubled prefix that also does not exist.

Why existing code doesn't prevent it

The beforeFiles content-negotiation rewrite already handles this correctly. Its source pattern explicitly excludes md-src:

source: "/:path((?\!api|_next|md-src)(?\!.*\.md$)(?\!.*\.txt$).*)",

The afterFiles rule was authored without an equivalent negative lookahead, creating an inconsistency between the two rewrite sections.

Impact and scope

The bug is latent — it only triggers when a static file does not exist under public/md-src/. In normal production operation, all valid paths are present after the build. However, any typo, stale link, or programmatically constructed /md-src/ URL that misses the static file check fails with an opaque doubled-prefix 404 (/md-src/md-src/...) rather than a straightforward one, making debugging harder. There is no infinite redirect loop since both paths are dead ends.

Step-by-step proof

Client requests GET /md-src/nonexistent.md

Next.js static file check: public/md-src/nonexistent.md does not exist → static serve fails

afterFiles phase: source: /:path*.md matches with path = md-src/nonexistent

Destination resolves to /md-src/md-src/nonexistent.md

Static file public/md-src/md-src/nonexistent.md also does not exist → 404

The error response references the doubled path, obscuring the original request

Fix

Add a negative lookahead to the afterFiles source pattern to mirror the guard already present in beforeFiles:

afterFiles: [ { source: "/:path(((?\!md-src).+)).md", destination: "/md-src/:path.md", }, ],

claude · 2026-04-08T08:20:34Z

lib/stripMdxForPlainMarkdown.js

+function stripTopLevelImportBlocks(text) {
+  const lines = text.split(/\r?\n/);
+  /** @type {string[]} */
+  const out = [];
+  let i = 0;
+  while (i < lines.length) {
+    const line = lines[i];
+    if (/^\s*import\s/.test(line)) {
+      let j = i;
+      let block = lines[j];
+      while (!/;\s*$/.test(block) && j + 1 < lines.length) {
+        if (
+          isCompleteSingleLineImportWithoutSemicolon(block) ||


🟡 In stripTopLevelImportBlocks(), an import line ending with ; // comment causes the accumulation while loop to consume all subsequent lines to EOF, silently dropping all file content after that import. The outer while condition !/;\s*$/.test(block) evaluates to true because the line ends with comment text rather than semicolon+whitespace, and isCompleteSingleLineImportWithoutSemicolon also returns false since the line does not end with the from-clause closing quote pattern. No current top-level MDX import (outside code fences) triggers this today, but the new ASI-safe import logic introduced a gap where a trailing inline comment on a semicoloned import could silently corrupt an entire page's output.

Extended reasoning...

What the bug is and how it manifests

The while loop in stripTopLevelImportBlocks() (lib/stripMdxForPlainMarkdown.js:108-120) accumulates lines until it finds a semicolon terminator. The outer condition is !/;\s*$/.test(block), which checks only whether the string ends with semicolon + optional whitespace. For an import like import { LangfuseClient } from "@langfuse/client"; // Add this import, the line contains a semicolon but ends with comment text — so the condition evaluates to true and the accumulation loop is entered.

The specific code path that triggers it

Once inside the accumulation branch, two completeness checks are evaluated: isCompleteSingleLineImportWithoutSemicolon(block) and isCompleteMultilineImportWithoutSemicolon(block). The first function returns false because its final regex /\bfrom\s+['"][^'"]+['"]\s*$/ requires the line to end with a closing quote — it does not match a line ending in comment text. The second returns false because there is only one line. Neither check breaks the loop, so j increments through every subsequent line until EOF. When the inner loop exits, i = j + 1 = lines.length, causing the outer loop to also exit immediately. None of those accumulated lines are ever pushed to out[], so they are permanently lost.

Why existing code does not prevent it

The fix for no-semicolon imports correctly handles the case where a semicolon is entirely absent, via isCompleteSingleLineImportWithoutSemicolon and isCompleteMultilineImportWithoutSemicolon. However, it did not account for the case where a semicolon is present but obscured by a trailing inline comment. The /;\s*$/ check requires the semicolon to be the final non-whitespace character; a // comment suffix defeats it. The completeness helpers also fail because they check for the from-clause closing quote at end-of-line.

What the impact would be

If any content author adds a top-level MDX import (outside a code fence) with a trailing inline comment — a common style in JS/TS projects — all content after that import would be silently dropped from public/md-src/ output and PDF generation. No build error, no warning, no visible signal; the generated .md file would contain only frontmatter and content up to that import.

Current exposure and why it is nit severity

All four verifiers confirmed that no current MDX file outside a code fence triggers this pattern. The sole matching instance (content/integrations/frameworks/vercel-ai-sdk.mdx line 130) is safely inside a triple-backtick fence, classified as fenced: true by splitMarkdownByTripleBacktickFences(), so stripTopLevelImportBlocks() never sees it. This is a latent defect in new code introduced by this PR.

How to fix it

Extend the while loop termination condition to also treat an import as complete when a semicolon appears before any trailing // comment. For example, also match /;\s*(\/\/.*)?\s*$/ so that a semicolon optionally followed by a space and inline comment is recognized as a terminator.

Step-by-step proof

Input line: import { X } from "mod"; // comment

Outer while: !/;\s*$/.test(block) is true (line ends with comment text, not semicolon+whitespace)

isCompleteSingleLineImportWithoutSemicolon(block): the /\bfrom\s+['"][^'"]+['"]\s*$/ test fails (line ends with comment) so returns false

isCompleteMultilineImportWithoutSemicolon(block): returns false (single line)

Loop increments j, appends next line, repeats for every remaining line through EOF

Loop exits with j = lines.length - 1; i = lines.length; outer loop exits

out[] contains only lines before the import; all subsequent content is permanently dropped

felixkrrr · 2026-04-08T16:32:44Z

@nimarb i am a bit lost here claude keeps commenting and stuff gets more complicated regex - i think this is a bottomless thread

I think we want to have you and ben be more in the loop with website functionality.

Any chance one of you can help me with this issue -

nimarb · 2026-04-08T17:15:48Z

@nimarb i am a bit lost here claude keeps commenting and stuff gets more complicated regex - i think this is a bottomless thread

I think we want to have you and ben be more in the loop with website functionality.

Any chance one of you can help me with this issue -

is the goal here that people can export the terms as PDF ?
we can help !

felixkrrr · 2026-04-08T17:44:04Z

@nimarb

yes - this is currently broken on prod

worked with my fixes locally (wont work on dev deployments without significant changes, but that would be ok for now) - my problem is the regex .md stripper keeps running into more and more edge cases - we are ok with a few md pages that show too much content

Fixing the broken terms page download 🛠️ — Tella

ArkuVonSymfon and others added 12 commits March 12, 2026 10:21

Merge pull request #37 from Altalogy/main-upstream

aff68d6

Main upstream

Merge branch 'main' of https://github.com/Altalogy/langfuse-docs

fc0987c

Merge remote-tracking branch 'upstream/main'

7bd2da5

fix: md-to-pdf api

b9650e0

fix: terms

afbf36f

fix mapping

6e2dc25

clean up

521480c

clean up

e4e5b96

create hard overrides for certain pages

b6c65ca

Merge branch 'main' into fix/md-to-pdf

b6045c7

fix

0850c74

Merge branch 'main' into fix/md-to-pdf

fe4bf90

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 7, 2026

dosubot bot added the docs label Apr 7, 2026

claude bot reviewed Apr 7, 2026

View reviewed changes

lib/stripMdxForPlainMarkdown.js Show resolved Hide resolved

scripts/copy_md_sources.js Show resolved Hide resolved

fix

1546a0b

vercel bot deployed to Preview April 7, 2026 18:39 View deployment

claude bot reviewed Apr 7, 2026

View reviewed changes

lib/stripMdxForPlainMarkdown.js Show resolved Hide resolved

lib/stripMdxForPlainMarkdown.js Show resolved Hide resolved

lib/stripMdxForPlainMarkdown.js Show resolved Hide resolved

fix loop in parser

dc07a6a

vercel bot deployed to Preview April 7, 2026 19:08 View deployment

felixkrrr added 2 commits April 7, 2026 21:08

fix

9b1f196

fix

e3bdc9c

vercel bot deployed to Preview April 7, 2026 19:13 View deployment

claude bot reviewed Apr 7, 2026

View reviewed changes

Merge branch 'main' into LFE-9137/md-to-pdf

5d3e462

vercel bot deployed to Preview April 8, 2026 07:54 View deployment

claude bot reviewed Apr 8, 2026

View reviewed changes

felixkrrr requested a review from nimarb April 8, 2026 16:30

Conversation

felixkrrr commented Apr 7, 2026

Uh oh!

vercel bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

felixkrrr commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

felixkrrr commented Apr 8, 2026

Uh oh!

nimarb commented Apr 8, 2026

Uh oh!

felixkrrr commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vercel bot commented Apr 7, 2026 •

edited

Loading