Skip to content

Conversation

@ejc3
Copy link

@ejc3 ejc3 commented Nov 28, 2025

Fixes HTML class attribute corruption in the code viewer.

The bug

The syntax highlighter was applying regex replacements sequentially, which caused later patterns to match content inside already-generated HTML class attributes. For example, "text-purple-400" would get matched by the number pattern, producing broken output like:

class="text-green-class="text-purple-400">400">"text"

The fix

Replaced sequential regex with a single-pass tokenizer that:

  1. Matches all syntax patterns in one pass
  2. Only applies highlighting to the matched tokens
  3. Escapes HTML on matched content only

Also fixed escapeHtml in formatters.ts to not use document.createElement (fails in SSR).

Changes

  • web/app/components/CodeViewer.tsx: Single-pass tokenizer for syntax highlighting
  • web/app/utils/formatters.ts: SSR-safe HTML escaping

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes HTML class attribute corruption in the code viewer's syntax highlighting by replacing sequential regex replacements with a single-pass tokenizer. The issue occurred when later patterns matched content inside already-generated HTML attributes (e.g., matching "400" inside class="text-purple-400"). Additionally, the PR makes escapeHtml SSR-compatible by removing the document.createElement dependency.

Key changes:

  • Single-pass tokenization prevents regex from matching inside generated HTML attributes
  • SSR-safe HTML escaping using manual character replacement
  • Simplified formatLargeText with reduced complexity (though with some issues identified)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
web/app/utils/formatters.ts Replaced DOM-based HTML escaping with SSR-compatible character replacement; simplified formatLargeText formatting logic
web/app/components/CodeViewer.tsx Implemented single-pass tokenizer for syntax highlighting using combined regex pattern and iterative matching to prevent HTML corruption

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 55 to 70
return escaped
// Preserve existing double line breaks
.replace(/\n\n/g, '<br><br>')
// Convert single line breaks to single <br> tags
// Preserve existing double line breaks as paragraph breaks
.replace(/\n\n/g, '</p><p class="mt-3">')
// Convert single line breaks to <br> tags
.replace(/\n/g, '<br>')
// Format bullet points with modern styling
.replace(/^(\s*)([-*])\s+(.+)$/gm, '$1<span class="inline-flex items-center space-x-2"><span class="w-1.5 h-1.5 bg-blue-500 rounded-full flex-shrink-0"></span><span>$3</span></span>')
// Format numbered lists with modern styling
.replace(/^(\s*)(\d+)\.\s+(.+)$/gm, '$1<span class="inline-flex items-center space-x-2"><span class="w-5 h-5 bg-blue-100 text-blue-700 rounded-full flex items-center justify-center text-xs font-semibold">$2</span><span>$3</span></span>')
// Format headers with better typography
.replace(/^([A-Z][^<\n]*:)(<br>|$)/gm, '<div class="font-semibold text-gray-900 mt-4 mb-2 border-b border-gray-200 pb-1">$1</div>$2')
// Format code blocks with better styling
.replace(/\b([A-Z_]{3,})\b/g, '<code class="bg-gradient-to-r from-gray-100 to-blue-50 border border-gray-200 px-2 py-0.5 rounded-md text-xs text-blue-700 font-mono font-medium">$1</code>')
// Format file paths and technical terms
.replace(/\b([a-zA-Z0-9_-]+\.[a-zA-Z]{2,4})\b/g, '<span class="bg-slate-100 text-slate-700 px-1.5 py-0.5 rounded text-xs font-mono border border-slate-200">$1</span>')
// Format URLs with modern link styling
.replace(/(https?:\/\/[^\s<]+)/g, '<a href="$1" class="text-blue-600 hover:text-blue-800 underline underline-offset-2 decoration-blue-300 hover:decoration-blue-500 transition-colors font-medium" target="_blank" rel="noopener noreferrer">$1</a>')
// Format quoted text
.replace(/^(\s*)([""](.+?)[""])/gm, '$1<blockquote class="border-l-4 border-blue-200 bg-blue-50 pl-4 py-2 my-2 italic text-gray-700 rounded-r">$3</blockquote>')
// Add proper spacing around paragraphs
.replace(/(<br><br>)/g, '<div class="my-4"></div>')
// Clean up any excessive spacing
.replace(/(<br>\s*){3,}/g, '<br><br>')
// Format emphasis patterns
.replace(/\*\*([^*]+)\*\*/g, '<strong class="font-semibold text-gray-900">$1</strong>')
.replace(/\*([^*]+)\*/g, '<em class="italic text-gray-700">$1</em>')
// Format inline code
.replace(/`([^`]+)`/g, '<code class="bg-gray-100 text-gray-800 px-1.5 py-0.5 rounded text-sm font-mono border border-gray-200">$1</code>');
// Format inline code (backticks)
.replace(/`([^`]+)`/g, '<code class="bg-gray-100 text-gray-800 px-1.5 py-0.5 rounded text-sm font-mono">$1</code>')
// Format bold text
.replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>')
// Format italic text
.replace(/\*([^*]+)\*/g, '<em>$1</em>')
// Wrap in paragraph
.replace(/^(.*)$/, '<p>$1</p>');
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final regex .replace(/^(.*)$/, '<p>$1</p>') will wrap the entire result in a single paragraph, creating malformed HTML. Since line 60 already adds </p><p class="mt-3"> for double line breaks, this will produce invalid nesting like <p>...</p><p class="mt-3">...</p>.

Instead, the formatting should either:

  1. Split by double newlines first, wrap each section in <p> tags, then apply the other replacements
  2. Or wrap the entire result in a container div without the final paragraph wrap

Example: Input "Line1\n\nLine2" will become:
<p>Line1</p><p class="mt-3">Line2<p> (missing closing tag and invalid nesting)

Copilot uses AI. Check for mistakes.
// Define token patterns with priorities (first match wins)
// Order matters: strings and comments first to avoid highlighting inside them
const tokenPatterns = [
{ regex: /(["'`])(?:(?=(\\?))\2.)*?\1/, className: 'text-green-400' }, // strings
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string regex pattern /(["'])(?:(?=(\?))\2.)*?\1/has a subtle issue. The backreference\2refers to capture group 2 (the optional backslash from(\?)`), but this creates a non-capturing lookahead that doesn't work as intended for matching escaped characters.

A more reliable pattern for strings would be:

  • /"(?:[^"\\]|\\.)*"/ for double quotes
  • /'(?:[^'\\]|\\.)*'/ for single quotes
  • /(?:[^\\]|\\.)*/` for backticks

Or combine them: /(["'])(?:[^\1\\]|\.)*?\1/` (though backreferences in character classes aren't supported in JS regex)

The current pattern may fail on strings containing escaped quotes like "He said \"hello\"" or produce incorrect matches.

Suggested change
{ regex: /(["'`])(?:(?=(\\?))\2.)*?\1/, className: 'text-green-400' }, // strings
{ regex: /"(?:[^"\\]|\\.)*"/, className: 'text-green-400' }, // double-quoted strings
{ regex: /'(?:[^'\\]|\\.)*'/, className: 'text-green-400' }, // single-quoted strings
{ regex: /`(?:[^`\\]|\\.)*`/, className: 'text-green-400' }, // backtick-quoted strings

Copilot uses AI. Check for mistakes.
const escapeHtml = (str: string) => str
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;');
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escapeHtml function is missing escaping for the " and ' characters, which are necessary for complete XSS protection especially when the escaped content might be used in HTML attribute contexts.

The function should also escape:

  • " as &quot;
  • ' as &#039; or &apos;

While this code viewer uses the escaped content in element text nodes (where quotes don't need escaping), maintaining consistency with the escapeHtml in formatters.ts (lines 44-45) would be safer and prevent future issues if the usage context changes.

Suggested change
.replace(/>/g, '&gt;');
.replace(/>/g, '&gt;')
.replace(/"/g, '&quot;')
.replace(/'/g, '&#039;');

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +134
for (let i = 0; i < tokenPatterns.length; i++) {
if (match[i + 1] !== undefined) {
className = tokenPatterns[i].className;
break;
}
}
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The combined regex pattern creates a large number of capture groups (8 patterns = 8 capture groups). For every match, all capture groups are checked in the loop (lines 125-130) even though only one will ever be defined.

Consider optimizing by using named capture groups or storing pattern indices to avoid the linear search:

const tokenPatterns = [
  { regex: /(["'`])(?:(?=(\\?))\2.)*?\1/, className: 'text-green-400', name: 'string' },
  // ... other patterns
];

// Then use a Map for O(1) lookup
const patternMap = new Map(tokenPatterns.map((p, i) => [i, p]));

Or simply store the result of the loop since you're already iterating through patterns when finding which matched.

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +123
if (match.index! > lastIndex) {
result += escapeHtml(code.slice(lastIndex, match.index));
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The non-null assertion operator ! is used on match.index which is guaranteed to be defined for successful matches from matchAll(). However, TypeScript's type definitions may not reflect this guarantee.

While this is technically safe, consider using nullish coalescing for better defensive programming:

if (match.index ?? 0 > lastIndex) {
  result += escapeHtml(code.slice(lastIndex, match.index ?? 0));
}

Or assert once at the start of the loop:

const index = match.index!;
if (index > lastIndex) {
  result += escapeHtml(code.slice(lastIndex, index));
}

Copilot uses AI. Check for mistakes.
{ regex: /#.*$/, className: 'text-gray-500 italic' }, // hash comments
{ regex: /\b(function|const|let|var|if|else|for|while|return|class|import|export|from|async|await|def|elif|except|finally|lambda|with|as|raise|del|global|nonlocal|assert|break|continue|try|catch|throw|new|this|super|extends|implements|interface|abstract|static|public|private|protected|void|int|string|boolean|float|double|char|long|short|byte|enum|struct|typedef|union|namespace|using|package|goto|switch|case|default|fn|pub|mod|use|mut|match|loop|impl|trait|where|type|readonly|override)\b/, className: 'text-blue-400' }, // keywords
{ regex: /\b(true|false|null|undefined|nil|None|True|False|NULL)\b/, className: 'text-orange-400' }, // literals
{ regex: /\b\d+\.?\d*\b/, className: 'text-purple-400' }, // numbers
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number regex pattern /\b\d+\.?\d*\b/ will match numbers like "123." (trailing decimal point with no digits after). This is technically valid in some contexts but may not be intended.

Consider using /\b\d+(?:\.\d+)?\b/ which requires at least one digit after the decimal point, or /\b\d+\.?\d+\b|\b\d+\b/ to match either integers or proper decimals.

Examples of current behavior:

  • "123." → matches (may be unintended)
  • "123.456" → matches ✓
  • "123" → matches ✓
  • ".456" → no match ✓
Suggested change
{ regex: /\b\d+\.?\d*\b/, className: 'text-purple-400' }, // numbers
{ regex: /\b\d+(?:\.\d+)?\b/, className: 'text-purple-400' }, // numbers

Copilot uses AI. Check for mistakes.
@ejc3 ejc3 force-pushed the pr/code-viewer-fix branch from 4fe7724 to 0031e85 Compare December 20, 2025 19:26
ejc3 added 2 commits December 20, 2025 11:45
- Replace sequential regex replacements with single-pass tokenizer in
  CodeViewer.tsx highlightCode() function
- The old approach applied patterns sequentially, causing later patterns
  to match numbers inside class attributes (e.g., "400" in "text-purple-400")
- New approach: build combined regex, iterate matches once, escape HTML
  on matched tokens only
- Also fix escapeHtml in formatters.ts to not use document.createElement
  (fails during SSR) and simplify formatLargeText to avoid over-formatting
- Add quote escaping to escapeHtml in CodeViewer.tsx for XSS protection
- Fix string regex patterns to properly handle escaped quotes
- Use template literal for paragraph wrapping in formatLargeText
- Add vitest and tests for escapeHtml, formatLargeText, and string patterns
@ejc3 ejc3 force-pushed the pr/code-viewer-fix branch from 0031e85 to 77b4d70 Compare December 20, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant