-
Notifications
You must be signed in to change notification settings - Fork 54
Fix code viewer syntax highlighting corruption #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes HTML class attribute corruption in the code viewer's syntax highlighting by replacing sequential regex replacements with a single-pass tokenizer. The issue occurred when later patterns matched content inside already-generated HTML attributes (e.g., matching "400" inside class="text-purple-400"). Additionally, the PR makes escapeHtml SSR-compatible by removing the document.createElement dependency.
Key changes:
- Single-pass tokenization prevents regex from matching inside generated HTML attributes
- SSR-safe HTML escaping using manual character replacement
- Simplified
formatLargeTextwith reduced complexity (though with some issues identified)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
web/app/utils/formatters.ts |
Replaced DOM-based HTML escaping with SSR-compatible character replacement; simplified formatLargeText formatting logic |
web/app/components/CodeViewer.tsx |
Implemented single-pass tokenizer for syntax highlighting using combined regex pattern and iterative matching to prevent HTML corruption |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
web/app/utils/formatters.ts
Outdated
| return escaped | ||
| // Preserve existing double line breaks | ||
| .replace(/\n\n/g, '<br><br>') | ||
| // Convert single line breaks to single <br> tags | ||
| // Preserve existing double line breaks as paragraph breaks | ||
| .replace(/\n\n/g, '</p><p class="mt-3">') | ||
| // Convert single line breaks to <br> tags | ||
| .replace(/\n/g, '<br>') | ||
| // Format bullet points with modern styling | ||
| .replace(/^(\s*)([-*•])\s+(.+)$/gm, '$1<span class="inline-flex items-center space-x-2"><span class="w-1.5 h-1.5 bg-blue-500 rounded-full flex-shrink-0"></span><span>$3</span></span>') | ||
| // Format numbered lists with modern styling | ||
| .replace(/^(\s*)(\d+)\.\s+(.+)$/gm, '$1<span class="inline-flex items-center space-x-2"><span class="w-5 h-5 bg-blue-100 text-blue-700 rounded-full flex items-center justify-center text-xs font-semibold">$2</span><span>$3</span></span>') | ||
| // Format headers with better typography | ||
| .replace(/^([A-Z][^<\n]*:)(<br>|$)/gm, '<div class="font-semibold text-gray-900 mt-4 mb-2 border-b border-gray-200 pb-1">$1</div>$2') | ||
| // Format code blocks with better styling | ||
| .replace(/\b([A-Z_]{3,})\b/g, '<code class="bg-gradient-to-r from-gray-100 to-blue-50 border border-gray-200 px-2 py-0.5 rounded-md text-xs text-blue-700 font-mono font-medium">$1</code>') | ||
| // Format file paths and technical terms | ||
| .replace(/\b([a-zA-Z0-9_-]+\.[a-zA-Z]{2,4})\b/g, '<span class="bg-slate-100 text-slate-700 px-1.5 py-0.5 rounded text-xs font-mono border border-slate-200">$1</span>') | ||
| // Format URLs with modern link styling | ||
| .replace(/(https?:\/\/[^\s<]+)/g, '<a href="$1" class="text-blue-600 hover:text-blue-800 underline underline-offset-2 decoration-blue-300 hover:decoration-blue-500 transition-colors font-medium" target="_blank" rel="noopener noreferrer">$1</a>') | ||
| // Format quoted text | ||
| .replace(/^(\s*)([""](.+?)[""])/gm, '$1<blockquote class="border-l-4 border-blue-200 bg-blue-50 pl-4 py-2 my-2 italic text-gray-700 rounded-r">$3</blockquote>') | ||
| // Add proper spacing around paragraphs | ||
| .replace(/(<br><br>)/g, '<div class="my-4"></div>') | ||
| // Clean up any excessive spacing | ||
| .replace(/(<br>\s*){3,}/g, '<br><br>') | ||
| // Format emphasis patterns | ||
| .replace(/\*\*([^*]+)\*\*/g, '<strong class="font-semibold text-gray-900">$1</strong>') | ||
| .replace(/\*([^*]+)\*/g, '<em class="italic text-gray-700">$1</em>') | ||
| // Format inline code | ||
| .replace(/`([^`]+)`/g, '<code class="bg-gray-100 text-gray-800 px-1.5 py-0.5 rounded text-sm font-mono border border-gray-200">$1</code>'); | ||
| // Format inline code (backticks) | ||
| .replace(/`([^`]+)`/g, '<code class="bg-gray-100 text-gray-800 px-1.5 py-0.5 rounded text-sm font-mono">$1</code>') | ||
| // Format bold text | ||
| .replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>') | ||
| // Format italic text | ||
| .replace(/\*([^*]+)\*/g, '<em>$1</em>') | ||
| // Wrap in paragraph | ||
| .replace(/^(.*)$/, '<p>$1</p>'); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final regex .replace(/^(.*)$/, '<p>$1</p>') will wrap the entire result in a single paragraph, creating malformed HTML. Since line 60 already adds </p><p class="mt-3"> for double line breaks, this will produce invalid nesting like <p>...</p><p class="mt-3">...</p>.
Instead, the formatting should either:
- Split by double newlines first, wrap each section in
<p>tags, then apply the other replacements - Or wrap the entire result in a container div without the final paragraph wrap
Example: Input "Line1\n\nLine2" will become:
<p>Line1</p><p class="mt-3">Line2<p> (missing closing tag and invalid nesting)
web/app/components/CodeViewer.tsx
Outdated
| // Define token patterns with priorities (first match wins) | ||
| // Order matters: strings and comments first to avoid highlighting inside them | ||
| const tokenPatterns = [ | ||
| { regex: /(["'`])(?:(?=(\\?))\2.)*?\1/, className: 'text-green-400' }, // strings |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string regex pattern /(["'])(?:(?=(\?))\2.)*?\1/has a subtle issue. The backreference\2refers to capture group 2 (the optional backslash from(\?)`), but this creates a non-capturing lookahead that doesn't work as intended for matching escaped characters.
A more reliable pattern for strings would be:
/"(?:[^"\\]|\\.)*"/for double quotes/'(?:[^'\\]|\\.)*'/for single quotes/(?:[^\\]|\\.)*/` for backticks
Or combine them: /(["'])(?:[^\1\\]|\.)*?\1/` (though backreferences in character classes aren't supported in JS regex)
The current pattern may fail on strings containing escaped quotes like "He said \"hello\"" or produce incorrect matches.
| { regex: /(["'`])(?:(?=(\\?))\2.)*?\1/, className: 'text-green-400' }, // strings | |
| { regex: /"(?:[^"\\]|\\.)*"/, className: 'text-green-400' }, // double-quoted strings | |
| { regex: /'(?:[^'\\]|\\.)*'/, className: 'text-green-400' }, // single-quoted strings | |
| { regex: /`(?:[^`\\]|\\.)*`/, className: 'text-green-400' }, // backtick-quoted strings |
web/app/components/CodeViewer.tsx
Outdated
| const escapeHtml = (str: string) => str | ||
| .replace(/&/g, '&') | ||
| .replace(/</g, '<') | ||
| .replace(/>/g, '>'); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escapeHtml function is missing escaping for the " and ' characters, which are necessary for complete XSS protection especially when the escaped content might be used in HTML attribute contexts.
The function should also escape:
"as"'as'or'
While this code viewer uses the escaped content in element text nodes (where quotes don't need escaping), maintaining consistency with the escapeHtml in formatters.ts (lines 44-45) would be safer and prevent future issues if the usage context changes.
| .replace(/>/g, '>'); | |
| .replace(/>/g, '>') | |
| .replace(/"/g, '"') | |
| .replace(/'/g, '''); |
| for (let i = 0; i < tokenPatterns.length; i++) { | ||
| if (match[i + 1] !== undefined) { | ||
| className = tokenPatterns[i].className; | ||
| break; | ||
| } | ||
| } |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The combined regex pattern creates a large number of capture groups (8 patterns = 8 capture groups). For every match, all capture groups are checked in the loop (lines 125-130) even though only one will ever be defined.
Consider optimizing by using named capture groups or storing pattern indices to avoid the linear search:
const tokenPatterns = [
{ regex: /(["'`])(?:(?=(\\?))\2.)*?\1/, className: 'text-green-400', name: 'string' },
// ... other patterns
];
// Then use a Map for O(1) lookup
const patternMap = new Map(tokenPatterns.map((p, i) => [i, p]));Or simply store the result of the loop since you're already iterating through patterns when finding which matched.
| if (match.index! > lastIndex) { | ||
| result += escapeHtml(code.slice(lastIndex, match.index)); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The non-null assertion operator ! is used on match.index which is guaranteed to be defined for successful matches from matchAll(). However, TypeScript's type definitions may not reflect this guarantee.
While this is technically safe, consider using nullish coalescing for better defensive programming:
if (match.index ?? 0 > lastIndex) {
result += escapeHtml(code.slice(lastIndex, match.index ?? 0));
}Or assert once at the start of the loop:
const index = match.index!;
if (index > lastIndex) {
result += escapeHtml(code.slice(lastIndex, index));
}| { regex: /#.*$/, className: 'text-gray-500 italic' }, // hash comments | ||
| { regex: /\b(function|const|let|var|if|else|for|while|return|class|import|export|from|async|await|def|elif|except|finally|lambda|with|as|raise|del|global|nonlocal|assert|break|continue|try|catch|throw|new|this|super|extends|implements|interface|abstract|static|public|private|protected|void|int|string|boolean|float|double|char|long|short|byte|enum|struct|typedef|union|namespace|using|package|goto|switch|case|default|fn|pub|mod|use|mut|match|loop|impl|trait|where|type|readonly|override)\b/, className: 'text-blue-400' }, // keywords | ||
| { regex: /\b(true|false|null|undefined|nil|None|True|False|NULL)\b/, className: 'text-orange-400' }, // literals | ||
| { regex: /\b\d+\.?\d*\b/, className: 'text-purple-400' }, // numbers |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number regex pattern /\b\d+\.?\d*\b/ will match numbers like "123." (trailing decimal point with no digits after). This is technically valid in some contexts but may not be intended.
Consider using /\b\d+(?:\.\d+)?\b/ which requires at least one digit after the decimal point, or /\b\d+\.?\d+\b|\b\d+\b/ to match either integers or proper decimals.
Examples of current behavior:
- "123." → matches (may be unintended)
- "123.456" → matches ✓
- "123" → matches ✓
- ".456" → no match ✓
| { regex: /\b\d+\.?\d*\b/, className: 'text-purple-400' }, // numbers | |
| { regex: /\b\d+(?:\.\d+)?\b/, className: 'text-purple-400' }, // numbers |
4fe7724 to
0031e85
Compare
- Replace sequential regex replacements with single-pass tokenizer in CodeViewer.tsx highlightCode() function - The old approach applied patterns sequentially, causing later patterns to match numbers inside class attributes (e.g., "400" in "text-purple-400") - New approach: build combined regex, iterate matches once, escape HTML on matched tokens only - Also fix escapeHtml in formatters.ts to not use document.createElement (fails during SSR) and simplify formatLargeText to avoid over-formatting
- Add quote escaping to escapeHtml in CodeViewer.tsx for XSS protection - Fix string regex patterns to properly handle escaped quotes - Use template literal for paragraph wrapping in formatLargeText - Add vitest and tests for escapeHtml, formatLargeText, and string patterns
0031e85 to
77b4d70
Compare
Fixes HTML class attribute corruption in the code viewer.
The bug
The syntax highlighter was applying regex replacements sequentially, which caused later patterns to match content inside already-generated HTML class attributes. For example,
"text-purple-400"would get matched by the number pattern, producing broken output like:The fix
Replaced sequential regex with a single-pass tokenizer that:
Also fixed
escapeHtmlinformatters.tsto not usedocument.createElement(fails in SSR).Changes
web/app/components/CodeViewer.tsx: Single-pass tokenizer for syntax highlightingweb/app/utils/formatters.ts: SSR-safe HTML escaping