-
-
Notifications
You must be signed in to change notification settings - Fork 23
Add comprehensive SEO static checker #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements a production-ready SEO validation system that scans the dist folder and ensures all built pages meet SEO standards. The checker validates 70+ rules across metadata, content, links, images, structured data, and internationalization. Features: - Automatically runs after every build via build:after script - Blocks deployment if critical SEO errors are found - Highly configurable with granular rule exclusions - Smart filtering to skip error pages and redirects - Multiple output formats (console, JSON, SARIF) Current configuration disables non-critical rules to allow initial deployment. Rules can be gradually enabled as issues are fixed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <[email protected]>
📝 WalkthroughWalkthroughAdds an SEO static checker integrated into the build pipeline with new config and exclusions files, new npm/bun scripts, documentation updates, and .gitignore entries for SEO report outputs; the checker runs post-build against site artifacts and locales. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Developer
participant CI as CI / Local CLI (bun)
participant Build as Build Pipeline
participant SEO as `@capgo/seo-checker`
participant FS as dist/ (filesystem)
participant Report as Reports (console / seo-report.*)
Developer->>CI: trigger `bun run build` or CI build
CI->>Build: execute build steps
Build->>Build: bun run repair_sitemap
Build->>SEO: bun run seo:check (--output github)
SEO->>FS: read artifacts from `dist/` and configs (`seo-checker.config.json`, exclusions)
SEO->>SEO: evaluate rules across locales (metadata, links, images, headings, structured data...)
alt issues found
SEO->>Report: emit reports (console, `seo-report.json`/`.sarif`/`.txt`)
else no issues
SEO->>Report: emit success summary (console)
end
Report->>Developer: results via console or report files
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (6)
scripts/seo-checker/parser.ts (2)
318-337: Async function uses synchronous filesystem APIs.The
findFilesfunction is declaredasyncbut internally usesfs.readdirSync. This works correctly but is misleading and doesn't provide the expected non-blocking behavior.🔎 Proposed fix to use async filesystem APIs
-async function findFiles(dir: string, extension: string): Promise<string[]> { +import { readdir } from 'node:fs/promises' + +async function findFiles(dir: string, extension: string): Promise<string[]> { const files: string[] = [] async function walk(currentDir: string): Promise<void> { - const entries = fs.readdirSync(currentDir, { withFileTypes: true }) + const entries = await readdir(currentDir, { withFileTypes: true }) for (const entry of entries) { const fullPath = path.join(currentDir, entry.name) if (entry.isDirectory()) { await walk(fullPath) } else if (entry.isFile() && entry.name.endsWith(extension)) { files.push(fullPath) } } } await walk(dir) return files }
245-258: Double file read for small files.Files under
MAX_REDIRECT_SIZEare read twice: once viastatSyncfor size, then viareadFileSyncfor content. This is a minor inefficiency.🔎 Proposed optimization
for (const filePath of htmlFiles) { - // Skip very small files (likely error pages or placeholders) const stats = fs.statSync(filePath) if (stats.size < MIN_FILE_SIZE) { continue } // Skip redirect pages (small files with meta refresh) if (stats.size < MAX_REDIRECT_SIZE) { const content = fs.readFileSync(filePath, 'utf-8') if (content.includes('http-equiv="refresh"') || content.includes("http-equiv='refresh'")) { continue } + // Reuse already-read content for parsing + const pageData = parseHtmlFileFromContent(content, filePath, distPath, config) + pages.set(pageData.relativePath, pageData) + // ... continue with duplicate tracking + continue }scripts/seo-checker/types.ts (1)
86-90: Consider memory implications of storing full HTML.
PageData.htmlstores the entire HTML content for each page. For large sites with hundreds/thousands of pages, this could lead to high memory consumption. Consider whether the full HTML is needed after initial parsing, or if it can be discarded once checks complete.scripts/seo-checker/checks.ts (1)
11-74: Hardcoded language codes may miss valid BCP47 codes.The
VALID_LANG_CODESset contains common codes but BCP47 supports many more valid combinations. The fallback regex at line 925 helps but uses a simplified pattern that may not catch all valid codes.Consider using a library like
bcp-47orbcp-47-matchfor robust validation, or document that this is an intentional subset for performance.scripts/seo-checker/index.ts (1)
77-92: Consider deep mergingseverityOverridesandthresholdOverrides.The
disabledarray is correctly concatenated, butseverityOverridesandthresholdOverridesfrom the base config are completely replaced rather than merged. If this is intentional, consider adding a comment to clarify; otherwise:🔎 Proposed fix for consistent deep merge
function mergeConfig( base: SEOCheckerConfig, overrides: Partial<SEOCheckerConfig> ): SEOCheckerConfig { return { ...base, ...overrides, rules: { ...base.rules, ...overrides.rules, disabled: [...(base.rules?.disabled || []), ...(overrides.rules?.disabled || [])], + severityOverrides: { + ...(base.rules?.severityOverrides || {}), + ...(overrides.rules?.severityOverrides || {}), + }, + thresholdOverrides: { + ...(base.rules?.thresholdOverrides || {}), + ...(overrides.rules?.thresholdOverrides || {}), + }, }, exclusions: [...(base.exclusions || []), ...(overrides.exclusions || [])], failOn: overrides.failOn || base.failOn, } }scripts/seo-checker/exclusions.ts (1)
13-23: ReDoS risk inelementPatternregex — optional mitigation for trusted config.Line 49 uses
new RegExp(rule.elementPattern)directly from config. While the risk is low (patterns are maintainer-controlled in JSON files), if this tool ever accepts untrusted patterns, consider usingre2(node-re2 package) or adding static detection viaredos-detectorin CI to catch vulnerable regexes.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
.gitignoreCLAUDE.mdpackage.jsonscripts/seo-checker/checks.tsscripts/seo-checker/exclusions.tsscripts/seo-checker/index.tsscripts/seo-checker/parser.tsscripts/seo-checker/reporter.tsscripts/seo-checker/rules.tsscripts/seo-checker/types.tsseo-checker.config.jsonseo-checker.exclusions.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{astro,jsx,tsx,js,ts}
📄 CodeRabbit inference engine (AGENT.md)
Dynamic content must use
aria-liveregions for accessibility
Files:
scripts/seo-checker/rules.tsscripts/seo-checker/parser.tsscripts/seo-checker/exclusions.tsscripts/seo-checker/index.tsscripts/seo-checker/reporter.tsscripts/seo-checker/checks.tsscripts/seo-checker/types.ts
🧠 Learnings (9)
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Run `bun run build` to check for build errors before completing changes
Applied to files:
CLAUDE.mdpackage.json
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Include JSON-LD structured data using `src/lib/ldJson.ts` helpers for SEO
Applied to files:
CLAUDE.mdscripts/seo-checker/rules.tsscripts/seo-checker/parser.tsscripts/seo-checker/index.tsscripts/seo-checker/checks.tsscripts/seo-checker/types.ts
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Page must have unique `title` and `description` in content object for SEO
Applied to files:
CLAUDE.mdscripts/seo-checker/rules.tsscripts/seo-checker/parser.tsscripts/seo-checker/index.tsscripts/seo-checker/checks.tsscripts/seo-checker/types.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use `keywords` prop for relevant search terms in page SEO configuration
Applied to files:
scripts/seo-checker/rules.tsscripts/seo-checker/parser.tsscripts/seo-checker/index.tsscripts/seo-checker/checks.tsscripts/seo-checker/types.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use the helpers in `src/lib/ldJson.ts` to add structured data (JSON-LD) to pages
Applied to files:
scripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/**/*.{ts,tsx,astro} : Use semantic HTML headings (h1, h2, h3) in proper hierarchy
Applied to files:
scripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Ensure hreflang tags are present for localized pages
Applied to files:
scripts/seo-checker/parser.tsscripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use canonical URLs for duplicate content
Applied to files:
scripts/seo-checker/parser.tsscripts/seo-checker/index.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Every page must have a unique `title` and `description` meta tag
Applied to files:
scripts/seo-checker/parser.ts
🧬 Code graph analysis (6)
scripts/seo-checker/rules.ts (1)
scripts/seo-checker/types.ts (1)
SEORule(10-19)
scripts/seo-checker/parser.ts (1)
scripts/seo-checker/types.ts (3)
SEOCheckerConfig(56-84)PageData(86-164)SiteData(166-178)
scripts/seo-checker/exclusions.ts (1)
scripts/seo-checker/types.ts (3)
SEOIssue(21-41)ExclusionRule(43-54)SEOCheckerConfig(56-84)
scripts/seo-checker/index.ts (5)
scripts/seo-checker/types.ts (4)
SEOCheckerConfig(56-84)Severity(6-6)SEOIssue(21-41)CheckResult(180-192)scripts/seo-checker/exclusions.ts (3)
loadExclusionsFromFile(109-126)filterDisabledRules(95-104)filterExcludedIssues(68-90)scripts/seo-checker/parser.ts (1)
scanDistFolder(227-313)scripts/seo-checker/checks.ts (2)
runPageChecks(932-955)checkDuplicates(819-916)scripts/seo-checker/reporter.ts (3)
formatJsonReport(157-159)printReport(257-259)writeReport(232-252)
scripts/seo-checker/reporter.ts (1)
scripts/seo-checker/types.ts (3)
Severity(6-6)SEOIssue(21-41)CheckResult(180-192)
scripts/seo-checker/checks.ts (3)
scripts/seo-checker/types.ts (4)
PageData(86-164)SEOIssue(21-41)SEOCheckerConfig(56-84)SiteData(166-178)scripts/seo-checker/rules.ts (1)
getRule(744-746)scripts/seo-checker/parser.ts (2)
resolveToFilePath(354-385)fileExists(342-349)
🪛 ast-grep (0.40.3)
scripts/seo-checker/exclusions.ts
[warning] 20-20: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(^${regexPattern}$)
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html
(regexp-from-variable)
[warning] 48-48: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(rule.elementPattern)
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html
(regexp-from-variable)
🪛 Biome (2.1.2)
scripts/seo-checker/reporter.ts
[error] 248-248: Unexpected control character in a regular expression.
Control characters are unusual and potentially incorrect inputs, so they are disallowed.
(lint/suspicious/noControlCharactersInRegex)
🔇 Additional comments (17)
.gitignore (1)
25-28: LGTM!The SEO checker report outputs (text, JSON, SARIF) are correctly added to
.gitignoreto prevent build artifacts from being committed.CLAUDE.md (1)
90-145: LGTM!The SEO Static Checker documentation is comprehensive, covering validation categories, configuration files, exclusion patterns with specificity levels, and new CLI commands. The examples for exclusion rules are clear and instructive.
scripts/seo-checker/rules.ts (2)
1-19: LGTM!The SEO rules catalog is well-structured with clear categorization. Each rule includes comprehensive metadata (id, category, name, scope, severity, fixHint, tags) enabling flexible rule management and reporting.
742-746: LGTM!The
RULE_MAPprovides O(1) lookup by rule ID, andgetRulesafely returnsundefinedfor missing rules, which is properly handled in consumers.scripts/seo-checker/parser.ts (1)
14-146: LGTM!The
parseHtmlFilefunction comprehensively extracts SEO-relevant data including headings, links, images, JSON-LD, hreflangs, and meta tags. The error handling for malformed JSON-LD (storing_parseErrorflag) enables proper error reporting downstream.scripts/seo-checker/types.ts (1)
1-192: LGTM!The type definitions are comprehensive and well-documented with JSDoc comments. The interfaces provide a clear public API surface for the SEO checker modules.
scripts/seo-checker/checks.ts (2)
819-913: LGTM!The
checkDuplicatesfunction correctly handles site-wide duplicate detection with smart handling of language variants for canonicals. The fingerprint generation for site-wide issues uses a consistent format.
929-955: LGTM!The
runPageChecksaggregator function provides a clean orchestration of all per-page checks.seo-checker.config.json (1)
83-83: ClarifymaxIssues: 0semantics in config documentation.The value
0formaxIssuesmeans "no limit"—the limit check is skipped when the value is falsy. While this is clear in the implementation (line 250:if (config.maxIssues && ...), consider adding a comment inseo-checker.config.jsonor documentation to clarify that0disables the limit and positive values enforce a hard threshold.seo-checker.exclusions.json (1)
1-5: LGTM!Clean initial exclusions configuration file with proper schema reference. The empty exclusions array is appropriate for initial deployment.
package.json (1)
16-20: LGTM!Well-structured script additions with clear naming conventions. The
build:afterintegration ensures SEO validation runs automatically post-build, and the multiple output format options (json,report) support CI/CD integration.scripts/seo-checker/index.ts (1)
185-328: Solid implementation of the main SEO checker workflow.The
run()function is well-structured with proper configuration layering (defaults → file config → exclusions → CLI args), comprehensive error handling, and clear exit code logic based onfailOnseverities.scripts/seo-checker/reporter.ts (3)
246-248: ANSI stripping regex is intentional - static analysis hint is a false positive.The regex
/\x1b\[[0-9;]*m/gcorrectly strips ANSI escape codes for file output. The\x1b(ESC) control character is necessary for matching ANSI sequences.
193-193: Verify that documentation pages exist for SARIFhelpUri.The
helpUrireferenceshttps://capgo.app/docs/seo/${ruleId}which may not exist. Consider either creating these documentation pages or using a placeholder that doesn't result in 404s.
165-227: Well-structured SARIF output format.The SARIF report correctly implements the 2.1.0 schema with proper severity mapping, rule extraction, and fingerprint inclusion for result deduplication.
scripts/seo-checker/exclusions.ts (2)
95-104: Good use ofSetfor O(1) lookup of disabled rules.Efficient implementation that avoids repeated array scans.
109-126: Defensive implementation with good fallback handling.Correctly handles both array and object structures, and silently returns empty array on parse errors - appropriate for optional config files.
scripts/seo-checker/checks.ts
Outdated
| // SEO00135: Anchor text missing/empty | ||
| if (!link.text || link.text.trim() === '') { | ||
| // Check if it has an aria-label or contains an image with alt | ||
| const issue = createIssue('SEO00135', page, { element: link.href }) | ||
| if (issue) issues.push(issue) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete implementation: aria-label check mentioned but not implemented.
The comment states "Check if it has an aria-label or contains an image with alt" but the code unconditionally creates an issue for empty anchor text. Links with aria-label or containing images with alt text are accessible and should not be flagged.
🔎 Proposed fix to account for accessible link patterns
// SEO00135: Anchor text missing/empty
if (!link.text || link.text.trim() === '') {
- // Check if it has an aria-label or contains an image with alt
- const issue = createIssue('SEO00135', page, { element: link.href })
- if (issue) issues.push(issue)
+ // Note: This check cannot verify aria-label or nested img alt
+ // as PageData.links only stores text content. Consider enhancing
+ // the parser to capture accessibility attributes for more accurate detection.
+ // For now, this may produce false positives for icon-only links.
+ const issue = createIssue('SEO00135', page, { element: link.href })
+ if (issue) issues.push(issue)
}To properly fix this, enhance the parser to capture aria-label and check for images with alt text within links.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // SEO00135: Anchor text missing/empty | |
| if (!link.text || link.text.trim() === '') { | |
| // Check if it has an aria-label or contains an image with alt | |
| const issue = createIssue('SEO00135', page, { element: link.href }) | |
| if (issue) issues.push(issue) | |
| } | |
| // SEO00135: Anchor text missing/empty | |
| if (!link.text || link.text.trim() === '') { | |
| // Note: This check cannot verify aria-label or nested img alt | |
| // as PageData.links only stores text content. Consider enhancing | |
| // the parser to capture accessibility attributes for more accurate detection. | |
| // For now, this may produce false positives for icon-only links. | |
| const issue = createIssue('SEO00135', page, { element: link.href }) | |
| if (issue) issues.push(issue) | |
| } |
scripts/seo-checker/index.ts
Outdated
| for (let i = 0; i < args.length; i++) { | ||
| switch (args[i]) { | ||
| case '--config': | ||
| result.configPath = args[++i] | ||
| break | ||
| case '--dist': | ||
| result.distPath = args[++i] | ||
| break | ||
| case '--output': | ||
| result.outputFormat = args[++i] as 'console' | 'json' | 'sarif' | ||
| break | ||
| case '--report': | ||
| result.reportPath = args[++i] | ||
| break | ||
| case '--fail-on': | ||
| result.failOn = args[++i].split(',') as Severity[] | ||
| break | ||
| case '--max-issues': | ||
| result.maxIssues = parseInt(args[++i], 10) | ||
| break | ||
| case '--generate-config': | ||
| result.generateConfig = true | ||
| break | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential out-of-bounds access when parsing CLI arguments.
When accessing args[++i], there's no validation that the next argument exists. If a user provides --config without a following value, the code will silently use undefined.
🔎 Proposed fix with bounds checking
for (let i = 0; i < args.length; i++) {
switch (args[i]) {
case '--config':
+ if (i + 1 >= args.length) {
+ console.error('Error: --config requires a path argument')
+ process.exit(1)
+ }
result.configPath = args[++i]
break
case '--dist':
+ if (i + 1 >= args.length) {
+ console.error('Error: --dist requires a path argument')
+ process.exit(1)
+ }
result.distPath = args[++i]
break
// ... apply similar pattern to other optionsCommittable suggestion skipped: line range outside the PR's diff.
scripts/seo-checker/index.ts
Outdated
| // Check max issues limit | ||
| if (config.maxIssues && allIssues.length >= config.maxIssues) { | ||
| console.log(`Reached max issues limit (${config.maxIssues}), stopping scan...`) | ||
| break | ||
| } | ||
| } | ||
|
|
||
| // Run site-wide duplicate checks | ||
| const duplicateIssues = checkDuplicates(siteData, config) | ||
| allIssues.push(...duplicateIssues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxIssues limit can be exceeded by duplicate checks.
The maxIssues short-circuit breaks out of the page loop, but checkDuplicates issues are added unconditionally afterward, potentially exceeding the configured limit.
🔎 Proposed fix
// Run site-wide duplicate checks
- const duplicateIssues = checkDuplicates(siteData, config)
- allIssues.push(...duplicateIssues)
+ if (!config.maxIssues || allIssues.length < config.maxIssues) {
+ const duplicateIssues = checkDuplicates(siteData, config)
+ const remaining = config.maxIssues ? config.maxIssues - allIssues.length : duplicateIssues.length
+ allIssues.push(...duplicateIssues.slice(0, remaining))
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Check max issues limit | |
| if (config.maxIssues && allIssues.length >= config.maxIssues) { | |
| console.log(`Reached max issues limit (${config.maxIssues}), stopping scan...`) | |
| break | |
| } | |
| } | |
| // Run site-wide duplicate checks | |
| const duplicateIssues = checkDuplicates(siteData, config) | |
| allIssues.push(...duplicateIssues) | |
| // Check max issues limit | |
| if (config.maxIssues && allIssues.length >= config.maxIssues) { | |
| console.log(`Reached max issues limit (${config.maxIssues}), stopping scan...`) | |
| break | |
| } | |
| } | |
| // Run site-wide duplicate checks | |
| if (!config.maxIssues || allIssues.length < config.maxIssues) { | |
| const duplicateIssues = checkDuplicates(siteData, config) | |
| const remaining = config.maxIssues ? config.maxIssues - allIssues.length : duplicateIssues.length | |
| allIssues.push(...duplicateIssues.slice(0, remaining)) | |
| } |
🤖 Prompt for AI Agents
In scripts/seo-checker/index.ts around lines 249 to 258, the loop breaks when
allIssues reaches config.maxIssues but subsequent site-wide duplicate issues are
appended unconditionally, which can exceed the limit; modify the code to respect
maxIssues by calculating remainingSlots = config.maxIssues - allIssues.length
(if config.maxIssues is set) and only append up to remainingSlots duplicates
(e.g., slice duplicateIssues to that length) and log that the max was reached if
not all duplicates could be added; if config.maxIssues is not set, append all
duplicates as before.
- Regenerated rules.ts from CSV with all 1152 SEO rules - Expanded checks.ts from ~70 to 200+ implemented rules - Fixed rule ID mismatches (mailto, tel, URL hygiene checks) - Added SEO00003 (missing meta robots) to disabled list Categories now covered: - Metadata (title, description, robots, viewport, charset) - HTML validity and structure - Content length and format validation - Headings hierarchy (H1-H6) - Indexability and canonicals - Links (broken, nofollow, noopener, mailto, tel) - URL hygiene (session IDs, parameters, encoding) - Images (alt text, dimensions, file size, lazy loading) - Social tags (OG, Twitter cards) - International SEO (hreflang) - Structured data (24 Schema.org types validated) - Content quality (duplicates, placeholders, thin content) - Accessibility basics - HTML semantics 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Move SEO rules from TypeScript to JSON file (reduces 11k lines to 24) - Fixes SonarCloud duplication warning (was 76.6%, required ≤3%) - JSON data files are not analyzed for code duplication - Add try-catch for regex patterns in exclusions.ts - Fixes security hotspots for ReDoS vulnerability - Invalid regex patterns now fail gracefully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
scripts/seo-checker/checks.ts (1)
669-673: Misleading comment: aria-label check not implemented.The comment states "Check if it has an aria-label or contains an image with alt" but the code unconditionally creates an issue for any empty link text. Links with
aria-labelor containing images with alt text are accessible and should not be flagged. This will produce false positives for icon-only links with proper accessibility attributes.🔎 Recommended approach
Either:
- Update the comment to reflect the current behavior and acknowledge the limitation:
- // SEO00135: Anchor text missing/empty + // SEO00135: Anchor text missing/empty + // Note: This check cannot verify aria-label or nested img alt attributes + // as PageData.links only captures text content. May produce false positives + // for accessible icon-only links. if (!link.text || link.text.trim() === '') { - // Check if it has an aria-label or contains an image with alt const issue = createIssue('SEO00135', page, { element: link.href }) if (issue) issues.push(issue) }
- Or enhance the parser (scripts/seo-checker/parser.ts) to capture
aria-labelattributes in the link data structure and check them here.
🧹 Nitpick comments (3)
scripts/seo-checker/checks.ts (3)
1220-1260: Fragile rule ID string manipulation.Lines 1237 and 1252 use
ruleId.replace('382', '383')andruleId.replace('382', '384')to derive rule IDs for different contexts. This approach has several issues:
- Hardcoded assumption that all pattern ruleIds contain '382'
- Silently fails if a pattern uses a different rule ID
- Creates implicit coupling between PLACEHOLDER_PATTERNS data and transformation logic
🔎 More maintainable approaches
Option 1: Use context-specific rule IDs in the pattern data:
-const PLACEHOLDER_PATTERNS: [RegExp, string, string][] = [ - [/lorem ipsum/i, 'lorem ipsum', 'SEO00382'], +const PLACEHOLDER_PATTERNS: [RegExp, string, { title: string; desc: string; h1: string }][] = [ + [/lorem ipsum/i, 'lorem ipsum', { title: 'SEO00382', desc: 'SEO00383', h1: 'SEO00384' }], // ... ]Option 2: Create separate pattern arrays for each context:
const PLACEHOLDER_PATTERNS_TITLE: [RegExp, string, string][] = [ [/lorem ipsum/i, 'lorem ipsum', 'SEO00382'], // ... ] const PLACEHOLDER_PATTERNS_DESC: [RegExp, string, string][] = [ [/lorem ipsum/i, 'lorem ipsum', 'SEO00383'], // ... ]
1101-1135: Consider avoiding issue mutation for dynamic schema checks.Lines 1122 and 1130 mutate
issue.ruleNameafter creation to customize the rule name for specific schema fields. While this works, it's unconventional to mutate properties aftercreateIssue()returns, and it means the rule names in rules.json are ignored for these dynamic checks.🔎 Alternative approach
Consider adding an optional parameter to
createIssuefor custom rule names:function createIssue( ruleId: string, page: PageData, options: { element?: string actual?: string expected?: string line?: number customRuleName?: string } = {} ): SEOIssue | null { const rule = getRule(ruleId) if (!rule) return null return { ruleId, ruleName: options.customRuleName || rule.name, // ... rest of properties } }Then use it without mutation:
const issue = createIssue('SEO00232', page, { element: `${type}: missing '${field}'`, customRuleName: `Schema ${type}: missing '${field}'` }) if (issue) issues.push(issue)
1328-1337: Semantic HTML check logic may not match intent.Lines 1328-1330 flag pages that use
<b>but have no<strong>tags. However, a page with 100<b>tags and one<strong>tag won't be flagged, even though it's using non-semantic markup extensively. The rule name "Using instead of " suggests the intent is to discourage<b>in favor of semantic<strong>, but the current logic only catches pages that exclusively use<b>.🔎 More accurate detection
Consider one of these approaches:
Option 1: Flag any use of
<b>regardless of<strong>presence:- // SEO00416: Using <b> instead of <strong> - if (/<b[^>]*>/.test(page.html) && !/<strong[^>]*>/.test(page.html)) { + // SEO00416: Using <b> instead of <strong> + if (/<b[^>]*>/.test(page.html)) { const issue = createIssue('SEO00416', page) if (issue) issues.push(issue) }Option 2: Flag excessive
<b>usage (e.g., more than 5 instances):const bCount = (page.html.match(/<b[^>]*>/gi) || []).length if (bCount > 5) { const issue = createIssue('SEO00416', page, { actual: `${bCount} <b> tags` }) if (issue) issues.push(issue) }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
scripts/seo-checker/checks.tsscripts/seo-checker/exclusions.tsscripts/seo-checker/rules.jsonscripts/seo-checker/rules.tsseo-checker.config.json
🚧 Files skipped from review as they are similar to previous changes (1)
- seo-checker.config.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{astro,jsx,tsx,js,ts}
📄 CodeRabbit inference engine (AGENT.md)
Dynamic content must use
aria-liveregions for accessibility
Files:
scripts/seo-checker/checks.tsscripts/seo-checker/rules.tsscripts/seo-checker/exclusions.ts
🧠 Learnings (7)
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Page must have unique `title` and `description` in content object for SEO
Applied to files:
scripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/**/*.{ts,tsx,astro} : Add `aria-label` to interactive elements without visible text
Applied to files:
scripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to **/*.{astro,jsx,tsx} : Icon-only buttons must have `aria-label`
Applied to files:
scripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to **/*.{astro,html,jsx,tsx} : All images must have descriptive `alt` text
Applied to files:
scripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Every page must have a unique `title` and `description` meta tag
Applied to files:
scripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use canonical URLs for duplicate content
Applied to files:
scripts/seo-checker/checks.ts
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Include JSON-LD structured data using `src/lib/ldJson.ts` helpers for SEO
Applied to files:
scripts/seo-checker/rules.ts
🧬 Code graph analysis (3)
scripts/seo-checker/checks.ts (3)
scripts/seo-checker/types.ts (4)
PageData(86-164)SEOIssue(21-41)SEOCheckerConfig(56-84)SiteData(166-178)scripts/seo-checker/rules.ts (1)
getRule(12-14)scripts/seo-checker/parser.ts (2)
resolveToFilePath(354-385)fileExists(342-349)
scripts/seo-checker/rules.ts (1)
scripts/seo-checker/types.ts (1)
SEORule(10-19)
scripts/seo-checker/exclusions.ts (1)
scripts/seo-checker/types.ts (3)
SEOIssue(21-41)ExclusionRule(43-54)SEOCheckerConfig(56-84)
🪛 ast-grep (0.40.3)
scripts/seo-checker/exclusions.ts
[warning] 21-21: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(^${regexPattern}$)
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html
(regexp-from-variable)
[warning] 54-54: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(rule.elementPattern)
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html
(regexp-from-variable)
🔇 Additional comments (4)
scripts/seo-checker/rules.ts (1)
1-24: LGTM!Clean and straightforward implementation. The module provides a simple, type-safe API for loading and querying SEO rules with efficient lookup helpers.
scripts/seo-checker/exclusions.ts (1)
1-169: LGTM! Static analysis warnings are false positives.The exclusion system implementation is solid. Both regex construction sites (lines 21 and 54) are properly wrapped in try-catch blocks that gracefully handle invalid patterns, making the ReDoS warnings from static analysis false positives. The implementation correctly handles fingerprint matching, glob patterns, and element patterns with appropriate specificity ordering.
scripts/seo-checker/checks.ts (2)
1-122: LGTM!The constants, data structures, and helper functions are well-organized. The fingerprinting approach provides stable identifiers for exclusion matching, and the createIssue helper properly centralizes issue construction logic.
124-668: Comprehensive SEO check implementation.The check functions are well-structured and cover a broad range of SEO concerns including metadata, HTML validity, content quality, links, images, social tags, internationalization, structured data, and accessibility. The implementation properly uses the createIssue helper for consistent issue reporting and handles edge cases appropriately.
Also applies to: 674-1100, 1136-1219, 1261-1327, 1338-1508
New checks added: - SEO01153: Missing robots.txt file - SEO01154: robots.txt syntax errors - SEO01155: Missing Sitemap directive in robots.txt - SEO01156: robots.txt blocks all crawlers - SEO01157: Sitemap URL domain mismatch - SEO01158: Missing sitemap.xml file - SEO01159: Invalid XML syntax in sitemap - SEO01160: Sitemap URL references non-existent page - SEO01161: Sitemap uses HTTP instead of HTTPS - SEO01162: Duplicate URLs in sitemap - SEO01163: Invalid lastmod date format - SEO01164: Trailing slash inconsistency - SEO01165: robots.txt references non-existent sitemap 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
scripts/seo-checker/index.ts (3)
109-132: Missing bounds checking for CLI argument values.When a flag like
--configis provided without a following value,args[++i]will access an undefined index, leading to silentundefinedassignment.
256-266: Site-wide checks bypassmaxIssueslimit.After breaking from the page loop at
maxIssues, duplicate, robots.txt, and sitemap issues are appended unconditionally, potentially exceeding the configured limit.
77-92: Shallow merge losesseverityOverridesandthresholdOverridesfrom base config.When
overrides.rulesexists (even with onlydisabled), the spread...overrides.rulesreplacesseverityOverridesandthresholdOverridesfrombase.rules. Onlydisabledis explicitly deep-merged.🔎 Proposed fix for deep merging rules
function mergeConfig( base: SEOCheckerConfig, overrides: Partial<SEOCheckerConfig> ): SEOCheckerConfig { return { ...base, ...overrides, rules: { - ...base.rules, - ...overrides.rules, disabled: [...(base.rules?.disabled || []), ...(overrides.rules?.disabled || [])], + severityOverrides: { + ...(base.rules?.severityOverrides || {}), + ...(overrides.rules?.severityOverrides || {}), + }, + thresholdOverrides: { + ...(base.rules?.thresholdOverrides || {}), + ...(overrides.rules?.thresholdOverrides || {}), + }, }, exclusions: [...(base.exclusions || []), ...(overrides.exclusions || [])], failOn: overrides.failOn || base.failOn, } }
🧹 Nitpick comments (1)
scripts/seo-checker/index.ts (1)
177-179: Consider warning before overwriting existing config file.
generateSampleConfigoverwritesseo-checker.config.jsonwithout checking if it already exists. Users might accidentally lose custom configurations.🔎 Proposed safeguard
const outputPath = 'seo-checker.config.json' + if (fs.existsSync(outputPath)) { + console.warn(`Warning: ${outputPath} already exists. Overwriting...`) + } fs.writeFileSync(outputPath, JSON.stringify(sampleConfig, null, 2)) console.log(`Generated sample config at ${outputPath}`)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
scripts/seo-checker/checks.tsscripts/seo-checker/index.tsscripts/seo-checker/rules.json
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/seo-checker/checks.ts
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{astro,jsx,tsx,js,ts}
📄 CodeRabbit inference engine (AGENT.md)
Dynamic content must use
aria-liveregions for accessibility
Files:
scripts/seo-checker/index.ts
🧠 Learnings (4)
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Page must have unique `title` and `description` in content object for SEO
Applied to files:
scripts/seo-checker/index.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use `keywords` prop for relevant search terms in page SEO configuration
Applied to files:
scripts/seo-checker/index.ts
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Include JSON-LD structured data using `src/lib/ldJson.ts` helpers for SEO
Applied to files:
scripts/seo-checker/index.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use canonical URLs for duplicate content
Applied to files:
scripts/seo-checker/index.ts
🧬 Code graph analysis (1)
scripts/seo-checker/index.ts (5)
scripts/seo-checker/types.ts (4)
SEOCheckerConfig(56-84)Severity(6-6)SEOIssue(21-41)CheckResult(180-192)scripts/seo-checker/exclusions.ts (3)
loadExclusionsFromFile(119-136)filterDisabledRules(105-114)filterExcludedIssues(78-100)scripts/seo-checker/parser.ts (1)
scanDistFolder(227-313)scripts/seo-checker/checks.ts (1)
runPageChecks(1852-1877)scripts/seo-checker/reporter.ts (3)
formatJsonReport(157-159)printReport(257-259)writeReport(232-252)
🔇 Additional comments (3)
scripts/seo-checker/index.ts (3)
1-24: LGTM!Clear documentation header with all CLI options, and imports follow modern Node.js conventions using the
node:prefix for built-ins.
195-232: LGTM!Config loading follows a sensible precedence (defaults → file → CLI args), and the dist folder validation provides helpful guidance for users.
338-342: LGTM!Error handling at the entry point is appropriate, ensuring the process exits with a non-zero code on unexpected failures.
scripts/seo-checker/index.ts
Outdated
| // Output results | ||
| if (config.outputFormat === 'json') { | ||
| console.log(formatJsonReport(result)) | ||
| } else { | ||
| printReport(result) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SARIF format not handled for console output.
When outputFormat is 'sarif', the code falls through to printReport(result) (console format) instead of outputting SARIF. The report file (line 311) handles SARIF correctly, but console output is inconsistent.
🔎 Proposed fix to handle SARIF console output
// Output results
if (config.outputFormat === 'json') {
console.log(formatJsonReport(result))
+ } else if (config.outputFormat === 'sarif') {
+ // Import formatSarifReport from reporter if not already imported
+ // console.log(formatSarifReport(result))
+ // Or print a notice that SARIF is only for file output:
+ console.log('SARIF format is only supported for file output. Use --report to generate a SARIF file.')
+ printReport(result)
} else {
printReport(result)
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In scripts/seo-checker/index.ts around lines 302 to 307, console output does not
handle the 'sarif' outputFormat and falls back to printReport(result); update
the conditional to explicitly handle 'sarif' by serializing the SARIF report the
same way the file writer does (e.g., call the existing SARIF
formatter/serializer used at line 311 or JSON.stringify the SARIF object) and
console.log that serialized SARIF output; keep existing branches for 'json' and
default console printReport for human-readable output.
- Rewrite parser.ts with parallel file I/O using fs/promises - Add parallel directory traversal for faster file discovery - Process HTML files in batches of 200 for optimal memory usage - Add file existence caching to avoid redundant filesystem checks - Run page checks in parallel batches of 500 with setImmediate - Execute site-wide checks (duplicates, robots, sitemap) concurrently Performance: ~109s for 6085 pages (previously ~142s) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
scripts/seo-checker/index.ts (3)
109-132: Out-of-bounds access when parsing CLI arguments.This issue was already identified in a previous review. When accessing
args[++i], there's no validation that the next argument exists.
277-280:maxIssueslimit can be exceeded by site-wide checks.This issue was already identified in a previous review. The loop breaks when
allIssues.length >= config.maxIssues, but lines 287-293 unconditionally add duplicate, robots, and sitemap issues afterward.
330-335: SARIF format not handled for console output.This issue was already identified in a previous review. When
outputFormatis'sarif', the code falls through toprintReport(result)instead of handling SARIF specifically.
🧹 Nitpick comments (7)
scripts/seo-checker/index.ts (3)
64-72: Improve error reporting for config loading failures.The
loadConfigfunction catches all errors and returns an empty object, which silently ignores important issues like permission errors, invalid JSON syntax, or file system failures. Consider logging more specific error details or differentiating between "file not found" (acceptable) and other errors (problematic).🔎 Suggested improvement
function loadConfig(configPath: string): Partial<SEOCheckerConfig> { try { const content = fs.readFileSync(configPath, 'utf-8') return JSON.parse(content) } catch (error) { - console.error(`Failed to load config from ${configPath}:`, error) + if ((error as NodeJS.ErrnoException).code === 'ENOENT') { + // File not found is acceptable + return {} + } + console.error(`Failed to load config from ${configPath}:`) + if (error instanceof SyntaxError) { + console.error(' Invalid JSON syntax') + } else { + console.error(` ${(error as Error).message}`) + } return {} } }
287-291: False parallelism: wrapping synchronous functions doesn't parallelize them.The site-wide checks (
checkDuplicates,checkRobotsTxt,checkSitemap) are synchronous functions wrapped inPromise.resolve()and passed toPromise.all(). This doesn't actually execute them in parallel—they still run sequentially within the event loop. While this doesn't affect correctness, it creates misleading code that suggests parallelism where none exists.💡 Suggested approach
If these functions are truly synchronous and cannot be made async, simply call them sequentially:
- const [duplicateIssues, robotsIssues, sitemapIssues] = await Promise.all([ - Promise.resolve(checkDuplicates(siteData, config)), - Promise.resolve(checkRobotsTxt(config)), - Promise.resolve(checkSitemap(config, siteData)), - ]) + const duplicateIssues = checkDuplicates(siteData, config) + const robotsIssues = checkRobotsTxt(config) + const sitemapIssues = checkSitemap(config, siteData)Alternatively, if you anticipate making these async in the future, document that intent with a comment.
87-87: Potential duplicate entries in disabled rules array.The
mergeConfigfunction concatenates thedisabledarrays from base and overrides without deduplication. While this doesn't break functionality (filtering an issue multiple times is harmless), it's inefficient and could clutter debugging output.🔎 Optional deduplication
disabled: [...(base.rules?.disabled || []), ...(overrides.rules?.disabled || [])], + // Or use a Set to deduplicate: + // disabled: [...new Set([...(base.rules?.disabled || []), ...(overrides.rules?.disabled || [])])],scripts/seo-checker/parser.ts (4)
145-146: Word count calculation may be inaccurate for non-space-separated languages.The word count splits on whitespace (
mainText.split(/\s+/)), which works for space-separated languages like English but will undercount or miscount for languages like Chinese, Japanese, or Thai that don't use spaces between words. Since the SEO checker supports multiple languages (lines 32-47 in index.ts list ja, ko, zh), this could produce misleading metrics for those locales.Consider either:
- Documenting this limitation in the word count metric
- Using a more sophisticated word counting approach for CJK languages
- Providing language-specific word count logic based on the page's
htmlLangattribute💡 Example language-aware approach
// Calculate word count - fast path + const htmlLang = $('html').attr('lang')?.split('-')[0] + const isCJK = htmlLang && ['zh', 'ja', 'ko'].includes(htmlLang) const mainText = $('main').text() || $('body').text() - const wordCount = mainText.split(/\s+/).filter(w => w.length > 0).length + const wordCount = isCJK + ? mainText.replace(/\s+/g, '').length // Character count for CJK + : mainText.split(/\s+/).filter(w => w.length > 0).length
234-237: Redirect detection only checkshttp-equiv="refresh"with basic string matching.The redirect detection on line 236 uses a simple
.includes()check forhttp-equiv="refresh". This may miss:
- Variations in attribute order or whitespace:
http-equiv = "refresh"orhttp-equiv='REFRESH'- JavaScript-based redirects:
window.location.href =orwindow.location.replace()- Other meta refresh patterns
While the current approach is fast, it could result in processing redirect pages as regular pages, potentially generating false positives in SEO checks.
💡 More robust redirect detection
if (stats.size < MAX_REDIRECT_SIZE) { const content = await fsp.readFile(filePath, 'utf-8') - if (content.includes('http-equiv="refresh"') || content.includes("http-equiv='refresh'")) { + const hasMetaRefresh = /http-equiv\s*=\s*["']?refresh["']?/i.test(content) + const hasJSRedirect = /window\.location\.(href|replace)/i.test(content) + if (hasMetaRefresh || hasJSRedirect) { return null } return { path: filePath, size: stats.size, content }
33-39: Module-levelbaseHostnamecache could cause issues with config changes.The
baseHostnamevariable is cached at module level (line 13) and only set once on the first call toparseHtmlFile. If the same process were to parse files with differentconfig.baseUrlvalues, the cached hostname would be stale. However, in the current usage pattern (single config per process run), this is unlikely to be an issue.Consider either:
- Documenting this assumption in a comment
- Invalidating the cache when a new baseUrl is detected
- Making
baseHostnamepart of the function scope or passed as a parameter💡 Cache invalidation approach
// Cache base hostname for URL checks - if (baseHostname === null) { + const currentBaseUrl = config.baseUrl + if (baseHostname === null || baseHostname !== currentBaseUrl) { try { - baseHostname = new URL(config.baseUrl).hostname + baseHostname = new URL(currentBaseUrl).hostname } catch { baseHostname = '' } }Note: This would require tracking the baseUrl as well, not just the hostname.
380-392: Synchronous file system call infileExistscould block event loop.The
fileExistsfunction usesfs.accessSync, which is synchronous and blocks the event loop. In a build tool processing thousands of files, this could impact performance despite caching. Given that this codebase usesasync/awaitpatterns elsewhere (e.g.,scanDistFolder), consider using an async version.🔎 Async alternative
/** * Check if a file exists at the given path (cached) */ -export function fileExists(filePath: string): boolean { +export async function fileExists(filePath: string): Promise<boolean> { const cached = fileExistsCache.get(filePath) if (cached !== undefined) return cached try { - fs.accessSync(filePath, fs.constants.F_OK) + await fsp.access(filePath, fs.constants.F_OK) fileExistsCache.set(filePath, true) return true } catch { fileExistsCache.set(filePath, false) return false } }Note: This would require updating all call sites to use
await.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
scripts/seo-checker/index.tsscripts/seo-checker/parser.ts
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{astro,jsx,tsx,js,ts}
📄 CodeRabbit inference engine (AGENT.md)
Dynamic content must use
aria-liveregions for accessibility
Files:
scripts/seo-checker/index.tsscripts/seo-checker/parser.ts
🧠 Learnings (6)
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Page must have unique `title` and `description` in content object for SEO
Applied to files:
scripts/seo-checker/index.tsscripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use `keywords` prop for relevant search terms in page SEO configuration
Applied to files:
scripts/seo-checker/index.tsscripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Include JSON-LD structured data using `src/lib/ldJson.ts` helpers for SEO
Applied to files:
scripts/seo-checker/index.tsscripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use canonical URLs for duplicate content
Applied to files:
scripts/seo-checker/index.tsscripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Use the helpers in `src/lib/ldJson.ts` to add structured data (JSON-LD) to pages
Applied to files:
scripts/seo-checker/parser.ts
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/**/*.{ts,tsx,astro} : Use semantic HTML headings (h1, h2, h3) in proper hierarchy
Applied to files:
scripts/seo-checker/parser.ts
🧬 Code graph analysis (2)
scripts/seo-checker/index.ts (5)
scripts/seo-checker/types.ts (4)
SEOCheckerConfig(56-84)Severity(6-6)SEOIssue(21-41)CheckResult(180-192)scripts/seo-checker/exclusions.ts (3)
loadExclusionsFromFile(119-136)filterDisabledRules(105-114)filterExcludedIssues(78-100)scripts/seo-checker/parser.ts (1)
scanDistFolder(204-330)scripts/seo-checker/checks.ts (4)
runPageChecks(1852-1877)checkDuplicates(1372-1468)checkRobotsTxt(1474-1623)checkSitemap(1629-1837)scripts/seo-checker/reporter.ts (3)
formatJsonReport(157-159)printReport(257-259)writeReport(232-252)
scripts/seo-checker/parser.ts (1)
scripts/seo-checker/types.ts (2)
PageData(86-164)SiteData(166-178)
🔇 Additional comments (2)
scripts/seo-checker/parser.ts (1)
1-432: LGTM: Efficient parallel HTML parsing implementation.The parallel processing approach with batching (HTML files: 200, stats: 500) strikes a good balance between concurrency and memory usage. The caching strategy for file existence and base hostname improves performance for repeated checks. The extraction logic using cheerio is comprehensive and covers the major SEO metadata requirements.
scripts/seo-checker/index.ts (1)
248-267: No action needed — setImmediate is fully supported in Bun.Bun implements
setImmediateas part of its Node.js compatibility layer and the behavior is consistent with Node.js expectations. The code pattern used here (deferring work withsetImmediateto allow other operations) is valid and works correctly in Bun. No changes are required.Likely an incorrect or invalid review comment.
- Replace custom scripts/seo-checker with @capgo/seo-checker package - Update package.json scripts to use seo-checker CLI - Configure failOn to include warnings (stricter CI) - Remove old seo-checker scripts folder 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
package.jsonseo-checker.config.json
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Page must have unique `title` and `description` in content object for SEO
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/pages/**/*.{ts,tsx,astro} : Ensure hreflang tags are present for localized pages
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Run `bun run build` to check for build errors before completing changes
Applied to files:
package.json
📚 Learning: 2025-12-31T01:44:25.964Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-31T01:44:25.964Z
Learning: Applies to src/**/*.{ts,tsx,astro} : Use Tailwind CSS for styling the Capgo landing website
Applied to files:
package.json
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Include JSON-LD structured data using `src/lib/ldJson.ts` helpers for SEO
Applied to files:
package.json
📚 Learning: 2025-12-31T01:44:38.598Z
Learnt from: CR
Repo: Cap-go/website PR: 0
File: AGENT.md:0-0
Timestamp: 2025-12-31T01:44:38.598Z
Learning: Applies to src/pages/**/*.{astro,ts,tsx,js,jsx} : Page must have unique `title` and `description` in content object for SEO
Applied to files:
package.json
🔇 Additional comments (5)
seo-checker.config.json (3)
1-21: LGTM! Comprehensive i18n configuration.The base configuration correctly specifies the dist path, base URL, and extensive language support for 15 locales.
83-84: Strict failOn configuration will block builds on warnings.The
failOn: ["error", "warning"]setting combined withbuild:afterexecutingseo:checkmeans any SEO warning will fail the build and block deployment. While the PR description mentions "blocks deployment if critical SEO errors are found," this configuration is stricter—it blocks on warnings too. Verify this aligns with your intended CI/CD behavior, especially during the initial rollout phase.Consider starting with
failOn: ["error"]to allow warnings without blocking, then gradually tighten to include warnings once the initial issues are resolved.
23-78: The original concern about non-existent rule IDs applies to the previous custom implementation, not the current @capgo/seo-checker package. The package is legitimate and published on npm, confirming the SEO00XXX rule ID format. However, the package README does not provide a complete enumeration of all supported rule IDs, making it impossible to definitively verify each of the 55 disabled rules without access to the tool's complete rule catalog.The configuration's intent to disable rules for "initial deployment" remains valid context. Consider documenting why these specific rules are disabled if not already captured elsewhere.
package.json (2)
16-16: SEO check now blocks deployments via build:after.Chaining
seo:checkinbuild:aftermeans the build will fail if SEO validation fails. Combined withfailOn: ["error", "warning"]in the config, any SEO warning will block deployment. Ensure your team is prepared for potential build failures and has a process to address SEO issues promptly.Consider testing this locally with
bun run build:afterbefore merging to verify it doesn't unexpectedly break CI/CD pipelines.
17-20: LGTM! Well-structured SEO validation scripts.The new scripts provide flexible SEO checking options (console, JSON, report formats) and config generation. The commands are documented in CLAUDE.md and follow standard CLI patterns.
| "@types/semver": "^7.7.1", | ||
| "@types/toastify-js": "^1.12.4", | ||
| "astro-font": "^1.1.0", | ||
| "@capgo/seo-checker": "^0.0.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if @capgo/seo-checker exists on npm and get version info
npm view @capgo/seo-checker versions --json 2>/dev/null | tail -20
echo "---"
npm view @capgo/seo-checker version 2>/dev/null
echo "---"
npm view @capgo/seo-checker dist-tags 2>/dev/null
echo "---"
npm view @capgo/seo-checker 2>/dev/null | head -30Repository: Cap-go/website
Length of output: 782
@capgo/[email protected] is the brand-new, first-ever release of this package—consider the risk of using it in a deployment-blocking context.
This package was published only hours ago and has no release history. Using a completely untested, newly-published package in your build:after hook creates significant risk: any latent bugs discovered later could block all deployments. Consider either thoroughly testing this version in non-blocking scenarios first, deferring its use until the package has multiple releases and proven stability, or restructuring your build process so SEO checker failures don't block deployments.
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- seo:check now outputs GitHub Actions annotations - Add seo:check:local for console output when running locally 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|



Summary
Implements a production-ready SEO validation system that scans the dist folder and ensures all built pages meet SEO standards. The checker validates 70+ rules across metadata, content, links, images, structured data, and internationalization.
Features
build:afterscriptConfiguration
seo-checker.config.json- Main configuration file with enabled/disabled rulesseo-checker.exclusions.json- Specific issue exclusions for legitimate casesCurrent configuration disables non-critical rules to allow initial deployment. Rules can be gradually enabled as issues are fixed.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.