Skip to content

Conversation

tessamero
Copy link
Contributor

@tessamero tessamero commented Sep 12, 2025

What does this PR do?

This PR implements an LLM-powered description generator for blog posts, building on the initial work from PR #2368 by @arielweinberger.

Key Improvements

  • Type Safety: Replaced any types with proper TypeScript interfaces
  • Simplified Processing: Streamlined content extraction using raw Markdoc content
  • Error Handling: Added comprehensive fallbacks for empty content and parsing failures
  • CI/CD Support: Added --skip-existing flag for automated workflows (for future use so it’s not missed)
  • Professional Output: Refined prompts for senior engineering audiences
  • Enhanced CLI: Added help system and better error messages
  • Compatibility: Fixed tsx execution issues
  • Documentation: Added usage examples and troubleshooting

Usage

# Generate description for a single file
npm run generate:page-description -- --file-path ./blog-post.markdoc

# Skip files that already have descriptions (CI/CD). (for future use when automated)
npm run generate:page-description -- --file-path ./blog-post.markdoc --skip-existing

# Show help
npm run generate:page-description -- --help

Test Plan

Manual Testing Steps:

  1. Install Dependencies:

    npm install
  2. Test Script with Help:

    npm run generate:page-description -- --help

    Expected: Shows usage instructions and available options

  3. Test Script with Sample File:

    npm run generate:page-description -- --file-path "./src/routes/changelog/(entries)/2025-07-10.markdoc"

    Expected: Generates SEO description and displays it in terminal

  4. Test Skip Existing Flag:

    npm run generate:page-description -- --file-path "./src/routes/changelog/(entries)/2025-07-10.markdoc" --skip-existing

    Expected: Skips file if description already exists, or generates description if none exists

  5. Test Error Handling:

    npm run generate:page-description -- --file-path "./nonexistent-file.markdoc"

    Expected: Shows appropriate error message

Verification:

  • ✅ Script runs without errors
  • ✅ Generates professional SEO descriptions
  • ✅ Handles missing files gracefully
  • ✅ Skip functionality works as expected
  • ✅ Help system displays correctly

Note: Requires valid OPENAI_API_KEY environment variable for full functionality.

Related PRs and Issues

Credits

Closes #2368

Have you read the Contributing Guidelines on issues?

yes

Summary by CodeRabbit

  • New Features

    • Added a command to auto-generate concise (up to 250 characters) page descriptions for docs, improving metadata quality and consistency.
  • Refactor

    • Centralized Markdoc schema configuration for more consistent docs processing and layouts.
  • Chores

    • Introduced new dependencies to support description generation and Markdoc processing.
    • Added an npm script to run the description generator.

Copy link
Contributor

coderabbitai bot commented Sep 12, 2025

Walkthrough

Adds a CLI and programmatic tool to generate ~250-character page descriptions for Markdoc docs using an LLM. Introduces scripts/llm-generate-description.ts with two exports (getDocPageContent, generateDescriptionForDocsPage), a main entrypoint handling args (file path, skip existing), prompt construction, retries if over limit, and structured output. Updates package.json with a new script (generate:page-description) and dependencies (dedent, @ai-sdk/openai, @markdoc/markdoc, ai, front-matter, jsdom, tsx). Refactors svelte.config.js to export a reusable markdocSchema and uses it in preprocess.

Pre-merge checks (4 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The current title succinctly and accurately describes the primary change—adding an LLM-powered script to generate page descriptions—and aligns with the changeset; however the parenthetical "(improvements from original PR)" is extraneous and could be removed for clarity.
Linked Issues Check ✅ Passed The changes align with the primary coding objectives of [#2368]: the new scripts/llm-generate-description.ts implements a CLI and programmatic API that produces descriptions and character counts, package.json adds the generate:page-description script and required dependencies, and svelte.config.js exports a centralized markdocSchema for preprocessing. Based on the provided summaries, the PR satisfies the linked issue's requirements for a command-line workflow, --file-path support, outputting length/status, and schema centralization, though it should be confirmed that the script explicitly fails with a clear error when OPENAI_API_KEY is missing and that usage/docs mention dependency installation.
Out of Scope Changes Check ✅ Passed I do not detect out-of-scope changes in the provided summaries; modified files (package.json, scripts/llm-generate-description.ts, and svelte.config.js) directly support the LLM description generator and the stated goal of centralizing the Markdoc schema from [#2368].

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tessamero tessamero marked this pull request as ready for review September 12, 2025 00:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
scripts/llm-generate-description.ts (8)

1-7: Use node: specifiers and keep pathToFileURL for robust direct-run detection.

Minor import nit and sets up for a better isDirect check below.

Apply:

-import { readFileSync } from "fs";
-import { pathToFileURL } from "url";
-import path from "path";
+import { readFileSync } from "node:fs";
+import { pathToFileURL } from "node:url";
+import path from "node:path";
+
+type FrontmatterValue =
+  | string
+  | number
+  | boolean
+  | Date
+  | string[]
+  | number[]
+  | boolean[]
+  | Record<string, unknown>
+  | null;
+export type FrontmatterAttributes = Record<string, FrontmatterValue>;

24-33: Type the frontmatter more explicitly.

Use a shared FrontmatterAttributes alias instead of an inline Record union.

-}: {
-  articleText: string;
-  frontmatterAttributes: Record<
-    string,
-    string | number | boolean | Date | string[] | number[] | boolean[]
-  >;
-}) {
+}: {
+  articleText: string;
+  frontmatterAttributes: FrontmatterAttributes;
+}) {

56-63: Stabilize generations.

Add a small temperature for consistency.

     const { text: description } = await generateText({
       model: openai("gpt-4o-mini"),
       system: systemPrompt,
       prompt: userPrompt,
       maxTokens: 100,
+      temperature: 0.2,
     });

98-117: Add explicit return type for the exported API.

Improves DX and catches accidental shape changes.

-export async function getDocPageContent(markdocPath: string) {
+export async function getDocPageContent(
+  markdocPath: string,
+): Promise<{ articleText: string; frontmatterAttributes: FrontmatterAttributes }> {

119-123: Type the public function return.

Small DX improvement.

-export async function generateDescriptionForDocsPage(
+export async function generateDescriptionForDocsPage(
   filePath: string,
-  options: { skipIfExists?: boolean } = {},
-) {
+  options: { skipIfExists?: boolean } = {},
+): Promise<{ description: string; characterCount: number; skipped?: boolean }> {

127-131: Redundant null check.

front-matter always returns an attributes object; this branch is unreachable.

-  if (!frontmatterAttributes) {
-    throw new Error(
-      "Frontmatter attributes are undefined - file may be malformed",
-    );
-  }

209-216: More reliable direct-run detection in ESM/tsx.

Compare import.meta.url to argv[1] as a file URL.

-const isDirect =
-  process.argv[1] && process.argv[1].endsWith("llm-generate-description.ts");
+const argv1 = process.argv[1];
+const isDirect = !!argv1 && import.meta.url === pathToFileURL(argv1).href;

169-184: Use pnpm in help text

Replace npm with pnpm in the Usage / Examples / CI lines in scripts/llm-generate-description.ts.

Occurrences: scripts/llm-generate-description.ts — lines 170, 178-179, 182

-Usage:
-  npm run generate:page-description -- --file-path <path> [options]
+Usage:
+  pnpm run generate:page-description -- --file-path <path> [options]
@@
-Examples:
-  npm run generate:page-description -- --file-path ./blog-post.markdoc
-  npm run generate:page-description -- --file-path ./blog-post.markdoc --skip-existing
+Examples:
+  pnpm run generate:page-description -- --file-path ./blog-post.markdoc
+  pnpm run generate:page-description -- --file-path ./blog-post.markdoc --skip-existing
@@
-For CI/CD usage:
-  npm run generate:page-description -- --file-path ./new-post.markdoc --skip-existing
+For CI/CD usage:
+  pnpm run generate:page-description -- --file-path ./new-post.markdoc --skip-existing
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f53b6f and ea9b789.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (3)
  • package.json (2 hunks)
  • scripts/llm-generate-description.ts (1 hunks)
  • svelte.config.js (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Tests
svelte.config.js

[warning] 1-1: Prettier formatting issue detected. Run 'prettier --write' to fix.

scripts/llm-generate-description.ts

[warning] 1-1: Prettier formatting issue detected. Run 'prettier --write' to fix.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (6)
package.json (2)

16-16: Script entrypoint LGTM.

CLI wired via tsx is appropriate and consistent with the ESM setup.


67-73: New devDependencies look correct for the CLI.

All additions belong in devDependencies for a tooling-only workflow.

svelte.config.js (3)

8-26: Centralized Markdoc schema export looks good.

This improves reuse and keeps preprocess config single-sourced.


33-33: Good switch to use the shared schema in preprocess.

This prevents config drift between CLI and app.


8-26: Fix Prettier warning in CI.

Run: pnpm format (or prettier --write .) to resolve the reported formatting issue.

scripts/llm-generate-description.ts (1)

1-1: Fix Prettier warning in CI.

Run: pnpm format (or prettier --write .) to resolve the reported formatting issue.

Comment on lines +64 to +91
const trimmedDescription = description.trim();
const characterCount = trimmedDescription.length;

// If the description is too long, try again with a more specific prompt
if (characterCount > 250) {
const retryPrompt = `The previous description was too long (${characterCount} characters). Generate a shorter description (maximum 250 characters) for this page:

Title: ${frontmatterAttributes?.title || "Untitled"}
Content: ${articleText.substring(0, 500)}...

Make it concise and under 250 characters.`;

const { text: retryDescription } = await generateText({
model: openai("gpt-4o-mini"),
system: systemPrompt,
prompt: retryPrompt,
maxTokens: 80,
});

const finalDescription = retryDescription.trim();
return {
description: finalDescription,
characterCount: finalDescription.length,
};
}

return { description: trimmedDescription, characterCount };
} catch (error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enforce ≤250 chars and sanitize output (single line, ASCII hyphens, quotes).

Guarantees meta-safe text even if the model overruns.

-    const trimmedDescription = description.trim();
-    const characterCount = trimmedDescription.length;
+    const sanitize = (s: string) =>
+      s.replace(/\s+/g, " ").replace(/[–—]/g, "-").replace(/"/g, "'").trim();
+    const trimmedDescription = sanitize(description);
+    let characterCount = trimmedDescription.length;
@@
-      const { text: retryDescription } = await generateText({
+      const { text: retryDescription } = await generateText({
         model: openai("gpt-4o-mini"),
         system: systemPrompt,
         prompt: retryPrompt,
-        maxTokens: 80,
+        maxTokens: 80,
+        temperature: 0.2,
       });
 
-      const finalDescription = retryDescription.trim();
+      const retrimmed = sanitize(retryDescription);
+      const finalDescription =
+        retrimmed.length > 250
+          ? retrimmed.slice(0, 247).replace(/\s+\S*$/, "") + "…"
+          : retrimmed;
       return {
         description: finalDescription,
         characterCount: finalDescription.length,
       };
     }
 
-    return { description: trimmedDescription, characterCount };
+    const final =
+      trimmedDescription.length > 250
+        ? trimmedDescription.slice(0, 247).replace(/\s+\S*$/, "") + "…"
+        : trimmedDescription;
+    return { description: final, characterCount: final.length };

Also applies to: 76-88

🤖 Prompt for AI Agents
In scripts/llm-generate-description.ts around lines 64-91, the generated
description paths (both initial and retry) need to enforce a hard ≤250-character
limit and sanitize the string into a single-line, meta-safe form: after
trimming, replace newlines with a single space and collapse multiple spaces,
normalize smart quotes to straight ASCII quotes, replace en/em-dashes with ASCII
hyphen, optionally remove or normalize other problematic unicode if present,
then truncate to 250 characters and re-trim; compute characterCount from this
sanitized/truncated string and return that value. Apply the same
sanitization/truncation logic to the retryDescription path (lines ~76-88) so
both branches return a single-line ASCII-safe description no longer than 250
chars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants