feat: generate page description with llm script (improvements from original PR) #2389

tessamero · 2025-09-12T00:13:58Z

What does this PR do?

This PR implements an LLM-powered description generator for blog posts, building on the initial work from PR #2368 by @arielweinberger.

Key Improvements

Type Safety: Replaced any types with proper TypeScript interfaces
Simplified Processing: Streamlined content extraction using raw Markdoc content
Error Handling: Added comprehensive fallbacks for empty content and parsing failures
CI/CD Support: Added --skip-existing flag for automated workflows (for future use so it’s not missed)
Professional Output: Refined prompts for senior engineering audiences
Enhanced CLI: Added help system and better error messages
Compatibility: Fixed tsx execution issues
Documentation: Added usage examples and troubleshooting

Usage

# Generate description for a single file
npm run generate:page-description -- --file-path ./blog-post.markdoc

# Skip files that already have descriptions (CI/CD). (for future use when automated)
npm run generate:page-description -- --file-path ./blog-post.markdoc --skip-existing

# Show help
npm run generate:page-description -- --help

Test Plan

Manual Testing Steps:

Install Dependencies:
```
npm install
```
Test Script with Help:
```
npm run generate:page-description -- --help
```
Expected: Shows usage instructions and available options

Test Script with Sample File:

npm run generate:page-description -- --file-path "./src/routes/changelog/(entries)/2025-07-10.markdoc"

Expected: Generates SEO description and displays it in terminal

Test Skip Existing Flag:

npm run generate:page-description -- --file-path "./src/routes/changelog/(entries)/2025-07-10.markdoc" --skip-existing

Expected: Skips file if description already exists, or generates description if none exists

Test Error Handling:

npm run generate:page-description -- --file-path "./nonexistent-file.markdoc"

Expected: Shows appropriate error message

Verification:

✅ Script runs without errors
✅ Generates professional SEO descriptions
✅ Handles missing files gracefully
✅ Skip functionality works as expected
✅ Help system displays correctly

Note: Requires valid OPENAI_API_KEY environment variable for full functionality.

Related PRs and Issues

Credits

Initial implementation: @arielweinberger (PR feat: generate page description with llm script #2368)
Improvements and fixes: @tessamero

Closes #2368

Have you read the Contributing Guidelines on issues?

yes

Summary by CodeRabbit

New Features
- Added a command to auto-generate concise (up to 250 characters) page descriptions for docs, improving metadata quality and consistency.
Refactor
- Centralized Markdoc schema configuration for more consistent docs processing and layouts.
Chores
- Introduced new dependencies to support description generation and Markdoc processing.
- Added an npm script to run the description generator.

coderabbitai · 2025-09-12T00:14:05Z

Walkthrough

Adds a CLI and programmatic tool to generate ~250-character page descriptions for Markdoc docs using an LLM. Introduces scripts/llm-generate-description.ts with two exports (getDocPageContent, generateDescriptionForDocsPage), a main entrypoint handling args (file path, skip existing), prompt construction, retries if over limit, and structured output. Updates package.json with a new script (generate:page-description) and dependencies (dedent, @ai-sdk/openai, @markdoc/markdoc, ai, front-matter, jsdom, tsx). Refactors svelte.config.js to export a reusable markdocSchema and uses it in preprocess.

Pre-merge checks (4 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The current title succinctly and accurately describes the primary change—adding an LLM-powered script to generate page descriptions—and aligns with the changeset; however the parenthetical "(improvements from original PR)" is extraneous and could be removed for clarity.
Linked Issues Check	✅ Passed	The changes align with the primary coding objectives of [#2368]: the new scripts/llm-generate-description.ts implements a CLI and programmatic API that produces descriptions and character counts, package.json adds the generate:page-description script and required dependencies, and svelte.config.js exports a centralized markdocSchema for preprocessing. Based on the provided summaries, the PR satisfies the linked issue's requirements for a command-line workflow, --file-path support, outputting length/status, and schema centralization, though it should be confirmed that the script explicitly fails with a clear error when OPENAI_API_KEY is missing and that usage/docs mention dependency installation.
Out of Scope Changes Check	✅ Passed	I do not detect out-of-scope changes in the provided summaries; modified files (package.json, scripts/llm-generate-description.ts, and svelte.config.js) directly support the LLM description generator and the stated goal of centralizing the Markdoc schema from [#2368].

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (8)

scripts/llm-generate-description.ts (8)

1-7: Use node: specifiers and keep pathToFileURL for robust direct-run detection.

Minor import nit and sets up for a better isDirect check below.

Apply:

-import { readFileSync } from "fs";
-import { pathToFileURL } from "url";
-import path from "path";
+import { readFileSync } from "node:fs";
+import { pathToFileURL } from "node:url";
+import path from "node:path";
+
+type FrontmatterValue =
+  | string
+  | number
+  | boolean
+  | Date
+  | string[]
+  | number[]
+  | boolean[]
+  | Record<string, unknown>
+  | null;
+export type FrontmatterAttributes = Record<string, FrontmatterValue>;

24-33: Type the frontmatter more explicitly.

Use a shared FrontmatterAttributes alias instead of an inline Record union.

-}: {
-  articleText: string;
-  frontmatterAttributes: Record<
-    string,
-    string | number | boolean | Date | string[] | number[] | boolean[]
-  >;
-}) {
+}: {
+  articleText: string;
+  frontmatterAttributes: FrontmatterAttributes;
+}) {

56-63: Stabilize generations.

Add a small temperature for consistency.

     const { text: description } = await generateText({
       model: openai("gpt-4o-mini"),
       system: systemPrompt,
       prompt: userPrompt,
       maxTokens: 100,
+      temperature: 0.2,
     });

98-117: Add explicit return type for the exported API.

Improves DX and catches accidental shape changes.

-export async function getDocPageContent(markdocPath: string) {
+export async function getDocPageContent(
+  markdocPath: string,
+): Promise<{ articleText: string; frontmatterAttributes: FrontmatterAttributes }> {

119-123: Type the public function return.

Small DX improvement.

-export async function generateDescriptionForDocsPage(
+export async function generateDescriptionForDocsPage(
   filePath: string,
-  options: { skipIfExists?: boolean } = {},
-) {
+  options: { skipIfExists?: boolean } = {},
+): Promise<{ description: string; characterCount: number; skipped?: boolean }> {

127-131: Redundant null check.

front-matter always returns an attributes object; this branch is unreachable.

-  if (!frontmatterAttributes) {
-    throw new Error(
-      "Frontmatter attributes are undefined - file may be malformed",
-    );
-  }

209-216: More reliable direct-run detection in ESM/tsx.

Compare import.meta.url to argv[1] as a file URL.

-const isDirect =
-  process.argv[1] && process.argv[1].endsWith("llm-generate-description.ts");
+const argv1 = process.argv[1];
+const isDirect = !!argv1 && import.meta.url === pathToFileURL(argv1).href;

169-184: Use pnpm in help text

Replace npm with pnpm in the Usage / Examples / CI lines in scripts/llm-generate-description.ts.

Occurrences: scripts/llm-generate-description.ts — lines 170, 178-179, 182

-Usage:
-  npm run generate:page-description -- --file-path <path> [options]
+Usage:
+  pnpm run generate:page-description -- --file-path <path> [options]
@@
-Examples:
-  npm run generate:page-description -- --file-path ./blog-post.markdoc
-  npm run generate:page-description -- --file-path ./blog-post.markdoc --skip-existing
+Examples:
+  pnpm run generate:page-description -- --file-path ./blog-post.markdoc
+  pnpm run generate:page-description -- --file-path ./blog-post.markdoc --skip-existing
@@
-For CI/CD usage:
-  npm run generate:page-description -- --file-path ./new-post.markdoc --skip-existing
+For CI/CD usage:
+  pnpm run generate:page-description -- --file-path ./new-post.markdoc --skip-existing

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f53b6f and ea9b789.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (3)

package.json (2 hunks)
scripts/llm-generate-description.ts (1 hunks)
svelte.config.js (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: Tests

svelte.config.js

[warning] 1-1: Prettier formatting issue detected. Run 'prettier --write' to fix.

scripts/llm-generate-description.ts

[warning] 1-1: Prettier formatting issue detected. Run 'prettier --write' to fix.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (6)

package.json (2)

16-16: Script entrypoint LGTM.

CLI wired via tsx is appropriate and consistent with the ESM setup.

67-73: New devDependencies look correct for the CLI.

All additions belong in devDependencies for a tooling-only workflow.

svelte.config.js (3)

8-26: Centralized Markdoc schema export looks good.

This improves reuse and keeps preprocess config single-sourced.

33-33: Good switch to use the shared schema in preprocess.

This prevents config drift between CLI and app.

8-26: Fix Prettier warning in CI.

Run: pnpm format (or prettier --write .) to resolve the reported formatting issue.

scripts/llm-generate-description.ts (1)

1-1: Fix Prettier warning in CI.

Run: pnpm format (or prettier --write .) to resolve the reported formatting issue.

coderabbitai · 2025-09-12T00:26:36Z

scripts/llm-generate-description.ts

+    const trimmedDescription = description.trim();
+    const characterCount = trimmedDescription.length;
+
+    // If the description is too long, try again with a more specific prompt
+    if (characterCount > 250) {
+      const retryPrompt = `The previous description was too long (${characterCount} characters). Generate a shorter description (maximum 250 characters) for this page:
+
+Title: ${frontmatterAttributes?.title || "Untitled"}
+Content: ${articleText.substring(0, 500)}...
+
+Make it concise and under 250 characters.`;
+
+      const { text: retryDescription } = await generateText({
+        model: openai("gpt-4o-mini"),
+        system: systemPrompt,
+        prompt: retryPrompt,
+        maxTokens: 80,
+      });
+
+      const finalDescription = retryDescription.trim();
+      return {
+        description: finalDescription,
+        characterCount: finalDescription.length,
+      };
+    }
+
+    return { description: trimmedDescription, characterCount };
+  } catch (error) {


🛠️ Refactor suggestion

Enforce ≤250 chars and sanitize output (single line, ASCII hyphens, quotes).

Guarantees meta-safe text even if the model overruns.

- const trimmedDescription = description.trim(); - const characterCount = trimmedDescription.length; + const sanitize = (s: string) => + s.replace(/\s+/g, " ").replace(/[–—]/g, "-").replace(/"/g, "'").trim(); + const trimmedDescription = sanitize(description); + let characterCount = trimmedDescription.length; @@ - const { text: retryDescription } = await generateText({ + const { text: retryDescription } = await generateText({ model: openai("gpt-4o-mini"), system: systemPrompt, prompt: retryPrompt, - maxTokens: 80, + maxTokens: 80, + temperature: 0.2, }); - const finalDescription = retryDescription.trim(); + const retrimmed = sanitize(retryDescription); + const finalDescription = + retrimmed.length > 250 + ? retrimmed.slice(0, 247).replace(/\s+\S*$/, "") + "…" + : retrimmed; return { description: finalDescription, characterCount: finalDescription.length, }; } - return { description: trimmedDescription, characterCount }; + const final = + trimmedDescription.length > 250 + ? trimmedDescription.slice(0, 247).replace(/\s+\S*$/, "") + "…" + : trimmedDescription; + return { description: final, characterCount: final.length };

Also applies to: 76-88

🤖 Prompt for AI Agents

In scripts/llm-generate-description.ts around lines 64-91, the generated description paths (both initial and retry) need to enforce a hard ≤250-character limit and sanitize the string into a single-line, meta-safe form: after trimming, replace newlines with a single space and collapse multiple spaces, normalize smart quotes to straight ASCII quotes, replace en/em-dashes with ASCII hyphen, optionally remove or normalize other problematic unicode if present, then truncate to 250 characters and re-trim; compute characterCount from this sanitized/truncated string and return that value. Apply the same sanitization/truncation logic to the retryDescription path (lines ~76-88) so both branches return a single-line ASCII-safe description no longer than 250 chars.

arielweinberger and others added 8 commits September 4, 2025 13:43

feat: generate page description with llm script

a4900d4

feat: expose reusable function

36c5475

do not run main() if imported

ba95757

better looking output

b0cd081

fix output

37da5ed

fix: formatting

7204134

Merge branch 'main' into feat/generate-llm-page-description-script

0ac69b2

improvements to the llm generator

ea9b789

tessamero marked this pull request as ready for review September 12, 2025 00:16

tessamero requested a review from arielweinberger September 12, 2025 00:16

coderabbitai bot reviewed Sep 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: generate page description with llm script (improvements from original PR) #2389

feat: generate page description with llm script (improvements from original PR) #2389

Uh oh!

tessamero commented Sep 12, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: generate page description with llm script (improvements from original PR) #2389

Are you sure you want to change the base?

feat: generate page description with llm script (improvements from original PR) #2389

Uh oh!

Conversation

tessamero commented Sep 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key Improvements

Usage

Test Plan

Manual Testing Steps:

Verification:

Related PRs and Issues

Credits

Have you read the Contributing Guidelines on issues?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Pre-merge checks (4 passed, 1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tessamero commented Sep 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 12, 2025 •

edited

Loading