new extension: rich metadata for SEO #1182

lbliii · 2025-10-15T18:30:57Z

programatically extends and enriches the meta tags across all docs pages by reading the frontmatter we installed during the docs refactor. This change will arguably make nemo curator the first truly seo-optimized Sphinx-based docs site that I know of at NVIDIA. This will also enable other cross-cutting docs initiatives by providing an early working example of what should be "baked in" to default sphinx builds for docs.

Before

<head class="at-element-marker">
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="viewport" content="width=device-width, initial-scale=1">

    <title>Common Crawl — NeMo-Curator</title>
  
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="docsearch:language" content="en">
    <meta name="docsearch:version" content="">

After

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <!-- SEO Meta Tags -->
    <meta name="description" content="Download and extract text from Common Crawl web archives using Curator.">
    <meta name="keywords"
        content="common-crawl, web-data, warc, language-detection, distributed, html-extraction, pipeline">

    <!-- Open Graph / Facebook -->
    <meta property="og:description" content="Download and extract text from Common Crawl web archives using Curator.">
    <meta property="og:type" content="article">
    <meta property="og:title" content="Common Crawl: Download Data - NeMo-Curator | NVIDIA">
    <meta property="og:url" content="None">

    <!-- Twitter -->
    <meta name="twitter:description" content="Download and extract text from Common Crawl web archives using Curator.">
    <meta name="twitter:title" content="Common Crawl: Download Data - NeMo-Curator | NVIDIA">
    <meta name="twitter:card" content="summary">

    <!-- Content Metadata -->
    <meta name="audience" content="Data Scientists, Machine Learning Engineers">
    <meta name="content-type-category" content="how-to">
    <meta name="difficulty" content="intermediate">
    <meta name="modality" content="text-only">

    <!-- Structured Data (JSON-LD) -->
    <script type="text/javascript" async=""
        src="[https://cdn.bizible.com/xdc.js?_biz_u=757ee508329f464dcb7f443545173063&amp;_biz_h=1579209556&amp;cdn_o=a&amp;jsVer=4.25.10.02&amp;a=nvidia.com](https://cdn.bizible.com/xdc.js?_biz_u=757ee508329f464dcb7f443545173063&_biz_h=1579209556&cdn_o=a&jsVer=4.25.10.02&a=nvidia.com)"></script>
    <script type="application/ld+json">
    {
    "@context": "https://schema.org/",
    "@type": "TechArticle",
    "headline": "Common Crawl",
    "name": "Common Crawl",
    "description": "Download and extract text from Common Crawl web archives using Curator.",
    "keywords": [
        "common-crawl",
        "web-data",
        "warc",
        "language-detection",
        "distributed",
        "html-extraction",
        "pipeline"
    ],
    "proficiencyLevel": "Intermediate",
    "audience": {
        "@type": "Audience",
        "audienceType": [
        "Data Scientists",
        "Machine Learning Engineers"
        ]
    },
    "url": null,
    "publisher": {
        "@type": "Organization",
        "name": "NVIDIA Corporation",
        "url": "https://www.nvidia.com/"
    }
    }
    </script>

    <title>Common Crawl: Download Data - NeMo-Curator | NVIDIA</title>
    ...
    ```

…tegration Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter: Core Features: - Extract frontmatter (description, tags, personas, difficulty, content_type, modality) - Generate standard meta tags (description, keywords, audience) - Generate Open Graph tags for social sharing (Facebook, LinkedIn) - Generate Twitter Card tags for enhanced previews - Generate JSON-LD structured data (schema.org) for search engines - Support product versioning via cascade.product fields Components: - rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks - rich_metadata/templates/layout.html: Template override for metadata injection - rich_metadata/README.md: Technical overview and features - rich_metadata/USAGE.md: Complete usage guide with examples - rich_metadata/SUMMARY.md: Quick reference guide - rich_metadata/IMPLEMENTATION.md: Architecture and implementation details - rich_metadata/verify_metadata.py: Automated verification script Template handling follows search_assets pattern - templates live within extension folder and are automatically added to Sphinx template search path via config-inited hook. Extension enabled in conf.py and ready for use with existing frontmatter.

… and enhanced titles - Extract frontmatter from markdown files and inject SEO metadata - Support standard meta tags (description, keywords) - Support Open Graph tags (og:description, og:title, og:type, og:url) - Support Twitter Card tags (twitter:description, twitter:title, twitter:card) - Support custom content metadata (audience, difficulty, modality, content_type) - Generate JSON-LD structured data (schema.org Article/TechArticle) - Organize metadata with HTML comments for readability - Enhanced page titles: 'Page: Section - Site | NVIDIA' - Template override for clean title rendering - Suppress warnings for generated pages (genindex, search, etc.) - Add frontmatter to homepage (index.md) - Fix invalid Jinja2 meta tag in search.html template

…l sharing - Remove duplicate metatags rendering (parent theme already renders it) - Use enhanced structured title format for og:title and twitter:title - All titles now consistent: 'Page: Section - Site | NVIDIA' - Improves social sharing context on Facebook, Twitter, LinkedIn

Signed-off-by: Lawrence Lane <[email protected]>

copy-pr-bot · 2025-10-15T18:31:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull Request Overview

This PR adds a new Sphinx extension (rich_metadata) that programmatically enriches HTML meta tags across all documentation pages by extracting metadata from frontmatter. The extension generates SEO-optimized tags including Open Graph, Twitter Cards, JSON-LD structured data, and custom content metadata.

Key Changes:

Implements a comprehensive SEO metadata injection system that reads YAML frontmatter from documentation pages
Adds frontmatter to the documentation home page (docs/index.md) with metadata fields for description, tags, personas, difficulty, content type, and modality
Creates a verification utility to validate metadata injection in built HTML files

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`docs/index.md`	Adds frontmatter with comprehensive metadata (description, tags, personas, difficulty, content type, modality) to enable SEO features on the home page
`docs/conf.py`	Enables the new `rich_metadata` extension in Sphinx configuration
`docs/_extensions/search_assets/templates/search.html`	Reformats HTML markup (code style changes only, no functional changes)
`docs/_extensions/rich_metadata/verify_metadata.py`	Implements verification script to validate metadata injection in built HTML files
`docs/_extensions/rich_metadata/templates/layout.html`	Provides Jinja2 template override to inject enhanced page titles and metadata into HTML head
`docs/_extensions/rich_metadata/__init__.py`	Core extension implementation that extracts frontmatter, builds meta tags (standard, Open Graph, Twitter, custom), generates JSON-LD structured data, and injects into page context

docs/_extensions/rich_metadata/__init__.py

Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]>

…elper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields

- Replace aliased errors with OSError in frontmatter extraction - Refactor _add_custom_fields into smaller functions to reduce complexity - Remove unnecessary variable assignment before return in build_meta_tags - Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help) - Add return type annotation to main() function - Remove trailing whitespace from blank lines throughout verify_metadata.py

sarahyurick

Cool concept, PR looks nice to me from a high level overview. Happy to help unblock.

sarahyurick · 2025-10-16T21:13:28Z

docs/_extensions/rich_metadata/__init__.py

+    structured_data["publisher"] = {
+        "@type": "Organization",
+        "name": "NVIDIA Corporation",
+        "url": "https://www.nvidia.com",


This could like to https://www.nvidia.com/en-us/ai-data-science/products/nemo/ or something more NVIDIA NeMo specific, but ultimately I don't have a strong preference.

this is great feedback. i'll look into maybe being able to set this from the confpy in the next version so it isn't in the extension

* docs(extensions): add rich metadata SEO extension with frontmatter integration Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter: Core Features: - Extract frontmatter (description, tags, personas, difficulty, content_type, modality) - Generate standard meta tags (description, keywords, audience) - Generate Open Graph tags for social sharing (Facebook, LinkedIn) - Generate Twitter Card tags for enhanced previews - Generate JSON-LD structured data (schema.org) for search engines - Support product versioning via cascade.product fields Components: - rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks - rich_metadata/templates/layout.html: Template override for metadata injection - rich_metadata/README.md: Technical overview and features - rich_metadata/USAGE.md: Complete usage guide with examples - rich_metadata/SUMMARY.md: Quick reference guide - rich_metadata/IMPLEMENTATION.md: Architecture and implementation details - rich_metadata/verify_metadata.py: Automated verification script Template handling follows search_assets pattern - templates live within extension folder and are automatically added to Sphinx template search path via config-inited hook. Extension enabled in conf.py and ready for use with existing frontmatter. * docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles - Extract frontmatter from markdown files and inject SEO metadata - Support standard meta tags (description, keywords) - Support Open Graph tags (og:description, og:title, og:type, og:url) - Support Twitter Card tags (twitter:description, twitter:title, twitter:card) - Support custom content metadata (audience, difficulty, modality, content_type) - Generate JSON-LD structured data (schema.org Article/TechArticle) - Organize metadata with HTML comments for readability - Enhanced page titles: 'Page: Section - Site | NVIDIA' - Template override for clean title rendering - Suppress warnings for generated pages (genindex, search, etc.) - Add frontmatter to homepage (index.md) - Fix invalid Jinja2 meta tag in search.html template * docs(rich_metadata): fix duplicates and use enhanced titles for social sharing - Remove duplicate metatags rendering (parent theme already renders it) - Use enhanced structured title format for og:title and twitter:title - All titles now consistent: 'Page: Section - Site | NVIDIA' - Improves social sharing context on Facebook, Twitter, LinkedIn * rich metadata Signed-off-by: Lawrence Lane <[email protected]> * docs: rich metadata for seo Signed-off-by: Lawrence Lane <[email protected]> * Update docs/_extensions/rich_metadata/__init__.py Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields * docs(rich_metadata): fix linter errors in extension code - Replace aliased errors with OSError in frontmatter extraction - Refactor _add_custom_fields into smaller functions to reduce complexity - Remove unnecessary variable assignment before return in build_meta_tags - Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help) - Add return type annotation to main() function - Remove trailing whitespace from blank lines throughout verify_metadata.py * docs(rich_metadata): fix quote style to use double quotes per Ruff Q000 --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Copilot <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* docs(extensions): add rich metadata SEO extension with frontmatter integration Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter: Core Features: - Extract frontmatter (description, tags, personas, difficulty, content_type, modality) - Generate standard meta tags (description, keywords, audience) - Generate Open Graph tags for social sharing (Facebook, LinkedIn) - Generate Twitter Card tags for enhanced previews - Generate JSON-LD structured data (schema.org) for search engines - Support product versioning via cascade.product fields Components: - rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks - rich_metadata/templates/layout.html: Template override for metadata injection - rich_metadata/README.md: Technical overview and features - rich_metadata/USAGE.md: Complete usage guide with examples - rich_metadata/SUMMARY.md: Quick reference guide - rich_metadata/IMPLEMENTATION.md: Architecture and implementation details - rich_metadata/verify_metadata.py: Automated verification script Template handling follows search_assets pattern - templates live within extension folder and are automatically added to Sphinx template search path via config-inited hook. Extension enabled in conf.py and ready for use with existing frontmatter. * docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles - Extract frontmatter from markdown files and inject SEO metadata - Support standard meta tags (description, keywords) - Support Open Graph tags (og:description, og:title, og:type, og:url) - Support Twitter Card tags (twitter:description, twitter:title, twitter:card) - Support custom content metadata (audience, difficulty, modality, content_type) - Generate JSON-LD structured data (schema.org Article/TechArticle) - Organize metadata with HTML comments for readability - Enhanced page titles: 'Page: Section - Site | NVIDIA' - Template override for clean title rendering - Suppress warnings for generated pages (genindex, search, etc.) - Add frontmatter to homepage (index.md) - Fix invalid Jinja2 meta tag in search.html template * docs(rich_metadata): fix duplicates and use enhanced titles for social sharing - Remove duplicate metatags rendering (parent theme already renders it) - Use enhanced structured title format for og:title and twitter:title - All titles now consistent: 'Page: Section - Site | NVIDIA' - Improves social sharing context on Facebook, Twitter, LinkedIn * rich metadata Signed-off-by: Lawrence Lane <[email protected]> * docs: rich metadata for seo Signed-off-by: Lawrence Lane <[email protected]> * Update docs/_extensions/rich_metadata/__init__.py Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields * docs(rich_metadata): fix linter errors in extension code - Replace aliased errors with OSError in frontmatter extraction - Refactor _add_custom_fields into smaller functions to reduce complexity - Remove unnecessary variable assignment before return in build_meta_tags - Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help) - Add return type annotation to main() function - Remove trailing whitespace from blank lines throughout verify_metadata.py * docs(rich_metadata): fix quote style to use double quotes per Ruff Q000 --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Copilot <[email protected]>

lbliii added 5 commits October 15, 2025 12:57

rich metadata

d97accf

Signed-off-by: Lawrence Lane <[email protected]>

docs: rich metadata for seo

25c27a9

Signed-off-by: Lawrence Lane <[email protected]>

lbliii requested a review from Copilot October 15, 2025 18:30

lbliii self-assigned this Oct 15, 2025

Copilot AI reviewed Oct 15, 2025

View reviewed changes

lbliii requested review from arhamm1 and sarahyurick October 15, 2025 18:32

lbliii and others added 5 commits October 15, 2025 14:35

Update docs/_extensions/rich_metadata/__init__.py

8eed4f3

Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]>

docs(extensions): refactor build_meta_tags to reduce complexity via h…

f7341cb

…elper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields

Merge branch 'main' into llane/rich-metadata

333808a

docs(rich_metadata): fix quote style to use double quotes per Ruff Q000

2ebe29f

lbliii marked this pull request as ready for review October 15, 2025 19:30

sarahyurick approved these changes Oct 16, 2025

View reviewed changes

Merge branch 'main' into llane/rich-metadata

4d420e1

lbliii merged commit 4a3e76b into NVIDIA-NeMo:main Oct 17, 2025
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new extension: rich metadata for SEO #1182

new extension: rich metadata for SEO #1182

lbliii commented Oct 15, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Uh oh!

sarahyurick Oct 16, 2025

Uh oh!

lbliii Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

new extension: rich metadata for SEO #1182

new extension: rich metadata for SEO #1182

Conversation

lbliii commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

lbliii Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lbliii commented Oct 15, 2025 •

edited

Loading