Skip to content

Conversation

@lbliii
Copy link
Contributor

@lbliii lbliii commented Oct 15, 2025

programatically extends and enriches the meta tags across all docs pages by reading the frontmatter we installed during the docs refactor. This change will arguably make nemo curator the first truly seo-optimized Sphinx-based docs site that I know of at NVIDIA. This will also enable other cross-cutting docs initiatives by providing an early working example of what should be "baked in" to default sphinx builds for docs.

Before

<head class="at-element-marker">
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="viewport" content="width=device-width, initial-scale=1">

    <title>Common Crawl — NeMo-Curator</title>
  
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="docsearch:language" content="en">
    <meta name="docsearch:version" content="">

After

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <!-- SEO Meta Tags -->
    <meta name="description" content="Download and extract text from Common Crawl web archives using Curator.">
    <meta name="keywords"
        content="common-crawl, web-data, warc, language-detection, distributed, html-extraction, pipeline">

    <!-- Open Graph / Facebook -->
    <meta property="og:description" content="Download and extract text from Common Crawl web archives using Curator.">
    <meta property="og:type" content="article">
    <meta property="og:title" content="Common Crawl: Download Data - NeMo-Curator | NVIDIA">
    <meta property="og:url" content="None">

    <!-- Twitter -->
    <meta name="twitter:description" content="Download and extract text from Common Crawl web archives using Curator.">
    <meta name="twitter:title" content="Common Crawl: Download Data - NeMo-Curator | NVIDIA">
    <meta name="twitter:card" content="summary">

    <!-- Content Metadata -->
    <meta name="audience" content="Data Scientists, Machine Learning Engineers">
    <meta name="content-type-category" content="how-to">
    <meta name="difficulty" content="intermediate">
    <meta name="modality" content="text-only">

    <!-- Structured Data (JSON-LD) -->
    <script type="text/javascript" async=""
        src="[https://cdn.bizible.com/xdc.js?_biz_u=757ee508329f464dcb7f443545173063&amp;_biz_h=1579209556&amp;cdn_o=a&amp;jsVer=4.25.10.02&amp;a=nvidia.com](https://cdn.bizible.com/xdc.js?_biz_u=757ee508329f464dcb7f443545173063&_biz_h=1579209556&cdn_o=a&jsVer=4.25.10.02&a=nvidia.com)"></script>
    <script type="application/ld+json">
    {
    "@context": "https://schema.org/",
    "@type": "TechArticle",
    "headline": "Common Crawl",
    "name": "Common Crawl",
    "description": "Download and extract text from Common Crawl web archives using Curator.",
    "keywords": [
        "common-crawl",
        "web-data",
        "warc",
        "language-detection",
        "distributed",
        "html-extraction",
        "pipeline"
    ],
    "proficiencyLevel": "Intermediate",
    "audience": {
        "@type": "Audience",
        "audienceType": [
        "Data Scientists",
        "Machine Learning Engineers"
        ]
    },
    "url": null,
    "publisher": {
        "@type": "Organization",
        "name": "NVIDIA Corporation",
        "url": "https://www.nvidia.com/"
    }
    }
    </script>

    <title>Common Crawl: Download Data - NeMo-Curator | NVIDIA</title>
    ...
    ```

…tegration

Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter:

Core Features:
- Extract frontmatter (description, tags, personas, difficulty, content_type, modality)
- Generate standard meta tags (description, keywords, audience)
- Generate Open Graph tags for social sharing (Facebook, LinkedIn)
- Generate Twitter Card tags for enhanced previews
- Generate JSON-LD structured data (schema.org) for search engines
- Support product versioning via cascade.product fields

Components:
- rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks
- rich_metadata/templates/layout.html: Template override for metadata injection
- rich_metadata/README.md: Technical overview and features
- rich_metadata/USAGE.md: Complete usage guide with examples
- rich_metadata/SUMMARY.md: Quick reference guide
- rich_metadata/IMPLEMENTATION.md: Architecture and implementation details
- rich_metadata/verify_metadata.py: Automated verification script

Template handling follows search_assets pattern - templates live within extension folder
and are automatically added to Sphinx template search path via config-inited hook.

Extension enabled in conf.py and ready for use with existing frontmatter.
… and enhanced titles

- Extract frontmatter from markdown files and inject SEO metadata
- Support standard meta tags (description, keywords)
- Support Open Graph tags (og:description, og:title, og:type, og:url)
- Support Twitter Card tags (twitter:description, twitter:title, twitter:card)
- Support custom content metadata (audience, difficulty, modality, content_type)
- Generate JSON-LD structured data (schema.org Article/TechArticle)
- Organize metadata with HTML comments for readability
- Enhanced page titles: 'Page: Section - Site | NVIDIA'
- Template override for clean title rendering
- Suppress warnings for generated pages (genindex, search, etc.)
- Add frontmatter to homepage (index.md)
- Fix invalid Jinja2 meta tag in search.html template
…l sharing

- Remove duplicate metatags rendering (parent theme already renders it)
- Use enhanced structured title format for og:title and twitter:title
- All titles now consistent: 'Page: Section - Site | NVIDIA'
- Improves social sharing context on Facebook, Twitter, LinkedIn
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
@lbliii lbliii requested a review from Copilot October 15, 2025 18:30
@lbliii lbliii self-assigned this Oct 15, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 15, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new Sphinx extension (rich_metadata) that programmatically enriches HTML meta tags across all documentation pages by extracting metadata from frontmatter. The extension generates SEO-optimized tags including Open Graph, Twitter Cards, JSON-LD structured data, and custom content metadata.

Key Changes:

  • Implements a comprehensive SEO metadata injection system that reads YAML frontmatter from documentation pages
  • Adds frontmatter to the documentation home page (docs/index.md) with metadata fields for description, tags, personas, difficulty, content type, and modality
  • Creates a verification utility to validate metadata injection in built HTML files

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
docs/index.md Adds frontmatter with comprehensive metadata (description, tags, personas, difficulty, content type, modality) to enable SEO features on the home page
docs/conf.py Enables the new rich_metadata extension in Sphinx configuration
docs/_extensions/search_assets/templates/search.html Reformats HTML markup (code style changes only, no functional changes)
docs/_extensions/rich_metadata/verify_metadata.py Implements verification script to validate metadata injection in built HTML files
docs/_extensions/rich_metadata/templates/layout.html Provides Jinja2 template override to inject enhanced page titles and metadata into HTML head
docs/_extensions/rich_metadata/__init__.py Core extension implementation that extracts frontmatter, builds meta tags (standard, Open Graph, Twitter, custom), generates JSON-LD structured data, and injects into page context

@lbliii lbliii requested review from arhamm1 and sarahyurick October 15, 2025 18:32
lbliii and others added 5 commits October 15, 2025 14:35
…elper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields
- Replace aliased errors with OSError in frontmatter extraction
- Refactor _add_custom_fields into smaller functions to reduce complexity
- Remove unnecessary variable assignment before return in build_meta_tags
- Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help)
- Add return type annotation to main() function
- Remove trailing whitespace from blank lines throughout verify_metadata.py
@lbliii lbliii marked this pull request as ready for review October 15, 2025 19:30
Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool concept, PR looks nice to me from a high level overview. Happy to help unblock.

structured_data["publisher"] = {
"@type": "Organization",
"name": "NVIDIA Corporation",
"url": "https://www.nvidia.com",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could like to https://www.nvidia.com/en-us/ai-data-science/products/nemo/ or something more NVIDIA NeMo specific, but ultimately I don't have a strong preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great feedback. i'll look into maybe being able to set this from the confpy in the next version so it isn't in the extension

@lbliii lbliii merged commit 4a3e76b into NVIDIA-NeMo:main Oct 17, 2025
10 of 11 checks passed
lbliii added a commit to lbliii/NeMo-Curator that referenced this pull request Oct 22, 2025
* docs(extensions): add rich metadata SEO extension with frontmatter integration

Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter:

Core Features:
- Extract frontmatter (description, tags, personas, difficulty, content_type, modality)
- Generate standard meta tags (description, keywords, audience)
- Generate Open Graph tags for social sharing (Facebook, LinkedIn)
- Generate Twitter Card tags for enhanced previews
- Generate JSON-LD structured data (schema.org) for search engines
- Support product versioning via cascade.product fields

Components:
- rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks
- rich_metadata/templates/layout.html: Template override for metadata injection
- rich_metadata/README.md: Technical overview and features
- rich_metadata/USAGE.md: Complete usage guide with examples
- rich_metadata/SUMMARY.md: Quick reference guide
- rich_metadata/IMPLEMENTATION.md: Architecture and implementation details
- rich_metadata/verify_metadata.py: Automated verification script

Template handling follows search_assets pattern - templates live within extension folder
and are automatically added to Sphinx template search path via config-inited hook.

Extension enabled in conf.py and ready for use with existing frontmatter.

* docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles

- Extract frontmatter from markdown files and inject SEO metadata
- Support standard meta tags (description, keywords)
- Support Open Graph tags (og:description, og:title, og:type, og:url)
- Support Twitter Card tags (twitter:description, twitter:title, twitter:card)
- Support custom content metadata (audience, difficulty, modality, content_type)
- Generate JSON-LD structured data (schema.org Article/TechArticle)
- Organize metadata with HTML comments for readability
- Enhanced page titles: 'Page: Section - Site | NVIDIA'
- Template override for clean title rendering
- Suppress warnings for generated pages (genindex, search, etc.)
- Add frontmatter to homepage (index.md)
- Fix invalid Jinja2 meta tag in search.html template

* docs(rich_metadata): fix duplicates and use enhanced titles for social sharing

- Remove duplicate metatags rendering (parent theme already renders it)
- Use enhanced structured title format for og:title and twitter:title
- All titles now consistent: 'Page: Section - Site | NVIDIA'
- Improves social sharing context on Facebook, Twitter, LinkedIn

* rich metadata

Signed-off-by: Lawrence Lane <[email protected]>

* docs: rich metadata for seo

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/_extensions/rich_metadata/__init__.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: L.B. <[email protected]>

* docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields

* docs(rich_metadata): fix linter errors in extension code

- Replace aliased errors with OSError in frontmatter extraction
- Refactor _add_custom_fields into smaller functions to reduce complexity
- Remove unnecessary variable assignment before return in build_meta_tags
- Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help)
- Add return type annotation to main() function
- Remove trailing whitespace from blank lines throughout verify_metadata.py

* docs(rich_metadata): fix quote style to use double quotes per Ruff Q000

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request Nov 12, 2025
* docs(extensions): add rich metadata SEO extension with frontmatter integration

Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter:

Core Features:
- Extract frontmatter (description, tags, personas, difficulty, content_type, modality)
- Generate standard meta tags (description, keywords, audience)
- Generate Open Graph tags for social sharing (Facebook, LinkedIn)
- Generate Twitter Card tags for enhanced previews
- Generate JSON-LD structured data (schema.org) for search engines
- Support product versioning via cascade.product fields

Components:
- rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks
- rich_metadata/templates/layout.html: Template override for metadata injection
- rich_metadata/README.md: Technical overview and features
- rich_metadata/USAGE.md: Complete usage guide with examples
- rich_metadata/SUMMARY.md: Quick reference guide
- rich_metadata/IMPLEMENTATION.md: Architecture and implementation details
- rich_metadata/verify_metadata.py: Automated verification script

Template handling follows search_assets pattern - templates live within extension folder
and are automatically added to Sphinx template search path via config-inited hook.

Extension enabled in conf.py and ready for use with existing frontmatter.

* docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles

- Extract frontmatter from markdown files and inject SEO metadata
- Support standard meta tags (description, keywords)
- Support Open Graph tags (og:description, og:title, og:type, og:url)
- Support Twitter Card tags (twitter:description, twitter:title, twitter:card)
- Support custom content metadata (audience, difficulty, modality, content_type)
- Generate JSON-LD structured data (schema.org Article/TechArticle)
- Organize metadata with HTML comments for readability
- Enhanced page titles: 'Page: Section - Site | NVIDIA'
- Template override for clean title rendering
- Suppress warnings for generated pages (genindex, search, etc.)
- Add frontmatter to homepage (index.md)
- Fix invalid Jinja2 meta tag in search.html template

* docs(rich_metadata): fix duplicates and use enhanced titles for social sharing

- Remove duplicate metatags rendering (parent theme already renders it)
- Use enhanced structured title format for og:title and twitter:title
- All titles now consistent: 'Page: Section - Site | NVIDIA'
- Improves social sharing context on Facebook, Twitter, LinkedIn

* rich metadata

Signed-off-by: Lawrence Lane <[email protected]>

* docs: rich metadata for seo

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/_extensions/rich_metadata/__init__.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: L.B. <[email protected]>

* docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields

* docs(rich_metadata): fix linter errors in extension code

- Replace aliased errors with OSError in frontmatter extraction
- Refactor _add_custom_fields into smaller functions to reduce complexity
- Remove unnecessary variable assignment before return in build_meta_tags
- Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help)
- Add return type annotation to main() function
- Remove trailing whitespace from blank lines throughout verify_metadata.py

* docs(rich_metadata): fix quote style to use double quotes per Ruff Q000

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants