Skip to content

Fix asset resolution when baseUrl points to a different origin #558

@ashleythedeveloper

Description

@ashleythedeveloper

When baseUrl is set, the intention is to:

  • keep crawling and asset loading tied to the origin of initialDocURLs (e.g. a local/static build or staging site), while
  • rewriting only the hyperlinks in the final PDF so they point at the canonical production URL (baseUrl).

At the moment, this doesn’t quite happen.

When baseUrl is provided and its origin differs from the origin of initialDocURLs, the final PDF generation attempts to load images and other assets from baseUrl instead of from the crawl source.

This is because concatHtml() inserts:

<base href="https://example.com" />

The browser applies the <base> tag to all relative URLs, not just hyperlinks. As a result, during the final render pass Puppeteer resolves every relative asset URL (images, stylesheets, scripts, etc.) against baseUrl.

Example

Suppose:

  • You are crawling docs from a local Docusaurus build:
    initialDocURLs = ["http://localhost:3000/docs/intro"]
  • Your canonical production site is:
    baseUrl = "https://docs.example.com"

Your HTML contains relative asset paths:

<img src="/img/logo.png" />
<link rel="stylesheet" href="/assets/styles.css" />
<a href="/guide">Read more</a>

Because concatHtml() adds:

<base href="https://docs.example.com" />

the browser now rewrites all relative URLs as:

Original Resolved under <base> Intended
/img/logo.png https://docs.example.com/img/logo.png http://localhost:3000/img/logo.png
/assets/styles.css https://docs.example.com/assets/styles.css http://localhost:3000/assets/styles.css
link /guide https://docs.example.com/guide This one should use baseUrl

This causes asset failures whenever:

  • content must be loaded from a local or static build,
  • the preview/staging environment differs from the canonical URL, or
  • the canonical hostname is not resolvable from the execution environment.

Expected behaviour

  • Assets (images, CSS, JS) should load from the crawl origin:
    http://localhost:3000/...
  • Hyperlinks inside the PDF should still use the canonical baseUrl:
    https://docs.example.com/...

Proposed fix

In generatePDF(), extend the existing request interception logic:

  1. Let Puppeteer build the request URL (which may incorrectly resolve under baseUrl).
  2. Check if that resolved URL starts with the baseUrl origin.
  3. If so, rewrite the origin to match initialDocURLs[0].

Example rewrite:

From: https://docs.example.com/img/logo.png  
To:   http://localhost:3000/img/logo.png

This keeps asset loading tied to the crawl source while preserving canonical hyperlink rewriting.

Backward compatibility

The change proposed is backward compatible for all cases where:

  • baseUrl is not provided, or
  • baseUrl shares the same origin as initialDocURLs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions