-
Notifications
You must be signed in to change notification settings - Fork 33
Description
When baseUrl is set, the intention is to:
- keep crawling and asset loading tied to the origin of
initialDocURLs(e.g. a local/static build or staging site), while - rewriting only the hyperlinks in the final PDF so they point at the canonical production URL (
baseUrl).
At the moment, this doesn’t quite happen.
When baseUrl is provided and its origin differs from the origin of initialDocURLs, the final PDF generation attempts to load images and other assets from baseUrl instead of from the crawl source.
This is because concatHtml() inserts:
<base href="https://example.com" />The browser applies the <base> tag to all relative URLs, not just hyperlinks. As a result, during the final render pass Puppeteer resolves every relative asset URL (images, stylesheets, scripts, etc.) against baseUrl.
Example
Suppose:
- You are crawling docs from a local Docusaurus build:
initialDocURLs = ["http://localhost:3000/docs/intro"] - Your canonical production site is:
baseUrl = "https://docs.example.com"
Your HTML contains relative asset paths:
<img src="/img/logo.png" />
<link rel="stylesheet" href="/assets/styles.css" />
<a href="/guide">Read more</a>Because concatHtml() adds:
<base href="https://docs.example.com" />the browser now rewrites all relative URLs as:
| Original | Resolved under <base> |
Intended |
|---|---|---|
/img/logo.png |
https://docs.example.com/img/logo.png |
http://localhost:3000/img/logo.png |
/assets/styles.css |
https://docs.example.com/assets/styles.css |
http://localhost:3000/assets/styles.css |
link /guide |
https://docs.example.com/guide |
This one should use baseUrl |
This causes asset failures whenever:
- content must be loaded from a local or static build,
- the preview/staging environment differs from the canonical URL, or
- the canonical hostname is not resolvable from the execution environment.
Expected behaviour
- Assets (images, CSS, JS) should load from the crawl origin:
http://localhost:3000/... - Hyperlinks inside the PDF should still use the canonical
baseUrl:
https://docs.example.com/...
Proposed fix
In generatePDF(), extend the existing request interception logic:
- Let Puppeteer build the request URL (which may incorrectly resolve under
baseUrl). - Check if that resolved URL starts with the
baseUrlorigin. - If so, rewrite the origin to match
initialDocURLs[0].
Example rewrite:
From: https://docs.example.com/img/logo.png
To: http://localhost:3000/img/logo.png
This keeps asset loading tied to the crawl source while preserving canonical hyperlink rewriting.
Backward compatibility
The change proposed is backward compatible for all cases where:
baseUrlis not provided, orbaseUrlshares the same origin asinitialDocURLs.