Complex Website Download Examples

Below are a few examples of complex sites, and the techniques required to download them successfully.

Needs Update: The following examples were written for Crystal 1.x and target real websites rather than archived versions of websites. It is likely that some steps are outdated. If you encounter issues please file a bug.

To download a website with dynamically generated links (ex: The Pragmatic Engineer):

Open Crystal and create a new project, call it "Pragmatic Engineer".
Press the "New Root URL" button and add the https://newsletter.pragmaticengineer.com/ URL, named "Home".
Select the added "Home" and press the "Download" button. Wait for it to finish downloading.
With "Home" still selected, press the "View" button. A web browser should open and display the downloaded home page.
While browsing a downloaded site from a web browser, Crystal's server will log information about requests it receives from the web browser. For example:
- "GET /_/https/newsletter.pragmaticengineer.com/ HTTP/1.1" 200 -
  - This line says the web browser did try to fetch the https://newsletter.pragmaticengineer.com/ URL from Crystal.
Notice in the server log that many red lines did appear saying "Requested resource not in archive".
- Since these were fetched immediately when loading the page, they must be a kind of resource that is "embedded" into the page. When Crystal downloads a page it also downloads all embedded resources it can find statically, but these embedded resources must have been fetched dynamically by JavaScript code running on the page, which Crystal cannot see.
We want to eliminate those red lines that appear when viewing the home page.

Eliminate red lines:

Let's start by eliminating the "Requested resource not in archive" red lines related to URLs like https://substackcdn.com/bundle/assets/entry-f6e60c95.js
Press the "New Group" button and add https://substackcdn.com/**, named "Substack CDN Asset".
Reload the home page in the web browser.
Notice in the server log that many green lines did appear saying "*** Dynamically downloading existing resource in group 'Substack CDN Asset':" and that there are no more red lines related to https://substackcdn.com/**.
All red lines related to https://substackcdn.com/** should be gone.

Eliminate the last red lines:

There should be only a few red lines left:
- *** Requested resource not in archive: https://newsletter.pragmaticengineer.com/api/v1/firehose?...
- *** Requested resource not in archive: https://newsletter.pragmaticengineer.com/api/v1/archive?...
- *** Requested resource not in archive: https://newsletter.pragmaticengineer.com/api/v1/homepage_links
- *** Requested resource not in archive: https://newsletter.pragmaticengineer.com/api/v1/recommendations/...
- *** Requested resource not in archive: https://newsletter.pragmaticengineer.com/api/v1/homepage_data
- *** Requested resource not in archive: https://newsletter.pragmaticengineer.com/service-worker.js
Eliminate these red lines by creating:
- a group https://newsletter.pragmaticengineer.com/api/v1/firehose?**, named "Firehose API"
- a group https://newsletter.pragmaticengineer.com/api/v1/archive?**, named "Archive API"
- a root URL https://newsletter.pragmaticengineer.com/api/v1/homepage_links, named "Homepage Links"
- a group https://newsletter.pragmaticengineer.com/api/v1/recommendations/**, named "Recommendations API"
- a root URL https://newsletter.pragmaticengineer.com/api/v1/homepage_data, named "Homepage Data"
- a root URL https://newsletter.pragmaticengineer.com/service-worker.js, named "Service Worker"
Reload the home page in the web browser.
There should be no red lines left.

Final testing:

If you click the "Let me read it first" link at the bottom of the page, a list of article links should appear.
Congratulations! You've fully downloaded the page! 🎉

To download a website that requires login (ex: The Pragmatic Engineer):

Using a browser like Chrome, login to the website you want to download.
Right-click anywhere on the page and choose Inspect to open the Chrome Developer Tools.
Switch to the Network pane and enable the Doc filter.
Reload the page by pressing the ⟳ button.
Select the page's URL in the Network pane.
Scroll down to see the "Request Headers" section and look for a "cookie" request header.
Copy the value of the "cookie" request header to a text file for safekeeping.
Open Crystal, either creating a new project or opening an existing project.
Click the "Preferences..." / "Settings..." button, paste the cookie value in the text box, and click "OK".
- This cookie value will be remembered only while the project remains open. If you reopen Crystal again later you'll need to paste the cookie value in again.
Now download pages using Crystal as you would normally. The specified cookie header value (which logs you in to the remote server) will be used as you download pages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Complex Website Download Examples

To download a website with dynamically generated links (ex: The Pragmatic Engineer):

To download a website that requires login (ex: The Pragmatic Engineer):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally