Skip to content

Conversation

@simonw
Copy link
Owner

@simonw simonw commented Dec 28, 2025

Add a new option to the har command: -x/--extract

If specified then after the .har or .har.zip file has been created it also creates a new directory parallel to that file with the same name (minus the extension) - and then extracts all of the resources from that HAR into that folder - using file names based on the original URL (using the same conventions used elsewhere to automatically derive image file names) and with an extension based either on the one of the original URL or one based on the content-type of that file, if a extension is missing from the URL or if that extension doesn't match the expected

Use TDD. Write failing tests first - run with "uv run --with-editable '.[test]'l. Manually test it too by running a python -m http.server and hitting it with "uv run --with-editable . shot-scraper"

The new --extract / -x option extracts all resources from the HAR file
into a directory parallel to the HAR file, with meaningful filenames
derived from URLs and extensions based on content-type.

Example usage:

shot-scraper har https://example.com/ --extract

This creates both example-com.har and an example-com/ directory
containing all the resources.

Works with both .har and .har.zip formats:

shot-scraper har https://example.com/ --extract --zip

For the is-it-a-bird demo, use:

shot-scraper har https://tools.simonwillison.net/is-it-a-bird \
  --extract -o isitabird.har \
  -j "document.querySelector('button')?.click()" \
  --wait-for "document.body.innerText.includes('Model loaded')" \
  --timeout 120000

Implements:

  • New extension_for_content_type() utility for mapping MIME types
  • New filename_for_har_entry() for deriving filenames from HAR entries
  • Extraction logic handles both plain HAR and zip formats
  • Supports content stored in HAR text field or as _file reference in zip

Closes #184


📚 Documentation preview 📚: https://shot-scraper--183.org.readthedocs.build/en/183/

https://gistpreview.github.io/?8958f38250b3bf8f5693fa7f0c73b57c/index.html

The new --extract / -x option extracts all resources from the HAR file
into a directory parallel to the HAR file, with meaningful filenames
derived from URLs and extensions based on content-type.

Example usage:

    shot-scraper har https://example.com/ --extract

This creates both example-com.har and an example-com/ directory
containing all the resources.

Works with both .har and .har.zip formats:

    shot-scraper har https://example.com/ --extract --zip

For the is-it-a-bird demo, use:

    shot-scraper har https://tools.simonwillison.net/is-it-a-bird \
      --extract -o isitabird.har \
      -j "document.querySelector('button')?.click()" \
      --wait-for "document.body.innerText.includes('Model loaded')" \
      --timeout 120000

Implements:
- New extension_for_content_type() utility for mapping MIME types
- New filename_for_har_entry() for deriving filenames from HAR entries
- Extraction logic handles both plain HAR and zip formats
- Supports content stored in HAR text field or as _file reference in zip

Closes #XXX
@simonw simonw linked an issue Dec 28, 2025 that may be closed by this pull request
@simonw simonw added the enhancement New feature or request label Dec 28, 2025
@simonw
Copy link
Owner Author

simonw commented Dec 28, 2025

Manual testing shows the interaction with -o is a bit confusing:

uv run --with-editable '.[test]' shot-scraper har datasette.io -o /tmp/wumpit.har -x
/tmp % ls -lah wumpit.har  
-rw-r--r--  1 simon  wheel  160766 Dec 28 12:33 wumpit.har
/tmp % rm wumpit.har 
/tmp % ls wumpit
datasette-io-static-datasette-logo.svg
datasette-io-static-lite-yt-embed.css
datasette-io-static-lite-yt-embed.js
datasette-io-static-site.css
datasette-io.html
i-ytimg-com-vi-7kDFBnXaw-c-hqdefault.jpg
img-shields-io-badge-license-Apache202-0-blue.svg
img-shields-io-badge-mastodon-datasette-blueviolet.svg
img-shields-io-discord-823971286308356157.svg
img-shields-io-github-v-release-simonw-datasette.svg
img-shields-io-pypi-pyversions-datasette.svg
img-shields-io-pypi-v-datasette.svg
plausible-io-js-plausible.js

@simonw simonw merged commit a6ca48d into main Dec 29, 2025
12 of 26 checks passed
simonw added a commit that referenced this pull request Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

shot-scraper har -x/--extract option

3 participants