Add -x/--extract option to har command #183

simonw · 2025-12-28T19:57:30Z

Add a new option to the har command: -x/--extract

If specified then after the .har or .har.zip file has been created it also creates a new directory parallel to that file with the same name (minus the extension) - and then extracts all of the resources from that HAR into that folder - using file names based on the original URL (using the same conventions used elsewhere to automatically derive image file names) and with an extension based either on the one of the original URL or one based on the content-type of that file, if a extension is missing from the URL or if that extension doesn't match the expected

Use TDD. Write failing tests first - run with "uv run --with-editable '.[test]'l. Manually test it too by running a python -m http.server and hitting it with "uv run --with-editable . shot-scraper"

The new --extract / -x option extracts all resources from the HAR file
into a directory parallel to the HAR file, with meaningful filenames
derived from URLs and extensions based on content-type.

Example usage:

shot-scraper har https://example.com/ --extract

This creates both example-com.har and an example-com/ directory
containing all the resources.

Works with both .har and .har.zip formats:

shot-scraper har https://example.com/ --extract --zip

For the is-it-a-bird demo, use:

shot-scraper har https://tools.simonwillison.net/is-it-a-bird \
  --extract -o isitabird.har \
  -j "document.querySelector('button')?.click()" \
  --wait-for "document.body.innerText.includes('Model loaded')" \
  --timeout 120000

Implements:

New extension_for_content_type() utility for mapping MIME types
New filename_for_har_entry() for deriving filenames from HAR entries
Extraction logic handles both plain HAR and zip formats
Supports content stored in HAR text field or as _file reference in zip

Closes #184

📚 Documentation preview 📚: https://shot-scraper--183.org.readthedocs.build/en/183/

https://gistpreview.github.io/?8958f38250b3bf8f5693fa7f0c73b57c/index.html

The new --extract / -x option extracts all resources from the HAR file into a directory parallel to the HAR file, with meaningful filenames derived from URLs and extensions based on content-type. Example usage: shot-scraper har https://example.com/ --extract This creates both example-com.har and an example-com/ directory containing all the resources. Works with both .har and .har.zip formats: shot-scraper har https://example.com/ --extract --zip For the is-it-a-bird demo, use: shot-scraper har https://tools.simonwillison.net/is-it-a-bird \ --extract -o isitabird.har \ -j "document.querySelector('button')?.click()" \ --wait-for "document.body.innerText.includes('Model loaded')" \ --timeout 120000 Implements: - New extension_for_content_type() utility for mapping MIME types - New filename_for_har_entry() for deriving filenames from HAR entries - Extraction logic handles both plain HAR and zip formats - Supports content stored in HAR text field or as _file reference in zip Closes #XXX

simonw · 2025-12-28T20:36:07Z

Manual testing shows the interaction with -o is a bit confusing:

uv run --with-editable '.[test]' shot-scraper har datasette.io -o /tmp/wumpit.har -x

/tmp % ls -lah wumpit.har  
-rw-r--r--  1 simon  wheel  160766 Dec 28 12:33 wumpit.har
/tmp % rm wumpit.har 
/tmp % ls wumpit
datasette-io-static-datasette-logo.svg
datasette-io-static-lite-yt-embed.css
datasette-io-static-lite-yt-embed.js
datasette-io-static-site.css
datasette-io.html
i-ytimg-com-vi-7kDFBnXaw-c-hqdefault.jpg
img-shields-io-badge-license-Apache202-0-blue.svg
img-shields-io-badge-mastodon-datasette-blueviolet.svg
img-shields-io-discord-823971286308356157.svg
img-shields-io-github-v-release-simonw-datasette.svg
img-shields-io-pypi-pyversions-datasette.svg
img-shields-io-pypi-v-datasette.svg
plausible-io-js-plausible.js

Refs #183, #185, #186, #187

simonw linked an issue Dec 28, 2025 that may be closed by this pull request

shot-scraper har -x/--extract option #184

Closed

simonw added the enhancement New feature or request label Dec 28, 2025

Drop testing on 3.9, test on 3.14

c9fdd16

simonw added 2 commits December 28, 2025 20:36

Merge branch 'main' into claude/har-extract-option-O7jGS

c26fd42

Merge branch 'main' into claude/har-extract-option-O7jGS

be86f7c

simonw merged commit a6ca48d into main Dec 29, 2025
12 of 26 checks passed

simonw mentioned this pull request Dec 29, 2025

Improve the way shot-scraper har -x -o out/ works #187

Closed

simonw added a commit that referenced this pull request Dec 29, 2025

Release 1.9

fdabd36

Refs #183, #185, #186, #187

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add -x/--extract option to har command #183

Add -x/--extract option to har command #183

simonw commented Dec 28, 2025 •

edited

Loading

Uh oh!

simonw commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add -x/--extract option to har command #183

Add -x/--extract option to har command #183

Conversation

simonw commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonw commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonw commented Dec 28, 2025 •

edited

Loading