1- grab-site
2- =========
1+ # grab-site
32
43[ ![ Build status] [ travis-image ]] [ travis-url ]
54
@@ -35,35 +34,11 @@ please [file an issue](https://github.com/ArchiveTeam/grab-site/issues) - thank
3534The installation methods below are the only ones supported in our GitHub issues.
3635Please do not modify the installation steps unless you really know what you're
3736doing, with both Python packaging and your operating system. grab-site runs
38- on a specific version of Python (3.7 or 3.8) and with specific dependency versions.
39-
40- ** Contents**
41-
42- - [ Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)] ( #install-on-ubuntu-1804-2004-2204-debian-10-buster-debian-11-bullseye )
43- - [ Install on NixOS] ( #install-on-nixos )
44- - [ Install on another distribution lacking Python 3.7.x or 3.8.x] ( #install-on-another-distribution-lacking-python-37x-or-38x )
45- - [ Install on macOS] ( #install-on-macos )
46- - [ Install on Windows 10 (experimental)] ( #install-on-windows-10-experimental )
47- - [ Upgrade an existing install] ( #upgrade-an-existing-install )
48- - [ Usage] ( #usage )
49- - [ ` grab-site ` options, ordered by importance] ( #grab-site-options-ordered-by-importance )
50- - [ Warnings] ( #warnings )
51- - [ Tips for specific websites] ( #tips-for-specific-websites )
52- - [ Changing ignores during the crawl] ( #changing-ignores-during-the-crawl )
53- - [ Inspecting the URL queue] ( #inspecting-the-url-queue )
54- - [ Preventing a crawl from queuing any more URLs] ( #preventing-a-crawl-from-queuing-any-more-urls )
55- - [ Stopping a crawl] ( #stopping-a-crawl )
56- - [ Advanced ` gs-server ` options] ( #advanced-gs-server-options )
57- - [ Viewing the content in your WARC archives] ( #viewing-the-content-in-your-warc-archives )
58- - [ Inspecting WARC files in the terminal] ( #inspecting-warc-files-in-the-terminal )
59- - [ Automatically pausing grab-site processes when free disk is low] ( #automatically-pausing-grab-site-processes-when-free-disk-is-low )
60- - [ Thanks] ( #thanks )
61- - [ Help] ( #help )
62-
63-
64-
65- Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)
66- ---
37+ on a specific version of Python (3.7+) and with specific dependency versions.
38+
39+ ## Installation Instructions
40+
41+ ### Install on Debian and Ubuntu
6742
68431. On Debian, use ` su ` to become root if ` sudo ` is not configured to give you access.
6944
@@ -99,45 +74,31 @@ Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)
9974 and then restart your shell (e.g. by opening a new terminal tab/window).
10075
10176
102- Install on NixOS
103- ---
77+ ### Install on NixOS
10478
10579grab-site was removed from nixpkgs master; 23.05 is the last release to contain grab-site.
10680
10781```
10882nix-env -f https://github.com/NixOS/nixpkgs/archive/release-23.05.tar.gz -iA grab-site
10983```
11084
111- or, if you are using profiles (ie when you have flakes enabled) :
85+ or, if you are using profiles, i.e., when you have flakes enabled:
11286
11387```
11488nix profile install nixpkgs/release-22.11#grab-site
11589```
11690
117-
118- Install on another distribution lacking Python 3.7.x or 3.8.x
119- ---
91+ ### Install on another distribution
12092
12193After installing [ uv] ( https://docs.astral.sh/uv/ ) , you can run
12294```
12395uv tool install --python=3.8 --no-binary-package lxml git+https://github.com/ArchiveTeam/grab-site/
12496```
12597
12698
127- Install on macOS
128- ---
129-
130- On OS X 10.10 - macOS 11:
131-
132- 1. Run ` locale ` in your terminal. If the output includes "UTF-8", you
133- are all set. If it does not, your terminal is misconfigured and grab-site
134- will fail to start. This can be corrected with:
135-
136- - Terminal.app: Preferences... -> Profiles -> Advanced -> **check** Set locale environment variables on startup
137-
138- - iTerm2: Preferences... -> Profiles -> Terminal -> Environment -> **check** Set locale variables automatically
99+ ### Install on macOS
139100
140- ### Using Homebrew (** Intel Mac** )
101+ #### Using Homebrew (** Intel Mac** )
141102
142103For M1 Macs, use the next section instead of this one.
143104
@@ -147,8 +108,8 @@ For M1 Macs, use the next section instead of this one.
147108
148109 ```
149110 brew update
150- brew install python@3.8 libxslt re2 pkg-config
151- /usr/local/opt/python@3.8/bin/ python3 -m venv ~/gs-venv
111+ brew install python@3 libxslt re2 pkg-config
112+ python3 -m venv ~/gs-venv
152113 PKG_CONFIG_PATH="/usr/local/opt/libxml2/lib/pkgconfig" ~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site
153114 ```
154115
@@ -160,7 +121,7 @@ For M1 Macs, use the next section instead of this one.
160121
161122 and then restart your shell (e.g. by opening a new terminal tab/window).
162123
163- ### Using Homebrew (** M1 Mac** )
124+ #### Using Homebrew (** M1 Mac** )
164125
1651262. Install Homebrew using the install step on https://brew.sh/
166127
@@ -170,8 +131,8 @@ For M1 Macs, use the next section instead of this one.
170131
171132 ```
172133 brew update
173- brew install python@3.8 libxslt re2 pkg-config
174- /opt/homebrew/opt/python@3.8/bin/ python3 -m venv ~/gs-venv
134+ brew install python@3 libxslt re2 pkg-config
135+ python3 -m venv ~/gs-venv
175136 PKG_CONFIG_PATH="/opt/homebrew/opt/libxml2/lib/pkgconfig" ~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site
176137 ```
177138
@@ -185,8 +146,7 @@ For M1 Macs, use the next section instead of this one.
185146
186147
187148
188- Install on Windows 10 (experimental)
189- ---
149+ ### Install on Windows 10 (experimental)
190150
191151On Windows 10 Fall Creators Update (1703) or newer:
192152
@@ -208,8 +168,7 @@ On Windows 10 Fall Creators Update (1703) or newer:
208168
209169
210170
211- Upgrade an existing install
212- ---
171+ ### Upgrade an existing install
213172
214173To update grab-site, simply run the ` ~/gs-venv/bin/pip install ... ` or
215174` nix-env ... ` command used to install it originally (see above).
@@ -219,8 +178,7 @@ Existing `grab-site` crawls will automatically reconnect to the new server.
219178
220179
221180
222- Usage
223- ---
181+ ## Usage
224182
225183First, start the dashboard with:
226184
@@ -518,9 +476,8 @@ while a [from:user](https://twitter.com/search?q=from%3Ainternetarchive&src=typd
518476query can return more.
519477
520478
479+ ## Changing ignores during the crawl
521480
522- Changing ignores during the crawl
523- ---
524481While the crawl is running, you can edit ` DIR/ignores ` and ` DIR/igsets ` ; the
525482changes will be applied within a few seconds.
526483
@@ -536,8 +493,8 @@ Note that ignores will not apply to any of the crawl's start URLs.
536493
537494
538495
539- Inspecting the URL queue
540- ---
496+ ## Inspecting the URL queue
497+
541498Inspecting the URL queue is usually not necessary, but may be helpful
542499for adding ignores before grab-site crawls a large number of junk URLs.
543500
@@ -558,22 +515,22 @@ gs-dump-urls DIR/wpull.db todo | sort | less -S
558515
559516
560517
561- Preventing a crawl from queuing any more URLs
562- ---
518+ ## Preventing a crawl from queuing any more URLs
519+
563520` rm DIR/scrape ` . Responses will no longer be scraped for URLs. Scraping cannot
564521be re-enabled for a crawl.
565522
566523
567524
568- Stopping a crawl
569- ---
525+ ## Stopping a crawl
526+
570527You can ` touch DIR/stop ` or press ctrl-c, which will do the same. You will
571528have to wait for the current downloads to finish.
572529
573530
574531
575- Advanced ` gs-server ` options
576- ---
532+ ## Advanced ` gs-server ` options
533+
577534These environmental variables control what ` gs-server ` listens on:
578535
579536* ` GRAB_SITE_INTERFACE ` (default ` 0.0.0.0 ` )
@@ -586,15 +543,14 @@ These environmental variables control which server each `grab-site` process conn
586543
587544
588545
589- Viewing the content in your WARC archives
590- ---
546+ ## Viewing the content in your WARC archives
591547
592548Try [ ReplayWeb.page] ( https://replayweb.page/ ) or [ webrecorder-player] ( https://github.com/webrecorder/webrecorder-player ) .
593549
594550
595551
596- Inspecting WARC files in the terminal
597- ---
552+ ## Inspecting WARC files in the terminal
553+
598554` zless ` is a wrapper over ` less ` that can be used to view raw WARC content:
599555
600556```
@@ -609,8 +565,7 @@ However, some servers will send compressed responses anyway.
609565
610566
611567
612- Automatically pausing grab-site processes when free disk is low
613- ---
568+ ## Automatically pausing grab-site processes when free disk is low
614569
615570If you automatically upload and remove finished .warc.gz files, you can still
616571run into a situation where grab-site processes fill up your disk faster than
@@ -621,8 +576,7 @@ crosses a threshold value.
621576
622577
623578
624- Thanks
625- ---
579+ ## Thanks
626580
627581grab-site is made possible only because of [ wpull] ( https://github.com/chfoo/wpull ) ,
628582written by [ Christopher Foo] ( https://github.com/chfoo ) who spent a year
@@ -651,8 +605,8 @@ in various browsers.
651605
652606
653607
654- Help
655- ---
608+ ## Support
609+
656610grab-site bugs and questions are welcome in
657611[ grab-site/issues] ( https://github.com/ArchiveTeam/grab-site/issues ) .
658612
0 commit comments