Skip to content

Commit a176949

Browse files
committed
Generalize instructions
This change removes 3.8 pins as that version of python is no longer supported by the upstream python project. Remove outdated OS X instructions while here and do some minor formatting tweaks with h1 and h2 sections. Remove the manually provided ToC while here: GitHub automatically generates the ToC for "free". Signed-off-by: Enji Cooper <yaneurabeya@gmail.com>
1 parent e0b11ac commit a176949

File tree

1 file changed

+35
-81
lines changed

1 file changed

+35
-81
lines changed

README.md

Lines changed: 35 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
grab-site
2-
=========
1+
# grab-site
32

43
[![Build status][travis-image]][travis-url]
54

@@ -35,35 +34,11 @@ please [file an issue](https://github.com/ArchiveTeam/grab-site/issues) - thank
3534
The installation methods below are the only ones supported in our GitHub issues.
3635
Please do not modify the installation steps unless you really know what you're
3736
doing, with both Python packaging and your operating system. grab-site runs
38-
on a specific version of Python (3.7 or 3.8) and with specific dependency versions.
39-
40-
**Contents**
41-
42-
- [Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)](#install-on-ubuntu-1804-2004-2204-debian-10-buster-debian-11-bullseye)
43-
- [Install on NixOS](#install-on-nixos)
44-
- [Install on another distribution lacking Python 3.7.x or 3.8.x](#install-on-another-distribution-lacking-python-37x-or-38x)
45-
- [Install on macOS](#install-on-macos)
46-
- [Install on Windows 10 (experimental)](#install-on-windows-10-experimental)
47-
- [Upgrade an existing install](#upgrade-an-existing-install)
48-
- [Usage](#usage)
49-
- [`grab-site` options, ordered by importance](#grab-site-options-ordered-by-importance)
50-
- [Warnings](#warnings)
51-
- [Tips for specific websites](#tips-for-specific-websites)
52-
- [Changing ignores during the crawl](#changing-ignores-during-the-crawl)
53-
- [Inspecting the URL queue](#inspecting-the-url-queue)
54-
- [Preventing a crawl from queuing any more URLs](#preventing-a-crawl-from-queuing-any-more-urls)
55-
- [Stopping a crawl](#stopping-a-crawl)
56-
- [Advanced `gs-server` options](#advanced-gs-server-options)
57-
- [Viewing the content in your WARC archives](#viewing-the-content-in-your-warc-archives)
58-
- [Inspecting WARC files in the terminal](#inspecting-warc-files-in-the-terminal)
59-
- [Automatically pausing grab-site processes when free disk is low](#automatically-pausing-grab-site-processes-when-free-disk-is-low)
60-
- [Thanks](#thanks)
61-
- [Help](#help)
62-
63-
64-
65-
Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)
66-
---
37+
on a specific version of Python (3.7+) and with specific dependency versions.
38+
39+
## Installation Instructions
40+
41+
### Install on Debian and Ubuntu
6742

6843
1. On Debian, use `su` to become root if `sudo` is not configured to give you access.
6944

@@ -99,45 +74,31 @@ Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)
9974
and then restart your shell (e.g. by opening a new terminal tab/window).
10075

10176

102-
Install on NixOS
103-
---
77+
### Install on NixOS
10478

10579
grab-site was removed from nixpkgs master; 23.05 is the last release to contain grab-site.
10680

10781
```
10882
nix-env -f https://github.com/NixOS/nixpkgs/archive/release-23.05.tar.gz -iA grab-site
10983
```
11084

111-
or, if you are using profiles (ie when you have flakes enabled):
85+
or, if you are using profiles, i.e., when you have flakes enabled:
11286

11387
```
11488
nix profile install nixpkgs/release-22.11#grab-site
11589
```
11690

117-
118-
Install on another distribution lacking Python 3.7.x or 3.8.x
119-
---
91+
### Install on another distribution
12092

12193
After installing [uv](https://docs.astral.sh/uv/), you can run
12294
```
12395
uv tool install --python=3.8 --no-binary-package lxml git+https://github.com/ArchiveTeam/grab-site/
12496
```
12597

12698

127-
Install on macOS
128-
---
129-
130-
On OS X 10.10 - macOS 11:
131-
132-
1. Run `locale` in your terminal. If the output includes "UTF-8", you
133-
are all set. If it does not, your terminal is misconfigured and grab-site
134-
will fail to start. This can be corrected with:
135-
136-
- Terminal.app: Preferences... -> Profiles -> Advanced -> **check** Set locale environment variables on startup
137-
138-
- iTerm2: Preferences... -> Profiles -> Terminal -> Environment -> **check** Set locale variables automatically
99+
### Install on macOS
139100

140-
### Using Homebrew (**Intel Mac**)
101+
#### Using Homebrew (**Intel Mac**)
141102

142103
For M1 Macs, use the next section instead of this one.
143104

@@ -147,8 +108,8 @@ For M1 Macs, use the next section instead of this one.
147108

148109
```
149110
brew update
150-
brew install python@3.8 libxslt re2 pkg-config
151-
/usr/local/opt/python@3.8/bin/python3 -m venv ~/gs-venv
111+
brew install python@3 libxslt re2 pkg-config
112+
python3 -m venv ~/gs-venv
152113
PKG_CONFIG_PATH="/usr/local/opt/libxml2/lib/pkgconfig" ~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site
153114
```
154115

@@ -160,7 +121,7 @@ For M1 Macs, use the next section instead of this one.
160121

161122
and then restart your shell (e.g. by opening a new terminal tab/window).
162123

163-
### Using Homebrew (**M1 Mac**)
124+
#### Using Homebrew (**M1 Mac**)
164125

165126
2. Install Homebrew using the install step on https://brew.sh/
166127

@@ -170,8 +131,8 @@ For M1 Macs, use the next section instead of this one.
170131

171132
```
172133
brew update
173-
brew install python@3.8 libxslt re2 pkg-config
174-
/opt/homebrew/opt/python@3.8/bin/python3 -m venv ~/gs-venv
134+
brew install python@3 libxslt re2 pkg-config
135+
python3 -m venv ~/gs-venv
175136
PKG_CONFIG_PATH="/opt/homebrew/opt/libxml2/lib/pkgconfig" ~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site
176137
```
177138

@@ -185,8 +146,7 @@ For M1 Macs, use the next section instead of this one.
185146

186147

187148

188-
Install on Windows 10 (experimental)
189-
---
149+
### Install on Windows 10 (experimental)
190150

191151
On Windows 10 Fall Creators Update (1703) or newer:
192152

@@ -208,8 +168,7 @@ On Windows 10 Fall Creators Update (1703) or newer:
208168

209169

210170

211-
Upgrade an existing install
212-
---
171+
### Upgrade an existing install
213172

214173
To update grab-site, simply run the `~/gs-venv/bin/pip install ...` or
215174
`nix-env ...` command used to install it originally (see above).
@@ -219,8 +178,7 @@ Existing `grab-site` crawls will automatically reconnect to the new server.
219178

220179

221180

222-
Usage
223-
---
181+
## Usage
224182

225183
First, start the dashboard with:
226184

@@ -518,9 +476,8 @@ while a [from:user](https://twitter.com/search?q=from%3Ainternetarchive&src=typd
518476
query can return more.
519477

520478

479+
## Changing ignores during the crawl
521480

522-
Changing ignores during the crawl
523-
---
524481
While the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the
525482
changes will be applied within a few seconds.
526483

@@ -536,8 +493,8 @@ Note that ignores will not apply to any of the crawl's start URLs.
536493

537494

538495

539-
Inspecting the URL queue
540-
---
496+
## Inspecting the URL queue
497+
541498
Inspecting the URL queue is usually not necessary, but may be helpful
542499
for adding ignores before grab-site crawls a large number of junk URLs.
543500

@@ -558,22 +515,22 @@ gs-dump-urls DIR/wpull.db todo | sort | less -S
558515

559516

560517

561-
Preventing a crawl from queuing any more URLs
562-
---
518+
## Preventing a crawl from queuing any more URLs
519+
563520
`rm DIR/scrape`. Responses will no longer be scraped for URLs. Scraping cannot
564521
be re-enabled for a crawl.
565522

566523

567524

568-
Stopping a crawl
569-
---
525+
## Stopping a crawl
526+
570527
You can `touch DIR/stop` or press ctrl-c, which will do the same. You will
571528
have to wait for the current downloads to finish.
572529

573530

574531

575-
Advanced `gs-server` options
576-
---
532+
## Advanced `gs-server` options
533+
577534
These environmental variables control what `gs-server` listens on:
578535

579536
* `GRAB_SITE_INTERFACE` (default `0.0.0.0`)
@@ -586,15 +543,14 @@ These environmental variables control which server each `grab-site` process conn
586543

587544

588545

589-
Viewing the content in your WARC archives
590-
---
546+
## Viewing the content in your WARC archives
591547

592548
Try [ReplayWeb.page](https://replayweb.page/) or [webrecorder-player](https://github.com/webrecorder/webrecorder-player).
593549

594550

595551

596-
Inspecting WARC files in the terminal
597-
---
552+
## Inspecting WARC files in the terminal
553+
598554
`zless` is a wrapper over `less` that can be used to view raw WARC content:
599555

600556
```
@@ -609,8 +565,7 @@ However, some servers will send compressed responses anyway.
609565

610566

611567

612-
Automatically pausing grab-site processes when free disk is low
613-
---
568+
## Automatically pausing grab-site processes when free disk is low
614569

615570
If you automatically upload and remove finished .warc.gz files, you can still
616571
run into a situation where grab-site processes fill up your disk faster than
@@ -621,8 +576,7 @@ crosses a threshold value.
621576

622577

623578

624-
Thanks
625-
---
579+
## Thanks
626580

627581
grab-site is made possible only because of [wpull](https://github.com/chfoo/wpull),
628582
written by [Christopher Foo](https://github.com/chfoo) who spent a year
@@ -651,8 +605,8 @@ in various browsers.
651605

652606

653607

654-
Help
655-
---
608+
## Support
609+
656610
grab-site bugs and questions are welcome in
657611
[grab-site/issues](https://github.com/ArchiveTeam/grab-site/issues).
658612

0 commit comments

Comments
 (0)