Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
dcb5603
Add simplistic Dockerfile
Fusl Mar 5, 2019
17507d8
added libffi-dev libressl-dev packages
Fusl Jan 4, 2021
927a68b
added patch package
Fusl Jan 4, 2021
59da795
lock python version to 3.7
Fusl Jan 4, 2021
33cd061
Merge branch 'master' into feature/dockerfile
acrois Jul 20, 2021
172479e
Parameterize dockerfile, separate volume for data vs installation dir…
acrois Jul 21, 2021
6fb46f7
Add Docker usage to README.md
acrois Jul 22, 2021
067dfdd
Add docker logs usage, attach to container, pause and resume crawl
acrois Jul 22, 2021
9fc98eb
Add method to access running container (not PID 1)
acrois Jul 22, 2021
cbaf124
Update gs-server container name in documentation, update documentatio…
acrois Aug 4, 2021
e4fb82b
Update README.md
acrois Aug 6, 2021
ce419ad
Quick start usage, network isolation, remove need for --dir parameter
acrois Aug 7, 2021
e66de97
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois Nov 29, 2021
ded3f49
Update Dockerfile
acrois Aug 14, 2022
f233c3f
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois Aug 14, 2022
b434f39
Set executable bit in entrypoint.sh, rename grab-network to gs-networ…
acrois Aug 15, 2022
74feab8
Executable bit
acrois Aug 15, 2022
84d0236
Rename grab-network to gs-network, update documentation for single co…
acrois Aug 15, 2022
ec4c5d4
Executable bit
acrois Aug 15, 2022
6570b0c
Merge branch 'feature/dockerfile' of https://github.com/acrois/grab-s…
acrois Aug 15, 2022
44ae2fd
Add .gitattributes for LF preservation on Windows, update Python and …
acrois Aug 15, 2022
de68a6e
Use su-exec for step-down from root to grab-site user, update Docker …
acrois Aug 15, 2022
d4fbb98
Update documentation for more consistent Docker first-time run, adjus…
acrois Aug 15, 2022
c48f129
Adjust pip installation parameters
acrois Aug 15, 2022
a6210db
Update README to be easier to follow
acrois Aug 15, 2022
e8f82d4
Additional formatting and context to Docker README
acrois Aug 15, 2022
7e20f21
Document Debian 11 Docker daemon setup, configuration option usage an…
acrois Aug 16, 2022
b7df3bf
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois May 21, 2023
534f7dc
Merge branch 'ArchiveTeam:master' into feature/dockerfile
acrois Jan 8, 2024
3ee0d2f
feat: Docker build and release
acrois Jan 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__pycache__
Dockerfile
data
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* text=auto eol=lf
87 changes: 87 additions & 0 deletions .github/workflows/build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
name: Build Image
on:
workflow_dispatch:
push:
concurrency: build-image
env:
REGISTRY: ghcr.io/
IMAGE: ${{ github.repository }}
TAG: ${{ github.sha }}
jobs:
build_deploy_cached:
runs-on: ubuntu-latest
name: Build and Deploy with Cache
steps:
- id: registry
uses: ASzc/change-string-case-action@v1
with:
string: ${{ env.REGISTRY }}

- id: image
uses: ASzc/change-string-case-action@v1
with:
string: ${{ env.IMAGE }}

- id: tag
uses: ASzc/change-string-case-action@v1
with:
string: ${{ env.TAG }}

- name: Checkout Code
uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
id: buildx

- name: Cache Docker layers
uses: actions/cache@v2
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx

- name: Login to the Container Registry
uses: docker/login-action@v2
with:
registry: ${{ steps.registry.outputs.lowercase }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v4
with:
# context: git
images: ${{ steps.registry.outputs.lowercase }}${{ steps.image.outputs.lowercase }}
tags: |
type=edge,priority=100
type=sha,priority=200
type=ref,event=tag,priority=300
type=raw,priority=150,value=latest,enable={{is_default_branch}}

- name: Build Image
uses: docker/build-push-action@v4
with:
context: .
builder: ${{ steps.buildx.outputs.name }}
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
build-args: |
BUILDTIME=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.created'] }}
VERSION=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.version'] }}
REVISION=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.revision'] }}
cache-from: type=local,src=/tmp/.buildx-cache
cache-to: type=local,dest=/tmp/.buildx-cache-new

# Temp fix
# https://github.com/docker/build-push-action/issues/252
# https://github.com/moby/buildkit/issues/1896
- name: Move cache
run: |
rm -rf /tmp/.buildx-cache
mv /tmp/.buildx-cache-new /tmp/.buildx-cache
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
__pycache__
data
58 changes: 58 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
ARG PYTHON_VERSION=3.8
ARG ALPINE_VERSION=3.16

FROM python:${PYTHON_VERSION}-alpine${ALPINE_VERSION}

WORKDIR /app
VOLUME [ "/data" ]

ENV GRAB_SITE_INTERFACE=0.0.0.0
ENV GRAB_SITE_PORT=29000
ARG GRAB_SITE_HOST=gs-server
ENV GRAB_SITE_HOST=${GRAB_SITE_HOST}
EXPOSE 29000

# Add user and group for grab-site
RUN addgroup -g 10000 -S grab-site \
&& adduser -u 10000 -S -G grab-site grab-site \
&& chown -R grab-site:grab-site $(pwd) \
&& mkdir -p /data \
&& chown -R grab-site:grab-site /data

# Install system dependencies
RUN apk add --no-cache \
su-exec>=0.2 \
git \
gcc \
libxml2-dev \
musl-dev \
libxslt-dev \
g++ \
re2-dev \
libffi-dev \
openssl-dev \
patch \
&& ln -s /usr/include/libxml2/libxml /usr/include/libxml

ENV PATH="/app:$PATH"
ENTRYPOINT [ "entrypoint.sh" ]
CMD [ "gs-server" ]

# TODO: resolve dependencies before loading library code to take advantage of build caching
# setup.py requires libgrabsite/__init__.py (__version__ property) to work

# Copy application files
COPY --chown=grab-site:grab-site . .

# Install application dependencies
RUN pip install --no-cache-dir .

# Set up runtime environment
WORKDIR /data

# docker build -t grab-site:latest .
# docker run --rm -it --entrypoint sh grab-site:latest
# docker network create -d bridge gs-network
# docker run --net=gs-network --name=gs-server -d -p 29000:29000 --restart=unless-stopped grab-site:latest
# docker run --net=gs-network --rm -d -e GRAB_SITE_HOST=gs-server -v ./data:/data:rw grab-site:latest grab-site https://www.example.com/
# docker run --net=gs-network --rm -d -e GRAB_SITE_HOST=gs-server -v C:\projects\grab-site\data:/data:rw grab-site:latest grab-site https://www.example.com/
176 changes: 176 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ on a specific version of Python (3.7 or 3.8) and with specific dependency versio
- [Upgrade an existing install](#upgrade-an-existing-install)
- [Usage](#usage)
- [`grab-site` options, ordered by importance](#grab-site-options-ordered-by-importance)
- [Docker](#docker)
- [Warnings](#warnings)
- [Tips for specific websites](#tips-for-specific-websites)
- [Changing ignores during the crawl](#changing-ignores-during-the-crawl)
Expand Down Expand Up @@ -373,6 +374,181 @@ Options can come before or after the URL.

* `--help`: print help text.

### Docker

`grab-site` and `gs-server` can be run from Docker!

Please see [Get Docker](https://docs.docker.com/get-docker/) to get started.

#### Quick Start

##### Debian Docker Daemon

On a Debian Docker daemon, you will want to run everything as the grab-site user. The following will create that user on the host machine with the same uid/gid as the user/group in the container and add it to the docker group that should have been set up with your Docker daemon installation.

```sh
sudo addgroup --system --gid 10000 grab-site
sudo adduser --system --uid 10000 --gid 10000 grab-site
sudo usermod -aG docker grab-site
```

As it is configured, the grab-site user does not allow a direct login. If you are a user that belongs to the sudoers group, you can run:

```sh
sudo su -l grab-site -s /bin/bash
```

Once you are accessing a bash terminal under the grab-site user, you can follow all the commands.

#### Data directory

Make sure you have cloned the repository and have created the data directory:

```sh
mkdir -p ./data
```

On Windows, this directory will automatically create itself when it is used as the target for the mount in the run step.

##### Configuration

The paths in `--input-file`, `--import-ignores`, `--dir`, `--finished-warc-dir`, and `--wpull-args` refer to paths in the container.

The instructions in this document mount the `./data` directory to `/data` within the container. This is the current working directory, while the application itself is stored within `/app` and included in the `$PATH` environmental variable.

Considering the Docker configuration in addition to the default configuration of grab-site, your crawls will be shared across all `grab-site` instances, and each instance will be working in its own subfolder, per-crawl in the `/data` directory.

#### Docker Build

The first major step is to build the application, including all dependencies:

```sh
docker build -t grab-site:latest .
```

##### Monolithic Container
Optionally, you can specify the build argument `GRAB_SITE_HOST`, which defaults to `gs-server` expecting a container called `gs-server` will be accessible by instances of `grab-site`.

This argument becomes the default `GRAB_SITE_HOST` environmental variable for all containers.

You can override `GRAB_SITE_HOST` if you plan on running a single container containing gs-server and manually run grab-site processes within that single container, for example:

```sh
docker build --build-arg GRAB_SITE_HOST=127.0.0.1 -t grab-site:latest .
```

Note that this is not the recommended configuration _in the Docker environment_, because the [current best practice](https://docs.docker.com/config/containers/multi-service_container/) for deployment of a Docker application states that each process should be running in a separate container _without_ a init process such as systemd, rc.d, upstart, etc. in order to let Docker manage the lifecycle of the process, instead.

##### Image Inspection

In development and for configuration, you may find it handy to get a new container's shell to inspect files and run the programs within the built image:

```sh
docker run --rm -it --entrypoint sh grab-site:latest
```

#### Docker Network

The second major step is to create an isolated network for the containers running `grab-site` and `gs-server` to communicate with eachother on a network that is isolated/inaccessible from any other container outside of the network.

The following will create a docker network called `gs-network` for our gs-server and grab-site instances to connect within:

```sh
docker network create -d bridge gs-network
```

We will use this `gs-network` later in `docker run` commands with the `--net=gs-network` flag.

#### Run gs-server on Docker

The third major step is to run a container named `gs-server` to host the dashboard and for grab-site instances to connect to:

```sh
docker run --net=gs-network --name=gs-server -d -p 29000:29000 --restart=unless-stopped grab-site:latest
```

The server will be running with the port forwarded (the -p parameter) from the host port 29000 -> container port 29000. You can access it via [http://localhost:29000](http://localhost:29000). The name of the container should correspond to the `GRAB_SITE_HOST` build argument, which defaults to `gs-server`.

##### View gs-server logs

Optionally, to tail the `gs-server` instance:

```sh
docker logs -f gs-server
```

##### Attach to gs-server process

Optionally, you can attach local STDIN/STDOUT/STDERR to your running `gs-server` instance:

```sh
docker attach gs-server
```

You can exit by using CTRL-p + CTRL+q, as documented further [here](https://docs.docker.com/engine/reference/commandline/attach/).

##### Access gs-server container

Optionally, to enter the currently running `gs-server` container:

```sh
docker exec -it gs-server sh
```

You will normally not need to do this.

#### Run grab-site on Docker

The final step is running the following command to download `example.com` to a local directory `data`.

This example works for various Linux shells and PowerShell:

```sh
docker run --net=gs-network --rm -d -e GRAB_SITE_HOST=gs-server -v "$(pwd)/data:/data:rw" grab-site:latest grab-site https://www.example.com/
```

Note: Windows file shares can be done several ways, this is using the legacy Windows full path volume sharing which can be "slower" if you are using WSL2.

##### View grab-site logs

When you run a docker run with the -d flag you will get returned to you the unique ID of the container. If you'd like to name your container, please specify a `--name` to the `docker run` command you are trying to run.

If you do not specify a name, it will give it a funny name which you can find from the list of all containers using:

```sh
docker ps -a
```

You can then use the container ID or the name here:

```sh
docker logs -f ead9034470ed
```

##### Attach to grab-site process

You can attach local STDIN/STDOUT/STDERR to your running `gs-server` instance:

```sh
docker attach ead9034470ed
```

You can exit by using CTRL-p + CTRL+q, as documented further [here](https://docs.docker.com/engine/reference/commandline/attach/).

##### Suspend a grab-site crawl

You can pause a crawl by using [docker pause](https://docs.docker.com/engine/reference/commandline/pause/):

```sh
docker pause ead9034470ed
```

Resume it using [docker unpause](https://docs.docker.com/engine/reference/commandline/unpause/):

```sh
docker unpause ead9034470ed
```

### Warnings

If you pay no attention to your crawls, a crawl may head down some infinite bot
Expand Down
13 changes: 13 additions & 0 deletions entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env sh
set -eax

# If running any scripts we recognize, step-down to "grab-site" user.
for v in grab-site gs-server gs-dump-urls; do
if [ "$1" = "$v" -a "$(id -u)" = '0' ]; then
find /app \! -user grab-site -exec chown grab-site '{}' +
find . \! -user grab-site -exec chown grab-site '{}' +
exec su-exec "$v" "$@"
fi
done

exec "$@"