Skip to content
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: Build

on:
workflow_dispatch:
inputs:
language-code:
description: 'the language of the wikimedia property e.g. tr - turkish, en - english'
required: true
default: 'en'
wiki-type:
description: 'the type of the wikimedia property e.g. wikipedia, wikiquote'
required: true
default: 'wikipedia'
tag:
description: 'the tag of the wikimedia property e.g. all, top'
required: true
default: 'all'
edition:
description: 'the edition of the wikimedia property e.g. maxi, mini'
required: true
default: 'maxi'
date:
description: 'the date of the wikimedia property e.g. latest'
required: true
default: 'latest'
hosting-dns-domain:
description: 'the DNS domain name the mirror will be hosted at e.g. tr.wikipedia-on-ipfs.org'
required: false
default: ''
hosting-ipns-hash:
description: 'the IPNS hash the mirror will be hosted at e.g. QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W'
required: false
default: ''
main-page-version:
description: 'an override hack used on Turkish Wikipedia, it sets the main page version as there are issues with the Kiwix version id'
required: false
default: ''

jobs:
build:
runs-on: ubuntu-latest
env:
AWS_S3_BUCKET: wikipedia-on-ipfs
AWS_REGION: eu-central-1
steps:
- uses: actions/checkout@v2
- uses: ./
with:
language-code: ${{ github.event.inputs.language-code }}
wiki-type: ${{ github.event.inputs.wiki-type }}
tag: ${{ github.event.inputs.tag }}
edition: ${{ github.event.inputs.edition }}
date: ${{ github.event.inputs.date }}
hosting-dns-domain: ${{ github.event.inputs.hosting-dns-domain }}
hosting-ipns-hash: ${{ github.event.inputs.hosting-ipns-hash }}
main-page-version: ${{ github.event.inputs.main-page-version }}
- run: |
sudo chown -R $USER tmp
cd tmp
for d in *; do
if [[ -d "${d}" ]]; then
echo "Processing ${d} ..."
tar -czf "${d}.tar.gz" "${d}"
aws s3 cp "${d}.tar.gz" "s3://${{ env.AWS_S3_BUCKET }}/website-packages/${d}.tar.gz" \
--acl 'public-read' --metadata "Name=${d},Url=${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
echo "::notice name=You can now publish $d::publish_website_from_s3.sh '${d}'"
fi
done
shell: bash
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
62 changes: 34 additions & 28 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,28 +1,34 @@
FROM debian:stable

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update
RUN apt -y install --no-install-recommends git ca-certificates curl wget apt-utils

# install:
# - node and yarn
# - go-ipfs
RUN curl -sL https://deb.nodesource.com/setup_14.x -o nodesource_setup.sh \
&& bash nodesource_setup.sh \
&& apt -y install --no-install-recommends nodejs \
&& npm install -g yarn \
&& wget -nv https://dist.ipfs.io/go-ipfs/v0.8.0/go-ipfs_v0.8.0_linux-amd64.tar.gz \
&& tar xvfz go-ipfs_v0.8.0_linux-amd64.tar.gz \
&& mv go-ipfs/ipfs /usr/local/bin/ipfs \
&& rm -r go-ipfs && rm go-ipfs_v0.8.0_linux-amd64.tar.gz \
&& ipfs init -p server,local-discovery,flatfs,randomports --empty-repo \
&& ipfs config --json 'Experimental.ShardingEnabled' true

# TODO: move repo init after external volume is mounted

ENV DEBIAN_FRONTEND=dialog

RUN mkdir /root/distributed-wikipedia-mirror
VOLUME ["/root/distributed-wikipedia-mirror"]
WORKDIR /root/distributed-wikipedia-mirror
# This Dockerfile creates a self-contained image in which mirrorzim.sh can be executed
#
# You can build the image as follows (remember to use this repo as context for the build):
# docker build . -f Dockerfile -t distributed-wikipedia-mirror
#
# You can then run the container anywhere as follows
# docker run --rm -v $(pwd)/snapshots:/github/workspace/snapshots -v $(pwd)/tmp:/github/workspace/tmp distributed-wikipedia-mirror <mirrorzim.sh arguments>
# NOTE(s):
# - volume attached at /github/workspace/snapshots will contain downloaded zim files after the run
# - volume attached at /github/workspace/tmp will contain created website directories after the run

FROM openzim/zim-tools:3.1.0 AS openzim

FROM node:16.14.0-buster-slim

RUN apt update && apt upgrade && apt install -y curl wget rsync

COPY --from=openzim /usr/local/bin/zimdump /usr/local/bin

COPY tools/docker_entrypoint.sh /usr/local/bin

RUN mkdir -p /github/distributed-wikipedia-mirror
RUN mkdir -p /github/distributed-wikipedia-mirror/snapshots
RUN mkdir -p /github/distributed-wikipedia-mirror/tmp
RUN mkdir -p /github/workspace

COPY . /github/distributed-wikipedia-mirror

RUN cd /github/distributed-wikipedia-mirror && yarn

VOLUME [ "/github/workspace" ]

WORKDIR /github/distributed-wikipedia-mirror
ENTRYPOINT [ "docker_entrypoint.sh" ]
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ This step won't be necessary when automatic sharding lands in go-ipfs (wip).

### Step 3: Download the latest snapshot from kiwix.org

Source of ZIM files is at https://download.kiwix.org/zim/wikipedia/
Source of ZIM files is at https://download.kiwix.org/zim/wikipedia/
Make sure you download `_all_maxi_` snapshots, as those include images.

To automate this, you can also use the `getzim.sh` script:
Expand Down Expand Up @@ -164,8 +164,8 @@ $ zimdump dump ./snapshots/wikipedia_tr_all_maxi_2021-01.zim --dir ./tmp/wikiped

> ### ℹ️ ZIM's main page
>
> Each ZIM file has "main page" attribute which defines the landing page set for the ZIM archive.
> It is often different than the "main page" of upstream Wikipedia.
> Each ZIM file has "main page" attribute which defines the landing page set for the ZIM archive.
> It is often different than the "main page" of upstream Wikipedia.
> Kiwix Main page needs to be passed in the next step, so until there is an automated way to determine "main page" of ZIM, you need to open ZIM in Kiwix reader and eyeball the name of the landing page.

### Step 5: Convert the unpacked zim directory to a website with mirror info
Expand Down Expand Up @@ -242,7 +242,7 @@ Make sure at least two full reliable copies exist before updating DNSLink.

## mirrorzim.sh

It is possible to automate steps 3-6 via a wrapper script named `mirrorzim.sh`.
It is possible to automate steps 3-6 via a wrapper script named `mirrorzim.sh`.
It will download the latest snapshot of specified language (if needed), unpack it, and add it to IPFS.

To see how the script behaves try running it on one of the smallest wikis, such as `cu`:
Expand All @@ -253,9 +253,9 @@ $ ./mirrorzim.sh --languagecode=cu --wikitype=wikipedia --hostingdnsdomain=cu.wi

## Docker build

A `Dockerfile` with all the software requirements is provided.
A `Dockerfile` with all the software requirements is provided.
For now it is only a handy container for running the process on non-Linux
systems or if you don't want to pollute your system with all the dependencies.
systems or if you don't want to pollute your system with all the dependencies.
In the future it will be end-to-end blackbox that takes ZIM and spits out CID
and repo.

Expand Down Expand Up @@ -340,3 +340,7 @@ We are working on improving deduplication between snapshots, but for now YMMV.
## Code

If you would like to contribute more to this effort, look at the [issues](https://github.com/ipfs/distributed-wikipedia-mirror/issues) in this github repo. Especially check for [issues marked with the "wishlist" label](https://github.com/ipfs/distributed-wikipedia-mirror/labels/wishlist) and issues marked ["help wanted"](https://github.com/ipfs/distributed-wikipedia-mirror/labels/help%20wanted).

## GitHub Actions Workflow

The GitHub Actions workflow that is available in this repository takes information about the wiki website that you want to mirror, downloads its' zim, unpacks it, converts it to a website and uploads it to S3 as a tar.gz package which is publicly accessible.
50 changes: 50 additions & 0 deletions action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Build
description: Download a zim file, unpack it, convert to website
inputs:
language-code:
description: 'the language of the wikimedia property e.g. tr - turkish, en - english'
required: true
default: 'en'
wiki-type:
description: 'the type of the wikimedia property e.g. wikipedia, wikiquote'
required: true
default: 'wikipedia'
tag:
description: 'the tag of the wikimedia property e.g. all, top'
required: true
default: 'all'
edition:
description: 'the edition of the wikimedia property e.g. maxi, mini'
required: true
default: 'maxi'
date:
description: 'the date of the wikimedia property e.g. latest'
required: true
default: 'latest'
hosting-dns-domain:
description: 'the DNS domain name the mirror will be hosted at e.g. tr.wikipedia-on-ipfs.org'
required: false
default: ''
hosting-ipns-hash:
description: 'the IPNS hash the mirror will be hosted at e.g. QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W'
required: false
default: ''
main-page-version:
description: 'an override hack used on Turkish Wikipedia, it sets the main page version as there are issues with the Kiwix version id'
required: false
default: ''
outputs:
time: # id of output
description: 'The time we greeted you'
runs:
using: docker
image: Dockerfile
args:
- '--languagecode=${{ inputs.language-code }}'
- '--wikitype=${{ inputs.wiki-type }}'
- '--tag=${{ inputs.tag }}'
- '--edition=${{ inputs.edition }}'
- '--date=${{ inputs.date }}'
- '--hostingdnsdomain=${{ inputs.hosting-dns-domain }}'
- '--hostingipnshash=${{ inputs.hosting-ipns-hash }}'
- '--mainpageversion=${{ inputs.main-page-version }}'
76 changes: 54 additions & 22 deletions mirrorzim.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,25 @@ usage() {
echo ""
echo "SYNOPSIS"
echo " $0 --languagecode=<LANGUAGE_CODE> --wikitype=<WIKI_TYPE>"
echo " [--tag=<TAG>]"
echo " [--edition=<EDITION>]"
echo " [--hostingdnsdomain=<HOSTING_DNS_DOMAIN>]"
echo " [--hostingipnshash=<HOSTING_IPNS_HASH>]"
echo " [--mainpageversion=<MAIN_PAGE_VERSION>]"
echo " [--push=<true|false>]"
echo ""
echo "OPTIONS"
echo ""
echo " -l, --languagecode string - the language of the wikimedia property e.g. tr - turkish, en - english"
echo " -w, --wikitype string - the type of the wikimedia property e.g. wikipedia, wikiquote"
echo " -d, --hostingdnsdomain string - the DNS domain name the mirror will be hosted at e.g. tr.wikipedia-on-ipfs.org"
echo " -i, --hostingipnshash string - the IPNS hash the mirror will be hosted at e.g. QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W"
echo " -v, --mainpageversion string - an override hack used on Turkish Wikipedia, it sets the main page version as there are issues with the Kiwix version id"

exit 2
echo " -l, --languagecode string - the language of the wikimedia property e.g. tr - turkish, en - english"
echo " -w, --wikitype string - the type of the wikimedia property e.g. wikipedia, wikiquote"
echo " -t, --tag string - the tag of the wikimedia property e.g. all, top (defaults to all)"
echo " -e, --edition string - the edition of the wikimedia property e.g. maxi, mini (defaults to maxi)"
echo " -c, --date string - the date of the wikimedia property e.g. latest (defaults to latest)"
echo " -d, --hostingdnsdomain string - the DNS domain name the mirror will be hosted at e.g. tr.wikipedia-on-ipfs.org"
echo " -i, --hostingipnshash string - the IPNS hash the mirror will be hosted at e.g. QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W"
echo " -v, --mainpageversion string - an override hack used on Turkish Wikipedia, it sets the main page version as there are issues with the Kiwix version id"
echo " -p, --push boolean - push to local ipfs instance (defaults to true)"
exit 2
}


Expand All @@ -38,6 +44,18 @@ case $i in
WIKI_TYPE="${i#*=}"
shift
;;
-t=*|--tag=*)
TAG="${i#*=}"
shift
;;
-e=*|--edition=*)
EDITION="${i#*=}"
shift
;;
-c=*|--date=*)
DATE="${i#*=}"
shift
;;
-d=*|--hostingdnsdomain=*)
HOSTING_DNS_DOMAIN="${i#*=}"
shift
Expand All @@ -50,6 +68,10 @@ case $i in
MAIN_PAGE_VERSION="${i#*=}"
shift
;;
-p=*|--push=*)
PUSH="${i#*=}"
shift
;;
--default)
DEFAULT=YES
shift
Expand All @@ -70,6 +92,18 @@ if [ -z ${WIKI_TYPE+x} ]; then
usage
fi

if [ -z ${TAG+x} ]; then
TAG="all"
fi

if [ -z ${EDITION+x} ]; then
EDITION="maxi"
fi

if [ -z ${DATE+x} ]; then
DATE="latest"
fi

if [ -z ${HOSTING_DNS_DOMAIN+x} ]; then
HOSTING_DNS_DOMAIN=""
fi
Expand All @@ -82,12 +116,16 @@ if [ -z ${MAIN_PAGE_VERSION+x} ]; then
MAIN_PAGE_VERSION=""
fi

if [ -z ${PUSH+x} ]; then
PUSH="true"
fi

printf "\nEnsure zimdump is present...\n"
PATH=$PATH:$(realpath ./bin)
which zimdump &> /dev/null || (curl --progress-bar -L https://download.openzim.org/release/zim-tools/zim-tools_linux-x86_64-3.0.0.tar.gz | tar -xvz --strip-components=1 -C ./bin zim-tools_linux-x86_64-3.0.0/zimdump && chmod +x ./bin/zimdump)

printf "\nDownload and verify the zim file...\n"
ZIM_FILE_SOURCE_URL="$(./tools/getzim.sh download $WIKI_TYPE $WIKI_TYPE $LANGUAGE_CODE all maxi latest | grep 'URL:' | cut -d' ' -f3)"
ZIM_FILE_SOURCE_URL="$(./tools/getzim.sh download $WIKI_TYPE $WIKI_TYPE $LANGUAGE_CODE $TAG $EDITION $DATE | grep 'URL:' | cut -d' ' -f3)"
ZIM_FILE=$(echo $ZIM_FILE_SOURCE_URL | rev | cut -d'/' -f1 | rev)
TMP_DIRECTORY="./tmp/$(echo $ZIM_FILE | cut -d'.' -f1)"

Expand Down Expand Up @@ -116,17 +154,11 @@ node ./bin/run $TMP_DIRECTORY \
${HOSTING_IPNS_HASH:+--hostingipnshash=$HOSTING_IPNS_HASH} \
${MAIN_PAGE_VERSION:+--mainpageversion=$MAIN_PAGE_VERSION}

printf "\n-------------------------\n"
printf "\nIPFS_PATH=$IPFS_PATH\n"

printf "\nAdding the processed tmp directory to IPFS\n(this part may take long time on a slow disk):\n"
CID=$(ipfs add -r --cid-version 1 --pin=false --offline -Qp $TMP_DIRECTORY)
MFS_DIR="/${ZIM_FILE}__$(date +%F_%T)"

# pin by adding to MFS under a meaningful name
ipfs files cp /ipfs/$CID "$MFS_DIR"

printf "\n\n-------------------------\nD O N E !\n-------------------------\n"
printf "MFS: $MFS_DIR\n"
printf "CID: $CID"
printf "\n-------------------------\n"
if [[ "$PUSH" == "true" ]]; then
./tools/add_website_to_ipfs.sh "$ZIM_FILE" "$TMP_DIRECTORY" "-p"
else
printf "\n\n-------------------------\nD O N E !\n-------------------------\n"
printf "ZIM: $ZIM_FILE\n"
printf "TMP: $TMP_DIRECTORY"
printf "\n-------------------------\n"
fi
3 changes: 3 additions & 0 deletions packer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Packer configuration that resides here creates AMI in which:
- ipfs service is started on machine boot
- `publish_website_from_s3.sh` is available
Loading