Skip to content

Build and Deploy

Tom Barber edited this page Feb 13, 2022 · 17 revisions

Development using Gitpod

You can of course develop Sparkler on your local machine, but there are various pieces of setup required to get there, installing a JDK, Scala, SBT and Elastic or Solr. So why not take the hard work out of the setup and use a Gitpod environment to develop in?

Gitpod.io allows us to develop in VSCode or IntelliJ in a remote environment already build and tested against the latest Sparkler source code.

To get started simply hit the Open in Gitpod button here or else where in this platform. You'll also see integration into the PR and Issue tracker to make it easy to test and verify fixes and Pull Requests.

What happens?

When you launch an environment if the prebuild has been successful there will be a build executable in the sparkler-core/build/ directory which you can run as with the other instructions: ./bin/sparkler.sh inject etc if not then you will need to follow the build instructions to build sparkler.

Once build there is a pre-configured elastic instance running in your Gitpod environment, and so you can run Sparkler with no more configuration. To access the results you can query elastic as follows:

https://9200-.ws-eu31.gitpod.io/crawldb/_search?q=crawl_id:

a slightly more refined query may look similar to this:

https://9200-.ws-eu31.gitpod.io/crawldb/_search?q=crawl_id:&_source_includes=url,status&pretty&size=200

Using prebuilt Docker image

docker run -it uscdatascience/sparkler
docker tag uscdatascience/sparkler  sparkler-local    
# Tagging lets bin/dockler.sh use this downloaded image instead of rebuilding from scratch

If you prefer to build the latest image from source code, use the instructions below.

Docker build

cd to the root directory of the project and issue the following commands:

$ bin/dockler.sh

When the script asks 'Y/N', press 'Y'. This script will do the following:

  • Builds this project (mvn and git are required)
  • Builds a docker image named sparkler-local (docker command is required),
  • Starts a docker container
  • Starts the Solr
  • gives you a bash shell inside docker container

Inside the docker

  • /data/solr/bin/solr - start / stop solr using this tool
  • /data/sparkler/bin/sparkler.sh - cli interface to sparkler

Test the build

# inject a seed url, assign a job id to it
/data/sparkler/bin/sparkler.sh inject -id sjob-1 -su https://isi.edu
# Crawl it
/data/sparkler/bin/sparkler.sh crawl -id sjob-1

NOTE: if you would like to build docker image directly

docker build -f sparkler-deployment/docker/Dockerfile . -t sparkler-local
docker run -it -p 8984:8983 sparkler-local
# inside
sparkler@inside # /data/solr/bin/solr start
sparkler@inside # /data/sparkler/bin/sparkler.sh [crawl|inject] -h

Local or native jar build

Requirements

To Build :

  • Apache Maven (Tested on v3.3.x)
  • JDK (Tested on Oracle JDK 1.8)
  • Working internet connection to retrieve maven dependencies

The following dependencies will be downloaded from Maven central. Feel free to look inside the pom.xml for the current versions being used.

  • Apache Spark
  • Apache Nutch
  • Apache Kafka Client
  • Apache Solr Client
  • Scala

Note that the libraries like Solr-client, spark, kafka etc should match with your own deployment version. For instance, if you have Spark Cluster deployment of v1.6 with Scala 2.11, make sure to set them the same versions for the client libraries in pom.xml.

mvn clean compile package

This should produce build directory that has everything (except solr) required to run sparkler. For solr setup see this page.

Clone this wiki locally