Oss docker (#503)

williambrandler · web-flow · commit 0eb87671f8c3 · 2022-03-31T18:08:11.000-07:00
* initial commit

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* resolve conflicts

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* cleanup

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* typo

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* break up into layers and add genomics libs

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* add root user

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* add libncurses

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* fix pip path

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* fix sudo

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* update readme and getting started guide

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* update readme

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* update readme

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* parameterize build scripts

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;

* remove minus-ganglia from push

Signed-off-by: William Brandler &lt;william.brandler@databricks.com&gt;
diff --git a/docker/README.md b/docker/README.md
@@ -2,11 +2,14 @@
 
 As of this time the following are supported: 
 
-* Glow 1.1.2 + Databricks Runtime (DBR) 9.1 (Spark 3.1)
+* Glow 1.1.2 + connectors to Azure Data Lake, Google Cloud Storage, Amazon Web Services (S3), Snowflake and Delta Lake (via data mechanics' Spark Image) 
+* Glow 1.1.2 + Databricks Runtime (DBR) 9.1 (Spark 3.1) + Ganglia
 * Hail 0.2.78 + DBR 9.1 (Spark 3.1)
 
-These Dockerfiles are built to run on Databricks, 
-but can be adapted to run Glow & Hail in the open source,
+The containers are hosted on the [projectglow dockerhub](https://hub.docker.com/u/projectglow), 
+Please see the Glow [Getting Started](https://glow.readthedocs.io/en/latest/getting-started.html) guide for documentation on how to use the containers.
+
+## Building the containers
 
 ##### Troubleshooting
 
@@ -25,8 +28,8 @@ export COMPOSE_DOCKER_CLI_BUILD=0
 
 Please see this [stack overflow post](https://stackoverflow.com/questions/64221861/an-error-failed-to-solve-with-frontend-dockerfile-v0) for explanation.
 
-Note: Docker builds may run out of memory, please increase
-Docker's default memorry setting, which is 2.0 GB, via Preferences -> Resources -> Advanced.
+Important: Docker builds may run out of memory, please increase
+Docker's default memory setting, which is 2.0 GB, via Docker Desktop -> Preferences -> Resources -> Advanced.
 
 To learn more about contributing to these images, please review the Glow [contributing guide](https://glow.readthedocs.io/en/latest/contributing.html#add-libraries-to-the-glow-docker-environment)
 
@@ -37,9 +40,12 @@ Ganglia is an optional layer for monitoring cluster metrics such as CPU load.
 
 ![Docker layer architecture](../static/glow_genomics_docker_image_architecture.png?raw=true "Glow Docker layer architecture")
 
+The open source version of this architecture to run outside of Databricks is simpler, 
+with a base layer that pulls from data mechanics' Spark Image, followed by the ```genomics``` and ```genomics-with-glow``` layers.
+
 ### Build the docker images as follows:
 
-run ```docker/databricks/build.sh``` to build all of the layers. 
+run ```docker/databricks/build.sh``` or ```docker/open-source-glow/build.sh``` to build all of the layers. 
 
 To build any layer individually, change directory into the layer and run: 
 
diff --git a/docker/databricks/build.sh b/docker/databricks/build.sh
@@ -3,25 +3,26 @@
 #
 # Usage: ./build.sh
 
-DOCKER_REPOSITORY="projectglow"
+DOCKER_HUB="projectglow"
+DATABRICKS_RUNTIME_VERSION="9.1"
+GLOW_VERSION="1.1.2"
+HAIL_VERSION="0.2.85"
 
-# Add commands to build DBR 9.1 images below
-pushd dbr/dbr9.1/
-docker build -t "${DOCKER_REPOSITORY}/minimal:9.1" minimal/
-docker build -t "${DOCKER_REPOSITORY}/python:9.1" python/
-docker build -t "${DOCKER_REPOSITORY}/dbfsfuse:9.1" dbfsfuse/
-docker build -t "${DOCKER_REPOSITORY}/standard:9.1" standard/
-docker build -t "${DOCKER_REPOSITORY}/with-r:9.1" r/
-docker build -t "${DOCKER_REPOSITORY}/genomics:9.1" genomics/
-docker build -t "${DOCKER_REPOSITORY}/databricks-hail:0.2.85" genomics-with-hail/
-docker build -t "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:1.1.2" genomics-with-glow/
-docker build -t "${DOCKER_REPOSITORY}/databricks-glow:1.1.2" ganglia/
-docker build -t "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:9.1" genomics-with-glow/
-docker build -t "${DOCKER_REPOSITORY}/databricks-glow:9.1" ganglia/
+# Add commands to build images below
+pushd dbr/dbr$DATABRICKS_RUNTIME_VERSION/
+docker build -t "${DOCKER_HUB}/minimal:${DATABRICKS_RUNTIME_VERSION}" minimal/
+docker build -t "${DOCKER_HUB}/python:${DATABRICKS_RUNTIME_VERSION}" python/
+docker build -t "${DOCKER_HUB}/dbfsfuse:${DATABRICKS_RUNTIME_VERSION}" dbfsfuse/
+docker build -t "${DOCKER_HUB}/standard:${DATABRICKS_RUNTIME_VERSION}" standard/
+docker build -t "${DOCKER_HUB}/with-r:${DATABRICKS_RUNTIME_VERSION}" r/
+docker build -t "${DOCKER_HUB}/genomics:${DATABRICKS_RUNTIME_VERSION}" genomics/
+docker build -t "${DOCKER_HUB}/databricks-hail:${HAIL_VERSION}" genomics-with-hail/
+docker build -t "${DOCKER_HUB}/databricks-glow-minus-ganglia:${GLOW_VERSION}" genomics-with-glow/
+docker build -t "${DOCKER_HUB}/databricks-glow:${GLOW_VERSION}" ganglia/
+docker build -t "${DOCKER_HUB}/databricks-glow-minus-ganglia:${DATABRICKS_RUNTIME_VERSION}" genomics-with-glow/
+docker build -t "${DOCKER_HUB}/databricks-glow:${DATABRICKS_RUNTIME_VERSION}" ganglia/
 popd
 
-docker push "${DOCKER_REPOSITORY}/databricks-hail:0.2.85"
-docker push "${DOCKER_REPOSITORY}/databricks-glow:1.1.2"
-docker push "${DOCKER_REPOSITORY}/databricks-glow:9.1"
-docker push "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:1.1.2"
-docker push "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:9.1"
+docker push "${DOCKER_HUB}/databricks-hail:${HAIL_VERSION}"
+docker push "${DOCKER_HUB}/databricks-glow:${GLOW_VERSION}"
+docker push "${DOCKER_HUB}/databricks-glow:${DATABRICKS_RUNTIME_VERSION}"
diff --git a/docker/open-source-glow/build.sh b/docker/open-source-glow/build.sh
@@ -0,0 +1,13 @@
+#!/bin/bash -xue
+# Builds all the Docker images
+#
+# Usage: ./build.sh
+
+DOCKER_HUB="projectglow"
+GLOW_VERSION="1.1.2"
+
+# Add commands to build DBR 9.1 images below
+docker build -t "${DOCKER_HUB}/open-source-base:${GLOW_VERSION}" datamechanics/
+docker build -t "${DOCKER_HUB}/open-source-genomics:${GLOW_VERSION}" genomics/
+docker build -t "${DOCKER_HUB}/open-source-glow:${GLOW_VERSION}" genomics-with-glow/
+docker push "${DOCKER_HUB}/open-source-glow:${GLOW_VERSION}"
diff --git a/docker/open-source-glow/datamechanics/Dockerfile b/docker/open-source-glow/datamechanics/Dockerfile
@@ -0,0 +1,33 @@
+#builds off Docker images for Apache Spark by Data Mechanics
+#this image includes connectors to 
+# - Azure blob and datalake
+# - Google cloud storage
+# - Amazon AWS S3
+# - Snowflake
+# - deltalake
+#to learn more, see https://hub.docker.com/r/datamechanics/spark
+FROM gcr.io/datamechanics/spark:3.1.2-hadoop-3.2.0-java-11-scala-2.12-python-3.8-dm16
+LABEL author="Edoardo Giacopuzzi"
+LABEL contact="edoardo.giacopuzzi@fht.org"
+LABEL spark_version="3.1.2"
+LABEL hadoop_version="3.2.0"
+LABEL java_version="11"
+LABEL scala_version="2.12"
+LABEL deltalake_version="1.0.0"
+LABEL glowgr_version="spark3-1.1.2"
+LABEL description="Spark with Glow support and glow.py"
+
+ENV PYSPARK_MAJOR_PYTHON_VERSION=3
+
+#Install some glow.py dependencies in the base conda env
+RUN conda update conda \
+&& conda update --all \
+&& conda config --set channel_priority false \
+&& conda install -c bioconda -c conda-forge \
+nptyping=1.3.0 \
+numpy>=1.18.1 \
+opt_einsum>=3.2.0 \
+pandas>=1.0.1 \
+statsmodels>=0.10.0 \
+typeguard=2.9.1 \
+pyarrow>=1.0.1 
diff --git a/docker/open-source-glow/genomics-with-glow/Dockerfile b/docker/open-source-glow/genomics-with-glow/Dockerfile
@@ -0,0 +1,35 @@
+FROM projectglow/open-source-genomics:1.1.2 AS genomics
+LABEL author="Edoardo Giacopuzzi"
+LABEL contact="edoardo.giacopuzzi@fht.org"
+LABEL spark_version="3.1.2"
+LABEL hadoop_version="3.2.0"
+LABEL java_version="11"
+LABEL scala_version="2.12"
+LABEL deltalake_version="1.0.0"
+LABEL glowgr_version="spark3-1.1.2"
+LABEL description="Spark with Glow support and glow.py"
+
+USER root
+WORKDIR /opt
+
+ENV GLOW_VERSION=1.1.2
+ENV SCALA_LOGGING_VERSION=3.7.2
+ENV PICARD_VERSION=2.23.3
+ENV HTSJDK_VERSION=2.21.2
+ENV NETTY_VERSION=3.9.9
+ENV JDBI_VERSION=2.78
+ENV HADOOP_BAM_VERSION=7.9.2
+
+#Download Glow JAR and its dependencies
+RUN wget https://repo1.maven.org/maven2/io/projectglow/glow-spark3_2.12/${GLOW_VERSION}/glow-spark3_2.12-${GLOW_VERSION}.jar \
+&& wget https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging_2.12/${SCALA_LOGGING_VERSION}/scala-logging_2.12-${SCALA_LOGGING_VERSION}.jar \
+&& wget https://repo1.maven.org/maven2/com/github/broadinstitute/picard/${PICARD_VERSION}/picard-${PICARD_VERSION}.jar \
+&& wget https://repo1.maven.org/maven2/com/github/samtools/htsjdk/${HTSJDK_VERSION}/htsjdk-${HTSJDK_VERSION}.jar \
+&& wget https://repo1.maven.org/maven2/io/netty/netty/${NETTY_VERSION}.Final/netty-${NETTY_VERSION}.Final.jar \
+&& wget https://repo1.maven.org/maven2/org/jdbi/jdbi/${JDBI_VERSION}/jdbi-${JDBI_VERSION}.jar \
+&& wget https://repo1.maven.org/maven2/org/seqdoop/hadoop-bam/${HADOOP_BAM_VERSION}/hadoop-bam-${HADOOP_BAM_VERSION}.jar \
+&& mv *.jar /opt/spark/jars
+
+#Install Glow python interface
+RUN pip3 install --upgrade pip
+RUN pip3 install glow.py==${GLOW_VERSION}
diff --git a/docker/open-source-glow/genomics/Dockerfile b/docker/open-source-glow/genomics/Dockerfile
@@ -0,0 +1,144 @@
+FROM projectglow/open-source-base:1.1.2 AS base
+USER root
+
+# ===== Set up base required libraries =============================================================
+
+RUN apt-get update && apt-get install -y \
+    apt-utils \
+    build-essential \
+    git \
+    apt-transport-https \
+    ca-certificates \
+    cpanminus \
+    libncurses-dev \
+    libpng-dev \
+    zlib1g-dev \
+    libbz2-dev \
+    liblzma-dev \
+    perl \
+    perl-base \
+    unzip \
+    curl \
+    gnupg2 \
+    software-properties-common \
+    jq \
+    libjemalloc2 \
+    libjemalloc-dev \
+    libdbi-perl \
+    libdbd-mysql-perl \
+    libdbd-sqlite3-perl \
+    zlib1g \
+    zlib1g-dev \
+    libxml2 \
+    libxml2-dev \
+    r-base \
+    r-base-dev \
+&& wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc \
+&& add-apt-repository "deb [arch=amd64,i386] https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/" \
+&& apt-get clean \
+&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+
+# ===== Set up VEP environment =====================================================================
+
+ENV OPT_SRC /opt/vep/src
+ENV PERL5LIB $PERL5LIB:$OPT_SRC/ensembl-vep:$OPT_SRC/ensembl-vep/modules
+RUN cpanm DBI && \
+    cpanm Set::IntervalTree && \
+    cpanm JSON && \
+    cpanm Text::CSV && \
+    cpanm Module::Build && \
+    cpanm PerlIO::gzip && \
+    cpanm IO::Uncompress::Gunzip
+
+RUN mkdir -p $OPT_SRC
+WORKDIR $OPT_SRC
+RUN git clone https://github.com/Ensembl/ensembl-vep.git
+WORKDIR ensembl-vep
+
+# The commit is the most recent one on release branch 100 as of July 29, 2020
+
+RUN git checkout 10932fab1e9c113e8e5d317e1f668413390344ac && \
+    perl INSTALL.pl --NO_UPDATE -AUTO a && \
+    perl INSTALL.pl -n -a p --PLUGINS AncestralAllele && \
+    chmod +x vep
+
+# ===== Set up samtools ============================================================================
+
+ENV SAMTOOLS_VERSION=1.9
+
+WORKDIR /opt
+RUN wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
+    tar -xjf samtools-1.9.tar.bz2
+WORKDIR samtools-1.9
+RUN ./configure && \
+    make && \
+    make install
+
+ENV PATH=${DEST_DIR}/samtools-{$SAMTOOLS_VERSION}:$PATH
+
+
+# ===== Set up htslib ==============================================================================
+# access htslib tools from the shell, for example,
+# %sh 
+# /opt/htslib-1.9/tabix
+# /opt/htslib-1.9/bgzip
+
+WORKDIR /opt
+RUN wget https://github.com/samtools/htslib/releases/download/${SAMTOOLS_VERSION}/htslib-${SAMTOOLS_VERSION}.tar.bz2 && \
+    tar -xjvf htslib-1.9.tar.bz2
+WORKDIR htslib-1.9
+RUN ./configure && \
+    make && \
+    make install
+
+# ===== bgenix ==============================================================================
+#access begenix from the shell from,
+#/opt/bgen/build/apps/bgenix
+
+RUN apt-get update && apt-get install -y \
+    npm
+
+RUN npm install --save sqlite3
+
+WORKDIR /opt
+RUN wget http://code.enkre.net/bgen/tarball/release/bgen.tgz && \
+    tar zxvf bgen.tgz && \
+    mv bgen.tgz bgen
+WORKDIR bgen
+RUN CXX=/usr/bin/g++ && \
+    CC=/usr/bin/gcc && \
+    ./waf configure && \
+    ./waf && \
+    ./build/test/unit/test_bgen && \
+    ./build/apps/bgenix -g example/example.16bits.bgen -list
+
+# ===== Set up MLR dependencies ====================================================================
+
+ENV QQMAN_VERSION=1.0.6
+RUN pip3 install qqman==$QQMAN_VERSION
+
+# ===== Set up R genomics packages =================================================================
+
+RUN R -e "install.packages('sim1000G',dependencies=TRUE,repos='https://cran.rstudio.com')"\
+ && R -e "install.packages('gplots',dependencies=TRUE,repos='http://cran.us.r-project.org')"\
+ && R -e "install.packages('bigsnpr',dependencies=TRUE,repos='http://cran.us.r-project.org')"\
+ && R -e "install.packages('ukbtools',dependencies=TRUE,repos='https://cran.rstudio.com')"\
+ && R -e "install.packages('qqman',dependencies=TRUE,repos='http://cran.us.r-project.org')"
+
+# ===== plink ==============================================================================
+#install both plink 1.07 and 1.9
+#access plink from the shell from,
+#v1.07
+#/opt/plink-1.07-x86_64/plink --noweb
+#v1.90
+#/opt/plink --noweb
+
+WORKDIR /opt
+RUN wget http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-x86_64.zip && \
+    unzip plink-1.07-x86_64.zip
+RUN wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20200616.zip && \
+    unzip plink_linux_x86_64_20200616.zip
+
+# ===== Reset current directory ====================================================================
+
+WORKDIR /root
diff --git a/docs/source/getting-started.rst b/docs/source/getting-started.rst
@@ -73,12 +73,6 @@ Glow requires Apache Spark 3.2.0.
           val sess = Glow.register(spark)
           val df = sess.read.format("vcf").load(path)
 
-Notebooks embedded in the docs
-------------------------------
-
-Documentation pages are accompanied by embedded notebook examples. Most code in these notebooks can be run on Spark and Glow alone, but functions such as ``display()`` or ``dbutils()`` are only available on Databricks. See :ref:`dbnotebooks` for more info.
-
-These notebooks are located in the Glow github repository `here <https://github.com/projectglow/glow/blob/master/docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/>`_ and are tested nightly end-to-end.  They include notebooks to define constants such as the number of samples to simulate and the output paths for each step in the pipeline. Notebooks that define constants are ``%run`` at the start of each notebook in the documentation. Please see :ref:`data_simulation` to get started.
 
 Getting started on Databricks
 -----------------------------
@@ -89,8 +83,23 @@ The Databricks documentation shows how to get started with Glow on,
   - **Microsoft Azure** (`docs <https://docs.microsoft.com/en-us/azure/databricks/applications/genomics/tertiary-analytics/glow>`_) 
   - **Google Cloud Platform** (GCP - `docs <https://docs.gcp.databricks.com/applications/genomics/tertiary-analytics/glow.html>`_)
 
- 
+We recommend using the `Databricks Glow docker container <https://hub.docker.com/r/projectglow/databricks-glow>`_ to manage the environment, 
+which includes `genomics libraries <https://github.com/projectglow/glow/blob/master/docker/databricks/dbr/dbr9.1/genomics/Dockerfile>`_ that complement Glow. 
+This container can be installed via Databricks container services using the ``projectglow/databricks-glow:<tag>`` Docker Image URL, replacing <tag> with the latest version of Glow. 
+
 Getting started on other cloud services
 ---------------------------------------
 
-Please submit a pull request to add a guide for other cloud services.
+Glow is packaged into a Docker container based on an image from `data mechanics <https://hub.docker.com/r/datamechanics/spark>`_ that can be run locally and that also includes connectors to Azure Data Lake, Google Cloud Storage, Amazon Web Services S3, Snowflake, and `Delta Lake <https://docs.delta.io/latest/index.html>`_. This container can be installed using the ``projectglow/open-source-glow:<tag>`` Docker Image URL, replacing <tag> with the latest version of Glow.
+
+This container can be used or adapted to run Glow outside of Databricks (`source code <https://github.com/projectglow/glow/tree/master/docker>`_).
+And was contributed by Edoardo Giacopuzzi (``edoardo.giacopuzzi at fht.org``) from Human Technopole.
+
+Please submit a pull request to add guides for specific cloud services.
+
+Notebooks embedded in the docs
+------------------------------
+
+Documentation pages are accompanied by embedded notebook examples. Most code in these notebooks can be run on Spark and Glow alone, but functions such as ``display()`` or ``dbutils()`` are only available on Databricks. See :ref:`dbnotebooks` for more info.
+
+These notebooks are located in the Glow github repository `here <https://github.com/projectglow/glow/blob/master/docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/>`_ and are tested nightly end-to-end.  They include notebooks to define constants such as the number of samples to simulate and the output paths for each step in the pipeline. Notebooks that define constants are ``%run`` at the start of each notebook in the documentation. Please see :ref:`data_simulation` to get started.