Skip to content

Commit 0eb8767

Browse files
Oss docker (#503)
* initial commit Signed-off-by: William Brandler <william.brandler@databricks.com> * resolve conflicts Signed-off-by: William Brandler <william.brandler@databricks.com> * cleanup Signed-off-by: William Brandler <william.brandler@databricks.com> * typo Signed-off-by: William Brandler <william.brandler@databricks.com> * break up into layers and add genomics libs Signed-off-by: William Brandler <william.brandler@databricks.com> * add root user Signed-off-by: William Brandler <william.brandler@databricks.com> * add libncurses Signed-off-by: William Brandler <william.brandler@databricks.com> * fix pip path Signed-off-by: William Brandler <william.brandler@databricks.com> * fix sudo Signed-off-by: William Brandler <william.brandler@databricks.com> * update readme and getting started guide Signed-off-by: William Brandler <william.brandler@databricks.com> * update readme Signed-off-by: William Brandler <william.brandler@databricks.com> * update readme Signed-off-by: William Brandler <william.brandler@databricks.com> * parameterize build scripts Signed-off-by: William Brandler <william.brandler@databricks.com> * remove minus-ganglia from push Signed-off-by: William Brandler <william.brandler@databricks.com>
1 parent 0dd8c7e commit 0eb8767

File tree

7 files changed

+274
-33
lines changed

7 files changed

+274
-33
lines changed

docker/README.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,14 @@
22

33
As of this time the following are supported:
44

5-
* Glow 1.1.2 + Databricks Runtime (DBR) 9.1 (Spark 3.1)
5+
* Glow 1.1.2 + connectors to Azure Data Lake, Google Cloud Storage, Amazon Web Services (S3), Snowflake and Delta Lake (via data mechanics' Spark Image)
6+
* Glow 1.1.2 + Databricks Runtime (DBR) 9.1 (Spark 3.1) + Ganglia
67
* Hail 0.2.78 + DBR 9.1 (Spark 3.1)
78

8-
These Dockerfiles are built to run on Databricks,
9-
but can be adapted to run Glow & Hail in the open source,
9+
The containers are hosted on the [projectglow dockerhub](https://hub.docker.com/u/projectglow),
10+
Please see the Glow [Getting Started](https://glow.readthedocs.io/en/latest/getting-started.html) guide for documentation on how to use the containers.
11+
12+
## Building the containers
1013

1114
##### Troubleshooting
1215

@@ -25,8 +28,8 @@ export COMPOSE_DOCKER_CLI_BUILD=0
2528

2629
Please see this [stack overflow post](https://stackoverflow.com/questions/64221861/an-error-failed-to-solve-with-frontend-dockerfile-v0) for explanation.
2730

28-
Note: Docker builds may run out of memory, please increase
29-
Docker's default memorry setting, which is 2.0 GB, via Preferences -> Resources -> Advanced.
31+
Important: Docker builds may run out of memory, please increase
32+
Docker's default memory setting, which is 2.0 GB, via Docker Desktop -> Preferences -> Resources -> Advanced.
3033

3134
To learn more about contributing to these images, please review the Glow [contributing guide](https://glow.readthedocs.io/en/latest/contributing.html#add-libraries-to-the-glow-docker-environment)
3235

@@ -37,9 +40,12 @@ Ganglia is an optional layer for monitoring cluster metrics such as CPU load.
3740

3841
![Docker layer architecture](../static/glow_genomics_docker_image_architecture.png?raw=true "Glow Docker layer architecture")
3942

43+
The open source version of this architecture to run outside of Databricks is simpler,
44+
with a base layer that pulls from data mechanics' Spark Image, followed by the ```genomics``` and ```genomics-with-glow``` layers.
45+
4046
### Build the docker images as follows:
4147

42-
run ```docker/databricks/build.sh``` to build all of the layers.
48+
run ```docker/databricks/build.sh``` or ```docker/open-source-glow/build.sh``` to build all of the layers.
4349

4450
To build any layer individually, change directory into the layer and run:
4551

docker/databricks/build.sh

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,25 +3,26 @@
33
#
44
# Usage: ./build.sh
55

6-
DOCKER_REPOSITORY="projectglow"
6+
DOCKER_HUB="projectglow"
7+
DATABRICKS_RUNTIME_VERSION="9.1"
8+
GLOW_VERSION="1.1.2"
9+
HAIL_VERSION="0.2.85"
710

8-
# Add commands to build DBR 9.1 images below
9-
pushd dbr/dbr9.1/
10-
docker build -t "${DOCKER_REPOSITORY}/minimal:9.1" minimal/
11-
docker build -t "${DOCKER_REPOSITORY}/python:9.1" python/
12-
docker build -t "${DOCKER_REPOSITORY}/dbfsfuse:9.1" dbfsfuse/
13-
docker build -t "${DOCKER_REPOSITORY}/standard:9.1" standard/
14-
docker build -t "${DOCKER_REPOSITORY}/with-r:9.1" r/
15-
docker build -t "${DOCKER_REPOSITORY}/genomics:9.1" genomics/
16-
docker build -t "${DOCKER_REPOSITORY}/databricks-hail:0.2.85" genomics-with-hail/
17-
docker build -t "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:1.1.2" genomics-with-glow/
18-
docker build -t "${DOCKER_REPOSITORY}/databricks-glow:1.1.2" ganglia/
19-
docker build -t "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:9.1" genomics-with-glow/
20-
docker build -t "${DOCKER_REPOSITORY}/databricks-glow:9.1" ganglia/
11+
# Add commands to build images below
12+
pushd dbr/dbr$DATABRICKS_RUNTIME_VERSION/
13+
docker build -t "${DOCKER_HUB}/minimal:${DATABRICKS_RUNTIME_VERSION}" minimal/
14+
docker build -t "${DOCKER_HUB}/python:${DATABRICKS_RUNTIME_VERSION}" python/
15+
docker build -t "${DOCKER_HUB}/dbfsfuse:${DATABRICKS_RUNTIME_VERSION}" dbfsfuse/
16+
docker build -t "${DOCKER_HUB}/standard:${DATABRICKS_RUNTIME_VERSION}" standard/
17+
docker build -t "${DOCKER_HUB}/with-r:${DATABRICKS_RUNTIME_VERSION}" r/
18+
docker build -t "${DOCKER_HUB}/genomics:${DATABRICKS_RUNTIME_VERSION}" genomics/
19+
docker build -t "${DOCKER_HUB}/databricks-hail:${HAIL_VERSION}" genomics-with-hail/
20+
docker build -t "${DOCKER_HUB}/databricks-glow-minus-ganglia:${GLOW_VERSION}" genomics-with-glow/
21+
docker build -t "${DOCKER_HUB}/databricks-glow:${GLOW_VERSION}" ganglia/
22+
docker build -t "${DOCKER_HUB}/databricks-glow-minus-ganglia:${DATABRICKS_RUNTIME_VERSION}" genomics-with-glow/
23+
docker build -t "${DOCKER_HUB}/databricks-glow:${DATABRICKS_RUNTIME_VERSION}" ganglia/
2124
popd
2225

23-
docker push "${DOCKER_REPOSITORY}/databricks-hail:0.2.85"
24-
docker push "${DOCKER_REPOSITORY}/databricks-glow:1.1.2"
25-
docker push "${DOCKER_REPOSITORY}/databricks-glow:9.1"
26-
docker push "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:1.1.2"
27-
docker push "${DOCKER_REPOSITORY}/databricks-glow-minus-ganglia:9.1"
26+
docker push "${DOCKER_HUB}/databricks-hail:${HAIL_VERSION}"
27+
docker push "${DOCKER_HUB}/databricks-glow:${GLOW_VERSION}"
28+
docker push "${DOCKER_HUB}/databricks-glow:${DATABRICKS_RUNTIME_VERSION}"

docker/open-source-glow/build.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash -xue
2+
# Builds all the Docker images
3+
#
4+
# Usage: ./build.sh
5+
6+
DOCKER_HUB="projectglow"
7+
GLOW_VERSION="1.1.2"
8+
9+
# Add commands to build DBR 9.1 images below
10+
docker build -t "${DOCKER_HUB}/open-source-base:${GLOW_VERSION}" datamechanics/
11+
docker build -t "${DOCKER_HUB}/open-source-genomics:${GLOW_VERSION}" genomics/
12+
docker build -t "${DOCKER_HUB}/open-source-glow:${GLOW_VERSION}" genomics-with-glow/
13+
docker push "${DOCKER_HUB}/open-source-glow:${GLOW_VERSION}"
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
#builds off Docker images for Apache Spark by Data Mechanics
2+
#this image includes connectors to
3+
# - Azure blob and datalake
4+
# - Google cloud storage
5+
# - Amazon AWS S3
6+
# - Snowflake
7+
# - deltalake
8+
#to learn more, see https://hub.docker.com/r/datamechanics/spark
9+
FROM gcr.io/datamechanics/spark:3.1.2-hadoop-3.2.0-java-11-scala-2.12-python-3.8-dm16
10+
LABEL author="Edoardo Giacopuzzi"
11+
LABEL contact="edoardo.giacopuzzi@fht.org"
12+
LABEL spark_version="3.1.2"
13+
LABEL hadoop_version="3.2.0"
14+
LABEL java_version="11"
15+
LABEL scala_version="2.12"
16+
LABEL deltalake_version="1.0.0"
17+
LABEL glowgr_version="spark3-1.1.2"
18+
LABEL description="Spark with Glow support and glow.py"
19+
20+
ENV PYSPARK_MAJOR_PYTHON_VERSION=3
21+
22+
#Install some glow.py dependencies in the base conda env
23+
RUN conda update conda \
24+
&& conda update --all \
25+
&& conda config --set channel_priority false \
26+
&& conda install -c bioconda -c conda-forge \
27+
nptyping=1.3.0 \
28+
numpy>=1.18.1 \
29+
opt_einsum>=3.2.0 \
30+
pandas>=1.0.1 \
31+
statsmodels>=0.10.0 \
32+
typeguard=2.9.1 \
33+
pyarrow>=1.0.1
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
FROM projectglow/open-source-genomics:1.1.2 AS genomics
2+
LABEL author="Edoardo Giacopuzzi"
3+
LABEL contact="edoardo.giacopuzzi@fht.org"
4+
LABEL spark_version="3.1.2"
5+
LABEL hadoop_version="3.2.0"
6+
LABEL java_version="11"
7+
LABEL scala_version="2.12"
8+
LABEL deltalake_version="1.0.0"
9+
LABEL glowgr_version="spark3-1.1.2"
10+
LABEL description="Spark with Glow support and glow.py"
11+
12+
USER root
13+
WORKDIR /opt
14+
15+
ENV GLOW_VERSION=1.1.2
16+
ENV SCALA_LOGGING_VERSION=3.7.2
17+
ENV PICARD_VERSION=2.23.3
18+
ENV HTSJDK_VERSION=2.21.2
19+
ENV NETTY_VERSION=3.9.9
20+
ENV JDBI_VERSION=2.78
21+
ENV HADOOP_BAM_VERSION=7.9.2
22+
23+
#Download Glow JAR and its dependencies
24+
RUN wget https://repo1.maven.org/maven2/io/projectglow/glow-spark3_2.12/${GLOW_VERSION}/glow-spark3_2.12-${GLOW_VERSION}.jar \
25+
&& wget https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging_2.12/${SCALA_LOGGING_VERSION}/scala-logging_2.12-${SCALA_LOGGING_VERSION}.jar \
26+
&& wget https://repo1.maven.org/maven2/com/github/broadinstitute/picard/${PICARD_VERSION}/picard-${PICARD_VERSION}.jar \
27+
&& wget https://repo1.maven.org/maven2/com/github/samtools/htsjdk/${HTSJDK_VERSION}/htsjdk-${HTSJDK_VERSION}.jar \
28+
&& wget https://repo1.maven.org/maven2/io/netty/netty/${NETTY_VERSION}.Final/netty-${NETTY_VERSION}.Final.jar \
29+
&& wget https://repo1.maven.org/maven2/org/jdbi/jdbi/${JDBI_VERSION}/jdbi-${JDBI_VERSION}.jar \
30+
&& wget https://repo1.maven.org/maven2/org/seqdoop/hadoop-bam/${HADOOP_BAM_VERSION}/hadoop-bam-${HADOOP_BAM_VERSION}.jar \
31+
&& mv *.jar /opt/spark/jars
32+
33+
#Install Glow python interface
34+
RUN pip3 install --upgrade pip
35+
RUN pip3 install glow.py==${GLOW_VERSION}
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
FROM projectglow/open-source-base:1.1.2 AS base
2+
USER root
3+
4+
# ===== Set up base required libraries =============================================================
5+
6+
RUN apt-get update && apt-get install -y \
7+
apt-utils \
8+
build-essential \
9+
git \
10+
apt-transport-https \
11+
ca-certificates \
12+
cpanminus \
13+
libncurses-dev \
14+
libpng-dev \
15+
zlib1g-dev \
16+
libbz2-dev \
17+
liblzma-dev \
18+
perl \
19+
perl-base \
20+
unzip \
21+
curl \
22+
gnupg2 \
23+
software-properties-common \
24+
jq \
25+
libjemalloc2 \
26+
libjemalloc-dev \
27+
libdbi-perl \
28+
libdbd-mysql-perl \
29+
libdbd-sqlite3-perl \
30+
zlib1g \
31+
zlib1g-dev \
32+
libxml2 \
33+
libxml2-dev \
34+
r-base \
35+
r-base-dev \
36+
&& wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc \
37+
&& add-apt-repository "deb [arch=amd64,i386] https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/" \
38+
&& apt-get clean \
39+
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
40+
41+
# ===== Set up VEP environment =====================================================================
42+
43+
ENV OPT_SRC /opt/vep/src
44+
ENV PERL5LIB $PERL5LIB:$OPT_SRC/ensembl-vep:$OPT_SRC/ensembl-vep/modules
45+
RUN cpanm DBI && \
46+
cpanm Set::IntervalTree && \
47+
cpanm JSON && \
48+
cpanm Text::CSV && \
49+
cpanm Module::Build && \
50+
cpanm PerlIO::gzip && \
51+
cpanm IO::Uncompress::Gunzip
52+
53+
RUN mkdir -p $OPT_SRC
54+
WORKDIR $OPT_SRC
55+
RUN git clone https://github.com/Ensembl/ensembl-vep.git
56+
WORKDIR ensembl-vep
57+
58+
# The commit is the most recent one on release branch 100 as of July 29, 2020
59+
60+
RUN git checkout 10932fab1e9c113e8e5d317e1f668413390344ac && \
61+
perl INSTALL.pl --NO_UPDATE -AUTO a && \
62+
perl INSTALL.pl -n -a p --PLUGINS AncestralAllele && \
63+
chmod +x vep
64+
65+
# ===== Set up samtools ============================================================================
66+
67+
ENV SAMTOOLS_VERSION=1.9
68+
69+
WORKDIR /opt
70+
RUN wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VERSION}/samtools-${SAMTOOLS_VERSION}.tar.bz2 && \
71+
tar -xjf samtools-1.9.tar.bz2
72+
WORKDIR samtools-1.9
73+
RUN ./configure && \
74+
make && \
75+
make install
76+
77+
ENV PATH=${DEST_DIR}/samtools-{$SAMTOOLS_VERSION}:$PATH
78+
79+
80+
# ===== Set up htslib ==============================================================================
81+
# access htslib tools from the shell, for example,
82+
# %sh
83+
# /opt/htslib-1.9/tabix
84+
# /opt/htslib-1.9/bgzip
85+
86+
WORKDIR /opt
87+
RUN wget https://github.com/samtools/htslib/releases/download/${SAMTOOLS_VERSION}/htslib-${SAMTOOLS_VERSION}.tar.bz2 && \
88+
tar -xjvf htslib-1.9.tar.bz2
89+
WORKDIR htslib-1.9
90+
RUN ./configure && \
91+
make && \
92+
make install
93+
94+
# ===== bgenix ==============================================================================
95+
#access begenix from the shell from,
96+
#/opt/bgen/build/apps/bgenix
97+
98+
RUN apt-get update && apt-get install -y \
99+
npm
100+
101+
RUN npm install --save sqlite3
102+
103+
WORKDIR /opt
104+
RUN wget http://code.enkre.net/bgen/tarball/release/bgen.tgz && \
105+
tar zxvf bgen.tgz && \
106+
mv bgen.tgz bgen
107+
WORKDIR bgen
108+
RUN CXX=/usr/bin/g++ && \
109+
CC=/usr/bin/gcc && \
110+
./waf configure && \
111+
./waf && \
112+
./build/test/unit/test_bgen && \
113+
./build/apps/bgenix -g example/example.16bits.bgen -list
114+
115+
# ===== Set up MLR dependencies ====================================================================
116+
117+
ENV QQMAN_VERSION=1.0.6
118+
RUN pip3 install qqman==$QQMAN_VERSION
119+
120+
# ===== Set up R genomics packages =================================================================
121+
122+
RUN R -e "install.packages('sim1000G',dependencies=TRUE,repos='https://cran.rstudio.com')"\
123+
&& R -e "install.packages('gplots',dependencies=TRUE,repos='http://cran.us.r-project.org')"\
124+
&& R -e "install.packages('bigsnpr',dependencies=TRUE,repos='http://cran.us.r-project.org')"\
125+
&& R -e "install.packages('ukbtools',dependencies=TRUE,repos='https://cran.rstudio.com')"\
126+
&& R -e "install.packages('qqman',dependencies=TRUE,repos='http://cran.us.r-project.org')"
127+
128+
# ===== plink ==============================================================================
129+
#install both plink 1.07 and 1.9
130+
#access plink from the shell from,
131+
#v1.07
132+
#/opt/plink-1.07-x86_64/plink --noweb
133+
#v1.90
134+
#/opt/plink --noweb
135+
136+
WORKDIR /opt
137+
RUN wget http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-x86_64.zip && \
138+
unzip plink-1.07-x86_64.zip
139+
RUN wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20200616.zip && \
140+
unzip plink_linux_x86_64_20200616.zip
141+
142+
# ===== Reset current directory ====================================================================
143+
144+
WORKDIR /root

docs/source/getting-started.rst

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -73,12 +73,6 @@ Glow requires Apache Spark 3.2.0.
7373
val sess = Glow.register(spark)
7474
val df = sess.read.format("vcf").load(path)
7575
76-
Notebooks embedded in the docs
77-
------------------------------
78-
79-
Documentation pages are accompanied by embedded notebook examples. Most code in these notebooks can be run on Spark and Glow alone, but functions such as ``display()`` or ``dbutils()`` are only available on Databricks. See :ref:`dbnotebooks` for more info.
80-
81-
These notebooks are located in the Glow github repository `here <https://github.com/projectglow/glow/blob/master/docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/>`_ and are tested nightly end-to-end. They include notebooks to define constants such as the number of samples to simulate and the output paths for each step in the pipeline. Notebooks that define constants are ``%run`` at the start of each notebook in the documentation. Please see :ref:`data_simulation` to get started.
8276
8377
Getting started on Databricks
8478
-----------------------------
@@ -89,8 +83,23 @@ The Databricks documentation shows how to get started with Glow on,
8983
- **Microsoft Azure** (`docs <https://docs.microsoft.com/en-us/azure/databricks/applications/genomics/tertiary-analytics/glow>`_)
9084
- **Google Cloud Platform** (GCP - `docs <https://docs.gcp.databricks.com/applications/genomics/tertiary-analytics/glow.html>`_)
9185

92-
86+
We recommend using the `Databricks Glow docker container <https://hub.docker.com/r/projectglow/databricks-glow>`_ to manage the environment,
87+
which includes `genomics libraries <https://github.com/projectglow/glow/blob/master/docker/databricks/dbr/dbr9.1/genomics/Dockerfile>`_ that complement Glow.
88+
This container can be installed via Databricks container services using the ``projectglow/databricks-glow:<tag>`` Docker Image URL, replacing <tag> with the latest version of Glow.
89+
9390
Getting started on other cloud services
9491
---------------------------------------
9592

96-
Please submit a pull request to add a guide for other cloud services.
93+
Glow is packaged into a Docker container based on an image from `data mechanics <https://hub.docker.com/r/datamechanics/spark>`_ that can be run locally and that also includes connectors to Azure Data Lake, Google Cloud Storage, Amazon Web Services S3, Snowflake, and `Delta Lake <https://docs.delta.io/latest/index.html>`_. This container can be installed using the ``projectglow/open-source-glow:<tag>`` Docker Image URL, replacing <tag> with the latest version of Glow.
94+
95+
This container can be used or adapted to run Glow outside of Databricks (`source code <https://github.com/projectglow/glow/tree/master/docker>`_).
96+
And was contributed by Edoardo Giacopuzzi (``edoardo.giacopuzzi at fht.org``) from Human Technopole.
97+
98+
Please submit a pull request to add guides for specific cloud services.
99+
100+
Notebooks embedded in the docs
101+
------------------------------
102+
103+
Documentation pages are accompanied by embedded notebook examples. Most code in these notebooks can be run on Spark and Glow alone, but functions such as ``display()`` or ``dbutils()`` are only available on Databricks. See :ref:`dbnotebooks` for more info.
104+
105+
These notebooks are located in the Glow github repository `here <https://github.com/projectglow/glow/blob/master/docs/source/_static/zzz_GENERATED_NOTEBOOK_SOURCE/>`_ and are tested nightly end-to-end. They include notebooks to define constants such as the number of samples to simulate and the output paths for each step in the pipeline. Notebooks that define constants are ``%run`` at the start of each notebook in the documentation. Please see :ref:`data_simulation` to get started.

0 commit comments

Comments
 (0)