- Data Science Pet Containers
M. Edward (Ed) Borasky znmeb@znmeb.net, 2018-04-04
Data Science Pet Containers comprise a collection of open-source software for all phases of the data science workflow, from ingestion of raw data through visualization, exploration, analysis and reporting. We provide the following tools:
- PostgreSQL / PostGIS / pgRouting: an industrial strength relational database management system with geographic information systems (GIS) extensions,
- Anaconda Python tools, including a Jupyter notebook server, and
- R language tools, including RStudio® Server.
As the name implies, the software is distributed via Docker. The user
simply clones a Git repository and uses the command docker-compose up
to bring up the services.
Why do it this way?
- Provide a standardized common working environment for data scientists and DevOps engineers at Hack Oregon. We want to build using the same tools we'll use for deployment as much as possible.
- Deliver advanced open source technologies to Windows and MacOS desktops and laptops. While there are "native" installers for most of these tools, some are readily available for and only extensively tested on Linux.
- Isolation: for the most part, software running in containers is contained. It interacts with the desktop / laptop user through well-defined mechanisms, often as a web server.
I've coded up some examples of how I've used this toolset for Hack
Oregon. They're in data-science-pet-containers/examples. Those that
require host-side operations mostly use Bash for scripting, and they
work on a Linux host, including Windows 10 Pro Windows Subsystem for
Linux (WSL) Ubuntu.
-
Clone this repository and
cd data-science-pet-containers/containers. -
Copy
sample.envto.env. Edit.envand change thePOSTGRES_PASSWORD. You don't need to change the other values. -
Copy any PostgreSQL database backups you want restored to
data-science-pet-containers/containers/Backups. Copy any raw data files you want on the image todata-science-pet-containers/containers/Raw. -
docker-compose -f postgis.yml up -d --build. The first time you run this, it will take some time. Once the image is built and the databases restored, it will be faster.When it's done you'll see
Successfully tagged postgis:latest Creating containers_postgis_1 ... done -
Type
docker logs -f containers_postgis_1to verify that the restores worked and the service is listening.PostgreSQL init process complete; ready for start up. LOG: database system was shut down at 2018-03-18 05:19:22 UTC LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher startedType
CTL-Cto stop following the container log. -
Connect to the container from the host: user name is
postgres, host islocalhost, port is the value ofHOST_POSTGRES_PORT, usually 5439, and password is the value ofPOSTGRES_PASSWORD. You can connect with any client that uses the PostgreSQL protocol including pgAdmin and QGIS. You can also connect with Jupyter notebooks and RStudio Desktop.
To stop the service, type docker-compose -f postgis.yml stop. To start
it back up again, docker-compose -f postgis.yml start.
The container and its filesystem will persist across host reboots. To
destroy them, type docker-compose -f postgis.yml down.
- Clone this repository and
cd data-science-pet-containers/containers. - Define the environment variables:
- Copy the file
sample.envto.env. For security reasons,.envis listed in.gitignore, so it won't be checked into version control. - Edit
.env. The variables you need to define areHOST_POSTGRES_PORT: If you have PostgreSQL installed on your host, it's probably listening on port 5432. Thepostgisservice listens on port 5432 inside the Docker network, so you'll need to map its port 5432 to another port. SetHOST_POSTGRES_PORTto the value you want; 5439 is what I use.POSTGRES_PASSWORD: To connect to thepostgisservice, you need a user name and a password. The user name is the default, the database superuserpostgres. Docker will set the password for thepostgresuser in thepostgisservice to the value ofPOSTGRES_PASSWORD.DB_USERS_TO_CREATE: When thepostgisservice first comes up, the users in this list are created in the database. If you're working on the 2018 Hack Oregon projects, there's no reason to change this.
- Copy the file
Here's sample.env:
# postgis container
HOST_POSTGRES_PORT=5439
POSTGRES_PASSWORD=some.string.you.can.remember.that.nobody.else.can.guess
DB_USERS_TO_CREATE=disaster-resilience housing-affordability local-elections transportation-systems urban-development
I've provided these to facilitate database ownership wrangling and
backup-restore testing. As given in sample.env, these are the accounts
Hack Oregon's PostgreSQL server will have. Some notes:
-
In the
postgisimage, they are also Linux users. For example, you candocker-exec -it -u disaster-resilience containers_postgis_1 /bin/bashand you'll be at a command prompt asdisaster-resilience.In the
postgisimage, they are database superusers. For example, you can docreateuserto create a database user andcreatedbto create a database. The "home database" for each of these users - a database with the same name as the Linux and database user - is created the first time the container comes up. -
In the
amazonimage, they are not Linux users, nor are they database superusers. They are created with the same permissions they have on the Hack Oregon PostgreSQL server:--no-createdb --no-createrole --no-superuser --no-replication. -
Because these names have hyphens in them, PostgreSQL in both images requires they be enclosed in double-quotes in SQL statements.
- WRONG:
ALTER DATABASE disaster OWNER TO disaster-resilience; - RIGHT:
ALTER DATABASE disaster OWNER TO "disaster-resilience";
- WRONG:
See the demo in exanples/reowning_a_database for an example of how
these users can be exploited in database wrangling.
-
Choose your version:
postgis.yml: PostGIS only. If you're doing all the analysis on the host and just want the PostGIS service, choose this. If you're an experienced Linux command-line user, this image has a comprehensive collection of extract-transform-load (ETL) and GIS tools.jupyter.yml: PostGIS and Jupyter Choose this if you want to run a Jupyter notebook server inside the Docker network.rstats.yml: PostGIS and RStudio Server. Choose this if you want an RStudio Server inside the Docker network.amazon.yml: PostGIS and an Amazon Linux 2 server running PostgreSQL. This is a specialized configuration for testing database backups for AWS server readiness. Most users won't need to use this.
-
Type
docker-compose -f <version> up -d --build. Docker will build/rebuild the images and start the services.Note that if you want to bring up all the services in one shot, just type
docker-compose up -d --build. This takes quite a bit of time - from 45 minutes to an hour the first time, depending on download bandwidth and disk I/O speed.
The postgis service is based on the official PostgreSQL image from the
Docker Store: https://store.docker.com/images/postgres. It is running
- PostgreSQL 9.6,
- PostGIS 2.4,
- pgRouting 2.5, and
- all of the foreign data wrappers that are available in a Debian
jessiePostgreSQL server.
All the images except amazon acquire PostgreSQL and its accomplices
from the official PostgreSQL Global Development Group (PGDG) Debian
repositories: https://www.postgresql.org/download/linux/debian/.
I've tried to provide a comprehensive command line experience. Git,
curl, wget, lynx, nano, emacs and vim are there, as is most
of the command-line GIS stack (gdal, proj, spatialite,
rasterlite, geotiff, osm2pgsql and osm2pgrouting), and of course
psql.
I've also included python3-csvkit for Excel, CSV and other text files,
unixodbc for ODBC connections and mdbtools for Microsoft Access
files. If you want to extend this image further, it is based on Debian
jessie.
You can log in as the Linux superuser root with
docker exec -it -u root containers_postgis_1 /bin/bash.
I've added a database superuser called dbsuper. This should be your
preferred login, rather than using the system database superuser
postgres. Log in with
docker exec -it -u dbsuper -w /home/dbsuper containers_postgis_1 /bin/bash.
For Linux hosts, including Windows 10 Pro Windows Subsystem for Linux Ubuntu (https://github.com/hackoregon/data-science-pet-containers/blob/master/win10pro-wsl-ubuntu-tools/README.md), I've created some convenience scripts:
-
login2postgis.bash: Type./login2postgis.bashand you'll be logged into thecontainers_postgis_1container asdbsuper. You can also log in as one of the users inDB_USERS_TO_CREATE, for example,./login2postgis.bash local-elections. -
login2amazon.bash: Log in tocontainers_amazon_1asdbsuper. -
pull-postgis-raw.bash: This script doesdocker cp containers_postgis_1:/home/dbsuper/Raw. That is, it copies all the files from/home/dbsuper/RawintoRawin your current directory, creatingRawif it doesn't exist.I use this for transferring backup files for testing; I'll create a database in the PostGIS container, create a backup in
/home/dbsuper/Rawand copy it out to the host with this script. I create the backups inRawinstead ofBackupsso they don't get automatically restored. -
push-amazon-raw.bash: This is the next step - it copiesRawto/home/dbsuper/Rawincontainers_amazon_1for restore testing.
You can use the Python virtualenvwrapper utility. See
https://virtualenvwrapper.readthedocs.io/en/latest/# for the
documentation.
To activate, enter
source /usr/share/virtualenvwrapper/virtualenvwrapper.sh.
- Log in with
docker execasdbsuperas described above. cd /home/dbsuper.- Edit
configure-git.bash. You'll need to supply your email address and name. - Enter
./configure-git.bash.
To clone a repository, use its https URL. For a private repository,
you'll need to authenticate when you clone. For a public one, you'll
only have to authenticate if you want to push.
In either case, once you've authenticated, git will cache your
credentials for an hour. As you probably noticed, this timeout is
adjustable in configure-git.bash.
Cloning this repository:
- Log in with
docker execasdbsuperas described above. cd /home/dbsuper.- Enter
./clone-me.bash.
You will find the repository in
$HOME/Projects/data-science-pet-containers
- From the host, connect to
localhost, portHOST_POSTGRES_PORT. - Inside the Docker network, connect to
postgis, port 5432. - In both cases, the username and maintenance database are
postgresand the password isPOSTGRES_PASSWORD. - From the command line, when you are logged in as the database
superuser
postgres, you do not need a password to connect.
If you've installed the EnterpriseDB PostgreSQL distribution, you probably already have pgAdmin, although it may not be the latest version. If you want to install pgAdmin without PostgreSQL:
- macOS installer: https://www.pgadmin.org/download/pgadmin-4-macos/
- Windows installer: https://www.pgadmin.org/download/pgadmin-4-windows/
To connect to the postgis service on localhost:HOST_POSTGRES_PORT
with pgAdmin:
- Right-click on
Serversand create a server. Give it any name you want. - On the
Connectiontab, set the host tolocalhost, the port toHOST_POSTGRES_PORT, the maintenance database topostgres, the user name topostgresand the password to the value you set forPOSTGRES_PASSWORD. - Check the
Save passwordbox and press theSavebutton.pgAdminwill add the tree for thepostgisservice.
When the postgis service first starts, it initializes the database
cluster. After that, it looks in a directory called
/docker-entrypoint-initdb.d/ and restores any .sql or sql.gz files
it finds. Then it looks for .sh scripts and runs them. We use this to
restore databases automatically at startup.
To use this feature:
-
Make sure all objects in the source databases have owners that will exist in the destination database. If the owner of an object doesn't exist in the destination, the restore will fail. Note that the
postgresuser will always exist, as willdbsuperand all users listed inDB_USERS_TO_CREATE. -
For each database you want restored, create a backup file. For documentation / repeatability, do this with
pg_dumpon the command line or in a script:pg_dump -Fp -v -C -c --if-exists -d <database> \ | gzip -c > <database>.sql.gzwhere
<database>is the name of the database.At restore time, a new database will be created (
-C -c). This is done by DROPping existing objects; the--if-existskeeps the DROPs from failing if the objects don't exist. -
Copy the database backup files to
data-science-pet-containers/containers/Backups. Note that.gitignoreis set for the common backup file extensions -*.sql.gz,*.sqland*.backup- so these backup files won't be version-controlled. -
Type
docker-compose -f postgis.yml build.
Docker will copy the backup files into /home/dbsuper/Backups on the
postgis image, and place a script restore-all.sh in
/docker-entrypoint-initdb.d/. The first time the image runs,
restore-all.sh will
- Create all the database users you defined in
DB_USERS_TO_CREATE, and then - restore all the
.backup,.sql.gzand.sqlfiles it finds in/home/dbsuper/Backups. Note that.backupfiles will be restored to a freshly-created database owned bypostgres.
If you want to load raw data onto the postgis image, copy the files to
the data-science-pet-containers/containers/Raw directory. The next
time the image is built they will be copied to /home/dbsuper/Raw.
You can put backup files in Raw but they will not be restored
automatically. See
data-science-pet-containers/examples/reowning_a_database for an
example. Like the backups, these files are not version-controlled.
This service is based on the Anaconda, Inc. (formerly Continuum)
miniconda3 image: https://hub.docker.com/r/continuumio/miniconda3/.
I've added a non-root user jupyter to avoid the security issues
associated with running Jupyter notebooks as "root".
The jupyter user has a Conda environment, also called jupyter. To
keep the image size manageable, I have not installed any data
science tools up front.
To install a (large) data science stack, run the script
/home/jupyter/kitchen-sink.bash as the jupyter user. This will
install
- cookiecutter,
- geopandas,
- gwr,
- matplotlib,
- osmnx,
- pandas,
- psycopg2,
- pysal,
- requests,
- seaborn, and
- statsmodels.
By default the Jupyter notebook server starts when Docker brings up the
service. Type docker logs conatiners_jupyter_1. You'll see something
like this:
$ docker logs conatiners_jupyter_1
[I 08:00:22.931 NotebookApp] Writing notebook server cookie secret to /home/jupyter/.local/share/jupyter/runtime/notebook_cookie_secret
[I 08:00:23.238 NotebookApp] Serving notebooks from local directory: /home/jupyter
[I 08:00:23.238 NotebookApp] 0 active kernels
[I 08:00:23.238 NotebookApp] The Jupyter Notebook is running at:
[I 08:00:23.238 NotebookApp] http://0.0.0.0:8888/?token=d90b23c9368933095c9fd8e25f29d2ba48f7ce67247e216d
[I 08:00:23.238 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 08:00:23.238 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://0.0.0.0:8888/?token=d90b23c9368933095c9fd8e25f29d2ba48f7ce67247e216d
Browse to localhost:8888 and copy/paste the token when the server asks
for it.
- Edit
configure-git.bashwith the Jupyter notebook file editor. You'll need to supply your email address and name. - Open a new terminal using the
New -> Terminaldropdown at the upper right of theHometab. - Enter
./configure-git.bash.
To clone a repository, use its https URL. For a private repository,
you'll need to authenticate when you clone. For a public one, you'll
only have to authenticate if you want to push.
In either case, once you've authenticated, git will cache your
credentials for an hour. As you probably noticed, this timeout is
adjustable in configure-git.bash.
Cloning this repository:
- Open a new terminal using the
New -> Terminaldropdown at the upper right of theHometab. - Enter
./clone-me.bash.
You will find the repository in
$HOME/Projects/data-science-pet-containers
To install packages:
- Open a new terminal using the
New -> Terminaldropdown at the upper right of theHometab. - Enter
bash. The terminal comes up initially in theshshell, which is missing many command-line conveniences. - Enter
source activate jupyter. - Use
conda searchto find packages in the Conda ecosystem, then install them withconda install. You can also install packages withpipif they're not in the Conda repositories.
To connect to the postgis service, use the user name and maintenance
database name postgres. The host is postgis, the port is 5432 and
the password is the value of POSTGRES_PASSWORD.
Reference: https://drivendata.github.io/cookiecutter-data-science/
- Open a new terminal using the
New -> Terminaldropdown at the upper right of theHometab. - Enter
./make-cookiecutter-project.
The script will install cookiecutter in the jupyter environment if
necessary. Then it will launch the Cookiecutter data science interactive
setup to create a new project in /home/jupyter/Projects.
Follow the instructions to set up the project.
This service is based on the rocker/rstudio image from Docker Hub:
https://hub.docker.com/r/rocker/rstudio/. I've added header files so
that the R packages RPostgres, odbc, sf and devtools will
install from source, but there are no R packages on the image besides
those that ship with rocker/rstudio.
Browse to localhost:8787. The user name and password are both
rstudio. Note that if you're using Firefox, you may have to adjust
a setting to use the terminal feature.
- Go to
Tools -> Global Options -> Terminal. - For Firefox, uncheck the
Connect with WebSocketsoption.
- Edit
configure-git.bash. You'll need to supply your email address and name. - Open a new terminal and enter
./configure-git.bash.
To clone a repository, use its https URL. For a private repository,
you'll need to authenticate when you clone. For a public one, you'll
only have to authenticate if you want to push.
In either case, once you've authenticated, git will cache your
credentials for an hour. As you probably noticed, this timeout is
adjustable in configure-git.bash.
Cloning this repository:
- Open a new terminal and enter
./clone-me.bash.
You will find the repository in
$HOME/Projects/data-science-pet-containers
As noted above, to keep the image size down, I've only installed header
files so that the R packages RPostgres, odbc, sf and devtools
will install. That covers the majority of use cases.
However, if you find an R package that won't install because of missing header or other Linux dependency, open an issue at https://github.com/hackoregon/data-science-pet-containers/issues/new.
Most packages that have missing dependencies will list the name of the
Debian packages you need to install. If that's the case, open a root
console with docker exec -it -u root containers_rstudio_1 /bin/bash.
Then type apt install <package-name>. After the Debian package is
installed, you should be able to install the R package.
To connect to the postgis service, use the user name and maintenance
database name postgres. The host is postgis, the port is 5432 and
the password is the value of POSTGRES_PASSWORD.
This image is based on the Amazon Linux 2 "2-with-sources" Docker image at https://hub.docker.com/_/amazonlinux/. The main reason it's in this collection is to provide a means of restore-testing backup files before handing them off to the DevOps engineers for deployment on AWS.
-
Read the section on automatic restores and backup file preparation above (Using automatic database restores).
-
Copy the backup files into
data-science-pet-containers/containers/Backups. -
docker-compose -f amazon.yml up -d --build. The backup files will be copied to/home/dbsuper/Backupson both thepostgisandamazonimages. -
When the services are up, type
docker logs -f containers_postgis_1. The backup files should be automatically restored. If there are errors, you'll need to fix your backup files. When the restores are done, typeCTL-Cto stop following the log. -
Log in to the
amazoncontainer -docker exec -it -u dbsuper -w /home/dbsuper containers_amazon_1 /bin/bash. -
cd Backups; ls. You'll see the backup files. For example:$ cd Backups; ls odot_crash_data.sql.gz passenger_census.sql.gz restore-all.shThose are the same backup files you just successfully restored in the
postgisimage. -
Type
./restore-all.sh. This is the same script that did the automatic restores onpostgisand it should have the same result. If there are no errors in the automatic restore onpostgisand the restore you just did inamazonthe backup files are good. To bring it up, typedocker-compose -f amazon,yml up -d --build.
This all started with an infamous "cattle, not pets" blog post. For some history, see http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/. In the Red Hat / Kubernetes / OpenShift universe, it's common for people to have a workstation that's essentially a Docker / Kubernetes host with all the actual work being done in containers. See https://rhelblog.redhat.com/2016/06/08/in-defense-of-the-pet-container-part-1-prelude-the-only-constant-is-complexity/ and https://www.projectatomic.io/blog/2018/02/fedora-atomic-workstation/.
So - pet containers for data scientists.