A starter template for data science workflows utilizing Docker as an alternative to Conda or venv environments.
4. Setting Up Docker for a Data Science Project
4.2. Step 2: Set Up Your Project Repository
4.3. Step 3: Write the Dockerfile
4.4. Step 4: Write the .dockerignore file
4.5. Step 5: Write the Docker Compose File
4.6. Step 6: requirements.txt
4.7. Step 7: Build and Run Your Container
4.8. Step 8: Verify the Container
4.9. Step 9: Attach VS Code to the Container
4.10. Step 10: Run the Python Script
4.11. Step 11: Work with Jupyter Notebooks in VS Code
4.12. Step 12: Stop and remove the container
4.13. Note 1: Jupyter on browser
4.14. Note2: Keeping Your Environment Up-to-Date
6. Advanced Topics and FAQ
6.2. Docker Port Mapping in Detail
6.3. Common Issues and Solutions
6.4. Data Science Specific Considerations
6.5. Docker Shortcuts (alias)
6.6. Understanding and Cleaning Dangling Images
6.7. Tagging Docker Images
6.8. Working with Docker Volumes
6.9. Frequently Asked Questions (FAQ)
This repository provides a complete Docker workflow for data science projects. It enables you to train machine learning models, develop Python scripts, experiment with Jupyter notebooks, and manage datasets—all within Docker containers. The setup is designed for reproducibility and maintainability.
Anyone interested in data science, Python programming, or Docker containerization can benefit from this project. Whether you are a student, developer, or data scientist, this resource will walk you through building and deploying a data science environment using Docker.
By working through this project, you will:
- Gain a solid understanding of Docker and containerization
- Learn to set up a full data science environment inside containers
- Discover how to manage dependencies with Docker
- See how to develop and execute Python scripts and Jupyter notebooks in containers
- Work through practical examples for reproducible data science workflows
- Learn Docker best practices tailored for data science
This project is suitable for three types of users:
- If you already know Docker:
- Jump right into the data science applications. The provided examples and configurations will help you refine your skills and explore best practices.
- If you know Python/data science but are new to Docker:
- This project will introduce you to containerization, guiding you through building and deploying reproducible environments.
- If you are a beginner:
- This project is beginner-friendly. You will start with the basics, learning how to set up Docker, then move on to building data science applications in containers.
Folder PATH listing
.
+---data <-- Contains sample datasets
| README.md <-- Documentation for the data folder
| sample.csv <-- Example dataset for experimentation
|
+---figures <-- Contains images for documentation
| README.md <-- Documentation for the figures folder
| docker.jpg <-- Docker concepts illustration
| port.jpg <-- Network port illustration
| volume.jpg <-- Docker volumes illustration
|
+---notebooks <-- Jupyter notebooks
| README.md <-- Documentation for the notebooks folder
| 01_data_exploration.ipynb <-- Notebook for data exploration
| 02_model_training.ipynb <-- Notebook for model training
|
+---scripts <-- Python scripts
| README.md <-- Documentation for the scripts folder
| data_prep.py <-- Sample data preparation script
|
| .dockerignore <-- Files to exclude from Docker build
| .gitignore <-- Files to exclude from git
| docker-compose.yml <-- Docker Compose configuration
| Dockerfile <-- Docker image definition
| LICENSE <-- License information
| README.md <-- This documentation file
| requirements.txt <-- Python dependencies
In simple terms:
- Docker: The most advanced environment manager
- Dockerfile: A recipe for a dish
- Docker Image: A cooked dish
- Docker Compose: Instructions for serving the dish
- Docker Container: A served dish
In technical terms:
- The "Dockerfile" (capital D) specifies how to build the image. For example, it defines the Python version and points to the requirements.txt file for dependencies.
- This file is typically located at the root of your project.
- This command creates an image based on the instructions in the Dockerfile.
- The resulting image is essentially a file containing a lightweight Ubuntu Linux with installed packages, such as Python and its libraries.
- The image acts like a compressed archive.
- It is portable and easy to share.
- However, it cannot be used until it is unpacked.
This command generates a container from an image.
- It unpacks the image (like extracting a compressed file) to make it usable.
- The command is often lengthy and varies for each image, making it hard to memorize.
- docker-compose.yml file: To simplify running containers, this command is written in a yml file and placed at the root of the project. From then on, you can start and stop containers with a simple, consistent command:
docker-compose up --build -d docker-compose down - Writing the docker-compose.yml file is often the most challenging part of Docker and has project-specific requirements. This repository provides a ready-to-use file for common data science tasks. For other use cases, such as web development, you may need to learn more or consult ChatGPT.
- A container is a lightweight Ubuntu Linux system with installed packages.
- Containers are not portable or shareable. If you make changes and want to share them, you must create a new image and distribute that image.
The following questions are addressed:
- Is a .dockerignore file necessary if there is already a .gitignore? Yes.
- What distinguishes .dockerignore from .gitignore?
- What should a .dockerignore file for a data science project look like?
- Explanation of .dockerignore contents.
A .dockerignore file is essential even if you have a .gitignore. While both exclude files, they serve different purposes:
- .gitignore prevents files from being tracked by Git
- .dockerignore prevents files from being copied into Docker images during builds
A .dockerignore file is important because it:
- Reduces the build context size, speeding up builds
- Keeps sensitive files out of Docker images
- Improves build cache efficiency
- Prevents unnecessary files from bloating images
A .dockerignore for data science should exclude:
- Python-specific: Compiled files, cache, and build artifacts
- Virtual environments: Local venvs should not be copied
- Development/IDE files: Editor configs and Git files
- Docker-specific: Dockerfile, docker-compose files, and .dockerignore itself
- Build/distribution: Local build artifacts
- System files: OS-specific files like .DS_Store and Windows Zone identifiers
Install Docker using the official documentation or with ChatGPT's help. After installation, verify with commands like docker images, docker ps, and by running the hello-world container.
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
# Install Docker and Docker Compose
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Install the latest Docker Compose
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-composeThe process is straightforward. Download and install the 64-bit Docker Desktop for Windows.
Note: To connect VS Code to Docker, Docker must be installed on Windows itself; installing it in WSL is not enough.
After installation, confirm Docker is working:
docker --version
sudo systemctl enable docker
sudo service docker startNote for WSL users:
WSL does not use systemd, so systemctl commands do not work inside WSL. In WSL, you must run sudo service docker start each time you boot. You can automate this with a script or alias.
# Check Docker installation
sudo docker images
sudo docker psTo use Docker without sudo:
sudo usermod -aG docker $USERTest it:
docker images
docker psIn WSL, sudo systemctl enable docker does not work because WSL lacks systemd. Here are ways to start Docker automatically:
If you do not mind running a command daily, use:
sudo service docker startCreate an alias to shorten the command:
echo 'alias start-docker="sudo service docker start"' >> ~/.bashrc
source ~/.bashrcNow, you can simply type:
start-dockerTo start Docker automatically when WSL starts:
-
Open WSL and edit the WSL configuration file:
sudo nano /etc/wsl.conf
-
Add these lines:
[boot] command="service docker start" -
Save and exit (Ctrl + X, then Y, then Enter).
-
Restart WSL:
wsl --shutdown
A guide to creating a portable and reproducible Docker project template for developing Python scripts and Jupyter notebooks in a containerized environment using VS Code.
- Install Docker Desktop with WSL integration on Windows 11.
- Install Visual Studio Code.
- In VS Code, add these extensions: Docker, Remote - Containers, Python, and Jupyter.
- Create a new Git repository (or clone an existing one).
- In the repository folder, create these files:
- Dockerfile
- .dockerignore
- docker-compose.yml
- requirements.txt
- data_prep.py
- 01_data_exploration.ipynb
- 02_model_training.ipynb
Add the following to your Dockerfile:
# Base image with Python 3.9
FROM python:3.9
# Set the working directory
WORKDIR /app
# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install Jupyter Notebook and JupyterLab
RUN pip install notebook jupyterlab
# Expose port 8888 for Jupyter
EXPOSE 8888
# Start Jupyter Notebook with no token for development
ENTRYPOINT ["sh", "-c", "exec jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=''" ]
Create a .dockerignore file in your project root to exclude unnecessary files from your Docker image.
In docker-compose.yml, include:
services:
your-project:
build: .
image: your-project_image
container_name: your-project_container
volumes:
- .:/app
stdin_open: true
tty: true
ports:
- "8888:8888"
This mounts your entire project folder into the container at /app. Note: In lines 2, 3, and 4, replace your-project with your project's name, for example: dockerproject1.
Keep the requirements.txt file clean and current. This ensures all dependencies are installed and maintains compatibility and performance. The ipykernel package is essential for Jupyter notebook support.
ipykernel # This package is essential for running Jupyter notebooks.
numpy==1.26.0
pandas==2.1.3
matplotlib==3.8.0
Best Practice: Use pip in Docker unless Conda is required. Stick to requirements.txt for best compatibility and performance.
On your host machine (in the project folder), you have two options:
-
First (recommended): This method extracts the project name to use as the image and container names.
To make start.sh executable if it is not:
chmod +x start.sh
To extract the project name and then build the image and run the container:
./start.sh
-
Second:
In this method, the image and container names default todata-science-project.docker-compose up --build -d
Note:
--build: Omitting "--build" means changes to Dockerfile or dependencies will not be applied.-d: The "-d" flag runs the container in detached mode, so you can keep using the terminal.
Run:
docker-compose ps
Ensure the container status is "Up" and port 8888 is mapped.
Follow these steps:
- Press
Ctrl+Shift+Pto open the command palette. - Type and select
Dev Containers: Attach to Running Container…. - Choose the container named
your-project_name. A second VS Code window will open. - In the new window, click
Open Folder. At the top, you will see/root. Deleterootto revealapp. Selectappand clickOK. You will then see all your project's folders and files. - In the second VS Code window, install the following extensions:
Docker,Dev Containers,Python, andJupyter. If you see aReload the windowbutton after installing each extension, make sure to click it every time. - You are all set and can continue.
Note: In Step 11, if you cannot select the kernel, close the second VS Code window and repeat steps 1–4. The correct kernel will then be automatically attached to the notebooks.
In the VS Code terminal, open the terminal. You will see a bash prompt, indicating you are inside the container. Run:
python scripts/data_prep.py
You should see the expected output (for example, "hi").
- Open 01_data_exploration.ipynb in VS Code.
- In the top-right corner of the notebook, you should see a kernel with the same name as your project. If not, click the
Select Kernelbutton and choose theJupyter kerneloption. This will display a kernel with your project's name and the Python kernel specified in the Dockerfile. The libraries from therequirements.txtfile, installed in the Docker container, will be automatically available for use. - You can now run and edit cells within the container.
docker-compose down
See localhost:8888/tree?
-
To rebuild your container with any changes, run on your host:
docker-compose up --build -
After installing a new package, update requirements.txt inside the container by running:
pip freeze > requirements.txt -
For pulling the latest base image, run:
docker-compose build --pull
# Pull images from Docker Hub
docker pull nginx
docker pull hello-world
# List all images
docker images
# Remove images
docker rmi <image1> <image2> ...
# List running containers
docker ps
# List all containers (including stopped ones)
docker ps -a
# List only container IDs
docker ps -aq
# Remove containers
docker rm <CONTAINER1> <CONTAINER2> ...
# Remove all containers
docker rm $(docker ps -aq)
# Run a container in detached mode
docker run -d <IMAGE name or ID>
# Start/stop containers
docker start <CONTAINER name or ID>
docker stop <CONTAINER name or ID>
# Start/stop all containers at once
docker start $(docker ps -aq)
docker stop $(docker ps -aq)
Note: You can use just the first two letters of a container ID for identification. For example: docker stop 2f
# Run nginx and map port 80 of the host to port 80 of the container
docker run -d -p 80:80 nginx
# Run another nginx instance on a different host port
docker run -d -p 8080:80 nginx
# Map multiple ports
docker run -d -p 80:80 -p 443:443 nginx
# Map all exposed ports to random ports
docker run -d -P nginx
The -p host_port:container_port option maps ports between your host system and the container.
# Enter a container's bash shell
docker exec -it <CONTAINER name or ID> bash
# Save an image to a tar file
docker save -o /home/mostafa/docker-projects/nginx.tar nginx
# Load an image from a tar file
docker load -i /home/mostafa/docker-projects/nginx.tar
Docker assigns random names to containers by default. To specify a custom name:
docker run -d --name <arbitrary-name> -p 80:80 <image-name>
Example:
docker run -d --name webserver -p 80:80 nginx
In networking:
- IP address identifies which device you're communicating with ("who")
- Port number specifies which service or application on that device ("what")
For example, when you access: google.com => 215.114.85.17:80
215.114.85.17is Google's IP address (who you're talking to)80is the port number for HTTP (what service you're requesting)
Ports can range from 0 to 65,535 (2^16 - 1), with standard services typically using well-known ports:
-
Web servers:
- HTTP: port
80 - HTTPS: port
443
- HTTP: port
-
Development servers:
- FastAPI: port
8000 - Jupyter: port
8888 - SSH: port
22
- FastAPI: port
-
Database Management Systems (DBMS):
- MySQL: port
3306 - PostgreSQL: port
5432 - MongoDB: port
27017
- MySQL: port
Important Notes on Database Ports:
- Databases themselves don't have ports; the Database Management Systems (DBMS) do.
- All databases within a single DBMS instance typically use the same port.
- If you want to run two versions of the same DBMS on one server, you must use different ports.
- Exception: Some DBMS like MongoDB allow each database to run on a different port, but by default, all databases share a common port.
Port mapping in Docker (-p 80:80) allows you to:
- Access containerized services from your host machine
- Run multiple instances of the same service on different host ports
- Avoid port conflicts when multiple containers need the same internal port
With these commands:
- First container: access via
localhost:80in browser - Second container: access via
localhost:8080in browser - Both containers are running nginx on their internal port
80
This approach is especially useful for data science projects when you need to:
- Run multiple Jupyter servers
- Access databases from both containerized applications and host tools
- Expose machine learning model APIs
If your container won't start, check:
- Port conflicts: Is another service using the same port?
- Resource limitations: Do you have enough memory/CPU?
- Permission issues: Are volume mounts correctly configured?
When using volume mounts, file permission issues can occur. Solutions:
- Use the
--userflag when running the container - Set appropriate permissions in the Dockerfile
- Use Docker Compose's
useroption
- Use
.dockerignoreto reduce build context size - Minimize the number of layers in your Dockerfile
- Consider multi-stage builds for smaller images
For production:
- Don't use
--NotebookApp.token='' - Set up proper authentication
- Use HTTPS for connections
For deep learning:
- Install NVIDIA Container Toolkit
- Use the
--gpus allflag with docker run - Use appropriate base images (e.g., tensorflow/tensorflow:latest-gpu)
When working with large datasets:
- Don't include data in the Docker image
- Use volume mounts for data directories
- Consider using data volumes or bind mounts
Add these aliases to your .bashrc or .zshrc file to make Docker commands more convenient:
#-----------------------------------------------------------------------------------------
# Docker aliases
# --- Image Management ---
alias di=" docker images --format 'table {{.ID}}\t{{.Repository}}\t{{.Tag}}\t{{.Size}}\t{{.CreatedSince}}'"
alias dia=" docker images -a --format 'table {{.ID}}\t{{.Repository}}\t{{.Tag}}\t{{.Size}}\t{{.CreatedSince}}'"
alias drmi=" docker rmi"
drmia() { docker rmi $(docker images -aq) } # Remove All Images
drmif() { # Remove All dangling images
local images=$(docker images -q -f dangling=true)
if [ -n "$images" ]; then
echo "Removing dangling images: $images"
docker rmi $images
else
echo "No dangling images to remove."
fi
}
# --- Container Management ---
alias dps=" docker ps --format 'table {{.ID}}\t{{.Image}}\t{{.Names}}\t{{.Status}}\t{{.Ports}}'"
alias dpsa=" docker ps -a --format 'table {{.ID}}\t{{.Image}}\t{{.Names}}\t{{.Status}}\t{{.Ports}}'"
alias dpsaq=" docker ps -aq --format 'table {{.ID}}\t{{.Image}}\t{{.Names}}\t{{.Status}}\t{{.Ports}}'"
alias dst=" docker start"
alias dsp=" docker stop"
alias drm=" docker rm"
dsta() { docker start $(docker ps -aq) } # Start All Containers
dspa() { docker stop $(docker ps -aq) } # Stop All Containers
drma() { docker rm $(docker ps -aq) } # Remove All Containers
# --- Docker Compose Commands ---
alias dcu=" docker compose up -d --build"
alias dcd=" docker compose down"
# --- Docker Exec Bash ---
deb() { docker exec -it "$1" bash }These shortcuts provide:
di: Lists images with formatted output showing ID, repository, tag, size, and agedps/dpsa: Shows running/all containers with formatted output
drmia: Removes all imagesdrmif: Removes only "dangling" images (untagged images)dsta/dspa: Starts/stops all containersdrma: Removes all containers
dst/dsp: Quick container start/stopdcu/dcd: Docker compose up/down with build and detached mode
To use these aliases:
- Add the code block to your shell profile file (~/.bashrc or ~/.zshrc)
- Run
source ~/.bashrcorsource ~/.zshrcto apply changes - Start using the shortened commands
When you run docker images, you may see entries with <none> as their repository and tag:
REPOSITORY TAG IMAGE ID CREATED SIZE
p1-ml-engineering-api-fastapi-docker-jupyter latest 5afe18f4594a 13 hours ago 745MB
<none> <none> 808f843b9362 13 hours ago 748MB
<none> <none> 5706fd96eca0 14 hours ago 742MB
<none> <none> 1e904ba38c6d 14 hours ago 742MB
These are called "dangling images" and usually appear when:
- You rebuild an image with the same tag—the old image becomes "dangling" and shows as
<none>:<none> - A build fails or is interrupted
- You pull a new version of an image, and the old one loses its tag
Dangling images:
- Consume disk space unnecessarily
- Make your image list harder to read
- Serve no practical purpose
You can safely remove all dangling images with:
docker image prune -fOr use the alias defined earlier:
drmifAfter running this command, you'll see output listing all deleted images:
Deleted Images:
deleted: sha256:1e904ba38c6dabb0c8c9dd896954c07b5f1b1cf196364ff1de5da46d18aa9fb
deleted: sha256:c73b8c1cc3550886ac1cc5965f89c6c2553b08fb0c472e1a1f9106b26ee4b14
...
This helps keep your Docker environment clean and efficient.
Proper tagging of Docker images is crucial for organizing, versioning, and deploying your containerized applications, especially in data science projects where model versions matter.
- Use semantic versioning (e.g.,
v1.0.1,v2.1) - Avoid using
latestin production - Use environment-specific tags (
dev,staging,prod) - Tag images before pushing to a registry
To tag a Docker image, use:
docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]Simple version tagging:
# Tag the current 'latest' image with a version number
docker tag my-datascience-app:latest my-datascience-app:v1.0Preparing for Docker Hub:
# Tag for pushing to Docker Hub
docker tag my-datascience-app:latest username/my-datascience-app:v1.0
# Then push to Docker Hub
docker push username/my-datascience-app:v1.0Multiple tags for different environments:
# Create production-ready tag
docker tag my-ml-model:v1.2.3 my-ml-model:prod
# Create development tag
docker tag my-ml-model:latest my-ml-model:devFor data science projects, consider including model information in your tags:
# Include model architecture and training data version
docker tag my-model:latest my-model:lstm-v2-dataset20230512
# Include accuracy metrics
docker tag my-model:latest my-model:v1.2-acc95.4Proper tagging helps maintain reproducibility and track which model version is deployed where.
By default, when a container is stopped or removed, all data inside it is lost. Docker volumes provide persistent storage that exists outside of containers.
- Data Persistence: Retain data even when containers are removed
- Data Sharing: Share data between multiple containers
- Performance: Better I/O performance than bind mounts, especially on Windows/Mac
- Isolation: Manage container data separately from the host filesystem
Syntax for mounting volumes:
docker run -v /host/path:/container/path[:options] image_nameExample 1: Exploring a Container's Default Storage
First, see what's inside a container without volumes:
# Start an nginx container
docker run -d --name nginx-test -p 80:80 nginx
# Enter the container
docker exec -it nginx-test bash
# Check the content of nginx's web directory
cd /usr/share/nginx/html
ls -laExample 2: Using a Volume for Persistence
Now mount a local directory to nginx's web directory:
docker run -d -p 3000:80 -v /home/username/projects/my-website:/usr/share/nginx/html nginxThis mounts your local directory /home/username/projects/my-website to the container's /usr/share/nginx/html directory. Any changes in either location will be reflected in the other.
The previous example gives full read/write access to the container. For better security, add the :ro (read-only) option:
docker run -d -p 3000:80 -v /home/username/projects/my-website:/usr/share/nginx/html:ro nginxThis prevents the container from modifying files in your local directory.
For data science projects, volumes are especially useful for:
Persisting Jupyter notebooks and data:
docker run -d -p 8888:8888 -v /home/username/ds-project:/app jupyter/datascience-notebookSharing datasets between containers:
# Create a named volume
docker volume create dataset-vol
# Mount the volume to multiple containers
docker run -d --name training -v dataset-vol:/data training-image
docker run -d --name inference -v dataset-vol:/data inference-imageStoring model artifacts:
docker run -d -p 8501:8501 -v /home/username/models:/models -e MODEL_PATH=/models/my_model ml-serving-image-
Named Volumes (managed by Docker):
docker volume create my-volume docker run -v my-volume:/container/path image_name
-
Bind Mounts (direct mapping to host):
docker run -v /absolute/host/path:/container/path image_name
-
Tmpfs Mounts (stored in host memory):
docker run --tmpfs /container/path image_name
</rewritten_file>



