Concourse CI Machine Charm

Work in progress

Concourse CI Machine Charm

A Juju machine charm for deploying Concourse CI - a modern, scalable continuous integration and delivery system. This charm supports flexible deployment patterns including single-unit, multi-unit with automatic role assignment, and separate web/worker configurations.

Note: This is a machine charm designed for bare metal, VMs, and LXD deployments. For Kubernetes deployments, see https://charmhub.io/concourse-web and https://charmhub.io/concourse-worker.

Features

Flexible Deployment Modes: Deploy as auto-scaled web/workers or explicit roles
Automatic Role Detection: Leader unit becomes web server, followers become workers
Fully Automated Key Distribution: TSA keys automatically shared via peer relations - zero manual setup!
Secure Random Passwords: Auto-generated admin password stored in Juju peer data
Latest Version Detection: Automatically downloads the latest Concourse release from GitHub
PostgreSQL 16+ Integration: Full support with Juju secrets API for secure credential management
Dynamic Port Configuration: Change web port on-the-fly with automatic service restart
Privileged Port Support: Run on port 80 with proper Linux capabilities (CAP_NET_BIND_SERVICE)
Auto External-URL: Automatically detects unit IP for external-url configuration
Ubuntu 24.04 LTS: Optimized for Ubuntu 24.04 LTS
Container Runtime: Uses containerd with LXD-compatible configuration
Automatic Key Management: TSA keys, session signing keys, and worker keys auto-generated
Prometheus Metrics: Optional metrics endpoint for monitoring
Download Progress: Real-time installation progress in Juju status
GPU Support: NVIDIA (CUDA) and AMD (ROCm) GPU workers for ML/AI workloads (GPU Guide)
Dataset Mounting: Automatic dataset injection for GPU tasks (Dataset Guide)
🆕 General Folder Mounting: Automatic discovery and mounting of ANY folder under /srv (General Mounting Guide)
- ✅ Zero configuration - just mount folders to /srv and go
- ✅ Read-only by default for data safety
- ✅ Writable folders with _writable or _rw suffix
- ✅ Multiple concurrent folders (datasets, models, outputs, caches)
- ✅ Works on both GPU and non-GPU workers
- ✅ Automatic permission validation and fail-fast
- ✅ Backward compatible with existing GPU dataset mounting

Quick Start

Prerequisites

Juju 3.x
Ubuntu 24.04 LTS (on Juju-managed machines)
PostgreSQL charm 16/stable (for web server)

Basic Deployment (Auto Mode)

# Create a Juju model
juju add-model concourse

# Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy Concourse CI charm as application "concourse-ci"
juju deploy concourse-ci-machine concourse-ci --config mode=auto

# Relate to database (uses PostgreSQL 16 client interface with Juju secrets)
juju integrate concourse-ci:postgresql postgresql:database

# Expose the web interface (opens port in Juju)
juju expose concourse-ci

# Wait for deployment (takes ~5-10 minutes)
juju status --watch 1s

The charm automatically:

Reads database credentials from Juju secrets
Configures the external URL based on unit IP
Opens the configured web port (default: 8080)
Generates and stores admin password in peer relation data

Naming Convention:

Charm name: concourse-ci-machine (what you deploy from Charmhub)
Application name: concourse-ci (used throughout this guide)
Unit names: concourse-ci/0, concourse-ci/1, etc.

Once deployed, get credentials with juju run concourse-ci/leader get-admin-password

Multi-Unit Deployment with Auto Mode (Recommended)

Deploy multiple units with automatic role assignment and key distribution:

# Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy Concourse charm (named "concourse-ci") with 1 web + 2 workers
juju deploy concourse-ci-machine concourse-ci -n 3 --config mode=auto

# Relate to database (using application name "concourse-ci")
juju relate concourse-ci:postgresql postgresql:database

# Check deployment
juju status

Result:

concourse-ci/0 (leader): Web server
concourse-ci/1-2: Workers
All keys automatically distributed via peer relations! ✨

Note: Application is named concourse-ci for easier reference (shorter than concourse-ci-machine)

Separated Web/Worker Deployment (For Independent Scaling)

For maximum flexibility with separate applications:

# Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy web server (1 unit)
juju deploy concourse-ci-machine web --config mode=web

# Deploy workers (2 units)  
juju deploy concourse-ci-machine worker -n 2 --config mode=worker

# Relate web to database
juju relate web:postgresql postgresql:database

# Relate web and worker for automatic TSA key exchange
juju relate web:tsa worker:flight

# Check deployment
juju status

Result:

web/0: Web server only
worker/0, worker/1: Workers only connected via TSA

Note: The tsa / flight relation automatically handles SSH key exchange between web and worker applications, eliminating the need for manual key management.

Deployment Modes

The charm supports three deployment modes via the mode configuration:

1. `auto` (Multi-Unit - Fully Automated ✨)

Leader unit runs web server, non-leader units run workers. Keys automatically distributed via peer relations!

Note: You need at least 2 units for this mode to have functional workers (Unit 0 = Web, Unit 1+ = Workers).

juju deploy concourse-ci-machine concourse-ci -n 3 --config mode=auto
juju relate concourse-ci:postgresql postgresql:database

Best for: Production, scalable deployments Key Distribution: ✅ Fully Automatic - zero manual intervention required!

2. `web` + `worker` (Separate Apps - Automatic TSA Setup)

Deploy web and workers as separate applications for independent scaling.

# Web application
juju deploy concourse-ci-machine web --config mode=web

# Worker application (scalable)
juju deploy concourse-ci-machine worker -n 2 --config mode=worker

# Relate web to PostgreSQL
juju relate web:postgresql postgresql:database

# Relate web and worker for automatic TSA key exchange
juju relate web:tsa worker:flight

Best for: Independent scaling of web and workers Key Distribution: ✅ Automatic via tsa / flight relation

Configuration Options

Option	Type	Default	Description
`mode`	string	`auto`	Deployment mode: auto, web, or worker
`version`	string	`latest`	Concourse version to install (auto-detects latest from GitHub)
`web-port`	int	`8080`	Web UI and API port
`worker-procs`	int	`1`	Number of worker processes per unit
`log-level`	string	`info`	Log level: debug, info, warn, error
`enable-metrics`	bool	`true`	Enable Prometheus metrics on port 9391
`external-url`	string	(auto)	External URL for webhooks and OAuth
`initial-admin-username`	string	`admin`	Initial admin username
`container-placement-strategy`	string	`volume-locality`	Container placement: volume-locality, random, etc.
`max-concurrent-downloads`	int	`10`	Max concurrent resource downloads
`containerd-dns-proxy-enable`	bool	`false`	Enable containerd DNS proxy
`containerd-dns-server`	string	`1.1.1.1,8.8.8.8`	DNS servers for containerd containers

Changing Configuration

Configuration changes are applied dynamically with automatic service restart.

# Set custom web port (automatically restarts service)
juju config concourse-ci web-port=9090

# Change to privileged port 80 (requires CAP_NET_BIND_SERVICE - already configured)
juju config concourse-ci web-port=80

# Enable debug logging
juju config concourse-ci log-level=debug

# Set external URL (auto-detects unit IP if not set)
juju config concourse-ci external-url=https://ci.example.com

Upgrading Concourse Version

Use the upgrade action to change Concourse CI version (update the version configuration first to ensure the change persists across charm refreshes):

# Set version configuration first (essential for persistence)
juju config concourse-ci version=7.14.3

# Trigger the upgrade action (automatically upgrades all workers)
juju config concourse-ci version=7.14.3

# Downgrade is also supported (update config then run action)
juju config concourse-ci version=7.12.1
juju config concourse-ci version=7.12.1

Auto-upgrade behavior:

When the web server (leader in mode=auto) is upgraded, all workers automatically upgrade to match
Works across separate applications connected via TSA relations
Workers show "Auto-upgrading Concourse CI to X.X.X..." during automatic upgrades

Note: The web-port configuration supports dynamic changes including privileged ports (< 1024) thanks to AmbientCapabilities=CAP_NET_BIND_SERVICE in the systemd service.

Using Concourse

Access the Web UI

Get the web server IP:

juju status

Check the exposed port (shown in Ports column):

juju status concourse-ci
# Look for: Ports column showing "80/tcp" or "8080/tcp"

Open in browser: http://<web-unit-ip>:<port>
Get the admin credentials:

juju run concourse-ci/leader get-admin-password

Example output:

message: Use these credentials to login to Concourse web UI
password: 01JfF@I!9W^0%re!3I!hyy3C
username: admin

Security: A random password is automatically generated on first deployment and stored securely in Juju peer relation data. All units in the deployment share the same credentials.

Using Fly CLI

The Fly CLI is Concourse's command-line tool for managing pipelines:

# Download fly from your Concourse instance
curl -Lo fly "http://<web-unit-ip>:8080/api/v1/cli?arch=amd64&platform=linux"
chmod +x fly
sudo mv fly /usr/local/bin/

# Get credentials
ADMIN_PASSWORD=$(juju run concourse-ci/leader get-admin-password --format=json | jq -r '."unit-concourse-ci-2".results.password')

# Login
fly -t prod login -c http://<web-unit-ip>:8080 -u admin -p "$ADMIN_PASSWORD"

# Sync fly version
fly -t prod sync

Create Your First Pipeline

⚠️ Important: This charm uses containerd runtime. All tasks must include an image_resource.

Create a pipeline file hello.yml:

jobs:
- name: hello-world
  plan:
  - task: say-hello
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: busybox
      run:
        path: sh
        args:
        - -c
        - |
          echo "=============================="
          echo "Hello from Concourse CI!"
          echo "Date: $(date)"
          echo "=============================="

Set the pipeline:

fly -t prod set-pipeline -p hello -c hello.yml
fly -t prod unpause-pipeline -p hello

Trigger the job:

fly -t prod trigger-job -j hello/hello-world -w

Note: Common lightweight images: busybox (~2MB), alpine (~5MB), ubuntu (~28MB)

Scaling

Add More Workers

# Add 2 more worker units to the concourse-ci application
juju add-unit concourse-ci -n 2

# Verify workers
juju ssh concourse-ci/0  # SSH to unit 0 of concourse-ci application
fly -t local workers

Remove Workers

# Remove specific unit
juju remove-unit concourse-ci/3

Relations

Required Relations

PostgreSQL (Required for Web Server)

The web server requires a PostgreSQL database:

juju relate concourse-ci:postgresql postgresql:database

Supported PostgreSQL Charms:

postgresql (16/stable recommended)
Any charm providing the postgresql interface

Optional Relations

Monitoring

Concourse exposes Prometheus metrics on port 9391:

juju relate concourse-ci:monitoring prometheus:target

Peer Relation

Units automatically coordinate via the peers relation (automatic, no action needed).

Storage

The charm uses Juju storage for persistent data:

# Deploy with specific storage
juju deploy concourse-ci-machine concourse-ci --storage concourse-data=20G

# Add storage to existing unit
juju add-storage concourse-ci/0 concourse-data=10G

Storage is mounted at /var/lib/concourse.

GPU Support

Concourse workers can utilize NVIDIA GPUs for ML/AI workloads, GPU-accelerated builds, and compute-intensive tasks.

Prerequisites

NVIDIA GPU hardware on the host machine
NVIDIA drivers installed on the host (tested with driver 580.95+)
For LXD/containers: GPU passthrough configured (see below)

Note: The charm automatically installs nvidia-container-toolkit and configures the GPU runtime. No manual setup required!

Quick Start: Deploy with GPU

Complete deployment from scratch:

# 1. Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# 2. Deploy web server
juju deploy concourse-ci-machine web --config mode=web

# 3. Deploy GPU-enabled worker
juju deploy concourse-ci-machine worker \
  --config mode=worker \
  --config compute-runtime=cuda

# 4. Add GPU to LXD container (only manual step for localhost cloud)
lxc config device add <container-name> gpu0 gpu
# Example: lxc config device add juju-abc123-0 gpu0 gpu

# 5. Create relations
juju relate web:postgresql postgresql:database
juju relate web:tsa worker:flight

# 6. Check status
juju status worker
# Expected: "Worker ready (GPU: 1x NVIDIA)"

Enable GPU on Existing Worker

# Enable NVIDIA GPU on already deployed worker
juju config worker compute-runtime=cuda

# Enable AMD GPU on already deployed worker
juju config worker compute-runtime=rocm

# Disable GPU
juju config worker compute-runtime=none

LXD GPU Passthrough (One-time setup)

If deploying on LXD (localhost cloud), add GPU to the container:

# Find your worker container name
lxc list | grep juju

# Add GPU device (requires container restart)
lxc config device add <container-name> gpu0 gpu

# Example:
lxc config device add juju-abc123-0 gpu0 gpu

Everything else is automated! The charm will:

✅ Install nvidia-container-toolkit
✅ Create GPU wrapper script
✅ Configure runtime for GPU passthrough
✅ Set up automatic GPU device injection

GPU Configuration Options

Option	Default	Description
`compute-runtime`	`none`	GPU compute runtime: `none`, `cuda` (NVIDIA), or `rocm` (AMD)
`gpu-device-ids`	`all`	GPU devices to expose: "all" or "0,1,2"

GPU Worker Tags

When GPU is enabled, workers are automatically tagged:

cuda - NVIDIA GPU worker (when compute-runtime=cuda)
rocm - AMD GPU worker (when compute-runtime=rocm)
gpu-count=N - Number of GPUs available
gpu-devices=0,1 - Specific device IDs (if configured)

Example: GPU Pipeline

Create a pipeline that targets GPU-enabled workers:

jobs:
- name: train-model-nvidia
  plan:
  - task: gpu-training
    tags: [cuda]  # Target NVIDIA GPU workers
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: nvidia/cuda
          tag: 13.1.0-runtime-ubuntu24.04
      run:
        path: sh
        args:
        - -c
        - |
          # Verify GPU access
          nvidia-smi
          
          # Run your GPU workload
          python train.py --use-gpu

- name: gpu-benchmark
  plan:
  - task: benchmark
    tags: [cuda, gpu-count=1]  # More specific targeting
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: nvidia/cuda
          tag: 13.1.0-base-ubuntu24.04
      run:
        path: nvidia-smi

Verifying GPU Access

# Check worker status
juju status worker
# Should show: "Worker ready (GPU: 1x NVIDIA)"

# Verify GPU tags in Concourse
fly -t local workers
# Worker should show tags: cuda, gpu-count=1

Common GPU Images

nvidia/cuda:13.1.0-base-ubuntu24.04 - CUDA base (~174MB)
nvidia/cuda:13.1.0-runtime-ubuntu24.04 - CUDA runtime (~1.38GB)
nvidia/cuda:13.1.0-devel-ubuntu24.04 - CUDA development (~3.39GB)
tensorflow/tensorflow:latest-gpu - TensorFlow with GPU
pytorch/pytorch:latest - PyTorch with GPU

GPU Troubleshooting

Worker shows "GPU enabled but no GPU detected"

Verify GPU present: nvidia-smi
Check driver installation: nvidia-smi

Container cannot access GPU

Verify nvidia-container-runtime: which nvidia-container-runtime
Check containerd config: cat /etc/containerd/config.toml
Restart containerd: sudo systemctl restart containerd

GPU not showing in task

Ensure using NVIDIA CUDA base image
Run nvidia-smi in task to debug
Check worker tags: fly -t local workers

AMD GPU Support (ROCm)

Concourse workers can utilize AMD GPUs with ROCm for ML/AI workloads, GPU-accelerated computations, and HPC tasks.

Prerequisites

AMD GPU hardware on the host machine (e.g., Radeon RX 6000/7000 series, MI series)
AMD GPU drivers installed on the host
ROCm tools (optional, for host-side management)
For LXD/containers: GPU passthrough configured (see below)

Note: The charm automatically installs amd-container-toolkit, generates CDI specifications, and configures the ROCm runtime. No manual setup required!

Quick Start: Deploy with AMD GPU

Complete deployment from scratch:

# 1. Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# 2. Deploy web server
juju deploy concourse-ci-machine web --config mode=web

# 3. Deploy ROCm-enabled worker
juju deploy concourse-ci-machine worker \
  --config mode=worker \
  --config compute-runtime=rocm

# 4. Add AMD GPU to LXD container (use specific GPU ID for multi-GPU systems)
# Note: On systems with multiple GPU vendors, use 'id=N' to target specific GPU
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'
lxc config device add <container-name> gpu1 gpu id=1
# Example: lxc config device add juju-abc123-0 gpu1 gpu id=1

# 5. Create relations
juju relate web:postgresql postgresql:database
juju relate web:tsa worker:flight

# 6. Check status
juju status worker
# Expected: "Worker ready (v7.14.2) (GPU: 1x AMD)"

Enable ROCm on Existing Worker

# Enable AMD GPU on already deployed worker
juju config worker compute-runtime=rocm

LXD GPU Passthrough for AMD (Critical for Multi-GPU Systems)

If deploying on LXD (localhost cloud), add AMD GPU to the container:

# Find available GPUs and their IDs
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'

# Output example:
# {
#   "id": 0,
#   "vendor": "NVIDIA Corporation",
#   "driver": "nvidia",
#   "product": "GA104 [GeForce RTX 3070]"
# }
# {
#   "id": 1,
#   "vendor": "Advanced Micro Devices, Inc. [AMD/ATI]",
#   "driver": "amdgpu",
#   "product": "Navi 31 [Radeon RX 7900 XT]"
# }

# Add AMD GPU device using specific ID (GPU 1 in this example)
lxc config device add <container-name> gpu1 gpu id=1

# Add /dev/kfd device (required for ROCm compute)
lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd

# Example:
lxc config device add juju-abc123-0 gpu1 gpu id=1
lxc config device add juju-abc123-0 kfd unix-char source=/dev/kfd path=/dev/kfd

⚠️ IMPORTANT for Multi-GPU Systems:

Generic lxc config device add ... gpu passes ALL GPUs to the container
This causes ambiguity when both NVIDIA and AMD GPUs are present
Always use id=N to target the specific AMD GPU
GPU ID corresponds to /dev/dri/cardN (e.g., id=1 → /dev/dri/card1)

⚠️ CRITICAL for ROCm Compute:

/dev/kfd (Kernel Fusion Driver) is required for ROCm compute workloads
Without /dev/kfd, GPU monitoring works but PyTorch/TensorFlow cannot use the GPU
Must be added as separate device after GPU passthrough

⚠️ Supported AMD GPUs:

Discrete GPUs (fully supported): RX 6000/7000 series, Radeon Pro, Instinct MI series - work natively
Integrated GPUs (requires workaround): APUs like Phoenix1 (gfx1103), Renoir, Cezanne
- ✅ CAN work with HSA_OVERRIDE_GFX_VERSION environment variable (see below)
- ⚠️ Lower performance due to shared system memory
- Recommended for development/testing, not production ML workloads
Check ROCm compatibility: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html

Everything else is automated! The charm will:

✅ Install amd-container-toolkit
✅ Generate CDI specification
✅ Install rocm-smi for GPU monitoring
✅ Create AMD GPU wrapper script
✅ Configure runtime for ROCm GPU passthrough
✅ Set up automatic GPU device injection into task containers (including /dev/kfd)

ROCm Configuration Options

Option	Default	Description
`compute-runtime`	`none`	GPU compute runtime: `none`, `cuda` (NVIDIA), or `rocm` (AMD)
`gpu-device-ids`	`all`	GPU devices to expose: "all" or "0,1,2"

ROCm Worker Tags

When ROCm GPU is enabled, workers are automatically tagged:

rocm - AMD GPU worker (when compute-runtime=rocm)
gpu-count=N - Number of AMD GPUs available
gpu-devices=0,1 - Specific device IDs (if configured)

Example: ROCm GPU Pipeline

Create a pipeline that targets ROCm-enabled workers:

jobs:
- name: rocm-benchmark
  plan:
  - task: gpu-test
    tags: [rocm]  # Target ROCm workers
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/dev-ubuntu-24.04
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Verify GPU access
          rocm-smi
          
          # Check available devices
          ls -la /dev/dri/
          
          # Run your ROCm workload
          python train.py --rocm

- name: amd-gpu-compute
  plan:
  - task: compute
    tags: [rocm, gpu-count=1]  # More specific targeting
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # For integrated AMD GPUs (Phoenix1/gfx1103, etc.)
          export HSA_OVERRIDE_GFX_VERSION=11.0.0
          
          python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); x = torch.rand(5,3).cuda(); print('Result:', x * 2)"

Verifying ROCm GPU Access

# Check worker status
juju status worker
# Should show: "Worker ready (v7.14.2) (GPU: 1x AMD)"

# Verify GPU tags in Concourse
fly -t local workers
# Worker should show tags: rocm, gpu-count=1

# Test GPU access in a task
fly -t local execute -c test-gpu.yml --tag=rocm

Common ROCm Images

rocm/dev-ubuntu-24.04:latest - ROCm development base (~1.1GB)
rocm/tensorflow:latest - TensorFlow with ROCm
rocm/pytorch:latest - PyTorch with ROCm (~6GB, includes PyTorch 2.9.1+rocm7.2.0)
rocm/rocm-terminal:latest - ROCm with utilities

HSA_OVERRIDE_GFX_VERSION Workaround for Integrated GPUs

Integrated AMD GPUs (APUs) like Phoenix1 (gfx1103), Renoir, and Cezanne are not officially supported by ROCm, but can work with the HSA_OVERRIDE_GFX_VERSION environment variable.

Why it's needed:

ROCm checks GPU architecture (GFX version) and rejects unsupported GPUs
Integrated GPUs often use newer GFX versions without full ROCm kernel support
Override tells ROCm to use kernels from a supported architecture

How to use:

jobs:
- name: pytorch-rocm-integrated-gpu
  plan:
  - task: test-gpu
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Set override for gfx1103 (Phoenix1) - use gfx11.0.0 kernels
          export HSA_OVERRIDE_GFX_VERSION=11.0.0
          
          # Your PyTorch code
          python3 -c "
          import torch
          print('CUDA (ROCm) available:', torch.cuda.is_available())
          x = torch.rand(5, 3).cuda()
          y = x * 2
          print('GPU computation succeeded!')
          print('Result:', y)
          "

Override values for common integrated GPUs:

GPU Architecture	GFX Version	Override Value
Phoenix1 (780M)	gfx1103	`11.0.0`
Renoir (4000 series)	gfx90c	`9.0.0`
Cezanne (5000 series)	gfx90c	`9.0.0`

Limitations:

⚠️ Uses suboptimal kernels → lower performance than discrete GPUs
⚠️ Shared system memory → memory bandwidth limitations
⚠️ May not support all ROCm features
✅ Good for development, testing, and light compute workloads
❌ Not recommended for production ML training

Testing on host (before deploying pipeline):

# Test if your integrated GPU works with override
docker run --rm -it --device=/dev/kfd --device=/dev/dri \
  rocm/pytorch:latest sh -c "
    export HSA_OVERRIDE_GFX_VERSION=11.0.0
    python3 -c 'import torch; x = torch.rand(5,3).cuda(); print(x * 2)'
  "

ROCm Troubleshooting

Worker shows "GPU enabled but no GPU detected"

Verify AMD GPU present: lspci | grep -i amd
Check driver: lsmod | grep amdgpu
Check devices: ls -la /dev/dri/

Container cannot access AMD GPU

Verify LXD device passthrough: lxc config device show <container-name>
Check devices in container: juju ssh worker/0 -- ls -la /dev/dri/
Ensure using correct GPU ID on multi-GPU systems
Check /dev/kfd: Must be present for compute workloads

PyTorch/TensorFlow shows "CUDA (ROCm) available: False"

Most common: Missing /dev/kfd device
- Check in container: ls -la /dev/kfd
- Add if missing: lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd
Integrated GPU without override: Try HSA_OVERRIDE_GFX_VERSION workaround (see above)
- Verify GPU model: lspci | grep -i vga
- Check PCI ID: cat /sys/class/drm/card*/device/uevent | grep PCI_ID
- For gfx1103 (Phoenix1): export HSA_OVERRIDE_GFX_VERSION=11.0.0
HSA_STATUS_ERROR_OUT_OF_RESOURCES: Usually indicates unsupported GPU or missing drivers

rocm-smi works but PyTorch doesn't detect GPU

This indicates /dev/kfd is missing or inaccessible
rocm-smi only needs /dev/dri/* for monitoring
PyTorch needs /dev/kfd for compute operations
Solution: Add /dev/kfd device to container (see above)

rocm-smi not working in container

Ensure using ROCm-enabled image (rocm/dev-ubuntu-24.04 or similar)
Check device permissions: ls -la /dev/dri/ in task
ROCm version mismatch: Host and container ROCm versions should be compatible

GPU not showing in task

Ensure using ROCm-enabled image
Run ls -la /dev/dri/ in task to debug device availability
Check worker tags: fly -t local workers
Verify task uses correct tags: --tag=rocm

Multi-GPU system issues

If worker detects wrong GPU type, check LXD device configuration
Use specific GPU ID: lxc config device add ... gpu id=1 (not generic gpu)
Query GPU IDs: lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'

Integrated GPU performance issues

If compute works but is slow, this is expected (shared memory bandwidth)
Consider discrete GPU for production workloads
Use integrated GPU for testing/development only
Monitor memory usage: integrated GPUs share system RAM

Troubleshooting

Charm Shows "Blocked" Status

Cause: Usually means PostgreSQL relation is missing (for web units).

Fix:

juju relate concourse-ci:postgresql postgresql:database

Web Server Won't Start

Check logs:

juju debug-log --include concourse-ci/0 --replay --no-tail | tail -50

# Or SSH and check systemd
juju ssh concourse-ci/0
sudo journalctl -u concourse-server -f

Common issues:

Database not configured: Check PostgreSQL relation
Auth configuration missing: Check /var/lib/concourse/config.env
Port already in use: Change web-port config

Workers Not Connecting

Check worker status:

juju ssh concourse-ci/1  # Worker unit
sudo systemctl status concourse-worker
sudo journalctl -u concourse-worker -f

Common issues:

TSA keys not generated: Check /var/lib/concourse/keys/
Containerd not running: sudo systemctl status containerd
Network connectivity: Ensure workers can reach web server

View Configuration

juju ssh concourse-ci/0
sudo cat /var/lib/concourse/config.env

Architecture

Components

┌─────────────────────────────────────────────────────────┐
│                     Web Server                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Web UI/API │  │    TSA     │  │ Scheduler  │         │
│  └────────────┘  └────────────┘  └────────────┘         │
│         │              │                 │              │
│         └──────────────┴─────────────────┘              │
│                        │                                │
└────────────────────────┼────────────────────────────────┘
                         │
                         │ (SSH over TSA)
                         │
        ┌────────────────┴────────────────┐
        │                                 │
  ┌─────▼──────┐                   ┌─────▼──────┐
  │  Worker 1  │                   │  Worker 2  │
  │┌──────────┐│                   │┌──────────┐│
  ││Container ││                   ││Container ││
  ││Runtime   ││                   ││Runtime   ││
  │└──────────┘│                   │└──────────┘│
  └────────────┘                   └────────────┘

... see https://concourse-ci.org/internals.html

Key Directories

/opt/concourse/: Concourse binaries
/var/lib/concourse/: Data and configuration
/var/lib/concourse/keys/: TSA and worker keys
/var/lib/concourse/worker/: Worker runtime directory

Systemd Services

concourse-server.service: Web server (runs as concourse user)
concourse-worker.service: Worker (runs as root)

Development

Building from Source

# Install charmcraft
sudo snap install charmcraft --classic

# Clone repository
git clone https://github.com/fourdollars/concourse-ci-machine.git
cd concourse-ci-machine

# Build charm
charmcraft pack

# Deploy locally
juju deploy ./concourse-ci-machine_amd64.charm

Project Structure

concourse-ci-machine/
├── src/
│   └── charm.py                  # Main charm logic
├── lib/
│   ├── concourse_common.py       # Shared utilities
│   ├── concourse_installer.py    # Installation logic
│   ├── concourse_web.py          # Web server management
│   └── concourse_worker.py       # Worker management
├── metadata.yaml                 # Charm metadata
├── config.yaml                   # Configuration options
├── charmcraft.yaml               # Build configuration
├── actions.yaml                  # Charm actions
└── README.md                     # This file

Security

Initial Setup

Change default password immediately:

fly -t prod login -c http://<ip>:8080 -u admin -p admin
# Use web UI to change password in team settings

Configure proper authentication:
- Set up OAuth providers (GitHub, GitLab, etc.)
- Use Juju secrets for credentials
- Enable HTTPS with reverse proxy (nginx/haproxy)
Network security:
- Use Juju spaces to isolate networks
- Configure firewall rules to restrict access
- Use private PostgreSQL endpoints

Database Credentials

Database credentials are passed securely via Juju relations, not environment variables.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This charm is licensed under the Apache 2.0 License. See LICENSE for details.

Resources

Concourse CI: https://concourse-ci.org/
Documentation: https://concourse-ci.org/docs.html
Charm Hub: https://charmhub.io/concourse-ci
Source Code: https://github.com/fourdollars/concourse-ci-machine
Issue Tracker: https://github.com/fourdollars/concourse-ci-machine/issues
Juju: https://juju.is/

Support

Community Support: Open an issue on GitHub
Commercial Support: Contact maintainers

Acknowledgments

Concourse CI team for the amazing CI/CD system
Canonical for Juju and the Operator Framework
Contributors to this charm

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
.github		.github
.sisyphus		.sisyphus
.specify		.specify
.vscode		.vscode
docs		docs
hooks		hooks
lib		lib
pipeline		pipeline
scripts		scripts
specs/001-shared-storage		specs/001-shared-storage
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
actions.yaml		actions.yaml
admin-password.txt		admin-password.txt
charmcraft.yaml		charmcraft.yaml
concourse-ip.txt		concourse-ip.txt
config.yaml		config.yaml
metadata.yaml		metadata.yaml

Uh oh!

License

fourdollars/concourse-ci-machine

Folders and files

Latest commit

History

Repository files navigation

Concourse CI Machine Charm

Features

Quick Start

Prerequisites

Basic Deployment (Auto Mode)

Multi-Unit Deployment with Auto Mode (Recommended)

Separated Web/Worker Deployment (For Independent Scaling)

Deployment Modes

1. auto (Multi-Unit - Fully Automated ✨)

2. web + worker (Separate Apps - Automatic TSA Setup)

Configuration Options

Changing Configuration

Upgrading Concourse Version

Using Concourse

Access the Web UI

Using Fly CLI

Create Your First Pipeline

Scaling

Add More Workers

Remove Workers

Relations

Required Relations

PostgreSQL (Required for Web Server)

Optional Relations

Monitoring

Peer Relation

Storage

GPU Support

Prerequisites

Quick Start: Deploy with GPU

Enable GPU on Existing Worker

LXD GPU Passthrough (One-time setup)

GPU Configuration Options

GPU Worker Tags

Example: GPU Pipeline

Verifying GPU Access

Common GPU Images

GPU Troubleshooting

AMD GPU Support (ROCm)

Prerequisites

Quick Start: Deploy with AMD GPU

Enable ROCm on Existing Worker

LXD GPU Passthrough for AMD (Critical for Multi-GPU Systems)

ROCm Configuration Options

ROCm Worker Tags

Example: ROCm GPU Pipeline

Verifying ROCm GPU Access

Common ROCm Images

HSA_OVERRIDE_GFX_VERSION Workaround for Integrated GPUs

ROCm Troubleshooting

Troubleshooting

Charm Shows "Blocked" Status

Web Server Won't Start

Workers Not Connecting

View Configuration

Architecture

Components

Key Directories

Systemd Services

Development

Building from Source

Project Structure

Security

Initial Setup

Database Credentials

Contributing

License

Resources

Support

Acknowledgments

About

Topics

Resources

1. `auto` (Multi-Unit - Fully Automated ✨)

2. `web` + `worker` (Separate Apps - Automatic TSA Setup)

Packages