Skip to content

Concourse CI charm for Juju machine model instead of k8s model

License

Notifications You must be signed in to change notification settings

fourdollars/concourse-ci-machine

Repository files navigation

Work in progress

Concourse CI Machine Charm

GitHub CI Charmhub

A Juju machine charm for deploying Concourse CI - a modern, scalable continuous integration and delivery system. This charm supports flexible deployment patterns including single-unit, multi-unit with automatic role assignment, and separate web/worker configurations.

Note: This is a machine charm designed for bare metal, VMs, and LXD deployments. For Kubernetes deployments, see https://charmhub.io/concourse-web and https://charmhub.io/concourse-worker.

Features

  • Flexible Deployment Modes: Deploy as auto-scaled web/workers or explicit roles
  • Automatic Role Detection: Leader unit becomes web server, followers become workers
  • Fully Automated Key Distribution: TSA keys automatically shared via peer relations - zero manual setup!
  • Secure Random Passwords: Auto-generated admin password stored in Juju peer data
  • Latest Version Detection: Automatically downloads the latest Concourse release from GitHub
  • PostgreSQL 16+ Integration: Full support with Juju secrets API for secure credential management
  • Dynamic Port Configuration: Change web port on-the-fly with automatic service restart
  • Privileged Port Support: Run on port 80 with proper Linux capabilities (CAP_NET_BIND_SERVICE)
  • Auto External-URL: Automatically detects unit IP for external-url configuration
  • Ubuntu 24.04 LTS: Optimized for Ubuntu 24.04 LTS
  • Container Runtime: Uses containerd with LXD-compatible configuration
  • Automatic Key Management: TSA keys, session signing keys, and worker keys auto-generated
  • Prometheus Metrics: Optional metrics endpoint for monitoring
  • Download Progress: Real-time installation progress in Juju status
  • GPU Support: NVIDIA (CUDA) and AMD (ROCm) GPU workers for ML/AI workloads (GPU Guide)
  • Dataset Mounting: Automatic dataset injection for GPU tasks (Dataset Guide)
  • 🆕 General Folder Mounting: Automatic discovery and mounting of ANY folder under /srv (General Mounting Guide)
    • ✅ Zero configuration - just mount folders to /srv and go
    • ✅ Read-only by default for data safety
    • ✅ Writable folders with _writable or _rw suffix
    • ✅ Multiple concurrent folders (datasets, models, outputs, caches)
    • ✅ Works on both GPU and non-GPU workers
    • ✅ Automatic permission validation and fail-fast
    • ✅ Backward compatible with existing GPU dataset mounting

Quick Start

Prerequisites

  • Juju 3.x
  • Ubuntu 24.04 LTS (on Juju-managed machines)
  • PostgreSQL charm 16/stable (for web server)

Basic Deployment (Auto Mode)

# Create a Juju model
juju add-model concourse

# Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy Concourse CI charm as application "concourse-ci"
juju deploy concourse-ci-machine concourse-ci --config mode=auto

# Relate to database (uses PostgreSQL 16 client interface with Juju secrets)
juju integrate concourse-ci:postgresql postgresql:database

# Expose the web interface (opens port in Juju)
juju expose concourse-ci

# Wait for deployment (takes ~5-10 minutes)
juju status --watch 1s

The charm automatically:

  • Reads database credentials from Juju secrets
  • Configures the external URL based on unit IP
  • Opens the configured web port (default: 8080)
  • Generates and stores admin password in peer relation data

Naming Convention:

  • Charm name: concourse-ci-machine (what you deploy from Charmhub)
  • Application name: concourse-ci (used throughout this guide)
  • Unit names: concourse-ci/0, concourse-ci/1, etc.

Once deployed, get credentials with juju run concourse-ci/leader get-admin-password

Multi-Unit Deployment with Auto Mode (Recommended)

Deploy multiple units with automatic role assignment and key distribution:

# Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy Concourse charm (named "concourse-ci") with 1 web + 2 workers
juju deploy concourse-ci-machine concourse-ci -n 3 --config mode=auto

# Relate to database (using application name "concourse-ci")
juju relate concourse-ci:postgresql postgresql:database

# Check deployment
juju status

Result:

  • concourse-ci/0 (leader): Web server
  • concourse-ci/1-2: Workers
  • All keys automatically distributed via peer relations!

Note: Application is named concourse-ci for easier reference (shorter than concourse-ci-machine)

Separated Web/Worker Deployment (For Independent Scaling)

For maximum flexibility with separate applications:

# Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy web server (1 unit)
juju deploy concourse-ci-machine web --config mode=web

# Deploy workers (2 units)  
juju deploy concourse-ci-machine worker -n 2 --config mode=worker

# Relate web to database
juju relate web:postgresql postgresql:database

# Relate web and worker for automatic TSA key exchange
juju relate web:tsa worker:flight

# Check deployment
juju status

Result:

  • web/0: Web server only
  • worker/0, worker/1: Workers only connected via TSA

Note: The tsa / flight relation automatically handles SSH key exchange between web and worker applications, eliminating the need for manual key management.

Deployment Modes

The charm supports three deployment modes via the mode configuration:

1. auto (Multi-Unit - Fully Automated ✨)

Leader unit runs web server, non-leader units run workers. Keys automatically distributed via peer relations!

Note: You need at least 2 units for this mode to have functional workers (Unit 0 = Web, Unit 1+ = Workers).

juju deploy concourse-ci-machine concourse-ci -n 3 --config mode=auto
juju relate concourse-ci:postgresql postgresql:database

Best for: Production, scalable deployments Key Distribution:Fully Automatic - zero manual intervention required!

2. web + worker (Separate Apps - Automatic TSA Setup)

Deploy web and workers as separate applications for independent scaling.

# Web application
juju deploy concourse-ci-machine web --config mode=web

# Worker application (scalable)
juju deploy concourse-ci-machine worker -n 2 --config mode=worker

# Relate web to PostgreSQL
juju relate web:postgresql postgresql:database

# Relate web and worker for automatic TSA key exchange
juju relate web:tsa worker:flight

Best for: Independent scaling of web and workers Key Distribution: ✅ Automatic via tsa / flight relation

Configuration Options

Option Type Default Description
mode string auto Deployment mode: auto, web, or worker
version string latest Concourse version to install (auto-detects latest from GitHub)
web-port int 8080 Web UI and API port
worker-procs int 1 Number of worker processes per unit
log-level string info Log level: debug, info, warn, error
enable-metrics bool true Enable Prometheus metrics on port 9391
external-url string (auto) External URL for webhooks and OAuth
initial-admin-username string admin Initial admin username
container-placement-strategy string volume-locality Container placement: volume-locality, random, etc.
max-concurrent-downloads int 10 Max concurrent resource downloads
containerd-dns-proxy-enable bool false Enable containerd DNS proxy
containerd-dns-server string 1.1.1.1,8.8.8.8 DNS servers for containerd containers

Changing Configuration

Configuration changes are applied dynamically with automatic service restart.

# Set custom web port (automatically restarts service)
juju config concourse-ci web-port=9090

# Change to privileged port 80 (requires CAP_NET_BIND_SERVICE - already configured)
juju config concourse-ci web-port=80

# Enable debug logging
juju config concourse-ci log-level=debug

# Set external URL (auto-detects unit IP if not set)
juju config concourse-ci external-url=https://ci.example.com

Upgrading Concourse Version

Use the upgrade action to change Concourse CI version (update the version configuration first to ensure the change persists across charm refreshes):

# Set version configuration first (essential for persistence)
juju config concourse-ci version=7.14.3

# Trigger the upgrade action (automatically upgrades all workers)
juju config concourse-ci version=7.14.3

# Downgrade is also supported (update config then run action)
juju config concourse-ci version=7.12.1
juju config concourse-ci version=7.12.1

Auto-upgrade behavior:

  • When the web server (leader in mode=auto) is upgraded, all workers automatically upgrade to match
  • Works across separate applications connected via TSA relations
  • Workers show "Auto-upgrading Concourse CI to X.X.X..." during automatic upgrades

Note: The web-port configuration supports dynamic changes including privileged ports (< 1024) thanks to AmbientCapabilities=CAP_NET_BIND_SERVICE in the systemd service.

Using Concourse

Access the Web UI

  1. Get the web server IP:
juju status
  1. Check the exposed port (shown in Ports column):
juju status concourse-ci
# Look for: Ports column showing "80/tcp" or "8080/tcp"
  1. Open in browser: http://<web-unit-ip>:<port>

  2. Get the admin credentials:

juju run concourse-ci/leader get-admin-password

Example output:

message: Use these credentials to login to Concourse web UI
password: 01JfF@I!9W^0%re!3I!hyy3C
username: admin

Security: A random password is automatically generated on first deployment and stored securely in Juju peer relation data. All units in the deployment share the same credentials.

Using Fly CLI

The Fly CLI is Concourse's command-line tool for managing pipelines:

# Download fly from your Concourse instance
curl -Lo fly "http://<web-unit-ip>:8080/api/v1/cli?arch=amd64&platform=linux"
chmod +x fly
sudo mv fly /usr/local/bin/

# Get credentials
ADMIN_PASSWORD=$(juju run concourse-ci/leader get-admin-password --format=json | jq -r '."unit-concourse-ci-2".results.password')

# Login
fly -t prod login -c http://<web-unit-ip>:8080 -u admin -p "$ADMIN_PASSWORD"

# Sync fly version
fly -t prod sync

Create Your First Pipeline

⚠️ Important: This charm uses containerd runtime. All tasks must include an image_resource.

  1. Create a pipeline file hello.yml:
jobs:
- name: hello-world
  plan:
  - task: say-hello
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: busybox
      run:
        path: sh
        args:
        - -c
        - |
          echo "=============================="
          echo "Hello from Concourse CI!"
          echo "Date: $(date)"
          echo "=============================="
  1. Set the pipeline:
fly -t prod set-pipeline -p hello -c hello.yml
fly -t prod unpause-pipeline -p hello
  1. Trigger the job:
fly -t prod trigger-job -j hello/hello-world -w

Note: Common lightweight images: busybox (~2MB), alpine (~5MB), ubuntu (~28MB)

Scaling

Add More Workers

# Add 2 more worker units to the concourse-ci application
juju add-unit concourse-ci -n 2

# Verify workers
juju ssh concourse-ci/0  # SSH to unit 0 of concourse-ci application
fly -t local workers

Remove Workers

# Remove specific unit
juju remove-unit concourse-ci/3

Relations

Required Relations

PostgreSQL (Required for Web Server)

The web server requires a PostgreSQL database:

juju relate concourse-ci:postgresql postgresql:database

Supported PostgreSQL Charms:

  • postgresql (16/stable recommended)
  • Any charm providing the postgresql interface

Optional Relations

Monitoring

Concourse exposes Prometheus metrics on port 9391:

juju relate concourse-ci:monitoring prometheus:target

Peer Relation

Units automatically coordinate via the peers relation (automatic, no action needed).

Storage

The charm uses Juju storage for persistent data:

# Deploy with specific storage
juju deploy concourse-ci-machine concourse-ci --storage concourse-data=20G

# Add storage to existing unit
juju add-storage concourse-ci/0 concourse-data=10G

Storage is mounted at /var/lib/concourse.

GPU Support

Concourse workers can utilize NVIDIA GPUs for ML/AI workloads, GPU-accelerated builds, and compute-intensive tasks.

Prerequisites

  • NVIDIA GPU hardware on the host machine
  • NVIDIA drivers installed on the host (tested with driver 580.95+)
  • For LXD/containers: GPU passthrough configured (see below)

Note: The charm automatically installs nvidia-container-toolkit and configures the GPU runtime. No manual setup required!

Quick Start: Deploy with GPU

Complete deployment from scratch:

# 1. Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# 2. Deploy web server
juju deploy concourse-ci-machine web --config mode=web

# 3. Deploy GPU-enabled worker
juju deploy concourse-ci-machine worker \
  --config mode=worker \
  --config compute-runtime=cuda

# 4. Add GPU to LXD container (only manual step for localhost cloud)
lxc config device add <container-name> gpu0 gpu
# Example: lxc config device add juju-abc123-0 gpu0 gpu

# 5. Create relations
juju relate web:postgresql postgresql:database
juju relate web:tsa worker:flight

# 6. Check status
juju status worker
# Expected: "Worker ready (GPU: 1x NVIDIA)"

Enable GPU on Existing Worker

# Enable NVIDIA GPU on already deployed worker
juju config worker compute-runtime=cuda

# Enable AMD GPU on already deployed worker
juju config worker compute-runtime=rocm

# Disable GPU
juju config worker compute-runtime=none

LXD GPU Passthrough (One-time setup)

If deploying on LXD (localhost cloud), add GPU to the container:

# Find your worker container name
lxc list | grep juju

# Add GPU device (requires container restart)
lxc config device add <container-name> gpu0 gpu

# Example:
lxc config device add juju-abc123-0 gpu0 gpu

Everything else is automated! The charm will:

  • ✅ Install nvidia-container-toolkit
  • ✅ Create GPU wrapper script
  • ✅ Configure runtime for GPU passthrough
  • ✅ Set up automatic GPU device injection

GPU Configuration Options

Option Default Description
compute-runtime none GPU compute runtime: none, cuda (NVIDIA), or rocm (AMD)
gpu-device-ids all GPU devices to expose: "all" or "0,1,2"

GPU Worker Tags

When GPU is enabled, workers are automatically tagged:

  • cuda - NVIDIA GPU worker (when compute-runtime=cuda)
  • rocm - AMD GPU worker (when compute-runtime=rocm)
  • gpu-count=N - Number of GPUs available
  • gpu-devices=0,1 - Specific device IDs (if configured)

Example: GPU Pipeline

Create a pipeline that targets GPU-enabled workers:

jobs:
- name: train-model-nvidia
  plan:
  - task: gpu-training
    tags: [cuda]  # Target NVIDIA GPU workers
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: nvidia/cuda
          tag: 13.1.0-runtime-ubuntu24.04
      run:
        path: sh
        args:
        - -c
        - |
          # Verify GPU access
          nvidia-smi
          
          # Run your GPU workload
          python train.py --use-gpu

- name: gpu-benchmark
  plan:
  - task: benchmark
    tags: [cuda, gpu-count=1]  # More specific targeting
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: nvidia/cuda
          tag: 13.1.0-base-ubuntu24.04
      run:
        path: nvidia-smi

Verifying GPU Access

# Check worker status
juju status worker
# Should show: "Worker ready (GPU: 1x NVIDIA)"

# Verify GPU tags in Concourse
fly -t local workers
# Worker should show tags: cuda, gpu-count=1

Common GPU Images

  • nvidia/cuda:13.1.0-base-ubuntu24.04 - CUDA base (~174MB)
  • nvidia/cuda:13.1.0-runtime-ubuntu24.04 - CUDA runtime (~1.38GB)
  • nvidia/cuda:13.1.0-devel-ubuntu24.04 - CUDA development (~3.39GB)
  • tensorflow/tensorflow:latest-gpu - TensorFlow with GPU
  • pytorch/pytorch:latest - PyTorch with GPU

GPU Troubleshooting

Worker shows "GPU enabled but no GPU detected"

  • Verify GPU present: nvidia-smi
  • Check driver installation: nvidia-smi

Container cannot access GPU

  • Verify nvidia-container-runtime: which nvidia-container-runtime
  • Check containerd config: cat /etc/containerd/config.toml
  • Restart containerd: sudo systemctl restart containerd

GPU not showing in task

  • Ensure using NVIDIA CUDA base image
  • Run nvidia-smi in task to debug
  • Check worker tags: fly -t local workers

AMD GPU Support (ROCm)

Concourse workers can utilize AMD GPUs with ROCm for ML/AI workloads, GPU-accelerated computations, and HPC tasks.

Prerequisites

  • AMD GPU hardware on the host machine (e.g., Radeon RX 6000/7000 series, MI series)
  • AMD GPU drivers installed on the host
  • ROCm tools (optional, for host-side management)
  • For LXD/containers: GPU passthrough configured (see below)

Note: The charm automatically installs amd-container-toolkit, generates CDI specifications, and configures the ROCm runtime. No manual setup required!

Quick Start: Deploy with AMD GPU

Complete deployment from scratch:

# 1. Deploy PostgreSQL
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# 2. Deploy web server
juju deploy concourse-ci-machine web --config mode=web

# 3. Deploy ROCm-enabled worker
juju deploy concourse-ci-machine worker \
  --config mode=worker \
  --config compute-runtime=rocm

# 4. Add AMD GPU to LXD container (use specific GPU ID for multi-GPU systems)
# Note: On systems with multiple GPU vendors, use 'id=N' to target specific GPU
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'
lxc config device add <container-name> gpu1 gpu id=1
# Example: lxc config device add juju-abc123-0 gpu1 gpu id=1

# 5. Create relations
juju relate web:postgresql postgresql:database
juju relate web:tsa worker:flight

# 6. Check status
juju status worker
# Expected: "Worker ready (v7.14.2) (GPU: 1x AMD)"

Enable ROCm on Existing Worker

# Enable AMD GPU on already deployed worker
juju config worker compute-runtime=rocm

LXD GPU Passthrough for AMD (Critical for Multi-GPU Systems)

If deploying on LXD (localhost cloud), add AMD GPU to the container:

# Find available GPUs and their IDs
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'

# Output example:
# {
#   "id": 0,
#   "vendor": "NVIDIA Corporation",
#   "driver": "nvidia",
#   "product": "GA104 [GeForce RTX 3070]"
# }
# {
#   "id": 1,
#   "vendor": "Advanced Micro Devices, Inc. [AMD/ATI]",
#   "driver": "amdgpu",
#   "product": "Navi 31 [Radeon RX 7900 XT]"
# }

# Add AMD GPU device using specific ID (GPU 1 in this example)
lxc config device add <container-name> gpu1 gpu id=1

# Add /dev/kfd device (required for ROCm compute)
lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd

# Example:
lxc config device add juju-abc123-0 gpu1 gpu id=1
lxc config device add juju-abc123-0 kfd unix-char source=/dev/kfd path=/dev/kfd

⚠️ IMPORTANT for Multi-GPU Systems:

  • Generic lxc config device add ... gpu passes ALL GPUs to the container
  • This causes ambiguity when both NVIDIA and AMD GPUs are present
  • Always use id=N to target the specific AMD GPU
  • GPU ID corresponds to /dev/dri/cardN (e.g., id=1/dev/dri/card1)

⚠️ CRITICAL for ROCm Compute:

  • /dev/kfd (Kernel Fusion Driver) is required for ROCm compute workloads
  • Without /dev/kfd, GPU monitoring works but PyTorch/TensorFlow cannot use the GPU
  • Must be added as separate device after GPU passthrough

⚠️ Supported AMD GPUs:

  • Discrete GPUs (fully supported): RX 6000/7000 series, Radeon Pro, Instinct MI series - work natively
  • Integrated GPUs (requires workaround): APUs like Phoenix1 (gfx1103), Renoir, Cezanne
    • CAN work with HSA_OVERRIDE_GFX_VERSION environment variable (see below)
    • ⚠️ Lower performance due to shared system memory
    • Recommended for development/testing, not production ML workloads
  • Check ROCm compatibility: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html

Everything else is automated! The charm will:

  • ✅ Install amd-container-toolkit
  • ✅ Generate CDI specification
  • ✅ Install rocm-smi for GPU monitoring
  • ✅ Create AMD GPU wrapper script
  • ✅ Configure runtime for ROCm GPU passthrough
  • ✅ Set up automatic GPU device injection into task containers (including /dev/kfd)

ROCm Configuration Options

Option Default Description
compute-runtime none GPU compute runtime: none, cuda (NVIDIA), or rocm (AMD)
gpu-device-ids all GPU devices to expose: "all" or "0,1,2"

ROCm Worker Tags

When ROCm GPU is enabled, workers are automatically tagged:

  • rocm - AMD GPU worker (when compute-runtime=rocm)
  • gpu-count=N - Number of AMD GPUs available
  • gpu-devices=0,1 - Specific device IDs (if configured)

Example: ROCm GPU Pipeline

Create a pipeline that targets ROCm-enabled workers:

jobs:
- name: rocm-benchmark
  plan:
  - task: gpu-test
    tags: [rocm]  # Target ROCm workers
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/dev-ubuntu-24.04
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Verify GPU access
          rocm-smi
          
          # Check available devices
          ls -la /dev/dri/
          
          # Run your ROCm workload
          python train.py --rocm

- name: amd-gpu-compute
  plan:
  - task: compute
    tags: [rocm, gpu-count=1]  # More specific targeting
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # For integrated AMD GPUs (Phoenix1/gfx1103, etc.)
          export HSA_OVERRIDE_GFX_VERSION=11.0.0
          
          python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); x = torch.rand(5,3).cuda(); print('Result:', x * 2)"

Verifying ROCm GPU Access

# Check worker status
juju status worker
# Should show: "Worker ready (v7.14.2) (GPU: 1x AMD)"

# Verify GPU tags in Concourse
fly -t local workers
# Worker should show tags: rocm, gpu-count=1

# Test GPU access in a task
fly -t local execute -c test-gpu.yml --tag=rocm

Common ROCm Images

  • rocm/dev-ubuntu-24.04:latest - ROCm development base (~1.1GB)
  • rocm/tensorflow:latest - TensorFlow with ROCm
  • rocm/pytorch:latest - PyTorch with ROCm (~6GB, includes PyTorch 2.9.1+rocm7.2.0)
  • rocm/rocm-terminal:latest - ROCm with utilities

HSA_OVERRIDE_GFX_VERSION Workaround for Integrated GPUs

Integrated AMD GPUs (APUs) like Phoenix1 (gfx1103), Renoir, and Cezanne are not officially supported by ROCm, but can work with the HSA_OVERRIDE_GFX_VERSION environment variable.

Why it's needed:

  • ROCm checks GPU architecture (GFX version) and rejects unsupported GPUs
  • Integrated GPUs often use newer GFX versions without full ROCm kernel support
  • Override tells ROCm to use kernels from a supported architecture

How to use:

jobs:
- name: pytorch-rocm-integrated-gpu
  plan:
  - task: test-gpu
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Set override for gfx1103 (Phoenix1) - use gfx11.0.0 kernels
          export HSA_OVERRIDE_GFX_VERSION=11.0.0
          
          # Your PyTorch code
          python3 -c "
          import torch
          print('CUDA (ROCm) available:', torch.cuda.is_available())
          x = torch.rand(5, 3).cuda()
          y = x * 2
          print('GPU computation succeeded!')
          print('Result:', y)
          "

Override values for common integrated GPUs:

GPU Architecture GFX Version Override Value
Phoenix1 (780M) gfx1103 11.0.0
Renoir (4000 series) gfx90c 9.0.0
Cezanne (5000 series) gfx90c 9.0.0

Limitations:

  • ⚠️ Uses suboptimal kernels → lower performance than discrete GPUs
  • ⚠️ Shared system memory → memory bandwidth limitations
  • ⚠️ May not support all ROCm features
  • ✅ Good for development, testing, and light compute workloads
  • ❌ Not recommended for production ML training

Testing on host (before deploying pipeline):

# Test if your integrated GPU works with override
docker run --rm -it --device=/dev/kfd --device=/dev/dri \
  rocm/pytorch:latest sh -c "
    export HSA_OVERRIDE_GFX_VERSION=11.0.0
    python3 -c 'import torch; x = torch.rand(5,3).cuda(); print(x * 2)'
  "

ROCm Troubleshooting

Worker shows "GPU enabled but no GPU detected"

  • Verify AMD GPU present: lspci | grep -i amd
  • Check driver: lsmod | grep amdgpu
  • Check devices: ls -la /dev/dri/

Container cannot access AMD GPU

  • Verify LXD device passthrough: lxc config device show <container-name>
  • Check devices in container: juju ssh worker/0 -- ls -la /dev/dri/
  • Ensure using correct GPU ID on multi-GPU systems
  • Check /dev/kfd: Must be present for compute workloads

PyTorch/TensorFlow shows "CUDA (ROCm) available: False"

  • Most common: Missing /dev/kfd device
    • Check in container: ls -la /dev/kfd
    • Add if missing: lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd
  • Integrated GPU without override: Try HSA_OVERRIDE_GFX_VERSION workaround (see above)
    • Verify GPU model: lspci | grep -i vga
    • Check PCI ID: cat /sys/class/drm/card*/device/uevent | grep PCI_ID
    • For gfx1103 (Phoenix1): export HSA_OVERRIDE_GFX_VERSION=11.0.0
  • HSA_STATUS_ERROR_OUT_OF_RESOURCES: Usually indicates unsupported GPU or missing drivers

rocm-smi works but PyTorch doesn't detect GPU

  • This indicates /dev/kfd is missing or inaccessible
  • rocm-smi only needs /dev/dri/* for monitoring
  • PyTorch needs /dev/kfd for compute operations
  • Solution: Add /dev/kfd device to container (see above)

rocm-smi not working in container

  • Ensure using ROCm-enabled image (rocm/dev-ubuntu-24.04 or similar)
  • Check device permissions: ls -la /dev/dri/ in task
  • ROCm version mismatch: Host and container ROCm versions should be compatible

GPU not showing in task

  • Ensure using ROCm-enabled image
  • Run ls -la /dev/dri/ in task to debug device availability
  • Check worker tags: fly -t local workers
  • Verify task uses correct tags: --tag=rocm

Multi-GPU system issues

  • If worker detects wrong GPU type, check LXD device configuration
  • Use specific GPU ID: lxc config device add ... gpu id=1 (not generic gpu)
  • Query GPU IDs: lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, vendor, driver, product_id, vendor_id, pci_address}'

Integrated GPU performance issues

  • If compute works but is slow, this is expected (shared memory bandwidth)
  • Consider discrete GPU for production workloads
  • Use integrated GPU for testing/development only
  • Monitor memory usage: integrated GPUs share system RAM

Troubleshooting

Charm Shows "Blocked" Status

Cause: Usually means PostgreSQL relation is missing (for web units).

Fix:

juju relate concourse-ci:postgresql postgresql:database

Web Server Won't Start

Check logs:

juju debug-log --include concourse-ci/0 --replay --no-tail | tail -50

# Or SSH and check systemd
juju ssh concourse-ci/0
sudo journalctl -u concourse-server -f

Common issues:

  • Database not configured: Check PostgreSQL relation
  • Auth configuration missing: Check /var/lib/concourse/config.env
  • Port already in use: Change web-port config

Workers Not Connecting

Check worker status:

juju ssh concourse-ci/1  # Worker unit
sudo systemctl status concourse-worker
sudo journalctl -u concourse-worker -f

Common issues:

  • TSA keys not generated: Check /var/lib/concourse/keys/
  • Containerd not running: sudo systemctl status containerd
  • Network connectivity: Ensure workers can reach web server

View Configuration

juju ssh concourse-ci/0
sudo cat /var/lib/concourse/config.env

Architecture

Components

┌─────────────────────────────────────────────────────────┐
│                     Web Server                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Web UI/API │  │    TSA     │  │ Scheduler  │         │
│  └────────────┘  └────────────┘  └────────────┘         │
│         │              │                 │              │
│         └──────────────┴─────────────────┘              │
│                        │                                │
└────────────────────────┼────────────────────────────────┘
                         │
                         │ (SSH over TSA)
                         │
        ┌────────────────┴────────────────┐
        │                                 │
  ┌─────▼──────┐                   ┌─────▼──────┐
  │  Worker 1  │                   │  Worker 2  │
  │┌──────────┐│                   │┌──────────┐│
  ││Container ││                   ││Container ││
  ││Runtime   ││                   ││Runtime   ││
  │└──────────┘│                   │└──────────┘│
  └────────────┘                   └────────────┘

... see https://concourse-ci.org/internals.html

Key Directories

  • /opt/concourse/: Concourse binaries
  • /var/lib/concourse/: Data and configuration
  • /var/lib/concourse/keys/: TSA and worker keys
  • /var/lib/concourse/worker/: Worker runtime directory

Systemd Services

  • concourse-server.service: Web server (runs as concourse user)
  • concourse-worker.service: Worker (runs as root)

Development

Building from Source

# Install charmcraft
sudo snap install charmcraft --classic

# Clone repository
git clone https://github.com/fourdollars/concourse-ci-machine.git
cd concourse-ci-machine

# Build charm
charmcraft pack

# Deploy locally
juju deploy ./concourse-ci-machine_amd64.charm

Project Structure

concourse-ci-machine/
├── src/
│   └── charm.py                  # Main charm logic
├── lib/
│   ├── concourse_common.py       # Shared utilities
│   ├── concourse_installer.py    # Installation logic
│   ├── concourse_web.py          # Web server management
│   └── concourse_worker.py       # Worker management
├── metadata.yaml                 # Charm metadata
├── config.yaml                   # Configuration options
├── charmcraft.yaml               # Build configuration
├── actions.yaml                  # Charm actions
└── README.md                     # This file

Security

Initial Setup

  1. Change default password immediately:
fly -t prod login -c http://<ip>:8080 -u admin -p admin
# Use web UI to change password in team settings
  1. Configure proper authentication:

    • Set up OAuth providers (GitHub, GitLab, etc.)
    • Use Juju secrets for credentials
    • Enable HTTPS with reverse proxy (nginx/haproxy)
  2. Network security:

    • Use Juju spaces to isolate networks
    • Configure firewall rules to restrict access
    • Use private PostgreSQL endpoints

Database Credentials

Database credentials are passed securely via Juju relations, not environment variables.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

License

This charm is licensed under the Apache 2.0 License. See LICENSE for details.

Resources

Support

  • Community Support: Open an issue on GitHub
  • Commercial Support: Contact maintainers

Acknowledgments

  • Concourse CI team for the amazing CI/CD system
  • Canonical for Juju and the Operator Framework
  • Contributors to this charm

About

Concourse CI charm for Juju machine model instead of k8s model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •