Skip to content

[BUG] agent-stack-k8s does not work with ubuntu based agent images #808

@seemethere

Description

@seemethere

Describe the bug

agent-stack-k8s Alpine Hardcoding Issue

Version: v0.37.0 (current latest)

Location: internal/controller/scheduler/scheduler.go

Problem

The controller hardcodes Alpine-specific shell and commands, making it impossible to use Ubuntu/Debian-based agent images:

  1. Shell hardcoded to ash (Alpine shell):
// Line 987 - copy-agent init container
Command: []string{"ash"}
Args:    []string{"-cefx", containerArgs.String()}

// Lines 1111, 1129, 1142 - checkout container
Command: []string{"ash", "-c"}
  1. Alpine-specific user/group commands:
// Lines 1112-1115
checkoutContainer.Args = []string{fmt.Sprintf(`set -exufo pipefail
addgroup -g %d buildkite-agent
adduser -D -u %d -G buildkite-agent -h /workspace buildkite-agent
su buildkite-agent -c "%s && buildkite-agent-entrypoint kubernetes-bootstrap"`,
  • addgroup / adduser -D are BusyBox/Alpine commands
  • Ubuntu/Debian use groupadd / useradd

Why This Matters

Some Kubernetes environments require the use-vc resolv.conf option to force TCP-based DNS queries. musl libc (Alpine) doesn't support use-vc, causing DNS resolution to fail. glibc-based images (Ubuntu, Rocky) work correctly. In general I feel as though it'd be good for all images published by buildkite/agent to be compatible with this stack

Requested Enhancement

Add configuration option to specify shell and use POSIX-compatible user creation, or detect the image type and adapt accordingly. Example:

config:
  shell: "/bin/bash"  # or auto-detect

Or use POSIX-compatible approach that works on both:

# Instead of Alpine-specific adduser/addgroup
getent group buildkite-agent || groupadd -g $GID buildkite-agent
getent passwd buildkite-agent || useradd -u $UID -g buildkite-agent -d /workspace buildkite-agent

To Reproduce

Steps to reproduce the behavior:

  1. Deploy with configuration '...':
  # Helm values for agent-stack-k8s
  config:
    # Custom agent image (Ubuntu-based instead of default Alpine)
    image: "ghcr.io/buildkite/agent:3.115.4-ubuntu-24.04"

    # Required for our environment - forces TCP DNS queries
    pod-spec-patch:
      dnsPolicy: "None"
      dnsConfig:
        options:
          - name: use-vc  # Force TCP for DNS (not supported by musl/Alpine)
  1. Run pipeline on agents
  2. See error

Expected behavior

In general I feel as though it'd be good for all images published by buildkite/agent to be compatible with this stack

Environment

  • agent-stack-k8s version: v0.37.0
  • Kubernetes version: v1.34.2
  • Deployment method: modified helm chart

Logs

The following init containers failed:

�[96;100m CONTAINER  �[0m�[96;100m EXIT CODE �[0m�[96;100m SIGNAL �[0m�[96;100m REASON     �[0m�[96;100m MESSAGE                                                                                                                                                                                                                      �[0m
�[97;40m copy-agent �[0m�[97;40m       128 �[0m�[97;40m      0 �[0m�[97;40m StartError �[0m�[97;40m failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH �[0m

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions