Skip to content

Commit 9fc4cc7

Browse files
miguelgilaclaude
andauthored
feat: add reaper-agent per-node DaemonSet (#27)
* feat: add reaper-agent per-node DaemonSet binary Introduces reaper-agent, a per-node Kubernetes DaemonSet that provides operational capabilities for Reaper: ConfigMap-based config sync to host, stale state GC with dead PID detection, health checks, and Prometheus metrics on :9100. Gated behind the "agent" cargo feature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add --with-agent install flag and agent integration tests Add --with-agent flag to install-reaper.sh for optional reaper-agent DaemonSet deployment. Add Phase 4a integration tests that verify agent deployment, ConfigMap sync, /healthz, /metrics, and stale state GC. Agent tests auto-skip if the image isn't loaded into Kind. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: auto-build and load reaper-agent image in integration tests Add scripts/build-agent-image.sh that cross-compiles the agent binary via musl, packages it into a distroless container image, and loads it into Kind. Wire it into Phase 2 setup so Phase 4a agent tests run automatically instead of being skipped. Build failure is non-fatal (tests gracefully skip if image is unavailable). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: make reaper-agent build and tests mandatory in CI Agent image build failure now aborts the test suite instead of silently skipping. Phase 4a agent tests fail hard if the image isn't found — the agent is core infrastructure, not optional. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use dedicated KUBECONFIG to avoid context conflicts Export a per-cluster KUBECONFIG file so all kubectl commands target the correct Kind cluster, even when the user has other clusters or contexts active. Previously, bare kubectl defaulted to localhost:8080 if another context was selected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set imagePullPolicy to IfNotPresent for Kind-loaded agent image The :latest tag defaults imagePullPolicy to Always, causing kubelet to attempt pulling from ghcr.io instead of using the locally loaded image. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add --agent-only flag to integration test harness Skips cargo tests and Phase 4 integration tests, running only infrastructure setup and Phase 4a agent tests for fast iteration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: run reaper-agent as root and use port-forward for tests The agent needs root to write /etc/reaper/reaper.conf and clean up /run/reaper/ on the host. Switch from distroless:nonroot to distroless root image with securityContext.runAsUser: 0. Also fix healthz/metrics tests to use kubectl port-forward instead of kubectl exec (distroless has no shell or wget). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add BUGS.md documenting DNS annotation test flake Documents the intermittent containerd sandbox teardown race that causes the DNS mode annotation override test to fail under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update TODO with agent Phase 2 items and mark ConfigMap eval done Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 78dae3a commit 9fc4cc7

File tree

18 files changed

+2694
-19
lines changed

18 files changed

+2694
-19
lines changed

Cargo.lock

Lines changed: 1221 additions & 14 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ license = "MIT"
1010
[dependencies]
1111
anyhow = "1"
1212
thiserror = "1"
13-
clap = { version = "4", features = ["derive"] }
13+
clap = { version = "4", features = ["derive", "env"] }
1414
serde = { version = "1", features = ["derive"] }
1515
serde_json = "1"
1616
nix = { version = "0.28", features = ["signal", "process", "user", "sched", "mount", "fs", "term"] }
@@ -23,6 +23,16 @@ protobuf = "3.3"
2323
containerd-shim = { version = "0.10", features = ["async", "tracing"] }
2424
containerd-shim-protos = { version = "0.10", features = ["async"] }
2525

26+
# reaper-agent dependencies
27+
kube = { version = "0.98", features = ["runtime", "client", "derive"], optional = true }
28+
k8s-openapi = { version = "0.24", features = ["v1_31"], optional = true }
29+
prometheus-client = { version = "0.23", optional = true }
30+
axum = { version = "0.8", optional = true }
31+
futures = { version = "0.3", optional = true }
32+
33+
[features]
34+
agent = ["kube", "k8s-openapi", "prometheus-client", "axum", "futures"]
35+
2636
[dev-dependencies]
2737
tempfile = "3"
2838
serial_test = "3"
@@ -35,6 +45,11 @@ path = "src/bin/reaper-runtime/main.rs"
3545
name = "containerd-shim-reaper-v2"
3646
path = "src/bin/containerd-shim-reaper-v2/main.rs"
3747

48+
[[bin]]
49+
name = "reaper-agent"
50+
path = "src/bin/reaper-agent/main.rs"
51+
required-features = ["agent"]
52+
3853
[lints.rust]
3954
unexpected_cfgs = { level = "warn", check-cfg = ['cfg(tarpaulin_include)'] }
4055

Dockerfile.agent

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Multi-stage build for reaper-agent
2+
# Produces a minimal static binary for Kubernetes DaemonSet deployment.
3+
4+
# --- Builder stage ---
5+
FROM messense/rust-musl-cross:x86_64-musl AS builder-amd64
6+
WORKDIR /work
7+
COPY . .
8+
RUN cargo build --release --features agent --bin reaper-agent --target x86_64-unknown-linux-musl
9+
10+
FROM messense/rust-musl-cross:aarch64-musl AS builder-arm64
11+
WORKDIR /work
12+
COPY . .
13+
RUN cargo build --release --features agent --bin reaper-agent --target aarch64-unknown-linux-musl
14+
15+
# --- Runtime stage ---
16+
# Use distroless for a minimal image with ca-certificates (needed for K8s API TLS)
17+
FROM gcr.io/distroless/static-debian12
18+
19+
ARG TARGETARCH
20+
COPY --from=builder-amd64 /work/target/x86_64-unknown-linux-musl/release/reaper-agent /reaper-agent-amd64
21+
COPY --from=builder-arm64 /work/target/aarch64-unknown-linux-musl/release/reaper-agent /reaper-agent-arm64
22+
23+
# Select binary based on target architecture
24+
# Note: For single-arch builds, use docker buildx with --platform
25+
COPY --from=builder-${TARGETARCH}64 /work/target/*/release/reaper-agent /reaper-agent
26+
27+
ENTRYPOINT ["/reaper-agent"]

deploy/ansible/install-reaper.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,5 +237,6 @@
237237
- "1. Create RuntimeClass: kubectl apply -f deploy/kubernetes/runtimeclass.yaml"
238238
- "2. Deploy test pod: kubectl apply -f deploy/kubernetes/runtimeclass.yaml"
239239
- "3. Verify: kubectl logs reaper-example"
240+
- "4. Optional: kubectl apply -f deploy/kubernetes/reaper-agent.yaml (config sync, GC, metrics)"
240241
- ""
241242
- "To rollback: ansible-playbook -i inventory.ini deploy/ansible/rollback-reaper.yml"
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
apiVersion: v1
3+
kind: Namespace
4+
metadata:
5+
name: reaper-system
6+
labels:
7+
app.kubernetes.io/part-of: reaper
8+
---
9+
apiVersion: v1
10+
kind: ConfigMap
11+
metadata:
12+
name: reaper-config
13+
namespace: reaper-system
14+
labels:
15+
app.kubernetes.io/part-of: reaper
16+
app.kubernetes.io/component: config
17+
data:
18+
reaper.conf: |
19+
# Reaper runtime configuration
20+
# Managed by reaper-agent ConfigMap sync.
21+
# Edit this ConfigMap to change Reaper settings on all nodes.
22+
REAPER_DNS_MODE=host
23+
REAPER_OVERLAY_ISOLATION=namespace
24+
REAPER_ANNOTATIONS_ENABLED=true
25+
---
26+
apiVersion: v1
27+
kind: ServiceAccount
28+
metadata:
29+
name: reaper-agent
30+
namespace: reaper-system
31+
labels:
32+
app.kubernetes.io/part-of: reaper
33+
app.kubernetes.io/component: agent
34+
---
35+
apiVersion: rbac.authorization.k8s.io/v1
36+
kind: ClusterRole
37+
metadata:
38+
name: reaper-agent
39+
labels:
40+
app.kubernetes.io/part-of: reaper
41+
app.kubernetes.io/component: agent
42+
rules:
43+
- apiGroups: [""]
44+
resources: ["configmaps"]
45+
verbs: ["get", "watch", "list"]
46+
---
47+
apiVersion: rbac.authorization.k8s.io/v1
48+
kind: ClusterRoleBinding
49+
metadata:
50+
name: reaper-agent
51+
labels:
52+
app.kubernetes.io/part-of: reaper
53+
app.kubernetes.io/component: agent
54+
roleRef:
55+
apiGroup: rbac.authorization.k8s.io
56+
kind: ClusterRole
57+
name: reaper-agent
58+
subjects:
59+
- kind: ServiceAccount
60+
name: reaper-agent
61+
namespace: reaper-system
62+
---
63+
apiVersion: apps/v1
64+
kind: DaemonSet
65+
metadata:
66+
name: reaper-agent
67+
namespace: reaper-system
68+
labels:
69+
app.kubernetes.io/name: reaper-agent
70+
app.kubernetes.io/part-of: reaper
71+
app.kubernetes.io/component: agent
72+
spec:
73+
selector:
74+
matchLabels:
75+
app.kubernetes.io/name: reaper-agent
76+
updateStrategy:
77+
type: RollingUpdate
78+
rollingUpdate:
79+
maxUnavailable: 1
80+
template:
81+
metadata:
82+
labels:
83+
app.kubernetes.io/name: reaper-agent
84+
app.kubernetes.io/part-of: reaper
85+
app.kubernetes.io/component: agent
86+
annotations:
87+
prometheus.io/scrape: "true"
88+
prometheus.io/port: "9100"
89+
prometheus.io/path: "/metrics"
90+
spec:
91+
serviceAccountName: reaper-agent
92+
hostPID: true
93+
tolerations:
94+
- operator: Exists
95+
effect: NoSchedule
96+
containers:
97+
- name: agent
98+
image: ghcr.io/miguelgila/reaper-agent:latest
99+
imagePullPolicy: IfNotPresent
100+
securityContext:
101+
runAsUser: 0
102+
args:
103+
- --config-namespace=reaper-system
104+
- --config-name=reaper-config
105+
- --config-path=/host/etc/reaper/reaper.conf
106+
- --state-dir=/host/run/reaper
107+
- --shim-path=/host/usr/local/bin/containerd-shim-reaper-v2
108+
- --runtime-path=/host/usr/local/bin/reaper-runtime
109+
ports:
110+
- containerPort: 9100
111+
name: metrics
112+
protocol: TCP
113+
livenessProbe:
114+
httpGet:
115+
path: /healthz
116+
port: metrics
117+
initialDelaySeconds: 10
118+
periodSeconds: 30
119+
readinessProbe:
120+
httpGet:
121+
path: /readyz
122+
port: metrics
123+
initialDelaySeconds: 5
124+
periodSeconds: 10
125+
resources:
126+
requests:
127+
cpu: 10m
128+
memory: 32Mi
129+
limits:
130+
cpu: 100m
131+
memory: 64Mi
132+
volumeMounts:
133+
- name: etc-reaper
134+
mountPath: /host/etc/reaper
135+
- name: run-reaper
136+
mountPath: /host/run/reaper
137+
- name: usr-local-bin
138+
mountPath: /host/usr/local/bin
139+
readOnly: true
140+
volumes:
141+
- name: etc-reaper
142+
hostPath:
143+
path: /etc/reaper
144+
type: DirectoryOrCreate
145+
- name: run-reaper
146+
hostPath:
147+
path: /run/reaper
148+
type: DirectoryOrCreate
149+
- name: usr-local-bin
150+
hostPath:
151+
path: /usr/local/bin
152+
type: Directory

docs/BUGS.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Known Bugs and Flaky Tests
2+
3+
## DNS Mode Annotation Override Test Flake
4+
5+
**Test:** `DNS mode annotation override (host vs kubernetes)`
6+
**Severity:** Low (intermittent, CI-only)
7+
**Status:** Open
8+
9+
### Symptoms
10+
11+
The test times out (64s) waiting for the `reaper-dns-annot-default` or
12+
`reaper-dns-annot-host` pod to reach `Succeeded` phase. The pod gets stuck
13+
and containerd reports:
14+
15+
```
16+
failed to stop sandbox: task must be stopped before deletion: running: failed precondition
17+
```
18+
19+
### Root Cause
20+
21+
A timing race in containerd's sandbox lifecycle. When the shim reports the
22+
container has exited, containerd sometimes tries to delete the task before
23+
it has fully transitioned out of the `running` state. This causes a
24+
`failed precondition` error that prevents sandbox teardown, leaving the pod
25+
stuck.
26+
27+
This is a containerd-level issue, not a Reaper bug. It tends to surface
28+
under load (e.g., when many pods are created/deleted in quick succession
29+
during the integration test suite).
30+
31+
### Workarounds
32+
33+
- Re-running the test suite usually passes on retry.
34+
- The `--agent-only` flag skips this test entirely for fast agent iteration.
35+
- Running with `--no-cleanup` and re-running `--skip-cargo --no-cleanup`
36+
often avoids the race since the cluster is warmer.
37+
38+
### Related
39+
40+
- Observed in Kind clusters with containerd v1.7+.
41+
- The `Combined annotations` test exercises similar annotation logic and
42+
passes reliably, suggesting the issue is timing-related rather than
43+
functional.

docs/TODO.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,9 @@ List of tasks to do, not ordered in any specific way.
1919
- [x] Add certain configuration parameters as annotations, so users can influence how Reaper works (DNS, overlay name and mount point, etc.). But ensuring adminsistrator parameters cannot be overriden.
2020
- [ ] Introduce more complex examples, answer this question: can we have a sssd containerd pod expose its socks file so a sample reaper pod can utilize it?
2121
- [ ] Produce RPM an DEB packages compatible with major distributions (SUSE, RHEL, Debian, Ubuntu). This will help with installation and deployment.
22-
- [ ] Evaluate if Reaper can be configured using a Kubernetes ConfigMap instead of relying on a node-level config file.
22+
- [x] Evaluate if Reaper can be configured using a Kubernetes ConfigMap instead of relying on a node-level config file. (Implemented via `reaper-agent` DaemonSet — PR #27)
23+
- [ ] reaper-agent Phase 2: Overlay GC — reconcile overlay namespaces against Kubernetes API, delete overlays for namespaces that no longer exist
24+
- [ ] reaper-agent Phase 2: Binary self-update — watch ConfigMap version field, download and replace shim/runtime binaries
25+
- [ ] reaper-agent Phase 2: Node condition reporting — patch Node object with `ReaperReady` condition
26+
- [ ] reaper-agent Phase 2: Mount namespace cleanup — detect and unmount stale `/run/reaper/ns/*` bind-mounts
27+
- [ ] Fix known bugs documented in [docs/BUGS.md](BUGS.md)

0 commit comments

Comments
 (0)