Skip to content

Add gensim-eks scenario: end-to-end episode runner with full observability#47691

Open
scottopell wants to merge 42 commits intomainfrom
sopell/gensim-eks-runner
Open

Add gensim-eks scenario: end-to-end episode runner with full observability#47691
scottopell wants to merge 42 commits intomainfrom
sopell/gensim-eks-runner

Conversation

@scottopell
Copy link
Contributor

@scottopell scottopell commented Mar 10, 2026

What does this PR do?

Adds a new e2e-framework scenario (aws/gensim-eks) that provisions a persistent EKS cluster and runs GenSim episodes end-to-end. Supports two modes:

  • record-parquet (default): Collects observer data to parquet files for offline analysis, uploads to S3
  • live-anomaly-detection: Runs live edge anomaly detection with all correlators enabled, sends detected anomaly events to Datadog via the Events v2 API

For each episode, the scenario deploys application services, installs a fully-configured Datadog agent, executes fault injection cycles, and collects results (parquet files or live anomaly events depending on mode).

Motivation

GenSim episodes are self-contained incident simulations -- each ships a Helm chart with application services, a play-episode.sh fault injection script, and (in 261 of 267 episodes) a stub Datadog agent Deployment. To evaluate the observer against these episodes at scale, we need infrastructure that:

  1. Replaces the stub agent with a custom observer-instrumented build
  2. Autodiscovers per-episode integration check configs (redis, postgres, etc.) via pod annotations
  3. Enables full observability -- APM traces, logs, DogStatsD, integration checks -- not just internal agent metrics
  4. Runs multiple episodes serially with clean teardown between each
  5. Supports both offline parquet collection and live anomaly detection modes

Data flow

On your laptop (dda inv aws.eks.gensim.submit --image=... --episodes=A:s1,B:s2 --mode=live-anomaly-detection):

Pulumi reads each episode's play-episode.sh, scenario YAML, and Helm chart from your local GENSIM_REPO_PATH, packages them into per-episode ConfigMaps, renders agent-values.yaml.tmpl with the target image and mode, and creates an orchestrator Job. Your laptop is done.

In the EKS cluster (orchestrator Job, alpine/k8s:1.31.0):

for each episode:

  # EPISODE: helm install episode chart with --set agent.enabled=false
  #          (suppresses stub agent resources; post-renderer fixes imagePullPolicy)

  # AGENT: helm install datadog-agent with:
  #   - agent-values.yaml (APM, logs, DogStatsD, process agent, cluster agent,
  #     admission controller, observer config based on mode)
  #   (autodiscovery annotations on episode pods handle check configs)

  # RUN: play-episode.sh executes warmup -> baseline -> fault injection -> cooldown
  #      (20-80 min per episode)

  # COLLECT (record-parquet mode only):
  #   kubectl cp parquet from agent pod
  #   aws s3 cp to $BUCKET/$IMAGE_TAG/$EPISODE--$SCENARIO/gensim-$DATE-$SHA/
  # (live-anomaly-detection mode: anomaly events sent to Datadog in real-time)

  # REPORT:  DD events + metrics via API, update gensim-run-status ConfigMap
  # TEARDOWN: helm uninstall episode + agent, clean workspace

Observer mode configuration

The --mode flag controls what the agent's observer component does:

Config record-parquet live-anomaly-detection
observer.recording.enabled true false
observer.analysis.enabled (default) true
observer.event_reporter.sending_enabled false true
Parquet collection + S3 upload yes skipped

Key design decisions

  • fullnameOverride: datadog-agent -- episode pods hardcode datadog-agent:8125 and datadog-agent:8126 as DNS names for DogStatsD and APM. This makes our Helm chart create Services with the names they expect.
  • Single-pass episode deployment -- helm install with --set agent.enabled=false suppresses stub agent resources. A lightweight sed post-renderer patches imagePullPolicy: Never to Always for the few episodes that set it. Autodiscovery annotations on episode pods handle check configs -- no extraction step needed.
  • pullPolicy: Always on the agent image -- prevents stale cached images on EKS nodes when the same tag is rebuilt with new code.
  • Admission controller with mutateUnlabelled: true -- auto-injects DD_AGENT_HOST into all episode pods without requiring labels.
  • Agent values as a Go template (agent-values.yaml.tmpl) -- embedded via go:embed, rendered at Pulumi plan time with target image repo/tag and mode.
  • --refresh on pulumi up -- reconciles Pulumi state with cluster reality before planning changes. Prevents silent no-ops when resources are deleted outside Pulumi (e.g. by the orchestrator's helm uninstall during teardown).
  • --non-interactive on pulumi up/destroy -- disables the TUI progress display so error output is captured in logs instead of being overwritten by cursor repositioning.
  • Content-addressed build cache -- docker-compose.yaml is included in the services hash (alongside the services/ directory) so adding/removing images triggers a rebuild.
  • Image reference validation -- renderAgentValues validates the image contains : before splitting into repo:tag, returning a clear error instead of panicking.

Describe how you validated your changes

Record-parquet mode -- 3-episode end-to-end run:

Episode Parquet Files Duration Signals
213_PagerDuty (cassandra) 381 49m APM + logs + DogStatsD + observer
food-delivery (redis) 235 33m APM + logs + DogStatsD + redis check + postgres check + observer
353_postmark (dns) 602 84m APM + logs + DogStatsD + observer

Live-anomaly-detection mode -- 213_PagerDuty episode:

  • 20 Datadog events successfully delivered via the event reporter (source:agent-q-branch-observer)
  • TimeCluster correlator detected temporal co-occurrence of BOCPD anomalies across trace agent metrics (2-8 anomalies per cluster)
  • Surprise correlator detected statistically unusual metric co-occurrences (lift scores 2.0-24.0)
  • Monitor received data (DD_ENV propagation working), went to Alert during disruption, recovered to OK
  • Clean teardown, no resource leaks

Future checklist: hacks to retire

  • awk filter stripping agent resources -- Retired. Episodes now gate agent resources behind agent.enabled: false (gensim-episodes PR [aggregator] Allow aggregator to handle rate metrics from checks #31).
  • extract_check_configs with yq -- Retired. Episodes migrated to autodiscovery annotations (gensim-episodes PR [aggregator] Support all existing check metric types #33).
  • helm template + kubectl apply instead of helm install -- Retired. With agent.enabled=false, episodes install cleanly via helm install.
  • imagePullPolicy: Never -> Always patching (awk filter) -- Retired as an awk filter. Replaced by a 3-line sed post-renderer on helm install.
  • Build VM + docker buildx bake + ECR push -- EC2 instance to build episode service images from source and push to ECR. ~150 lines of Pulumi + bash. Retires when: gensim-episodes publishes pre-built images to a shared registry.

Evaluated and kept as-is:

  • fullnameOverride: datadog-agent -- load-bearing DNS name that 227 episodes depend on.
  • mutateUnlabelled: true -- correct for single-tenant namespace.
  • kubelet.tlsVerify: false -- standard EKS workaround, permanent.

Files

File What
test/e2e-framework/scenarios/aws/gensim-eks/run.go Scenario: EKS cluster, orchestrator Job, build VM, orchestrator bash script, mode plumbing
test/e2e-framework/scenarios/aws/gensim-eks/agent-values.yaml.tmpl Agent Helm values template (mode-conditional observer config)
tasks/e2e_framework/aws/gensim_eks.py Invoke tasks: submit (with --mode), status, destroy, logs
tasks/e2e_framework/deploy.py Pulumi deploy: --refresh, --non-interactive, pty=False
tasks/e2e_framework/destroy.py Pulumi destroy: --non-interactive, pty=False

@dd-octo-sts dd-octo-sts bot added the internal Identify a non-fork PR label Mar 10, 2026
scottopell and others added 29 commits March 12, 2026 16:05
Adds a new e2e scenario (aws/gensim-eks) that provisions a real EKS cluster
for running gensim episodes, as an alternative to the existing Kind-on-EC2
approach. Key improvements over the Kind path:

- EC2 build VM builds episode service images natively (linux/amd64) and pushes
  to ECR via instance IAM role — no local Docker or credential setup required
- No cross-platform issues (Apple Silicon vs x86_64 EKS nodes)
- play-episode.sh will run as a Kubernetes Job (M4) rather than a VM-side script

M1 (✓): EKS cluster with Linux node group, kubeconfig export
M2 (✓): EC2 build VM builds/pushes episode images to ECR, deploys episode
         Helm chart with imagePullPolicy post-renderer (Never→IfNotPresent)
M3-M5: Datadog Agent, autonomous Job runner, S3 upload (upcoming)

New files:
- test/e2e-framework/scenarios/aws/gensim-eks/run.go
- tasks/e2e_framework/aws/gensim_eks.py

Invoke: dda inv aws.eks.gensim.create --episode=<name>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ConfigMap

pulumi-eks manages aws-auth with patchForce:true, giving Pulumi full ownership
of the mapRoles list. The Fargate pod execution role was previously auto-created
inside eks.NewCluster and never included in RoleMappings, so it was silently
dropped from aws-auth on every update. This caused CoreDNS and other kube-system
Fargate pods to fail scheduling with:
  "Pod execution role is not found in auth config or does not have all
   required permissions for launching fargate pods"

Fix: pre-create the Fargate execution role (GetFargatePodExecutionRole in
role.go) before calling eks.NewCluster, pass it as FargateProfileArgs.
PodExecutionRoleArn, and include it in RoleMappings with the standard
system:bootstrappers + system:nodes groups.

Pre-creation is required because the role ARN must be known upfront -
it cannot be read from the cluster after creation without a circular
Pulumi dependency.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ld VM

Two boot-time optimizations for the gensim-eks build VM:

1. Pre-baked Amazon Linux ECS AMI
   Switch build VM from default Ubuntu to Amazon Linux ECS, which ships with
   Docker pre-installed and the daemon already running. Eliminates the
   apt-get update + install docker.io/awscli step (~3-5 min per fresh cluster).
   Only docker-compose needs to be installed via pip3.

2. ECR image caching by content hash
   Images are now pushed with two tags: :latest (for the Helm chart) and
   :<hash> where hash is the first 12 chars of the SHA256 of the services/
   directory. On each build run, the script checks ECR for the hash tag
   before starting docker-compose build. If all images are present at that
   tag, the build is skipped and images are pulled + retagged as :latest
   (~5-10 min saved on re-runs with unchanged source).

   This is most valuable when destroying and recreating a stack without
   changing episode code — a common pattern during cluster debugging.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…aner

The ECR image cache (hash-tagged images) is wiped by the e2e account's
weekly infra-cleaner job, limiting cache hits to within the same week.
To make it durable, ECR repos need a protection tag or cleaner exclusion.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…LI install

The ECS AMI ships Docker 25 (including buildx) but not AWS CLI or
docker-compose. Previous approach tried pip3 install docker-compose
which failed due to Python 3.7 + OpenSSL 1.0.2k incompatibility with
urllib3 v2.0.

Fix:
- Install only awscli via yum (~30s, the only missing tool)
- Replace docker-compose build with docker buildx bake, which understands
  docker-compose.yaml natively, requires no separate install, and builds
  images in parallel
- Replace docker-compose config --images with Python yaml parsing
- Restore explicit docker login (no ECR credential helper on this AMI;
  --password-stdin works without a TTY as Pulumi uses for remote commands)

Net result: setup step goes from apt-get docker.io + docker-compose + awscli
(~4 min) to yum install awscli (~30s). Docker and buildx are pre-installed.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Fargate nodes carry eks.amazonaws.com/compute-type=fargate:NoSchedule.
Any DaemonSet — including the Datadog agent — gets permanently-Pending
pods on those nodes, blocking Helm readiness checks and adding operational
complexity. The only benefit of Fargate is that CoreDNS starts before EC2
nodes join (a ~5 min provisioning optimisation that is irrelevant for
long-running test scenarios).

WithoutFargate() skips both the Fargate profile and the pre-created
execution role. CoreDNS schedules on the EC2 node group once nodes join.
The Fargate execution role and aws-auth complexity are eliminated entirely.

gensim-eks now uses WithoutFargate() so the Datadog agent DaemonSet
(added in M3) schedules cleanly on EC2 nodes without stuck-Pending pods.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Deploys the full DaemonSet-based Datadog agent after the episode chart,
using the framework's helm.NewKubernetesAgent wrapper.

Changes:
- run.go: add M3 agent deployment block, gated on awsEnv.AgentDeploy().
  Reads gensim:datadogValuesPath (sets clusterName + kubelet.tlsVerify:false,
  required on EKS since kubelet uses a self-signed cert). Waits for the
  episode Helm release before deploying.
- gensim_eks.py: set install_agent=True when episode is provided, which
  sets ddagent:deploy=True and injects ddagent:apiKey via the framework.
  Passes gensim:datadogValuesPath when datadog-values.yaml exists at the
  postmortems root. Adds _delete_stub_agent() post-pulumi step to remove
  the episode chart's built-in Deployment-based agent (which would produce
  duplicate metrics alongside the real DaemonSet).

Success: DD agent pod Running on EC2 node, metrics visible in DD under the
episode's env tag, stub agent gone.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…uesFile

pulumi.NewStringAsset (used by WithHelmValues) serialises to []interface{}
in the local Pulumi state backend, causing the Helm provider to fail with
"unsupported type for 'valueYamlFiles' arg: []interface{}".

Fix: add WithHelmValuesFile(path) which uses pulumi.NewFileAsset instead.
File assets are read from disk at apply time and round-trip through the
local backend correctly. gensim-eks uses this for datadogValuesPath.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tonomously

Creates the RBAC, credentials Secret, ConfigMap, and Job needed to run
play-episode.sh inside the cluster without the developer's laptop staying open.

Resources created when gensim:scenario is set:
- ServiceAccount gensim-runner
- ClusterRole with: get/list/watch pods; get/list/update deployments/scale
  (play-episode.sh uses kubectl scale + kubectl wait only)
- ClusterRoleBinding
- Secret gensim-secrets (DD_API_KEY, DD_APP_KEY)
- ConfigMap gensim-episode (play-episode.sh + <scenario>.yaml)
  Two separate volume mounts map them to /episode/ and /episode/episodes/
  so play-episode.sh finds the scenario YAML at its expected path.
- Job gensim-runner running alpine/k8s:1.31.0 (has kubectl/bash/curl/jq).
  Annotated pulumi.com/skipAwait=true so Pulumi returns immediately
  rather than waiting 30-60 min for the episode to finish.

Python: --scenario flag added to create_gensim_eks; post-deploy prints
kubectl logs -f command for monitoring.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…kend bug properly

AssetOrArchiveArray (used by WithHelmValues/WithHelmValuesFile) deserialises
as []interface{} in local Pulumi state on resource update, causing the Helm
provider to fail with "unsupported type for 'valueYamlFiles' arg: []interface{}".

Fix: add ExtraValues pulumi.Map to HelmInstallationArgs, merged into the main
values map before ToYAMLPulumiAssetOutput(). Map values flow through the
computed output path which survives local-state round-trips correctly on both
create and update.

Changes:
- kubernetes_helm.go: ExtraValues field; merge loop before ToYAMLPulumiAssetOutput
- kubernetes_agent.go: pass params.ExtraHelmValues through
- kubernetesagentparams/params.go: ExtraHelmValues field + WithExtraHelmValues option
- gensim-eks/run.go: use WithExtraHelmValues for kubelet.tlsVerify + clusterName

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…luesYAML

The previous code serialised ALL agent values (including framework defaults)
through ValuesYAML/AssetOrArchiveArray via ToYAMLPulumiAssetOutput(). This
caused the same local-backend deserialisation bug as WithHelmValues:
  "unsupported type for 'valueYamlFiles' arg: []interface {}"

Fix: pass the HelmValues map (after ExtraValues merge) directly as the
Values pulumi.MapInput instead of converting to YAML and going through
ValueYamlFiles. pulumi.Map values serialise as JSON in local state and
survive update round-trips correctly.

User-provided ValuesYAML (WithHelmValues/WithHelmValuesFile) still use the
AssetOrArchiveArray path for now — prefer WithExtraHelmValues instead.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…olume

play-episode.sh creates a results/ directory relative to its script path.
ConfigMap volumes are read-only so mkdir fails. Set RESULTS_DIR=/tmp/results
so the script writes results to a writable location instead.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
play-episode.sh creates results/ relative to its script path (hardcoded,
not overridable via env var). ConfigMap volumes are read-only so mkdir
fails. Add an emptyDir volume mounted at /episode/results to give the
script a writable location without modifying the upstream script.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Cluster-checks runners are not needed for a single-node test cluster.
Their readiness probe blocks the Helm timeout when the forwarder hasn't
fully initialised — particularly noticeable on first deploy where the
pods predate the real API key being available.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
play-episode.sh hardcodes RESULTS_DIR="${SCRIPT_DIR}/results". Since the
script is mounted from a read-only ConfigMap volume at /episode/, mkdir
fails. Add an emptyDir volume and mount at /episode/results to provide a
writable location without modifying the upstream script.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… scale

kubectl scale issues a PATCH on deployments/scale, not update. Without
patch in the ClusterRole verbs, the scale command fails silently (|| true
suppresses the error) and surge generators never spin up during disruption.

Validated: monitor transitioned OK→Alert after surge scaled to 5 replicas,
then Alert→OK after scale-down during cooldown.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…r agent

After play-episode.sh completes, the Job now:
1. kubectl cp observer parquet from the agent pod (/tmp/observer-parquet)
   into /episode/results/parquet/ — requires pods/exec added to ClusterRole
2. aws s3 cp /episode/results/ to s3://<bucket>/gensim-results-<episode>-<date>/
   — requires s3:PutObject on the EKS Linux node role, attached by Pulumi

Observer-recorder agent support:
- additionalConfig injects observer config directly into datadog.yaml via
  the Helm chart's additionalConfig field
- --full-image-path flag allows specifying the observer-recorder image
  (e.g. docker.io/datadog/agent-dev:q-branch-observer-full)

New invoke flags:
  --s3-bucket   bucket to upload results (optional; skipped if unset)
  --full-image-path   custom agent image path

Example:
  inv aws.eks.gensim.create \
    --episode=authcore-pgbouncer-connection-pool-saturation \
    --scenario=pool-saturation \
    --s3-bucket=qbranch-gensim-recordings \
    --full-image-path=docker.io/datadog/agent-dev:q-branch-observer-full

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…true explicitly

buildLinuxHelmValues builds an agents map without setting agents.enabled.
When this map is passed via Values (pulumi.Map — equivalent to --set style),
Helm's merge semantics can omit the agents.enabled key from the effective
values, causing the node agent DaemonSet template condition to evaluate to
false and the DaemonSet to be skipped entirely.

clusterAgent already sets "enabled": pulumi.Bool(true) explicitly.
agents must do the same for the DaemonSet to be reliably created.

Root cause identified via: Helm manifest had 0 DaemonSets despite
chart defaults having agents.enabled=True, because the Values map
passed from buildLinuxHelmValues had agents.{image,containers,...}
but no agents.enabled key.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Job command was still the M4 form (direct play-episode.sh call).
Apply the M5 bash -c script that chains: episode run → parquet collection
via kubectl cp → flat S3 upload via aws s3 cp --recursive.

The S3 upload destination follows Maxime's naming:
  s3://<bucket>/gensim-results-<episode>-<YYYYMMDD>/

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Amazon Linux ECS ships Python 3.7 without the yaml module. Replace the
python3 yaml parsing of docker-compose.yaml image names with grep+awk
which is always available. Also fixes backtick in Go raw string literal.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tick

Three bugs found during clean end-to-end validation run (M5):

1. Backtick in raw string comment terminated the Go string literal,
   causing compile error. Removed backticks from comment text.

2. date format string had `%%%%Y%%%%m%%%%d` -> produced `%%Y%%m%%d` in
   the shell script -> `date -u` received `%%Y%%m%%d` as format and
   expanded it to the literal string `%Y%m%d` instead of YYYYMMDD.
   Fix: `%%Y%%m%%d` in pulumi.Sprintf -> `%Y%m%d` in the script.

3. Parquet kubectl cp used `-l app.kubernetes.io/component=agent` but
   the DaemonSet pods have label `app=datadog-agent`. Selector returned
   empty AGENT_POD, silently skipping parquet collection.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Previously `|| echo "Warning: ..."` swallowed the exit code and the
error was easy to miss. Now:
- If no agent pod is found, prints to stderr with ERROR prefix
- If kubectl cp fails, prints to stderr with ERROR prefix + hint
- On success, prints file count
Job still exits 0 in all cases (parquet is best-effort).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two root causes prevented the dda-linux DaemonSet from being created:

1. ExtraValues shallow merge replaced the entire `datadog` map from
   buildLinuxHelmValues, losing `apiKeyExistingSecret`. The chart's
   DaemonSet template requires either apiKey or apiKeyExistingSecret;
   with neither set, it skips the DaemonSet entirely.

   Fix: add deepMergeHelmValues() that recursively merges pulumi.Map
   values instead of replacing top-level keys. This preserves all
   framework defaults while allowing scenarios to override nested keys
   like datadog.kubelet.tlsVerify or agents.customAgentConfig.

2. The episode chart's built-in datadog-agent DaemonSet occupied the
   cluster before dda-linux deployed, creating confusion about which
   agent was running. The stub used gcr.io/datadoghq/agent:7 (stock)
   instead of the observer-recorder image.

   Fix: extend the Helm post-renderer to strip DaemonSet, ServiceAccount,
   ClusterRole, and ClusterRoleBinding resources named `datadog-agent`
   from the episode chart output. The framework's dda-linux release now
   deploys the sole node agent DaemonSet.

Also: update Job label selector from app=datadog-agent to
app=dda-linux-datadog to match the framework's DaemonSet naming.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pace

The q-branch-observer image renamed config keys from flat
observer.capture_metrics.enabled / observer.parquet_* to nested
observer.recording.enabled / observer.recording.parquet_*.
The old keys were flagged as unknown and recording never started.

Validated end-to-end from clean stack: 230 parquet files (474 MiB)
collected and uploaded to S3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tasks

Replace the single-episode Pulumi-managed runner with a multi-episode
orchestrator architecture:

Persistent layer (Pulumi): EKS cluster, RBAC (cluster-admin SA),
gensim-secrets Secret, S3 IAM policy. Stays up between runs.

Orchestrator Job (per-submission): Loops through episode:scenario pairs
serially. For each: helm install agent with observer-recorder config,
helm install episode chart with post-renderer, run play-episode.sh,
collect parquet via kubectl cp, upload to S3, emit DD events + metrics,
teardown, next episode. Updates gensim-run-status ConfigMap at each
phase transition.

Invoke tasks:
- submit: validates episodes, captures gensim SHA, enforces clean
  checkout, guards against busy cluster, deploys via Pulumi
- status: reads ConfigMap, renders per-episode progress table
- destroy: unchanged

Removes dead code: buildAndPushImages, hashDir, writePatchScript,
buildRunnerScript (all superseded by orchestrator).

Fixes specs: removes sopell review comments, temporal qualifiers,
misplaced blocked-by notes. Updates executive status table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Design.md ASCII diagram: use full inv aws.eks.gensim.* task names
- Orchestrator script: capture play-episode.sh exit code into
  EP_OUTCOME (success/failure) and pass to emit_dd_event

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scottopell and others added 8 commits March 12, 2026 16:06
The build VM was accidentally removed during the orchestrator
restructure. Episodes with docker-compose.yaml need their service
images built on EC2 (x86_64) and pushed to ECR before the orchestrator
can helm-install them.

Split into provisionBuildVM (one VM) + buildEpisodeImages (per-episode
with unique resource names) to support multi-episode submissions where
multiple episodes have custom service images.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Secret keys: use api-key/app-key (Helm chart convention) not DD_API_KEY
- Helm: use upgrade --install (idempotent, handles stale releases)
- Helm: remove --skip-tests (invalid flag), use --skip-crds
- Cluster agent: disable (observer image doesn't include it)
- Post-renderer: also strip Service named datadog-agent
- DD_ENV: export before play-episode.sh (required by episode scripts)
- ConfigMap names: sanitize to lowercase+hyphens (K8s RFC 1123)
- Status task: drop aws_wrapper from kubectl (kubeconfig has auth)
- Kubeconfig lookup: glob for full stack name prefix
- Build VM: write docker-compose.yaml via inline command (fix collisions)
- Bash: fix ${3:-{}} default (extra brace from bash parsing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace minimal agent Helm values with full observability config: APM
(with admission controller injection), logs (containerCollectAll),
process agent, DogStatsD (hostPort), cluster agent with admission
controller (mutateUnlabelled), and 3Gi agent memory limits.

Replace helm install + post-renderer with two-pass deployment:
helm template -> extract_check_configs (yq) -> awk_filter -> kubectl
apply. This extracts per-episode check configs (redis, postgres, etc.)
from the episode chart and passes them to the agent via datadog.confd.

Key changes:
- fullnameOverride: datadog-agent (episode pods hardcode this DNS name)
- Cluster agent uses chart-default image (not custom observer image)
- extract_check_configs() uses yq to pull datadog-checks ConfigMap
- awk_filter() strips DaemonSet/Deployment/Service/SA/ClusterRole/
  ClusterRoleBinding/ConfigMap named datadog-agent or datadog-checks
- Episode teardown via kubectl delete -f (not helm uninstall)
- Parquet collection label updated to app=datadog-agent
- Helm release name changed to datadog-agent (was dda-linux)
- S3 path includes date in gensim SHA segment for readability
- Removed post-renderer ConfigMap, volume, and volume mount

Validated end-to-end: 3 episodes, 1218 parquet files, APM traces +
logs + DogStatsD + redis/postgres check metrics all flowing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert changes to kubernetes_helm.go, kubernetes_agent.go, and
params.go that added ExtraHelmValues, deepMergeHelmValues, and
WithHelmValuesFile. These were from an earlier Go-level approach
that was replaced by the bash-based helm calls in the orchestrator
Job. The gensim-eks scenario never imports these components.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert the full Fargate refactor (pre-created pod execution role,
restructured clusterArgs, RoleMappings changes) from this branch.
The only change to cluster.go is wrapping the existing Fargate block
in `if !params.DisableFargate`. The full refactor lives on branch
sopell/eks-fargate-refactor for separate review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ew comments

Move the agent Helm values from an inline bash heredoc (with sed
placeholder substitution) to a standalone agent-values.yaml.tmpl
file. The template is embedded via go:embed, rendered at Pulumi
plan time with the actual image repo/tag, and mounted as a ConfigMap.
Template errors are now caught at deploy time, not at Job runtime.

Also addresses inline review comments:
- Trim verbose WithoutFargate doc comment
- Add TODO for buildEpisodeImages removal
- Clarify hashDir's purpose (Pulumi Trigger, not Docker cache)
- Remove stale cluster comment block

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These were planning artifacts used during development, not
intended for the final PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pisodes

Replace helm template/awk filter/kubectl apply with single-pass helm
install --set agent.enabled=false. Remove extract_check_configs and
awk_filter functions (autodiscovery annotations handle check configs
since gensim-episodes PR #33, agent.enabled gating since PR #31).

Update _get_gensim_repo_path() to return the repo root instead of
hardcoding /postmortems, add _find_episode_dir() to search both
postmortems/ and synthetics/ subdirectories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scottopell scottopell force-pushed the sopell/gensim-eks-runner branch from d6f6d6e to c8fbae5 Compare March 12, 2026 20:14
@scottopell scottopell added changelog/no-changelog No changelog entry needed qa/no-code-change No code change in Agent code requiring validation team/agent-devx labels Mar 12, 2026
Move top-level `from tasks.e2e_framework.config import ...` to lazy
imports inside wrapper functions. The config module unconditionally
imports pydantic, which isn't installed in standard CI runners -- only
in e2e-specific environments. Every other e2e_framework file uses lazy
imports for this reason.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the long review PR is complex, plan time to review it label Mar 12, 2026
@scottopell scottopell marked this pull request as ready for review March 12, 2026 20:30
@scottopell scottopell requested a review from a team as a code owner March 12, 2026 20:30
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd4d8f599b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Mar 12, 2026

Files inventory check summary

File checks results against ancestor 0b457eee:

Results for datadog-agent_7.78.0~devel.git.723.16412a7.pipeline.103341240-1_amd64.deb:

No change detected

… ref

P1: Include docker-compose.yaml content in the services hash used as a
Pulumi trigger. Previously only the services/ directory was hashed, so
changes to docker-compose.yaml (adding/removing images, changing build
contexts) would not trigger a rebuild.

P2: Validate that the agent image reference contains a colon before
splitting into repo:tag. Returns a clear error instead of panicking
with a slice-bounds error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Mar 12, 2026

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 0b457ee
📊 Static Quality Gates Dashboard
🔗 SQG Job

30 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
agent_deb_amd64 748.919 MiB
agent_deb_amd64_fips 707.285 MiB
agent_heroku_amd64 312.824 MiB
agent_rpm_amd64 748.903 MiB
agent_rpm_amd64_fips 707.268 MiB
agent_rpm_arm64 727.411 MiB
agent_rpm_arm64_fips 688.727 MiB
agent_suse_amd64 748.903 MiB
agent_suse_amd64_fips 707.268 MiB
agent_suse_arm64 727.411 MiB
agent_suse_arm64_fips 688.727 MiB
docker_agent_amd64 809.250 MiB
docker_agent_arm64 812.559 MiB
docker_agent_jmx_amd64 1000.165 MiB
docker_agent_jmx_arm64 992.253 MiB
docker_cluster_agent_amd64 203.688 MiB
docker_cluster_agent_arm64 218.167 MiB
docker_cws_instrumentation_amd64 7.142 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_dogstatsd_amd64 39.207 MiB
docker_dogstatsd_arm64 37.445 MiB
dogstatsd_deb_amd64 29.847 MiB
dogstatsd_deb_arm64 28.003 MiB
dogstatsd_rpm_amd64 29.847 MiB
dogstatsd_suse_amd64 29.847 MiB
iot_agent_deb_amd64 43.065 MiB
iot_agent_deb_arm64 40.124 MiB
iot_agent_deb_armhf 40.860 MiB
iot_agent_rpm_amd64 43.066 MiB
iot_agent_suse_amd64 43.066 MiB
On-wire sizes (compressed)
Quality gate Change Size (prev → curr → max)
agent_deb_amd64 -4.41 KiB (0.00% reduction) 174.191 → 174.186 → 177.700
agent_deb_amd64_fips -21.94 KiB (0.01% reduction) 165.191 → 165.170 → 172.230
agent_heroku_amd64 neutral 74.896 MiB → 79.970
agent_rpm_amd64 +35.9 KiB (0.02% increase) 177.138 → 177.173 → 180.780
agent_rpm_amd64_fips -15.66 KiB (0.01% reduction) 167.197 → 167.181 → 173.370
agent_rpm_arm64 +41.76 KiB (0.03% increase) 159.608 → 159.648 → 161.610
agent_rpm_arm64_fips -2.48 KiB (0.00% reduction) 151.234 → 151.231 → 155.910
agent_suse_amd64 +35.9 KiB (0.02% increase) 177.138 → 177.173 → 180.780
agent_suse_amd64_fips -15.66 KiB (0.01% reduction) 167.197 → 167.181 → 173.370
agent_suse_arm64 +41.76 KiB (0.03% increase) 159.608 → 159.648 → 161.610
agent_suse_arm64_fips -2.48 KiB (0.00% reduction) 151.234 → 151.231 → 155.910
docker_agent_amd64 neutral 267.206 MiB → 271.240
docker_agent_arm64 +4.08 KiB (0.00% increase) 254.476 → 254.480 → 259.800
docker_agent_jmx_amd64 neutral 335.848 MiB → 339.870
docker_agent_jmx_arm64 neutral 319.126 MiB → 324.390
docker_cluster_agent_amd64 neutral 71.289 MiB → 72.920
docker_cluster_agent_arm64 neutral 66.926 MiB → 68.220
docker_cws_instrumentation_amd64 neutral 2.999 MiB → 3.330
docker_cws_instrumentation_arm64 neutral 2.729 MiB → 3.090
docker_dogstatsd_amd64 neutral 15.155 MiB → 15.820
docker_dogstatsd_arm64 neutral 14.478 MiB → 14.830
dogstatsd_deb_amd64 neutral 7.886 MiB → 8.790
dogstatsd_deb_arm64 neutral 6.772 MiB → 7.710
dogstatsd_rpm_amd64 neutral 7.897 MiB → 8.800
dogstatsd_suse_amd64 neutral 7.897 MiB → 8.800
iot_agent_deb_amd64 neutral 11.348 MiB → 12.040
iot_agent_deb_arm64 neutral 9.664 MiB → 10.450
iot_agent_deb_armhf neutral 9.897 MiB → 10.620
iot_agent_rpm_amd64 -3.48 KiB (0.03% reduction) 11.369 → 11.366 → 12.060
iot_agent_suse_amd64 -3.48 KiB (0.03% reduction) 11.369 → 11.366 → 12.060

@cit-pr-commenter-54b7da
Copy link

cit-pr-commenter-54b7da bot commented Mar 12, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 2403b26f-8c95-408d-a2da-313a1ce50dac

Baseline: 0b457ee
Comparison: 16412a7
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization +3.27 [+0.23, +6.31] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization +3.27 [+0.23, +6.31] 1 Logs
otlp_ingest_logs memory utilization +0.73 [+0.62, +0.84] 1 Logs
ddot_metrics_sum_cumulative memory utilization +0.36 [+0.22, +0.50] 1 Logs
quality_gate_metrics_logs memory utilization +0.26 [+0.02, +0.50] 1 Logs bounds checks dashboard
docker_containers_memory memory utilization +0.24 [+0.17, +0.31] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.07 [-0.47, +0.60] 1 Logs
quality_gate_idle memory utilization +0.06 [+0.02, +0.11] 1 Logs bounds checks dashboard
quality_gate_idle_all_features memory utilization +0.05 [+0.01, +0.09] 1 Logs bounds checks dashboard
tcp_dd_logs_filter_exclude ingress throughput +0.01 [-0.10, +0.11] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput +0.00 [-0.20, +0.20] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.21, +0.20] 1 Logs
file_to_blackhole_500ms_latency egress throughput -0.01 [-0.40, +0.39] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.01 [-0.43, +0.42] 1 Logs
otlp_ingest_metrics memory utilization -0.01 [-0.17, +0.14] 1 Logs
file_to_blackhole_100ms_latency egress throughput -0.02 [-0.10, +0.05] 1 Logs
ddot_metrics_sum_delta memory utilization -0.08 [-0.25, +0.10] 1 Logs
ddot_logs memory utilization -0.08 [-0.16, +0.00] 1 Logs
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.15 [-0.21, -0.10] 1 Logs
file_tree memory utilization -0.16 [-0.21, -0.10] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization -0.38 [-0.60, -0.16] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.44 [-0.58, -0.30] 1 Logs
ddot_metrics memory utilization -0.50 [-0.67, -0.33] 1 Logs
quality_gate_logs % cpu utilization -0.66 [-2.24, +0.92] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed observed_value links
docker_containers_cpu simple_check_run 10/10 709 ≥ 26
docker_containers_memory memory_usage 10/10 273.68MiB ≤ 370MiB
docker_containers_memory simple_check_run 10/10 695 ≥ 26
file_to_blackhole_0ms_latency memory_usage 10/10 0.19GiB ≤ 1.20GiB
file_to_blackhole_0ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_1000ms_latency memory_usage 10/10 0.23GiB ≤ 1.20GiB
file_to_blackhole_1000ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_100ms_latency memory_usage 10/10 0.20GiB ≤ 1.20GiB
file_to_blackhole_100ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_500ms_latency memory_usage 10/10 0.22GiB ≤ 1.20GiB
file_to_blackhole_500ms_latency missed_bytes 10/10 0B = 0B
quality_gate_idle intake_connections 10/10 3 = 3 bounds checks dashboard
quality_gate_idle memory_usage 10/10 173.66MiB ≤ 175MiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 2 ≤ 3 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 486.85MiB ≤ 550MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 3 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 203.47MiB ≤ 220MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 345.91 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 418.82MiB ≤ 475MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.

scottopell and others added 3 commits March 13, 2026 10:56
…bility

Add --mode flag to gensim submit (record-parquet or live-anomaly-detection).
In live mode, the observer runs with analysis + event_reporter enabled and
parquet collection/S3 upload are skipped.

Changes:
- agent-values.yaml.tmpl: templatize observer config based on mode,
  add pullPolicy: Always to prevent stale cached images on EKS nodes
- run.go: plumb mode through Pulumi config, orchestrator Job env,
  renderAgentValues, and buildOrchestratorScript. Conditional parquet
  collection based on GENSIM_MODE env var.
- gensim_eks.py: add --mode parameter with validation, pass as
  gensim:mode Pulumi config
- deploy.py: add --refresh and --non-interactive to pulumi up,
  disable pty. Fixes stale Pulumi state causing silent no-ops when
  cluster resources are deleted outside Pulumi.
- destroy.py: same --non-interactive and pty=False treatment so
  error output is captured instead of swallowed by the TUI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog No changelog entry needed internal Identify a non-fork PR long review PR is complex, plan time to review it qa/no-code-change No code change in Agent code requiring validation team/agent-devx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant