Add gensim-eks scenario: end-to-end episode runner with full observability by scottopell · Pull Request #47691 · DataDog/datadog-agent

scottopell · 2026-03-10T22:17:08Z

What does this PR do?

Adds a new e2e-framework scenario (aws/gensim-eks) that provisions a persistent EKS cluster and runs GenSim episodes end-to-end. Supports two modes:

record-parquet (default): Collects observer data to parquet files for offline analysis, uploads to S3
live-anomaly-detection: Runs live edge anomaly detection with all correlators enabled, sends detected anomaly events to Datadog via the Events v2 API

For each episode, the scenario deploys application services, installs a fully-configured Datadog agent, executes fault injection cycles, and collects results (parquet files or live anomaly events depending on mode).

Motivation

GenSim episodes are self-contained incident simulations -- each ships a Helm chart with application services, a play-episode.sh fault injection script, and (in 261 of 267 episodes) a stub Datadog agent Deployment. To evaluate the observer against these episodes at scale, we need infrastructure that:

Replaces the stub agent with a custom observer-instrumented build
Autodiscovers per-episode integration check configs (redis, postgres, etc.) via pod annotations
Enables full observability -- APM traces, logs, DogStatsD, integration checks -- not just internal agent metrics
Runs multiple episodes serially with clean teardown between each
Supports both offline parquet collection and live anomaly detection modes

Data flow

On your laptop (dda inv aws.eks.gensim.submit --image=... --episodes=A:s1,B:s2 --mode=live-anomaly-detection):

Pulumi reads each episode's play-episode.sh, scenario YAML, and Helm chart from your local GENSIM_REPO_PATH, packages them into per-episode ConfigMaps, renders agent-values.yaml.tmpl with the target image and mode, and creates an orchestrator Job. Your laptop is done.

In the EKS cluster (orchestrator Job, alpine/k8s:1.31.0):

for each episode:

  # EPISODE: helm install episode chart with --set agent.enabled=false
  #          (suppresses stub agent resources; post-renderer fixes imagePullPolicy)

  # AGENT: helm install datadog-agent with:
  #   - agent-values.yaml (APM, logs, DogStatsD, process agent, cluster agent,
  #     admission controller, observer config based on mode)
  #   (autodiscovery annotations on episode pods handle check configs)

  # RUN: play-episode.sh executes warmup -> baseline -> fault injection -> cooldown
  #      (20-80 min per episode)

  # COLLECT (record-parquet mode only):
  #   kubectl cp parquet from agent pod
  #   aws s3 cp to $BUCKET/$IMAGE_TAG/$EPISODE--$SCENARIO/gensim-$DATE-$SHA/
  # (live-anomaly-detection mode: anomaly events sent to Datadog in real-time)

  # REPORT:  DD events + metrics via API, update gensim-run-status ConfigMap
  # TEARDOWN: helm uninstall episode + agent, clean workspace

Observer mode configuration

The --mode flag controls what the agent's observer component does:

Config	`record-parquet`	`live-anomaly-detection`
`observer.recording.enabled`	true	false
`observer.analysis.enabled`	(default)	true
`observer.event_reporter.sending_enabled`	false	true
Parquet collection + S3 upload	yes	skipped

Key design decisions

fullnameOverride: datadog-agent -- episode pods hardcode datadog-agent:8125 and datadog-agent:8126 as DNS names for DogStatsD and APM. This makes our Helm chart create Services with the names they expect.
Single-pass episode deployment -- helm install with --set agent.enabled=false suppresses stub agent resources. A lightweight sed post-renderer patches imagePullPolicy: Never to Always for the few episodes that set it. Autodiscovery annotations on episode pods handle check configs -- no extraction step needed.
pullPolicy: Always on the agent image -- prevents stale cached images on EKS nodes when the same tag is rebuilt with new code.
Admission controller with mutateUnlabelled: true -- auto-injects DD_AGENT_HOST into all episode pods without requiring labels.
Agent values as a Go template (agent-values.yaml.tmpl) -- embedded via go:embed, rendered at Pulumi plan time with target image repo/tag and mode.
--refresh on pulumi up -- reconciles Pulumi state with cluster reality before planning changes. Prevents silent no-ops when resources are deleted outside Pulumi (e.g. by the orchestrator's helm uninstall during teardown).
--non-interactive on pulumi up/destroy -- disables the TUI progress display so error output is captured in logs instead of being overwritten by cursor repositioning.
Content-addressed build cache -- docker-compose.yaml is included in the services hash (alongside the services/ directory) so adding/removing images triggers a rebuild.
Image reference validation -- renderAgentValues validates the image contains : before splitting into repo:tag, returning a clear error instead of panicking.

Describe how you validated your changes

Record-parquet mode -- 3-episode end-to-end run:

Episode	Parquet Files	Duration	Signals
213_PagerDuty (cassandra)	381	49m	APM + logs + DogStatsD + observer
food-delivery (redis)	235	33m	APM + logs + DogStatsD + redis check + postgres check + observer
353_postmark (dns)	602	84m	APM + logs + DogStatsD + observer

Live-anomaly-detection mode -- 213_PagerDuty episode:

20 Datadog events successfully delivered via the event reporter (source:agent-q-branch-observer)
TimeCluster correlator detected temporal co-occurrence of BOCPD anomalies across trace agent metrics (2-8 anomalies per cluster)
Surprise correlator detected statistically unusual metric co-occurrences (lift scores 2.0-24.0)
Monitor received data (DD_ENV propagation working), went to Alert during disruption, recovered to OK
Clean teardown, no resource leaks

Future checklist: hacks to retire

~~awk filter stripping agent resources~~ -- Retired. Episodes now gate agent resources behind agent.enabled: false (gensim-episodes PR [aggregator] Allow aggregator to handle rate metrics from checks #31).
~~extract_check_configs with yq~~ -- Retired. Episodes migrated to autodiscovery annotations (gensim-episodes PR [aggregator] Support all existing check metric types #33).
~~helm template + kubectl apply instead of helm install~~ -- Retired. With agent.enabled=false, episodes install cleanly via helm install.
~~imagePullPolicy: Never -> Always patching (awk filter)~~ -- Retired as an awk filter. Replaced by a 3-line sed post-renderer on helm install.
Build VM + docker buildx bake + ECR push -- EC2 instance to build episode service images from source and push to ECR. ~150 lines of Pulumi + bash. Retires when: gensim-episodes publishes pre-built images to a shared registry.

Evaluated and kept as-is:

fullnameOverride: datadog-agent -- load-bearing DNS name that 227 episodes depend on.
mutateUnlabelled: true -- correct for single-tenant namespace.
kubelet.tlsVerify: false -- standard EKS workaround, permanent.

Files

File	What
`test/e2e-framework/scenarios/aws/gensim-eks/run.go`	Scenario: EKS cluster, orchestrator Job, build VM, orchestrator bash script, mode plumbing
`test/e2e-framework/scenarios/aws/gensim-eks/agent-values.yaml.tmpl`	Agent Helm values template (mode-conditional observer config)
`tasks/e2e_framework/aws/gensim_eks.py`	Invoke tasks: `submit` (with `--mode`), `status`, `destroy`, `logs`
`tasks/e2e_framework/deploy.py`	Pulumi deploy: `--refresh`, `--non-interactive`, `pty=False`
`tasks/e2e_framework/destroy.py`	Pulumi destroy: `--non-interactive`, `pty=False`

Adds a new e2e scenario (aws/gensim-eks) that provisions a real EKS cluster for running gensim episodes, as an alternative to the existing Kind-on-EC2 approach. Key improvements over the Kind path: - EC2 build VM builds episode service images natively (linux/amd64) and pushes to ECR via instance IAM role — no local Docker or credential setup required - No cross-platform issues (Apple Silicon vs x86_64 EKS nodes) - play-episode.sh will run as a Kubernetes Job (M4) rather than a VM-side script M1 (✓): EKS cluster with Linux node group, kubeconfig export M2 (✓): EC2 build VM builds/pushes episode images to ECR, deploys episode Helm chart with imagePullPolicy post-renderer (Never→IfNotPresent) M3-M5: Datadog Agent, autonomous Job runner, S3 upload (upcoming) New files: - test/e2e-framework/scenarios/aws/gensim-eks/run.go - tasks/e2e_framework/aws/gensim_eks.py Invoke: dda inv aws.eks.gensim.create --episode=<name> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ConfigMap pulumi-eks manages aws-auth with patchForce:true, giving Pulumi full ownership of the mapRoles list. The Fargate pod execution role was previously auto-created inside eks.NewCluster and never included in RoleMappings, so it was silently dropped from aws-auth on every update. This caused CoreDNS and other kube-system Fargate pods to fail scheduling with: "Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods" Fix: pre-create the Fargate execution role (GetFargatePodExecutionRole in role.go) before calling eks.NewCluster, pass it as FargateProfileArgs. PodExecutionRoleArn, and include it in RoleMappings with the standard system:bootstrappers + system:nodes groups. Pre-creation is required because the role ARN must be known upfront - it cannot be read from the cluster after creation without a circular Pulumi dependency. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ld VM Two boot-time optimizations for the gensim-eks build VM: 1. Pre-baked Amazon Linux ECS AMI Switch build VM from default Ubuntu to Amazon Linux ECS, which ships with Docker pre-installed and the daemon already running. Eliminates the apt-get update + install docker.io/awscli step (~3-5 min per fresh cluster). Only docker-compose needs to be installed via pip3. 2. ECR image caching by content hash Images are now pushed with two tags: :latest (for the Helm chart) and :<hash> where hash is the first 12 chars of the SHA256 of the services/ directory. On each build run, the script checks ECR for the hash tag before starting docker-compose build. If all images are present at that tag, the build is skipped and images are pulled + retagged as :latest (~5-10 min saved on re-runs with unchanged source). This is most valuable when destroying and recreating a stack without changing episode code — a common pattern during cluster debugging. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…aner The ECR image cache (hash-tagged images) is wiped by the e2e account's weekly infra-cleaner job, limiting cache hits to within the same week. To make it durable, ECR repos need a protection tag or cleaner exclusion. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…LI install The ECS AMI ships Docker 25 (including buildx) but not AWS CLI or docker-compose. Previous approach tried pip3 install docker-compose which failed due to Python 3.7 + OpenSSL 1.0.2k incompatibility with urllib3 v2.0. Fix: - Install only awscli via yum (~30s, the only missing tool) - Replace docker-compose build with docker buildx bake, which understands docker-compose.yaml natively, requires no separate install, and builds images in parallel - Replace docker-compose config --images with Python yaml parsing - Restore explicit docker login (no ECR credential helper on this AMI; --password-stdin works without a TTY as Pulumi uses for remote commands) Net result: setup step goes from apt-get docker.io + docker-compose + awscli (~4 min) to yum install awscli (~30s). Docker and buildx are pre-installed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Fargate nodes carry eks.amazonaws.com/compute-type=fargate:NoSchedule. Any DaemonSet — including the Datadog agent — gets permanently-Pending pods on those nodes, blocking Helm readiness checks and adding operational complexity. The only benefit of Fargate is that CoreDNS starts before EC2 nodes join (a ~5 min provisioning optimisation that is irrelevant for long-running test scenarios). WithoutFargate() skips both the Fargate profile and the pre-created execution role. CoreDNS schedules on the EC2 node group once nodes join. The Fargate execution role and aws-auth complexity are eliminated entirely. gensim-eks now uses WithoutFargate() so the Datadog agent DaemonSet (added in M3) schedules cleanly on EC2 nodes without stuck-Pending pods. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Deploys the full DaemonSet-based Datadog agent after the episode chart, using the framework's helm.NewKubernetesAgent wrapper. Changes: - run.go: add M3 agent deployment block, gated on awsEnv.AgentDeploy(). Reads gensim:datadogValuesPath (sets clusterName + kubelet.tlsVerify:false, required on EKS since kubelet uses a self-signed cert). Waits for the episode Helm release before deploying. - gensim_eks.py: set install_agent=True when episode is provided, which sets ddagent:deploy=True and injects ddagent:apiKey via the framework. Passes gensim:datadogValuesPath when datadog-values.yaml exists at the postmortems root. Adds _delete_stub_agent() post-pulumi step to remove the episode chart's built-in Deployment-based agent (which would produce duplicate metrics alongside the real DaemonSet). Success: DD agent pod Running on EC2 node, metrics visible in DD under the episode's env tag, stub agent gone. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…uesFile pulumi.NewStringAsset (used by WithHelmValues) serialises to []interface{} in the local Pulumi state backend, causing the Helm provider to fail with "unsupported type for 'valueYamlFiles' arg: []interface{}". Fix: add WithHelmValuesFile(path) which uses pulumi.NewFileAsset instead. File assets are read from disk at apply time and round-trip through the local backend correctly. gensim-eks uses this for datadogValuesPath. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tonomously Creates the RBAC, credentials Secret, ConfigMap, and Job needed to run play-episode.sh inside the cluster without the developer's laptop staying open. Resources created when gensim:scenario is set: - ServiceAccount gensim-runner - ClusterRole with: get/list/watch pods; get/list/update deployments/scale (play-episode.sh uses kubectl scale + kubectl wait only) - ClusterRoleBinding - Secret gensim-secrets (DD_API_KEY, DD_APP_KEY) - ConfigMap gensim-episode (play-episode.sh + <scenario>.yaml) Two separate volume mounts map them to /episode/ and /episode/episodes/ so play-episode.sh finds the scenario YAML at its expected path. - Job gensim-runner running alpine/k8s:1.31.0 (has kubectl/bash/curl/jq). Annotated pulumi.com/skipAwait=true so Pulumi returns immediately rather than waiting 30-60 min for the episode to finish. Python: --scenario flag added to create_gensim_eks; post-deploy prints kubectl logs -f command for monitoring. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…kend bug properly AssetOrArchiveArray (used by WithHelmValues/WithHelmValuesFile) deserialises as []interface{} in local Pulumi state on resource update, causing the Helm provider to fail with "unsupported type for 'valueYamlFiles' arg: []interface{}". Fix: add ExtraValues pulumi.Map to HelmInstallationArgs, merged into the main values map before ToYAMLPulumiAssetOutput(). Map values flow through the computed output path which survives local-state round-trips correctly on both create and update. Changes: - kubernetes_helm.go: ExtraValues field; merge loop before ToYAMLPulumiAssetOutput - kubernetes_agent.go: pass params.ExtraHelmValues through - kubernetesagentparams/params.go: ExtraHelmValues field + WithExtraHelmValues option - gensim-eks/run.go: use WithExtraHelmValues for kubelet.tlsVerify + clusterName Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…luesYAML The previous code serialised ALL agent values (including framework defaults) through ValuesYAML/AssetOrArchiveArray via ToYAMLPulumiAssetOutput(). This caused the same local-backend deserialisation bug as WithHelmValues: "unsupported type for 'valueYamlFiles' arg: []interface {}" Fix: pass the HelmValues map (after ExtraValues merge) directly as the Values pulumi.MapInput instead of converting to YAML and going through ValueYamlFiles. pulumi.Map values serialise as JSON in local state and survive update round-trips correctly. User-provided ValuesYAML (WithHelmValues/WithHelmValuesFile) still use the AssetOrArchiveArray path for now — prefer WithExtraHelmValues instead. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…olume play-episode.sh creates a results/ directory relative to its script path. ConfigMap volumes are read-only so mkdir fails. Set RESULTS_DIR=/tmp/results so the script writes results to a writable location instead. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

play-episode.sh creates results/ relative to its script path (hardcoded, not overridable via env var). ConfigMap volumes are read-only so mkdir fails. Add an emptyDir volume mounted at /episode/results to give the script a writable location without modifying the upstream script. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Cluster-checks runners are not needed for a single-node test cluster. Their readiness probe blocks the Helm timeout when the forwarder hasn't fully initialised — particularly noticeable on first deploy where the pods predate the real API key being available. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

play-episode.sh hardcodes RESULTS_DIR="${SCRIPT_DIR}/results". Since the script is mounted from a read-only ConfigMap volume at /episode/, mkdir fails. Add an emptyDir volume and mount at /episode/results to provide a writable location without modifying the upstream script. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

… scale kubectl scale issues a PATCH on deployments/scale, not update. Without patch in the ClusterRole verbs, the scale command fails silently (|| true suppresses the error) and surge generators never spin up during disruption. Validated: monitor transitioned OK→Alert after surge scaled to 5 replicas, then Alert→OK after scale-down during cooldown. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…r agent After play-episode.sh completes, the Job now: 1. kubectl cp observer parquet from the agent pod (/tmp/observer-parquet) into /episode/results/parquet/ — requires pods/exec added to ClusterRole 2. aws s3 cp /episode/results/ to s3://<bucket>/gensim-results-<episode>-<date>/ — requires s3:PutObject on the EKS Linux node role, attached by Pulumi Observer-recorder agent support: - additionalConfig injects observer config directly into datadog.yaml via the Helm chart's additionalConfig field - --full-image-path flag allows specifying the observer-recorder image (e.g. docker.io/datadog/agent-dev:q-branch-observer-full) New invoke flags: --s3-bucket bucket to upload results (optional; skipped if unset) --full-image-path custom agent image path Example: inv aws.eks.gensim.create \ --episode=authcore-pgbouncer-connection-pool-saturation \ --scenario=pool-saturation \ --s3-bucket=qbranch-gensim-recordings \ --full-image-path=docker.io/datadog/agent-dev:q-branch-observer-full Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…true explicitly buildLinuxHelmValues builds an agents map without setting agents.enabled. When this map is passed via Values (pulumi.Map — equivalent to --set style), Helm's merge semantics can omit the agents.enabled key from the effective values, causing the node agent DaemonSet template condition to evaluate to false and the DaemonSet to be skipped entirely. clusterAgent already sets "enabled": pulumi.Bool(true) explicitly. agents must do the same for the DaemonSet to be reliably created. Root cause identified via: Helm manifest had 0 DaemonSets despite chart defaults having agents.enabled=True, because the Values map passed from buildLinuxHelmValues had agents.{image,containers,...} but no agents.enabled key. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The Job command was still the M4 form (direct play-episode.sh call). Apply the M5 bash -c script that chains: episode run → parquet collection via kubectl cp → flat S3 upload via aws s3 cp --recursive. The S3 upload destination follows Maxime's naming: s3://<bucket>/gensim-results-<episode>-<YYYYMMDD>/ Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Amazon Linux ECS ships Python 3.7 without the yaml module. Replace the python3 yaml parsing of docker-compose.yaml image names with grep+awk which is always available. Also fixes backtick in Go raw string literal. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tick Three bugs found during clean end-to-end validation run (M5): 1. Backtick in raw string comment terminated the Go string literal, causing compile error. Removed backticks from comment text. 2. date format string had `%%%%Y%%%%m%%%%d` -> produced `%%Y%%m%%d` in the shell script -> `date -u` received `%%Y%%m%%d` as format and expanded it to the literal string `%Y%m%d` instead of YYYYMMDD. Fix: `%%Y%%m%%d` in pulumi.Sprintf -> `%Y%m%d` in the script. 3. Parquet kubectl cp used `-l app.kubernetes.io/component=agent` but the DaemonSet pods have label `app=datadog-agent`. Selector returned empty AGENT_POD, silently skipping parquet collection. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Previously `|| echo "Warning: ..."` swallowed the exit code and the error was easy to miss. Now: - If no agent pod is found, prints to stderr with ERROR prefix - If kubectl cp fails, prints to stderr with ERROR prefix + hint - On success, prints file count Job still exits 0 in all cases (parquet is best-effort). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Two root causes prevented the dda-linux DaemonSet from being created: 1. ExtraValues shallow merge replaced the entire `datadog` map from buildLinuxHelmValues, losing `apiKeyExistingSecret`. The chart's DaemonSet template requires either apiKey or apiKeyExistingSecret; with neither set, it skips the DaemonSet entirely. Fix: add deepMergeHelmValues() that recursively merges pulumi.Map values instead of replacing top-level keys. This preserves all framework defaults while allowing scenarios to override nested keys like datadog.kubelet.tlsVerify or agents.customAgentConfig. 2. The episode chart's built-in datadog-agent DaemonSet occupied the cluster before dda-linux deployed, creating confusion about which agent was running. The stub used gcr.io/datadoghq/agent:7 (stock) instead of the observer-recorder image. Fix: extend the Helm post-renderer to strip DaemonSet, ServiceAccount, ClusterRole, and ClusterRoleBinding resources named `datadog-agent` from the episode chart output. The framework's dda-linux release now deploys the sole node agent DaemonSet. Also: update Job label selector from app=datadog-agent to app=dda-linux-datadog to match the framework's DaemonSet naming. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pace The q-branch-observer image renamed config keys from flat observer.capture_metrics.enabled / observer.parquet_* to nested observer.recording.enabled / observer.recording.parquet_*. The old keys were flagged as unknown and recording never started. Validated end-to-end from clean stack: 230 parquet files (474 MiB) collected and uploaded to S3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… tasks Replace the single-episode Pulumi-managed runner with a multi-episode orchestrator architecture: Persistent layer (Pulumi): EKS cluster, RBAC (cluster-admin SA), gensim-secrets Secret, S3 IAM policy. Stays up between runs. Orchestrator Job (per-submission): Loops through episode:scenario pairs serially. For each: helm install agent with observer-recorder config, helm install episode chart with post-renderer, run play-episode.sh, collect parquet via kubectl cp, upload to S3, emit DD events + metrics, teardown, next episode. Updates gensim-run-status ConfigMap at each phase transition. Invoke tasks: - submit: validates episodes, captures gensim SHA, enforces clean checkout, guards against busy cluster, deploys via Pulumi - status: reads ConfigMap, renders per-episode progress table - destroy: unchanged Removes dead code: buildAndPushImages, hashDir, writePatchScript, buildRunnerScript (all superseded by orchestrator). Fixes specs: removes sopell review comments, temporal qualifiers, misplaced blocked-by notes. Updates executive status table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Design.md ASCII diagram: use full inv aws.eks.gensim.* task names - Orchestrator script: capture play-episode.sh exit code into EP_OUTCOME (success/failure) and pass to emit_dd_event Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The build VM was accidentally removed during the orchestrator restructure. Episodes with docker-compose.yaml need their service images built on EC2 (x86_64) and pushed to ECR before the orchestrator can helm-install them. Split into provisionBuildVM (one VM) + buildEpisodeImages (per-episode with unique resource names) to support multi-episode submissions where multiple episodes have custom service images. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Secret keys: use api-key/app-key (Helm chart convention) not DD_API_KEY - Helm: use upgrade --install (idempotent, handles stale releases) - Helm: remove --skip-tests (invalid flag), use --skip-crds - Cluster agent: disable (observer image doesn't include it) - Post-renderer: also strip Service named datadog-agent - DD_ENV: export before play-episode.sh (required by episode scripts) - ConfigMap names: sanitize to lowercase+hyphens (K8s RFC 1123) - Status task: drop aws_wrapper from kubectl (kubeconfig has auth) - Kubeconfig lookup: glob for full stack name prefix - Build VM: write docker-compose.yaml via inline command (fix collisions) - Bash: fix ${3:-{}} default (extra brace from bash parsing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace minimal agent Helm values with full observability config: APM (with admission controller injection), logs (containerCollectAll), process agent, DogStatsD (hostPort), cluster agent with admission controller (mutateUnlabelled), and 3Gi agent memory limits. Replace helm install + post-renderer with two-pass deployment: helm template -> extract_check_configs (yq) -> awk_filter -> kubectl apply. This extracts per-episode check configs (redis, postgres, etc.) from the episode chart and passes them to the agent via datadog.confd. Key changes: - fullnameOverride: datadog-agent (episode pods hardcode this DNS name) - Cluster agent uses chart-default image (not custom observer image) - extract_check_configs() uses yq to pull datadog-checks ConfigMap - awk_filter() strips DaemonSet/Deployment/Service/SA/ClusterRole/ ClusterRoleBinding/ConfigMap named datadog-agent or datadog-checks - Episode teardown via kubectl delete -f (not helm uninstall) - Parquet collection label updated to app=datadog-agent - Helm release name changed to datadog-agent (was dda-linux) - S3 path includes date in gensim SHA segment for readability - Removed post-renderer ConfigMap, volume, and volume mount Validated end-to-end: 3 episodes, 1218 parquet files, APM traces + logs + DogStatsD + redis/postgres check metrics all flowing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert changes to kubernetes_helm.go, kubernetes_agent.go, and params.go that added ExtraHelmValues, deepMergeHelmValues, and WithHelmValuesFile. These were from an earlier Go-level approach that was replaced by the bash-based helm calls in the orchestrator Job. The gensim-eks scenario never imports these components. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert the full Fargate refactor (pre-created pod execution role, restructured clusterArgs, RoleMappings changes) from this branch. The only change to cluster.go is wrapping the existing Fargate block in `if !params.DisableFargate`. The full refactor lives on branch sopell/eks-fargate-refactor for separate review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ew comments Move the agent Helm values from an inline bash heredoc (with sed placeholder substitution) to a standalone agent-values.yaml.tmpl file. The template is embedded via go:embed, rendered at Pulumi plan time with the actual image repo/tag, and mounted as a ConfigMap. Template errors are now caught at deploy time, not at Job runtime. Also addresses inline review comments: - Trim verbose WithoutFargate doc comment - Add TODO for buildEpisodeImages removal - Clarify hashDir's purpose (Pulumi Trigger, not Docker cache) - Remove stale cluster comment block Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These were planning artifacts used during development, not intended for the final PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pisodes Replace helm template/awk filter/kubectl apply with single-pass helm install --set agent.enabled=false. Remove extract_check_configs and awk_filter functions (autodiscovery annotations handle check configs since gensim-episodes PR #33, agent.enabled gating since PR #31). Update _get_gensim_repo_path() to return the repo root instead of hardcoding /postmortems, add _find_episode_dir() to search both postmortems/ and synthetics/ subdirectories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move top-level `from tasks.e2e_framework.config import ...` to lazy imports inside wrapper functions. The config module unconditionally imports pydantic, which isn't installed in standard CI runners -- only in e2e-specific environments. Every other e2e_framework file uses lazy imports for this reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd4d8f599b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

test/e2e-framework/scenarios/aws/gensim-eks/run.go

agent-platform-auto-pr · 2026-03-12T20:45:16Z

Files inventory check summary

File checks results against ancestor 0b457eee:

Results for datadog-agent_7.78.0~devel.git.723.16412a7.pipeline.103341240-1_amd64.deb:

No change detected

… ref P1: Include docker-compose.yaml content in the services hash used as a Pulumi trigger. Previously only the services/ directory was hashed, so changes to docker-compose.yaml (adding/removing images, changing build contexts) would not trigger a rebuild. P2: Validate that the agent image reference contains a colon before splitting into repo:tag. Returns a clear error instead of panicking with a slice-bounds error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

agent-platform-auto-pr · 2026-03-12T20:53:47Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 0b457ee
📊 Static Quality Gates Dashboard
🔗 SQG Job

30 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	agent_deb_amd64	748.919 MiB
✅	agent_deb_amd64_fips	707.285 MiB
✅	agent_heroku_amd64	312.824 MiB
✅	agent_rpm_amd64	748.903 MiB
✅	agent_rpm_amd64_fips	707.268 MiB
✅	agent_rpm_arm64	727.411 MiB
✅	agent_rpm_arm64_fips	688.727 MiB
✅	agent_suse_amd64	748.903 MiB
✅	agent_suse_amd64_fips	707.268 MiB
✅	agent_suse_arm64	727.411 MiB
✅	agent_suse_arm64_fips	688.727 MiB
✅	docker_agent_amd64	809.250 MiB
✅	docker_agent_arm64	812.559 MiB
✅	docker_agent_jmx_amd64	1000.165 MiB
✅	docker_agent_jmx_arm64	992.253 MiB
✅	docker_cluster_agent_amd64	203.688 MiB
✅	docker_cluster_agent_arm64	218.167 MiB
✅	docker_cws_instrumentation_amd64	7.142 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB
✅	docker_dogstatsd_amd64	39.207 MiB
✅	docker_dogstatsd_arm64	37.445 MiB
✅	dogstatsd_deb_amd64	29.847 MiB
✅	dogstatsd_deb_arm64	28.003 MiB
✅	dogstatsd_rpm_amd64	29.847 MiB
✅	dogstatsd_suse_amd64	29.847 MiB
✅	iot_agent_deb_amd64	43.065 MiB
✅	iot_agent_deb_arm64	40.124 MiB
✅	iot_agent_deb_armhf	40.860 MiB
✅	iot_agent_rpm_amd64	43.066 MiB
✅	iot_agent_suse_amd64	43.066 MiB

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	-4.41 KiB (0.00% reduction)	174.191 → 174.186 → 177.700
✅	agent_deb_amd64_fips	-21.94 KiB (0.01% reduction)	165.191 → 165.170 → 172.230
✅	agent_heroku_amd64	neutral	74.896 MiB → 79.970
✅	agent_rpm_amd64	+35.9 KiB (0.02% increase)	177.138 → 177.173 → 180.780
✅	agent_rpm_amd64_fips	-15.66 KiB (0.01% reduction)	167.197 → 167.181 → 173.370
✅	agent_rpm_arm64	+41.76 KiB (0.03% increase)	159.608 → 159.648 → 161.610
✅	agent_rpm_arm64_fips	-2.48 KiB (0.00% reduction)	151.234 → 151.231 → 155.910
✅	agent_suse_amd64	+35.9 KiB (0.02% increase)	177.138 → 177.173 → 180.780
✅	agent_suse_amd64_fips	-15.66 KiB (0.01% reduction)	167.197 → 167.181 → 173.370
✅	agent_suse_arm64	+41.76 KiB (0.03% increase)	159.608 → 159.648 → 161.610
✅	agent_suse_arm64_fips	-2.48 KiB (0.00% reduction)	151.234 → 151.231 → 155.910
✅	docker_agent_amd64	neutral	267.206 MiB → 271.240
✅	docker_agent_arm64	+4.08 KiB (0.00% increase)	254.476 → 254.480 → 259.800
✅	docker_agent_jmx_amd64	neutral	335.848 MiB → 339.870
✅	docker_agent_jmx_arm64	neutral	319.126 MiB → 324.390
✅	docker_cluster_agent_amd64	neutral	71.289 MiB → 72.920
✅	docker_cluster_agent_arm64	neutral	66.926 MiB → 68.220
✅	docker_cws_instrumentation_amd64	neutral	2.999 MiB → 3.330
✅	docker_cws_instrumentation_arm64	neutral	2.729 MiB → 3.090
✅	docker_dogstatsd_amd64	neutral	15.155 MiB → 15.820
✅	docker_dogstatsd_arm64	neutral	14.478 MiB → 14.830
✅	dogstatsd_deb_amd64	neutral	7.886 MiB → 8.790
✅	dogstatsd_deb_arm64	neutral	6.772 MiB → 7.710
✅	dogstatsd_rpm_amd64	neutral	7.897 MiB → 8.800
✅	dogstatsd_suse_amd64	neutral	7.897 MiB → 8.800
✅	iot_agent_deb_amd64	neutral	11.348 MiB → 12.040
✅	iot_agent_deb_arm64	neutral	9.664 MiB → 10.450
✅	iot_agent_deb_armhf	neutral	9.897 MiB → 10.620
✅	iot_agent_rpm_amd64	-3.48 KiB (0.03% reduction)	11.369 → 11.366 → 12.060
✅	iot_agent_suse_amd64	-3.48 KiB (0.03% reduction)	11.369 → 11.366 → 12.060

cit-pr-commenter-54b7da · 2026-03-12T21:07:29Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 2403b26f-8c95-408d-a2da-313a1ce50dac

Baseline: 0b457ee
Comparison: 16412a7
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+3.27	[+0.23, +6.31]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+3.27	[+0.23, +6.31]	1	Logs
➖	otlp_ingest_logs	memory utilization	+0.73	[+0.62, +0.84]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.36	[+0.22, +0.50]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	+0.26	[+0.02, +0.50]	1	Logs bounds checks dashboard
➖	docker_containers_memory	memory utilization	+0.24	[+0.17, +0.31]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.07	[-0.47, +0.60]	1	Logs
➖	quality_gate_idle	memory utilization	+0.06	[+0.02, +0.11]	1	Logs bounds checks dashboard
➖	quality_gate_idle_all_features	memory utilization	+0.05	[+0.01, +0.09]	1	Logs bounds checks dashboard
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.01	[-0.10, +0.11]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	+0.00	[-0.20, +0.20]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.21, +0.20]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	-0.01	[-0.40, +0.39]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.01	[-0.43, +0.42]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.01	[-0.17, +0.14]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.10, +0.05]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	-0.08	[-0.25, +0.10]	1	Logs
➖	ddot_logs	memory utilization	-0.08	[-0.16, +0.00]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.15	[-0.21, -0.10]	1	Logs
➖	file_tree	memory utilization	-0.16	[-0.21, -0.10]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	-0.38	[-0.60, -0.16]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.44	[-0.58, -0.30]	1	Logs
➖	ddot_metrics	memory utilization	-0.50	[-0.67, -0.33]	1	Logs
➖	quality_gate_logs	% cpu utilization	-0.66	[-2.24, +0.92]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	709 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	273.68MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	695 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.19GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.23GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.20GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.22GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	3 = 3	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	173.66MiB ≤ 175MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	2 ≤ 3	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	486.85MiB ≤ 550MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	3 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	203.47MiB ≤ 220MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	345.91 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	418.82MiB ≤ 475MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.

…bility Add --mode flag to gensim submit (record-parquet or live-anomaly-detection). In live mode, the observer runs with analysis + event_reporter enabled and parquet collection/S3 upload are skipped. Changes: - agent-values.yaml.tmpl: templatize observer config based on mode, add pullPolicy: Always to prevent stale cached images on EKS nodes - run.go: plumb mode through Pulumi config, orchestrator Job env, renderAgentValues, and buildOrchestratorScript. Conditional parquet collection based on GENSIM_MODE env var. - gensim_eks.py: add --mode parameter with validation, pass as gensim:mode Pulumi config - deploy.py: add --refresh and --non-interactive to pulumi up, disable pty. Fixes stale Pulumi state causing silent no-ops when cluster resources are deleted outside Pulumi. - destroy.py: same --non-interactive and pty=False treatment so error output is captured instead of swallowed by the TUI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dd-octo-sts bot added the internal Identify a non-fork PR label Mar 10, 2026

scottopell and others added 29 commits March 12, 2026 16:05

fixup: correct stale comment about AWS CLI on ECS AMI

61fa954

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

docs: mark M3 complete in GENSIM_RUNNER.md

3038b5c

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

docs: mark M4 complete in GENSIM_RUNNER.md

93bdc86

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

scottopell and others added 8 commits March 12, 2026 16:06

[gensim-eks] Remove working docs from PR

9824e59

These were planning artifacts used during development, not intended for the final PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scottopell force-pushed the sopell/gensim-eks-runner branch from d6f6d6e to c8fbae5 Compare March 12, 2026 20:14

scottopell added changelog/no-changelog No changelog entry needed qa/no-code-change No code change in Agent code requiring validation team/agent-devx labels Mar 12, 2026

github-actions bot added the long review PR is complex, plan time to review it label Mar 12, 2026

scottopell marked this pull request as ready for review March 12, 2026 20:30

scottopell requested a review from a team as a code owner March 12, 2026 20:30

chatgpt-codex-connector bot reviewed Mar 12, 2026

View reviewed changes

test/e2e-framework/scenarios/aws/gensim-eks/run.go Show resolved Hide resolved

test/e2e-framework/scenarios/aws/gensim-eks/run.go Outdated Show resolved Hide resolved

scottopell and others added 3 commits March 13, 2026 10:56

Merge branch 'main' into sopell/gensim-eks-runner

bbe5907

Adds 'live-and-record' as option for gensim execution

16412a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gensim-eks scenario: end-to-end episode runner with full observability#47691

Add gensim-eks scenario: end-to-end episode runner with full observability#47691
scottopell wants to merge 42 commits intomainfrom
sopell/gensim-eks-runner

scottopell commented Mar 10, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

agent-platform-auto-pr bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

agent-platform-auto-pr bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

cit-pr-commenter-54b7da bot commented Mar 12, 2026 •

edited

Loading

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scottopell commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Data flow

Observer mode configuration

Key design decisions

Describe how you validated your changes

Future checklist: hacks to retire

Files

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

agent-platform-auto-pr bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files inventory check summary

Results for datadog-agent_7.78.0~devel.git.723.16412a7.pipeline.103341240-1_amd64.deb:

Uh oh!

agent-platform-auto-pr bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Uh oh!

cit-pr-commenter-54b7da bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scottopell commented Mar 10, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Mar 12, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Mar 12, 2026 •

edited

Loading

cit-pr-commenter-54b7da bot commented Mar 12, 2026 •

edited

Loading