Add gensim-eks scenario: end-to-end episode runner with full observability#47691
Add gensim-eks scenario: end-to-end episode runner with full observability#47691scottopell wants to merge 42 commits intomainfrom
Conversation
Adds a new e2e scenario (aws/gensim-eks) that provisions a real EKS cluster
for running gensim episodes, as an alternative to the existing Kind-on-EC2
approach. Key improvements over the Kind path:
- EC2 build VM builds episode service images natively (linux/amd64) and pushes
to ECR via instance IAM role — no local Docker or credential setup required
- No cross-platform issues (Apple Silicon vs x86_64 EKS nodes)
- play-episode.sh will run as a Kubernetes Job (M4) rather than a VM-side script
M1 (✓): EKS cluster with Linux node group, kubeconfig export
M2 (✓): EC2 build VM builds/pushes episode images to ECR, deploys episode
Helm chart with imagePullPolicy post-renderer (Never→IfNotPresent)
M3-M5: Datadog Agent, autonomous Job runner, S3 upload (upcoming)
New files:
- test/e2e-framework/scenarios/aws/gensim-eks/run.go
- tasks/e2e_framework/aws/gensim_eks.py
Invoke: dda inv aws.eks.gensim.create --episode=<name>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ConfigMap pulumi-eks manages aws-auth with patchForce:true, giving Pulumi full ownership of the mapRoles list. The Fargate pod execution role was previously auto-created inside eks.NewCluster and never included in RoleMappings, so it was silently dropped from aws-auth on every update. This caused CoreDNS and other kube-system Fargate pods to fail scheduling with: "Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods" Fix: pre-create the Fargate execution role (GetFargatePodExecutionRole in role.go) before calling eks.NewCluster, pass it as FargateProfileArgs. PodExecutionRoleArn, and include it in RoleMappings with the standard system:bootstrappers + system:nodes groups. Pre-creation is required because the role ARN must be known upfront - it cannot be read from the cluster after creation without a circular Pulumi dependency. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ld VM Two boot-time optimizations for the gensim-eks build VM: 1. Pre-baked Amazon Linux ECS AMI Switch build VM from default Ubuntu to Amazon Linux ECS, which ships with Docker pre-installed and the daemon already running. Eliminates the apt-get update + install docker.io/awscli step (~3-5 min per fresh cluster). Only docker-compose needs to be installed via pip3. 2. ECR image caching by content hash Images are now pushed with two tags: :latest (for the Helm chart) and :<hash> where hash is the first 12 chars of the SHA256 of the services/ directory. On each build run, the script checks ECR for the hash tag before starting docker-compose build. If all images are present at that tag, the build is skipped and images are pulled + retagged as :latest (~5-10 min saved on re-runs with unchanged source). This is most valuable when destroying and recreating a stack without changing episode code — a common pattern during cluster debugging. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…aner The ECR image cache (hash-tagged images) is wiped by the e2e account's weekly infra-cleaner job, limiting cache hits to within the same week. To make it durable, ECR repos need a protection tag or cleaner exclusion. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…LI install The ECS AMI ships Docker 25 (including buildx) but not AWS CLI or docker-compose. Previous approach tried pip3 install docker-compose which failed due to Python 3.7 + OpenSSL 1.0.2k incompatibility with urllib3 v2.0. Fix: - Install only awscli via yum (~30s, the only missing tool) - Replace docker-compose build with docker buildx bake, which understands docker-compose.yaml natively, requires no separate install, and builds images in parallel - Replace docker-compose config --images with Python yaml parsing - Restore explicit docker login (no ECR credential helper on this AMI; --password-stdin works without a TTY as Pulumi uses for remote commands) Net result: setup step goes from apt-get docker.io + docker-compose + awscli (~4 min) to yum install awscli (~30s). Docker and buildx are pre-installed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Fargate nodes carry eks.amazonaws.com/compute-type=fargate:NoSchedule. Any DaemonSet — including the Datadog agent — gets permanently-Pending pods on those nodes, blocking Helm readiness checks and adding operational complexity. The only benefit of Fargate is that CoreDNS starts before EC2 nodes join (a ~5 min provisioning optimisation that is irrelevant for long-running test scenarios). WithoutFargate() skips both the Fargate profile and the pre-created execution role. CoreDNS schedules on the EC2 node group once nodes join. The Fargate execution role and aws-auth complexity are eliminated entirely. gensim-eks now uses WithoutFargate() so the Datadog agent DaemonSet (added in M3) schedules cleanly on EC2 nodes without stuck-Pending pods. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Deploys the full DaemonSet-based Datadog agent after the episode chart, using the framework's helm.NewKubernetesAgent wrapper. Changes: - run.go: add M3 agent deployment block, gated on awsEnv.AgentDeploy(). Reads gensim:datadogValuesPath (sets clusterName + kubelet.tlsVerify:false, required on EKS since kubelet uses a self-signed cert). Waits for the episode Helm release before deploying. - gensim_eks.py: set install_agent=True when episode is provided, which sets ddagent:deploy=True and injects ddagent:apiKey via the framework. Passes gensim:datadogValuesPath when datadog-values.yaml exists at the postmortems root. Adds _delete_stub_agent() post-pulumi step to remove the episode chart's built-in Deployment-based agent (which would produce duplicate metrics alongside the real DaemonSet). Success: DD agent pod Running on EC2 node, metrics visible in DD under the episode's env tag, stub agent gone. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…uesFile
pulumi.NewStringAsset (used by WithHelmValues) serialises to []interface{}
in the local Pulumi state backend, causing the Helm provider to fail with
"unsupported type for 'valueYamlFiles' arg: []interface{}".
Fix: add WithHelmValuesFile(path) which uses pulumi.NewFileAsset instead.
File assets are read from disk at apply time and round-trip through the
local backend correctly. gensim-eks uses this for datadogValuesPath.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tonomously Creates the RBAC, credentials Secret, ConfigMap, and Job needed to run play-episode.sh inside the cluster without the developer's laptop staying open. Resources created when gensim:scenario is set: - ServiceAccount gensim-runner - ClusterRole with: get/list/watch pods; get/list/update deployments/scale (play-episode.sh uses kubectl scale + kubectl wait only) - ClusterRoleBinding - Secret gensim-secrets (DD_API_KEY, DD_APP_KEY) - ConfigMap gensim-episode (play-episode.sh + <scenario>.yaml) Two separate volume mounts map them to /episode/ and /episode/episodes/ so play-episode.sh finds the scenario YAML at its expected path. - Job gensim-runner running alpine/k8s:1.31.0 (has kubectl/bash/curl/jq). Annotated pulumi.com/skipAwait=true so Pulumi returns immediately rather than waiting 30-60 min for the episode to finish. Python: --scenario flag added to create_gensim_eks; post-deploy prints kubectl logs -f command for monitoring. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…kend bug properly
AssetOrArchiveArray (used by WithHelmValues/WithHelmValuesFile) deserialises
as []interface{} in local Pulumi state on resource update, causing the Helm
provider to fail with "unsupported type for 'valueYamlFiles' arg: []interface{}".
Fix: add ExtraValues pulumi.Map to HelmInstallationArgs, merged into the main
values map before ToYAMLPulumiAssetOutput(). Map values flow through the
computed output path which survives local-state round-trips correctly on both
create and update.
Changes:
- kubernetes_helm.go: ExtraValues field; merge loop before ToYAMLPulumiAssetOutput
- kubernetes_agent.go: pass params.ExtraHelmValues through
- kubernetesagentparams/params.go: ExtraHelmValues field + WithExtraHelmValues option
- gensim-eks/run.go: use WithExtraHelmValues for kubelet.tlsVerify + clusterName
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…luesYAML
The previous code serialised ALL agent values (including framework defaults)
through ValuesYAML/AssetOrArchiveArray via ToYAMLPulumiAssetOutput(). This
caused the same local-backend deserialisation bug as WithHelmValues:
"unsupported type for 'valueYamlFiles' arg: []interface {}"
Fix: pass the HelmValues map (after ExtraValues merge) directly as the
Values pulumi.MapInput instead of converting to YAML and going through
ValueYamlFiles. pulumi.Map values serialise as JSON in local state and
survive update round-trips correctly.
User-provided ValuesYAML (WithHelmValues/WithHelmValuesFile) still use the
AssetOrArchiveArray path for now — prefer WithExtraHelmValues instead.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…olume play-episode.sh creates a results/ directory relative to its script path. ConfigMap volumes are read-only so mkdir fails. Set RESULTS_DIR=/tmp/results so the script writes results to a writable location instead. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
play-episode.sh creates results/ relative to its script path (hardcoded, not overridable via env var). ConfigMap volumes are read-only so mkdir fails. Add an emptyDir volume mounted at /episode/results to give the script a writable location without modifying the upstream script. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Cluster-checks runners are not needed for a single-node test cluster. Their readiness probe blocks the Helm timeout when the forwarder hasn't fully initialised — particularly noticeable on first deploy where the pods predate the real API key being available. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
play-episode.sh hardcodes RESULTS_DIR="${SCRIPT_DIR}/results". Since the
script is mounted from a read-only ConfigMap volume at /episode/, mkdir
fails. Add an emptyDir volume and mount at /episode/results to provide a
writable location without modifying the upstream script.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… scale kubectl scale issues a PATCH on deployments/scale, not update. Without patch in the ClusterRole verbs, the scale command fails silently (|| true suppresses the error) and surge generators never spin up during disruption. Validated: monitor transitioned OK→Alert after surge scaled to 5 replicas, then Alert→OK after scale-down during cooldown. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…r agent
After play-episode.sh completes, the Job now:
1. kubectl cp observer parquet from the agent pod (/tmp/observer-parquet)
into /episode/results/parquet/ — requires pods/exec added to ClusterRole
2. aws s3 cp /episode/results/ to s3://<bucket>/gensim-results-<episode>-<date>/
— requires s3:PutObject on the EKS Linux node role, attached by Pulumi
Observer-recorder agent support:
- additionalConfig injects observer config directly into datadog.yaml via
the Helm chart's additionalConfig field
- --full-image-path flag allows specifying the observer-recorder image
(e.g. docker.io/datadog/agent-dev:q-branch-observer-full)
New invoke flags:
--s3-bucket bucket to upload results (optional; skipped if unset)
--full-image-path custom agent image path
Example:
inv aws.eks.gensim.create \
--episode=authcore-pgbouncer-connection-pool-saturation \
--scenario=pool-saturation \
--s3-bucket=qbranch-gensim-recordings \
--full-image-path=docker.io/datadog/agent-dev:q-branch-observer-full
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…true explicitly
buildLinuxHelmValues builds an agents map without setting agents.enabled.
When this map is passed via Values (pulumi.Map — equivalent to --set style),
Helm's merge semantics can omit the agents.enabled key from the effective
values, causing the node agent DaemonSet template condition to evaluate to
false and the DaemonSet to be skipped entirely.
clusterAgent already sets "enabled": pulumi.Bool(true) explicitly.
agents must do the same for the DaemonSet to be reliably created.
Root cause identified via: Helm manifest had 0 DaemonSets despite
chart defaults having agents.enabled=True, because the Values map
passed from buildLinuxHelmValues had agents.{image,containers,...}
but no agents.enabled key.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Job command was still the M4 form (direct play-episode.sh call). Apply the M5 bash -c script that chains: episode run → parquet collection via kubectl cp → flat S3 upload via aws s3 cp --recursive. The S3 upload destination follows Maxime's naming: s3://<bucket>/gensim-results-<episode>-<YYYYMMDD>/ Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Amazon Linux ECS ships Python 3.7 without the yaml module. Replace the python3 yaml parsing of docker-compose.yaml image names with grep+awk which is always available. Also fixes backtick in Go raw string literal. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tick Three bugs found during clean end-to-end validation run (M5): 1. Backtick in raw string comment terminated the Go string literal, causing compile error. Removed backticks from comment text. 2. date format string had `%%%%Y%%%%m%%%%d` -> produced `%%Y%%m%%d` in the shell script -> `date -u` received `%%Y%%m%%d` as format and expanded it to the literal string `%Y%m%d` instead of YYYYMMDD. Fix: `%%Y%%m%%d` in pulumi.Sprintf -> `%Y%m%d` in the script. 3. Parquet kubectl cp used `-l app.kubernetes.io/component=agent` but the DaemonSet pods have label `app=datadog-agent`. Selector returned empty AGENT_POD, silently skipping parquet collection. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Previously `|| echo "Warning: ..."` swallowed the exit code and the error was easy to miss. Now: - If no agent pod is found, prints to stderr with ERROR prefix - If kubectl cp fails, prints to stderr with ERROR prefix + hint - On success, prints file count Job still exits 0 in all cases (parquet is best-effort). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two root causes prevented the dda-linux DaemonSet from being created: 1. ExtraValues shallow merge replaced the entire `datadog` map from buildLinuxHelmValues, losing `apiKeyExistingSecret`. The chart's DaemonSet template requires either apiKey or apiKeyExistingSecret; with neither set, it skips the DaemonSet entirely. Fix: add deepMergeHelmValues() that recursively merges pulumi.Map values instead of replacing top-level keys. This preserves all framework defaults while allowing scenarios to override nested keys like datadog.kubelet.tlsVerify or agents.customAgentConfig. 2. The episode chart's built-in datadog-agent DaemonSet occupied the cluster before dda-linux deployed, creating confusion about which agent was running. The stub used gcr.io/datadoghq/agent:7 (stock) instead of the observer-recorder image. Fix: extend the Helm post-renderer to strip DaemonSet, ServiceAccount, ClusterRole, and ClusterRoleBinding resources named `datadog-agent` from the episode chart output. The framework's dda-linux release now deploys the sole node agent DaemonSet. Also: update Job label selector from app=datadog-agent to app=dda-linux-datadog to match the framework's DaemonSet naming. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pace The q-branch-observer image renamed config keys from flat observer.capture_metrics.enabled / observer.parquet_* to nested observer.recording.enabled / observer.recording.parquet_*. The old keys were flagged as unknown and recording never started. Validated end-to-end from clean stack: 230 parquet files (474 MiB) collected and uploaded to S3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tasks Replace the single-episode Pulumi-managed runner with a multi-episode orchestrator architecture: Persistent layer (Pulumi): EKS cluster, RBAC (cluster-admin SA), gensim-secrets Secret, S3 IAM policy. Stays up between runs. Orchestrator Job (per-submission): Loops through episode:scenario pairs serially. For each: helm install agent with observer-recorder config, helm install episode chart with post-renderer, run play-episode.sh, collect parquet via kubectl cp, upload to S3, emit DD events + metrics, teardown, next episode. Updates gensim-run-status ConfigMap at each phase transition. Invoke tasks: - submit: validates episodes, captures gensim SHA, enforces clean checkout, guards against busy cluster, deploys via Pulumi - status: reads ConfigMap, renders per-episode progress table - destroy: unchanged Removes dead code: buildAndPushImages, hashDir, writePatchScript, buildRunnerScript (all superseded by orchestrator). Fixes specs: removes sopell review comments, temporal qualifiers, misplaced blocked-by notes. Updates executive status table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Design.md ASCII diagram: use full inv aws.eks.gensim.* task names - Orchestrator script: capture play-episode.sh exit code into EP_OUTCOME (success/failure) and pass to emit_dd_event Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The build VM was accidentally removed during the orchestrator restructure. Episodes with docker-compose.yaml need their service images built on EC2 (x86_64) and pushed to ECR before the orchestrator can helm-install them. Split into provisionBuildVM (one VM) + buildEpisodeImages (per-episode with unique resource names) to support multi-episode submissions where multiple episodes have custom service images. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Secret keys: use api-key/app-key (Helm chart convention) not DD_API_KEY
- Helm: use upgrade --install (idempotent, handles stale releases)
- Helm: remove --skip-tests (invalid flag), use --skip-crds
- Cluster agent: disable (observer image doesn't include it)
- Post-renderer: also strip Service named datadog-agent
- DD_ENV: export before play-episode.sh (required by episode scripts)
- ConfigMap names: sanitize to lowercase+hyphens (K8s RFC 1123)
- Status task: drop aws_wrapper from kubectl (kubeconfig has auth)
- Kubeconfig lookup: glob for full stack name prefix
- Build VM: write docker-compose.yaml via inline command (fix collisions)
- Bash: fix ${3:-{}} default (extra brace from bash parsing)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace minimal agent Helm values with full observability config: APM (with admission controller injection), logs (containerCollectAll), process agent, DogStatsD (hostPort), cluster agent with admission controller (mutateUnlabelled), and 3Gi agent memory limits. Replace helm install + post-renderer with two-pass deployment: helm template -> extract_check_configs (yq) -> awk_filter -> kubectl apply. This extracts per-episode check configs (redis, postgres, etc.) from the episode chart and passes them to the agent via datadog.confd. Key changes: - fullnameOverride: datadog-agent (episode pods hardcode this DNS name) - Cluster agent uses chart-default image (not custom observer image) - extract_check_configs() uses yq to pull datadog-checks ConfigMap - awk_filter() strips DaemonSet/Deployment/Service/SA/ClusterRole/ ClusterRoleBinding/ConfigMap named datadog-agent or datadog-checks - Episode teardown via kubectl delete -f (not helm uninstall) - Parquet collection label updated to app=datadog-agent - Helm release name changed to datadog-agent (was dda-linux) - S3 path includes date in gensim SHA segment for readability - Removed post-renderer ConfigMap, volume, and volume mount Validated end-to-end: 3 episodes, 1218 parquet files, APM traces + logs + DogStatsD + redis/postgres check metrics all flowing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert changes to kubernetes_helm.go, kubernetes_agent.go, and params.go that added ExtraHelmValues, deepMergeHelmValues, and WithHelmValuesFile. These were from an earlier Go-level approach that was replaced by the bash-based helm calls in the orchestrator Job. The gensim-eks scenario never imports these components. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert the full Fargate refactor (pre-created pod execution role, restructured clusterArgs, RoleMappings changes) from this branch. The only change to cluster.go is wrapping the existing Fargate block in `if !params.DisableFargate`. The full refactor lives on branch sopell/eks-fargate-refactor for separate review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ew comments Move the agent Helm values from an inline bash heredoc (with sed placeholder substitution) to a standalone agent-values.yaml.tmpl file. The template is embedded via go:embed, rendered at Pulumi plan time with the actual image repo/tag, and mounted as a ConfigMap. Template errors are now caught at deploy time, not at Job runtime. Also addresses inline review comments: - Trim verbose WithoutFargate doc comment - Add TODO for buildEpisodeImages removal - Clarify hashDir's purpose (Pulumi Trigger, not Docker cache) - Remove stale cluster comment block Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These were planning artifacts used during development, not intended for the final PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pisodes Replace helm template/awk filter/kubectl apply with single-pass helm install --set agent.enabled=false. Remove extract_check_configs and awk_filter functions (autodiscovery annotations handle check configs since gensim-episodes PR #33, agent.enabled gating since PR #31). Update _get_gensim_repo_path() to return the repo root instead of hardcoding /postmortems, add _find_episode_dir() to search both postmortems/ and synthetics/ subdirectories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d6f6d6e to
c8fbae5
Compare
Move top-level `from tasks.e2e_framework.config import ...` to lazy imports inside wrapper functions. The config module unconditionally imports pydantic, which isn't installed in standard CI runners -- only in e2e-specific environments. Every other e2e_framework file uses lazy imports for this reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd4d8f599b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Files inventory check summaryFile checks results against ancestor 0b457eee: Results for datadog-agent_7.78.0~devel.git.723.16412a7.pipeline.103341240-1_amd64.deb:No change detected |
… ref P1: Include docker-compose.yaml content in the services hash used as a Pulumi trigger. Previously only the services/ directory was hashed, so changes to docker-compose.yaml (adding/removing images, changing build contexts) would not trigger a rebuild. P2: Validate that the agent image reference contains a colon before splitting into repo:tag. Returns a clear error instead of panicking with a slice-bounds error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Static quality checks✅ Please find below the results from static quality gates 30 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 0b457ee Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +3.27 | [+0.23, +6.31] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +3.27 | [+0.23, +6.31] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.73 | [+0.62, +0.84] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.36 | [+0.22, +0.50] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.26 | [+0.02, +0.50] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_memory | memory utilization | +0.24 | [+0.17, +0.31] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.07 | [-0.47, +0.60] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.06 | [+0.02, +0.11] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.05 | [+0.01, +0.09] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.10, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.00 | [-0.20, +0.20] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.21, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.01 | [-0.40, +0.39] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.43, +0.42] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.01 | [-0.17, +0.14] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.10, +0.05] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.08 | [-0.25, +0.10] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.08 | [-0.16, +0.00] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.15 | [-0.21, -0.10] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.16 | [-0.21, -0.10] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.38 | [-0.60, -0.16] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.44 | [-0.58, -0.30] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.50 | [-0.67, -0.33] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.66 | [-2.24, +0.92] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 709 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 273.68MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 695 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.22GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 173.66MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 486.85MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 203.47MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 345.91 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 418.82MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
…bility Add --mode flag to gensim submit (record-parquet or live-anomaly-detection). In live mode, the observer runs with analysis + event_reporter enabled and parquet collection/S3 upload are skipped. Changes: - agent-values.yaml.tmpl: templatize observer config based on mode, add pullPolicy: Always to prevent stale cached images on EKS nodes - run.go: plumb mode through Pulumi config, orchestrator Job env, renderAgentValues, and buildOrchestratorScript. Conditional parquet collection based on GENSIM_MODE env var. - gensim_eks.py: add --mode parameter with validation, pass as gensim:mode Pulumi config - deploy.py: add --refresh and --non-interactive to pulumi up, disable pty. Fixes stale Pulumi state causing silent no-ops when cluster resources are deleted outside Pulumi. - destroy.py: same --non-interactive and pty=False treatment so error output is captured instead of swallowed by the TUI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Adds a new e2e-framework scenario (
aws/gensim-eks) that provisions a persistent EKS cluster and runs GenSim episodes end-to-end. Supports two modes:record-parquet(default): Collects observer data to parquet files for offline analysis, uploads to S3live-anomaly-detection: Runs live edge anomaly detection with all correlators enabled, sends detected anomaly events to Datadog via the Events v2 APIFor each episode, the scenario deploys application services, installs a fully-configured Datadog agent, executes fault injection cycles, and collects results (parquet files or live anomaly events depending on mode).
Motivation
GenSim episodes are self-contained incident simulations -- each ships a Helm chart with application services, a
play-episode.shfault injection script, and (in 261 of 267 episodes) a stub Datadog agent Deployment. To evaluate the observer against these episodes at scale, we need infrastructure that:Data flow
On your laptop (
dda inv aws.eks.gensim.submit --image=... --episodes=A:s1,B:s2 --mode=live-anomaly-detection):Pulumi reads each episode's
play-episode.sh, scenario YAML, and Helm chart from your localGENSIM_REPO_PATH, packages them into per-episode ConfigMaps, rendersagent-values.yaml.tmplwith the target image and mode, and creates an orchestrator Job. Your laptop is done.In the EKS cluster (orchestrator Job,
alpine/k8s:1.31.0):Observer mode configuration
The
--modeflag controls what the agent's observer component does:record-parquetlive-anomaly-detectionobserver.recording.enabledobserver.analysis.enabledobserver.event_reporter.sending_enabledKey design decisions
fullnameOverride: datadog-agent-- episode pods hardcodedatadog-agent:8125anddatadog-agent:8126as DNS names for DogStatsD and APM. This makes our Helm chart create Services with the names they expect.helm installwith--set agent.enabled=falsesuppresses stub agent resources. A lightweightsedpost-renderer patchesimagePullPolicy: NevertoAlwaysfor the few episodes that set it. Autodiscovery annotations on episode pods handle check configs -- no extraction step needed.pullPolicy: Alwayson the agent image -- prevents stale cached images on EKS nodes when the same tag is rebuilt with new code.mutateUnlabelled: true-- auto-injectsDD_AGENT_HOSTinto all episode pods without requiring labels.agent-values.yaml.tmpl) -- embedded viago:embed, rendered at Pulumi plan time with target image repo/tag and mode.--refreshonpulumi up-- reconciles Pulumi state with cluster reality before planning changes. Prevents silent no-ops when resources are deleted outside Pulumi (e.g. by the orchestrator's helm uninstall during teardown).--non-interactiveonpulumi up/destroy-- disables the TUI progress display so error output is captured in logs instead of being overwritten by cursor repositioning.renderAgentValuesvalidates the image contains:before splitting into repo:tag, returning a clear error instead of panicking.Describe how you validated your changes
Record-parquet mode -- 3-episode end-to-end run:
Live-anomaly-detection mode -- 213_PagerDuty episode:
source:agent-q-branch-observer)Future checklist: hacks to retire
awk filter stripping agent resources-- Retired. Episodes now gate agent resources behindagent.enabled: false(gensim-episodes PR [aggregator] Allow aggregator to handleratemetrics from checks #31).-- Retired. Episodes migrated to autodiscovery annotations (gensim-episodes PR [aggregator] Support all existing check metric types #33).extract_check_configswith yq-- Retired. Withhelm template+kubectl applyinstead ofhelm installagent.enabled=false, episodes install cleanly viahelm install.-- Retired as an awk filter. Replaced by a 3-lineimagePullPolicy: Never -> Alwayspatching (awk filter)sedpost-renderer onhelm install.docker buildx bake+ ECR push -- EC2 instance to build episode service images from source and push to ECR. ~150 lines of Pulumi + bash. Retires when: gensim-episodes publishes pre-built images to a shared registry.Evaluated and kept as-is:
fullnameOverride: datadog-agent-- load-bearing DNS name that 227 episodes depend on.mutateUnlabelled: true-- correct for single-tenant namespace.kubelet.tlsVerify: false-- standard EKS workaround, permanent.Files
test/e2e-framework/scenarios/aws/gensim-eks/run.gotest/e2e-framework/scenarios/aws/gensim-eks/agent-values.yaml.tmpltasks/e2e_framework/aws/gensim_eks.pysubmit(with--mode),status,destroy,logstasks/e2e_framework/deploy.py--refresh,--non-interactive,pty=Falsetasks/e2e_framework/destroy.py--non-interactive,pty=False