Skip to content

Commit 4ae3450

Browse files
authored
Tests retries (#799)
adds a few retry mechanisms to the `imagetest_tests` resource: - a "resource" scoped one, that'll just recreate the entire resource - a "step" scoped one, that'll just retry the step itself both of which are opt-in
1 parent d825654 commit 4ae3450

File tree

5 files changed

+504
-7
lines changed

5 files changed

+504
-7
lines changed

docs/resources/tests.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ description: |-
2626
- `labels` (Map of String) Metadata to attach to the tests resource. Used for filtering and grouping.
2727
- `name` (String) The name of the test. If one is not provided, a random name will be generated.
2828
- `repo` (String) The target repository the provider will use for pushing/pulling dynamically built images, overriding provider config.
29+
- `retry` (Attributes) On failure, tears down the driver completely, creates a fresh one, and re-runs all tests from scratch. This gives each attempt a clean driver, but external side effects from previous attempts are not rolled back: pushed images, written files, cloud resources created outside the driver (e.g. IAM roles, DNS records), and any other out-of-band mutations will still exist. All per-test retry blocks also reset — every test runs from its first attempt on each resource-level retry. (see [below for nested schema](#nestedatt--retry))
2930
- `skipped` (Boolean) Whether or not the tests were skipped. This is set to true if the tests were skipped, and false otherwise.
3031
- `tests` (Attributes List) An ordered list of test suites to run (see [below for nested schema](#nestedatt--tests))
3132
- `timeout` (String) The maximum amount of time to wait for all tests to complete. This includes the time it takes to start and destroy the driver.
@@ -231,6 +232,18 @@ Optional:
231232

232233

233234

235+
<a id="nestedatt--retry"></a>
236+
### Nested Schema for `retry`
237+
238+
Required:
239+
240+
- `attempts` (Number) Total number of attempts including the initial run. Must be >= 1.
241+
242+
Optional:
243+
244+
- `delay` (String) Delay between retry attempts as a Go duration string (e.g. "5s", "1m"). Defaults to 5s.
245+
246+
234247
<a id="nestedatt--tests"></a>
235248
### Nested Schema for `tests`
236249

@@ -246,6 +259,7 @@ Optional:
246259
- `content` (Attributes List) The content to use for the test (see [below for nested schema](#nestedatt--tests--content))
247260
- `envs` (Map of String) Environment variables to set on the test container. These will overwrite the environment variables set in the image's config on conflicts.
248261
- `on_failure` (List of String) Commands to run in the sandbox on test failure for diagnostic collection. Each command runs independently (best-effort); failures do not prevent subsequent commands from executing.
262+
- `retry` (Attributes) Re-runs this individual test within the same driver instance. Each retry launches a fresh test sandbox container, but all driver-level state persists: for Kubernetes-based drivers (k3s_in_docker, EKS, AKS) this means the cluster, namespace, RBAC, secrets, and any objects created by previous attempts are still present. For EC2, the instance filesystem and Docker daemon state carry over. Tests must be idempotent — use create-or-update patterns, unique names, or explicit cleanup to avoid conflicts with leftover state from failed attempts. (see [below for nested schema](#nestedatt--tests--retry))
249263
- `timeout` (String) The maximum amount of time to wait for the individual test to complete. This is encompassed by the overall timeout of the parent tests resource.
250264

251265
<a id="nestedatt--tests--artifact"></a>
@@ -267,3 +281,15 @@ Required:
267281
Optional:
268282

269283
- `target` (String) The target path to use for the test
284+
285+
286+
<a id="nestedatt--tests--retry"></a>
287+
### Nested Schema for `tests.retry`
288+
289+
Required:
290+
291+
- `attempts` (Number) Total number of attempts including the initial run. Must be >= 1.
292+
293+
Optional:
294+
295+
- `delay` (String) Delay between retry attempts as a Go duration string (e.g. "5s", "1m"). Defaults to 5s.

internal/provider/tests_resource.go

Lines changed: 142 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import (
88
"maps"
99
"net/url"
1010
"os"
11+
"strconv"
1112
"strings"
1213
"time"
1314

@@ -18,6 +19,7 @@ import (
1819
internallog "github.com/chainguard-dev/terraform-provider-imagetest/internal/log"
1920
"github.com/chainguard-dev/terraform-provider-imagetest/internal/o11y"
2021
"github.com/chainguard-dev/terraform-provider-imagetest/internal/provider/framework"
22+
"github.com/chainguard-dev/terraform-provider-imagetest/internal/retry"
2123
"github.com/chainguard-dev/terraform-provider-imagetest/internal/skip"
2224
"github.com/google/go-containerregistry/pkg/name"
2325
v1 "github.com/google/go-containerregistry/pkg/v1"
@@ -77,6 +79,7 @@ type TestsResourceModel struct {
7779
Labels map[string]string `tfsdk:"labels"`
7880
Skipped types.Bool `tfsdk:"skipped"`
7981
RepoOverride types.String `tfsdk:"repo"`
82+
Retry *RetryResourceModel `tfsdk:"retry"`
8083
}
8184

8285
type TestsImageResource map[string]string
@@ -114,6 +117,41 @@ type TestResourceModel struct {
114117
Timeout types.String `tfsdk:"timeout"`
115118
Artifact types.Object `tfsdk:"artifact"`
116119
OnFailure []string `tfsdk:"on_failure"`
120+
Retry *RetryResourceModel `tfsdk:"retry"`
121+
}
122+
123+
type RetryResourceModel struct {
124+
Attempts types.Int64 `tfsdk:"attempts"`
125+
Delay types.String `tfsdk:"delay"`
126+
}
127+
128+
func (r *RetryResourceModel) config() (retry.Config, diag.Diagnostics) {
129+
if r == nil {
130+
return retry.Config{}, nil
131+
}
132+
cfg := retry.Config{
133+
Attempts: int(r.Attempts.ValueInt64()),
134+
Delay: 5 * time.Second,
135+
}
136+
if v := os.Getenv("IMAGETEST_RETRY_ATTEMPTS"); v != "" {
137+
if n, err := strconv.Atoi(v); err == nil {
138+
cfg.Attempts = n
139+
}
140+
}
141+
if cfg.Attempts < 1 {
142+
cfg.Attempts = 1
143+
}
144+
if !r.Delay.IsNull() && r.Delay.ValueString() != "" {
145+
d, err := time.ParseDuration(r.Delay.ValueString())
146+
if err != nil {
147+
return cfg, diag.Diagnostics{diag.NewErrorDiagnostic(
148+
"invalid retry delay",
149+
fmt.Sprintf("failed to parse delay %q: %s", r.Delay.ValueString(), err),
150+
)}
151+
}
152+
cfg.Delay = d
153+
}
154+
return cfg, nil
117155
}
118156

119157
type TestContentResourceModel struct {
@@ -206,6 +244,11 @@ func (t *TestsResource) Schema(ctx context.Context, req resource.SchemaRequest,
206244
Optional: true,
207245
ElementType: types.StringType,
208246
},
247+
"retry": retrySchema("Re-runs this individual test within the same driver instance. " +
248+
"Each retry launches a fresh test sandbox container, but all driver-level state persists: " +
249+
"for Kubernetes-based drivers (k3s_in_docker, EKS, AKS) this means the cluster, namespace, RBAC, secrets, and any objects created by previous attempts are still present. " +
250+
"For EC2, the instance filesystem and Docker daemon state carry over. " +
251+
"Tests must be idempotent — use create-or-update patterns, unique names, or explicit cleanup to avoid conflicts with leftover state from failed attempts."),
209252
"artifact": schema.SingleNestedAttribute{
210253
Description: "The bundled artifact generated by the test.",
211254
Optional: true,
@@ -240,6 +283,27 @@ func (t *TestsResource) Schema(ctx context.Context, req resource.SchemaRequest,
240283
Optional: true,
241284
Computed: true,
242285
},
286+
"retry": retrySchema("On failure, tears down the driver completely, creates a fresh one, and re-runs all tests from scratch. " +
287+
"This gives each attempt a clean driver, but external side effects from previous attempts are not rolled back: " +
288+
"pushed images, written files, cloud resources created outside the driver (e.g. IAM roles, DNS records), and any other out-of-band mutations will still exist. " +
289+
"All per-test retry blocks also reset — every test runs from its first attempt on each resource-level retry."),
290+
},
291+
}
292+
}
293+
294+
func retrySchema(description string) schema.SingleNestedAttribute {
295+
return schema.SingleNestedAttribute{
296+
Description: description,
297+
Optional: true,
298+
Attributes: map[string]schema.Attribute{
299+
"attempts": schema.Int64Attribute{
300+
Description: "Total number of attempts including the initial run. Must be >= 1.",
301+
Required: true,
302+
},
303+
"delay": schema.StringAttribute{
304+
Description: "Delay between retry attempts as a Go duration string (e.g. \"5s\", \"1m\"). Defaults to 5s.",
305+
Optional: true,
306+
},
243307
},
244308
}
245309
}
@@ -380,17 +444,13 @@ func (t *TestsResource) do(ctx context.Context, data *TestsResourceModel) (ds di
380444
return []diag.Diagnostic{diag.NewErrorDiagnostic("failed to create target repository", err.Error())}
381445
}
382446

383-
tracer := otel.Tracer("imagetest")
384-
447+
// Build test images once — refs are digest-based and stable across retries.
385448
trefs, buildDiags := t.buildTestImages(ctx, data, trepo, imgsResolvedData, id)
386449
if buildDiags.HasError() {
387450
return buildDiags
388451
}
389452

390-
dr, err := t.LoadDriver(ctx, data)
391-
if err != nil {
392-
return []diag.Diagnostic{diag.NewErrorDiagnostic("failed to load driver", err.Error())}
393-
}
453+
tracer := otel.Tracer("imagetest")
394454

395455
ctx, suiteSpan := tracer.Start(ctx, "imagetest.suite",
396456
trace.WithAttributes(
@@ -412,6 +472,51 @@ func (t *TestsResource) do(ctx context.Context, data *TestsResourceModel) (ds di
412472
suiteSpan.End()
413473
}()
414474

475+
// Resource-level retry: on failure, tear down the driver, create a fresh
476+
// one, and re-run all tests from scratch.
477+
retryCfg, cfgDiags := data.Retry.config()
478+
if cfgDiags.HasError() {
479+
return cfgDiags
480+
}
481+
482+
result := retry.Do(ctx, retryCfg, func(ctx context.Context, attempt int) error {
483+
if attempt > 1 {
484+
suiteSpan.AddEvent("retry", trace.WithAttributes(
485+
attribute.Int("test.attempt", attempt),
486+
))
487+
}
488+
489+
ds = t.doAttempt(ctx, data, trefs, tracer)
490+
if ds.HasError() {
491+
return fmt.Errorf("%s", ds[len(ds)-1].Detail())
492+
}
493+
return nil
494+
})
495+
496+
if result.Retried {
497+
suiteSpan.SetAttributes(
498+
attribute.Int("test.attempts", result.Attempts),
499+
attribute.Bool("test.retried", true),
500+
)
501+
if !ds.HasError() {
502+
ds = append(ds, diag.NewWarningDiagnostic(
503+
fmt.Sprintf("tests passed after retry (attempt %d/%d)", result.Attempts, retryCfg.Attempts),
504+
fmt.Sprintf("previous attempt failed: %s", result.LastError),
505+
))
506+
}
507+
}
508+
509+
return ds
510+
}
511+
512+
// doAttempt runs a single attempt of the full driver lifecycle: load → setup →
513+
// run tests → teardown. Each resource-level retry calls this with a fresh driver.
514+
func (t *TestsResource) doAttempt(ctx context.Context, data *TestsResourceModel, trefs []name.Reference, tracer trace.Tracer) (ds diag.Diagnostics) {
515+
dr, err := t.LoadDriver(ctx, data)
516+
if err != nil {
517+
return []diag.Diagnostic{diag.NewErrorDiagnostic("failed to load driver", err.Error())}
518+
}
519+
415520
defer func() {
416521
ctx, teardownSpan := tracer.Start(ctx, "imagetest.teardown",
417522
trace.WithAttributes(
@@ -443,7 +548,7 @@ func (t *TestsResource) do(ctx context.Context, data *TestsResourceModel) (ds di
443548
setupSpan.End()
444549

445550
for i, tref := range trefs {
446-
ds.Append(t.doTest(ctx, dr, data.Tests[i], tref)...)
551+
ds.Append(t.doTestWithRetry(ctx, dr, data.Tests[i], tref)...)
447552
if ds.HasError() {
448553
return ds
449554
}
@@ -452,6 +557,36 @@ func (t *TestsResource) do(ctx context.Context, data *TestsResourceModel) (ds di
452557
return ds
453558
}
454559

560+
// doTestWithRetry wraps doTest with per-test retry. Each retry re-runs d.Run()
561+
// within the same driver — the test author asserts idempotency.
562+
func (t *TestsResource) doTestWithRetry(ctx context.Context, d drivers.Tester, test *TestResourceModel, ref name.Reference) diag.Diagnostics {
563+
cfg, cfgDiags := test.Retry.config()
564+
if cfgDiags.HasError() {
565+
return cfgDiags
566+
}
567+
if cfg.Attempts <= 1 {
568+
return t.doTest(ctx, d, test, ref)
569+
}
570+
571+
var lastDiags diag.Diagnostics
572+
result := retry.Do(ctx, cfg, func(ctx context.Context, attempt int) error {
573+
lastDiags = t.doTest(ctx, d, test, ref)
574+
if lastDiags.HasError() {
575+
return fmt.Errorf("%s", lastDiags[len(lastDiags)-1].Detail())
576+
}
577+
return nil
578+
})
579+
580+
if result.Retried && !lastDiags.HasError() {
581+
lastDiags = append(lastDiags, diag.NewWarningDiagnostic(
582+
fmt.Sprintf("test %q passed after retry (attempt %d/%d)", test.Name.ValueString(), result.Attempts, cfg.Attempts),
583+
fmt.Sprintf("previous attempt failed: %s", result.LastError),
584+
))
585+
}
586+
587+
return lastDiags
588+
}
589+
455590
func (t *TestsResource) doTest(ctx context.Context, d drivers.Tester, test *TestResourceModel, ref name.Reference) diag.Diagnostics {
456591
// Get the test_id from context
457592
testID, ok := ctx.Value(contextKeyResourceTestID).(string)

internal/provider/tests_resource_test.go

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,116 @@ resource "imagetest_tests" "foo" {
296296
Check: checkArtifact(t),
297297
},
298298
},
299+
// Per-test retry on a passing test: retry block is accepted, test still passes.
300+
"dockerindocker-per-test-retry-passes": {
301+
{
302+
Config: fmt.Sprintf(`
303+
resource "imagetest_tests" "foo" {
304+
name = "dind-per-test-retry-passes"
305+
driver = "docker_in_docker"
306+
307+
images = {
308+
foo = "cgr.dev/chainguard/busybox:latest@sha256:c546e746013d75c1fc9bf01b7a645ce7caa1ec46c45cb618c6e28d7b57bccc85"
309+
}
310+
311+
tests = [
312+
{
313+
name = "sample"
314+
image = "cgr.dev/chainguard/busybox:latest"
315+
content = [{ source = "${path.module}/testdata/TestAccTestsResource" }]
316+
cmd = "./%[1]s"
317+
retry = { attempts = 3, delay = "1s" }
318+
}
319+
]
320+
321+
timeout = "5m"
322+
}
323+
`, "docker-in-docker-basic.sh"),
324+
},
325+
},
326+
// Per-test retry on a failing test: all attempts exhausted, error surfaces.
327+
"dockerindocker-per-test-retry-exhausted": {
328+
{
329+
Config: fmt.Sprintf(`
330+
resource "imagetest_tests" "foo" {
331+
name = "dind-per-test-retry-exhausted"
332+
driver = "docker_in_docker"
333+
334+
images = {
335+
foo = "cgr.dev/chainguard/busybox:latest@sha256:c546e746013d75c1fc9bf01b7a645ce7caa1ec46c45cb618c6e28d7b57bccc85"
336+
}
337+
338+
tests = [
339+
{
340+
name = "sample"
341+
image = "cgr.dev/chainguard/busybox:latest"
342+
content = [{ source = "${path.module}/testdata/TestAccTestsResource" }]
343+
cmd = "./%[1]s"
344+
retry = { attempts = 2, delay = "1s" }
345+
}
346+
]
347+
348+
timeout = "5m"
349+
}
350+
`, "docker-in-docker-fails.sh"),
351+
ExpectError: regexp.MustCompile(`.*can't open 'imalittleteapot'.*`),
352+
},
353+
},
354+
// Resource-level retry on a passing test: retry block is accepted, test still passes.
355+
"dockerindocker-resource-retry-passes": {
356+
{
357+
Config: fmt.Sprintf(`
358+
resource "imagetest_tests" "foo" {
359+
name = "dind-resource-retry-passes"
360+
driver = "docker_in_docker"
361+
362+
images = {
363+
foo = "cgr.dev/chainguard/busybox:latest@sha256:c546e746013d75c1fc9bf01b7a645ce7caa1ec46c45cb618c6e28d7b57bccc85"
364+
}
365+
366+
tests = [
367+
{
368+
name = "sample"
369+
image = "cgr.dev/chainguard/busybox:latest"
370+
content = [{ source = "${path.module}/testdata/TestAccTestsResource" }]
371+
cmd = "./%[1]s"
372+
}
373+
]
374+
375+
retry = { attempts = 2, delay = "1s" }
376+
timeout = "5m"
377+
}
378+
`, "docker-in-docker-basic.sh"),
379+
},
380+
},
381+
// Resource-level retry on a failing test: all attempts exhausted, error surfaces.
382+
"dockerindocker-resource-retry-exhausted": {
383+
{
384+
Config: fmt.Sprintf(`
385+
resource "imagetest_tests" "foo" {
386+
name = "dind-resource-retry-exhausted"
387+
driver = "docker_in_docker"
388+
389+
images = {
390+
foo = "cgr.dev/chainguard/busybox:latest@sha256:c546e746013d75c1fc9bf01b7a645ce7caa1ec46c45cb618c6e28d7b57bccc85"
391+
}
392+
393+
tests = [
394+
{
395+
name = "sample"
396+
image = "cgr.dev/chainguard/busybox:latest"
397+
content = [{ source = "${path.module}/testdata/TestAccTestsResource" }]
398+
cmd = "./%[1]s"
399+
}
400+
]
401+
402+
retry = { attempts = 2, delay = "1s" }
403+
timeout = "5m"
404+
}
405+
`, "docker-in-docker-fails.sh"),
406+
ExpectError: regexp.MustCompile(`.*can't open 'imalittleteapot'.*`),
407+
},
408+
},
299409
}
300410

301411
for name, tc := range testCases {

0 commit comments

Comments
 (0)