Skip to content

fix: Addressing test flakes for TestReadTerragruntConfigDependencyInStack#5781

Open
yhakbar wants to merge 4 commits intomainfrom
fix/addressing-flake
Open

fix: Addressing test flakes for TestReadTerragruntConfigDependencyInStack#5781
yhakbar wants to merge 4 commits intomainfrom
fix/addressing-flake

Conversation

@yhakbar
Copy link
Copy Markdown
Collaborator

@yhakbar yhakbar commented Apr 1, 2026

Description

A race between cache dir creation and reset causes flakes in TestReadTerragruntConfigDependencyInStack. This fixes that.

TODOs

Read the Gruntwork contribution guidelines.

  • I authored this code entirely myself
  • I am submitting code based on open source software (e.g. MIT, MPL-2.0, Apache)]
  • I am adding or upgrading a dependency or adapted code and confirm it has a compatible open source license
  • Update the docs.
  • Run the relevant tests successfully, including pre-commit checks.
  • Include release notes. If this PR is backward incompatible, include a migration guide.

Release Notes (draft)

Added / Removed / Updated [X].

Migration Guide

Summary by CodeRabbit

  • Refactor

    • Improved Terraform source caching to serialize concurrent downloads and avoid race conditions.
    • Adjusted module copying so local modules are only copied when necessary, reducing redundant work.
  • Tests

    • Added a parallelized regression test for concurrent runs.
    • Updated integration tests (including Windows) to reflect the refined caching and module-handling behavior.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
terragrunt-docs Ready Ready Preview, Comment Apr 2, 2026 5:22pm

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Modify Terraform source download flow to return a boolean indicating whether a download occurred, add per-cache-directory mutexes to serialize concurrent downloads/copying, and conditionally skip module copying when no download and source equals the working directory. Tests and docs updated; a new concurrent integration test added.

Changes

Cohort / File(s) Summary
Download logic & locking
internal/runner/run/download_source.go, internal/runner/run/run.go
Changed DownloadTerraformSourceIfNecessary(...) signature to return (bool, error) indicating whether a download occurred. Introduced per-DownloadDir mutexes (sourceChangeLocks) to serialize concurrent cache operations. Adjusted control flow to call DownloadTerraformSourceIfNecessary first and run module copy only when downloaded == true or source differs from working dir. Updated inline comment for sourceChangeLocks.
Unit tests
internal/runner/run/download_source_test.go
Updated tests to unpack the new boolean return value (using _, err = ...) while preserving existing assertions and error handling.
Integration tests — concurrency
test/integration_regressions_test.go
Added TestReadTerragruntConfigDependencyInStackWithRacing, a parallelized duplicate of the existing stack test to exercise concurrent runs.
Integration tests — manifest behavior
test/integration_windows_test.go
Relaxed TestWindowsManifestFileIsRemoved to assert the manifest file exists after the second run but no longer require its modtime to be newer, reflecting conditional module-copy behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description provides the root cause and general intent but lacks specific details about implementation changes, making it incomplete against the template. Add more details about the changes made (mutex introduction, function signature updates) and populate release notes and migration guide sections as required by the template.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary fix: addressing test flakes for a specific test via race condition resolution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/addressing-flake

Comment @coderabbitai help to get the list of available commands and usage tips.

@yhakbar yhakbar force-pushed the fix/addressing-flake branch from 1dff515 to 7c5461a Compare April 2, 2026 12:02
@yhakbar yhakbar marked this pull request as ready for review April 2, 2026 14:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/runner/run/download_source_test.go (1)

952-975: ⚠️ Potential issue | 🟠 Major

Don't write to the parent err from parallel subtests.

These subtests call t.Parallel(), but Line 965 assigns into the outer err variable. That creates a race in the test itself under -race, so this coverage can flake independently of the production code.

🛠️ Minimal fix
-			_, err = run.DownloadTerraformSourceIfNecessary(t.Context(), l, src, configbridge.NewRunOptions(opts), cfg, r)
+			_, downloadErr := run.DownloadTerraformSourceIfNecessary(t.Context(), l, src, configbridge.NewRunOptions(opts), cfg, r)
 
 			if tc.name == "Local file source" {
-				require.NoError(t, err)
+				require.NoError(t, downloadErr)
 
 				expectedFilePath := filepath.Join(tmpDir, "main.tf")
 				assert.FileExists(t, expectedFilePath)
 			} else {
-				t.Logf("Source %s result: %v", tc.sourceURL, err)
+				t.Logf("Source %s result: %v", tc.sourceURL, downloadErr)
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/runner/run/download_source_test.go` around lines 952 - 975, The
parallel subtests are writing to the outer err variable and also risk using a
loop variable unsafely; inside the for loop capture tc (e.g., tc := tc) before
calling t.Run and inside the subtest use a local error shadow (err :=
run.DownloadTerraformSourceIfNecessary(...)) instead of assigning to the outer
err so the parallel goroutines do not race on the shared err variable; update
the test body around run.DownloadTerraformSourceIfNecessary and the tc usage
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/runner/run/download_source.go`:
- Around line 61-67: The current mutex taken from sourceChangeLocks for
terraformSource.DownloadDir is unlocked before Run returns, allowing concurrent
goroutines to mutate the cache during later steps (GenerateConfig,
CheckFolderContainsTerraformCode, Terraform execution); change the locking so
the mutex covers the entire cache-dir lifetime by acquiring the mutex before
entering Run (or at the start of the code path that ultimately calls
Run/CopyFolderContents) and only unlocking after Run fully completes, ensuring
manifest.Clean(), CopyFolderContents, GenerateConfig,
CheckFolderContainsTerraformCode and any Terraform execution run while holding
the same mutex for that terraformSource.DownloadDir.
- Around line 74-105: The current logic uses needsModuleCopy := downloaded ||
!tf.IsLocalSource(...) which skips copying when downloaded==false (cache hit)
and that can serve stale working-dir-specific files; update the logic so
working-dir inputs are never skipped: either (A) always perform the module copy
when any of the includeInCopy items (the computed includeInCopy / tfLintConfig /
cfg.Terraform.IncludeInCopy set) or ModuleManifestName could differ from the
cached copy by setting needsModuleCopy = true when those working-dir inputs
exist/changed, or (B) fold those working-dir inputs into the cache/version key
instead of skipping the copy (i.e., incorporate their hash into the version
calculation used to set downloaded). Locate the decision around needsModuleCopy
(the downloaded variable and tf.IsLocalSource call) and the
util.CopyFolderContents call and implement one of these two fixes so unit-local
files (includeInCopy/.tflint.hcl/tfvars) are either always copied on cache hits
or are included in the cache version.

---

Outside diff comments:
In `@internal/runner/run/download_source_test.go`:
- Around line 952-975: The parallel subtests are writing to the outer err
variable and also risk using a loop variable unsafely; inside the for loop
capture tc (e.g., tc := tc) before calling t.Run and inside the subtest use a
local error shadow (err := run.DownloadTerraformSourceIfNecessary(...)) instead
of assigning to the outer err so the parallel goroutines do not race on the
shared err variable; update the test body around
run.DownloadTerraformSourceIfNecessary and the tc usage accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a7d2b9f-3d95-4780-aba0-240f89d3fa0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0945c60 and bb9ac34.

📒 Files selected for processing (5)
  • internal/runner/run/download_source.go
  • internal/runner/run/download_source_test.go
  • internal/runner/run/run.go
  • test/integration_regressions_test.go
  • test/integration_windows_test.go

Comment on lines +61 to +67
// Serialize concurrent downloads to the same cache directory. Without this,
// manifest.Clean() in one goroutine can delete files while another goroutine
// is checking for them (e.g. during CheckFolderContainsTerraformCode).
rawLock, _ := sourceChangeLocks.LoadOrStore(terraformSource.DownloadDir, &sync.Mutex{})
dirLock := rawLock.(*sync.Mutex)
dirLock.Lock()
defer dirLock.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The new mutex doesn't cover the whole cache-dir lifetime.

Line 67 unlocks before control returns to Run, so another goroutine can grab the same lock, run CopyFolderContents/manifest.Clean(), and mutate the cache while the first goroutine is already in GenerateConfig, CheckFolderContainsTerraformCode, or Terraform execution. This narrows the original flake, but the race is still reachable whenever the second caller takes the copy path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/runner/run/download_source.go` around lines 61 - 67, The current
mutex taken from sourceChangeLocks for terraformSource.DownloadDir is unlocked
before Run returns, allowing concurrent goroutines to mutate the cache during
later steps (GenerateConfig, CheckFolderContainsTerraformCode, Terraform
execution); change the locking so the mutex covers the entire cache-dir lifetime
by acquiring the mutex before entering Run (or at the start of the code path
that ultimately calls Run/CopyFolderContents) and only unlocking after Run fully
completes, ensuring manifest.Clean(), CopyFolderContents, GenerateConfig,
CheckFolderContainsTerraformCode and any Terraform execution run while holding
the same mutex for that terraformSource.DownloadDir.

Comment on lines +74 to +105
// For local sources, when no download was needed (AlreadyHaveLatestCode=true),
// skip the module copy: the version hash incorporates all file mod times, so
// no files have changed and the cache already has the correct content from a
// previous run. Skipping avoids manifest.Clean() deleting files that a
// concurrent goroutine expects to exist.
//
// For remote sources, always do the module copy because local working-dir
// files may have changed independently of the remote source version.
needsModuleCopy := downloaded || !tf.IsLocalSource(terraformSource.CanonicalSourceURL)

if needsModuleCopy {
l.Debugf(
"Copying files from %s into %s",
util.RelPathForLog(opts.WorkingDir, opts.WorkingDir, opts.Writers.LogShowAbsPaths),
util.RelPathForLog(opts.RootWorkingDir, terraformSource.WorkingDir, opts.Writers.LogShowAbsPaths),
)

// Always include the .tflint.hcl file, if it exists
includeInCopy := slices.Concat(cfg.Terraform.IncludeInCopy, []string{tfLintConfig})

err = util.CopyFolderContents(
l,
opts.WorkingDir,
terraformSource.WorkingDir,
ModuleManifestName,
includeInCopy,
cfg.Terraform.ExcludeFromCopy,
)
if err != nil {
return nil, err
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Skipping the copy on local cache hits can serve stale unit files.

AlreadyHaveLatestCode is documented below as hashing the local source dir, but this block is also what copies unit-local files from opts.WorkingDir into the cache (IncludeInCopy, .tflint.hcl, tfvars, etc.). If those files change while the local module source does not, downloaded stays false and Terragrunt keeps running with the previous cached copies. Please either keep copying on cache hits or fold the copied working-dir inputs into the cache/version key first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/runner/run/download_source.go` around lines 74 - 105, The current
logic uses needsModuleCopy := downloaded || !tf.IsLocalSource(...) which skips
copying when downloaded==false (cache hit) and that can serve stale
working-dir-specific files; update the logic so working-dir inputs are never
skipped: either (A) always perform the module copy when any of the includeInCopy
items (the computed includeInCopy / tfLintConfig / cfg.Terraform.IncludeInCopy
set) or ModuleManifestName could differ from the cached copy by setting
needsModuleCopy = true when those working-dir inputs exist/changed, or (B) fold
those working-dir inputs into the cache/version key instead of skipping the copy
(i.e., incorporate their hash into the version calculation used to set
downloaded). Locate the decision around needsModuleCopy (the downloaded variable
and tf.IsLocalSource call) and the util.CopyFolderContents call and implement
one of these two fixes so unit-local files (includeInCopy/.tflint.hcl/tfvars)
are either always copied on cache hits or are included in the cache version.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/runner/run/download_source.go (1)

61-67: Lock scope addresses the download/copy race; consider defensive type assertion.

The mutex serializes concurrent DownloadTerraformSourceIfNecessary and CopyFolderContents calls to the same cache directory, which addresses the stated race condition in the PR objectives.

Minor defensive improvement: the type assertion on line 65 could use the comma-ok idiom to avoid a potential panic if the map ever contains an unexpected type.

🛡️ Optional defensive fix
 rawLock, _ := sourceChangeLocks.LoadOrStore(terraformSource.DownloadDir, &sync.Mutex{})
-dirLock := rawLock.(*sync.Mutex)
+dirLock, ok := rawLock.(*sync.Mutex)
+if !ok {
+	return nil, errors.Errorf("unexpected lock type for %s", terraformSource.DownloadDir)
+}
 dirLock.Lock()
 defer dirLock.Unlock()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/runner/run/download_source.go` around lines 61 - 67, The code
serializes concurrent downloads using sourceChangeLocks but does a blind type
assertion rawLock.(*sync.Mutex) which can panic if the map contains an
unexpected type; update the assertion in the
DownloadTerraformSourceIfNecessary/CopyFolderContents lock section to use the
comma-ok form (m, ok := rawLock.(*sync.Mutex)) and handle the false case
defensively (e.g., allocate a new sync.Mutex, store it back into
sourceChangeLocks for terraformSource.DownloadDir and use that mutex) so the
code never panics on a bad type and still preserves the serialized lock
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/runner/run/download_source.go`:
- Around line 61-67: The code serializes concurrent downloads using
sourceChangeLocks but does a blind type assertion rawLock.(*sync.Mutex) which
can panic if the map contains an unexpected type; update the assertion in the
DownloadTerraformSourceIfNecessary/CopyFolderContents lock section to use the
comma-ok form (m, ok := rawLock.(*sync.Mutex)) and handle the false case
defensively (e.g., allocate a new sync.Mutex, store it back into
sourceChangeLocks for terraformSource.DownloadDir and use that mutex) so the
code never panics on a bad type and still preserves the serialized lock
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bac3f02a-2bb5-40da-b161-6854f9d3e26d

📥 Commits

Reviewing files that changed from the base of the PR and between bb9ac34 and 4ff19fe.

📒 Files selected for processing (1)
  • internal/runner/run/download_source.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant