Skip to content

Fix race condition on deployment creation#979

Merged
agarciamontoro merged 2 commits intomasterfrom
fix.file.owner
Mar 19, 2026
Merged

Fix race condition on deployment creation#979
agarciamontoro merged 2 commits intomasterfrom
fix.file.owner

Conversation

@agarciamontoro
Copy link
Member

@agarciamontoro agarciamontoro commented Mar 17, 2026

Summary

In my recent deployments for running load tests, I've been seeing this bug where the deployment fails the n-th time (with n > 1, never n = 1) with an error on trying to remove a nested subdirectory of browser/node_modules.

This is a race condition caused by the ltbrowserapi pr ocess actively writing to that directory, so rm -rf complains about it when trying to delete what it expects to be an empty directory.

This is fixed by stopping the processes (I also stop ltapi, just in case) before removing all the directories, since the goal is to restart everything.

One of my first hypothesis on this error was that the node_modules directory somehow had the permissions messed up, so I added a change to the user_data template to run the provisioner scripts by the base AMI user, not by root. This ended up not being the root cause of the problem, but I still left the change because the scripts were originally designed to run under a plain user, and I think this is safer.

Ticket Link

--

Summary by CodeRabbit

Release Notes

  • Chores
    • Improved deployment cleanup process to prevent race conditions during service shutdown.
    • Enhanced provisioning reliability with better error handling and status tracking during system initialization.

@agarciamontoro agarciamontoro requested a review from carlisgg March 17, 2026 19:44
@agarciamontoro agarciamontoro added the 2: Dev Review Requires review by a core committer label Mar 17, 2026
@agarciamontoro
Copy link
Member Author

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38baedda-cb82-4b0f-a3bb-39c8c0d87796

📥 Commits

Reviewing files that changed from the base of the PR and between 841f485 and 2d28eab.

📒 Files selected for processing (2)
  • deployment/terraform/agent.go
  • deployment/terraform/assets/user_data.sh.tpl

📝 Walkthrough

Walkthrough

The pull request modifies two Terraform deployment scripts. The agent cleanup routine now stops services before removing directories to prevent race conditions. The user data provisioner transitions from direct script execution to using runuser with proper login shell setup and adds exit code handling for improved error tracking.

Changes

Cohort / File(s) Summary
Agent Cleanup Logic
deployment/terraform/agent.go
Added service stop commands (ltapi and ltbrowserapi) before directory removal to prevent race conditions; suppresses errors during service shutdown.
User Data Provisioning
deployment/terraform/assets/user_data.sh.tpl
Removed explicit HOME and USER environment exports; replaced direct provisioner script execution with runuser -l for full login shell; added exit code capture and logging to files for success/failure cases.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • mattermost-load-test-ng#973: Directly related to the user_data.sh.tpl provisioner invocation changes and exit code handling improvements for cloud-init script execution.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix race condition on deployment creation' accurately describes the main change: stopping services before removing directories to prevent race conditions during cleanup.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix.file.owner
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copy link

@carlisgg carlisgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agarciamontoro agarciamontoro merged commit 43de7d6 into master Mar 19, 2026
2 checks passed
@agarciamontoro agarciamontoro deleted the fix.file.owner branch March 19, 2026 11:10
agarciamontoro added a commit that referenced this pull request Mar 20, 2026
* Run the provisioner scripts as the base AMI user

* Stop processes before removing directory
agarciamontoro added a commit that referenced this pull request Mar 20, 2026
* Run the provisioner scripts as the base AMI user

* Stop processes before removing directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2: Dev Review Requires review by a core committer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants