Fix race condition on deployment creation#979
Conversation
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe pull request modifies two Terraform deployment scripts. The agent cleanup routine now stops services before removing directories to prevent race conditions. The user data provisioner transitions from direct script execution to using Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
* Run the provisioner scripts as the base AMI user * Stop processes before removing directory
Summary
In my recent deployments for running load tests, I've been seeing this bug where the deployment fails the n-th time (with n > 1, never n = 1) with an error on trying to remove a nested subdirectory of
browser/node_modules.This is a race condition caused by the
ltbrowserapipr ocess actively writing to that directory, sorm -rfcomplains about it when trying to delete what it expects to be an empty directory.This is fixed by stopping the processes (I also stop
ltapi, just in case) before removing all the directories, since the goal is to restart everything.One of my first hypothesis on this error was that the
node_modulesdirectory somehow had the permissions messed up, so I added a change to the user_data template to run the provisioner scripts by the base AMI user, not byroot. This ended up not being the root cause of the problem, but I still left the change because the scripts were originally designed to run under a plain user, and I think this is safer.Ticket Link
--
Summary by CodeRabbit
Release Notes