Fix disk test timeout by making disk occupation asynchronous#901
Fix disk test timeout by making disk occupation asynchronous#901bonzofenix wants to merge 33 commits intomainfrom
Conversation
Update scripts/run-acceptance-tests-task.sh Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Fix reviewdog linter issues Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Update error message for missing config in tests to include $ACCEPTANCE_CONFIG_JSON option Update shellcheck fail behavior in linters workflow Enable test coverage for acceptance tests and update shellcheck source references in scripts Disable shellcheck warning and update service broker creation in register-broker.sh; enable test coverage in run-acceptance-tests-task.sh
- Include Go module files and main.go in acceptance build artifacts - Create a placeholder main.go for acceptance tests - Set GOVERSION to go1.23 in mta.tpl.yaml - Change buildpack from binary_buildpack to go_buildpack for acceptance tests
…o binary_buildpack in mta.tpl.yaml
…ding if the extension file already exists
…inkgo binary compilation in compile-acceptance-tests.sh. Updated messaging in build-extension-file.sh for existing extension file check.
… BOSH login • Replace explicit environment variable checks with default assignments • Remove usage function and associated checks for required variables • Simplify BOSH login process by using a bbl_login function
Simplified the BBL_STATE_PATH setup to avoid redundant realpath calls and prevent script failure when the default path doesn't exist. The path is now set to the default first, then validated once with proper error suppression. Co-Authored-By: Claude <noreply@anthropic.com>
Moved BBL_STATE_PATH initialization from vars.source.sh into the bbl_login function in common.sh. This prevents path resolution errors when BBL is not being used and the path doesn't exist. Changes: - Removed BBL_STATE_PATH setup from vars.source.sh - Updated bbl_login() to define and validate BBL_STATE_PATH internally - Removed parameter from all bbl_login() calls across scripts - Simplified validation logic in scripts that checked BBL_STATE_PATH Benefits: - Scripts that don't use BBL won't fail if BBL_STATE_PATH doesn't exist - BBL_STATE_PATH is only resolved when actually needed - Cleaner separation of concerns Co-Authored-By: Claude <noreply@anthropic.com>
• Simplified BOSH login by removing unnecessary path canonicalization and error messages in bbl_login function. • Removed redundant comments and environment variable checks. • Updated cf_target to use non-optional variables for org and space. • Ensured consistent use of BBL_STATE_PATH across scripts for BOSH login. • Clarified error message for missing bbl-state folder in bbl_login.
Co-authored-by: Arsalan Khan <asalan316@hotmail.com>
The disk test was failing with a 502 Bad Gateway error because writing 800MB of random data to disk was taking longer than the HTTP timeout. Changes: - Made disk occupation operation asynchronous (runs in goroutine) - HTTP response is returned immediately after starting occupation - The isRunning flag is set before starting the goroutine to prevent races - On error, the isRunning flag is properly reset This fixes the failing acceptance test: "AutoScaler dynamic policy > when there is a scaling policy for diskutil > should scale out and in" Co-Authored-By: Claude <noreply@anthropic.com>
| return err | ||
| } | ||
|
|
||
| if err := d.occupy(space); err != nil { |
There was a problem hiding this comment.
how long does this synchronous occupying of disk space take? what is the configured timeout in the router?
| } | ||
|
|
||
| func (d *defaultDiskOccupier) Occupy(space int64, duration time.Duration) error { | ||
| if err := d.checkAlreadyRunning(); err != nil { |
There was a problem hiding this comment.
why does the test for diskutil fail (as it was described in the PR description) but the one for disk not? 🤷
| d.mu.Unlock() | ||
|
|
||
| // Start disk occupation asynchronously to avoid HTTP timeout | ||
| go func() { |
There was a problem hiding this comment.
Do we need to adjust anything here since this may be running async now? I wonder if the timeouts are still enough. Shall we potentially add a comment explaining why the timeout is longer here?
app-autoscaler/acceptance/app/dynamic_policy_test.go
Lines 320 to 321 in 4855d4a
| } | ||
|
|
||
| func (d *defaultDiskOccupier) Occupy(space int64, duration time.Duration) error { | ||
| if err := d.checkAlreadyRunning(); err != nil { |
There was a problem hiding this comment.
The test has not been flaky whenever it got introduced with feat(disk, diskutil): Add new metric type disk and diskutil by geigerj0 · Pull Request #2811 · cloudfoundry/app-autoscaler-release
How come it become "flaky" now out of a sudden?
|



Summary
Fixes the failing acceptance test by making the disk occupation operation asynchronous.
The test
AutoScaler dynamic policy > when there is a scaling policy for diskutil > should scale out and inwas failing with a 502 Bad Gateway error because writing 800MB of random data to disk was taking longer than the HTTP timeout (GoRouter backend timeout).Changes
isRunningflag to prevent race conditionsisRunningflag if the operation failscheckAlreadyRunninghelper methodRoot Cause
The
/disk/800/5endpoint was performing a synchronous blocking operation that:crypto/rand.ReaderThis operation exceeded the GoRouter's backend timeout, causing 502 errors.
Solution
By making the disk write asynchronous:
Testing
This fix should resolve the flaky test failure seen in PR #879 and allow the disk utilization scaling tests to pass consistently.
🤖 Generated with Claude Code