Skip to content

Fix disk test timeout by making disk occupation asynchronous#901

Open
bonzofenix wants to merge 33 commits intomainfrom
fix-disk-test-timeout
Open

Fix disk test timeout by making disk occupation asynchronous#901
bonzofenix wants to merge 33 commits intomainfrom
fix-disk-test-timeout

Conversation

@bonzofenix
Copy link
Contributor

Summary

Fixes the failing acceptance test by making the disk occupation operation asynchronous.

The test AutoScaler dynamic policy > when there is a scaling policy for diskutil > should scale out and in was failing with a 502 Bad Gateway error because writing 800MB of random data to disk was taking longer than the HTTP timeout (GoRouter backend timeout).

Changes

  • Made disk occupation operation asynchronous (runs in goroutine)
  • HTTP response is returned immediately after starting occupation
  • Proper synchronization with isRunning flag to prevent race conditions
  • Error handling properly resets the isRunning flag if the operation fails
  • Removed unused checkAlreadyRunning helper method

Root Cause

The /disk/800/5 endpoint was performing a synchronous blocking operation that:

  1. Generated 800MB of cryptographically random data using crypto/rand.Reader
  2. Wrote it to disk
  3. Only then returned the HTTP response

This operation exceeded the GoRouter's backend timeout, causing 502 errors.

Solution

By making the disk write asynchronous:

  • The HTTP handler returns immediately with a success response
  • The disk occupation continues in the background as intended
  • The test can proceed without waiting for the slow I/O operation to complete
  • The autoscaler can still detect the disk usage increase and scale accordingly

Testing

This fix should resolve the flaky test failure seen in PR #879 and allow the disk utilization scaling tests to pass consistently.

🤖 Generated with Claude Code

app-autoscaler-ci-bot and others added 30 commits January 8, 2026 22:52
Update scripts/run-acceptance-tests-task.sh

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix reviewdog linter issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update error message for missing config in tests to include $ACCEPTANCE_CONFIG_JSON option

Update shellcheck fail behavior in linters workflow

Enable test coverage for acceptance tests and update shellcheck source references in scripts

 Disable shellcheck warning and update service broker creation in register-broker.sh; enable test coverage in run-acceptance-tests-task.sh
 - Include Go module files and main.go in acceptance build artifacts
 - Create a placeholder main.go for acceptance tests
 - Set GOVERSION to go1.23 in mta.tpl.yaml
 - Change buildpack from binary_buildpack to go_buildpack for acceptance tests
…inkgo binary compilation in

compile-acceptance-tests.sh. Updated messaging in build-extension-file.sh for existing extension file check.
… BOSH login

 • Replace explicit environment variable checks with default assignments
 • Remove usage function and associated checks for required variables
 • Simplify BOSH login process by using a bbl_login function
Simplified the BBL_STATE_PATH setup to avoid redundant realpath calls and
prevent script failure when the default path doesn't exist. The path is now
set to the default first, then validated once with proper error suppression.

Co-Authored-By: Claude <noreply@anthropic.com>
Moved BBL_STATE_PATH initialization from vars.source.sh into the bbl_login
function in common.sh. This prevents path resolution errors when BBL is not
being used and the path doesn't exist.

Changes:
- Removed BBL_STATE_PATH setup from vars.source.sh
- Updated bbl_login() to define and validate BBL_STATE_PATH internally
- Removed parameter from all bbl_login() calls across scripts
- Simplified validation logic in scripts that checked BBL_STATE_PATH

Benefits:
- Scripts that don't use BBL won't fail if BBL_STATE_PATH doesn't exist
- BBL_STATE_PATH is only resolved when actually needed
- Cleaner separation of concerns

Co-Authored-By: Claude <noreply@anthropic.com>
 • Simplified BOSH login by removing unnecessary path canonicalization and error messages in bbl_login function.
 • Removed redundant comments and environment variable checks.
 • Updated cf_target to use non-optional variables for org and space.
 • Ensured consistent use of BBL_STATE_PATH across scripts for BOSH login.
 • Clarified error message for missing bbl-state folder in bbl_login.
Co-authored-by: Arsalan Khan <asalan316@hotmail.com>
bonzofenix and others added 2 commits January 22, 2026 19:01
The disk test was failing with a 502 Bad Gateway error because writing
800MB of random data to disk was taking longer than the HTTP timeout.

Changes:
- Made disk occupation operation asynchronous (runs in goroutine)
- HTTP response is returned immediately after starting occupation
- The isRunning flag is set before starting the goroutine to prevent races
- On error, the isRunning flag is properly reset

This fixes the failing acceptance test:
"AutoScaler dynamic policy > when there is a scaling policy for diskutil > should scale out and in"

Co-Authored-By: Claude <noreply@anthropic.com>
@bonzofenix bonzofenix changed the base branch from main to acceptance-test-tasks January 22, 2026 19:02
@bonzofenix bonzofenix added the bug label Jan 22, 2026
return err
}

if err := d.occupy(space); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long does this synchronous occupying of disk space take? what is the configured timeout in the router?

}

func (d *defaultDiskOccupier) Occupy(space int64, duration time.Duration) error {
if err := d.checkAlreadyRunning(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the test for diskutil fail (as it was described in the PR description) but the one for disk not? 🤷

Context("when there is a scaling policy for disk", func() {

d.mu.Unlock()

// Start disk occupation asynchronously to avoid HTTP timeout
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to adjust anything here since this may be running async now? I wonder if the timeouts are still enough. Shall we potentially add a comment explaining why the timeout is longer here?

helpers.StartDiskUsage(cfg, appToScaleName, 800, 5)
helpers.WaitForNInstancesRunning(appToScaleGUID, 2, 5*time.Minute)

}

func (d *defaultDiskOccupier) Occupy(space int64, duration time.Duration) error {
if err := d.checkAlreadyRunning(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test has not been flaky whenever it got introduced with feat(disk, diskutil): Add new metric type disk and diskutil by geigerj0 · Pull Request #2811 · cloudfoundry/app-autoscaler-release

How come it become "flaky" now out of a sudden?

Base automatically changed from acceptance-test-tasks to main January 26, 2026 11:45
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants