Skip to content

Conversation

@xgboosted
Copy link
Contributor

This PR introduces built-in, configurable retry logic to the runner creation step in the Hetzner GitHub Action. By allowing users to specify the number of retry attempts and the delay between attempts, the goal is to improve robustness against transient API/network errors or mitigate temporary resource unavailability issue.


Key Changes

1. New Inputs in action.yml

  • create_retries: Number of retry attempts for runner creation (default: 1).
  • create_retry_delay: Delay (in seconds) between attempts (default: 10).
  • Both are passed as environment variables to the shell script.

2. Retry Logic in action.sh

  • Parses the new environment variables and validates them as integers.
  • Wraps the Hetzner Cloud server creation (curl -X POST ...) in a retry loop:
    • Logs each attempt and error.
    • Retries up to the configured limit, waiting the specified delay between attempts.
    • Exits with a clear error message if all attempts fail.
  • Only applies to the server creation step (not deletion or unrelated steps).

3. Documentation in README.md

  • Inputs table updated to include create_retries and create_retry_delay.
  • New section describes the retry logic, its purpose, and usage.

Example Usage

with:
  mode: create
  github_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
  hcloud_token: ${{ secrets.HCLOUD_TOKEN }}
  create_retries: 3
  create_retry_delay: 15

Motivation

  • Addresses transient failures in Hetzner API or network connectivity.
  • Reduces manual intervention for temporary issues.
  • Provides clear feedback and logging for troubleshooting.

Style & Compatibility

  • All new code follows project conventions: uppercase MY_ variables, lowercase functions, tab indentation, and preservation of comments.
  • No breaking changes; defaults maintain previous behavior.

Closes: #8

Introduce create_retries and create_retry_delay inputs to control the number of retry attempts and delay between attempts when creating a Hetzner Cloud server. Update action.sh to implement the retry loop and improve error handling for transient failures. Document new options in README.md and action.yml to enhance reliability.
@Cyclenerd
Copy link
Owner

Ok, thanks for the new revised pull. Now I can read and understand it much better. In curl there is a built-in retry. Wouldn't that be enough? Then we could avoid the extra loop.

Documentation: https://everything.curl.dev/usingcurl/downloads/retry.html

Example:

curl --retry 12 --retry-all-errors "https://api.hetzner.cloud/v1/servers"

@xgboosted
Copy link
Contributor Author

The --retry feature of the curl command (as described at https://everything.curl.dev/usingcurl/downloads/retry.html) cannot always be used in CI/CD shell scripts for these reasons:

  1. Portability: Not all environments have a curl version that supports --retry. Many CI runners or minimal Linux images ship with older curl versions lacking this flag, so relying on it can break cross-platform compatibility.

  2. Control Over Retry Logic: The built-in --retry flag only retries on network errors or certain HTTP codes, and does not allow fine-grained control over which failures to retry, how to handle output, or custom logging per attempt. Custom shell logic allows for more detailed error handling, logging, and integration with other script steps.

  3. Consistent Logging and Output: With a manual retry loop, you can log each attempt, capture and inspect output, and decide exactly what to do on each failure. This is important for debugging and for providing clear feedback in CI logs.

  4. Project Coding Standards: Some projects require all retry logic to be explicit in the script for auditability, readability, and to ensure all error handling is visible and testable.

Summary:
While curl --retry is convenient, custom retry logic in shell scripts is more portable, auditable, and flexible—especially in environments where you can't guarantee curl's version or want detailed control over error handling and logging.

@Cyclenerd
Copy link
Owner

I have a question about the curl version and the retry mechanism. I'm trying to understand the reasoning behind the suggested changes fully.

Since the script runs in a GitHub Actions runner with a standard image, are we concerned about the curl version not supporting --retry? The --retry flag has been around for a very long time (since curl 7.12.3 https://curl.se/ch/7.12.3.html), and I suspect it's included in all curl versions used in the GitHub Actions environment.

Also, I'm wondering if retrying on all non-zero exit codes is the best approach. For example, if the API returns a 401, should we really keep retrying? curl --retry has built-in logic to avoid retrying on certain error codes, which seems more robust. My understanding is that the original issue was about network errors, and curl --retry is specifically designed to handle those. Could retrying on everything potentially mask underlying issues or cause unnecessary load? Maybe your AI miss that?

@xgboosted
Copy link
Contributor Author

My understanding is that the original issue was about network errors

The main issue is "resource unavailability," where Cloud Servers are unavailable because of resource exhaustion. This has happened since last year, and I face it every day from 10 a.m. to 12 p.m. German time. When I tested the branch by releasing the Action to the marketplace (that is why I had to delete the last word "Cloud" once in the prior PR, as duplicate Actions are not allowed), I had set 50 tries, and the Cloud server was created on the 27th try (see the screenshot).

Hetzner's notice on this issue since last year: https://status.hetzner.com/incident/aa5ce33b-faa5-4fd0-9782-fde43cd270cf

image

@Cyclenerd Cyclenerd merged commit 480db35 into Cyclenerd:master May 17, 2025
1 check passed
@xgboosted
Copy link
Contributor Author

@Cyclenerd thanks for accepting the PR.

I am guessing that only one variable was added named create_wait which is a value representing total number of tries, where the wait time interval is hardcoded in the action.sh as 10 seconds? So, create_wait = 20 means retrying for 200 seconds with 10 seconds interval.

Did I get it right?

I think an update to the example template in the readme.md would be helpful :)

@Cyclenerd
Copy link
Owner

Yes:

20 * 10 = 200 sec

Default:

360 * 10 = 3600 / 1 hour

@Cyclenerd
Copy link
Owner

In my opinion, no further help than the description and explanation in the table is necessary. In the example, it is likely to confuse many because it only affects a very small number of people.

@xgboosted
Copy link
Contributor Author

xgboosted commented May 18, 2025

This is how I have defined the new parameter. Is this correct?

   steps:
      - name: Create runner
        id: create-runner
        uses: Cyclenerd/hcloud-github-runner@v1.1.0
        with:
          mode: create
          github_token: ${{ ----------- }}
          hcloud_token: ${{  -----------  }}
          server_type: ${{  -----------  }}
          location: ${{  -----------  }}
          image: ${{  -----------  }}
          name: ${{  -----------  }}
          create_wait: 100
        continue-on-error: false

@xgboosted xgboosted deleted the feature-create-rerty-logic branch June 4, 2025 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants