Automatic retry of failed policy automations of scripts and software installs #38018

ksykulev · 2026-01-08T05:04:35Z

Related issue: Resolves #31916

Checklist for submitter

If some of the following don't apply, delete the relevant line.

Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
See Changes files for more information.
Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements)

Testing

Added/updated automated tests
Where appropriate, automated tests simulate multiple hosts and test for host isolation (updates to one hosts's records do not affect another)
QA'd all new/changed functionality manually

Database migrations

Checked schema for all modified table for columns that will auto-update timestamps during migration.
Confirmed that updating the timestamps is acceptable, and will not cause unwanted side effects.
Ensured the correct collation is explicitly set for character columns (COLLATE utf8mb4_unicode_ci).

Summary by CodeRabbit

New Features
- Script and software installer policy automations now automatically retry up to three times on failure.
- Retry attempt counters automatically reset when policies transition from failing to passing state.
- Enhanced attempt tracking for improved monitoring and troubleshooting of policy automation executions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…installs

ksykulev · 2026-01-08T05:14:51Z

server/service/integration_enterprise_test.go

+	require.NotNil(t, results[0].ExitCode)
+	require.Equal(t, int64(1), *results[0].ExitCode)
+	// queued retry
+	// Note: attempt_number is only calculated when the result is submitted, not when queued, so it will be NULL


Why am I waiting until the result comes in to update attempt_number?
When maybeRetryPolicyAutomationScript calls NewHostScriptExecutionRequest, the row in host_script_results doesn't exist yet. NewHostScriptExecutionRequest creates rows in upcoming_activities and script_upcoming_activities tables. activateNextScriptActivity is what actually creates the host_script_results row. We could pass the attempt_number through the upcoming_activities payload, then have activateNextScriptActivity read it from the payload and insert it into host_script_results when creating the row. The current approach calculates on result submission is simpler - it just means there's a brief period of time when attempt_number is NULL. This might actually not be semantically incorrect because orbit hasn't actually attempted to run the script.

codecov · 2026-01-08T05:30:40Z

Codecov Report

❌ Patch coverage is 76.95167% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.88%. Comparing base (05d3420) to head (668f4f6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
server/service/orbit.go	76.80%	17 Missing and 12 partials ⚠️
server/datastore/mysql/policies.go	70.12%	12 Missing and 11 partials ⚠️
...16_AddAttemptNumberToScriptsAndSoftwareInstalls.go	66.66%	4 Missing and 2 partials ⚠️
server/datastore/mysql/scripts.go	90.47%	1 Missing and 1 partial ⚠️
server/datastore/mysql/software.go	89.47%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main   #38018    +/-   ##
========================================
  Coverage   65.87%   65.88%            
========================================
  Files        2392     2393     +1     
  Lines      190798   191017   +219     
  Branches     8364     8364            
========================================
+ Hits       125689   125844   +155     
- Misses      53688    53724    +36     
- Partials    11421    11449    +28

Flag	Coverage Δ
backend	`67.71% <76.95%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mostlikelee · 2026-01-08T17:47:43Z

server/fleet/datastore.go


 	CleanupPolicyMembership(ctx context.Context, now time.Time) error
+	// IsPolicyStillFailing checks if a policy is currently failing for a given host.
+	IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error)


nit: "still" sounds like it's also evaluating attempt count

Suggested change

IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error)

IsPolicyFailing(ctx context.Context, policyID, hostID uint) (bool, error)

mostlikelee · 2026-01-08T17:52:03Z

server/service/orbit.go

+
+	// Only calculate attempt_number for policy automation installs
+	if currentInstall.PolicyID != nil && currentInstall.SoftwareInstallerID != nil {
+		count, err := svc.ds.CountHostSoftwareInstallAttempts(ctx, host.ID, *currentInstall.SoftwareInstallerID, *currentInstall.PolicyID)


do we save multiple rows if there are >1 successful installs? that could affect how we're counting attempts here

Indeed, we are storing multiple rows in host_script_results and host_software_installs for each attempt. Is it possible to get >1 successful installs for a policy for a host for a software installer 🤔?

I guess if the policy is failing. it attempts 2 times. Then the policy turns into a successful state. Then later it goes back to failed. It means it will only attempt it 1 time. : / because we don't really keep track of the "sequence".

ksykulev · 2026-01-08T22:40:01Z

@coderabbitai full review

coderabbitai · 2026-01-08T22:40:09Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-01-08T22:47:52Z

Walkthrough

This PR implements automatic retry logic for policy automations (scripts and software installations). When a policy remains failing after triggering an automation, Fleet now retries the automation up to 3 times. The implementation adds attempt tracking to the database schema, updates datastore method signatures to handle attempt numbers, resets attempt counts when policies transition from failing to passing, and introduces retry queueing in the service layer.

Changes

Cohort / File(s)	Change Summary
Policy & Attempt Tracking `server/fleet/policies.go`, `server/datastore/mysql/policies.go`, `server/datastore/mysql/policies_test.go`	Adds `MaxPolicyAutomationRetries` constant (value 3). Implements `IsPolicyFailing` method to query if a policy is failing for a host. Adds pre-update flip detection in `RecordPolicyQueryExecutions` and `AsyncBatchInsertPolicyMembership` to reset attempt counts when policies transition from failing to passing. Includes comprehensive test coverage for flip detection and attempt resets in both sync and async paths.
Datastore Interface & Methods `server/fleet/datastore.go`, `server/fleet/scripts.go`, `server/fleet/software_installer.go`	Updates `SetHostScriptExecutionResult` and `SetHostSoftwareInstallResult` signatures to accept optional `attemptNumber *int` parameter. Adds three new public datastore methods: `IsPolicyFailing`, `CountHostScriptAttempts`, and `CountHostSoftwareInstallAttempts` to support retry tracking and policy state queries.
Scripts Datastore Implementation `server/datastore/mysql/scripts.go`, `server/datastore/mysql/scripts_test.go`	Extends `SetHostScriptExecutionResult` to accept and persist `attemptNumber`. Adds `CountHostScriptAttempts` method to count only current-sequence attempts. Updates SQL statements to populate `attempt_number` column. Includes test coverage with new `testCountHostScriptAttempts` verifying counts across multiple attempts and policy combinations.
Software Installation Datastore `server/datastore/mysql/software.go`, `server/datastore/mysql/software_test.go`, `server/datastore/mysql/software_installers.go`, `server/datastore/mysql/software_installers_test.go`	Extends `SetHostSoftwareInstallResult` to accept and persist `attemptNumber`. Adds `CountHostSoftwareInstallAttempts` method. Updates software installer queries to expose `software_installer_id` and `attempt_number` columns. Includes comprehensive test coverage for attempt counting across multiple scenarios.
Database Schema & Migrations `server/datastore/mysql/schema.sql`, `server/datastore/mysql/migrations/tables/20260108223716_.go`	Adds `attempt_number INT DEFAULT NULL` column to both `host_script_results` and `host_software_installs` tables. Includes migration up/down functions and comprehensive migration test verifying column additions and NULL handling.
Service Layer & Retry Logic `server/service/orbit.go`	Computes `attemptNumber` for policy-automation-driven scripts and software installs via `CountHostScriptAttempts` and `CountHostSoftwareInstallAttempts`. Implements gating logic to suppress activity creation for intermediate retry attempts. Adds retry queuing via `maybeRetryPolicyAutomationScript` and `maybeRetryPolicyAutomationSoftwareInstall` helpers. Modifies `SaveHostScriptResult` and `SaveHostSoftwareInstallResult` to handle policy automation retries and setup experience flows.
Mock Datastore `server/mock/datastore_mock.go`	Updates `SetHostSoftwareInstallResultFunc` and `SetHostScriptExecutionResultFunc` signatures to include `attemptNumber *int` parameter. Adds three new function types (`IsPolicyFailingFunc`, `CountHostSoftwareInstallAttemptsFunc`, `CountHostScriptAttemptsFunc`) with corresponding struct fields and invocation tracking. Implements wrapper methods for new functions.
Test Files & Helpers `server/datastore/mysql/activities_test.go`, `server/datastore/mysql/hosts_test.go`, `server/test/activities.go`, `server/datastore/mysql/software_titles_test.go`	Updates all call sites to pass additional `nil` or computed `attemptNumber` argument to `SetHostScriptExecutionResult` and `SetHostSoftwareInstallResult` to match updated signatures.
Integration Tests `server/service/integration_enterprise_test.go`	Adds extensive enterprise test suite covering policy automation retries for both scripts and software, label-based scope changes, conditional access integration, device authentication, calendar events, software installers, and maintained apps. Includes tests for retry suppression, attempt counting, scope-based retrigger scenarios, and interaction with other enterprise features.
Documentation `changes/31916-retry-policy-automations`	Adds policy note documenting that script and software installer policy automations will retry up to three times if attempts fail.

Sequence Diagram(s)

sequenceDiagram
    participant Policy as Policy Evaluation
    participant Automation as Policy Automation Trigger
    participant DataStore as Datastore
    participant Retry as Retry Manager

    Policy->>DataStore: Check if policy is failing for host
    DataStore-->>Policy: IsPolicyFailing result
    
    alt Policy is failing
        Policy->>Automation: Trigger automation (script/install)
        Automation->>DataStore: CountAttempts for this policy+automation
        DataStore-->>Automation: Current attempt count
        Automation->>DataStore: SetResult with attemptNumber
        DataStore->>DataStore: Persist result with attempt_number
        DataStore-->>Automation: Result saved
        
        alt Attempt < 3 and still failing
            Automation->>Retry: Queue retry for next policy run
        else Attempt >= 3
            Automation->>Retry: Final attempt - create activity
        end
    else Policy is passing
        DataStore->>DataStore: Reset attempt_number to NULL
        Policy->>Automation: No automation triggered
    end

sequenceDiagram
    participant Host as Host
    participant Service as Service Layer
    participant DataStore as Datastore
    participant Queue as Retry Queue

    Host->>Service: Submit script/software result
    Service->>DataStore: Fetch existing results for policy
    DataStore-->>Service: Prior results with attempt counts
    Service->>Service: Calculate nextAttemptNumber
    
    alt Is policy automation?
        Service->>DataStore: CountAttempts for this automation
        DataStore-->>Service: Current attempt count
        Service->>Service: Suppress activity creation if attempt < 3
    else Manual execution
        Service->>Service: Create activity normally
    end
    
    Service->>DataStore: SetResult with attemptNumber
    DataStore-->>Service: Result persisted
    
    alt Still failing and attempt < 3?
        Service->>Queue: Queue retry request
        Queue-->>Service: Retry queued
    else Passing or final attempt
        Service->>Service: Complete flow
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Ensure a host vitals refetch is queued when installs/uninstalls are successful #30505: Directly modifies SaveHostScriptResult and SaveHostSoftwareInstallResult service methods, overlapping with core retry logic implementation.
Stop setup experience on software install failure #34173: Touches service/orbit paths for result handling and modifies the same datastore method signatures for attempt number tracking.
API endpoints for Linux setup experience #32493: Modifies orbit.go's result processing flows with setup-experience handling that interacts with the retry attempt logic introduced here.

Suggested reviewers

sgress454
mostlikelee

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: automatic retry of policy automations for scripts and software installs.
Description check	✅ Passed	The description covers key checklist items including changes file, data validation, SQL injection prevention, automated tests, database migrations, and manual QA.
Linked Issues check	✅ Passed	The PR implements the core coding requirements from issue #31916: retry logic for script/software policy automations up to 3 times, attempt tracking, and resetting on policy transitions.
Out of Scope Changes check	✅ Passed	All code changes are directly related to implementing the retry mechanism, attempt tracking, and policy failure detection required by issue #31916.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch 31916-policy-automation-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @server/datastore/mysql/software.go:
- Around line 5899-5917: The CountHostSoftwareInstallAttempts function currently
counts attempts regardless of canceled or removed records; update its SQL WHERE
clause in host_software_installs (inside CountHostSoftwareInstallAttempts) to
exclude canceled and removed attempts by adding checks like canceled_at IS NULL
and removed_at IS NULL (or the equivalent removed flag used elsewhere in the
codebase) so only active, non-removed attempts are counted for the retry limit.

In @server/service/integration_enterprise_test.go:
- Around line 18481-18482: Update the comment above the
submitPolicyResult(policy.ID, false) call to correct the typo: change "fail
policy fails to trigger first attempt" to "fail policy to trigger first attempt"
so the comment accurately describes the intent of the test step; locate the
comment adjacent to the submitPolicyResult invocation in the
integration_enterprise_test suite.

🧹 Nitpick comments (6)

server/fleet/scripts.go (1)

280-283: HostScriptResult.AttemptNumber is well-wired; consider future-proofing the comment

The new AttemptNumber *int field is correctly tagged for JSON/DB and matches the schema’s nullable int column, so it should integrate cleanly with existing scan/encode paths and is a backward-compatible API extension. The only small suggestion is that the comment bakes in “3 being max retries”; if the max ever changes, this can drift—consider either referencing a shared constant in the comment (by name) or rephrasing to “values 1..N where N is the configured max retries” to avoid future inconsistencies.
server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go (1)
72-78: Consider asserting row presence before indexing attempt_number slices

The test correctly validates that attempt_number defaults to NULL and persists explicit values for both tables. To make failures clearer if the setup ever breaks, you could assert the query returned exactly one row before indexing:
require.Len(t, attempt_number, 1)
require.Equal(t, sql.NullInt64{Valid: false}, attempt_number[0])
(and similarly for scriptAttemptNumber). This avoids panics and yields more informative test failures.

Also applies to: 85-89, 90-105
server/datastore/mysql/software.go (1)

5842-5842: Consider: Add validation for attemptNumber parameter

The attemptNumber parameter is a *int with no validation. While nil is intentional (represents "currently processing"), if a non-nil value is passed, it could be negative or exceed the retry limit without any checks.

Consider adding defensive validation in the calling code to ensure attemptNumber is within the expected range (e.g., 1-3 for the 3-retry limit).

server/datastore/mysql/policies_test.go (1)

6478-6642: Reset-attempt tests validate correct behavior; consider minor DRY helper

The sync and async reset tests both:

Set up a failing policy (p1) and a non-transitioning policy (p2).

Seed host_script_results and host_software_installs with both completed (attempt_number = 1) and pending (attempt_number IS NULL) rows.

Drive a failing→passing transition for p1 only, and assert all p1 attempts (completed and pending) are reset to 0 while p2’s rows remain unchanged.

This accurately captures the desired retry-reset behavior and correctly scopes the verification queries by host_id and policy_id. Logic and SQL look sound.

There is some nearly identical setup logic between the sync and async tests (manual INSERT statements for both policies and tables); if this grows further, you might consider a small shared helper to seed attempts for a (host, policy) pair to reduce duplication and keep future schema tweaks localized. This is purely optional test hygiene.
server/service/orbit.go (1)
1057-1065: Good defensive check, but consider error handling consistency.

The retry logic correctly checks for both PolicyID and ScriptID before attempting to retry. However, when maybeRetryPolicyAutomationScript returns an error, it's wrapped and returned, which will cause the entire SaveHostScriptResult call to fail.

Consider whether failing to queue a retry should be a hard error or just logged. The result was already saved successfully, and the user might expect at least that to persist.
Consider logging retry failures instead of returning error
 	// If this is a policy automation script that failed, maybe retry
 	if hsr != nil && hsr.PolicyID != nil && hsr.ScriptID != nil {
 		scriptFailed := hsr.ExitCode == nil || *hsr.ExitCode != 0
 		if scriptFailed {
 			if err := svc.maybeRetryPolicyAutomationScript(ctx, host, hsr); err != nil {
-				return ctxerr.Wrap(ctx, err, "retry policy automation script")
+				// Log but don't fail - the result was already saved successfully
+				level.Error(svc.logger).Log(
+					"msg", "failed to queue policy automation script retry",
+					"host_id", host.ID,
+					"policy_id", *hsr.PolicyID,
+					"err", err,
+				)
 			}
 		}
 	}
server/datastore/mysql/software_test.go (1)
10346-10497: testCountHostSoftwareInstallAttempts covers main scenarios; consider aligning first attempt with production path

The test thoroughly exercises:

Zero-attempt baseline.

Multiple attempts on the same (host, installer, policy) tuple.

Isolation from:

installs without policy_id,

a different host,

a different policy,

a different installer.

This gives solid coverage of the counting behavior.

One small realism improvement: for the first policy-driven attempt you currently call SetHostSoftwareInstallResult with attemptNumber == nil, whereas production is likely to pass attemptNumber == 1 for automation-driven installs. If CountHostSoftwareInstallAttempts or related logic ever relies directly on the attempt_number column instead of just row count, this test could miss regressions.

You could optionally pass ptr.Int(1) for the first attempt to mirror the real flow more closely:

[inspect for fit before changing; optional]
♻️ Optional tweak to set explicit attempt number on first attempt
-	// Set result
-	_, err = ds.SetHostSoftwareInstallResult(ctx, &fleet.HostSoftwareInstallResultPayload{
-		HostID:                host.ID,
-		InstallUUID:           install1UUID,
-		InstallScriptExitCode: ptr.Int(1), // failed
-		InstallScriptOutput:   ptr.String("install failed"),
-	}, nil)
+	// Set result for first attempt (attempt_number = 1)
+	_, err = ds.SetHostSoftwareInstallResult(ctx, &fleet.HostSoftwareInstallResultPayload{
+		HostID:                host.ID,
+		InstallUUID:           install1UUID,
+		InstallScriptExitCode: ptr.Int(1), // failed
+		InstallScriptOutput:   ptr.String("install failed"),
+	}, ptr.Int(1))
If the datastore implementation intentionally ignores attempt_number for counting and only uses row count, leaving this as-is is also acceptable.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05d3420 and 0553ab0.

📒 Files selected for processing (23)

changes/31916-retry-policy-automations
server/datastore/mysql/activities_test.go
server/datastore/mysql/hosts_test.go
server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go
server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go
server/datastore/mysql/policies.go
server/datastore/mysql/policies_test.go
server/datastore/mysql/schema.sql
server/datastore/mysql/scripts.go
server/datastore/mysql/scripts_test.go
server/datastore/mysql/software.go
server/datastore/mysql/software_installers.go
server/datastore/mysql/software_installers_test.go
server/datastore/mysql/software_test.go
server/datastore/mysql/software_titles_test.go
server/fleet/datastore.go
server/fleet/policies.go
server/fleet/scripts.go
server/fleet/software_installer.go
server/mock/datastore_mock.go
server/service/integration_enterprise_test.go
server/service/orbit.go
server/test/activities.go

🧰 Additional context used

📓 Path-based instructions (1)

**/*.go

⚙️ CodeRabbit configuration file

When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.

Files:

server/fleet/policies.go
server/datastore/mysql/scripts.go
server/datastore/mysql/activities_test.go
server/datastore/mysql/policies.go
server/service/orbit.go
server/fleet/software_installer.go
server/datastore/mysql/policies_test.go
server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go
server/fleet/datastore.go
server/datastore/mysql/scripts_test.go
server/fleet/scripts.go
server/test/activities.go
server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go
server/datastore/mysql/software.go
server/datastore/mysql/software_titles_test.go
server/datastore/mysql/software_installers_test.go
server/service/integration_enterprise_test.go
server/datastore/mysql/software_installers.go
server/datastore/mysql/hosts_test.go
server/mock/datastore_mock.go
server/datastore/mysql/software_test.go

🧠 Learnings (9)

📚 Learning: 2025-08-08T07:40:05.301Z

Learnt from: getvictor
Repo: fleetdm/fleet PR: 31726
File: server/datastore/mysql/labels_test.go:2031-2031
Timestamp: 2025-08-08T07:40:05.301Z
Learning: Fleet repo targets Go 1.24.5 (root go.mod), which supports testing.T.Context(). Do not flag usage of t.Context() or suggest replacing it with context.Background() in tests (e.g., server/datastore/mysql/labels_test.go Line 2031 and similar).

Applied to files:

server/datastore/mysql/activities_test.go

📚 Learning: 2025-07-08T16:12:48.797Z

Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: ee/server/service/hostidentity/depot/depot.go:108-111
Timestamp: 2025-07-08T16:12:48.797Z
Learning: In ee/server/service/hostidentity/depot/depot.go, the SCEP depot interface methods like Put() do not accept context parameters, and the common_mysql.WithRetryTxx callback function type TxFn only receives a transaction parameter, not a context. Therefore, using context.Background() in tx.ExecContext calls within the transaction callback is the correct approach.

Applied to files:

server/datastore/mysql/policies.go

📚 Learning: 2025-08-29T00:30:04.502Z

Learnt from: getvictor
Repo: fleetdm/fleet PR: 32438
File: server/datastore/mysql/scripts.go:505-515
Timestamp: 2025-08-29T00:30:04.502Z
Learning: In the Fleet codebase, database operations are typically wrapped in `ds.withRetryTxx` which provides transaction isolation, eliminating the need for additional race condition checks within the same transaction boundary.

Applied to files:

server/datastore/mysql/policies.go

📚 Learning: 2025-07-08T16:13:39.114Z

Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: server/datastore/mysql/migrations/tables/20250707095725_HostIdentitySCEPCertificates.go:53-55
Timestamp: 2025-07-08T16:13:39.114Z
Learning: In the Fleet codebase, Down migration functions are intentionally left empty/no-op. The team does not implement rollback functionality for database migrations, so empty Down_* functions in migration files are correct and should not be flagged as issues.

Applied to files:

server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go

📚 Learning: 2025-08-13T18:20:42.136Z

Learnt from: titanous
Repo: fleetdm/fleet PR: 31075
File: tools/redis-tests/elasticache/iam_auth.go:4-10
Timestamp: 2025-08-13T18:20:42.136Z
Learning: For test harnesses and CLI tools in the Fleet codebase, resource cleanup on error paths (like closing connections before log.Fatalf) may not be necessary since the OS handles cleanup when the process exits. These tools prioritize simplicity over defensive programming patterns used in production code.

Applied to files:

server/datastore/mysql/scripts_test.go
server/test/activities.go

📚 Learning: 2025-07-08T16:11:49.555Z

Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: ee/server/service/hostidentity/depot/depot.go:115-115
Timestamp: 2025-07-08T16:11:49.555Z
Learning: In ee/server/service/hostidentity/depot/depot.go, the error from result.RowsAffected() is intentionally ignored because the information is only used for logging purposes, not for critical program logic.

Applied to files:

server/datastore/mysql/software.go

📚 Learning: 2025-10-03T18:16:11.482Z

Learnt from: MagnusHJensen
Repo: fleetdm/fleet PR: 33805
File: server/service/integration_mdm_test.go:1248-1251
Timestamp: 2025-10-03T18:16:11.482Z
Learning: In server/service/integration_mdm_test.go, the helper createAppleMobileHostThenEnrollMDM(platform string) is exclusively for iOS/iPadOS hosts (mobile). Do not flag macOS model/behavior issues based on changes within this helper; macOS provisioning uses different helpers such as createHostThenEnrollMDM.

Applied to files:

server/service/integration_enterprise_test.go
server/datastore/mysql/software_test.go

📚 Learning: 2025-08-01T15:08:16.858Z

Learnt from: sgress454
Repo: fleetdm/fleet PR: 31508
File: server/datastore/mysql/schema.sql:102-116
Timestamp: 2025-08-01T15:08:16.858Z
Learning: The schema.sql file in server/datastore/mysql/ is auto-generated from migrations for use with tests, so it cannot be manually edited. Any changes must be made through migrations.

Applied to files:

server/datastore/mysql/schema.sql

📚 Learning: 2025-09-12T13:04:23.777Z

Learnt from: getvictor
Repo: fleetdm/fleet PR: 32823
File: server/datastore/mysql/software.go:4457-4471
Timestamp: 2025-09-12T13:04:23.777Z
Learning: In the Fleet codebase, the `status` column in the `host_software_installs` table is defined as `GENERATED ALWAYS`, meaning it's automatically computed by the database based on other field values (like exit codes) and should not be explicitly included in INSERT statements.

Applied to files:

server/datastore/mysql/schema.sql
server/datastore/mysql/software_installers.go

🧬 Code graph analysis (8)

server/datastore/mysql/scripts.go (2)

server/fleet/datastore.go (1)

Datastore (50-2625)

server/fleet/scripts.go (2)

HostScriptResultPayload (196-203)

HostScriptResult (208-284)

server/datastore/mysql/policies_test.go (2)

server/ptr/ptr.go (2)

T (86-88)

Bool (35-37)

server/fleet/policies.go (1)

PolicyMembershipResult (466-470)

server/fleet/datastore.go (3)

server/test/activities.go (1)

SetHostSoftwareInstallResult (66-74)

server/fleet/software_installer.go (1)

HostSoftwareInstallResultPayload (936-953)

server/fleet/scripts.go (1)

HostScriptResult (208-284)

server/datastore/mysql/software.go (2)

server/fleet/datastore.go (1)

Datastore (50-2625)

server/fleet/software_installer.go (1)

HostSoftwareInstallResultPayload (936-953)

server/datastore/mysql/software_installers_test.go (3)

server/test/activities.go (1)

SetHostSoftwareInstallResult (66-74)

server/fleet/software_installer.go (1)

HostSoftwareInstallResultPayload (936-953)

server/ptr/ptr.go (1)

Int (15-17)

server/service/integration_enterprise_test.go (5)

server/fleet/hosts.go (1)

Host (279-411)

server/fleet/software_installer.go (1)

UploadSoftwareInstallerPayload (497-533)

server/datastore/mysql/testing_utils.go (1)

ExecAdhocSQL (421-425)

server/fleet/policies.go (3)

PolicyPayload (15-59)

ModifyPolicyPayload (170-206)

Policy (280-303)

server/fleet/scripts.go (2)

Script (17-32)

HostScriptResult (208-284)

server/mock/datastore_mock.go (3)

server/fleet/software_installer.go (1)

HostSoftwareInstallResultPayload (936-953)

server/fleet/scripts.go (2)

HostScriptResultPayload (196-203)

HostScriptResult (208-284)

server/test/activities.go (1)

SetHostSoftwareInstallResult (66-74)

server/datastore/mysql/software_test.go (2)

server/test/activities.go (1)

SetHostSoftwareInstallResult (66-74)

server/fleet/datastore.go (1)

Datastore (50-2625)

🔇 Additional comments (43)

server/fleet/policies.go (1)

110-112: LGTM: Well-documented retry limit constant.

The addition of MaxPolicyAutomationRetries clearly defines the retry behavior for policy automations. The constant value of 3 aligns with the PR objectives, and the comment provides clear context.

server/fleet/software_installer.go (1)

422-425: LGTM: Well-designed attempt tracking field.

The AttemptNumber field addition is well-implemented:

Pointer type correctly represents optional nature (nil for non-policy installs)

Clear documentation of semantics (nil = not policy-triggered, 1-3 for retry attempts)

Appropriate JSON tag with omitempty

DB tag matches expected schema

changes/31916-retry-policy-automations (1)

1-1: LGTM: Clear user-facing documentation.

The changelog entry clearly communicates the retry behavior to users. The "up to three times" limit matches the MaxPolicyAutomationRetries constant defined in the codebase.

server/datastore/mysql/software_titles_test.go (1)

2232-2232: Signature update looks correct.

The addition of nil as the third argument correctly updates the call to match the new SetHostSoftwareInstallResult signature. Passing nil is appropriate for this test since it's not testing policy automation retry behavior.

Comprehensive test coverage for the attemptNumber parameter already exists in the integration tests, including verification of attempt counting (1, 2, 3), behavior at maximum attempts, and persistence of attempt numbers.

server/datastore/mysql/schema.sql (3)

975-1009: host_script_results.attempt_number schema looks correct and matches Go struct

The new attempt_number int DEFAULT NULL column integrates cleanly here: type and nullability match HostScriptResult.AttemptNumber *int (db:"attempt_number"), and there are no dependent generated columns/constraints to adjust. No changes needed from this file’s side.

Based on learnings, this looks like the expected output of the new migration rather than a manual edit.

1045-1084: host_software_installs.attempt_number addition is consistent with retries design

Similarly, adding attempt_number int DEFAULT NULL is consistent with the retry-tracking design and does not interfere with the existing generated status / execution_status columns. It gives the datastore a straightforward place to persist attempt numbers for installs.

Based on learnings, this also appears to come from regenerating schema.sql from the latest migrations.

1759-1768: Migration status table updates are internally consistent

The AUTO_INCREMENT=469 and the extended INSERT INTO migration_status_tables list (up through id 468 / version 20260108223716) are coherent: last inserted id matches AUTO_INCREMENT−1, and the versions sequence continues as expected. This aligns with adding the new migration that introduced the attempt_number columns.

Based on learnings, this matches the normal pattern of schema.sql being regenerated from migrations.

server/datastore/mysql/software_installers.go (1)

1422-1505: LGTM! Query modifications correctly add retry tracking columns.

The additions to GetSoftwareInstallResults properly extend both parts of the UNION query:

software_installer_id, source, and attempt_number are added consistently to both the historical (host_software_installs) and upcoming (upcoming_activities) result paths

The NULL AS attempt_number for upcoming activities is appropriate since those installs haven't been attempted yet

The existing filtering by execution_id ensures precise scoping, and the ua.activated_at IS NULL condition prevents duplicates between the UNION parts

The changes align with the PR's objective to track retry attempts for policy automations.

server/datastore/mysql/hosts_test.go (2)

4199-4217: SetHostScriptExecutionResult signature usage looks correct

The added nil argument aligns these test calls with the new SetHostScriptExecutionResult signature while preserving existing behavior; using nil is appropriate here since these tests don’t validate attempt-count logic.

8491-8496: Consistent use of new optional attempt parameter in SetHostScriptExecutionResult

This updated call also correctly passes nil for the new attempt parameter, matching the intended optional semantics without changing the test’s behavior.

server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go (1)

9-34: Migration shape and error handling look correct

The Up migration cleanly adds nullable attempt_number columns to both tables with contextual error wrapping, and the no-op Down function matches this repo’s convention of non-rollbackable migrations. Based on learnings, leaving Down_20260108223716 empty is expected.

server/datastore/mysql/software_installers_test.go (1)

183-188: Test callsites correctly adapted to new attemptNumber parameter

All updated calls to SetHostSoftwareInstallResult and SetHostScriptExecutionResult now pass a trailing nil for the new attemptNumber argument, which is appropriate for these non-policy / non-retry scenarios. Assertions remain unchanged, so behavior is preserved while tests stay aligned with the new API.

Also applies to: 198-203, 239-244, 404-409, 460-465, 531-536, 552-557, 1280-1285, 1353-1358, 2275-2280, 2359-2365, 2411-2424, 3133-3139

server/test/activities.go (1)

30-36: Helpers correctly pass nil attemptNumber for non-policy executions

The activity helpers now pass nil as the attemptNumber when recording script and software results for manual/uninstall flows, which matches the intended “not policy automation” semantics and keeps tests aligned with the updated datastore interface.

Also applies to: 66-73, 105-113

server/datastore/mysql/scripts.go (1)

145-229: SetHostScriptExecutionResult / GetHostScriptExecutionResult wiring for attempt_number looks sound; please double-check NULL semantics in CountHostScriptAttempts

SetHostScriptExecutionResult now updates host_script_results.attempt_number alongside the other result fields and keeps the duplicate-result guard and MDM/uninstall handling intact.

getHostScriptExecutionResultDB correctly projects batch_execution_id and attempt_number for active results, and supplies NULL for both in the upcoming-activity path, keeping the struct mapping consistent.

CountHostScriptAttempts is properly scoped by host_id, script_id, and policy_id, and only counts rows where attempt_number > 0 OR attempt_number IS NULL, effectively treating attempt_number = 0 as “old sequence” while including current-sequence (positive) and unset attempts.

One nuance to confirm: any pre-existing host_script_results rows for a given (host_id, script_id, policy_id) that predate this feature (and thus have attempt_number = NULL) will be included in the count. That means the very first retry-aware run for a long-lived policy/host pair could start with a non-zero attempt count unless some higher-level logic proactively resets those older rows to attempt_number = 0 when the policy is known to have passed. If that’s not desired, you may want to revisit how/when old rows are normalized.

As per coding guidelines, the new COUNT query uses precise filtering and avoids under-scoped SELECT usage.

Also applies to: 387-417, 469-487

server/datastore/mysql/activities_test.go (1)

1285-1288: LGTM! Test call sites updated to match expanded API signatures.

The test updates correctly pass nil for the new optional attemptNumber parameter in SetHostScriptExecutionResult and SetHostSoftwareInstallResult calls, which is appropriate since these tests don't exercise the retry behavior.

Also applies to: 1426-1431, 1448-1453, 1678-1683, 1693-1697, 1705-1710

server/datastore/mysql/policies.go (3)

540-613: LGTM! Flip detection and attempt reset logic is correctly implemented.

The implementation properly:

Computes newly-passing policies before updating membership to compare against previous state

Resets attempt_number to 0 for both script results and software installs when policies flip from failing to passing

Uses proper SQL filtering with host_id and policy_id IN (?) clauses

The flip detection happening before the transaction is a pragmatic design choice that avoids holding locks during computation, with the withRetryTxx providing conflict resolution.

1428-1484: LGTM! Async batch processing correctly handles per-host flip detection.

The implementation properly extends the flip detection and attempt reset logic to handle batches of hosts:

Groups incoming results per host before computing flips

Computes newly-passing policies separately for each host

Resets attempt numbers within the same transaction as the membership upsert

This ensures consistency between sync and async policy membership update paths.

1755-1775: LGTM! IsPolicyFailing correctly implements the failure check semantics.

The method correctly returns true (failing) when:

No membership record exists (sql.ErrNoRows) — policy hasn't been evaluated yet

passes is nil — policy execution didn't complete

passes is false — policy is explicitly failing

This aligns with the retry logic requirement: automations should retry while the policy remains in a non-passing state. The SQL query properly filters on both policy_id and host_id.

server/fleet/datastore.go (3)

687-687: LGTM: Signature extension supports retry tracking.

The addition of attemptNumber *int parameter is well-designed. Using a pointer allows distinguishing between policy-triggered installs (non-nil with attempt number) and regular installs (nil), which aligns with the retry feature requirements.

868-875: LGTM: New methods provide proper scoping parameters.

The three new methods for policy retry logic are well-designed:

Each method includes both host and policy identifiers, enabling precise filtering

Comments clearly explain their purpose

Parameter types are consistent with existing datastore patterns

Note: The actual implementations should be reviewed to ensure SQL queries use these parameters properly with WHERE clauses (per coding guidelines for **/*.go).

1827-1827: LGTM: Consistent signature extension.

The attemptNumber *int parameter addition mirrors the change to SetHostSoftwareInstallResult, maintaining consistency across the retry tracking implementation.

server/datastore/mysql/policies_test.go (2)

83-85: New policy retry tests correctly wired into TestPolicies suite

The three new helpers are cleanly added to the table-driven TestPolicies suite with descriptive names and consistent ordering. No changes needed.

6414-6476: IsPolicyFailing test semantics look correct and comprehensive

This test exercises the important edge cases for IsPolicyFailing (no membership row, passes = NULL, passes = false, passes = true, different hosts, different policies) and clearly documents the intended semantics (treat missing/NULL/false as failing, true as passing). This aligns well with how policy automations typically reason about “still failing” vs “resolved”. No issues from my side.

server/datastore/mysql/scripts_test.go (3)

57-57: LGTM!

New test case entry correctly wired into the test suite.

114-134: LGTM!

All SetHostScriptExecutionResult calls correctly updated to pass nil for the new attemptNumber parameter, which is appropriate for non-policy-automation script executions.

2989-3132: LGTM! Comprehensive test coverage for attempt counting.

The test covers:

Zero attempts case

Multiple consecutive attempts with the same policy/script

Executions without policy_id (should not affect count)

Different hosts with same policy/script

Same host with different policies

Same host/policy with different scripts

One minor observation: at line 3029, the first call to SetHostScriptExecutionResult passes nil for attemptNumber (correct for the initial attempt), but subsequent calls at lines 3053 and 3077 pass explicit attempt numbers via ptr.Int(2) and ptr.Int(3). This asymmetry is intentional to test that the count increments correctly regardless of what's passed, which is good coverage.

server/service/orbit.go (6)

1020-1033: Verify activity suppression logic handles nil ExitCode correctly.

The condition at line 1028 checks hsr.ExitCode == nil || *hsr.ExitCode != 0 to determine if the script failed. This is correct for the failure check. However, when hsr.AttemptNumber is not nil and the script succeeded (exit code 0), shouldCreateActivity remains true, which is the expected behavior.

The logic correctly suppresses activity creation only for intermediate failed attempts (attempts 1 and 2 when max is 3).

1482-1496: Consistent approach with script attempt counting.

The software install attempt counting mirrors the script counting logic, which is good for consistency. The same observation about verifying the counting logic applies here.

1556-1572: Activity suppression and retry logic looks correct but has asymmetric error handling.

The retry is triggered inside the activity suppression block for failed installs, and errors are only logged (line 1558-1563), which is different from the script retry handling. This inconsistency is actually better—logging instead of failing is more resilient.

However, note that maybeRetryPolicyAutomationSoftwareInstall is called before checking shouldCreateActivity, meaning a retry could be queued even if we then create an activity (when AttemptNumber is nil, which shouldn't happen for policy automation but is worth noting).

915-934: The counting logic is correct. The CountHostScriptAttempts SQL query explicitly includes rows where attempt_number IS NULL in addition to those where attempt_number > 0 (line 480 of server/datastore/mysql/scripts.go). This means newly created rows with NULL attempt_number are counted. Therefore, when a new script execution is created via NewHostScriptExecutionRequest, the subsequent CountHostScriptAttempts call returns a count that already includes that new row. Setting attemptNumber = &count produces the correct attempt number with no off-by-one error.

1607-1642: The nil check on AttemptNumber is defensive programming and does not indicate a data inconsistency issue.

The concern about race conditions is unfounded because the operations are sequential, not concurrent. SetHostSoftwareInstallResult uses a database transaction (withRetryTxx) that ensures the attempt_number is committed before GetSoftwareInstallResults is called immediately after. Additionally, maybeRetryPolicyAutomationSoftwareInstall is only invoked when status == fleet.SoftwareInstallFailed, which can only originate from the first UNION part of the GetSoftwareInstallResults query (the host_software_installs table), which always includes the attempt_number field. The WHERE clause properly filters by execution_id, ensuring the correct record is retrieved.

1644-1682: Script retry correctly uses original execution contents for reproducibility.

The maybeRetryPolicyAutomationScript function correctly passes hsr.ScriptContents (the original script contents from the initial execution) to the retry request. This is intentional: each script execution is versioned through separate script_contents records, ensuring that retries reproduce the exact same script execution for proper debugging and auditability. If the script is later modified via UpdateScriptContents, the retry will still use the original contents that caused the policy to fail—which is the correct behavior.

server/datastore/mysql/software_test.go (2)

106-107: Registering testCountHostSoftwareInstallAttempts in matrix looks good

The new test is correctly wired into TestSoftware, so it will run with the rest of the datastore tests. No issues here.

4175-4331: Signature updates for SetHostSoftwareInstallResult / SetHostScriptExecutionResult are consistent

Across these blocks, all updated call sites now pass the new third argument:

nil for legacy/non-automation flows where attempt numbering is irrelevant.

Explicit *int only where tests care about attempt semantics.

Argument ordering and payloads remain unchanged, so behavior is preserved while satisfying the new API.

Also applies to: 6116-6149, 6200-6206, 6313-6319, 9131-9138, 9220-9221

server/mock/datastore_mock.go (3)

536-540: SetHostSoftwareInstallResult mock correctly extended with attemptNumber

The updated SetHostSoftwareInstallResultFunc type and its wrapper now accept and forward attemptNumber *int, matching the production signature while preserving the existing lock/flag pattern. This keeps tests able to observe and drive retry-specific behavior without altering the mock conventions.

Also applies to: 6095-6099

684-688: New policy retry helpers are consistently wired through the mock

IsPolicyFailingFunc, CountHostSoftwareInstallAttemptsFunc, and CountHostScriptAttemptsFunc are added as function types, stored on DataStore with matching *Invoked flags, and exposed via thin wrappers that set the flag under lock and delegate with the correct parameter order. This is consistent with the rest of the mock and should integrate cleanly with the new retry logic.

Also applies to: 2726-2733, 6613-6632

1210-1212: SetHostScriptExecutionResult mock updated to support attemptNumber

The SetHostScriptExecutionResultFunc type and its wrapper now include attemptNumber *int and forward it correctly while retaining the existing invocation-tracking pattern. This keeps the mock aligned with the main datastore interface and supports script retry tests without behavioral surprises.

Also applies to: 8454-8458

server/service/integration_enterprise_test.go (6)

7107-7107: LGTM!

The additional nil parameter correctly passes no attempt number for this test case, which is appropriate since this test isn't specifically exercising the retry tracking logic.

17134-17267: Well-structured test setup with properly scoped queries.

The test setup is comprehensive and all SQL queries include appropriate filtering criteria (host_id, software_installer_id, policy_id) to ensure correct data isolation. The helper functions nicely encapsulate the test operations.

17269-17361: LGTM!

The retry logic testing correctly verifies:

Attempt number progression (1 → 2 → 3)

Activity is created only after the final attempt

No additional retries are queued after reaching the maximum of 3 attempts

17362-17459: LGTM!

The test correctly verifies:

Policy passing resets the retry counter (attempts marked as obsolete with attempt_number=0)

A fresh retry sequence can be started after reset (new attempts start at 1)

The CountHostSoftwareInstallAttempts datastore method behaves correctly

The final database state verification at lines 17444-17458 provides good coverage of the expected state.

18280-18383: Well-structured test setup with properly scoped queries.

Similar to the software install test, all SQL queries include appropriate filtering criteria (host_id, script_id, policy_id). The helper functions are well-designed for the test operations.

18385-18576: LGTM!

The script retry test correctly verifies the same behaviors as the software install test:

Retry attempt progression (1 → 2 → 3)

Activity creation only after final attempt

Policy passing resets attempts

Fresh sequence can start after reset

The parallel structure between both tests provides good consistency and coverage of the retry feature for both automation types.

coderabbitai · 2026-01-08T22:47:55Z

server/datastore/mysql/software.go

+func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {
+	var count int
+	// Only count attempts from the current retry sequence.
+	// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".
+	// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).
+	err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `
+		SELECT COUNT(*)
+		FROM host_software_installs
+		WHERE host_id = ?
+		  AND software_installer_id = ?
+		  AND policy_id = ?
+		  AND (attempt_number > 0 OR attempt_number IS NULL)
+	`, hostID, softwareInstallerID, policyID)
+	if err != nil {
+		return 0, ctxerr.Wrap(ctx, err, "count host software install attempts")
+	}
+
+	return count, nil
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash echo "=== Finding calls to CountHostSoftwareInstallAttempts ===" rg -n -C 5 'CountHostSoftwareInstallAttempts' --type go echo -e "\n=== Looking for retry limit logic (3 retries) ===" rg -n -C 3 'attempt.*3|max.*attempt|retry.*limit' --type go server/ echo -e "\n=== Checking host_software_installs table schema ===" rg -n -C 5 'host_software_installs' --type go | head -100

Repository: fleetdm/fleet

Length of output: 50369

🏁 Script executed:

#!/bin/bash echo "=== Checking how removed and canceled are used in CountHostSoftwareInstallAttempts context ===" rg -n -C 10 'removed|canceled' server/datastore/mysql/software.go | head -150 echo -e "\n=== Checking orbit.go usage more carefully ===" rg -n -B 15 -A 15 'CountHostSoftwareInstallAttempts' server/service/orbit.go echo -e "\n=== Looking for integration test that checks behavior with canceled/removed ===" rg -n -C 20 'removed.*=.*1\|canceled.*=.*1' server/service/integration_enterprise_test.go | head -100

Repository: fleetdm/fleet

Length of output: 8644

🏁 Script executed:

#!/bin/bash echo "=== Checking testCountHostSoftwareInstallAttempts for any canceled/removed tests ===" sed -n '10346,10497p' server/datastore/mysql/software_test.go echo -e "\n=== Checking if there are other tests with canceled/removed installs ===" rg -n 'canceled.*=.*1|removed.*=.*1' server/datastore/mysql/software_test.go

Repository: fleetdm/fleet

Length of output: 5550

🏁 Script executed:

#!/bin/bash echo "=== Looking for code that sets canceled = 1 or removed = 1 ===" rg -n 'canceled.*=.*1|removed.*=.*1' server/datastore/mysql/software.go | head -30 echo -e "\n=== Looking for CancelSoftwareInstall or similar cancel logic ===" rg -n -C 5 'CancelSoftwareInstall|cancel.*software.*install' --type go server/ | head -80

Repository: fleetdm/fleet

Length of output: 6776

🏁 Script executed:

#!/bin/bash echo "=== Looking for code that sets canceled = 1 when activity is deleted ===" rg -n -B 5 -A 5 'canceled.*=.*1' server/datastore/mysql/ | grep -A 10 -B 10 'activity\|DELETE\|upcoming' echo -e "\n=== Check the DeleteUpcomingActivity or similar function ===" rg -n -C 10 'DeleteUpcomingActivity|canceled' server/datastore/mysql/activities.go | head -100

Repository: fleetdm/fleet

Length of output: 11600

Add filters for removed and canceled installs to retry attempt count.

The CountHostSoftwareInstallAttempts query should exclude canceled and removed attempts, consistent with how all other queries in the codebase handle these records. Currently, if an admin cancels an attempt or a record is marked as removed, it still counts toward the 3-retry limit, which prevents valid retries.

Recommended fix

func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) { var count int // Only count attempts from the current retry sequence. // When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence". // We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed). err := sqlx.GetContext(ctx, ds.reader(ctx), &count, ` SELECT COUNT(*) FROM host_software_installs WHERE host_id = ? AND software_installer_id = ? AND policy_id = ? + AND removed = 0 + AND canceled = 0 AND (attempt_number > 0 OR attempt_number IS NULL) `, hostID, softwareInstallerID, policyID)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {

var count int

// Only count attempts from the current retry sequence.

// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".

// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).

err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `

SELECT COUNT(*)

FROM host_software_installs

WHERE host_id = ?

AND software_installer_id = ?

AND policy_id = ?

AND (attempt_number > 0 OR attempt_number IS NULL)

`, hostID, softwareInstallerID, policyID)

if err != nil {

return 0, ctxerr.Wrap(ctx, err, "count host software install attempts")

}

return count, nil

}

func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {

var count int

// Only count attempts from the current retry sequence.

// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".

// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).

err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `

SELECT COUNT(*)

FROM host_software_installs

WHERE host_id = ?

AND software_installer_id = ?

AND policy_id = ?

AND removed = 0

AND canceled = 0

AND (attempt_number > 0 OR attempt_number IS NULL)

`, hostID, softwareInstallerID, policyID)

if err != nil {

return 0, ctxerr.Wrap(ctx, err, "count host software install attempts")

}

return count, nil

}

🤖 Prompt for AI Agents

In @server/datastore/mysql/software.go around lines 5899 - 5917, The CountHostSoftwareInstallAttempts function currently counts attempts regardless of canceled or removed records; update its SQL WHERE clause in host_software_installs (inside CountHostSoftwareInstallAttempts) to exclude canceled and removed attempts by adding checks like canceled_at IS NULL and removed_at IS NULL (or the equivalent removed flag used elsewhere in the codebase) so only active, non-removed attempts are counted for the retry limit.

coderabbitai · 2026-01-08T22:47:55Z

server/service/integration_enterprise_test.go

+	// fail policy fails to trigger first attempt
+	submitPolicyResult(policy.ID, false)


⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

The comment reads "fail policy fails to trigger first attempt" but should likely be "fail policy to trigger first attempt".

📝 Suggested fix

// pass policy submitPolicyResult(policy.ID, true) - // fail policy fails to trigger first attempt + // fail policy to trigger first attempt submitPolicyResult(policy.ID, false)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// fail policy fails to trigger first attempt

submitPolicyResult(policy.ID, false)

// pass policy

submitPolicyResult(policy.ID, true)

// fail policy to trigger first attempt

submitPolicyResult(policy.ID, false)

🤖 Prompt for AI Agents

In @server/service/integration_enterprise_test.go around lines 18481 - 18482, Update the comment above the submitPolicyResult(policy.ID, false) call to correct the typo: change "fail policy fails to trigger first attempt" to "fail policy to trigger first attempt" so the comment accurately describes the intent of the test step; locate the comment adjacent to the submitPolicyResult invocation in the integration_enterprise_test suite.

ksykulev added 2 commits January 7, 2026 23:03

Automatic retry of failed policy automations of scripts and software …

6de4e06

…installs

Merge branch 'main' into 31916-policy-automation-retry

40edaac

ksykulev temporarily deployed to Docker Hub January 8, 2026 05:07 — with GitHub Actions Inactive

ksykulev commented Jan 8, 2026

View reviewed changes

mostlikelee reviewed Jan 8, 2026

View reviewed changes

ksykulev added 2 commits January 8, 2026 16:34

retries work across multiple policy success/failure transitions

dc1b2f5

changes file

0a2a7e6

ksykulev marked this pull request as ready for review January 8, 2026 22:36

ksykulev requested a review from a team as a code owner January 8, 2026 22:36

Merge branch 'main' into 31916-policy-automation-retry

0553ab0

ksykulev temporarily deployed to Docker Hub January 8, 2026 22:38 — with GitHub Actions Inactive

ksykulev requested review from juan-fdz-hawa, mostlikelee and nulmete January 8, 2026 22:38

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

fixes

668f4f6

ksykulev temporarily deployed to Docker Hub January 8, 2026 23:27 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic retry of failed policy automations of scripts and software installs #38018

Automatic retry of failed policy automations of scripts and software installs #38018

ksykulev commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ksykulev Jan 8, 2026

Uh oh!

codecov bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

mostlikelee Jan 8, 2026

Uh oh!

mostlikelee Jan 8, 2026

Uh oh!

ksykulev Jan 8, 2026

Uh oh!

ksykulev Jan 8, 2026 •

edited

Loading

Uh oh!

ksykulev commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 8, 2026

Uh oh!

coderabbitai bot Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error)
	IsPolicyFailing(ctx context.Context, policyID, hostID uint) (bool, error)

		// fail policy fails to trigger first attempt
		submitPolicyResult(policy.ID, false)

Automatic retry of failed policy automations of scripts and software installs #38018

Are you sure you want to change the base?

Automatic retry of failed policy automations of scripts and software installs #38018

Conversation

ksykulev commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist for submitter

Testing

Database migrations

Summary by CodeRabbit

Uh oh!

ksykulev Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mostlikelee Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

mostlikelee Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

ksykulev Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

ksykulev Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ksykulev commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ksykulev commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

codecov bot commented Jan 8, 2026 •

edited

Loading

ksykulev Jan 8, 2026 •

edited

Loading