Skip to content

Conversation

@ksykulev
Copy link
Contributor

@ksykulev ksykulev commented Jan 8, 2026

Related issue: Resolves #31916

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
    See Changes files for more information.
  • Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements)

Testing

Database migrations

  • Checked schema for all modified table for columns that will auto-update timestamps during migration.
  • Confirmed that updating the timestamps is acceptable, and will not cause unwanted side effects.
  • Ensured the correct collation is explicitly set for character columns (COLLATE utf8mb4_unicode_ci).

Summary by CodeRabbit

  • New Features
    • Script and software installer policy automations now automatically retry up to three times on failure.
    • Retry attempt counters automatically reset when policies transition from failing to passing state.
    • Enhanced attempt tracking for improved monitoring and troubleshooting of policy automation executions.

✏️ Tip: You can customize this high-level summary in your review settings.

require.NotNil(t, results[0].ExitCode)
require.Equal(t, int64(1), *results[0].ExitCode)
// queued retry
// Note: attempt_number is only calculated when the result is submitted, not when queued, so it will be NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why am I waiting until the result comes in to update attempt_number?
When maybeRetryPolicyAutomationScript calls NewHostScriptExecutionRequest, the row in host_script_results doesn't exist yet. NewHostScriptExecutionRequest creates rows in upcoming_activities and script_upcoming_activities tables. activateNextScriptActivity is what actually creates the host_script_results row. We could pass the attempt_number through the upcoming_activities payload, then have activateNextScriptActivity read it from the payload and insert it into host_script_results when creating the row. The current approach calculates on result submission is simpler - it just means there's a brief period of time when attempt_number is NULL. This might actually not be semantically incorrect because orbit hasn't actually attempted to run the script.

@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 76.95167% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.88%. Comparing base (05d3420) to head (668f4f6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
server/service/orbit.go 76.80% 17 Missing and 12 partials ⚠️
server/datastore/mysql/policies.go 70.12% 12 Missing and 11 partials ⚠️
...16_AddAttemptNumberToScriptsAndSoftwareInstalls.go 66.66% 4 Missing and 2 partials ⚠️
server/datastore/mysql/scripts.go 90.47% 1 Missing and 1 partial ⚠️
server/datastore/mysql/software.go 89.47% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #38018    +/-   ##
========================================
  Coverage   65.87%   65.88%            
========================================
  Files        2392     2393     +1     
  Lines      190798   191017   +219     
  Branches     8364     8364            
========================================
+ Hits       125689   125844   +155     
- Misses      53688    53724    +36     
- Partials    11421    11449    +28     
Flag Coverage Δ
backend 67.71% <76.95%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


CleanupPolicyMembership(ctx context.Context, now time.Time) error
// IsPolicyStillFailing checks if a policy is currently failing for a given host.
IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "still" sounds like it's also evaluating attempt count

Suggested change
IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error)
IsPolicyFailing(ctx context.Context, policyID, hostID uint) (bool, error)


// Only calculate attempt_number for policy automation installs
if currentInstall.PolicyID != nil && currentInstall.SoftwareInstallerID != nil {
count, err := svc.ds.CountHostSoftwareInstallAttempts(ctx, host.ID, *currentInstall.SoftwareInstallerID, *currentInstall.PolicyID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we save multiple rows if there are >1 successful installs? that could affect how we're counting attempts here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we are storing multiple rows in host_script_results and host_software_installs for each attempt. Is it possible to get >1 successful installs for a policy for a host for a software installer 🤔?

Copy link
Contributor Author

@ksykulev ksykulev Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if the policy is failing. it attempts 2 times. Then the policy turns into a successful state. Then later it goes back to failed. It means it will only attempt it 1 time. : / because we don't really keep track of the "sequence".

@ksykulev ksykulev marked this pull request as ready for review January 8, 2026 22:36
@ksykulev ksykulev requested a review from a team as a code owner January 8, 2026 22:36
@ksykulev
Copy link
Contributor Author

ksykulev commented Jan 8, 2026

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

This PR implements automatic retry logic for policy automations (scripts and software installations). When a policy remains failing after triggering an automation, Fleet now retries the automation up to 3 times. The implementation adds attempt tracking to the database schema, updates datastore method signatures to handle attempt numbers, resets attempt counts when policies transition from failing to passing, and introduces retry queueing in the service layer.

Changes

Cohort / File(s) Change Summary
Policy & Attempt Tracking
server/fleet/policies.go, server/datastore/mysql/policies.go, server/datastore/mysql/policies_test.go
Adds MaxPolicyAutomationRetries constant (value 3). Implements IsPolicyFailing method to query if a policy is failing for a host. Adds pre-update flip detection in RecordPolicyQueryExecutions and AsyncBatchInsertPolicyMembership to reset attempt counts when policies transition from failing to passing. Includes comprehensive test coverage for flip detection and attempt resets in both sync and async paths.
Datastore Interface & Methods
server/fleet/datastore.go, server/fleet/scripts.go, server/fleet/software_installer.go
Updates SetHostScriptExecutionResult and SetHostSoftwareInstallResult signatures to accept optional attemptNumber *int parameter. Adds three new public datastore methods: IsPolicyFailing, CountHostScriptAttempts, and CountHostSoftwareInstallAttempts to support retry tracking and policy state queries.
Scripts Datastore Implementation
server/datastore/mysql/scripts.go, server/datastore/mysql/scripts_test.go
Extends SetHostScriptExecutionResult to accept and persist attemptNumber. Adds CountHostScriptAttempts method to count only current-sequence attempts. Updates SQL statements to populate attempt_number column. Includes test coverage with new testCountHostScriptAttempts verifying counts across multiple attempts and policy combinations.
Software Installation Datastore
server/datastore/mysql/software.go, server/datastore/mysql/software_test.go, server/datastore/mysql/software_installers.go, server/datastore/mysql/software_installers_test.go
Extends SetHostSoftwareInstallResult to accept and persist attemptNumber. Adds CountHostSoftwareInstallAttempts method. Updates software installer queries to expose software_installer_id and attempt_number columns. Includes comprehensive test coverage for attempt counting across multiple scenarios.
Database Schema & Migrations
server/datastore/mysql/schema.sql, server/datastore/mysql/migrations/tables/20260108223716_*.go*
Adds attempt_number INT DEFAULT NULL column to both host_script_results and host_software_installs tables. Includes migration up/down functions and comprehensive migration test verifying column additions and NULL handling.
Service Layer & Retry Logic
server/service/orbit.go
Computes attemptNumber for policy-automation-driven scripts and software installs via CountHostScriptAttempts and CountHostSoftwareInstallAttempts. Implements gating logic to suppress activity creation for intermediate retry attempts. Adds retry queuing via maybeRetryPolicyAutomationScript and maybeRetryPolicyAutomationSoftwareInstall helpers. Modifies SaveHostScriptResult and SaveHostSoftwareInstallResult to handle policy automation retries and setup experience flows.
Mock Datastore
server/mock/datastore_mock.go
Updates SetHostSoftwareInstallResultFunc and SetHostScriptExecutionResultFunc signatures to include attemptNumber *int parameter. Adds three new function types (IsPolicyFailingFunc, CountHostSoftwareInstallAttemptsFunc, CountHostScriptAttemptsFunc) with corresponding struct fields and invocation tracking. Implements wrapper methods for new functions.
Test Files & Helpers
server/datastore/mysql/activities_test.go, server/datastore/mysql/hosts_test.go, server/test/activities.go, server/datastore/mysql/software_titles_test.go
Updates all call sites to pass additional nil or computed attemptNumber argument to SetHostScriptExecutionResult and SetHostSoftwareInstallResult to match updated signatures.
Integration Tests
server/service/integration_enterprise_test.go
Adds extensive enterprise test suite covering policy automation retries for both scripts and software, label-based scope changes, conditional access integration, device authentication, calendar events, software installers, and maintained apps. Includes tests for retry suppression, attempt counting, scope-based retrigger scenarios, and interaction with other enterprise features.
Documentation
changes/31916-retry-policy-automations
Adds policy note documenting that script and software installer policy automations will retry up to three times if attempts fail.

Sequence Diagram(s)

sequenceDiagram
    participant Policy as Policy Evaluation
    participant Automation as Policy Automation Trigger
    participant DataStore as Datastore
    participant Retry as Retry Manager

    Policy->>DataStore: Check if policy is failing for host
    DataStore-->>Policy: IsPolicyFailing result
    
    alt Policy is failing
        Policy->>Automation: Trigger automation (script/install)
        Automation->>DataStore: CountAttempts for this policy+automation
        DataStore-->>Automation: Current attempt count
        Automation->>DataStore: SetResult with attemptNumber
        DataStore->>DataStore: Persist result with attempt_number
        DataStore-->>Automation: Result saved
        
        alt Attempt < 3 and still failing
            Automation->>Retry: Queue retry for next policy run
        else Attempt >= 3
            Automation->>Retry: Final attempt - create activity
        end
    else Policy is passing
        DataStore->>DataStore: Reset attempt_number to NULL
        Policy->>Automation: No automation triggered
    end
Loading
sequenceDiagram
    participant Host as Host
    participant Service as Service Layer
    participant DataStore as Datastore
    participant Queue as Retry Queue

    Host->>Service: Submit script/software result
    Service->>DataStore: Fetch existing results for policy
    DataStore-->>Service: Prior results with attempt counts
    Service->>Service: Calculate nextAttemptNumber
    
    alt Is policy automation?
        Service->>DataStore: CountAttempts for this automation
        DataStore-->>Service: Current attempt count
        Service->>Service: Suppress activity creation if attempt < 3
    else Manual execution
        Service->>Service: Create activity normally
    end
    
    Service->>DataStore: SetResult with attemptNumber
    DataStore-->>Service: Result persisted
    
    alt Still failing and attempt < 3?
        Service->>Queue: Queue retry request
        Queue-->>Service: Retry queued
    else Passing or final attempt
        Service->>Service: Complete flow
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • sgress454
  • mostlikelee
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: automatic retry of policy automations for scripts and software installs.
Description check ✅ Passed The description covers key checklist items including changes file, data validation, SQL injection prevention, automated tests, database migrations, and manual QA.
Linked Issues check ✅ Passed The PR implements the core coding requirements from issue #31916: retry logic for script/software policy automations up to 3 times, attempt tracking, and resetting on policy transitions.
Out of Scope Changes check ✅ Passed All code changes are directly related to implementing the retry mechanism, attempt tracking, and policy failure detection required by issue #31916.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 31916-policy-automation-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @server/datastore/mysql/software.go:
- Around line 5899-5917: The CountHostSoftwareInstallAttempts function currently
counts attempts regardless of canceled or removed records; update its SQL WHERE
clause in host_software_installs (inside CountHostSoftwareInstallAttempts) to
exclude canceled and removed attempts by adding checks like canceled_at IS NULL
and removed_at IS NULL (or the equivalent removed flag used elsewhere in the
codebase) so only active, non-removed attempts are counted for the retry limit.

In @server/service/integration_enterprise_test.go:
- Around line 18481-18482: Update the comment above the
submitPolicyResult(policy.ID, false) call to correct the typo: change "fail
policy fails to trigger first attempt" to "fail policy to trigger first attempt"
so the comment accurately describes the intent of the test step; locate the
comment adjacent to the submitPolicyResult invocation in the
integration_enterprise_test suite.
🧹 Nitpick comments (6)
server/fleet/scripts.go (1)

280-283: HostScriptResult.AttemptNumber is well-wired; consider future-proofing the comment

The new AttemptNumber *int field is correctly tagged for JSON/DB and matches the schema’s nullable int column, so it should integrate cleanly with existing scan/encode paths and is a backward-compatible API extension. The only small suggestion is that the comment bakes in “3 being max retries”; if the max ever changes, this can drift—consider either referencing a shared constant in the comment (by name) or rephrasing to “values 1..N where N is the configured max retries” to avoid future inconsistencies.

server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go (1)

72-78: Consider asserting row presence before indexing attempt_number slices

The test correctly validates that attempt_number defaults to NULL and persists explicit values for both tables. To make failures clearer if the setup ever breaks, you could assert the query returned exactly one row before indexing:

require.Len(t, attempt_number, 1)
require.Equal(t, sql.NullInt64{Valid: false}, attempt_number[0])

(and similarly for scriptAttemptNumber). This avoids panics and yields more informative test failures.

Also applies to: 85-89, 90-105

server/datastore/mysql/software.go (1)

5842-5842: Consider: Add validation for attemptNumber parameter

The attemptNumber parameter is a *int with no validation. While nil is intentional (represents "currently processing"), if a non-nil value is passed, it could be negative or exceed the retry limit without any checks.

Consider adding defensive validation in the calling code to ensure attemptNumber is within the expected range (e.g., 1-3 for the 3-retry limit).

server/datastore/mysql/policies_test.go (1)

6478-6642: Reset-attempt tests validate correct behavior; consider minor DRY helper

The sync and async reset tests both:

  • Set up a failing policy (p1) and a non-transitioning policy (p2).
  • Seed host_script_results and host_software_installs with both completed (attempt_number = 1) and pending (attempt_number IS NULL) rows.
  • Drive a failing→passing transition for p1 only, and assert all p1 attempts (completed and pending) are reset to 0 while p2’s rows remain unchanged.

This accurately captures the desired retry-reset behavior and correctly scopes the verification queries by host_id and policy_id. Logic and SQL look sound.

There is some nearly identical setup logic between the sync and async tests (manual INSERT statements for both policies and tables); if this grows further, you might consider a small shared helper to seed attempts for a (host, policy) pair to reduce duplication and keep future schema tweaks localized. This is purely optional test hygiene.

server/service/orbit.go (1)

1057-1065: Good defensive check, but consider error handling consistency.

The retry logic correctly checks for both PolicyID and ScriptID before attempting to retry. However, when maybeRetryPolicyAutomationScript returns an error, it's wrapped and returned, which will cause the entire SaveHostScriptResult call to fail.

Consider whether failing to queue a retry should be a hard error or just logged. The result was already saved successfully, and the user might expect at least that to persist.

Consider logging retry failures instead of returning error
 	// If this is a policy automation script that failed, maybe retry
 	if hsr != nil && hsr.PolicyID != nil && hsr.ScriptID != nil {
 		scriptFailed := hsr.ExitCode == nil || *hsr.ExitCode != 0
 		if scriptFailed {
 			if err := svc.maybeRetryPolicyAutomationScript(ctx, host, hsr); err != nil {
-				return ctxerr.Wrap(ctx, err, "retry policy automation script")
+				// Log but don't fail - the result was already saved successfully
+				level.Error(svc.logger).Log(
+					"msg", "failed to queue policy automation script retry",
+					"host_id", host.ID,
+					"policy_id", *hsr.PolicyID,
+					"err", err,
+				)
 			}
 		}
 	}
server/datastore/mysql/software_test.go (1)

10346-10497: testCountHostSoftwareInstallAttempts covers main scenarios; consider aligning first attempt with production path

The test thoroughly exercises:

  • Zero-attempt baseline.
  • Multiple attempts on the same (host, installer, policy) tuple.
  • Isolation from:
    • installs without policy_id,
    • a different host,
    • a different policy,
    • a different installer.

This gives solid coverage of the counting behavior.

One small realism improvement: for the first policy-driven attempt you currently call SetHostSoftwareInstallResult with attemptNumber == nil, whereas production is likely to pass attemptNumber == 1 for automation-driven installs. If CountHostSoftwareInstallAttempts or related logic ever relies directly on the attempt_number column instead of just row count, this test could miss regressions.

You could optionally pass ptr.Int(1) for the first attempt to mirror the real flow more closely:

[inspect for fit before changing; optional]

♻️ Optional tweak to set explicit attempt number on first attempt
-	// Set result
-	_, err = ds.SetHostSoftwareInstallResult(ctx, &fleet.HostSoftwareInstallResultPayload{
-		HostID:                host.ID,
-		InstallUUID:           install1UUID,
-		InstallScriptExitCode: ptr.Int(1), // failed
-		InstallScriptOutput:   ptr.String("install failed"),
-	}, nil)
+	// Set result for first attempt (attempt_number = 1)
+	_, err = ds.SetHostSoftwareInstallResult(ctx, &fleet.HostSoftwareInstallResultPayload{
+		HostID:                host.ID,
+		InstallUUID:           install1UUID,
+		InstallScriptExitCode: ptr.Int(1), // failed
+		InstallScriptOutput:   ptr.String("install failed"),
+	}, ptr.Int(1))

If the datastore implementation intentionally ignores attempt_number for counting and only uses row count, leaving this as-is is also acceptable.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05d3420 and 0553ab0.

📒 Files selected for processing (23)
  • changes/31916-retry-policy-automations
  • server/datastore/mysql/activities_test.go
  • server/datastore/mysql/hosts_test.go
  • server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go
  • server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go
  • server/datastore/mysql/policies.go
  • server/datastore/mysql/policies_test.go
  • server/datastore/mysql/schema.sql
  • server/datastore/mysql/scripts.go
  • server/datastore/mysql/scripts_test.go
  • server/datastore/mysql/software.go
  • server/datastore/mysql/software_installers.go
  • server/datastore/mysql/software_installers_test.go
  • server/datastore/mysql/software_test.go
  • server/datastore/mysql/software_titles_test.go
  • server/fleet/datastore.go
  • server/fleet/policies.go
  • server/fleet/scripts.go
  • server/fleet/software_installer.go
  • server/mock/datastore_mock.go
  • server/service/integration_enterprise_test.go
  • server/service/orbit.go
  • server/test/activities.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

⚙️ CodeRabbit configuration file

When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.

Files:

  • server/fleet/policies.go
  • server/datastore/mysql/scripts.go
  • server/datastore/mysql/activities_test.go
  • server/datastore/mysql/policies.go
  • server/service/orbit.go
  • server/fleet/software_installer.go
  • server/datastore/mysql/policies_test.go
  • server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go
  • server/fleet/datastore.go
  • server/datastore/mysql/scripts_test.go
  • server/fleet/scripts.go
  • server/test/activities.go
  • server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go
  • server/datastore/mysql/software.go
  • server/datastore/mysql/software_titles_test.go
  • server/datastore/mysql/software_installers_test.go
  • server/service/integration_enterprise_test.go
  • server/datastore/mysql/software_installers.go
  • server/datastore/mysql/hosts_test.go
  • server/mock/datastore_mock.go
  • server/datastore/mysql/software_test.go
🧠 Learnings (9)
📚 Learning: 2025-08-08T07:40:05.301Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 31726
File: server/datastore/mysql/labels_test.go:2031-2031
Timestamp: 2025-08-08T07:40:05.301Z
Learning: Fleet repo targets Go 1.24.5 (root go.mod), which supports testing.T.Context(). Do not flag usage of t.Context() or suggest replacing it with context.Background() in tests (e.g., server/datastore/mysql/labels_test.go Line 2031 and similar).

Applied to files:

  • server/datastore/mysql/activities_test.go
📚 Learning: 2025-07-08T16:12:48.797Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: ee/server/service/hostidentity/depot/depot.go:108-111
Timestamp: 2025-07-08T16:12:48.797Z
Learning: In ee/server/service/hostidentity/depot/depot.go, the SCEP depot interface methods like Put() do not accept context parameters, and the common_mysql.WithRetryTxx callback function type TxFn only receives a transaction parameter, not a context. Therefore, using context.Background() in tx.ExecContext calls within the transaction callback is the correct approach.

Applied to files:

  • server/datastore/mysql/policies.go
📚 Learning: 2025-08-29T00:30:04.502Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 32438
File: server/datastore/mysql/scripts.go:505-515
Timestamp: 2025-08-29T00:30:04.502Z
Learning: In the Fleet codebase, database operations are typically wrapped in `ds.withRetryTxx` which provides transaction isolation, eliminating the need for additional race condition checks within the same transaction boundary.

Applied to files:

  • server/datastore/mysql/policies.go
📚 Learning: 2025-07-08T16:13:39.114Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: server/datastore/mysql/migrations/tables/20250707095725_HostIdentitySCEPCertificates.go:53-55
Timestamp: 2025-07-08T16:13:39.114Z
Learning: In the Fleet codebase, Down migration functions are intentionally left empty/no-op. The team does not implement rollback functionality for database migrations, so empty Down_* functions in migration files are correct and should not be flagged as issues.

Applied to files:

  • server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go
📚 Learning: 2025-08-13T18:20:42.136Z
Learnt from: titanous
Repo: fleetdm/fleet PR: 31075
File: tools/redis-tests/elasticache/iam_auth.go:4-10
Timestamp: 2025-08-13T18:20:42.136Z
Learning: For test harnesses and CLI tools in the Fleet codebase, resource cleanup on error paths (like closing connections before log.Fatalf) may not be necessary since the OS handles cleanup when the process exits. These tools prioritize simplicity over defensive programming patterns used in production code.

Applied to files:

  • server/datastore/mysql/scripts_test.go
  • server/test/activities.go
📚 Learning: 2025-07-08T16:11:49.555Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: ee/server/service/hostidentity/depot/depot.go:115-115
Timestamp: 2025-07-08T16:11:49.555Z
Learning: In ee/server/service/hostidentity/depot/depot.go, the error from result.RowsAffected() is intentionally ignored because the information is only used for logging purposes, not for critical program logic.

Applied to files:

  • server/datastore/mysql/software.go
📚 Learning: 2025-10-03T18:16:11.482Z
Learnt from: MagnusHJensen
Repo: fleetdm/fleet PR: 33805
File: server/service/integration_mdm_test.go:1248-1251
Timestamp: 2025-10-03T18:16:11.482Z
Learning: In server/service/integration_mdm_test.go, the helper createAppleMobileHostThenEnrollMDM(platform string) is exclusively for iOS/iPadOS hosts (mobile). Do not flag macOS model/behavior issues based on changes within this helper; macOS provisioning uses different helpers such as createHostThenEnrollMDM.

Applied to files:

  • server/service/integration_enterprise_test.go
  • server/datastore/mysql/software_test.go
📚 Learning: 2025-08-01T15:08:16.858Z
Learnt from: sgress454
Repo: fleetdm/fleet PR: 31508
File: server/datastore/mysql/schema.sql:102-116
Timestamp: 2025-08-01T15:08:16.858Z
Learning: The schema.sql file in server/datastore/mysql/ is auto-generated from migrations for use with tests, so it cannot be manually edited. Any changes must be made through migrations.

Applied to files:

  • server/datastore/mysql/schema.sql
📚 Learning: 2025-09-12T13:04:23.777Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 32823
File: server/datastore/mysql/software.go:4457-4471
Timestamp: 2025-09-12T13:04:23.777Z
Learning: In the Fleet codebase, the `status` column in the `host_software_installs` table is defined as `GENERATED ALWAYS`, meaning it's automatically computed by the database based on other field values (like exit codes) and should not be explicitly included in INSERT statements.

Applied to files:

  • server/datastore/mysql/schema.sql
  • server/datastore/mysql/software_installers.go
🧬 Code graph analysis (8)
server/datastore/mysql/scripts.go (2)
server/fleet/datastore.go (1)
  • Datastore (50-2625)
server/fleet/scripts.go (2)
  • HostScriptResultPayload (196-203)
  • HostScriptResult (208-284)
server/datastore/mysql/policies_test.go (2)
server/ptr/ptr.go (2)
  • T (86-88)
  • Bool (35-37)
server/fleet/policies.go (1)
  • PolicyMembershipResult (466-470)
server/fleet/datastore.go (3)
server/test/activities.go (1)
  • SetHostSoftwareInstallResult (66-74)
server/fleet/software_installer.go (1)
  • HostSoftwareInstallResultPayload (936-953)
server/fleet/scripts.go (1)
  • HostScriptResult (208-284)
server/datastore/mysql/software.go (2)
server/fleet/datastore.go (1)
  • Datastore (50-2625)
server/fleet/software_installer.go (1)
  • HostSoftwareInstallResultPayload (936-953)
server/datastore/mysql/software_installers_test.go (3)
server/test/activities.go (1)
  • SetHostSoftwareInstallResult (66-74)
server/fleet/software_installer.go (1)
  • HostSoftwareInstallResultPayload (936-953)
server/ptr/ptr.go (1)
  • Int (15-17)
server/service/integration_enterprise_test.go (5)
server/fleet/hosts.go (1)
  • Host (279-411)
server/fleet/software_installer.go (1)
  • UploadSoftwareInstallerPayload (497-533)
server/datastore/mysql/testing_utils.go (1)
  • ExecAdhocSQL (421-425)
server/fleet/policies.go (3)
  • PolicyPayload (15-59)
  • ModifyPolicyPayload (170-206)
  • Policy (280-303)
server/fleet/scripts.go (2)
  • Script (17-32)
  • HostScriptResult (208-284)
server/mock/datastore_mock.go (3)
server/fleet/software_installer.go (1)
  • HostSoftwareInstallResultPayload (936-953)
server/fleet/scripts.go (2)
  • HostScriptResultPayload (196-203)
  • HostScriptResult (208-284)
server/test/activities.go (1)
  • SetHostSoftwareInstallResult (66-74)
server/datastore/mysql/software_test.go (2)
server/test/activities.go (1)
  • SetHostSoftwareInstallResult (66-74)
server/fleet/datastore.go (1)
  • Datastore (50-2625)
🔇 Additional comments (43)
server/fleet/policies.go (1)

110-112: LGTM: Well-documented retry limit constant.

The addition of MaxPolicyAutomationRetries clearly defines the retry behavior for policy automations. The constant value of 3 aligns with the PR objectives, and the comment provides clear context.

server/fleet/software_installer.go (1)

422-425: LGTM: Well-designed attempt tracking field.

The AttemptNumber field addition is well-implemented:

  • Pointer type correctly represents optional nature (nil for non-policy installs)
  • Clear documentation of semantics (nil = not policy-triggered, 1-3 for retry attempts)
  • Appropriate JSON tag with omitempty
  • DB tag matches expected schema
changes/31916-retry-policy-automations (1)

1-1: LGTM: Clear user-facing documentation.

The changelog entry clearly communicates the retry behavior to users. The "up to three times" limit matches the MaxPolicyAutomationRetries constant defined in the codebase.

server/datastore/mysql/software_titles_test.go (1)

2232-2232: Signature update looks correct.

The addition of nil as the third argument correctly updates the call to match the new SetHostSoftwareInstallResult signature. Passing nil is appropriate for this test since it's not testing policy automation retry behavior.

Comprehensive test coverage for the attemptNumber parameter already exists in the integration tests, including verification of attempt counting (1, 2, 3), behavior at maximum attempts, and persistence of attempt numbers.

server/datastore/mysql/schema.sql (3)

975-1009: host_script_results.attempt_number schema looks correct and matches Go struct

The new attempt_number int DEFAULT NULL column integrates cleanly here: type and nullability match HostScriptResult.AttemptNumber *int (db:"attempt_number"), and there are no dependent generated columns/constraints to adjust. No changes needed from this file’s side.

Based on learnings, this looks like the expected output of the new migration rather than a manual edit.


1045-1084: host_software_installs.attempt_number addition is consistent with retries design

Similarly, adding attempt_number int DEFAULT NULL is consistent with the retry-tracking design and does not interfere with the existing generated status / execution_status columns. It gives the datastore a straightforward place to persist attempt numbers for installs.

Based on learnings, this also appears to come from regenerating schema.sql from the latest migrations.


1759-1768: Migration status table updates are internally consistent

The AUTO_INCREMENT=469 and the extended INSERT INTO migration_status_tables list (up through id 468 / version 20260108223716) are coherent: last inserted id matches AUTO_INCREMENT−1, and the versions sequence continues as expected. This aligns with adding the new migration that introduced the attempt_number columns.

Based on learnings, this matches the normal pattern of schema.sql being regenerated from migrations.

server/datastore/mysql/software_installers.go (1)

1422-1505: LGTM! Query modifications correctly add retry tracking columns.

The additions to GetSoftwareInstallResults properly extend both parts of the UNION query:

  • software_installer_id, source, and attempt_number are added consistently to both the historical (host_software_installs) and upcoming (upcoming_activities) result paths
  • The NULL AS attempt_number for upcoming activities is appropriate since those installs haven't been attempted yet
  • The existing filtering by execution_id ensures precise scoping, and the ua.activated_at IS NULL condition prevents duplicates between the UNION parts

The changes align with the PR's objective to track retry attempts for policy automations.

server/datastore/mysql/hosts_test.go (2)

4199-4217: SetHostScriptExecutionResult signature usage looks correct

The added nil argument aligns these test calls with the new SetHostScriptExecutionResult signature while preserving existing behavior; using nil is appropriate here since these tests don’t validate attempt-count logic.


8491-8496: Consistent use of new optional attempt parameter in SetHostScriptExecutionResult

This updated call also correctly passes nil for the new attempt parameter, matching the intended optional semantics without changing the test’s behavior.

server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go (1)

9-34: Migration shape and error handling look correct

The Up migration cleanly adds nullable attempt_number columns to both tables with contextual error wrapping, and the no-op Down function matches this repo’s convention of non-rollbackable migrations. Based on learnings, leaving Down_20260108223716 empty is expected.

server/datastore/mysql/software_installers_test.go (1)

183-188: Test callsites correctly adapted to new attemptNumber parameter

All updated calls to SetHostSoftwareInstallResult and SetHostScriptExecutionResult now pass a trailing nil for the new attemptNumber argument, which is appropriate for these non-policy / non-retry scenarios. Assertions remain unchanged, so behavior is preserved while tests stay aligned with the new API.

Also applies to: 198-203, 239-244, 404-409, 460-465, 531-536, 552-557, 1280-1285, 1353-1358, 2275-2280, 2359-2365, 2411-2424, 3133-3139

server/test/activities.go (1)

30-36: Helpers correctly pass nil attemptNumber for non-policy executions

The activity helpers now pass nil as the attemptNumber when recording script and software results for manual/uninstall flows, which matches the intended “not policy automation” semantics and keeps tests aligned with the updated datastore interface.

Also applies to: 66-73, 105-113

server/datastore/mysql/scripts.go (1)

145-229: SetHostScriptExecutionResult / GetHostScriptExecutionResult wiring for attempt_number looks sound; please double-check NULL semantics in CountHostScriptAttempts

  • SetHostScriptExecutionResult now updates host_script_results.attempt_number alongside the other result fields and keeps the duplicate-result guard and MDM/uninstall handling intact.
  • getHostScriptExecutionResultDB correctly projects batch_execution_id and attempt_number for active results, and supplies NULL for both in the upcoming-activity path, keeping the struct mapping consistent.
  • CountHostScriptAttempts is properly scoped by host_id, script_id, and policy_id, and only counts rows where attempt_number > 0 OR attempt_number IS NULL, effectively treating attempt_number = 0 as “old sequence” while including current-sequence (positive) and unset attempts.

One nuance to confirm: any pre-existing host_script_results rows for a given (host_id, script_id, policy_id) that predate this feature (and thus have attempt_number = NULL) will be included in the count. That means the very first retry-aware run for a long-lived policy/host pair could start with a non-zero attempt count unless some higher-level logic proactively resets those older rows to attempt_number = 0 when the policy is known to have passed. If that’s not desired, you may want to revisit how/when old rows are normalized.

As per coding guidelines, the new COUNT query uses precise filtering and avoids under-scoped SELECT usage.

Also applies to: 387-417, 469-487

server/datastore/mysql/activities_test.go (1)

1285-1288: LGTM! Test call sites updated to match expanded API signatures.

The test updates correctly pass nil for the new optional attemptNumber parameter in SetHostScriptExecutionResult and SetHostSoftwareInstallResult calls, which is appropriate since these tests don't exercise the retry behavior.

Also applies to: 1426-1431, 1448-1453, 1678-1683, 1693-1697, 1705-1710

server/datastore/mysql/policies.go (3)

540-613: LGTM! Flip detection and attempt reset logic is correctly implemented.

The implementation properly:

  1. Computes newly-passing policies before updating membership to compare against previous state
  2. Resets attempt_number to 0 for both script results and software installs when policies flip from failing to passing
  3. Uses proper SQL filtering with host_id and policy_id IN (?) clauses

The flip detection happening before the transaction is a pragmatic design choice that avoids holding locks during computation, with the withRetryTxx providing conflict resolution.


1428-1484: LGTM! Async batch processing correctly handles per-host flip detection.

The implementation properly extends the flip detection and attempt reset logic to handle batches of hosts:

  1. Groups incoming results per host before computing flips
  2. Computes newly-passing policies separately for each host
  3. Resets attempt numbers within the same transaction as the membership upsert

This ensures consistency between sync and async policy membership update paths.


1755-1775: LGTM! IsPolicyFailing correctly implements the failure check semantics.

The method correctly returns true (failing) when:

  • No membership record exists (sql.ErrNoRows) — policy hasn't been evaluated yet
  • passes is nil — policy execution didn't complete
  • passes is false — policy is explicitly failing

This aligns with the retry logic requirement: automations should retry while the policy remains in a non-passing state. The SQL query properly filters on both policy_id and host_id.

server/fleet/datastore.go (3)

687-687: LGTM: Signature extension supports retry tracking.

The addition of attemptNumber *int parameter is well-designed. Using a pointer allows distinguishing between policy-triggered installs (non-nil with attempt number) and regular installs (nil), which aligns with the retry feature requirements.


868-875: LGTM: New methods provide proper scoping parameters.

The three new methods for policy retry logic are well-designed:

  • Each method includes both host and policy identifiers, enabling precise filtering
  • Comments clearly explain their purpose
  • Parameter types are consistent with existing datastore patterns

Note: The actual implementations should be reviewed to ensure SQL queries use these parameters properly with WHERE clauses (per coding guidelines for **/*.go).


1827-1827: LGTM: Consistent signature extension.

The attemptNumber *int parameter addition mirrors the change to SetHostSoftwareInstallResult, maintaining consistency across the retry tracking implementation.

server/datastore/mysql/policies_test.go (2)

83-85: New policy retry tests correctly wired into TestPolicies suite

The three new helpers are cleanly added to the table-driven TestPolicies suite with descriptive names and consistent ordering. No changes needed.


6414-6476: IsPolicyFailing test semantics look correct and comprehensive

This test exercises the important edge cases for IsPolicyFailing (no membership row, passes = NULL, passes = false, passes = true, different hosts, different policies) and clearly documents the intended semantics (treat missing/NULL/false as failing, true as passing). This aligns well with how policy automations typically reason about “still failing” vs “resolved”. No issues from my side.

server/datastore/mysql/scripts_test.go (3)

57-57: LGTM!

New test case entry correctly wired into the test suite.


114-134: LGTM!

All SetHostScriptExecutionResult calls correctly updated to pass nil for the new attemptNumber parameter, which is appropriate for non-policy-automation script executions.


2989-3132: LGTM! Comprehensive test coverage for attempt counting.

The test covers:

  • Zero attempts case
  • Multiple consecutive attempts with the same policy/script
  • Executions without policy_id (should not affect count)
  • Different hosts with same policy/script
  • Same host with different policies
  • Same host/policy with different scripts

One minor observation: at line 3029, the first call to SetHostScriptExecutionResult passes nil for attemptNumber (correct for the initial attempt), but subsequent calls at lines 3053 and 3077 pass explicit attempt numbers via ptr.Int(2) and ptr.Int(3). This asymmetry is intentional to test that the count increments correctly regardless of what's passed, which is good coverage.

server/service/orbit.go (6)

1020-1033: Verify activity suppression logic handles nil ExitCode correctly.

The condition at line 1028 checks hsr.ExitCode == nil || *hsr.ExitCode != 0 to determine if the script failed. This is correct for the failure check. However, when hsr.AttemptNumber is not nil and the script succeeded (exit code 0), shouldCreateActivity remains true, which is the expected behavior.

The logic correctly suppresses activity creation only for intermediate failed attempts (attempts 1 and 2 when max is 3).


1482-1496: Consistent approach with script attempt counting.

The software install attempt counting mirrors the script counting logic, which is good for consistency. The same observation about verifying the counting logic applies here.


1556-1572: Activity suppression and retry logic looks correct but has asymmetric error handling.

The retry is triggered inside the activity suppression block for failed installs, and errors are only logged (line 1558-1563), which is different from the script retry handling. This inconsistency is actually better—logging instead of failing is more resilient.

However, note that maybeRetryPolicyAutomationSoftwareInstall is called before checking shouldCreateActivity, meaning a retry could be queued even if we then create an activity (when AttemptNumber is nil, which shouldn't happen for policy automation but is worth noting).


915-934: The counting logic is correct. The CountHostScriptAttempts SQL query explicitly includes rows where attempt_number IS NULL in addition to those where attempt_number > 0 (line 480 of server/datastore/mysql/scripts.go). This means newly created rows with NULL attempt_number are counted. Therefore, when a new script execution is created via NewHostScriptExecutionRequest, the subsequent CountHostScriptAttempts call returns a count that already includes that new row. Setting attemptNumber = &count produces the correct attempt number with no off-by-one error.


1607-1642: The nil check on AttemptNumber is defensive programming and does not indicate a data inconsistency issue.

The concern about race conditions is unfounded because the operations are sequential, not concurrent. SetHostSoftwareInstallResult uses a database transaction (withRetryTxx) that ensures the attempt_number is committed before GetSoftwareInstallResults is called immediately after. Additionally, maybeRetryPolicyAutomationSoftwareInstall is only invoked when status == fleet.SoftwareInstallFailed, which can only originate from the first UNION part of the GetSoftwareInstallResults query (the host_software_installs table), which always includes the attempt_number field. The WHERE clause properly filters by execution_id, ensuring the correct record is retrieved.


1644-1682: Script retry correctly uses original execution contents for reproducibility.

The maybeRetryPolicyAutomationScript function correctly passes hsr.ScriptContents (the original script contents from the initial execution) to the retry request. This is intentional: each script execution is versioned through separate script_contents records, ensuring that retries reproduce the exact same script execution for proper debugging and auditability. If the script is later modified via UpdateScriptContents, the retry will still use the original contents that caused the policy to fail—which is the correct behavior.

server/datastore/mysql/software_test.go (2)

106-107: Registering testCountHostSoftwareInstallAttempts in matrix looks good

The new test is correctly wired into TestSoftware, so it will run with the rest of the datastore tests. No issues here.


4175-4331: Signature updates for SetHostSoftwareInstallResult / SetHostScriptExecutionResult are consistent

Across these blocks, all updated call sites now pass the new third argument:

  • nil for legacy/non-automation flows where attempt numbering is irrelevant.
  • Explicit *int only where tests care about attempt semantics.

Argument ordering and payloads remain unchanged, so behavior is preserved while satisfying the new API.

Also applies to: 6116-6149, 6200-6206, 6313-6319, 9131-9138, 9220-9221

server/mock/datastore_mock.go (3)

536-540: SetHostSoftwareInstallResult mock correctly extended with attemptNumber

The updated SetHostSoftwareInstallResultFunc type and its wrapper now accept and forward attemptNumber *int, matching the production signature while preserving the existing lock/flag pattern. This keeps tests able to observe and drive retry-specific behavior without altering the mock conventions.

Also applies to: 6095-6099


684-688: New policy retry helpers are consistently wired through the mock

IsPolicyFailingFunc, CountHostSoftwareInstallAttemptsFunc, and CountHostScriptAttemptsFunc are added as function types, stored on DataStore with matching *Invoked flags, and exposed via thin wrappers that set the flag under lock and delegate with the correct parameter order. This is consistent with the rest of the mock and should integrate cleanly with the new retry logic.

Also applies to: 2726-2733, 6613-6632


1210-1212: SetHostScriptExecutionResult mock updated to support attemptNumber

The SetHostScriptExecutionResultFunc type and its wrapper now include attemptNumber *int and forward it correctly while retaining the existing invocation-tracking pattern. This keeps the mock aligned with the main datastore interface and supports script retry tests without behavioral surprises.

Also applies to: 8454-8458

server/service/integration_enterprise_test.go (6)

7107-7107: LGTM!

The additional nil parameter correctly passes no attempt number for this test case, which is appropriate since this test isn't specifically exercising the retry tracking logic.


17134-17267: Well-structured test setup with properly scoped queries.

The test setup is comprehensive and all SQL queries include appropriate filtering criteria (host_id, software_installer_id, policy_id) to ensure correct data isolation. The helper functions nicely encapsulate the test operations.


17269-17361: LGTM!

The retry logic testing correctly verifies:

  • Attempt number progression (1 → 2 → 3)
  • Activity is created only after the final attempt
  • No additional retries are queued after reaching the maximum of 3 attempts

17362-17459: LGTM!

The test correctly verifies:

  • Policy passing resets the retry counter (attempts marked as obsolete with attempt_number=0)
  • A fresh retry sequence can be started after reset (new attempts start at 1)
  • The CountHostSoftwareInstallAttempts datastore method behaves correctly

The final database state verification at lines 17444-17458 provides good coverage of the expected state.


18280-18383: Well-structured test setup with properly scoped queries.

Similar to the software install test, all SQL queries include appropriate filtering criteria (host_id, script_id, policy_id). The helper functions are well-designed for the test operations.


18385-18576: LGTM!

The script retry test correctly verifies the same behaviors as the software install test:

  • Retry attempt progression (1 → 2 → 3)
  • Activity creation only after final attempt
  • Policy passing resets attempts
  • Fresh sequence can start after reset

The parallel structure between both tests provides good consistency and coverage of the retry feature for both automation types.

Comment on lines 5899 to 5917
func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {
var count int
// Only count attempts from the current retry sequence.
// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".
// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).
err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `
SELECT COUNT(*)
FROM host_software_installs
WHERE host_id = ?
AND software_installer_id = ?
AND policy_id = ?
AND (attempt_number > 0 OR attempt_number IS NULL)
`, hostID, softwareInstallerID, policyID)
if err != nil {
return 0, ctxerr.Wrap(ctx, err, "count host software install attempts")
}

return count, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== Finding calls to CountHostSoftwareInstallAttempts ==="
rg -n -C 5 'CountHostSoftwareInstallAttempts' --type go

echo -e "\n=== Looking for retry limit logic (3 retries) ==="
rg -n -C 3 'attempt.*3|max.*attempt|retry.*limit' --type go server/

echo -e "\n=== Checking host_software_installs table schema ==="
rg -n -C 5 'host_software_installs' --type go | head -100

Repository: fleetdm/fleet

Length of output: 50369


🏁 Script executed:

#!/bin/bash
echo "=== Checking how removed and canceled are used in CountHostSoftwareInstallAttempts context ==="
rg -n -C 10 'removed|canceled' server/datastore/mysql/software.go | head -150

echo -e "\n=== Checking orbit.go usage more carefully ==="
rg -n -B 15 -A 15 'CountHostSoftwareInstallAttempts' server/service/orbit.go

echo -e "\n=== Looking for integration test that checks behavior with canceled/removed ==="
rg -n -C 20 'removed.*=.*1\|canceled.*=.*1' server/service/integration_enterprise_test.go | head -100

Repository: fleetdm/fleet

Length of output: 8644


🏁 Script executed:

#!/bin/bash
echo "=== Checking testCountHostSoftwareInstallAttempts for any canceled/removed tests ==="
sed -n '10346,10497p' server/datastore/mysql/software_test.go

echo -e "\n=== Checking if there are other tests with canceled/removed installs ==="
rg -n 'canceled.*=.*1|removed.*=.*1' server/datastore/mysql/software_test.go

Repository: fleetdm/fleet

Length of output: 5550


🏁 Script executed:

#!/bin/bash
echo "=== Looking for code that sets canceled = 1 or removed = 1 ==="
rg -n 'canceled.*=.*1|removed.*=.*1' server/datastore/mysql/software.go | head -30

echo -e "\n=== Looking for CancelSoftwareInstall or similar cancel logic ==="
rg -n -C 5 'CancelSoftwareInstall|cancel.*software.*install' --type go server/ | head -80

Repository: fleetdm/fleet

Length of output: 6776


🏁 Script executed:

#!/bin/bash
echo "=== Looking for code that sets canceled = 1 when activity is deleted ==="
rg -n -B 5 -A 5 'canceled.*=.*1' server/datastore/mysql/ | grep -A 10 -B 10 'activity\|DELETE\|upcoming'

echo -e "\n=== Check the DeleteUpcomingActivity or similar function ==="
rg -n -C 10 'DeleteUpcomingActivity|canceled' server/datastore/mysql/activities.go | head -100

Repository: fleetdm/fleet

Length of output: 11600


Add filters for removed and canceled installs to retry attempt count.

The CountHostSoftwareInstallAttempts query should exclude canceled and removed attempts, consistent with how all other queries in the codebase handle these records. Currently, if an admin cancels an attempt or a record is marked as removed, it still counts toward the 3-retry limit, which prevents valid retries.

Recommended fix
 func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {
 	var count int
 	// Only count attempts from the current retry sequence.
 	// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".
 	// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).
 	err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `
 		SELECT COUNT(*)
 		FROM host_software_installs
 		WHERE host_id = ?
 		  AND software_installer_id = ?
 		  AND policy_id = ?
+		  AND removed = 0
+		  AND canceled = 0
 		  AND (attempt_number > 0 OR attempt_number IS NULL)
 	`, hostID, softwareInstallerID, policyID)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {
var count int
// Only count attempts from the current retry sequence.
// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".
// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).
err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `
SELECT COUNT(*)
FROM host_software_installs
WHERE host_id = ?
AND software_installer_id = ?
AND policy_id = ?
AND (attempt_number > 0 OR attempt_number IS NULL)
`, hostID, softwareInstallerID, policyID)
if err != nil {
return 0, ctxerr.Wrap(ctx, err, "count host software install attempts")
}
return count, nil
}
func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {
var count int
// Only count attempts from the current retry sequence.
// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".
// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).
err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `
SELECT COUNT(*)
FROM host_software_installs
WHERE host_id = ?
AND software_installer_id = ?
AND policy_id = ?
AND removed = 0
AND canceled = 0
AND (attempt_number > 0 OR attempt_number IS NULL)
`, hostID, softwareInstallerID, policyID)
if err != nil {
return 0, ctxerr.Wrap(ctx, err, "count host software install attempts")
}
return count, nil
}
🤖 Prompt for AI Agents
In @server/datastore/mysql/software.go around lines 5899 - 5917, The
CountHostSoftwareInstallAttempts function currently counts attempts regardless
of canceled or removed records; update its SQL WHERE clause in
host_software_installs (inside CountHostSoftwareInstallAttempts) to exclude
canceled and removed attempts by adding checks like canceled_at IS NULL and
removed_at IS NULL (or the equivalent removed flag used elsewhere in the
codebase) so only active, non-removed attempts are counted for the retry limit.

Comment on lines +18481 to +18482
// fail policy fails to trigger first attempt
submitPolicyResult(policy.ID, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

The comment reads "fail policy fails to trigger first attempt" but should likely be "fail policy to trigger first attempt".

📝 Suggested fix
 	// pass policy
 	submitPolicyResult(policy.ID, true)
-	// fail policy fails to trigger first attempt
+	// fail policy to trigger first attempt
 	submitPolicyResult(policy.ID, false)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// fail policy fails to trigger first attempt
submitPolicyResult(policy.ID, false)
// pass policy
submitPolicyResult(policy.ID, true)
// fail policy to trigger first attempt
submitPolicyResult(policy.ID, false)
🤖 Prompt for AI Agents
In @server/service/integration_enterprise_test.go around lines 18481 - 18482,
Update the comment above the submitPolicyResult(policy.ID, false) call to
correct the typo: change "fail policy fails to trigger first attempt" to "fail
policy to trigger first attempt" so the comment accurately describes the intent
of the test step; locate the comment adjacent to the submitPolicyResult
invocation in the integration_enterprise_test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Policy automations: add retries for scripts & software

3 participants