-
Notifications
You must be signed in to change notification settings - Fork 754
Automatic retry of failed policy automations of scripts and software installs #38018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| require.NotNil(t, results[0].ExitCode) | ||
| require.Equal(t, int64(1), *results[0].ExitCode) | ||
| // queued retry | ||
| // Note: attempt_number is only calculated when the result is submitted, not when queued, so it will be NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why am I waiting until the result comes in to update attempt_number?
When maybeRetryPolicyAutomationScript calls NewHostScriptExecutionRequest, the row in host_script_results doesn't exist yet. NewHostScriptExecutionRequest creates rows in upcoming_activities and script_upcoming_activities tables. activateNextScriptActivity is what actually creates the host_script_results row. We could pass the attempt_number through the upcoming_activities payload, then have activateNextScriptActivity read it from the payload and insert it into host_script_results when creating the row. The current approach calculates on result submission is simpler - it just means there's a brief period of time when attempt_number is NULL. This might actually not be semantically incorrect because orbit hasn't actually attempted to run the script.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #38018 +/- ##
========================================
Coverage 65.87% 65.88%
========================================
Files 2392 2393 +1
Lines 190798 191017 +219
Branches 8364 8364
========================================
+ Hits 125689 125844 +155
- Misses 53688 53724 +36
- Partials 11421 11449 +28
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
server/fleet/datastore.go
Outdated
|
|
||
| CleanupPolicyMembership(ctx context.Context, now time.Time) error | ||
| // IsPolicyStillFailing checks if a policy is currently failing for a given host. | ||
| IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "still" sounds like it's also evaluating attempt count
| IsPolicyStillFailing(ctx context.Context, policyID, hostID uint) (bool, error) | |
| IsPolicyFailing(ctx context.Context, policyID, hostID uint) (bool, error) |
|
|
||
| // Only calculate attempt_number for policy automation installs | ||
| if currentInstall.PolicyID != nil && currentInstall.SoftwareInstallerID != nil { | ||
| count, err := svc.ds.CountHostSoftwareInstallAttempts(ctx, host.ID, *currentInstall.SoftwareInstallerID, *currentInstall.PolicyID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we save multiple rows if there are >1 successful installs? that could affect how we're counting attempts here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we are storing multiple rows in host_script_results and host_software_installs for each attempt. Is it possible to get >1 successful installs for a policy for a host for a software installer 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if the policy is failing. it attempts 2 times. Then the policy turns into a successful state. Then later it goes back to failed. It means it will only attempt it 1 time. : / because we don't really keep track of the "sequence".
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughThis PR implements automatic retry logic for policy automations (scripts and software installations). When a policy remains failing after triggering an automation, Fleet now retries the automation up to 3 times. The implementation adds attempt tracking to the database schema, updates datastore method signatures to handle attempt numbers, resets attempt counts when policies transition from failing to passing, and introduces retry queueing in the service layer. Changes
Sequence Diagram(s)sequenceDiagram
participant Policy as Policy Evaluation
participant Automation as Policy Automation Trigger
participant DataStore as Datastore
participant Retry as Retry Manager
Policy->>DataStore: Check if policy is failing for host
DataStore-->>Policy: IsPolicyFailing result
alt Policy is failing
Policy->>Automation: Trigger automation (script/install)
Automation->>DataStore: CountAttempts for this policy+automation
DataStore-->>Automation: Current attempt count
Automation->>DataStore: SetResult with attemptNumber
DataStore->>DataStore: Persist result with attempt_number
DataStore-->>Automation: Result saved
alt Attempt < 3 and still failing
Automation->>Retry: Queue retry for next policy run
else Attempt >= 3
Automation->>Retry: Final attempt - create activity
end
else Policy is passing
DataStore->>DataStore: Reset attempt_number to NULL
Policy->>Automation: No automation triggered
end
sequenceDiagram
participant Host as Host
participant Service as Service Layer
participant DataStore as Datastore
participant Queue as Retry Queue
Host->>Service: Submit script/software result
Service->>DataStore: Fetch existing results for policy
DataStore-->>Service: Prior results with attempt counts
Service->>Service: Calculate nextAttemptNumber
alt Is policy automation?
Service->>DataStore: CountAttempts for this automation
DataStore-->>Service: Current attempt count
Service->>Service: Suppress activity creation if attempt < 3
else Manual execution
Service->>Service: Create activity normally
end
Service->>DataStore: SetResult with attemptNumber
DataStore-->>Service: Result persisted
alt Still failing and attempt < 3?
Service->>Queue: Queue retry request
Queue-->>Service: Retry queued
else Passing or final attempt
Service->>Service: Complete flow
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @server/datastore/mysql/software.go:
- Around line 5899-5917: The CountHostSoftwareInstallAttempts function currently
counts attempts regardless of canceled or removed records; update its SQL WHERE
clause in host_software_installs (inside CountHostSoftwareInstallAttempts) to
exclude canceled and removed attempts by adding checks like canceled_at IS NULL
and removed_at IS NULL (or the equivalent removed flag used elsewhere in the
codebase) so only active, non-removed attempts are counted for the retry limit.
In @server/service/integration_enterprise_test.go:
- Around line 18481-18482: Update the comment above the
submitPolicyResult(policy.ID, false) call to correct the typo: change "fail
policy fails to trigger first attempt" to "fail policy to trigger first attempt"
so the comment accurately describes the intent of the test step; locate the
comment adjacent to the submitPolicyResult invocation in the
integration_enterprise_test suite.
🧹 Nitpick comments (6)
server/fleet/scripts.go (1)
280-283:HostScriptResult.AttemptNumberis well-wired; consider future-proofing the commentThe new
AttemptNumber *intfield is correctly tagged for JSON/DB and matches the schema’s nullableintcolumn, so it should integrate cleanly with existing scan/encode paths and is a backward-compatible API extension. The only small suggestion is that the comment bakes in “3 being max retries”; if the max ever changes, this can drift—consider either referencing a shared constant in the comment (by name) or rephrasing to “values 1..N where N is the configured max retries” to avoid future inconsistencies.server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.go (1)
72-78: Consider asserting row presence before indexing attempt_number slicesThe test correctly validates that
attempt_numberdefaults to NULL and persists explicit values for both tables. To make failures clearer if the setup ever breaks, you could assert the query returned exactly one row before indexing:require.Len(t, attempt_number, 1) require.Equal(t, sql.NullInt64{Valid: false}, attempt_number[0])(and similarly for
scriptAttemptNumber). This avoids panics and yields more informative test failures.Also applies to: 85-89, 90-105
server/datastore/mysql/software.go (1)
5842-5842: Consider: Add validation for attemptNumber parameterThe
attemptNumberparameter is a*intwith no validation. Whilenilis intentional (represents "currently processing"), if a non-nil value is passed, it could be negative or exceed the retry limit without any checks.Consider adding defensive validation in the calling code to ensure
attemptNumberis within the expected range (e.g., 1-3 for the 3-retry limit).server/datastore/mysql/policies_test.go (1)
6478-6642: Reset-attempt tests validate correct behavior; consider minor DRY helperThe sync and async reset tests both:
- Set up a failing policy (
p1) and a non-transitioning policy (p2).- Seed
host_script_resultsandhost_software_installswith both completed (attempt_number = 1) and pending (attempt_number IS NULL) rows.- Drive a failing→passing transition for
p1only, and assert allp1attempts (completed and pending) are reset to0whilep2’s rows remain unchanged.This accurately captures the desired retry-reset behavior and correctly scopes the verification queries by
host_idandpolicy_id. Logic and SQL look sound.There is some nearly identical setup logic between the sync and async tests (manual
INSERTstatements for both policies and tables); if this grows further, you might consider a small shared helper to seed attempts for a(host, policy)pair to reduce duplication and keep future schema tweaks localized. This is purely optional test hygiene.server/service/orbit.go (1)
1057-1065: Good defensive check, but consider error handling consistency.The retry logic correctly checks for both
PolicyIDandScriptIDbefore attempting to retry. However, whenmaybeRetryPolicyAutomationScriptreturns an error, it's wrapped and returned, which will cause the entireSaveHostScriptResultcall to fail.Consider whether failing to queue a retry should be a hard error or just logged. The result was already saved successfully, and the user might expect at least that to persist.
Consider logging retry failures instead of returning error
// If this is a policy automation script that failed, maybe retry if hsr != nil && hsr.PolicyID != nil && hsr.ScriptID != nil { scriptFailed := hsr.ExitCode == nil || *hsr.ExitCode != 0 if scriptFailed { if err := svc.maybeRetryPolicyAutomationScript(ctx, host, hsr); err != nil { - return ctxerr.Wrap(ctx, err, "retry policy automation script") + // Log but don't fail - the result was already saved successfully + level.Error(svc.logger).Log( + "msg", "failed to queue policy automation script retry", + "host_id", host.ID, + "policy_id", *hsr.PolicyID, + "err", err, + ) } } }server/datastore/mysql/software_test.go (1)
10346-10497:testCountHostSoftwareInstallAttemptscovers main scenarios; consider aligning first attempt with production pathThe test thoroughly exercises:
- Zero-attempt baseline.
- Multiple attempts on the same
(host, installer, policy)tuple.- Isolation from:
- installs without
policy_id,- a different host,
- a different policy,
- a different installer.
This gives solid coverage of the counting behavior.
One small realism improvement: for the first policy-driven attempt you currently call
SetHostSoftwareInstallResultwithattemptNumber == nil, whereas production is likely to passattemptNumber == 1for automation-driven installs. IfCountHostSoftwareInstallAttemptsor related logic ever relies directly on theattempt_numbercolumn instead of just row count, this test could miss regressions.You could optionally pass
ptr.Int(1)for the first attempt to mirror the real flow more closely:[inspect for fit before changing; optional]
♻️ Optional tweak to set explicit attempt number on first attempt
- // Set result - _, err = ds.SetHostSoftwareInstallResult(ctx, &fleet.HostSoftwareInstallResultPayload{ - HostID: host.ID, - InstallUUID: install1UUID, - InstallScriptExitCode: ptr.Int(1), // failed - InstallScriptOutput: ptr.String("install failed"), - }, nil) + // Set result for first attempt (attempt_number = 1) + _, err = ds.SetHostSoftwareInstallResult(ctx, &fleet.HostSoftwareInstallResultPayload{ + HostID: host.ID, + InstallUUID: install1UUID, + InstallScriptExitCode: ptr.Int(1), // failed + InstallScriptOutput: ptr.String("install failed"), + }, ptr.Int(1))If the datastore implementation intentionally ignores
attempt_numberfor counting and only uses row count, leaving this as-is is also acceptable.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
changes/31916-retry-policy-automationsserver/datastore/mysql/activities_test.goserver/datastore/mysql/hosts_test.goserver/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.goserver/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.goserver/datastore/mysql/policies.goserver/datastore/mysql/policies_test.goserver/datastore/mysql/schema.sqlserver/datastore/mysql/scripts.goserver/datastore/mysql/scripts_test.goserver/datastore/mysql/software.goserver/datastore/mysql/software_installers.goserver/datastore/mysql/software_installers_test.goserver/datastore/mysql/software_test.goserver/datastore/mysql/software_titles_test.goserver/fleet/datastore.goserver/fleet/policies.goserver/fleet/scripts.goserver/fleet/software_installer.goserver/mock/datastore_mock.goserver/service/integration_enterprise_test.goserver/service/orbit.goserver/test/activities.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
⚙️ CodeRabbit configuration file
When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.
Files:
server/fleet/policies.goserver/datastore/mysql/scripts.goserver/datastore/mysql/activities_test.goserver/datastore/mysql/policies.goserver/service/orbit.goserver/fleet/software_installer.goserver/datastore/mysql/policies_test.goserver/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.goserver/fleet/datastore.goserver/datastore/mysql/scripts_test.goserver/fleet/scripts.goserver/test/activities.goserver/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls_test.goserver/datastore/mysql/software.goserver/datastore/mysql/software_titles_test.goserver/datastore/mysql/software_installers_test.goserver/service/integration_enterprise_test.goserver/datastore/mysql/software_installers.goserver/datastore/mysql/hosts_test.goserver/mock/datastore_mock.goserver/datastore/mysql/software_test.go
🧠 Learnings (9)
📚 Learning: 2025-08-08T07:40:05.301Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 31726
File: server/datastore/mysql/labels_test.go:2031-2031
Timestamp: 2025-08-08T07:40:05.301Z
Learning: Fleet repo targets Go 1.24.5 (root go.mod), which supports testing.T.Context(). Do not flag usage of t.Context() or suggest replacing it with context.Background() in tests (e.g., server/datastore/mysql/labels_test.go Line 2031 and similar).
Applied to files:
server/datastore/mysql/activities_test.go
📚 Learning: 2025-07-08T16:12:48.797Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: ee/server/service/hostidentity/depot/depot.go:108-111
Timestamp: 2025-07-08T16:12:48.797Z
Learning: In ee/server/service/hostidentity/depot/depot.go, the SCEP depot interface methods like Put() do not accept context parameters, and the common_mysql.WithRetryTxx callback function type TxFn only receives a transaction parameter, not a context. Therefore, using context.Background() in tx.ExecContext calls within the transaction callback is the correct approach.
Applied to files:
server/datastore/mysql/policies.go
📚 Learning: 2025-08-29T00:30:04.502Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 32438
File: server/datastore/mysql/scripts.go:505-515
Timestamp: 2025-08-29T00:30:04.502Z
Learning: In the Fleet codebase, database operations are typically wrapped in `ds.withRetryTxx` which provides transaction isolation, eliminating the need for additional race condition checks within the same transaction boundary.
Applied to files:
server/datastore/mysql/policies.go
📚 Learning: 2025-07-08T16:13:39.114Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: server/datastore/mysql/migrations/tables/20250707095725_HostIdentitySCEPCertificates.go:53-55
Timestamp: 2025-07-08T16:13:39.114Z
Learning: In the Fleet codebase, Down migration functions are intentionally left empty/no-op. The team does not implement rollback functionality for database migrations, so empty Down_* functions in migration files are correct and should not be flagged as issues.
Applied to files:
server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go
📚 Learning: 2025-08-13T18:20:42.136Z
Learnt from: titanous
Repo: fleetdm/fleet PR: 31075
File: tools/redis-tests/elasticache/iam_auth.go:4-10
Timestamp: 2025-08-13T18:20:42.136Z
Learning: For test harnesses and CLI tools in the Fleet codebase, resource cleanup on error paths (like closing connections before log.Fatalf) may not be necessary since the OS handles cleanup when the process exits. These tools prioritize simplicity over defensive programming patterns used in production code.
Applied to files:
server/datastore/mysql/scripts_test.goserver/test/activities.go
📚 Learning: 2025-07-08T16:11:49.555Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 30589
File: ee/server/service/hostidentity/depot/depot.go:115-115
Timestamp: 2025-07-08T16:11:49.555Z
Learning: In ee/server/service/hostidentity/depot/depot.go, the error from result.RowsAffected() is intentionally ignored because the information is only used for logging purposes, not for critical program logic.
Applied to files:
server/datastore/mysql/software.go
📚 Learning: 2025-10-03T18:16:11.482Z
Learnt from: MagnusHJensen
Repo: fleetdm/fleet PR: 33805
File: server/service/integration_mdm_test.go:1248-1251
Timestamp: 2025-10-03T18:16:11.482Z
Learning: In server/service/integration_mdm_test.go, the helper createAppleMobileHostThenEnrollMDM(platform string) is exclusively for iOS/iPadOS hosts (mobile). Do not flag macOS model/behavior issues based on changes within this helper; macOS provisioning uses different helpers such as createHostThenEnrollMDM.
Applied to files:
server/service/integration_enterprise_test.goserver/datastore/mysql/software_test.go
📚 Learning: 2025-08-01T15:08:16.858Z
Learnt from: sgress454
Repo: fleetdm/fleet PR: 31508
File: server/datastore/mysql/schema.sql:102-116
Timestamp: 2025-08-01T15:08:16.858Z
Learning: The schema.sql file in server/datastore/mysql/ is auto-generated from migrations for use with tests, so it cannot be manually edited. Any changes must be made through migrations.
Applied to files:
server/datastore/mysql/schema.sql
📚 Learning: 2025-09-12T13:04:23.777Z
Learnt from: getvictor
Repo: fleetdm/fleet PR: 32823
File: server/datastore/mysql/software.go:4457-4471
Timestamp: 2025-09-12T13:04:23.777Z
Learning: In the Fleet codebase, the `status` column in the `host_software_installs` table is defined as `GENERATED ALWAYS`, meaning it's automatically computed by the database based on other field values (like exit codes) and should not be explicitly included in INSERT statements.
Applied to files:
server/datastore/mysql/schema.sqlserver/datastore/mysql/software_installers.go
🧬 Code graph analysis (8)
server/datastore/mysql/scripts.go (2)
server/fleet/datastore.go (1)
Datastore(50-2625)server/fleet/scripts.go (2)
HostScriptResultPayload(196-203)HostScriptResult(208-284)
server/datastore/mysql/policies_test.go (2)
server/ptr/ptr.go (2)
T(86-88)Bool(35-37)server/fleet/policies.go (1)
PolicyMembershipResult(466-470)
server/fleet/datastore.go (3)
server/test/activities.go (1)
SetHostSoftwareInstallResult(66-74)server/fleet/software_installer.go (1)
HostSoftwareInstallResultPayload(936-953)server/fleet/scripts.go (1)
HostScriptResult(208-284)
server/datastore/mysql/software.go (2)
server/fleet/datastore.go (1)
Datastore(50-2625)server/fleet/software_installer.go (1)
HostSoftwareInstallResultPayload(936-953)
server/datastore/mysql/software_installers_test.go (3)
server/test/activities.go (1)
SetHostSoftwareInstallResult(66-74)server/fleet/software_installer.go (1)
HostSoftwareInstallResultPayload(936-953)server/ptr/ptr.go (1)
Int(15-17)
server/service/integration_enterprise_test.go (5)
server/fleet/hosts.go (1)
Host(279-411)server/fleet/software_installer.go (1)
UploadSoftwareInstallerPayload(497-533)server/datastore/mysql/testing_utils.go (1)
ExecAdhocSQL(421-425)server/fleet/policies.go (3)
PolicyPayload(15-59)ModifyPolicyPayload(170-206)Policy(280-303)server/fleet/scripts.go (2)
Script(17-32)HostScriptResult(208-284)
server/mock/datastore_mock.go (3)
server/fleet/software_installer.go (1)
HostSoftwareInstallResultPayload(936-953)server/fleet/scripts.go (2)
HostScriptResultPayload(196-203)HostScriptResult(208-284)server/test/activities.go (1)
SetHostSoftwareInstallResult(66-74)
server/datastore/mysql/software_test.go (2)
server/test/activities.go (1)
SetHostSoftwareInstallResult(66-74)server/fleet/datastore.go (1)
Datastore(50-2625)
🔇 Additional comments (43)
server/fleet/policies.go (1)
110-112: LGTM: Well-documented retry limit constant.The addition of
MaxPolicyAutomationRetriesclearly defines the retry behavior for policy automations. The constant value of 3 aligns with the PR objectives, and the comment provides clear context.server/fleet/software_installer.go (1)
422-425: LGTM: Well-designed attempt tracking field.The
AttemptNumberfield addition is well-implemented:
- Pointer type correctly represents optional nature (nil for non-policy installs)
- Clear documentation of semantics (nil = not policy-triggered, 1-3 for retry attempts)
- Appropriate JSON tag with
omitempty- DB tag matches expected schema
changes/31916-retry-policy-automations (1)
1-1: LGTM: Clear user-facing documentation.The changelog entry clearly communicates the retry behavior to users. The "up to three times" limit matches the
MaxPolicyAutomationRetriesconstant defined in the codebase.server/datastore/mysql/software_titles_test.go (1)
2232-2232: Signature update looks correct.The addition of
nilas the third argument correctly updates the call to match the newSetHostSoftwareInstallResultsignature. Passingnilis appropriate for this test since it's not testing policy automation retry behavior.Comprehensive test coverage for the
attemptNumberparameter already exists in the integration tests, including verification of attempt counting (1, 2, 3), behavior at maximum attempts, and persistence of attempt numbers.server/datastore/mysql/schema.sql (3)
975-1009:host_script_results.attempt_numberschema looks correct and matches Go structThe new
attempt_number int DEFAULT NULLcolumn integrates cleanly here: type and nullability matchHostScriptResult.AttemptNumber *int(db:"attempt_number"), and there are no dependent generated columns/constraints to adjust. No changes needed from this file’s side.Based on learnings, this looks like the expected output of the new migration rather than a manual edit.
1045-1084:host_software_installs.attempt_numberaddition is consistent with retries designSimilarly, adding
attempt_number int DEFAULT NULLis consistent with the retry-tracking design and does not interfere with the existing generatedstatus/execution_statuscolumns. It gives the datastore a straightforward place to persist attempt numbers for installs.Based on learnings, this also appears to come from regenerating schema.sql from the latest migrations.
1759-1768: Migration status table updates are internally consistentThe
AUTO_INCREMENT=469and the extendedINSERT INTO migration_status_tableslist (up through id 468 / version 20260108223716) are coherent: last inserted id matches AUTO_INCREMENT−1, and the versions sequence continues as expected. This aligns with adding the new migration that introduced theattempt_numbercolumns.Based on learnings, this matches the normal pattern of schema.sql being regenerated from migrations.
server/datastore/mysql/software_installers.go (1)
1422-1505: LGTM! Query modifications correctly add retry tracking columns.The additions to
GetSoftwareInstallResultsproperly extend both parts of the UNION query:
software_installer_id,source, andattempt_numberare added consistently to both the historical (host_software_installs) and upcoming (upcoming_activities) result paths- The
NULL AS attempt_numberfor upcoming activities is appropriate since those installs haven't been attempted yet- The existing filtering by
execution_idensures precise scoping, and theua.activated_at IS NULLcondition prevents duplicates between the UNION partsThe changes align with the PR's objective to track retry attempts for policy automations.
server/datastore/mysql/hosts_test.go (2)
4199-4217: SetHostScriptExecutionResult signature usage looks correctThe added
nilargument aligns these test calls with the newSetHostScriptExecutionResultsignature while preserving existing behavior; usingnilis appropriate here since these tests don’t validate attempt-count logic.
8491-8496: Consistent use of new optional attempt parameter in SetHostScriptExecutionResultThis updated call also correctly passes
nilfor the new attempt parameter, matching the intended optional semantics without changing the test’s behavior.server/datastore/mysql/migrations/tables/20260108223716_AddAttemptNumberToScriptsAndSoftwareInstalls.go (1)
9-34: Migration shape and error handling look correctThe Up migration cleanly adds nullable
attempt_numbercolumns to both tables with contextual error wrapping, and the no-op Down function matches this repo’s convention of non-rollbackable migrations. Based on learnings, leavingDown_20260108223716empty is expected.server/datastore/mysql/software_installers_test.go (1)
183-188: Test callsites correctly adapted to new attemptNumber parameterAll updated calls to
SetHostSoftwareInstallResultandSetHostScriptExecutionResultnow pass a trailingnilfor the newattemptNumberargument, which is appropriate for these non-policy / non-retry scenarios. Assertions remain unchanged, so behavior is preserved while tests stay aligned with the new API.Also applies to: 198-203, 239-244, 404-409, 460-465, 531-536, 552-557, 1280-1285, 1353-1358, 2275-2280, 2359-2365, 2411-2424, 3133-3139
server/test/activities.go (1)
30-36: Helpers correctly pass nil attemptNumber for non-policy executionsThe activity helpers now pass
nilas theattemptNumberwhen recording script and software results for manual/uninstall flows, which matches the intended “not policy automation” semantics and keeps tests aligned with the updated datastore interface.Also applies to: 66-73, 105-113
server/datastore/mysql/scripts.go (1)
145-229: SetHostScriptExecutionResult / GetHostScriptExecutionResult wiring for attempt_number looks sound; please double-check NULL semantics in CountHostScriptAttempts
SetHostScriptExecutionResultnow updateshost_script_results.attempt_numberalongside the other result fields and keeps the duplicate-result guard and MDM/uninstall handling intact.getHostScriptExecutionResultDBcorrectly projectsbatch_execution_idandattempt_numberfor active results, and suppliesNULLfor both in the upcoming-activity path, keeping the struct mapping consistent.CountHostScriptAttemptsis properly scoped byhost_id,script_id, andpolicy_id, and only counts rows whereattempt_number > 0 OR attempt_number IS NULL, effectively treatingattempt_number = 0as “old sequence” while including current-sequence (positive) and unset attempts.One nuance to confirm: any pre-existing
host_script_resultsrows for a given(host_id, script_id, policy_id)that predate this feature (and thus haveattempt_number = NULL) will be included in the count. That means the very first retry-aware run for a long-lived policy/host pair could start with a non-zero attempt count unless some higher-level logic proactively resets those older rows toattempt_number = 0when the policy is known to have passed. If that’s not desired, you may want to revisit how/when old rows are normalized.As per coding guidelines, the new COUNT query uses precise filtering and avoids under-scoped
SELECTusage.Also applies to: 387-417, 469-487
server/datastore/mysql/activities_test.go (1)
1285-1288: LGTM! Test call sites updated to match expanded API signatures.The test updates correctly pass
nilfor the new optionalattemptNumberparameter inSetHostScriptExecutionResultandSetHostSoftwareInstallResultcalls, which is appropriate since these tests don't exercise the retry behavior.Also applies to: 1426-1431, 1448-1453, 1678-1683, 1693-1697, 1705-1710
server/datastore/mysql/policies.go (3)
540-613: LGTM! Flip detection and attempt reset logic is correctly implemented.The implementation properly:
- Computes newly-passing policies before updating membership to compare against previous state
- Resets
attempt_numberto 0 for both script results and software installs when policies flip from failing to passing- Uses proper SQL filtering with
host_idandpolicy_id IN (?)clausesThe flip detection happening before the transaction is a pragmatic design choice that avoids holding locks during computation, with the
withRetryTxxproviding conflict resolution.
1428-1484: LGTM! Async batch processing correctly handles per-host flip detection.The implementation properly extends the flip detection and attempt reset logic to handle batches of hosts:
- Groups incoming results per host before computing flips
- Computes newly-passing policies separately for each host
- Resets attempt numbers within the same transaction as the membership upsert
This ensures consistency between sync and async policy membership update paths.
1755-1775: LGTM! IsPolicyFailing correctly implements the failure check semantics.The method correctly returns
true(failing) when:
- No membership record exists (
sql.ErrNoRows) — policy hasn't been evaluated yetpassesisnil— policy execution didn't completepassesisfalse— policy is explicitly failingThis aligns with the retry logic requirement: automations should retry while the policy remains in a non-passing state. The SQL query properly filters on both
policy_idandhost_id.server/fleet/datastore.go (3)
687-687: LGTM: Signature extension supports retry tracking.The addition of
attemptNumber *intparameter is well-designed. Using a pointer allows distinguishing between policy-triggered installs (non-nil with attempt number) and regular installs (nil), which aligns with the retry feature requirements.
868-875: LGTM: New methods provide proper scoping parameters.The three new methods for policy retry logic are well-designed:
- Each method includes both host and policy identifiers, enabling precise filtering
- Comments clearly explain their purpose
- Parameter types are consistent with existing datastore patterns
Note: The actual implementations should be reviewed to ensure SQL queries use these parameters properly with WHERE clauses (per coding guidelines for
**/*.go).
1827-1827: LGTM: Consistent signature extension.The
attemptNumber *intparameter addition mirrors the change toSetHostSoftwareInstallResult, maintaining consistency across the retry tracking implementation.server/datastore/mysql/policies_test.go (2)
83-85: New policy retry tests correctly wired into TestPolicies suiteThe three new helpers are cleanly added to the table-driven
TestPoliciessuite with descriptive names and consistent ordering. No changes needed.
6414-6476: IsPolicyFailing test semantics look correct and comprehensiveThis test exercises the important edge cases for
IsPolicyFailing(no membership row,passes = NULL,passes = false,passes = true, different hosts, different policies) and clearly documents the intended semantics (treat missing/NULL/false as failing, true as passing). This aligns well with how policy automations typically reason about “still failing” vs “resolved”. No issues from my side.server/datastore/mysql/scripts_test.go (3)
57-57: LGTM!New test case entry correctly wired into the test suite.
114-134: LGTM!All
SetHostScriptExecutionResultcalls correctly updated to passnilfor the newattemptNumberparameter, which is appropriate for non-policy-automation script executions.
2989-3132: LGTM! Comprehensive test coverage for attempt counting.The test covers:
- Zero attempts case
- Multiple consecutive attempts with the same policy/script
- Executions without policy_id (should not affect count)
- Different hosts with same policy/script
- Same host with different policies
- Same host/policy with different scripts
One minor observation: at line 3029, the first call to
SetHostScriptExecutionResultpassesnilforattemptNumber(correct for the initial attempt), but subsequent calls at lines 3053 and 3077 pass explicit attempt numbers viaptr.Int(2)andptr.Int(3). This asymmetry is intentional to test that the count increments correctly regardless of what's passed, which is good coverage.server/service/orbit.go (6)
1020-1033: Verify activity suppression logic handles nil ExitCode correctly.The condition at line 1028 checks
hsr.ExitCode == nil || *hsr.ExitCode != 0to determine if the script failed. This is correct for the failure check. However, whenhsr.AttemptNumberis not nil and the script succeeded (exit code 0),shouldCreateActivityremainstrue, which is the expected behavior.The logic correctly suppresses activity creation only for intermediate failed attempts (attempts 1 and 2 when max is 3).
1482-1496: Consistent approach with script attempt counting.The software install attempt counting mirrors the script counting logic, which is good for consistency. The same observation about verifying the counting logic applies here.
1556-1572: Activity suppression and retry logic looks correct but has asymmetric error handling.The retry is triggered inside the activity suppression block for failed installs, and errors are only logged (line 1558-1563), which is different from the script retry handling. This inconsistency is actually better—logging instead of failing is more resilient.
However, note that
maybeRetryPolicyAutomationSoftwareInstallis called before checkingshouldCreateActivity, meaning a retry could be queued even if we then create an activity (whenAttemptNumberis nil, which shouldn't happen for policy automation but is worth noting).
915-934: The counting logic is correct. TheCountHostScriptAttemptsSQL query explicitly includes rows whereattempt_number IS NULLin addition to those whereattempt_number > 0(line 480 of server/datastore/mysql/scripts.go). This means newly created rows with NULLattempt_numberare counted. Therefore, when a new script execution is created viaNewHostScriptExecutionRequest, the subsequentCountHostScriptAttemptscall returns a count that already includes that new row. SettingattemptNumber = &countproduces the correct attempt number with no off-by-one error.
1607-1642: The nil check on AttemptNumber is defensive programming and does not indicate a data inconsistency issue.The concern about race conditions is unfounded because the operations are sequential, not concurrent.
SetHostSoftwareInstallResultuses a database transaction (withRetryTxx) that ensures theattempt_numberis committed beforeGetSoftwareInstallResultsis called immediately after. Additionally,maybeRetryPolicyAutomationSoftwareInstallis only invoked whenstatus == fleet.SoftwareInstallFailed, which can only originate from the first UNION part of theGetSoftwareInstallResultsquery (thehost_software_installstable), which always includes theattempt_numberfield. The WHERE clause properly filters byexecution_id, ensuring the correct record is retrieved.
1644-1682: Script retry correctly uses original execution contents for reproducibility.The
maybeRetryPolicyAutomationScriptfunction correctly passeshsr.ScriptContents(the original script contents from the initial execution) to the retry request. This is intentional: each script execution is versioned through separatescript_contentsrecords, ensuring that retries reproduce the exact same script execution for proper debugging and auditability. If the script is later modified viaUpdateScriptContents, the retry will still use the original contents that caused the policy to fail—which is the correct behavior.server/datastore/mysql/software_test.go (2)
106-107: RegisteringtestCountHostSoftwareInstallAttemptsin matrix looks goodThe new test is correctly wired into
TestSoftware, so it will run with the rest of the datastore tests. No issues here.
4175-4331: Signature updates forSetHostSoftwareInstallResult/SetHostScriptExecutionResultare consistentAcross these blocks, all updated call sites now pass the new third argument:
nilfor legacy/non-automation flows where attempt numbering is irrelevant.- Explicit
*intonly where tests care about attempt semantics.Argument ordering and payloads remain unchanged, so behavior is preserved while satisfying the new API.
Also applies to: 6116-6149, 6200-6206, 6313-6319, 9131-9138, 9220-9221
server/mock/datastore_mock.go (3)
536-540: SetHostSoftwareInstallResult mock correctly extended with attemptNumberThe updated
SetHostSoftwareInstallResultFunctype and its wrapper now accept and forwardattemptNumber *int, matching the production signature while preserving the existing lock/flag pattern. This keeps tests able to observe and drive retry-specific behavior without altering the mock conventions.Also applies to: 6095-6099
684-688: New policy retry helpers are consistently wired through the mock
IsPolicyFailingFunc,CountHostSoftwareInstallAttemptsFunc, andCountHostScriptAttemptsFuncare added as function types, stored onDataStorewith matching*Invokedflags, and exposed via thin wrappers that set the flag under lock and delegate with the correct parameter order. This is consistent with the rest of the mock and should integrate cleanly with the new retry logic.Also applies to: 2726-2733, 6613-6632
1210-1212: SetHostScriptExecutionResult mock updated to support attemptNumberThe
SetHostScriptExecutionResultFunctype and its wrapper now includeattemptNumber *intand forward it correctly while retaining the existing invocation-tracking pattern. This keeps the mock aligned with the main datastore interface and supports script retry tests without behavioral surprises.Also applies to: 8454-8458
server/service/integration_enterprise_test.go (6)
7107-7107: LGTM!The additional
nilparameter correctly passes no attempt number for this test case, which is appropriate since this test isn't specifically exercising the retry tracking logic.
17134-17267: Well-structured test setup with properly scoped queries.The test setup is comprehensive and all SQL queries include appropriate filtering criteria (host_id, software_installer_id, policy_id) to ensure correct data isolation. The helper functions nicely encapsulate the test operations.
17269-17361: LGTM!The retry logic testing correctly verifies:
- Attempt number progression (1 → 2 → 3)
- Activity is created only after the final attempt
- No additional retries are queued after reaching the maximum of 3 attempts
17362-17459: LGTM!The test correctly verifies:
- Policy passing resets the retry counter (attempts marked as obsolete with
attempt_number=0)- A fresh retry sequence can be started after reset (new attempts start at 1)
- The
CountHostSoftwareInstallAttemptsdatastore method behaves correctlyThe final database state verification at lines 17444-17458 provides good coverage of the expected state.
18280-18383: Well-structured test setup with properly scoped queries.Similar to the software install test, all SQL queries include appropriate filtering criteria (host_id, script_id, policy_id). The helper functions are well-designed for the test operations.
18385-18576: LGTM!The script retry test correctly verifies the same behaviors as the software install test:
- Retry attempt progression (1 → 2 → 3)
- Activity creation only after final attempt
- Policy passing resets attempts
- Fresh sequence can start after reset
The parallel structure between both tests provides good consistency and coverage of the retry feature for both automation types.
| func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) { | ||
| var count int | ||
| // Only count attempts from the current retry sequence. | ||
| // When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence". | ||
| // We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed). | ||
| err := sqlx.GetContext(ctx, ds.reader(ctx), &count, ` | ||
| SELECT COUNT(*) | ||
| FROM host_software_installs | ||
| WHERE host_id = ? | ||
| AND software_installer_id = ? | ||
| AND policy_id = ? | ||
| AND (attempt_number > 0 OR attempt_number IS NULL) | ||
| `, hostID, softwareInstallerID, policyID) | ||
| if err != nil { | ||
| return 0, ctxerr.Wrap(ctx, err, "count host software install attempts") | ||
| } | ||
|
|
||
| return count, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
echo "=== Finding calls to CountHostSoftwareInstallAttempts ==="
rg -n -C 5 'CountHostSoftwareInstallAttempts' --type go
echo -e "\n=== Looking for retry limit logic (3 retries) ==="
rg -n -C 3 'attempt.*3|max.*attempt|retry.*limit' --type go server/
echo -e "\n=== Checking host_software_installs table schema ==="
rg -n -C 5 'host_software_installs' --type go | head -100Repository: fleetdm/fleet
Length of output: 50369
🏁 Script executed:
#!/bin/bash
echo "=== Checking how removed and canceled are used in CountHostSoftwareInstallAttempts context ==="
rg -n -C 10 'removed|canceled' server/datastore/mysql/software.go | head -150
echo -e "\n=== Checking orbit.go usage more carefully ==="
rg -n -B 15 -A 15 'CountHostSoftwareInstallAttempts' server/service/orbit.go
echo -e "\n=== Looking for integration test that checks behavior with canceled/removed ==="
rg -n -C 20 'removed.*=.*1\|canceled.*=.*1' server/service/integration_enterprise_test.go | head -100Repository: fleetdm/fleet
Length of output: 8644
🏁 Script executed:
#!/bin/bash
echo "=== Checking testCountHostSoftwareInstallAttempts for any canceled/removed tests ==="
sed -n '10346,10497p' server/datastore/mysql/software_test.go
echo -e "\n=== Checking if there are other tests with canceled/removed installs ==="
rg -n 'canceled.*=.*1|removed.*=.*1' server/datastore/mysql/software_test.goRepository: fleetdm/fleet
Length of output: 5550
🏁 Script executed:
#!/bin/bash
echo "=== Looking for code that sets canceled = 1 or removed = 1 ==="
rg -n 'canceled.*=.*1|removed.*=.*1' server/datastore/mysql/software.go | head -30
echo -e "\n=== Looking for CancelSoftwareInstall or similar cancel logic ==="
rg -n -C 5 'CancelSoftwareInstall|cancel.*software.*install' --type go server/ | head -80Repository: fleetdm/fleet
Length of output: 6776
🏁 Script executed:
#!/bin/bash
echo "=== Looking for code that sets canceled = 1 when activity is deleted ==="
rg -n -B 5 -A 5 'canceled.*=.*1' server/datastore/mysql/ | grep -A 10 -B 10 'activity\|DELETE\|upcoming'
echo -e "\n=== Check the DeleteUpcomingActivity or similar function ==="
rg -n -C 10 'DeleteUpcomingActivity|canceled' server/datastore/mysql/activities.go | head -100Repository: fleetdm/fleet
Length of output: 11600
Add filters for removed and canceled installs to retry attempt count.
The CountHostSoftwareInstallAttempts query should exclude canceled and removed attempts, consistent with how all other queries in the codebase handle these records. Currently, if an admin cancels an attempt or a record is marked as removed, it still counts toward the 3-retry limit, which prevents valid retries.
Recommended fix
func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) {
var count int
// Only count attempts from the current retry sequence.
// When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence".
// We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed).
err := sqlx.GetContext(ctx, ds.reader(ctx), &count, `
SELECT COUNT(*)
FROM host_software_installs
WHERE host_id = ?
AND software_installer_id = ?
AND policy_id = ?
+ AND removed = 0
+ AND canceled = 0
AND (attempt_number > 0 OR attempt_number IS NULL)
`, hostID, softwareInstallerID, policyID)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) { | |
| var count int | |
| // Only count attempts from the current retry sequence. | |
| // When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence". | |
| // We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed). | |
| err := sqlx.GetContext(ctx, ds.reader(ctx), &count, ` | |
| SELECT COUNT(*) | |
| FROM host_software_installs | |
| WHERE host_id = ? | |
| AND software_installer_id = ? | |
| AND policy_id = ? | |
| AND (attempt_number > 0 OR attempt_number IS NULL) | |
| `, hostID, softwareInstallerID, policyID) | |
| if err != nil { | |
| return 0, ctxerr.Wrap(ctx, err, "count host software install attempts") | |
| } | |
| return count, nil | |
| } | |
| func (ds *Datastore) CountHostSoftwareInstallAttempts(ctx context.Context, hostID, softwareInstallerID, policyID uint) (int, error) { | |
| var count int | |
| // Only count attempts from the current retry sequence. | |
| // When a policy passes, all attempt_number values are reset to 0 to mark them as "old sequence". | |
| // We count attempts where attempt_number > 0 (current sequence) OR attempt_number IS NULL (currently being processed). | |
| err := sqlx.GetContext(ctx, ds.reader(ctx), &count, ` | |
| SELECT COUNT(*) | |
| FROM host_software_installs | |
| WHERE host_id = ? | |
| AND software_installer_id = ? | |
| AND policy_id = ? | |
| AND removed = 0 | |
| AND canceled = 0 | |
| AND (attempt_number > 0 OR attempt_number IS NULL) | |
| `, hostID, softwareInstallerID, policyID) | |
| if err != nil { | |
| return 0, ctxerr.Wrap(ctx, err, "count host software install attempts") | |
| } | |
| return count, nil | |
| } |
🤖 Prompt for AI Agents
In @server/datastore/mysql/software.go around lines 5899 - 5917, The
CountHostSoftwareInstallAttempts function currently counts attempts regardless
of canceled or removed records; update its SQL WHERE clause in
host_software_installs (inside CountHostSoftwareInstallAttempts) to exclude
canceled and removed attempts by adding checks like canceled_at IS NULL and
removed_at IS NULL (or the equivalent removed flag used elsewhere in the
codebase) so only active, non-removed attempts are counted for the retry limit.
| // fail policy fails to trigger first attempt | ||
| submitPolicyResult(policy.ID, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo in comment.
The comment reads "fail policy fails to trigger first attempt" but should likely be "fail policy to trigger first attempt".
📝 Suggested fix
// pass policy
submitPolicyResult(policy.ID, true)
- // fail policy fails to trigger first attempt
+ // fail policy to trigger first attempt
submitPolicyResult(policy.ID, false)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // fail policy fails to trigger first attempt | |
| submitPolicyResult(policy.ID, false) | |
| // pass policy | |
| submitPolicyResult(policy.ID, true) | |
| // fail policy to trigger first attempt | |
| submitPolicyResult(policy.ID, false) |
🤖 Prompt for AI Agents
In @server/service/integration_enterprise_test.go around lines 18481 - 18482,
Update the comment above the submitPolicyResult(policy.ID, false) call to
correct the typo: change "fail policy fails to trigger first attempt" to "fail
policy to trigger first attempt" so the comment accurately describes the intent
of the test step; locate the comment adjacent to the submitPolicyResult
invocation in the integration_enterprise_test suite.
Related issue: Resolves #31916
Checklist for submitter
If some of the following don't apply, delete the relevant line.
changes/,orbit/changes/oree/fleetd-chrome/changes.See Changes files for more information.
SELECT *is avoided, SQL injection is prevented (using placeholders for values in statements)Testing
Database migrations
COLLATE utf8mb4_unicode_ci).Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.