Skip to content

[Synthetics] Silent error handling in package policy deletion can mask failures #254278

@dominiqueclarke

Description

@dominiqueclarke

Summary

The deletePolicyBulk method in synthetics_private_location.ts silently swallows all errors during package policy deletion, making it impossible for callers to detect and handle failures. This can lead to orphaned package policies and inconsistent state.

Current Behavior

In synthetics_private_location.ts:

async deletePolicyBulk(policyIdsToDelete: string[]) {
  const soClient = this.server.coreStart.savedObjects.createInternalRepository();
  const esClient = this.server.coreStart.elasticsearch.client.asInternalUser;
  if (policyIdsToDelete.length > 0) {
    try {
      return await this.server.fleet.packagePolicyService.delete(
        soClient,
        esClient,
        policyIdsToDelete,
        {
          force: true,
          asyncDeploy: true,
        }
      );
    } catch (e) {
      this.server.logger.error(e);
      // Returns undefined on error - caller cannot detect failure
    }
  }
}

The calling code in editMonitors does not capture or check the delete response:

const [_createResponse, failedUpdatesRes] = await Promise.all([
  this.createPolicyBulk(policiesToCreate),
  this.updatePolicyBulk(policiesToUpdate),
  this.deletePolicyBulk(uniqueToDelete),  // Delete response is discarded
]);

Problems

1. Silent failure masking

All errors (500s, 404s, permission errors, etc.) are caught and logged but not surfaced to the caller. The method returns undefined on error, which is indistinguishable from a successful operation with no policies to delete.

Impact:

  • 500 errors during deletion are completely invisible — orphaned package policies may remain
  • The monitor edit operation will appear successful even when policies weren't properly cleaned up

2. Potential race condition with parallel operations

The editMonitors function executes create, update, and delete operations in parallel using Promise.all. While this improves performance, it can create race conditions where:

  • A "migrated" policy is being created with a new ID while the legacy policy is being deleted
  • Fleet agents may receive operations in an unexpected order

Impact:

  • During migration flows, agents could temporarily see inconsistent state
  • In edge cases, the wrong policy could be active on the agent

Suggested Improvements

Option A: Return detailed results and handle in caller

async deletePolicyBulk(policyIdsToDelete: string[]): Promise<DeletePackagePoliciesResponse | undefined> {
  if (policyIdsToDelete.length === 0) {
    return undefined;
  }
  
  const soClient = this.server.coreStart.savedObjects.createInternalRepository();
  const esClient = this.server.coreStart.elasticsearch.client.asInternalUser;
  
  const result = await this.server.fleet.packagePolicyService.delete(
    soClient,
    esClient,
    policyIdsToDelete,
    { force: true, asyncDeploy: true }
  );
  
  // Log and return failures for caller to handle
  const failures = result.filter(r => r.statusCode !== 200 && r.statusCode !== 404);
  if (failures.length > 0) {
    this.server.logger.error(`Failed to delete ${failures.length} package policies: ${JSON.stringify(failures)}`);
  }
  
  return result;
}

Option B: Distinguish 404s from actual failures

async deletePolicyBulk(policyIdsToDelete: string[]) {
  // ... existing setup ...
  try {
    return await this.server.fleet.packagePolicyService.delete(/* ... */);
  } catch (e) {
    // 404s are expected (policy already deleted or never existed)
    if (e.statusCode === 404) {
      this.server.logger.debug(`Policy not found during cleanup: ${e.message}`);
      return [];
    }
    // Re-throw actual failures so caller can handle them
    throw e;
  }
}

Option C: Sequential operations for migration scenarios

For migration scenarios specifically, consider executing delete operations after create/update to avoid race conditions:

const [_createResponse, failedUpdatesRes] = await Promise.all([
  this.createPolicyBulk(policiesToCreate),
  this.updatePolicyBulk(policiesToUpdate),
]);

// Delete after create/update completes to avoid race conditions
const deleteResult = await this.deletePolicyBulk(uniqueToDelete);

Additional Context

This issue exists independently of recent changes and affects all monitor edit operations involving private locations. The Fleet packagePolicyService.delete method returns detailed results per policy (including statusCode and body.message), but this information is currently discarded.

Acceptance Criteria

  • deletePolicyBulk returns meaningful results to callers
  • Actual deletion failures (5xx errors) are surfaced, not just logged
  • Expected "not found" scenarios (404s) are handled gracefully
  • Callers can detect and respond to deletion failures appropriately

Metadata

Metadata

Assignees

No one assigned

    Labels

    Team:actionable-obsFormerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics.bugFixes for quality problems that affect the customer experience

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions