Skip to content

Conversation

@pchila
Copy link
Member

@pchila pchila commented Nov 12, 2025

What does this PR do?

Introduces manual rollback for managed agents (requires elastic/fleet-server#5975 on fleet-server side)

Why is it important?

To implement manual rollback feature for Fleet-managed agents as it has been implemented for standalone agents in #9643

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Prerequisites

In order to test this PR we need to use a fleet-server that contains the changes of PR elastic/fleet-server#5975.

  1. (Option A) Support agent available rollback fields fleet-server#5975 isn't available yet
    In case the fleet server PR isn't merged yet we can build a docker image (to create a stack on ECH CFT region or using elastic-package) and a zip/tar.gz including a local fleet-server artifact from a local build as follows (MANIFEST_URL, PLATFORMS, AGENT_DROP_PATH location can be changed/updated as needed)
# create a drop directory
mkdir ./drop

# create cloud image
MANIFEST_URL=https://snapshots.elastic.co/9.3.0-9546ac47/manifest-9.3.0-SNAPSHOT.json AGENT_DROP_PATH=./drop mage downloadManifest && cp <fleet-server containing dir>/fleet-server/build/distributions/* ./drop && AGENT_DROP_PATH=./drop SNAPSHOT=true PACKAGES="tar.gz" DOCKER_VARIANTS=cloud PLATFORMS=linux/amd64 mage package

# create a .tar.gz agent package (windows .zip works the same).
# For some reason the drop directory gets dirty during the packaging so we have to launch a separate command
# instead of specifying PACKAGES="docker,tar.gz" in a single package command
MANIFEST_URL=https://snapshots.elastic.co/9.3.0-9546ac47/manifest-9.3.0-SNAPSHOT.json AGENT_DROP_PATH=./drop mage downloadManifest && cp <fleet-server containing dir>/fleet-server/build/distributions/* ./drop && AGENT_DROP_PATH=./drop SNAPSHOT=true PACKAGES="tar.gz" PLATFORMS=linux/amd64 mage package

# Create another package with a different version from the same commit (useful to test upgrade/rollback both ways)
MANIFEST_URL=https://snapshots.elastic.co/9.3.0-9546ac47/manifest-9.3.0-SNAPSHOT.json AGENT_DROP_PATH=./drop mage downloadManifest && cp <fleet-server containing dir>/fleet-server/build/distributions/* ./drop && AGENT_PACKAGE_VERSION=9.3.0+build20251125 BEAT_VERSION=9.3.0 AGENT_DROP_PATH=./drop SNAPSHOT=true PACKAGES="tar.gz" PLATFORMS=linux/amd64 mage package
  1. (Option B) Support agent available rollback fields fleet-server#5975 is merged and used already in elastic-agent packaging
    Alternatively, if fleet-server is already available in the manifest pointed at by .package-version we only need to create 2 elastic-agents archives with a simpler command
# 9.3.0-SNAPSHOT version
USE_PACKAGE_VERSION=true SNAPSHOT=true PACKAGES="tar.gz" PLATFORMS=linux/amd64 mage package

#9.3.0+build20251125-SNAPSHOT version
AGENT_PACKAGE_VERSION=9.3.0+build20251125 USE_PACKAGE_VERSION=true SNAPSHOT=true PACKAGES="tar.gz" PLATFORMS=linux/amd64 mage package
  1. Prepare an HTTP server for providing the repackaged agent version
  • Create a beats/elastic-agent directory tree under build/distributions (could be a different root folder if so preferred)
     mkdir -p beats/elastic-agent 
  • Copy the relevant files under beats/elastic-agent
    cp elastic-agent-9.3.0+build20251125-SNAPSHOT-linux-x86_64.tar.gz* beats/elastic-agent
  • Run a simple python http server
    python -m http.server
  1. either apply this patch
    0001-DO-NOT-MERGE-Test-commit-to-skip-verifying-upgrade-p.patch
    or setup an alternative PGP key to sign and verify the packages produces from this PR

Testing

  1. Create a deployment using the elastic-agent image containing the right fleet-server either on ECH (after uploading the image in one of the allowed docker repositories using terraform or cloud API specifying the custom integration_server.config.docker_image value)
  2. Log on Kibana, navigate to the FleetUI and create a new Binary Agent download pointing at the HTTP server setup in the prerequisites section
    image
  3. create a new policy (empty for the sake of simplicity), specifying the new Agent binary download
    image
    image
  4. Set a rollback window to the policy using the override API from the dev console (15m in this example)
    PUT kbn:/api/fleet/agent_policies/<policy id>
    {
        "name": "TestRollback",
        "namespace": "default",
        "overrides": {
            "agent": {
                "upgrade": {
                    "rollback":{
                        "window": "15m"
                    }
                }
            }
        }
    }
  5. extract and install/enroll version 9.3.0-SNAPSHOT
    tar xvf elastic-agent-9.3.0-SNAPSHOT-linux-x86_64.tar.gz
    cd elastic-agent-9.3.0-SNAPSHOT-linux-x86_64
    sudo ./elastic-agent install --url=https://fleet-server:8220 --enrollment-token=<enrollment token>
    and verify that the agent appears online in Fleet
  6. check the document for the elastic-agent in .fleet-agent from the Dev Console
    GET .fleet-agents/_search
    {
        "query": {
            "match": {
              "agent.id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2"
            }
        }
    }
    the interesting part of the document should be toward the bottom (omitted some details marked by ... in the example below for brevity), notice the key available_rollbacks
    {
        "_index": ".fleet-agents-7",
        "_id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2",
        "_score": 0.6931471,
        "_source": {
          "access_api_key_id": "OLrfxZoBUWrzn3OjJDDo",
          "action_seq_no": [
            -1
          ],
          "active": true,
          "agent": {
            "id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2",
            "version": "9.3.0"
          },
          "enrolled_at": "2025-11-27T15:12:06Z",
          "local_metadata": { ... },
          "namespaces": [
            "default"
          ],
          "policy_id": "6eeca4e4-79eb-477f-8f60-73b4c04b6be2",
          "type": "PERMANENT",
          "outputs": { ... },
          "policy_revision_idx": 2,
          "updated_at": "2025-11-27T15:12:53Z",
          "available_rollbacks": [],
          "components": [ ... ],
          "last_checkin_message": "Running",
          "last_checkin_status": "online",
          "last_checkin": "2025-11-27T15:12:44Z",
          "unhealthy_reason": null,
          "last_known_status": "online"
        }
      }
  7. Trigger an update to version 9.3.0+build20251125-SNAPSHOT (or whatever version has been used for repackaging the agent) via the Fleet UI
    image
    and wait till the agent restarts with the new version and is in state Upgrade monitoring
    image
  8. Check again the agent document in .fleet-agents
    {
        "_index": ".fleet-agents-7",
        "_id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2",
        "_score": 0.6931471,
        "_ignored": [
          "local_metadata.elastic.agent.version.keyword",
          "upgrade_details.target_version.keyword"
        ],
        "_source": {
          "access_api_key_id": "OLrfxZoBUWrzn3OjJDDo",
          "action_seq_no": [
            1
          ],
          "active": true,
          "agent": {
            "id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2",
            "version": "9.3.0+build20251125"
          },
          "enrolled_at": "2025-11-27T15:12:06Z",
          "local_metadata": { ... },
          "namespaces": [
            "default"
          ],
          "policy_id": "6eeca4e4-79eb-477f-8f60-73b4c04b6be2",
          "type": "PERMANENT",
          "outputs": { ... },
          "policy_revision_idx": 4,
          "updated_at": "2025-11-27T15:35:03Z",
          "available_rollbacks": [
            {
              "valid_until": "2025-11-27T15:48:59Z",
              "version": "9.3.0-SNAPSHOT"
            }
          ],
          "components": [],
          "last_checkin_message": "Running",
          "last_checkin_status": "online",
          "last_checkin": "2025-11-27T15:35:02Z",
          "unhealthy_reason": [
            "output"
          ],
          "last_known_status": "online",
          "upgrade_started_at": null,
          "upgraded_at": "2025-11-27T15:34:00Z",
          "upgrade_details": {
            "metadata": {
              "download_percent": 1
            },
            "action_id": "e26c5b33-5ab3-44a1-9f4a-163832603b0f",
            "state": "UPG_WATCHING",
            "target_version": "9.3.0+build20251125"
          },
          "upgrade_status": null
        }
      }
    Now available_rollbacks shows 9.3.0-SNAPSHOT as a possible rollback target.
  9. Before the rollback window expires (communicated with the valid_until attribute), let's manually rollback the agent
    POST kbn:/api/fleet/agents/8432a3dc-0b75-43f2-9fac-87f8701b82a2/actions 
    {
        "action": {
            "type": "UPGRADE",
            "data": {
                "version": "9.3.0-SNAPSHOT",
                "rollback": true
            }    
        }
    }
    verify that the agent rolls back to 9.3.0-SNAPSHOT
    image
  10. Verify the agent document in .fleet-agents
    {
        "_index": ".fleet-agents-7",
        "_id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2",
        "_score": 0.53899646,
        "_source": {
          "access_api_key_id": "OLrfxZoBUWrzn3OjJDDo",
          "action_seq_no": [
            3
          ],
          "active": true,
          "agent": {
            "id": "8432a3dc-0b75-43f2-9fac-87f8701b82a2",
            "version": "9.3.0"
          },
          "enrolled_at": "2025-11-27T15:12:06Z",
          "local_metadata": { ... },
          "namespaces": [
            "default"
          ],
          "policy_id": "6eeca4e4-79eb-477f-8f60-73b4c04b6be2",
          "type": "PERMANENT",
          "outputs": { ... },
          "policy_revision_idx": 4,
          "updated_at": "2025-11-27T15:45:23Z",
          "available_rollbacks": [],
          "components": [],
          "last_checkin_message": "Running",
          "last_checkin_status": "online",
          "last_checkin": "2025-11-27T15:45:14Z",
          "unhealthy_reason": [
            "output"
          ],
          "last_known_status": "online",
          "upgrade_started_at": null,
          "upgraded_at": "2025-11-27T15:45:12Z",
          "upgrade_details": {
            "metadata": {
              "reason": "manual rollback requested to version 9.3.0-SNAPSHOT",
            },
            "action_id": "f46e5d9e-3418-48dc-a18b-f7a8bfae8cd9",
            "state": "UPG_ROLLBACK",
            "target_version": "9.3.0-SNAPSHOT"
          },
          "upgrade_status": null
        }
      }
    now available_rollbacks is again empty and upgrade_details reports the state UPG_ROLLBACK with a specific reason
  11. If the manual rollback is triggered outside the rollback window or the version specified is not available as a rollback target, for example
    POST kbn:/api/fleet/agents/8432a3dc-0b75-43f2-9fac-87f8701b82a2/actions 
    {
        "action": {
            "type": "UPGRADE",
            "data": {
                "version": "8.19.7",
                "rollback": true
            }    
        }
    }
    the agent will display an Upgrade failed message (with correct message `` on the i tooltip which is incredibly difficult to screenshot)
    image
    The same info can be found in the `.fleet-agents` document in the `upgrade_details` section
          "upgrade_details": {
            "metadata": {
              "download_percent": 1,
              "reason": "manual rollback requested to version 9.3.0-SNAPSHOT",
              "error_msg": "version \"8.19.7\" not listed among the available rollbacks: no rollbacks available",
              "failed_state": "UPG_REQUESTED"
            },
            "action_id": "9a6d8770-1854-46be-9254-1a0f31f9c309",
            "state": "UPG_FAILED",
            "target_version": "8.19.7"
          },

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@mergify
Copy link
Contributor

mergify bot commented Nov 12, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@mergify mergify bot assigned pchila Nov 12, 2025
@mergify
Copy link
Contributor

mergify bot commented Nov 21, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-rollback-flag-to-update-action upstream/add-rollback-flag-to-update-action
git merge upstream/main
git push upstream add-rollback-flag-to-update-action

@pchila pchila force-pushed the add-rollback-flag-to-update-action branch 2 times, most recently from 2d1051f to d79ff2e Compare November 26, 2025 17:30
@pchila pchila force-pushed the add-rollback-flag-to-update-action branch from d79ff2e to cbd20bb Compare November 27, 2025 13:30
@pchila pchila changed the title Add rollback flag to update action Support manual rollback for Fleet-managed agents Nov 27, 2025
@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pchila

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-skip skip-changelog labels Dec 1, 2025
@pchila pchila marked this pull request as ready for review December 1, 2025 13:37
@pchila pchila requested a review from a team as a code owner December 1, 2025 13:37
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried to manually test this, but the logic looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip enhancement New feature or request skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trigger manual rollback using a Fleet action for a managed agent Add rollback field to actionUpgrade

3 participants