Skip to content

salt-run state.orchestrate fails because it tries to run a second task while a first is ongoing #18564

@bernieke

Description

@bernieke

To start off with, this problem seems somehow related to disk speed. We can reliably reproduce the problem on sata backed virtual machines, but not on ssd backed virtual machines. (Two different openstack flavors on for the rest the same hardware.)
Probably because it takes longer to execute the task on a sata disk. (There's a lot of deb installing going on, so this would make sense.)

This is the relevant part of the output:

# salt-run state.orchestrate orchestration.sls
awingu_master:
----------
          ID: common
    Function: salt.state
      Result: True
     Comment: States ran successfully.
     Started: 12:50:39.093757
    Duration: 63217.649 ms
     Changes:   
----------
          ID: dns_server
    Function: salt.state
      Result: False
     Comment: Run failed on minions: awingu
              Failures:
                  awingu:
                      Data failed to compile:
                  ----------
                      The function "state.sls" is running as PID 6334 and was started at 2014, Nov 28 12:50:39.102732 with jid 20141128125039102732
     Started: 12:51:42.312406
    Duration: 630.701 ms

The error doesn't always occur in that part of the orchestration, sometimes it happens in a later task, but always right after a task which takes a long time to run (several minutes.)

When I check right after the error, I can see that the job referenced is in fact still ongoing. Also issueing salt 'awingu' saltutil.running confirms this.
The job will, in time, do its thing and finish properly.
This job runs the "common" task preceding the dns_server task. (And which the dns_server task requires!)

So it's pretty clear that the orchestrate deems the "common" task finished, while in truth it has not.

It looks like a salt execution returns prematurely, before the task actually being finished (I've noted this on one or two occasions), but apart from that being a bug, I would expect the salt runner to be smarter and check the job queue?

For us this currently is a major issue, so any recommendations on how to handle this would be extremely welcome. Even a dirty monkeypatch to apply on top of 2014.7.0 would mean the world to us!

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1PlatformRelates to OS, containers, platform-based utilities like FS, system based appsState-ModuleZDThe issue is related to a Zendesk customer support ticket.bugbroken, incorrect, or confusing behaviorseverity-medium3rd level, incorrect or bad functionality, confusing and lacks a work around

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions