Skip to content

Puppetrunner can get stuck in endless starvation loop #56

@abaddononion

Description

@abaddononion

We use the puppetrunner.rb functionality to loop Puppet runs with concurrency across our enterprise via mcollective-on-choria. This is normally a rock-solid process, however this week we ran into an interesting problem. One node in the company had its puppet run just get stuck. This caused the node to forever be in the "running" status. Overall this shouldn't be that big of a deal. It needs fixed, but you wouldn't expect it to impact other nodes. However, it does because of this block of code:

` while !host_list.empty?
# clone list so we can safely remove items host_list
working_list = host_list.clone

      working_list.each do |host|
        if running.size >= @concurrency
          # we do not have room for another running host
          break
        end
        # check if host is already in a running state
        log("Checking #{host}")
        log(" -- running: %d, concurrency: %d" % [running.size, @concurrency])
        if running.find{ |running_host| running_host[:name] == host }
          # already in a running state, leave it for now
          log("#{host} is already being tracked, skipping it")
        else
          # host is not running - kick it, put it in the running bucket
          Log.debug("Triggering #{host}")
          initiated_at = runhost(host)
          running << make_status(host, initiated_at)
          host_list.delete(host)
          next
        end
      end

      Log.debug("Sleeping")
      # wait a second to give some time for something to happen
      sleep 1
      # update our view of the network
      running = find_applying_nodes(hosts, running)
    end

    log("Iteration complete. Initiated a Puppet run on #{hosts.size} nodes.")
  end

`

The bug here is more ideological than technically. This block of code is designed with a goal of triggering a run on every server in the company. If it encounters a server that is presently running, it marks it to simply be revisited later. The outer while loop will never exit until it has triggered a run on every (enabled) host.

In our case, we had one host that was "enabled", however Puppet was frozen. So the pool of ${host_list} was never able to fully empty out. It remained an array with only a single entry, and then proceeded to loop on this entry forever, constantly only saying, "Puppet is running, skipping node".

I appreciate the effort for thoroughness here, however I think it's unnecessary. If you're launching a full "run Puppet on all hosts" run across your enterprise, and the runner encounters a host where Puppet is already running, I think it's fair to have the runner simply acknowledge, "Puppet is already running here, so I will mark this node as good". I understand that that may be less than ideal under perfect circumstances, however it would prevent this particular failure scenario from bringing the entire process dragging to starvation.

The line host_list.delete(host) would simply need to be added to the block that output "#{host} is already being tracked, skipping it". I'm not certain if the next is also necessary under that condition, however... I stared at it for a while and haven't quite figured that out yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions