Puppetrunner can get stuck in endless starvation loop

We use the puppetrunner.rb functionality to loop Puppet runs with concurrency across our enterprise via mcollective-on-choria. This is normally a rock-solid process, however this week we ran into an interesting problem. One node in the company had its puppet run just get stuck. This caused the node to forever be in the "running" status. Overall this shouldn't be that big of a deal. It needs fixed, but you wouldn't expect it to impact other nodes. However, it does because of this block of code:

`        while !host_list.empty?
          # clone list so we can safely remove items host_list
          working_list = host_list.clone

          working_list.each do |host|
            if running.size >= @concurrency
              # we do not have room for another running host
              break
            end
            # check if host is already in a running state
            log("Checking #{host}")
            log(" -- running: %d, concurrency: %d" % [running.size, @concurrency])
            if running.find{ |running_host| running_host[:name] == host }
              # already in a running state, leave it for now
              log("#{host} is already being tracked, skipping it")
            else
              # host is not running - kick it, put it in the running bucket
              Log.debug("Triggering #{host}")
              initiated_at = runhost(host)
              running << make_status(host, initiated_at)
              host_list.delete(host)
              next
            end
          end

          Log.debug("Sleeping")
          # wait a second to give some time for something to happen
          sleep 1
          # update our view of the network
          running = find_applying_nodes(hosts, running)
        end

        log("Iteration complete. Initiated a Puppet run on #{hosts.size} nodes.")
      end
`

The bug here is more ideological than technically. This block of code is designed with a goal of triggering a run on every server in the company. If it encounters a server that is *presently* running, it marks it to simply be revisited later. The outer while loop will *never* exit until it has triggered a run on every (enabled) host.

In our case, we had one host that was "enabled", however Puppet was frozen. So the pool of ${host_list} was never able to fully empty out. It remained an array with only a single entry, and then proceeded to loop on this entry forever, constantly only saying, "Puppet is running, skipping node".

I appreciate the effort for thoroughness here, however I think it's unnecessary. If you're launching a full "run Puppet on all hosts" run across your enterprise, and the runner encounters a host where Puppet is *already* running, I think it's fair to have the runner simply acknowledge, "Puppet is already running here, so I will mark this node as good". I understand that that may be less than ideal under perfect circumstances, however it would prevent this particular failure scenario from bringing the entire process dragging to starvation.

The line `host_list.delete(host)` would simply need to be added to the block that output "#{host} is already being tracked, skipping it". I'm not certain if the `next` is also necessary under that condition, however... I stared at it for a while and haven't quite figured that out yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Puppetrunner can get stuck in endless starvation loop #56

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Puppetrunner can get stuck in endless starvation loop #56

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions