-
Notifications
You must be signed in to change notification settings - Fork 9
Description
We use the puppetrunner.rb functionality to loop Puppet runs with concurrency across our enterprise via mcollective-on-choria. This is normally a rock-solid process, however this week we ran into an interesting problem. One node in the company had its puppet run just get stuck. This caused the node to forever be in the "running" status. Overall this shouldn't be that big of a deal. It needs fixed, but you wouldn't expect it to impact other nodes. However, it does because of this block of code:
` while !host_list.empty?
# clone list so we can safely remove items host_list
working_list = host_list.clone
working_list.each do |host|
if running.size >= @concurrency
# we do not have room for another running host
break
end
# check if host is already in a running state
log("Checking #{host}")
log(" -- running: %d, concurrency: %d" % [running.size, @concurrency])
if running.find{ |running_host| running_host[:name] == host }
# already in a running state, leave it for now
log("#{host} is already being tracked, skipping it")
else
# host is not running - kick it, put it in the running bucket
Log.debug("Triggering #{host}")
initiated_at = runhost(host)
running << make_status(host, initiated_at)
host_list.delete(host)
next
end
end
Log.debug("Sleeping")
# wait a second to give some time for something to happen
sleep 1
# update our view of the network
running = find_applying_nodes(hosts, running)
end
log("Iteration complete. Initiated a Puppet run on #{hosts.size} nodes.")
end
`
The bug here is more ideological than technically. This block of code is designed with a goal of triggering a run on every server in the company. If it encounters a server that is presently running, it marks it to simply be revisited later. The outer while loop will never exit until it has triggered a run on every (enabled) host.
In our case, we had one host that was "enabled", however Puppet was frozen. So the pool of ${host_list} was never able to fully empty out. It remained an array with only a single entry, and then proceeded to loop on this entry forever, constantly only saying, "Puppet is running, skipping node".
I appreciate the effort for thoroughness here, however I think it's unnecessary. If you're launching a full "run Puppet on all hosts" run across your enterprise, and the runner encounters a host where Puppet is already running, I think it's fair to have the runner simply acknowledge, "Puppet is already running here, so I will mark this node as good". I understand that that may be less than ideal under perfect circumstances, however it would prevent this particular failure scenario from bringing the entire process dragging to starvation.
The line host_list.delete(host) would simply need to be added to the block that output "#{host} is already being tracked, skipping it". I'm not certain if the next is also necessary under that condition, however... I stared at it for a while and haven't quite figured that out yet.