Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions agents/heuristics_resource/fence_heuristics_resource.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#!/usr/libexec/platform-python -tt

import io
import re
import subprocess
import shlex
import sys, stat
import logging
import atexit
import time
sys.path.append("/usr/share/fence")
from fencing import fail_usage, run_command, fence_action, all_opt
from fencing import atexit_handler, check_input, process_input, show_docs
from fencing import run_delay

def heuristics_resource(con, options):
# Search the node where the resource is running and determine
# the ACT node or not. For SBY node, a delay is generated.
# Note that this method always returns FALSE.

if not "--nodename" in options or options["--nodename"] == "":
logging.error("nodename parameter required")
return False

if not "--resource" in options or options["--resource"] == "":
logging.error("resource parameter required")
return False

target = options["--nodename"]
resource = options["--resource"]
promotable = options["--promotable"] in ["", "1"]
standby_wait = int(options["--standby-wait"])
crm_resource_path = options["--crm-resource-path"]
crm_node_path = options["--crm-node-path"]

(rc, out, err) = run_command(options, "%s --name" % crm_node_path)
if rc != 0 or out == None:
logging.error("Can not get my nodename. rc=%s, stderr=%s" % (rc, err))
return False

mynodename = out.strip()

if mynodename == target:
logging.info("Skip standby wait due to self-fencing.")
return False

(rc, out, err) = run_command(options, "%s -r %s -W" % (crm_resource_path, resource))
if rc != 0 or out == None:
logging.error("Command failed. rc=%s, stderr=%s" % (rc, err))
return False

search_str = re.compile(r"\s%s%s$" % (mynodename, '\sMaster' if promotable else ''))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this? I think there's a space at the end of the line if "Master" isn't printed.

I'd be a little concerned about introducing a dependency on the output format. We regularly have problems crop up when other software (e.g. pcs) looks for specific strings in the output, since we don't always remember that something requires it when we need to update the pacemaker output for some reason.

We actually are in the middle of a project to have all the pacemaker tools support XML output, so external code can use a format that we will try to keep backward-compatible while still allowing us to change text output as needed. Unfortunately crm_resource has not yet been adapted, and you'd want to support older pacemaker versions anyway.

Another concern I have is that crm_resource -W doesn't distinguish between running cleanly and failed. The output would be the same in either case. If the resource is failed and the node needs to be fenced, this will unnecessarily delay recovery.

One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:

<resource id="R" ... failed="true" failure_ignored="false" ... >
    <node name="N" ... />
</resource>

Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.

Side note: the upcoming release of pacemaker will have a slight change in how crm_mon does XML output. --as-xml will still behave the same, but it will be deprecated (however it will remain supported for years to come). A new --output-as=xml option will generate nearly identical XML output, but with a different outermost tag. That is part of the project to make all tools support XML -- that option and outermost tag will be identical for all tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a space at the end of the line if "Master" isn't printed.

At this point, it was an implementation that removed the space by strip() .

One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:

I added new commit to use crm_mon --as-xml.
Is it necessary and sufficient to detect resource failure by checking the following two attributes?

failed="true" failure_ignored="false"

If even one of the resources has failed, including resources inside the resource group or promotable resource, the delay is skipped.

Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.

Regardless of the number of nodes in the cluster, the delay is skipped when the resource is started on multiple nodes.

In the latest commit, the delay occurs only when the specified resource, resource group, promotable master resource is started on another single node (not remote) and has not failed. The delay is also skipped for self-fencing.

for line in out.splitlines():
searchres = search_str.search(line.strip())
if searchres:
logging.info("This node is ACT! Skip standby wait.")
return False

logging.info("Resource %s NOT found on this node" % resource)

if standby_wait > 0:
# The SBY node waits for fencing from the ACT node, and tries to fence
# the ACT node on next fencing level waking up from sleep.
logging.info("Standby wait %s sec" % standby_wait)
time.sleep(standby_wait)

return False


def define_new_opts():
all_opt["nodename"] = {
"getopt" : "n:",
"longopt" : "nodename",
"required" : "1",
"help" : "-n, --nodename=[nodename] Name of node to be fenced",
"shortdesc" : "Name of node to be fenced",
"default" : "",
"order" : 1
}
all_opt["resource"] = {
"getopt" : "r:",
"longopt" : "resource",
"required" : "1",
"help" : "-r, --resource=[resource-id] ID of the resource that should be running in the ACT node",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would document that it does not make sense to specify a cloned or bundled resource unless it is promotable and has only a single master instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I documented it.

"shortdesc" : "Resource ID",
"default" : "",
"order" : 1
}
all_opt["promotable"] = {
"getopt" : "p",
"longopt" : "promotable",
"required" : "0",
"help" : "-p, --promotable Specify if resource parameter is promotable (master/slave) resource",
"shortdesc" : "Handle the promotable resource. The node on which the master resource is running is considered as ACT.",
"default" : "False",
"order" : 1
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use crm_mon --as-xml, you can detect this automatically. Cloned resources are wrapped in clone tags that include a multi_state true/false attribute that is equivalent to promotable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to distinguish promotable resources based on clone tags and attriburtes in xml output.

all_opt["standby_wait"] = {
"getopt" : "w:",
"longopt" : "standby-wait",
"required" : "0",
"help" : "-w, --standby-wait=[seconds] Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.",
"shortdesc" : "Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent will delay but not succeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"default" : "0",
"order" : 1
}
all_opt["crm_resource_path"] = {
"getopt" : ":",
"longopt" : "crm-resource-path",
"required" : "0",
"help" : "--crm-resource-path=[path] Path to crm_resource",
"shortdesc" : "Path to crm_resource command",
"default" : "@CRM_RESOURCE_PATH@",
"order" : 1
}
all_opt["crm_node_path"] = {
"getopt" : ":",
"longopt" : "crm-node-path",
"required" : "0",
"help" : "--crm-node-path=[path] Path to crm_node",
"shortdesc" : "Path to crm_node command",
"default" : "@CRM_NODE_PATH@",
"order" : 1
}


def main():
device_opt = ["no_status", "no_password", "nodename", "resource", "promotable", "standby_wait", "crm_resource_path", "crm_node_path", "method"]
define_new_opts()
atexit.register(atexit_handler)

all_opt["method"]["default"] = "cycle"
all_opt["method"]["help"] = "-m, --method=[method] Method to fence (cycle|onoff) (Default: cycle)"

options = check_input(device_opt, process_input(device_opt))

docs = {}
docs["shortdesc"] = "Fence agent for resource-heuristic based fencing delay"
docs["longdesc"] = "fence_heuristics_resource uses resource-heuristics to delay execution of fence agent running on next level.\
\n.P\n\
This is not a fence agent by itself! \
Its only purpose is to delay execution of another fence agent that lives on next fencing level. \
Note that this agent always returns FALSE. Therefore, subsequent agents on the same fencing level will not run"
docs["vendorurl"] = ""
show_docs(options, docs)

run_delay(options)

result = fence_action(\
None, \
options, \
None, \
None, \
reboot_cycle_fn = heuristics_resource,
sync_set_power_fn = heuristics_resource)

sys.exit(result)

if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,8 @@ AC_PATH_PROG([SNMPSET_PATH], [snmpset], [/usr/bin/snmpset])
AC_PATH_PROG([SNMPGET_PATH], [snmpget], [/usr/bin/snmpget])
AC_PATH_PROG([NOVA_PATH], [nova], [/usr/bin/nova])
AC_PATH_PROG([POWERMAN_PATH], [powerman], [/usr/bin/powerman])
AC_PATH_PROG([CRM_RESOURCE_PATH], [crm_resource], [/usr/sbin/crm_resource])
AC_PATH_PROG([CRM_NODE_PATH], [crm_node], [/usr/sbin/crm_node])

AC_PATH_PROG([PING_CMD], [ping])
AC_PATH_PROG([PING6_CMD], [ping6])
Expand Down
14 changes: 14 additions & 0 deletions fence-agents.spec.in
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ fence-agents-emerson \\
fence-agents-eps \\
fence-agents-hds-cb \\
fence-agents-heuristics-ping \\
fence-agents-heuristics-resource \\
fence-agents-hpblade \\
fence-agents-ibmblade \\
fence-agents-ifmib \\
Expand Down Expand Up @@ -536,6 +537,19 @@ ping-heuristics.
%{_sbindir}/fence_heuristics_ping
%{_mandir}/man8/fence_heuristics_ping.8*

%package heuristics-resource
License: GPLv2+ and LGPLv2+
Summary: Pseudo fence agent to affect other agents based on resource-heuristics
Requires: fence-agents-common = %{version}-%{release}
BuildArch: noarch
Obsoletes: fence-agents
%description heuristics-resource
Fence pseudo agent used to affect other agents based on
resource-heuristics.
%files heuristics-resource
%{_sbindir}/fence_heuristics_resource
%{_mandir}/man8/fence_heuristics_resource.8*

%package hpblade
License: GPLv2+ and LGPLv2+
Summary: Fence agent for HP BladeSystem devices
Expand Down
2 changes: 2 additions & 0 deletions make/fencebuild.mk
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ define gen_agent_from_py
-e 's#@''SNMPGET_PATH@#${SNMPGET_PATH}#g' \
-e 's#@''NOVA_PATH@#${NOVA_PATH}#g' \
-e 's#@''POWERMAN_PATH@#${POWERMAN_PATH}#g' \
-e 's#@''CRM_RESOURCE_PATH@#${CRM_RESOURCE_PATH}#g' \
-e 's#@''CRM_NODE_PATH@#${CRM_NODE_PATH}#g' \
-e 's#@''PING_CMD@#${PING_CMD}#g' \
-e 's#@''PING6_CMD@#${PING6_CMD}#g' \
-e 's#@''PING4_CMD@#${PING4_CMD}#g' \
Expand Down
119 changes: 119 additions & 0 deletions tests/data/metadata/fence_heuristics_resource.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
<?xml version="1.0" ?>
<resource-agent name="fence_heuristics_resource" shortdesc="Fence agent for resource-heuristic based fencing delay" >
<longdesc>fence_heuristics_resource uses resource-heuristics to delay execution of fence agent running on next level.

This is not a fence agent by itself! Its only purpose is to delay execution of another fence agent that lives on next fencing level. Note that this agent always returns FALSE. Therefore, subsequent agents on the same fencing level will not run</longdesc>
<vendor-url></vendor-url>
<parameters>
<parameter name="action" unique="0" required="1">
<getopt mixed="-o, --action=[action]" />
<content type="string" default="reboot" />
<shortdesc lang="en">Fencing action</shortdesc>
</parameter>
<parameter name="crm_node_path" unique="0" required="0">
<getopt mixed="--crm-node-path=[path]" />
<shortdesc lang="en">Path to crm_node command</shortdesc>
</parameter>
<parameter name="crm_resource_path" unique="0" required="0">
<getopt mixed="--crm-resource-path=[path]" />
<shortdesc lang="en">Path to crm_resource command</shortdesc>
</parameter>
<parameter name="method" unique="0" required="0">
<getopt mixed="-m, --method=[method]" />
<content type="select" default="cycle" >
<option value="onoff" />
<option value="cycle" />
</content>
<shortdesc lang="en">Method to fence</shortdesc>
</parameter>
<parameter name="nodename" unique="0" required="1">
<getopt mixed="-n, --nodename=[nodename]" />
<content type="string" default="" />
<shortdesc lang="en">Name of node to be fenced</shortdesc>
</parameter>
<parameter name="promotable" unique="0" required="0">
<getopt mixed="-p, --promotable" />
<content type="boolean" default="False" />
<shortdesc lang="en">Handle the promotable resource. The node on which the master resource is running is considered as ACT.</shortdesc>
</parameter>
<parameter name="resource" unique="0" required="1">
<getopt mixed="-r, --resource=[resource-id]" />
<content type="string" default="" />
<shortdesc lang="en">Resource ID</shortdesc>
</parameter>
<parameter name="standby_wait" unique="0" required="0">
<getopt mixed="-w, --standby-wait=[seconds]" />
<content type="string" default="0" />
<shortdesc lang="en">Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.</shortdesc>
</parameter>
<parameter name="quiet" unique="0" required="0">
<getopt mixed="-q, --quiet" />
<content type="boolean" />
<shortdesc lang="en">Disable logging to stderr. Does not affect --verbose or --debug-file or logging to syslog.</shortdesc>
</parameter>
<parameter name="verbose" unique="0" required="0">
<getopt mixed="-v, --verbose" />
<content type="boolean" />
<shortdesc lang="en">Verbose mode</shortdesc>
</parameter>
<parameter name="debug" unique="0" required="0" deprecated="1">
<getopt mixed="-D, --debug-file=[debugfile]" />
<content type="string" />
<shortdesc lang="en">Write debug information to given file</shortdesc>
</parameter>
<parameter name="debug_file" unique="0" required="0" obsoletes="debug">
<getopt mixed="-D, --debug-file=[debugfile]" />
<content type="string" />
<shortdesc lang="en">Write debug information to given file</shortdesc>
</parameter>
<parameter name="version" unique="0" required="0">
<getopt mixed="-V, --version" />
<content type="boolean" />
<shortdesc lang="en">Display version information and exit</shortdesc>
</parameter>
<parameter name="help" unique="0" required="0">
<getopt mixed="-h, --help" />
<content type="boolean" />
<shortdesc lang="en">Display help and exit</shortdesc>
</parameter>
<parameter name="delay" unique="0" required="0">
<getopt mixed="--delay=[seconds]" />
<content type="second" default="0" />
<shortdesc lang="en">Wait X seconds before fencing is started</shortdesc>
</parameter>
<parameter name="login_timeout" unique="0" required="0">
<getopt mixed="--login-timeout=[seconds]" />
<content type="second" default="5" />
<shortdesc lang="en">Wait X seconds for cmd prompt after login</shortdesc>
</parameter>
<parameter name="power_timeout" unique="0" required="0">
<getopt mixed="--power-timeout=[seconds]" />
<content type="second" default="20" />
<shortdesc lang="en">Test X seconds for status change after ON/OFF</shortdesc>
</parameter>
<parameter name="power_wait" unique="0" required="0">
<getopt mixed="--power-wait=[seconds]" />
<content type="second" default="0" />
<shortdesc lang="en">Wait X seconds after issuing ON/OFF</shortdesc>
</parameter>
<parameter name="shell_timeout" unique="0" required="0">
<getopt mixed="--shell-timeout=[seconds]" />
<content type="second" default="3" />
<shortdesc lang="en">Wait X seconds for cmd prompt after issuing command</shortdesc>
</parameter>
<parameter name="retry_on" unique="0" required="0">
<getopt mixed="--retry-on=[attempts]" />
<content type="integer" default="1" />
<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
</parameter>
</parameters>
<actions>
<action name="on" automatic="0"/>
<action name="off" />
<action name="reboot" />
<action name="monitor" />
<action name="metadata" />
<action name="manpage" />
<action name="validate-all" />
</actions>
</resource-agent>