You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
:log_dir: directory to save tensorboard event logs. If None, defaults to a fixed path on local filesystem.
213
216
:driver_ps_nodes: run the PS nodes on the driver locally instead of on the spark executors; this help maximizing computing resources (esp. GPU). You will need to set cluster_size = num_executors + num_ps
217
+
:master_node: name of the "master" or "chief" node in the cluster_template, used for `tf.estimator` applications.
214
218
:reservation_timeout: number of seconds after which cluster reservation times out (600 sec default)
# since our "primary key" for each executor's TFManager is (host, ppid), sanity check for duplicates
323
+
# since our "primary key" for each executor's TFManager is (host, executor_id), sanity check for duplicates
315
324
# Note: this may occur if Spark retries failed Python tasks on the same executor.
316
325
tb_nodes=set()
317
326
fornodeincluster_info:
318
-
node_id= (node['host'],node['ppid'])
327
+
node_id= (node['host'],node['executor_id'])
319
328
ifnode_idintb_nodes:
320
-
raiseException("Duplicate cluster node id detected (host={0}, ppid={1}). Please ensure that (1) the number of executors >= number of TensorFlow nodes, (2) the number of tasks per executors == 1, and (3) TFCluster.shutdown() is successfully invoked when done.".format(node_id[0], node_id[1]))
329
+
raiseException("Duplicate cluster node id detected (host={0}, executor_id={1})".format(node_id[0], node_id[1]) +
330
+
"Please ensure that:\n"+
331
+
"1. Number of executors >= number of TensorFlow nodes\n"+
332
+
"2. Number of tasks per executors is 1\n"+
333
+
"3, TFCluster.shutdown() is successfully invoked when done.")
0 commit comments