Error when submit a new job #1123
Unanswered
Michael-tech88
asked this question in
Q&A
Replies: 1 comment
-
i also meet this problem, have you solved it? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Here is the error information, how to solve it, thanks!
nohup: ignoring input
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
Traceback (most recent call last):
File "/home/gengzi/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 247, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/gengzi/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 613, in handle_unexpected_job_state
raise RuntimeError(f"job:{self.job_hash} {self.job_id} failed {self.fail_count} times.job_detail:{self}")
RuntimeError: job:f5dd75978fefc50e8387502acd89bcee87b890fa 678281 failed 3 times.job_detail:{'f5dd75978fefc50e8387502acd89bcee87b890fa': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 2, 'gpu_per_node': 1, 'queue_name': '', 'group_size': 1, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': [], 'envs': {}, 'prepend_script': [], 'append_script': [], 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 678281, 'fail_count': 3}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/pub/toolkits/anaconda3/envs/deepmd/bin/dpgen", line 8, in
sys.exit(main())
File "/opt/pub/toolkits/anaconda3/envs/deepmd/lib/python3.9/site-packages/dpgen/main.py", line 185, in main
args.func(args)
File "/opt/pub/toolkits/anaconda3/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3926, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/opt/pub/toolkits/anaconda3/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3788, in run_iter
run_train (ii, jdata, mdata)
File "/opt/pub/toolkits/anaconda3/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 615, in run_train
submission.run_submission()
File "/home/gengzi/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 214, in run_submission
self.handle_unexpected_submission_state()
File "/home/gengzi/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 250, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/gengzi/DeepMD/85e527d8b3e39bf235fb9c015c058ac4d4fc3341.
Debug information: submission_hash==85e527d8b3e39bf235fb9c015c058ac4d4fc3341.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
Beta Was this translation helpful? Give feedback.
All reactions