Skip to content

Ask For Help! I encounter the error ,when I run the dpgen . Thanks you very much ! #1342

@12jscvb

Description

@12jscvb

2023-09-26 11:57:18,486 - INFO : Find old submission; recover submission from json file;submission.submission_hash:c40cbdcc7d969ad3ae8930771348a761b6d46f47; machine.context.remote_root:/home/jiang/work/dpgen_example/run/nnwork/c40cbdcc7d969ad3ae8930771348a761b6d46f47; submission.work_base:iter.000000/00.train;
2023-09-26 11:57:18,536 - INFO : info:check_all_finished: False
2023-09-26 11:57:18,539 - INFO : job: 35225006374c02dda988d09b9556589231a34548 6329 terminated;fail_cout is 10; resubmitting job
2023-09-26 11:57:18,547 - INFO : job:35225006374c02dda988d09b9556589231a34548 re-submit after terminated; new job_id is 13634
2023-09-26 11:57:18,794 - INFO : job:35225006374c02dda988d09b9556589231a34548 job_id:13634 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-26 11:57:18,794 - INFO : job: 35225006374c02dda988d09b9556589231a34548 13634 terminated;fail_cout is 11; resubmitting job
2023-09-26 11:57:18,799 - INFO : job:35225006374c02dda988d09b9556589231a34548 re-submit after terminated; new job_id is 13656
2023-09-26 11:57:19,046 - INFO : job:35225006374c02dda988d09b9556589231a34548 job_id:13656 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-26 11:57:19,047 - INFO : job: 35225006374c02dda988d09b9556589231a34548 13656 terminated;fail_cout is 12; resubmitting job
Traceback (most recent call last):
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 352, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 846, in handle_unexpected_job_state
raise RuntimeError(
RuntimeError: job:35225006374c02dda988d09b9556589231a34548 13656 failed 12 times.job_detail:{'35225006374c02dda988d09b9556589231a34548': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '001', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '003', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '000', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 4, 'gpu_per_node': 0, 'queue_name': '', 'group_size': 4, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': [], 'envs': {}, 'prepend_script': [], 'append_script': [], 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 13656, 'fail_count': 12}}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jiang/.local/bin/dpgen", line 8, in
sys.exit(main())
^^^^^^
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 229, in run_submission
self.handle_unexpected_submission_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 355, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/jiang/work/dpgen_example/run/nnwork/c40cbdcc7d969ad3ae8930771348a761b6d46f47.
Debug information: submission_hash==c40cbdcc7d969ad3ae8930771348a761b6d46f47.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.

throught the above information, I find the train.log file that shows the ‘dp: nov vocab file specified’,but I don't know how to solve this problem, Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions