Dpgen cannot submit tasks on the slurm operating system #1216
Unanswered
SEU-NiuWenLong
asked this question in
Q&A
Replies: 1 comment
-
I see you execute |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When I submit a task on the local server, I can submit the task smoothly, but when I use the slurm job management system, the task always fails to be submitted. The error is as follows:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gcniu/.local/bin/dpgen", line 8, in
sys.exit(main())
File "/home/gcniu/.local/lib/python3.9/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/home/gcniu/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/gcniu/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/home/gcniu/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/home/gcniu/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 252, in run_submission
self.handle_unexpected_submission_state()
File "/home/gcniu/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 290, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/gcniu/workspace/deepmd/dpgen_example/run/temp/e622221beffedcf69212d3c858c1daabc5c00c26.
Debug information: submission_hash==e622221beffedcf69212d3c858c1daabc5c00c26.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
My machine configuration file is as follows:
{
"api_version": "1.0",
"deepmd_version": "2.2.1",
"train": [
{
"command": "mpirun -np 24 dp",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root": "./",
"remote_root": "/home/gcniu/workspace/deepmd/dpgen_example/run/temp"
},
"resources": {
"number_node": 1,
"cpu_per_node": 24,
"gpu_per_node": 0,
"group_size": 4,
"queue_name":" C6326LI",
"source_list": ["
/.bashrc"]/.bashrc"]}
}
],
"model_devi": [
{
"command": "mpirun -np 24 lmp -i input.lammps",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root": "./",
"remote_root": "/home/gcniu/workspace/deepmd/dpgen_example/run/temp"
},
"resources": {
"number_node": 1,
"cpu_per_node": 24,
"gpu_per_node": 0,
"queue_name":"C6326LI",
"group_size": 5,
"source_list": ["
}
}
],
"fp": [
{
"command": "mpirun -np 24 vasp_std >& log",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root": "./",
"remote_root": "/home/gcniu/workspace/deepmd/dpgen_example/run/temp"
},
"resources": {
"number_node": 1,
"cpu_per_node": 24,
"gpu_per_node": 0,
"group_size": 1,
"queue_name":"C6326LI",
"module_list":[
"vasp/5.4.4"
]
}
}
]
}
The error directory script is as follows:
#!/bin/bash -l
#SBATCH --parsable
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 24
#SBATCH --gres=gpu:0
#SBATCH --partition C6326LI
REMOTE_ROOT=$(readlink -f /home/gcniu/workspace/deepmd/dpgen_example/run/temp/e622221beffedcf69212d3c858c1daabc5c00c26)
echo 0 > $REMOTE_ROOT/3abaec2376676ab674ad3a3a6b889ccbd69c3a52_flag_if_job_task_fail
test $? -ne 0 && exit 1
{ source ~/.bashrc; }
cd $REMOTE_ROOT
cd 002
test $? -ne 0 && exit 1
if [ ! -f c56a28e087307cb91099b8a97f0855b1abcb3753_task_tag_finished ] ;then
( /bin/sh -c '{ if [ ! -f model.ckpt.index ]; then mpirun -np 24 dp train input.json; else mpirun -np 24 dp train input.json --restart model.ckpt; fi }'&&mpirun -np 24 dp freeze ) 1>>train.log 2>>train.log
if test $? -eq 0; then touch c56a28e087307cb91099b8a97f0855b1abcb3753_task_tag_finished; else echo 1 > $REMOTE_ROOT/3abaec2376676ab674ad3a3a6b889ccbd69c3a52_flag_if_job_task_fail;fi
fi &
wait
cd $REMOTE_ROOT
cd 001
test $? -ne 0 && exit 1
if [ ! -f 03c6ca0813d7d931279cea63148513b40410778e_task_tag_finished ] ;then
( /bin/sh -c '{ if [ ! -f model.ckpt.index ]; then mpirun -np 24 dp train input.json; else mpirun -np 24 dp train input.json --restart model.ckpt; fi }'&&mpirun -np 24 dp freeze ) 1>>train.log 2>>train.log
if test $? -eq 0; then touch 03c6ca0813d7d931279cea63148513b40410778e_task_tag_finished; else echo 1 > $REMOTE_ROOT/3abaec2376676ab674ad3a3a6b889ccbd69c3a52_flag_if_job_task_fail;fi
fi &
wait
cd $REMOTE_ROOT
cd 003
test $? -ne 0 && exit 1
if [ ! -f eb73a2c2157e76f447256f3df2a5b6cad417eeb8_task_tag_finished ] ;then
( /bin/sh -c '{ if [ ! -f model.ckpt.index ]; then mpirun -np 24 dp train input.json; else mpirun -np 24 dp train input.json --restart model.ckpt; fi }'&&mpirun -np 24 dp freeze ) 1>>train.log 2>>train.log
if test $? -eq 0; then touch eb73a2c2157e76f447256f3df2a5b6cad417eeb8_task_tag_finished; else echo 1 > $REMOTE_ROOT/3abaec2376676ab674ad3a3a6b889ccbd69c3a52_flag_if_job_task_fail;fi
fi &
wait
cd $REMOTE_ROOT
cd 000
test $? -ne 0 && exit 1
if [ ! -f 72c48bd42034ee677ccd2a1e0fea48690f6b0b15_task_tag_finished ] ;then
( /bin/sh -c '{ if [ ! -f model.ckpt.index ]; then mpirun -np 24 dp train input.json; else mpirun -np 24 dp train input.json --restart model.ckpt; fi }'&&mpirun -np 24 dp freeze ) 1>>train.log 2>>train.log
if test $? -eq 0; then touch 72c48bd42034ee677ccd2a1e0fea48690f6b0b15_task_tag_finished; else echo 1 > $REMOTE_ROOT/3abaec2376676ab674ad3a3a6b889ccbd69c3a52_flag_if_job_task_fail;fi
fi &
wait
cd $REMOTE_ROOT
test $? -ne 0 && exit 1
wait
FLAG_IF_JOB_TASK_FAIL=$(cat 3abaec2376676ab674ad3a3a6b889ccbd69c3a52_flag_if_job_task_fail)
if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch 3abaec2376676ab674ad3a3a6b889ccbd69c3a52_job_tag_finished; else exit 1;fi
Excuse me, why is that?
Beta Was this translation helpful? Give feedback.
All reactions