Error: dpgen run param.json machine.json #1109

Kangxuxin · 2023-01-08T11:21:33Z

Kangxuxin
Jan 8, 2023

Teacher, when I enter the command "dpgen run param.json machine.json", I get an error.

"Traceback (most recent call last):
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpdispatcher/submission.py", line 215, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpdispatcher/submission.py", line 532, in handle_unexpected_job_state
raise RuntimeError(f"job:{self.job_hash} {self.job_id} failed {self.fail_count} times.job_detail:{self}")
RuntimeError: job:72173a39f8ec32e18711bd340dfc0bcf6068dacc 20661 failed 6 times.job_detail:{'72173a39f8ec32e18711bd340dfc0bcf6068dacc': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 4, 'gpu_per_node': 0, 'queue_name': 'train', 'group_size': 1, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False}, 'para_deg': 1, 'module_unload_list': [], 'module_list': [], 'source_list': ['/public/software/profile.d/compiler_intel-compiler-2021.3.0.sh', '/public/software/profile.d/mpi_intelmpi-2021.3.0.sh', '/public/home/duanxiangmei/.bashrc'], 'envs': {}, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 20661, 'fail_count': 6}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/public/home/duanxiangmei/softwore/dpgen/bin/dpgen", line 10, in
sys.exit(main())
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpgen/generator/run.py", line 2944, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpgen/generator/run.py", line 2909, in run_iter
run_train (ii, jdata, mdata)
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpgen/generator/run.py", line 607, in run_train
submission.run_submission()
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpdispatcher/submission.py", line 164, in run_submission
self.handle_unexpected_submission_state()
File "/public/home/duanxiangmei/softwore/dpgen/lib/python3.8/site-packages/dpdispatcher/submission.py", line 219, in handle_unexpected_submission_state
f"Meet errors will handle unexpected submission state.\n"
AttributeError: 'Submission' object has no attribute 'remote_root'"

Then I check the train.log file of ch4/run/work/789b60381b5f7811b1e59f4b19fcbb340b2316a6/000

"WARNING:tensorflow:From /public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
_bootstrap._exec(spec, module)
/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/deepmd/utils/compat.py:316: UserWarning: It seems that you are using a deepmd-kit input of version 1.x.x, which is deprecated. we have converted the input to >2.0.0 compatible
warnings.warn(msg)
Traceback (most recent call last):
File "/public/home/duanxiangmei/softwore/deepmd-kit/bin/dp", line 10, in
sys.exit(main())
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 562, in main
train_dp(**dict_args)
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 91, in train
jdata = normalize(jdata)
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/deepmd/utils/argcheck.py", line 782, in normalize
base.check_value(data, strict=True)
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 278, in check_value
self.traverse_value(argdict,
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 241, in traverse_value
self._traverse_sub(value,
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 260, in _traverse_sub
subarg.traverse(value,
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 228, in traverse
self.traverse_value(value,
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 241, in traverse_value
self._traverse_sub(value,
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 256, in _traverse_sub
sub_hook(self, value, path)
File "/public/home/duanxiangmei/softwore/deepmd-kit/lib/python3.10/site-packages/dargs/dargs.py", line 307, in _check_strict
raise ArgumentKeyError(path,
dargs.dargs.ArgumentKeyError: [at location training] undefined key stop_batch is not allowed in strict mode"

I hope you can give me some tips on what to do next, thank you.

AnguseZhang · 2023-01-08T15:45:05Z

AnguseZhang
Jan 8, 2023
Maintainer

dargs.dargs.ArgumentKeyError: [at location training] undefined key stop_batch is not allowed in strict mode

About this problem, you adopted the old version of param.json. You should change default_training_param and let it be compatible with DeePMD-kit version.

You can refer to https://github.com/deepmodeling/dpgen/blob/master/examples/run/dp2.x-lammps-vasp/param_CH4_deepmd-kit-2.0.1.json and https://github.com/deepmodeling/deepmd-kit/blob/master/examples/water/se_e2_a/input.json, and make a comparison.

Besides, could you provide the reference of your param.json, we can update the example if it's outdated.

2 replies

Kangxuxin Jan 9, 2023
Author

These are my param.json and machine.json
{
"type_map": ["H","C"],
"mass_map": [1,12],
"init_data_prefix": "../",
"init_data_sys": ["CH4.POSCAR.01x01x01/02.md/sys-0004-0001/deepmd"],
"sys_configs_prefix": "../",
"sys_configs": [
["CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-1.000/00000*/POSCAR"],
["CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-1.000/00001*/POSCAR"]
],
"_comment": " that's all ",
"numb_models": 4,
"default_training_param": {
"model": {
"type_map": ["H","C"],
"descriptor": {
"type": "se_a",
"sel": [16,4],
"rcut_smth": 0.5,
"rcut": 5.0,
"neuron": [120,120,120],
"resnet_dt": true,
"axis_neuron": 12,
"seed": 1
},
"fitting_net": {
"neuron": [25,50,100],
"resnet_dt": false,
"seed": 1
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"decay_steps": 5000
},
"loss": {
"start_pref_e": 0.02,
"limit_pref_e": 2,
"start_pref_f": 1000,
"limit_pref_f": 1,
"start_pref_v": 0.0,
"limit_pref_v": 0.0
},
"training": {
"stop_batch": 2000,
"disp_file": "lcurve.out",
"disp_freq": 1000,
"numb_test": 4,
"save_freq": 1000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all"
}
},
"model_devi_dt": 0.002,
"model_devi_skip": 0,
"model_devi_f_trust_lo": 0.05,
"model_devi_f_trust_hi": 0.15,
"model_devi_e_trust_lo": 10000000000.0,
"model_devi_e_trust_hi": 10000000000.0,
"model_devi_clean_traj": true,
"model_devi_jobs": [
{"sys_idx": [0],"temps": [100],"press": [1.0],"trj_freq": 10,"nsteps": 300,"ensemble": "nvt","_idx": "00"},
{"sys_idx": [1],"temps": [100],"press": [1.0],"trj_freq": 10,"nsteps": 3000,"ensemble": "nvt","_idx": "01"}
],
"fp_style": "vasp",
"shuffle_poscar": false,
"fp_task_max": 20,
"fp_task_min": 5,
"fp_pp_path": "./",
"fp_pp_files": ["POTCAR_H","POTCAR_C"],
"fp_incar": "./INCAR_methane"
}

{
"api_version": "1.0",
"train" :[
{
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root" : "./",
"remote_root": "/public/home/duanxiangmei/KXX/DPMD/init_bulk/ch4/run/work"
},
"resources": {
"number_node": 1,
"cpu_per_node": 4,
"gpu_per_node": 0,
"group_size": 1,
"queue_name":"train",
"source_list":["/public/software/profile.d/compiler_intel-compiler-2021.3.0.sh",
"/public/software/profile.d/mpi_intelmpi-2021.3.0.sh",
"/public/home/duanxiangmei/.bashrc"]
},
"command": "dp"
}
],
"model_devi":[
{
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root" : "./",
"remote_root": "/public/home/duanxiangmei/KXX/DPMD/init_bulk/ch4/run/work"
},
"resources": {
"number_node": 1,
"cpu_per_node": 4,
"gpu_per_node": 0,
"group_size": 1,
"queue_name": "model_devi",
"source_list":["/public/software/profile.d/compiler_intel-compiler-2021.3.0.sh",
"/public/software/profile.d/mpi_intelmpi-2021.3.0.sh",
"/public/home/duanxiangmei/.bashrc"]
},
"command": "lmp"
}
],
"fp":[
{
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root" : "./",
"remote_root": "/public/home/duanxiangmei/KXX/DPMD/init_bulk/ch4/run/work"
},
"resources": {
"number_node": 1,
"cpu_per_node": 16,
"gpu_per_node": 0,
"group_size": 4,
"queue_name": "fp",
"source_list": ["/public/software/profile.d/compiler_intel-compiler-2021.3.0.sh",
"/public/software/profile.d/mpi_intelmpi-2021.3.0.sh"]
},
"command":"mpirun -np 16 /public/software/apps/vasp5.4.4-neb/bin/vasp_std"
}
]
}

Kangxuxin Jan 9, 2023
Author

Teacher, the problem has been solved by adjusting the param.json file. The 00.train can proceed smoothly.
However, the 01.model_devi get a error.

use deepmd-kit at: /public/home/duanxiangmei/softwore/deepmd-kitpair_coeff
ERROR: Incorrect args for pair coefficients (src/input.cpp:1680)
Last command: pair_coeff

I checked the input.lammps file, I found that the pair_coeff parameter was empty.
"pair_style deepmd ../graph.000.pb ../graph.003.pb ../graph.002.pb ../graph.001.pb out_freq ${THERMO_FREQ} out_file model_devi.out
pair_coeff"

If I add ** to pair_coeff in input.lammps in all folders, the 01.model_devi can proceed smoothly. But the second iteration needs to be changed again, which is very troublesome.
I would like to know if this is due to a missing parameter in param.json or machine.json?
Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error: dpgen run param.json machine.json #1109

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error: dpgen run param.json machine.json #1109

Uh oh!

Kangxuxin Jan 8, 2023

Replies: 1 comment · 2 replies

Uh oh!

AnguseZhang Jan 8, 2023 Maintainer

Uh oh!

Kangxuxin Jan 9, 2023 Author

Uh oh!

Kangxuxin Jan 9, 2023 Author

Kangxuxin
Jan 8, 2023

Replies: 1 comment 2 replies

AnguseZhang
Jan 8, 2023
Maintainer

Kangxuxin Jan 9, 2023
Author

Kangxuxin Jan 9, 2023
Author