You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I first created init_bulk structure for the initial data by running dpgen init_bulk param.json machine.json in the "init" folder, and it successfully generated the 02.md folder with deepmd data for model training.
However, when I went to the "run" folder and run the dpgen run param.json machine.json, it stuck at task 01, shown at below,
2023-11-05 17:06:18,590 - INFO : start running
2023-11-05 17:06:18,599 - INFO : =============================iter.000000==============================
2023-11-05 17:06:18,599 - INFO : -------------------------iter.000000 task 00--------------------------
2023-11-05 17:06:18,735 - INFO : -------------------------iter.000000 task 01--------------------------
2023-11-05 17:08:57,278 - INFO : start running
2023-11-05 17:08:57,283 - INFO : continue from iter 000 task 00
2023-11-05 17:08:57,283 - INFO : =============================iter.000000==============================
2023-11-05 17:08:57,283 - INFO : -------------------------iter.000000 task 01--------------------------
I've noticed that all four ML models have completed their training (as evidenced by the presence of "frozen_model.pb" files in folders 000 to 003 and the information in the "train.log"). However, despite this, the running stage appears to be stalled, and my Slurm cluster is still actively training the model
Here is one of the sub file used in my case:
#!/bin/bash -l
#SBATCH --parsable
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --gres=gpu:1
#SBATCH --partition a100_normal_q
#SBATCH -A oxides_1
#SBATCH -J dpgen
module load apps site/tinkercliffs/easybuild/setup
source activate deepmd
export OMP_NUM_THREADS=8
export TF_INTRA_OP_PARALLELISM_THREADS=8
export TF_INTER_OP_PARALLELISM_THREADS=2
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 30 >> gpu_log.csv &
GPU_LOG_PID=$! # Save the PID to terminate it later
REMOTE_ROOT=$(readlink -f /home/wenjiang0716/dpgen/dpgen_example/3d89208a118cddc7647f92706baafdb51eed8c2f)
echo 0 > $REMOTE_ROOT/f788c8b4a37095f3a02f3e3b1fb400c17bf29f55_flag_if_job_task_fail
test $? -ne 0 && exit 1
source $REMOTE_ROOT/f788c8b4a37095f3a02f3e3b1fb400c17bf29f55.sub.run
cd $REMOTE_ROOT
test $? -ne 0 && exit 1
wait
FLAG_IF_JOB_TASK_FAIL=$(cat f788c8b4a37095f3a02f3e3b1fb400c17bf29f55_flag_if_job_task_fail)
if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch f788c8b4a37095f3a02f3e3b1fb400c17bf29f55_job_tag_finished; else exit 1;fi
Wondering where I mis-defined the parameters. Thank for any help.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm going through the dpgen tutorial to understand how the model works, but I've hit a snag when trying to perform the "run" stage
Here is some basic info about the package versions:
and deepmd-kit 2.2.4,
Here is the machine.json I crafted based on my system
I first created init_bulk structure for the initial data by running
dpgen init_bulk param.json machine.json
in the "init" folder, and it successfully generated the 02.md folder with deepmd data for model training.However, when I went to the "run" folder and run the
dpgen run param.json machine.json
, it stuck at task 01, shown at below,I've noticed that all four ML models have completed their training (as evidenced by the presence of "frozen_model.pb" files in folders 000 to 003 and the information in the "train.log"). However, despite this, the running stage appears to be stalled, and my Slurm cluster is still actively training the model
Here is one of the sub file used in my case:
Wondering where I mis-defined the parameters. Thank for any help.
Best,
JJ
Beta Was this translation helpful? Give feedback.
All reactions