Am i really making use of gpu, if so why the wall time is more than my local. #3929
Replies: 2 comments
-
It looks like the GPU is not used, and you may first check in |
Beta Was this translation helpful? Give feedback.
-
(deepmd) [jayaprakash@login01 gpu3]$ conda create -n deepmd_gpu deepmd-kit=*=gpu libdeepmd==*gpu lammps cudatoolkit=11.6 horovod -c https://conda.deepmodeling.com -c defaults ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/main/notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e230>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: defaults url: https://repo.anaconda.com/pkgs/main/notices.json ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/r/notices.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b927e1a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))> for channel: defaults url: https://repo.anaconda.com/pkgs/r/notices.json
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b066c1f0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /conda-forge/noarch/repodata.json.zst Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff3b066c6d0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /pkgs/r/noarch/repodata.json.zst failed CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.deepmodeling.com/linux-64/repodata.json An HTTP error occurred when trying to retrieve this URL. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My university HPC has some tensorflow dependency error. System admin asked me to install it on hpc just the way i did in my local machine. I did install using miniforge and used this submission script.
history:
302 conda create -n deepmd deepmd-kit lammps horovod -c conda-forge
303 conda activate deepmd
(deepmd) [user@login04 gpu]$ cat submit.sh
#!/bin/bash
#SBATCH --job-name=Job #Job name
#SBATCH -N 1 #Number of nodes
#SBATCH --ntasks-per-node=1 #Number of core per node
#SBATCH --gres=gpu:2 #Number of GPUs
#SBATCH --error=job.%J.err #Name of output file
#SBATCH --output=job.%J.out #Name of error file
#SBATCH --time=72:00:00 #Time take to execute the program
#SBATCH --partition=gpu #specifies queue name(standard is the default partiti>
module load openmpi/4.1.4
conda init bash
conda activate deepmd
export OMP_NUM_THREADS=4
export TF_INTRA_OP_PARALLELISM_THREADS=4
export TF_INTER_OP_PARALLELISM_THREADS=4
mpirun -np 1 dp train input.json > output.txt
(deepmd)[user@login04 gpu]$ dp --version
DeePMD-kit v2.2.10
from job error file:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: gpu008
DEEPMD INFO computing device: cpu:0
DEEPMD INFO Count of visible GPU: 0
DEEPMD INFO num_intra_threads: 4
DEEPMD INFO num_inter_threads: 4
DEEPMD INFO -----------------------------------------------------------------
Am i really making use of gpu, if so why the wall time is more than my local.
Please help me
Beta Was this translation helpful? Give feedback.
All reactions