Skip to content

Failed to read 8 bytes from input stream at first SCF iteration #6132

@stilldown

Description

@stilldown

Describe the bug

when running ABACUS with OMP_NUM_THREADS=12 nohup mpirun -n 2 --map-by socket --bind-to none abacus | tee output.log & , the program crashed at first step of SCF iterration using HSE functional. I use the -DDEBUG_INFO=ON to provide more details for debug

DIAMINODUT5-HSE.tar.gz

Expected behavior

No response

To Reproduce

before using toolchain, i have modified the script install_openmpi.sh and install_elpa.sh to enable the support of cuda awared mpi and cusolvermp and disabled compilation of gpu version of elpa.
configure of openmpi

      ./configure CFLAGS="${CFLAGS}" \
        --prefix=${pkg_install_dir} \
        --libdir="${pkg_install_dir}/lib" \
        --with-zlib=${ZLIB} \
        --with-libevent=internal \
        --with-cuda=${CUDA_PATH} \
        --with-ucx=${UCX} \
        --with-ucc=${UCC} \
        ${EXTRA_CONFIGURE_FLAGS} \
        > configure.log 2>&1 || tail -n ${LOG_LINES} configure.log

configure of elpa

      for TARGET in "cpu" ; do
        [ "$TARGET" = "nvidia" ] && [ "$ENABLE_CUDA" != "__TRUE__" ] && continue
        # disable cpu if cuda is enabled
        # [ "$TARGET" != "nvidia" ] && [ "$ENABLE_CUDA" = "__TRUE__" ] && continue
        echo "Installing from scratch into ${pkg_install_dir}/${TARGET}"
        mkdir -p "build_${TARGET}"
        cd "build_${TARGET}"
        if [ "${with_amd}" != "__DONTUSE__" ] && [ "${WITH_FLANG}" = "yes" ] ; then
        echo "AMD fortran compiler detected, enable special option operation"

the toolchain_gnu.sh

./install_abacus_toolchain.sh \
--with-gcc=install \
--with-intel=no \
--with-openblas=install \
--with-openmpi=install \
--with-cmake=install \
--with-scalapack=install \
--with-libxc=install \
--with-fftw=install \
--with-elpa=install \
--with-cereal=install \
--with-rapidjson=install \
--with-libtorch=install \
--with-libnpy=install \
--with-libri=install \
--with-libcomm=install \
--with-4th-openmpi=no \
--enable-cuda \
--gpu-ver=86 \
| tee compile.log

the build_abacus_gnu.sh

cmake -B $BUILD_DIR -DCMAKE_INSTALL_PREFIX=$PREFIX \
        -DCMAKE_CXX_COMPILER=g++ \
        -DMPI_CXX_COMPILER=mpicxx \
        -DLAPACK_DIR=$LAPACK \
        -DSCALAPACK_DIR=$SCALAPACK \
        -DUSE_ELPA=ON \
        -DELPA_DIR=$ELPA \
        -DCEREAL_INCLUDE_DIR=$CEREAL \
        -DFFTW3_DIR=$FFTW3 \
        -DLibxc_DIR=$LIBXC \
        -DENABLE_LCAO=ON \
        -DENABLE_LIBXC=ON \
        -DUSE_OPENMP=ON \
        -DENABLE_RAPIDJSON=ON \
        -DRapidJSON_DIR=$RAPIDJSON \
        -DUSE_CUDA=ON \
        -DUSE_CUDA_MPI=ON \
        -DENABLE_DEEPKS=ON \
        -DTorch_DIR=$LIBTORCH \
        -Dlibnpy_INCLUDE_DIR=$LIBNPY \
        -DENABLE_LIBRI=ON \
        -DLIBRI_DIR=$LIBRI \
        -DLIBCOMM_DIR=$LIBCOMM \
        -DENABLE_CUSOLVERMP=ON \
        -DCAL_CUSOLVERMP_PATH=$CUDA_PATH/lib64 \
        -DDEBUG_INFO=ON

Environment

No response

Additional Context

build - 副本.log

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions