practice on ParallelMPO for mem and speed management. #213
Replies: 1 comment
-
|
For standard quantum chemistry Hamiltonians, In case you are interested, the recommended approach for manually setting the MPO distribution in Python API is given in #145 (reply in thread). But this is mostly targeting custom Hamiltonians and it will not give you the speed and memory cost comparable to the "1" and "2" given below.
import numpy as np
from pyblock2._pyscf.ao2mo import integrals as itg
from pyblock2.driver.core import DMRGDriver, SymmetryTypes, MPOAlgorithmTypes
bond_dims = [250] * 4 + [500] * 4
noises = [1e-4] * 4 + [1e-5] * 4 + [0]
thrds = [1e-10] * 8
from pyscf import gto, scf
mol = gto.M(atom="N 0 0 0; N 0 0 1.1", basis="sto3g", symmetry="d2h", verbose=0)
mf = scf.RHF(mol).run(conv_tol=1E-14)
ncas, n_elec, spin, ecore, h1e, g2e, orb_sym = itg.get_rhf_integrals(mf, ncore=0, ncas=None, g2e_symm=8)
driver = DMRGDriver(scratch="./tmp", symm_type=SymmetryTypes.SU2, n_threads=64, mpi=False)
driver.initialize_system(n_sites=ncas, n_elec=n_elec, spin=spin, orb_sym=orb_sym)
mpo = driver.get_qc_mpo(h1e=h1e, g2e=g2e, ecore=ecore, algo_type=MPOAlgorithmTypes.Conventional, iprint=1)
ket = driver.get_random_mps(tag="GS", bond_dim=250, nroots=1)
energy = driver.dmrg(mpo, ket, n_sweeps=20, bond_dims=bond_dims, noises=noises, lowmem_noise=True,
dav_def_max_size=30, thrds=thrds, iprint=1)
print('DMRG energy = %20.15f' % energy)
import numpy as np
from pyblock2._pyscf.ao2mo import integrals as itg
from pyblock2.driver.core import DMRGDriver, SymmetryTypes, MPOAlgorithmTypes, ParallelTypes
bond_dims = [250] * 4 + [500] * 4
noises = [1e-4] * 4 + [1e-5] * 4 + [0]
thrds = [1e-10] * 8
from pyscf import gto, scf
mol = gto.M(atom="N 0 0 0; N 0 0 1.1", basis="sto3g", symmetry="d2h", verbose=0)
mf = scf.RHF(mol).run(conv_tol=1E-14)
ncas, n_elec, spin, ecore, h1e, g2e, orb_sym = itg.get_rhf_integrals(mf, ncore=0, ncas=None, g2e_symm=8)
driver = DMRGDriver(scratch="./tmp", symm_type=SymmetryTypes.SU2, n_threads=64, mpi=True)
driver.prule = driver.bw.bs.ParallelRuleQC(driver.mpi)
driver.initialize_system(n_sites=ncas, n_elec=n_elec, spin=spin, orb_sym=orb_sym)
mpo = driver.get_qc_mpo(h1e=h1e, g2e=g2e, ecore=ecore, algo_type=MPOAlgorithmTypes.Conventional,
para_type=ParallelTypes.Nothing, iprint=2)
ket = driver.get_random_mps(tag="GS", bond_dim=250, nroots=1)
energy = driver.dmrg(mpo, ket, n_sweeps=20, bond_dims=bond_dims, noises=noises, lowmem_noise=True,
dav_def_max_size=30, thrds=thrds, iprint=1)
print('DMRG energy = %20.15f' % energy)and executed as export OMP_NUM_THREADS=64
mpirun --map-by ppr:$SLURM_TASKS_PER_NODE:node:pe=$OMP_NUM_THREADS python3 -u dmrg.py > dmrg.out
# delete: driver.prule = driver.bw.bs.ParallelRuleQC(driver.mpi)
mpo = driver.get_qc_mpo(h1e=h1e, g2e=g2e, ecore=ecore, algo_type=MPOAlgorithmTypes.FastBipartite,
para_type=ParallelTypes.I, iprint=2)
# delete: driver.prule = driver.bw.bs.ParallelRuleQC(driver.mpi)
mpo = driver.get_qc_mpo(h1e=h1e, g2e=g2e, ecore=ecore, algo_type=MPOAlgorithmTypes.FastBipartite,
para_type=ParallelTypes.SIJ, iprint=2) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Huanchen @hczhai
I’m running a ~large-scale DMRG calculation on a 78-orbital system. The MPO bond dimension is 6320.
On a SerialMPO run, using a single node with ~1000 GB memory, I can push the MPS bond dimension up to around 4000 before hitting OOM, and the calculation becomes very slow at that point. I have not yet tried the MPO reloading trick in this setup, but I plan to test that separately.
To speed things up, I tried using ParallelMPO:
mpo_rank.bin.minimal=Trueflag when loading the MPO.However, in this parallel setup the calculation hits OOM already at M ~ 500.
My initial intuition was that sub-Hamiltonian parallelization would keep the total memory usage roughly similar to the serial case, since each sub-Hamiltonian would have its own environment and we’d just add their contributions. Clearly I’m misunderstanding something about how memory scales here, especially when Pham is large (I noticed in your 2021 JCP paper that the largest Pham used was about 16).
So I was wondering that
Any guidance or rule-of-thumb setup would be very helpful. Thanks a lot for your time and for block2!
Below are my running scripts:
Beta Was this translation helpful? Give feedback.
All reactions