Skip to content

Batch shared resources and batched APIs

Peter Doak edited this page Feb 18, 2021 · 25 revisions

TrialWaveFunction (TWF) tree and shared resource

TrialWaveFunction
  |    \
Slater  Jastrow    (WFCs)
  |   \
Det_up Det_dn
  |       |  
SPOSet  SPOSet

A resource may be defined anywhere in the tree but there are multiple levels of type erasure used in the tree and the tree is built dynamically based on the input. The details of resources are only known to the actual derived class not globally. For instance there are many derived classes of WaveFunctionComponent with unknown (to the TWF) resource needs. Also an SPOSet is unknown outside of a determinant class. Currently there is no mechanism to build this set of resources except by instantiating the complete TWF tree.

A shared resource simply serves a collection of indistinguishable walker objects. For example,

  1. Det_up clones share a BLAS handle. Det_dn clones share another BLAS handle. The two handles are independent.
  2. SPOSet shares a piece of device memory buffer for output orbital values when multiple walkers are computed together.

A ResourceCollection (right now only used in TWF, it may cover ParticleSet and Hamiltonian) collects all the shared resources created when instantiating the gold TWF tree which is built when parsing the wavefunction input before creating any the drivers. This is a first step to prepare shared resources for batching. Now we can capture from the TWF a set of resources and store them in a composed object. And we can move this composed object from one TWF to another so we can keep it in a crowd scope even as walkers enter and leave that scope.

When crowds are created in a driver, the gold ResourceCollection is copied once per crowd to serve the collection of walker objects in each crowd but not yet assigned to any walker in the crowd. In each resource scope in the driver, RefVectorWithLeader is built by designating a leader and taking a collection of walker objects from a crowd. The per-crowd shared resource is given to the crowd leader at the beginning of this scope and taken back at the end. For now its good enough since resource distribution is static in driver scopes like PbyP or Tmoves. So once per crowd current scope, we need to give a leader the resources and take them back. The only change to the other APIs is to change RefVectors to RefVectorsWithLeader which is a simple change.

Every class in the TWF tree has acquireResource/releaseResource APIs allowing the shared resources from this ResourceCollection to be acquired at the beginning of a scope and released at the end. This process is hidden by ResourceCollectionLock using RAII.

{ // crowd scope
  {
    // designate the leader
    const RefVectorWithLeader<TrialWaveFunction> walker_twfs(crowd.get_walker_twfs()[0], crowd.get_walker_twfs());
    // hand the shared resource to the leader.
    ResourceCollectionLock<TrialWaveFunction> resource_lock(crowd.getTWFSharedResource(), crowd.get_walker_twfs()[0]);
    // pbyp + hamiltonian
  }
  {
    // T-move section. only work on walkers touched by T-moves.
    const RefVectorWithLeader<TrialWaveFunction> walker_twfs(tmove_touched_walkers[0], tmove_touched_walkers);
    ResourceCollectionLock<TrialWaveFunction> resource_lock(crowd.getTWFSharedResource(), tmove_touched_walkers[0]);
  }
}

{ population scope
  // walker control. only work on walkers received or copied.
  const RefVectorWithLeader<TrialWaveFunction> walker_twfs(gold_walker, tmove_touched_walkers);
  ResourceCollectionLock<TrialWaveFunction> resource_lock(population.getTWFSharedResource(), gold_walker);
}

Peter: It would be nice to pass the shared resources in the flex and then mw arguments but for expedience we build up the resource collection as the TWF goes through its many initialization steps and unrolling this state sequence is essential to delivering the correct shared resources to the correct WFCs later. More analysis is needed to break resources out of the TWF state machine and into a scheme where resources can be distributed based on WFC's scope and evaluation requirements without recourse to the sequence of states that instantiated them in the TWF. This is what the cursor in the Resource collection is about.

Ye: passing ResourceCollection can only be done after we changed the cursor method to indexes which has been attempted and failed. More preparation work is needed. Once this is completed. We may pass the shared resources in the flex and then mw arguments. Even if we can pass resources via arguments, we still need to handle vtable explicitly from the leader instead of walker[0] I assume. One more thought, shared resource may has dependencies. TWF needs pair of distances form ParticleSet. Grabbing P_leader.shared_resource.Temp_r is more natural than indexing into a piece of anonymous resource in the ResourceCollection.

Batched APIs

In a resource scope, where to find the resource to use in the batched APIs?

static void A::mw_evaluate(const RefVectorWithLeader<A>& a_list)
{
  auto& a_leader = a_list.getLeader();
  batched_gemm(a_leader.shared_resource.get_cublas_handle(), extractDataPtrsFromAs(a_list), a_list.size(), a_leader.shared_resource.get_memory_resource());
}

However, not all the functions can be made fully static especially when virtual inheritance is used.

Virtual void Base::mw_evaluate(const RefVectorWithLeader<Base>& a_list);

class B: public Base
{
  void B::mw_evaluate(const RefVectorWithLeader<B>& b_list) const override
  // make all the object member read only, write can only be done through the leader.
  {
    assert(this == &b_getLeader(); //safety mechanism making sure read access to object member is the same as leader
    auto& b_leader = b_list.getLeader();
    batched_gemm(b_leader.shared_resource.get_cublas_handle(), extractDataPtrsFromBs(b_list), b_list.size(), b_leader.shared_resource.get_memory_resource());
  }
}

So the "Leader" owns the shared resource in a given driver/resource scope. All the vtable dispatch is done via the leader consistently with shared resources. Leader is not necessary be part of the RefVector collection. So the leader could just be So no more treating element 0 in the collection special and the collection can serve 0-N elements.

Clone this wiki locally