Skip to content

[BUG] How does BinningStrategy::extend_element_col_map() work? #1677

@vovannikov

Description

@vovannikov

Bug Description: What's going wrong?

I have encountered an issue when working with the benchmarks of @slfuchs with the recent version of 4C. I am still collecting the data to properly describe the issue, but what happens is that 4C crashes in MPI case with rank > 1 with the following error:

PROC 1 ERROR in /home/vladimir/development/4C/src/core/fem/src/discretization/4C_fem_discretization_partition.cpp, line 247:
Proc 1: Element gid=93 from oldmap is not in newmap
------------------
 0# void FourC::Core::Internal::format_and_throw_error<int const&, int&>(std::source_location const&, std::basic_format_string<char, std::type_identity<int const&>::type, std::type_identity<int&>::type>, int const&, int&) in /home/vladimir/development/4C-debug/lib4C.so
 1# FourC::Core::FE::Discretization::export_column_elements(FourC::Core::LinAlg::Map const&, bool, bool) in /home/vladimir/development/4C-debug/lib4C.so
 2# FourC::Core::Binstrategy::Utils::extend_discretization_ghosting(FourC::Core::FE::Discretization&, FourC::Core::LinAlg::Map&, bool, bool, bool) in /home/vladimir/development/4C-debug/lib4C.so
 3# FourC::Particle::WallHandlerDiscretCondition::extend_wall_element_ghosting(std::map<int, std::set<int, std::less<int>, std::allocator<int> >, std::less<int>, std::allocator<std::pair<int const, std::set<int, std::less<int>, std::allocator<int> > > > >&) in /home/vladimir/development/4C-debug/lib4C.so
 4# FourC::Particle::WallHandlerDiscretCondition::distribute_wall_elements_and_nodes() in /home/vladimir/development/4C-debug/lib4C.so
 5# FourC::Particle::ParticleAlgorithm::distribute_load_among_procs() in /home/vladimir/development/4C-debug/lib4C.so
 6# FourC::Particle::ParticleAlgorithm::setup() in /home/vladimir/development/4C-debug/lib4C.so
 7# FourC::PaSI::PartitionedAlgo::setup() in /home/vladimir/development/4C-debug/lib4C.so
 8# FourC::PaSI::PasiPartTwoWayCoup::setup() in /home/vladimir/development/4C-debug/lib4C.so
 9# FourC::pasi_dyn() in /home/vladimir/development/4C-debug/lib4C.so
10# entrypoint_switch() in /home/vladimir/development/4C-debug/4C
11# run(FourC::CommandlineArguments&, FourC::Core::Communication::Communicators&) in /home/vladimir/development/4C-debug/4C
12# main in /home/vladimir/development/4C-debug/4C
13# 0x00007F35B0E6A1CA in /lib/x86_64-linux-gnu/libc.so.6
14# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
15# _start in /home/vladimir/development/4C-debug/4C

I debugged the code trying to figure out what is going on in its different places on the way to its crash and the part which is least clear for me is the following:

std::shared_ptr<Core::LinAlg::Map> Core::Binstrategy::BinningStrategy::extend_element_col_map(
std::map<int, std::set<int>> const& bin_to_row_ele_map,
std::map<int, std::set<int>>& bin_to_row_ele_map_to_lookup_requests,
std::map<int, std::set<int>>& ext_bin_to_ele_map, std::shared_ptr<Core::LinAlg::Map> bin_colmap,
std::shared_ptr<Core::LinAlg::Map> bin_rowmap,
const Core::LinAlg::Map* ele_colmap_from_standardghosting) const
{
// do communication to gather all elements for extended ghosting
const int numproc = Core::Communication::num_mpi_ranks(comm_);
for (int iproc = 0; iproc < numproc; ++iproc)
{
// gather set of column bins for each proc
std::set<int> bins;
if (iproc == myrank_)
{
// either use given column layout of bins ...
if (bin_colmap != nullptr)
{
int nummyeles = bin_colmap->num_my_elements();
int* entries = bin_colmap->my_global_elements();
bins.insert(entries, entries + nummyeles);
}
else // ... or add an extra layer to the given bin distribution
{
std::map<int, std::set<int>>::const_iterator iter;
for (iter = bin_to_row_ele_map.begin(); iter != bin_to_row_ele_map.end(); ++iter)
{
int binId = iter->first;
// avoid getting two layer ghosting as this is not needed
if (bin_rowmap != nullptr)
{
const int lid = bin_rowmap->lid(binId);
if (lid < 0) continue;
}
std::vector<int> binvec;
// get neighboring bins
get_neighbor_and_own_bin_ids(binId, binvec);
bins.insert(binvec.begin(), binvec.end());
}
}
}
// copy set to vector in order to broadcast data
std::vector<int> binids(bins.begin(), bins.end());
// first: proc i tells all procs how many bins it has
int numbin = binids.size();
Core::Communication::broadcast(&numbin, 1, iproc, comm_);
// second: proc i tells all procs which bins it has
binids.resize(numbin);
Core::Communication::broadcast(binids.data(), numbin, iproc, comm_);
// loop over all own bins and find requested ones, fill in elements in these bins
std::map<int, std::set<int>> sdata;
std::map<int, std::set<int>> rdata;
for (int i = 0; i < numbin; ++i)
{
if (bin_to_row_ele_map_to_lookup_requests.find(binids[i]) !=
bin_to_row_ele_map_to_lookup_requests.end())
sdata[binids[i]].insert(bin_to_row_ele_map_to_lookup_requests[binids[i]].begin(),
bin_to_row_ele_map_to_lookup_requests[binids[i]].end());
}
Core::LinAlg::gather<int>(sdata, rdata, 1, &iproc, comm_);
// proc i has to store the received data
if (iproc == myrank_)
{
ext_bin_to_ele_map = rdata;
}
}
// reduce map of sets to one set and copy to a vector to create extended elecolmap
std::set<int> coleleset;
std::map<int, std::set<int>>::iterator iter;
for (iter = ext_bin_to_ele_map.begin(); iter != ext_bin_to_ele_map.end(); ++iter)
coleleset.insert(iter->second.begin(), iter->second.end());
// insert standard ghosting
if (ele_colmap_from_standardghosting != nullptr)
for (int lid = 0; lid < ele_colmap_from_standardghosting->num_my_elements(); ++lid)
coleleset.insert(ele_colmap_from_standardghosting->gid(lid));
std::vector<int> colgids(coleleset.begin(), coleleset.end());
// return extended elecolmap
return std::make_shared<Core::LinAlg::Map>(-1, (int)colgids.size(), colgids.data(), 0, comm_);
}

I can not figure out exactly the logic and the purpose of the function.

It looks like it can change the ownership of the elements provided inside the original bin_to_row_ele_map, but the description of the function and the context in which it is called say that it should not happen. Eventually, it happens and I get the aforementioned crash.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions