-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Bug Description: What's going wrong?
I have encountered an issue when working with the benchmarks of @slfuchs with the recent version of 4C. I am still collecting the data to properly describe the issue, but what happens is that 4C crashes in MPI case with rank > 1 with the following error:
PROC 1 ERROR in /home/vladimir/development/4C/src/core/fem/src/discretization/4C_fem_discretization_partition.cpp, line 247:
Proc 1: Element gid=93 from oldmap is not in newmap
------------------
0# void FourC::Core::Internal::format_and_throw_error<int const&, int&>(std::source_location const&, std::basic_format_string<char, std::type_identity<int const&>::type, std::type_identity<int&>::type>, int const&, int&) in /home/vladimir/development/4C-debug/lib4C.so
1# FourC::Core::FE::Discretization::export_column_elements(FourC::Core::LinAlg::Map const&, bool, bool) in /home/vladimir/development/4C-debug/lib4C.so
2# FourC::Core::Binstrategy::Utils::extend_discretization_ghosting(FourC::Core::FE::Discretization&, FourC::Core::LinAlg::Map&, bool, bool, bool) in /home/vladimir/development/4C-debug/lib4C.so
3# FourC::Particle::WallHandlerDiscretCondition::extend_wall_element_ghosting(std::map<int, std::set<int, std::less<int>, std::allocator<int> >, std::less<int>, std::allocator<std::pair<int const, std::set<int, std::less<int>, std::allocator<int> > > > >&) in /home/vladimir/development/4C-debug/lib4C.so
4# FourC::Particle::WallHandlerDiscretCondition::distribute_wall_elements_and_nodes() in /home/vladimir/development/4C-debug/lib4C.so
5# FourC::Particle::ParticleAlgorithm::distribute_load_among_procs() in /home/vladimir/development/4C-debug/lib4C.so
6# FourC::Particle::ParticleAlgorithm::setup() in /home/vladimir/development/4C-debug/lib4C.so
7# FourC::PaSI::PartitionedAlgo::setup() in /home/vladimir/development/4C-debug/lib4C.so
8# FourC::PaSI::PasiPartTwoWayCoup::setup() in /home/vladimir/development/4C-debug/lib4C.so
9# FourC::pasi_dyn() in /home/vladimir/development/4C-debug/lib4C.so
10# entrypoint_switch() in /home/vladimir/development/4C-debug/4C
11# run(FourC::CommandlineArguments&, FourC::Core::Communication::Communicators&) in /home/vladimir/development/4C-debug/4C
12# main in /home/vladimir/development/4C-debug/4C
13# 0x00007F35B0E6A1CA in /lib/x86_64-linux-gnu/libc.so.6
14# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
15# _start in /home/vladimir/development/4C-debug/4C
I debugged the code trying to figure out what is going on in its different places on the way to its crash and the part which is least clear for me is the following:
4C/src/core/binstrategy/4C_binstrategy.cpp
Lines 1253 to 1341 in bbb4375
| std::shared_ptr<Core::LinAlg::Map> Core::Binstrategy::BinningStrategy::extend_element_col_map( | |
| std::map<int, std::set<int>> const& bin_to_row_ele_map, | |
| std::map<int, std::set<int>>& bin_to_row_ele_map_to_lookup_requests, | |
| std::map<int, std::set<int>>& ext_bin_to_ele_map, std::shared_ptr<Core::LinAlg::Map> bin_colmap, | |
| std::shared_ptr<Core::LinAlg::Map> bin_rowmap, | |
| const Core::LinAlg::Map* ele_colmap_from_standardghosting) const | |
| { | |
| // do communication to gather all elements for extended ghosting | |
| const int numproc = Core::Communication::num_mpi_ranks(comm_); | |
| for (int iproc = 0; iproc < numproc; ++iproc) | |
| { | |
| // gather set of column bins for each proc | |
| std::set<int> bins; | |
| if (iproc == myrank_) | |
| { | |
| // either use given column layout of bins ... | |
| if (bin_colmap != nullptr) | |
| { | |
| int nummyeles = bin_colmap->num_my_elements(); | |
| int* entries = bin_colmap->my_global_elements(); | |
| bins.insert(entries, entries + nummyeles); | |
| } | |
| else // ... or add an extra layer to the given bin distribution | |
| { | |
| std::map<int, std::set<int>>::const_iterator iter; | |
| for (iter = bin_to_row_ele_map.begin(); iter != bin_to_row_ele_map.end(); ++iter) | |
| { | |
| int binId = iter->first; | |
| // avoid getting two layer ghosting as this is not needed | |
| if (bin_rowmap != nullptr) | |
| { | |
| const int lid = bin_rowmap->lid(binId); | |
| if (lid < 0) continue; | |
| } | |
| std::vector<int> binvec; | |
| // get neighboring bins | |
| get_neighbor_and_own_bin_ids(binId, binvec); | |
| bins.insert(binvec.begin(), binvec.end()); | |
| } | |
| } | |
| } | |
| // copy set to vector in order to broadcast data | |
| std::vector<int> binids(bins.begin(), bins.end()); | |
| // first: proc i tells all procs how many bins it has | |
| int numbin = binids.size(); | |
| Core::Communication::broadcast(&numbin, 1, iproc, comm_); | |
| // second: proc i tells all procs which bins it has | |
| binids.resize(numbin); | |
| Core::Communication::broadcast(binids.data(), numbin, iproc, comm_); | |
| // loop over all own bins and find requested ones, fill in elements in these bins | |
| std::map<int, std::set<int>> sdata; | |
| std::map<int, std::set<int>> rdata; | |
| for (int i = 0; i < numbin; ++i) | |
| { | |
| if (bin_to_row_ele_map_to_lookup_requests.find(binids[i]) != | |
| bin_to_row_ele_map_to_lookup_requests.end()) | |
| sdata[binids[i]].insert(bin_to_row_ele_map_to_lookup_requests[binids[i]].begin(), | |
| bin_to_row_ele_map_to_lookup_requests[binids[i]].end()); | |
| } | |
| Core::LinAlg::gather<int>(sdata, rdata, 1, &iproc, comm_); | |
| // proc i has to store the received data | |
| if (iproc == myrank_) | |
| { | |
| ext_bin_to_ele_map = rdata; | |
| } | |
| } | |
| // reduce map of sets to one set and copy to a vector to create extended elecolmap | |
| std::set<int> coleleset; | |
| std::map<int, std::set<int>>::iterator iter; | |
| for (iter = ext_bin_to_ele_map.begin(); iter != ext_bin_to_ele_map.end(); ++iter) | |
| coleleset.insert(iter->second.begin(), iter->second.end()); | |
| // insert standard ghosting | |
| if (ele_colmap_from_standardghosting != nullptr) | |
| for (int lid = 0; lid < ele_colmap_from_standardghosting->num_my_elements(); ++lid) | |
| coleleset.insert(ele_colmap_from_standardghosting->gid(lid)); | |
| std::vector<int> colgids(coleleset.begin(), coleleset.end()); | |
| // return extended elecolmap | |
| return std::make_shared<Core::LinAlg::Map>(-1, (int)colgids.size(), colgids.data(), 0, comm_); | |
| } |
I can not figure out exactly the logic and the purpose of the function.
It looks like it can change the ownership of the elements provided inside the original bin_to_row_ele_map, but the description of the function and the context in which it is called say that it should not happen. Eventually, it happens and I get the aforementioned crash.