-
Notifications
You must be signed in to change notification settings - Fork 3
Description
This is a potential bug that I discovered during development. I am putting the recreation setup and behaviour here:
environment
I set up v5.4.0-rc1-a9 using instructions from https://github.com/DUNE-DAQ/daqconf/wiki/Setting-up-a-fddaq%E2%80%90v5.4.0-software-area.
Then cloned develop branch of [fdreadoutlibs](https://github.com/DUNE-DAQ/fdreadoutlibs) into sourcecode, only packages in there are fdreadoutlibs and daqsystemtest.
Following instruction,
drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config ${USER}-local-test boot --no-override-logs wait 5 conf wait 3 start --run-number 101 enable-triggers wait 10 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate
runs without problem.
Recreation
Make the following modification:
private:
bool m_first_hit = true;
bool m_tpg_metric_collect_enabled{false};
uint32_t m_metric_collect_opmon_period { 128 };
std::unique_ptr<tpglibs::TPGenerator> m_tp_generator;
std::vector<std::pair<std::string, nlohmann::json>> m_tpg_configs;
std::vector<std::pair<std::string, nlohmann::json>> new_variable_test; // This is a new variable to cause the bug
uint32_t m_tp_max_width;
std::set<unsigned int> m_channel_mask_set;
uint16_t m_tpg_threshold_selected;
std::map<uint, std::atomic<int>> m_tp_channel_rate_map;
This breaks the system, from ru01 log, we see
2025-Aug-18 17:54:08,175 LOG [void dunedaq::iomanager::NetworkSenderModel<Datatype>::get_sender(const dunedaq::iomanager::Sender::timeout_t&) [with Datatype = dunedaq::dfmessages::TimeSync; dunedaq::iomanager::Sender::timeout_t = std::chrono::duration<long int, std::ratio<1, 1000> >] at /cvmfs/dunedaq-development.opensciencegrid.org/candidates/coredaq-v5.4.0-rc1-a9/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/iomanager-v3.0.0-lllkrnhmsuitfq36n3n52y7m5no6rubt/include/iomanager/network/detail/NetworkSenderModel.hxx:88] Setting topic to TimeSync
malloc(): unaligned tcache chunk detected
bash: line 1: 4047540 Aborted (core dumped) daq_application -s xinyue-local-test -k local-1x1-config -n ru-01 -c rest://localhost:0 -d oksconflibs:config/daqsystemtest/example-configs.data.xml
Likewise, if we do
uint32_t m_tp_max_width;
std::set<unsigned int> m_channel_mask_set;
uint16_t m_tpg_threshold_selected;
std::map<uint, std::atomic<int>> m_tp_channel_rate_map;
std::map<uint, std::atomic<int>> m_new_variable_test; // This is a new variable to cause the bug
This cause the same malloc error.
Adding a new:
std::shared_ptr<uint32_t> cause failure to boot (ru01 In error), error is a segfault (This should be a lack of initializer issue)
Other tests I have tried:
uint16_t, uint32_t is okay
std::shared_ptr<uint32_t> new_variable_test { nullptr }; is okay, but if not explicit initialize to nullptr, it breaks the system
Discussion
Since WIBEthFrameProcessor does not use an explicit constructor (relying on default), the act of creating another identical type that is not used in code should not cause different behaviour than what we are already doing. I think next thing I will try is to grab the size of WIBEthFrameProcessor from upstream, and see how the size change from adding these member variables.