Skip to content

Failure at boot time after adding new member variable to WIBEthFrameProcessor #275

@xinyue-uoft

Description

@xinyue-uoft

This is a potential bug that I discovered during development. I am putting the recreation setup and behaviour here:

environment

I set up v5.4.0-rc1-a9 using instructions from https://github.com/DUNE-DAQ/daqconf/wiki/Setting-up-a-fddaq%E2%80%90v5.4.0-software-area.

Then cloned develop branch of [fdreadoutlibs](https://github.com/DUNE-DAQ/fdreadoutlibs) into sourcecode, only packages in there are fdreadoutlibs and daqsystemtest.

Following instruction,

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config ${USER}-local-test boot --no-override-logs wait 5 conf wait 3 start --run-number 101 enable-triggers wait 10 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate

runs without problem.

Recreation

Make the following modification:

private:
  bool m_first_hit = true;
  bool m_tpg_metric_collect_enabled{false};
  uint32_t m_metric_collect_opmon_period { 128 };
  std::unique_ptr<tpglibs::TPGenerator> m_tp_generator;
  std::vector<std::pair<std::string, nlohmann::json>> m_tpg_configs;
  std::vector<std::pair<std::string, nlohmann::json>> new_variable_test; // This is a new variable to cause the bug
  uint32_t m_tp_max_width;
  std::set<unsigned int> m_channel_mask_set;
  uint16_t m_tpg_threshold_selected;

  std::map<uint, std::atomic<int>> m_tp_channel_rate_map;

This breaks the system, from ru01 log, we see

2025-Aug-18 17:54:08,175 LOG [void dunedaq::iomanager::NetworkSenderModel<Datatype>::get_sender(const dunedaq::iomanager::Sender::timeout_t&) [with Datatype = dunedaq::dfmessages::TimeSync; dunedaq::iomanager::Sender::timeout_t = std::chrono::duration<long int, std::ratio<1, 1000> >] at /cvmfs/dunedaq-development.opensciencegrid.org/candidates/coredaq-v5.4.0-rc1-a9/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/iomanager-v3.0.0-lllkrnhmsuitfq36n3n52y7m5no6rubt/include/iomanager/network/detail/NetworkSenderModel.hxx:88] Setting topic to TimeSync
malloc(): unaligned tcache chunk detected
bash: line 1: 4047540 Aborted                 (core dumped) daq_application -s xinyue-local-test -k local-1x1-config -n ru-01 -c rest://localhost:0 -d oksconflibs:config/daqsystemtest/example-configs.data.xml

Likewise, if we do

  uint32_t m_tp_max_width;
  std::set<unsigned int> m_channel_mask_set;
  uint16_t m_tpg_threshold_selected;

  std::map<uint, std::atomic<int>> m_tp_channel_rate_map;
  std::map<uint, std::atomic<int>> m_new_variable_test; // This is a new variable to cause the bug

This cause the same malloc error.

Adding a new:

std::shared_ptr<uint32_t> cause failure to boot (ru01 In error), error is a segfault (This should be a lack of initializer issue)

Other tests I have tried:
uint16_t, uint32_t is okay
std::shared_ptr<uint32_t> new_variable_test { nullptr }; is okay, but if not explicit initialize to nullptr, it breaks the system

Discussion

Since WIBEthFrameProcessor does not use an explicit constructor (relying on default), the act of creating another identical type that is not used in code should not cause different behaviour than what we are already doing. I think next thing I will try is to grab the size of WIBEthFrameProcessor from upstream, and see how the size change from adding these member variables.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions