Skip to content

update run marker to be on at start and off at stop-trigger-sources #10

Closed
wesketchum wants to merge 1 commit intopatch/fddaq-v5.3.xfrom
wketchum/modify_run_marker_location
Closed

update run marker to be on at start and off at stop-trigger-sources #10
wesketchum wants to merge 1 commit intopatch/fddaq-v5.3.xfrom
wketchum/modify_run_marker_location

Conversation

@wesketchum
Copy link
Collaborator

(… trying to avoid an infinite while loop)

This is in response to the issue raised here: DUNE-DAQ/daqsystemtest#218

@bieryAtFnal reports that this resolved the problem when he ran it (though @denizergonul and I were not able to reproduce the problem in our own tests).

@wesketchum
Copy link
Collaborator Author

To test:

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest_v5

dbtx-create-workarea-from-recipe.py --use-ref https://gist.githubusercontent.com/wesketchum/c64eb0dedccdf43219b86c41d68f7fcc/raw

cd testarea_wketchum_crt_test_fix*
source env.sh
dbt-build
dbt-workarea-env

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-crt-grenoble-1x1-config ${USER}-local-test boot wait 5 conf wait 3 start --run-number 134 enable-triggers wait 30 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate

@denizergonul
Copy link
Contributor

Thank you for your efforts to solve the "stop-trigger-sources" transition issue. I must say I cannot see how the current implementation may be causing the issue and how this PR will fix it.

I used dpdklibs::DPDKReaderModule as a reference to write this part.

In DPDKReaderModule::do_configure:

if (!m_run_marker.load()) {
  set_running(true);
  TLOG() << "Starting iface wrappers.";
  for (auto& [iface_id, iface] : m_ifaces) {
    iface->start();
  }
} else {
  TLOG_DEBUG(5) << "iface wrappers are already running!";
}

We don’t have m_ifaces for CRT, but what the start function does is to schedule work (IfaceWrapper::rx_runner) for the RX threads. I believe this is similar to:

m_producer_thread.set_work(&CRTGrenobleReaderModule::run_produce, this);

RX threads then start reading data over DPDK. But when they arrive to the point where the handle_eth_payload call is, they will not make the call because IfaceWrapper::m_lcore_enable_flow is still false at this point. In other words, data taking has started, but data is not being handled yet.

Parallelly, this is CRTGrenobleReaderModule::do_conf:

// Configure HW interface?
if (!m_run_marker.load()) {
  set_running(true);
} else {
  TLOG_DEBUG(5) << "Already running!";
}  

We could put the call to set_work here, but we were not sure if this “start taking data but don’t handle it yet” logic is wanted for CRT.

Moving on to DPDKReaderModule::do_start:

for (auto& [iface_id, iface] : m_ifaces) {
  iface->enable_flow();
}

This is the trigger for data handling to start.

CRTGrenobleReaderModule::do_start:

enable_flow(); 
m_producer_thread.set_work(&CRTGrenobleReaderModule::run_produce, this);

As I mentioned, the logic is currently not being used. We enable the flow before scheduling the work anyway, so, currently for CRT, we start data taking and immediately handling. We can change this by moving set_work inside do_conf, if desired.

Until this point I covered "conf" and "start". I don’t see the point of duplicating set_running(true); at start as it is already being done at conf.

Moving on to DPDKReaderModule::do_stop:

for (auto& [iface_id, iface] : m_ifaces) {
  iface->disable_flow();
}

This stops data handling, but data taking will continue.

CRTGrenobleReaderModule::do_stop:

disable_flow();

Same for CRT. (Again, maybe we don’t even want this.)

Finally, DPDKReaderModule::do_scrap:

if (m_run_marker.load()) {
  TLOG() << "Raising stop through variables!";
  set_running(false);
  TLOG() << "Stopping iface wrappers.";
  for (auto& [iface_id, iface] : m_ifaces) {
    iface->stop();
  }
  ealutils::wait_for_lcores();
  TLOG() << "Stoppped DPDK lcore processors and internal threads...";
} else {
  TLOG_DEBUG(5) << "DPDK lcore processor is already stopped!";
}

What IfaceWrapper::stop does is to break the data taking while loop. For CRT, this corresponds to checking m_run_marker anyway (while (m_run_marker.load())), so, we don’t need to do anything else. Then ealutils::wait_for_lcores corresponds to CRT waiting for the thread to end.

CRTGrenobleReaderModule::do_scrap:

if (m_run_marker.load()) {
  TLOG() << "Raising stop through variables!";
  set_running(false);
  while (!m_producer_thread.get_readiness()) {
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
  }
} else {
  TLOG_DEBUG(5) << "Already stopped!";
} 

Now we covered "stop" and "scrap". Similarly, I don’t see the point of duplicating set_running(false); at stop as it is already being done at scrap.

@jcfreeman2
Copy link
Contributor

As far as the problem of a timeout occurring on the stop triggers transition, I don't believe this PR actually solves it. Please note that in my experience, the problem only occurs roughly ~20% of the time which means we need to do multiple runs to figure out if things are working. However, I created a test build which uses this PR (CRTFD_DEV_250709_A9) (*) and I just created a work area on daq.fnal.gov based on it; after running

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-crt-grenoble-1x1-config ${USER}-local-test boot --no-override-logs wait 5 conf wait 3 start --run-number 111 enable-triggers wait 10 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate

10 times in a loop, 3 of those times we get the timeout problem.

(*) https://github.com/DUNE-DAQ/daq-release/actions/runs/16180927746

@wesketchum
Copy link
Collaborator Author

Closing as it's resolved otherwise in a different PR.

@wesketchum wesketchum closed this Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants