Skip to content

Monitoring for FragmentAggregatorModule#418

Merged
MRiganSUSX merged 4 commits intopatch/fddaq-v5.3.xfrom
mrigan/fragment_agg_monitoring
Jun 11, 2025
Merged

Monitoring for FragmentAggregatorModule#418
MRiganSUSX merged 4 commits intopatch/fddaq-v5.3.xfrom
mrigan/fragment_agg_monitoring

Conversation

@MRiganSUSX
Copy link
Copy Markdown
Contributor

@MRiganSUSX MRiganSUSX commented Jun 10, 2025

Upon request from @mroda88, here is an initial version of opmon monitoring for FragmentAggregatorModule.

This includes:

  • removing old / obsolete portions of opmon vars
  • adding opmon metrics for data requests (received, sent, failed)
  • adding opmon metrics for fragments (received, sent, failed)
  • additional error checking for the fragment error bits (empty, incomplete, invalid window)
  • adding opmon metrics for processing time (avg, min, max) for both data requests and fragments
  • associated protobuf messages

The counters for failed are set-up to be cumulative (no resets).

Testing:

  • passes unit tests
  • passes daqsystemtest_integtest_bundle
  • local:
    • Able to see the relevant metrics being published and the counters to change in time:
  • local:
    • Able to see the relevant metrics being published and the counters to change in time:
      • Data Requests processing times:

        {
         "time": "2025-06-11T09:48:20.972742681Z",
         "origin": {
          "session": "mrigan-local-test",
          "application": "ru-01",
          "substructure": [
           "fragmentaggregator-ru-01"
          ]
         },
         "custom_origin": {
          "data": "DataRequest"
         },
         "measurement": "dunedaq.dfmodules.opmon.FragmentAggregatorTimeInfo",
         "data": {
          "average_us": {
           "uint8_value": "10"
          },
          "min_us": {
           "uint8_value": "3"
          },
          "max_us": {
           "uint8_value": "218"
          }
         }
        }
        
      • Fragments processing times:

        {
         "time": "2025-06-11T09:48:20.972992235Z",
         "origin": {
          "session": "mrigan-local-test",
          "application": "ru-01",
          "substructure": [
           "fragmentaggregator-ru-01"
          ]
         },
         "custom_origin": {
          "data": "Fragment"
         },
         "measurement": "dunedaq.dfmodules.opmon.FragmentAggregatorTimeInfo",
         "data": {
          "average_us": {
           "uint8_value": "29"
          },
          "max_us": {
           "uint8_value": "169"
          },
          "min_us": {
           "uint8_value": "2"
          }
         }
        }
        
      • Fragments counters:

        {
         "time": "2025-06-11T09:48:20.973457987Z",
         "origin": {
          "session": "mrigan-local-test",
          "application": "ru-01",
          "substructure": [
           "fragmentaggregator-ru-01"
          ]
         },
         "measurement": "dunedaq.dfmodules.opmon.FAFragmentsCounterInfo",
         "data": {
          "fragments_empty": {
           "uint8_value": "4"
          },
          "fragments_failed": {
           "uint8_value": "0"
          },
          "fragments_processed": {
           "uint8_value": "231"
          },
          "fragments_invalid": {
           "uint8_value": "0"
          },
          "fragments_received": {
           "uint8_value": "231"
          },
          "fragments_incomplete": {
           "uint8_value": "0"
          }
         }
        }
        
      • Data requests counters:

        {
         "time": "2025-06-11T09:48:51.078466196Z",
         "origin": {
          "session": "mrigan-local-test",
          "application": "ru-01",
          "substructure": [
           "fragmentaggregator-ru-01"
          ]
         },
         "measurement": "dunedaq.dfmodules.opmon.FADataRequestsCounterInfo",
         "data": {
          "data_requests_failed": {
           "uint8_value": "0"
          },
          "data_requests_processed": {
           "uint8_value": "301"
          },
          "data_requests_received": {
           "uint8_value": "301"
          }
         }
        }
        

@MRiganSUSX MRiganSUSX self-assigned this Jun 11, 2025
@MRiganSUSX MRiganSUSX requested review from Copilot and mroda88 June 11, 2025 10:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds operational monitoring metrics to the FragmentAggregatorModule, including counters for data requests and fragments, error‐bit tracking, timing statistics, and updated protobuf definitions.

  • Removes deprecated opmon methods and variables
  • Introduces atomic counters and timing logic in header and implementation
  • Defines new protobuf messages for counters and timing metrics

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
schema/dfmodules/opmon/FragmentAggregatorModule.proto Adds counter and timing message definitions
plugins/FragmentAggregatorModule.hpp Removes old stats, declares new atomic metrics and generate_opmon_data()
plugins/FragmentAggregatorModule.cpp Implements metric updates, error‐bit checks, timing, and publishing logic
Comments suppressed due to low confidence (3)

plugins/FragmentAggregatorModule.hpp:95

  • [nitpick] The field m_fragments_time_average_us actually accumulates total time, not the average. Renaming it to m_fragments_time_total_us would more accurately reflect its purpose.
std::atomic<metric_counter_type> m_fragments_time_average_us{ 0 };

plugins/FragmentAggregatorModule.cpp:69

  • Consider adding unit tests for generate_opmon_data() to validate that counters reset correctly and timing metrics publish expected values under various scenarios.
void
FragmentAggregatorModule::generate_opmon_data()

plugins/FragmentAggregatorModule.cpp:160

  • Storing the start timestamp in a shared member risks race conditions if methods run concurrently. Consider using a local variable instead of m_timestamp_before_dr.
m_timestamp_before_dr = get_current_time_us();

Copy link
Copy Markdown
Contributor

@mroda88 mroda88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested at EHN1 and dashboard are created.

@MRiganSUSX MRiganSUSX merged commit 4be3d51 into patch/fddaq-v5.3.x Jun 11, 2025
@MRiganSUSX MRiganSUSX deleted the mrigan/fragment_agg_monitoring branch June 11, 2025 21:46
@eflumerf eflumerf restored the mrigan/fragment_agg_monitoring branch June 13, 2025 16:37
@eflumerf eflumerf deleted the mrigan/fragment_agg_monitoring branch June 13, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants