Skip to content

Conversation

@qiluo-msft
Copy link
Collaborator

@qiluo-msft qiluo-msft commented Nov 12, 2025

Why I did it

Rewrite Python version of supervisor-proc-exit-listener in Rust (supervisor-proc-exit-listener-rs)
Add unit test, achieve coverage "73.60% coverage, 184/250 lines covered".
Change executable name to supervisor-proc-exit-listener-rs.
However, the executable is not used in any the docker containers, we will use another PR (#23688) to use it in each docker container.

Before this PR:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  13062 root      20   0  130.3m  33.4m  19.0m S   0.0   0.2   0:00.23 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name swss
  15163 root      20   0  130.3m  32.1m  17.8m S   0.0   0.2   0:00.17 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name eventd
  13170 root      20   0  130.3m  32.1m  17.8m S   0.0   0.2   0:00.26 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name sysmgr
  20784 root      20   0  130.3m  32.1m  17.8m S   0.0   0.2   0:00.23 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name gnmi
  19288 root      20   0  130.3m  32.1m  17.7m S   0.0   0.2   0:00.17 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name bmp
  21357 root      20   0  129.3m  32.0m  17.8m S   0.0   0.2   0:00.12 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name snmp
  13286 root      20   0  130.3m  32.0m  17.7m S   0.0   0.2   0:00.27 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name teamd
  15647 root      20   0  129.3m  31.9m  17.7m S   0.0   0.2   0:00.17 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name bgp
  17187 root      20   0  130.2m  31.9m  17.6m S   0.0   0.2   0:00.15 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name syncd
   5191 root      20   0  129.3m  31.9m  17.6m S   0.0   0.2   0:00.20 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name database
  17893 root      20   0  129.3m  31.9m  17.6m S   0.0   0.2   0:00.13 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name pmon
  21026 root      20   0  129.3m  31.8m  17.6m S   0.0   0.2   0:00.13 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name lldp
  17195 root      20   0  129.3m  31.8m  17.6m S   0.0   0.2   0:00.12 /usr/bin/python3 /usr/local/bin/supervisor-proc-exit-listener --container-name radv
RES SHR RssAnon All
33.4 19 14.4
32.1 17.8 14.3
32.1 17.8 14.3
32.1 17.8 14.3
32.1 17.7 14.4
32 17.8 14.2
32 17.7 14.3
31.9 17.7 14.2
31.9 17.6 14.3
31.9 17.6 14.3
31.9 17.6 14.3
31.8 17.6 14.2
31.8 17.6 14.2
Total 185.7 203.49
Avg 17.79

After this PR:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  20051 root      20   0   94.9m   8.4m   7.7m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name bmp
  16240 root      20   0   94.9m   8.4m   7.7m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name eventd
  17235 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.02 /usr/bin/supervisor-proc-exit-listener-rs --container-name syncd
  21093 root      20   0   94.9m   8.3m   7.7m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name gnmi
  16207 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name teamd
  16092 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.02 /usr/bin/supervisor-proc-exit-listener-rs --container-name sysmgr
  16288 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name bgp
  21158 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name lldp
  17281 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name radv
  16093 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.02 /usr/bin/supervisor-proc-exit-listener-rs --container-name swss
  21452 root      20   0   94.9m   8.3m   7.6m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name snmp
  18123 root      20   0   94.9m   8.2m   7.6m S   0.0   0.1   0:00.01 /usr/bin/supervisor-proc-exit-listener-rs --container-name pmon
   7246 root      20   0   94.9m   8.2m   7.5m S   0.0   0.1   0:00.04 /usr/bin/supervisor-proc-exit-listener-rs --container-name database

Memory calc sheet:

RES SHR RssAnon All
8.4 7.7 0.7
8.4 7.7 0.7
8.3 7.6 0.7
8.3 7.7 0.6
8.3 7.6 0.7
8.3 7.6 0.7
8.3 7.6 0.7
8.3 7.6 0.7
8.3 7.6 0.7
8.3 7.6 0.7
8.3 7.6 0.7
8.2 7.6 0.6
8.2 7.5 0.7
Total 8.9 16.52
Avg 7.62

The total memory reduction is about 203.49 - 16.52 = 186.97 MB

Work item tracking
  • Microsoft ADO (number only):

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft qiluo-msft marked this pull request as ready for review November 13, 2025 21:27
@qiluo-msft qiluo-msft requested a review from xumia as a code owner November 13, 2025 21:27
Copilot AI review requested due to automatic review settings November 13, 2025 21:27
@qiluo-msft qiluo-msft requested a review from lguohan as a code owner November 13, 2025 21:27

[eventlistener:supervisor-proc-exit-listener]
command=/usr/local/bin/supervisor-proc-exit-listener --container-name dhcp_relay
command=/usr/bin/supervisor-proc-exit-listener-rs --container-name dhcp_relay
Copy link
Contributor

@saiarcot895 saiarcot895 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why/how does test data need to be updated? #Resolved

Copilot finished reviewing on behalf of qiluo-msft November 13, 2025 21:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a Rust implementation of the Python-based supervisor-proc-exit-listener utility, providing significant memory reduction (~187 MB total across all containers). The new binary is named supervisor-proc-exit-listener-rs and includes comprehensive unit tests achieving 73.60% coverage. The implementation maintains functional parity with the Python version while being performance-optimized.

Key Changes

  • Complete Rust rewrite of the supervisor process exit listener with matching functionality
  • New test suite with unit tests, integration tests, and mock implementations
  • Build infrastructure including Cargo configuration, Debian packaging, and Makefiles

Reviewed Changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/sonic-supervisord-utilities-rs/tests/test_listener.rs Comprehensive unit tests with mocking framework for the listener
src/sonic-supervisord-utilities-rs/tests/integration_tests.rs Integration tests for end-to-end functionality
src/sonic-supervisord-utilities-rs/src/proc_exit_listener.rs Core listener implementation with event handling and alerting
src/sonic-supervisord-utilities-rs/src/childutils.rs Supervisor protocol parsing utilities
src/sonic-supervisord-utilities-rs/src/lib.rs Library entry point and module exports
src/sonic-supervisord-utilities-rs/src/bin/supervisor_proc_exit_listener.rs Binary entry point
src/sonic-supervisord-utilities-rs/Cargo.toml Rust package configuration and dependencies
src/sonic-supervisord-utilities-rs/debian/* Debian packaging configuration files
src/sonic-supervisord-utilities-rs/Makefile Build automation and testing targets
rules/sonic-supervisord-utilities-rs.mk SONiC build system integration
rules/docker-config-engine-bookworm.mk Docker build dependency configuration
src/sonic-supervisord-utilities-rs/tests/test_data/config_db.json Test fixture for ConfigDB data
src/sonic-supervisord-utilities-rs/tests/etc/supervisor/critical_processes Test fixture for critical process configuration
src/sonic-supervisord-utilities-rs/tests/etc/supervisor/watchdog_processes Test fixture for watchdog process configuration
src/sonic-supervisord-utilities-rs/tests/dev/stdin Test fixture for supervisor protocol input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +20
fn create_test_config_db() -> ConfigDBConnector {
// Try to create a ConfigDB connector for testing
// In test environments this might fail, but we'll let the individual tests handle it
ConfigDBConnector::new(false, None).unwrap_or_else(|_| {
// If it fails, try with unix socket path
ConfigDBConnector::new(true, None).expect("Failed to create test ConfigDB connector with any method")
})
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create_test_config_db() function uses unwrap_or_else with expect, which will still panic if the unix socket path connection also fails. Consider returning a Result instead of panicking in a test helper function, allowing individual tests to handle unavailable ConfigDB appropriately:

fn create_test_config_db() -> Result<ConfigDBConnector, Box<dyn std::error::Error>> {
    ConfigDBConnector::new(false, None)
        .or_else(|_| ConfigDBConnector::new(true, None))
}
Suggested change
fn create_test_config_db() -> ConfigDBConnector {
// Try to create a ConfigDB connector for testing
// In test environments this might fail, but we'll let the individual tests handle it
ConfigDBConnector::new(false, None).unwrap_or_else(|_| {
// If it fails, try with unix socket path
ConfigDBConnector::new(true, None).expect("Failed to create test ConfigDB connector with any method")
})
fn create_test_config_db() -> Result<ConfigDBConnector, Box<dyn std::error::Error>> {
// Try to create a ConfigDB connector for testing
// In test environments this might fail, but we'll let the individual tests handle it
ConfigDBConnector::new(false, None)
.or_else(|_| ConfigDBConnector::new(true, None))
.map_err(|e| Box::new(e) as Box<dyn std::error::Error>)
``` #WontFix

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic in this situation in a testcase is okay, the test fails anyway.

Comment on lines 495 to 511
// Test InjectorPP mocking of nix::sys::signal::kill function
let mut injector = InjectorPP::new();

// Mock the kill function to track calls
injector
.when_called(injectorpp::func!(fn (nix::sys::signal::kill)(nix::unistd::Pid, nix::sys::signal::Signal) -> Result<(), nix::errno::Errno>))
.will_execute(injectorpp::fake!(
func_type: fn(_pid: nix::unistd::Pid, _signal: nix::sys::signal::Signal) -> Result<(), nix::errno::Errno>,
returns: Ok(())
));

// Test that kill would be called with correct parameters
let test_pid = nix::unistd::Pid::from_raw(1234);
let result = nix::sys::signal::kill(test_pid, nix::sys::signal::Signal::SIGTERM);

assert!(result.is_ok(), "Mocked kill should succeed");
}
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_injectorpp_kill_mocking_basic sets up a mock for nix::sys::signal::kill but doesn't actually verify that the mock was called or test any real functionality. This test only verifies the mock setup works. Consider either:

  1. Adding assertions to verify the mock behavior
  2. Removing this test if it's redundant with other tests that actually exercise the mocked functionality
  3. Adding a comment explaining this is a basic smoke test for the mocking framework #Resolved

Copilot uses AI. Check for mistakes.
fn test_critical_process_with_autorestart_disabled_no_kill() {
// Test that when a critical process exits and auto-restart is disabled, kill() is NOT called
let mut injector = InjectorPP::new();
let mut kill_call_count = 0;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable: kill_call_count is declared and initialized to 0 but never incremented or used. This appears to be leftover from an earlier implementation. Remove this unused variable.

Suggested change
let mut kill_call_count = 0;
``` #Resolved

Copilot uses AI. Check for mistakes.
Comment on lines 67 to 89
fn poll(&mut self, events: &mut mio::Events, _timeout: Option<std::time::Duration>) -> std::io::Result<()> {
// We need to actually add an event to make the main loop detect stdin_ready = true
// Since mio::Event is not easily constructible, we'll use a different approach

// Create a temporary pipe just to generate a real mio event
let (read_fd, write_fd) = nix::unistd::pipe()?;
let mut source = mio::unix::SourceFd(&read_fd);

// Write something to the pipe to make it readable
nix::unistd::write(write_fd, &[1])?;
nix::unistd::close(write_fd)?;

// Create a temporary registry and poll to generate a real event
let mut temp_poll = mio::Poll::new()?;
temp_poll.registry().register(&mut source, mio::Token(0), mio::Interest::READABLE)?;

// Poll the temporary poll to get real events
temp_poll.poll(events, Some(std::time::Duration::from_millis(1)))?;

// Clean up
nix::unistd::close(read_fd)?;

Ok(())
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MockPoller::poll implementation creates temporary file descriptors in each call but never closes them in the error paths. If temp_poll.registry().register() or temp_poll.poll() fails, the read_fd will leak. Consider wrapping the file descriptor cleanup in a RAII guard or ensuring cleanup happens in all code paths.

// Clean up should happen even on error
let _read_fd_guard = scopeguard::guard(read_fd, |fd| { let _ = nix::unistd::close(fd); });
``` #Resolved

Copilot uses AI. Check for mistakes.
Ok(())
}

fn register(&self, _stdin_fd: std::os::unix::io::RawFd) -> std::io::Result<()> {
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MockPoller::register method signature doesn't match the Poller trait. The trait expects fn register(&self, stdin_fd: RawFd, token: Token) but this implementation only has _stdin_fd parameter. Add the missing token parameter:

fn register(&self, _stdin_fd: std::os::unix::io::RawFd, _token: mio::Token) -> std::io::Result<()> {
Suggested change
fn register(&self, _stdin_fd: std::os::unix::io::RawFd) -> std::io::Result<()> {
fn register(&self, _stdin_fd: std::os::unix::io::RawFd, _token: mio::Token) -> std::io::Result<()> {
``` #Resolved

Copilot uses AI. Check for mistakes.
* Rust implementation of SONiC supervisor process exit listener
* Performance-optimized port of Python sonic-supervisord-utilities

-- SONiC Team <[email protected]> Sat, 03 Aug 2025 00:00:00 -0800 No newline at end of file
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date format in the changelog is incorrect. August 3, 2025 is a Sunday, not a Saturday. The day should be Sun instead of Sat.

Suggested change
-- SONiC Team <[email protected]> Sat, 03 Aug 2025 00:00:00 -0800
-- SONiC Team <[email protected]> Sun, 03 Aug 2025 00:00:00 -0800
``` #Resolved

Copilot uses AI. Check for mistakes.
Comment on lines 608 to 610
let mut kill_called = false;
let mut kill_pid = nix::unistd::Pid::from_raw(0);
let mut kill_signal = nix::sys::signal::Signal::SIGTERM;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variables: kill_called, kill_pid, and kill_signal are declared but never used in the test. These appear to be leftover from an earlier implementation. Remove these unused variables:

fn test_critical_process_with_autorestart_enabled_calls_kill() {
    let mut injector = InjectorPP::new();
    
    // Mock kill function to capture calls
    injector
        .when_called(...)
Suggested change
let mut kill_called = false;
let mut kill_pid = nix::unistd::Pid::from_raw(0);
let mut kill_signal = nix::sys::signal::Signal::SIGTERM;
``` #Resolved

Copilot uses AI. Check for mistakes.
use nix::sys::signal::{self, Signal};
use nix::unistd::getppid;
use std::collections::HashMap;
use std::collections::HashMap as StdHashMap;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import: HashMap is imported twice on lines 10 and 11. Remove the redundant import on line 11:

use std::collections::HashMap;
Suggested change
use std::collections::HashMap as StdHashMap;
``` #Resolved

Copilot uses AI. Check for mistakes.
Comment on lines 273 to 278
/// Get current monotonic time as seconds since an arbitrary epoch - helper function
pub fn get_current_time() -> f64 {
static START_TIME: std::sync::OnceLock<Instant> = std::sync::OnceLock::new();
let start = START_TIME.get_or_init(|| Instant::now());
start.elapsed().as_secs_f64()
}
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function comment says "helper function" but this is a public API (pub fn). Consider adding proper documentation with /// doc comments explaining the return value and any important behavior (like the fact that it returns time since an arbitrary epoch, not wall clock time). #Resolved

Copilot uses AI. Check for mistakes.
Comment on lines +494 to +502
if elapsed_secs >= ALERTING_INTERVAL_SECS as f64 {
let elapsed_mins = (elapsed_secs / 60.0) as u64;
process_info.insert("last_alerted".to_string(), current_time);
let current_dead_minutes = process_info.get("dead_minutes").unwrap_or(&0.0);
let new_dead_minutes = current_dead_minutes + elapsed_mins as f64;
process_info.insert("dead_minutes".to_string(), new_dead_minutes);

generate_alerting_message(process_name, "not running", new_dead_minutes as u64, Severity::LOG_ERR);
}
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elapsed time calculation elapsed_secs >= ALERTING_INTERVAL_SECS as f64 could result in precision issues due to float comparison. Consider using a small epsilon for comparison or ensuring the interval is slightly less than the actual threshold to avoid edge cases where alerts might be skipped due to floating-point precision. #Resolved

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precision issues does not matter in this use case.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings November 14, 2025 01:59
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot finished reviewing on behalf of qiluo-msft November 14, 2025 02:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 18 out of 20 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.when_called(injectorpp::func!(fn (sonic_supervisord_utilities_rs::proc_exit_listener::get_current_time)() -> f64))
.will_execute(injectorpp::fake!(
func_type: fn() -> f64,
returns: get_mock_time() // This will be called each time and return progressive time
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_mock_time() function is called during test setup (at line 284), but injectorpp mocking evaluates the return value once at setup time, not on each call. This means all calls to get_current_time() will return the same value rather than progressive times. The comment incorrectly states 'This will be called each time and return progressive time'. Either the mocking approach needs to be changed to support progressive time, or the comment should be corrected.

Suggested change
returns: get_mock_time() // This will be called each time and return progressive time
calls: || get_mock_time() // This closure will be called each time and return progressive time
``` #Resolved

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid comment.
The output of cargo test -- --nocapture shows progressive time.

READY
RESULT 2
OKREADY
DEBUG get_mock_time(): call #1, start_time=1609459200, counter=0, result=1609459200
DEBUG get_mock_time(): call #2, start_time=1609459200, counter=1, result=1609462800
RESULT 2
OKREADY
DEBUG get_mock_time(): call #3, start_time=1609459200, counter=2, result=1609466400
DEBUG get_mock_time(): call #4, start_time=1609459200, counter=3, result=1609470000
RESULT 2
OKREADY
DEBUG get_mock_time(): call #5, start_time=1609459200, counter=4, result=1609473600
✓ Main function processed stdin successfully
test tests::test_main_swss_success ... ok

.when_called(injectorpp::func!(fn (nix::sys::signal::kill)(nix::unistd::Pid, nix::sys::signal::Signal) -> Result<(), nix::errno::Errno>))
.will_execute(injectorpp::fake!(
func_type: fn(_pid: nix::unistd::Pid, _signal: nix::sys::signal::Signal) -> Result<(), nix::errno::Errno>,
returns: panic!("kill() should NOT be called for snmp container with auto-restart disabled")
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using panic! in mock return values is unconventional. If kill() is called, the panic message won't be displayed clearly. Consider using a flag or counter to track whether kill() was called, then assert after the test completes. For example: let kill_called = Arc::new(AtomicBool::new(false)); and check its value after the test.

Copilot uses AI. Check for mistakes.
.when_called(injectorpp::func!(fn (nix::sys::signal::kill)(nix::unistd::Pid, nix::sys::signal::Signal) -> Result<(), nix::errno::Errno>))
.will_execute(injectorpp::fake!(
func_type: fn(_pid: nix::unistd::Pid, _signal: nix::sys::signal::Signal) -> Result<(), nix::errno::Errno>,
returns: panic!("kill() should not be called when auto-restart is disabled")
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in test_main_snmp - using panic! in mock return values is not ideal. Consider using a flag or counter to verify kill() was not called, then assert after the test completes.

Copilot uses AI. Check for mistakes.
Comment on lines +494 to +500
let elapsed_mins = (elapsed_secs / 60.0) as u64;
process_info.insert("last_alerted".to_string(), current_time);
let current_dead_minutes = process_info.get("dead_minutes").unwrap_or(&0.0);
let new_dead_minutes = current_dead_minutes + elapsed_mins as f64;
process_info.insert("dead_minutes".to_string(), new_dead_minutes);

generate_alerting_message(process_name, "not running", new_dead_minutes as u64, Severity::LOG_ERR);
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dead_minutes calculation is incorrect. It should track the total time the process has been dead, not accumulate elapsed intervals. This line adds the current elapsed_mins to the previous dead_minutes, causing exponential growth. It should be: let new_dead_minutes = (current_time - first_dead_time) / 60.0; where first_dead_time is the time when the process was first added to process_under_alerting.

Suggested change
let elapsed_mins = (elapsed_secs / 60.0) as u64;
process_info.insert("last_alerted".to_string(), current_time);
let current_dead_minutes = process_info.get("dead_minutes").unwrap_or(&0.0);
let new_dead_minutes = current_dead_minutes + elapsed_mins as f64;
process_info.insert("dead_minutes".to_string(), new_dead_minutes);
generate_alerting_message(process_name, "not running", new_dead_minutes as u64, Severity::LOG_ERR);
process_info.insert("last_alerted".to_string(), current_time);
// Calculate dead_minutes based on first_dead_time
if let Some(&first_dead_time) = process_info.get("first_dead_time") {
let new_dead_minutes = (current_time - first_dead_time) / 60.0;
process_info.insert("dead_minutes".to_string(), new_dead_minutes);
generate_alerting_message(process_name, "not running", new_dead_minutes as u64, Severity::LOG_ERR);
}
``` #WontFix

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not valid comment. It is mimic the behavior of python version. The dead_minutes calculation is not super accurate, but good enough.

pub use_unix_socket_path: bool,
}

/// Read the critical processes/group names
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function documentation should describe what the return tuple represents. Consider: /// Returns (group_list, process_list) where group_list contains group names and process_list contains process names from the config file.

Suggested change
/// Read the critical processes/group names
/// Read the critical processes/group names.
///
/// Returns (group_list, process_list) where `group_list` contains group names and `process_list` contains process names from the config file.

Copilot uses AI. Check for mistakes.
ALERTING_INTERVAL_SECS as f64
}

/// Publish events
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for this public function. Should describe what events are published and for what purpose.

Suggested change
/// Publish events
/// Publishes a process exit event to the event publisher.
///
/// This function sends an event with the tag specified by `EVENTS_PUBLISHER_TAG` to the provided
/// `events_handle`. The event includes the process name and container name as parameters.
///
/// # Parameters
/// - `events_handle`: The event publisher handle used to send the event.
/// - `process_name`: The name of the process for which the event is being published.
/// - `container_name`: The name of the container in which the process is running.
///
/// # Events Published
/// Publishes an event tagged with `EVENTS_PUBLISHER_TAG` containing the process and container names.
/// This is typically used to notify other components or systems that a process has exited or requires
/// attention, enabling monitoring or alerting mechanisms to respond accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +488 to +491
let time1 = get_mock_time(); // Should return start_time (1609459200.0)
let time2 = get_mock_time(); // Should return start_time + 3600.0
let time3 = get_mock_time(); // Should return start_time + 7200.0
let time4 = get_mock_time(); // Should return start_time + 10800.0
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers 3600.0, 7200.0, 10800.0 should be defined as constants or computed as multiples of a base interval. For example: const HOUR_SECS: f64 = 3600.0; and then use start_time + HOUR_SECS, start_time + 2.0 * HOUR_SECS, etc.

Copilot uses AI. Check for mistakes.
Ok(is_auto_restart.unwrap().clone())
}

/// Load heartbeat alert intervals from ConfigDB
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for this public function. Should describe what heartbeat alert intervals are loaded and from where.

Suggested change
/// Load heartbeat alert intervals from ConfigDB
/// Loads heartbeat alert intervals for monitored processes from the ConfigDB.
///
/// This function retrieves the `HEARTBEAT` table from the ConfigDB, which contains
/// per-process configuration for heartbeat monitoring. For each process entry,
/// it reads the `alert_interval` field (in milliseconds), converts it to seconds,
/// and stores the value in a global mapping used for alerting logic.
///
/// The function sets a global flag to indicate that heartbeat alert intervals
/// have been initialized. If the `HEARTBEAT` table is empty or missing, no intervals
/// are loaded and the mapping remains unchanged.
///
/// # Arguments
///
/// * `config_db` - A reference to an object implementing `ConfigDBTrait` used to access ConfigDB.
///
/// # Errors
///
/// Returns an error if the `HEARTBEAT` table cannot be retrieved or if the initialization flag cannot be set.

Copilot uses AI. Check for mistakes.
if let Some(&last_alerted) = process_info.get("last_alerted") {
let elapsed_secs = current_time - last_alerted;
if elapsed_secs >= ALERTING_INTERVAL_SECS as f64 {
let elapsed_mins = (elapsed_secs / 60.0) as u64;
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant 60.0 (seconds per minute) appears multiple times in the code. Consider defining it as a named constant: const SECS_PER_MINUTE: f64 = 60.0;

Copilot uses AI. Check for mistakes.
/// Get progressive time
/// Each call returns start_time + (call_count * 3600) seconds (1 hour increment)
pub fn get_mock_time() -> f64 {
let start_time = *TIME_MOCK_START.get().unwrap_or(&1609459200.0); // Default: 2021-01-01 00:00:00 UTC
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 1609459200.0 (Unix timestamp) should be defined as a named constant with a descriptive comment. For example: const DEFAULT_TEST_START_TIME: f64 = 1609459200.0; // 2021-01-01 00:00:00 UTC

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@saiarcot895 saiarcot895 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create the .dep file for this as well. #Resolved

Maintainer: SONiC Team <[email protected]>
Section: net
Priority: optional
Build-Depends: debhelper (>= 12)
Copy link
Contributor

@saiarcot895 saiarcot895 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Build-Depends: debhelper (>= 12)
Build-Depends: debhelper-compat (= 13)
``` #Resolved

Copy link
Contributor

@saiarcot895 saiarcot895 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doing the debhelper-compat change, please remove this file to avoid conflicts. #Resolved

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants