Skip to content

Conversation

@tobz
Copy link
Member

@tobz tobz commented Feb 1, 2026

Summary

This PR updates Supervisable::initialize to be async, allowing a wider number of operations to be undertaken as part of initializing the child process future.

As part of working up towards being able to supervisor-ify the various tasks we spawn, we need one additional chunk of functionality: the ability to run async code. For many tasks, we do some amount of async work to prepare before actually spawning the task itself, such as listening on sockets, reading files on disk, making queries to remote services, and more. These operations are usually fundamental, and we can't build/spawn the actual task unless they complete successfully, and so they need to be run as part of an initialization phase: if we fail to do them, we don't just try again, we intentionally give up/bail out.

This PR adds the ability for us to do this sort of asynchronous work when initializing a child process future by updating the Supervisable trait to have async fn initialize(...). In tandem, we now also lift up initialization errors as their own thing, so we can specifically fail fast when they occur. This means that while normal process failures (runtime errors during steady state) are still subject to the normal restart strategy behavior of supervision, initialization errors cause a supervisor to immediately fail.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

How did you test this PR?

Existing unit tests.

References

AGTMETRICS-393

Copilot AI review requested due to automatic review settings February 1, 2026 21:32
@dd-octo-sts dd-octo-sts bot added the area/core Core functionality, event model, etc. label Feb 1, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the supervisor system by making the Supervisable::initialize method asynchronous, allowing processes to perform async initialization (e.g., binding ports, establishing connections) before running.

Changes:

  • Introduced InitializationError type to distinguish initialization failures from runtime failures
  • Changed Supervisable::initialize from synchronous to async, returning Result<SupervisorFuture, InitializationError> instead of Option<SupervisorFuture>
  • Updated supervisor internals to await initialization and propagate initialization errors immediately without restart logic

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
lib/saluki-core/src/runtime/supervisor.rs Added InitializationError enum, made Supervisable::initialize async, updated supervisor spawn methods to await initialization
lib/saluki-core/src/runtime/mod.rs Exported InitializationError from supervisor module
lib/saluki-core/examples/basic_supervisor.rs Updated MockWorker to implement async initialize with example initialization pattern

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pr-commenter
Copy link

pr-commenter bot commented Feb 1, 2026

Binary Size Analysis (Agent Data Plane)

Target: 9fe09ef (baseline) vs 850dc8d (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 26.96 MiB
Comparison Size: 26.98 MiB
Size Change: +25.34 KiB (+0.09%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module File Size Symbols
tokio +19.15 KiB 2709
core +6.78 KiB 11100
[Unmapped] -2.59 KiB 1
alloc +1.88 KiB 1216
[sections] +1.09 KiB 7
saluki_env::workload::helpers -698 B 60
std +386 B 285
saluki_app::api::APIBuilder +370 B 4
saluki_common::task::instrument -361 B 86
agent_data_plane::cli::dogstatsd -301 B 34
agent_data_plane::cli::debug +301 B 92
agent_data_plane::internal::remote_agent -244 B 62
miniz_oxide -227 B 1
futures_util -208 B 86
tonic -161 B 356
saluki_components::common::datadog +160 B 326
tokio_util -158 B 38
hyper +130 B 594
saluki_components::forwarders::otlp +120 B 60
agent_data_plane::main -120 B 2

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +1.79Mi  [NEW] +1.79Mi    std::thread::local::LocalKey<T>::with::h938f6ddf7aedbbb8
  [NEW]  +113Ki  [NEW]  +113Ki    agent_data_plane::cli::run::create_topology::_{{closure}}::hfb0b645d0bb8c8cf
  [NEW] +84.5Ki  [NEW] +84.4Ki    agent_data_plane::internal::control_plane::spawn_control_plane::_{{closure}}::h6a61f8e463394a17
  [NEW] +68.2Ki  [NEW] +68.1Ki    h2::hpack::decoder::Decoder::try_decode_string::hd1c9a39c48e78a6e
  [NEW] +63.7Ki  [NEW] +63.6Ki    saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::hd35493454b2aefa0
  [NEW] +62.3Ki  [NEW] +62.2Ki    agent_data_plane::main::_{{closure}}::hd35953e90113f5c5
  [NEW] +57.8Ki  [NEW] +57.7Ki    agent_data_plane::cli::run::handle_run_command::_{{closure}}::hb1b79147c6e6e2ef
  [NEW] +48.8Ki  [NEW] +48.7Ki    saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::hc8e634acdf97ad33
  [NEW] +47.7Ki  [NEW] +47.6Ki    moka::sync::base_cache::Inner<K,V,S>::do_run_pending_tasks::h0ef6755dbc77ea2a
  [NEW] +46.1Ki  [NEW] +46.0Ki    h2::proto::connection::Connection<T,P,B>::poll::h1671860da0918c66
  +0.2% +26.0Ki  +0.2% +23.3Ki    [29572 Others]
  [DEL] -46.1Ki  [DEL] -46.0Ki    h2::proto::connection::Connection<T,P,B>::poll::h2aedcbe1089b311c
  [DEL] -47.7Ki  [DEL] -47.6Ki    moka::sync::base_cache::Inner<K,V,S>::do_run_pending_tasks::ha97a2f55834f17a3
  [DEL] -48.8Ki  [DEL] -48.7Ki    saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::hbb5d0d8e944d3a21
  [DEL] -57.8Ki  [DEL] -57.7Ki    agent_data_plane::cli::run::handle_run_command::_{{closure}}::hd8d13580d16cc8a3
  [DEL] -62.3Ki  [DEL] -62.2Ki    agent_data_plane::main::_{{closure}}::h1c7c640002ce8ad1
  [DEL] -63.8Ki  [DEL] -63.6Ki    saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::hc194831d658db8cc
  [DEL] -68.2Ki  [DEL] -68.1Ki    h2::hpack::decoder::Decoder::try_decode_string::hbfc0bb5a77e7669f
  [DEL] -84.5Ki  [DEL] -84.4Ki    agent_data_plane::internal::control_plane::spawn_control_plane::_{{closure}}::heace61ab0f2bde0c
  [DEL]  -113Ki  [DEL]  -113Ki    agent_data_plane::cli::run::create_topology::_{{closure}}::h6a8b41fa76a89e2c
  [DEL] -1.79Mi  [DEL] -1.79Mi    std::thread::local::LocalKey<T>::with::h0d8770940d38805b
  +0.1% +25.3Ki  +0.1% +22.6Ki    TOTAL

@pr-commenter
Copy link

pr-commenter bot commented Feb 1, 2026

Regression Detector (Agent Data Plane)

Regression Detector Results

Run ID: 5802f1b6-eafa-4426-9bba-15b9e6580d8e

Baseline: 9fe09ef
Comparison: 850dc8d
Diff

❌ Experiments with retried target crashes

This is a critical error. One or more replicates failed with a non-zero exit code. These replicates may have been retried. See Replicate Execution Details for more information.

  • otlp_ingest_logs_5mb_memory
  • quality_gates_rss_dsd_heavy
  • quality_gates_rss_dsd_medium
  • otlp_ingest_logs_5mb_throughput

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
otlp_ingest_logs_5mb_throughput ingress throughput +0.00 [-0.13, +0.13] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_cpu % cpu utilization -0.27 [-5.55, +5.00] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_memory memory utilization -9.70 [-10.06, -9.34] 1 (metrics) (profiles) (logs)

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_512kb_3k_contexts_cpu % cpu utilization +12.46 [-45.47, +70.39] 1 (metrics) (profiles) (logs)
otlp_ingest_metrics_5mb_cpu % cpu utilization +2.12 [-3.99, +8.23] 1 (metrics) (profiles) (logs)
dsd_uds_10mb_3k_contexts_memory memory utilization +1.02 [+0.83, +1.22] 1 (metrics) (profiles) (logs)
dsd_uds_100mb_3k_contexts_cpu % cpu utilization +0.83 [-5.13, +6.79] 1 (metrics) (profiles) (logs)
quality_gates_rss_idle memory utilization +0.63 [+0.59, +0.68] 1 (metrics) (profiles) (logs)
dsd_uds_1mb_3k_contexts_memory memory utilization +0.61 [+0.43, +0.80] 1 (metrics) (profiles) (logs)
dsd_uds_100mb_3k_contexts_memory memory utilization +0.61 [+0.41, +0.80] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_low memory utilization +0.54 [+0.37, +0.70] 1 (metrics) (profiles) (logs)
dsd_uds_500mb_3k_contexts_memory memory utilization +0.48 [+0.30, +0.67] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_medium memory utilization +0.27 [+0.08, +0.45] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_heavy memory utilization +0.26 [+0.13, +0.38] 1 (metrics) (profiles) (logs)
dsd_uds_512kb_3k_contexts_memory memory utilization +0.13 [-0.06, +0.31] 1 (metrics) (profiles) (logs)
otlp_ingest_metrics_5mb_throughput ingress throughput +0.01 [-0.12, +0.14] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_5mb_throughput ingress throughput +0.00 [-0.08, +0.08] 1 (metrics) (profiles) (logs)
dsd_uds_1mb_3k_contexts_throughput ingress throughput +0.00 [-0.06, +0.06] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_throughput ingress throughput +0.00 [-0.13, +0.13] 1 (metrics) (profiles) (logs)
dsd_uds_10mb_3k_contexts_throughput ingress throughput -0.00 [-0.17, +0.16] 1 (metrics) (profiles) (logs)
dsd_uds_512kb_3k_contexts_throughput ingress throughput -0.01 [-0.06, +0.05] 1 (metrics) (profiles) (logs)
dsd_uds_100mb_3k_contexts_throughput ingress throughput -0.01 [-0.06, +0.04] 1 (metrics) (profiles) (logs)
dsd_uds_500mb_3k_contexts_cpu % cpu utilization -0.27 [-1.52, +0.98] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_cpu % cpu utilization -0.27 [-5.55, +5.00] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_ultraheavy memory utilization -0.32 [-0.45, -0.18] 1 (metrics) (profiles) (logs)
dsd_uds_500mb_3k_contexts_throughput ingress throughput -0.73 [-0.87, -0.58] 1 (metrics) (profiles) (logs)
dsd_uds_10mb_3k_contexts_cpu % cpu utilization -1.34 [-30.99, +28.30] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_5mb_memory memory utilization -1.42 [-1.66, -1.17] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_5mb_cpu % cpu utilization -1.90 [-4.35, +0.55] 1 (metrics) (profiles) (logs)
otlp_ingest_metrics_5mb_memory memory utilization -2.84 [-3.06, -2.62] 1 (metrics) (profiles) (logs)
dsd_uds_1mb_3k_contexts_cpu % cpu utilization -6.81 [-55.83, +42.20] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_memory memory utilization -9.70 [-10.06, -9.34] 1 (metrics) (profiles) (logs)

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
quality_gates_rss_dsd_heavy memory_usage 10/10 (metrics) (profiles) (logs)
quality_gates_rss_dsd_low memory_usage 10/10 (metrics) (profiles) (logs)
quality_gates_rss_dsd_medium memory_usage 10/10 (metrics) (profiles) (logs)
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 (metrics) (profiles) (logs)
quality_gates_rss_idle memory_usage 10/10 (metrics) (profiles) (logs)

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

Replicate Execution Details

We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.

Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.

Experiment Variant Replicates Failure Logs Debug Dashboard
experiment_with_failures baseline 0 (x8) 1 (x6) Oom killed Debug Dashboard

The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.

❌ Retried Normal Replicate Execution Failures (non-profiling)

Experiment Variant Replicates Failure Debug Dashboard
otlp_ingest_logs_5mb_memory comparison 0 Failed to shutdown when requested Debug Dashboard
otlp_ingest_logs_5mb_throughput baseline 9 Failed to shutdown when requested Debug Dashboard
otlp_ingest_logs_5mb_throughput comparison 4 Failed to shutdown when requested Debug Dashboard
quality_gates_rss_dsd_heavy comparison 9 Failed to shutdown when requested Debug Dashboard
quality_gates_rss_dsd_medium baseline 7 Failed to shutdown when requested Debug Dashboard
quality_gates_rss_dsd_medium comparison 9 Failed to shutdown when requested Debug Dashboard

@tobz tobz force-pushed the tobz/supervisor-dedicated-runtime branch from 6a8f7e2 to 10571d0 Compare February 2, 2026 14:32
@tobz tobz force-pushed the tobz/supervisable-async-initialize branch from 9870966 to 871f057 Compare February 2, 2026 14:32
@tobz tobz changed the title enhancement(core): make initialization async in Supervisable enhancement(core): make initialization async in Supervisable Feb 2, 2026
@tobz tobz added the type/enhancement An enhancement in functionality or support. label Feb 2, 2026
@tobz tobz marked this pull request as ready for review February 2, 2026 18:12
@tobz tobz requested a review from a team as a code owner February 2, 2026 18:12
Copilot AI review requested due to automatic review settings February 2, 2026 18:12
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tobz tobz force-pushed the tobz/supervisor-dedicated-runtime branch from 10571d0 to d4f07a2 Compare February 2, 2026 18:18
@tobz tobz force-pushed the tobz/supervisable-async-initialize branch from 871f057 to ebbe088 Compare February 2, 2026 18:18
@tobz tobz force-pushed the tobz/supervisor-dedicated-runtime branch from d4f07a2 to b278bff Compare February 4, 2026 15:10
Copilot AI review requested due to automatic review settings February 4, 2026 15:10
@tobz tobz force-pushed the tobz/supervisable-async-initialize branch from ebbe088 to 8db420f Compare February 4, 2026 15:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tobz tobz force-pushed the tobz/supervisable-async-initialize branch from 9402e2f to 850dc8d Compare February 11, 2026 16:24
Copilot AI review requested due to automatic review settings February 11, 2026 16:24
@tobz tobz force-pushed the tobz/supervisor-dedicated-runtime branch from 332f393 to 694d69e Compare February 11, 2026 16:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 378 to 392
Some((child_spec_idx, worker_result)) => {
let child_spec = self.get_child_spec(child_spec_idx);
match restart_state.evaluate_restart() {
RestartAction::Restart(mode) => match mode {
RestartMode::OneForOne => {
warn!(supervisor_id = %self.supervisor_id, worker_name = child_spec.name(), ?worker_result, "Child process terminated, restarting.");
self.spawn_child(child_spec_idx, &mut worker_state)?;
// Note: spawn_child is now async, but init errors during restart should still fail
// the supervisor immediately (no restart for init errors).
self.spawn_child(child_spec_idx, &mut worker_state).await?;
}
RestartMode::OneForAll => {
warn!(supervisor_id = %self.supervisor_id, worker_name = child_spec.name(), ?worker_result, "Child process terminated, restarting all processes.");
worker_state.shutdown_workers().await;
self.spawn_all_children(&mut worker_state)?;
self.spawn_all_children(&mut worker_state).await?;
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting (re)initialization inside the select! branch can block the supervisor from processing the shutdown signal and/or other worker completions while a child is doing potentially long async initialization. Consider restructuring so restarts remain interruptible (e.g., wrap restart init in a tokio::select! against the supervisor shutdown future, or move child (re)initialization into a separate task and have the supervisor loop continue to service shutdown/events).

Copilot uses AI. Check for mistakes.
Comment on lines +354 to 359
// Do the initial spawn of all child processes and supervisors. If any fail to initialize, we fail immediately
// (no restart).
self.spawn_all_children(&mut worker_state).await?;

// Now we supervise.
let shutdown = process_shutdown.wait_for_shutdown();
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since initialization is now async, spawn_all_children(...).await can take arbitrarily long, during which a shutdown request won’t be observed (the shutdown wait is created afterward). To keep shutdown responsive during lengthy initialization, consider awaiting initialization in a select! with the shutdown signal (or otherwise checking for shutdown during the initial spawn phase) so run_inner can exit promptly when shutdown is requested.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +119
///
/// This error indicates that a child could not complete its async initialization. This is distinct from runtime
/// failures and does NOT trigger restart logic.
#[snafu(display("Child process failed to initialize: {}", source))]
FailedToInitialize {
/// The underlying initialization error.
source: InitializationError,
},
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FailedToInitialize doesn’t include which child failed to initialize, which can make diagnosing startup failures difficult (especially when multiple children are spawned). Consider adding a child_name/process field to this error variant and including it in the display string (and populating it at the call sites in ChildSpecification::initialize), so the propagated error clearly identifies the failing child.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Core functionality, event model, etc. type/enhancement An enhancement in functionality or support.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant