-
Notifications
You must be signed in to change notification settings - Fork 8
enhancement(core): make initialization async in Supervisable
#1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tobz/supervisor-dedicated-runtime
Are you sure you want to change the base?
enhancement(core): make initialization async in Supervisable
#1146
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the supervisor system by making the Supervisable::initialize method asynchronous, allowing processes to perform async initialization (e.g., binding ports, establishing connections) before running.
Changes:
- Introduced
InitializationErrortype to distinguish initialization failures from runtime failures - Changed
Supervisable::initializefrom synchronous to async, returningResult<SupervisorFuture, InitializationError>instead ofOption<SupervisorFuture> - Updated supervisor internals to await initialization and propagate initialization errors immediately without restart logic
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lib/saluki-core/src/runtime/supervisor.rs | Added InitializationError enum, made Supervisable::initialize async, updated supervisor spawn methods to await initialization |
| lib/saluki-core/src/runtime/mod.rs | Exported InitializationError from supervisor module |
| lib/saluki-core/examples/basic_supervisor.rs | Updated MockWorker to implement async initialize with example initialization pattern |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Binary Size Analysis (Agent Data Plane)Target: 9fe09ef (baseline) vs 850dc8d (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
tokio |
+19.15 KiB | 2709 |
core |
+6.78 KiB | 11100 |
[Unmapped] |
-2.59 KiB | 1 |
alloc |
+1.88 KiB | 1216 |
[sections] |
+1.09 KiB | 7 |
saluki_env::workload::helpers |
-698 B | 60 |
std |
+386 B | 285 |
saluki_app::api::APIBuilder |
+370 B | 4 |
saluki_common::task::instrument |
-361 B | 86 |
agent_data_plane::cli::dogstatsd |
-301 B | 34 |
agent_data_plane::cli::debug |
+301 B | 92 |
agent_data_plane::internal::remote_agent |
-244 B | 62 |
miniz_oxide |
-227 B | 1 |
futures_util |
-208 B | 86 |
tonic |
-161 B | 356 |
saluki_components::common::datadog |
+160 B | 326 |
tokio_util |
-158 B | 38 |
hyper |
+130 B | 594 |
saluki_components::forwarders::otlp |
+120 B | 60 |
agent_data_plane::main |
-120 B | 2 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
[NEW] +1.79Mi [NEW] +1.79Mi std::thread::local::LocalKey<T>::with::h938f6ddf7aedbbb8
[NEW] +113Ki [NEW] +113Ki agent_data_plane::cli::run::create_topology::_{{closure}}::hfb0b645d0bb8c8cf
[NEW] +84.5Ki [NEW] +84.4Ki agent_data_plane::internal::control_plane::spawn_control_plane::_{{closure}}::h6a61f8e463394a17
[NEW] +68.2Ki [NEW] +68.1Ki h2::hpack::decoder::Decoder::try_decode_string::hd1c9a39c48e78a6e
[NEW] +63.7Ki [NEW] +63.6Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::hd35493454b2aefa0
[NEW] +62.3Ki [NEW] +62.2Ki agent_data_plane::main::_{{closure}}::hd35953e90113f5c5
[NEW] +57.8Ki [NEW] +57.7Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::hb1b79147c6e6e2ef
[NEW] +48.8Ki [NEW] +48.7Ki saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::hc8e634acdf97ad33
[NEW] +47.7Ki [NEW] +47.6Ki moka::sync::base_cache::Inner<K,V,S>::do_run_pending_tasks::h0ef6755dbc77ea2a
[NEW] +46.1Ki [NEW] +46.0Ki h2::proto::connection::Connection<T,P,B>::poll::h1671860da0918c66
+0.2% +26.0Ki +0.2% +23.3Ki [29572 Others]
[DEL] -46.1Ki [DEL] -46.0Ki h2::proto::connection::Connection<T,P,B>::poll::h2aedcbe1089b311c
[DEL] -47.7Ki [DEL] -47.6Ki moka::sync::base_cache::Inner<K,V,S>::do_run_pending_tasks::ha97a2f55834f17a3
[DEL] -48.8Ki [DEL] -48.7Ki saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::hbb5d0d8e944d3a21
[DEL] -57.8Ki [DEL] -57.7Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::hd8d13580d16cc8a3
[DEL] -62.3Ki [DEL] -62.2Ki agent_data_plane::main::_{{closure}}::h1c7c640002ce8ad1
[DEL] -63.8Ki [DEL] -63.6Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::hc194831d658db8cc
[DEL] -68.2Ki [DEL] -68.1Ki h2::hpack::decoder::Decoder::try_decode_string::hbfc0bb5a77e7669f
[DEL] -84.5Ki [DEL] -84.4Ki agent_data_plane::internal::control_plane::spawn_control_plane::_{{closure}}::heace61ab0f2bde0c
[DEL] -113Ki [DEL] -113Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h6a8b41fa76a89e2c
[DEL] -1.79Mi [DEL] -1.79Mi std::thread::local::LocalKey<T>::with::h0d8770940d38805b
+0.1% +25.3Ki +0.1% +22.6Ki TOTAL
Regression Detector (Agent Data Plane)Regression Detector ResultsRun ID: 5802f1b6-eafa-4426-9bba-15b9e6580d8e Baseline: 9fe09ef ❌ Experiments with retried target crashesThis is a critical error. One or more replicates failed with a non-zero exit code. These replicates may have been retried. See Replicate Execution Details for more information.
Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.00 | [-0.13, +0.13] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | -0.27 | [-5.55, +5.00] | 1 | (metrics) (profiles) (logs) |
| ✅ | otlp_ingest_logs_5mb_memory | memory utilization | -9.70 | [-10.06, -9.34] | 1 | (metrics) (profiles) (logs) |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | dsd_uds_512kb_3k_contexts_cpu | % cpu utilization | +12.46 | [-45.47, +70.39] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_cpu | % cpu utilization | +2.12 | [-3.99, +8.23] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_memory | memory utilization | +1.02 | [+0.83, +1.22] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_cpu | % cpu utilization | +0.83 | [-5.13, +6.79] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_idle | memory utilization | +0.63 | [+0.59, +0.68] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_memory | memory utilization | +0.61 | [+0.43, +0.80] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_memory | memory utilization | +0.61 | [+0.41, +0.80] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_low | memory utilization | +0.54 | [+0.37, +0.70] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_memory | memory utilization | +0.48 | [+0.30, +0.67] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_medium | memory utilization | +0.27 | [+0.08, +0.45] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_heavy | memory utilization | +0.26 | [+0.13, +0.38] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_memory | memory utilization | +0.13 | [-0.06, +0.31] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_throughput | ingress throughput | +0.01 | [-0.12, +0.14] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_throughput | ingress throughput | +0.00 | [-0.08, +0.08] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.06, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.00 | [-0.13, +0.13] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.17, +0.16] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_throughput | ingress throughput | -0.01 | [-0.06, +0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_throughput | ingress throughput | -0.01 | [-0.06, +0.04] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_cpu | % cpu utilization | -0.27 | [-1.52, +0.98] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | -0.27 | [-5.55, +5.00] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_ultraheavy | memory utilization | -0.32 | [-0.45, -0.18] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_throughput | ingress throughput | -0.73 | [-0.87, -0.58] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_cpu | % cpu utilization | -1.34 | [-30.99, +28.30] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_memory | memory utilization | -1.42 | [-1.66, -1.17] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_cpu | % cpu utilization | -1.90 | [-4.35, +0.55] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_memory | memory utilization | -2.84 | [-3.06, -2.62] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_cpu | % cpu utilization | -6.81 | [-55.83, +42.20] | 1 | (metrics) (profiles) (logs) |
| ✅ | otlp_ingest_logs_5mb_memory | memory utilization | -9.70 | [-10.06, -9.34] | 1 | (metrics) (profiles) (logs) |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | quality_gates_rss_dsd_heavy | memory_usage | 10/10 | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_low | memory_usage | 10/10 | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_medium | memory_usage | 10/10 | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_ultraheavy | memory_usage | 10/10 | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_idle | memory_usage | 10/10 | (metrics) (profiles) (logs) |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Normal Replicate Execution Failures (non-profiling)
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| otlp_ingest_logs_5mb_memory | comparison | 0 | Failed to shutdown when requested | Debug Dashboard |
| otlp_ingest_logs_5mb_throughput | baseline | 9 | Failed to shutdown when requested | Debug Dashboard |
| otlp_ingest_logs_5mb_throughput | comparison | 4 | Failed to shutdown when requested | Debug Dashboard |
| quality_gates_rss_dsd_heavy | comparison | 9 | Failed to shutdown when requested | Debug Dashboard |
| quality_gates_rss_dsd_medium | baseline | 7 | Failed to shutdown when requested | Debug Dashboard |
| quality_gates_rss_dsd_medium | comparison | 9 | Failed to shutdown when requested | Debug Dashboard |
6a8f7e2 to
10571d0
Compare
9870966 to
871f057
Compare
Supervisable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
10571d0 to
d4f07a2
Compare
871f057 to
ebbe088
Compare
d4f07a2 to
b278bff
Compare
ebbe088 to
8db420f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8db420f to
9402e2f
Compare
b278bff to
332f393
Compare
9402e2f to
850dc8d
Compare
332f393 to
694d69e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Some((child_spec_idx, worker_result)) => { | ||
| let child_spec = self.get_child_spec(child_spec_idx); | ||
| match restart_state.evaluate_restart() { | ||
| RestartAction::Restart(mode) => match mode { | ||
| RestartMode::OneForOne => { | ||
| warn!(supervisor_id = %self.supervisor_id, worker_name = child_spec.name(), ?worker_result, "Child process terminated, restarting."); | ||
| self.spawn_child(child_spec_idx, &mut worker_state)?; | ||
| // Note: spawn_child is now async, but init errors during restart should still fail | ||
| // the supervisor immediately (no restart for init errors). | ||
| self.spawn_child(child_spec_idx, &mut worker_state).await?; | ||
| } | ||
| RestartMode::OneForAll => { | ||
| warn!(supervisor_id = %self.supervisor_id, worker_name = child_spec.name(), ?worker_result, "Child process terminated, restarting all processes."); | ||
| worker_state.shutdown_workers().await; | ||
| self.spawn_all_children(&mut worker_state)?; | ||
| self.spawn_all_children(&mut worker_state).await?; | ||
| } |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awaiting (re)initialization inside the select! branch can block the supervisor from processing the shutdown signal and/or other worker completions while a child is doing potentially long async initialization. Consider restructuring so restarts remain interruptible (e.g., wrap restart init in a tokio::select! against the supervisor shutdown future, or move child (re)initialization into a separate task and have the supervisor loop continue to service shutdown/events).
| // Do the initial spawn of all child processes and supervisors. If any fail to initialize, we fail immediately | ||
| // (no restart). | ||
| self.spawn_all_children(&mut worker_state).await?; | ||
|
|
||
| // Now we supervise. | ||
| let shutdown = process_shutdown.wait_for_shutdown(); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since initialization is now async, spawn_all_children(...).await can take arbitrarily long, during which a shutdown request won’t be observed (the shutdown wait is created afterward). To keep shutdown responsive during lengthy initialization, consider awaiting initialization in a select! with the shutdown signal (or otherwise checking for shutdown during the initial spawn phase) so run_inner can exit promptly when shutdown is requested.
| /// | ||
| /// This error indicates that a child could not complete its async initialization. This is distinct from runtime | ||
| /// failures and does NOT trigger restart logic. | ||
| #[snafu(display("Child process failed to initialize: {}", source))] | ||
| FailedToInitialize { | ||
| /// The underlying initialization error. | ||
| source: InitializationError, | ||
| }, |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FailedToInitialize doesn’t include which child failed to initialize, which can make diagnosing startup failures difficult (especially when multiple children are spawned). Consider adding a child_name/process field to this error variant and including it in the display string (and populating it at the call sites in ChildSpecification::initialize), so the propagated error clearly identifies the failing child.

Summary
This PR updates
Supervisable::initializeto beasync, allowing a wider number of operations to be undertaken as part of initializing the child process future.As part of working up towards being able to supervisor-ify the various tasks we spawn, we need one additional chunk of functionality: the ability to run
asynccode. For many tasks, we do some amount ofasyncwork to prepare before actually spawning the task itself, such as listening on sockets, reading files on disk, making queries to remote services, and more. These operations are usually fundamental, and we can't build/spawn the actual task unless they complete successfully, and so they need to be run as part of an initialization phase: if we fail to do them, we don't just try again, we intentionally give up/bail out.This PR adds the ability for us to do this sort of asynchronous work when initializing a child process future by updating the
Supervisabletrait to haveasync fn initialize(...). In tandem, we now also lift up initialization errors as their own thing, so we can specifically fail fast when they occur. This means that while normal process failures (runtime errors during steady state) are still subject to the normal restart strategy behavior of supervision, initialization errors cause a supervisor to immediately fail.Change Type
How did you test this PR?
Existing unit tests.
References
AGTMETRICS-393