|
| 1 | +# Design Document: async-profiler Rust Agent |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The async-profiler Rust agent is an in-process profiling library that integrates with [async-profiler](https://github.com/async-profiler/async-profiler) to collect performance data and upload it to various backends. The agent is designed to run continuously in production environments with minimal overhead. |
| 6 | + |
| 7 | +For a more how-to-focused guide on running the profiler in various contexts, read the README. |
| 8 | + |
| 9 | +This guide is based on an AI-driven summary, but it includes many comments from the development team. |
| 10 | + |
| 11 | +## Architecture |
| 12 | + |
| 13 | +The async-profiler agent runs as an agent within a Rust process and profiles it using [async-profiler]. |
| 14 | + |
| 15 | +async-profiler is loaded, currently the agent only supports loading a `libasyncProfiler.so` dynamically |
| 16 | +via [libloading], but in future versions it might also be possible to statically or plain-dynamically |
| 17 | +link against it. |
| 18 | + |
| 19 | +[async-profiler]: https://github.com/async-profiler/async-profiler |
| 20 | +[libloading]: https://crates.io/crates/libloading |
| 21 | + |
| 22 | +## Code Architecture |
| 23 | + |
| 24 | +The crate follows a modular architecture with clear separation of concerns: |
| 25 | + |
| 26 | +``` |
| 27 | +async-profiler-agent/ |
| 28 | +├── src/ |
| 29 | +│ ├── lib.rs # Public API and documentation |
| 30 | +│ ├── profiler.rs # Core profiler orchestration |
| 31 | +│ ├── asprof/ # async-profiler FFI bindings |
| 32 | +│ ├── metadata/ # Host and report metadata |
| 33 | +│ ├── pollcatch/ # Tokio poll time tracking |
| 34 | +│ └── reporter/ # Data upload backends |
| 35 | +├── examples/ # Sample applications |
| 36 | +├── decoder/ # JFR analysis tool |
| 37 | +└── tests/ # Integration tests |
| 38 | +``` |
| 39 | + |
| 40 | +## Core Modules |
| 41 | + |
| 42 | +### 1. Profiler (`profiler`) |
| 43 | + |
| 44 | +**Purpose**: Central orchestration of profiling lifecycle and data collection. |
| 45 | + |
| 46 | +**Key Components**: |
| 47 | +- `Profiler` & `ProfilerBuilder`: Main entry point for starting profiling |
| 48 | +- `ProfilerOptions`: Profiling behavior configuration |
| 49 | +- `RunningProfiler`: Handle for controlling active profiler |
| 50 | +- `ProfilerEngine` trait: used to allow mocking async-profiler (the C library) during tests |
| 51 | + |
| 52 | +#### Profiler lifecycle management |
| 53 | + |
| 54 | +As of version 4.1, async-profiler does not have a mode where it can run continuously |
| 55 | +with bounded memory usage and periodically collect samples. |
| 56 | + |
| 57 | +Therefore, every [`reporting_interval`] seconds, the async-profiler agent restarts async-profiler by sending a `stop` (which flushes the JFR file) and `start` commands. |
| 58 | + |
| 59 | +This is managed by `Profiler` (see the [`profiler_tick`] function). |
| 60 | + |
| 61 | +This is a supported async-profiler operation mode. |
| 62 | + |
| 63 | +[`reporting_interval`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/profiler/struct.ProfilerBuilder.html#method.with_reporting_interval |
| 64 | +[`profiler_tick`]: https://github.com/async-profiler/rust-agent/blob/506718fff274b49cf2eb03305a4f9547b61720e3/src/profiler.rs#L1083 |
| 65 | + |
| 66 | +#### Agent lifecycle management |
| 67 | + |
| 68 | +The async-profiler agent can be stopped and started at run-time. |
| 69 | + |
| 70 | +Trying to start an async-profiler session when async-profiler is already running leads to an error from |
| 71 | +async-profiler, so if restarting the profiler is desired (possibly with a different configuration), it is needed |
| 72 | +to stop the profiler before starting it again. |
| 73 | + |
| 74 | +When stopped, the async-profiler agent stops async-profiler, flushes the last profile to the recorder, and then signals |
| 75 | +that it has finished. After that signal, it is possible to start a different instance of the async-profiler |
| 76 | +agent on the same process. |
| 77 | + |
| 78 | +#### Profiler configuration |
| 79 | + |
| 80 | +async-profiler is configured via [`ProfilerOptions`] and [`ProfilerOptionsBuilder`]. You |
| 81 | +should read these docs along with the [async-profiler options docs], for more details. |
| 82 | + |
| 83 | +[`ProfilerOptions`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/profiler/struct.ProfilerOptionsBuilder.html |
| 84 | +[`ProfilerOptionsBuilder`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/profiler/struct.ProfilerOptionsBuilder.html |
| 85 | +[async-profiler options docs]: https://github.com/async-profiler/async-profiler/blob/v4.0/docs/ProfilerOptions.md |
| 86 | + |
| 87 | +#### JFR file rotation |
| 88 | + |
| 89 | +async-profiler expects to be writing the current JFR to a "fresh" file path. To that |
| 90 | +effect, async-profiler creates 2 unnamed temporary files via `JfrFile`, and gives to |
| 91 | +async-profiler alternating paths of the form `/proc/self/fd/<N>` to write the |
| 92 | +JFRs into. |
| 93 | + |
| 94 | +### 2. async-profiler FFI (`asprof`) |
| 95 | + |
| 96 | +**Purpose**: Safe Rust bindings to the native async-profiler library. |
| 97 | + |
| 98 | +**Key Components**: |
| 99 | +- `AsProf`: Safe interface to async-profiler |
| 100 | +- `raw`: Low-level FFI declarations |
| 101 | + |
| 102 | +**Responsibilities**: |
| 103 | +- Dynamic loading of `libasyncProfiler.so` using [`libloading`] |
| 104 | +- Safe, Rust-native wrappers around C API calls |
| 105 | + |
| 106 | +[libloading]: crates.io/crates/libloading |
| 107 | + |
| 108 | +### 3. Metadata (`metadata/`) |
| 109 | + |
| 110 | +**Purpose**: Host identification and report context information. |
| 111 | + |
| 112 | +**Key Components**: |
| 113 | +- `AgentMetadata`: Host identification (EC2, Fargate, or generic) |
| 114 | +- `aws`: AWS-specific metadata autodetection via IMDS |
| 115 | + |
| 116 | +The metadata is sent to the [`Reporter`] implementation, and can be used to |
| 117 | +identify the host that generated a particular profiling report. In the local reporter, |
| 118 | +it is ignored, In the S3 reporter, it is attached to the zip uploaded |
| 119 | +to S3 as `metadata.json`. |
| 120 | + |
| 121 | +### 4. Reporters (`reporter/`) |
| 122 | + |
| 123 | +**Purpose**: Pluggable backends for uploading profiling data. |
| 124 | + |
| 125 | +**Key Components**: |
| 126 | +- [`Reporter`] trait: Common interface for all backends |
| 127 | +- [`LocalReporter`]: Filesystem output for development/testing |
| 128 | +- [`S3Reporter`]: AWS S3 upload with metadata |
| 129 | +- [`MultiReporter`]: Composition of multiple reporters |
| 130 | + |
| 131 | +[`Reporter`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/reporter/trait.Reporter.html |
| 132 | +[`LocalReporter`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/reporter/local/struct.LocalReporter.html |
| 133 | +[`S3Reporter`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/reporter/s3/struct.S3Reporter.html |
| 134 | +[`MultiReporter`]: https://docs.rs/async-profiler-agent/0.1/async_profiler_agent/reporter/multi/struct.MultiReporter.html |
| 135 | + |
| 136 | +The reporter trait is as follows: |
| 137 | + |
| 138 | +```rust |
| 139 | +#[async_trait] |
| 140 | +pub trait Reporter: fmt::Debug { |
| 141 | + async fn report( |
| 142 | + &self, |
| 143 | + jfr: Vec<u8>, |
| 144 | + metadata: &ReportMetadata, |
| 145 | + ) -> Result<(), Box<dyn std::error::Error + Send>>; |
| 146 | +} |
| 147 | +``` |
| 148 | + |
| 149 | +Customers whose needs are not suited by the built-in reporters might write their |
| 150 | +own reporters. |
| 151 | + |
| 152 | +### 5. PollCatch (`pollcatch/`) |
| 153 | + |
| 154 | +**Purpose**: Tokio-specific instrumentation for detecting long poll times. |
| 155 | + |
| 156 | +**Key Components**: |
| 157 | +- `before_poll_hook()`: Pre-poll timestamp capture |
| 158 | +- `after_poll_hook()`: Post-poll analysis and reporting |
| 159 | +- `tsc.rs`: CPU timestamp counter utilities |
| 160 | + |
| 161 | +**Responsibilities**: |
| 162 | +- Minimal-overhead poll time tracking |
| 163 | +- Integration with Tokio's task hooks |
| 164 | +- JFR event emission for long polls |
| 165 | +- CPU timestamp correlation with profiler samples |
| 166 | + |
| 167 | +## Data Flow |
| 168 | + |
| 169 | +1. **Initialization**: Profiler loads `libasyncProfiler.so` and initializes |
| 170 | +2. **Session Start**: Creates temporary JFR files and starts async-profiler |
| 171 | +3. **Continuous Profiling**: async-profiler collects samples to active JFR file |
| 172 | +4. **Periodic Reporting**: |
| 173 | + - Stop profiler and rotate JFR files |
| 174 | + - Read completed JFR data |
| 175 | + - Package with metadata |
| 176 | + - Upload via configured reporters |
| 177 | + - Restart profiler with new JFR file |
| 178 | +5. **Shutdown**: Stop profiler and perform final report |
| 179 | + |
| 180 | +## Key Design Decisions |
| 181 | + |
| 182 | +### Dual JFR File Strategy |
| 183 | +Uses two temporary files to enable continuous profiling during report uploads. While one file receives new samples, the other is being processed and uploaded. |
| 184 | + |
| 185 | +### Builder Pattern Configuration |
| 186 | +Provides type-safe, ergonomic configuration with sensible defaults while supporting advanced customization. |
| 187 | + |
| 188 | +### Trait-Based Reporters |
| 189 | +Enables pluggable upload destinations without coupling core profiling logic to specific backends. |
| 190 | + |
| 191 | +### Optional AWS Integration |
| 192 | +AWS-specific features are behind feature flags, allowing use in non-AWS environments without unnecessary dependencies. |
| 193 | + |
| 194 | +### Thread Safety |
| 195 | +Designed for multi-threaded environments with careful synchronization around profiler state and file operations. |
| 196 | + |
| 197 | +## Feature Flags |
| 198 | + |
| 199 | +- `s3`: Full S3 reporter with default AWS SDK features |
| 200 | +- `s3-no-defaults`: S3 reporter without default features (for custom TLS) |
| 201 | +- `aws-metadata`: AWS metadata detection with default features |
| 202 | +- `aws-metadata-no-defaults`: AWS metadata without default features |
| 203 | +- `__unstable-fargate-cpu-count`: Experimental Fargate CPU metrics |
| 204 | + |
| 205 | +## Error Handling |
| 206 | + |
| 207 | +The design emphasizes resilience: |
| 208 | +- Reporter errors don't stop profiling |
| 209 | +- Profiler errors are logged but allow graceful degradation |
| 210 | +- Resource cleanup is guaranteed via RAII patterns |
| 211 | +- Temporary file management prevents resource leaks |
| 212 | + |
| 213 | +## Performance Considerations |
| 214 | + |
| 215 | +- Minimal overhead during normal operation |
| 216 | +- JFR file I/O is asynchronous and non-blocking |
| 217 | +- PollCatch hooks are optimized for the common case (no sample) |
| 218 | +- Memory allocation is minimized in hot paths |
| 219 | +- Background reporting doesn't interfere with application performance |
0 commit comments