|
| 1 | +# Tracing |
| 2 | + |
| 3 | +Tracing is a powerful tool for identifying performance issues and bottlenecks in code. |
| 4 | + |
| 5 | +> Profiling on GPUs is trickier due to asynchronous execution, see the [GPU section](#gpu). |
| 6 | +
|
| 7 | +## Overview |
| 8 | + |
| 9 | +Candle uses the [tracing](https://docs.rs/tracing/latest/tracing/) crate for instrumentation. |
| 10 | + |
| 11 | +To try it out, run an example in `candle-examples` with the `--tracing` flag. |
| 12 | +This generates a trace file, typically named `trace-<timestamp>.json`. |
| 13 | +You can view the trace in Chrome by navigating to `chrome://tracing/`, clicking **Load**, and selecting the generated trace file. |
| 14 | + |
| 15 | +## Adding Tracing |
| 16 | + |
| 17 | +Candle includes built-in tracing for many internal operations, using [spans](https://docs.rs/tracing/latest/tracing/struct.Span.html) to mark key points of execution. |
| 18 | + |
| 19 | +To add custom tracing in your code, you can define a span like this: |
| 20 | + |
| 21 | +```rust |
| 22 | +let span = tracing::span!(tracing::Level::TRACE, name); |
| 23 | +``` |
| 24 | + |
| 25 | +Then, to record the span during execution, create a guard: |
| 26 | + |
| 27 | +```rust |
| 28 | +let _enter = span.enter(); |
| 29 | +``` |
| 30 | + |
| 31 | +This guard will record the span's duration, from when it is created to when it is dropped, into a global data structure managed by the tracing crate. |
| 32 | + |
| 33 | +## Recording and Saving a Trace |
| 34 | + |
| 35 | +To capture and save trace data, you need to configure the tracing system with an output format. Candle uses the [tracing_subscriber](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/) and [tracing_chrome](https://docs.rs/tracing-chrome/latest/tracing_chrome/) crates. |
| 36 | + |
| 37 | +The snippet below sets up a Chrome compatible recorder that logs all tracing activity between creation and drop of the guard: |
| 38 | + |
| 39 | +```rust |
| 40 | +use tracing_chrome::ChromeLayerBuilder; |
| 41 | +use tracing_subscriber::prelude::*; |
| 42 | + |
| 43 | +let _guard = { |
| 44 | + let (chrome_layer, guard) = ChromeLayerBuilder::new().build(); |
| 45 | + tracing_subscriber::registry().with(chrome_layer).init(); |
| 46 | + guard |
| 47 | +}; |
| 48 | +``` |
| 49 | + |
| 50 | +## GPU |
| 51 | + |
| 52 | +When using CUDA, Metal, or other asynchronous GPU backends, tracing may produce misleading timing data because operations are queued rather than executed immediately. |
| 53 | + |
| 54 | +### CUDA |
| 55 | + |
| 56 | +For CUDA-specific profiling, you have two options: |
| 57 | + |
| 58 | +1. Set the environment variable `CUDA_LAUNCH_BLOCKING=1` which forces synchronous execution. This makes trace timings more accurate, at the cost of reduced performance. |
| 59 | +2. Use [NVIDIA's Nsight Systems](https://developer.nvidia.com/nsight-systems) (`nsys profile` and `nsys-ui`) which are designed specifically for profiling asynchronous CUDA executions. |
| 60 | + |
| 61 | +We recommend using NVIDIA's Nsight Systems when possible, as it offers accurate performance data without altering typical execution patterns. In contrast, setting the `CUDA_LAUNCH_BLOCKING` environment variable forces synchronous execution, which can significantly alter execution behavior. |
| 62 | + |
| 63 | +#### Performance Profiling with NVIDIA Nsight Systems |
| 64 | + |
| 65 | +1. Generate an `.nsys-rep` file containing performance data ([docs](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#example-single-command-lines)) |
| 66 | + - Run `nsys profile --trace cuda,nvtx,osrt --gpu-metrics-device=all --output profile_run ./target/debug/... --prompt "whatever "` |
| 67 | +1. Open the generated `.nsys-rep` report file in Nsight Systems GUI |
| 68 | + - File > Open |
0 commit comments