This repo is largely motivated by the need for observability in init4 and
the broader crypto ecosystem. The Rust tracing crate is a modern and sleek
solution for instrumenting code, but it is massively underutilitized in the
crypto ecosystem.
I think this is largely because crypto-industry developers are generally isolated from institutional best practices. We tend to be self-taught, or come from non-traditional backgrounds. We don't have the same exposure to the tools and best-practices that are common in large enterprises, and we tend not to have experience operating large-scale systems.
Today, lighthouse and reth supports distributed tracing via OTLP, but while
ethrex is instrumented with events, it only uses tracing as a form of
logging. This is a missed opportunity, and the client can be considered ready
for institutional use until it is properly spanned, and supports exporting to
a distributed tracing backend.
So first of all, what is tracing? "Tracing" is best understood as a way of recording information about a program's execution, in a structured way. It is an evolution of, and improvement on, traditional logging.
Developers are familiar with logging. We're used to console debugging like
println!("what the heck is {}, {}", x, y);, and logging is a natural
extension of that. We add log lines to our code, and then we can see what
happened when we ran the program. This works great for individuals, or in small
programs, but it doesn't scale well to large systems.
There are a few problems that prevent logging from scaling up. The most annoying for me are the following:
- Logs are discrete, while systems are continuous.
- Logs are localized, while systems are distributed.
- Log data is text, while system data is structured.
- Logs are uncorrelated, while system events are related.
So what do these mean?
-
Logs are discrete, while systems are continuous.
A program runs continuously, and we want to understand what it is doing over time. Logs are discrete events, and we have to infer the program's behavior from a series of log lines. This is hard, and often impossible. To correct this, tracing's core primitive is not the event, but the span. A span represents a unit of work, and has a start and end time. Spans can be nested, and we can see the relationships between spans. This gives us a much richer understanding of what the program is doing.
-
Logs are isolated, while system events are related.
Logs are just a series of unrelated events, and a reader has to infer the relationships between them. Tracing spans are nested by default, and events are associated with a specific set of spans. This makes it much easier to understand the relationships between events. Filtering events by the span that contains them is a powerful way to understand the behavior of a system.
-
Logs are localized, while systems are distributed.
Logs are generated by a single program, and localized to the machine that the program is running on. In a distributed system, we have many programs running on many machines, and we want to understand how they interact. Tracing standards allow correlation of events across service boundaries (!!!). E.g. the
traceparentheader standard allows correlation of events between different system components communicating over HTTP. Properly-instrumented systems can be traced end-to-end, even when they are distributed across many services. -
Log data is text, while system data is structured.
Logs are just text, and we have to parse them to extract useful information. This is hard, and often impossible. Tracing data is structured, and we can attach arbitrary key-value pairs to spans and events. This makes it much easier to extract useful information. Structured data allows consumers to filter, aggregate, and analyze the data in powerful ways.
If we can stop thinking of tracing as "better logging", and start thinking of
it as critical infrastructure for understanding complex systems, our software
will be more maintainable, more mature, and more ready for institutional use.
The goal of instrumenting code with tracing is to allow us to observe its behavior both in specific instances, and in aggregate. We want to be able to see how it behaved in a specific run, and we want to be able to see how it behaves across many runs.
Distributed tracing is MANDATORY for operating institutional-grade systems. It is not enough to have logs and log collectors. We need to be able to trace the entire system.
There are many options, both open-source and commercial. Some popular options are:
- Jaeger (open-source)
- Zipkin (open-source)
- Grafana Tempo (open-source)
- Honeycomb (commercial)
- Datadog (commercial)
- Groundcover (commercial, init4's current vendor)
Remember, all software is work-in-progress. You do not have to ship with a perfectly instrumented codebase. These best practices are a guide to improving your codebase over time. You can start small, and improve your instrumentation when you have spare time, or when investigating specific issues. All tracing is good tracing, so keep it up!
Event and span level should correspond to the significance of the event as follows:
TRACE- low-level, detailed information. This is useful for debugging, but is not necessary for normal operation. This may include very large data or very frequent events. This should be used rarely. Examples: the body of an http request. Every packet received by a network server.DEBUG- low-level lifecycle information that is useful for debugging, but not for tracing normal operation. This should be used sparingly. Examples: the result of a single database query. The result of a single function call.INFO- lifecycle information that is useful for understanding the operation of the system. This should be the default level for most events. Examples: the start of request processing. Connection established to a database. Initialization of a service succeeded.WARN- lifecycle information that indicates a potential problem. These are often ignorable errors that may indicate a problem, but do not prevent the system from operating. Examples: a request took longer than expected. Input failed to parse, but was ignored.ERROR- lifecycle information that indicates a problem that prevents the system from operating correctly. These are often fatal errors that require human intervention. Examples: a database connection failed. A required file was not found.
The tracing example in this repo uses the OTEL_FILTER env var to set the
level of tracing. e.g. export OTEL_FILTER=info will show only INFO, WARN,
and ERROR events.
// avoid this
warn!("Connected to database");
// instead do this
info!("Connected to database");Spans represent the duration of a unit of work. They are the primary data structure in tracing, and wrap span events produced by the event macros. Spans should be:
- time-limited - at most a few seconds.
- work-associated - a specific action the program is taking.
- informative - have useful data attached to them, and not over-verbose.
Spans inherit the currently-entered span as their parent, when they are created. Try to avoid spurious span relationships, as it can lead to non-useful trace data.
// avoid this
let span = info_span!("accidental child");
...
let my_closure = move || {
// Moving the span into the closure captures the outer current span
// WHEN THE CLOSURE IS CREATED, not when it is called.
let guard = span.entered();
// do some work
};
do_work(my_closure);
// instead do this
let span = info_span!("outer_function").entered();
let my_closure = || {
let span = info_span!("intentional child").entered();
// do some work
};
do_work(my_closure);This is related to Root Spans
When instrumenting a fn that takes self as its first argument, you should
almost always skip(self). Add properties from self to the span manually. By
default, the self argument will be added to the span as a field using
its Debug implementation. This can be very verbose, and is rarely useful. It
is better to add only the fields you need to the span.
// avoid this
#[instrument]
async fn my_method(&self) {
// ...
}
// instead do this
#[instrument(skip(self), fields(self.id = self.id))]
async fn my_method(&self) {
// ...
}When there are several arguments, you should consider skipping all of them, and then adding back only the ones you need. This makes the span more readable, and makes it clear which arguments are important.
// avoid this
#[instrument]
async fn my_method(&self, arg1: i32, arg2: String) {
// ...
}
// instead do this
#[instrument(skip_all, fields(arg1))]
async fn my_method(&self, arg1: i32, arg2: String) {
// ...
}
// this works too, but is not preferred, as it is not immediately clear
#[instrument(skip(self, arg2))]
async fn my_method(&self, arg1: i32, arg2: String) {
// ...
}When spawning a future, use the instrument method on the future itself,
rather than on the JoinHandle that tokio::spawn returns. Attaching the span
to the JoinHandle will NOT propagate the span to the future when it runs, and
the span contents will not be recorded alongside the future's events.
// avoid this
tokio::spawn(fut).instrument(span);
// instead do this
tokio::spawn(fut.instrument(span));Avoid adding spans to tasks that you expect to run for a long time. Instead, make a new span in the task's internal working loop each time it runs. Then drop that span at the end of the loop. This way, the span will be closed and its contents recorded each time the task runs, rather than being left open for the entire lifetime of the task.
// avoid this
let span = info_span!("task");
tokio::spawn(async {
loop {
// do some work
tokio::time::sleep(Duration::from_secs(1)).await;
}
}.instrument(span));
// instead do this
tokio::spawn(async {
loop {
let guard = info_span!("loop_iteration").entered();
// do some work
drop(guard);
tokio::time::sleep(Duration::from_secs(1)).await;
}
};When creating a span, think carefully if it should be a "root" span. Root spans are the top-level spans in a trace, and are the entry points for the trace. In groundcover, a root span will be the base of the UI tree. As such, it is important that your root span corresponds to a SINGLE UNIT OF WORK.
If your span contains a succession of work units, consider making a new root span for each work unit. This will make the trace easier to read and understand.
This is related to Instrument work, not tasks, but is a higher-level concern.
// avoid this
let span = info_span!("task");
for item in my_vec {
let work_span = info_span!("work_unit").entered();
// do some work
}
// instead do this
let work_loop = info_span!("work_loop").entered();
for item in my_vec {
// We explicitly set the parent to `None`
let span = info_span!(parent: None, "work_unit").entered();
// do some work
drop(span);
}When using the #[instrument] attribute, you can set parent to None to create a root span.
// avoid this
#[instrument]
async fn a_unit_of_work() {
// do some work
}
// instead do this
#[instrument(parent = None)]
async fn a_unit_of_work() {
// do some work
}Using #[instrument(err)] can result in the same error being emitted many
times, as EACH span generated this way will contain a unique error! event.
Instead, only root spans should have instrument(err) on them. This
ensures that we always associate the error with the root span that it
terminated, rather than associating it at each individual level
// avoid this
#[instrument(err)]
async fn one() -> Result<(), ()> {
// do work
}
#[instrument(err)]
async fn two() -> Result<(), ()> {
one().await?;
// more work
}
// instead do this
#[instrument]
async fn one() -> Result<(), ()> {
// do work
}
#[instrument(parent = None, err)]
async fn two() -> Result<(), ()> {
one().await?;
// more work
}If we need to inspect the error-bubbling process more, we can add additional info to the current span like this:
#[instrument]
async fn do_thing2() -> std::io::Result<()> { }
#[instrument(err, fields(err_source = tracing::field::Empty))]
async fn do_thing3() -> std::io::Result<()> {
do_thing2().await.inspect_err(|_| {
tracing::span::Span::current().record("err_source", "do_thing2");
})
}Events represent a single point in time. They are associated with the currently-open span. In init4 systems, each event will result in a log line AND will be exported via OTLP to the tracing backend. Events should be:
- informative - have useful data attached to them, and not over-verbose.
- descriptive - the message should be clear and concise.
- lifecycle-aware - events should be used to record the lifecycle of a unit of work.
- non-repetitive - the event should fire ONCE in its span's lifetime.
Events are structured data, like JSON. They are converted to log lines by a specific formatter, but also kept as fully typed data. The typed data is exported via OTLP. Using string interpolation leads to loss of type information, and can make the data harder to parse and use.
// avoid this
info!("Value calculated: {}", x);
// instead do this
info!(x, "Value calculated");Events should be used to record changes in the state of a unit of work. This means they should capture significant lifecycle steps. Avoid using events to record every step in a process, as this can lead to over-verbose traces. And avoid using events to record starts/ends of a process, as this is what spans are for. Events should be achievements, not steps.
// avoid this
info!("Parsing input");
let parsed = parse_input(input);
info!("Input parsed");
// instead do this
let span = info_span!("parse_input").entered();
let parsed = parse_input(input);
drop(span);
// even better
#[instrument(skip(input), fields(input_size = input.len()))]
fn parse_input(input: String) -> Option<ParsedInput> {
// do some work
}
let parsed = parse_input(input);If you’re firing the same event many times, it’s likely that you’re violating either a span rule, or a verbosity rule. We want to instrument a single unit of work, which implies that everything important happens once. If the same event is firing many times in the same span, we’ve probably got some poorly designed spans that need to be fixed, or we are firing unimportant events and their level should be downgraded.
// avoid this
for i in my_vec {
info!(i, "processing");
do_work(i);
}
// instead do this
for i in my_vec {
do_work(i);
trace!(i, "processed vec item");
}
info!(my_vec.len(), "processed my vec");