Skip to content

Commit e3302e3

Browse files
authored
fill out more sections in doc/maintainers.md (#56)
* error handling * logging
1 parent e9b715d commit e3302e3

File tree

1 file changed

+281
-2
lines changed

1 file changed

+281
-2
lines changed

doc/maintainers.md

Lines changed: 281 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -355,10 +355,278 @@ representation." In part, `finalize_config` already mitigates the problem.
355355
Abstaining from storing the finalized config as a data member is a step further.
356356
357357
### Error Handling
358-
TODO
358+
Most error scenarios within this library are individually enumerated by `enum
359+
Error::Code`, defined in [error.h][36].
360+
361+
As of this writing, some of the error codes are emitted in more than one place,
362+
but most of them are emitted in only one place, as is illustrated by the
363+
following incomprehensible command shell pipeline:
364+
```
365+
david@ein:~/src/dd-trace-cpp/src/datadog$ sed -n 's/^\s*\(\w\+\)\s*=\s*[0-9]\+,\?\s*$/{Error::\1/p' error.h | paste -d '|' -s | xargs git grep -P | sed -n 's/.*Error::\(\w\+\),.*/\1/p' | sort | uniq -c | sort -rn
366+
3 MESSAGEPACK_ENCODE_FAILURE
367+
2 MAX_PER_SECOND_OUT_OF_RANGE
368+
2 INVALID_INTEGER
369+
2 INVALID_DOUBLE
370+
2 CURL_REQUEST_SETUP_FAILED
371+
2 CURL_HTTP_CLIENT_ERROR
372+
1 ZERO_TRACE_ID
373+
1 URL_UNSUPPORTED_SCHEME
374+
1 URL_UNIX_DOMAIN_SOCKET_PATH_NOT_ABSOLUTE
375+
1 URL_MISSING_SEPARATOR
376+
1 UNKNOWN_PROPAGATION_STYLE
377+
1 TRACE_SAMPLING_RULES_WRONG_TYPE
378+
1 TRACE_SAMPLING_RULES_UNKNOWN_PROPERTY
379+
1 TRACE_SAMPLING_RULES_SAMPLE_RATE_WRONG_TYPE
380+
1 TRACE_SAMPLING_RULES_INVALID_JSON
381+
1 TAG_MISSING_SEPARATOR
382+
1 SPAN_SAMPLING_RULES_WRONG_TYPE
383+
1 SPAN_SAMPLING_RULES_UNKNOWN_PROPERTY
384+
1 SPAN_SAMPLING_RULES_SAMPLE_RATE_WRONG_TYPE
385+
1 SPAN_SAMPLING_RULES_MAX_PER_SECOND_WRONG_TYPE
386+
1 SPAN_SAMPLING_RULES_INVALID_JSON
387+
1 SPAN_SAMPLING_RULES_FILE_IO
388+
1 SERVICE_NAME_REQUIRED
389+
1 RULE_WRONG_TYPE
390+
1 RULE_TAG_WRONG_TYPE
391+
1 RULE_PROPERTY_WRONG_TYPE
392+
1 RATE_OUT_OF_RANGE
393+
1 OUT_OF_RANGE_INTEGER
394+
1 NO_SPAN_TO_EXTRACT
395+
1 MISSING_TRACE_ID
396+
1 MISSING_SPAN_INJECTION_STYLE
397+
1 MISSING_SPAN_EXTRACTION_STYLE
398+
1 MISSING_PARENT_SPAN_ID
399+
1 MALFORMED_TRACE_TAGS
400+
1 DUPLICATE_PROPAGATION_STYLE
401+
1 DATADOG_AGENT_NULL_HTTP_CLIENT
402+
1 DATADOG_AGENT_INVALID_FLUSH_INTERVAL
403+
1 CURL_REQUEST_FAILURE
404+
1 CURL_HTTP_CLIENT_SETUP_FAILED
405+
1 CURL_HTTP_CLIENT_NOT_RUNNING
406+
```
407+
The integer values of the enumerated `Error::Code`s are intended to be
408+
permanent. Given an error message from a log mentioning an error code 24, one
409+
can look at the source code for the matching `Error::Code`
410+
(`TAG_MISSING_SEPARATOR`) and then find all of the possible source origins for
411+
that error. At first this seems like a poor substitute for including the source
412+
file and line in the error message, but I suspect that this compromise is
413+
better.
414+
415+
In addition to an error code, each error condition is associated with a
416+
contextual diagnostic message. The diagnostic is not only a description of the
417+
error, but also contains runtime context that might help a user or a maintainer
418+
identify the underlying issue. For example, a message for
419+
`TAG_MISSING_SEPARATOR` might include the text that was being parsed, or mention
420+
the name of the relevant environment variable.
421+
422+
The `Error::Code code` and `std::string message` are combined in a `struct
423+
Error`, which is the error type used by this library.
424+
425+
`struct Error` has a convenience member function, `Error with_prefix(StringView)
426+
const`, that allows context to be added to an error. It's analogous to
427+
`catch`ing one exception and then `throw`ing another, e.g.
428+
```c++
429+
Expected<Cat> adopt_cat(AnimalShelter& shelter) {
430+
Expected<Cat> result = shelter.adopt(Animals::CAT, 1);
431+
if (Error *error = result.if_error()) {
432+
return error->with_prefix("While trying to adopt from AnimalShelter: ");
433+
}
434+
return result;
435+
}
436+
```
437+
438+
`struct Error` is most often used in conjunction with the `Expected` class
439+
template, which is defined in [expected.h][37].
440+
441+
`template <typename T> class Expected` is either a `T` or an `Error`. It is a
442+
wrapper around `std::variant<T, Error>`. It's inspired by C++23's
443+
[std::expected][38], but the two are not compatible.
444+
445+
Functions in this library that intend to return a "`Value`," but that might
446+
instead fail with an `Error`, return an `Expected<Value>`.
447+
448+
This library never reports errors by throwing an exception. However, the library
449+
will allow exceptions, such as `std::bad_alloc` and `std::bad_variant_access`,
450+
to pass through it, and does sometimes use exceptions internally. The intention
451+
is that a client of this library does not need to write `catch`. An alternative
452+
design would be to use exceptions for their intended purpose. We decided that,
453+
for a tracing library embedded in proxies, the ergonomics of error values are a
454+
better fit than exceptions.
455+
456+
`Expected` supports two syntaxes of use. The first is [std::get_if][39]-style
457+
unpacking with the `if_error` member function:
458+
```c++
459+
Expected<Salad> lunch = go_buy_lunch();
460+
if (Error *error = lunch.if_error()) {
461+
return error->with_prefix("Probably getting a hotdog instead: ");
462+
}
463+
eat_salad(*lunch); // or, lunch.value()
464+
```
465+
`if_error` returns a pointer to the `Error` if there is one, otherwise it
466+
returns `nullptr`. The body of a conditional statement such as `if` or `while`
467+
is allowed to contain a variable definition. If the conditional statement
468+
contains a variable definition, then the truth value of the resulting variable
469+
is used as the condition. So, if there is _not_ an error, then `if_error`
470+
returns `nullptr`, and so the declared `Error*` is `nullptr`, which is false for
471+
the purposes of conditionals, and so in that case the body (block) of the
472+
conditional statement is skipped.
473+
474+
`if_error` cannot be called on an [rvalue][40], because allowing this makes it
475+
far too easy to obtain a pointer to a destroyed object. I wrote the component
476+
and nonetheless made this mistake repeatedly. So, the rvalue-flavored overload
477+
of `Expected::if_error` is `delete`d.
478+
479+
The other syntax of use supported by `Expected` is the traditional
480+
check-and-accessor pattern:
481+
```c++
482+
Expected<Salad> lunch = go_buy_lunch();
483+
if (!lunch) {
484+
return lunch.error().with_prefix("Probably getting halal instead: ");
485+
}
486+
eat_salad(*lunch); // or, lunch.value()
487+
```
488+
489+
`Expected` defines `explicit operator bool`, which is `true` if the `Expected`
490+
contains a non-`Error`, and `false` if the `Expected` contains an `Error`. There
491+
is also `has_value()`, which returns the same thing.
492+
493+
Then the `Error` can be obtained via `error()`, or the non-`Error` can be
494+
obtained via `value()` or `operator*()`.
495+
496+
Why bother with `if_error`? Because I like it! I'm a sucker for structured
497+
bindings and pattern matching.
498+
499+
`template <typename T> class Expected` has one specialization: `Expected<void>`.
500+
`Expected<void>` is "either an `Error` or nothing." It's used to convey the
501+
result of an operation that might fail but that doesn't yield a value when it
502+
succeeds. It behaves in the same way as `Expected<T>`, except that `value()` and
503+
`operator*()` are not defined.
504+
505+
At first I instead used `std::optional<Error>` directly. The problem there was
506+
that `explicit operator bool` has the opposite meaning as it does in
507+
`Expected<T>`. I wanted error handling code to be the same in the two cases, and
508+
so I specialized `Expected<void>`. `Expected<void>` is implemented in terms of
509+
`std::optional<Error>`, but inverts the value of `explicit operator bool`.
359510

360511
### Logging
361-
TODO
512+
Can we write a tracing library that does not do any logging by itself? The
513+
previous section describes how errors are reported by the library, and no
514+
logging is involved there. Why not leave it up to the client to decide whether
515+
to note error conditions in the log or to proceed silently?
516+
517+
The problem is that the default implementations of [HTTPClient][15] ([Curl][18])
518+
and [EventScheduler][16] ([ThreadedEventScheduler][19]) do some of their work on
519+
background threads, where error conditions may occur without having a "caller"
520+
to notify, and where execution will proceed despite the error. In those cases,
521+
should the library squelch the error?
522+
523+
One option is for these async components to accept an `on_error(Error)` callback
524+
that will be invoked whenever an error occurs on the background thread that
525+
would not otherwise be reported to client code. What would a client library do
526+
in `on_error`? I think that it would either do nothing or log the error. So, an
527+
`on_error` callback for use in background threads is the same as a logging
528+
interface.
529+
530+
It's tempting to omit a logging interface, leaving it to clients to decide
531+
whether and how to log. Why have logging _and_ error reporting when you can have
532+
just error reporting?
533+
534+
We could do that. Here are three reasons why this library has a logging interface
535+
anyway:
536+
537+
1. Logging a `Tracer`'s configuration when it's initialized is a helpful
538+
diagnostic tool. A logging interface allows `Tracer` to do this explicitly,
539+
as opposed to counting on client code to log `Tracer::config_json()` itself.
540+
A client library can still suppress the startup message in its implementation
541+
of the logging interface, but this is more opt-out than opt-in.
542+
2. Along the same lines, reporting errors that occur on a background thread by
543+
invoking a logging interface allows for the library's default behavior to be
544+
to print an error message to a log, as opposed to having an `on_error`
545+
callback that a client library might choose to log within.
546+
3. A logging interface allows warnings to be logged, notifying a user of a
547+
potentially problematic, but valid, configuration. Sometimes these warnings
548+
are a matter of taste, and sometimes they are required by the specification
549+
of the feature being configured. Logging is not the only way to handle this —
550+
imagine if the return value of `finalize_config` included a list of warnings.
551+
But, as with the previous two points, a logging interface allows for the
552+
default behavior to be a logged message.
553+
554+
On the other hand, the primary clients of this library are NGINX and Envoy,
555+
where Datadog engineers have a say in how the library is used. So, the
556+
distinction is probably not so important.
557+
558+
The logging interface is `class Logger`, defined in [logger.h][41].
559+
560+
The design of `class Logger` is informed by three constraints:
561+
562+
1. Most "warn," "info," "debug," and "trace" severity logging is noise. It is
563+
more an artifact of feature development than it is a helpful event with which
564+
issues can be diagnosed. The logger should have few severity levels, or
565+
ideally only one.
566+
2. Logging libraries that obscure a conditional branch via the use of
567+
preprocessor macros are difficult to understand. Reverse engineering a
568+
composition of preprocessor macros is hard, and is not justified by the
569+
syntactic convenience that the macros provide.
570+
3. When client code decides that a message will not be logged, the library
571+
should minimize the amount of computation that is done — as close to nothing
572+
as is feasible.
573+
574+
(1) and (3) work well together.
575+
576+
On account of (1), `Logger` has only two "severities": "error" and "startup."
577+
`log_startup` is called once by a `Tracer` when it is initialized. `log_error`
578+
is called in all other logging circumstances.
579+
580+
On account of (3), `Logger` must do as little work as possible when an
581+
implementer decides that a logging function should not log. If this library were
582+
to build diagnostic messages in all cases, and then pass them to the `Logger`,
583+
then the cost of building the message would be paid even when the `Logger`
584+
decides not to log. On account of (2), the library cannot define a `DD_LOG`
585+
macro that hides an `if (logger.severity > ...)` branch. As a matter of taste, I
586+
don't want the library to be littered with such branches explicitly.
587+
588+
The compromise is to have `Logger::log_error` and `Logger::log_startup` accept a
589+
[std::function][44] that, if invoked, does the work of building the diagnostic
590+
message. When a `Logger` implementation decides not to log, the only cost paid
591+
at the call site is the construction of the `std::function` and the underlying
592+
capturing lambda expression with which it was likely initialized. The signature
593+
of the `std::function`, aliased as `Logger::LogFunc`, is `void(std::ostream&)`.
594+
I don't know whether this results in a real runtime savings in the not-logging
595+
case, but it seems likely.
596+
597+
`Logger` also has two convenience overloads of `log_error`: one that takes a
598+
`const Error&` and one that takes a `StringView`.
599+
600+
The `const Error&` overload reveals a contradiction in the design. If an `Error`
601+
is produced in a context where its only destiny is to be passed to
602+
`Logger::log_error(const Error&)`, then the cost of building `Error::message`
603+
will always be paid, in violation of constraint (3). One way to avoid this would
604+
be to have versions of the `Error`-returning operations that instead accept a
605+
`Logger&` and use `Logger::log_error(Logger::LogFunc)` directly. I think that
606+
the present state of things is acceptable and that such a change is not
607+
warranted.
608+
609+
The default implementation of `Logger` is `CerrLogger`, defined in
610+
[cerr_logger.h][42]. `CerrLogger` logs to [std::cerr][43] in both `log_error`
611+
and `log_startup`.
612+
613+
The `Logger` used by a `Tracer` is configured via `std::shared_ptr<Logger>
614+
TracerConfig::logger`, defined in [tracer_config.h][10].
615+
616+
Finally, constraint (1) is still a matter of debate. Support representatives
617+
have expressed frustration with this library's, and its predecessor's, lack of a
618+
"debug mode" that logs the reason for every decision made throughout the
619+
processing of a trace. This issue deserves revisiting. Here are some potential
620+
approaches:
621+
622+
- Introduce a "debug mode" that sends fine-grained internal information to
623+
Datadog, rather than to a log, either as hidden tags on the trace or via a
624+
different data channel, such as the [Telemetry API][45].
625+
- Introduce a "trace the tracer" mode that generates additional traces that
626+
represent operations performed by the tracer. This technique was prototyped in
627+
the [david.goffredo/traception][46] branch.
628+
- Add a `Logger::log_debug` function that optionally prints fine-grained tracing
629+
information to the log, in violation of constraint (1).
362630

363631
Operations
364632
----------
@@ -417,3 +685,14 @@ TODO
417685
[33]: ../src/datadog/trace_sampler_config.h
418686
[34]: ../src/datadog/span_sampler_config.h
419687
[35]: ../src/datadog/rate.h
688+
[36]: ../src/datadog/error.h
689+
[37]: ../src/datadog/expected.h
690+
[38]: https://en.cppreference.com/w/cpp/utility/expected
691+
[39]: https://en.cppreference.com/w/cpp/utility/variant/get_if
692+
[40]: https://en.cppreference.com/w/cpp/language/value_category
693+
[41]: ../src/datadog/logger.h
694+
[42]: ../src/datadog/cerr_logger.h
695+
[43]: https://en.cppreference.com/w/cpp/io/cerr
696+
[44]: https://en.cppreference.com/w/cpp/utility/functional/function
697+
[45]: https://github.com/DataDog/datadog-agent/blob/796ccb9e92326c85b51f519291e86eb5bc950180/pkg/trace/api/endpoints.go#L97
698+
[46]: https://github.com/DataDog/dd-trace-cpp/tree/david.goffredo/traception

0 commit comments

Comments
 (0)