@@ -355,10 +355,278 @@ representation." In part, `finalize_config` already mitigates the problem.
355355Abstaining from storing the finalized config as a data member is a step further.
356356
357357### Error Handling
358- TODO
358+ Most error scenarios within this library are individually enumerated by `enum
359+ Error::Code`, defined in [error.h][36].
360+
361+ As of this writing, some of the error codes are emitted in more than one place,
362+ but most of them are emitted in only one place, as is illustrated by the
363+ following incomprehensible command shell pipeline:
364+ ```
365+ david@ein:~ /src/dd-trace-cpp/src/datadog$ sed -n 's/^\s* \( \w\+\) \s* =\s* [ 0-9] \+ ,\? \s* $/{Error::\1/p' error.h | paste -d '|' -s | xargs git grep -P | sed -n 's/.* Error::\( \w\+\) ,.* /\1/p' | sort | uniq -c | sort -rn
366+ 3 MESSAGEPACK_ENCODE_FAILURE
367+ 2 MAX_PER_SECOND_OUT_OF_RANGE
368+ 2 INVALID_INTEGER
369+ 2 INVALID_DOUBLE
370+ 2 CURL_REQUEST_SETUP_FAILED
371+ 2 CURL_HTTP_CLIENT_ERROR
372+ 1 ZERO_TRACE_ID
373+ 1 URL_UNSUPPORTED_SCHEME
374+ 1 URL_UNIX_DOMAIN_SOCKET_PATH_NOT_ABSOLUTE
375+ 1 URL_MISSING_SEPARATOR
376+ 1 UNKNOWN_PROPAGATION_STYLE
377+ 1 TRACE_SAMPLING_RULES_WRONG_TYPE
378+ 1 TRACE_SAMPLING_RULES_UNKNOWN_PROPERTY
379+ 1 TRACE_SAMPLING_RULES_SAMPLE_RATE_WRONG_TYPE
380+ 1 TRACE_SAMPLING_RULES_INVALID_JSON
381+ 1 TAG_MISSING_SEPARATOR
382+ 1 SPAN_SAMPLING_RULES_WRONG_TYPE
383+ 1 SPAN_SAMPLING_RULES_UNKNOWN_PROPERTY
384+ 1 SPAN_SAMPLING_RULES_SAMPLE_RATE_WRONG_TYPE
385+ 1 SPAN_SAMPLING_RULES_MAX_PER_SECOND_WRONG_TYPE
386+ 1 SPAN_SAMPLING_RULES_INVALID_JSON
387+ 1 SPAN_SAMPLING_RULES_FILE_IO
388+ 1 SERVICE_NAME_REQUIRED
389+ 1 RULE_WRONG_TYPE
390+ 1 RULE_TAG_WRONG_TYPE
391+ 1 RULE_PROPERTY_WRONG_TYPE
392+ 1 RATE_OUT_OF_RANGE
393+ 1 OUT_OF_RANGE_INTEGER
394+ 1 NO_SPAN_TO_EXTRACT
395+ 1 MISSING_TRACE_ID
396+ 1 MISSING_SPAN_INJECTION_STYLE
397+ 1 MISSING_SPAN_EXTRACTION_STYLE
398+ 1 MISSING_PARENT_SPAN_ID
399+ 1 MALFORMED_TRACE_TAGS
400+ 1 DUPLICATE_PROPAGATION_STYLE
401+ 1 DATADOG_AGENT_NULL_HTTP_CLIENT
402+ 1 DATADOG_AGENT_INVALID_FLUSH_INTERVAL
403+ 1 CURL_REQUEST_FAILURE
404+ 1 CURL_HTTP_CLIENT_SETUP_FAILED
405+ 1 CURL_HTTP_CLIENT_NOT_RUNNING
406+ ```
407+ The integer values of the enumerated `Error::Code`s are intended to be
408+ permanent. Given an error message from a log mentioning an error code 24, one
409+ can look at the source code for the matching `Error::Code`
410+ (`TAG_MISSING_SEPARATOR`) and then find all of the possible source origins for
411+ that error. At first this seems like a poor substitute for including the source
412+ file and line in the error message, but I suspect that this compromise is
413+ better.
414+
415+ In addition to an error code, each error condition is associated with a
416+ contextual diagnostic message. The diagnostic is not only a description of the
417+ error, but also contains runtime context that might help a user or a maintainer
418+ identify the underlying issue. For example, a message for
419+ `TAG_MISSING_SEPARATOR` might include the text that was being parsed, or mention
420+ the name of the relevant environment variable.
421+
422+ The `Error::Code code` and `std::string message` are combined in a `struct
423+ Error`, which is the error type used by this library.
424+
425+ `struct Error` has a convenience member function, `Error with_prefix(StringView)
426+ const`, that allows context to be added to an error. It's analogous to
427+ `catch`ing one exception and then `throw`ing another, e.g.
428+ ```c++
429+ Expected<Cat> adopt_cat(AnimalShelter& shelter) {
430+ Expected<Cat> result = shelter.adopt(Animals::CAT, 1);
431+ if (Error *error = result.if_error()) {
432+ return error->with_prefix("While trying to adopt from AnimalShelter: ");
433+ }
434+ return result;
435+ }
436+ ```
437+
438+ ` struct Error ` is most often used in conjunction with the ` Expected ` class
439+ template, which is defined in [ expected.h] [ 37 ] .
440+
441+ ` template <typename T> class Expected ` is either a ` T ` or an ` Error ` . It is a
442+ wrapper around ` std::variant<T, Error> ` . It's inspired by C++23's
443+ [ std::expected] [ 38 ] , but the two are not compatible.
444+
445+ Functions in this library that intend to return a "` Value ` ," but that might
446+ instead fail with an ` Error ` , return an ` Expected<Value> ` .
447+
448+ This library never reports errors by throwing an exception. However, the library
449+ will allow exceptions, such as ` std::bad_alloc ` and ` std::bad_variant_access ` ,
450+ to pass through it, and does sometimes use exceptions internally. The intention
451+ is that a client of this library does not need to write ` catch ` . An alternative
452+ design would be to use exceptions for their intended purpose. We decided that,
453+ for a tracing library embedded in proxies, the ergonomics of error values are a
454+ better fit than exceptions.
455+
456+ ` Expected ` supports two syntaxes of use. The first is [ std::get_if] [ 39 ] -style
457+ unpacking with the ` if_error ` member function:
458+ ``` c++
459+ Expected<Salad> lunch = go_buy_lunch();
460+ if (Error *error = lunch.if_error()) {
461+ return error->with_prefix("Probably getting a hotdog instead: ");
462+ }
463+ eat_salad (* lunch); // or, lunch.value()
464+ ```
465+ `if_error` returns a pointer to the `Error` if there is one, otherwise it
466+ returns `nullptr`. The body of a conditional statement such as `if` or `while`
467+ is allowed to contain a variable definition. If the conditional statement
468+ contains a variable definition, then the truth value of the resulting variable
469+ is used as the condition. So, if there is _not_ an error, then `if_error`
470+ returns `nullptr`, and so the declared `Error*` is `nullptr`, which is false for
471+ the purposes of conditionals, and so in that case the body (block) of the
472+ conditional statement is skipped.
473+
474+ `if_error` cannot be called on an [rvalue][40], because allowing this makes it
475+ far too easy to obtain a pointer to a destroyed object. I wrote the component
476+ and nonetheless made this mistake repeatedly. So, the rvalue-flavored overload
477+ of `Expected::if_error` is `delete`d.
478+
479+ The other syntax of use supported by `Expected` is the traditional
480+ check-and-accessor pattern:
481+ ```c++
482+ Expected<Salad> lunch = go_buy_lunch();
483+ if (!lunch) {
484+ return lunch.error().with_prefix("Probably getting halal instead: ");
485+ }
486+ eat_salad(*lunch); // or, lunch.value()
487+ ```
488+
489+ ` Expected ` defines ` explicit operator bool ` , which is ` true ` if the ` Expected `
490+ contains a non-` Error ` , and ` false ` if the ` Expected ` contains an ` Error ` . There
491+ is also ` has_value() ` , which returns the same thing.
492+
493+ Then the ` Error ` can be obtained via ` error() ` , or the non-` Error ` can be
494+ obtained via ` value() ` or ` operator*() ` .
495+
496+ Why bother with ` if_error ` ? Because I like it! I'm a sucker for structured
497+ bindings and pattern matching.
498+
499+ ` template <typename T> class Expected ` has one specialization: ` Expected<void> ` .
500+ ` Expected<void> ` is "either an ` Error ` or nothing." It's used to convey the
501+ result of an operation that might fail but that doesn't yield a value when it
502+ succeeds. It behaves in the same way as ` Expected<T> ` , except that ` value() ` and
503+ ` operator*() ` are not defined.
504+
505+ At first I instead used ` std::optional<Error> ` directly. The problem there was
506+ that ` explicit operator bool ` has the opposite meaning as it does in
507+ ` Expected<T> ` . I wanted error handling code to be the same in the two cases, and
508+ so I specialized ` Expected<void> ` . ` Expected<void> ` is implemented in terms of
509+ ` std::optional<Error> ` , but inverts the value of ` explicit operator bool ` .
359510
360511### Logging
361- TODO
512+ Can we write a tracing library that does not do any logging by itself? The
513+ previous section describes how errors are reported by the library, and no
514+ logging is involved there. Why not leave it up to the client to decide whether
515+ to note error conditions in the log or to proceed silently?
516+
517+ The problem is that the default implementations of [ HTTPClient] [ 15 ] ([ Curl] [ 18 ] )
518+ and [ EventScheduler] [ 16 ] ([ ThreadedEventScheduler] [ 19 ] ) do some of their work on
519+ background threads, where error conditions may occur without having a "caller"
520+ to notify, and where execution will proceed despite the error. In those cases,
521+ should the library squelch the error?
522+
523+ One option is for these async components to accept an ` on_error(Error) ` callback
524+ that will be invoked whenever an error occurs on the background thread that
525+ would not otherwise be reported to client code. What would a client library do
526+ in ` on_error ` ? I think that it would either do nothing or log the error. So, an
527+ ` on_error ` callback for use in background threads is the same as a logging
528+ interface.
529+
530+ It's tempting to omit a logging interface, leaving it to clients to decide
531+ whether and how to log. Why have logging _ and_ error reporting when you can have
532+ just error reporting?
533+
534+ We could do that. Here are three reasons why this library has a logging interface
535+ anyway:
536+
537+ 1 . Logging a ` Tracer ` 's configuration when it's initialized is a helpful
538+ diagnostic tool. A logging interface allows ` Tracer ` to do this explicitly,
539+ as opposed to counting on client code to log ` Tracer::config_json() ` itself.
540+ A client library can still suppress the startup message in its implementation
541+ of the logging interface, but this is more opt-out than opt-in.
542+ 2 . Along the same lines, reporting errors that occur on a background thread by
543+ invoking a logging interface allows for the library's default behavior to be
544+ to print an error message to a log, as opposed to having an ` on_error `
545+ callback that a client library might choose to log within.
546+ 3 . A logging interface allows warnings to be logged, notifying a user of a
547+ potentially problematic, but valid, configuration. Sometimes these warnings
548+ are a matter of taste, and sometimes they are required by the specification
549+ of the feature being configured. Logging is not the only way to handle this —
550+ imagine if the return value of ` finalize_config ` included a list of warnings.
551+ But, as with the previous two points, a logging interface allows for the
552+ default behavior to be a logged message.
553+
554+ On the other hand, the primary clients of this library are NGINX and Envoy,
555+ where Datadog engineers have a say in how the library is used. So, the
556+ distinction is probably not so important.
557+
558+ The logging interface is ` class Logger ` , defined in [ logger.h] [ 41 ] .
559+
560+ The design of ` class Logger ` is informed by three constraints:
561+
562+ 1 . Most "warn," "info," "debug," and "trace" severity logging is noise. It is
563+ more an artifact of feature development than it is a helpful event with which
564+ issues can be diagnosed. The logger should have few severity levels, or
565+ ideally only one.
566+ 2 . Logging libraries that obscure a conditional branch via the use of
567+ preprocessor macros are difficult to understand. Reverse engineering a
568+ composition of preprocessor macros is hard, and is not justified by the
569+ syntactic convenience that the macros provide.
570+ 3 . When client code decides that a message will not be logged, the library
571+ should minimize the amount of computation that is done — as close to nothing
572+ as is feasible.
573+
574+ (1) and (3) work well together.
575+
576+ On account of (1), ` Logger ` has only two "severities": "error" and "startup."
577+ ` log_startup ` is called once by a ` Tracer ` when it is initialized. ` log_error `
578+ is called in all other logging circumstances.
579+
580+ On account of (3), ` Logger ` must do as little work as possible when an
581+ implementer decides that a logging function should not log. If this library were
582+ to build diagnostic messages in all cases, and then pass them to the ` Logger ` ,
583+ then the cost of building the message would be paid even when the ` Logger `
584+ decides not to log. On account of (2), the library cannot define a ` DD_LOG `
585+ macro that hides an ` if (logger.severity > ...) ` branch. As a matter of taste, I
586+ don't want the library to be littered with such branches explicitly.
587+
588+ The compromise is to have ` Logger::log_error ` and ` Logger::log_startup ` accept a
589+ [ std::function] [ 44 ] that, if invoked, does the work of building the diagnostic
590+ message. When a ` Logger ` implementation decides not to log, the only cost paid
591+ at the call site is the construction of the ` std::function ` and the underlying
592+ capturing lambda expression with which it was likely initialized. The signature
593+ of the ` std::function ` , aliased as ` Logger::LogFunc ` , is ` void(std::ostream&) ` .
594+ I don't know whether this results in a real runtime savings in the not-logging
595+ case, but it seems likely.
596+
597+ ` Logger ` also has two convenience overloads of ` log_error ` : one that takes a
598+ ` const Error& ` and one that takes a ` StringView ` .
599+
600+ The ` const Error& ` overload reveals a contradiction in the design. If an ` Error `
601+ is produced in a context where its only destiny is to be passed to
602+ ` Logger::log_error(const Error&) ` , then the cost of building ` Error::message `
603+ will always be paid, in violation of constraint (3). One way to avoid this would
604+ be to have versions of the ` Error ` -returning operations that instead accept a
605+ ` Logger& ` and use ` Logger::log_error(Logger::LogFunc) ` directly. I think that
606+ the present state of things is acceptable and that such a change is not
607+ warranted.
608+
609+ The default implementation of ` Logger ` is ` CerrLogger ` , defined in
610+ [ cerr_logger.h] [ 42 ] . ` CerrLogger ` logs to [ std::cerr] [ 43 ] in both ` log_error `
611+ and ` log_startup ` .
612+
613+ The ` Logger ` used by a ` Tracer ` is configured via `std::shared_ptr<Logger >
614+ TracerConfig::logger`, defined in [ tracer_config.h] [ 10 ] .
615+
616+ Finally, constraint (1) is still a matter of debate. Support representatives
617+ have expressed frustration with this library's, and its predecessor's, lack of a
618+ "debug mode" that logs the reason for every decision made throughout the
619+ processing of a trace. This issue deserves revisiting. Here are some potential
620+ approaches:
621+
622+ - Introduce a "debug mode" that sends fine-grained internal information to
623+ Datadog, rather than to a log, either as hidden tags on the trace or via a
624+ different data channel, such as the [ Telemetry API] [ 45 ] .
625+ - Introduce a "trace the tracer" mode that generates additional traces that
626+ represent operations performed by the tracer. This technique was prototyped in
627+ the [ david.goffredo/traception] [ 46 ] branch.
628+ - Add a ` Logger::log_debug ` function that optionally prints fine-grained tracing
629+ information to the log, in violation of constraint (1).
362630
363631Operations
364632----------
@@ -417,3 +685,14 @@ TODO
417685[ 33 ] : ../src/datadog/trace_sampler_config.h
418686[ 34 ] : ../src/datadog/span_sampler_config.h
419687[ 35 ] : ../src/datadog/rate.h
688+ [ 36 ] : ../src/datadog/error.h
689+ [ 37 ] : ../src/datadog/expected.h
690+ [ 38 ] : https://en.cppreference.com/w/cpp/utility/expected
691+ [ 39 ] : https://en.cppreference.com/w/cpp/utility/variant/get_if
692+ [ 40 ] : https://en.cppreference.com/w/cpp/language/value_category
693+ [ 41 ] : ../src/datadog/logger.h
694+ [ 42 ] : ../src/datadog/cerr_logger.h
695+ [ 43 ] : https://en.cppreference.com/w/cpp/io/cerr
696+ [ 44 ] : https://en.cppreference.com/w/cpp/utility/functional/function
697+ [ 45 ] : https://github.com/DataDog/datadog-agent/blob/796ccb9e92326c85b51f519291e86eb5bc950180/pkg/trace/api/endpoints.go#L97
698+ [ 46 ] : https://github.com/DataDog/dd-trace-cpp/tree/david.goffredo/traception
0 commit comments