|
| 1 | +# Tracing Design |
| 2 | + |
| 3 | +* **Type**: Design |
| 4 | +* **Author(s)**: Ian Botsford |
| 5 | + |
| 6 | +# Abstract |
| 7 | + |
| 8 | +Tracing describes the emission of logging and metric events in a structured manner for the purposes of analyzing SDK |
| 9 | +performance and debugging. This document presents a design for how tracing will work in the SDK. |
| 10 | + |
| 11 | +# Concepts |
| 12 | + |
| 13 | +The following terms are defined: |
| 14 | + |
| 15 | +* **Trace span**: A logical grouping of tracing events that encompasses some operation. Trace spans may be subdivided |
| 16 | + into child spans which group a narrower set of events within the context of the parent. Trace spans are hierarchical; |
| 17 | + events that occur within one span also logically occur within the ancestors of that span. |
| 18 | + |
| 19 | +* **Trace probe**: A receiver for tracing events. A probe is notified when new events occur within a span and may take |
| 20 | + appropriate action to route the event (e.g., forward to a downstream logging/metrics framework, print to the console, |
| 21 | + write to a file, etc.). |
| 22 | + |
| 23 | +# Design |
| 24 | + |
| 25 | +The following components provide tracing support: |
| 26 | + |
| 27 | +## Tracer |
| 28 | + |
| 29 | +A `Tracer` is a top-level provider of tracing capabilities. It bridges trace spans (into which events are emitted) and |
| 30 | +trace probes (which receive events and handle them accordingly). Typically, each service client will have its own |
| 31 | +internal `Tracer` instance. That `Tracer` need not be publicly accessible but must be configurable with a trace probe |
| 32 | +and client name at service client construction. |
| 33 | + |
| 34 | +The `Tracer` interface is specified as: |
| 35 | + |
| 36 | +```kotlin |
| 37 | +interface Tracer { |
| 38 | + fun createRootSpan(id: String): TraceSpan |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +A `Tracer` provides root spans for a service client, into which all events over the lifetime of an operation will be |
| 43 | +emitted. Child spans can be created as mentioned below in the [Trace Span](#trace-span) section. |
| 44 | + |
| 45 | +**Note**: The interface does not specify how trace probes will be configured or utilized. These are implementation |
| 46 | +details of the tracer and aren't necessary in the public interface. |
| 47 | + |
| 48 | +## Trace Span |
| 49 | + |
| 50 | +A `TraceSpan` is a logical grouping of tracing events that are associated with some operation. Spans may be subdivided |
| 51 | +into child spans which group a narrower set of events within the context of the parent. Spans are hierarchical; events |
| 52 | +that occur within one span also logically occur within the ancestors of that span. |
| 53 | + |
| 54 | +The `TraceSpan` interface is specified as: |
| 55 | + |
| 56 | +```kotlin |
| 57 | +interface TraceSpan : Closeable { |
| 58 | + val id: String |
| 59 | + val parent: TraceSpan? |
| 60 | + |
| 61 | + fun child(id: String): TraceSpan |
| 62 | + fun postEvent(event: TraceEvent) |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +Spans have an ID (or name) which must be unique among sibling spans within the same parent. Span IDs will generally be |
| 67 | +used by probes to contextualize events. |
| 68 | + |
| 69 | +`TraceSpan` instances are `Closeable` and must be closed when no more events will be emitted to them. Probes may choose |
| 70 | +to batch/aggregate events within a span until a span is closed. |
| 71 | + |
| 72 | +## Trace Event |
| 73 | + |
| 74 | +A `TraceEvent` is the recording of a single event that took place and its associated metadata: |
| 75 | + |
| 76 | +```kotlin |
| 77 | +data class TraceEvent( |
| 78 | + val level: EventLevel, |
| 79 | + val sourceComponent: String, |
| 80 | + val timestamp: Instant, |
| 81 | + val threadId: String, |
| 82 | + val data: TraceEventData, |
| 83 | +) |
| 84 | + |
| 85 | +enum class EventLevel { Fatal, Error, Warning, Info, Debug, Trace } |
| 86 | + |
| 87 | +sealed interface TraceEventData { |
| 88 | + data class Message(val exception: Throwable? = null, val content: () -> Any?) : TraceEventData |
| 89 | + |
| 90 | + sealed interface Metric : TraceEventData { val metric: String } |
| 91 | + data class Count<T : Number>(override val metric: String, val count: () -> T) : Metric |
| 92 | + data class Timespan(override val metric: String, val duration: () -> Duration) : Metric |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +Trace events occur at different levels (e.g., fatal, info, debug, etc.). These levels may be used by probes to |
| 97 | +include/omit events in their output. |
| 98 | + |
| 99 | +Trace events can be one of three types: |
| 100 | +* `Message`: Typically, a free-form text message used for logging |
| 101 | +* `Count`: The numerical measurement of some value (e.g., results returned, bytes written, etc.) |
| 102 | +* `Timespan`: The temporal measurement of some occurrence (e.g., time elapsed, latency, etc.) |
| 103 | + |
| 104 | +Probes are free to handle these different types of events however they see fit (e.g., they may log some messages, |
| 105 | +aggregate some metrics, ignore some events, etc.). |
| 106 | + |
| 107 | +Event data values (i.e., message text, count values, and timespan durations) are provided as lambdas rather than with |
| 108 | +direct values. This allows probe implementations to skip calculating them in the event they would otherwise be discarded |
| 109 | +(e.g., for events emitted at a level ignored by the probe). |
| 110 | + |
| 111 | +## Trace Probe |
| 112 | + |
| 113 | +A `TraceProbe` is a sink for receiving events from spans. They will typically form a bridge between the SDK's events and |
| 114 | +downstream libraries/frameworks/services which can handle the events. Examples of such downstream systems include Log4j, |
| 115 | +CloudWatch, local files on disk, the console, etc. |
| 116 | + |
| 117 | +SDKs will typically not bundle many implementations of `TraceProbe` themselves. Common probe implementations may be |
| 118 | +available as separate libraries or from third-party sources. Users may implement probes themselves to bridge SDK events |
| 119 | +to whatever downstream systems they desire. |
| 120 | + |
| 121 | +The `TraceProbe` interface is defined as: |
| 122 | + |
| 123 | +```kotlin |
| 124 | +interface TraceProbe { |
| 125 | + fun postEvent(span: TraceSpan, event: TraceEvent) |
| 126 | + fun spanClosed(span: TraceSpan) |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +The methods of `TraceProbe` are invoked by the top-level `Tracer` (or `TraceSpan` instances created by it). |
| 131 | + |
| 132 | +The `postEvent` method indicates that an event has been emitted to a span. Probe implementations may choose to |
| 133 | +immediately handle/discard events or to batch them until later. Once `spanClosed` is called, no more events will be |
| 134 | +posted for the given span. |
| 135 | + |
| 136 | +## Client config |
| 137 | + |
| 138 | +The following additional parameters will be added to client config: |
| 139 | + |
| 140 | +* `tracer`: An optional `Tracer` implementation to use. If not provided explicitly, this will default to a tracer which |
| 141 | + sends logging events to **kotlin-logging** and ignores metric events. The `DefaultTracer` class is available to |
| 142 | + provide a simple `Tracer` implementation with a configurable probe and root prefix. Using a root prefix can help |
| 143 | + differentiate events from multiple clients of a single service used for different use cases. |
| 144 | + |
| 145 | +# Implementation guidance |
| 146 | + |
| 147 | +The following guidelines are intended to inform implementation and usage of tracing features by SDK contributors and |
| 148 | +those who customize their usage of the SDK: |
| 149 | + |
| 150 | +## Trace span hierarchy |
| 151 | + |
| 152 | +Trace spans form a taxonomy that categorize tracing events into a hierarchy. Discrete spans help group related events in |
| 153 | +a way that's useful to downstream tools which facilitate analysis. Consequently, choosing meaningful trace spans is key |
| 154 | +to maximizing the usefulness of tracing events. Trace spans which are too specific and too deeply nested may create |
| 155 | +noise and obscure events in an opaque hierarchy. Trace spans which are too shallow may bundle together too many events |
| 156 | +and hinder meaningful analyses by downstream systems. |
| 157 | + |
| 158 | +The following trace span levels are recommended for implementors: |
| 159 | + |
| 160 | +* A top-level span for each operation invocation, in the form of `<clientName>-<operation>-<uuid>` (e.g., |
| 161 | + `S3-ListBuckets-8e6bf409-c119-4661-bd99-523c70701aac`) |
| 162 | +* A span for retry attempts, in the form of `Attempt-<n>` (e.g., `"Attempt-1"`) or `Non-retryable attempt` in the case |
| 163 | + of operations which cannot be retried |
| 164 | +* A span for credentials chains, named `Credentials chain`. Note that individual credentials providers (e.g., static, |
| 165 | + profile, environment, etc.) don't get their own child spans—only the chain. |
| 166 | +* A span for HTTP engine events within a request, named `HTTP` |
| 167 | +* Spans for subclients (or _inner clients_) which are embedded in the logic for superclients (or _outer clients_). An |
| 168 | + example of a subclient is using a nested STS client as part of credential resolution while invoking an operation for a |
| 169 | + different service. Spans for subclients effectively reproduce the span hierarchy listed above nested within the outer |
| 170 | + span hierarchy. |
| 171 | + |
| 172 | +The following are examples of suggested trace span hierarchies: |
| 173 | + |
| 174 | +* `S3-ListBuckets-8e6bf409-c119-4661-bd99-523c70701aac`: events which occur during invocation of an S3 `ListBuckets` |
| 175 | + operation _before_ or _after_ retry middleware (e.g., serialization/deserialization, endpoint resolution, etc.) |
| 176 | +* `S3-ListBuckets-8e6bf409-c119-4661-bd99-523c70701aac/Attempt-1`: events which occur during the first attempt at |
| 177 | + calling `ListBuckets` _outside of_ the HTTP engine (e.g., signing) |
| 178 | +* `S3-ListBuckets-8e6bf409-c119-4661-bd99-523c70701aac/Attempt-1/Credentials chain`: events which occur during |
| 179 | + credential resolution in a credentials chain during the first attempt at calling `ListBuckets` |
| 180 | +* `S3-ListBuckets-8e6bf409-c119-4661-bd99-523c70701aac/Attempt-1/HTTP`: events which occur inside the HTTP engine during |
| 181 | + the first attempt at calling `ListBuckets` (e.g., sending/receiving bytes from service) |
| 182 | + |
| 183 | +The following is an example of a nested span hierarchy for a subclient: |
| 184 | + |
| 185 | +* `S3-ListBuckets-8e6bf409-c119-4661-bd99-523c70701aac/Attempt-1/Credentials chain/SSO-AssumeRole-c080b2e2-ff6d-4504-bce4-3433f9f4ac1b/Attempt-2`: |
| 186 | + events which occur during the second attempt to call SSO's `AssumeRole` as part of credential chain resolution during |
| 187 | + the first attempt to call S3's `ListBuckets`. |
| 188 | + |
| 189 | +### Adding new spans |
| 190 | + |
| 191 | +New spans may be necessary for certain features in the future and thus the above list and examples are not exhaustive. |
| 192 | +For the reasons described above, care should be taken to ensure that new spans add enough value and distinctiveness |
| 193 | +without nesting so deeply as to obscure event relationships. |
| 194 | + |
| 195 | +# Revision history |
| 196 | + |
| 197 | +* 8/19/2022 - Initial draft |
| 198 | +* 11/15/2022 - Revised draft with latest proposed interfaces |
0 commit comments