Skip to content

Conversation

@sophiatev
Copy link
Contributor

@sophiatev sophiatev commented Mar 28, 2025

This PR adds support for distributed tracing for entities in the .NET isolated framework. This repo is where the trace Activities are actually created for signaling and calling entities and for entities starting orchestrations from an isolated app.

  • RequestMessage, OperationRequest, OperationResult, SendSignalOperationAction, and StartNewOrchestrationOperationAction were altered to include extra information from the durabletask-dotnet repo where the requests to entities are actually generated, and where the requests are also executed. This extra information includes the time of the requests, the end time of the execution and any error messages, and parent trace contexts.
    • It's worth noting that the CreateTrace field added to RequestMessage is only used in the isolated case to indicate that we want to make an entity-specific trace for this request (it is set to true by the appropriate signal/call methods in durabletask-dotnet). In the in-process case, all of the traces are created in the WebJobs repo, so this field is not populated (and will be false by default).
  • TraceHelper, Schema, and TraceActivityConstants were updated with the instantiation of the entity-specific trace Activities
  • ClientyEntityHelpers and OrchestrationEntityContext methods that generate EntityMessageEvents which are used by the durabletask-dotnet repo were updated to attach this above-mentioned additional information to the message events.
  • TaskEntityDispatcher, which is where all entity requests end up (orchestrations calling/signaling entities, clients signaling entities, entities signaling other entities, and entities starting orchestrations), and where the entities are actually invoked to fulfill the requests, was updated to instantiate the corresponding traces. One exception is that clients signaling entities via gRPC (i.e., when the DurableEntityClient is a GrpcDurableEntityClient) is handled in the WebJobs repo, where the call ultimately reaches the LocalGrpcListener. The PR for this repo is linked below.
    • TaskEntityDispatcher.StartTraceActivityForSignalingEntity is used to create the Activity in the case of a client signaling an entity (via ShimDurableEntityClient in the dotnet repo, since the gRPC client call is handled by WebJobs) or in the case of an orchestration signaling an entity. In the former case, ShimDurableEntityClient has access to the correct parent trace context via Activity.Current.Context so it attaches this context to the request message itself. StartTraceActivityForSignalingEntity then parses and uses this context as the parent to the Activity for signaling the entity. For an orchestration signaling an entity, the dotnet repo does not have access to the orchestration trace context and neither does TaskEntityDispatcher. In this case, the way the parent trace context is attached is via TaskOrchestrationDispatcher.ProcessSendEvent, where Activity.Current.Context holds the orchestration context. This method only has access to the associated EventRaisedEvent, so this is what it attaches the parent trace context to and is what is eventually parsed and used by StartTraceActivityForSignalingEntity. Finally, in the case of an orchestration calling an entity, the Activity is only created at the very end once the call has completed. The code at that point only has access to the RequestMessage, so StartTraceActivityForSignalingEntity attaches the parent trace context from the EventRaisedEvent to the RequestMessage such that it can eventually be used when making the Activity for the call.

The various other PRs related to this effort are

It is worth noting that the Activities for signaling an entity in the isolated case will have longer durations than in the in-process case. In the in-process case, the Activity for a signal to an entity is created upon the request and almost immediately disposed. In the isolated case, we cannot immediately dispose the Activity upon the request since this would require creating the Activity in the dotnet repo where the request is generated. Instead, we create the Activity once the signal request reaches DurableTask.Core and is actually processed by the TaskEntityDispatcher, and pass the request time as the start time of the Activity. Its end time will therefore be much more offset from its start time (the request time) relative to the in-process case. This is not an issue for calls to entities since these are only ended once the operation completes (in the isolated case, once we send the result back to the orchestration instance, and in the in-process case once the entity invocation completes).

This is also true in the case of a client creating an orchestration using the ShimDurableTaskClient - we only create the Activity for the orchestration once the request reaches DurableTask.Core and is processed by the TaskOrchestrationDispatcher. Therefore the duration of the create orchestration Activity will be much longer than in all other cases where the Activity is started upon the request and almost immediately ended afterwards.

An example trace generated by this simple orchestration
image
looks as follows
image

Each signal request has type ActivityKind.Producer and each call request has type ActivityKind.Client (an entity starting an orchestration is also of type ActivityKind.Producer). When an entity actually processes the request, for a signal the span has type ActivityKind.Producer and for a call the span has type ActivityKind.Server. Note that the call to add_to_other_entity_step_1 starts a cascade of entities signaling other entities until eventually the last call is simply an add to the third entity.

If instead of starting the orchestration via an HTTP request we signal an entity to start the orchestration, the trace would look like this
image

@sophiatev sophiatev requested a review from jviau April 3, 2025 19:09
Sophia Tevosyan added 3 commits May 12, 2025 20:45
…n orchestration signaling an entity is too short, then the message gets redelivered and a trace is created for each redelivery. we fixed this and only make the trace once
bachuv
bachuv previously approved these changes May 19, 2025
Base automatically changed from stevosyan/distributed-tracing-for-entities to main May 20, 2025 21:49
@sophiatev sophiatev dismissed bachuv’s stale review May 20, 2025 21:49

The base branch was changed.

@sophiatev sophiatev merged commit 4467e99 into main May 20, 2025
44 checks passed
@sophiatev sophiatev deleted the stevosyan/distributed-tracing-for-entities-isolated branch May 20, 2025 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants