Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

## (Unreleased)

- Add automatic retry on gateway timeout in `GrpcDurableTaskClient.WaitForInstanceCompletionAsync` in [#412](https://github.com/microsoft/durabletask-dotnet/pull/412))
- - Add automatic retry on gateway timeout in `GrpcDurableTaskClient.WaitForInstanceCompletionAsync` in [#412](https://github.com/microsoft/durabletask-dotnet/pull/412))
- Add specific logging for NotFound error on worker connection by @halspang in ([#413](https://github.com/microsoft/durabletask-dotnet/pull/413))


## v1.10.0

Expand Down
5 changes: 5 additions & 0 deletions src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,11 @@ public async Task ExecuteAsync(CancellationToken cancellation)
// Sidecar is down - keep retrying
this.Logger.SidecarUnavailable();
}
catch (RpcException ex) when (ex.StatusCode == StatusCode.NotFound)
{
// Task hub is not found or insufficient permissions - retry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, 404 Not Found responses are not supposed to be retried. Why would we want to retry in our case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two main reasons I had for this were:

  1. A 404 is returned from the scheduler if the task hub isn't present or if there are insufficient permissions to access it. Given permissions can take time to propagate, I didn't want to have the customer have to go in and restart their worker to get things working again.
  2. A scheduler and a task hub can be deployed in separate steps of a deployment. If the worker is mistakenly put in between them, it would mean the customer has to restart their worker again to get things going.
  3. It's the behavior today (since we retry on unknown) and stopping that may be a breaking change. Who knows what customers depend on so I changed it to at least highlight the log better.

Let me know if you think those are valid reasons or if you think we should change the behavior or add additional sleeping.

Copy link
Member

@cgillum cgillum Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! Might be worth mentioning this in the comments. (thought this particular code path is agnostic to DTS, so ideally you'd explain it in a more agnostic way :))

this.Logger.TaskHubNotFound();
}
catch (OperationCanceledException) when (cancellation.IsCancellationRequested)
{
// Shutting down, lets exit gracefully.
Expand Down
3 changes: 3 additions & 0 deletions src/Worker/Grpc/Logs.cs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ static partial class Logs
[LoggerMessage(EventId = 4, Level = LogLevel.Information, Message = "Sidecar work-item streaming connection established.")]
public static partial void EstablishedWorkItemConnection(this ILogger logger);

[LoggerMessage(EventId = 5, Level = LogLevel.Warning, Message = "Task hub NotFound. Will continue retrying.")]
public static partial void TaskHubNotFound(this ILogger logger);

[LoggerMessage(EventId = 10, Level = LogLevel.Debug, Message = "{instanceId}: Received request to run orchestrator '{name}' with {oldEventCount} replay and {newEventCount} new history events.")]
public static partial void ReceivedOrchestratorRequest(this ILogger logger, string name, string instanceId, int oldEventCount, int newEventCount);

Expand Down
Loading