Skip to content

Conversation

@awln-temporal
Copy link
Contributor

@awln-temporal awln-temporal commented Dec 10, 2025

What changed?

Add batch workflow refresh tasks Admin API.

Add tdbg commands to invoke StartAdminBatchOperation.

Details

AdminBatchOperations are executed in the Batched System Worker, similar to existing Frontend BatchOperations. The worker activity will fetch pages of workflow executions to perform operations on, and periodically heartbeat its results. For RefreshWorkflowTasks, each individual Refresh call will go through the admin handler.

Refreshing Workflow Tasks regenerates all pending tasks of an execution given its mutable state.

CLI usage

tdbg workflow batch-refresh-tasks \
  --namespace <ns> \
  --query "WorkflowType='MyType'" \
  --reason "fixing stuck workflows" \
  [--job-id <optional-job-id>] \
  [--archetype <optional-archetype>]

Why?

Unblock new Matcher migration and allow for general use case batch Admin calls. Currently, only supports BatchOperationRefreshWorkflowTasks.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

@awln-temporal awln-temporal requested review from a team as code owners December 10, 2025 16:48
@awln-temporal awln-temporal requested a review from yycptt December 10, 2025 16:48
@awln-temporal awln-temporal force-pushed the admin-batch-refresh-workflows branch from bfd3e1b to 32e6d5e Compare December 10, 2025 17:14
@awln-temporal awln-temporal force-pushed the admin-batch-refresh-workflows branch from 32e6d5e to 94f9a55 Compare December 10, 2025 18:22
@yycptt
Copy link
Member

yycptt commented Dec 10, 2025

Haven't review the actual changes yet but for the tdbg experience, let’s match what we have in the temporal CLI. so something like `tdbg workflow refresh-tasks -q '' means to start a batch operation. and if -q is not specified and --wid is specified, then it's not a batch op, but only operating on that single workflow.

Copy link
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @prathyushpv Can you help review this PR as well?

@yycptt
Copy link
Member

yycptt commented Dec 15, 2025

After this PR is landed, CGS can considering adding a batch force migrate command as well.
cc @yux0 @xwduan @hai719

@yycptt yycptt requested a review from prathyushpv December 15, 2025 06:44
@awln-temporal
Copy link
Contributor Author

Haven't review the actual changes yet but for the tdbg experience, let’s match what we have in the temporal CLI. so something like `tdbg workflow refresh-tasks -q '' means to start a batch operation. and if -q is not specified and --wid is specified, then it's not a batch op, but only operating on that single workflow.

Do we care about setting Memos for the Admin Batch Operation, eg. Reason, JobID, that are common with User Batch Operations, but not usable for single RefresWorkflowTasks ops?

case *adminservice.StartAdminBatchOperationRequest_RefreshWorkflowTasksOperation:
return processTask(ctx, limiter, task,
func(execution *commonpb.WorkflowExecution) error {
_, err := a.AdminClient.RefreshWorkflowTasks(ctx, &adminservice.RefreshWorkflowTasksRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear, we're okay with any user being able to do this at any time?

refresh tasks is safe, but my concern is that someone is going to add a more dangerous admin operation here and not realize that it's exposed to all users in the namespace.

I would say we need a big warning comment somewhere but I'm not sure where it should go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a big warning to the proto definition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need a general way for preventing user from starting this workflow directly, even for normal operations like signal/terminate etc, which allows them to bypass, for example, the concurrent limit.

@awln-temporal awln-temporal force-pushed the admin-batch-refresh-workflows branch from a7bd46f to 54bd801 Compare December 17, 2025 17:37
@awln-temporal awln-temporal force-pushed the admin-batch-refresh-workflows branch from 54bd801 to 30b305c Compare December 17, 2025 20:39
@yycptt
Copy link
Member

yycptt commented Dec 22, 2025

Haven't review the actual changes yet but for the tdbg experience, let’s match what we have in the temporal CLI. so something like `tdbg workflow refresh-tasks -q '' means to start a batch operation. and if -q is not specified and --wid is specified, then it's not a batch op, but only operating on that single workflow.

Do we care about setting Memos for the Admin Batch Operation, eg. Reason, JobID, that are common with User Batch Operations, but not usable for single RefresWorkflowTasks ops?

I think it's nice to have them (reason and batch type) in the memo. JobID is not part of the memo I believe.


// StartAdminBatchOperationRequest starts an admin batch operation.
// WARNING: Batch Operations are exposed to all users of the namespace. Admin Batch Operations should be exercised with caution.
message StartAdminBatchOperationRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if a field to specify batch operation RPS would be helpful in input. We have max_operations_per_second in StartBatchOperationRequest. We can use this field to control the rate. It may not be strictly necessary since we are marking this batch operation with the lowest priority, and it should not block other operations.
cc: @yycptt

Copy link
Contributor Author

@awln-temporal awln-temporal Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this max_operations_per_second field isn't properly handled by the activity? It seems to always default to the Dynamic Config RPS value https://github.com/temporalio/temporal/blob/admin-batch-refresh-workflows/service/worker/batcher/fx.go#L89.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah looks like it's not used at all right now and we just rely on the dc value to protect us.


_, err = adminClient.StartAdminBatchOperation(ctx, &adminservice.StartAdminBatchOperationRequest{
Namespace: nsName,
VisibilityQuery: query,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR.

Does visibility query parser support BusinessID today? I mean we added support for each component to define it's own alias for the underlying WorkflowID column, but we will also need a default alias I think. Very low priority though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least right now, WorkflowID should be able to be used as a default system search attribute to retrieve the BusinessID, we never made the change to disallow querying against WorkflowID yet, since that might impact scheduler query backwards compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I didn't mean to disallow WorkflowID. just we need to allow BusinessID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, I just recall a discussion with Roey/Rodrigo about disallowing before, but maybe we just ignore that.

@awln-temporal awln-temporal force-pushed the admin-batch-refresh-workflows branch from acea49c to a286808 Compare December 29, 2025 19:06

_, err = adminClient.StartAdminBatchOperation(ctx, &adminservice.StartAdminBatchOperationRequest{
Namespace: nsName,
VisibilityQuery: query,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I didn't mean to disallow WorkflowID. just we need to allow BusinessID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider adding tests for 1. admin batch and normal batch don't share the same limit and 2. ListBatchOperations doesn't show admin batch op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests to validate separate namespace divisions.

@awln-temporal awln-temporal force-pushed the admin-batch-refresh-workflows branch from 3a579fa to cb3823c Compare January 6, 2026 21:04
@awln-temporal awln-temporal merged commit 98e7254 into main Jan 7, 2026
60 checks passed
@awln-temporal awln-temporal deleted the admin-batch-refresh-workflows branch January 7, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants