A2A-Go SDK distributed mode #139

yarolegovich · 2025-12-12T14:42:40Z

yarolegovich
Dec 12, 2025
Maintainer

Objective

Make the SDK suitable for distributed deployments where multiple servers run behind a load-balancer. This means providing extension points which have enough information and controls for:

Detecting and preventing concurrent executions.
Supporting idempotent message handling in case of duplicate identical requests.
Supporting task event subscription through different servers.
Supporting abandoned task detection (e.g. on process reboot).
Propagating cancelation signal to different processes.

re #115

PoC

There are four PRs which implement the changes suggested in this proposal:

EventQueue use for task update notifications (refactor: event queue semantics #120)
TaskVersion concept and extended TaskStore API (feat: introduce and integrate TaskVersion #127)
WorkQueue abstraction (feat: introduce and integrate TaskVersion #127)
MySQL-based demo (feat: db-backend cluster setup #133). An example of 3 A2A server replicas behind nginx which use mysql for implementing the proposed extension points.

Background

The current high levels interactions looks like the following (with extension points highlighted in green):

API

// AgentExecutor acts as an application agent adapter which is responsible 
// for invoking an agent and translating its outputs to protocol data types 
type AgentExecutor interface {
  Execute(ctx context.Context, reqCtx *RequestContext, queue eventqueue.Queue) error
  Cancel(ctx context.Context, reqCtx *RequestContext, queue eventqueue.Queue) error
}

// TaskStore is used for persisting the state.
type TaskStore interface {
  Save(ctx context.Context, task *a2a.Task) error
  Get(ctx context.Context, taskID a2a.TaskID) (*a2a.Task, error)
}

// Queue is a communication mechanism between AgentExecutor and A2A stack
// responsible for task updates, task store management and notifications.
type Queue interface {
  Write(ctx context.Context, event a2a.Event) error
  Read(ctx context.Context) (a2a.Event, error)
  Close() error
}

Solution

The proposal is to change this to the following:

Where ExecutionFrontend and ExecutionBackend are logical entities. Every process runs both a frontend which writes work to a WorkQueue and backend which invokes AgentExecutor after reading work from the queue.

WorkQueue

A new extension point responsible for addressing work ownership and concurrent request detection. Implementations might do work distribution by (taskID, payloadType), keep track of concurrent executions, reject duplicate messages (using IDs as a deduplication key), implement idempotency (using message hashing), implement heartbeats and retries supporting both push and pull models.

workqueue.Queue is the single dependency which a2asrv.RequestHandler takes on construction.

package workqueue

type Payload struct {
  Type PayloadType
  TaskID a2a.TaskID
  CancelParams *a2a.TaskIDParams
  ExecuteParams *a2a.MessageSendParams
}

type HandlerFn func(context.Context, *Payload) (a2a.SendMessageResult, error)

type Writer interface {
  // Return TaskID to allow handling idempotency (if a message with the same
  // ID already generated a Task through another instance)
  Write(context.Context, *Payload) (a2a.TaskID, error)
}

type Queue interface {
  Writer
  RegisterHandler(limiter.ConcurrencyConfig, HandlerFn)
}

ExecutorBackend when created registers itself as a handler:

func newClusterBackend(cfg *ClusterConfig) *clusterBackend {
  backend := &clusterBackend{...}
  cfg.WorkQueue.RegisterHandler(cfg.ConcurrencyConfig, backend.handle)
  return backend
}

Which happens when a a2asrv.RequestHandler is created in the new “cluster mode”:

requestHandler := a2asrv.NewHandler(
  newAgentExecutor(workerID), 
  a2asrv.WithClusterMode(&a2asrv.ClusterConfig{
    TaskStore:    newDBTaskStore(db),
    QueueManager: newDBEventQueueManager(db),
    WorkQueue:    newDBWorkQueue(db, workerID),
  }),
)

Push-based

This API provides a high level of control and no overhead for integrations where things like heartbeats or load-based work distribution is already handled by the framework. A push-based work queue adapter takes only a writer as a dependency and returns a handler function which will be delegating to the handler (i.e. ExecutorBackend) which registers itself with the queue:

// ErrConcurrencyLimitExceeded is a backpressure mechanism.
var ErrConcurrencyLimitExceeded = errors.New("concurrency limit exceeded")

func NewPushQueue(writer Writer) (Queue, HandlerFn) {
  queue := &pushQueue{Writer: writer}
  handler := HandlerFn(func(ctx context.Context, p *Payload) (a2a.SendMessageResult, error) {
    return queue.handlerFn(ctx, p)
  })
  return queue, handler
}

type pushQueue struct {
  Writer
  concurrencyConfig limiter.ConcurrencyConfig
  handlerFn         HandlerFn
}

func (q *pushQueue) RegisterHandler(cfg limiter.ConcurrencyConfig, handlerFn HandlerFn) {
  q.concurrencyConfig = cfg
  q.handlerFn = handlerFn
}

Pull-based queue

Pull-based API makes it possible to provide support for utilities related to work polling strategies, concurrency control, retry policies etc. A pull-based workqueue adapter takes a pull-able source as a dependency and starts a poller thread when the execution handler is registered:

type Message interface {
  Payload() *Payload
  Complete(ctx context.Context, result a2a.SendMessageResult) error
  Return(ctx context.Context, cause error) error
}

func NewPullQueue(rw ReadWriter) Queue {
  return &pullQueue{ReadWriter: rw}
}

type ReadWriter interface {
  Writer
  Read(context.Context) (Message, error)
}

type pullQueue struct {
  ReadWriter
}

func (q *pullQueue) RegisterHandler(cfg limiter.ConcurrencyConfig, handlerFn HandlerFn) {
  go func() {
    ctx := context.Background()
    for {
      // start polling loop
    }
  }()
}

TaskStore

TaskStore API will be extended to handle:

Concurrent task updates. A valid concurrent update scenario is a cancellation request during an active execution. TaskStore methods will take and return an opaque TaskVersion which can be used for optimistic concurrency control.
Transactional event storage. Save will take the event which triggered an update, so that it can be stored transactionally with the new task state.

package a2a

type TaskVersion interface {
 After(another TaskVersion) bool
}

type TaskVersionInt int64

var _ TaskVersion = TaskVersionInt(0)

func (i TaskVersionInt) After(another TaskVersion) bool {
  if ai, ok := another.(TaskVersionInt); ok {
    return i > ai
  }
  return false
}

package a2asrv

type TaskStore interface {
  Save(ctx context.Context, task *a2a.Task, event a2a.Event, prev TaskVersion) (TaskVersion, error)
  Get(context.Context, a2a.TaskID) (*a2a.Task, TaskVersion, error)
}

TaskVersion should be used to detect when another update (eg. task cancellation) was applied concurrently with task execution. In this case execution should be immediately aborted.

EventQueue

EventQueue gets repurposed by rearranging component connections. Instead of serving as a connector of AgentExecutor and event processor, it will act as an event bus for already processed and applied updates. The interface will also be extended to work with TaskVersion.

package eventqueue

type Queue interface {
  Read(context.Context) (a2a.Event, TaskVersion, error)
  Write(context.Context, a2a.Event, TaskVersion) error
}

With the new API the task resubscription logic will be roughly:

// Subscribe to task-related events
queue, err := qm.GetOrCreate(ctx, tid)
if err != nil {
  yield(nil, err)
  return
}
// Get the latest stored task
snapshot, snapshotVersion, err := taskStore.Get(ctx, tid)
if err != nil {
  yield(nil, err)
  return
}
// Emit the snapshot
if !yield(snapshot, nil) {
  return
}
for {
  event, version, err := queue.Read(ctx)
  if err != nil {
    yield(nil, err)
    return
  }
  // Skip events which which were included in the snapshot we emitted
  if !version.After(snapshotVersion) {
    continue
  }
  if !yield(event, nil) {
    return
  }
}

AgentExecutor

In the next major SDK version update the suggestion is to change the abstraction to something independent of other extension points (i.e. EventQueue), for example:

type AgentAdapter interface {
  Execute(ctx context.Context, reqCtx *RequestContext) iter.Seq2[a2a.Event, error]
  Cancel(ctx context.Context, reqCtx *RequestContext) iter.Seq2[a2a.Event, error]
}

Renaming AgentExecutor to AgentAdapter is a change which subjectively will make the purpose of the extension point more clear.

Migration

To make the existing implementation migration trivial we can export a TaskVersionMissing type which can be used returned and passed to all the changed methods:

type taskVersionMissingType struct{}

var TaskVersionMissing = taskVersionMissingType{}

func (taskVersionMissingType) After(another TaskVersion) bool {
 return true
}

If implementations did not use versioning in any way they should continue working with the version placeholder object.

yarolegovich · 2026-01-30T15:11:00Z

yarolegovich
Jan 30, 2026
Maintainer Author

The proposal was implemented #115

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2A-Go SDK distributed mode #139

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

A2A-Go SDK distributed mode #139

Uh oh!

yarolegovich Dec 12, 2025 Maintainer

Objective

PoC

Background

Solution

WorkQueue

Push-based

Pull-based queue

TaskStore

EventQueue

AgentExecutor

Migration

Replies: 1 comment

Uh oh!

yarolegovich Jan 30, 2026 Maintainer Author

yarolegovich
Dec 12, 2025
Maintainer

yarolegovich
Jan 30, 2026
Maintainer Author