|
| 1 | +# Event Queue Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | +The A2A Java SDK uses a producer-consumer pattern for handling asynchronous agent execution and event streaming. |
| 5 | + |
| 6 | +## Core Components |
| 7 | + |
| 8 | +### EventQueue Hierarchy |
| 9 | +- **MainQueue**: Primary queue for a task, manages child queues |
| 10 | +- **ChildQueue**: "Tapped" view that receives copies of parent events |
| 11 | +- **Reference Counting**: MainQueue tracks children, closes when all children close (or immediate flag set) |
| 12 | + |
| 13 | +### Queue Lifecycle Management |
| 14 | + |
| 15 | +**Creation:** |
| 16 | +```java |
| 17 | +EventQueue queue = queueManager.createOrTap(taskId); |
| 18 | +// Creates MainQueue if new, or ChildQueue if tapping existing |
| 19 | +``` |
| 20 | + |
| 21 | +**Closing:** |
| 22 | +```java |
| 23 | +queue.close(immediate, notifyParent); |
| 24 | +// immediate: clear pending events vs graceful drain |
| 25 | +// notifyParent: decrement MainQueue ref count (ChildQueue only) |
| 26 | +``` |
| 27 | + |
| 28 | +### Thread Model |
| 29 | + |
| 30 | +**Agent-Executor Threads:** |
| 31 | +- Pool of 15 threads (configurable via `a2a.executor.max-pool-size`) |
| 32 | +- Execute `AgentExecutor.execute()` which enqueues events |
| 33 | +- Return immediately after execution (as of fix 3e93dd2) |
| 34 | +- No longer block waiting for consumer to start |
| 35 | + |
| 36 | +**Vert.x Worker Threads:** |
| 37 | +- Handle HTTP/gRPC/REST requests |
| 38 | +- Create and run EventConsumer to poll queue |
| 39 | +- Process events via ResultAggregator |
| 40 | +- Close queue on final events/errors |
| 41 | + |
| 42 | +**Background Threads:** |
| 43 | +- Cleanup tasks tracked in `backgroundTasks` set |
| 44 | +- Manage ChildQueue/MainQueue lifecycle after agent completion |
| 45 | +- Handle reference counting for resubscription support |
| 46 | + |
| 47 | +## Event Flow |
| 48 | + |
| 49 | +### Non-Streaming (onMessageSend) |
| 50 | +``` |
| 51 | +1. Vert.x worker: Create EventQueue, EventConsumer |
| 52 | +2. Agent-executor: Execute agent async → enqueue events → return immediately |
| 53 | +3. Vert.x worker: Poll queue in consumeAndBreakOnInterrupt() |
| 54 | + - Blocks until terminal event or AUTH_REQUIRED |
| 55 | + - Consumer closes queue on final event |
| 56 | +4. Background: Wait for agent → cleanup ChildQueue → manage MainQueue refs |
| 57 | +``` |
| 58 | + |
| 59 | +### Streaming (onMessageSendStream) |
| 60 | +``` |
| 61 | +1. Vert.x worker: Create EventQueue, EventConsumer |
| 62 | +2. Agent-executor: Execute agent async → enqueue events → return immediately |
| 63 | +3. Vert.x worker: Return Flow.Publisher immediately |
| 64 | + - Consumer polls queue in background |
| 65 | + - Streams events to client |
| 66 | + - On client disconnect: continue consuming in background |
| 67 | + - Consumer closes queue on final event |
| 68 | +4. Background: Always async cleanup for streaming |
| 69 | +``` |
| 70 | + |
| 71 | +## Queue Closing Semantics |
| 72 | + |
| 73 | +### Who Closes What |
| 74 | + |
| 75 | +**EventConsumer.consumeAll()** - Primary queue closer: |
| 76 | +- Closes queue when final event received (COMPLETED, CANCELED, FAILED, REJECTED, UNKNOWN) |
| 77 | +- Closes queue on Message event (legacy) |
| 78 | +- Closes queue on EventQueueClosedException |
| 79 | + |
| 80 | +**Background cleanupProducer()** - Secondary cleanup: |
| 81 | +- Waits for agent completion |
| 82 | +- Closes ChildQueue (may already be closed by consumer) |
| 83 | +- Uses `close(immediate=false, notifyParent=!keepMainQueueAlive)` for reference counting |
| 84 | + |
| 85 | +**Error Handling:** |
| 86 | +- Agent errors: Set via EnhancedRunnable.setError(), consumer picks up via callback |
| 87 | +- Consumer errors: EventConsumer fails tube, queue closes |
| 88 | +- No immediate queue close in agent whenComplete() callback |
| 89 | + |
| 90 | +### Reference Counting for Resubscription |
| 91 | + |
| 92 | +**Purpose:** Keep MainQueue alive for resubscription when task is non-final |
| 93 | + |
| 94 | +**Mechanism:** |
| 95 | +```java |
| 96 | +boolean keepMainQueueAlive = !isStreaming && task != null && !task.isFinal(); |
| 97 | +queue.close(false, !keepMainQueueAlive); |
| 98 | +``` |
| 99 | + |
| 100 | +**Effect:** |
| 101 | +- Non-streaming non-final tasks: Close ChildQueue without notifying MainQueue |
| 102 | +- MainQueue stays open, allowing onResubscribeToTask() to tap it |
| 103 | +- Streaming or final tasks: Close ChildQueue and notify MainQueue (decrements ref count) |
| 104 | + |
| 105 | +## Signal Mechanism (Deprecated) |
| 106 | + |
| 107 | +### awaitQueuePollerStart() - NO LONGER USED as of 3e93dd2 |
| 108 | + |
| 109 | +**Original Purpose:** |
| 110 | +Agent-executor threads waited for consumer to start polling before returning, ensuring queue wouldn't close prematurely. |
| 111 | + |
| 112 | +**Problem:** |
| 113 | +Created cascading delays when Vert.x workers were busy. Agent threads blocked up to 10 seconds waiting for worker threads to become available. |
| 114 | + |
| 115 | +**Replacement:** |
| 116 | +Queue lifecycle entirely managed by consumer. Consumer is guaranteed to be running when it closes queue, eliminating race condition. |
| 117 | + |
| 118 | +**Implementation Details:** |
| 119 | +- CountDownLatch in MainQueue, countdown in dequeueEvent() finally block |
| 120 | +- Still exists in code but no longer called |
| 121 | +- Can be removed in future cleanup |
| 122 | + |
| 123 | +## Configuration |
| 124 | + |
| 125 | +### Thread Pool Settings |
| 126 | +```properties |
| 127 | +# tck/src/main/resources/application.properties |
| 128 | +a2a.executor.core-pool-size=5 |
| 129 | +a2a.executor.max-pool-size=15 |
| 130 | +a2a.executor.keep-alive-seconds=60 |
| 131 | +``` |
| 132 | + |
| 133 | +### Vert.x Settings |
| 134 | +```properties |
| 135 | +# Worker thread blocking timeout |
| 136 | +quarkus.vertx.max-worker-execute-time=300s # Temporary workaround, should be removed |
| 137 | +``` |
| 138 | + |
| 139 | +### Queue Settings |
| 140 | +```java |
| 141 | +EventQueue.DEFAULT_QUEUE_SIZE = 1000 |
| 142 | +EventConsumer.QUEUE_WAIT_MILLISECONDS = 500 # Polling timeout |
| 143 | +``` |
| 144 | + |
| 145 | +## Best Practices |
| 146 | + |
| 147 | +### Creating Queues |
| 148 | +- Use `queueManager.createOrTap()` for both create and resubscribe scenarios |
| 149 | +- MainQueue created with onClose callback for QueueManager cleanup |
| 150 | +- ChildQueues inherit MainQueue events via internalEnqueueEvent() |
| 151 | + |
| 152 | +### Closing Queues |
| 153 | +- Let EventConsumer.consumeAll() handle queue closing (it knows when to close) |
| 154 | +- Background cleanup manages reference counting |
| 155 | +- Don't close queue in agent execution code |
| 156 | + |
| 157 | +### Resubscription Support |
| 158 | +- Non-streaming non-final tasks keep MainQueue alive |
| 159 | +- Use `queue.close(false, false)` to close ChildQueue without notifying parent |
| 160 | +- MainQueue closes when last child notifies or immediate close requested |
| 161 | + |
| 162 | +### Error Handling |
| 163 | +- Set errors via EnhancedRunnable.setError() for consumer pickup |
| 164 | +- Consumer handles errors via createAgentRunnableDoneCallback() |
| 165 | +- Background cleanup is safe to call even if queue already closed |
| 166 | + |
| 167 | +## Common Patterns |
| 168 | + |
| 169 | +### Standard Request Handler Pattern |
| 170 | +```java |
| 171 | +EventQueue queue = queueManager.createOrTap(taskId); |
| 172 | +EnhancedRunnable producer = registerAndExecuteAgentAsync(taskId, context, queue); |
| 173 | +try { |
| 174 | + EventConsumer consumer = new EventConsumer(queue); |
| 175 | + producer.addDoneCallback(consumer.createAgentRunnableDoneCallback()); |
| 176 | + // Consume events (blocking or streaming) |
| 177 | +} finally { |
| 178 | + runningAgents.remove(taskId); |
| 179 | + trackBackgroundTask(cleanupProducer(agentFuture, taskId, queue, isStreaming)); |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +### Cancel Pattern |
| 184 | +```java |
| 185 | +EventQueue queue = queueManager.tap(taskId); // Tap existing or create new |
| 186 | +agentExecutor.cancel(context, queue); |
| 187 | +Optional.ofNullable(runningAgents.get(taskId)).ifPresent(cf -> cf.cancel(true)); |
| 188 | +EventConsumer consumer = new EventConsumer(queue); |
| 189 | +EventKind result = resultAggregator.consumeAll(consumer); |
| 190 | +``` |
| 191 | + |
| 192 | +### Resubscribe Pattern |
| 193 | +```java |
| 194 | +EventQueue queue = queueManager.tap(taskId); |
| 195 | +if (queue == null && !task.isFinal()) { |
| 196 | + queue = queueManager.createOrTap(taskId); // Create new for future events |
| 197 | +} |
| 198 | +EventConsumer consumer = new EventConsumer(queue); |
| 199 | +return resultAggregator.consumeAndEmit(consumer); |
| 200 | +``` |
| 201 | + |
| 202 | +## Troubleshooting |
| 203 | + |
| 204 | +### Queue Not Closing |
| 205 | +- Check if consumer reached final event (DEBUG logs show "Dequeued event") |
| 206 | +- Verify task state is actually final (isFinal() returns true) |
| 207 | +- Check background cleanup completed (CLEANUP END log message) |
| 208 | + |
| 209 | +### Events Missing |
| 210 | +- Verify queue not closed before consumer started polling |
| 211 | +- Check for EventQueueClosedException in logs |
| 212 | +- Ensure agent enqueued events before returning |
| 213 | + |
| 214 | +### Resource Leaks |
| 215 | +- Monitor runningAgents.size() (should return to 0) |
| 216 | +- Check backgroundTasks.size() (tasks should complete) |
| 217 | +- Verify QueueManager active queue count |
| 218 | +- Use thread dumps to identify blocked threads |
| 219 | + |
| 220 | +### Performance Issues |
| 221 | +- Check for blocking waits in agent-executor threads (should be none) |
| 222 | +- Monitor Vert.x worker thread availability |
| 223 | +- Review queue wait timeouts and polling frequency |
| 224 | +- Consider increasing thread pool size if CPU allows |
0 commit comments