Skip to content

Agent and Action Early Termination#1537

Merged
igordayen merged 5 commits intomainfrom
agent-termination
Mar 27, 2026
Merged

Agent and Action Early Termination#1537
igordayen merged 5 commits intomainfrom
agent-termination

Conversation

@igordayen
Copy link
Contributor

@igordayen igordayen commented Mar 24, 2026

OVERVIEW

  • Introduced graceful termination event via BlackBoard
  • Introduced immediate termination via Termination Exception
  • Added Termination handling logic to abstract, simple, and concurrent agent
  • Added Termination handling logic to Default and Parallel tool loops

    Agent/Action Early Termination

Please refer to:

#783 and
#1481

Problem Statement

Agents and actions currently lack a mechanism to terminate gracefully or immediately based on
runtime conditions. When a tool detects an unrecoverable situation (e.g., required MCP service
unavailable, invalid state, resource exhaustion), it has no clean way to signal that the action or
entire agent should stop.

This PR introduces a termination API with two scopes (ACTION, AGENT) and two mechanisms (graceful,
immediate), giving tools and actions fine-grained control over execution flow.

Usage

Immediate Termination (Exception-based)

Use when you need to stop right now:

                                                                                  
// Stop current action, agent continues with next action                                           
@Action                                                                                            
fun processData(input: Input): Output {                                                            
    if (!validator.isValid(input)) {                                                               
        throw TerminateActionException("Invalid input - skipping action")                          
    }                                                                                              
    // ...                                                                                         
}                                                                                                  
                                                                                                   
// Stop entire agent                                                                               
@Action                                                                                            
fun criticalOperation(input: Input): Output {                                                      
    if (!mcpClient.isConnected("required_service")) {                                              
        throw TerminateAgentException("Required MCP service unavailable")                          
    }                                                                                              
    // ...                                                                                         
}                                                                                                  
                                                                                                   
Graceful Termination (Signal-based)                                                                
                                                                                                   
Use when you want to stop at the next natural checkpoint:                                          
                                                                                                   
@Action                                                                                            
fun longRunningOperation(ctx: ProcessContext): Output {                                            
    // Set signal - agent will stop before next tick                                               
    ctx.terminateAgent("User requested shutdown")                                                  
                                                                                                   
    // Current action completes normally                                                           
    return Output("Completed, but agent will stop after this")                                     
}                                                                                                  
                                                                                                   
// For LLM-based actions with tool loop                                                            
@Tool                                                                                              
fun checkCondition(ctx: ProcessContext): String {                                                  
    if (shouldStop()) {                                                                            
        // Signal - tool loop stops before next tool call                                          
        ctx.terminateAction("Condition met - no more tools needed")                                
    }                                                                                              
    return "Checked"                                                                               
}                                                                                                  

DETAILED COMPARISON

Important: Graceful terminateAction() only works in LLM-based actions with tool loops. For normal agent actions, use throw TerminateActionException().

Graceful vs Immediate Termination
Aspect: Mechanism
Graceful (Signal): Sets signal on blackboard
Immediate (Exception): Throws exception
────────────────────────────────────────
Aspect: When checked
Graceful (Signal): At next checkpoint
Immediate (Exception): Immediately
────────────────────────────────────────
Aspect: Current work
Graceful (Signal): Completes
Immediate (Exception): Stops
────────────────────────────────────────
Aspect: API
Graceful (Signal): ctx.terminateAgent() / ctx.terminateAction()
Immediate (Exception): throw TerminateAgentException() / throw TerminateActionException()
────────────────────────────────────────
Aspect: Use case
Graceful (Signal): Clean shutdown, allow cleanup
Immediate (Exception): Critical failure, invalid state

Tool Loop Handling Comparison

┌─────────────┬─────────────────────────────────────────┬─────────────────────────────────────────┐
  │ Termination │             DefaultToolLoop             │            ParallelToolLoop             │
  │     Type    │                                         │                                         │
  ├─────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┤
  │ Graceful    │ checkForActionTerminationSignal()       │ checkForActionTerminationSignal()       │
  │ ACTION      │ called before each tool call. Throws    │ called once before batch. Throws        │
  │             │ TerminateActionException.               │ TerminateActionException.               │
  ├─────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┤
  │             │ Not handled by tool loop. Signal        │                                         │
  │ Graceful    │ checked at agent process level          │ Same as DefaultToolLoop. Action         │
  │ AGENT       │ (TerminationSignalPolicy) before next   │ completes normally.                     │
  │             │ tick. Action completes normally.        │                                         │
  ├─────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┤
  │ Immediate   │ TerminateActionException propagates     │ Exception captured in exceptionally     │
  │ ACTION      │ immediately. Remaining tools in         │ handler. All parallel tools complete,   │
  │             │ sequence not called.                    │ then exception re-thrown.               │
  ├─────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┤
  │             │ TerminateAgentException propagates      │ Exception captured in exceptionally     │
  │ Immediate   │ immediately. Remaining tools in         │ handler. All parallel tools complete,   │
  │ AGENT       │ sequence not called.                    │ then exception re-thrown with priority  │
  │             │                                         │ (agent > action).                       │
  └─────────────┴─────────────────────────────────────────┴─────────────────────────────────────────┘

Key Implementation Details

  • ActionStatusCode.TERMINATED: New status for action-level termination. Maps to
    AgentProcessStatusCode.RUNNING (agent continues).
  • ActionStatusCode.AGENT_TERMINATED: New status for agent-level termination. Maps to
    AgentProcessStatusCode.TERMINATED (agent stops).
  • Retry exclusion: Both termination exceptions excluded from retry policies (SpringAiRetryPolicy).
  • hasRun handling: Terminated actions don't set hasRun=true, allowing retry on next tick.

Files Changed

New files:

  • api/termination/TerminationExtensions.kt - Extension functions for graceful termination
  • api/common/TerminationSignal.kt - Signal data class and scope enum
  • api/tool/TerminationExceptions.kt - Exception classes

Modified:

  • core/OperationStatus.kt - Added TERMINATED, AGENT_TERMINATED to ActionStatusCode
  • core/support/AbstractAgentProcess.kt - Catch termination exceptions in executeAction
  • core/support/ConcurrentAgentProcess.kt - Handle termination statuses
  • spi/loop/support/DefaultToolLoop.kt - Check for graceful action termination signal
  • spi/loop/support/ParallelToolLoop.kt - Capture and propagate termination in parallel execution
  • spi/support/springai/SpringAiRetryPolicy.kt - Exclude termination from retry

* Introduced graceful termination event via BlackBoard
* Introduced immediate termination via Termination Exception
* Added Termination handling logic to abstract, simple, and concurrent agent
* Added Termination handling logic to Default and Parallel tool loops
@jasperblues
Copy link
Contributor

The termination context is currently string-only (reason: String) across the public API (TerminateAgentException, TerminateActionException, ProcessContext.terminateAgent/terminateAction). If we ever need structured context (error codes, metadata, severity, recovery hints), we'd need to add overloads or break the API.

String-only is probably fine — adding an optional metadata: Map<String, Any> = emptyMap() parameter later is backwards-compatible in Kotlin (default args) and Java (overloads). But wrapping in a TerminationContext type later would be a breaking change, so if there is appetite for structured context (e.g., data class TerminationContext(val reason: String, val errorCode: String? = null, val metadata: Map<String, Any> = emptyMap())) then it's worth introducing the type now rather than migrating callers later.

@jasperblues
Copy link
Contributor

The termination signal is currently stored on the Blackboard via a magic string key (__termination_signal__), but it seems like purely process-level control flow — written by ProcessContext extensions, read by internal checkpoints (TerminationSignalPolicy, checkForActionTerminationSignal), never consumed externally. Since the Blackboard is the data plane for action inputs/outputs, is there a reason the signal needs to live there rather than on the process itself?

Asking because it leads to a couple of friction points: the Blackboard doesn't support key removal ("objects are immutable and may not be removed"), so clearTerminationSignal() has to set Unit as a sentinel value. And the signal sits alongside domain data with a magic string key.

Given that AbstractAgentProcess already has private mutable fields for process-level state (_status, _failureInfo, _goal, _lastWorldState), a _terminationSignal: TerminationSignal? = null would fit that same pattern — clearing becomes = null, and the BLACKBOARD_KEY / Unit sentinel / clearTerminationSignal all go away. What do you think?

@jasperblues
Copy link
Contributor

The PR description documents the mechanics of signal vs exception well. The project docs use practical code examples throughout (e.g., the tools reference) — how about enhancing the docs with practical "when would I choose which" supported by examples? The key distinction is about side effects:

// Signal: "Let me finish my work, then stop"
@LlmTool(description = "Save and shutdown")
fun saveAndStop(ctx: ProcessContext): String {
    customerRepository.save(record)  // side effect completes
    ctx.terminateAction("Save complete, no more work needed")
    return "Saved"  // tool finishes normally
}

// Exception: "Stop now, nothing left to do"
@LlmTool(description = "Check service health")
fun checkHealth(): String {
    if (!mcpClient.isConnected("required_service")) {
        throw TerminateActionException("Service unavailable")
        // nothing after this runs
    }
    return "Healthy"
}

It is especially worth noting the key difference between sequential and parallel stop points. In both modes, neither mechanism can stop sibling tools already in flight — that's inherent to parallelism. For example, if deleteCustomer and chargeCard run in the same parallel batch and deleteCustomer signals termination — chargeCard is already executing and will attempt to charge a deleted customer.

Currently there's an additional difference: in ParallelToolLoop, the signal path isn't detected until the next batch, costing an extra LLM round-trip and potentially more side effects:

Signal path (current) — not detected until next batch:

Batch 1: [deleteCustomer, chargeCard] launch in parallel
  ├── deleteCustomer → sets signal, returns normally ───┐
  └── chargeCard     → runs to completion (charges!)    │
                                                         ▼
  propagateControlFlowSignals() finds nothing
  → results sent to LLM
  → LLM responds with Batch 2: [sendConfirmation, ...]
  → checkForActionTerminationSignal() detects signal
  → action stops HERE (one extra LLM round-trip + batch 2 side effects)

Adding a checkForActionTerminationSignal() call right after propagateControlFlowSignals(results) would catch signals set during the batch, eliminating the extra round-trip and making the stop-point behavior consistent between sequential and parallel — the only difference becomes the inherent in-flight sibling behavior.

@jasperblues
Copy link
Contributor

Quick question on the retry behavior for terminated actions: hasRun isn't set to true for TERMINATED actions, which means the planner keeps selecting them on every tick. The existing test handles this with an attemptCount that makes the condition transient — but what happens if the termination condition is permanent (e.g., an external service that stays down)?

I wrote a test to check:

val PermanentlyTerminatingActionAgent = agent("PermanentTerminator", description = "Agent that always terminates action") {
    transformation<UserInput, TestPerson>(name = "always_terminating_action") {
        val attemptCount = (it["attemptCount"] as? Int) ?: 0
        it["attemptCount"] = attemptCount + 1
        throw TerminateActionException("Service permanently unavailable")
    }
    // ...goal and other transformations...
}

@Test
fun `permanently terminating action retries until action budget exhausted`() {
    val actionBudget = 5
    val agentProcess = SimpleAgentProcess(
        id = "test-permanent-action-termination",
        agent = PermanentlyTerminatingActionAgent,
        processOptions = ProcessOptions(budget = Budget(actions = actionBudget)),
        // ...
    )

    val result = agentProcess.run()

    // The action was retried on every tick until budget exhausted
    assertThat(blackboard["attemptCount"] as Int).isEqualTo(actionBudget)  // all 5 burned
    assertThat(result.status).isEqualTo(AgentProcessStatusCode.TERMINATED)
}

This passes — the action retries all 5 times before maxActions terminates the process. With the default budget of 50 actions, and a slow action (e.g., network timeout), that could mean a long wait before the process gives up.

Would it make sense to still set hasRun = true on terminated actions, and let actions that genuinely need to retry after termination opt in via canRerun = true? That way it stays consistent with the existing mechanism rather than introducing a special case.

@jasperblues
Copy link
Contributor

In ConcurrentAgentProcess.actionStatusToAgentProcessStatus, AGENT_TERMINATED is checked before FAILED. Since we support concurrent actions, consider two running at the same time:

  • Action A: tries to connect to a database, gets a NullPointerException → FAILED
  • Action B: checks a condition and throws TerminateAgentException("shutting down") → AGENT_TERMINATED

The process reports TERMINATED, and the failure from Action A is silently dropped — no log, no failureInfo. Someone debugging why the agent stopped sees "terminated" and might not look for the NPE.

Is that intentional? Could be worth either logging the concurrent failure when AGENT_TERMINATED wins, or returning a richer result that surfaces both signals. What do you think?

@igordayen
Copy link
Contributor Author

igordayen commented Mar 25, 2026

@jasperblues - thanks for extenive feedback.

  1. TerminationContext - failureInfo: Any? already supports structured data.
 Flow:                                                                                              
                                                                                                     
  AGENT scope:                                                                                       
    Action sets _terminationRequest(AGENT, "reason")                                                   
         ↓                                                                                           
    identifyEarlyTermination() detects                                                               
         ↓                                                                                           
    Creates EarlyTermination(reason="reason", policy=TerminationSignalPolicy)                        
         ↓                                                                                           
    _**failureInfo = earlyTermination   ← stored here (existing field)**                                 
    _terminationRequest = null         ← cleared                                                     
    status = TERMINATED                                                                              
                                                                                                     
  ACTION scope:                                                                                      
    Tool sets _terminationRequest(ACTION, "reason")                                                  
         ↓                                                                                           
    checkForActionTerminationSignal() detects                                                        
         ↓                                                                                           
    _terminationRequest = null         ← cleared                                                     
    throw TerminateActionException("reason")                                                         
         ↓                                                                                           
    (_failureInfo NOT set - process continues)                                                       
  1. using AbstractAgentProcess - vs. BB ==> very good point, agreed
  2. TERMINATED ==> hasRun=true set (like normal actions)
  • Actions needing retry must declare canRerun=true
  • Consistent, explicit, safer ==> agreed
  1. Priority Failure over Terminated:
  • Arguable - failures are unexpected, termination is intentional
  • But losing failure info is bad either way
  • AGENT_TERMINATED is intentional (developer requested it), but FAILED is unexpected
    (something went wrong). Both should be visible for debugging. ==> will address
  1. Re: ", eliminating the extra round-trip and making the stop-point behavior consistent between sequential and parallel — the only difference becomes the inherent in-flight sibling behavior." ==> agreed

@igordayen
Copy link
Contributor Author

igordayen commented Mar 25, 2026

Review feedback

Items completed:

  1. Moved termination signal to AbstractAgentProcess field - _terminationRequest with
    terminationRequest/setTerminationRequest()/resetTerminationRequest() accessors
  2. Updated TerminationExtensions - uses AbstractAgentProcess cast instead of Blackboard
  3. Updated identiyEarlyTermination() - handles AGENT scope and clears stale ACTION signals
  4. Updated tool loops - checkForActionTerminationSignal() uses new field
  5. Removed BLACKBOARD_KEY from TerminationSignal
  6. Used canRerun=true instead of hasRun special case for TERMINATED actions
  7. Added signal check after transformers in both DefaultToolLoop and ParallelToolLoop (performance concern on extra LLM call)
  8. Added concurrent failure logging in ConcurrentAgentProcess

@jasperblues
Copy link
Contributor

jasperblues commented Mar 25, 2026

Looks good!

A couple of small things I noticed:

  1. Stale KDoc in TerminationSignal.kt — The doc still says "When placed on the blackboard" but the signal now lives on the process. Minor, but worth updating so it doesn't confuse anyone reading the API docs.

  2. Hard cast in TerminationExtensions.ktterminateAgent/terminateAction use agentProcess as AbstractAgentProcess, which would throw a bare ClassCastException if someone has a custom AgentProcess implementation. Maybe a safe cast with a descriptive error? e.g.:

    val process = agentProcess as? AbstractAgentProcess
        ?: error("Termination signals require AbstractAgentProcess")

Also — do you think it would be worth adding some "when would I choose signal vs exception" examples to the user-facing docs? Something like the signal-for-side-effects vs exception-for-immediate-stop pattern. Totally optional, but could help users pick the right mechanism without having to read the internals.

Copy link
Contributor

@jasperblues jasperblues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A couple of minor nitpicks in my latest comment. I think they're worth tacking on, but I added approval in advance.

@igordayen
Copy link
Contributor Author

@jasperblues - agreed with comments, will address. thank you

@igordayen
Copy link
Contributor Author

  • Stale KDoc in TerminationSignal.kt — The doc still says "When placed on the blackboard" but the signal now lives on the process. Minor, but worth updating so it doesn't confuse anyone reading the API docs.
  • Hard cast in TerminationExtensions.ktterminateAgent/terminateAction use agentProcess as AbstractAgentProcess, which would throw a bare ClassCastException if someone has a custom AgentProcess implementation. Maybe a safe cast with a descriptive error? e.g.:

==> addressed both.

examples for user guide will be added as per suggestion.

Copy link
Contributor

@poutsma poutsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good in general, couple of comments.

- Introduced TerminationException as a parent to Action and Agent Termination Exceptions
- Introduced public API terminateAgent/Action on AgentProcess interface
- Updated User Guide
@igordayen
Copy link
Contributor Author

Summary of changes:

  • TerminationException sealed base class
  • AgentProcess interface with terminateAgent()/terminateAction()
  • AbstractAgentProcess implementation with private setTerminationRequest
  • TerminationExtensions simplified to delegate
  • Tests updated to use public API
  • Cognitive complexity fixed in SimpleAgentProcess
  • Checkpoints verified correct in both tool loops

- Introduced TerminationException as a parent to Action and Agent Termination Exceptions
- Introduced public API terminateAgent/Action on AgentProcess interface
- Updated User Guide
@sonarqubecloud
Copy link

Copy link
Contributor

@poutsma poutsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks!

@igordayen igordayen merged commit eb57f0b into main Mar 27, 2026
17 checks passed
@igordayen igordayen deleted the agent-termination branch March 27, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants