Skip to content

.NET: [Bug]: Workflow checkpoints are not restorable across SDK upgrades — TypeId uses Assembly.FullName (incl. version) for executor type matching #6466

@saikir1994

Description

@saikir1994

Description

Labels: bug, workflows, checkpointing

Summary

Workflow checkpoint restore fails with:

System.IO.InvalidDataException: The specified checkpoint is not compatible with the workflow associated with this runner.

…whenever the Microsoft.Agents.AI.Workflows (or other executor/port type-owning) assembly version changes between the run that wrote the checkpoint and the run that restores it — e.g. after any package upgrade and redeploy. The workflow topology, executor IDs, and state shape are all unchanged; only the assembly version differs.

Because the agent SDK is in fast-moving preview (we've taken 1.3.0 → 1.6.1 → 1.6.2 → 1.8.0 → 1.9.0 over ~5 weeks), every upgrade silently invalidates all previously persisted checkpoints, breaking in-flight conversations/workflows that resume after a deploy.

Root cause

Checkpoint/workflow compatibility is gated by WorkflowInfo.IsMatch, which compares each executor's type via ExecutorInfoTypeId. TypeId identity uses Assembly.FullName, which embeds Version, Culture, and PublicKeyToken:

// TypeId
public TypeId(Type type)
    : this(type.Assembly.FullName, type.FullName) { }   // AssemblyName = "...Version=1.8.0.0, Culture=..., PublicKeyToken=..."

public bool IsMatch(Type type)
{
    if (AssemblyName == type.Assembly.FullName)          // <-- version-sensitive comparison
        return TypeName == type.FullName;
    return false;
}

The runner serializes these TypeIds into the checkpoint and, on restore, re-derives them from the currently loaded assemblies:

// InProcessRunner.RestoreCheckpointCoreAsync
Checkpoint checkpoint = await CheckpointManager.LookupCheckpointAsync(SessionId, checkpointInfo);
if (!CheckWorkflowMatch(checkpoint))                      // checkpoint.Workflow.IsMatch(Workflow)
{
    throw new InvalidDataException(
        "The specified checkpoint is not compatible with the workflow associated with this runner.");
}

So a checkpoint written under ...Version=1.8.0.0 can never match a runner whose executor/port types now resolve to ...Version=1.9.0.0, even though the types (namespace + name) and serialized state are identical.

Steps to reproduce

  1. Build a workflow whose executors are framework-provided (e.g. any agent bound via AsAIAgent(...).WithCheckpointing(...)), run a turn, and persist a checkpoint via an ICheckpointStore/JsonCheckpointStore.
  2. Upgrade Microsoft.Agents.AI.Workflows to any different version (patch/minor/major) — or otherwise change the assembly version.
  3. Reconstruct the same workflow and call RestoreCheckpointAsync (or resume the workflow agent) with the previously stored CheckpointInfo.

Expected: Restore succeeds, since the workflow shape and state are unchanged.
Actual: InvalidDataException: The specified checkpoint is not compatible with the workflow associated with this runner.

Impact

  • Any host that persists workflow checkpoints across process restarts/deploys (the intended durability use case) loses all existing checkpoints on every SDK bump.
  • For interactive multi-turn agents, this surfaces as a hard, unrecoverable error on the first turn after a deploy — the conversation is effectively bricked unless the app detects the string and resets.
  • The failure is opaque: it's a generic InvalidDataException with a message string, with no indication that an assembly version mismatch (vs. a genuine topology change) caused it, and no machine-readable detail about which executor/type diverged.

Suggested fixes (in rough priority order)

  1. Don't include assembly version in type identity for matching. Match on Type.FullName (namespace + type name), and optionally Assembly.GetName().Name (simple name) — not Assembly.FullName. This makes checkpoints portable across version-only changes while still distinguishing genuinely different types.
  2. Make type compatibility pluggable. Allow callers to supply an ITypeCompatibilityResolver/comparer (or a TypeId matching policy: Exact vs NameOnly vs NameAndSimpleAssembly) so hosts can opt into version-tolerant restore.
  3. Add a checkpoint compatibility/version envelope with a documented forward/backward-compatibility contract, instead of relying on Assembly.FullName equality as an implicit schema check.
  4. At minimum, fail better. Throw a typed, catchable exception (e.g. WorkflowCheckpointMismatchException) that includes the specific diff (expected vs. actual TypeId/executor id), so hosts can distinguish "incompatible SDK version" from "topology actually changed" and react deterministically rather than string-matching the message.

Environment

  • Microsoft.Agents.AI.Workflows 1.9.0 (also observed on 1.6.x/1.8.x)
  • Runtime: .NET 10
  • Checkpoint store: custom JsonCheckpointStore (Azure Blob), but the matching logic is store-agnostic
  • OS/host: Linux containers (Azure Container Apps), one process per tenant; fails specifically on the first resume after a deploy that bumps the SDK

Code Sample

Error Messages / Stack Traces

Package Versions

 <PackageVersion Include="Microsoft.Agents.AI" Version="1.9.0" />     <PackageVersion Include="Microsoft.Agents.AI.Workflows" Version="1.9.0" />

.NET Version

.Net 10

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions