Skip to content

Add Azure Kubernetes Service (AKS) hosting support#16088

Draft
mitchdenny wants to merge 49 commits intomainfrom
feature/aks-support
Draft

Add Azure Kubernetes Service (AKS) hosting support#16088
mitchdenny wants to merge 49 commits intomainfrom
feature/aks-support

Conversation

@mitchdenny
Copy link
Copy Markdown
Member

Description

WIP — Adds first-class Azure Kubernetes Service (AKS) support to Aspire via a new Aspire.Hosting.Azure.Kubernetes package.

Motivation

Aspire's Aspire.Hosting.Kubernetes package supports end-to-end deployment to any conformant Kubernetes cluster via Helm charts, but it has no awareness of Azure-specific capabilities. Users who deploy to AKS must manually provision the cluster, configure workload identity, set up monitoring, and manage networking outside of Aspire.

What's here so far (Phase 1)

  • New Aspire.Hosting.Azure.Kubernetes package with dependencies on Aspire.Hosting.Kubernetes and Aspire.Hosting.Azure
  • AzureKubernetesEnvironmentResource — unified resource that extends AzureProvisioningResource and implements IAzureComputeEnvironmentResource, internally wrapping a KubernetesEnvironmentResource for Helm deployment
  • AddAzureKubernetesEnvironment() entry point (mirrors AddAzureContainerAppEnvironment() pattern)
  • Configuration extensions: WithVersion, WithSkuTier, WithNodePool, AsPrivateCluster, WithContainerInsights, WithAzureLogAnalyticsWorkspace
  • AzureKubernetesInfrastructure eventing subscriber
  • Implementation spec at docs/specs/aks-support.md

What's planned next

  • Workload identity (federated credentials + ServiceAccount YAML generation)
  • VNet integration (WithDelegatedSubnet)
  • Full Bicep provisioning (pending Azure.Provisioning.ContainerService package availability in internal feeds)
  • Unit tests with Bicep snapshot verification
  • E2E deployment tests

Validation

  • Package builds successfully with dotnet build /p:SkipNativeBuild=true
  • Follows established patterns from Aspire.Hosting.Azure.AppContainers

Fixes # (issue)

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
    • No
  • Does the change require an update in our Aspire docs?

mitchdenny and others added 2 commits April 12, 2026 21:24
Create core implementation files for AKS hosting support:
- AzureKubernetesEnvironmentResource: resource class with BicepOutputReference properties
- AzureKubernetesEnvironmentExtensions: AddAzureKubernetesEnvironment and configuration methods
- AzureKubernetesInfrastructure: eventing subscriber for compute resource processing
- AksNodePoolConfig, AksSkuTier, AksNetworkProfile: supporting types
- Project file with dependencies on Hosting.Azure, Hosting.Kubernetes, etc.
- Add project to Aspire.slnx solution
- Add InternalsVisibleTo in Aspire.Hosting.Kubernetes for internal API access

Note: Azure.Provisioning.ContainerService package is not yet available in
internal NuGet feeds. ConfigureAksInfrastructure uses placeholder outputs.
When the package becomes available, replace with typed ContainerServiceManagedCluster.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 12, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16088

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16088"

mitchdenny and others added 8 commits April 12, 2026 21:31
- Bicep snapshot verification tests
- Configuration extension tests (version, SKU, node pools, private cluster)
- Monitoring integration tests (Container Insights, Log Analytics)
- Argument validation tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- AzureKubernetesEnvironmentResource now implements IAzureDelegatedSubnetResource
  and IAzureNspAssociationTarget for VNet and network perimeter integration
- WithWorkloadIdentity() on AKS environment enables OIDC and workload identity
- WithAzureWorkloadIdentity<T>() on compute resources for federated credential
  setup with auto-create identity support
- AksWorkloadIdentityAnnotation for ServiceAccount YAML generation
- AsExisting() works automatically via AzureProvisioningResource base class
- Additional unit tests for all new functionality

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Changed WithNodePool to AddNodePool returning IResourceBuilder<AksNodePoolResource>
- AksNodePoolResource is a child resource (IResourceWithParent) of AKS environment
- WithNodePoolAffinity<T> extension lets compute resources target specific node pools
- AksNodePoolAffinityAnnotation carries scheduling info for Helm chart nodeSelector
- Made AksNodePoolConfig and AksNodePoolMode public (exposed via AksNodePoolResource)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When no user node pool is explicitly added via AddNodePool(), the
AzureKubernetesInfrastructure subscriber creates a default 'workload'
user pool (Standard_D4s_v5, 1-10 nodes) during BeforeStartEvent.

Compute resources without explicit WithNodePoolAffinity() are
automatically assigned to the first available user pool (either
explicitly created or the auto-generated default).

This ensures workloads are never scheduled on the system pool,
which should only run system pods (kube-system, etc.).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion

Two issues with aspire publish:

1. No Helm chart output: The inner KubernetesEnvironmentResource was stored
   as a property but never added to the application model. KubernetesInfrastructure
   looks for KubernetesEnvironmentResource instances in the model to generate
   Helm charts. Fix: add the inner K8s environment to the model (excluded from
   manifest) with the default Helm engine.

2. Duplicate DeploymentTargetAnnotation: AzureKubernetesInfrastructure was
   adding its own DeploymentTargetAnnotation, conflicting with the one that
   KubernetesInfrastructure adds (which points to the correct KubernetesResource
   deployment target with Helm chart data). Fix: remove the duplicate annotation
   from our subscriber — KubernetesInfrastructure handles it.

Also made EnsureDefaultHelmEngine internal (was private) so the AKS package
can call it to set up the Helm deployment engine on the inner K8s environment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Override GetBicepTemplateString and GetBicepTemplateFile to generate
proper AKS ManagedCluster Bicep directly, bypassing the
Azure.Provisioning SDK infrastructure (which requires the unavailable
Azure.Provisioning.ContainerService package).

The generated Bicep includes:
- Microsoft.ContainerService/managedClusters resource with SystemAssigned identity
- Configurable SKU tier, Kubernetes version, DNS prefix
- Agent pool profiles with autoscaling from NodePools config
- OIDC issuer profile and workload identity security profile
- Optional private cluster API server access profile
- Optional network profile (Azure CNI)
- All outputs: id, name, clusterFqdn, oidcIssuerUrl,
  kubeletIdentityObjectId, nodeResourceGroup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Container Registry:
- Auto-create a default Azure Container Registry when AddAzureKubernetesEnvironment
  is called (same pattern as Container Apps)
- WithContainerRegistry() extension to use an explicit ACR, replacing the default
- FlowContainerRegistry() in AzureKubernetesInfrastructure propagates the registry
  to the inner KubernetesEnvironmentResource via ContainerRegistryReferenceAnnotation
  so KubernetesInfrastructure can discover it for image push/pull

Localhive fix:
- Added SuppressFinalPackageVersion to csproj (required for new packages in Arcade)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Packages with SuppressFinalPackageVersion=true (like Aspire.Hosting.Kubernetes
and Aspire.Hosting.Azure.Kubernetes) are placed in the NonShipping output
directory by Arcade SDK. The localhive script was only looking in the
Shipping directory, causing these packages to be missing from the hive.

Changes:
- Added Get-AllPackagePaths that returns both Shipping and NonShipping dirs
- Package collection now scans all available package directories
- When packages span multiple directories, auto-uses copy mode (can't
  symlink to two dirs)
- Single-dir case still uses symlink/junction for performance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

/// <param name="builder">The resource builder.</param>
/// <param name="version">The Kubernetes version (e.g., "1.30").</param>
/// <returns>A reference to the <see cref="IResourceBuilder{AzureKubernetesEnvironmentResource}"/> for chaining.</returns>
[AspireExportIgnore(Reason = "AKS hosting is not yet supported in ATS")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be addressed.

mitchdenny and others added 16 commits April 13, 2026 09:41
Internal methods from Aspire.Hosting.Kubernetes (AddKubernetesInfrastructureCore,
EnsureDefaultHelmEngine, KubernetesInfrastructure, HelmDeploymentEngine) are not
accessible at runtime across NuGet package boundaries, even with InternalsVisibleTo
set. The InternalsVisibleTo attribute only works at compile time with project
references, not with signed NuGet packages.

Fix: call the public AddKubernetesEnvironment() API instead. This handles all the
internal setup (registering KubernetesInfrastructure subscriber, creating the
resource, setting up Helm engine) through a single public entry point.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The localhive.ps1 modifications were unnecessary - packages with
SuppressFinalPackageVersion go to Shipping, not NonShipping.
The package discovery issue was caused by running localhive from
the wrong worktree, not a script problem.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…placeholders

The ConfigureAksInfrastructure callback was still adding ProvisioningOutput
objects with no values, even though GetBicepTemplateString/GetBicepTemplateFile
now generate the Bicep directly. While our overrides prevent these from being
used for Bicep generation, the stale outputs could confuse the
AzureResourcePreparer's parameter analysis.

Emptied the callback body since all Bicep generation is handled by the
resource's overrides. Also removed unused Azure.Provisioning using directive.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two changes to ensure Helm/kubectl target the AKS cluster instead of
the user's default kubectl context:

1. KubernetesEnvironmentResource.KubeConfigPath (Aspire.Hosting.Kubernetes):
   New public property. When set, HelmDeploymentEngine passes --kubeconfig
   to all helm and kubectl commands. This is non-breaking — null means
   use default behavior.

2. AzureKubernetesInfrastructure get-credentials step (Aspire.Hosting.Azure.Kubernetes):
   Adds a pipeline step that runs after AKS Bicep provisioning and before
   Helm prepare. It calls 'az aks get-credentials --file <isolated-path>'
   to write credentials to a temp kubeconfig file, then sets KubeConfigPath
   on the inner KubernetesEnvironmentResource. This ensures Helm deploys
   to the provisioned AKS cluster without mutating ~/.kube/config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…meterResource

The resourceGroupName is a ParameterResource that requires configuration
key 'Parameters:resourceGroupName' to be set. During deploy, this isn't
available as a raw parameter value — it's resolved by the Azure provisioning
context and stored in the 'Azure:ResourceGroup' configuration key.

Changed GetAksCredentialsAsync to read from IConfiguration['Azure:ResourceGroup']
which is populated by the Azure provisioner during context creation, before
our get-credentials step runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…luster name

BicepOutputReference.GetValueAsync() triggers parameter resolution on
the AzureProvisioningResource, which tries to resolve the 'location'
parameter that depends on 'resourceGroup().location'. In a fresh
environment without Parameters:resourceGroupName configured, this fails.

Since we set the cluster name directly in the Bicep template (name: '{Name}'),
we can just use environment.Name as the cluster name. This avoids the
parameter resolution chain entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The aks-get-credentials step was depending on provision-{name} (individual
AKS resource step) but the Azure:ResourceGroup config key is set by the
create-provisioning-context step. In a fresh environment, the step ordering
wasn't guaranteed to have the config available.

Changed to depend on the provision-azure-bicep-resources aggregation step
which gates on ALL provisioning completing, ensuring both the provisioning
context (with resource group) and the AKS cluster are ready.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uring event

The ContainerRegistryReferenceAnnotation was being added to the inner
KubernetesEnvironmentResource during BeforeStartEvent via FlowContainerRegistry.
But KubernetesInfrastructure also runs during BeforeStartEvent and reads
the registry annotation — if it ran first, it wouldn't see the annotation,
resulting in no push steps being created and images never getting pushed.

Fix: Add the ContainerRegistryReferenceAnnotation to the inner K8s
environment immediately in AddAzureKubernetesEnvironment and WithContainerRegistry,
at resource creation time before any events fire. This guarantees
KubernetesInfrastructure always sees the registry regardless of subscriber
execution order.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Push steps call registry.Endpoint.GetValueAsync() which awaits the
BicepOutputReference for loginServer. If the ACR hasn't been provisioned
yet, this blocks indefinitely — the push step just hangs after push-prereq.

Push steps depend on build + push-prereq, but neither of those depend
on the ACR's provision step. Added a PipelineConfigurationAnnotation on
the inner K8s environment that makes all compute resource push steps
depend on the ACR's provision step.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…step

Changed from depending on individual ACR provision step (which required
resource-to-step lookup that may not resolve correctly) to depending on
the provision-azure-bicep-resources aggregation step by name. This is
simpler and ensures ALL Azure provisioning (including ACR output
population) completes before any image push begins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous attempts tried to wire dependencies via GetSteps(resource, tag)
which uses the StepToResourceMap. This approach failed because the push
steps are keyed to the compute resources, not the K8s environment.

New approach: find the push-prereq step by name in the Steps collection
and directly call DependsOn(provision-azure-bicep-resources). Since all
push steps already depend on push-prereq, this ensures the entire push
chain waits for Azure provisioning to complete.

This mirrors how ACA works: ACA doesn't need this because it implements
IContainerRegistry directly on the environment resource, so the endpoint
values are resolved differently. For AKS, the ACR is a separate Bicep
resource whose outputs need to be populated before push can proceed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Temporary console output to debug why push steps hang after push-prereq.
Logs whether the PipelineConfigurationAnnotation runs, whether push-prereq
is found, how many push steps exist, and their dependencies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Diagnostics revealed push steps had EMPTY DependsOnSteps lists. The
standard wiring from ProjectResource's PipelineConfigurationAnnotation
(pushSteps.DependsOn(buildSteps, push-prereq)) wasn't working because
context.GetSteps(resource, tag) returned empty — the resource lookup
via ResourceNameComparer didn't match when K8s deployment targets are
involved.

Fix: directly find push steps by tag in the Steps collection and
explicitly wire dependencies on:
- provision-azure-bicep-resources (ACR must be provisioned for endpoint)
- push-prereq (ACR login must complete)
- build-{resourceName} (container image must be built)

This ensures the correct execution order:
  provision → push-prereq → build → push → helm-deploy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Azure provisioning context internals (ProvisioningContextTask,
AzureProvisionerOptions) are all internal to Aspire.Hosting.Azure
and inaccessible from our package. IConfiguration['Azure:ResourceGroup']
is also not reliably set when our step runs because the deployment
state manager writes to a different configuration scope.

New approach: query Azure directly with 'az aks list --query' to find
the cluster's resource group. This is guaranteed to work after
provisioning completes, regardless of internal configuration state.
The az CLI is already available (validated by validate-azure-login step).

Also wires push step dependencies directly by finding steps by tag
in the Steps collection, fixing the issue where push steps had empty
DependsOnSteps lists in K8s compute environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… group

- Quote --resource-group and --name values to handle special characters
- Strip line endings from az aks list output to prevent argument parsing issues
- Add logging of cluster name and resource group values for debugging

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The JMESPath query in 'az aks list --query [?name==...].resourceGroup'
had quote-escaping issues on Windows when passed via ProcessStartInfo.
The quotes in the JMESPath expression were being mangled by cmd.exe,
producing truncated/malformed resource group names.

Switched to 'az resource list --resource-type ... --name ... --query [0].resourceGroup'
which uses --name as a proper CLI argument (no embedded quotes in JMESPath)
and the simpler [0].resourceGroup query has no quote escaping issues.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mitchdenny and others added 2 commits April 13, 2026 21:42
The publish context and step factory used GetDeploymentTargetAnnotation(environment)
where environment is the inner K8s env. But WithComputeEnvironment(aksEnv)
sets the compute env to the AKS resource, and KubernetesInfrastructure now
sets DeploymentTargetAnnotation.ComputeEnvironment to match the resource's
actual compute env (the AKS resource, not the inner K8s env).

Updated all GetDeploymentTargetAnnotation calls to use ParentComputeEnvironment
when available, so the lookup matches correctly.

Also fixed KubernetesInfrastructure to set ComputeEnvironment on the
DeploymentTargetAnnotation to the resource's actual compute env rather
than always using the inner K8s env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BicepOutputReference.ValueExpression uses single braces ({storage.outputs.blobEndpoint})
but ResolveUnknownValue only stripped double braces ({{ }}) via HelmExtensions delimiters.
Single braces passed through to the Helm template, causing a parse error.

Fix: also strip single { and } characters when sanitizing the values key.

Fixes #16114

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🎬 CLI E2E Test Recordings — 68 recordings uploaded (commit 0748329)

View recordings
Test Recording
AddPackageInteractiveWhileAppHostRunningDetached ▶️ View Recording
AddPackageWhileAppHostRunningDetached ▶️ View Recording
AgentCommands_AllHelpOutputs_AreCorrect ▶️ View Recording
AgentInitCommand_DefaultSelection_InstallsSkillOnly ▶️ View Recording
AgentInitCommand_MigratesDeprecatedConfig ▶️ View Recording
AllPublishMethodsBuildDockerImages ▶️ View Recording
AspireAddPackageVersionToDirectoryPackagesProps ▶️ View Recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps ▶️ View Recording
Banner_DisplayedOnFirstRun ▶️ View Recording
Banner_DisplayedWithExplicitFlag ▶️ View Recording
Banner_NotDisplayedWithNoLogoFlag ▶️ View Recording
CertificatesClean_RemovesCertificates ▶️ View Recording
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate ▶️ View Recording
CertificatesTrust_WithUntrustedCert_TrustsCertificate ▶️ View Recording
ConfigSetGet_CreatesNestedJsonFormat ▶️ View Recording
CreateAndRunAspireStarterProject ▶️ View Recording
CreateAndRunAspireStarterProjectWithBundle ▶️ View Recording
CreateAndRunEmptyAppHostProject ▶️ View Recording
CreateAndRunJavaEmptyAppHostProject ▶️ View Recording
CreateAndRunJsReactProject ▶️ View Recording
CreateAndRunPythonReactProject ▶️ View Recording
CreateAndRunTypeScriptEmptyAppHostProject ▶️ View Recording
CreateAndRunTypeScriptStarterProject ▶️ View Recording
CreateJavaAppHostWithViteApp ▶️ View Recording
CreateStartAndStopAspireProject ▶️ View Recording
CreateTypeScriptAppHostWithViteApp ▶️ View Recording
DashboardRunWithOtelTracesReturnsNoTraces ▶️ View Recording
DeployK8sBasicApiService ▶️ View Recording
DeployK8sWithGarnet ▶️ View Recording
DeployK8sWithMongoDB ▶️ View Recording
DeployK8sWithMySql ▶️ View Recording
DeployK8sWithPostgres ▶️ View Recording
DeployK8sWithRabbitMQ ▶️ View Recording
DeployK8sWithRedis ▶️ View Recording
DeployK8sWithSqlServer ▶️ View Recording
DeployK8sWithValkey ▶️ View Recording
DeployTypeScriptAppToKubernetes ▶️ View Recording
DescribeCommandResolvesReplicaNames ▶️ View Recording
DescribeCommandShowsRunningResources ▶️ View Recording
DetachFormatJsonProducesValidJson ▶️ View Recording
DoctorCommand_DetectsDeprecatedAgentConfig ▶️ View Recording
DoctorCommand_WithSslCertDir_ShowsTrusted ▶️ View Recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted ▶️ View Recording
GlobalMigration_HandlesCommentsAndTrailingCommas ▶️ View Recording
GlobalMigration_HandlesMalformedLegacyJson ▶️ View Recording
GlobalMigration_PreservesAllValueTypes ▶️ View Recording
GlobalMigration_SkipsWhenNewConfigExists ▶️ View Recording
GlobalSettings_MigratedFromLegacyFormat ▶️ View Recording
InitTypeScriptAppHost_AugmentsExistingViteRepoAtRoot ▶️ View Recording
InvalidAppHostPathWithComments_IsHealedOnRun ▶️ View Recording
LegacySettingsMigration_AdjustsRelativeAppHostPath ▶️ View Recording
LogsCommandShowsResourceLogs ▶️ View Recording
PsCommandListsRunningAppHost ▶️ View Recording
PsFormatJsonOutputsOnlyJsonToStdout ▶️ View Recording
PublishWithDockerComposeServiceCallbackSucceeds ▶️ View Recording
RestoreGeneratesSdkFiles ▶️ View Recording
RestoreSupportsConfigOnlyHelperPackageAndCrossPackageTypes ▶️ View Recording
RunFromParentDirectory_UsesExistingConfigNearAppHost ▶️ View Recording
SecretCrudOnDotNetAppHost ▶️ View Recording
SecretCrudOnTypeScriptAppHost ▶️ View Recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels ▶️ View Recording
StartAndWaitForTypeScriptSqlServerAppHostWithNativeAssets ▶️ View Recording
StopAllAppHostsFromAppHostDirectory ▶️ View Recording
StopAllAppHostsFromUnrelatedDirectory ▶️ View Recording
StopNonInteractiveMultipleAppHostsShowsError ▶️ View Recording
StopNonInteractiveSingleAppHost ▶️ View Recording
StopWithNoRunningAppHostExitsSuccessfully ▶️ View Recording
UnAwaitedChainsCompileWithAutoResolvePromises ▶️ View Recording

📹 Recordings uploaded automatically from CI run #24345283063

mitchdenny and others added 8 commits April 14, 2026 11:27
When WithDelegatedSubnet(subnet) is called, the generated AKS Bicep now:
- Accepts a subnetId parameter wired from the subnet's ID output
- Sets vnetSubnetID on all agent pool profiles
- Configures Azure CNI network profile (networkPlugin: azure) with
  default service CIDR and DNS service IP

This follows the ACA pattern where DelegatedSubnetAnnotation is read
during infrastructure configuration and the subnet ID is passed as
a provisioning parameter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AKS requires plain (non-delegated) subnets for node pools. The previous
approach used WithDelegatedSubnet which implements IAzureDelegatedSubnetResource
and adds a 'Microsoft.ContainerService/managedClusters' service delegation
to the subnet. Azure rejects this with 'SubnetIsDelegated' error.

Changes:
- Removed IAzureDelegatedSubnetResource from AzureKubernetesEnvironmentResource
- Added WithSubnet(subnet) extension that stores the subnet ID without
  adding a service delegation (via AksSubnetAnnotation)
- Added Aspire.Hosting.Azure.Network project reference for AzureSubnetResource
- WithDelegatedSubnet still works as fallback but users should use WithSubnet

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
During publish, config values backed by IValueProvider (e.g., Bicep output
references like {storage.outputs.blobEndpoint}) were written as raw
expression strings to values.yaml. The prepare step only resolved
ParameterResource values, leaving IValueProvider expressions unresolved.

Fix (all in Aspire.Hosting.Kubernetes — no Azure dependency):

1. HelmValue.ValueProviderSource: new optional IValueProvider property
   set when ResolveUnknownValue detects the expression provider also
   implements IValueProvider

2. KubernetesPublishingContext: when a HelmValue has a ValueProviderSource,
   writes an empty placeholder and captures a CapturedHelmValueProvider
   for deploy-time resolution

3. KubernetesEnvironmentResource.CapturedHelmValueProvider: new record
   storing (Section, ResourceKey, ValueKey, IValueProvider)

4. HelmDeploymentEngine Phase 4: calls GetValueAsync() on captured
   IValueProvider entries to resolve values from external sources

This is cloud-provider agnostic — works with any IValueProvider
implementation (Azure Bicep outputs, AWS outputs, etc.).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three fixes:

1. Composite expressions (e.g., 'Endpoint={storage.outputs.blobEndpoint};ContainerName=photos')
   containing IValueProvider references (like BicepOutputReference) are now deferred for
   deploy-time resolution. Added IsUnresolvedAtPublishTime() check before processing
   sub-expressions — if any sub-expression would fall through to ResolveUnknownValue,
   the entire composite expression is captured as a CapturedHelmValueProvider.

2. Helm chart names are now scoped per AKS environment (e.g., 'k8stest5-corek8s' instead of
   'k8stest5-apphost') to avoid conflicts when multiple environments deploy to the same
   cluster or when re-deploying with different environment names.

3. Updated K8s snapshot tests for the new deferred value handling.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ressions

Previous check only detected direct IValueProvider references. For
composite expressions like connection strings (Endpoint={storage.outputs.blobEndpoint};ContainerName=photos),
the BicepOutputReference is nested inside a ReferenceExpression chain:
  ConnectionStringReference → ReferenceExpression → BicepOutputReference

Fixed IsUnresolvedAtPublishTime to recursively check:
- ConnectionStringReference → check inner ConnectionStringExpression
- IResourceWithConnectionString → check inner ConnectionStringExpression
- ReferenceExpression → check all value providers recursively
- IManifestExpressionProvider + IValueProvider → true (leaf unresolvable)

Also added early deferral in ProcessValueAsync for ConnectionStringReference
and IResourceWithConnectionString before they get unwrapped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed values

The deferred value's Helm key was derived from ValueExpression which
contains format strings like 'Endpoint={storage.outputs.blobEndpoint};ContainerName=photos'.
This produced invalid Helm paths with = and ; characters.

Fix: moved deferral check to ProcessEnvironmentAsync (outer loop) where
the env var key name is available. CreateDeferredHelmValue now takes the
env var key directly, producing clean paths like
'{{ .Values.config.apiservice.ConnectionStrings__photos }}'.

Removed deferral checks from ProcessValueAsync — all deferral is now
handled before ProcessValueAsync is called.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add KubernetesNodePoolResource as base class in Aspire.Hosting.Kubernetes
- Add KubernetesNodePoolAnnotation for nodeSelector scheduling
- Add AddNodePool and WithNodePool extensions on K8s environment
- AksNodePoolResource now extends KubernetesNodePoolResource
- Remove WithNodePoolAffinity (replaced by WithNodePool)
- Apply nodeSelector in KubernetesPublishingContext when annotation present
- Delete AksNodePoolAffinityAnnotation (replaced by KubernetesNodePoolAnnotation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- AzureVmSizes.Generated.cs with common VM sizes grouped by family
  (GeneralPurpose, ComputeOptimized, MemoryOptimized, GpuAccelerated,
  StorageOptimized, Burstable, Arm)
- GenVmSizes.cs tool that fetches VM SKUs from Azure REST API
- update-azure-vm-sizes.yml workflow (monthly, like GitHub Models pattern)
- Users can now write: aks.AddNodePool("gpu", AzureVmSizes.GpuAccelerated.StandardNC6sV3, 0, 5)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

mitchdenny and others added 5 commits April 14, 2026 19:45
- Add WithSubnet overload for IResourceBuilder<AksNodePoolResource>
- Per-pool subnets generate separate Bicep params (subnetId_{poolName})
- Each pool uses its own subnet if set, else falls back to env-level default
- Environment-level WithSubnet remains unchanged as the default
- Network profile auto-configured when any subnet (default or per-pool) is set
- Add 2 new tests for per-pool subnet scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove WithAzureWorkloadIdentity and AksWorkloadIdentityAnnotation
- Honor AppIdentityAnnotation (same mechanism as ACA/AppService)
- AzureKubernetesInfrastructure detects AppIdentityAnnotation and:
  - Enables OIDC + workload identity on the AKS cluster
  - Generates K8s ServiceAccount with azure.workload.identity/client-id
  - Sets serviceAccountName on pod spec
  - Adds azure.workload.identity/use pod label via customization annotation
- Generate federated identity credential Bicep per workload identity
- Add ServiceAccountV1 resource to Aspire.Hosting.Kubernetes
- AKS admission controller injects AZURE_CLIENT_ID automatically

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The identity clientId was captured but never wired as a deferred Helm
value, resulting in an empty azure.workload.identity/client-id
annotation on the ServiceAccount. This caused pods to authenticate
with the identity but lack the actual client ID, leading to 403
AuthorizationPermissionMismatch errors on Azure resources.

Fix: Add identity.ClientId as a CapturedHelmValueProvider so it gets
resolved from Bicep output at deploy time and written into the Helm
override values under parameters.<name>.identityClientId.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The azure.workload.identity/use label was only added to the
ServiceAccount but NOT to the pod template. The AKS workload identity
admission webhook requires this label on the pod to inject
AZURE_CLIENT_ID, AZURE_TENANT_ID, and token volume mounts.

Without the pod label, the webhook doesn't fire and the pod
authenticates with the default SA token instead of the federated
identity, resulting in 403 AuthorizationPermissionMismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
More specific name that clarifies these are VM sizes for AKS node
pools, not general Azure VM sizes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member Author

@mitchdenny mitchdenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review: found 7 issues — 2 bugs (deadlock + missing exit code check), 1 security concern (credential file leak), 2 correctness issues (orphaned resources, redundant allocation), 1 behavioral concern (FindNodePoolResource identity), 1 documentation gap (region-locked VM sizes).

process.Start();

var stdout = await process.StandardOutput.ReadToEndAsync(context.CancellationToken).ConfigureAwait(false);
await process.StandardError.ReadToEndAsync(context.CancellationToken).ConfigureAwait(false);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Potential deadlock — sequential ReadToEndAsync on stdout then stderr.

If the process writes enough to stderr to fill its buffer while stdout is being drained, this will deadlock. The correct pattern (already used in GetAksCredentialsAsync above) is to start both reads concurrently:

`csharp
var stdoutTask = process.StandardOutput.ReadToEndAsync(context.CancellationToken);
var stderrTask = process.StandardError.ReadToEndAsync(context.CancellationToken);

await process.WaitForExitAsync(context.CancellationToken).ConfigureAwait(false);

var stdout = await stdoutTask.ConfigureAwait(false);
var stderr = await stderrTask.ConfigureAwait(false);
`

var stdout = await process.StandardOutput.ReadToEndAsync(context.CancellationToken).ConfigureAwait(false);
await process.StandardError.ReadToEndAsync(context.CancellationToken).ConfigureAwait(false);

await process.WaitForExitAsync(context.CancellationToken).ConfigureAwait(false);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Process exit code is never checked.

If �z resource list fails, stdout will be empty and the error thrown is the misleading "Could not resolve resource group" with no indication of the actual �z CLI error. The stderr output is read and discarded.

Check process.ExitCode after WaitForExitAsync (like GetAksCredentialsAsync does above) and include stderr in the error message.

{
return new AksNodePoolResource(poolName,
environment.NodePools.First(p => p.Name == poolName),
environment);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: FindNodePoolResource always creates a new instance instead of finding an existing one.

This method creates a
ew AksNodePoolResource(...) every time, so the node pool object assigned to workloads via KubernetesNodePoolAnnotation will always be a different object from any pool added via AddNodePool(). If any downstream code relies on reference equality or object identity between the annotation's pool and the pool added to the model, it will break silently.

Consider searching the app model's resources for an existing AksNodePoolResource matching the given name.

.ConfigureAwait(false);

// Write credentials to an isolated kubeconfig file
var kubeConfigDir = Directory.CreateTempSubdirectory("aspire-aks");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security/Resource leak: Temp directory containing cluster credentials never cleaned up.

Directory.CreateTempSubdirectory("aspire-aks") creates a directory that persists after the pipeline completes. The kubeconfig file — which contains cluster credentials — remains on disk indefinitely. Consider registering cleanup (e.g., via IAsyncDisposable on the step or a inally block), or at minimum document that credentials persist.

return new(helmExpression, parameter.ValueExpression);
var helmValue = new HelmValue(helmExpression, parameter.ValueExpression);

// If the expression provider also implements IValueProvider, attach it
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: Redundant HelmValue allocation.

A HelmValue is created on line 681, then immediately discarded and a new one with identical constructor args is created here inside the if block. Simplify to:

csharp var helmValue = new HelmValue(helmExpression, parameter.ValueExpression) { ValueProviderSource = parameter as IValueProvider }; return helmValue;

var defaultConfig = new AksNodePoolConfig("workload", "Standard_D4s_v5", 1, 10, AksNodePoolMode.User);
environment.NodePools.Add(defaultConfig);

var defaultPool = new AksNodePoolResource("workload", defaultConfig, environment);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Default node pool is created but never added to the distributed application model.

When no user pool exists, a new AksNodePoolResource("workload", ...) is created and returned for use in annotations, but it's never registered via AddResource(). This means it won't appear in manifests, pipelines, or any resource enumeration — unlike pools created via AddNodePool() which do call �uilder.AddResource(). The Bicep generation adds the pool config to NodePools, but the resource object is orphaned.


// Fetch resource SKUs filtered to virtualMachines
var url = $"https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.Compute/skus?api-version=2021-07-01&$filter=location eq 'eastus'";
var json = await RunAzCommand($"rest --method get --url \"{url}\"").ConfigureAwait(false);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: VM size query is hardcoded to �astus region.

Some VM sizes are region-specific and may not be available in �astus, while others available in other regions will be missing from the generated constants. Consider documenting this limitation in the generated file's header comment, or querying across multiple representative regions to produce a more complete list.

- Mark implemented features (Phase 1-3, node pools, IValueProvider)
- Update workload identity to document AppIdentityAnnotation approach
- Update VNet to document WithSubnet (not WithDelegatedSubnet)
- Document removed APIs (WithAzureWorkloadIdentity, AksWorkloadIdentityAnnotation)
- List remaining gaps: monitoring Bicep, WithHelm/WithDashboard, AsExisting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

private string GenerateAksBicep()
{
var sb = new StringBuilder();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get Azure Provisoining for this?

Comment on lines +172 to +173
// Fallback: check for DelegatedSubnetAnnotation (legacy WithDelegatedSubnet usage)
hasDefaultSubnet = this.TryGetLastAnnotation<DelegatedSubnetAnnotation>(out var delegatedAnnotation);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AKS doesn't use delegated subnets, we shouldn't support DelegatedSubnetAnnotation. There is no "legacy" here...

}

// AKS cluster resource
sb.Append("resource ").Append(id).AppendLine(" 'Microsoft.ContainerService/managedClusters@2024-06-02-preview' = {");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version is 2026-01-01. Can we use that? Is there a reason we are using a couple years old "preview" version?

Comment on lines +300 to +301
sb.AppendLine(" serviceCidr: '10.0.0.0/16'");
sb.AppendLine(" dnsServiceIP: '10.0.0.10'");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we OK with these being hard-coded?

/// This class is auto-generated. To update, run the GenVmSizes tool:
/// <code>dotnet run --project src/Aspire.Hosting.Azure.Kubernetes/tools GenVmSizes.cs</code>
/// </remarks>
public static partial class AksNodeVmSizes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places, we've used Azure.Provisioning types for things like this that are defined by Azure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants